Don't ask too much from data literacy

December 22, 2015

This essay is written for a special issue of the Journal of Community Informatics on data literacy^[1]. In the call for submissions, the journal points out that some argue that more data literacy will foster greater transparency, thanks to better use of open data, and let citizens better engage with science. With regards to journalism, I argue here that this view is mistaken.

“Datajournalism is social science on a deadline”, once said Steven Doig, a Pulitzer-prize winner^[2]. Indeed, doing journalism with structured data requires that journalists use the social scientist’s toolbox, from data collection to data analysis. Data literacy is needed at every step of the process: To know what data to collect, understand biases in a data set, perform sensible analyses and visualize the results properly.

Statisticians, once pariah, are now in demand within newsrooms. Many news operations have created dedicated teams to work with structured data. In the United States, the New-York Times is famous for its embrace of datajournalism. Examples abound throughout Europe as well, from public service broadcasters such as Bayerischer Rundfunk (Munich, Germany) or Schweizer Radio und Fernsehen (Zurich, Switzerland) to newspapers of record such as Gazeta Wyborcza (Warsaw, Poland) to local news outlets such as Le Télégramme (Brest, France) or Heilbronner Stimme (Heilbronn, Germany). They published several stories that could never have been made without data literacy skills. Just like muckrakers in the late 19^th century forced public institutions to listen to the voiceless, datajournalists today force them to measure the unmeasured.

Data literacy enabled The Guardian to record every person killed by the police in the United States in 2015 with The Counted, for instance. This prompted the FBI to rethink the way it measured the issue. In Spain, Civio systematically analyzed official pardons with the Indultómetro. The number of pardons went from 500 per year to less than 100 a year after the project was published. Journalism++, an agency for datajournalism I co-founded, had similar successes when we measured the gap in what women and men pay (The Woman Tax) or when we measured the mortality rates of people trying to come or stay in Europe, with The Migrants’ Files (coming to Europe is slightly less dangerous than going into battle at Verdun in 1916). Both projects were instrumental in pushing authorities to measure the issue.

Journalists that acquired data literacy skills are able to produce new, exclusive stories that have an impact and resonate with the audience. It can even be profitable. The Upshot, the New-York Times’ datajournalism operation, made up to 5% of the newsroom’s page views in 2014 with 1.5% of the total staff^[3]. Prejudice against math and stats in the newsroom is waning and datajournalism operations are popping up across the world. From this perspective, one could think that data literacy will trickle down the newsroom and, from there, throughout society. But one would be mistaken to do so.

Misusing data

The Migrants’ Files, our body counting project, provides a table where each line is a separate incident during which a person died in their attemps to reach or stay in Europe. It aggregates information from a wide variety of sources: coast guards, news reports, social media etc. The table has several columns, such as country in which the event happened, cause of death and source. This table is freely accessible online as a Google Spreadsheet. Anyone with basic data manipulation skills can map, filter and analyze the data in a few minutes.

While the table was reused many times, a few journalists wrongly used the data we collected and came to false conclusions. It happened, for instance, that journalists grouped the data by country to compute a list of most dangerous countries for refugees, even though we made clear, in the data itself and through email contacts, that the data collection process did not allow for such a usage. Our data set has more lines of events that happened in Germany because more organizations there record them. By grouping by country, these journalists wrote that Germany was the country were most refugees died. If anything, Germany might be the safest country for refugees and migrants in Europe.

Using data without data literacy skills is the source of many more errors. It is not rare to see journalists confuse units, for instance between watts, a unit of power, and watt-hours, a unit of energy^[4]. It is not rare to read analyses that distort the data to the point where journalists reach conclusions that are precisely opposite to what a data literate person would have said^[5]. It is not rare to see articles where billions are used instead of millions. How rarely such mistakes happen is impossible to say. To date, no systematic study of data-driven mistakes has been carried out by academia or professional organizations.

It could be argued that this happens because some people in these newsrooms are still data illiterate. As data literacy increases in the workplace, such mistakes will occur less and less often, the argument goes. It would be true if publishers had an incentive to seek out the truth and root out data illiteracy. They do not.

Wrong incentives

No publisher makes money by publishing facts that are true. Instead, they either sell their readers’ attention to advertisers or sell content to their readers, or both. When selling attention, publishers need to garner as much of it as possible. To do so, they do not need to publish true stories, they need to publish articles that will be read and shared while minimizing production costs. This explains why rumor mills abound on the web and why once respectable media outlets like The Daily Mail routinely publish articles that are factually wrong. The lack of data literacy of the staff does not help, but managers have no need for data literate writers. And writers with data literacy skills have no incentive to use them^[6].

Selling content directly to readers does not fundamentally change a publisher’s incentives. Instead of volume, a publisher needs to please the audience as much as possible, even when facts need to be distorted to fit the audience’s beliefs. Fox News, a cable TV station in the United States, is famous for reporting and creating lies, often with help of wrong charts and statistics^[7]. The news operation derives most of its income not from advertising, but from subscriptions^[8]. It grew by accompanying the rise of reactionary Republicans in the United States and feeds on their disregard for facts.

Still, in the absence of systematic measurement of mistakes done with data, one could argue that a general lack of data literacy skills are more responsible for bad reporting of numerical information than misplaced incentives. The argument does not hold when one looks at other fields where data literacy is widespread and incentives as misaligned.

Testimonies from financiers, many of whom have degrees in statistics from top-tier universities, show how wrong incentives can make one knowingly ignore their data literacy skills. An employee of Standard & Poors, a company that rates the risk of financial products, once said that they would rate any securities, even if they were “structured by cows”. The reason is given by an employee of Moody’s, another ratings agency, when he explained that the errors committed in the run up to the financial crisis “made [them] look either incompetent at credit analysis or like [they] sold [their] soul to the devil for revenue” (emphasis mine). That high-level staffers of the world’s top finance institutions be incompetent is unlikely. The explanation why data literate people put blinders on and pretended not to understand the data before their eyes is given by another Standard & Poor’s employee, who said before the financial crisis: “Let’s hope we are all wealthy and retired by the time this house of cards falters”^[9].

In finance as in journalism, greed trumps data literacy. In academia, too, stories abound where renown professors manipulate data to increase their standing, gain public visibility (the Séralini affair is a good example) or make money^[10]. A meta study of data falsification found that a significant chunk (about 10%) of academics had manipulated data at some point^[11].

These examples show that, while data literacy skills are needed to correctly work with data, they can easily be turned off by other incentives to the point where part (academia) or most (finance) actions purporting to be driven by data are actually driven by entirely different factors. It would be foolish to believe that journalists are different and that, confronted with conflicting incentives, they will resist and keep looking for the truth in data.

The good data literacy can do

Current incentives do encourage journalists to put data literacy skills to use when their audience is interested in facts. Newsrooms that set up datajournalism teams are mostly the newspaper of record of a given market. Tabloids and other mass-market media outlets have yet to invest in data literacy. They do use data and visualizations to serve “truthiness”, to give a veneer of seriousness to lies, as Fox News exemplifies. An experiment conducted by Italian researchers on a large sample (close to 10,000 Facebook users) showed that factually wrong news items were in great demand by groups exhibiting distrust of established institutions^[12]. As distrust in institutions increases in many places (think of Trump’s or Le Pen’s popularity), the market for factually inaccurate news grows. In an article announcing the end of a fact-checking column, The Washington Post’s Caitlin Dewey rightly asks “Is it the point at which we start segmenting off into alternate realities?"^[13] Only one of these realities, which might not be the largest, is interested in data literacy.

Data literacy does enable some journalists to make better, more efficient reporting. Anyone interested in quality journalism should undoubtedly support this trend. But we should not fool ourselves and believe that this will result in better information overall. It will, at most, impact the minority of the public that is interested in developing a fact-based world view.

Teaching data literacy to the general population, by changing the primary school curriculum, for instance, would make little difference. The desire for a person to apprehend the world through facts and consume content from data literate journalists is probably not function of the curriculum. If it were, the development of curricula focused more on concrete skills throughout the 20^th century (think basic science versus Latin and ancient Greek or versus an absence of education) would have created populations more willing to embrace a fact-based worldview. It did not^[14].

To achieve the larger goal of engaging citizens with science and data, journalists need more than data literacy skills. They need a reason to acquire and use them. In Code and Other Laws of Cyberspace^[15], Lawrence Lessig writes that socially desirable behaviors can be influenced by markets, law, code or norms. Many consumers are uninterested in data literate journalists. The Daily Mail and Fox News’ successes are proof of this. Markets will not work. Legislation would not either. Prohibiting a wrongful usage of data would necessarily infringe on freedom of speech, making the cure worst than the illness. Some tried to use code to address the issue. Trooclick, a French startup, created a browser plug-in that automatically looked for mistakes in news items (it didn’t work^[16]). Beyond the enormous technical challenges, the lack of public interest in the issue will not entice computer scientists to automate data literacy. The last option, in Lessig’s framework, are norms. Journalists could be incentivized to become data literate through encouragement (by setting up more datajournalism prizes, for instance) and through deterrents, such as the condemnation of data illiteracy by civil society groups. Scientists could gather and create a media watchdog dedicated to data literacy, for instance. It would be a colossal and expensive endeavor, but, absent a change in the context that fosters data illiteracy (distrust in institutions), this is the only available option.

Notes

See the call for submissions website here.
“Social science done on deadline”: Research chat with ASU’s Steve Doig on data journalism
As stated in an AdAge article from early 2015. Staff count comes from this Fast Company article, published at the same time.
As in this Reuters article from December 21st, 2015.
As in this article from The Daily Telegraph, where an increase in rape reporting is equated to an increase in rape cases.
To understand how The Daily Mail operates, do read this testimony from a former staffer: My Year Ripping Off the Web With the Daily Mail Online.
This partial collection of wrong or dishonest charts by Fox News covers only the period 2009-2012 but reveals a continuous lack of data literacy among the station’s staff: A History Of Dishonest Fox Charts.
Data from SNL Kagan available at the Pew Research Center.
All quotes are from the 2008 hearing committee of the House of Representatives.
Read Greenpeace exposes sceptics hired to cast doubt on climate science or Robert Proctor’s Golden Holocaust, which shows in great details how academics were bought by tobacco companies.
Fanelli, Daniele. “How Many Scientists Fabricate and Falsify Research? A Systematic Review and Meta-Analysis of Survey Data.” PLoS ONE 4.5 (2009).
Bessi, Alessandro, Mauro Coletto, George Alexandru Davidescu, Antonio Scala, Guido Caldarelli, and Walter Quattrociocchi. “Science vs Conspiracy: Collective Narratives in the Age of Misinformation.” PLoS ONE 10.2 (2015)
Dewey, Caitlin. “What Was Fake on the Internet This Week: Why This Is the Final Column.” Washington Post. The Washington Post, 18 Dec. 2015.
Evidence for the disconnect between facts and majority opinion can be found most easily on immigration topics, where the disconnect between facts and public discourse is strongest. Work by Hein de Haas, among many others, examplifies this. Read for instance his lecture Human Migration: Myths, Hysteria and Facts.
Lessig, Lawrence. Code and Other Laws of Cyberspace. New York: Basic, 1999. I discovered this book in Ethan Zuckerman’s brilliant lecture Insurrectionist Civics in the Age of Mistrust.
Read ‘There is No Market for Fact-Checking’: Trooclick Exits the Verification Scene.