From acquisition to actual use, the value chain of data is often compared to mining. Raw data are the ores, to be extracted from mountains of documents or painstakingly collected from streams of information. Data is then refined, repackaged in big data smelters. At the very end of the chain, data is beautified in visualizations that can be consumed by the end-users.
This metaphor does work for certain uses of structured data. Vehicle registration data, for instance, is collected by the administration. It is then sold, in bulk, to companies that repackage it and sell it forward to statisticians within insurance companies or car manufacturers. They, in turn, use it to run the analyses that hopefully will raise the bottom line of their employers.
If this value chain were working in the news industry, news organizations could decide to expand in data gathering, analysis or visualization, depending on their resources and strategy. However, using data to produce journalism is nothing like mining. I see three reasons why, ranked by reverse order of importance below.
Libel risk and quality control
Datajournalism ethics (and the reality of front-end development) dictate that data sets used in a story must be published in full. A mistake in a database of corrupt politicians or homicide suspects would have terrible consequences for the person whose data is erroneously labeled. More importantly for this argument, it would have tragic consequences for the publisher. Defamation legislation holds the publisher accountable for all of its content. Blaming a data supplier would not convince a judge.
The imperative of quality control pushes a journalism operation to keep direct oversight of the collection and analysis process.
News is about the missing data
Running routine queries against a database to conclude that the situation has not changed from the previous month will never make for a good headline. Finding interesting stories is only possible if one knows the data set very intimately. Databases are not the shiny reflection of some reality. Instead, they are marshy swamps and anyone who wants to explore them needs extensive knowledge of their structure and contents.
To be able to properly analyze a dataset, a journalism operation needs to control data collection as tightly as possible.
Datajournalism measures the unmeasured
Just like journalists like to describe their work as “giving a voice to the voiceless”, datajournalism could be the act of measuring the unmeasured. By measuring a problem, a datajournalism team transforms it from a series of stories in an issue that can be followed in time and rigorously assessed. The methodology chosen for a data collection can have tremendous consequences on the results.
The action of journalism encompasses all stages of the production process. Data collection is as journalistic as the visualization. The essence of datajournalism means that a news operation needs to operate all all stages of the data value chain.
Too many news organizations focus on the last steps of the value chain: visualization and distribution. Acquiring data from third parties and prettifying it makes for nice articles, but it falls short of the standards of datajournalism. Rather, it is the data-driven equivalent of churnalism, the practice of copying and slightly editing press releases and newswire reports. 
To excel at all steps of the data value chain, news organizations must internalize the required skills of data scraping, crowdsourcing, setting up sensors, maintaining databases and statistical analysis - in the newsroom. Having small teams of datajournalists is a good start, but few organizations have reorganized from an article-centric logic to a data-centric one. Maintaining databases requires to think in the long run. It requires devoting resources to curate information that might be reused months or years later. It requires to keep digging on topics that are not “in the news” anymore. It requires that the chief-of-data have the same hierarchical position as the editor-in-chief.
This vision is not far-fetched. The Economist Group, an established British publisher, runs a data operation (the Intelligence Unit) that feeds into the editorial operations of the company. Non-global companies do it. La Nacion of Argentina has a data desk that is specifically tasked with building databases that will give the newsroom an edge against the competition . They work with - but are not part of - the design and interactive desk. Younger newsrooms do it, especially ProPublica. The New-York-based non-profit tends to organize its reporting around databases (Dollars For Docs is one famous example, as is their database of non-profits or the one on college debt). There is no reason why other publishers could not adopt the workflows of these two.
Importantly, a news operation can sell its data along the way. Both ProPublica and The Economist Group act as suppliers of data to other companies. They sell the structured information they collect as datasets (ProPublica) or as informed analyses (The Economist). The data-driven newsrooms does not only provides for original journalism, it also opens a new revenue stream.
- The ongoing demoting of datajournalism in data processing by European judges could change that. Drop me a line if you hear of an interesting case.
- On the topic of churnalism, do read Nick Davis’ Flat Earth News.
- Check their blog in English.