Free your data' is over. Now, we need data to be free.

April 13, 2016

My keynote at the Nordics Datajournalism Conference in Helsinki: Why it’s important for datajournalists to be more careful with government data and should collect more data independently.

You probably know that Finland has one of the best population data in the world. Finland was part of Sweden in the 18th century, at the time when the Kingdom decided to start counting its subjects regularly. This was the first modern population census and gave birth to the first modern public statistics office, the Tabular Bureau.

What you might not know is how the data was collected. For much of the 18th century, the Tabular Bureau had one single employee. He did not go door-to-door to collect data on the women, men and children of Finland. Instead, the Church was in charge. Clergymen filled out the forms and sent them back to Stockholm. This had some unintended consequences.

Unbaptized children, for instance, were not counted. It takes a few days to walk across the woods to the church to be baptized. In cities, you can be baptized more rapidly. In the 18th century, a lot of children died a few days after birth. This explains why infant mortality was lower in remote areas than in towns^[1].

Another consequence of the pastoral data collection was age. Tax was due on people aged 15 to 60 and was to be collected by the Church. Unsurprisingly, to avoid being hated too much by their flock, pastors let a few Finns pretend they were under 15 or over 60^[2].

The Tabular Bureau of Stockholm was the best statistical office in the world for all of the 18th century, but the data quality was still bad. Poor communication lines explain some of the low data quality. But, and this is the point of this essay, the rationale behind data collection is much more important to assess the quality of data collected by the government.

The Swedish nobility was not interested in knowing about infant mortality when it created the Tabular Bureau. They were interested in population numbers, and, more precisely, in population growth. The thinking in Stockholm at the time was not that a large population was good. It was that a large and poor population was good^[3]. More poor people meant lower wages and more pliable servants. This is why the only metric that was recorded in the first censuses apart from age and gender was social status.

What was true of the first data collection initiative by the government in Finland in the 18th century remains true all over the world today. Data is collected by the bureaucracy to serve its missions. When the government needed to take control of the economy to organize the war effort in the 1930’s, it started measuring economic output. The whole concept of GDP was created prior and during World War 2, mainly in the United States. Its purpose was to help the administration find ways to produce more tanks and more planes^[4]. This is why this metric is so good at measuring industrial output and so inadequate for everything else.

Bad and worse public data

When an administration is honest, the quality of collected data can be assessed from administration’s mission. The role of an administration is not to seek out the truth, but to do its job. Take unemployment statistics. Their role is not to measure the number of people who need a job, but to count the number of people who will get a check for unemployment benefits at the end of the month. Anyone working in the gray economy and registered at the job center will be counted as unemployed, anyone looking for work but not registered at the job center will not be counted as unemployed. This limitation is fairly straightforward.

When an administration is not honest, the quality of the data is very hard to assess. When infant mortality in Romania increased during the 1980’s, the government simply ordered hospitals not to register babies immediately after birth. It had the effect of keeping infant mortality down^[5]. Not registering dead babies is an extreme measure. But administrations routinely produce terribly bad data for political aims. In January 2016, Europol, the European police coordination center, announced that 10,000 refugee children were missing and were very probably in the hands of traffickers. We asked them how they came to this number. They told us the methodology was secret. We asked them if they had encountered a single case of a trafficked refugee kid. They said no. The “10,000” figure was obviously phony, but it was used to justify the creation of a European Migrant Smuggling Centre - by Europol - a few days later^[6].

Sometimes, no data is better than bad data. The police and other coercive forces, for instance, engages in vast data collection enterprises. The error rates in these files can be mind-boggling. In France, in 2009, the privacy ombudsman analyzed part of one such file and found that only one in five personal profiles were accurate^[7]. The French police alone has more than 70 such databases^[8]. Because of these mistakes, men, women and children will be suspected and arrested.

Political data

The political aims or constraints of an organization influence the data it produces. You are probably familiar with the Global Terrorism Database, which keeps track of all terror attacks in the world since 1970. Their 2015 revision had to make a choice about Ukraine. Either it was a war between Ukraine and Russia, via its proxies of the Luhansk and Donesk Republics. Or it was not a war and the separatists were non-state actors, ergo the violence were acts of terrorism. Of course, any person who follows what happens in Ukraine knows that the situation in Ukraine is an actual war. That the actions of one side be terrorism while those of the Ukrainian side be not makes no sense. The Global Terrorism Database has links to the American government and toes the official line^[9]. In this case, politics trump the truth.

That public data can be of poor quality is not news. We’ve all faced data sets riddled with errors or bogus methodologies. But there are many deliberate attempts by our administrations to weaken the quality of the collected data. This is why I called this essay “Free your Data is Over”. Being able to access bogus data is pointless. What is needed to make sense of the world around us is better data, free from government interference. To be free to contextualize news, free to interpret events and free to think independently, we need independently collected data.

This is important because good information is a prerequisite for the bureaucracy to run properly. Removing good data is like removing the foundations of our administrations. And this is already happening.

Measuring what counts

Inflation is the measure of the evolution of prices. You probably know that prices have been rising predictably at around 2% per year since the end of the 1980’s and have been flat since the late 2000’s. You probably know as well that the European Central Bank has been pumping cash into the system since 2009, in effect “printing money”^[10]. According to all economic theories, this should have pushed prices up. Inflation did not budge and this made a few observers queasy.

But this should not be a surprise. Inflation represents the price level for average people. It works for communities where the average represents something tangible. The less evenly distributed the purchasing power, the less meaningful the average becomes. We now live in a society with two-tiered inflation. For the middle tranche of society, inflation is very low. For the rich, inflation is skyrocketing. The price of investment wines doubled since 2007^[11]. Brueghel’s painting Noli Me Tangere was bought for 76,000$ in April 2015 and sold at 128,000$ six months later^[12]. That’s a 70% inflation rate, over the course of 6 months. In most very large European cities, housing prices have surged. In Stockholm, prices went up more than 50% in the past eight years^[13].

I don’t claim to understand inflation perfectly, but it seems that increasing the money supply does increase price levels. But this inflation for the rich is not measured. It’s not that economic theory is wrong, it’s just that we do not have the tools to see the effect of our monetary policy.

Measuring what counts is not limited to economics. In France, an NGO wants to know how hospitals treat families when they give birth. (In France, almost all babies are born in hospitals). The NGO, CIANE, manages several long-running questionnaires on several key aspects of childbirth. They measure aspects such as whether or not fathers can be present, whether the information given to the family was sufficient to make informed decisions or whether the mothers felt well after delivery^[14]. Such items were never on the Health ministry’s agenda.

Datajournalists have a role to play

NGOs need not be the only ones acting as counterweight to official measurers. Measuring the unmeasured is what datajournalism is all about. Journalists have shown that they could create data gathering activities that helped both their readers and the institutions around them. In 2012, Slate.com started to count victims of gun violence in the United States^[15]. It was a breakthrough because, for the first time, it counted suicides as gun-related death, and not only homicides. The project was taken over by a special-purpose website, the Gun Violence Archive, in 2013. Two years later, The Guardian’s American operation created The Counted, a database of people killed by law enforcement. Their methodology was adopted by the FBI in late 2015, which admitted that their current system of measure was broken. The company I co-founded, Journalism++, is coordinating The Migrants’ Files since 2013, a database of people who died in their attempt to reach or stay in Europe. Since publication in 2014, international organizations have reused our data and methodology, making the death toll of Fortress Europe an actual issue.

The journalists behind these three projects realized that a problem existed and needed to be measured. You probably noticed that these three projects all count deaths. The reason is simple: corpses are relatively easy to count. They are hard to hide, are usually noticed either in official records or in the news, and everyone agrees on the definition of who’s dead and who’s not.

Other journalists tackle harder topics. Some try to measure police violence^[16], which seems to be increasing a lot since non-lethal tools such as Tasers and Flash-balls have become ubiquitous. Others measure detention^[17]. Others still measure the cost surcharge imposed on products marketed to women^[18].

As well-meaning as they are, these projects often lack the knowledge and skills to measure an issue efficiently. There is space for datajournalists to jump in.

Methodology

The hardest part when measuring an issue is methodology. The contours of the problem to be measured need to be defined very precisely. Not everything that can be measured need to be stored, only the relevant bits. With The Migrants’ Files, for instance, we started by storing the names and gender of the victims. We quickly realized that this information was much too biased to be of use (we only had detailed information about some categories of victims) and stopped collecting it, thereby freeing resources to focus on the core task of measuring the location and number of victims.

Once the methodology is in place, data collection can be done manually, by structuring information from open sources. Or it can be done automatically, with data scrapers or sensors. In India, the Hindustan Times was so fed up with the unreliability of the official measures of pollution that it is setting up its own network of sensors^[19].

Although no systematic study has been done, I sense that most of the cost of such data collection initiative stem from poorly designed processes, such as using non-structured formats to store structured data. This is precisely why datajournalists should get in the game: they have the skills to make data collection efficient.

In 18th century Finland, authorities counted the population to divide it into productive and non-productive elements in order for the government to increase the supply of the former. Starting in 1755, the government added a category to the census. It grouped beggars, prisoners and Sami people, who were all considered to be a dead weight to society. Later in the 1850’s, the Swedish government realized the need to homogenize Sweden (Finland wasn’t part of Sweden then). In the census, a new category replaced the previous one. It now aggregated others: “Lapps, Gypsies and Jews”. Efforts to measure these others evolved until 1945 (including a “half-Lapp” category in the 1930 census), when it stopped. In this 200-year period, not one Sami was asked their opinion on what defined them or how they should be counted. It was not until 1993 that Samis started their own statistics^[20].

What happened to the Sami is happening to many people and issues in Europe, today. Datajournalism has the tools and the possibility to ensure that they need not wait 200 years for their suffering to be measured and addressed.

Notes

In The reliability of the registration of births and deaths in Finland in the eighteenth and nineteenth centuries: Some examples
In What is Tabellverket? of Umean University
In Swedish Population Thought in the Eighteenth Century. Malthusian thinking penetrated Sweden only in the 1830’s, later than elsewhere in Europe.
This is from GDP: A Brief but Affectionate History. A fantastic book.
That’s research I made in 2013. Read this article: A Fundamental Way Data Repositories Must Change.
Read the story in the newsletter of The Migrants’ Files.
The report by CNIL (in French).
Read Nicolas Sarkozy a créé 44 fichiers policiers.
Contacted, GTD denied that the decision had anything to do with politics. Interestingly enough, the database counts only 50 fatalities of terrorism in Bosnia in 1992, even though the status of Serbian irregulars was similar to the LPR’s and DPR’s fighters.
Yes, it’s more complicated. This article is a good introduction to the subject.
See the Liv-Ex indices
In The art market in 2015 page 20.
Read at Global Property Guides
Read the results of the Enquête Accouchement of CIANE (in French)
Read How Many People Have Been Killed by Guns Since Newtown?
In France, see this Facebook page.
See the Global Detention Project.
In France, see The Woman Tax Tumblr blog. We, at Journalism++, tackled the problem, too. Read Why we could not compute the Woman Tax.
The data is live at airquality.hindustantimes.com. On 19 April 2016, Hindustan Times’ Piyush Aggarwal said that the sensors were “being calibrated”.
Information for this paragraph comes from Abandoning “the Other”: Statistical Enumeration of Swedish Sami,1700 to 1945 and Beyond

Many thanks to Pekka Myrskylä for his help in understanding the 1749 Swedish census. Thanks Pierre and Anne-Lise for your precious feedback.