For researchers, accessing data is one thing. Assessing its quality another | AlgorithmWatch

By Nicolas Kayser-Bril • [email protected]GPG Key

Online platforms often provide data that is riddled with errors. Rather than launching quixotic attempts at fixing them, researchers increasingly investigate why platforms bias their data.

Facebook created an “ad library” in May 2018. It provides detailed information about advertisements paid by politicians and advocacy groups, and more general data on other types of ads. On a website specifically made for academic researchers, Facebook describes the tool as a “comprehensive collection of all ads running across Facebook”.

But when researchers attempted to use the tool in the run up to the 2019 British general election, they saw that many adverts had mysteriously disappeared from the library. Days before the vote, ads from the Conservatives, the Liberal Democrats and the Brexit Party could not be found. The page “Boris Johnson” was listed as having spent 181 pounds (200 euros) in ads during the campaign, when in fact it had spent 90,000 pounds (100,000 euros).

Even when large online platforms make data available, researchers have no guarantee that the data they see is accurate or truthful.

Non-random samples

Jakob Jünger, a postoctoral researcher at the University of Greifswald, develops Facepager, a tool that lets researchers extract information from Facebook, YouTube and Twitter. He told AlgorithmWatch that Facebook only lets the tool extract 600 posts for each year of a Page’s existence. However, Facebook does not disclose the selection criteria for these 600 posts (A recent experiment showed that posts with more engagement are returned). The non-random selection of posts and the absence of a transparent methodology severely limits the range of hypotheses researchers can test on such data.

In a 2012 paper, Sandra González-Bailón, who was at the time a research fellow at Oxford University (she is now an associate professor at the University of Pennsylvania), compared two samples of tweets, containing the same hashtags, over the same period, obtained from two different access points made available by Twitter (the “Search” and “Stream” interfaces). While there was substential overlap between the two samples, the difference suggested that the selection of tweets was not random. But without access to the data set containing all tweets, “we can only speculate about the extent” of the selection bias, she and her co-authors wrote. In 2016, computer scientists from Arizona State University showed that bots could bias the Stream interface to influence the tweets it returned.

Scholars are still pursuing the efforts to assess these biases. Mr Jünger is currently running a research program, titled “When is a Like a Like?”, where he and his colleagues attempt to better understand what researchers can do with the data made available by different platforms. The first results are expected to be published in 2022.


Several platforms have been caught red-handed falsifying information for corporate gain. In 2017, the British Competition and Markets Authority investigated the practice of some hotel reservation websites to pressure consumers by displaying messages such as “only 1 room left!” although many rooms are still available. The authority announced last year that several websites agreed to refrain from using deceptive practices. It remains unclear whether they complied, according to Which, a consumer-rights magazine.

In 2019, the US Federal Trade Commission announced that it would sue, a dating service, after the platform knowingly let scammers target prospective users. While the platform knew that some messages were inauthentic and prevented them from being shown to paying users, it still allowed scammers to reach out to non-paying users in the hope that they would buy a subscription to the service.

Data falsification is a punishable offense when it misleads consumers. I am not aware of a legal case where a company was fined for misleading researchers.

Handling data biases

In an upcoming article, US scholars Angela Xiao Wu and Harsh Taneja argue that data provided by online platforms can never be free from biases. Instead, researchers should investigate the reasons why a platform provides data in a certain way, they wrote. When researchers take platform data at face value, they do little more than producing analytics that benefit the companies themselves.

Mr Jünger of the University of Greifswald concurs. If one considers platforms as “socio-technical systems”, investigating data quality really means investigating which mechanisms produced the data, he told AlgorithmWatch. The platforms themselves become the object of investigation, he said. To do so, he analyzes in particular what the companies operating the platform publish, what content tends to be posted on different platforms, and what users themselves have to say about it.

Subscribe to our biweekly newsletter for updates on automated systems and algorithmic accountability.

Photo by Jilbert Ebrahimi on Unsplash

Source: AlgorithmWatch News (CC BY 4.0).

Brought to you by SEO Press Release service by Topic News PR.