Data Access

By: Sara Bundtzen

Ensuring public interest researchers have what is considered ‘meaningful access’ to social media data (data access) is essential for evidence gathering, informed decision-making, and platform accountability. Yet, researchers continue to face barriers to accessing the data needed to establish a complete picture of platforms’ content moderation and curation systems – and, more broadly, their impact on users and society. A data access infrastructure, backed by enforcement powers that are enshrined in legislation and co-regulatory frameworks, should aim to grapple with the information asymmetry (lack of equal access) between platforms and researchers.

Glossary

Application Programming Interfaces (APIs) are software intermediaries that allow two applications to communicate with each other. APIs have a huge range of uses, but in the context of this Explainer, they allow researchers to access certain types of data from some online platforms via requests. As an intermediary, APIs also provide an additional layer of security by not allowing direct access to data, alongside logging, managing and controlling the volume and frequency of requests.

Blockchain technology – the technology that underpins cryptocurrencies such as Bitcoin – commonly refers to a digital database that stores and distributes data in a decentralised and public peer-to-peer network. It stands out from other technologies due its transparent and decentralised structure, in which data is stored in multiple locations and continuously compared and updated. Blockchain technology allows for pseudonymous transactions and communication which makes it attractive for malign use.

Data donations (including crowdsourcing and surveying methods) involve users of platforms voluntarily reporting certain content to researchers through mechanisms such as browser extensions or reporting forms.

Data scraping is the process of collecting data directly and independently from a platform, typically by writing code to automatically process a website’s HTML/CSS (the code that the website’s visual interface is written in).

The Fediverse is an interconnected group of servers hosted by a multitude of individuals rather than one central company. Together the servers form a decentralised network.

Public interest research commonly refers to research with the explicit aim to develop society’s collective knowledge. Regulatory precedent suggests that public interest research must be independent of commercial interests and reveal the source of its funding. Public interest researchers are not necessarily linked to academic institutions and can also include researchers affiliated to non-profit or media organisations.

Sock puppet accounts impersonate users on a platform. Researchers use sock puppet accounts to understand what a particular user profile, or set of user profiles, may experience on a platform. The data generated by the platform in response to the programmed (fictitious) users is recorded and analysed.

Regulatory and co-regulatory frameworks

EU: The Digital Services Act (DSA) introduces harmonised rules across the European Union (EU) for intermediary services to ensure a safe and accountable online environment, including the effective protection of users’ fundamental rights online.

EU: Strengthened Code of Practice on Disinformation 2022 (CoPD) empowers industry to adhere to self-regulatory standards to combat disinformation.

EU: European Digital Media Observatory’s (EDMO) draft Code of Conduct on platform-to-researcher data access includes guidelines on how platforms can share data with independent researchers while protecting users’ rights, as described under Article 40 of the General Data Protection Regulation (GDPR).

US: Platform Accountability and Transparency Act (PATA), introduced by Senator Coons (D) alongside Senators Cassidy (R), Klobuchar (D), Cornyn (R), Blumenthal (D), and Romney (R), is a bipartisan bill that would support research about the impact of digital communication platforms on society by providing privacy-protected, secure pathways for independent research on data held by large internet companies.

US: Social Media Disclosure And Transparency of Advertisements Act of 2021 (Social Media DATA Act), introduced by Representative Trahan (D), is a bill that would require certain consumer-facing websites and mobile applications to maintain advertisement libraries and make them available to academic researchers and the Federal Trade Commission.

US: Digital Services Oversight and Safety Act of 2022 (DSOSA), introduced by Representative Trahan (D), is a bill that would provide for, among other things, the facilitation of independent research on covered platforms through the Federal Trade Commission.

US: American Data Privacy and Protection Act (ADPPA), co-sponsored by Representatives Pallone (D) and McMorris Rodgers (R) and Senator Wicker (R), is a bipartisan bill that would create a comprehensive federal consumer privacy framework, including exceptions for “publicly available information” and for “research” purposes.

A changing landscape – New rules versus restricted access

The lack of data access not only undermines the development of knowledge about human experiences, societal phenomena and trends in the online ecosystem, but will also soon affect the compliance assessments of obligations introduced by the EU’s Digital Services Act (DSA). As it entered into force in November 2022, the DSA is the first major legislation containing provisions related to data access for researchers. These will allow researchers to access data of very large online platforms (VLOPs) with more than 45 million active monthly users in the EU to conduct research on systemic risks. Such research may comprise monitoring platform actions to tackle illegal content, such as illegal hate speech, as well as a range of other societal risks such as the spread of disinformation, and risks that may affect users’ fundamental rights.

In the US, members of Congress have introduced numerous bills with provisions calling for external researcher access to platform data, including the Platform Accountability and Transparency Act, the Social Media DATA Act, and the Digital Services Oversight and Safety Act. If passed, these bills could address concerns about platform’s content moderation – ranging from claims that platforms are biased and too much content is removed, to criticism that platforms are not doing enough to tackle illegal hate speech.

While data access will be essential for national regulators who hold enforcement powers, as well as independent researchers who scrutinise platform action, some tech companies have started to restrict data access by cutting off or raising the costs of access to APIs. In February 2023, Twitter announced that it would no longer support free access to its API. The Coalition for Independent Technology Research released an open letter, criticising Twitter’s API plans, saying they “will devastate public interest research” – and noting that over 250 projects will be jeopardised by ending free and low-cost API access, including research into the spread of harmful content, mis- and disinformation, news consumption, public health, and elections.

Ultimately, data access for the purpose of public interest research is needed to support scrutiny and accountability of both platform action and government intervention – to ensure they are fit for purpose to protect users against harassment, abuse and incitement, while not setting precedent that threatens users’ rights of privacy and freedom of expression. Current barriers to access – including technological, legal and ethical challenges – often stand in the way of public interest research that aims to understand the spectrum of actors, content and behaviour on both mainstream and emerging alternative platforms.

Barriers to data access

Barriers to access currently risk undermining researchers’ ability to conduct causal research over time, for example, understanding the effects of platforms’ recommendations on user experiences. This is especially challenging when it comes to ‘disinformation studies’ which have been criticised at times for seeming to favour “rapid and attention-grabbing results over those deriving from more time-consuming and rigorous approaches.” Barriers have required researchers to resort to an array of independent data collection methods (such as sock puppet accounts or data donations), rather than accessing data directly from the platforms. This, combined with a lack of common data documentation practices and quality standards in the field, has made advancing cumulative research and peer-reviewing results more difficult. In sum, researchers are facing three types of barriers to data access:

Technological barriers may arise from platforms deliberately restricting access to data or having technological features which inadvertently create barriers. For example, certain content formats, particularly audio or audiovisual content, are hard to search through systematically, making video-sharing platforms such as YouTube or image-based platforms such as Instagram difficult to analyse at scale. Moreover, the use of blockchain technology by platforms such as Odysee and the emergence of decentralised networks such as the Fediverse pose relatively unexplored territory for systematic data collection.
Legal and/or ethical barriers may arise from platforms’ data privacy concerns. For example, the use of third-party technologies (such as browser extensions) for data donation purposes that are prohibited by the platforms’ Terms of Service could lead to legal action. Furthermore, platform efforts to prevent automated data collection through scraping or other methods inadvertently hinder researchers from verifying compliance with their own guidelines. There may also be the issue of platforms’ retention of data given research demands for deleted content, including data that platforms removed due to violations of their Community Guidelines. For example, some types of research require examining deleted content that could provide evidence of criminal activities. Ethical barriers also arise from uncertain expectations of user privacy, when the platform interface lies in a grey space between private and public such as large WhatsApp or Telegram groups, as well as emerging questions of informed consent from users.
Fragmentation barriers may arise when data that is publicly available is scattered among vast amounts of sources or features that cannot be searched systematically via platform-wide functions or through an API. For example, Discord’s public groups can only be searched server-by-server (individual channels on the messaging platform are known as servers) and not in a systematic way. Moreover, platforms use metrics with varying definitions and opaque methodology behind how they are tallied. For example, how individual ‘views’ are counted and what they describe can differ between platforms. This adds to the difficulty of comparing behaviour and content across platforms.

Categorisation – What types of social media data are needed?

As the online communication and information ecosystems evolve, so too has the nature of conducting research. Before outlining the reasoning behind data access and potential research questions, it is important to explain what types of social media data the research community may be interested in:

User-generated data includes information about user content and behaviour on a platform. This comprises of stats gathered from content such as posts and comments, as well as user behaviour, including likes, shares and other types of engagement. This data can be ‘public’ (for example, a post accessible to any member of the public) and ‘private’ (for example, a post shared in a private or closed chat) – though researchers lack a common definition of ‘private online spaces’, which creates additional ethical barriers and considerations. Data may include non-aggregated, individual-level data (personal data) as well as aggregated data on reach (unique number of users who saw a post at least once), impressions (number of times a post was seen) and engagement (likes, comments, shares). Platform APIs may enable access to public data with varying requirements and restrictions.
Platform curation data includes information relating to how human- and algorithmic systems moderate and sort (e.g. boost or demote) content on a platform. This also includes community and recommendation guidelines (e.g. content moderation policies), and how they are enforced, including by means of content removal, content demotion or account suspension. Transparency reporting of curation data may include aggregated information about content moderation decisions at scale, specifying the type of content, the detection method, the type of restriction applied, and whether removal or suspension was due to the Community Guidelines, legal requirements or government removal requests. This type of data is usually available in a non-machine-readable format (such as PDF documents containing tables of data), making further analysis difficult. On a granular level, platform curation data can include signals or ‘tags’ associated with specific types of content or accounts used for content moderation as well as recommendation algorithms.
Platform decision-making data includes information about internal decision-making, including decisions related to the introduction of new features on the platform or experiments conducted by platforms to test and evaluate the ranking algorithms of the recommender systems. For example, such data may include information about changes intended to increase certain types of engagement. Concretely, this can be the quantitative figures from the outcome of experiments with ranking systems. Information about methodology and decision-making would be accessible in the form of qualitative information. Researchers would rely on access to platform employees or company leadership, either through on-site inspections and interviews or access to internal documents, decision-making processes, and communications. In part, the so-called ‘Twitter Files’ uncovered this type of data, albeit with caveats regarding its selectivity and verifiability.

Reasoning behind data access – Advancement of collective knowledge

Researchers from multiple disciplines are interested in understanding a range of emerging social phenomena such as the spread of health-related misinformation, growing distrust in institutions or news consumption. This in turn requires a better (long-term) understanding of the relationship between the use of social media platforms and observed trends. Without reliable and searchable data access, researchers lack the resources to monitor and externally assess compliance with regulations such as the EU’s DSA that cover a range of risks and platform functionalities. Beyond compliance with regulation, social media data can serve as a proxy to assess multiple societal phenomena and trends, as well as human behaviour, attitudes or opinions. Scholars such as Nate Persily have argued that access to social media data has become a prerequisite to investigating and understanding most contemporary problems “in the real world” – whether in the context of election cycles, foreign interference, public health, or societal attitudes towards climate change, immigration or LGBTQ+ rights. Social media data can further offer timely and comprehensive datasets compared with traditional, retrospective social science methods, especially in crisis situations such as a global pandemic, conflict, natural disaster or terrorist attack. The table below outlines sample indicative research questions related to the compliance of platform regulation and beyond.

	Directly linked to compliance with current platform regulation	Indirectly linked to compliance with current platform regulation
Primarily user-generated data required	What is the prevalence of content that could be classified as “incitement to hatred” under the German penal code on Facebook? How many views did video clips of RT and Sputnik broadcasting activities receive on YouTube one month prior and one month after Russia’s invasion of Ukraine?	How do discussions around the COVID-19 pandemic differ across Facebook and Twitter? What online news outlets are shared most prominently among German-language influencers on Instagram?
Primarily platform curation data required	How effective are warning labels from independent fact-checkers or authoritative sources in reducing the spread of misinformation on Twitter? What types of users are more likely to be exposed to content categorised as hate speech? Do moderation decisions about what content is allowed on a platform affect some user groups disproportionately? Are Instagram’s ‘Explore’ page algorithms systematically amplifying the visibility of cyber-abuse content? What is the proportion of so-called ‘superusers’ that show hyperactive and abusive behaviour on Facebook? How can we measure the effect of ‘superusers’ on algorithmic feeds?	How does historical user behaviour impact YouTube ‘Shorts’ recommendation algorithms? What is the role and impact of feedback loops between user behaviour and algorithmic recommendations? How do users adapt their posting behaviour in response to a changed choice architecture of a platform (referring to the platform design)? For example, how did user interactions change when Facebook introduced the ‘angry’ reaction? To what extent does revealing the source of factual interventions affect the likelihood of users sharing misinformation? Does context added to posts such as Twitter’s Community Notes mitigate the spread of false and misleading information? To what extent do people from different points of view find them helpful? Does opting for a reverse-chronological timeline over an algorithmic feed alter the ‘stickiness’ of social media platforms (i.e. whether users spend more time, or engage more on a platform)?
Primarily platform decision-making data required	Are high-profile users treated preferentially in content moderation processes? Are TikTok’s algorithms intentionally demoting Black Lives Matter activists, i.e., reducing how frequently their videos appear on the ‘For You’ feed? Are users able to silence others through the misuse of moderation tools or through systemic harassment designed to censor certain viewpoints?	Is it possible to generate a quantitative estimate of the proportion of reach and engagement resulting from algorithmic ‘amplification’? How could platforms and researchers assess user behaviour in a ‘counterfactual’ scenario, e.g. comparing user groups engaging with algorithmic vs. reverse-chronological feeds. How are Meta’s Oversight Board decisions received by company leadership? What effect do these decisions have on content moderation practices of other companies? How do ranking and product teams at social media companies decide on and use experiments to test and evaluate changes to the algorithms?

Privacy-compliant access to data

Access to social media data, including personal data, can activate data privacy obligations included in the EU’s GDPR, in particular its special regime for data processing for research purposes. The European Digital Media Observatory’s (EDMO) Working Group on Platform-to-Researcher Data Access has published a draft Code of Conduct, which establishes a process under which researchers can be given access to personal data in compliance with the GDPR. Specifically, the Code notes that the GDPR introduces exceptions to the limitations on processing personal data for research purposes by implementing appropriate safeguards. Safeguards can relate to informed consent, data storage (retention periods and criteria), pseudonymisation (processing personal data in such a way that this data can no longer be attributed to an individual), commingling of data (i.e., combining crowdsourced data with API data) or sharing with third parties (such as research partners).

In the US, the proposed American Data Privacy and Protection Act would establish similar exceptions and allow for the collecting, processing, or transferring of data that is “reasonably necessary, proportionate, and limited to […] a public or peer-reviewed scientific, historical, or statistical research project that is in the public’s interest.” The bill does not contain specific privacy or security safeguards but refers to existing laws governing such research.

Regulatory precedent in the EU suggests that “manifestly-made public data” and data that is “publicly accessible” such as posts from public pages and public groups should be made available to researchers through APIs or other visual interfaces, with certain access criteria in place (see below). In the US, PATA would require platforms to make available to the public through APIs “reasonably public content that has been highly disseminated; or was originated or spread by major public accounts.” The latter is defined as accounts “whose content is followed by or reaches at least 25,000 users per month.” Such data would also include “the number of impressions, reach, and engagements.”

However, when using thresholds, the type of data deemed “publicly accessible” can change, and keeping track of these developments requires a shared taxonomy of what data is ‘publicly accessible’ For example, platforms can clarify which features of their platforms are truly public or private, and set reasonable thresholds for the number of users that can participate in private online spaces. In terms of accessing API endpoints, platforms should provide comprehensive public documentation about legitimate use cases and research requirements as well as technical specifications and support. While API access will require some form of ‘light vetting’ to prevent malicious or commercial uses, access should be free or at a nominal cost for researchers. Higher costs risk a de-facto inability to access data, or inequity among less well-resourced research organisations.

Implementation of data access provisions – What’s on the horizon

Despite companies conducting internal research experiments based on the data they collect, they rarely share the outcome and methodology of these studies with the public and policymakers, and therefore put into question the effectiveness of their self-regulatory commitments. It also reiterates the need for legislation with enforcement powers for regulators.

Article 40.4 of the DSA provides for vetted researcher access to VLOPs. Researchers who meet certain conditions (e.g. being affiliated to a research organisation, independent of commercial interest, and capable of fulfilling specific data security and confidentiality requirements) and have been vetted by a national regulator will gain access to data for “sole purpose of conducting research that contributes to the detection, identification and understanding of systemic risks […] and the assessment of the adequacy, efficiency and impacts of the risk mitigation measures.” Specifically, such data access will enable research on illegal content; negative effects on a range of fundamental rights such as the prohibition of discrimination; negative effects on civic discourse, electoral processes, public security, gender-based violence, public health, or minors; as well as serious negative consequences for users’ physical and mental well-being.

Article 40.12 of the DSA further stipulates that access to data “publicly accessible in their online interface” should be made available, where possible, in real-time to researchers, “including those affiliated to not-for-profit bodies, organisations and associations.” Though researchers do not need to be affiliated with a research organisation, they still need to meet certain conditions. These data access provisions will be applicable across the EU by 17 February 2024. The European Commission will adopt a delegated act (a secondary piece of EU legislation) to specify the conditions under which accessing and sharing of data under the DSA can take place.

In parallel, company signatories of the 2022 Strengthened Code of Practice on Disinformation committed to voluntary standards that will serve as co-regulatory measures for the DSA. The Code includes the commitment to “continuous, real-time or near real-time, searchable stable access to non-personal data and anonymised, aggregated, or manifestly-made public data for research purposes on Disinformation through automated means such as APIs.”

Signatories also committed to developing, funding and cooperating with an “independent, third-party body that can vet researchers and research proposals”, which may serve as the “independent advisory mechanism” envisioned in the DSA. EDMO launched a Working Group to discuss the creation of such an independent body. Representatives of VLOPs, academia and civil society will develop an organisational model for a body that will facilitate data sharing between platforms and researchers. This body could streamline the management of access requests and vetting of researchers. It could further help with capacity building measures in the research community, the development of standards of review and certification processes of platform data and codebooks, as well as the formulation of agreed upon data dictionaries and glossaries for data access requests. Given that research into systemic risks is an interdisciplinary effort involving global research teams, an intermediary body should support international coordination and the extraterritorial applicability of Article 40 of the DSA to enable cross-border research collaborations and research partners in non-EU countries.

_________________________________________________________________________________

This Explainer was uploaded on 4 July 2023.