Mapping the Next Frontier of Open Data: Corporate Data Sharing

16 September 2014

(cross-posted at the UN Global Pulse Blog)
When it comes to data, we are living in the Cambrian Age. About ninety percent of the data that exists today has been generated within the last two years. We create 2.5 quintillion bytes of data on a daily basis—equivalent to a “new Google every four days.”
All of this means that we are certain to witness a rapid intensification in the process of “datafication”– already well underway. Use of data will grow increasingly critical. Data will confer strategic advantages; it will become essential to addressing many of our most important social, economic and political challenges.
This explains–at least in large part–why the Open Data movement has grown so rapidly in recent years. More and more, it has become evident that questions surrounding data access and use are emerging as one of the transformational opportunities of our time.
Today, it is estimated that over one million datasets have been made open or public. The vast majority of this open data is government data—information collected by agencies and departments in countries as varied as India, Uganda and the United States. But what of the terabyte after terabyte of data that is collected and stored by corporations? This data is also quite valuable, but it has been harder to access.
The topic of private sector data sharing was the focus of a recent conference organized by the Responsible Data Forum, Data and Society Research Institute and Global Pulse (see event summary). Participants at the conference, which was hosted by The Rockefeller Foundation in New York City, included representatives from a variety of sectors who converged to discuss ways to improve access to private data; the data held by private entities and corporations. The purpose for that access was rooted in a broad recognition that private data has the potential to foster much public good. At the same time, a variety of constraints—notably privacy and security, but also proprietary interests and data protectionism on the part of some companies—hold back this potential.
The framing for issues surrounding sharing private data has been broadly referred to under the rubric of “corporate data philanthropy.” The term refers to an emerging trend whereby companies have started sharing anonymized and aggregated data with third-party users who can then look for patterns or otherwise analyze the data in ways that lead to policy insights and other public good. The term was coined at the World Economic Forum meeting in Davos, in 2011, and has gained wider currency through Global Pulse, a United Nations data project that has popularized the notion of a global “data commons.”
Although still far from prevalent, some examples of corporate data sharing exist. Here is a sampling of those discussed at the conference:

In Ivory Coast and Senegal, Orange Telecom hosted a Data for Development Challenge that allowed researchers to use anonymized, aggregated data to help solve various development problems, including those related to transportation, health, and agriculture.
South Africa-based telecom MTN makes anonymized call records available to researchers through a trusted intermediary, Real Impacts Analytics–a data analytics firm that provides guided and predictive analytics solutions.
Last year, regional bank BBVA hosted a contest “Innova Challenge Big Data,” inviting developers to create applications, services and content based on anonymous card transaction data.The first prize went, for instance, an application called Qkly, which helps users plan their time by estimating what time of day a given place will be most overcrowded so as to avoid lines.

Taxonomy of current corporate data sharing efforts

For all the growing attention corporate data sharing has recently been receiving, it remains very much a fledgling field. Much remains to be defined and understood. There has been little rigorous analysis of different ways of sharing, though our survey of the landscape resulted in identifying six main categories of activity to date

Academic research partnerships, in which corporations share data with universities and other research organizations. For instance:

o Using anonymized data from Safaricom, one of Kenya’s leading mobile companies, researchers from the Harvard School of Public Health mapped how human travel patterns contributed to the spread of malaria in the country.

o Just recently, popular online communities have joined forces with a select number of academic institutions as a part of the Digital Ecologies Research Partnership (DERP) in order to promote research on Internet social behavior.

Prizes and challenges, in which companies make data available to qualified applicants who compete to develop new apps or discover innovative uses for the data. In its 2014 Dataset Challenge, Yelp is making its data on restaurants in cities like Phoenix, Madison, and Edinburgh available to academic researchers to build models and provide research on urban trends and behavior (such as whether Yelp data can help predict environmental conditions of restaurants).
Trusted intermediaries, where companies share data with a limited number of known (often commercial) partners. For example, Twitter recently acquired, the social media aggregator Gnip in order to provide its data products to clients.
Application programming interfaces (APIs), which allow developers and others to access data for testing, product development, and data analytics. Through metadata and click tracking, Bitly’s Social Data API estimates social trends and allows users to build tools from real-time data.
Intelligence products, where companies share (often aggregated) data that provides general insight into market conditions, customer demographic information, or other broad trends. Google shares search query-based data through Google Flu Trends, which estimates the current level of influenza activity in conjunction with traditional health surveillance systems.
Corporate Data cooperatives or pooling, in which corporations group together to create “collaborative databases” with shared data resources. In its “Big Data Challenge,” Telecom Italia pooled their data with partners from various Italian industries (local news, automobile, energy and weather) into one aggregated, geo-referenced dataset for participants to use for the competition. The data was available in batches and through an API, and contained millions of call data records, energy consumption records, tweets, and weather data points.

Beyond such broad taxonomies, there exists almost no systematic analysis of corporate data sharing Much research remains to be done on the value proposition for corporations doing the sharing (or, indeed, for end-users), and on ways to maximize the potential and—importantly—minimize potential harms of shared data.

Help us map the field

A more comprehensive mapping of the field of corporate data sharing would draw on a wide range of case studies and examples to identify opportunities and gaps, and to inspire more corporations to allow access to their data (consider, for instance, the GovLab Open Data 500 mapping for open government data) . From a research point of view, the following questions would be important to ask:

What types of data sharing have proven most successful, and which ones least?
Who are the users of corporate shared data, and for what purposes?
What conditions encourage companies to share, and what are the concerns that prevent sharing?
What incentives can be created (economic, regulatory, etc.) to encourage corporate data philanthropy?
What differences (if any) exist between shared government data and shared private sector data?
What steps need to be taken to minimize potential harms (e.g., to privacy and security) when sharing data?
What’s the value created from using shared private data?

We (the GovLab; Global Pulse; and Data & Society) welcome your input to add to this list of questions, or to help us answer them by providing case studies and examples of corporate data philanthropy. Please add your examples below, use our Google Form or email them to us at [email protected]

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License