We’ve all heard how data science will transform (if it hasn’t already) the business landscape, touching everything from our supermarkets to our hospitals and our airlines to our credit cards. Most companies in these areas use proprietary information from millions of private transactions to gain insight into our behavior that in turn allows these companies to turn a profit. However, if you are an amateur data scientist, a hobbyist, a student or a data-minded citizen, this information is typically off limits. And a simulation just isn’t enough because it doesn’t meaningfully replicate the complexity and multi-dimensionality of this data.
Public Data Sets
How about all the publicly available information though? Now, here’s an underused treasure trove for data scientists. Concerns about the quality of data aside, open data provides unparalleled opportunities. There are typically no usage restrictions for data in the public domain, and stitching together disparate sources of data (census, crime records, traffic, air pollution, public maps, etc.) gives you the opportunity to test interactions between various data sets. Possibly the most complete list of public datasets is available at this GitHub page.
Notice I said ‘concerns about the quality of data’ in the previous paragraph? That can be a massive problem. The biggest impediment to the use of public data is the lack of reliability of data. Often, the data sets are poorly indexed or incomplete. But even more commonly, these public stores of information are stored in formats that are incompatible with data wrangling. Scanned documents and hand-written ledgers don’t lend themselves to easy analysis. So, a large part of public data projects ends up being a transcription effort. Web scraping, dimensionality reduction, imputation, bias removal and normalization are all skills that a data scientist needs to develop when working with public, open data.
Where is all this public data?
Of course, there are some extremely powerful sources of public data with somewhat clean, reliable and ready-to-use data as well. For government and public sector data, the first port of call is India’s Open Government Data Platform, which includes robust data on commodity prices, national statistics, company registers and even government budgets. Macroeconomic data is best sourced from the World Bank or from Google’s Public Data Explorer. The Public Data Explorer stitches together information from a range of sources (IMF, World Bank, OECD, University libraries and even Google’s own data collection efforts), and contains some slick, interactive visualization. A variety of other interesting sources of data include Reserve Bank data for bank, forex and CPI information and Bhuvan, ISRO’s geo-platform for geographical data.
Recognizing just how time-intensive and complicated data cleaning and collation can be, there are some interesting companies that focus on getting you clean data sets. Not surprisingly, they focus on the most immediately lucrative sector – finance. Quandl provides some intriguing financial data sets for free, including the renowned Big Mac price index, and all the data is designed to be easily imported and ready for use in minutes. Another company challenging the traditional (paid) data powerhouses is StockTwits. Their API allows you to get real-time data for free all day, every day. If you want historical data (going back about 3-5 years), numerous users have downloaded using StockTwits and created data sets that you can easily repurpose.
If you’re the sort who likes a competitive challenge rather than tinkering with datasets by yourself, there are some wonderful competition platforms that make public datasets available with a well-defined problem statement. The first port of call is Kaggle, whose competition problems include Flu Forecasting and Automated Essay Scoring. Kaggle also comes with a set of very interesting data sets for the self-driven data scientist. Driven Data is another such platform albeit with a limited selection of competitions.
Once you’re ready to meet and work meaningfully with others interested in data-driven solutions to social problems, you can seek out global movements like DataKind. Their efforts range from weekend marathons to long-term cross-sector engagements. Earlier this year, DataKind’s Bangalore chapter created a tool to help you understand various aspects of the Union Budget for 2016-17. The source code is public and entirely open to being repurposed for use on any other data set. There are also academic paths to learning and collaboration in data science – the most prominent of which is the University of Chicago’s Data Science for Social Good fellowship.
Public datasets offer the best opportunity to learn, experiment and produce valuable analytical insights to benefit society. In a world where data is an increasingly valuable currency, these public data sets are perhaps the last bastion for the precious, complex data necessary to draw meaningful conclusions about the way we live.