Top 20 Free Dataset Sources for Data Science Projects

Top 20 Free Dataset Sources for Data Science Projects

Overview

With over 147 zettabytes of data available worldwide, data science enthusiasts have access to a vast array of free, high-quality datasets to fuel their projects. Spanning industries like finance, healthcare, social media, and climate science, this wealth of data offers endless opportunities to sharpen skills, tackle real-world challenges, and build a standout portfolio.

While many high-quality datasets remain behind company firewalls and paywalls, others are freely accessible, offering a magnitude of opportunities for exploration and insight. However, the real challenge lies in finding the right dataset—with minimal outliers and a robust sample size that ensures meaningful insights for your project.

To help you focus on your project and not worry about data quality, we’ve gathered the best websites with repositories for free data science datasets you can build a solid foundation on.

Explore the Top 20 Free Dataset Sources for Data Science Projects

1. Kaggle Datasets

Beyond its vast collection, Kaggle’s datasets are often accompanied by community-driven notebooks and tutorials, making it an excellent resource for beginners and experts alike. You can also collaborate and discuss specific datasets, participating in a collaborative learning environment. Advanced users can tackle more complex datasets or even participate in Kaggle competitions to challenge their skills. How about starting by searching for the popular Titanic dataset for ML competition?

  • Approximate Size: Over 50,000 datasets.
  • History: Launched in 2010, Kaggle started as a platform for machine learning competitions.
  • Topics: Machine learning, image data, natural language processing, healthcare, big data, predictive modeling, climate science, urban transportation, retail and e-commerce, and finance.
  • Source: User-generated, public data contributions, companies sharing data for competitions, and open data repositories.

2. Google Dataset Search

If you’re unsure where to start looking for datasets, Google Dataset Search is like the “Google” of datasets—efficient and comprehensive. It indexes datasets from a wide range of sources, including government portals, research institutions, and public repositories. Whether you need healthcare statistics or social science information, this tool makes it easy to locate high-quality data. You can refine your search with filters for file formats, accessibility, or frequency of updates.

  • Approximate Size: Aggregates millions of datasets from across the web.
  • History: Google Dataset Search was launched in 2018 to help users find publicly available datasets across the web.
  • Topics: Wide-ranging, covering virtually any field from healthcare to social sciences.
  • Source: Aggregates from a variety of public data repositories, research institutions, government sources, and open-access data.

3. UCI Machine Learning Repository

While not as vast, this repository is known for its simplicity and accessibility. UCI’s datasets are frequently small, clean, and structured, making them ideal for beginners in machine learning. It also includes detailed documentation on variables and previous research usage, facilitating reproducibility. For beginners, start with the Iris dataset or Wine Quality dataset—both are classics in the field and well-suited for learning foundational machine learning techniques.

  • Approximate Size: Around 600 datasets.
  • History: Established in 1987 by researchers at the University of California, Irvine, the UCI Repository has been one of the first to provide datasets specifically for machine learning research.
  • Topics: Machine learning and AI, commonly used for academic research.
  • Source: Primarily academic sources, with datasets contributed by universities and researchers worldwide.

4. Data.gov

If you’re looking for real-world, continuously updated datasets, Data.gov is an invaluable resource. As the US government’s open data portal, it provides access to thousands of datasets across diverse topics like weather, healthcare, and agriculture. It’s also API-friendly, making it ideal for developers who want to integrate live data into their applications or analyses. If you’re new to working with APIs, try experimenting with simpler datasets like NOAA’s weather data or USDA agriculture statistics to learn the ropes of live data integration.

  • Approximate Size: Over 330,000 datasets.
  • History: Created in 2009 during the Obama administration as part of the Open Government initiative, Data.gov aimed to make US government data accessible to the public.
  • Topics: US government data in areas like health, climate, education, and agriculture.
  • Source: US federal, state, and local government agencies providing public-access data.

5. AWS Public Datasets

For those tackling large-scale or computationally intensive projects, AWS public datasets are quite resourceful. AWS enables seamless analysis of massive datasets through its cloud infrastructure, with many datasets pre-loaded into Amazon’s storage services to minimize data transfer costs. It’s ideal for research requiring significant computing power, such as genomics or large-scale social data.

  • Approximate Size: Hundreds of datasets, totaling over 1 petabyte.
  • History: Launched in 2013, AWS Public Datasets provides free access to large datasets hosted on Amazon’s cloud platform.
  • Topics: Genomics, satellite imagery, economic data, and weather data.
  • Source: Publicly accessible datasets from academic institutions, government agencies, companies, and research collaborations.

6. FiveThirtyEight

FiveThirtyEight provides datasets in easy-to-use formats popular for data journalism and storytelling. Many of the datasets are accompanied by visualizations and analysis guides, helping users interpret data in the context of current events. Many FiveThirtyFive datasets link back to corresponding articles. Read these pieces to understand the context, methodology, and interpretation.

  • Approximate Size: Dozens of curated datasets (around 100).
  • History: Founded in 2008 by statistician Nate Silver, FiveThirtyEight is a data-driven news site focusing on politics, economics, and sports.
  • Topics: Sports, politics, economics, and cultural datasets.
  • Source: Generated by FiveThirtyEight’s data journalism team, often tied to research or analysis for published articles.

7. World Bank Open Data

The World Bank’s data is structured for cross-country comparisons, making it invaluable for development economics and policy research. Many datasets include long time-series data, allowing for historical analyses and trend tracking over decades. You may start by comparing cross-country datasets for specific key metrics such as poverty headcount ratios or health expenditures.

  • Approximate Size: Thousands of time-series datasets, covering 300+ indicators across 200+ countries.
  • History: Launched in 2010, World Bank Open Data was part of the bank’s commitment to open data and transparency.
  • Topics: Global development indicators, economics, population, and environment.
  • Source: Publicly sourced global development data from the World Bank and its member countries.

8. UNdata

UNdata is especially valuable for international research, as it aggregates data from multiple UN agencies and provides an interface for finding data across fields like education, poverty, and environment. It supports multilingual access, broadening its usability.

A quick tip: Sustainability, as it should be, is trendy now. For a global perspective on social issues, try analyzing datasets on gender equality or education enrollment ratios.

  • Approximate Size: Over 60 million data points across hundreds of datasets.
  • History: UNdata was established by the United Nations in the early 2000s to provide centralized access to statistical data from various UN bodies.
  • Topics: Global health, economics, demographics, and social indicators.
  • Source: Data collected by the United Nations and various UN agencies from member countries and international surveys.

9. European Union Open Data Portal

This portal offers access to official statistics and research data, often linked to EU-funded projects and policy reports. It is a crucial resource for comparative studies on topics like energy efficiency, digital infrastructure, and public health across EU nations. For a hands-on project, consider creating a comparison of energy efficiency or carbon emissions across several EU countries to visualize trends or highlight policy successes.

  • Approximate Size: Around 14,000 datasets.
  • History: The European Union Open Data Portal was launched in 2012 to provide access to EU data and foster transparency and innovation in public services.
  • Topics: Economy, energy, environment, and public policy within the EU.
  • Source: Datasets provided by European Union institutions, agencies, and member countries.

10. Stanford Large Network Dataset Collection (SNAP)

SNAP datasets are frequently cited in academic papers on network theory and are valuable for testing algorithms at scale. The datasets are optimized for compatibility with popular graph analysis tools, supporting research on network structure and dynamics.

  • Approximate Size: 50+ datasets, often very large.
  • History: The SNAP project was initiated by Stanford University in the 2000s to provide large-scale network data for research in social network analysis, data mining, and graph theory.
  • Topics: Social networks, web graphs, citation networks, and road networks.
  • Source: Academic data, primarily generated by researchers at Stanford University for network science studies.

11. IMF Data

IMF’s data is known for its focus on finance, providing insights into global economic trends, inflation, and fiscal policies. It’s regularly updated with projections, making it a primary source for economic forecasting and macroeconomic studies.

  • Approximate Size: Thousands of datasets, including long time series data.
  • History: The International Monetary Fund (IMF) began collecting and distributing global economic data in the 1940s, shortly after its founding.
  • Topics: Global financial data, economic indicators, and exchange rates.
  • Source: Economic and financial data collected by the International Monetary Fund (IMF) from member countries and international organizations.

12. Awesome Public Datasets (GitHub)

Apparent from its name, this repository is community-driven and constantly evolving, often featuring links to niche or newly released datasets not easily found elsewhere. It’s organized by topic, making it convenient for researchers seeking data in specific fields.

  • Approximate Size: Links to thousands of datasets.
  • History: The Awesome Public Datasets list started in 2014 as a collaborative GitHub project to collect and share useful datasets for research and learning.
  • Topics: Comprehensive, covering everything from geospatial to financial data.
  • Source: Community-curated, linking to a variety of public datasets sourced from academic institutions, companies, and government repositories.

13. Harvard Dataverse

Harvard Dataverse is notable for its preservation policies, making it a reliable source for long-term research projects. Datasets are often peer-reviewed and cited, making them a credible source for academic references and scholarly work. Always properly cite the dataset if you’re using it for your project. Harvard Dataverse makes this easy by providing citation formats in various styles.

  • Approximate Size: Over 100,000 datasets.
  • History: Launched in 2011, the Harvard Dataverse is part of a broader initiative at Harvard University to support research data management and preservation.
  • Topics: Academic research data across social sciences, medicine, and environmental studies.
  • Source: Academic research datasets contributed by researchers and institutions, hosted by Harvard University.

14. OpenStreetMap (OSM)

OSM’s data is frequently updated and highly granular, enabling detailed geographical analysis and urban planning projects. Its open-source nature has led to numerous applications, from navigation to disaster response mapping. As a user, you can also contribute to its humongous dataset. What sets OSM apart is its community-driven approach, meaning anyone can contribute to and improve the dataset, making it one of the most dynamic and comprehensive mapping resources available.

  • Approximate Size: Terabytes of data, covering global geospatial information.
  • History: OpenStreetMap was founded in 2004 by Steve Coast as a volunteer-driven mapping project to create freely accessible and editable maps.
  • Topics: Roads, buildings, geographical features, and administrative boundaries.
  • Source: Crowdsourced data contributed by the global OpenStreetMap community of volunteers.

15. Humanitarian Data Exchange (HDX)

HDX’s datasets are tailored for use by NGOs and crisis-response teams, often focusing on real-time needs such as disease outbreaks or refugee movements. The platform prioritizes accessibility, offering multiple formats and visualization tools. For a project, try combining multiple datasets such as population data, health, and food insecurity data to create a comprehensive analysis of a specific humanitarian crisis, such as refugee displacement or disease outbreaks.

  • Approximate Size: Over 18,000 datasets.
  • History: HDX was launched by the United Nations Office for the Coordination of Humanitarian Affairs (OCHA) in 2014.
  • Topics: Crisis and humanitarian data, including health, population, and emergency data.
  • Source: Data from humanitarian organizations, non-governmental organizations (NGOs), and the United Nations Office for the Coordination of Humanitarian Affairs (OCHA).

16. Pew Research Center

Pew Research data is renowned for its methodological rigor, with many datasets containing comprehensive demographic breakdowns. The data often reflects nuanced insights into social trends, making it ideal for public opinion and sociological research.

  • Approximate Size: Hundreds of datasets, updated frequently.
  • History: Founded in 2004, Pew Research Center is a nonpartisan “fact tank” that provides data and analysis on social issues, public opinion, and demographic trends.
  • Topics: Public opinion, US demographics, social trends, and political data.
  • Source: Data generated by Pew Research Center through public opinion surveys and social research studies.

17. Quandl

Quandl’s datasets are especially useful for financial modeling and quantitative analysis, with many datasets offering in-depth, granular data points. It includes specialized financial indicators and alternative data, such as sentiment analysis or macroeconomic indicators.

  • Approximate Size: Thousands of datasets, some requiring paid access.
  • History: Founded in 2012, Quandl began as a provider of financial and economic data, focusing on delivering alternative data to analysts and investors.
  • Topics: Financial data, stock market, economics, and alternative data.
  • Source: Aggregated from public data sources, as well as proprietary data providers and financial institutions (some datasets are paid).

18. Google BigQuery Public Datasets

Google BigQuery enables instant querying of large datasets in SQL, making it ideal for data-intensive applications like real-time analytics. The platform hosts data partnerships with providers like NOAA and CryptoCompare, expanding its coverage in weather and finance.

  • Approximate Size: Dozens of datasets, totaling terabytes of cloud-based data.
  • History: Google BigQuery launched in 2011 as a cloud data warehouse for large-scale analytics.
  • Topics: Genomics, weather, cryptocurrency, and social media.
  • Source: Curated public datasets hosted on Google Cloud, with contributions from government agencies, academic research, and industry partnerships.

19. NASA Earth Data

NASA Earth Data offers datasets that support advanced geospatial analysis, often used in climate studies and predictive modeling. It includes tools for downloading, visualizing, and processing satellite data, ideal for environmental monitoring and space science.

  • Approximate Size: Petabytes of data from various Earth observation missions.
  • History: NASA’s Earth Data initiative began in the 1970s as part of its Earth observation program, initially focusing on satellite imagery. The data portal, established in the 2000s, now includes petabytes of geospatial and environmental data from multiple NASA missions.
  • Topics: Climate, weather patterns, satellite imagery, and land use.
  • Source: Generated by NASA’s Earth observation missions, satellites, and climate research programs.

20. RE3DATA (Registry of Research Data Repositories)

RE3DATA serves as a central directory for accessing specialized data from diverse research fields, including life sciences, humanities, and engineering. It’s especially useful for locating data repositories that adhere to open-access policies, supporting transparent and reproducible research.

  • Approximate Size: Indexes over 2,600 data repositories, each hosting numerous datasets.
  • History: Launched in 2013, RE3DATA was created to provide a global registry of research data repositories.
  • Topics: Multidisciplinary, spanning all research areas from humanities to life sciences.
  • Source: Index of research data repositories from universities, research institutions, and government bodies across disciplines.

The Bottomline

We’ve gathered the best sources to find the vast array of publicly available datasets across diverse platforms that offer invaluable resources for data scientists, researchers, and analysts. Bookmark this list to refer to when looking for clean datasets for your next projects. This list, among other topics, consists of unique websites for machine learning, economic data for policy analysis, or satellite imagery for environmental studies. These sources provide high-quality, accessible data that can power insightful projects and fuel innovation in various fields.

Explore our website further if you’re looking for data science interview questions, portfolio ideas, our AI Interviewer, and data science jobs.