Data science projects start with quality data. You’ve heard it before: garbage in, garbage out. Fortunately, there’s a wealth of free and good datasets online, from governmental and economic data to narrower topics like MLB stats and video game sales.
Whether you’re putting together a data science project to land a job or just want to brush up on your SQL or data analyst skills, we’ve selected some of our favorite sources of useful, free, and good datasets for Data Science in 2024 which you can use for your next project.
You can collect data from just about everywhere, from Wikipedia to your personal Facebook data. But if you don’t know what kind of data you are looking for, these dataset search engines and data repositories are the places to start:
Google’s data search engine is useful for finding datasets in a particular niche. This is a great starting point for both paid and free datasets from top sources around the web. Other useful Google sources are Google Trends and Google’s Public Data Directory.
Find all of the U.S. government’s free and open datasets here. This is a rich source for public economic data—like housing, wages, and inflation—and education, health, agriculture, and census data. With more than 300,000 datasets available, this repository is extremely helpful.
FiveThirtyEight might be best known for its data journalism. Fortunately, the site also makes most of its data in its reporting open to the public. This is a great source for a wide range of data focusing on politics, sports, and culture.
Kaggle is one of the most popular communities for data scientists, and the site’s user-published datasets are great for self-guided machine learning or analysis projects. You’ll find a wide range of data, from movie reviews to customer sales data; fortunately, most of the data have been preprocessed. This is a useful data source for sentiment analysis projects and data analysis and visualization projects.
Check out the University of California Irvine’s repository, which features nearly 500 public datasets. This is a great source for clean, ready-to-model data in a wide range of niches from a dataset of chickenpox cases, to bank marketing data.
This regularly updated library of datasets is a great place to start. The data is organized by category, with options like machine learning and software, and you’ll find quick links to sources.
Amazon’s registry provides public access to data from various organizations, from the 1000 Genomes Project to NASA. You’ll also find helpful usage examples for many of the datasets, as well as project links for various organizations and groups.
The Pew Research Center’s data repository focuses mainly on culture and media. In particular, you’ll find datasets and surveys covering media consumption, social media use, and demographic trends like this 2018 Twitter Survey.
data.world calls itself a “collaborative data community,” and the site has built a dedicated audience of data scientists who have collaborated on projects like social bot detection and data journalism. You’ll find datasets in a range of categories from crime to Twitter.
There’s a plethora of regularly updated public COVID data available online. Some of the best sources include CDC COVID Data Tracker and Our World In Data. For more niche projects, try the Coronavirus Tweets Database, featuring more than 1 billion Tweets, as well as The Marshall Project’s COVID cases in prisons datasets.
Start with the WHO’s Global Health Observation repository if you’re looking for healthcare data. The platform features a variety of health-related statistics such as HIV/AIDs, vaccination rates, and malaria. If you want to build a machine learning health project, this is the source to utilize
Academic Torrents is a database that provides large-scale datasets for research projects. Researchers share the data, and there’s a variety of interesting sources, including the classic Enron email dataset or the annotated New York Times text corpus, which contains 1.8 million articles.
For FinTech machine learning projects, you’ll find a variety of finance-related datasets on Nasdaq Data Link. The site features both paid and free data. Some free datasets of note include Zillow Real Estate Data and Federal Reserve Economic Data. To access the site’s free datasets, you must create an account to access the 20+ free sources. However, numerous premium datasets are available as well. This is a great data source for a real estate data science project.
Each week, Jeremy Singer-Vine compiles a newsletter of useful and curious datasets. Since 2015, he’s published more than 300 newsletters; you can access the full archive on his site. The latest edition featured a dataset of restaurant “chains” as well as an energy-demand dataset. If you’re interested in data, you should subscribe to the weekly newsletter.
NASA’s Earth Science Data Systems Program is a repository of the organization’s Earth science data. You’ll find datasets related to sea level rise, wildfire frequency, and tropical storms, among other interesting earth sciences insights. See the Data Pathfinders tool to learn how to source and access science datasets.
Datahub is a wonderful source of open data. Jump to the Collections tab to browse datasets in various categories covering everything from climate change to football. You can also use the Find Data tool to search for relevant datasets.
For open crime and law enforcement data, this is one of the best sources for U.S. crime statistics. You can search by state or jump into various datasets, including use of force or arrests.
Looking for a specific type of data for your project? We’re surfacing some of the most useful datasets to use in a wide range of data science projects from data analysis and visualization to machine learning and data cleaning.
Whether you want to work with predictions or classification, these datasets are interesting and helpful for machine learning projects. The data is relatively clean and lends nicely to machine learning, e.g., plenty of variables that can help predict the target column.
This dataset, used in DoorDash data science take-homes, features user and transaction data and asks you to build a model to predict delivery time.
Build a stroke prediction model with this handy dataset. The CVS contains patient information like gender, age, pre-existing conditions, and smoking status that can help you build a model.
The dataset used in this take-home contains all pitches from the 2011 MLB season and asks you to build a model to predict the probability of a pitch type.
This UCI Machine Learning Repository dataset contains survey data from married couples. Use the data to identify predictive divorce indicators or build a prediction model.
The data used in this take-home includes cyber security threats and events faced in the healthcare industry. You can use this data to build a healthcare risk assessment model.
With data from more than 400,000 flights in January 2019 and January 2020, this data from the Bureau of Transportation is well suited for building a model for winter flight delays. This dataset is useful for a regression data science project.
Can you predict gender from a Twitter user’s profile and tweets? Build models to answer that question with this dataset, which contains information on more than 20,000 Twitter users.
This classic dataset from UCI is a great source for a classification data science project. One great project idea is to build a model to identify classifiers for poisonous mushrooms.
This is a great dataset for a financial prediction model. Use the data to understand if an applicant is “good” or “bad.”
Use water quality metrics from nearly 4,000 bodies of water to predict whether the water is safe for consumption or not.
This dataset features ratings for submitted New Yorker caption contest entries. Get some ideas for using this data here.
Numerous movie rating datasets are available here, including one featuring 25 million ratings, making it a great source for building a recommendation engine.
This open dataset covers health inspection scores for restaurants in San Francisco. One option is to use the data to build a model to predict a restaurant’s repeat health scores.
This dataset is useful for demand forecasting projects. It contains bike rental data from a bike-sharing program, including travel duration, departure and arrival locations, and weather data. This dataset is similar to one that’s used in the McKinsey data analytics take-home.
With more than 20,000 images of cats and dogs, this is one of the best datasets for beginner image classification projects. The data is already separated into training and testing datasets, and it’s already labeled.
The MNIST dataset is a large database of handwritten digits. It’s widely used for image classification tasks, where the goal is to identify the digit based on the image.
This popular dataset contains information about different types of iris flowers and their characteristics, such as petal length, petal width, and sepal length. The goal is to predict the species of the iris flower based on these characteristics.
Build data visualization projects with these helpful datasets. We looked at data with the potential for interesting visualizations and datasets that weren’t too messy or overly complex.
This dataset includes revenue and sales data from Supercell and asks you to visualize a single aspect of the data that you find important.
With more than 11 million nodes and 85 million edges, this dataset is useful for building graphical relationship models of X users.
This is a great dataset for visualizing hotel bookings. You’ll be able to build visualizations that answer questions like:
Design visualizations that show top authors, best-selling titles, and review ratings for the best-selling books on Amazon.
Visualize the impact COVID is having on hiring with this dataset from the Amazon Open Data Registry. It features regularly updated hiring data from 3+ million job organizations.
If you’re interested in political visualizations, FiveThirtyEight is one of the best data sources. Its updated polling data is great for visualizing averages and polling movements.
Build charts to visualize the United State’s international trade, including top imports, top exports, and annual trade balances.
This dataset is useful for Matplotlib visualizations. You can create visualizations of exchange rates and currency valuations over time. The dataset features more than 20 years of daily exchange rate data.
This dataset has more than 400,000 records featuring daily circulation for the San Francisco library system. You can build visualizations related to new acquisitions, most checked-out authors, most checked-out titles, etc.
This Kaggle dataset features daily trending video data from YouTube. Trending videos aren’t necessarily the most watched but are generally the most interacted-with videos. Visualizations include the most popular videos of the year or month or the most trending videos by the artist/creator.
This dataset features more than 31 years of unemployment in numerous countries worldwide. There is a wide range of visualizations you can create, including comparisons of countries, unemployment rates over time, or countries with the lowest unemployment.
This dataset originated on New York State Open Data and features station, line, location information, etc. You can use this dataset to visualize popular lines or subway maps.
This dataset contains information on more than 10,000 athletes in 40+ sports, and it’s a great source for building country medal count visualizations. There’s also coaching data, so you can add medal information by coach.
Say you want to take a big dataset and investigate. As you dive into the data, you discover patterns, trends, and anomalies. These datasets are perfect for exploratory data analysis projects because they contain large amounts of mostly clean data.
This Airbnb dataset, part of a sample data analytics take-home, contains user information for bookings in Brazil.
A fun dataset to explore and great for beginners, this features all of the Netflix original movies up to June 1, 2020, and their corresponding IMDb scores.
This Stripe dataset, which features product usage and marketing data, is perfect for diving into marketing and product analytics to determine how well a product performs.
Featuring 4 years of data from a superstore, this dataset is perfect for analyzing and identifying trends and sales forecasting.
This sample dataset from a Home Depot data science take-home can be used to produce a gross sales forecast for a new product launch.
This dataset is made up of mock marketing analytics data used by master’s in business analytics students. A great source for a marketing analytics project.
This is a great dataset for surfacing actionable insights for animal shelters, including what factors led to successful animal outcomes.
Another FiveThirtyEight dataset features survey data from non-voters in the U.S. A few project ideas are identifying key factors that result in non-voting or building a voting likeliness model.
A sprawling dataset from Amazon, the Common Crawl corpus features crawling data from billions of websites. Check out the Example Projects page for ideas.
This dataset is useful for a sports analytics project. Featuring data on more than 20,000 matches and individual stats from 2008 to 2016, this is great for exploratory data analysis projects on line-ups, team stats, wins, and individual player stats.
This large-scale dataset, which was originally developed in 2018, features product information for more than 600,000 food items. Data includes allergens, ingredients, and nutrition facts, and there are a wide range of data analytics projects you can do with it.
This useful marketing analytics dataset features survey data from 2,500+ millennials. The survey asked which social platform has influenced your online shopping the most.
This dataset features Google Analytics metrics from Austin, TX’s website. This dataset is great for working in Google Analytics or analyzing website traffic.
This dataset features more than 20 million metrics on Uber pickups in NYC in 2014 and 2015. This is great for an exploratory data analysis or analytics project, and you can gather insights into popular pickup locations, common trip routes, and the locations with the longest pickups.
This dataset is a great source for a campaign budget optimization project or for diving into exploratory data analysis for marketing analytics projects.
This dataset contains a wide range of economic and social indicators for countries worldwide, including information about their GDP, population, and education levels.
This dataset contains salaries for roles in the data science field for the year 2023. You can group the data by domain, years of experience, and even by country of employment, allowing many angles for exploratory analysis.
There are plenty of large datasets great for sentiment analysis and natural language processing (NLP) projects. Data like movie reviews, tweets, Reddit comments, and more are all great for these types of projects.
This take-home provides a dataset of human vs. bot texts and asks you to build a classification model to label the data correctly.
An interesting dataset for performing sentiment or text analysis, this features thousands of posts from the popular subreddit Vaccine Myths.
There are more than 270,000 book chapters in 12 languages in this dataset. It’s perfect for performing various NLP tasks like text parsing, text generation, or semantic analysis.
Featuring 3+ million headlines from the now-defunct tabloid The Examiner, this is a great place to start an NLP news analysis Python project.
Explore thousands of hotel reviews from TripAdvisor and build semantic prediction or top clustering models.
Another helpful medium source, this features headlines from nearly 20 years. It’s a great dataset for performing latent semantic analysis or latent Dirichlet allocation tasks.
With more than 40,000 reviews from three Disneyland locations, this is a great data source for performing sentiment analysis.
This dataset, a classic produced in 2009, features star ratings for numerous Amazon products.
The Stanford Sentiment Treebank contains over 10,000 Rotten Tomatoes files and provides sentiment annotations on a 25-point scale.
This dataset features thousands of airline reviews on X (Formerly Twitter) from February 2015. The data has already been classified as positive, negative, or neutral, and sometimes includes a reason for the negative tweet.
Featuring 25,000 movie reviews, you can use this dataset for a binary classification project or to analyze movie review ratings by title.
This classic NLP dataset has been studied and written numerous times, and it is great for text classification and analysis projects.
Whether you’re looking into an image recognition project or speech recognition project, these great image and audio datasets will help you practice your deep learning skills.
The VoxCeleb large-scale dataset features audio-visual data from 7,000 speakers. It’s a great dataset for performing emotional recognition, speaker recognition, or talking face synthesis.
There are about 900 images in this dataset of people wearing facemasks. You can use this to build models to detect if someone is wearing a mask, not wearing a mask, or wearing a mask improperly.
This rich visual-text dataset is loaded with helpful information. Use the photos for object detection. A bonus: There are millions of keywords and metadata you can also use for exploratory data analysis projects.
This dataset from Stanford features 200,000+ chest radiographs. Build a model to detect pathologies and see how well your model performs against radiologists.
There are thousands of images of Pokemon characters in this dataset. Use the data to build a prediction model to determine the type of Pokemon based on the image.
A classic image dataset from Stanford, you’ll find more than 14 million images here. This is one of the best datasets for performing object recognition tasks.
Featuring more than 20,000 photos of dogs, this is a useful dataset for building classification models or a dog breed image classifier project.
Featuring more than 5,000 images with fine annotations, as well as 20,000 images with coarse annotations, this is one of the best datasets for understanding urban street scenes at the pixel level.
This is a smaller dataset featuring 165 images of 11 subjects. Each subject has images with various expressions and configurations, for example, “sleepy” or “without glasses.”
Similar to the MNIST handwritten text dataset, this image set includes a training set of 60,000 images of clothing articles and a test set of 10,000 images. There are 10 classes for the dataset and a label like “bag” or “trouser.” This is useful for testing machine learning models.
Image recognition indoors is more difficult, and this dataset, which features 15,000+ images of indoor scenes, is useful for building indoor recognition models.
Pricing optimization is the most important lever for increasing revenue with data. Try to identify prices that maximize revenue for these different products and environments:
Over 1 million transactions from an online retailer, including customer data, product data, and transaction data.
Weekly retail prices and volume data for avocados in various US markets from 2015 to 2018.
Data on beer consumption and prices in Sao Paulo, Brazil, from 2015 to 2018.
Information on Airbnb listings in New York City, including listing prices and attributes such as location, number of bedrooms, and amenities.
Information on Uber pickups in New York City from 2014 to 2015, including pickup times and locations.
Information on over 53,000 diamonds, including their cut, color, clarity, carat weight, and price.
Weekly sales data for 45 Walmart stores across the US from 2010 to 2012, including information on promotions, holidays, and weather conditions.
Using this dataset, you can build predictive pricing models, especially around surge pricing.
Using this dataset, you can build a predictive model for used car pricing. The dataset includes various features such as the car’s make and model, location, year of manufacture, kilometers driven, fuel type, transmission, owner type, mileage, engine size, power, and the number of seats. These attributes can be used to analyze how different factors influence the resale value of cars.
Using this dataset, you can build predictive models to estimate health insurance costs based on various factors such as age, gender, BMI, number of children, smoking status, and region. These models can help insurance companies better understand the risk factors associated with individual clients and set more accurate premiums.
By analyzing patterns within the data, you can also explore how different variables interact with each other to influence the overall charges, potentially identifying key drivers behind high insurance costs.