Interview Query
Top 22 Social Media Datasets That You Can Use (Updated in 2025)

Top 22 Social Media Datasets That You Can Use (Updated in 2025)

Introduction

Collecting and combining good data is a challenge for all data scientists. Even more difficult, however, is turning that good data into projects that generate riveting insights and inspire new discussion. As data scientists, we define good data as raw, uncompromised, and observed in their natural state - this is where social media data sets come into play.

Social media data sets are superior to other datasets due to various factors. One of those factors is that social media data sets contain information that is already correlated, sorted (i.e., user profiling), and numerous. Moreover, data shared through websites such as X and Reddit tend to be unfiltered, which can be used as leverage for businesses to create unadulterated datasets for sentiment analysis or research.

A dataset is made up of related data cells, which can then be maneuvered and accessed either as a singular unit, curated based on similar characteristics or modified individually. At their core, datasets have untapped potential and are waiting to be utilized. As data scientists, we should be able to use datasets in the following manner:

  • Generating sentiment analysis.
  • Training neural networks.
  • Creating business insights (which drive business decisions).

X Datasets And Project Ideas

X is often cited as one of the best places to explore datasets. Not only are existing datasets numerous, but the core user demographic for X is extensive and home to communities of polarizing ideologies and raw and genuine opinions.

The authenticity of X datasets makes them great for businesses to explore trends that drive business decisions.

We have gathered five datasets that you can use to form a project around (linked below). Each of the datasets can be of varying specificity, and they will tackle:

  • Overspending Analysis Take-Home
  • 2015 New Year’s Resolutions
  • The US 2016 Elections X Analysis
  • Apple (FAANG) 2016 Sentiments
  • Sentiments on Climate Change

These four datasets are seemingly random and unrelated, but this only highlights how much variety you can get from X.

1. X: Overspending Analysis Take-Home

X Take-Home Challenge/Social Media Data set

This social media data set doesn’t focus on users. Instead, it’s more of a product case study challenge. Specifically, the data provided relates to X’s advertising business.

You’re provided a dataset of campaigns, budgets, and impressions, and you’re asked to measure the data generated by an A/B test on a new advertising product.

This take-home tests your ability to:

  • Process social media advertising data
  • Calculate and measure the effectiveness of an A/B test
  • Determine the winner of an A/B test

Overall, this X data science takehome is a product case study in that you’re asked to determine how a new product compares to an old product in an A/B test.

2. 2015 New Year’s Resolutions

One of X’s most organized and detailed social media data sets compiles tweets related to 2015 New Year’s resolutions. While 2015 resolutions might seem outdated, it is evident that most people who make New Year’s resolutions tend to fail halfway through and, as a result, commit to the same promises year after year.

This dataset is helpful as it can assist businesses either downscale or upscale their operations at certain times of the year. For example, a quick skim through the dataset will reveal the prevalence of gym and weight-loss-related resolutions at the beginning of the year, just after the resolutions are made.

A project that tackles and predicts the increase or decrease of demand for certain services can drive business decisions and generate those actions that produce service supply shifts throughout the year. Moreover, the dataset is straightforward to navigate as it sorts the resolution based on relevant categories (for example, philanthropy, health and fitness, and personal growth).

3. Climate Change X Sentiment Analysis

On July 22, 2022, the UK recorded its highest temperatures in recorded history. Concurrently, reports of melting ice caps in the Arctic spread around the internet. Multiple research papers dating from 1896 have helped back up the authenticity of climate change.

Despite the body of scientific evidence piling up from social media, we can gather that there are still people doubting the correlation between carbon emissions and climate change.

For one of your projects, you can perform a sentiment analysis on climate change using this data set provided by CrowdFlower. This social media data set contains a column specifying whether a tweet suggests accepting or rejecting the climate change phenomena.

4. Apple (FAANG) 2016 Sentiment Analysis

By January 2022, Apple became the world’s first $3 Trillion company. Many factors led to this point, but it has long been understood that Apple’s valuation rides on a public perception of the brand’s products being labeled as “premium”. This is based in no small part on their product’s decent build quality and innovative marketing. However, does X share the same opinion?

2016 was a big year for Apple, introducing many novel products with features that are still prevalent today. These are the following:

  • iPhone SE, Apple’s budget iPhone lineup.
  • 9.7-inch iPad Pro, Apple’s dip at iPad Pros on a smaller form factor.
  • iPhone 7 and iPhone 7 Plus, one of which sported the double camera setup but also started the trend of removing the headphone jack.
  • Apple Watch Series 2.
  • AirPods, which were controversial due to the headphone jack issue.
  • The Macbook Pro 2016 introduced the contentious touch bar and the butterfly keyboard second generation, which was also plagued with quality control issues.
  • New software innovations and operating system updates.

Sentiment analysis dataset

2016 is the perfect year to conduct a project on sentiment analysis for Apple at a time when the company introduced many features and hardware that created a large divide in the tech community. To this day, there is still no consensus on the preference for the touch bar.

It would also be interesting to see how far opinion has developed and how the opinions from 2016 have grown since that time.

5. US 2016 Elections and ISIS X Analysis

The US 2016 Elections, alongside the peak of the terrorist group ISIS’s self-declared caliphate, were two of the most talked about events and phenomena of 2016, with the elections garnering approximately 60 million tweets and ISIS gaining a quarter of that. In the same year, Paragon Science Inc. compiled a list of all related X networks that mention or are related to the US 2016 elections alongside tweets that heavily emphasize ISIS.

Paragon Science Inc. compiled this social media data set by employing dynamic anomaly detection technology, a deviation from the chaos and dynamical theories most often employed to understand these phenomena.

The dataset is a lovechild of many methods, utilizing sentiment analysis, network analysis, community detection, and topic detection.

We can use this dataset to develop the following projects:

  • Using Natural language processing (NLP), we can gather keywords related to either the US elections or ISIS.
  • The gathered keywords can also create an exploratory data analysis project, visualizing related keywords and finding weakly or strongly connected graphs.
  • Another project we can explore is the creation of an algorithm that produces political Tweets.
  • We can also do a statistical data analytics project to identify what factors aided Trump’s victory and Clinton’s loss in that 2016 election.

Meta Datasets And Project Ideas

If you are actively looking at tech stocks, you might encounter quite a bit of doom-posting about the fall of Meta. This pessimism does not match the active user numbers; however, 69% of Americans, 79% of Canadians, and 329 million people in India alone still actively use Meta. Moreover, in the 35 to 44 age demographic, Meta remains the dominant social media site.

Meta datasets might be the most helpful for businesses explicitly targeting developing countries and/or Millennials in the 35 to 44 age demographic.

For Meta, we have four datasets to get started with:

  • Exploratory Data Analysis: General Meta Dataset
  • SNAP Meta Social Circles
  • Political Advertisements from Meta
  • A Dataset Compiling Removed Domestic Meta Pages

6. Exploratory Data Analysis: General Meta Dataset

A dataset that explores the demographic of Meta users allows data scientists to grasp the contours of the current active community. Because the dataset contains a lot of general information, it might not be as valuable for a project that requires specific statistics and user preferences.

However, this social media data set holds comprehensive information such as age, year born and birth date, gender, friend count, and likes. Because of that, this dataset can be used for exploratory data analysis.

Meta dataset

Exploratory data analysis allows you to explore the data with visual techniques, check trends, confirm assumptions, and get a general gist of the data.

7. SNAP Meta Social Circles

SNAP is an initiative by Stanford University and stands for Stanford Network Analysis Platform. SNAP primarily contains social media data sets for network analysis and various platforms running C++ and Python. However, due to the sheer amount of data in SNAP datasets, it is recommended that the faster C++ libraries be utilized.

The reason behind C++’s better performance is that many Python libraries use C code in them, thus hampering performance. Moreover, Python is interpreted while C++ is compiled.

SNAP’s social media data sets are not limited to Meta; they offer Reddit, X, and more. However, this section will focus on their Meta social circles dataset.

The SNAP Meta dataset is focused on friend lists (i.e., social circles) commonly used in ego networks. Ego networks are a type of network that is based around a focal node (a central node) named an “ego,” with social relationships surrounding them as “alters.” The relationships between alters are also specified.

These alters, in themselves, can become the focal node as well. Moreover, the SNAP dataset also analyzes the “features” (i.e., political party) but anonymizes them as to be unidentifiable. For example, political party A will be labeled “1,” and party B will be labeled “0.”

Projects focusing on an analysis with a particular feature in mind will not do well with this dataset. For instance, if an organization prioritizes campaigns for the supporters of a specific political party, it will be challenging as the dataset itself does not specify which users belong to which political party.

However, projects that analyze the divisiveness and the clustering of nodes (e.g., how friend groups influence the appearance of certain features) will be a perfect fit for this dataset.

8. Political Advertisements from Meta

Political Advertisements from Meta dataset

Advertisements that advocate social change, whether political or non-political, are grouped into the “political ad campaigns list.” As such, it is vital to distinguish these two categories from each other.

In this dataset, various ad campaigns from Meta were collected, including their HTML code, metadata, and their “political” or “not political” classification. Do note that the classification is not made by hand but rather by a machine learning classifier.

One project idea is to utilize this dataset and build neural networks to improve the current AI. Moreover, you can use the dataset for an AI developing political ad campaigns from scratch (like the Dalle mini, which is an ambitious project). The dataset also contains the necessary metadata and HTML code to train neural networks or build a classification project.

9. A Dataset Compiling Removed Domestic Meta Pages

In 2018, Meta removed hundreds of US political pages for inauthentic activity (i.e., political spam). The social media data set below contains daily engagement metrics of three specific removed political pages:

  • Right Wing News
  • Daily Vine
  • Silence is Consent

The data starts from February 14, 2011, and ends on December 9, 2018. The dataset also contains the number of daily interactions, including shares and likes, and a day-on-day growth statistic.

One project idea is to create and analyze the trends and growth of the said Meta pages by thoroughly looking at the engagement statistics. Alternatively, a pseudo-sentiment analysis (without the use of NLP) can be introduced by analyzing the Meta reactions (Like, Love, Haha, Wow, Sad, Angry) on these pages using posts from 2016 to 2018 (as Meta introduced the reactions menu as an alternative to the singular Like button in 2016).

Reddit Datasets And Project Ideas

While Meta and X may get all the credit for being the conventional social media platforms for gathering social media data sets, Reddit offers a unique platform underutilized by many organizations and data scientists, overlooked frequently due to its unfamiliar setup.

Despite its different format, Reddit’s community-first approach is often more useful in generating business insights on target consumers once you account for the more than 50,000 niche communities (subreddits). While Reddit may not deliver the most extensive datasets, the social media data sets from Reddit are collected from a user base that is easier to confirm as relevant, as they self-select themselves for you in the communities they choose to join.

For Reddit, we tackle questions from the following databases:

  • r/financialindependence subreddit demographic dataset.
  • Redditor Demographic 2011 Survey.
  • Reddit growth dataset.
  • Reddit AAPL (Apple Inc.) stock sentiment analysis.

10. r/financialindependence Subreddit Demographic Dataset

One of the most extraordinary things Reddit datasets offer is the ability to look at a subreddit and its members’ demographic and be able to grasp a product’s demand and outlook.

For example, when interested in learning which online games have the most active communities, you can quickly look up the membership numbers of a specific game’s subreddit and compare the numbers to other competing games.

r/financialindependence is a subreddit full of individuals aiming to become financially independent, specifically meaning they wish to generate income without having to work for the income (passive income streams) actively. This social media data set contains the said subreddit’s demographics (age, income, marital status, country of residence), allowing you to create insights that might drive business decisions that can better utilize such demographics.

Organizations that heavily benefit from projects like these offer services that provide opportunities for generating passive income (i.e., providing real estate for other companies such as Airbnb, investment platforms such as Binance, and more). Specifically, you can utilize this social media data set to predict whether a particular person might be interested in downloading apps like Binance.

11. Redditor Demographic 2011 Survey: Exploratory Data Analysis

The following dataset resembles Meta’s exploratory data analysis project but differs in one crucial aspect. Because Reddit is built around communities, this dataset is much more substantial. Aside from providing the general gist of Reddit’s user base, it also gives insight into how a person of a particular demographic might favor specific subreddits over others.

The 2011 Survey contains information about the user, such as their age range, education, country of residence, income, and most importantly, their favorite subreddit. Because of how robust it is, we can do the following projects:

  • A project that helps predict, based on demographics, what subreddit a user is likely to be a part of.
  • A project that can help brands determine which subreddits are most relevant to their businesses.
  • A project determining which subreddit holds the highest and lowest members by earnings.

As you can see, the dataset is incredibly flexible. There are definitely a lot more projects that come into mind, but these are a select few that are notable.

12. Reddit Growth Dataset

Reddit Growth Dataset The following dataset contains Reddit’s engagement data (submissions, comments, votes, account membership) per month from the month of conception (June 2005) to March 2017.

One project you can do using this data is to predict the engagement numbers for the next few months from today and cross-check them with the currently available data (2017 - current year).

13. Reddit AAPL stock sentiment analysis

The value of stocks heavily relies on public sentiment, and while Reddit, at first glance, might not be a determining factor in stock prices, a closer look at the news will reveal otherwise. One of the most extraordinary events in Reddit history was the inflation of the Gamestop stock price due to a subreddit called r/WallStreetBets.

The subreddit single-handedly modified and inflated the Gamestop stock price, revealing a vulnerability in stock pricing and Reddit’s sheer power. When it comes to stocks and trading, Reddit is no pushover and is, in fact, a key tool.

AAPL (Apple Inc.) remains one of the world’s most potent stocks despite the recent decline in the value of other tech stocks. Running a sentiment analysis of the AAPL stock using Reddit data can be a powerful way to examine the industry.

Since the data’s scope lasts only from November 2016 until October 2021, you can build a machine learning project to predict AAPL’s stock price and cross-check the result on the proceeding AAPL stock prices from October 2021 to the present.

Another project idea is to compare sentiment analysis results from a particular time frame (i.e., November 2017) and see how it is reflected in AAPL’s November 2017 stock price.

Youtube Datasets And Project Ideas

YouTube is the world’s leading video streaming platform. Moreover, it is currently one of the world’s most significant outlets for audiovisual ad campaigns and the primary outlet of multimedia content, including news, music, travel, education, and virtually any other video-related content.

Below, we list four social media data sets that utilize YouTube in particular:

  • Trending YouTube Video Statistics
  • YouTube 8M Dataset
  • Donald Glover’s This is America - YouTube Comments Sentiment Analysis
  • Most Subscribed YouTube Channels

14. Trending Youtube Video Statistics

Trending Youtube Video Statistics dataset YouTube’s trending chart is a position most video creators would want to be listed on, as it reflects the general popularity of the video you’ve made and the hype it receives around a community. Nevertheless, the trending tab is not uniform around the world, as it elevates content differently in every country.

This dataset records trending videos, including their engagement statistics, tags, metadata and more. As specified earlier, the dataset heavily varies per country, and currently, the dataset includes the trending chart of the following countries:

  • USA
  • Great Britain
  • Canada
  • Germany
  • France
  • South Korea
  • Mexico
  • Russia
  • India
  • Japan

Moreover, one can utilize this dataset for the following projects:

  • Create an algorithm that determines what tags and content will trend in a specific country.
  • Analysis of which content tends to trend more.
  • Determining what type of YouTube content generates the most user engagement.
  • Using machine learning algorithms (i.e., RNNs) to generate YouTube comments.

15. YouTube 8M Dataset

youTube 8M Dataset

Unstructured data, especially multimedia data, is tough to analyze and requires a commitment of huge amounts of processing power and resources. The YouTube 8M dataset compiles 350,000 hours of unstructured, raw video files and turns them into a CSV file that allows for easier processing.

The dataset also contains human-verified annotations describing the video’s audiovisual elements. The dataset can be used to train machine learning algorithms that require video input. You can consider the following projects:

  • For example, you can develop a project that determines the genre of a YouTube video based on its content.
  • You can also create video content based on prompts for a more challenging (significantly more challenging, in fact) project.
  • Another project you can do is to create an algorithm that describes and annotates YouTube videos.

16. Donald Glover’s This is America - YouTube Comments Sentiment Analysis

At the forefront of music activism, Childish Gambino’s “This Is America,” is a viral anthem that displays and protests the police brutality often experienced by the black community in America. As of writing, the music video currently has 846 million views with more than 750,000 comments.

This dataset contains the information of more than 200,000 comments scrapped using YouTube’s API. One of the projects you can do is to use Natural Language Processing to do sentiment analysis on the comments within the video.

Natural language processing (NLP) allows computers to understand human language. It takes complex sentences and breaks them down into simple structures, enabling algorithms to better perform numerical interpretation. NLP is often used in sentiment analysis.

17. Most Subscribed YouTube Channels Datasets

The following dataset contains the most subscribed YouTube channels, including their engagement statistics, date created, video count, genre, and rank. One of the projects you can do is a statistical analysis to determine which genres are most popular.

Other Datasets To Discover

Aside from the datasets above, what datasets can you build projects with and explore? These social media data sets can be of use:

18. LinkedIn Social Media Data Set

This social media data set contains information about accounts on LinkedIn, the individual’s job experience, LinkedIn activity, current company, gender, race, and more. You can do statistical analysis to analyze which demographic is favored in which jobs and which type of individual or group is highest or lowest paid.

19. Shopping Influence And Social Media

This dataset contains the answers to a survey questionnaire about how Millennials get influenced by social media, especially with their shopping habits. You can create a project analyzing which demographic of which social media platform has their shopping behaviors the most heavily influenced.

20. Fake News Detection

While this dataset is not necessarily one we could call a social media data set, it is a dataset where one can build awesome projects. Debuted as part of a hackathon, this database contains information to help you make a fake news detection algorithm using NLP.

21. User Behavior on Instagram

This dataset provides insights into user behavior on Instagram, focusing on various aspects such as engagement metrics, content analysis, and user demographics. You can develop a project to analyze which content types, hashtags, and user demographics drive the highest engagement on the platform. Additionally, you can explore how follower growth, churn, and influencer collaborations impact overall user interaction and brand reach on Instagram.

22. Waze Synthetic User Churn Data

While this dataset primarily focuses on navigation, Waze functions similarly to social media by fostering a community-driven experience where users share real-time information about traffic, hazards, and road conditions. This interactive element of Waze, where users can contribute and receive updates from others, makes it a valuable dataset for building projects that analyze user engagement patterns similar to those found on social media platforms. Released as part of the Google Advanced Data Analytics Professional Certificate program, this dataset enables you to develop a user churn prediction model using machine learning techniques to understand user behavior within the app.

Social Media Data Science Projects

All of these free datasets can be used on your next data science project. In particular, social media data sets are great for data analytics projects, especially LinkedIn data. You can also use these sources for advanced machine learning and classification data science projects, as X data works well for text classification projects.