Collecting and combining good data is a challenge for all data scientists. Even more difficult, however, is turning that good data into projects that generate riveting insights and inspire new discussion. As data scientists, we define good data as raw, uncompromised, and observed in their natural state - this is where social media data sets come into play.
Social media data sets are superior to other datasets due to various factors. One of those factors is that social media data sets contain information that is already correlated, sorted (i.e., user profiling), and numerous. Moreover, data shared through websites such as X and Reddit tend to be unfiltered, which can be used as leverage for businesses to create unadulterated datasets for sentiment analysis or research.
A dataset is made up of related data cells, which can then be maneuvered and accessed either as a singular unit, curated based on similar characteristics or modified individually. At their core, datasets have untapped potential and are waiting to be utilized. As data scientists, we should be able to use datasets in the following manner:
X is often cited as one of the best places to explore datasets. Not only are existing datasets numerous, but the core user demographic for X is extensive and home to communities of polarizing ideologies and raw and genuine opinions.
The authenticity of X datasets makes them great for businesses to explore trends that drive business decisions.
We have gathered five datasets that you can use to form a project around (linked below). Each of the datasets can be of varying specificity, and they will tackle:
These four datasets are seemingly random and unrelated, but this only highlights how much variety you can get from X.
This social media data set doesn’t focus on users. Instead, it’s more of a product case study challenge. Specifically, the data provided relates to X’s advertising business.
You’re provided a dataset of campaigns, budgets, and impressions, and you’re asked to measure the data generated by an A/B test on a new advertising product.
This take-home tests your ability to:
Overall, this X data science takehome is a product case study in that you’re asked to determine how a new product compares to an old product in an A/B test.
One of X’s most organized and detailed social media data sets compiles tweets related to 2015 New Year’s resolutions. While 2015 resolutions might seem outdated, it is evident that most people who make New Year’s resolutions tend to fail halfway through and, as a result, commit to the same promises year after year.
This dataset is helpful as it can assist businesses either downscale or upscale their operations at certain times of the year. For example, a quick skim through the dataset will reveal the prevalence of gym and weight-loss-related resolutions at the beginning of the year, just after the resolutions are made.
A project that tackles and predicts the increase or decrease of demand for certain services can drive business decisions and generate those actions that produce service supply shifts throughout the year. Moreover, the dataset is straightforward to navigate as it sorts the resolution based on relevant categories (for example, philanthropy, health and fitness, and personal growth).
On July 22, 2022, the UK recorded its highest temperatures in recorded history. Concurrently, reports of melting ice caps in the Arctic spread around the internet. Multiple research papers dating from 1896 have helped back up the authenticity of climate change.
Despite the body of scientific evidence piling up from social media, we can gather that there are still people doubting the correlation between carbon emissions and climate change.
For one of your projects, you can perform a sentiment analysis on climate change using this data set provided by CrowdFlower. This social media data set contains a column specifying whether a tweet suggests accepting or rejecting the climate change phenomena.
By January 2022, Apple became the world’s first $3 Trillion company. Many factors led to this point, but it has long been understood that Apple’s valuation rides on a public perception of the brand’s products being labeled as “premium”. This is based in no small part on their product’s decent build quality and innovative marketing. However, does X share the same opinion?
2016 was a big year for Apple, introducing many novel products with features that are still prevalent today. These are the following:
2016 is the perfect year to conduct a project on sentiment analysis for Apple at a time when the company introduced many features and hardware that created a large divide in the tech community. To this day, there is still no consensus on the preference for the touch bar.
It would also be interesting to see how far opinion has developed and how the opinions from 2016 have grown since that time.
The US 2016 Elections, alongside the peak of the terrorist group ISIS’s self-declared caliphate, were two of the most talked about events and phenomena of 2016, with the elections garnering approximately 60 million tweets and ISIS gaining a quarter of that. In the same year, Paragon Science Inc. compiled a list of all related X networks that mention or are related to the US 2016 elections alongside tweets that heavily emphasize ISIS.
Paragon Science Inc. compiled this social media data set by employing dynamic anomaly detection technology, a deviation from the chaos and dynamical theories most often employed to understand these phenomena.
The dataset is a lovechild of many methods, utilizing sentiment analysis, network analysis, community detection, and topic detection.
We can use this dataset to develop the following projects:
If you are actively looking at tech stocks, you might encounter quite a bit of doom-posting about the fall of Meta. This pessimism does not match the active user numbers; however, 69% of Americans, 79% of Canadians, and 329 million people in India alone still actively use Meta. Moreover, in the 35 to 44 age demographic, Meta remains the dominant social media site.
Meta datasets might be the most helpful for businesses explicitly targeting developing countries and/or Millennials in the 35 to 44 age demographic.
For Meta, we have four datasets to get started with:
A dataset that explores the demographic of Meta users allows data scientists to grasp the contours of the current active community. Because the dataset contains a lot of general information, it might not be as valuable for a project that requires specific statistics and user preferences.
However, this social media data set holds comprehensive information such as age, year born and birth date, gender, friend count, and likes. Because of that, this dataset can be used for exploratory data analysis.
Exploratory data analysis allows you to explore the data with visual techniques, check trends, confirm assumptions, and get a general gist of the data.
SNAP is an initiative by Stanford University and stands for Stanford Network Analysis Platform. SNAP primarily contains social media data sets for network analysis and various platforms running C++ and Python. However, due to the sheer amount of data in SNAP datasets, it is recommended that the faster C++ libraries be utilized.
The reason behind C++’s better performance is that many Python libraries use C code in them, thus hampering performance. Moreover, Python is interpreted while C++ is compiled.
SNAP’s social media data sets are not limited to Meta; they offer Reddit, X, and more. However, this section will focus on their Meta social circles dataset.
The SNAP Meta dataset is focused on friend lists (i.e., social circles) commonly used in ego networks. Ego networks are a type of network that is based around a focal node (a central node) named an “ego,” with social relationships surrounding them as “alters.” The relationships between alters are also specified.
These alters, in themselves, can become the focal node as well. Moreover, the SNAP dataset also analyzes the “features” (i.e., political party) but anonymizes them as to be unidentifiable. For example, political party A will be labeled “1,” and party B will be labeled “0.”
Projects focusing on an analysis with a particular feature in mind will not do well with this dataset. For instance, if an organization prioritizes campaigns for the supporters of a specific political party, it will be challenging as the dataset itself does not specify which users belong to which political party.
However, projects that analyze the divisiveness and the clustering of nodes (e.g., how friend groups influence the appearance of certain features) will be a perfect fit for this dataset.
Advertisements that advocate social change, whether political or non-political, are grouped into the “political ad campaigns list.” As such, it is vital to distinguish these two categories from each other.
In this dataset, various ad campaigns from Meta were collected, including their HTML code, metadata, and their “political” or “not political” classification. Do note that the classification is not made by hand but rather by a machine learning classifier.
One project idea is to utilize this dataset and build neural networks to improve the current AI. Moreover, you can use the dataset for an AI developing political ad campaigns from scratch (like the Dalle mini, which is an ambitious project). The dataset also contains the necessary metadata and HTML code to train neural networks or build a classification project.
In 2018, Meta removed hundreds of US political pages for inauthentic activity (i.e., political spam). The social media data set below contains daily engagement metrics of three specific removed political pages:
The data starts from February 14, 2011, and ends on December 9, 2018. The dataset also contains the number of daily interactions, including shares and likes, and a day-on-day growth statistic.
One project idea is to create and analyze the trends and growth of the said Meta pages by thoroughly looking at the engagement statistics. Alternatively, a pseudo-sentiment analysis (without the use of NLP) can be introduced by analyzing the Meta reactions (Like, Love, Haha, Wow, Sad, Angry) on these pages using posts from 2016 to 2018 (as Meta introduced the reactions menu as an alternative to the singular Like button in 2016).
While Meta and X may get all the credit for being the conventional social media platforms for gathering social media data sets, Reddit offers a unique platform underutilized by many organizations and data scientists, overlooked frequently due to its unfamiliar setup.
Despite its different format, Reddit’s community-first approach is often more useful in generating business insights on target consumers once you account for the more than 50,000 niche communities (subreddits). While Reddit may not deliver the most extensive datasets, the social media data sets from Reddit are collected from a user base that is easier to confirm as relevant, as they self-select themselves for you in the communities they choose to join.
For Reddit, we tackle questions from the following databases:
One of the most extraordinary things Reddit datasets offer is the ability to look at a subreddit and its members’ demographic and be able to grasp a product’s demand and outlook.
For example, when interested in learning which online games have the most active communities, you can quickly look up the membership numbers of a specific game’s subreddit and compare the numbers to other competing games.
r/financialindependence is a subreddit full of individuals aiming to become financially independent, specifically meaning they wish to generate income without having to work for the income (passive income streams) actively. This social media data set contains the said subreddit’s demographics (age, income, marital status, country of residence), allowing you to create insights that might drive business decisions that can better utilize such demographics.
Organizations that heavily benefit from projects like these offer services that provide opportunities for generating passive income (i.e., providing real estate for other companies such as Airbnb, investment platforms such as Binance, and more). Specifically, you can utilize this social media data set to predict whether a particular person might be interested in downloading apps like Binance.
The following dataset resembles Meta’s exploratory data analysis project but differs in one crucial aspect. Because Reddit is built around communities, this dataset is much more substantial. Aside from providing the general gist of Reddit’s user base, it also gives insight into how a person of a particular demographic might favor specific subreddits over others.
The 2011 Survey contains information about the user, such as their age range, education, country of residence, income, and most importantly, their favorite subreddit. Because of how robust it is, we can do the following projects:
As you can see, the dataset is incredibly flexible. There are definitely a lot more projects that come into mind, but these are a select few that are notable.
The following dataset contains Reddit’s engagement data (submissions, comments, votes, account membership) per month from the month of conception (June 2005) to March 2017.
One project you can do using this data is to predict the engagement numbers for the next few months from today and cross-check them with the currently available data (2017 - current year).
The value of stocks heavily relies on public sentiment, and while Reddit, at first glance, might not be a determining factor in stock prices, a closer look at the news will reveal otherwise. One of the most extraordinary events in Reddit history was the inflation of the Gamestop stock price due to a subreddit called r/WallStreetBets.
The subreddit single-handedly modified and inflated the Gamestop stock price, revealing a vulnerability in stock pricing and Reddit’s sheer power. When it comes to stocks and trading, Reddit is no pushover and is, in fact, a key tool.
AAPL (Apple Inc.) remains one of the world’s most potent stocks despite the recent decline in the value of other tech stocks. Running a sentiment analysis of the AAPL stock using Reddit data can be a powerful way to examine the industry.
Since the data’s scope lasts only from November 2016 until October 2021, you can build a machine learning project to predict AAPL’s stock price and cross-check the result on the proceeding AAPL stock prices from October 2021 to the present.
Another project idea is to compare sentiment analysis results from a particular time frame (i.e., November 2017) and see how it is reflected in AAPL’s November 2017 stock price.
YouTube is the world’s leading video streaming platform. Moreover, it is currently one of the world’s most significant outlets for audiovisual ad campaigns and the primary outlet of multimedia content, including news, music, travel, education, and virtually any other video-related content.
Below, we list four social media data sets that utilize YouTube in particular:
YouTube’s trending chart is a position most video creators would want to be listed on, as it reflects the general popularity of the video you’ve made and the hype it receives around a community. Nevertheless, the trending tab is not uniform around the world, as it elevates content differently in every country.
This dataset records trending videos, including their engagement statistics, tags, metadata and more. As specified earlier, the dataset heavily varies per country, and currently, the dataset includes the trending chart of the following countries:
Moreover, one can utilize this dataset for the following projects:
Unstructured data, especially multimedia data, is tough to analyze and requires a commitment of huge amounts of processing power and resources. The YouTube 8M dataset compiles 350,000 hours of unstructured, raw video files and turns them into a CSV file that allows for easier processing.
The dataset also contains human-verified annotations describing the video’s audiovisual elements. The dataset can be used to train machine learning algorithms that require video input. You can consider the following projects:
At the forefront of music activism, Childish Gambino’s “This Is America,” is a viral anthem that displays and protests the police brutality often experienced by the black community in America. As of writing, the music video currently has 846 million views with more than 750,000 comments.
This dataset contains the information of more than 200,000 comments scrapped using YouTube’s API. One of the projects you can do is to use Natural Language Processing to do sentiment analysis on the comments within the video.
Natural language processing (NLP) allows computers to understand human language. It takes complex sentences and breaks them down into simple structures, enabling algorithms to better perform numerical interpretation. NLP is often used in sentiment analysis.
The following dataset contains the most subscribed YouTube channels, including their engagement statistics, date created, video count, genre, and rank. One of the projects you can do is a statistical analysis to determine which genres are most popular.
Aside from the datasets above, what datasets can you build projects with and explore? These social media data sets can be of use:
This social media data set contains information about accounts on LinkedIn, the individual’s job experience, LinkedIn activity, current company, gender, race, and more. You can do statistical analysis to analyze which demographic is favored in which jobs and which type of individual or group is highest or lowest paid.
This dataset contains the answers to a survey questionnaire about how Millennials get influenced by social media, especially with their shopping habits. You can create a project analyzing which demographic of which social media platform has their shopping behaviors the most heavily influenced.
While this dataset is not necessarily one we could call a social media data set, it is a dataset where one can build awesome projects. Debuted as part of a hackathon, this database contains information to help you make a fake news detection algorithm using NLP.
This dataset provides insights into user behavior on Instagram, focusing on various aspects such as engagement metrics, content analysis, and user demographics. You can develop a project to analyze which content types, hashtags, and user demographics drive the highest engagement on the platform. Additionally, you can explore how follower growth, churn, and influencer collaborations impact overall user interaction and brand reach on Instagram.
While this dataset primarily focuses on navigation, Waze functions similarly to social media by fostering a community-driven experience where users share real-time information about traffic, hazards, and road conditions. This interactive element of Waze, where users can contribute and receive updates from others, makes it a valuable dataset for building projects that analyze user engagement patterns similar to those found on social media platforms. Released as part of the Google Advanced Data Analytics Professional Certificate program, this dataset enables you to develop a user churn prediction model using machine learning techniques to understand user behavior within the app.
All of these free datasets can be used on your next data science project. In particular, social media data sets are great for data analytics projects, especially LinkedIn data. You can also use these sources for advanced machine learning and classification data science projects, as X data works well for text classification projects.