It’s estimated that we generate 328.77 million terabytes of data daily worldwide, so ensuring that even 1% of that data is usable is a difficult task. This is where data engineers come in.
Data engineers are responsible for creating a solid foundation upon which data scientists and analysts build their models and businesses make informed decisions. As the amount of generated data grows exponentially, so does the need for efficient pipelines, databases, and data processing strategies.
While data engineering is in high demand, recent developments in the tech industry have made the field incredibly competitive. Just knowing fundamental data engineering concepts isn’t enough to be a candidate on the playing field anymore. Many today have a robust portfolio of data engineering projects. Not only are these projects proof of strong technical experience, but they also showcase problem-solving abilities, creativity, and an aptitude for translating complex data needs into scalable solutions.
The field requires knowledge of a broad set of concepts. These data engineering projects can help you hone your skills in these domains.
System design is an essential aspect of data engineering. Data engineers use system design to provide a high-level overview of how a data pipeline meets its functional requirements, like building efficient, scalable, and real-time systems.
At the very least, a data engineer should know what a possible system looks like at a high level. Here are some system design projects that you can practice with:
Here’s an Interview Query system design question that provides a great starting point.
Let’s say you’re in charge of getting payment data into your internal data warehouse.
How would you build an ETL pipeline to get Stripe payment data into the database so analysts can build revenue dashboards and run analytics?
As a data engineer, you must ensure that, even under load, your systems can scale without issues.
Consider the following edge case:
Suppose you have a table with a billion rows. How would you add a column inserting data from the original source without affecting the user experience?
Let’s try something a bit more complex and extensive. As a data engineer, you should be able to create database schemas that are fully compliant with normalization constraints. Let’s assume the following problem:
We want to build an application that allows anyone to review restaurants. This app should allow users to sign up, create a profile, and leave reviews on restaurants if they wish. These reviews can include texts and images, and a user should only be able to write one review for a restaurant, with the ability to update it.
Design a database schema that would support the application’s main functions.
In the age of artificial intelligence, integrating machine learning into data pipelines is becoming increasingly important. Machine learning systems need vast amounts of data to train models effectively.
As a data engineer, understanding how to design systems that can handle, pre-process, and serve data for machine learning processes is invaluable. It’s not just about feeding data to models– it’s also about optimizing the data flow, ensuring quality, and deploying trained models seamlessly. Let’s look at some data engineering project ideas you can get started with.
You’ve been asked to help the Chicago Police Department build machine learning services to power the next generation of mobile crime analytics software. This software aims to predict, in real-time, the category of a crime (e.g., ‘robbery’, ‘assault’, ‘theft’) as soon as it’s reported by an emergency call. This prediction can only be made with information available at the time of the call (like time and location) without on-the-ground assessment or knowledge of ex post action (such as arrest, conviction, demographics of victim(s) or offender(s)).
One of the best ways to approach this model is through random forests, a robust and widely used algorithm for classification tasks. Because it can capture non-linear relationships between features and handle a mix of categorical and continuous variables, it’s a good strategy for this problem in particular. However, while it’s a good starting point, performance should be evaluated continuously using real-world data to ensure accuracy and fairness.
Here’s a system design question for a machine learning pipeline.
Suppose we want to design a data pipeline that ingests real-time traffic data from multiple sources, such as traffic cameras, IoT devices on roads, and mobile app data. The pipeline should be capable of handling a high velocity and volume of data, pre-processing it, and then forwarding it to analytics tools or machine learning models for further insights. Consider the following questions in your solution:
You’re tasked with creating a central repository of health data from various city hospitals, clinics, and health apps. The system should cater to batch processing, ensure data quality and consistency, and support fast querying for analytics purposes. Here are some key challenges:
You are tasked with forecasting a town’s electricity demand to avoid outages and reduce costs. A strong approach is using the ARIMA model (Auto Regressive Integrated Moving Average), which excels in time series forecasting. ARIMA effectively captures patterns in daily electricity usage and accounts for seasonality, such as higher demand in winter. You might also consider advanced variants like SARIMA or SARIMAX to include external factors. By implementing ARIMA, you can align electricity supply with demand, optimizing efficiency and minimizing waste.
When predicting housing prices with missing square footage data, the first step is to evaluate how this missing data affects model accuracy. You can train models on different subsets (e.g., 60% vs. 80%) to see if performance drops significantly. If the accuracy remains stable, it might be feasible to exclude the missing entries, though this could suggest square footage isn’t as critical as other features.
Alternatively, imputation can be used to fill in the gaps. Simple methods like using the mean or median are quick but might miss important feature relationships. A more effective approach is to use nearest neighbors, imputing values based on similar listings with shared characteristics like bedrooms, bathrooms, or location.
Often, data is not publicly available through conventional methods like RESTful APIs or in cleaned CSV datasets. Data scraping, often known as web scraping, is the process of extracting structured data from web pages. This process is integral to gathering data that isn’t readily available through traditional means. This is a critical skill that any competitive data engineer should have, as it’s often utilized to pull valuable insights from web resources or feed data into systems.
Before pushing through to web scraping, let’s review one important concept you need to know: robots.txt.
In the world of data scraping and web crawling, it’s imperative to ensure that one’s activities are both ethical and legal. This is where the robots.txt
file plays a pivotal role in serving as a guideline for web robots (often referred to as ‘bots’ or ‘crawlers’).
robots.txt
is a standard adopted by websites to guide web scraping and crawling activities. Located in the root directory of a website (e.g., https://www.example.com/robots.txt
), this file provides directives to web crawlers about which pages or sections of the site shouldn’t be accessed or scraped by automated processes.
robots.txt
, can put undue stress on web servers, leading to slow performance or even causing websites to crash.robots.txt
and accessing restricted parts of a website can lead to legal consequences. Following this file ensures that scraping activities remain within legal boundaries.A typical robots.txt
file contains “User-agent” declarations followed by “Allow” or “Disallow” directives.
javascriptCopy code
User-agent: *
Disallow: /private/
Disallow: /test/
Allow: /public/
In the above example:
Online shopping platforms hold a treasure trove of consumer opinions and feedback. By extracting product reviews from these platforms, businesses can gain deep insights into product performance, customer preferences, and areas for improvement.
Imagine you want to analyze public opinion on a particular product category– say, wireless earbuds. You could start by identifying popular e-commerce websites selling this product and then design a scraper to extract reviews, ratings, and associated comments. Leveraging libraries like Beautiful Soup or Scrapy in Python can simplify this process. Once the data is gathered, it can be cleaned, structured, and further analyzed to gauge consumer preferences and even predict future sales trends.
To remain competitive in the market, businesses often need to monitor the pricing strategies of their competitor(s). While there are tools available that offer this service, building a custom scraper allows for tailored insights.
Consider an example where a local bookstore is trying to keep its pricing competitive against major online retailers. By regularly scraping these online platforms for the prices of top-selling books, the bookstore can adjust its prices dynamically. This project would involve identifying the specific books to monitor, setting up a scraper to extract current pricing information at regular intervals, and potentially even implementing an alert system for drastic price changes. This way, the bookstore stays ahead, offering competitive pricing to its customers.
For individuals entering the job market or considering a career change, understanding the current demand for certain positions, required skills, and average salaries can be invaluable. This data, while available on job listing websites, is usually scattered and not always easy to analyze.
You could develop a scraper to extract listings from popular job portals for specific roles or industries. This information, once collated, could provide insights into the most in-demand skills, emerging job roles, and salary benchmarks. For instance, someone considering a transition into data science could scrape listings for data scientist roles, analyze the common tools and languages required, and even predict which sectors are hiring the most.
ETL (Extract, Transform, Load) is the backbone of many data processing tasks. It involves taking data from source systems, transforming it into a format suitable for analysis, and loading it into a data warehouse. Designing efficient ETL pipelines can be the difference between data that’s actionable and data that’s a mess.
ETL is the bread and butter of a data engineer’s career. Knowing how to design, implement, and automate these pipelines is critical. Here are some ETL data engineering projects you can try on.
You’re tasked with designing an ETL pipeline for a model that uses videos as input.
How would you collect and aggregate data for multimedia information, specifically when it’s unstructured data from videos?
Interested in identifying your music listening habits and storing this data for future analysis? Spotify provides a Spotify Web API, which developers can use to access various types of data from the Spotify platform. This includes data about music tracks, albums, and artists, as well as playlist data and user profiles.
To do this project, create a Spotify developer account, extract this data, load it to a database, and create an airflow task that automates the process for you. If you want something a bit more complicated, you can have a notification system that alerts you whenever the system is down.
If you’re down for a bit of analytics, extract the data from the database and create a pipeline for an analytics sandbox. This type of task allows you to explore the different facets of data engineering without dealing with real-time high-velocity data streams.
Ever wonder about your reading habits? Goodreads, a haven for book enthusiasts, has an API that offers a wealth of information on books, authors, and personal reading records. By integrating with this API, you can extract data on the books you’ve read, your ratings, favorite genres, and more. This raw data can then be cleaned and standardized– for example, ensuring genres like ‘Mystery’ and ‘Detective Fiction’ align.
Once polished, the next step is to store this data in relational databases like PostgreSQL or MySQL. Use platforms like Tableau and PowerBI to create dashboards that depict your reading trends, preferred genres, and global author distribution.
For an added challenge, consider implementing a notification system for reading milestones or even a basic recommendation system.
Conventionally, data is stored in relational database management systems (RDBMS), where data follows a strict schema and a certain level of normalization. However, as data requirements continue to morph, many systems require a more flexible approach. NoSQL databases are used to fill this gap.
Data engineers need to interact with different types of data, so it’s essential to be familiar with database systems other than SQL. Let’s look at these NoSQL databases and see how to integrate them into a data pipeline.
Apache Cassandra is a wide-column store NoSQL database known for its high write and read throughput. It’s often the go-to for many tech companies when it comes to real-time big data analytics due to its seamless scalability.
Let’s say you’re trying to design a real-time user activity tracking system for a sprawling e-commerce website using Cassandra. The data modeling in Cassandra is distinctly different from relational databases, as the design revolves more around the queries that will be executed rather than the data’s relational structure.
For our proposed project, tracking activities like user logins, product views, cart additions, and purchases would be vital. These activities can be modeled in a way that makes it efficient to query for real-time analytics or even retrospective analysis. Additionally, given the distributed nature of Cassandra, data replication, partitioning, and consistency must be considered in the design.
The end goal of this data engineering project would be to have a system robust enough to handle massive influxes of user data in real-time and provide insights, perhaps even integrating it with a dashboard that showcases user activity metrics. This project would not only provide hands-on experience with NoSQL databases but also focus on the importance of scalable data design and real-time analytics in today’s digital age.
Neo4j is a graph database that excels in representing and traversing relationships between entities. Using Neo4j, design a system that finds the shortest path of relationships between two random movie actors, similar to the ‘Six Degrees of Kevin Bacon’ game. Pull in data from movie databases and establish relationships between actors, movies, directors, and genres. Then, create a user interface where users can challenge the system with two actors’ names and the system finds and displays their relationship across various movies.
Using MongoDB, a leading document-oriented NoSQL database known for its dynamic schema capabilities, you can develop a dynamic content personalization engine for different platforms, including news websites or streaming services. This engine would curate recommendations based on user viewing or reading habits, interests, and preferences.
Data modeling would revolve around user profiles, encapsulating their behavior, interactions, and feedback. In parallel, content metadata– spanning articles, videos, or music tracks– would be structured to hold detailed attributes, facilitating effective matching with user profiles.
An essential component of the engine would be the recommendation algorithm, which would constantly analyze the relationship between user profiles and content metadata to churn out the best content suggestions. Over time, as more interactions are logged, and more content is added, the engine would refine its recommendations.
Understanding various data sources and the art of data migration are two essential pillars for all data engineers to master. The modern digital ecosystem is vast and varied. Data doesn’t just reside in traditional databases: it emerges from APIs, is processed in big data platforms like Apache Spark, and is communicated via messaging systems like Google Pub/Sub.
Data migration, the process of transferring data between storage types, formats, or computer systems, is crucial. As organizations evolve, they frequently need to move data to newer systems, formats, and applications. Ensuring this data transition is seamless, accurate, and efficient is a significant responsibility.
For data engineers, understanding data sources and migration isn’t just a job requirement– it’s the cornerstone of building resilient, dynamic, and efficient systems. Below are sone data engineering projects that highlight the importance and complexity of working with diverse data sources and migration.
In this example, you’re tasked with harnessing public weather APIs to construct a platform that provides historical weather data analysis. Users can query conditions for a specific date and location, which would be invaluable for sectors like agriculture, event planning, or even crime investigation.
The platform should aggregate data from multiple APIs to ensure coverage for as many regions and dates as possible. To go beyond just fetching data, you should integrate Apache Spark to perform real-time analytics, offering trends like annual rainfall and temperature fluctuations or even predicting weather anomalies. The complexity here lies in handling a vast amount of data, ensuring accuracy by cross-referencing sources, and presenting it in a user-friendly manner.
Many universities provide public access to vast collections of research articles through repositories like DSpace or EPrints. Let’s say a university wants to consolidate multiple repositories into a modern cloud-based system. This would involve migrating terabytes of data while ensuring each document’s metadata (like author and publication date) remains intact.
The new platform should offer enhanced search capabilities and analytics tools, with the possibility of integrating AI-driven recommendation systems for research papers. Major hurdles in this project include ensuring data consistency, handling varied document formats, and providing uninterrupted access during the migration process.
An e-commerce platform is transitioning to a new system and needs to migrate user-generated content, especially product reviews. While they have access to their database, they also want to incorporate reviews from competitor sites to enhance their product listings.
Using web scraping tools, design a project to extract reviews from top e-commerce sites for products also listed on your platform. This data, combined with the platform’s existing reviews, will be migrated to the new system. The challenges in this example are multi-fold: adhering to ethical web scraping guidelines, ensuring data accuracy, matching products across different platforms, and integrating these external reviews seamlessly with existing ones.
For help with data engineering concepts and fundamentals, Interview Query’s data engineering learning path will guide you through the basics of data engineering.