Interview Query
18 Data Engineering Projects You Must Know in 2025

18 Data Engineering Projects You Must Know in 2025

Overview

In 2025, data engineers are still among the most in-demand tech professionals and some of the highest-paid. However, more people are also trying to break into this career. To be competitive in the job market, you’ll need to stand out, and a portfolio of solid data engineering projects is key.

A robust data engineering portfolio today should showcase an excellent grasp of fundamentals such as SQL, building data pipelines, data storage and management, cloud computing, etc. However, modern data engineers are also expected to be well-versed in areas such as AI and be familiar with technologies like Kafka, web scrapers, Cassandra, Talend, Redshift, etc.

In this post, we have compiled 18 data engineering projects covering different aspects of this field, including system design, ETL, and machine learning. These projects are a great way to practice your data engineering skills and will also be excellent additions to your portfolio. Let’s dive in!

System Design Data Engineering Projects

A system design offers a high-level overview of how a data pipeline meets its functional requirements. As a data engineer, you’ll need system design skills to demonstrate how your solutions meet efficiency, scalability, and real-time data access requirements.

1. Stripe Payment Data Pipeline (ETL)

You are a data engineer at Stripe and have been tasked with getting payment data into the company’s internal data warehouse. Design an ETL pipeline to get all Stripe payment data into the database and allow analysts to run analytics and build relevant dashboards.

This project will familiarize you with fundamental data engineering concepts such as extraction, transformation, and loading. You can also learn about schema design, security and access control, and error handling. Try this problem on Interview Query.

2. Restaurant Review System

Your organization is looking to build an application allowing the public to leave restaurant reviews. App users should be able to sign up, create profiles, and leave reviews for specific restaurants. The review system should be able to accommodate text and images, but users should only be able to leave one review per restaurant. However, they should be able to update their reviews. Design a database schema that would support these app functions.

This project will help you learn how to create database schemas based on the project’s requirements and constraints. Check out some possible approaches on Interview Query.

Machine Learning Data Engineering Projects

More organizations are adopting AI than ever before. Therefore, creating or modifying data pipelines to serve machine learning applications is now a key part of many data engineers’ jobs. An engineer must know how to design systems that can efficiently handle, pre-process, and serve data needed for training and be able to seamlessly deploy a trained model.

3. Chicago Crime Categorization Project

The Chicago Police Department has asked you to build a machine learning system to predict the category of a crime based on information typically provided during an emergency call, e.g., time and location, before any assessment is made on location.

This project is a good opportunity to learn about or showcase your skills in using random forests for classification. You can find the initial code for this project and a guided approach to the solution on Interview Query.

4. Real-Time Data Processing for Traffic Management

Your organization is building a system that will be used for traffic management. The system is expected to receive data from different sources, including traffic cameras, IoT devices on roads, and mobile apps. As the data engineer, your task is to design a data pipeline that will ingest this data in real time, pre-process it, and forward it to machine learning models or analytics tools.

In your solution, consider the following questions:

  1. How would you implement a scalable data ingestion system using technologies such as Kafka and Kinesis?
  2. What factors must you consider to ensure low-latency processing so insights are available instantaneously?
  3. What database will be suitable for real-time querying as will be required here?

You can check out some traffic prediction datasets and papers on this page to get started.

5. Health Data Warehouse

You have been asked to create a central repository of health data from various city hospitals, clinics, and health apps. The system should cater to batch processing, ensure data quality and consistency is maintained, and support fast querying.

Some key challenges in this project will be:

  • Integrating data from different sources, possibly in different formats
  • Using tools such as Talend or Apache NiFi to implement ETL processes
  • Using tools such as Snowflake or Redshift to design the data warehouse schema
  • Ensuring data privacy and compliance with health data regulations
  • Creating a data sandbox for a machine learning pipeline

For this project, you can work with real or synthetic data, which you can find from sources such as data.cms.gov, Kaggle, and PhysioNet.

6. Electricity Demand Forecast

An energy supplier has asked you to create a model that can forecast the electricity demand for a town for the next year. This will help avoid outages or an oversupply of energy in the town. How would you create this model?

This is a time-series problem because there is significant variability in daily electricity consumption as well as a seasonal component. The ARIMA model and its variants are a great choice for this type of problem. Check out how to approach this project on Interview Query or see how one user handled a similar project using SARIMA models.

Data Scraping Data Engineering Projects

Web scraping is an essential skill for a data engineer. This is sometimes the only way to gather data that isn’t available through APIs and other conventional sources. It is good practice to abide by the guidelines of a website’s robot.txt file when data scraping. Ignoring these instructions can crash websites or have legal implications.

7. Analyzing Customer Reviews on E-Commerce Websites

Build a web scraper to extract customer reviews from an e-commerce site of your choosing. Your scraper should gather all information that could be used for sentiment analysis, including the ratings and comments. The program should also be able to clean and structure the data automatically. Finally, find out which insights can be extracted from the data.

This project is a good opportunity to showcase your knowledge of tools such as Scrapy and Beautiful Soup. If you are new to web scraping, you can use this project on scraping data from Amazon as a guide.

8. Online Book Retailer Price Monitoring

A local bookstore wants to switch to a dynamic pricing model to keep up with online retailers. They have asked you to build a web scraper that will regularly scrape a specific competitor’s website and gather price data on the top-selling books. You are provided with the specific books to monitor. Apart from extracting price information regularly, your solution should also have a system to alert the store owner of any drastic pricing changes.

For this project, you can also use the Amazon scraping project mentioned earlier as a guide.

9. Job Market Analysis

You have been tasked with building a tool to scrape listings from a job portal and offer insights on the available jobs. The tool is to be used to assess:

  • The demand for certain positions/experts
  • Most in-demand skills for specific roles
  • Average salaries and other compensation benchmarks
  • Tools and languages required
  • Sectors and industries with the highest demand

Your web scraper should be able to gather the data, clean it, and structure it. You can also add a dashboard to display key information.

This project is an excellent way to showcase advanced web scraping skills. You can also opt to build your own dashboard instead of using tools such as Tableau. To get started, check out this guide on scraping LinkedIn using Selenium and Beautiful Soup in Python.

ETL Pipelines Data Engineering Projects

Extract, transform, and load (ETL) operations are at the heart of data engineering. This refers to the practice of collecting data from different sources, transforming it into a suitable format, and loading it into a database or data warehouse. As a data engineer, you can expect to be regularly tasked with designing efficient ETL pipelines or improving existing ones.

10. Unstructured Data ETL Pipeline

Your organization wants to extract data from video input. Design an ETL pipeline that can achieve this. In this project, the key questions to answer are: What data is available from multimedia input, and how would you collect and aggregate this data? Find out how to approach this problem on Interview Query. You can also check out this guide on performing video content analysis in Python.

11. Spotify Self-Analytics ETL

For this project, you’ll need to create a Spotify developer account. The goal is to gather data using the Spotify Web API and load it into a database. You’ll need to use Airflow to automate the process. The data you’ll be gathering include music tracks, artists, albums, playlist data, and user profiles.

Optional objectives are creating a notification system that alerts you when the system is down and creating a pipeline for an analytics sandbox that helps you explore different facets of data engineering without dealing with real-time high-velocity data streams. To help you get started, check out this guide on data extraction using the Spotify Music API.

12. Goodreads Reading Insights Platform

This project is similar to the one for Spotify except you’ll be building a system to gather data about books. You can do this using Goodreads’ API, which offers information on personal reading records, authors, and books. The API can be used to gather data on books you’ve read, ratings, favorite genres, etc. You’ll need to clean and standardize this data before storing it in a relational database. Finally, you can create a dashboard using PowerBI or Tableau to show key information such as genres, global author distribution, etc.

Optional objectives are implementing a notification system when reading milestones are reached and adding a basic book recommendation system. You can check out how one person used the Goodreads API here.

NoSQL Data Engineering Projects

Relational databases have been the conventional data storage solution, but they have scaling issues that are much more apparent in the age of Big Data. This has led to the rise of NoSQL databases, so-called because they can’t typically be queried using SQL. As a data engineer, you’ll need to demonstrate your skills in working with NoSQL databases.

13. Real-Time User Activity Tracking with Cassandra

You’ve been hired by an e-commerce company to design a system that can track the activities of users in real time. Activities to be tracked include logins, product views, cart additions, and purchases. The system must be modeled in a way that makes querying for real-time analytics or retrospective analysis efficient. You’ll be using the Cassandra DBMS.

The design must factor in data replication, partitioning, and consistency due to Cassandra’s distributed nature. The system must be robust enough to handle massive influxes of user data in real time and provide insights. An optional dashboard can be added to show relevant metrics. This project will give you hands-on experience with NoSQL databases and creating scalable data designs that offer real-time analytics. To get started, use this dataset of user activity from a cosmetics store.

14. Six Degrees of Separation Challenge with Neo4j

This project is based on the Six Degrees of Kevin Bacon game. The goal is to build a system to find the shortest relationship path between two random actors. You’ll need to collect data from movie databases and establish relationships between actors, movies, directors, and genres. You’ll then create a user interface so users can challenge the system with two actors’ names so it can find and display their relationship across different movies.

For this project, you’ll be using Neo4j. This is a graph database that excels at representing and traversing relationships between entities. This is a great project for showcasing your skill with graph database management systems. Get started with this IMDB Movies Dataset on Kaggle.

15. Dynamic Content Personalization Engine with MongoDB

This project requires you to create a system that personalizes content for users dynamically. A recommendation algorithm will be at the heart of this system. The system can be for a news website or streaming service with user recommendations based on viewing/reading habits, areas of interest, feedback, etc. Other key attributes should also be captured in the metadata to better match user profiles with content.

This system must be dynamic so recommendations can be refined as a user interacts more with the system. You’ll be using MongoDB, a document-oriented NoSQL database known for its dynamic schema capabilities. Netflix’s Prize dataset can be used for this project.

APIs and Data Migration Data Engineering Projects

Data engineers must be skilled at using APIs in their data collection efforts, with some projects gathering data from multiple APIs at the same time. Additionally, engineers must also understand data migration as organizations regularly have to move to new systems or formats. Here are a few projects focusing on these challenges.

16. Historical Weather Analysis Platform Using Public APIs

You have been tasked with constructing a platform to be used for analyzing historical weather data. The data must be acquired through public weather APIs. Multiple APIs should be used to cover more regions and dates. Users should be able to query the weather conditions for a specific date and location.

This project requires dealing with a massive amount of data. This is ideal for demonstrating your skills in using Apache Spark to perform real-time analytics and machine learning on a large scale. You can identify trends in annual rainfall and temperature fluctuations or try to predict adverse weather events or anomalies. It’s important to ensure accuracy by cross-referencing sources and presenting the data in a user-friendly format. You can get started with this list of free weather APIs.

17. University Digital Library Migration

A university has a large collection of research articles stored in multiple repositories such as EPrints and DSpace and wants to consolidate the repos into a modern cloud-based system. This data is in the order of terabytes and must be migrated while keeping the documents’ metadata, e.g., authors and publication dates, intact.

The new platform needs to have advanced search capabilities and analytics tools. If possible, an AI-driven recommendation system for papers should be included. The key challenges in this project include maintaining data consistency, handling different document formats, and maintaining access to documents during the migration process. You can check out this article on data migration best practices to see how you can approach this project.

18. Product Review Migration with Web Scraping for an E-commerce Site

An e-commerce platform is changing to a new system and would like to migrate its users’ product reviews. Additionally, they want to incorporate reviews from competitors’ websites to enhance their product listings. Design a program to extract reviews from other e-commerce sites for products listed on this platform using web scraping tools. Combine the data with existing reviews and migrate all the data to the new system.

Challenges in this project include adhering to ethical web scraping guidelines, maintaining data accuracy, matching products across different platforms, and integrating external reviews with existing ones. You can use this article on combining datasets to get started on this project.

Conclusion

High-level data engineering skills are in high demand, but the barrier to entry has also increased in recent years. To land a role today, you’ll need to showcase a good grasp of the fundamentals as well as an understanding of the data engineering demands of emerging fields such as AI. The 18 projects on this list offer a good starting point and will allow you to develop your skills as you build solid portfolio pieces. These projects will also give you opportunities to explore both established and emerging technologies that are shaping the field of data engineering today.

Once you have developed a robust portfolio and are ready to take on the job market, Interview Query offers a wide variety of tools to assist you at this crucial stage. We provide access to commonly asked data engineer interview questions in our question bank and company interview guides tailored to different roles. You can also view salary data or visit our job board to find out who is hiring. If you need a little extra help, you can use our mock interview feature or AI interviewer. You can also sign up for one-on-one interview coaching.

Starting a career in data engineering today is challenging, but we hope these projects will give you a leg up and encourage you to stay the course.