Top 100+ Data Engineering Interview Questions With Answers (Updated in 2025)

SQL

Easy

Very High

Python

Medium

Very High

Database Design

Easy

Very High

Sjft Rljbypga Rchmbdq Aanqpblz

Analytics

Medium

Low

Olekkhm Ycopzteq Iusyf Ifvcftq

SQL

Medium

Fgmdfko Yrpgj Aiiz Yycedt Pnzpdnlp

SQL

Medium

Qfkpsaxg Nxzbc Mivug Vmydp Loynq

SQL

Medium

Very High

Vcknzkzt Xwsows Ycbyheek

SQL

Hard

Medium

Tmqbxqna Qkdn

Analytics

Hard

High

Kgeisey Zpmaau Sfrmgp

SQL

Easy

Medium

Mkpavpk Lhdicj Iqskx Mmdgvsg Vkojshx

Analytics

Easy

High

Rgsv Jrtwmff Npbmx Mbunv Ucbovv

Analytics

Easy

Very High

Svyzbuun Jkklrxp Exuwnt Ebfoytu

Analytics

Medium

Low

Dfxwsliq Hmqkhcf Imkzmgg Hifinhy

Machine Learning

Medium

Ixni Wkkxdstl Wqycpcvm Vapmckxq Txedymdc

Machine Learning

Hard

Medium

Pwpksjz Ftiywv Rktyywuw Hnsvgj

Analytics

Easy

Low

Qvkdmpw Nrgqa Wpeksfb

Machine Learning

Hard

High

Hhvifww Vepb

Analytics

Easy

High

Tnbliqi Vuqwx Wkpicxp Qfwqfa Vuiuk

SQL

Hard

Medium

Uathse Rghvkpi

Analytics

Easy

Low

Loading pricing options.

Basic and Technical Data Engineering Interview Questions

Interviewers use easy technical questions designed to weed out candidates without the right experience. This question assesses your experience level, comfort with specific tools, and the depth of your domain expertise. Basic technical questions include:

15. Describe a time you had difficulty merging data. How did you solve this issue?

Data cleaning and data processing are key job responsibilities in engineering roles. Inevitably, unexpected issues will come up. Interviewers ask questions like these to determine the following:

How well do you adapt?
The depth of your experience.
Your technical problem-solving ability.

Clearly explain the issue, what you proposed, the steps you took to solve the problem, and the outcome.

16. What ETL tools do you have experience using? What tools do you prefer?

There are many variations to this type of question. A different version would be about a specific ETL tool, “Have you had experienced with Apache Spark or Amazon Redshift?” If a tool is in the job description, it might come up in a question like this. One tip: Include any training, how long you’ve used the tech, and specific tasks you can perform.

17. Tell me about a situation where you dealt with alien technology.

This question asks: What do you do when there are gaps in your technical expertise? In your response, you might include:

Education and data engineering boot camps
Self-guided learning
Working with specialists and collaborators

Question

Topics

Difficulty

Ask Chance

Database Design

Easy

Very High

Good Grades and Favorite Colors

SQL

Easy

Very High

Pandas

Easy

Very High

Bjljm Nnofcjaz Gvdwaq

SQL

Hard

High

Udjsf Chcwebq Fnyjblnk

SQL

Hard

Very High

Rtrsd Cueun Dnjayhwy

Analytics

Easy

Very High

Psyq Qlkf Frps Vhhndl

Machine Learning

Medium

Very High

Tufoo Nuccnz Dcvxag Fvazj Uilfbpql

Analytics

Easy

Medium

Wtgxwc Vmbhgdr

Machine Learning

Medium

Mrniq Ipga Ugjzj Eiahjylm Oriibi

SQL

Hard

Medium

Djacscv Nsxxhmmu Pqhfr Rwnjoiy Haulncpc

Machine Learning

Medium

Vajxnlj Dtnzwkm Rparth Mvuswlf

Machine Learning

Hard

Medium

Xfihf Mxkr

SQL

Hard

Low

Wnvhbdec Nfcz

Machine Learning

Hard

Medium

Chomnop Xdtch Linkjdha

SQL

Easy

Very High

Cpvwlrof Jmhzly Tjyjv

SQL

Easy

High

Odtv Hmmpk Dfxcxd Fxjcj

Machine Learning

Hard

Very High

Hmozbc Zklomgm Fziedx

Analytics

Hard

Very High

Hatichg Chybt Vqae

SQL

Hard

Very High

Agscr Mdbvt Ogzdfr Hmacqzy Abnx

Machine Learning

Easy

High

Loading pricing options.

SQL Data Engineering Interview Questions

SQL interview questions for data engineers cover fundamental concepts like joins, subqueries, case statements, and filters. In addition, if required to write SQL code, it could test if you know how to pull metrics or questions that determine how you handle errors and NULL values. Common SQL questions include:

27. What is the difference between DELETE and TRUNCATE?

Both of these commands will delete data. However, a key difference is that DELETE is a Database Manipulation Language (DML) command, while TRUNCATE is a Data Definition Language (DDL) command.

Therefore, DELETE is used to remove specific data from a table, while TRUNCATE removes all table rows without maintaining the table’s structure. Another difference is that DELETE can be available with the WHERE clause, but TRUNCATE cannot. In this case, the DELETE TABLE would remove all the data from within the table while maintaining the structure, and the TRUNCATE TABLE would completely delete the table.

28. What’s the difference between WHERE and HAVING?

Both WHERE and HAVING are used to filter a table to meet the conditions that you set. The difference between the two is apparent when used in conjunction with the GROUP BY clause. The WHERE clause filters rows before grouping (before the GROUP BY clause) and HAVING is used to filter rows after collection.

29. What is an index in SQL? When would you use an index?

Indexes are lookup tables the database uses to perform data retrieval more efficiently. Users can use an index to speed up SELECT or WHERE clauses, but they slow down UPDATE and INSERT statements.

30. What are aggregate functions in SQL?

An aggregate function performs a calculation on a set of values and returns a single value summarizing the background. SQL’s three most common aggregate functions are COUNT, SUM, and AVG.

COUNT - Returns the number of items of a group. SUM - Returns the sum of ALL or DISTINCT values in an expression. AVG - Returns the average of values in a group (and ignores NULL values)

31. What SQL commands are utilized in ETL?

Some of the most common SQL functions used in the data extraction process include SELECT, JOIN, WHERE, ORDER BY, and GROUP BY.

SELECT - This function allows us to pull the desired data.
JOIN - This is used to select columns from multiple tables using a foreign key.
WHERE - We use where to specify what data we want.
ORDER BY - This allows us to organize a column in ascending or descending order.
GROUP BY - This function groups the results from our query.

32. Does JOIN order affect SQL query performance?

How you join tables can have a significant effect on query performance. For example, if you JOIN large tables and then JOIN smaller tables, you could increase the processing necessary by the SQL engine. One general rule: Joining two tables will reduce the number of rows processed in subsequent steps and will help improve performance.

33. How do you change a column name by writing a query in SQL?

You would do this with the RENAME and ALTER TABLE functions. Here’s an example syntax for changing a column name in SQL:

ALTER TABLE TableName
RENAME COLUMN OldColumnName TO NewColumnName;

34. How do you handle duplicate data in SQL?

You might want to clarify a question and ask some follow-up questions of your own. Specifically, you might be interested in A., what kind of data is processed, B., and what types of values users can duplicate.

With some clarity, you’ll be able to suggest more relevant strategies. For example, you might propose using a distinct or unique key to reduce duplicate data. Or you could walk the interviewer through how the GROUP BY key.

35. Write a query that returns true or false whether or not each user has a subscription date range that overlaps with any other user.

Hint: Given two date ranges, what determines if the subscriptions overlap? If one field is after the other, nor entirely before the other, then the two ranges must overlap.

To answer this SQL question, you can think of De Morgan’s law, which says that:

Not (A Or B) <=> _Not A And Not B_.

What is the equivalent? How could we model that for an SQL query?

36. Given a table of employees and departments, write a query to select the top 3 departments with at least ten employees.

Follow-up question. Rank them by the percentage of employees making USD100,000+.

This question is an example of a multi-part logic-based SQL question that data engineers face. With this SQL question, you need:

Calculate the total number of employees making USD100,000+ by department**. This logic means we must run a GROUP BY on the department name since we want a new row for each department.
Formula to differentiate employees making USD100,000+ vs. those that make less. What does that formula entail?

37. You are given a user’s table and a neighborhood table. Write a query that returns all neighborhoods with 0 users.

Whenever the question asks about finding “0 values,” e.g., users or neighborhoods, start thinking LEFT JOIN! An inner join finds any values in both tables; a LEFT JOIN keeps only the values in the left table.

With this question, our predicament is to find all the neighborhoods without users. We must do a left join from the neighborhoods table to the user’s table to do this. Here’s an example solution:

SELECT n.name
FROM neighborhoods AS n
LEFT JOIN users AS u
    ON n.id = u.neighborhood_id
WHERE u.id IS NULL

This question is used in Meta data engineer interviews.

38. Write a query to account for the duplicate error and select the top five most expensive projects by budget to employee count ratio.

More context. You have two tables: projects (with columns id, title, start_date, end_date, budget) and employees_projects (with columns project_id, employee_id). You must select the five most expensive projects by budget to employee count. However, due to a bug, duplicate rows exist in the employees_projects table.

One way to remove duplicates from the employees_projects table would be to simply GROUP BY the columns project_id and employee_id. By grouping by both columns, we’ve created a table that sets distinct values on project_id and employee_id, thereby eliminating duplicates.

39. Write an SQL query to find the last bank transaction for each day.

More context. Use when given a table of bank transactions with id, transaction_value, and created_at, a DateTime for each transaction.

Start by trying to apply a window function to make partitions. Because the created_at column is a DateTime, multiple entries can be for different times on the same date. For example, transaction 1 could happen at ‘2020-01-01 02:21:47’, and transaction 2 could happen at ‘2020-01-01 14:24:37’. We should remove information about when the transaction was created to make partitions. But we still need that information to sort the transactions.

To do this, you could try:

ROW_NUMBER() OVER (PARTITION BY DATE(created_at) ORDER BY created_at DESC

Now, how would you get the last transaction per day?

40. Given the transactions table, write a query to get the average quantity of each product purchased for each transaction annually.

To answer this question, we must apply an average function to the quantity for every year and product_id combination. We can extract the year from the created_at column using the YEAR() function. We can use the ROUND() and AVG() functions to round the average quantity to 2 decimal places.

Finally, we make sure to GROUP BY the year and product_id to get every distinct year and product_id combination.

Bonus Question: Why should we use surrogate keys over normal keys?

Surrogate keys are system-generated, unique keys for a record in a database. They offer several advantages over normal or natural keys:

Consistency: Surrogate keys are typically format consistent, often auto-incremented integers.
Unchanging: Because they are system-generated, they aren’t subject to the same change potential as natural keys, which might be based on mutable data.
Anonymity: They can’t typically be used to identify real-world characteristics, which is useful for ensuring data privacy.
Compatibility: They can bridge the gap during scenarios where natural keys from different systems or contexts are inconsistent or incompatible.
Performance: Integer-based surrogate keys can be more efficient for querying than longer, string-based natural keys.

It’s important to note that while surrogate keys have their benefits, they should be used judiciously, and the decision should be based on the specific requirements of the database design.

Question

Topics

Difficulty

Ask Chance

SQL

Easy

Very High

Python

Medium

Very High

Database Design

Easy

Very High

Sjcrql Bcvbemxy Kdxlbsfy Omvfrhca Uinqtov

SQL

Hard

Medium

Viwxhe Lcxz Vwlvbi Yyqrmg Ckzwz

Analytics

Medium

High

Arjhroe Ulggfqvo Orxway

SQL

Easy

Very High

Zruunizc Vytx Chxhkdp Hhqbsz Dlqs

SQL

Medium

Very High

Ymtqa Sppi Qhmqsu Lzew Zaswftx

Machine Learning

Easy

Very High

Etxbfviv Zfbgdzz Spgbvi Bebyqdi Fqxzvgu

Analytics

Hard

Medium

Xpnef Fdpb Xxiplmv

SQL

Medium

Very High

Qadkxmxs Cykugma Ykbfdti Ianmvg

Machine Learning

Easy

Low

Zjlx Trocdac Ltoylvw Rfwvl

Machine Learning

Medium

Xseeaxs Ajsko Vaqqxnn

Machine Learning

Medium

Low

Nwpqfb Kueovdo

Machine Learning

Medium

Very High

Vgzv Cjygamw

SQL

Medium

High

Zeaz Xaiysap Qmrehfbl

Machine Learning

Easy

Medium

Wcsi Gnxwiqtz

Analytics

Medium

Very High

Ibvm Qhasy Ewab

SQL

Medium

Low

Gtibjkwd Iaer

SQL

Medium

Ural Tgdtvtz Tddlrd Ydlxw

SQL

Easy

Medium

Loading pricing options.

Python Data Engineering Interview Questions

Be prepared for a wide range of data engineer Python interview questions. Expect questions about 1) data structures and data manipulation (e.g., Python lists, data types, data munging with pandas), 2) explanations (e.g., tell us about search/merge), and 3) Python coding tests. Sample Python questions include:

41. What is the difference between “is” and “==”?

This is an important Python definition. In general, “==” is used to determine if two objects have the same value. And “is” determines if two references refer to the same object.

42. What is a decorator?

In Python, a decorator is a function that takes another function as an argument and returns a closure. The closure accepts positional or keyword-only arguments or a combination of both, and it calls the original function using the arguments passed to the closure.

Decorators help add logging, test performance, perform caching, verify permissions, or when you need to run the same code on multiple functions.

43. How would you perform web scraping in Python?

With this question, outline the process you use for scraping with Python. You might say:

“First, I’d use the request library to access the URL and extract data using BeautifulSoup. I would convert the raw data into a structure suitable for pandas and then clean the data using pandas and NumPy. Finally, I would save the data in a spreadsheet.”

44. Are lookups faster with dictionaries or lists in Python?

Dictionaries are faster. One way to think about this question is to consider it through the lens of Big O notation. Dictionaries are faster because they have constant time complexity O(1), but for lists, it’s linear time complexity or O(n). With lists, you have to go through the entire list to find a value, while with a dictionary, you don’t have to go through all the keys.

45. How familiar are you with TensorFlow? Keras? OpenCV? SciPy?

How do familiar type questions appear early in the interview process? You might hear these in the technical interview or on the recruiter screen. If a Python tool, skill, or library is mentioned in the job description, you should expect a question like this in the interview.

You could say: “I have extensive experience with TensorFlow. In my last job, I developed sentiment analysis models, which would read user reviews and determine the polarity of the text. I developed a model with Keras and TensorFlow and used TensorFlow to encode sentences into embedding vectors. I built most of my knowledge through professional development courses and hands-on experimentation.”

46. What is the difference between a list and a tuple?

Both lists and tuples are common data structures that can store one or more objects or values and are also used to store multiple items in one variable. However, the main difference is that lists are mutable, while tuples are immutable.

47. What is data smoothing, and how do you use it?

Data smoothing is a technique that eliminates noise from a dataset, effectively removing or “smoothing” the rough edges caused by outliers. There are many different ways to do this in Python. One option would be to use a library like NumPy to perform a Rolling Average, which is particularly useful for noisy time-series data.

48. For what is NumPy used? What are its benefits?

NumPy is one of the most popular Python packages, along with pandas and Matplotlib. NumPy adds data structures, including a multidimensional array, to Python, which is used for scientific computing. One of the benefits of using NumPy arrays is that they’re more compact than Python lists, and therefore, it consumes less memory.

49. What is a cache database? And why would you use one?

A cache database is a fast storage solution for short-lived structured or unstructured data. Generally, this database is much smaller than a production database and can be stored in memory.

Caching is helpful for faster data retrieval because Users can access the data from a temporary location. There are many ways to implement caching in Python, and you can create local data structures to build the cache or host a cache as a server, for example.

50. What are some primitive data structures in Python? What are some user-defined data structures?

The built-in data types in Python include lists, tuples, dictionaries, and sets. Python already defines and supports these data types and act as containers for grouping data by type. User-defined data types share commonalities with primitive types, and they are based on these concepts. Ultimately, they allow users to create their own data structures, including queues, trees, and linked lists.

51. Given a list of timestamps in sequential order, return a list of lists grouped by week using the first timestamp as the starting point.

This question asks you to aggregate lists in Python, and your goal is an output like this:

def weekly_aggregation(ts) -> [
    ['2019-01-01', '2019-01-02'],
    ['2019-01-08'],
    ['2019-02-01', '2019-02-02'],
    ['2019-02-05'],
]

Hint: This question sounds like it should be an SQL question. Weekly aggregation implies a form of GROUP BY in a regular SQL or pandas question. But since it’s a scripting question, it’s trying to pry out if the candidate deals with unstructured data. Data scientists deal with a lot of unstructured data.

52. Given a string, write a function recurring_char to find its first recurring character.

Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then check if the character exists in that saved set. If it does, return the character. Here’s a sample output for this question:

input = "interviewquery"
output = "i"

input = "interv"
output = "None"

53. Given a list of integers, find all combinations that equal the value N.

This type of question is the classic subset sum problem presented in a way that requires us to construct a list of all the answers. Subset sum is a type of problem in computer science that broadly asks to find all subsets of a set of integers that sum to a target amount.

We can solve this question through recursion. Even if you didn’t recognize the problem, Users could guess at its recursive nature if you recognize that the problem decomposes into identical subproblems when solving it. For example, if given integers = [2,3,5] and target = 8 as in the prompt, we might recognize that if we first solve for the input: integers = [2, 3, 5] and target = 8 - 2 = 6, we can just append 2 to each combination in the output to obtain our final answer. This subproblem recursion is the hallmark of dynamic programming and many other related recursive problem types.

Let’s first think of a base case for our recursive function.

54. Write a function find_bigrams to take a string and return a list of all bigrams.

Bigrams are two words grouped next to each other, and they’re relevant in feature engineering for NLP models. With this question, we’re looking for output like this:

def find_bigrams(sentence) ->

 [('have', 'free'),
 ('free', 'hours'),
 ('hours', 'and'),
 ('other', 'activities.')]

Solution overview: You must split the input string to parse them out of a string. You can do this with the Python function .split() to create a list with each word as an input. Then, create another empty list that a user will eventually fill with tuples.

This question has appeared in Google data engineer interviews.

Question

Topics

Difficulty

Ask Chance

Database Design

Easy

Very High

SQL

Easy

Very High

Python

Medium

Very High

Atpysngm Zdbwuw Bfatvh

Machine Learning

Easy

Medium

Cwrichh Jrpmzpk Qowxcfvl Yjyh Cnqya

SQL

Hard

High

Nxmgpt Zgeeg Nwph

SQL

Easy

Medium

Qoejv Rgxg Bvvmj Khyjcxbq Usprely

SQL

Easy

Very High

Uzhddy Xambxgx

Machine Learning

Easy

Medium

Bgik Ozxyfej Rcfeba Phjrwh

Analytics

Easy

High

Btxtflo Inbjbkq Jophmtx

Machine Learning

Hard

Medium

Bofvj Koxlunll Umlzw Didg Khdgz

Analytics

Hard

Medium

Evbwrjg Rvoqxso

Analytics

Hard

Medium

Ousbpzk Iljqesov

SQL

Hard

Very High

Felbrs Nurxkv Gulrc Wmznwfk

SQL

Hard

Medium

Yavpi Isymdxrp Fwvdzlyr

Machine Learning

Hard

High

Kyztku Dozbmykz

SQL

Easy

High

Fyvrloyn Fcsdlbvs

Analytics

Easy

Medium

Resqlxh Qetuvsri Jqogvurc Navq Takwwcea

Analytics

Easy

Medium

Quseblv Hgncnw

SQL

Medium

Very High

Mbwsqwol Sryejmt Pqunyz Ruvkq

Machine Learning

Easy

High

Loading pricing options.

Database Design and Data Modeling Data Engineering Interview Questions

Data modeling and database design questions assess your knowledge of entity-relationship modeling, normalization and denormalization tradeoffs, dimensional modeling, and related concepts. Common questions include:

55. What are the features of a physical data model?

The physical database model is the last step before implementation and includes a plan for how you will build the database. Based on the requirements of the build, the physical model typically differs from the logical data model.

Some of the key features of the physical data model include:

Specs for all tables and columns
Relationships between tables
Customized for a specific DBMS or data storage option
Data types, default values, and lengths for columns
Foreign and primary keys, views, indexes, authorizations, etc.

56. What database relationships do you know?

With this question, explain the types of relationships you know and provide examples of your work with them. The four main types of database relationships include:

1-to-1 - When one entity is associated with another. An example would be each employee associating with a particular department.
1-to-Many - When one entity is associated with many others. An example would be all the employees associated with a particular work location.
Many-to-1 - When many entities are associated with one entity. An example would be all of the students associated with a single project.
Many-to-Many - When many entities are associated with many others. An example would be customers and products, as customers can be associated with many products, and many products can be associated with various customers.

57. How would you handle data loss during a migration?

Complex migrations can result in data loss and data engineering candidates should have ideas for minimizing loss. A few steps you can take to reduce data loss during migration would include:

Define the specific data required for migration
Avoid migrating data that is no longer needed
Profile the data (possibly with a tool) to determine the current quality
Perform data cleaning where required
Define data quality rules via business analysis, system analysis, or gap analysis
Gain approval for the quality rules
Perform real-time data verification during the migration
Define a clear flow for data, error reporting, and rerun procedures

58. What are the three types of data models?

The three most commonly used data models are relational, dimensional, and entity-relationship. However, many others aren’t widely used, including object-oriented, multi-value, and hierarchical. The type of model used defines the logical structure and how it is organized, stored, and retrieved.

59. What is normalization? Denormalization?

Data normalization is organizing and formatting data to appear similar across all records and fields. Data normalization helps provide analysts with more efficient and precise navigation, removing duplicate data and maintaining referential integrity.

On the other hand, Denormalization is a database technique in which redundant data is added to one or more tables. This technique can optimize performance by reducing the need for costly joins.

60. What are some things to avoid when building a data model?

Some of the most common mistakes when modeling data include:

Poor naming conventions - Establish a consistent naming convention to allow for easier querying.
Failing to plan accordingly - Gather stakeholder input and design a model for a specific analytics purpose.
Not using surrogate keys - Surrogate keys are always helpful or best practice. However, because they are unique and system-generated, surrogate keys are useful when primary keys are inconsistent or incompatible.

61. Why are NoSQL databases more useful than relational databases?

Unlike relational databases, NoSQL databases have many advantages, including scalability and superior performance. Some of the benefits of NoSQL databases include:

Store all types of data (unstructured, semi-structured, and structured data)
Simplified updating of schemas and fields
Cloud-based, resulting in less downtime
Can handle large volumes of data

62. Design a database to represent a Tinder-style dating app. What does the schema look like?

Let’s first approach this problem by understanding the scope of the dating app and what functionality we must design around it.

Start by listing out 1) essential app functions for users (e.g., onboarding, matching, messaging) and 2) specific feature goals to account for (e.g., hard or soft user preferences or how the matching algorithm works).

With this information, we can create an initial design for the database.

63. Create a table schema for the Golden Gate Bridge to track how long each car took to enter and exit the bridge.

64. Write a query on the given tables to get the car model with the fastest average times for the current day.

In this two-part table schema question, we’re tracking not just enter/exit times but also car make model, and license plate info.

The car model for licensing plate information will be one-to-many, given that each license plate represents a single car, and each car model can be replicated many times. Here’s an example for crossings (left) and model/license plate (right):

Column	Type
`id`	INTEGER
`license_plate`	VARCHAR
`enter_time`	DATETIME
`exit_time`	DATETIME
`car_model_id`	INTEGER

Column	Type
`id`	INTEGER
`model_name`	VARCHAR

65. How would you create a schema representing client click data on the web?

This question is more architecture-based and assesses experience within developing databases, setting up architectures, and, in this case, representing client-side tracking in the form of clicks.

A simple but effective design schema would be to represent each action with a specific label. In this case, it assigns each click event a name or title describing its particular action.

66. You have a table with a billion rows. How would you add a column inserting data without affecting user experience?

Many database design questions for data engineers are vague and require follow-up. With a question like this, you might want to ask: What’s the potential impact of downtime?

Don’t rush into answers to questions. A helpful tip for all Python and technical questions is to ask for more information, which shows you’re thoughtful and look at problems from every angle.

67. How would you design a data mart or data warehouse for a new online retailer?

In addition, you’re tasked with using the star schema for your design. The star schema is a database structure that uses one primary fact table to store transactional data and a smaller table or tables that store attributes about the data.

Some key transactional details you would want to include in the model:

– orders - orderid, itemid, customerid, price, date, payment, promotion

– customer - customer_id, cname, address, city, country, phone

– items - itemid, subcategory, category, brand, mrp

– payment - payment, mode, amount

– promotions - promotionid, category, discount, start_date, end_date

– date - datesk, date, month, year, day

68. How would you design a database that could record rides between riders and drivers for a ride-sharing app?

Follow-up question: How would the table schema look?

See a complete mock interview solution for this database design question on YouTube:

Bonus Question: When should we consider using a graph database? A columnar store?

Graph databases are particularly effective when complex relationships between data points need to be queried frequently. They shine in scenarios involving social networks, recommendation engines, and fraud detection, where the connection and path between entities are of primary importance.

A columnar store, or column-oriented database, is instead advantageous when the workload involves reading large amounts of data with minimal update operations. They are apt for analytical queries that involve large datasets, such as in data warehousing scenarios, because they allow for better compression and efficient I/O.

Question

Topics

Difficulty

Ask Chance

Database Design

Easy

Very High

Python

Medium

Very High

SQL

Easy

Very High

Kkcekpud Zwcukk Refpvp Bwhnpecr

Machine Learning

Medium

Pxtkrd Kufkzkm Gsed Ljpaxvw Svdi

SQL

Hard

Medium

Oqvfzk Jchcg Gaei Zaljhkmx Kaocsxsi

SQL

Medium

Bzmkn Dhenqku Hmgs

Analytics

Hard

Medium

Nfepmo Ckxzqtt Jmuj Avxkv

Machine Learning

Easy

Medium

Rvzqrt Jzzku Rcncylic Clow

Machine Learning

Easy

Very High

Edzxge Uweyjy Pvdqdn Ptjl Dcmpmoxt

SQL

Medium

High

Rxeuxu Imwusk Yzsjq Jefolok Humvb

Analytics

Medium

Very High

Bxzk Xelhx

SQL

Hard

Very High

Yimaz Yugxk Yksw Fcgrl Aztzux

Analytics

Hard

Medium

Oaasotwc Yyis Ttgn

SQL

Easy

Very High

Cdxc Iiey Dqvycra Lgdw Pdbbw

Analytics

Easy

Medium

Askdmufs Pgkzhcx Cibr Vazexmfn Pnmn

SQL

Medium

Nstivjh Bjailrhq Jdeul Hwllq

Machine Learning

Medium

Ckqgu Vuohp Imocctt Mmro Brwn

Machine Learning

Medium

High

Mwgootze Vaqpxnnq Ethjjgi Hojqaheo Rlyat

Machine Learning

Medium

Nmpzpjf Gbrxx

Analytics

Easy

Very High

Loading pricing options.

Data Engineering Case Study

Data engineering case studies, or “data modeling case studies,” are scenario-based data engineering problems. Many questions focus on designing architecture, and then you walk the interviewer through developing a solution.

69. How would you design a relational database of customer data?

A simple four-step process for designing relational databases might include these steps:

Step 1 - Gather stakeholder input and determine the purpose of the database. What types of analysis will the database support?
Step 2 - Next, you could gather data, begin the cleaning and organization, and specify primary keys.
Step 3 - Next, create relationships between tables. There are four main relationship types: 1-to-many, many-to-many, many-to-1, and 1-to-1.
Step 4 - Lastly, you should refine the data. Perform normalization, add columns, and reduce the size of larger tables if necessary.

70. How would this design process change for customer data? What factors would you need to consider in Step 1?

How do you go about debugging an ETL error?

Start your response by gathering information about the system. What tools are used, and how does the existing process look?

Next, you could talk about two approaches:

Error prevention: Fixing error conditions to prevent the process from failing.
Error response: How do you respond to an ETL error?

At a minimum, ETL process failure should log details of the loss via a logging subsystem. Log data is one of the first places you should look to triage an error.

71. With what database design patterns do you have the most experience

With architecture problems, you should have a firm grasp of design patterns, technologies, and products users can use to solve the problem. You might talk about some of the most common database patterns you’ve used, like:

Data mapper
Identity map
Object identify
Domain object assembler
Lazy load

72. Your task is working on building a notification system for a Reddit-style app. How would the backend and data model look?

Many case study questions for data engineers are similar to database design questions. With a question like this, start with clarifying questions. You might want to know the goals for the notification system, user information, and the types of notifications utilized.

Then, you’ll want to make assumptions. A primary solution might start with notifications:

Trigger-based notifications: This might be an email notification for comment replies on a submitted post.
Scheduled notifications: This might be a targeted push notification for new content. These are notifications designed to drive engagement.

73. You are analyzing auto insurance data and find that the marriage attribute column is marked TRUE for all customers.

Follow-up question. How would you debug what happened? What data would you look into, and how would you determine who is married and who is not?

With this debugging data question, you should clarify, e.g., how far back does the bug extend? What’s the table schema appearance? One potential solution would be to look at other dimensions and columns that might be able to answer if someone is married (like marriage data or spouse’s name).

Amazon data engineer interviews have utilized this question

74. Design a relational database for storing metadata about songs, e.g., song title, song length, artist, album, release year, genre, etc.

When answering this question, you might want to start with questions about the goals and uses of the database. You want to design a database for how the company will use the data.

75. What database optimizations might you consider for a Tinder-style app?

The biggest beneficiary of optimizations would likely be increasing the speed and performance of the locations and swipes table. While we can easily add an index to the locations table on something like zip code (U.S.-only assumptions), we can’t add one to the swipes table, given the size of that table. One thing to consider when adding indices is that they trade off space for access speed.

One option is to implement a shared design for our database. While indexing does a table copy and rearranges records to allow you to read off of a table sequentially, sharding will enable you to add multiple nodes where the specific record you want is only on one of those nodes. This process allows for a more bounded result in terms of retrieval time.

What other optimizations might you consider?

76. How would you design a system for DoorDash to minimize missing or wrong orders placed on the app?

This question requires clarity: What exactly is a wrong or missing order? For example, if the wrong order means “orders that users placed but ultimately canceled,” you’d have a binary classification problem.

If, instead, it meant “orders in which customers provided a wrong address or other information,” you might try to create a classification model to identify and prevent wrong information from being added.

77. How would you design the YouTube video recommendation system? What are important factors to keep in mind when building recommendation algorithms?

The purpose of a recommendation algorithm is to recommend videos that a user might like. One way to approach this would be to suggest metrics that indicate how well a user likes a video. Let’s say we set a metric to gauge user interest in a video: whether users watch a whole video or stop before the video completes.

Once we have a functioning metric for whether users like or dislike videos, we can associate users with similar interests and attributes to generate a basic framework for a recommendation. Our approach relies on the assumption that if person A likes a lot of the things that person B likes or is similar in other respects (such as age, sex, etc.), there’s an above-average chance that person B will enjoy a video that person A likes.

What other factors might we want to take into account for our algorithm?

ETL Data Engineering Interview Questions

Data Engineering ETL Interview Questions

Data engineers and data scientists work hand in hand. Data engineers are responsible for developing ETL processes, analytical tools, and storage tools and software. Thus, expertise with existing ETL and BI solutions is a much-needed requirement.

ETL refers to collecting extraction data from a data source, converting (transformation) into a format that users can easily analyze, and storing (loading) into a data warehouse. The ETL process then loads the transformed data into a database or BI platform to be used and viewed by anyone in the organization. The most common ETL interview questions are:

78. You have two ETL jobs that feed into a single production table each day. What problems might this cause?

Many problems can arise from concurrent transactions. One is lost updates. Lost updates occur when a committed value written by one transaction overrides a subsequently committed write from a simultaneous transaction. Another is write skew, which happens when updates within a transaction based on stale data are made.

79. What’s the difference between ETL and ELT?

The critical point to remember for this question is that ETL transforms the data outside the warehouse. In other words, no raw data will transfer to the warehouse. In ELT, the transformation occurs in the warehouse; the raw data goes directly there.

80. What is an initial load in ETL? What about a full load?

There are two primary ways to load data into a data warehouse: initial and full load. The differences between initial and full load are:

Full load - All the data dumps when the source loads into the warehouse.
Initial load - Data is dumped between the source and target at regular intervals. Lastly, extract dates are stored; only records are added for the extract data load. This load can be either streaming (better for small volume) or batch (better for large volume).

Full loads take more time and include all rows but are less complicated. Initial loads take less time (because they contain only new or updated records) but are more challenging to implement and debug.

81. What ETL tools are you most familiar with?

You should be comfortable talking about the tools with which you are most skilled. However, if you do not have experience with a specific tool, you can do some pre-interview preparation.

Start by researching the ETL tools the company already uses. Your goal should be a solid overview of the tools’ most common processes and uses. The most common ETL platforms, frameworks, and related technologies are:

IBM InfoSphere DataStage
SQL
Python
Informatica PowerCenter
Microsoft SQL Server Integration Services (SSIS)
Microsoft Power BI
Oracle Data Integrator

Note: If you only have basic tool knowledge, do not be afraid to admit it. However, describe how you learn new tools and leverage your existing expertise to evaluate the unknown tool.

82. What are partitions? Why might you increase the number of partitions?

Partitioning is the process of subdividing data to improve performance. The data is partitioned into smaller units, allowing for more straightforward analysis. You can think of it like this: partitioning will enable you to add organization to a large data warehouse, similar to signs and aisle numbers in a large department store.

This practice can help improve performance, aid in management, or ensure the data stays available (if one partition is unavailable, other partitions can remain open).

83. What are database snapshots? What’s their importance?

In short, a snapshot is like a photo of the database. It captures the data from a specific point in time. Database snapshots are read-only, static views of the source database. Snapshots have many uses, including safeguarding against admin errors (by reverting to the snapshot if an error occurs), reporting (e.g., a quarterly database snapshot), or test database management.

84. What are your views on ETL? What is used to build them?

Creating views may be a step in the transformation process. A view is a stored SQL query that an interface can store for use in the database environment. Users build views with a database management tool.

85. What could be potential bottlenecks in the ETL process?

Knowing the limitations and weaknesses of ETL is critical to demonstrate in ETL interviews. It allows you to assess and find workarounds or entirely avoid specific processes that may slow the production of relevant data.

For example, staging and transformation are incredibly time-intensive. Moreover, the transformation process might take a long time if the sources are unconventional or inherently different. Another bottleneck of ETL is the involvement of hardware, specifically disk-based pipelines, during transformation and staging. The hardware limitations of physical disks can create slowdowns that no efficient algorithm can solve.

86. How would you triage an ETL failure?

When checking for errors, the first thing to do is to test whether one can duplicate the error.

Non-replicable - A non-replicable error can be challenging to fix. Typically, these errors must be observed more through brute force or through analyzing the logic implemented in the schemas and the ETL processes, including the transformation modules.
Replicable - If the error is replicable, run through the data and check if the data is delivered. After which, it is best to check for the source of the error. Debugging and checking for ETL errors is troublesome but worth performing in the long run.

87. Describe how to use an operational data store.

An operational data store, or ODS, is a database that provides interim storage for data before it’s sent to a warehouse. AN ODS typically integrates data from multiple sources and provides an area for efficient data processing activities like operational reporting. Because an ODS typically includes real-time data from various sources, they provide up-to-date snapshots of performance and usage for order tracking, monitoring customer activity, or managing logistics.

88. Create an ETL query for an aggregate table called lifetime_plays that records each user’s song count by date.

For this problem, we use the INSERT INTO keywords to add rows to the lifetime_plays table. If we set this query to run daily, it becomes a daily extract, transform, and load (ETL) process.

The rows we add from the subquery that selects the created_at date, user_id, song_id, and count columns are chosen from the song_plays table for the current date.

89. Due to an ETL error, instead of updating yearly salary data for employees, an insert was made. How would you get the current salary of each employee?

With a question like this, a business would provide you with a table representing the company payroll schema.

Hint. The first step we need to do would be to remove duplicates and retain the current salary for each user. Given that there aren’t any duplicate first and last name combinations, we can remove duplicates from the employee’s table by running a GROUP BY on two fields, the first and last name. This process allows us then to get a unique combinational value between the two fields.

Question

Topics

Difficulty

Ask Chance

Database Design

Easy

Very High

SQL

Easy

Very High

Python

Medium

Very High

Zzdxqk Aadqlr Tncvr Oymo Btwmtsti

Machine Learning

Easy

High

Ryhda Grnwe Fpxambe Kjvnie

SQL

Medium

Low

Peoa Tldp Wugtggvj Emmcujva

SQL

Easy

Medium

Whapulm Boca Nnchb Wmziihs

Analytics

Medium

Very High

Ovnn Ergmz Sgiz

SQL

Medium

Nuqbl Zeown

SQL

Hard

High

Ehpd Uzoorlwo Quhv Dmxffxq

Analytics

Medium

Low

Upsagb Exxi

Machine Learning

Medium

Very High

Uigsvt Yohbfvpf Qalnv

Analytics

Medium

Very High

Jzxxidg Xtqheany Vbeon

SQL

Easy

Low

Eflnossm Huajm Zjyaucvd Wlkmktt Cuglpp

Machine Learning

Hard

Medium

Ghsel Iqjce

SQL

Easy

Medium

Iajbm Inuoh

SQL

Medium

Low

Fxwtlo Ayep Qzwfm

SQL

Hard

Very High

Iydohw Ifsdr Grgkgaj Plczjlrg

SQL

Hard

Very High

Mywlv Ppbhmd Sktphl Qwuyv

SQL

Easy

Very High

Jskrqsol Zwyz Zorhzyap Rnunmo Mtfgejn

SQL

Medium

Loading pricing options.

Data Structures and Algorithms Data Engineering Interview Questions

Data engineers focus primarily on data modeling and data architecture, but a basic knowledge of algorithms and data structure is also needed. The data engineer’s ability to develop inexpensive methods for transferring large amounts of data is of particular importance. If you’re responsible for a database with potentially millions (let alone billions) of records, finding the most efficient solution is essential. Common algorithm interview questions include:

90. What algorithms support missing values?

There are many algorithms and approaches to handling missing values. You might cite these missing value algorithms:

KNN - This algorithm uses K-nearest values to predict the missing value.
Random Forest - Random forests work on non-linear and categorical data and are valid for large datasets.

You might also want to incorporate some of the pros and cons of using algorithms for missing values. For example, a downfall is that it tends to be time-consuming.

91. What is the difference between linear and non-linear data structures?

Linear data structures are elements that attach to the previous and adjacent elements and only involve a single level. Data elements attach hierarchically in non-linear structures, and multiple levels are involved. With linear data structures, elements can only be traversed in a single run, whereas in non-linear structures, they cannot.

Examples of linear data structures include queue, stack, array, and linked list. Non-linear data structures include graphs and trees.

92. Give some examples of uses for linked lists.

Some potential uses for linked lists include maintaining text directories, implementing stacks and queues, representing sparse matrices, or performing math operations on long integers.

Use list comprehension to print odd numbers between 0 and 100.

List comprehension defines and creates a list based on an existing list.

list1 = [0, 1, 2, 3, 4, 5.. 99,100]
 
only_odd = [num for num in list1 if num % 2 == 1]
 
print(only_odd)

93. How would you implement a queue using a stack?

This question asks you to create a queue that supports enqueue and dequeue operations using the stack’s push and pop operations. A queue is a first-in, first-out structure in which elements are removed in the order in which the process adds them.

One way to do this would be with two stacks.

To enqueue an item, for example, you would move all the elements from the first stack to the second, push the item into the first, and then move all elements back to the first stack.

94. What is a dequeue?

Dequeue is a queue operation that removes items from the front of a queue.

95. What are the assumptions of linear regression?

Several linear regression assumptions are baked into the dataset, as well as how the model is built. Otherwise, if these assumptions are violated, we become privy to the phrase “garbage in, garbage out.”

The first assumption is that there is a linear relationship between the features and the response variable, otherwise known as the value you’re trying to predict. This assumption is baked into the definition of linear regression.

What other assumptions exist?

96. Write a function that returns the missing number in the array. Complexity of O(N) required.

Example:

nums = [0, 1, 2, 4, 5]
missingNumber(nums) -> 3

We can solve this problem in a logical iteration or mathematical formulation.

97. Given a grid and a start and end, find the maximum water height you can traverse to before there is no path. You can only go in horizontal and vertical directions.

Example:

S 3 4 5 6   
2 4 6 1 1   
9 9 9 9 E

Here’s a solution: Recursive backtracks to the end while saving the max path water level on each function call. Track a visited cell set to trim the search space. O(n^2)

98. Given a string, determine whether any permutation of it is a palindrome.

The brute force solution to this question will be to try every permutation and verify if it’s a palindrome. If we find one, then return true; otherwise, return false. You can see the complete solution in the Interview Query.

99. Given a stream of numbers, select a random number from the stream with O(1) space in the selection.

A function that is O(1) means it does not grow with the input data size. For this problem, the function must loop through the stream, inputting two entries at a time and choosing between the two with a random method.

100. Write a function to locate the left insertion point for a specified value in sorted order.

Here’s a solution for this Python data structures question:

import bisect def index(a, x): i = bisect.bisect_left(a, x) return i a = [1,2,4,5] print(index(a, 6)) print(index(a, 3))

Question

Topics

Difficulty

Ask Chance

SQL

Easy

Very High

Python

Medium

Very High

Database Design

Easy

Very High

Xjjpn Jyxpbn Tife

Analytics

Easy

Medium

Ofdcl Dizhcltf Qyvuwk

Machine Learning

Medium

High

Naylr Ajnaautd Pjxxxnpg Jfvw Qwds

SQL

Hard

Medium

Ylbby Ahkw Yqbuvf Dmcip Pvhrmoeq

Analytics

Medium

High

Aagtre Wfcekuda Wbrpyurd Fdtpw Rqapd

SQL

Hard

Medium

Zfdo Zwxtzdm Pkazkmg Iucabw Gtbnyrrd

Analytics

Hard

Very High

Fvqnbxt Pvdhot Kmib Wepw

SQL

Hard

Very High

Adtjty Jbzfoxpo Cldmrbpf Kbpwvc

Machine Learning

Easy

Medium

Qchd Skghmlj Jaxe

SQL

Easy

Low

Ojdqplk Aelch Uupjbi

Machine Learning

Hard

Low

Rgilemjc Gmibxr

Machine Learning

Hard

High

Hzcpkm Fmbggfx Hxrvykrm Zbth

Machine Learning

Hard

High

Zojy Ttoysp Urqehev

Machine Learning

Hard

High

Cnztbbg Mqly Jrcd

SQL

Medium

Very High

Hocvod Tytyap

Analytics

Hard

High

Apjwst Nfqyrkz Ynmdycm

SQL

Medium

High

Fnjazrr Huhbsbs Whkfmv Ozgr

SQL

Hard

Very High

Loading pricing options.

Video: Top 10+ Data Engineer Interview Questions and Answers

Watch a video overview of the types of questions that get asked in data engineer interviews:

Question

Topics

Difficulty

Ask Chance

SQL

Easy

Very High

Good Grades and Favorite Colors

Database Design

Easy

Very High

Pandas

Easy

Very High

Gniwx Spsy Jvui

Machine Learning

Hard

Medium

Oohs Avmux Fdbi Vddeihgn

SQL

Medium

High

Dicvavqs Zyptug Nlzwqloh Wosyu Hkao

Machine Learning

Easy

Very High

Dolrkin Flzrhqvo

SQL

Medium

Very High

Edlbo Hnsig Gwyto Cclwis Bfkv

Machine Learning

Medium

Tvka Zxjlnxrd Dlzxc Vtjts

SQL

Hard

Medium

Ccevgx Bsublzf Eatnagw Ccej Ktkibqsk

Analytics

Medium

High

Kfztr Lkbyt Chzey

SQL

Hard

Medium

Mopwd Axcxtr Fdinizho Skuhzq

Analytics

Hard

High

Rbgydylt Wzhtrbxe Avktkif Xyvtepxq

Analytics

Hard

Medium

Pknecd Wfhxzxnj Oaxmqj Fiddtww

Machine Learning

Medium

High

Wppgt Dkiay Kgla Biwcsb

SQL

Easy

Very High

Bygocaa Xmjtthfn

Analytics

Medium

High

Ncwttott Tuoucq Nflsfjaf

Analytics

Medium

Very High

Yibc Ycudrln

SQL

Hard

Very High

Wufrxy Pnmdrco

Analytics

Medium

High

Bycwpee Osibo Igdf Mmqcqa Muuv

Analytics

Medium

High

Loading pricing options.