In data engineer interviews, Python is the second most frequent programming language you will find, behind only SQL. In fact, it is listed as a required skill for nearly 75% of data engineer jobs.
Python data engineer interview questions will assess your technical skills and understanding of data engineering concepts. The interview questions may cover Python, data processing, frameworks, and cloud technologies commonly used in data engineering roles.
Python is widely used in data science, machine learning, and AI. Therefore, if you are preparing for a data engineer interview, you should have a strong grasp of its fundamentals and practical uses, including Python definitions, Python theory, and Python functions.
Data engineer Python interview questions can be broken down into three main categories:
This article provides an overview of Python interview questions for data engineers. For an in-depth guide on how to answer data engineering interview questions, check out the data engineering learning path.
Data engineer Python interview questions typically include a wide range of Python coding concepts. The most common topics include distribution-based questions, data munging with pandas, and data manipulation. Some frequently asked Python topics asked about are:
The majority of questions that are asked in an interview will be beginner to intermediate Python coding exercises. These assessments require you to write code efficiently, and candidates are graded on their coding skills and the time required to solve the problem.
Generally, the best practice for Python interviews is to work through as many data engineer questions as possible beforehand and focus on a wide range of Python topics in your preparation.
Here are some quick tips to help you prepare for a data engineering Python interview:
Python is only one category of data engineer interview questions (100 Questions provided here by our team). You could also practice SQL, algorithms, product metrics, and machine learning questions to ace a data engineer interview.
Easy Python questions asked of data engineers are commonly theory- or definition-based. These questions most frequently relate to data structures, basic Python functions, and scenarios.
This is a foundational question that quickly assesses your familiarity with data processing. Be sure to include NumPy and Pandas and list their advantages.
NumPy is the best solution for arrays of data, while Pandas is the most efficient solution for processing statistics and machine learning data.
Hint: Be prepared for situational questions as well. The interviewer might give you a situation and ask which Python library you might use to process the data.
Data smoothing is an approach that is used to eliminate outliers from data sets. This technique helps to reduce noise and make patterns more recognizable. ‘Roughing out the edges’ helps to improve machine learning as well.
Algorithms are used in Python to reduce noise and smooth data sets. A sample of data smoothing algorithms includes the Savitzky-Golay filter and the Triangular Moving Average.
There are a lot of similarities between these two languages. They are both object-oriented and have large libraries that extend their broadest uses. In data science, however, Python has an edge. That is in part due to the language’s simplicity and user-friendliness. Java is, instead, the better language for developing applications.
NumPy is an open-source library used to analyze data and supports Python’s multi-dimensional arrays and matrices. NumPy is used for a variety of mathematical and statistical operations.
Python lists are a basic building block of the coding language, and they are a useful data container for various functions. With Python lists, for example, vectorized operations aren’t possible, including element-wise multiplication, whereas it is possible with NumPy arrays. Lists also require Python to store the type of information of every element since they support objects of different types. This means a type dispatching code must be executed each time an operation on an element is performed.
Also, each interaction would have to undergo type checks and require Python API bookkeeping, resulting in very few operations being carried out by C loops.
The built-in data types in Python include lists, tuples, dictionaries, and sets. These data types are already defined and supported by Python and act as containers for grouping data by type.
User-defined data types share commonalities with primitive types, and they are based on these concepts. Ultimately, these data types allow users to create their own data structures, including queues, trees, and linked lists.
Hint: With questions like these, be prepared to discuss the advantages of a particular data structure and when it might be best for a project.
The is operator in Python checks whether two variables point to the same object, while == checks if the values of two variables are the same.
We can apply this to sample data. Consider the following:
= [2,4,6] = [2,4,6] = b
Here is how this data would be evaluated under the “is” and “==” operators:
a == b
This would evaluate
true
since the listed values in a and b are the same.
a is b
This would evaluate
false
since a and b are different objects.
One technique would be to convert a list into a set because sets do not contain duplicate data. Then, you would convert the set back into a list.
Here is an example with data:
list1 = [3,6,7,9,2,3,7,1]
list2 = list(set(list1))
The resulting list2 would contain [3,6,7,9,2,1]. However, it is also important to remember that sets may not maintain the order of the list.
You can use the rename() function to rename columns. This can be used to rename any column in a dataframe. For example, if in the customers
table, you wanted to rename the column “user_id_number” to “user_id” and the “customer_phone” to “phone,” you would write:
customers.rename(columns=dict(user_id_number="user_id", customer_phone="phone")
With lists in Python, the time complexity is linear and depends on the number of values within the list. The lookup value is O(n). With dictionaries, the time complexity is constant because dictionaries are hash tables. You can find the value as O(1).
Because of this, dictionary lookups are generally faster in Python.
You have an array of integers, nums of length n spanning 0 to n with one missing. Write a function missing_number
that returns the missing number in the array.
A list is mutable, meaning its elements can be changed after creation, while a tuple is immutable and cannot be altered once defined. Lists are typically used for collections of items that may change, whereas tuples are used when you need a constant set of values. Since tuples are immutable, they can be used as keys in dictionaries, unlike lists.
Python uses try, except, and finally blocks to handle exceptions. The try block contains code that might raise an exception, while the except block handles it if it occurs. For example:
Copy code
try:
x = 10 / 0
except ZeroDivisionError:
print("Cannot divide by zero")
finally:
print("Execution completed")
You can iterate over a dictionary in Python using a for loop that iterates over its keys, values, or key-value pairs. For example:
Copy code
my_dict = {'a': 1, 'b': 2, 'c': 3}
for key, value in my_dict.items():
print(key, value)
The items() method returns key-value pairs, allowing you to access both simultaneously.
Note: Complexity of O(n) required. There are two ways to solve this problem while holding O(N) complexity: mathematical formulation or logical iteration.
Medium Python coding questions ask you to write Python functions to perform various operations. Typically, these questions test concepts like string manipulation, data munging, statistical analysis, or ETL process builds. Some medium Python coding questions include:
Hint. This problem is pretty straightforward. We can loop through each user and their tip, sum the tips up for each user, and then find the user with the highest amount of total tips left. We can then find the user with the highest tip by sorting by the tip value.
Additionally, we can use Python’s collection package, which allows us to sort our dictionary by calling it the function most_common()
. This function sorts by the value and returns a sorted dictionary. Then, all we have to do is grab the first value.
Note: Treat upper and lower case letters as distinct characters. You may assume the input string includes no spaces.
Example:
input = "interviewquery"
output = "i"
input = "interv"
output = "None"
Hint: We know we have to store a unique set of characters of the input string and loop through the string to check which ones occur twice.
Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character.
Hint: To actually parse bigrams out of a string, we need first to split the input string. We would use the Python function .split()
to create a list with each individual word as an input. Create another empty list that will eventually be filled with tuples.
Then, once we have identified each individual word, we need to loop through k-1
times (if k is the number of words in a sentence) and append the current word and subsequent word to make a tuple. This tuple gets added to a list that we eventually return.
Remember to use the Python function .lower()
to turn all the words into lowercase.
This solution uses the module, which supports maintaining a list in sorted order without having to sort the list after each insertion. This can be an improvement over the more common approach for long lists of items with expensive comparison operations.
Here is a sample of Python code for this problem:
import bisect
def index(a, x):
i = bisect.bisect_left(a, x)
return i
a = [1,2,4,5]
print(index(a, 6))
print(index(a, 3))
This Python problem tests your knowledge of data structures, a built-in Python module. Here is a sample code for this problem:
import queue
q = queue.Queue()
for x in range(4):
q.put(x)
print("Members of the queue:")
y=z=q.qsize()
for n in list(q.queue):
print(n, end=" ")
print("\nSize of the queue:")
print(q.qsize())
Note. If a word has many roots that can form it, replace it with the root with the shortest length. Here is an example input and output for this problem:
Input:
roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"
Output:
"the cat was a rat by the bat"
At first, it looks like we can simply loop through each word and check if the root exists in that word. If the root is present, we would then replace the word with the root. However, since we are technically stemming the words, we must ensure that the roots are equivalent to the word at its prefix rather than existing anywhere within the word.
We are given a list of roots and a sentence string. Given we have to check each word, we can first split the table sentence
into a list of words.
words = sentence.split()
Next, we loop through each word in the table words, and for each one, we check if it has a prefix equal to one of the roots. We loop through each possible substring starting at the first letter to accomplish this. If we find a prefix matching a root, we replace that word in the words list with the root it contains.
What’s the last step?
This question sounds like it should be an SQL question, doesn’t it? Weekly aggregation implies a form of GROUP BY
in a regular SQL or pandas question. In either case, aggregation on a dataset of this form by week would be pretty trivial.
But since this is a scripting question, it is trying to pry out if the candidate deals with unstructured data since data scientists often deal with a lot of unstructured data.
In this function, we have to do a few things.
Loop through all of the datetimes.
Set a beginning timestamp as our reference point.
Check if the next time in the array is more than seven days ahead.
a. If it is more than seven days, set the new timestamp as the reference point.
b. If not, continue to loop through and append the last value.
You can use an algorithm to solve this problem. Examples include bubble sort and quick sort algorithms. Here is the solution code for bubble sort:
def sorting(array):
sorted_list = array.copy()
for i in range(len(sorted_list)):
for j in range(len(sorted_list)-i-1):
if sorted_list[j] > sorted_list[j+1]:
sorted_list[j], sorted_list[j+1] = sorted_list[j+1], sorted_list[j]
return sorted_list
Note: Do not use the library that actually performs this function.
Input:
int_list = [8, 16, 24]
Output:
def gcd(int_list) -> 8
Hint: The GCD (greatest common denominator) of three or more numbers equals the product of the prime factors common to all the numbers. It can also be calculated by repeatedly taking the GCDs of pairs of numbers.
The greatest common denominator is also associative. GCD of multiple numbers: say, a,b,c is equivalent to gcd(gcd(a,b),c). Intuitively, this is because if the GCD divides gcd(a,b) and c, it must divide a and b as well by the definition of the greatest common divisor.
Thus, the greatest common denominator of multiple numbers can be obtained by iteratively computing the GCD of a and b and the GCD of the result of that with the next number, and so on.
Priority queues are an abstract data structure that allows enqueuing items with an attached priority. While typically implemented with a heap, implement a priority queue using a linked list.
The Priority Queue implementation should support the following operations:
insert
(element, priority): This operation should be able to insert an element into the Priority Queue, along with its corresponding priority.delete()
: This operation should remove and return the element with the highest priority. If multiple elements share the same highest priority, the element first enqueued should be returned. In the case that the queue is empty, return None.peek()
: This operation should return the element with the highest priority without removing it from the Priority Queue. Again, the element first enqueued should be returned in the case of equal highest priorities. In the case that the queue is empty, return None.Hint: Start by creating a Node class to represent each element in the Priority Queue.
deepcopy()
and copy()
in Python?A copy()
creates a shallow copy of an object, meaning nested objects within the original are still referenced, whereas deepcopy()
creates a completely new object and recursively copies all nested objects. Use deepcopy()
when you need to avoid modifying nested objects in the original while working with a copy. The copy()
method is faster but not suitable for complex objects with nested references.
Python uses automatic memory management via its garbage collector, which reclaims memory from objects no longer in use. You can optimize memory usage by using generators instead of lists to handle large data sets or by manually deleting objects with the del
keyword. Profiling tools like memory_profiler
help identify memory consumption in Python applications.
These questions will help you practice for the coding exercise portion of the interview. Typically, you’ll be given some information –like a data set– and asked to write Python code to solve the problem. These types of questions can test beginner Python skills all the way up to advanced sequences and functions in Python.
Input:
integers = [2,3,5]
N = 8
Output:
def sum_to_n(integers, N) ->
[
[2,2,2,2],
[2,3,3],
[3,5]
]
Hint: You may notice in solving this problem that it breaks down into identical subproblems. For example, if given integers = [2,3,5] and target = 8 as in the prompt, we might recognize that if we first solve for the input: integers = [2, 3, 5] and target = 8 - 2 = 6, we can just add 2 to the output to obtain our final answer. This is a key idea in using recursion to solve this problem.
Hint: First, we need to calculate where to truncate our distribution. We want a sample where all values are below the percentile threshold.
Say we have a point z and want to calculate the percentage of our normal distribution that resides on or below z. To do this, we would simply plug z into our distribution CDF.
Input:
m = 2
sd = 1
n = 6
percentile_threshold = 0.75
Output:
def truncated_dist(m,sd,n, percentile_threshold): ->
[2, 1.1, 2.2, 3, 1.5, 1.3]
# All values in the output sample are in the distribution's lower 75% = percentile_threshold.
plan_trip
to reconstruct the trip path so the trip tickets are in order.More context. You are calculating the trip from one city to another with many layovers. Given the list of flights out of order, each with a starting and end city, you must reconstruct the flight path.
Here is sample input data:
flights = [
['Chennai', 'Bangalore'],
['Bombay', 'Delhi'],
['Goa', 'Chennai'],
['Delhi', 'Goa'],
['Bangalore', 'Beijing']
]
In problems of this nature, clarifying your assumptions with the interviewer is good. We can start by stating our assumptions (in an interview, you would want to do this aloud).
The first thing we need to do is figure out where the start and end cities are. We can do that by building our graph and traversing through each (start city: end city) combination. There are a few ways to do this, but the simplest is to iterate through the list of tickets and sort the departure and arrival cities into sets. While we are doing this, we can also build up our directed graph as a dictionary where the departure city is the key and the arrival city is the value. We can then take the set difference between the departure cities set and the arrival cities set, yielding a set containing only the first start city.
Hint. A function that is O(1) means it does not grow with the input data size. That means, in this problem, that the function must loop through the stream, inputting two entries at a time and choosing between the two of them with a random method.
The input data for the function should be the current entry in the stream, the subsequent entry in the stream, and the count (i.e., the total number of entries cycled through thus far). What happens if the count is at 1?
Hint. The median is the middle value in an ordered integer list. If the list size is even, there is no middle value, so the median is the mean of the two middle values.
Example:
new_value = 2
stream = [1, 2, 3, 4, 5, 6]
def data_stream_median(new_value, stream): -> 3
A decorator is a function that takes another function as input and extends or alters its behavior without changing the original function. In data engineering, decorators can be used to add logging, error handling, or performance monitoring to data processing functions. For example, a decorator could automatically log execution times for ETL processes.
A closure is a function that captures and remembers its environment, even when called outside of that environment. In Python, closures are created when a nested function references variables from its outer function. For example:
Copy code
def outer(x):
def inner(y):
return x + y
return inner
add_5 = outer(5)
print(add_5(10)) # Outputs 15
Python uses reference counting and garbage collection for memory management, but handling large datasets requires special attention. To optimize memory, you can use libraries like NumPy
for efficient storage or Pandas
for in-place modifications, avoiding unnecessary copies. Additionally, using generators or Dask
can help handle large datasets by processing them in chunks without loading everything into memory.
To design a fault-tolerant and scalable ETL pipeline, you would use frameworks like Airflow
for orchestration, Dask
for parallel processing, and Pandas
for data transformation. You would implement error handling, retries, and logging to capture and retry failures, while breaking tasks into smaller jobs to improve scalability. Also, using cloud services like AWS Lambda or Google Cloud Functions can offer auto-scaling and reliability.
You can create a distributed data pipeline by using Kafka for ingesting real-time data streams and Spark for processing them. In Python, you would use confluent-kafka
to consume data from Kafka topics and ‘PySpark’ to process and transform that data in parallel. The processed data could then be stored in a distributed data store like HDFS or a NoSQL database such as Cassandra for further analysis.
Data partitioning in Spark involves dividing a dataset into smaller, manageable chunks, which are processed in parallel across multiple nodes. Optimizing partitioning requires ensuring that data is evenly distributed to prevent data skew and ensuring that each partition holds an appropriate amount of data based on the cluster’s resources. You can adjust the number of partitions using repartition()
or coalesce()
and apply partitioning strategies based on the data’s characteristics (e.g., range or hash partitioning).
For real-time data processing, serverless technologies like AWS Lambda or Google Cloud Functions can be used to handle incoming data as it arrives. Data can be pushed to services like S3, DynamoDB, or Pub/Sub, which then trigger Lambda functions to process or transform the data. These functions scale automatically based on incoming data, making them cost-effective and efficient for real-time streaming analytics.
A star schema is a type of database schema used in data warehouses that organizes data into fact tables (storing quantitative data) and dimension tables (storing descriptive data). The schema resembles a star, with the fact table at the center and dimension tables surrounding it. Star schemas simplify querying and reporting, as they optimize for fast aggregations and easy-to-understand relationships between the fact and dimension tables.
Ensuring regulatory compliance involves implementing strong data governance practices, including secure data storage, access control, and documentation of data processes. Automated auditing logs can track user actions and data transformations, ensuring that every step in the pipeline is traceable. Tools like Apache Ranger
for access controls, encryption standards, and compliance frameworks like GDPR
or HIPAA
must be integrated into the pipeline to maintain compliance.
Ensuring data pipeline reliability involves implementing monitoring tools (e.g., Prometheus, Grafana, or AWS CloudWatch) to track performance, alerting on failures, and setting up logging for traceability. You can automate retries, use checkpointing, and version control pipeline code to maintain consistency and ensure reliability. Testing is crucial, so you would integrate unit tests and integration tests into your CI/CD pipeline, ensuring that each component performs as expected before deployment.
For more in-depth interview preparation, check out the Python Learning Path and the Data Engineering Learning Path
You can also see our list of data engineer interview questions, which includes additional Python questions, as well as SQL, case study questions, and more. You’ll also find more examples in our list of Python data science interview questions.
If you’re interested in exploring further, here are some additional resources for interview preparation:
Looking to hire top data engineers proficient in Python? OutSearch.ai leverages AI to simplify your recruitment, ensuring you find well-rounded candidates efficiently. Consider checking out their website today.