In data engineer interviews, Python is the second most frequent programming language you will find, behind only SQL. In fact, it is listed as a required skill for nearly 75% of data engineer jobs.
Python is widely used in data science, machine learning, and AI. Therefore, if you are preparing for a data engineer interview, you should have a strong grasp of its fundamentals and practical uses, including Python definitions, Python theory, and writing Python functions.
Python data engineering interview questions can be broken down into three main categories:
This article provides an overview of Python interview questions for data engineers. For an in-depth guide on how to answer data engineering interview questions, check out the data engineering learning path.
Data engineer interview questions typically include a wide range of Python coding concepts. The most common topics include distribution-based questions, data munging with pandas, and data manipulation. Some frequently asked Python topics asked about are:
The majority of questions that are asked in an interview will be beginner to intermediate Python coding exercises. These assessments require you to write code efficiently, and candidates are graded on their coding skills and the time required to solve the problem.
Generally, the best practice for Python interviews is to work through as many data engineer questions as possible beforehand and focus on a wide range of Python topics in your preparation.
Here are some quick tips to help you prepare for a data engineering Python interview:
Python is only one category of data engineer interview questions (100 Questions provided here by our team). To ace a data engineer interview, you could also practice SQL, algorithms, product metrics, and machine learning questions.
Easy Python questions asked of data engineers are commonly theory- or definition-based. These questions most frequently relate to data structures, basic Python functions, and basic scenarios.
This is a foundational question, and it quickly assesses your familiarity with data processing. Be sure to include NumPy and Pandas and list the advantages of both.
NumPy is the best solution for arrays of data, while Pandas is the most efficient solution for processing statistics and machine learning data.
Hint: Be prepared for situational questions as well. The interviewer might give you a situation and ask which Python library you might use to process the data.
Data smoothing is an approach that is used to eliminate outliers from data sets. This technique helps to reduce noise and make patterns more recognizable. ‘Roughing out the edges’ helps to improve machine learning as well.
Algorithms are used in Python to reduce noise and smooth data sets. A sample of data smoothing algorithms includes the Savitzky-Golay filter and the Triangular Moving Average.
There are a lot of similarities between these two languages. They are both object-oriented and have large libraries that extend their broadest uses. In data science, however, Python has an edge. That is in part due to the language’s simplicity and user-friendliness. Java is, instead, the better language for developing applications.
NumPy is an open-source library that is used to analyze data and includes support for Python’s multi-dimensional arrays and matrices. NumPy is used for a variety of mathematical and statistical operations.
Python lists are a basic building block of the coding language, and they are a useful data container for a variety of functions. With Python lists, for example, vectorized operations aren’t possible, including element-wise multiplication, whereas it is possible with NumPy arrays. Lists also require that Python store the type of information of every element since they support objects of different types. This means a type dispatching code must be executed each time an operation on an element is performed.
Also, each interaction would have to undergo type checks and require Python API bookkeeping, resulting in very few operations being carried out by C loops.
The built-in data types in Python include lists, tuples, dictionaries, and sets. These data types are already defined and supported by Python, and they act as containers for grouping data by type.
User-defined data types share commonalities with primitive types, and they are based on these concepts. Ultimately, these data types allow users to create their own data structures, including queues, trees, and linked lists.
Hint: With questions like these, be prepared to talk about the advantages of a particular data structure and when it might be best for a project.
The is operator in Python checks whether two variables point to the same object, while == checks if the values of two variables are the same.
We can apply this to sample data. Consider the following:
= [2,4,6] = [2,4,6] = b
Here is how this data would evaluate under the “is” and “==” operators:
a == b
This would evaluate
true
since the listed values in a and b are the same.
a is b
This would evaluate
false
since a and b are different objects.
One technique would be to convert a list into a set because sets do not contain duplicate data. Then you would convert the set back into a list.
Here is an example with data:
list1 = [3,6,7,9,2,3,7,1]
list2 = list(set(list1))
The resulting list2 would contain [3,6,7,9,2,1]. However, it is also important to remember that sets may not maintain the order of the list.
In order to rename columns, you can use the rename() function. This can be used to rename any column in a dataframe. For example, if in the customers
table, you wanted to rename the column “user_id_number” to “user_id” and the “customer_phone” to “phone,” you would write:
customers.rename(columns=dict(user_id_number="user_id", customer_phone="phone")
With lists in Python, the time complexity is linear and is dependent on the number of values within the list. The lookup value is O(n). With dictionaries, the time complexity is constant because dictionaries are hash tables. You can find the value as O(1).
Because of this, dictionary lookups are generally faster in Python.
You have an array of integers, nums of length n spanning 0 to n with one missing. Write a function missing_number
that returns the missing number in the array.
Note: Complexity of O(n) required. There are also two ways to solve this problem while holding O(N) complexity: through mathematical formulation or logical iteration.
Medium Python coding questions ask you to write Python functions to perform various operations. Typically, these questions will test concepts like string manipulation, data munging, statistical analysis, or ETL process builds. Some medium Python coding questions include:
Hint. This problem is pretty straightforward. We can loop through each user and their tip, sum the tips up for each user, and then find the user with the highest amount of total tips left. We can then find the user with the highest tip by sorting by the tip value.
Additionally, we can use Python’s collection package that allows us to sort our dictionary by calling the function most_common()
. This function sorts by the value and returns a sorted dictionary. Then all we have to do is grab the first value.
Note: Treat upper and lower case letters as distinct characters. You may assume the input string includes no spaces.
Example:
input = "interviewquery"
output = "i"
input = "interv"
output = "None"
Hint: We know we have to store a unique set of characters of the input string and loop through the string to check which ones occur twice.
Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character.
Hint: To actually parse bigrams out of a string, we need first to split the input string. We would use the Python function .split()
to create a list with each individual word as an input. Create another empty list that will eventually be filled with tuples.
Then, once we have identified each individual word, we need to loop through k-1
times (if k is the number of words in a sentence) and append the current word and subsequent word to make a tuple. This tuple gets added to a list that we eventually return.
Remember to use the Python function .lower()
to turn all the words into lowercase.
This solution uses the module, which provides support for maintaining a list in sorted order without having to sort the list after each insertion. For long lists of items with expensive comparison operations, this can be an improvement over the more common approach.
Here is a sample of Python code for this problem:
import bisect
def index(a, x):
i = bisect.bisect_left(a, x)
return i
a = [1,2,4,5]
print(index(a, 6))
print(index(a, 3))
This Python problem tests your knowledge of data structures, which is a built-in module in Python. Here is a sample code for this problem:
import queue
q = queue.Queue()
for x in range(4):
q.put(x)
print("Members of the queue:")
y=z=q.qsize()
for n in list(q.queue):
print(n, end=" ")
print("\nSize of the queue:")
print(q.qsize())
Note. If a word has many roots that can form it, replace it with the root with the shortest length. Here is an example input and output for this problem:
Input:
roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"
Output:
"the cat was a rat by the bat"
At first, it looks like we can simply loop through each word and check if the root exists in that word. If the root is present, we would then replace the word with the root. However, since we are technically stemming the words, we have to make sure that the roots are equivalent to the word at its prefix, rather than existing anywhere within the word.
We are given a list of roots and a sentence string. Given we have to check each word, we can first split the table sentence
into a list of words.
words = sentence.split()
Next, we loop through each word in the table words, and for each one we check if it has a prefix equal to one of the roots. To accomplish this, we loop through each possible substring starting at the first letter. If we find a prefix matching a root, we replace that word in the words list with the root it contains.
What’s the last step?
This question sounds like it should be a SQL question, doesn’t it? Weekly aggregation implies a form of GROUP BY
in a regular SQL or pandas question. In either case, aggregation on a dataset of this form by week would be pretty trivial.
But since this is a scripting question, it is trying to pry out if the candidate deals with unstructured data, since data scientists often deal with a lot of unstructured data.
In this function, we have to do a few things.
Loop through all of the datetimes.
Set a beginning timestamp as our reference point.
Check if the next time in the array is more than seven days ahead.
a. If it is more than seven days, set the new timestamp as the reference point.
b. If not, continue to loop through and append the last value.
You can use an algorithm to solve this problem. Examples include bubble sort and quick sort algorithms. Here is the solution code for bubble sort:
def sorting(array):
sorted_list = array.copy()
for i in range(len(sorted_list)):
for j in range(len(sorted_list)-i-1):
if sorted_list[j] > sorted_list[j+1]:
sorted_list[j], sorted_list[j+1] = sorted_list[j+1], sorted_list[j]
return sorted_list
Note: Do not use the library that actually performs this function.
Input:
int_list = [8, 16, 24]
Output:
def gcd(int_list) -> 8
Hint: The GCD (greatest common denominator) of three or more numbers equals the product of the prime factors common to all the numbers. It can also be calculated by repeatedly taking the GCDs of pairs of numbers.
The greatest common denominator is also associative. GCD of multiple numbers: say, a,b,c is equivalent to gcd(gcd(a,b),c). Intuitively, this is because if the GCD divides gcd(a,b) and c, it must divide a and b as well by the definition of the greatest common divisor.
Thus the greatest common denominator of multiple numbers can be obtained by iteratively computing the GCD of a and b, and GCD of the result of that with the next number, and so on.
Priority queues are an abstract data structure responsible for allowing enqueuing items with an attached priority. While typically implemented with a heap, implement a priority queue using a linked list.
The Priority Queue implementation should support the following operations:
insert
(element, priority): This operation should be able to insert an element into the Priority Queue, along with its corresponding priority.delete()
: This operation should remove and return the element with the highest priority. If multiple elements share the same highest priority, the element first enqueued should be returned. In the case that the queue is empty, return None.peek()
: This operation should return the element with the highest priority without removing it from the Priority Queue. Again, in the case of equal highest priorities, the element first enqueued should be returned. In the case that the queue is empty, return None.Hint: Start by creating a Node class to represent each element in the Priority Queue.
These questions will help you practice for the coding exercise portion of the interview. Typically, you’ll be given some information –like a data set– and asked to write Python code to solve the problem. These types of questions can test beginner Python skills, all the way up to advanced sequences and functions in Python.
Input:
integers = [2,3,5]
N = 8
Output:
def sum_to_n(integers, N) ->
[
[2,2,2,2],
[2,3,3],
[3,5]
]
Hint: You may notice in solving this problem that it breaks down into identical subproblems. For example, if given integers = [2,3,5] and target = 8 as in the prompt, we might recognize that if we first solve for the input: integers = [2, 3, 5] and target = 8 - 2 = 6, we can just add 2 to the output to obtain our final answer. This is a key idea in using recursion to solve this problem.
Hint: First, we need to calculate where to truncate our distribution. We want a sample where all values are below the percentile_threshold. Say we have a point z and want to calculate the percentage of our normal distribution that resides on or below z. In order to do this, we would simply plug z into the CDF of our distribution.
Input:
m = 2
sd = 1
n = 6
percentile_threshold = 0.75
Output:
def truncated_dist(m,sd,n, percentile_threshold): ->
[2, 1.1, 2.2, 3, 1.5, 1.3]
# All values in the output sample are in the lower 75% = percentile_threshold of the distribution.
plan_trip
to reconstruct the path of the trip so the trip tickets are in order.More context. You are calculating the trip from one city to another with many layovers. Given the list of flights out of order, each with a starting city and end city, you must reconstruct the flight path.
Here is sample input data:
flights = [
['Chennai', 'Bangalore'],
['Bombay', 'Delhi'],
['Goa', 'Chennai'],
['Delhi', 'Goa'],
['Bangalore', 'Beijing']
]
In problems of this nature, it is good to clarify your assumptions with the interviewer. We can start out by stating our assumptions (in an interview, you would want to do this out loud).
The first thing we need to do is figure out where the start and end cities are. We can do that by building our graph and traversing through each (start city: end city) combination. There are a few ways to do this, but the simplest is to iterate through the list of tickets and sort the departure and arrival cities into sets. While we are doing this, we can also build up our directed graph as a dictionary where the departure city is the key and the arrival city is the value. We can then take the set difference between the departure cities set and the arrival cities set, yielding a set containing only the first start city.
Hint. A function that is O(1) means it does not grow with the size of the input data. That means, in this problem, that the function must loop through the stream, inputting two entries at a time and choosing between the two of them with a random method.
The input data for the function should be the current entry in the stream, the subsequent entry in the stream, and the count (i.e. total number of entries cycled through thus far). What happens if the count is at 1?
Hint. The median is the middle value in an ordered integer list. If the size of the list is even, there is no middle value, so the median is the mean of the two middle values.
Example:
new_value = 2
stream = [1, 2, 3, 4, 5, 6]
def data_stream_median(new_value, stream): -> 3
This course is designed to help you learn everything you need to know about working with data, from basic concepts to more advanced techniques.
For more in-depth interview preparation, check out the Python Learning Path and the Data Engineering Learning Path
You can also see our list of data engineer interview questions, which includes additional Python questions, as well as SQL, case study questions, and more. You’ll also find more examples in our list of Python data science interview questions.
If you’re interested in exploring further, here are some additional resources for interview preparation:
Looking to hire top data engineers proficient in Python? OutSearch.ai leverages AI to simplify your recruitment, ensuring you find well-rounded candidates efficiently. Consider checking out their website today.