Python interview questions feature prominently in data science technical interviews. You will likely be asked questions covering key Python coding concepts during a typical interview. Start your practice with these newly updated Python data science interview questions covering statistics, probability, string parsing, NumPy/matrices, and Pandas.
Python data science interview questions range from asking you what’s the difference between a list and a tuple and asking you to find all bigrams in a sentence or to implement the K-means algorithm from scratch.
The most commonly covered topics in Python data science interview questions include:
Python has reigned as the dominant language in data science over the past few years, taking over former strongholds such as R, Julia, Spark, and Scala. That is thanks in large part to its wide breadth of data science libraries (modules) supported by a strong and growing data science community.
One of the main reasons why Python is now the preferred language of choice is because Python libraries can extend their use to the full stack of data science. While each data science language has its own specialties, such as R for data analysis and modeling within academia and Spark and Scala for big data ETLs and production, Python has produced an ecosystem of libraries that all fit nicely together.
At the end of the day, it’s much easier to program and perform full-stack data science without having to switch languages. This means running exploratory data analysis, creating graphs and visualization, building the model, and implementing the deployment, all in one language.
Who Gets Asked Python Questions?
Data scientists, data engineers, machine learning engineers, and data analysts face Python questions in interviews. However, the difficulty of the question is dependent on the role.
Here’s how Python questions differ between data analysts and data scientists:
Data Analyst - Data analyst Python questions are easier and are typically scripting-focused. In general, most questions will be easy Pandas and Python questions.
Data Scientist/Data Engineer - More than two-thirds of data scientists use Python every day. Questions include basic concepts like conditions and branching, loops, functions, and object-oriented programming. Data engineer interview questions also focus heavily on common libraries like NumPy, SciPy and Pandas, as well as advanced concepts like regression, K-means clustering and classification.
Although there are plenty of advanced technical questions, be sure you can quickly and competently answer basic questions like “What data types are used in Python?” and “What is a Python dictionary?” You don’t want to get caught stumbling on an answer for a basic Python syntax question.
If possible, direct your responses back to work experiences or Python data science projects you have worked on.
Python uses several built-in data types, including:
In Python, data types are used to classify or categorize data, and every value has a data type.
A key reason Python is such a popular data science programming language is because there is an extensive collection of data analysis libraries available. These libraries include functions, tools and methods for managing and analyzing data. There are Python libraries for performing a wide range of data science functions, including for the processing of image and textual data, data mining and data visualization. The most widely used Python data analysis libraries include:
Negative indexes are used in Python to assess and index lists and arrays from the end of your string, moving backwards towards your first value. For example, n-1 will show the last item in a list, while n-2 will show the second to last. Here’s an example of a negative index in Python:
b = "Python Coding Fun"
print(b[-1])
>> n
Lists and tuples are classes in Python that store one or more objects or values. Key differences include:
Seaborn and Matplotlib are two of the most popular visualization libraries in Python. One thing to note is that Seaborn is built on top of Matplotlib. However, Seaborn tends to offer more customization, thanks to its built-in tools. Therefore, Seaborn can make the work faster, and you could switch to Matplotlib for fine-tuning.
NOTE: This question asks about preferences. The library you choose might be dependent on the task or how familiar you are with the tool. In other words, there is no right or wrong answer; rather, the interviewer wants to understand how proficient you are at creating visualizations in Python.
Yes and no. Python combines features of both object-oriented programming (OOP) and aspect-oriented programming. One reason it can’t be considered a true OOP language is that it doesn’t support strong encapsulation, which is the only basic feature of an OOP that Python does not support.
Series only support a single list with index, whereas a dataframe supports one or more series. In other words:
Series is a one-dimensional array that supports any datatype (including integers, strings, floats, etc.). In a series, the axis labels are the index.
A dataframe is a two-dimensional data structure with columns that can support different data types. It is similar to a SQL table or a dictionary of series objects.
You can check for duplicates using the Pandas duplicated() method. This will return a boolean series which is TRUE only for unique elements.
DataFrame.duplicated(subset=None,keep='last')
In this example, keep determines what to do with duplicates. You can use
Sometimes called an “anonymous function,” the lambda function is just like a normal function but is not defined with the keyword. They are defined with the keyword. Lambda functions are restricted to a single line expression, and can take in multiple parameters, just like normal functions.
Here is an example of both normal and lambda functions for the argument (x) and the expression (x+x)
Normal function:
def function_name(x)
return x+x
Lambda function:
lambda x: x+x
No. Modules with circular references to other objects are not always freed. It is also impossible to free some of the memory reserved by the C library.
Compound data structures are single variables that represent multiple values. Some of the most common in Python are:
List comprehension is used to define and create a list based on an existing list. For example, if we wanted to separate all the letters in the word “retain,” and make each letter a list item, we could use list comprehension:
r_letters = [ letter for letter in 'retain' ]
print( r_letters)
Output:
['r', 'e', 't', 'a', 'i', 'n']
The short answer: Unpacking refers to the practice of assigning elements of a tuple to multiple variables. You use the * operator to assign elements of an unpacking assignment to assign it a value.
With unpacking, you can swap variables without using a temporary variable. For example:
x = 20
y = 30
print(f'x={x}, y={y}')
x, y = y, x
print(f'x={x}, y={y}')
Output:
x=20, y=30
x=30, y=20
Both / and // are division operators. However, / does float division, dividing the first operand by the second. / returns the value in decimal form. // does floor division, dividing the first operand by the second, but returns the value in natural number form.
The most common way to convert an integer to a string in Python is with the built-in str() function. This function converts any data type into a string; however, there are other ways you can do this. You can turn to the f-string function, by using “%s” keywords or with the .format function.
Arrays store multiple values in one single variable. For example, you could create an array “faang” which included Facebook, Apple, Amazon, Netflix and Google.
Example:
faang = ["facebook", "apple", "amazon", "netflix", "google"]
print(faang)
Output:
['facebook', 'apple', 'amazon', 'netflix', 'google']
In Python, mutable or immutable refers to whether or not the object’s value can change. Mutable objects can change those values, while immutable objects cannot. Mutable data types include lists, sets, dictionaries and byte arrays. Immutable data types include numeric data types (boolean, float, etc.), strings, frozensets and tuples.
Python is limited in a few key ways, including:
The enumerate() function returns the indexes of all items in lists, dictionaries, sets and other iterables. The zip() function combines multiple iterables.
PYTHONPATH tells Python Interpreter where to locate module files imported into a program. The role is similar to PATH. PYTHONPATH includes both the source library directory and the source code directories.
String parsing questions in Python data science interviews are probably one of the most common. These types of questions focus on how well you can manipulate text data, which always needs to be thoroughly cleaned and transformed into a dataset.
These types of questions are common for companies that process a lot of text like Twitter, LinkedIn, Indeed or Netflix.
Example:
sentence = """
Have free hours and love children?
"""
output = [('have', 'free'),
('free', 'hours'),
('hours', 'and'),
('and', 'love'),
('love', 'children?')]
When separating a sentence into bigrams, the first thing we need to do is split the sentence into individual words. We would need to loop through each word of the sentence and append bigrams to the list. How many loops would we need, for instance, if the amount of words in a sentence was equal to k?
Example:
A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
This problem is relatively simple if we work out the underlying algorithm that allows us to easily check for string shifts between the strings A and B. First off, we have to set baseline conditions for string shifting. Strings A and B must both be the same length and consist of the same letters. We can check for the former by setting a condition statement when the length of A and B are equivalent.
Example:
string1 = 'qwe'
string2 = 'asd'
string_map(string1, string2) == True
#q = a, w = s, and e = d
Note: This example would return False if the letters were repeated; for example, string1 = ‘donut’ and string2 =’fatty’. This is because the letter t from fatty
attempts to map to two different outcomes (t = n or t = u)..
Example:
input = "interviewquery"
output = "i"
Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character.
Hint: Notice that in the subsequence problem set, one string in this problem will need to be traversed to check for the values of the other string. In this case, it is string2.
The idea to solve this should then be simple. We traverse both strings from one side to the other side going from leftmost to rightmost. If we find a matching character, we move ahead in both strings. Otherwise, we move ahead only in string2.
Python statistics and probability questions test your ability to translate stats and probability concepts into code. Both types require knowledge of the mathematical concepts, as well as intermediate Python skills.
This is a relatively simple Python problem that requires setting up a distribution and then generating and plotting n samples from it. We can do this with the SciPy library for scientific computing.
First, declare a standard normal distribution, e.g. mean=0 and standard deviation = 1. Then we generate samples through the rvs(n) function.
input = [
{
'key': 'list1',
'values': [4,5,2,3,4,5,2,3],
},
{
'key': 'list2',
'values': [1,1,34,12,40,3,9,7],
}
]
output = {'list1': 1.12, 'list2': 14.19}
Hint: Remember the equation for standard deviation. To be able to fulfill this function, we need to use the equation, where we take the sum of the square of the data value minus the mean, over the total number of data points, all within a square root.
stock_prices = [10,5,20,32,25,12]
dts = [
'2019-01-01',
'2019-01-02',
'2019-01-03',
'2019-01-04',
'2019-01-05',
'2019-01-06',
]
def max_profit(stock_prices,dts) -> 27
There are many ways you could go about solving this problem. A good first step is thinking about what our goal is: if we want the maximum profit, then ideally we want to buy at the lowest possible price and sell at the highest possible price. However, since we cannot go back in time, we have a constraint that our sell date must be after our buy date.
What’s the probability that Amy wins on her first roll? Let’s play out the scenario. If she loses, then Brad must lose his first roll for Amy to have a chance to win again.
You know the probability that Amy wins on her first roll is ⅙. What is then the probability of Amy winning on the 3rd roll? 5th roll?
More context. Every night between 7 p.m. and midnight, two computing jobs from two different sources are randomly started, with each job lasting an hour. When the jobs run simultaneously at any point in their computations, they cause a failure in some of the company’s other nightly jobs, resulting in downtime for the company that costs $1,000.
The CEO needs a single number representing the annual (365 days) cost of this problem.
Hint. We can model this scenario by implementing two random number generators across a spectrum of 0 to 300 minutes, modeling the time in minutes between 7 p.m. and midnight.
While Pandas has many roles in data science, like analytics-type questions, in most Python interviews, Pandas Interview questions are related to data cleaning. These questions include on-hot encoding variables, using the Pandas apply() function to group different variables, and text cleaning different columns.
def bucket_test_scores(df):
bins = [0, 50, 75, 90, 100]
labels=['<50','<75','<90' , '<100']
df['test score'] = pd.cut(df['test score'], bins,labels=labels)
Hint. In this question, we are given a dataframe full of addresses (in the form of strings) and asked to interpolate state names (more strings) into those addresses.
We will need to match our state names with the cities that they contain. That is going to require us to perform a simple merge of our two dataframes. But before we can do that, we need to split df_addresses such that we can isolate the city part of the address to use in our merge.
We need to filter our dataframe by two conditions: grade and favorite color. We can filter our dataframe by grade by setting our dataframe equal to itself with the condition that the grade column is greater than 90:
students_df = students_df[students_df["grade"] > 90]
Now we want to do the same process for favorite color, but the problem is that we have two possible categories for inclusion in the filtered dataframe. How can we write code to include both possibilities in our final dataframe?
There are two steps to solve the problem:
This problem uses two built-in Pandas methods.
dataframe.column.median()
This returns the median of a column in a dataframe.
dataframe.column.fillna('value')
This applies value to all NaN values in a given column.
Given a list of integers, write a function that returns the maximum number in the list. If the list is empty, return None
.
Example 1:
Input:
nums = [1, 7, 3, 5, 6]
Output:
find_max(nums) -> 7
Example 2:
Input:
nums = []
Output:
find_max(nums) -> None
Data manipulation questions are a common type of Python data engineer interview question. They cover techniques that would be transforming data outside of NumPy or Pandas. This is common when designing ETLs, when transforming data between raw json and database reads.
Many times these types of transformations will require grouping, sorting or filtering data using lists, dictionaries and other Python data structure types. These questions test your general knowledge of Python data munging outside of actual Pandas formatting.
This question sounds like it should be a SQL question, doesn’t it? Weekly aggregation implies a form of GROUP BY in a regular SQL or Pandas question. In either case, aggregation on a dataset of this form by week would be pretty trivial.
But as a scripting question, this task is trying to pry out if the candidate is comfortable dealing with unstructured data, as data scientists may be forced to deal with a lot of unstructured data depending on their specific role or company.
In this function, we have to do a few things:
This Python question explores the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set.
Input:
roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"
Output:
"the cat was rat by the bat"
Hint. You are only looking for friendships that have an end date. Because of this, every friendship that will be in our final output is contained within the friends_removed list.
If you start by iterating through the friends_removed dictionary, you will already have the id pair and the end date of each listing in our final output. Next, you just need to find the corresponding start date for each end date.
Many data science problems deal with working with the NumPy library and matrices. Matrices and NumPy interview questions are not as common as the others but still show up, especially for specialized roles like in computer vision interviews. This involves working with the NumPy library to run matrix multiplication, calculating the Jacobian determinant, and transforming matrices in some way or form.
NumPy is an open-source library that is used to analyze data, and includes support for Python’s multi-dimensional arrays and matrices. NumPy is used for a variety of mathematical and statistical operations.
You can find the inverse of any square matrix with the numpy.linalg.inv(array) function. In this case, the ‘array’ would be the matrix to be inverted.
More context. Let’s say we have a five-by-five matrix num_employees where each row is a company and each column represents a department. Each cell of the matrix displays the number of employees working in that particular department at each company.
To reconstruct the new array, loop through every cell in a department and divide by the total number of employees of the whole company, which is the sum of the whole row.
There are two approaches to this problem. The first would be to analyze how exactly a 90-degree clockwise rotation changes the index of each entry in the matrix. The second is to think of a series of simpler matrix transformations that amount to a 90-degree clockwise rotation when performed in succession.
Python data structures interview questions assess your ability to use Python coding in algorithms. In general there are two types of questions, algorithmic coding problems and writing algorithms from scratch.
Input:
begin_word = "same",
end_word = "cost",
word_list = ["same","came","case","cast","lost","last","cost"]
Output:
def shortest_transformation(begin_word, end_word, word_list) -> 5
Since the transformation sequence would be:
'same' -> 'came' -> 'case' -> 'cast' -> 'cost'
Generally, shortest path algorithms require the solution to recursively try every possible matching path from the start to the end.
In this question, we have a few constraints.
Every word in word_list is of the same length.
The max difference between two words in the path is only one letter change.
Input:
dictionary = {
'a' : ['b','c','e'],
'm' : ['c','e'],
}
input = 'c'
Output:
closest_key(dictionary, input) -> 'm'
With this question, ask: Is your computed distance always positive? Negative values for distance (for example, between ‘c’ and ‘a’ instead of ‘a’ and ‘c’) will interfere with getting an accurate result.
Input:
string1 = 'mississippi'
string2 = 'mossyistheapple'
The idea is that we need to try every matching substring of string1 and string2. So, for example, if we have string1 = abbc, string2 = acc, we can take the first letter of string1, a, and look for a match in string2. Once we find one, we are left with the same problem with a smaller portion of the two strings. The remaining part of string1 will be bbc and string2 cc, and we repeat the process.
Python machine learning questions tend to focus on model deployment and model building, and, in particular, assess your ability to use Python coding in algorithms. In general, there are two types of questions: algorithmic coding problems and writing algorithms from scratch.
You are provided with:
data_points
consisting of an arbitrary number of data points (rows) n
and an arbitrary number of columns m
.k
.initial_centroids
.Return a list of the cluster to which each point belongs in the original list data_points, maintaining the same order (as an integer).
Example
#Input
data_points = [(0,0),(3,4),(4,4),(1,0),(0,1),(4,3)]
k = 2
initial_centroids = [(1,1),(4,5)]
#Output
k_means_clustering(data_points,k,initial_centroids) -> [0,1,1,0,0,1]
Example Output:
def kNN(k,data,new_point) -> 2
The model should have these conditions:
new_point
with a length equal to the number of fields in the df.new_point
are 0 or 1, i.e., all fields are dummy variables, and there are only two classes.new_point
for that column.This course is designed to help you learn everything you need to know about working with data, from basic concepts to more advanced techniques.
Continue your prep with an Interview Query. We offer a variety of Python resources, including:
Streamlining your recruitment process for Python-savvy data science roles? Let OutSearch.ai’s AI-driven platform help you find candidates who not only excel in Python but are perfect for your team’s dynamic. Consider checking out the site!