Data science connects user intentions to business interests. The purpose of data science is to extract and analyze user data to help companies make informed decisions about strategy and product changes.
However, as a data scientist, you’ll also be reconstructing the often-broken bridge between technical and non-technical stakeholders with data visualizations and effective communication.
Most data science roles, except a few, require you to be highly proficient in coding to facilitate data manipulation, designing statistical forecast models, and performing automation.
To aid in that matter and reinforce your preparedness for the upcoming data science interview, we’ve compiled a list of data science coding interview questions in this article that you’ll find challenging and useful.
We’ve considered foundational coding problems, such as databases and querying, as basic data science coding interview questions. The difficulty of the questions is at the competitive levels that most well-known data science companies expect you to perform on:
Write a function named grades_colors
to select only the rows where the student’s favorite color is green or red and their grade is above 90.
grades_colors
to select only the rows where the student’s favorite color is green or red and their grade is above 90.students_df
table
name | age | favorite_color | grade |
---|---|---|---|
Tim Voss | 19 | red | 91 |
Nicole Johnson | 20 | yellow | 95 |
Elsa Williams | 21 | green | 82 |
John James | 20 | blue | 75 |
Catherine Jones | 23 | green | 93 |
Example:
Input:
import pandas as pd
students = {"name" : ["Tim Voss", "Nicole Johnson", "Elsa Williams", "John James", "Catherine Jones"], "age" : [19, 20, 21, 20, 23], "favorite_color" : ["red", "yellow", "green", "blue", "green"], "grade" : [91, 95, 82, 75, 93]}
students_df = pd.DataFrame(students)
Output:
def grades_colors(students_df) ->
name | age | favorite_color | grade |
---|---|---|---|
Tim Voss | 19 | red | 91 |
Catherine Jones | 23 | green | 93 |
id
of suitable wines for this customer.Let’s say you run a wine house. You have detailed information about the chemical composition of wines in a wines
table.
One day, a customer comes asking specifically for a wine that has
Note: All percentages are reported with two numbers before the decimal point; for example, 13.55% is represented as 13.55
instead of 0.1355
.
Example:
Input:
wines
table
Column | Type |
---|---|
id | INTEGER |
alcohol | FLOAT |
malic_acid | FLOAT |
ash | FLOAT |
alcalinity_of_ash | FLOAT |
magnesium | INTEGER |
total_phenols | FLOAT |
flavanoids | FLOAT |
nonflavanoid_phenols | FLOAT |
proanthocyanins | FLOAT |
color_intensity | FLOAT |
hue | FLOAT |
od280_or_od315_of_diluted_wines | FLOAT |
proline | INTEGER |
Output:
Column | Type |
---|---|
id | INTEGER |
We’re given two tables, a users
table with demographic information and the neighborhood they live in and a neighborhoods
table.
Example:
Input:
users
table
Columns | Type |
---|---|
id | INTEGER |
name | VARCHAR |
neighborhood_id | INTEGER |
created_at | DATETIME |
neighborhoods
table
Columns | Type |
---|---|
id | INTEGER |
name | VARCHAR |
city_id | INTEGER |
Output:
Columns | Type |
---|---|
name | VARCHAR |
Note: If more than one person shares the highest salary, the query should select the next highest salary.
Example:
Input:
employees
table
Column | Type |
---|---|
id | INTEGER |
first_name | VARCHAR |
last_name | VARCHAR |
salary | INTEGER |
department_id | INTEGER |
departments
table
Column | Type |
---|---|
id | INTEGER |
name | VARCHAR |
Output:
Column | Type |
---|---|
salary | INTEGER |
id
, transaction_value
, and created_at
representing the date and time for each transaction, write a query to get the last transaction for each day.The output should include the ID of the transaction, datetime of the transaction, and the transaction amount. Order the transactions by datetime.
Example:
Input:
bank_transactions
table
Column | Type |
---|---|
id | INTEGER |
created_at | DATETIME |
transaction_value | FLOAT |
Output:
Column | Type |
---|---|
created_at | DATETIME |
transaction_value | FLOAT |
id | INTEGER |
A fundamental requirement to succeed as a data scientist involves demonstrating a problem-solving approach and applying algorithm and coding skills to resolve real-world analytical challenges.
To ascertain your coding abilities, data science interviewers typically use Python interview questions. Here are some of them:
Bonus: What’s the time complexity?
Example:
Input:
list1 = [1,2,5]
list2 = [2,4,6]
Output:
def merge_list(list1,list2) -> [1,2,2,4,5,6]
a
and b
, each defined by four ordered pairs denoting their corners on the x
, y
plane. Write a function rectangle_overlap
to determine whether or not they overlap. Return True
if so, and False
otherwise.Note: If the two rectangles border one another or share a corner like two diagonally adjacent positions on a chessboard, they are said to overlap.
Note: The lists of ordered pairs are in no particular order. The first entry in list a
could be the top left corner, while the first in list b
is the bottom right.
Example:
Input:
a = [(-3,5), (-3,2),(0,5),(0,2)]
b = [(-1,4), (3,4), (3,1), (-1,1)]
Output:
def rectangle_overlap(a, b) -> True
Point (0,2)
is fully contained in rectangle b
, and point (-1,4)
is fully contained in rectangle a
.
nums
of length n
spanning 0
to n
with one missing. Write a function missing_number
that returns the missing number in the array.Note: The complexity of O(n) is required.
Example:
Input:
nums = [0,1,2,4,5]
missing_number(nums) -> 3
rain_days
to calculate the probability that it will rain on the nth day after today.Given that it is raining today and rained yesterday, write a function rain_days
to calculate the probability that it will rain on the nth day after today.
Example:
Input:
n=5
Output:
def rain_days(n) -> 0.39968
In addition to algorithms and coding, data structure fundamentals—especially trees, lists, and maps—also contribute to successful data science projects. We have a plethora of data structure interview questions in our database; some of which are:
A
and B
, write a function can_shift
to return whether or not A
can be shifted some number of places to get B
.Example:
Input:
A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
data
and an array new_point
with a length equal to the number of fields in the data
.data
and new_point
are 0
or 1
, i.e., all fields are dummy variables and there are only two classes.new_point
for that column.new_point
.pandas
and NumPy
but NOT scikit-learn
.Bonus: The permutations
in the itertools
package can help you easily get all of any iterable object.
Example:
Input:
new_point = [0,1,0,1]
print(data)
...
Var1 Var2 Var3 Var4 Target
0 1.0 1.0 1.0 0.0 1
1 0.0 0.0 0.0 0.0 0
2 1.0 0.0 1.0 0.0 0
3 0.0 1.0 1.0 1.0 1
4 1.0 0.0 1.0 0.0 0
.. ... ... ... ... ...
95 0.0 1.0 0.0 1.0 0
96 1.0 1.0 0.0 0.0 0
97 0.0 0.0 1.0 1.0 0
98 1.0 0.0 0.0 0.0 0
99 0.0 1.0 0.0 0.0 0
[100 rows x 5 columns]
Output:
def random_forest(new_point, data) -> 0
find_intersecting
to find which lines, if any, intersect with any of the others in the given x_range
.Say you are given a list of tuples where the first element is the slope of a line and the second element is the y-intercept of a line.
Example:
Input:
tuple_list = [(2, 3), (-3, 5), (4, 6), (5, 7)]
x_range = (0, 1)
Output:
def find_intersecting(tuple_list, x_range) -> [(2,3), (-3,5)]
pandas
and NumPy
but NOT scikit-learn
.Example:
Input:
k = 5
new_point = [0.5,-2,8]
print(data)
...
Var1 Var2 Var3 Target
0 -3.279536 3.362223 2.847892 2
1 -0.791565 1.742475 2.151587 2
2 -0.785992 -0.938681 -0.459770 0
3 -1.068190 1.461051 0.127130 3
4 -0.367568 -0.870240 -0.225734 0
.. ... ... ... ...
95 -1.327175 1.971085 -0.690689 2
96 -3.203714 1.847649 0.778901 2
97 -0.587640 0.647458 2.094385 2
98 0.363644 -0.509795 2.514191 1
99 -0.673498 2.955285 2.102122 4
[100 rows x 4 columns]
Output:
def kNN(k, new_point, data) -> 2
closest_key
to find the key with the input value closest to the beginning of the list.Example:
Input:
dictionary = {
'a' : ['b','c','e'],
'm' : ['c','e'],
}
input = 'c'
Output:
closest_key(dictionary, input) -> 'm'
c is at a distance of 1 from a and 0 from m. Hence, the closest key for c is m.
NumPy is a fundamental Python library for scientific computing that provides high-performance multidimensional array objects and tools for working with these arrays. It is an upgrade to Python’s built-in lists for mathematical calculations on large datasets. We have an extensive list of NumPy Interview Questions, some of which are discussed here:
gcd
to find the greatest common denominator between them.Example:
Input:
int_list = [8, 16, 24]
Output:
def gcd(int_list) -> 8
Machine learning aids data scientists when they need to gather information faster and assists with trend analysis. While your involvement in building or “coding” ML models will be determined by the company and the type of role you hold, data scientists are generally not expected to approach machine learning interview questions from a strict development standpoint. However, you may be expected to answer algorithm coding questions, such as:
search_list
that returns a Boolean indicating if the target
value is in the linked_list
or not.You receive the head of the linked list, which is a dictionary with the following keys: value
(contains the value of the node) and next
(contains the next node in the list, or None
).
If the linked list is empty, you’ll receive None
since there is no head node for an empty list.
Example:
Input:
target = 2
linked_list = 3 -> 2 -> 5 -> 6 -> 8 -> None
Output:
search_list(target, linked_list) -> True
A
and B
, write a function can_shift
to return whether or not A
can be shifted some number of places to get B
.Example:
Input:
A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
begin_word
and end_word
which are elements of word_list
.Write a function shortest_transformation
to find the length of the shortest transformation sequence from begin_word
to end_word
through the elements of word_list
.
Note: Only one letter can be changed at a time, and each transformed word in the list must exist inside of word_list
.
Note: In all test cases, a path does exist between begin_word
and end_word
Example:
Input:
Input:
begin_word = "same",
end_word = "cost",
word_list = ["same","came","case","cast","lost","last","cost"]
Output:
def shortest_transformation(begin_word, end_word, word_list) -> 5
Since the transformation sequence would be:
'same' -> 'came' -> 'case' -> 'cast' -> 'cost'
which is five elements long.
string1
and string2
, write a function max_substring
to return the maximal substring shared by both strings.Example:
Input:
string1 = 'mississippi'
string2 = 'mossyistheapple'
Output:
def maximal_substring(string1, string2) -> 'mssispp'
Note: If there are multiple max substrings with the same length, just return any one of them.
Additionally, it should do the following:
To excel in data science coding interviews, focus on a strong foundation in data structures and algorithms.
Practice coding regularly on our platform and utilize our AI Interviewer feature. Understand the trade-offs between different approaches and articulate your thought process clearly, especially in ML coding questions.
Emphasize code readability, efficiency, and test case considerations.
Additionally, delve deep into Python libraries like NumPy, pandas, and scikit-learn for efficient data manipulation and modeling. All the best!