Top 27 Data Science Coding Interview Questions (Updated for 2024)

Top 27 Data Science Coding Interview Questions (Updated for 2024)

Overview

Data science connects user intentions to business interests. Data science aims to extract and analyze user data to help companies make informed decisions about strategy and product changes.

However, as a data scientist, you’ll also reconstruct the often-broken bridge between technical and non-technical stakeholders with data visualizations and effective communication.

Most data science roles, except a few, require you to be highly proficient in coding to facilitate data manipulation, designing statistical forecast models, and performing automation.

To aid in that matter and reinforce your preparedness for the upcoming data science interview, we’ve compiled a list of data science coding interview questions in this article that you’ll find challenging and useful.

Basic Data Science Coding Interview Questions

We’ve considered foundational coding problems, such as databases and querying, as basic data science coding interview questions. The difficulty of the questions is at the competitive levels that most well-known data science companies expect you to perform on:

Write a function named grades_colors to select only the rows where the student’s favorite color is green or red and their grade is above 90.

1. Write a function named grades_colors to select only the rows where the student’s favorite color is green or red and their grade is above 90.

students_df table

name age favorite_color grade
Tim Voss 19 red 91
Nicole Johnson 20 yellow 95
Elsa Williams 21 green 82
John James 20 blue 75
Catherine Jones 23 green 93

Example:

Input:

import pandas as pd

students = {"name" : ["Tim Voss", "Nicole Johnson", "Elsa Williams", "John James", "Catherine Jones"], "age" : [19, 20, 21, 20, 23], "favorite_color" : ["red", "yellow", "green", "blue", "green"], "grade" : [91, 95, 82, 75, 93]}

students_df = pd.DataFrame(students)

Output:

def grades_colors(students_df) ->

name age favorite_color grade
Tim Voss 19 red 91
Catherine Jones 23 green 93

2. Write a query to find the id of suitable wines for this customer.

Let’s say you run a wine house. You have detailed information about the chemical composition of wines in a wines table.

One day, a customer comes asking specifically for a wine that has

  • Greater or equal to 13% alcohol content
  • Ash content less than 2.4
  • Color intensity less than 3

Note: All percentages are reported with two numbers before the decimal point; for example, 13.55% is represented as 13.55 instead of 0.1355.

Example:

Input:

wines table

Column Type
id INTEGER
alcohol FLOAT
malic_acid FLOAT
ash FLOAT
alcalinity_of_ash FLOAT
magnesium INTEGER
total_phenols FLOAT
flavanoids FLOAT
nonflavanoid_phenols FLOAT
proanthocyanins FLOAT
color_intensity FLOAT
hue FLOAT
od280_or_od315_of_diluted_wines FLOAT
proline INTEGER

Output:

Column Type
id INTEGER

3. Write a query that returns all neighborhoods that have 0 users.

We’re given two tables, a users table with demographic information and the neighborhood they live in and a neighborhoods table.

Example:

Input:

users table

Columns Type
id INTEGER
name VARCHAR
neighborhood_id INTEGER
created_at DATETIME

neighborhoods table

Columns Type
id INTEGER
name VARCHAR
city_id INTEGER

Output:

Columns Type
name VARCHAR

4. Write an SQL query to select the 2nd highest salary in the engineering department.

Note: If more than one person shares the highest salary, the query should select the next highest salary.

Example:

Input:

employees table

Column Type
id INTEGER
first_name VARCHAR
last_name VARCHAR
salary INTEGER
department_id INTEGER

departments table

Column Type
id INTEGER
name VARCHAR

Output:

Column Type
salary INTEGER

5. Given a table of bank transactions with columns idtransaction_value, and created_at representing the date and time for each transaction, write a query to get the last transaction for each day.

The output should include the ID of the transaction, datetime of the transaction, and the transaction amount. Order the transactions by datetime.

Example:

Input:

bank_transactions table

Column Type
id INTEGER
created_at DATETIME
transaction_value FLOAT

Output:

Column Type
created_at DATETIME
transaction_value FLOAT
id INTEGER

Python Data Science Coding Interview Questions

A fundamental requirement to succeed as a data scientist involves demonstrating a problem-solving approach and applying algorithm and coding skills to resolve real-world analytical challenges.

To ascertain your coding abilities, data science interviewers typically use Python interview questions. Here are some of them:

6. Given two sorted lists, write a function to merge them into one sorted list.

Bonus: What’s the time complexity?

Example:

Input:

list1 = [1,2,5]
list2 = [2,4,6]

Output:

def merge_list(list1,list2) -> [1,2,2,4,5,6]

7. You are given a singly linked list; write a function to find and return the last node of the list. If the list is empty, return null.

8. You are given two rectangles a and b, each defined by four ordered pairs denoting their corners on the xy plane. Write a function rectangle_overlap to determine whether or not they overlap. Return True if so, and False otherwise.

Note: If the two rectangles border one another or share a corner like two diagonally adjacent positions on a chessboard, they are said to overlap.

Note: The lists of ordered pairs are in no particular order. The first entry in list a could be the top left corner, while the first in list b is the bottom right.

Example:

Input:

a = [(-3,5), (-3,2),(0,5),(0,2)]
b = [(-1,4), (3,4), (3,1), (-1,1)]

Output:

def rectangle_overlap(a, b) -> True

Point (0,2) is fully contained in rectangle b, and point (-1,4) is fully contained in rectangle a.

9. You have an array of integers, nums of length n spanning 0 to n with one missing. Write a function missing_number that returns the missing number in the array.

Note: The complexity of O(n) is required.

Example:

Input:

nums = [0,1,2,4,5]
missing_number(nums) -> 3

10. The probability that it will rain tomorrow is dependent on whether or not it is raining today and whether or not it rained yesterday. Given that it is raining today and that it rained yesterday, write a function rain_days to calculate the probability that it will rain on the nth day after today.

Given that it is raining today and rained yesterday, write a function rain_days to calculate the probability that it will rain on the nth day after today.

Example:

Input:

n=5

Output:

def rain_days(n) -> 0.39968

Data Structures Interview Questions

In addition to algorithms and coding, data structure fundamentals—especially trees, lists, and maps—also contribute to successful data science projects. We have a plethora of data structure interview questions in our database, some of which are:

11. Given two strings A and B, write a function can_shift to return whether or not A can be shifted some number of places to get B.

Example:

Input:

A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True

A = 'abc'
B = 'acb'
can_shift(A, B) == False

12. Build a random forest model from scratch with the following conditions:

  • The model takes as input a dataframe data and an array new_point with a length equal to the number of fields in the data.
  • All values of both data and new_point are 0 or 1, i.e., all fields are dummy variables and there are only two classes.
  • Rather than randomly deciding what subspace of the data each tree in the forest will use, like usual, make your forest out of decision trees that go through every permutation of the value columns of the data frame. Split the data according to the value seen in new_point for that column.
  • Return the majority vote on the class of new_point.
  • You may use pandas and NumPy but NOT scikit-learn.

Bonus: The permutations in the itertools package can help you easily get all of any iterable object.

Example:

Input:

new_point = [0,1,0,1]
print(data)
...
    Var1  Var2  Var3  Var4  Target
0    1.0   1.0   1.0   0.0       1
1    0.0   0.0   0.0   0.0       0
2    1.0   0.0   1.0   0.0       0
3    0.0   1.0   1.0   1.0       1
4    1.0   0.0   1.0   0.0       0
..   ...   ...   ...   ...     ...
95   0.0   1.0   0.0   1.0       0
96   1.0   1.0   0.0   0.0       0
97   0.0   0.0   1.0   1.0       0
98   1.0   0.0   0.0   0.0       0
99   0.0   1.0   0.0   0.0       0

[100 rows x 5 columns]

Output:

def random_forest(new_point, data) -> 0

13. Write a function find_intersecting to find which lines, if any, intersect with any of the others in the given x_range.

Say you are given a list of tuples where the first element is the slope of a line and the second element is the y-intercept of a line.

Example:

Input:

tuple_list = [(2, 3), (-3, 5), (4, 6), (5, 7)]
x_range = (0, 1)

Output:

def find_intersecting(tuple_list, x_range) ->  [(2,3), (-3,5)]

14. Build a k-nearest neighbors classification model from scratch with the following conditions:

  • Use Euclidian distance (a.k.a., the “2 norm”) as your closeness metric.
  • Your function should be able to handle data frames of many arbitrary rows and columns.
  • If there is a tie in the class of the k-nearest neighbors, rerun the search using k-1 neighbors instead.
  • You may use pandas and NumPy but NOT scikit-learn.

Example:

Input:

k = 5
new_point = [0.5,-2,8]
print(data)
...
        Var1      Var2      Var3  Target
0  -3.279536  3.362223  2.847892       2
1  -0.791565  1.742475  2.151587       2
2  -0.785992 -0.938681 -0.459770       0
3  -1.068190  1.461051  0.127130       3
4  -0.367568 -0.870240 -0.225734       0
..       ...       ...       ...     ...
95 -1.327175  1.971085 -0.690689       2
96 -3.203714  1.847649  0.778901       2
97 -0.587640  0.647458  2.094385       2
98  0.363644 -0.509795  2.514191       1
99 -0.673498  2.955285  2.102122       4

[100 rows x 4 columns]

Output:

def kNN(k, new_point, data) -> 2

15. Given a dictionary with keys of letters and values of a list of letters, write a function closest_key to find the key with the input value closest to the beginning of the list.

Example:

Input:

dictionary = {
    'a' : ['b','c','e'],
    'm' : ['c','e'],
}
input = 'c'

Output:

closest_key(dictionary, input) -> 'm'

c is at a distance of 1 from a and 0 from m. Hence, the closest key for c is m.

NumPy Data Science Coding Interview Questions

NumPy is a fundamental Python library for scientific computing that provides high-performance multidimensional array objects and tools for working with these arrays. It is an upgrade to Python’s built-in lists for mathematical calculations on large datasets. We have an extensive list of NumPy Interview Questions, some of which are discussed here:

16. How can you initialize a three-dimensional array in NumPy? Give an example.

17. How can we reverse a NumPy array? Give an example.

18. How can we reshape a NumPy array? Give an example.

19. Given a list of integers, write a function gcd to find the greatest common denominator between them.

Example:

Input:

int_list = [8, 16, 24]

Output:

def gcd(int_list) -> 8

20. Given a NumPy array of integers and an integer called num, remove all elements with an instance lower than num. As much as possible, reduce the dependence on Python loops and utilize NumPy functions.

Machine Learning Data Science Coding Interview Questions

Machine learning aids data scientists when they need to gather information faster and assists with trend analysis. While your involvement in building or “coding” ML models will be determined by the company and the type of role you hold, data scientists are generally not expected to approach machine learning interview questions from a strict development standpoint. However, you may be expected to answer algorithm coding questions, such as:

21. Write a function, search_list that returns a Boolean indicating if the target value is in the linked_list or not.

You receive the head of the linked list, which is a dictionary with the following keys:  value (contains the value of the node) and next (contains the next node in the list, or None).

If the linked list is empty, you’ll receive None since there is no head node for an empty list.

Example:

Input:

target = 2
linked_list = 3 -> 2 -> 5 -> 6 -> 8 -> None

Output:

search_list(target, linked_list) -> True

22. Given two strings A and B, write a function can_shift to return whether or not A can be shifted some number of places to get B.

Example:

Input:

A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True

A = 'abc'
B = 'acb'
can_shift(A, B) == False

23. You’re given two words, begin_word and end_word which are elements of word_list.

Write a function shortest_transformation to find the length of the shortest transformation sequence from begin_word to end_word through the elements of word_list.

Note: Only one letter can be changed at a time, and each transformed word in the list must exist inside of word_list.

Note: In all test cases, a path does exist between begin_word and end_word

Example:

Input:

Input:
begin_word = "same",
end_word = "cost",
word_list = ["same","came","case","cast","lost","last","cost"]

Output:

def shortest_transformation(begin_word, end_word, word_list) -> 5

Since the transformation sequence would be:

'same' -> 'came' -> 'case' -> 'cast' -> 'cost'

which is five elements long.

24. Given two strings, string1 and string2, write a function max_substring to return the maximal substring shared by both strings.

Example:

Input:

string1 = 'mississippi'

string2 = 'mossyistheapple'

Output:

def maximal_substring(string1, string2) ->  'mssispp'

Note: If there are multiple max substrings with the same length, just return any one of them.

25. Given a sorted list of integers ints with no duplicates, write an efficient function nearest_entries that takes in integers N and k.

Additionally, it should do the following:

  • Find the element of the list closest to N.
  • Then, it returns that element along with the k-next and k-previous elements of the list.

26. You’ve been asked to generate a machine learning model that can map the legal first name of a person to likely nicknames they might have. How do you go about designing this model?

27. Design a machine learning model, which, given a set of health features, classifies whether the individual will undergo major health issues.

Tips to Ace Data Science Coding Interview Questions

To excel in data science coding interviews, focus on a strong foundation in data structures and algorithms.

Practice coding regularly on our platform and utilize our AI Interviewer feature. Understand the trade-offs between different approaches and articulate your thought process clearly, especially in ML coding questions.

Emphasize code readability, efficiency, and test case considerations.

Additionally, delve deep into Python libraries like NumPy, pandas, and scikit-learn for efficient data manipulation and modeling. All the best!