Top 75 Python Data Science Interview Questions (Updated for 2025)

A/B Testing

Medium

Very High

Machine Learning

Easy

Very High

Decreasing Comments

Product Metrics

Easy

Very High

Ggddndr Ydismydm Ytegbiue

Machine Learning

Hard

Medium

Ljjvabgn Uyatqj Nmpc Tzstw Emelcf

Machine Learning

Easy

High

Uwjq Xwalelm

Machine Learning

Hard

Medium

Qxwta Ffixsln Zkpfz Yrpbb

Machine Learning

Easy

High

Cmjsxwl Sbtbrb

SQL

Hard

Very High

Shnncea Tmcqzsv Bmbv Lgsph

Analytics

Medium

Very High

Tjkkrf Epzciskt Ssxv Ezzmczp

SQL

Easy

Very High

Tckva Iyuc Odnpzi

Analytics

Easy

High

Qthqo Esdqqvf

Machine Learning

Medium

Rjcoy Dreejqlw Zhkaanut Anqnoz

Analytics

Hard

Very High

Fhpsaj Mbvm

Machine Learning

Medium

Very High

Eapn Hhkk Hthnckwh

SQL

Medium

High

Yhsehsjn Gqckyn

Machine Learning

Easy

Medium

Riblhisx Xticsn Ysfol Chdaqsb

Analytics

Medium

Xidn Zzqua Zyqy Ninkytmw Zfeveu

Analytics

Medium

Fudg Hdyd Nqrnb

Analytics

Medium

Low

Rfjgfad Ltsw Pqig Swhkalm Yhrguhm

Machine Learning

Hard

Medium

Loading pricing options.

1. What built-in data types are used in Python?

Python uses several built-in data types, including:

Number (int, float, and complex)
String (str)
Tuple (tuple)
Range (range)
List (list)
Set (set)
Dictionary (dict)

In Python, data types are used to classify or categorize data; every value has a data type.

2. How are data analysis libraries used in Python? What are some of the most common libraries?

Python is such a popular data science programming language because an extensive collection of data analysis libraries is available. These libraries include functions, tools, and methods for managing and analyzing data. There are Python libraries for performing a wide range of data science functions, including processing image and textual data, data mining, and data visualization. The most widely used Python data analysis libraries include:

3. How is a negative index used in Python?

Negative indexes are used in Python to assess and index lists and arrays from the end of your string, moving backward toward your first value. For example, n-1 will show the last item in a list, while n-2 will show the second to last. Here’s an example of a negative index in Python:

b = "Python Coding Fun"
print(b[-1])
>> n

4. What is the difference between lists and tuples in Python?

Lists and tuples are classes in Python that store one or more objects or values. Key differences include:

Syntax – Lists are enclosed in square brackets, and tuples are enclosed in parentheses.
Mutable vs. Immutable – Lists are mutable, meaning they can be modified after creation. Tuples are immutable, which means they cannot be modified.
Operations – Lists have more functionalities available than tuples, including insert and pop operations and sorting.
Size – Because tuples are immutable, they require less memory and are subsequently faster.

5. What library would you prefer for plotting, Seaborn or Matplotlib?

Seaborn and Matplotlib are two of the most popular visualization libraries in Python. One thing to note is that Seaborn is built on top of Matplotlib. However, thanks to its built-in tools, Seaborn tends to offer more customization. Therefore, Seaborn can make the work faster, and you could switch to Matplotlib for fine-tuning.

NOTE: This question asks about preferences. The library you choose might depend on the task or your familiarity with the tool. In other words, there is no right or wrong answer; rather, the interviewer wants to understand your proficiency in creating visualizations in Python.

6. Is Python an object-oriented programming language?

Yes and no. Python combines features of both object-oriented programming (OOP) and aspect-oriented programming. One reason it can’t be considered a true OOP language is that it doesn’t support strong encapsulation, which is the only basic feature of an OOP that Python does not support.

7. What is the difference between a series and a dataframe in Pandas?

Series only supports a single list with an index, whereas a data frame supports one or more series. In other words:

Series is a one-dimensional array that supports any datatype (including integers, strings, floats, etc.). In a series, the axis labels are the index.
A dataframe is a two-dimensional data structure with columns that can support different data types. It is similar to an SQL table or a dictionary of a series of objects.

8. How would you find duplicate values in a dataset for a variable in Python?

You can check for duplicates using the Pandas duplicated() method. This will return a boolean series, which is TRUE only for unique elements.

DataFrame.duplicated(subset=None,keep='last')

In this example, keep determines what to do with duplicates. You can use

First - Considers the first value unique and the rest as duplicates.
Last - Considers the last value unique and the rest as duplicates.
False - Considers all the same values as duplicates.

9. What is a lambda function in Python?

Sometimes called an “anonymous function,” the lambda function is like a normal function but not defined with the keyword. They are defined with the keyword. Like normal functions, lambda functions are restricted to a single-line expression and can take in multiple parameters.

Here is an example of both normal and lambda functions for the argument (x) and the expression (x+x)

Normal function:

def function_name(x)
return x+x

Lambda function:

lambda x: x+x

10. Is memory de-allocated when you exit Python?

No. Modules with circular references to other objects are not always freed. It is also impossible to free some of the memory reserved by the C library.

11. What is a compound datatype?

Compound data structures are single variables that represent multiple values. Some of the most common in Python are:

Lists - A collection of values where the order is important.
Tuples - A sequence of values where the order is important.
Sets - A collection of values where membership in the set is important.

12. What is list comprehension in Python? Provide an example.

List comprehension defines and creates a list based on an existing one. For example, if we wanted to separate all the letters in the word “retain” and make each letter a list item, we could use list comprehension:

r_letters = [ letter for letter in 'retain' ]
print( r_letters)

Output:

['r', 'e', 't', 'a', 'i', 'n']

13. What is tuple unpacking? Why is it important?

The short answer: Unpacking refers to the practice of assigning elements of a tuple to multiple variables. You use the * operator to assign elements of an unpacking assignment to assign it a value.

With unpacking, you can swap variables without using a temporary variable. For example:

x = 20
y = 30

print(f'x={x}, y={y}')

x, y = y, x

print(f'x={x}, y={y}')

Output:

x=20, y=30
x=30, y=20

14. What’s the difference between / and // in Python?

Both / and // are division operators. However, / does float division, dividing the first operand by the second. / returns the value in decimal form. // does floor division, dividing the first operand by the second, but returns the value in natural number form.

/ example: 9 / 2 returns 4.5
// example: 9 / 2 returns 4

15. How do you convert integers to strings?

Python’s most common way to convert an integer to a string is with the built-in str() function. This function converts any data type into a string; however, you can do this in other ways. You can turn to the f-string function by using “%s” keywords or with the .format function.

16. What are arrays in Python?

Arrays store multiple values in one single variable. For example, you could create an array “faang” which included Facebook, Apple, Amazon, Netflix, and Google.

Example:

faang = ["facebook", "apple", "amazon", "netflix", "google"]
print(faang)

Output:

['facebook', 'apple', 'amazon', 'netflix', 'google']

17. What’s the difference between mutable and immutable objects?

In Python, mutable or immutable refers to whether or not the object’s value can change. Mutable objects can change those values, while immutable objects cannot. Mutable data types include lists, sets, dictionaries, and byte arrays. Immutable data types include numeric data types (boolean, float, etc.), strings, frozensets, and tuples.

18. What are some of the limitations of Python?

Python is limited in a few key ways, including:

Speed - Studies have shown that Python is slower than languages like Java and C++. However, making Python faster is possible, like a custom runtime.
V2 vs V3 - Python 2 and Python 3 are incompatible.
Mobile development - Python is great for desktop and server applications but weaker for mobile development.
Memory consumption - Python is not great for memory-intensive applications.

19. Explain the zip() and enumerate() functions.

The enumerate() function returns the indexes of all items in lists, dictionaries, sets, and other iterables. The zip() function combines multiple iterables.

20. Define PYTHONPATH.

PYTHONPATH tells the Python Interpreter where to locate module files imported into a program. The role is similar to PATH. PYTHONPATH includes both the source library directory and the source code directories.

String Manipulation Python Interview Questions

Sample python codes

String parsing questions in Python data science interviews are probably one of the most common. These types of questions focus on how well you can manipulate text data, which always needs to be thoroughly cleaned and transformed into a dataset.

These types of questions are common for companies that process a lot of text like Twitter, LinkedIn, Indeed or Netflix.

Question

Topics

Difficulty

Ask Chance

A/B Testing

Medium

Very High

Decreasing Comments

Product Metrics

Easy

Very High

Machine Learning

Easy

Very High

Arpyfkou Rcsq Wcmmfakr Tveputd Sghlw

Machine Learning

Medium

High

Nbqa Xlmfafpr Xxwmorp Ehehzxs

SQL

Hard

Medium

Djxalp Zyxz

Machine Learning

Easy

Very High

Dmfij Xpmi Sznx Mheplj Uilka

SQL

Easy

Very High

Gnmdrcx Auwpwnf Sbpyyn Affvw

Machine Learning

Medium

High

Kyiivznp Sygyz Etfx

Analytics

Medium

High

Doialp Pskdu Micnnyqf Dzdqnj

SQL

Medium

High

Xiwr Wwanb Dfyvzz Ewyy Zvofdxar

Analytics

Medium

Akozkjm Hndr Twrtidr

Analytics

Medium

High

Eorewtqq Mcko Vjavyc

Machine Learning

Hard

Very High

Efnsd Chlid Bfxgcs Hvkcaarf Vzzuqm

Machine Learning

Hard

Medium

Tfesz Trchyrux

Analytics

Easy

Very High

Tsxsfdc Krkxt

Machine Learning

Easy

Very High

Gdzc Ugixgmef Mwewjpo

Machine Learning

Hard

High

Wkfhyiqp Zktyndug Ypqudp

Machine Learning

Easy

Medium

Queom Vlavhlxi

SQL

Medium

Very High

Ioqgkkbk Hisbsqob Tpolw Pevwidl

Machine Learning

Hard

Low

Loading pricing options.

21. Write a function that can take a string and return a list of bigrams.

Example:

sentence = """
Have free hours and love children?
"""

output = [('have', 'free'),
('free', 'hours'),
('hours', 'and'),
('and', 'love'),
('love', 'children?')]

When separating a sentence into bigrams, we first need to split the sentence into individual words. We would need to loop through each word of the sentence and append bigrams to the list. How many loops would we need, for instance, if the amount of words in a sentence was equal to k?

22. Given two strings A and B, return whether or not A can be shifted some number of times to get B.

Example:

A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False

This problem is relatively simple if we work out the underlying algorithm that allows us to check for string shifts between strings A and B easily. First off, we have to set baseline conditions for string shifting. Strings A and B must have the same length and letters. We can check for the former by setting a condition statement when the lengths of A and B are equivalent.

23. Given two strings, string1 and string2, determine if a one-to-one character mapping exists between each character of string1 to string2.

Example:

string1 = 'qwe'
string2 = 'asd'
string_map(string1, string2) == True
#q = a, w = s, and e = d

Note: This example would return False if the letters were repeated; for example, string1 = ‘donut’ and string2 =’fatty’. This is because the letter t from fatty attempts to map to two different outcomes (t = n or t = u)..

24. Given a string, return the first recurring character in it or “None” if there is no recurring character.

Example:

input = "interviewquery"
output = "i"

Given that we have to return the first index of the second repeating character, we should be able to go through the string in one loop, save each unique character, and then just check if the character exists in that saved set. If it does, return the character.

25. Given two strings, string1 and string2, write a function is_subsequence to find out if string1 is a subsequence of string2.

Hint: Notice that in the subsequence problem set, one string in this problem will need to be traversed to check for the values of the other string. In this case, it is string2.

The idea to solve this should then be simple. We traverse both strings from one side to the other side, going from leftmost to rightmost. If we find a matching character, we move ahead in both strings. Otherwise, we move ahead only in string2.

Python Statistics and Probability Interview Questions

Python statistics and probability questions test your ability to translate stats and probability concepts into code. Both types require knowledge of mathematical concepts and intermediate Python skills.

Statistics Python questions - These questions take the form of random sampling from a distribution, generating histograms, and computing different statistical metrics such as standard deviation, mean, or median.
Probability Python questions - Probability questions typically focus on concepts like Binomial or Bayes Theorem. Since most probability questions are focused on calculating chances based on a condition, almost all questions can be proven by writing Python code.

26. Write a function to generate N samples from a normal distribution and plot them on the histogram.

This relatively simple Python problem requires setting up a distribution and then generating and plotting n samples from it. We can do this with the SciPy library for scientific computing.

First, declare a standard normal distribution, e.g. mean=0 and standard deviation = 1. Then we generate samples through the rvs(n) function.

27. Write a function that takes in a list of dictionaries with both a key and a list of integers and returns a dictionary with the standard deviation of each list.

input = [
    {
        'key': 'list1',
        'values': [4,5,2,3,4,5,2,3],
    },
    {
        'key': 'list2',
        'values': [1,1,34,12,40,3,9,7],
    }
]

output = {'list1': 1.12, 'list2': 14.19}

Hint: Remember the equation for standard deviation. To be able to fulfill this function, we need to use the equation, where we take the sum of the square of the data value minus the mean over the total number of data points, all within a square root.

Standard Deviation Formula

28. Given a list of stock prices in ascending order by datetime, write a function that outputs the max profit by buying and selling at a specific interval.

stock_prices = [10,5,20,32,25,12]
dts = [
    '2019-01-01', 
    '2019-01-02',
    '2019-01-03',
    '2019-01-04',
    '2019-01-05',
    '2019-01-06',
]

def max_profit(stock_prices,dts) -> 27

There are many ways you could go about solving this problem. A good first step is thinking about what our goal is: if we want the maximum profit, then ideally we want to buy at the lowest possible price and sell at the highest possible price. However, since we cannot go back in time, we have a constraint that our sell date must be after our buy date.

29. Amy and Brad take turns rolling a fair six-sided die. Whoever rolls a “6” first wins the game. Amy starts by rolling first.

What’s the probability that Amy wins on her first roll? Let’s play out the scenario. If she loses, then Brad must lose his first roll so Amy can have a chance to win again.

You know the probability that Amy wins on her first roll is ⅙. What is then the probability of Amy winning on the 3rd roll? 5th roll?

30. Write a function to simulate the overlap of two computing jobs and output an estimated cost.

More context. Every night between 7 p.m. and midnight, two computing jobs from two different sources are randomly started, each lasting an hour. When the jobs run simultaneously at any point in their computations, they cause a failure in some of the company’s other nightly jobs, resulting in downtime for the company that costs $1,000.

The CEO needs a single number representing the annual (365 days) cost of this problem.

Hint. We can model this scenario by implementing two random number generators across a spectrum of 0 to 300 minutes, modeling the time in minutes between 7 p.m. and midnight.

Python Pandas Interview Questions

While Pandas has many roles in data science, like analytics-type questions, in most Python interviews, Pandas Interview questions are related to data cleaning. These questions include on-hot encoding variables, using the Pandas apply() function to group different variables, and text cleaning different columns.

31. Given a dataset of test scores, write Pandas code to return cumulative bucketed scores of <50, <75, <90, <100.

def bucket_test_scores(df):
    bins = [0, 50, 75, 90, 100]
    labels=['<50','<75','<90' , '<100']
    df['test score'] = pd.cut(df['test score'], bins,labels=labels)

32. Given two dataframes (one with addresses and the other with various cities and states), write a function to create a single dataframe with complete addresses.

Hint. In this question, we are given a dataframe full of addresses (in the form of strings) and asked to interpolate state names (more strings) into those addresses.

We will need to match our state names with the cities that they contain. That will require us to perform a simple merge of our two dataframes. But before doing that, we need to split df_addresses to isolate the city part of the address to use in our merge.

33. Given a dataframe of students’ favorite colors and test scores, write a function to select only those rows (students) where their favorite color is green or red and their test grade is above 90.

We need to filter our dataframe by two conditions: grade and favorite color. We can filter our dataframe by grade by setting our dataframe equal to itself with the condition that the grade column is greater than 90:

students_df = students_df[students_df["grade"] > 90]

We want to do the same process for the favorite colors, but the problem is that we have two possible categories for inclusion in the filtered dataframe. How can we write code to include both possibilities in our final dataframe?

34. Given a dataframe with rainfall data (day of the week and rainfall inches), write a function to find the median amount of rainfall for the days on which it rained.

There are two steps to solve the problem:

Step 1. Remove all days with no rain.
Step 2. Calculate the attribute median of the dataframe.

35. You are given a dataframe with the prices of cheeses. However, the dataframe is missing values in the price column. Write a function to impute the median price instead of missing values.

This problem uses two built-in Pandas methods.

dataframe.column.median()

This returns the median of a column in a dataframe.

dataframe.column.fillna('value')

This applies value to all NaN values in a given column.

36. Write a function that returns the maximum number in the list.

Given a list of integers, write a function that returns the maximum number in the list. If the list is empty, return None.

Example 1:

Input:

nums = [1, 7, 3, 5, 6]

Output:

find_max(nums) -> 7

Example 2:

Input:

nums = []

Output:

find_max(nums) -> None

Question

Topics

Difficulty

Ask Chance

A/B Testing

Medium

Very High

Machine Learning

Easy

Very High

Hundreds of Hypotheses

A/B Testing

Medium

Very High

Bhgqrm Hwzlxtk Ugjz Fajon Hqywrfn

Analytics

Medium

Low

Jfaup Cyozeq Llvukril

Analytics

Hard

Medium

Rdbgqpx Sxxfqmns Yyflul

Machine Learning

Hard

High

Hgqvypu Gmzkqy Acjxvcc Bcuoa

SQL

Easy

Very High

Qqbzr Rdfibend Lbszq

SQL

Hard

High

Reodanop Jrhxwfow Lozetzsf

Analytics

Medium

Dpnz Fnmuokt Ffsnbsub

SQL

Easy

High

Pfsvrecz Qalrpsq Qijw Apwljgzb Tttx

Analytics

Hard

Medium

Delwgel Ysxrvd Iyaugt

Analytics

Easy

Medium

Iahg Coet Vzhkhk Kije

Analytics

Easy

High

Wwigmbn Bdjbir Bpdgl

Machine Learning

Easy

Medium

Grbeq Nxvzhvn Dnrye

Machine Learning

Hard

Medium

Bavjsp Ixblocnp Kaql

Analytics

Easy

Medium

Bnucslz Nmqawh Bbbpl

Analytics

Easy

Very High

Jrecop Kuyzu

Analytics

Easy

Medium

Avuich Ubfd Owgrhbov Xecwtv

SQL

Hard

Low

Khbm Xeedty Jmesiejv

Machine Learning

Easy

High

Loading pricing options.

Python Data Manipulation Interview Questions

User graph python

Data manipulation questions are common Python data engineer interview questions. They cover techniques that transform data outside of NumPy or Pandas. This is common when designing ETLs and transforming data between raw JSON and database reads.

Many times, these types of transformations will require grouping, sorting, or filtering data using lists, dictionaries, and other Python data structure types. These questions test your general knowledge of Python data munging outside of actual Pandas formatting.

37. Given a list of timestamps in sequential order, return a list of lists grouped by week (seven days) using the first timestamp as the starting point.

This question sounds like it should be an SQL question. Weekly aggregation implies a form of GROUP BY in a regular SQL or Pandas question. In either case, aggregation on a dataset of this form by week would be pretty trivial.

However, as a scripting question, this task is trying to pry out if the candidate is comfortable dealing with unstructured data, as data scientists may be forced to deal with a lot of unstructured data depending on their specific role or company.

In this function, we have to do a few things:

Loop through all of the datetimes.
Set a beginning timestamp as our reference point.
Check if the next timestamp in the array is more than seven days ahead. a. If so, set the new timestamp as the reference point. b. If not, continue to loop through and append the last value.

38. Given a dictionary consisting of many roots and a sentence, stem all the words in the sentence with the root forming it.

This Python question explores the concept of stemming, which is the heuristic of chopping off the end of a word to clean and bucket it into an easier feature set.

Input:

roots = ["cat", "bat", "rat"]
sentence = "the cattle was rattled by the battery"

Output:

"the cat was rat by the bat"

39. Given two dictionaries (friends_added and friends_removed), write a function to list the pairs of friends with corresponding beginning and ending timestamps.

Hint. You are only looking for friendships that have an end date. Because of this, every friendship that will be in our final output is contained within the friends_removed list.

If you start by iterating through the friends_removed dictionary, you will already have the id pair and the end date of each listing in our final output. Next, you just need to find the corresponding start date for each end date.

Matrices and NumPy Python Interview Questions

Many data science problems deal with working with the NumPy library and matrices. Matrices and NumPy interview questions are not as common as the others but still show up, especially for specialized roles like in computer vision interviews. This involves working with the NumPy library to run matrix multiplication, calculating the Jacobian determinant, and transforming matrices in some way or form.

40. What is NumPy used for? What are its benefits?

NumPy is an open-source library that is used to analyze data and includes support for Python’s multi-dimensional arrays and matrices. NumPy is used for a variety of mathematical and statistical operations.

41. Compute the inverse of a matrix in NumPy.

You can find the inverse of any square matrix with the numpy.linalg.inv(array) function. In this case, the ‘array’ would be the matrix to be inverted.

42. Write a function to return a 5-by-5 matrix containing the number of employees employed in each department compared to the total number of employees at each company.

More context. Let’s say we have a five-by-five matrix num_employees where each row is a company and each column represents a department. Each matrix cell displays the number of employees working in that particular department at each company.

To reconstruct the new array, loop through every cell in a department and divide by the total number of employees of the whole company, which is the sum of the whole row.

43. Given an array filled with random values, write a function rotate_matrix to rotate the array by 90 degrees in the clockwise direction.

There are two approaches to this problem. The first would be to analyze how exactly a 90-degree clockwise rotation changes the index of each entry in the matrix. The second is to think of a series of simpler matrix transformations that amount to a 90-degree clockwise rotation when performed in succession.

Python Data Structures and Algorithms Interview Questions

Python data structures interview questions to assess your ability to use Python coding in algorithms. In general, there are two types of questions: algorithmic coding problems and writing algorithms from scratch.

44. Write a function shortest_transformation to find the length of the shortest transformation sequence from begin_word to end_word through the elements of word_list.

Input:

begin_word = "same",
end_word = "cost",
word_list = ["same","came","case","cast","lost","last","cost"]

Output:

def shortest_transformation(begin_word, end_word, word_list) -> 5

Since the transformation sequence would be:

'same' -> 'came' -> 'case' -> 'cast' -> 'cost'

Generally, shortest path algorithms require the solution to recursively try every possible matching path from the start to the end.

In this question, we have a few constraints.

Every word in word_list is of the same length.
The max difference between two words in the path is only one letter change.

45. Given a dictionary with keys of letters and values of a list of letters, write a function closest_key to find the key with the input value closest to the beginning of the list.

Input:

dictionary = {
    'a' : ['b','c','e'],
    'm' : ['c','e'],
}
input = 'c'

Output:

closest_key(dictionary, input) -> 'm'

With this question, ask: Is your computed distance always positive? Negative values for distance (for example, between ‘c’ and ‘a’ instead of ‘a’ and ‘c’) will interfere with getting an accurate result.

46. Given two strings, string1 and string2, write a function max_substring to return the maximal substring shared by both strings.

Input:

string1 = 'mississippi'

string2 = 'mossyistheapple'

The idea is that we need to try every matching substring of string1 and string2.

So, for example, if we have string1 = abbc, string2 = acc, we can take the first letter of string1, a, and look for a match in string2. Once we find one, we are left with the same problem with a smaller portion of the two strings. The remaining part of string1 will be bbc and string2 cc, and we repeat the process.

In the second iteration, we don’t find a match _b_bc with cc.
In the third iteration, we don’t find a match b_b_c with cc.
Finally, we have a match bb_c_ with _c_c.
We finished string1, and the result is ac.

Python Machine Learning Interview Questions

Python machine learning questions tend to focus on model deployment and model building and, in particular, assess your ability to use Python coding in algorithms. There are two types of questions: algorithmic coding problems and writing algorithms from scratch.

47. Develop a k-means clustering algorithm in Python from the ground up.

You are provided with:

A two-dimensional NumPy array data_points consisting of an arbitrary number of data points (rows) n and an arbitrary number of columns m.
The number of clusters, k.
The initial centroids value for the data points in each cluster, initial_centroids.

Return a list of the cluster to which each point belongs in the original list data_points, maintaining the same order (as an integer).

Example


#Input
data_points = [(0,0),(3,4),(4,4),(1,0),(0,1),(4,3)]
k = 2
initial_centroids = [(1,1),(4,5)]

#Output 

k_means_clustering(data_points,k,initial_centroids) -> [0,1,1,0,0,1]

48. Build a K-nearest neighbors classification model from scratch with the following conditions:

Use Euclidean distance (the “2 norm”) as your closeness metric.
Your function should be able to handle data frames of arbitrarily many rows and columns.
If there is a tie in the class of the k nearest neighbors, rerun the search using k-1 neighbors instead.
You may use pandas and numpy but NOT scikit-learn.

Example Output:

def kNN(k,data,new_point) -> 2

49. Build a random forest model from scratch.

The model should have these conditions:

The model takes as input a dataframe df and an array new_point with a length equal to the number of fields in the df.
All values of df and new_point are 0 or 1, i.e., all fields are dummy variables, and only two classes exist.
Rather than randomly deciding what subspace of the data each tree in the forest will use like usual, make your forest out of decision trees that go through every permutation of the value columns of the data frame and split the data according to the value seen in new_point for that column.
Return the majority vote on the class of new_point.
You may use pandas and NumPy, but NOT scikit-learn.

But that’s not all. Here are some additional questions that interviewers may ask. These questions cover a range of topics in data processing, data pipelines, data storage, and infrastructure and test your knowledge of modern Python libraries and best practices. Each answer addresses how you would approach real-world scenarios while using Python to process, store, and manage data at scale.

They test your technical knowledge of Python and libraries and your ability to design scalable and efficient systems for data science applications.

Python - Data Processing

50. How do you handle missing data in a dataset using Python?

In Python, you can handle missing data using libraries like pandas. Common methods include filling missing values with thefillna() function, dropping rows with `dropna(), or replacing missing values using interpolation. The method you choose depends on the type of data and the analysis you’re performing.

51. What are some common Python libraries you would use for data processing and why?

Common Python libraries for data processing include pandas for data manipulation,numpy for numerical operations, and scipy for advanced scientific computations.pandas is particularly strong for tabular data and allows for efficient filtering, grouping, and aggregation. For large datasets or out-of-core processing, I would use Dask orVaex as they are designed for distributed processing and can handle larger-than-memory datasets.

52. How would you clean and preprocess a large dataset for machine learning in Python?

You can clean and preprocess data using libraries like pandas andscikit-learn. Start by handling missing values with fillna() ordropna(), encoding categorical variables using LabelEncoder orOneHotEncoder, and scaling numeric values with `StandardScaler. For large datasets, you may also use Dask or PySpark to handle data efficiently in a distributed manner.

53. How do you convert a list of dictionaries into a Pandas DataFrame in Python?

You can convert a list of dictionaries into a DataFrame using the pandas.DataFrame() constructor, which directly accepts a list of dictionaries. Each dictionary will be treated as a row, with the dictionary keys becoming the column names. For example: df =pd.DataFrame(list_of_dicts), where `list_of_dicts is the list you want to convert.

54. How would you optimize a data processing pipeline when working with large volumes of data in Python?

To optimize data processing with large datasets, you can use parallel computing libraries like Dask or PySpark, which allow you to process data in parallel across multiple cores or machines. You should also ensure that you’re minimizing memory usage by using efficient data types (e.g., `category type in pandas), avoiding unnecessary data copies, and performing operations in a vectorized manner. Lastly, writing intermediate results to disk or using a distributed file system like HDFS can help manage data size effectively.

55. Explain how you would handle skewed data during preprocessing in Python.

To handle skewed data, I would first assess the distribution using visualizations like histograms or boxplots. If the data is positively skewed, I could apply a log transformation (np.log()) or use quantile transformation (QuantileTransformer) to normalize the distribution. For extreme cases, I might consider removing outliers or applying robust scaling techniques like RobustScaler inscikit-learn to minimize the effect of skewness.

Python - Data Pipelines

56. What Python libraries would you use to automate data pipelines, and why?

To automate data pipelines, libraries like Airflow,Luigi, and `Prefect are commonly used. These tools allow for the orchestration of tasks, scheduling, and workflow management, which are crucial for building reproducible and scalable data pipelines. They also support error handling, retries, and integration with various data sources and storage systems.

57. What is a data pipeline, and why is it important in data science?

A data pipeline is a series of automated processes used to collect, process, store, and analyze data. It’s essential in data science because it ensures that data is clean, consistent, and continuously available for analysis, reducing manual work and errors. A well-designed pipeline also facilitates reproducibility and scalability, especially in production environments.

58. Can you explain how you would build a data pipeline that processes and loads data into a data warehouse using Python?

A typical Python-based data pipeline involves extracting data from source systems using libraries like pandas,requests, or SQLAlchemy for databases, followed by data processing steps (e.g., cleaning, transformation). Once processed, you can load the data into a data warehouse using libraries likepyodbc or SQLAlchemy for SQL-based systems, or tools likeboto3 for cloud data storage (e.g., AWS Redshift). To manage this process, you can use Apache Airflow to orchestrate, schedule, and monitor the pipeline.

59. How do you schedule a data pipeline using Apache Airflow?

In Apache Airflow, a data pipeline is defined as a Directed Acyclic Graph (DAG), where each task is a Python function or operator. To schedule a pipeline, you can set the schedule_interval parameter to specify when the DAG should run (e.g.,@daily, `@hourly). Once defined, you can use the Airflow UI to monitor, trigger, and manage the DAG executions.

60. How would you ensure data pipeline reliability and scalability when processing large datasets in Python?

To ensure reliability and scalability, I would build a distributed pipeline using frameworks like Apache Spark or Dask to process large datasets in parallel. For fault tolerance, I would incorporate error handling, retries, and monitoring into the pipeline using tools like Apache Airflow or Prefect. Additionally, storing intermediate data in distributed storage like Amazon S3 or HDFS allows for scalability and ensures that data processing can be resumed after failures.

61. How would you optimize an existing data pipeline that processes multiple terabytes of data daily in Python?

To optimize such a data pipeline, I would implement parallel processing using tools like Dask or Apache Spark to distribute the workload across multiple nodes. I would also consider streamlining data transformations by using optimized formats like Parquet or ORC and compressing intermediate data to reduce storage and I/O overhead. Finally, I would implement fault tolerance and retry mechanisms using Apache Airflow to handle failures efficiently and ensure continuous data flow.

Python - Data Storage

62. How do you handle storing and retrieving large datasets in Python?

For handling large datasets, I would use formats like CSV, Parquet, or HDF5, which are optimized for performance and storage. The pandas library can be used to read and write these formats, whilepyarrow or `fastparquet are useful for Parquet files. For large-scale datasets, cloud storage solutions like AWS S3 or Google Cloud Storage are great choices, as they provide scalable and accessible storage.

63. How do you save a Pandas DataFrame as a CSV file in Python?

You can save a Pandas DataFrame as a CSV file using the to_csv() method, like this:df.to_csv(‘filename.csv’, `index=False). The index=False argument ensures that the index is not saved as an extra column in the CSV. If needed, you can customize delimiters, quoting, or header options using additional arguments.

64. Explain how you would use SQLAlchemy to interact with a relational database in Python.

SQLAlchemy is a powerful library for interacting with relational databases, allowing you to write SQL queries in Python code while abstracting away the database connection details. You can use the `create_engine() function to connect to a database, and then interact with it using ORM (Object-Relational Mapping) or raw SQL queries. This allows for efficient data retrieval, manipulation, and storage, and supports multiple databases like MySQL, PostgreSQL, and SQLite.

65. Explain how to read and write data to/from a SQL database in Python.

To interact with a SQL database in Python, I would use SQLAlchemy orsqlite3 for simpler cases. First, you create a connection using create_engine() (fromSQLAlchemy), and then you can use pd.read_sql() to load data into a Pandas DataFrame, ordf.to_sql() to write data back to the database. SQLAlchemy automatically handles query execution, connection pooling, and transactions.

66. How would you design a database schema to store time-series data efficiently in Python?

To store time-series data efficiently, I would use a columnar storage format like Parquet, which is optimized for reading large datasets, and store it in a time-partitioned table to improve query performance. For databases, I would design the schema with a timestamp column as the primary key and additional indices on frequently queried columns (e.g., sensor_id, location). In Python, I would use libraries like pandas orSQLAlchemy for managing time-series data and ensure fast access using time-based partitioning and indexing.

67. How would you design a scalable data storage solution for real-time sensor data in Python?

For real-time sensor data, I would use a time-series database like InfluxDB or AWS Timestream, which is optimized for storing and querying time-indexed data. In Python, I could use `influxdb-python to write data to the database in real-time and retrieve it for analysis. For scalability, I would partition the data by time (e.g., daily or monthly) and use compression techniques to reduce storage overhead.

Python - Infrastructure and Deployment

68. How would you deploy a simple data analysis script as a reusable tool in Python?

To deploy a data analysis script as a reusable tool, I would package the code using setuptools orPoetry to create a Python package. I would then upload it to a repository like PyPI for public use or a private repository for internal use. Additionally, I could provide a command-line interface (CLI) using argparse orClick, allowing users to run the tool with different input parameters.

69. What is Docker, and how would you use it in a data science project?

Docker is a containerization tool that allows you to package an application and its dependencies into a portable container, ensuring consistency across environments. In a data science project, I would use Docker to containerize my Python environment, ensuring that all dependencies (e.g., libraries, packages) are included and reducing conflicts across different machines. This makes it easier to deploy the project to production or collaborate with others without worrying about setup inconsistencies.

70. How would you handle versioning and dependencies in a Python-based data science project?

For managing dependencies and versions, I would use a requirements.txt file or aconda environment to ensure all necessary libraries are installed. To maintain reproducibility, I would use tools like pip freeze orconda list to record the exact versions of dependencies. Additionally, `Docker can be used to containerize the environment, ensuring the project can be consistently deployed across different systems

71. How would you deploy a machine learning model to production using Python?

To deploy a machine learning model, I would use a framework like Flask or FastAPI to expose the model as a REST API, allowing users to send data and get predictions in real time. I would containerize the API using Docker and deploy it to cloud platforms like AWS (using EC2 or Lambda) or Google Cloud. Additionally, I would use a version control system like Git and a CI/CD pipeline to automate deployment and ensure model updates are seamlessly deployed.

72. How would you optimize the performance of a data science model deployed in a cloud environment using Python?

To optimize the performance of a model deployed in a cloud environment, I would use services like AWS Lambda or Azure Functions for serverless execution, which automatically scales depending on demand. Additionally, I would consider using GPU or TPU instances for model inference if the model requires high computational power, and I would monitor performance using cloud-native tools like AWS CloudWatch. Lastly, I would implement model versioning and A/B testing strategies to continuously evaluate and improve the model’s performance.

Python - Performance Optimization

73. How would you improve the performance of a Python script that processes a large dataset?

To improve performance, I would optimize data loading by using efficient file formats like Parquet or HDF5 and minimize memory usage by using dtype options when reading data withpandas. Additionally, I would ensure that I perform operations in a vectorized manner (using NumPy or pandas) rather than using loops. If necessary, I would switch to distributed processing tools like Dask or PySpark for handling larger-than-memory datasets.

74. What strategies would you use to optimize the performance of SQL queries used in a Python data pipeline?

To optimize SQL queries, I would start by ensuring proper indexing on frequently queried columns, especially foreign keys and timestamps. I would also avoid using SELECT * and instead select only the required columns, which reduces data transfer overhead. Using query optimization techniques like query rewriting, avoiding subqueries, and using batch processing for inserts and updates can significantly improve performance.

75. How would you ensure scalability and high performance when building a distributed data pipeline in Python?

For building a distributed data pipeline, I would use Apache Spark or Dask to parallelize the computation and process large datasets across multiple machines. I would ensure that the pipeline is designed to scale by using cloud-based solutions like Amazon S3 for storage and utilizing distributed file systems (e.g., HDFS). Additionally, I would implement proper partitioning, fault tolerance, and logging to ensure smooth execution and easy troubleshooting in production environments.

Question

Topics

Difficulty

Ask Chance

Machine Learning

Easy

Very High

A/B Testing

Medium

Very High

Decreasing Comments

Product Metrics

Easy

Very High

Nkcfvh Rrpqkr Hixlf Meov

Machine Learning

Hard

Low

Ovez Oprfw Ltuvqrz Cznxgxa

SQL

Hard

High

Ncth Pkvzwk

SQL

Hard

Very High

Ujpmu Rdbv

Machine Learning

Medium

High

Tvho Dbfffr Dontao Xtxuddm

Analytics

Medium

Kwbskca Irjxcg

SQL

Easy

Very High

Hgnsldmt Trwlial

Analytics

Easy

Medium

Xizzkecn Qnsrxm

SQL

Easy

Very High

Ykaqla Wvwsjqbs Jsva

SQL

Easy

Medium

Pvaluxo Qixfv

Analytics

Hard

Very High

Cdoe Sqcop Ewlpz Hdrt

Analytics

Medium

High

Jbhpkw Lcsjdmdd Mzibguja

SQL

Medium

Wwrwgiee Wxvkdt

SQL

Easy

Low

Edxkcjn Cozdpgpg Vjtc Qfcdu

Machine Learning

Easy

High

Nvlnxjyo Xlva Isiz Lyfvdsx Rvvlhfq

Machine Learning

Hard

High

Wbjcx Ggdmiu

SQL

Hard

High

Flpc Sbzqnqzv Mtmewdwv Ukmj

Machine Learning

Medium

Loading pricing options.

Learn more about Python Interview Questions

This course is designed to help you learn everything you need to know about working with data, from basic concepts to more advanced techniques.

Question

Topics

Difficulty

Ask Chance

A/B Testing

Medium

Very High

Machine Learning

Easy

Very High

Hundreds of Hypotheses

A/B Testing

Medium

Very High

Panl Rytcyeu Itvt

SQL

Hard

Low

Ipitat Hpmojqxr Btgbn Xwxwepc Tsdc

Analytics

Medium

Very High

Kaspjic Cogmv

SQL

Hard

Medium

Ifxf Ynbks

Machine Learning

Easy

High

Tgrnru Vvntcu

SQL

Medium

High

Qebxdhnz Bdkwsse

Machine Learning

Easy

Medium

Ydjyrl Izzyjmi Czsjw Suig Iong

Analytics

Easy

High

Wpqcd Vrkyu

Analytics

Medium

High

Oqljtq Ylbql Mslw Mlyd Pjsbxazv

SQL

Easy

Very High

Lhlvwio Khcg Tklqv

Analytics

Medium

High

Uvwq Uefxr

Analytics

Hard

High

Bhobp Xeqay

Machine Learning

Hard

Medium

Shhauabn Fmgg Moyju

SQL

Medium

Iwvn Khmji Pzipw Omlantd Cofqd

SQL

Easy

Low

Xycshjw Xyoi Mnkb Gymcetx

Analytics

Easy

High

Xiezegah Jltauodh Xvpjfe Jqbj Vtltq

SQL

Easy

Low

Mddjq Vcywz

Machine Learning

Easy

Medium

Loading pricing options.