Data Science Coding Skills
As a data scientist, your primary tool will be your computer. So if you want to scale the mathematics, statistics, and business insights, you’ll need to know how to code.
That’s why coding skills are the most frequently asked questions in data science interviews - they’re a prerequisite to the rest of the work you’ll tackle. When we aggregated the feedback on topics asked during interviews, it is helpful to remember that SQL questions are asked in 90% of interviews, and Python questions are asked in 80% of them.
Coding questions are especially common in technical screenings because they’re so straightforward and direct. You can’t talk yourself out of the question. Usually you come up with a direct solution or fail the interview. So coding questions are a quick and reliable way of screening out candidates who lack experience or haven’t prepared enough.
SQL and Python remain the two most common languages for data science, and interviewers will have questions focused directly on testing your proficiency. They will ask you a question, and you’ll need to write a function or query in order to answer it.
There are three main ways of evaluating your coding skills:
- Whiteboard: You’ll have to code on a whiteboard or on a text file without any feedback from the computer. This will test your raw proficiency and ability with the language at hand.
- Editor console: Because you will receive some level of automated feedback, you’ll have the possibility of running and correcting the code. However, it’s not a bad idea to treat this as a whiteboard question to minimize mistakes.
- Pair programming: The interviewer will follow along as you code and provide tips on how to continue. You’ll be able to plan your solution with them.
SQL Questions for Data Scientists
SQL questions will ask you to pull metrics, aggregate data, and conduct analysis in a very short time frame. These questions appear frequently and assess your ability to write clean code. For onsite interviews, you should expect to not only get the answers right but also reach them quickly and clearly to make a good impression.
The primary focus will be intermediate questions: you’ll be provided with data or a table schema, and they will ask you to set up a SQL query to generate a metric they request. These questions typically cover basic joins, aggregations, and manipulating dates to understand how different metrics evolve through different time periods.
It’s very unlikely that you will find any definition-based SQL questions in an interview for data science (such as “what’s the difference between WHERE and HAVING?”).
Example: Given a users
table and purchases
table, write a query to find the total amount spent for each item by users that registered in 2022
You’ll likely find hard SQL questions that test more advanced concepts such as subqueries, window functions, and advanced joins.
Example: Given a table called user_experiences
that shows when each user started and ended working at each position, write a query to determine the percentage of users that held the title of “data analyst” immediately before holding the title “data scientist”.
The harder questions also require you to answer multistep SQL case studies and perform more complex queries by joining simpler ones using subqueries.
Example: We have two tables, transactions
and products
. We want to find pairs of products that are often purchased together by the same user (such as wine and bottle openers, chips and beer, etc).
Python Data Science Questions
Python questions test your coding skills in Python. They range from testing your knowledge of algorithms & data structures to your familiarity with NumPy, Pandas, and data science packages, as well as applying your knowledge in probability simulations, statistics distribution, and string parsing or data manipulation.
The most basic kind of Python question is the one you can answer verbally. These questions focus on Python syntax or libraries in order to test your level of familiarity with the Python language.
Examples:
- What are the built-in data types used in Python?
- What library do you prefer for plotting? Seaborn or Matplotlib? Why?
- Is Python an object-oriented language?
A common type of Python question is about string manipulation. This question type is good for interviewers because it tests both your algorithmic skills and your ability to understand string processing (which is crucial to cleaning and manipulating data).
Example: Given two strings, A and B, return whether or not A can be shifted some number of times to get B.
Related to data cleaning, questions that evaluate your ability to clean & process data using Python’s Pandas library are frequent.
Examples:
- Given a dataset of test scores, write Pandas code to return cumulative bucketed scores of <50, <75, <90, <100.
- Given two dictionaries (friends_added and friends_removed), write a function to list the pairs of friends with corresponding beginning and ending timestamps.
Finally, some data science roles will require you to be able to perform numerical and matrix operations using NumPy and will test your proficiency with the Numpy library as a proxy for your coding experience.
Examples:
- Compute the inverse of a matrix in NumPy.
- Given an array filled with random values, write a function rotate_matrix to rotate the array by 90 degrees in the clockwise direction.
Coding Questions in Other Contexts
Sometimes, there are specific coding requirements within other topics. For example, in the context of a statistics or a database design question, they might ask you for some code as part of your answer.
In these cases, they often don’t expect you to go through the whole process of coding your answers because that would take too long for an interview. However, they will test if you have an idea of how you would solve the problem at hand if you had the time available. One way of doing this is by asking you to give them an overview of how you would code your answer without getting so deeply into implementation details.
As an example, some product analytics questions ask you to define the SQL queries you would design in order to retrieve the metrics you propose to track. Database design questions might also ask you for examples of the SQL queries you’d use to retrieve specific data from the databases you designed.
In the case of Python, both statistics and probability-based questions may ask you to code random sampling for specific distributions, compute statistical metrics or write Python code to calculate chances based on specific conditions using Bayes theorem.
Finally, Python questions may usually come up in the context of machine learning questions, especially on how to build and deploy ML models. Depending on your role and company, you can even be asked to code specific ML algorithms without resorting to any Python library.
35%
CompletedYou have 166 sections remaining on this learning path.