Full-time data scientists use SQL every day to perform analysis and gather insights from data, which is why SQL is tested so frequently (around 70% of the time) in data science interviews.
SQL data science interview questions mimic the work that data scientists actually perform. These questions ask candidates to pull metrics, aggregate data, and conduct analysis. In the majority of interviews, you will be required to write out or whiteboard SQL queries in a short timeframe.
Occasionally, you may have the opportunity to take this portion home and present your findings later, but you should not count on that format.
Data science SQL questions fall into three main categories:
Beginner SQL data science questions - These questions are definition-based, include use case examples or contain comparisons of SQL functions. A basic SQL data science question might be: What is the difference between WHERE and HAVING?
Intermediate data science SQL questions - Intermediate query writing questions will be the primary focus in data science interviews. You are provided with data or table schema and required to set up a SQL query to generate the requested metric. These questions will cover basic joins, aggregations and date manipulations.
Hard SQL data science questions - Hard SQL questions test SQL concepts like subqueries, window functions, and advanced joins. Additionally, you can expect multistep SQL case studies that ask you to investigate a business problem.
In data scientist interviews, you will be asked to write SQL queries and, usually, these queries cover intermediate-to-hard SQL concepts. Although you can never truly know what will be asked, five SQL concepts are tested most frequently based on our survey of the industry.
If you can prepare for these concepts - which include aggregations, joins, date manipulations, window functions and advanced subqueries - you will do well in the SQL portion of any data science interview. These are the most frequently tested technical SQL concepts in data science interviews:
Typically, when you are asked to write SQL code, you will be asked to perform a task. Some of the most common tasks you will be asked to demonstrate proficiency with include:
Reporting and metrics - These questions are the most common, and ask you to write queries that end up in dashboards. Although the output is clearly defined, these questions typically require complex joins, sub-queries, inner joins and window functions to reach the final product..
Analytics case studies - These questions are multi-step problems that test two skills: 1) understanding what metrics you require to answer the question, and 2) writing the correct SQL query to pull the metrics you have identified. Analytics case studies test your problem-solving skills, and are generally left more ambiguous than other types of SQL problems.
Database design - These questions test your knowledge of data architecture and design. Most importantly, they test whether you know how to design a database from scratch when you are given a business idea, application, or any other software that needs to interact with a database.
ETL questions - These questions test your ability to perform ETL tasks, and translates to extracting data from a database, transforming the data, and loading the data into another database or table. These questions are most common in engineering roles, but data scientists should expect them to appear periodically.
Logic-based questions - Logic-based SQL interview questions are very tricky. They are not typically based on real life examples so much as injecting the complexities of algorithms and data structure interviews into SQL questions. This is exemplified on sites such as LeetCode, where you will see quite a few interview questions that are not practical for real world scenarios.
The best approach for studying SQL interview questions is to practice as many problems as possible. To that end, Interview Query is constantly updating our question database, and already stands with 500+ problems banked for you to practice against. Here is a simple process to use for SQL practice problems from our database:
Try the SQL question without running any SQL code in the engine. Give yourself 15 to 20 minutes to work on each problem by just reading the problem statement and using the editor.
Work out all the bugs and make sure that when you click run or submit, you feel like this is the definitive solution that will work without syntax errors.
If you get stuck, try the problem hint. The hints on certain questions will push you into learning how to think about solving the problem at hand.
Try submitting your code to check if it passes our test cases. If it misses a few test cases, figure out why by debugging your code.
If you need to debug your code, you can hit the RUN SQL button to try it on our test database. Run your code and check for syntax errors. If you get any, re-read your query to fix the necessary errors.
Check your solution against ours. Take note of whether there are any optimizations you can use to write a more efficient query.
Definition-based basic SQL interview questions are commonly asked, and these questions are used by interviewers to quickly gauge your SQL knowledge. Commonly, these questions will lead to more challenging questions.
For example, you might be asked: “What is the difference between an INNER JOIN and a LEFT JOIN”? If you answer correctly, the interviewer might then provide a scenario and ask if you would use the INNER JOIN or LEFT JOIN to solve.
Some of the most common basic SQL questions for data scientists include:
A JOIN is a keyword used to merge together two or more tables on the same key.
The INSERT command is used to add rows to an existing table.
OLAP are databases intended for online analytical processing, while OLTP are databases intended for online transaction processing. Denormalize the data when it falls under OLAP operations and normalize when OLTP.
The WHERE clause is used to filter rows before grouping and HAVING is used to exclude records after grouping.
CASE WHEN lets you write complex conditional statements on the SELECT clause, and also allows you to pivot data from wide to long formats.
Breaking it down, a JOIN is used to combine data from multiple data sources. A LEFT JOIN combines and returns all rows from the left table, even if there are no matches in the right table. An INNER JOIN, only returns rows if there is a match in both of the tables.
UNION is used to join multiple data sets. UNION is thus used to combine two queries into a single result, based on the parameters of the query. Conversely, UNION ALL extracts all rows from a set of two tables, including duplicates.
The COUNT function is used to return the number of rows specified by the query, which is to say those rows that are specified by a WHERE condition. COUNT DISTINCT eliminates duplicate values before the count is applied, or it returns the number of unique rows in a table. Additionally, it does not count rows with NULL values.
You would use WHERE and HAVING if you are using them together with the GROUP BY clause. For example, WHERE would be used to filter rows before the GROUP BY clause, while HAVING would filter rows after the clause.
Easy SQL interview questions jump into basic SQL query writing. These questions are most common in technical screens, but they could show up in the on-site interview for more junior-level data science roles.
Easy questions in SQL interviews test your ability to perform basic aggregations like using COUNT or HAVING, if you can use basic joins or date-time manipulations.
Hint: The * serves as a wildcard that will capture all columns in a table.
In this Google SQL interview question, we are given two tables. The users
table includes a user_id, name, neighborhood_id, and a created_at timestamp. The second table includes a user_id, neighborhoods_name, and a city_id.
One hint: Whenever a SQL question asks you to find values with 0, immediately think of the concept of a LEFT JOIN. Remember that an INNER JOIN finds values that are in both tables, while a LEFT JOIN keeps only the values in the left table.
One additional step in this question would be to use the WHERE to select only NULL values. Here is a simple solution:
SELECT n.name
FROM neighborhoods AS n
LEFT JOIN users AS u
ON n.id = u.neighborhood_id
WHERE u.id IS NULL
Here is the schema for the projects
table:
Columns | Type |
---|---|
id |
INTEGER |
title |
VARCHAR |
start_date |
DATETIME |
end_date |
DATETIME |
budget |
FLOAT |
With this question, our aim is to use values in a single table for comparisons. Specifically, we are looking to select pairs of projects in which one project starts on the same day another project ends.
This task requires us to perform an INNER JOIN. The order of the values must be precise to correctly perform this function. Pay particular attention of aliasing to make sure you keep your tables straight. For instance, you do not want to name one of your tables project_starts
, but use that table to source the value for project_title_end, or vice-versa.
More Context: The search results table includes:
For example, the query “cat” shows a picture of a cat, and includes a rating of 4 and a position of 1. However, the query “dog” shows a picture of a hotdog, with a rating of 2 and a position of 1. The ranking precision of “cat” is better than “dog.” How would you write a query to create a metric that could validate the position by their search precision?
Hint: Taking the average of the ratings seems like a good way to measure the quality of the search results.
We now have to find a way to weight the position in accordance to the rating to normalize the metric score. For example, in a table of two search results for one query:
You are presented with an employees
table (employee_id, first_name, last_name, salary and department_id) and a departments
table (id, name). The department_id field in the employees
table is associated with the “id” field in the departments
table.
With this knowledge, department_id is a “foreign key” because it is a column that references the primary key of another table. Based on this common field, we can join both tables using INNER JOIN to associate the name of the department name to the employees that are a part of these departments.
SELECT salary
FROM employees
INNER JOIN departments
ON employees.department_id = departments.id
What else would you need to do to filter to just the employees in the engineering department, their order by salary, and selection of the top two salaries?
You are provided with a transactions
table (id, user_id, created_at, product_id, and quantity), a users
table (id, name, and sex), a products
table (id, name, and price). To answer the question, we would need to apply an INNER JOIN between the users
and transactions
tables to find the value.
Here is a quick solution:
SELECT
u.sex
, ROUND(AVG(quantity *price), 2) AS aov
FROM users AS u
INNER JOIN transactions AS t
ON u.id = t.user_id
INNER JOIN products AS p
ON t.product_id = p.id
GROUP BY 1
Let us say you have a database with two tables. The first table is composed of user information, including their registration date, and the second table is of purchases, which has the purchase history (if any) for all users.
Here is a process you can use to solve this:
Note: Find the duration in minutes and sort results by duration in descending order.
rides
table:
Columns | Type |
---|---|
id |
INTEGER |
passenger_user_id |
INTEGER |
start_dt |
DATETIME |
end_dt |
DATETIME |
distance |
FLOAT |
ride_region |
VARCHAR |
is_completed |
INTEGER |
Output:
Column | Type |
---|---|
duration_minutes |
INTEGER |
Here is a quick solution:
SELECT TIMESTAMPDIFF(MINUTE,start_dt,end_dt) AS duration_minutes
FROM rides
WHERE TIMESTAMPDIFF(MINUTE,start_dt,end_dt) > 120
ORDER BY duration_minutes DESC
Medium SQL coding questions jump into more complex joins, aggregations, window functions and more. Generally, these questions are more advanced reporting and metrics questions that require multi-step aggregations and creative problem-solving to answer.
We need an indicator of which purchase was the third value. Whenever we think of ranking our dataset, it is helpful to think of a specific window function we can use then immediately.
We need to apply the RANK function to the transactions table. The RANK function is a window function that assigns a rank to each row in the partition of the result set.
SELECT *, RANK() OVER (
PARTITION BY user_id ORDER BY created_at ASC
) AS rank_value
FROM transactions
LIMIT 100
What else do we need to do now that we have the transactions in ranked order?
More Context: You are given two tables, accounts
(acount_id, paying_customer) and downloads
(account_id, download_date, downloads).
We need to use data from both tables to solve the question, so the first thing we figure out is how to join them together. Since we should only consider accounts that had downloads, we can use an INNER JOIN (or just JOIN).
An INNER JOIN will discard accounts with no records in the downloads
table. To calculate an average, we can use the AVG function. The AVG function is an aggregate function, so we need to apply a GROUP BY function to group results by the columns’ date and paying customer.
If a user has five jobs but has only posted each job once, then they are part of the single_post user base. But if the user has five jobs and has posted seven times, then at least one job must have multiple postings.
Generally, if a user’s total number of postings exceeds that user’s total number of distinct jobs, the pigeonhole principle tells us at least one must have been posted multiple times.
We first write a subquery to get an organized version of the job_postings and name it user_job.
We want a count of total job postings per user and job. Since each job posting has a unique id, we write our subquery to count posting ids and distinct job ids per user.
We use COUNT DISTINCT on job_id to get a unique row for each job and COUNT on id as all id are already unique. We then GROUP BY user_id so we can compare the number of distinct jobs per user, denoted as num_jobs, with the number of total posts per user denoted as n_posts.
Note: If the customer purchased two things on the same day, we do not count that as an upsell, as they were purchased within a similar timeframe.
Questions like this are common in Amazon SQL interviews.
Hint: An upsell is determined by multiple days by the same user. Therefore, we have to group by both the date field and the user_id to get each transaction broken out by day and user.
SELECT
user_id
, DATE(created_at) AS date
FROM transactions
GROUP BY 1,2
Now we just have to filter for the users that purchased on multiple dates. How can we do this?
Evaluating what is given to us, we have an initial query that shows the purchases on each date. So effectively we need to now count the number of distinct dates that a user purchases on.
Note: If there are multiple students with the same minimum score difference, select the student name combination that is higher in the alphabet.
Given that the problem statement is referencing one table with only two columns, we have to self-reference different creations of the same table. It is helpful to think about this problem in the form of two different tables with the same values.
There are two parts to this question:
For the first part of the comparison, we have two of the same tables s1 and s2. Since we want to compare each student against each other, we can do a variation of the CROSS JOIN by joining each user against each other.
INNER JOIN scores AS s2
ON s1.student != s2.student
More Context: You want to build a naïve recommender, and you are given two tables. One is called friends
(user_id and friend_id) and another called page_likes
(with user_id and page_id). Note: it shouldn’t recommend pages that the user already likes.
Let’s solve this Facebook SQL interview question by visualizing what kind of output we want from the query. Given that we have to create a metric to evaluate recommending a user to a page, we know we want something with a user_id and a page_id along with some sort of recommendation score.
Let’s try to think of an easy way to represent the scores of each user_id and page_id combo. One naïve method would be to create a score by summing up the total likes by friends on each page that the user has not currently liked. Then the max value on our metric will be the most recommendable page.
The first thing we have to do is then to write a query to associate users to their friends’ liked pages. We can do that easily with an initial join between the two tables.
WITH t1 AS (
SELECT
f.user_id
, f.friend_id
, pl.page_id
FROM friends AS f
INNER JOIN page_likes AS pl
ON f.friend_id = pl.user_id
)
What comes next to reaching the desired output?
More context: Let’s say you’re working in HR at a major tech company, and they ask you to find employees with a high probability of leaving the company. They tell you these employees perform well but have the lowest pay. Also note that we consider projects to be completed when they have an end date, which is the same as saying their End_dt is not NULL.
Given that the problem statement requires finding employees with a high probability of leaving based on their performance and salary, we can break down the process into a few key steps:
This approach leverages basic SQL operations like JOINs, GROUP BY, and HAVING clauses, combined with sorting and limiting results to answer the question effectively.
More context: Imagine you’re analyzing customer behavior for a financial institution, and they ask you to identify accounts that were closed right after being active for a period. Specifically, they want to find accounts that were active on December 31st, 2019, but closed the very next day, January 1st, 2020. This insight could help understand customer churn patterns or service issues.
To solve this, you would start by identifying accounts that meet the criteria of being active on December 31st and then closed on January 1st. The next step involves calculating the percentage of these accounts relative to the total number of active accounts on December 31st.
This process involves:
This approach requires careful filtering and comparison of data across specific dates, which is common in SQL but requires attention to detail to ensure accuracy.
Hard SQL questions in data science interviews can include complex queries that involve sub-queries, window functions and advanced joins. However, one of the most common types of hard SQL questions is the SQL case study.
Assume that we are only interested in the monthly reports for a single year (January to December). You are given three tables: transactions
, products
, and users
. But do you need the data from all three tables?
Since there is no information in the users table that explicitly relates to the outcome of our query, we can use just the transactions
and products
table to generate our number of customers and their total purchase amounts per month.
How will we get the number of customers per month, excluding repeat customers?
Hint: We have decided to COUNT the DISTINCT users in our transactions table to generate our number of monthly customers. How can we take advantage of the relationship between our transactions and product tables to calculate the total purchase amount per month of any given customer?
The schema below is for a retail online shopping company consisting of two tables, attribution
and user_sessions
. Here are some details of the two tables:
attribution
table logs a session visit for each row.user_sessions
table maps session visits back to one user, from a single visit all the way up to serval on the same day.How do we solve this one? First-touch attribution is defined as the channel to which the converted user was associated with when they first discovered the website. It is helpful to sketch out the attribution model for converting users:
How do we figure out the beginning path of the Facebook ad and connect it to the end purchasing user?
We need to do two actions: 1) subset all of the users that converted to customers, and 2) figure out their first session visit to attribute the actual channel. We can do that by creating a subquery that only gets the distinct users that have actually converted.
You are provided with a table of product subscriptions with subscription start and end dates for each user.
Take a look at each of the conditions first and see how they could be triggered. Given two date ranges, what determines if the subscriptions would overlap?
Let’s set an example with two date ranges: A and B.
Let Condition A>B demonstrate that DateRange A is completely after DateRange B. _ |—- DateRange A ——| |—Date Range B —–| _
When would Condition A>B be true?
flights
, that displays unique pairs of two locations.Note: Duplicate pairs from the flights
table, such as Dallas to Seattle and Seattle to Dallas, should have one entry in the flight routes output table.
flights
table
Column | Type |
---|---|
id |
INTEGER |
source_location |
VARCHAR |
destination_location |
VARCHAR |
How could you solve this using the LEAST and GREATEST functions?
Let’s review the common strategies when tackling SQL interview questions.
When presented with a SQL question, listen carefully to the problem description and repeat back what you think the crux of the problem is. The interviewer can then help verify if your understanding is correct.
If time permits, write out a base case and an edge case to show that you understand the problem. For example: if the interviewer asks you to pull the average number of events per user per day, write out an example scenario where you’re verifying this metric.
Do duplicate events matter? Are we looking at distinct users? These are questions we need to clarify.
Sketching out what the output of the SQL question will look like is a great strategy towards solving the problem. Usually, if we know what the end output table is supposed to look like, we can work backwards from there on what functions need to be applied before.
For example, if the output looks like this:
date | average events per user |
---|---|
2021-12-01 | 3.5 |
2021-12-02 | 4.0 |
We know that the table before this aggregation would have to look something like this:
date | event | user_id |
---|---|---|
2021-12-01 | click | 1 |
2021-12-01 | view | 1 |
…… |
From here, we can figure out what functions I should use to get to my desired output!
As you practice more and more SQL exercises, what you will find is that many SQL problems follow similar patterns. There are techniques we can use in SQL, like utilizing HAVING on aggregations, self-joins and cross-joins, and applying window functions, and these functions will appear repeatedly the more problems we practice on. For example, writing a query to get the second highest salary or writing a query to isolate every fifth purchase by a user utilizes the same RANK function in SQL.
Understanding the commonalities between questions will help you understand the first step to solving SQL questions faster, because you can re-use similar code and stitch together techniques on top of each other.
Finally, it’s important to just start writing SQL. It is better to start writing an imperfect solution than to try and perfectly understand the problem or trying to perfect the coding solution on the first try.
Verbalize your assumptions and what you are doing as you write SQL, and your interviewer can then be put on the same page as you, even helping you to navigate your response. These processes are most often collaborative in a business environment, so you can show your ability to brainstorm and communicate during the interview.
Interview Query offers a variety of resources to level up your SQL skills and prepare for your interview. Premium members get access to our data science course, which features an entire SQL module, as well as Python, statistics, and product metrics. Additionally, see our SQL questions for data analysts and our top 10 SQL questions for data engineers.
If you’re interested in learning more about our discussions on Interview Query, take a look at our blog, where we cover a wide range of topics such as: