Data Analyst SQL interview questions aim to quickly evaluate your ability to extract metrics and work with data using SQL. You’ll usually need to write SQL queries on a whiteboard or in a code editor.
SQL questions cover a range of topics, from asking when to use a GROUP BY statement to challenging you to create a query that supports or disproves a product metrics hypothesis.
The goal remains simple, regardless of the interview format: produce clean SQL code as fast as possible. Data analyst SQL questions fall into three categories:
Easy SQL questions - These questions focus on defining SQL features, basic use cases, and differentiating between commands like ´WHERE´ and ´HAVING´. They may also include simple queries.
Intermediate SQL questions - These questions require you to write complex queries using joins, sub-queries, self-joins, and window functions. They may also involve They may also include analytics case studies.
Hard SQL questions - These questions challenge you to write advanced queries, including the use of indices and complex SQL clauses. They may also include more advanced analytics case studies.
SQL technical screens are a part of nearly every data analyst interview. In these screens, candidates are asked to answer real-world problems using SQL.
Most commonly, candidates are provided a dataset and asked to write a SQL query to return the desired data.
How do companies test proficiency? There are three main types of SQL interview questions:
Data analyst interviews and technical screens generally start with beginner SQL questions. There are two main types of easy SQL questions:
Basic SQL Queries - You will often be asked basic SQL interview questions that require you to write a query. You might be asked to get the COUNT of a table, make a simple join, or use the HAVING clause.
Definitions - Less frequently encountered, these questions ask you to explain technical concepts, compare two or more SQL functions or define how a concept is used.
Here are nine easy SQL questions for data analysts:
GROUP BY
statement?The GROUP BY
clause enables us to compute aggregate values alongside discrete data by collapsing the distinct rows of our database into summary rows that share a particular characteristic.
If you were to attempt to formulate a query selecting a number of rows with a WHERE
clause, and then display an aggregate value alongside, you would find that the query would return an error. This is because SQL is not able to display, in a single table, the results of our WHERE
query as a list of values conjoined with the aggregate value you are looking for.
An aggregate function performs a calculation on a set of values and returns a single value summarizing the set. The three most common aggregate functions in SQL are: COUNT
, SUM
, and AVG
.
In SQL, a unique key is one or more columns or fields that identify a record in a database. Tables can have multiple unique keys, which is a difference between unique keys and primary keys. With unique keys, only one NULL value is accepted for the column and it cannot have duplicate values.
UNION and UNION ALL are SQL operators used to concatenate two or more result sets. This allows us to write multiple SELECT statements, retrieve the desired results, then combine them together into a final, unified set.
The main difference between UNION and UNION ALL is that:
RIGHT JOIN
and a LEFT JOIN
?The two most common types of joins in SQL are LEFT and RIGHT. The main difference is that these JOIN operators deal with matched and unmatched rows.
A LEFT JOIN
includes all records from the left table and matched rows from the right.
A RIGHT JOIN
returns all rows from the right table and unmatched rows from the left.
Tables contain data and they are made up of columns and rows. A view is a virtual table, which generally is dependent on data from the table for its display.
One use case for a view can be found if you wanted to look at a subset of data from a table. You could create a view using the SELECT
command to query the data.
The LIKE
operator is used to search for a specific pattern within a column. It is used with a WHERE
clause to query specific columns.
LIKE
is used with two wildcard characters:
Here are a few examples:
WHERE Employee Name LIKE '%r' - Finds matches that end with "r"
WHERE Employee NAME LIKE '%gh%' - Finds matches that include "gh" in any position `
WHERE Employee NAME LIKE '_ch%' - Finds matches with "ch" in the second and third positions
WHERE Employee NAME LIKE 'g%r' - Finds matches that start with "g" and end with "r"
You update an existing table with the UPDATE command in SQL. It is used with SET
(which includes the updated information) and WHERE
to select the specific instance.
Example: In the table ‘Employees’, you want to change the emergency contact, ContactName, for an employee with EmployeeID 3.
UPDATE Employees
SET ContactName = "Bob Smith"
WHERE EmployeeID = 3;
The BETWEEN
operator is used to select values within a range. You can use numbers, texts or dates with BETWEEN
.
One important thing to note: the BETWEEN
operator includes both the start and end dates.
SELECT EmployeeID
FROM Employees
WHERE EmployeeID BETWEEN 378 AND 492
Given a table of cars with columns id
and make
, write a query that outputs a random manufacturer’s name with an equal probability of selecting any name.
Input:
cars
table
id |
make |
---|---|
1 | Ford |
2 | Toyota |
3 | Toyota |
4 | Honda |
5 | Honda |
6 | Honda |
Output:
Column | Type |
---|---|
make |
Text |
For mid-level data analyst roles, expect intermediate SQL questions to challenge your knowledge and skill. Medium SQL questions fall into two buckets:
Here are some medium SQL questions for data analysts:
Given tables employees
, employee_projects
, and projects
, find the 3 lowest-paid employees that have completed at least 2 projects.
Note: incomplete projects will have an end date of NULL
in the projects table.
users
and rides
, write a query to report the distance traveled by each user in descending order.For this question, you need to accomplish two things: the first is to figure out the total distance traveled for each user_id
, and the second is to order from greatest to least each user_id
by a calculated distance traveled.
This question has been asked in Uber data analyst interviews.
For this medium SQL problem, assume you work at an advertising firm. You have a table of users’ impressions of ad campaigns over time. Each user_id from these campaigns has an attached impression_id, categorized as either “Excited” or “Bored”. You will need to assess which users are “Excited” by their most recent campaign and have never been “Bored” in any past campaign.
See a full solution to this problem on YouTube:
Note: This is the type of question that might get asked for a marketing analyst job.
To answer this question, you need the name of the department to be associated with each employee in the employees
table to understand which department each employee is a part of.
The “department_id” field in the employees
table is associated with the “id” field in the departments
table. You can call the “department_id” a foreign key because it is a column that references the primary key of another table, which in this case is the “id” field in the departments
table.
Based on this shared field, you can join both tables using INNER JOIN
to associate the department name with their employees.
SELECT salary
FROM employees
INNER JOIN departments
ON employees.department_id = departments.id
With the department name in place you can now look at the employees of the Engineering team and sort by salary to find the second highest paid.
More Context: The table includes the columns: id, transaction_value and created_at (representing the time for each transaction).
Since our goal in this problem is to pull the last transaction from each day, you want to group the transactions by the day they occurred and create a chronological order within each day from which you can retrieve the latest transaction.
To accomplish the task of grouping and order, create a modified version of the bank_transactions table with an added column denoting the chronological ordering of transactions within each day.
To partition by date, you can use an OVER()
statement. After partitioning, you should use a descending order so that the first entry in each partition is the last transaction chronologically. Here is how that query can be written:
SELECT (*), ROW_NUMBER() OVER(PARTITION BY DATE(created_at)
ORDER BY created_at DESC)
AS ordered_time
More context: You are given two tables. A projects
table and another that maps employees to their projects, called employee_projects.
In this question, however, a bug exists that is causing duplicate rows in the employee_projects
table.
Example:
Input:
projects
table
column | type |
---|---|
id |
INTEGER |
title |
VARCHAR |
state_date |
DATETIME |
end_date |
DATETIME |
budget |
INTEGER |
employee_projects
table
Column | Type |
---|---|
project_id |
INTEGER |
employee_id |
INTEGER |
Output:
Column | Type |
---|---|
title |
VARCHAR |
budget_per_employee |
FLOAT |
This is a good example of a logic-based SQL problem. Although there are a few steps to the solution, the actual SQL queries are fairly simple.
HINT: One way to do the debugging is to group by columns project_id simply
and employee_id.
By grouping by both columns, you are creating a table that provides distinct values on project_id and employee_id, thereby excluding any duplicates.
This question tests your data sense, as well as your SQL writing skills. It has also appeared in Facebook data analyst interviews.
To answer the first part of the question regarding insights, there are a number of metrics you could evaluate. You can find the total number of messages sent per day, the number of conversations being started, or the average number of messages per conversation. All of these metrics seek to find users’ level of engagement and connectivity.
You can find a full solution on parts one through three in the following YouTube video:
This intermediate SQL question has been asked in Amazon data analyst interviews. Here is a partial answer from Interview Query:
What does a histogram represent, and what kind of story does it tell? In this case, you are interested in using a histogram to represent the distribution of comments each user has made in January 2020. A histogram with bin buckets of size one means that you can avoid the logical overhead of grouping frequencies into specific intervals.
For example, if you want a histogram of size five, you would have to run a SELECT statement like so:
SELECT
CASE WHEN frequency BETWEEN 0 AND 5 THEN 5
WHEN frequency BETWEEN 5 AND 10 THEN 10 etc..
In this problem, you are given two tables: An employees
table and a departments
table.
Example:
Input:
employees
table
Columns | Type |
---|---|
id |
INTEGER |
first_name |
VARCHAR |
last_name |
VARCHAR |
salary |
INTEGER |
department_id |
INTEGER |
departments
table
Columns | Type |
---|---|
id |
INTEGER |
name |
VARCHAR |
Output:
Column | Type |
---|---|
percentage_over_100k |
FLOAT |
department_name |
VARCHAR |
number of employees |
INTEGER |
First, break down the question to understand what it’s asking. Specifically, you break the question down into three clauses of conditions:
From here, think about how you would associate employees with their department, calculate and display the percentage of employees making over $\$100,000$ a year, and order those results to provide an answer to the original question.
Given that this problem is referencing one table with only two columns, you have to self-reference different creations of the same table. It is helpful to think about this problem in the form of two different tables with the same values.
There are two parts to this question:
This question provides a table that represents search results on Facebook, including a query, a position, and a human rating.
For this problem, you are given a table of product purchases. Each row in the table represents an individual product purchase.
Note: If the customer purchased two things on the same day, that does not count as an upsell, as they were purchased within a similar timeframe. We are looking for a customer returning on a different date to purchase a product.
This question is a little tricky because you have to note the dates that each user purchased products. You can’t just group by the user_id and look where the number of products purchased is greater than one because of the upsell condition.
You have to group by both the date field and the user_id
to get each transaction broken out by day and user:
SELECT
user_id
, DATE(created_at) AS date
FROM transactions
GROUP BY 1,2
The query above will now give us a user_id
and date field for each row. If there exists a duplicate user_id
, then you know that the user purchased on multiple days, which satisfies the upsell condition. What comes next?
transactions
table below, write a query that finds the third purchase of every user.Note: Sort the results by the user_id
in ascending order. If a user purchases two products at the same time, the lower ID field is used to determine which is the first purchase.
Example:
Input:
transactions
table
Columns | Type |
---|---|
id |
INTEGER |
user_id |
INTEGER |
created_at |
DATETIME |
product_id |
INTEGER |
quantity |
INTEGER |
Output:
Columns | Type |
---|---|
user_id |
INTEGER |
created_at |
DATETIME |
product_id |
INTEGER |
quantity |
INTEGER |
Here is a helpful hint for this question: You need an indicator of which purchase was the third by a specific user. Whenever you are thinking of ranking a dataset, it is helpful to immediately think of a specific window function you can use. You need to apply the RANK function to the transactions
table. The RANK function is a window function that assigns a rank to each row in the partition of the result set.
SELECT *, RANK() OVER (
PARTITION BY user_id ORDER BY created_at ASC
) AS rank_value
FROM transactions
LIMIT 100
This is a LinkedIn data analyst interview question. See a full solution to this question on YouTube:
For this problem, you are given an employees
and a departments
table.
Note: If the department contains less than three employees, the top two or top one highest salaries should be listed.
Here’s a hint: You need to order the salaries by department. A window function is useful here. Window functions enable calculations within a certain partition of rows. In this case, the RANK() function would be useful. What would you put in the PARTITION BY and ORDER BY clauses?
Your window function can look something like this:
RANK() OVER (PARTITION BY id ORDER BY metric DESC) AS ranks
Note: When you substitute for the actual id and metric fields, make sure the substitutes are relevant to the question asked and aligned with the data provided to you.
In this Robinhood data analyst question, assume you work for a small airline, and you are given three tables: flights
, planes
, and flight_purchases
.
To get the number of unsold seats per flight, you need to get each flight’s total number of seats available and the total seats sold.
You can do an inner join on all 3 tables since the question states that the flight_purchases
table does not have entries of flights or seats that do not exist.
To calculate the number of seats per flight, you use GROUP BY on the flight_id together with COUNT() on seat_id to get a count of seats sold. You then do the calculation of the number of total seats on the flight minus the total seats sold to reach how many seats remained unsold.
transactions
table with date timestamps, sample every fourth row ordered by date.Here’s a hint for this question to get you started: If you are sampling from this table and you want to sample every fourth value specifically, you will probably have to use a window function.
A general rule of thumb to follow is that when a question states or asks for some Nth value (like the third purchase of each customer or the tenth notification sent), then a window function is the best option. Window functions allow us to use the RANK() or ROW_NUMBER() function to provide a numerical index based on a certain ordering.
More Context: You are given two tables: the first is a users
table with demographic information and the neighborhoods they live in, and the second is a neighborhoods
table.
This is an intermediate SQL problem that requires you to write a simple query. Our task is to find all the neighborhoods without users. To reframe the task, you need all the neighborhoods that do not have a single user living in them. This means you have to introduce a column in one table but not in the other, such that you can see user counts by neighborhood.
Advanced SQL questions are common for mid- and senior-level data analyst jobs, and they require you to write advanced SQL queries or work through complex logic-based case studies. The two types of questions include:
Here’s the full solution to this complex data analytics case study on YouTube:
For this question, you are given a table of bank transactions with three columns: user_id, a deposit or withdrawal value (determined if the value is positive or negative), and created_at time for each transaction.
Here’s a hint: Usually, if the problem asks to solve for a moving/rolling average, you are provided the dataset in the form of a table with two columns: date and value. This problem is taken one step further as it provides a table of just transactions, with an interest in filtering for deposits (positive value) and removing records representing withdrawals (negative value, e.g. -10).
You also need to know the total deposit amount (sum of all deposits) for each day, as it will factor into calculating the numerator for the rolling three-day average:
rolling three-day avg for day 3 = [day 3 + day 2 + day 1] / 3
To solve this cumulative distribution practice problem, you are given two tables, a users
table and a comments
table.
frequency | cumulative |
---|---|
0 | 10 |
1 | 25 |
2 | 27 |
For this question, assume that you work at Twitter. Twitter wants to roll out more push notifications to users because they think users are missing out on good content. Twitter decides to do this in an A/B test. After you release more push notifications, you suddenly see the total number of unsubscribes increase. How would you visually represent this growth in unsubscribes and its effect on login rates?
Write a query to prove or disprove this hypothesis: Data scientists who switch jobs more frequently become managers faster than data scientists that stay at one job for longer.
For this question, you are interested in analyzing the career paths of data scientists. Let’s say that the titles you care about are bucketed into three categories: data scientist, senior data scientist, and data science manager.
Here’s a partial solution to this question:
This question requires a bit of creative problem-solving to understand how you can prove or disprove the hypothesis. The hypothesis is that data scientists who end up switching jobs more often get promoted faster. Therefore, in analyzing this dataset, you can prove this hypothesis by separating the data scientists into specific segments on how often they switch jobs in their careers.
For example, if you looked at the number of job switches for data scientists that have been in their field for five years, you could prove the hypothesis that the number of data science managers increased with the number of times they had switched jobs.
For this question you’re given two tables, a table of notification_deliveries and a table of users with created and purchase conversion dates. If the user hasn’t purchased then the conversion_date column is NULL.
Example:
Input:
notification_deliveries
table
Column | Type |
---|---|
notification |
VARCHAR |
user_id |
INTEGER |
created_at |
DATETIME |
users
table
Column | Type |
---|---|
id |
INTEGER |
created_at |
DATETIME |
conversion_date |
DATETIME |
Output:
Column | Type |
---|---|
total_pushes |
INTEGER |
frequency |
INTEGER |
You’re given two tables, transactions and products. Hypothetically the transactions table consists of over a billion rows of purchases bought by users.
You are trying to find paired products that are often purchased together by the same user, such as wine and bottle openers, chips and beer, etc..
Note: For the purposes of satisfying the test case, p2 should be the item that comes first in the alphabet. The qty column represents paired products count
Example:
Input:
transactions
table
Column | Type |
---|---|
id |
INTEGER |
user_id |
INTEGER |
created_at |
DATETIME |
product_id |
INTEGER |
quantity |
INTEGER |
products
table
Column | Type |
---|---|
id |
INTEGER |
name |
VARCHAR |
price |
FLOAT |
Output:
Column | Type |
---|---|
p1 |
VARCHAR |
p2 |
VARCHAR |
qty |
INTEGER |