The majority of SQL interview questions are straightforward. You may be asked for definitions, or to write a clearly defined SQL query.
But SQL case study questions are an entirely different beast.
These questions usually start with a hypothetical business or product issue, e.g. unsubscribe rates are falling. Then, you have to define what metrics could be used to investigate the problem, and then write the query to produce those metrics.
One of the best ways to prepare for SQL case study interviews is to walk through solutions step-by-step. This will show you how to think about metrics in hypotheticals, as well as how to walk interviewers through your logic.
We’ve done that here, with two breakdowns of SQL case questions with clear solutions.
While case study and SQL case study interview questions can cover a variety of topics, some may specifically require finding correlations as part of the analysis. In this example SQL case question, we’re looking into this issue: Unsubscribe rates have increased after a new notification system has been introduced.
Question:
Twitter wants to roll out more push notifications to users because they think users are missing out on good content. Twitter decides to do this in an A/B test.
Say that after more notifications are released, there is a sudden increase in the total number of unsubscribes.
We’re given two tables: events where actions are ‘login’, ‘nologin’, and ‘unsubscribe’ and another table called variants where user’s are bucketed into a control and a variant A/B test.
Given these tables, write a query to display a graph to understand how unsubscribes are affecting login rates over time.
Note: Let’s say that all users are automatically put into the A/B test.
events
table
Column | Type |
---|---|
user_id | INTEGER |
created_at | DATETIME |
action | STRING |
variants
table
Column | Type |
---|---|
user_id | INTEGER |
experiment | STRING |
variant | STRING |
This question asks us to compare multiple variables at play here. Specifically, we’re looking at:
We’re not sure how unsubscribes are affecting login rates, but we can plot a graph that would help us visualize how the login rates change before and after an unsubscribe from a user.
We can also see how the login rates compare for unsubscribes for each bucket of the A/B test. Given that we want to measure two different changes, we have to eventually do a GROUP BY of two different variables:
In order to visualize this, we’ll need to plot two lines on a 2D graph.
Now that we have what we’re going to graph, it’s a matter of writing a SQL query to get the dataset for the graph.
We can make sure our dataset looks something like this:
variant | days_since_unsub | login_rate |
---|---|---|
control | -30 | 90% |
test | -30 | 91% |
Each column represents a different axis or line for our graph.
We know that we have to get every user that has unsubscribed, so we’ll first INNER JOIN the abtest table to the events table, where there exists an unsubscribe event. Now we’ve isolated all users that have ever unsubscribed.
Additionally, we have to then get every event in which the user has logged in, and divide it by the total number of users that are eligible within the timeframe.
SELECT
variant
, DATEDIFF(e1.created_at, e2.created_at) AS days_since_unsub
, ROUND(COUNT(DISTINCT CASE WHEN e2.action = 'login'
THEN 1 ELSE 0 END)/COUNT(DISTINCT abtest.user_id) * 100, 2)AS login_rate
FROM variants as abtest
INNER JOIN events AS e1
ON abtest.user_id = e1.user_id
AND e1.action = 'unsubscribe'
INNER JOIN events AS e2
ON abtest.user_id = e2.user_id
AND e2.action IN ('login', 'nologin')
AND DATEDIFF(e1.created_at, e2.created_at)
BETWEEN -30 AND 30
GROUP BY 1,2
Many SQL case questions require creativity to solve. You’re given a hypothesis, but then have to determine how to prove or disprove it with specific metrics. The key here is walking the interviewer through your thought process. This example SQL case question from LinkedIn explores user career paths.
Question:
We’re given a table of user experiences representing each person’s past work experiences and timelines.
Specifically, let’s say we’re interested in analyzing the career paths of data scientists. The titles we care about are bucketed into data scientist, senior data scientist, and data science manager.
We’re interested in determining if a data scientist who switches jobs more often ends up getting promoted to a manager role faster than a data scientist that stays at one job for longer.
Write a query to prove or disprove this hypothesis.
user_experiences
table
Column | Type |
---|---|
id | INTEGER |
user_id | INTEGER |
title | STRING |
company | STRING |
start_date | DATETIME |
end_date | DATETIME |
is_current_role | BOOLEAN |
The hypothesis is that data scientists that end up switching jobs more often get promoted faster.
Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments based on how often they shift in their careers.
For example, if we look at the number of job switches for data scientists that have been in their field for five years, we could prove the hypothesis if the number of data science managers increased along with the number of career jumps.
Here’s what that might look like:
We could look at this over different buckets of time as well to see if the correlation stays consistent after 10 or 15 years in a data science career.
This analysis proves to be correct except for the fact that it doesn’t count the intention of the data scientist. What happens if the data scientist didn’t ever want to become a manager?
There’s one flaw in the assumption there. It doesn’t account for the intention of the data scientist. It doesn’t answer the question: What happens if the data scientist didn’t ever want to become a manager?
One way to solve this is to do the analysis backwards.
We can subset all of the existing data science managers and see how often they ended up switching jobs before they got to their first manager position.
Then divide the number of job switches by the amount of time it took for them to achieve the manager position themselves. This way, we can end up with a result that looks like this:
But there is a fault with this analysis as well. What about all those data scientists that have switched jobs / not switched jobs but haven’t become managers yet? They could be one month away from being a manager and be subsetted out of our analysis!
We have to then make some assumptions about the distribution of existing data science managers.
Are the years of experience before they became managers normally distributed? If not, then our results might be a bit biased from our hindsight analysis.
We first make a CTE called manager_promo with all the user_ids that have been promoted to data science managers.
Next, we count the number of job switches before getting promoted as num_jobs_switched.
Then, we calculate the number of months before promotion to the data science manager position as month_to_promo.
Finally, we order by the number of jobs switched.
WITH manager_promo as (
SELECT user_id
, MIN(start_date) as promotion_date
FROM playground.user_experiences
WHERE title='data science manager'
GROUP BY user_id
)
SELECT num_jobs_switched
, AVG(
TIMESTAMPDIFF(MONTH, career_started_date, promotion_date)
) as month_to_promo
FROM (
SELECT u.user_id
, mp.promotion_date
, COUNT(DISTINCT u.id) as num_jobs_switched
, MIN(start_date) as career_started_date
FROM playground.user_experiences u
INNER JOIN manager_promo mp
on u.user_id=mp.user_id
WHERE u.start_date<mp.promotion_date
GROUP BY u.user_id, mp.promotion_date
) tt
GROUP BY num_jobs_switched
ORDER BY 2 DESC
Hint: Talk about any conclusions you could draw from your data, but also be prepared to talk about trade-offs and potential flaws.
With the query result, we can draw conclusions about the months it took each distinct user to be promoted to data science manager.
Be warned this solution is not perfect. The edge cases where users never become promoted to data science managers are not considered.
Finally, many adjustments, like creating buckets for different ranges of months (0-20 months to promotion, 20-40 months to promotion, etc.), can present a more digestible, high-level analysis on whether frequent job changes affect promotion opportunities to the data science manager position.
Each bucket would correspond to the average time it took the users in that bucket to be promoted to a data science manager position.
This course is designed to help you learn everything you need to know about working with data, from basic concepts to more advanced techniques.
If you have an interview coming up, review Interview Query’s data science course, which includes modules in SQL.
SQL interviews are demanding, and the more you practice all types of SQL interview questions and not just case questions, the more confident and efficient you’ll become in answering them.