Over ten years ago, a data analyst interview was very simple. All you needed to know to say was two things:
Nowadays, the role of a data analyst has changed. Not only have salaries shot up, but data analysts are more in-demand than ever before due to their insight and analytical skillset.
Most data analyst jobs at tech companies require a strong technical skillset combined with good judgment. In this guide, we’ll break down the interview process and the most common data analyst interview questions.
Technical interviews for data analyst roles are typically multi-stage interviews. They start with initial screens designed to weed out candidates, and quickly progress to more technically demanding screens. To prepare for a data analyst interview, practice questions in each of these category.
Here’s a typical breakdown of the data analyst interview process:
Initial screens are generally calls with recruiters. These screens assess your experience, your interests/specializations, and salary expectations. In some cases, you may be asked basic SQL questions or a simple scenario-based case question.
Sample question: Tell us about a challenging project you have worked on. What were some of the obstacles you had to overcome?
The technical screen assesses your technical skills. In many cases, SQL is a primary focus. SQL questions range from basic definitions, to writing intermediate-to-advanced queries. Depending on the job function, you may be asked technical questions about Python, statistics and probability, algorithms and A/B testing.
Sample question: Given two tables, write a query to find the number of users that accepted Friend requests in the last month.
Take-homes are longer tests that may take several hours to complete. These challenges are designed to evaluate your ability to handle data, perform analysis and present your results effectively. Typically, these tests will ask you to perform and investigation on a dataset and present your findings.
Sample question: Prepare a summary of sales and website data for the Vice President of Marketing. Include an overview of website traffic and sales, as well as areas for improvement.
Data analyst on-site interviews typically consist of 3-5 hour-long interviews. They typically cover traditional technical SQL and statistics questions, as well as data analytics case studies and behavioral questions.
Sample question: Describe an analytics project you worked on. What were some challenges you faced?
Interview Query regularly analyzes the contents of data analyst interviews. By tagging common keywords and mapping them back to question topics for over 10K+ tech companies, we’ve found that SQL questions are asked most frequently.
In fact, in interviews for data analyst roles, SQL and data manipulations questions are asked 85% of the time.
Here are the types of technical interview questions data analysts get asked most frequently:
Additionally, for more traditional data analyst roles, expect interview questions around:
Let’s first dive into how to approach and answer behavioral interview questions
Behavioral questions in data analyst interviews ask about specific situations you’ve been in, in which you had to apply specific skills or knowledge.
For many data analysts, behavioral questions can be fairly tough.
One tip: Always try to relate the question back to your experience and strengths.
Successful data analysts can help businesses identify anomalies and respond quickly.
For data sense questions, think about a time that you were able to spot an inconsistency in data quality, and how you eventually addressed it.
Interviewers want to see you demonstrate:
Here’s an example answer: “In my previous job, I was working on a sales forecasting problem under a strict deadline. However, due to a processing error, I was missing the most recent data, and only had 3-year-old sales figures. The strategy I took was applying the growth factor to the data to establish correct correlation and variances. This strategy helped me deliver a close forecast and meet the deadline.”
Interviewers ask this question to see if you can make complex subjects accessible and that you have a knack for communicating insights in a way that persuades people. Here’s a marketing analytics example response:
“I was working on a customer segmentation project. The marketing department wanted to better segment users. I worked on a presentation and knew the audience wouldn’t understand some of the more complex segmenting strategies, so I put together a presentation that talked about the benefits and potential trade-offs of segmenting options like K-means clustering.For each option, I created a slide to show how it worked, and after the presentation, we were able to have an informed discussion about which approach to use.”
Interviewers want to see that you can set manageable goals and understand your process for achieving them. Don’t forget to mention the challenges you faced, which will make your response more dynamic and insightful. For example, you might say:
“Data visualization was something I struggled with in college. I didn’t have a strong design eye, and my visualizations were hard to read. In my last job, I made it a goal to improve, and there were two strategies that were most helpful. I took an online data visualization course, and I built a clip file of my most favorite visualizations. The course was great for building my domain knowledge. However, I felt I learned the most by building my clip file and breaking down what made a good visualization on my own.”
This question assesses your ability to remain objective at work, that you communicate effectively in challenging situations, and that you remain calm under fire. Here’s an example response:
“In my previous job, I was the project manager on a dashboard project. One of the BI engineers wasn’t meeting the deadlines I had laid out, and I brought that up with him. At first, he was defensive and angry with me. But I listened to his concerns about the deadlines and asked what I could do to help. From our conversation, I learned he had a full workload in addition to this project. I talked with the engineering manager, and we were able to reduce some of his workload. He caught up quickly and we were able to finish the project on time.”
This is a leadership question in disguise. If you can relate a time you were an effective leader, chances are you will easily incorporate all of these traits. For example:
“I was the lead on a marketing analytics project. We had a critical deadline to meet, but due to a data processing error, we were in danger of missing the deadline. The team morale was low, so I held a quick meeting to lay out a schedule, answer questions, and rally the team. That meeting gave the team the jolt it needed. We made the deadline, and I made sure leadership knew how hard each of the contributors had worked.”
This question tests your resilience, how you respond to adversity, and how you learn from your mistakes. You could say:
“I had to give a presentation about a data analytics project to a client. One mistake I made was assuming the audience had more technical knowledge than they did. The presentation was received by a lot of blank stares. However, I knew the material about our findings was strong. I stopped for questions, and then, I jumped ahead to the visualizations and findings. This helped get the presentation on track, and by the end, the client was impressed. Now, whenever I have a presentation, I take time to understand the audience before I start working on it.”
A strong response to this question shows that you can solve problems creativity and that you don’t just jump at the first or easiest solution. One tip: Illustrate your story with data to make it more credible.
Here’s what you could say: “In my previous job, I was responsible for competitor research, and through my analysis, I noticed that our most significant competitors had increased sales 5% during Q1. This deviated significantly from our sales forecasts for these accounts. I found that we needed to update our competitor sales models with more recent market research and historical data. I tested the model adjustments, and ultimately, I was able to improve our forecasting accuracy by 15%.”
Interviewers ask this question to gauge your emotional maturity, see that you can remain objective, and gain insights into your communication skills. Avoid subjective examples like my boss was a micromanager. Instead, you could say:
“One time, I disagreed with my manager over the process for building a dashboard, as their approach was to jump straight into the execution. I knew that it would be better to perform some planning in advance, rather than feeling our way through and reacting to roadblocks as they arose, so I documented a plan that could potentially save us time in development. That documentation and planning showed where pitfalls were likely to arise, and by solving for future issues we were able to launch the new dashboard three weeks early.”
This question is asked to see how confident you are in your communication skills, and it provides insight into how you communicate complex technical ideas. With this question, talk about the various ways you make data and analytics accessible. Try to answer these questions:
This question is basically asking: Are you open to new ideas in your work? Analysts can get stuck on trying to prove their hypothesis, even if the data says otherwise. A successful analyst is OK with being wrong and listens to the data. You could say:
“While working on a customer analytics project, I was surprised to find that a subsegment of our customer base wasn’t actually responding to the offers we were providing. We had lumped the subsegment into a larger customer bucket, and had assumed that a broader segmentation wouldn’t make a difference. I relayed the insight to the marketing team, and we were able to reduce churn among this subsegment.”
This question is super common in analyst behavioral interviews. However, it still trips a lot of candidates up. Another variation of this question would be: why did you want to work in data analytics.
In your response, your goal should be to convey your passion for the work and talk about what excites you about the company/role. You might focus on the company’s culture, a mentor who inspired you, recommendation you received, or someone in your network who’s connected with the company. A sample response:
“I’m excited by the possibility of using data to foster stronger social connections amongst friends and peers. I also like to ‘go fast’ and experiment, which fits into Meta’s innovative culture.”
Interviewers ask questions like this to assess how you handle adversity and adapt. Don’t be afraid to share what went wrong. B do describe what you learned and how you apply it to future work.
Here’s a sample answer for a data analyst role: “I presented a data analytics project to non-technical stakeholders, but my presentation was far too technical. I realized that the audience wasn’t following the technical aspects, so I stopped and asked for questions. I spent time clarifying the technical details until there were no questions left. One thing I learned was that it’s important to tailor presentations to the audience, so before I start a presentation, I always consider the audience.”
Data analysts use SQL to query data to solve complex business problems or find answers for other employees. In general, SQL data analyst questions focus on analytics and reporting:
Reporting interview questions focus on writing a query to generate an already-known output. Such as producing a report or a metric given some example table.
For analytics-based questions, you might have to understand what kind of report or graph to build first and then write a query to generate that report. So it’s an extra step on top of a regular SQL question.
NULL
when querying a data set?To handle such a situation, we can use three different operations:
IS NULL
− This operator returns true, if the column value is NULL.
NOT NULL
− This operator returns true, if the column value is not NULL.
‘<=>
− This operator compares values, which (unlike the = operator) is true even for two NULL values.UNION
and UNION ALL
are SQL operators used to concatenate 2 or more result sets. This allows us to write multiple SELECT
statements, retrieve the desired results, then combine them together into a final, unified set.
The main difference between UNION
and UNION ALL
is that:
UNION
: only keeps unique recordsUNION ALL
: keeps all records, including duplicatesA table is structured with columns and rows. A view is a virtual table extracted from a database by writing a query.
INNER
and OUTER JOIN
?The difference between an inner and outer join is that inner joins result in the intersection of two tables, whereas outer joins result in the union of two tables.
WHERE
and HAVING
?The WHERE
clause is used to filter rows before grouping, and HAVING
is used to exclude records after grouping.
CASE WHEN
function?CASE WHEN
lets you write complex conditional statements on the SELECT
clause, and also allows you to pivot data from wide to long formats.
DELETE TABLE
and TRUNCATE TABLE
in SQL?Although they’re both used to delete data, a key difference is that DELETE
is a Database Manipulation Language (DML) command, while TRUNCATE
is a Data Definition Language (DDL) command.
Therefore, DELETE
is used to remove specific data from a table, while TRUNCATE
removes all the rows of a table without maintaining the structure of the table.
Another difference: DELETE
can be used with the WHERE
clause, but TRUNCATE
cannot. In this case, DELETE TABLE
would remove all the data from the table, while maintaining the structure. TRUNCATE
would delete the entire table.
EXTRACT
allows us to pull temporal data types like date, time, timestamp, and interval from date and time values.
For this question, assume the table is called “Employees” and the last name column is “LastName”.
SELECT * FROM Employees WHERE LastName BETWEEN 'Bailey' AND 'Frederick'
ISNULL
function? When would you use it?The ISNULL
function returns an alternative value if an expression is NULL
. Therefore, if you wanted to add a default value for NULL values, you would use ISNULL
. For example in the statement:
SELECT name, ISNULL(price, 50) FROM PRODUCTS
NULL
price values would be replaced with 50.
big_table
Column | Type |
---|---|
id | INTEGER |
name | VARCHAR |
In most SQL databases, there exists a RAND() function which normally we can call:
SELECT * FROM big_table
ORDER BY RAND()
The function will randomly sort the rows in the table. This function works fine and is fast if you only have, let’s say, around 1,000 rows. It might take a few seconds to run at 10K. And then at 100K maybe you have to go to the bathroom or cook a meal before it finishes.
What happens at 100 million rows?
Someone in DevOps is probably screaming at you.
Random sampling is important in SQL with scale. We don’t want to use the pre-built function because it wasn’t meant for performance. But maybe we can re-purpose it for our own use case.
We know that the RAND() function actually returns a floating point between 0 and 1. So if we were to instead call:
SELECT RAND()
We would get a random decimal point to some Nth degree of precision. RAND() essentially allows us to seed a random value. How can we use this to select a random row quickly?
Let’s try to grab a random number using RAND() from our table that can be mapped to an id. Given we have 100 million rows, we probably want a random number from 1 to 100 million. We can do this by multiplying our random seed from RAND() by the max number of rows in our table.
SELECT CEIL(RAND() * (
SELECT MAX(id) FROM big_table)
)
We use the CEIL
function to round the random value to an integer. Now we have to join back to our existing table to get the value.
What happens if we have missing or skipped id values, though? We can solve for this by running the join on all the ids which are greater or equal than our random value and selecting only the direct neighbor if a direct match is not possible.
As soon as one row is found, we stop (LIMIT 1). And we read the rows according to the index (ORDER BY id ASC). Now our performance is optimal.
SELECT r1.id, r1.name
FROM big_table AS r1
INNER JOIN (
SELECT CEIL(RAND() * (
SELECT MAX(id)
FROM big_table)
) AS id
) AS r2
ON r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1
Hint: We want the value of two different metrics, the number of users that have posted their jobs once and the number of users that have posted at least one job multiple times. What does that mean exactly?
More context. Let’s say we have a table representing a company payroll schema.
Due to an ETL error, the employees table instead of updating the salaries every year when doing compensation adjustments, did an insert instead. The head of HR still needs the current salary of each employee.
More context. Let’s say you work at Costco. Costco has a database with two tables. The first is users composed of user information, including their registration date, and the second table is purchases which has the entire item purchase history (if any) for those users.
Here’s a process you can use to solve this question:
INNER JOIN
or JOIN
to connect tables users and purchases on the user_id
columnWHERE
clauseGROUP BY
to aggregate items, and apply the SUM() function to calculate the amount spentHere’s a code solution:
SELECT
u.name
,u.id AS user_id
,ROUND(SUM(p.price * t.quantity ) ,2) AS total_cost
FROM users u
INNER JOIN transactions t
ON u.id = t.user_id
INNER JOIN products p
ON p.id = t.product_id
GROUP BY u.name
ORDER BY total_cost DESC
Hint: This question has been asked in Amazon data analyst interviews, and the first step is getting data from the users table to the transactions table. This can be done using a JOIN, based on a common column between the tables. How do we identify when the addresses match? We can use the CASE WHEN statement to produce a flag to use in further calculations. Finally, we need the percentage of all the transactions made to the primary address rounded to two decimals.
You’re provided with three tables representing a forum of users and their comments on posts and are asked to find the top three users with the most upvotes in the year 2020. Additionally, we’re told that upvotes on deleted comments and upvotes that users make on their own comments don’t matter.
Hint: The trickiest part about this question is performing your JOINs on the proper fields. If you join two of our tables on the wrong key, you could make things difficult, or even impossible, for yourself later on.
In this question, you’re given two transactions and users.
Hint: Start by joining the transactions and users tables. Use INNER JOIN or JOIN.
search_results
table
Column | Type |
---|---|
query | VARCHAR |
result_id | INTEGER |
position | INTEGER |
rating | INTEGER |
You want to be able to compute a metric that measures the precision of the ranking system based on position. For example, if the results for dog and cat are….
query | result_id | position | rating | notes |
---|---|---|---|---|
dog | 1000 | 1 | 2 | picture of hotdog |
dog | 998 | 2 | 4 | dog walking |
dog | 342 | 3 | 1 | zebra |
cat | 123 | 1 | 4 | picture of cat |
cat | 435 | 2 | 2 | cat memes |
cat | 545 | 3 | 1 | pizza shops |
…we would rank ‘cat’ as having a better search result ranking precision than ‘dog’ based on the correct sorting by rating.
Write a query to create a metric that can validate and rank the queries by their search result precision. Round the metric (avg_rating column) to 2 decimal places.
Hint: What is a cumulative distribution exactly? If we were to imagine our output and figure out what we wanted to display on a cumulative distribution graph, what would the dataset look like?
Write a query to get the total three day rolling average for deposits by day.
Usually, if the problem states to solve for a moving/rolling average, we’re given the dataset in the form of a table with two columns, the date and the value.
This problem, however, is taken one step further with a table of just transactions with values conditioned to filtering for only deposits, and remove records representing withdrawals, denoted by a negative value (e.g. -10).
More context. Let’s say we’re interested in analyzing the career paths of data scientists. We have job titles bucketed into data scientist, senior data scientist, and data science manager. We’re interested in determining if a data scientist who switches jobs more often ends up getting promoted to a manager role faster than a data scientist that stays at one job for longer.
This question has been asked in Google data analyst interviews, and it requires a bit of creative problem solving to understand how we can prove or disprove the hypothesis. The hypothesis is that data scientists that end up switching jobs more often get promoted faster.
Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments on how often they jump in their careers. How would you do that?
Our focus is getting our key metric of a number of new conversations created by day in a single query. To get this metric, we have to group by the date field and then group by the distinct number of users messaged. Afterward, we can then group by the frequency value and get the total count of that as our distribution.
More context. We’re given three tables representing a forum of users and their comments on posts. We want to figure out if users are creating multiple accounts to upvote their own comments. What kind of metrics could we use to figure this out?
Hint. What metric would help you investigate this problem?
More context. Let’s say that Facebook account managers are not able to reach business owners after repeated calls to try to onboard them onto a new Facebook business product. Assume that we have training data on all of the account manager’s outreach in terms of calls made, calls picked up, time of call, etc…
One option would be to investigate when calls are most likely to be connected. Could changing our approach here improve connection rate?
Follow-up question. How would you investigate the causes of such a disparity?
Data analytics case study questions combine a rotating mix of product intuition, business estimation, and data analytics.
Case questions come up in interviews when the job responsibilities lean to more of a heavy analytics space with an emphasis on solving problems and producing insights for management.
Many times data analysts will transition into a heavy analytics role when they’re required to take on more scope around the product and provide insights that upper level management can understand and interpret.
So data analytics case study questions will focus on a particular problem and you will be judged on how you break down the question, analyze the problem, and communicate your insights.
Here’s an example data analytics case study question:
Community members can create a post to ask a question, and other users can reply with answers or comments to that question. The community can express their support for the post by upvoting or downvoting.
post_analytics
table:
Column | Type | Description |
---|---|---|
id | int | Primary key of posts table |
user_id | int | ID of the user who created the post |
created_at | datetime | Timestamp of the post |
title | string | Title of the post |
body | string | Text content of the post |
comment_count | int | Total number of the comments on a post |
view_count | int | Total number of the views on a post |
answer_count | int | Total number of answers on a post |
upvotes | int | Total number of upvotes on the post |
More context. You work at Stack Overflow on the community team that monitors the health of the platform. Community members can create a post to ask a question, and other users can reply with answers or comments to that question. The community can express their support for the post by upvoting or downvoting.
This is a classic data analytics case study. A question like this is designed to assess your data intuition, product sense, and ability to isolate key metrics.
Remember: There isn’t one correct answer, but usually, the conversation should head in a similar direction.
For example, this question asks about community health. Broadly, there are several metrics you’ll want to consider: Growth rate, engagement, and user retention would provide insights into the community’s health.
The challenge with this question is to determine how to measure those metrics with the data provided.
Case questions sometimes take the form of behavioral questions. Data analysts get tasked with experimenting with data to test new features or campaigns. Many behavioral questions will ask about experiments but also tap into how you approach measuring your results.
With questions like these, be sure to describe the objective of the experiment, even if it is a simple A/B test. Don’t be afraid to get technical and explain the metrics you used and the process you used to quantify the results.
Bonus question. How would you measure the success of this new feature?
See a step-by-step solution to this data analytics case study problem.
More context. You’re given a table that represents search results from searches on Facebook. The query column is the search term, the position column represents each position the search result came in, and the rating column represents the human rating from 1 to 5, where 5 is high relevance and 1 is low relevance.
Each row in the search_events
table represents a single search with the has_clicked
column representing if a user clicked on a result or not. We have a hypothesis that the CTR is dependent on the search result rating.
Python coding questions for data analysts are usually pretty simple and not as difficult as the ones seen on Leetcode. Mainly most interviewers just want to test basic knowledge of Python to the point that they know you can write scripts or some basic functions to move data between SQL and Excel or onto a dashboard.
Most data analysts never write production code, such as their code is never under scrutiny because it’s not holding a website up or performing some critical business function.
Therefore, most coding questions for data analyst interviews are generally on the easier side and mostly test basic functions that are required for data manipulation. Pandas questions may also be asked in this round of the interview.
Here’s an example Python coding question:
sentence = "Have free hours and love children?"
output = [
('have', 'free'),
('free', 'hours'),
('hours', 'and'),
('and', 'love'),
('love', 'children')
]
Bigrams are two words that are placed next to each other. To actually parse them out of a string, we need to first split the input string.
We would use the Python function .split() to create a list with each individual word as an input. Create another empty list that will eventually be filled with tuples.
Then, once we’ve identified each individual word, we need to loop through k-1 times (if k is the amount of words in a sentence) and append the current word and subsequent word to make a tuple. This tuple gets added to a list that we eventually return. Remember to use the Python function .lower()
to turn all the words into lowercase!
def find_bigrams(sentence):
input_list = sentence.split()
bigram_list = []
# Now we have to loop through each word
for i in range(len(input_list)-1):
#strip the whitespace and lower the word to ensure consistency
bigram_list.append((input_list[i].strip().lower(), input_list[i+1].strip().lower()))
return bigram_list
Negative indexing is a function in Python that allows users to index arrays or lists from the last element. For example, the value -1 returns the last element, while -2 returns the second-to-last element. It is used to display data from the end of a list, or to reverse a number or string.
Example of negative indexing:
a = "Python Data Analyst Questions"
print (a[-1])
>> s
Compound data structures are single variables that represent multiple values. Some of the most common in Python are:
Lists, tuples, and sets are compound data types that serve a similar purpose: storing collections of items in Python. However, knowing the differences between each of them is crucial for compute and memory efficiency.
You can check for duplicates using the Pandas duplicated() method. This will return a boolean series which is TRUE only for unique elements.
DataFrame.duplicated(subset=None,keep='last')
List comprehension is used to define and create a list based on an existing list. For example, if we wanted to separate all the letters in the word “retain,” and make each letter a list item, we could use list comprehension:
r_letters = [ letter for letter in 'retain' ]
print( r_letters)
We can also use list comprehension for filtering. For example, to get all the vowels in the word “retain”, we do the following:
vowels = [vowel for vowel in 'retain' if vowel in ('a', 'e', 'i', 'o', 'u')]
print(vowels)
If you are concerned about duplicate values, you can opt for sets instead, by replacing “[]” with “{}”.
unique_vowels = {vowel for vowel in 'retain' if vowel in ('a', 'e', 'i', 'o', 'u')}
print(unique_vowels)
Sequence unpacking is a python operation that allows you to de-structure the elements of a collection and assign them directly to variables without the need for iteration. It provides a terse method for mapping variables to the elements of a compound data structure. For example:
# instead of:
x = coordinates[0]
y = coordinates[1]
# we can unpack a list:
x, y = coordinates
# we can also do the same for sets, tuples, and dictionaries.
We can even swap the elements of two variables without the use of a third variable:
a = 3
b = 2
a, b = b, a
assert a == 2
assert b == 3
# no assertion errors
If the size of a collection is unclear, you can use the *
operator on a variable to assign all extra items to said variable:
food = ('apples', 'oranges', 'carrots', 'cabbages', 'lettuce')
apples, oranges, *vegetables = food
# apples = 'apples', oranges = 'oranges',
# vegetables = ('carrots', 'cabbanges', 'lettuce')
Hint: need to use the equation for standard deviation to answer this question. Using the equation, allows us to take the sum of the square of the data value minus the mean, over the total number of data points, all in a square root.
This question sounds like it should be a SQL question doesn’t it? Weekly aggregation implies a form of GROUP BY in a regular SQL or pandas question. In either case, aggregation on a dataset of this form by week would be pretty trivial.
Example:
A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
Hint: This problem is relatively simple if we work out the underlying algorithm that allows us to easily check for string shifts between the strings A and B.
Hint: Notice that in the subsequence problem set, one string in this problem will need to be traversed to check for the values of the other string. In this case, it is string2.
Statistics and probability questions for data analysts will usually come up on an onsite round as a test of basic fundamentals.
Statistics questions are more likely than probability questions to show up, as statistics are the fundamental building blocks for many analyst formulas and calculations.
Given that X and Y both have a mean of 0 and a standard deviation of 1, what does that indicate for the distributions of X and Y?
Let’s look at this question a little closer.
We’re given two normal distributions. The values can either be positive or negative but each value is equally likely to occur. Since we know the mean is 0 and the standard deviation is 1, we understand that the distributions are also symmetrical across the Y-axis.
In this scenario, we are equally likely to randomly sample a value that is greater than 0 or less than 0 from the distribution.
Now, let’s take examples of random values that we could get from each scenario. There are about six different scenarios here.
We can simulate a random sampling by equating that all six are equally likely to occur. If we play out each scenario and plug the variables into 2X > Y, then we see about half of the time the statement is true, or 50%.
Why is this the case? Generally if we go back to the fact that both distributions are symmetrical across the Y-axis, we can intuitively understand that if both X and Y are random variables across the same distribution, we will see 2X as being on average double positive or double negative the value that Y is.
To answer this question, start by thinking about how a biased estimator looks. Then, think about how an unbiased estimator differs. Ultimately, an estimator is unbiased if its expected value equals the true value of a parameter, meaning that the estimates are in line with the average.
Hint: In order to decrease our margin of error, we’ll probably have to increase our sample size. But by how much?
Covariance measures the linear relationship of variables, while correlation measures the strength and direction of the relationship. Therefore, correlation is a function of a covariance. For example, a correlation between two variables does not mean that the change in variable X caused the change in variable Y’s value.
Probability distributions represent random variables and associated probabilities of different outcomes. In essence, a distribution maps the probability of various outcomes.
For example, a distribution of test grades might look similar to a normal distribution, AKA bell curve, with the highest number of students receiving Cs and Bs, and a smaller percentage of students failing or receiving a perfect score. In this way the center of the distribution would be the highest, while outcomes at either end of the scale falling lower and lower.
A probability distribution is not normal if most of its observations do not cluster around the mean, forming the bell curve. An example of a non-normal probability distribution is a uniform distribution, in which all values are equally likely to occur within a given range. A random number generator set to produce only the numbers 1-5 would create such a not normal distribution, as each value would be equally represented in your distribution after several hundred iterations.
More context. You are about to get on a plane to Seattle. You call 3 random friends in Seattle and ask each if it’s raining. Each has a 2⁄3 chance of telling you the truth and a 1⁄3 chance of messing with you by lying. All 3 friends tell you that “yes” it is raining.
Hint: There are several ways to answer this question. Given that a frequentist approach operates on the set of known principles and variables given in the original problem, you can logically deduce that P(Raining)= 1-P(Not Raining).
Since all three friends have given you the same answer as to whether or not it’s raining, what can you determine about the relationship between P(Not Raining) and the probability that each of your friends is lying?
Before jumping into anomaly detection, discuss what the meaning of a univariate dataset is. Univariate means one variable. For example, travel time in hours from your city to 10 other cities is given in an example list below:
12, 27, 11, 41, 35, 22, 18, 43, 26, 10
This kind of single column-data set is called a univariate dataset. Anomaly detection is a way to discover unexpected values in datasets. The anomaly means data exists that is different from the normal data. For example, you can see below the dataset where one data point is unexpectedly high intuitively:
12, 27, 11, 41, 35, 22, 76767676, 18, 43, 26, 10
You should answer these questions in your response:
Hint: The first step in solving this problem is to separate it into two instances– one where you grab the fair coin, and one where you grab the biased coin. Solve for the probabilities of flipping the same side separately for both.
Capital approval rates have gone down for our overall approval rate. Let’s say last week it was 85% and the approval rate went down to 82% this week which is a statistically significant reduction.
The first analysis shows that all approval rates stayed flat or increased over time when looking at the individual products.
Hint: This would be an example of Simpson’s Paradox which is a phenomenon in statistics and probability. Simpson’s Paradox occurs when a trend shows in several groups but either disappears or is reversed when combining the data.
In probability, confidence intervals refer to a range of values that you expect your estimate to fall between if you were to rerun a test. Confidence intervals are a range that are equal to the mean of your estimate plus or minus the variation.
For example, if a presidential popularity poll had a confidence interval of 93%, encompassing a 50%-55% approval, it would be expected that, if you re-polled your sample 100 more times, 93 times the estimate would fall between the upper and lower values of your interval. Those other seven events would fall outside, which is to say either below the 50% or above 55%. More polling would allow you to get closer to the true population average, and narrow the interval.
One question to add: does order matter here? Is drawing an ace on the second card the same thing as drawing an ace on the first card and still drawing a second card? Let’s see if we can solve and prove this out.
We can generalize to two scenarios when drawing two cards of getting an ace:
If we model the probability of the first scenario we can multiply the two probabilities of each occurrence to get the actual probability.
A/B testing and experimentation questions for data analysts tend to explore the candidate’s ability to properly conduct A/B tests. You should have strong knowledge of p-values, confidence intervals, and assessing the validity of the experiment.
In this particular question, you’ll need to clarify the context of how the A/B test was set up and measured.
If we have an A/B test to analyze, there are two main ways in which we can look for invalidity. We could likely re-phrase the question to: How do you set up and measure an A/B test correctly?
Let’s start out by answering the first part of figuring out the validity of the set up of the A/B test:
1. How were the user groups separated?
Can we determine that the control and variant groups were sampled accordingly to the test conditions?
If we’re testing changes to a landing page to increase conversion, can we compare the two different users in the groups to see different metrics in which the distributions should look the same?
For example, if the groups were randomly bucketed, does the distribution of traffic from different attribution channels still look similar or is the variant A traffic channel coming primarily from Facebook ads and the variant B from email? If testing group B has more traffic coming from email then that could be a biased test.
2. Were the variants equal in all other aspects?
The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically.
If the variants A’s landing page has a picture of the Eifel Tower and the submit button on the top of the page, and variant B’s landing page has a large picture of an ugly man and the submit button on the bottom of the page, then we could get conflicting results based on the change to multiple features.
Measurement
Looking at the actual measurement of the p-value, we understand that industry standard is .05, which means that 19 out of 20 times that we perform that test, we’re going to be correct that there is a difference between the populations.
However, we have to note a couple of things about the test in the measurement process.
What was the sample size of the test?
Additionally, how long did it take before the product manager measured the p-value? Lastly, how did the product manager measure the p-value and did they do so by continually monitoring the test?
If the product manager ran a T-test with a small sample size, they could very well easily get a p-value under 0.05. Many times, the source of confusion in AB testing is how much time you need to make a conclusion about the results of an experiment.
The problem with using the p-value as a stopping criterion is that the statistical test that gives you a p-value assumes that you designed the experiment with a sample and effect size in mind. If we continuously monitor the development of a test and the resulting p-value, we are very likely to see an effect, even if there is none. The opposite error is also common when you stop an experiment too early, before an effect becomes visible.
The number one most important reason is that we are performing a statistical test every time you compute a p-value and the more you do it, the more likely you are to find an effect.
How long should we recommend an experiment to run for then? To prevent a false negative (a Type II error), the best practice is to determine the minimum effect size that we care about and compute, based on the sample size (the number of new samples that come every day) and the certainty you want, how long to run the experiment for, before starting the experiment.
Split testing fails when you have unclear goals. That’s why it’s imperative to start backwards with that goal. Is it to increase conversions? Are you trying to increase engagement and time spent on page? Once you have that goal, you can start experimenting with variables.
Statistical significance - or having 95% confidence in the results - requires the right volume of data. That’s why most A/B tests run for 2-8 weeks. Comparing metrics like conversions is fairly easy to calculate. In fact, most A/B tools have built-in calculators.
Hint: A question like this asks you to think hypothetically about A/B testing. But the format is the same: Walk the interviewer through setting up the test and how you arrive at a statistically relevant result.
The p-value is a fundamental concept in statistical testing. First, why does this kind of question matter? What an interviewer is looking for here is can you answer this question in a way that both conveys your understanding of statistics but can also answer a question from a non-technical worker that doesn’t understand why a p-value might matter.
For example, if you were a data scientist and explained to a PM that the ad campaign test has a .08 p-value, why should the PM care about this number?
The new channels include: Youtube Ads, Google search ads, Facebook ads, direct mail campaigns.
To start, you’d want to follow up with some clarifying questions and make some assumptions. Let’s assume, for example, that most efficient means lowest cost per conversion, and that we’ve been asked to spend evenly across all platforms.
Understanding whether your data abides by or violates a normal distribution is an important first step in your subsequent data analysis.
This understanding will change which statistical tests you want to use if you need to immediately look for statistical significance. For example, you cannot run a t-test if your distribution is non-normal since this test uses mean/average as a way to find differences between groups.
Hint: Is A/B testing a price difference a good idea? Would it encourage users to opt-out of your test, if they were seeing different prices for a product?
Is there a better way to test pricing?
Causal relationships are hard to come by, and truly determining causality is tough as the world is full of confounding variables. Because of this, instead of causality, we can dissect the correlation between the location sharing feature and the user unhappiness level.
At its core, this interview question is testing how you can dig into the science and statistics behind their assumption. The interviewer is asking essentially a small variation of a traditional experimental design with survey research and wants to know how you would either validate or disprove this claim.
Metrics is a common product analyst interview question subject, and you’ll also see this type of question in product-oriented data analyst roles. In general, these questions test your ability to choose metrics to investigate problems or measure success. These questions require strong product sense to answer.
The first thing we’ll want to do when faced with an interview question like this one is to ask a few clarifying questions. Answer these questions first:
Let’s say it’s a SaaS business that offers a free Studio model of their product, but makes their money selling enterprise subscriptions. This gives us a better sense of how they’re approaching their customers. They’re saying: here’s a good free tool, but you can pay to make it even better.
Imagine what your analysis would look like if the answer to this question was “a few.” Now imagine what your analysis would look like if the answer to this question was “hundreds.”
Mode could be spending 90% of its marketing budget on Facebook Ads and 10% on affiliate marketing, or vice versa. We can’t know unless we ask.
Here’s where we start getting into the meat of the question.
More context. Let’s say you work for a social media company that has just done a launch in a new city. Looking at weekly metrics, you see a slight decrease in the average number of comments per user from January to March in this city. The company has been consistently growing new users in the city from January to March.
Let’s model an example scenario to help us see the data.
We’re given information that total user count is increasing linearly, which means that the decreasing comments/user is not an effect of a declining user base creating a loss of network effects on the platform. What else can we hypothesize, then?
Start here: What is the point of Facebook Groups? Primarily we could say Facebook Groups provides a way for Facebook users to connect with other users through a shared interest or real-life/offline relationship.
How could we use the goals of Facebook Groups to measure success?
More context. You have access to a set of tables summarizing user event data for a community forum app. You’re asked to conduct a user journey analysis using this data with the eventual goal of improving the user interface.
See a step-by-step solution for this question on YouTube.
With this question, you might define success in terms of advertising performance. A few metrics you might be interested in are:
Let’s look at two examples: An eCommerce product like Groupon vs. a subscription product like Netflix.
E-commerce metrics tend to be related to conversions and sales. Therefore, you might be interested in the number of purchases, conversion rate, quarterly or monthly sales, and cost of goods sold.
Subscription products tend to focus more on subscriber costs and revenue, like churn rates, cost of customer acquisition, average revenue per user, lifetime value, and monthly recurring revenue.
More context. Let’s say that you’re given event data from users on a social networking site like Facebook. A product manager is interested in understanding the average number of “sessions” that occur every day. However, the company has not technically defined what a “session” is yet.
The best the product manager can do is illustrate an example of a user browsing Facebook in the morning on their phone and then again during lunch as two distinct “sessions.” There must be a period of time where the user leaves Facebook to do another task before coming back again anew.
See a solution for this question on YouTube.
This question provides you with a chance to show your expertise in analyzing sale metrics and KPIs. Some of the challenges you might bring up include competitor price analysis, examining core customer experiences, and investigating evolving customer desires. Your goal in your response should be to outline how you would perform root cause analysis.
Tip. Start with some clarifying questions like, What is the product? Who is the audience? How long has the decline in sales persisted?
More context. Specifically, we want to improve search results for people looking for things to do in San Francisco.
This product question is more focused on growth and very much used for Facebook’s growth marketing analyst technical screen. Here are a couple of things that we have to remember.
Like usual product questions where we are analyzing a problem and coming up with a solution with data, we have to do the same with growth except we have to come up with solutions in the form of growth ideas and provide data points for how they might support our hypothesis.
Measuring the success of Facebook Stories requires an integrated approach that examines how users interact with the feature and its impact on the platform. Key to this evaluation is understanding engagement levels, which are reflected through metrics such as the total number of story views and unique viewers, alongside interactions like replies and reactions. These figures are pivotal because they indicate not just how many people are watching, but how actively they are engaging with the content.
Excel is still a widely used tool by data analysts, and in interviews, Excel questions typically focus on advanced features. These questions might ask for definitions, or you may be required to perform some Excel tasks.
Data analysts should also have strong knowledge of data visualization. Data visualization interview questions typically focus on design and presenting data, and may be more behavioral in nature. Be prepared to talk about how you make data accessible on dashboards.
This function allows users to find data from one column, and return a corresponding value from another.
For example, if you were analyzing a spreadsheet of customer data, you might use VLOOKUP to find a customer name and the corresponding phone number.
One limitation of VLOOKUP is that it only looks to the right of the column you are analyzing. For example, you couldn’t return a value from column A, if you used column B as the lookup column.
Another limitation is that VLOOKUP only returns the first value; if the spreadsheet contains duplicate records, you wouldn’t see any duplicates.
Conditional formatting allows users to change the appearance of a cell based on specified conditions.
Using conditional formatting, you can quickly highlight cells or ranges of cells, based on your conditions. Data analysts use conditional formatting to visualize data, to identify patterns or trends, or to detect potential issues.
Data analysts will get asked what tools they have experience with. Choose a few that you’re most comfortable with and explain the features that you like.
One tip: Think of questions like this in terms of Big Data’s 5 Vs: volume, velocity, variety, veracity and value.
Data can be imported from a variety of sources by selecting the Data tab and clicking Get External Data > From Other Sources. Excel worksheet data, data feeds, text files and other such data formats can be imported, but you will need to create relationships between the imported tables and those in your worksheet before using them to create a pivot table.
In your answer, provide an overview of your data validation process. For example, you might say, “The first step I would do would be to prepare a data validation report, which reveals why the data failed.” Then, you might talk through strategies for analyzing the dataset or techniques to process missing data, like deletion or mean/median/mode imputation.
Data visualization involves presenting data in a graphical or pictorial format. This allows viewers to see data trends and patterns that may not be easy to understand in text-based data. Tableau, Power BI, and Python libraries such as Matplotlib and Seaborn are some of the most commonly used tools for data visualization.
This question requires you to detail your hands-on experience with the mentioned tools. It involves discussing specific features you have used in Tableau, Power BI, and Python, such as creating different types of charts, setting up dashboards, or using Python libraries like Matplotlib and Seaborn for custom visualizations.
DAX, or Data Analysis Expressions, is a library of functions and operators used to create formulas in Power BI, Analysis Services, and Power Pivot in Excel. These formulas, or expressions, are used to define custom calculations for tables and fields, and to manipulate data within the model.
This question tests your understanding of DAX time-intelligence functions. A suitable response could be:
“I would use a combination of the SUM and CALCULATE functions along with a Date table. First, I would create a measure using the SUM function to total the sales. Then, I would use the CALCULATE function along with the DATESMTD (Dates Month to Date) function to calculate the monthly total. The DAX expression would look something like this:
*Monthly Sales = CALCULATE(SUM(Sales[Daily Sales]), DATESMTD('Date'[Date]))
“*
This question assesses your ability to analyze complex datasets and create straightforward, impactful visualizations. Your response might include:
“Understanding the audience is key. For an executive summary, it’s important to focus on high-level insights. I would start by performing exploratory data analysis to identify key trends and relationships within the data. From this, I could determine which aspects are most relevant to the executive team’s interests and strategic goals.
For visualization, I would use a tool like Tableau or Power BI, known for their user-friendly, interactive dashboards. To make the data more digestible, I would utilize various chart types such as bar graphs for categorical data comparison, line graphs for trend analysis, or pie charts for proportions.
To add an interactive element, I’d implement filters to allow executives to view data for different demographics, products, or time periods. It’s crucial to keep the design clean and ensure the visuals tell a clear story. For the presentation, I would walk them through the dashboard, explain key insights, and address any questions.”
Your response should demonstrate your knowledge of both Tableau and Power BI and your ability to select the most appropriate tool for a specific task.
“For real-time sales monitoring, both Tableau and Power BI can be effective. However, if the company uses Microsoft’s suite of products and requires extensive integration with these services, I would lean towards Power BI as it’s part of the same ecosystem.
Power BI has robust real-time capabilities. I would leverage Power BI’s DirectQuery feature to connect to the sales database, ensuring the data displayed on the dashboard is always up-to-date. The tool also allows for streaming datasets that can be used for continuously streaming and updating data.
To visualize sales, I would design a dashboard that includes key metrics such as total sales, sales by product category, and changes in sales over time. I would also include slicers to allow users to filter data by region, time period, or other relevant dimensions.
Power BI also allows creating alerts based on KPIs that could notify the team when a sales target is reached or when there are significant changes in sales trends.”
If you are interested in a career path for a data analyst or have data analyst interview questions coming up, review Interview Query’s data science course, which includes modules in SQL, Python, product metrics and statistics. SQL is hands-down the most commonly asked subject in data analyst interviews. See our list of 50+ SQL data science interview questions or our guide to SQL questions for data analysts.
Here are more articles that you can check out as well: