Interview Query
Top 89 Data Analyst SQL Interview Questions (Updated for 2025)

Top 89 Data Analyst SQL Interview Questions (Updated for 2025)

Introduction

Data Analyst SQL interview questions aim to quickly evaluate your ability to extract metrics and work with data using SQL. You’ll usually need to write SQL queries on a whiteboard or in a code editor.

In 2025, real-time processing, semi-structured data handling, and advanced SQL techniques like window functions and materialized views are becoming increasingly important for data analysts.

SQL questions cover various topics, from asking when to use a GROUP BY statement to challenging you to create a query that supports or disproves a product metrics hypothesis.

Regardless of the interview format, the goal remains simple: produce clean SQL code as fast as possible. Data analyst SQL questions fall into three categories:

  • Easy SQL questions - These questions focus on defining SQL features, basic use cases, and differentiating between commands like ´WHERE´ and ´HAVING´. They may also include simple queries.

  • Intermediate SQL questions - These questions require you to write complex queries using joins, sub-queries, self-joins, and window functions. They may also include analytics case studies.

  • Hard SQL questions - These questions challenge you to write advanced queries, including the use of indices and complex SQL clauses. They may also include more advanced analytics case studies.

How Is SQL Tested in Data Analyst Interviews?

Writing on a white Board

Data analyst SQL technical screens are a part of nearly every data analyst interview. In these screens, candidates are asked to answer real-world problems using SQL.

Most commonly, candidates are provided a dataset and asked to write a SQL query to return the desired data.

How do companies test proficiency? There are three main types of SQL interview questions:

  1. Whiteboarding - SQL whiteboard tests are common in interviews. In a whiteboard test, you’re required to write SQL queries by hand, which allows companies to assess your understanding of SQL concepts and problem-solving ability.
  2. Coding tests - Many companies ask you to write code and run queries in live interviews. With live coding screens, you can check for syntax errors while you work, providing companies with a way to see your coding efficiency.
  3. SQL case studies - In case interviews, you’re given a real-world problem and asked to use your SQL skills to solve the problem. These are typically open-ended questions that leave room for analysis and creative problem-solving.

Easy Data Analyst SQL Questions

Easy SQL Interview Questions for Data Analysts

Data analyst interviews and technical screens generally start with beginner SQL questions. There are two main types of easy SQL questions:

  • Basic SQL Queries - You will often be asked basic SQL interview questions that require you to write a query. You might be asked to get the COUNT of a table, make a simple join, or use the HAVING clause.

  • Definitions - Less frequently encountered, these questions ask you to explain technical concepts, compare two or more SQL functions, or define how a concept is used.

Here are nine easy SQL questions for data analysts:

1. When would you use the GROUP BY statement?

The GROUP BY clause enables us to compute aggregate values alongside discrete data by collapsing the distinct rows of our database into summary rows that share a particular characteristic.

If you were to attempt to formulate a query by selecting a number of rows with a WHERE clause and then displaying an aggregate value alongside it, you would find that the query would return an error. This is because SQL cannot display, in a single table, the results of our WHERE query as a list of values conjoined with the aggregate value you are looking for.

2. What are the most common aggregate functions in SQL? What do they do?

An aggregate function calculates a set of values and returns a single value summarizing the set. The three most common aggregate functions in SQL are: COUNT, SUM, and AVG.

  • COUNT - Returns the number of items of a group.
  • SUM - Returns the sum of ALL or DISTINCT values in an expression.
  • AVG - Returns the average of values in a group (and ignores NULL values).

3. What is a unique key in SQL?

In SQL, a unique key is one or more columns or fields that identify a record in a database. Tables can have multiple unique keys, a difference between unique and primary keys. With unique keys, only one NULL value is accepted for the column and cannot have duplicate values.

4. What is the difference between UNION and UNION ALL?

UNION and UNION ALL are SQL operators that concatenate two or more result sets. This allows us to write multiple SELECT statements, retrieve the desired results, and then combine them together into a final, unified set.

The main difference between UNION and UNION ALL is that:

  • UNION: only keeps unique records.
  • UNION ALL: keeps all records, including duplicates.

5. What is the difference between a RIGHT JOIN and a LEFT JOIN?

The two most common types of join in SQL are LEFT and RIGHT. The main difference is that these JOIN operators deal with matched and unmatched rows.

A LEFT JOIN includes all records from the left table and matched rows from the right. A RIGHT JOIN returns all rows from the right table and unmatched rows from the left.

6. What is the difference between a table and a view?

Tables contain data, and they are made up of columns and rows. A view is a virtual table, which generally depends on data from the table for its display.

One use case for a view can be found if you want to look at a subset of data from a table. You could create a view using the SELECT command to query the data.

Master SQL with Our Question Bank

7. What SQL operator is used for pattern matching?

The LIKE operator searches for a specific pattern within a column. It is used with a WHERE clause to query specific columns.

LIKE is used with two wildcard characters:

  • % represents a 0, 1, or multiple characters
  • _ (underscore) represents a single character

Here are a few examples:

WHERE Employee Name LIKE '%r' - Finds matches that end with "r"

WHERE Employee NAME LIKE '%gh%'  - Finds matches that include "gh" in any position `

WHERE Employee NAME LIKE '_ch%' - Finds matches with "ch" in the second and third positions
 
WHERE Employee NAME LIKE 'g%r' - Finds matches that start with "g" and end with "r"

8. What command would you use to update data in a table?

You update an existing table with the UPDATE command in SQL. It is used with SET (which includes the updated information) and WHERE to select the specific instance.

Example: In the table ‘Employees’, you want to change the emergency contact, ContactName, for an employee with EmployeeID 3.

UPDATE Employees
SET ContactName = "Bob Smith"
WHERE EmployeeID = 3;

9. Which operator is used to select values within a range? What types of values can be selected?

The BETWEEN operator is used to select values within a range. You can use numbers, texts, or dates with BETWEEN.

One important thing to note is that the BETWEEN operator includes the start and end dates.

SELECT EmployeeID
FROM Employees
WHERE EmployeeID BETWEEN 378 AND 492 

10. Write a query that outputs a random manufacturer’s name.

Given a table of cars with columns id and make, write a query that outputs a random manufacturer’s name with an equal probability of selecting any name.

Input:

cars table

id make
1 Ford
2 Toyota
3 Toyota
4 Honda
5 Honda
6 Honda

Output:

Column Type
make Text

In the Easy Difficulty Data Analyst SQL Interview Category, the Interviewer might also ask:

11. What is SQL?

SQL (Structured Query Language) is a standard programming language for managing and manipulating databases. It is used for querying, inserting, updating, and deleting data. SQL allows interaction with relational databases, enabling users to retrieve, modify, and manage data efficiently.

12. What is the difference between WHERE and HAVING clauses?

The WHERE clause filters rows before grouping in SQL, whereas HAVING filters rows after aggregation. WHERE is used to specify conditions on individual rows, while HAVING is used to filter the result of GROUP BY queries. Both are important, but WHERE operates before grouping, and HAVING operates after grouping.

13. What are SQL constraints?

SQL constraints are rules applied to columns in a table to enforce data integrity. Common constraints include NOT NULL, PRIMARY KEY, FOREIGN KEY, CHECK, and UNIQUE. These constraints ensure data accuracy and prevent invalid data from being inserted.

14. What is the difference between INNER JOIN and LEFT JOIN?

An INNER JOIN returns only the rows that have matching values in both tables. A LEFT JOIN (or LEFT OUTER JOIN) returns all the rows from the left table, along with the matching rows from the right table. If there is no match, the result from the right table will contain NULL values.

15. What is a PRIMARY KEY?

A PRIMARY KEY is a unique identifier for each row in a table, ensuring no duplicate or NULL values in the key column(s). It can consist of one or more columns, known as a composite primary key. Each table can have only one primary key.

16. What are the different types of joins in SQL?

The primary types of SQL joins are INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN. INNER JOIN returns matching rows from both tables, LEFT JOIN returns all rows from the left table and matching rows from the right table, RIGHT JOIN returns all rows from the right table and matching rows from the left table, and FULL JOIN returns matching rows plus all unmatched rows from both tables.

17. What is a FOREIGN KEY?

A FOREIGN KEY is a column (or a set of columns) in one table that uniquely identifies a row in another table. It establishes a relationship between the two tables, ensuring referential integrity. Foreign keys ensure that data in one table corresponds to valid data in another table.

18. What does DISTINCT do in a query?

The DISTINCT keyword in SQL is used to return only unique (non-duplicate) values in the result set. It eliminates duplicate records from the result of a query. It can be applied to one or more columns to ensure the uniqueness of the data returned.

19. What is a NULL value in SQL?

A NULL value in SQL represents the absence of a value or an unknown value. It is not the same as zero or an empty string; NULL indicates that the value is undefined. Special handling of NULL is necessary in queries, as it is not comparable using standard comparison operators.

20. What is a SUBQUERY?

A SUBQUERY is a query embedded within another SQL query. It can be used in SELECT, INSERT, UPDATE, or DELETE statements and returns a single or multiple values that can be used by the outer query. Subqueries can be placed in WHERE, FROM, or SELECT clauses.

21. What is the ORDER BY clause in SQL?

The ORDER BY clause is used to sort the result set of a query based on one or more columns. You can specify ascending (ASC) or descending (DESC) order for the sorting. By default, sorting is done in ascending order if no direction is specified.

22. What is a composite key?

A composite key is a primary key that consists of two or more columns in a table, used to uniquely identify a record. It is typically used when a single column cannot uniquely identify a record on its own. Composite keys are helpful in situations where the combination of multiple fields is necessary to guarantee uniqueness.

23. What is the BETWEEN operator in SQL?

The BETWEEN operator is used to filter the result set within a range of values, which can be numbers, dates, or text. It includes the boundary values, making it inclusive of the specified range. The BETWEEN operator is often used in WHERE clauses to find data within a particular range.

24. What is a transaction in SQL?

A transaction in SQL is a set of SQL statements executed as a single unit of work. It ensures that the database is updated in a consistent manner and follows the ACID properties (Atomicity, Consistency, Isolation, Durability). Transactions are useful in situations where multiple operations need to be completed together, or none should be applied.

25. What is the LIKE operator in SQL?

The LIKE operator is used in SQL to search for a specified pattern in a column. It allows the use of wildcard characters such as % (representing zero or more characters) and _ (representing a single character). This is useful when you need to search for patterns within text fields.

26. What is the IS NULL condition?

The IS NULL condition is used to check whether a value in a column is NULL or not. Since NULL is not the same as an empty string or zero, regular comparison operators (=, <>, etc.) do not work with NULL. The IS NULL condition allows for handling NULL values in queries.

27. What is a TRUNCATE statement in SQL?

The TRUNCATE statement is used to delete all rows from a table, but unlike DELETE, it does not log individual row deletions and cannot be rolled back in some databases. It is faster than DELETE because it does not generate a large number of page reads. However, TRUNCATE removes all data without firing any triggers.

28. What is the EXISTS operator in SQL?

The EXISTS operator is used to check if a subquery returns any rows. It returns TRUE if the subquery contains one or more records, otherwise it returns FALSE. EXISTS is often used in correlated subqueries.

29. How do you perform a join without using the JOIN keyword?

You can perform a join without using the JOIN keyword by specifying the tables and the condition in the WHERE clause. For example, using the WHERE clause with matching conditions on the columns from both tables can act like an INNER JOIN.

30. What are wildcards in SQL?

Wildcards are special characters used in SQL queries to represent one or more unknown characters. The most common wildcards are the percent sign (%) to represent zero or more characters and the underscore (_) to represent a single character. They are used with the LIKE operator to search for patterns in data.

Intermediate Data Analyst SQL Questions

For mid-level data analyst roles, expect intermediate SQL questions to challenge your knowledge and skill. Medium SQL questions fall into two buckets:

  • Complex SQL queries - Intermediate SQL questions ask you to perform joins, sub-queries, self-joins, and window functions.
  • SQL/Analytics case studies - Many intermediate questions take the form of case studies or ask you to perform analysis on the data you pull from a query.

Here are some medium SQL questions for data analysts:

31. Find the 3 lowest-paid employees who have completed at least 2 projects.

Given tables employees, employee_projects, and projects, find the 3 lowest-paid employees that have completed at least 2 projects.

Note: incomplete projects will have an end date of NULL in the projects table.

32. Given the tables users and rides, write a query to report the distance traveled by each user in descending order.

For this question, you need to accomplish two things: the first is to figure out the total distance traveled for each user_id, and the second is to order from greatest to least each user_id by a calculated distance traveled.

This question has been asked in Uber data analyst interviews.

33. Write a query to find all the users that are currently “Excited” and have never been “Bored” within a campaign.

For this medium SQL problem, assume you work at an advertising firm. You have a table of users’ impressions of ad campaigns over time. Each user_id from these campaigns has an attached impression_id, categorized as either “Excited” or “Bored”. You will need to assess which users are “Excited” by their most recent campaign and have never been “Bored” in any past campaign.

See a full solution to this problem on YouTube:

SQL questions

Note: This question might get asked for a marketing analyst job.

34. Write a SQL query to select the second-highest salary in the engineering department.

To answer this question, you need the department’s name to be associated with each employee in the employees table to understand which department each employee is a part of.

The “department_id” field in the employees table is associated with the “id” field in the departments table. You can call the “department_id” a foreign key because it is a column that references the primary key of another table, which in this case is the “id” field in the departments table.

Based on this shared field, you can join both tables using INNER JOIN to associate the department name with their employees.

SELECT salary
    FROM employees
    INNER JOIN departments
        ON employees.department_id = departments.id

With the department name in place, you can now look at the employees of the Engineering team and sort by salary to find the second highest paid.

35. Given a table of bank transactions, write a query to get the last transaction for each day.

More Context: The table includes the columns: id, transaction_value and created_at (representing the time for each transaction).

Since our goal in this problem is to pull the last transaction from each day, you want to group the transactions by the day they occurred and create a chronological order within each day from which you can retrieve the latest transaction.

To accomplish the task of grouping and order, create a modified version of the bank_transactions table with an added column denoting the chronological ordering of transactions within each day.

You can use an OVER() statement to partition by date. After partitioning, you should use a descending order so that the first entry in each partition is the last transaction chronologically. Here is how that query can be written:

SELECT (*), ROW_NUMBER() OVER(PARTITION BY DATE(created_at) 
ORDER BY created_at DESC)

AS ordered_time

36. Write a query to debug an error and select the top five most expensive projects by budget-to-employee ratio.

More context: You are given two tables. A projects table and another that maps employees to their projects, called employee_projects. In this question, however, a bug exists that is causing duplicate rows in the employee_projects table.

Example:

Input:

projects table

column type
id INTEGER
title VARCHAR
state_date DATETIME
end_date DATETIME
budget INTEGER

employee_projects table

Column Type
project_id INTEGER
employee_id INTEGER

Output:

Column Type
title VARCHAR
budget_per_employee FLOAT

This is a good example of a logic-based SQL problem. Although there are a few steps to the solution, the actual SQL queries are fairly simple.

HINT: One way to do the debugging is to group by columns project_id simply and employee_id. By grouping by both columns, you are creating a table that provides distinct values on project_id and employee_id, thereby excluding any duplicates.

37. You have a table representing the total number of messages sent between two users by date on Facebook Messenger. Answer these questions:

  • What are some insights that could be derived from this table?
  • What do you think the distribution of the number of conversations each user creates per day looks like?
  • Write a query to get the distribution of the number of conversations created by each user by day in 2020.

This question tests your data sense, as well as your SQL writing skills. It has also appeared in Facebook data analyst interviews.

To answer the first part of the question regarding insights, you could evaluate a number of metrics. You can find the total number of messages sent daily, the number of conversations being started, or the average number of messages per conversation. All of these metrics seek to find users’ level of engagement and connectivity.

You can find a full solution on parts one through three in the following YouTube video:

SQL mock

38. Write an SQL query to create a histogram of the number of comments per user in January 2020.

This intermediate SQL question has been asked in Amazon data analyst interviews. Here is a partial answer from Interview Query:

What does a histogram represent, and what kind of story does it tell? In this case, you are interested in using a histogram to represent the distribution of comments each user has made in January 2020. A histogram with bin buckets of size one means avoiding the logical overhead of grouping frequencies into specific intervals.

For example, if you want a histogram of size five, you would have to run a SELECT statement like so:

SELECT
    CASE WHEN frequency BETWEEN 0 AND 5 THEN 5 
        WHEN frequency BETWEEN 5 AND 10 THEN 10 etc..

39. Select the largest three departments with ten or more employees and rank them according to the percentage of employees making over 100,000 dollars.

In this problem, you are given two tables: An employees table and a departments table.

Example:

Input:

employees table

Columns Type
id INTEGER
first_name VARCHAR
last_name VARCHAR
salary INTEGER
department_id INTEGER

departments table

Columns Type
id INTEGER
name VARCHAR

Output:

Column Type
percentage_over_100k FLOAT
department_name VARCHAR
number of employees INTEGER

First, break down the question to understand what it’s asking. Specifically, you break the question down into three clauses of conditions:

  • Top three departments by employee count.
  • % of employees making over $\$100,000$ a year.
  • The department must have at least ten employees.

From here, think about how you would associate employees with their department, calculate and display the percentage of employees making over $\$100,000$ a year, and order those results to answer the original question.

40. Given a table of students and their SAT test scores, write a query to return the two students with the closest test scores by score difference.

Given that this problem is referencing one table with only two columns, you have to self-reference different creations of the same table. It is helpful to think about this problem using two different tables with the same values.

There are two parts to this question:

  • The first part compares each combination of students and their SAT scores.
  • The second part is determining which two students’ scores are the closest.

Master SQL with Our Question Bank

41. Write a query to support or disprove the hypothesis: Clickthrough Rate (CTR) depends on search rating.

This question provides a table that represents search results on Facebook, including a query, a position, and a human rating.

42. Write a query to get the number of customers that were upsold by purchasing additional products.

For this problem, you are given a table of product purchases. Each row in the table represents an individual product purchase.

Note: If the customer purchased two things on the same day, that does not count as an upsell, as they were purchased within a similar timeframe. We are looking for a customer returning on a different date to purchase a product.

This question is a little tricky because you have to note the dates that each user purchased products. You can’t just group by the user_id and look where the number of products purchased exceeds one because of the upsell condition.

You have to group by both the date field and the user_id to get each transaction broken out by day and user:

SELECT 
    user_id
    , DATE(created_at) AS date
FROM transactions
GROUP BY 1,2

The query above will now give us a user_id and date field for each row. If a duplicate user_id exists, then you know that the user purchased on multiple days, which satisfies the upsell condition. What comes next?

43. Given the transactions table below, write a query that finds the third purchase of every user.

Note: Sort the results by the user_id in ascending order. If a user purchases two products simultaneously, the lower ID field is used to determine the first purchase.

Example:

Input:

transactions table

Columns Type
id INTEGER
user_id INTEGER
created_at DATETIME
product_id INTEGER
quantity INTEGER

Output:

Columns Type
user_id INTEGER
created_at DATETIME
product_id INTEGER
quantity INTEGER

Here is a helpful hint for this question: You need an indicator of which purchase was the third by a specific user. Whenever you are thinking of ranking a dataset, it is helpful to think of a specific window function you can use immediately. You need to apply the RANK function to the transactions table. The RANK function is a window function that assigns a rank to each row in the partition of the result set.

SELECT *, RANK() OVER (
    PARTITION BY user_id ORDER BY created_at ASC
        ) AS rank_value
FROM transactions
LIMIT 100

44. Write a query to retrieve the number of users who have posted each of their job listings only once and the number of users who have posted at least one job multiple times.

This is a LinkedIn data analyst interview question. See a full solution to this question on YouTube:

SQL Linkedin Video

45. Write a query to get each department’s top three highest employee salaries.

For this problem, you are given an employees and a departments table.

Note: If the department contains less than three employees, the top two or top one highest salaries should be listed.

Here’s a hint: You need to order the salaries by department. A window function is useful here. Window functions enable calculations within a certain partition of rows. In this case, the RANK() function would be useful. What would you put in the PARTITION BY and ORDER BY clauses?

Your window function can look something like this:

RANK() OVER (PARTITION BY id ORDER BY metric DESC) AS ranks

Note: When you substitute for the actual ID and metric fields, ensure the substitutes are relevant to the question asked and aligned with the data provided to you.

46. Write a query to find the number of non-purchased seats for each flight.

In this Robinhood data analyst question, assume you work for a small airline, and you are given three tables: flights, planes, and flight_purchases.

To get the number of unsold seats per flight, you need to get each flight’s total number of seats available and the total seats sold.

You can do an inner join on all 3 tables since the question states that the flight_purchases table does not have entries of flights or seats that do not exist.

To calculate the number of seats per flight, you use GROUP BY on the flight_id together with COUNT() on seat_id to get a count of seats sold. You then calculate the number of total seats on the flight minus the total seats sold to determine how many seats remain unsold.

47. Given a transactions table with date timestamps, sample every fourth row ordered by date.

Here’s a hint for this question to get you started: If you are sampling from this table and you want to sample every fourth value specifically, you will probably have to use a window function.

A general rule of thumb is that when a question states or asks for some Nth value (like the third purchase of each customer or the tenth notification sent), a window function is the best option. Window functions allow us to use the RANK() or ROW_NUMBER() function to provide a numerical index based on a certain ordering.

48. Write a query that returns all neighborhoods with zero users.

More Context: You are given two tables: a users table with demographic information and the neighborhoods they live in and a neighborhoods table.

This is an intermediate SQL problem that requires you to write a simple query. Our task is to find all the neighborhoods without users. To reframe the task, you need all the neighborhoods that do not have a single user living in them. This means you have to introduce a column in one table but not in the other, such that you can see user counts by neighborhood.

In the Medium Difficulty Data Analyst SQL Interview Category, the Interviewer might also ask:

49. Explain the concept of GROUP BY.

The GROUP BY clause is used to group rows based on shared values in one or more columns. It is often used with aggregate functions such as SUM(), COUNT(), AVG(), etc., to perform calculations on each group. The GROUP BY clause comes after the WHERE clause and before the HAVING clause.

50. What is the difference between UNION and UNION ALL?

UNION combines the result sets of two or more queries and removes duplicate rows, while UNION ALL combines all rows from multiple queries without removing duplicates. UNION performs a distinct operation, which is more resource-intensive compared to UNION ALL, which does not perform any duplicates check. Therefore, UNION ALL is usually faster when duplicate rows are not a concern.

51. What are aggregate functions in SQL?

Aggregate functions are used to perform calculations on a set of values to return a single value. Common aggregate functions include COUNT(), SUM(), AVG(), MAX(), and MIN(). These functions help summarize data and are often used in combination with GROUP BY.

52. What is a CASE statement in SQL?

The CASE statement is used to perform conditional logic in SQL queries. It allows you to return a specific value based on conditions, similar to an IF-THEN-ELSE structure. It can be used in SELECT, UPDATE, and other SQL clauses.

53. How do you handle duplicates in SQL?

To handle duplicates, you can use the DISTINCT keyword to eliminate them from query results. Alternatively, you can use GROUP BY to group data and aggregate it, which automatically removes duplicates. In some cases, ROW_NUMBER() or RANK() can be used to assign unique identifiers to rows and filter out duplicates.

54. What is a JOIN condition?

A JOIN condition is the rule that specifies how two or more tables should be combined based on a related column. It defines how rows from different tables should match, typically using a foreign key or matching column. The condition is usually specified in the ON clause of a JOIN.

55. What is the difference between CHAR and VARCHAR?

CHAR is a fixed-length data type, while VARCHAR is a variable-length data type. CHAR pads values with spaces to meet its defined length, whereas VARCHAR only uses as much space as needed. This means VARCHAR is more storage-efficient for variable-length strings.

56. What is the purpose of LIMIT in SQL?

The LIMIT clause is used to restrict the number of rows returned by a query. It is useful for controlling the size of the result set, especially when dealing with large datasets. LIMIT can be particularly useful in performance optimization and pagination.

57. What is a SELF JOIN?

A SELF JOIN is a join where a table is joined with itself. This is useful when you need to compare rows within the same table, such as finding relationships between records. To distinguish between two instances of the same table, aliases are used for the table.

58. What is the difference between IN and EXISTS?

The IN operator is used to check whether a value matches any value in a list or subquery. The EXISTS operator is used to check whether a subquery returns any rows. While IN evaluates a list of values, EXISTS is typically more efficient for correlated subqueries.

59. What is a FULL OUTER JOIN?

A FULL OUTER JOIN combines the results of both a LEFT JOIN and a RIGHT JOIN. It returns all rows from both tables, with matching rows from both sides where available. If there is no match, the missing side will contain NULL values.

60. How would you get the second highest salary in SQL?

To get the second highest salary, you can use a subquery that selects the maximum salary less than the highest. Alternatively, the ROW_NUMBER() window function can be used to rank salaries, and the second row can be retrieved. The query might look like: SELECT MAX(salary) FROM employees WHERE salary < (SELECT MAX(salary) FROM employees);

61. What is a CROSS JOIN?

A CROSS JOIN returns the Cartesian product of the two tables involved, meaning it will return all possible combinations of rows from the two tables. It does not require any condition, and the result is a table with rows equal to the product of the number of rows in both tables. This can be useful in generating combinations, but it can also lead to large result sets.

62. What is an Index and how does it improve query performance?

An Index is a database object that improves the speed of data retrieval operations. It creates a sorted structure for a specific column, which reduces the amount of data the database needs to scan. However, while it improves read operations, it can slow down INSERT, UPDATE, and DELETE operations because the index also needs to be updated.

63. What is a Stored Procedure in SQL?

A Stored Procedure is a precompiled collection of one or more SQL statements that are stored in the database. It can be executed by calling the procedure from a query or program, and it often includes logic such as loops and conditional statements. Stored procedures improve performance by reducing the number of database calls and promoting code reuse.

64. What is denormalization?

Denormalization is the process of deliberately introducing redundancy into a database by merging tables. It is done to improve query performance, as it can reduce the need for complex joins. However, it can lead to data inconsistency and is generally used when performance is a priority over strict normalization.

65. What is a clustered index?

A clustered index determines the physical order of data in the table. A table can only have one clustered index, and it is created on the primary key by default. Since the data is physically sorted based on the clustered index, querying based on the indexed column is faster.

66. What is the WITH clause in SQL?

The WITH clause, or Common Table Expression (CTE), allows you to define temporary result sets that can be referenced within a SELECT, INSERT, UPDATE, or DELETE statement. It is useful for breaking down complex queries into simpler parts. CTEs can improve query readability and can be recursive.

67. How do you handle transactions in SQL?

Transactions are handled using BEGIN TRANSACTION, COMMIT, and ROLLBACK statements. BEGIN TRANSACTION marks the start of a transaction, COMMIT finalizes the changes made, and ROLLBACK undoes all changes if an error occurs. Transactions ensure that the database remains in a consistent state, even when unexpected errors happen.

68. What is relational integrity?

Relational integrity refers to the accuracy and consistency of data within a relational database. It is enforced using various constraints such as PRIMARY KEY, FOREIGN KEY, and CHECK. Maintaining relational integrity ensures that relationships between tables are valid and that data remains accurate.

Hard Data Analyst SQL Questions

data on graph - topic data analyst sql questions

Advanced SQL questions are common for mid and senior-level data analyst jobs, requiring you to write advanced SQL queries or work through complex logic-based case studies. The two types of questions include:

  • Advanced SQL writing - Writing queries to debug code, using indices to tune SQL queries, and using advanced SQL clauses.
  • Logic-based questions - These questions can be more challenging analytics case studies or queries that first require you to solve a logic-based problem.

69. An online marketplace company has introduced a new feature allowing potential buyers and sellers to conduct audio chats with each other before transacting. Answer the following questions:**

  • How would you measure the success of this new feature?
  • Write a query that can show whether the feature is successful or not.

Here’s the full solution to this complex data analytics case study on YouTube:

data analytics case study problem

70. Write a query to get the total three-day rolling average for deposits by day.

For this question, you are given a table of bank transactions with three columns: user_id, a deposit or withdrawal value (determined if the value is positive or negative), and created_at time for each transaction.

Here’s a hint: Usually, if the problem asks to solve for a moving/rolling average, you are provided the dataset in the form of a table with two columns: date and value. This problem is taken one step further as it provides a table of just transactions, with an interest in filtering for deposits (positive value) and removing records representing withdrawals (negative value, e.g. -10).

You also need to know the total deposit amount (sum of all deposits) for each day, as it will factor into calculating the numerator for the rolling three-day average:

rolling three-day avg for day 3 = [day 3 + day 2 + day 1] / 3

71. Write an SQL query that creates a cumulative distribution of the number of comments per user. Assume bin buckets class intervals of one.

To solve this cumulative distribution practice problem, you are given two tables, a users table, and a comments table.

frequency cumulative
0 10
1 25
2 27

72. Write a query to display a graph to understand how unsubscribes affect login rates over time.

For this question, assume that you work at Twitter. Twitter wants to roll out more push notifications to users because they think users are missing out on good content. Twitter decides to do this in an A/B test. After you release more push notifications, you suddenly see an increase in the total number of unsubscribes. How would you visually represent this growth in unsubscribes and its effect on login rates?

73. You are given a table of user experiences representing each person’s past employment history. Answer the following:

Write a query to prove or disprove this hypothesis: Data scientists who switch jobs more frequently become managers faster than data scientists who stay at one job for longer.

For this question, you are interested in analyzing the career paths of data scientists. Let’s say the titles you care about are bucketed into three categories: data scientist, senior data scientist, and data science manager.

Here’s a partial solution to this question:

This question requires a bit of creative problem-solving to understand how you can prove or disprove the hypothesis. The hypothesis is that data scientists who end up switching jobs more often get promoted faster. Therefore, in analyzing this dataset, you can prove this hypothesis by separating the data scientists into specific segments on how often they switch jobs in their careers.

For example, if you looked at the number of job switches for data scientists who have been in their field for five years, you could prove the hypothesis that the number of data science managers increased with the number of times they had switched jobs.

  • Never switched jobs: 10% are managers
  • Switched jobs once: 20% are managers
  • Switched jobs twice: 30% are managers
  • Switched jobs three times: 40% are managers

74. Write a query to get the distribution of total push notifications before a user converts.

For this question, you’re given two tables: a table of notification_deliveries and a table of users with created and purchase conversion dates. If the user hasn’t purchased, then the conversion_date column is NULL.

Example:

Input:

notification_deliveries table

Column Type
notification VARCHAR
user_id INTEGER
created_at DATETIME

users table

Column Type
id INTEGER
created_at DATETIME
conversion_date DATETIME

Output:

Column Type
total_pushes INTEGER
frequency INTEGER

75. Write a query to find the top five paired products and their names.

You’re given two tables, transactions, and products. Hypothetically, the transactions table consists of over a billion rows of purchases bought by users.

You are trying to find paired products often purchased by the same user, such as wine and bottle openers, chips and beer, etc.

Note: For the purposes of satisfying the test case, p2 should be the item that comes first in the alphabet. The qty column represents the paired product count.

Example:

Input:

transactions table

Column Type
id INTEGER
user_id INTEGER
created_at DATETIME
product_id INTEGER
quantity INTEGER

products table

Column Type
id INTEGER
name VARCHAR
price FLOAT

Output:

Column Type
p1 VARCHAR
p2 VARCHAR
qty INTEGER

In the Difficult Data Analyst SQL Interview Category, the Interviewer might also ask:

76. What are window functions in SQL?

Window functions perform calculations across a set of table rows related to the current row. They are used with the OVER() clause to partition data and apply aggregation or ranking without collapsing rows. Examples include ROW_NUMBER(), RANK(), LEAD(), and LAG().

77. What is normalization? Explain the different normal forms.

Normalization is the process of organizing data in a database to reduce redundancy and dependency. The first three normal forms (1NF, 2NF, and 3NF) focus on removing duplicate data, ensuring that each piece of data is atomic, and eliminating partial and transitive dependencies. Higher normal forms like BCNF and 4NF deal with more complex relationships and multi-valued dependencies.

78. What is a CTE (Common Table Expression)?

A CTE is a temporary result set defined within the execution scope of a SELECT, INSERT, UPDATE, or DELETE query. It is defined using the WITH keyword and can be referenced within the main query. CTEs improve query readability and simplify complex joins and subqueries.

79. What is a TRIGGERS in SQL?

A TRIGGER is a stored procedure that is automatically executed or triggered when a specific event occurs in a database, such as INSERT, UPDATE, or DELETE. Triggers help enforce business rules and data integrity by automatically executing operations in response to changes in the database. They are useful for auditing, logging, or ensuring consistency.

80. What is the difference between RANK() and DENSE_RANK()?

Both RANK() and DENSE_RANK() are window functions used for ranking rows. RANK() assigns the same rank to equal values but skips the next rank(s), leaving gaps, while DENSE_RANK() assigns the same rank to equal values but does not leave gaps between ranks. The difference becomes noticeable when there are tied values.

81. What is an INDEX in SQL?

An INDEX in SQL is a data structure that improves the speed of data retrieval operations on a table. Indexes work by creating a sorted map of column values, allowing for faster search queries. However, they can slow down INSERT, UPDATE, and DELETE operations, as the index needs to be updated.

82. Explain the concept of ACID properties in SQL.

ACID stands for Atomicity, Consistency, Isolation, and Durability, which are key properties of a transactional database. These properties ensure that database transactions are processed reliably, even in the event of a system failure. For example, Atomicity ensures transactions are fully completed, while Consistency ensures the database remains in a valid state.

83. What is a VIEW in SQL?

A VIEW is a virtual table in SQL that consists of a stored query. It allows you to present data in a specific way without physically storing it. Views can simplify complex queries and provide a level of abstraction by hiding underlying table structures.

84. What is a MATERIALIZED VIEW?

A MATERIALIZED VIEW is a database object that stores the result of a query physically. Unlike regular views, which are computed each time they are queried, a materialized view stores the result and can be periodically refreshed. This can greatly improve performance for complex queries.

85. How does SQL optimization work?

SQL optimization is the process of improving query performance by analyzing and modifying queries for faster execution. Techniques include using indexes, avoiding unnecessary joins, minimizing subqueries, and optimizing aggregate functions. Query execution plans can be analyzed to identify bottlenecks and areas for improvement.

86. What is a deadlock in SQL?

A deadlock occurs when two or more transactions are waiting for each other to release locks, leading to a situation where none of them can proceed. SQL Server and other databases automatically detect deadlocks and resolve them by rolling back one of the transactions. Preventing deadlocks can involve optimizing queries, using the correct transaction isolation level, or controlling lock ordering.

87. Explain Database Sharding.

Database sharding is a method of splitting a large database into smaller, more manageable parts, called “shards.” Each shard contains a subset of the data, and these shards are distributed across multiple servers to improve scalability and performance. Sharding helps with managing large datasets and reduces the load on a single database server.

88. What is Indexing in SQL and how does it affect performance?

Indexing in SQL refers to creating a structure that helps speed up the retrieval of data. It creates pointers to the data in the table based on the indexed column, which reduces the number of rows the database has to scan. However, while indexes can improve query performance, they also slow down INSERT, UPDATE, and DELETE operations because the index has to be updated each time.

89. What are data anomalies in SQL?

Data anomalies refer to inconsistencies or irregularities in a database that can occur due to improper database design or maintenance. Types of anomalies include insertion anomalies, deletion anomalies, and update anomalies. These issues can arise when data redundancy is present or when normalization is not properly implemented.

Learn more about Hard SQL Questions

This course is designed to help you learn everything you need to know about working with data, from basic concepts to more advanced techniques.

image

More Data Analyst Interview Questions

Question
Topics
Difficulty
Ask Chance
Pandas
SQL
R
Medium
Very High
Product Metrics
Hard
High
Sbwrgw Qphn Onwtq
Machine Learning
Easy
Low
Wtsugjzw Eoxqmc Itwb Wwfwvf
SQL
Hard
Very High
Avpvha Qgmz
Machine Learning
Medium
High
Vtqt Wadb Emglq Njvsm
SQL
Easy
Low
Szuq Ertiy
Machine Learning
Hard
Medium
Ahzspjo Nqxrkyo Ieff Abwqqrat Eiluo
SQL
Medium
Medium
Yfxttm Dwygiulc
SQL
Hard
Low
Gnvbj Ojidyk Ujlwbkww Kebugt
Machine Learning
Easy
Low
Xibiulgg Huphuzd Okuamls Twhpf Qkvpief
SQL
Easy
Very High
Npyx Eunezf Gfntvr Rvdxoalj
Analytics
Easy
Low
Ftlgwa Ejgiwz Qyci Iovl
SQL
Medium
Medium
Lgqgndr Fqvx Oxbpehlz Zmlesyiz Eeeutu
Analytics
Medium
Very High
Glcwkr Eknibpx Azsyagjk Uqygrhgn
Analytics
Easy
Low
Akduallf Zmgyld Tzburwfs Frahoqwg
Machine Learning
Medium
High
Xwaws Cdcbge Mbyehg Qbhzp Fqbrllp
Machine Learning
Medium
Medium
Azorpys Amlxwmw Dask Klnzqos
Analytics
Medium
Medium
Steqeswl Ozxedad Ugsncfwa
Analytics
Hard
High
Loading pricing options

View all Data Analyst questions