SQL has traditionally served as a database querying language for data storage and retrieval. However, with the exponential growth of data, there’s been an increased demand for robust server-side analytic functions that can compute and process data efficiently. In this article, we’ll explore a range of analytic functions in SQL that allow us to extract valuable insights directly from our data.
SQL analytic functions, also known as window functions, are powerful features that perform calculations across rows in a result set, allowing for advanced data analysis and aggregation within queries. This provides data scientists and analysts with a way to implement logic without client-side programming languages that operate individually on the result sets (e.g., Python).
Like aggregation functions, analytic functions provide greater insights into data trends, patterns, rankings, and comparisons. Fundamentally, they differ on a critical feature: the granularity of their operations.
We declare an analytic function through the use of the OVER() clause.
While aggregation functions operate on the entire result set, analytic functions work with subsets of the result set called windows. These are user-defined, allowing for more dynamic results.
Before we dive into the different types of analytic functions in SQL, we must first understand how they’re declared and how to specify their behavior.
Analytic functions in SQL consist of several parts, which determine how a window function operates over a table:
The function component of an analytic function defines the operation that is performed within the window. In other words, while other parts of an analytic function determine WHAT the window is, the function component determines HOW the window is operated on.
Let’s look at an example of the SUM() OVER() analytic function to better understand this concept. The SUM() function calculates the sum over the specified window, allowing us to obtain cumulative totals or perform other types of aggregations. In a way, you can think of the function as the engine that powers the window’s functionality.
Functions vary in purpose and type. Typically, they’re classified under four distinct categories:
We’ll explain how to use these different types of analytic functions later.
The partitioning clause divides the rows of the table into separate groups or partitions based on one or more column values. The function is applied independently within each partition. This allows you to perform calculations and aggregations on specific subsets of data.
PARTITION BY works similarly to GROUP BY, but instead of dividing the table into sets, it defines how we partition the table into windows. Let’s take a look at the following visualization:
student_id | grade_id | student_name | course_name | grade |
---|---|---|---|---|
1 | 1 | John Smith | Mathematics | 3.50 |
1 | 2 | John Smith | Physics | 4.00 |
1 | 3 | John Smith | English | 3.80 |
2 | 4 | Jane Doe | Chemistry | 3.90 |
2 | 5 | Jane Doe | Biology | 3.70 |
This is the table before PARTITION BY (student_id)
is executed.
However, after the partition, the table is divided into windows as defined by the student_id
:
student_id | grade_id | student_name | course_name | grade |
---|---|---|---|---|
1 | 1 | John Smith | Mathematics | 3.50 |
1 | 2 | John Smith | Physics | 4.00 |
1 | 3 | John Smith | English | 3.80 |
student_id | grade_id | student_name | course_name | grade |
---|---|---|---|---|
2 | 4 | Jane Doe | Chemistry | 3.90 |
2 | 5 | Jane Doe | Biology | 3.70 |
The ordering clause defines the order in which the rows are processed within each partition. It specifies the column(s) that the rows are sorted by before applying the analytic function.
The framing clause, which is optional, allows you to specify a subset of rows within each partition on which the analytic function will operate, thereby determining the window used to calculate the function’s result. Think of it as an even more granular version of PARTITION BY. However, instead of partitioning based on a column, it partitions based on position (as dictated by ORDER BY).
The framing clause can be specified using the ROWS BETWEEN
or RANGE BETWEEN
keywords.
The ROWS BETWEEN
clause allows you to define a window or frame of rows based on their relative positions within each partition. It specifies a range of rows that are included in the calculation of the analytic function’s result.
The RANGE BETWEEN
clause defines a window or frame of rows based on their values rather than their positions. The range of rows included in the calculation is based on the order specified in the ORDER BY
clause.
It’s important to note that the behavior of RANGE BETWEEN
can vary depending on the specific analytic function and the data type of the ordering column.
When defining ROWS BETWEEN
and RANGE BETWEEN
, there are certain operations you can declare to further specify the frame size.
1. UNBOUNDED PRECEDING TO CURRENT ROW: This specifies that the window should include all rows from the beginning of the partition up to and including the current row being evaluated by the function.
2. n PRECEDING TO CURRENT ROW: This creates a fixed window size that includes the current row being evaluated by the function and the preceding “n” number of rows.
In this example, n = 1.
3. UNBOUNDED PRECEDING TO UNBOUNDED FOLLOWING: This creates a window that includes all of the rows between the start and end of the partition.
For the first two frame operations, you can swap in the FOLLOWING keyword instead of PRECEDING to define a range of rows after the current row.
The output from ROWS BETWEEN
and RANGE BETWEEN
can vary vastly. Let’s look at the following section to understand why.
ROWS BETWEEN
is typically used with analytic functions that operate on continuous or numerical values. It considers the values of the ordering column. This is different from ROWS BETWEEN
, which considers the relative position of the rows instead.
The following illustration displays how these clauses produce different windows:
With ROWS BETWEEN
, the window includes the current row and the one immediately preceding it based on the specified order (in this case, by units), which are ‘Chemistry’ and ‘English.’
However, with RANGE BETWEEN
,, it would be all the rows with unit values between 3 (from 4-1) and 4. This table includes ‘Math,’ ‘Physics,’ ‘English,’ and ‘Chemistry’ because all of them have units between 3 and 4.
RANGE BETWEEN is a window, while ROWS BETWEEN is a row-based window.
RANGE BETWEEN
includes all four rows because the order of the rows doesn’t matter– only the values. So if multiple rows have the same or close values (within 1 unit in this case), they’ll all be included in the window.
In this section, we’ll cover ranking functions, which are a group of analytic functions that assign ranks to each row in a result set based on certain criteria. These functions are helpful for data rankings, such as identifying the top performers or finding the relative position of a particular row compared to others.
This function assigns a unique sequential number to each row within the result set. The numbers are generated based on the order specified in the ORDER BY clause. ROW_NUMBER() does not consider ties, and each row receives a distinct number.
Example:
Company | Stock_Price | ROW_NUMBER |
---|---|---|
Metamorphosis LLC | 542.12 | 1 |
Orange Inc. | 542.12 | 2 |
MacroString Corp. | 902.93 | 3 |
Spookify | 1252.97 | 4 |
(Using ROW_NUMBER() OVER (ORDER BY Stock_Price)
)
RANK()
assigns a unique rank to each row within the result set. Rows with the same values and order criteria receive the same rank, and the subsequent rank changes based on the number of rows with the same rank.
Example:
Company | Stock_Price | RANK |
---|---|---|
Metamorphosis LLC | 542.12 | 1 |
Orange Inc. | 542.12 | 1 |
MacroString Corp. | 902.93 | 3 |
Spookify | 1252.97 | 4 |
(Using RANK() OVER (ORDER BY Stock_Price)
)
Similar to RANK(), DENSE_RANK()
assigns a unique rank to each row in the result set. However, it doesn’t skip any ranks for tied values.
Example:
Company | Stock_Price | ROW_NUMBER |
---|---|---|
Metamorphosis LLC | 542.12 | 1 |
Orange Inc. | 542.12 | 1 |
MacroString Corp. | 902.93 | 2 |
Spookify | 1252.97 | 3 |
(Using DENSE_RANK() OVER (ORDER BY Stock_Price)
)
NTILE()
divides the result set into equal-sized groups, or “tiles,” and assigns a group number to each row. The number of groups is determined by the argument provided to NTILE().
Example:
Company | Stock_Price | NTILE |
---|---|---|
Metamorphosis LLC | 542.12 | 1 |
Orange Inc. | 542.12 | 1 |
MacroString Corp. | 902.93 | 2 |
Spookify | 1252.97 | 2 |
(Using NTILE(2) OVER (ORDER BY Stock_Price)
)
Window-aggregation functions in SQL are a specialized group of analytic functions that combine the capabilities of aggregate and window functions. These functions allow you to perform aggregate calculations while retaining a granular level of detail within the result set, providing more flexibility for data analysis.
The SUM()
function combined with the OVER()
clause calculates the sum of a specified column over a window of rows. The SUM()
function takes a single argument, which is the column to be aggregated. This allows us to compute running totals, cumulative sums, or other aggregations within the specified window.
Example:
Company | Stock_Price | Cumulative_Sum |
---|---|---|
Metamorphosis LLC | 542.12 | 542.12 |
Orange Inc. | 542.12 | 1084.24 |
MacroString Corp. | 902.93 | 1987.17 |
Spookify | 1252.97 | 3240.14 |
(Using SUM(Stock_Price) OVER (ORDER BY Stock_Price)
)
The AVG()
function with the OVER()
clause calculates the average of a column over a window of rows. It works similarly to SUM() OVER(), but provides the average value instead. This function is useful for calculating moving averages or obtaining average values within specific partitions or orderings.
Example:
Company | Stock_Price | Running_Average |
---|---|---|
Metamorphosis LLC | 542.12 | 542.12 |
Orange Inc. | 642.78 | 592.45 |
MacroString Corp. | 902.93 | 695.94 |
Spookify | 1252.97 | 835.20 |
(Using AVG(Stock_Price) OVER (ORDER BY Company)
)
The MIN()
function with the OVER()
clause returns the minimum value of a column over a defined window. This allows us to find the minimum value over a specific range or partition, which is helpful for finding the earliest date or the minimum value within each group.
Company | Stock_Price | Running_Min |
---|---|---|
Metamorphosis LLC | 542.12 | 542.12 |
Orange Inc. | 642.78 | 542.12 |
MacroString Corp. | 902.93 | 542.12 |
Spookify | 1252.97 | 542.12 |
(Using MIN(Stock_Price) OVER (ORDER BY Company)
)
Similar to MIN() OVER(), the MAX()
function with the OVER()
clause returns the maximum value of a column over a specified window.
Company | Stock_Price | Running_Max |
---|---|---|
Metamorphosis LLC | 542.12 | 542.12 |
Orange Inc. | 642.78 | 642.78 |
MacroString Corp. | 902.93 | 902.93 |
Spookify | 1252.97 | 1252.97 |
(Using MAX(Stock_Price) OVER (ORDER BY Company)
)
The COUNT()
function with OVER()
provides the number of rows within a window. It can count rows within specific partitions or orderings, which is helpful if you need to find the number of items sold within a period or any other counts within specific groups.
Company | Stock_Price | Running_Count |
---|---|---|
Metamorphosis LLC | 542.12 | 1 |
Orange Inc. | 642.78 | 2 |
MacroString Corp. | 902.93 | 3 |
Spookify | 1252.97 | 4 |
(Using COUNT(Stock_Price) OVER (ORDER BY Company)
)
Statistical window functions are another set of analytic functions that perform statistical calculations over a window of rows from a result set. These functions allow us to derive valuable statistical insights and metrics from our data.
The STDDEV() OVER() function calculates the standard deviation of a column over a specified window, providing insight into the data’s spread.
Example:
employee_id | employee_name | department_name | department_id | salary | Sample Standard Deviation |
---|---|---|---|---|---|
1 | Alice | HR | 101 | 60000 | 2500 |
2 | Bob | HR | 101 | 55000 | 2500 |
4 | David | Sales | 102 | 80000 | 5000 |
3 | Charlie | Sales | 102 | 70000 | 5000 |
5 | Eve | IT | 103 | 65000 | 2500 |
6 | Frank | IT | 103 | 70000 | 2500 |
(Using STDDEV(salary) OVER (PARTITION BY department_id)
)
The VARIANCE() function calculates the variance (average squared deviation from the mean) of a column over a defined window.
Example:
employee_id | employee_name | department_name | department_id | salary | Variance |
---|---|---|---|---|---|
1 | Alice | HR | 101 | 60000 | 6250000 |
2 | Bob | HR | 101 | 55000 | 6250000 |
4 | David | Sales | 102 | 80000 | 25000000 |
3 | Charlie | Sales | 102 | 70000 | 25000000 |
5 | Eve | IT | 103 | 65000 | 6250000 |
6 | Frank | IT | 103 | 70000 | 6250000 |
(Using VARIANCE(salary) OVER (PARTITION BY department_id)
)
Offset window functions allow you to access data from rows that are offset from the current row within the window. These functions are helpful when we need to compare the current row to a preceding or successive row. Examples include comparing today’s stock data to last week’s stock data (LAG), or comparing salaries to the lowest salary (FIRST_VALUE with ORDER BY).
The LAG() function retrieves a value from a previous row within the window. It takes two arguments: the column you want to retrieve the value from and the offset (number of rows before the current row). For example, LAG(column_name, offset)
returns the value from column_name
that is offset
rows before the current row.
Note: Offset is an INT type parameter.
Example:
Company | Stock_Price | LAG_Stock_Price |
---|---|---|
Metamorphosis LLC | 542.12 | NULL |
Orange Inc. | 542.12 | 542.12 |
MacroString Corp. | 902.93 | 542.12 |
Spookify | 1252.97 | 902.93 |
(Using LAG(Stock_Price) OVER (ORDER BY Stock_Price) AS LAG_Stock_Price
)
The LEAD() function is similar to LAG(), but it retrieves the value from a subsequent row within the window. It also takes a similar set of arguments with the column you want to retrieve the value from and the offset (number of rows after the current row). For example, LEAD(column_name, offset)
returns the value from column_name
that is offset
rows after the current row.
Example:
Company | Stock_Price | LEAD_Stock_Price |
---|---|---|
Metamorphosis LLC | 542.12 | 542.12 |
Orange Inc. | 542.12 | 902.93 |
MacroString Corp. | 902.93 | 1252.97 |
Spookify | 1252.97 | NULL |
(Using LEAD(Stock_Price) OVER (ORDER BY Stock_Price) AS LEAD_Stock_Price
)
FIRST_VALUE() retrieves the value from the first row in the window. It takes one argument: the column you want to retrieve the value from. For example, FIRST_VALUE(column_name)
returns the value from column_name
in the first row of the window.
Example:
Company | Stock_Price | FIRST_Stock_Price |
---|---|---|
Metamorphosis LLC | 542.12 | 542.12 |
Orange Inc. | 542.12 | 542.12 |
MacroString Corp. | 902.93 | 542.12 |
Spookify | 1252.97 | 542.12 |
(Using FIRST_VALUE(Stock_Price) OVER (ORDER BY Stock_Price) AS FIRST_Stock_Price
)
LAST_VALUE() is similar to FIRST_VALUE(), but it retrieves the value from the last row in the window. It also takes one argument: the column you want to retrieve the value from. For example, LAST_VALUE(column_name)
returns the value from column_name
in the last row of the window.
Example:
Company | Stock_Price | LAST_Stock_Price |
---|---|---|
Metamorphosis LLC | 542.12 | 1252.97 |
Orange Inc. | 542.12 | 1252.97 |
MacroString Corp. | 902.93 | 1252.97 |
Spookify | 1252.97 | 1252.97 |
(Using LAST_VALUE(Stock_Price) OVER (ORDER BY Stock_Price) as LAST_Stock_Price)
)
Given the transactions
table below, write a query that finds the third purchase of every user.
Note: Sort the results by the user_id
in ascending order. If a user purchases two products at the same time, the lower id
field is used to determine which is the first purchase.
Example:
Input:
transactions
table
Columns | Type |
---|---|
id | INTEGER |
user_id | INTEGER |
created_at | DATETIME |
product_id | INTEGER |
quantity | INTEGER |
Output:
Columns | Type |
---|---|
user_id | INTEGER |
created_at | DATETIME |
product_id | INTEGER |
quantity | INTEGER |
This problem set is relatively straightforward at first. We can first determine the order of purchases for every user by looking at the created_at
column and ordering by user_id
and the created_at
column.
However, we still need an indicator of which purchase was the third value. Whenever we want to rank our dataset, it’s helpful to then immediately think of a specific window function to use.
In this case, we need to apply the RANK
function to the transactions table. The RANK
function is a window function that assigns a rank to each row in the partition of the result set.
RANK() OVER (PARTITION BY user_id ORDER BY created_at ASC, id ASC) AS rank_value
In this example, the PARTITION BY
clause distributes the rows in the result set into partitions by one or more criteria.
The ORDER BY
clause then sorts the rows in each partition by the column we indicated– in this case, created_at
.
Finally, the RANK()
function operates on the rows of each partition and re-initializes when crossing each partition boundary. The end result is a column with the rank of each purchase partitioned by user_id
.
All we have to do is then wrap the table in a subquery and filter out where the new column is equal to 3, which is equivalent to subsetting for the third purchase.
SELECT user_id, created_at, product_id, quantity
FROM (
SELECT
user_id
, created_at
, product_id
, quantity
, RANK() OVER (
PARTITION BY user_id
ORDER BY created_at ASC, id ASC
) AS rank_value
FROM transactions
) AS t
WHERE rank_value = 3
ORDER BY user_id ASC
Given a table of flights
, extract the 2nd flight with the longest duration between each pair of cities. Order the flights by the flight id
ascending.
Note: For any cities X
and Y
, (source_location=X, destination_location=Y)
and (source_location=Y, destination_location=X)
are counted as the same pair of cities.
Note: If there are fewer than two flights between two cities, there is no 2nd longest flight.
Example:
Input:
flights
table
Column | Type |
---|---|
id | INTEGER |
destination_location | VARCHAR |
source_location | VARCHAR |
plane_id | INTEGER |
flight_start | DATETIME |
flight_end | DATETIME |
Output:
Column | Type |
---|---|
id | INTEGER |
destination_location | VARCHAR |
source_location | VARCHAR |
flight_start | DATETIME |
flight_end | DATETIME |
The solution for this question is quite long, given that we have a graph-based problem that specifically handles undirected edges. Meanwhile, the SQL table clearly shows that our data is directed. To make up for this, we have to conduct operations that consider symmetrical edges, as the question specifies that (source_location=X, destination_location=Y)
and (source_location=Y, destination_location=X)
are counted as the same pair of cities.
This increases the question’s complexity. Moreover, it’s noted in the question that there are specified edge cases, wherein if there are fewer than two flights between two cities, there is no second-longest flight. This condition adds another level of difficulty, requiring the solution to handle cases where a city pair does not have at least two flights.
Let’s take a look at our solution below:
WITH cte_1 AS (
SELECT id, destination_location AS x, source_location AS y, flight_start, flight_end, TIMESTAMPDIFF(SECOND, flight_start, flight_end) AS diff, 1 AS helper FROM flights
UNION ALL
SELECT id, source_location AS x, destination_location AS y, flight_start, flight_end, TIMESTAMPDIFF(SECOND, flight_start, flight_end) AS diff, 2 AS helper FROM flights
),
cte_2 AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY x,y ORDER BY diff DESC) AS ranking FROM cte_1
)
SELECT id, x AS destination_location, y AS source_location, flight_start, flight_end FROM cte_2 WHERE ranking = 2 AND helper= 1
ORDER BY id ASC
To extract the second longest flight between pairs of cities from the given flights
table, we can use the following SQL solution:
cte_1
, which serves as a foundation for our subsequent calculations.cte_1
, we select the flight details, source location, destination location, and duration of each flight using the TIMESTAMPDIFF()
function to calculate the duration in seconds.cte_1
, we union the previous selection with another selection that swaps the source and destination locations, effectively considering flights in both directions.cte_1
, we create another CTE named cte_2
to rank the flights within each pair of cities based on their duration.ROW_NUMBER()
window function partitioned by the source and destination locations and ordered by the flight duration in descending order, we assign a ranking to each flight within its respective city pair.cte_2
where the ranking equals 2 (indicating the second longest flight) and the helper value is 1 (to avoid duplicate entries).The provided solution efficiently determines the second longest flight between each pair of cities, considering both source and destination locations. By employing CTEs and window functions, the query calculates flight durations, assigns rankings, and filters the desired results.
Given a table of bank transactions with columns id
, transaction_value
, and created_at
representing the date and time for each transaction, write a query to get the last transaction for each day.
The output should include the id of the transaction, datetime of the transaction, and the transaction amount. Order the transactions by datetime.
Example:
Input:
bank_transactions
table
Column | Type |
---|---|
id |
INTEGER |
created_at |
DATETIME |
transaction_value |
FLOAT |
Output:
Column | Type |
---|---|
created_at |
DATETIME |
transaction_value |
FLOAT |
id |
INTEGER |
Solution
Since our goal is to pull the last transaction from each day, we want to somehow group the transactions into days, and create a chronological ordering within each day from which we can retrieve the latest transaction.
First, we create a modified version of the bank_transactions
table with an added column denoting the chronological ordering of transactions within each day.
To do this, we can use an OVER()
statement to partition by date and then order by timestamp. After partitioning, we should use a descending order so that the first entry in each partition is the last transaction chronologically. The query:
SELECT (*), ROW_NUMBER() OVER(PARTITION BY DATE(created_at) ORDER BY created_at DESC)
AS ordered_time
Returns a table identical to bank_transactions
with an appended column named ordered_time
where the last transaction on each day has ordered_time = 1
, the penultimate has ordered_time = 2
and so on. (since we used a descending order).
Now, we can use a WITH
statement to select from this new table. We simply need to select the transactions where ordered_time = 1
. We name the new table timed_transactions
and write our final query as follows:
WITH timed_transactions AS
(
SELECT * , ROW_NUMBER() OVER (PARTITION BY DATE(created_at) ORDER BY created_at DESC) as ordered_time
FROM bank_transactions
)
SELECT created_at,transaction_value, id FROM timed_transactions
WHERE ordered_time = 1;
Alternatively, according to IQ user teddy_25, we can perform this action with a much shorter query.
We write a subquery to get the latest time denoted MAX(created_at)
and group by date
. This subquery creates the table maxdate
which is just a column of the datetime of the last transaction each day.
We then write our main query to return data about all transactions where their created_at
datetime is included in the ‘maxdate’ table.
SELECT * FROM bank_transactions
WHERE created_at IN ( SELECT MAX(created_at)
AS maxdate
FROM bank_transactions
GROUP BY DATE(created_at) )
This query is similar to our first solution in terms of computational efficiency. It checks all created_at
for each day to find the latest in the same way. Aside from elegance, its main advantage is that the maxdate
table in the second solution stores much less data than the timed_transactions
table in the first.
However, the first solution is more flexible. If we wanted to retrieve the latest three transactions each day, we could simply change the WHERE ordered_time = 1
line to include the second and third entries.
The best way to get better in SQL is practice. Interview Query offers a variety of SQL learning resources to help you practice and improve your SQL skills, including: