Owing to its high integrability and easy-to-learn syntax, SQL is heavily used in data analysis and software development. While various relational database management systems exist, Oracle was the first provider to offer a commercial system.
The GROUP BY Clause is among the most used statements in SQL, as it allows the user to group rows that have the same values in the specified columns. It’s often used in conjunction with functions like COUNT, SUM, AVG, MIN, MAX, WHERE, ORDER BY, HAVING, etc., to perform calculations on grouped data.
In this article, we’ll go through how to use GROUP BY in SQL and address some real-world questions you might encounter in upcoming interviews.
As mentioned, the SQL GROUP BY clause gathers rows with the same values in specified columns according to user preference. It also allows the aggregate functions to be applied to the grouped data.
For example,
Suppose we have a table called orders
that contains the following columns: order_id
, customer_id
, order_date
, order_amount
.
Table: orders
order_id | customer_id | order_date | order_amount |
---|---|---|---|
1 | 1 | 2023-01-10 | 100 |
2 | 2 | 2023-01-12 | 200 |
3 | 1 | 2023-01-15 | 150 |
4 | 3 | 2023-01-17 | 250 |
5 | 2 | 2023-01-20 | 100 |
We want to find out the total amount of orders placed by each customer.
SQL Query Using GROUP BY
SELECT customer_id, SUM(order_amount) AS total_order_amount
FROM orders
GROUP BY customer_id;
Result
customer_id | total_order_amount |
---|---|
1 | 250 |
2 | 300 |
3 | 250 |
Explanation
The SELECT statement in the query selects the customer_id
and the sum of order_amount
. The SUM
aggregate function is used to calculate the total amount for each customer. The GROUP BY
statement groups the rows in the orders by customer_id
.
customer_id = 1
, the total order amount is 100 + 150 = 250
.customer_id = 2
, the total order amount is 200 + 100 = 300
.customer_id = 3
, the total order amount is 250
.This example demonstrates the basic functionality of the GROUP BY clause. We’ll delve into more advanced use cases throughout the article.
Some of the practical use cases of GROUP BY in SQL include:
Let’s explore the GROUP BY techniques that your interviewers may expect you to know:
When you need to calculate summary statistics such as sums, averages, counts, etc., for groups of rows.
Example
To calculate the total sales amount for each product from a sales
table:
Table: sales
sale_id | product_id | product_name | sale_amount |
---|---|---|---|
1 | 101 | Widget A | 100 |
2 | 102 | Gadget B | 200 |
3 | 101 | Widget A | 150 |
4 | 103 | Doohickey C | 250 |
5 | 102 | Gadget B | 100 |
Query: Total sales amount for each product
SELECT product_id, product_name, SUM(sale_amount) AS total_sales
FROM sales
GROUP BY product_id, product_name;
Result
product_id | product_name | total_sales |
---|---|---|
101 | Widget A | 250 |
102 | Gadget B | 300 |
103 | Doohickey C | 250 |
When you need to count the number of occurrences of specific values in a column.
Example
To count the number of orders placed by each customer from an orders
table:
Table: orders
order_id | customer_id | customer_name |
---|---|---|
1 | 1 | John Doe |
2 | 2 | Jane Smith |
3 | 1 | John Doe |
4 | 3 | Mary Brown |
5 | 2 | Jane Smith |
Query: Number of orders placed by each customer
SELECT customer_id, customer_name, COUNT(order_id) AS order_count
FROM orders
GROUP BY customer_id, customer_name;
Result
customer_id | customer_name | order_count |
---|---|---|
1 | John Doe | 2 |
2 | Jane Smith | 2 |
3 | Mary Brown | 1 |
When you need to find the maximum or minimum values within groups.
Example
To find the highest sale amount for each product from a sales
table:
Table: sales
sale_id | product_id | product_name | sale_amount |
---|---|---|---|
1 | 101 | Widget A | 100 |
2 | 102 | Gadget B | 200 |
3 | 101 | Widget A | 150 |
4 | 103 | Doohickey C | 250 |
5 | 102 | Gadget B | 100 |
Query: Highest sale amount for each product
SELECT product_id, product_name, MAX(sale_amount) AS highest_sale
FROM sales
GROUP BY product_id, product_name;
Result
product_id | product_name | highest_sale |
---|---|---|
101 | Widget A | 150 |
102 | Gadget B | 200 |
103 | Doohickey C | 250 |
When you need to perform grouping based on multiple columns to get more granular summary statistics.
Example
To calculate the total sales amount for each product in each region from a sales
table:
Table: sales
sale_id | product_id | product_name | region | sale_amount |
---|---|---|---|---|
1 | 101 | Widget A | North | 100 |
2 | 102 | Gadget B | South | 200 |
3 | 101 | Widget A | North | 150 |
4 | 103 | Doohickey C | East | 250 |
5 | 102 | Gadget B | South | 100 |
Query: Total sales amount for each product in each region
SELECT product_id, product_name, region, SUM(sale_amount) AS total_sales
FROM sales
GROUP BY product_id, product_name, region;
Result
product_id | product_name | region | total_sales |
---|---|---|---|
101 | Widget A | North | 250 |
102 | Gadget B | South | 300 |
103 | Doohickey C | East | 250 |
When you need to analyze data that is categorized by specific columns.
Example
To find the average rating for each category of products from a product_reviews
table:
Table: product_reviews
review_id | product_id | category | rating |
---|---|---|---|
1 | 101 | Electronics | 4.5 |
2 | 102 | Home Appliances | 4.0 |
3 | 103 | Tools | 3.8 |
4 | 104 | Automotive | 4.2 |
5 | 105 | Kitchen | 4.7 |
6 | 106 | Outdoor | 3.9 |
7 | 107 | Electronics | 4.6 |
Query: Average rating for each category of products
SELECT category, AVG(rating) AS average_rating
FROM product_reviews
GROUP BY category;
Result
category | average_rating |
---|---|
Electronics | 4.55 |
Home Appliances | 4.00 |
Tools | 3.80 |
Automotive | 4.20 |
Kitchen | 4.70 |
Outdoor | 3.90 |
When you need to filter grouped data using the HAVING
clause, which allows you to specify conditions on aggregate functions.
Example
To find customers who have placed more than 5 orders from an orders
table:
Table: orders
order_id | customer_id | customer_name |
---|---|---|
1 | 1 | John Doe |
2 | 2 | Jane Smith |
3 | 1 | John Doe |
4 | 3 | Mary Brown |
5 | 2 | Jane Smith |
6 | 1 | John Doe |
7 | 4 | Paul Green |
Query: Customers who have placed more than 2 orders
SELECT customer_id, customer_name, COUNT(order_id) AS order_count
FROM orders
GROUP BY customer_id, customer_name
HAVING COUNT(order_id) > 2;
Result
customer_id | customer_name | order_count |
---|---|---|
1 | John Doe | 3 |
When you need to use the GROUP BY
clause in conjunction with other SQL clauses like WHERE
, ORDER BY
, etc.
Example
To calculate the average salary by department for employees who have been with the company for more than 5 years from an employees
table:
Table: Employees
employee_id | employee_name | department | salary | years_with_company |
---|---|---|---|---|
1 | Alice Smith | Sales | 50000 | 6 |
2 | Bob Johnson | HR | 60000 | 7 |
3 | Carol White | IT | 70000 | 5 |
4 | Dave Black | Marketing | 55000 | 4 |
5 | Emma Blue | Finance | 62000 | 8 |
Query: Average salary by department for employees who have been with the company for more than 5 years
SELECT department, AVG(salary) AS average_salary
FROM employees
WHERE years_with_company > 5
GROUP BY department
ORDER BY average_salary DESC;
Result
department | average_salary |
---|---|
HR | 60000 |
Finance | 62000 |
Sales | 50000 |
Even the most experienced data analysts may encounter the common pitfalls of using GROUP BY in SQL. When responding to interview questions, consider reviewing your query for these common mistakes to strengthen your candidacy.
The HAVING clause is used to filter groups after they have been aggregated, whereas the WHERE clause is used to filter rows before aggregation.
An incorrect query might look like:
SELECT product_id, SUM(sale_amount)
FROM sales
WHERE SUM(sale_amount) > 100
GROUP BY product_id;
And, the correct query would be:
SELECT product_id, SUM(sale_amount)
FROM sales
GROUP BY product_id
HAVING SUM(sale_amount) > 100;
Using GROUP BY on large datasets can be slow if not optimized properly. Without indexes or other optimizations, the database may need to perform a full table scan. Furthermore, indexing also helps optimize multiple Join statements within a query.
NULL values in columns used for grouping can lead to groups that are difficult to interpret. SQL treats NULL as a separate group and needs to be replaced with something more interpretable.
Example
Consider the sales
table with some NULL values in the region
column.
sale_id | product_id | product_name | region | sale_amount |
---|---|---|---|---|
1 | 101 | Widget A | North | 100 |
2 | 102 | Gadget B | South | 200 |
3 | 101 | Widget A | NULL | 150 |
4 | 103 | Doohickey C | East | 250 |
5 | 102 | Gadget B | NULL | 100 |
An incorrect query that may lead to confusion:
SELECT region, COUNT(*)
FROM sales
GROUP BY region;
This query will cluster NULL as well, which might be unexpected.
Handling NULL values:
SELECT COALESCE(region, 'Unknown') AS region, COUNT(*)
FROM sales
GROUP BY COALESCE(region, 'Unknown');
Using COALESCE, NULL values are replaced with ‘Unknown’, making the results clearer.
Except for specific use cases, grouping on too many columns can lead to overly detailed results, making it hard to draw meaningful conclusions. Consider applying thoughtful aggregations and more meaningful queries to reach better conclusions.
Misinterpreting grouped data, especially when using multiple aggregate functions, can lead to incorrect conclusions.
Example
Consider the sales
table with an additional quantity
column.
sale_id | product_id | product_name | sale_amount | quantity |
---|---|---|---|---|
1 | 101 | Widget A | 100 | 2 |
2 | 102 | Gadget B | 200 | 1 |
3 | 101 | Widget A | 150 | 3 |
4 | 103 | Doohickey C | 250 | 1 |
5 | 102 | Gadget B | 100 | 2 |
Misleading query:
SELECT product_id, SUM(sale_amount), AVG(quantity)
FROM sales
GROUP BY product_id;
The result shows the total sales and the average quantity per order, which may be misinterpreted.
Clearer query:
SELECT product_id, SUM(sale_amount) AS total_sales, AVG(quantity) AS avg_quantity
FROM sales
GROUP BY product_id;
Renaming the columns in the result makes their meaning clearer.
By learning effective SQL and practicing, you can interpret tables accurately and write queries that generate optimal solutions. Here are a few SQL interview questions requiring the GROUP BY clause to solve. You may encounter these in your upcoming interview:
Note: Students took each exam only once.
Example:
For the given input:
student_id | student_name | exam_id | score |
---|---|---|---|
100 | Anna | 1 | 71 |
100 | Anna | 2 | 72 |
100 | Anna | 3 | 73 |
100 | Anna | 4 | 74 |
101 | Brian | 1 | 65 |
The expected output should be:
student_name | exam_1 | exam_2 | exam_3 | exam_4 |
---|---|---|---|---|
Anna | 71 | 72 | 73 | 74 |
Brian | 65 | NULL | NULL | NULL |
Input:
exam_scores
table
Column | Type |
---|---|
student_id | INTEGER |
student_name | VARCHAR |
exam_id | INTEGER |
score | INTEGER |
Output:
Column | Type |
---|---|
student_name | VARCHAR |
exam_1 | INT |
exam_2 | INT |
exam_3 | INT |
exam_4 | INT |
Given two tables, employees
and projects
, find the five lowest-paid employees who have completed at least three projects.
Note: We consider projects to be completed when they have an end date, which is the same as saying their End_dt
is not NULL
.
Example:
Input:
employees
table
id | salary |
---|---|
INTEGER | FLOAT |
projects
table
employee_id | project_id | Start_dt | End_dt |
---|---|---|---|
INTEGER | INTEGER | DATETIME | DATETIME |
Output:
employee_id |
---|
INTEGER |
Note: Both cities are in the same time zone, so you do not need to worry about converting time zones.
flights
table
Example:
Input:
Columns | Type |
---|---|
id | INTEGER |
destination_location | VARCHAR |
source_location | VARCHAR |
plane_id | INTEGER |
flight_start | DATETIME |
flight_end | DATETIME |
Output:
Columns | Type |
---|---|
plane_id | INTEGER |
calendar_day | VARCHAR |
time_in_min | INTEGER |
Please give calendar_day
in the format YYYY-MM-DD
.
Hopefully, you’ll now be able to utilize the inherent grouping capabilities of SQL’s GROUP BY clause to count concurrencies, aggregate group data, filter data, and combine other clauses. To delve deeper into SQL functions and prepare for interview questions, explore our SQL Learning Path and Company Interview Guides. Best of luck with your SQL journey!