How to Use GROUP BY in SQL (With Examples)

How to Use GROUP BY in SQL (With Examples)

Overview

Owing to its high integrability and easy-to-learn syntax, SQL is heavily used in data analysis and software development. While various relational database management systems exist, Oracle was the first provider to offer a commercial system.

The GROUP BY Clause is among the most used statements in SQL, as it allows the user to group rows that have the same values in the specified columns. It’s often used in conjunction with functions like COUNT, SUM, AVG, MIN, MAX, WHERE, ORDER BY, HAVING, etc., to perform calculations on grouped data.

In this article, we’ll go through how to use GROUP BY in SQL and address some real-world questions you might encounter in upcoming interviews.

What Does SQL GROUP BY Clause Do?

As mentioned, the SQL GROUP BY clause gathers rows with the same values in specified columns according to user preference. It also allows the aggregate functions to be applied to the grouped data.

For example,

Suppose we have a table called orders that contains the following columns: order_id, customer_id, order_date, order_amount.

Table: orders

order_id customer_id order_date order_amount
1 1 2023-01-10 100
2 2 2023-01-12 200
3 1 2023-01-15 150
4 3 2023-01-17 250
5 2 2023-01-20 100

We want to find out the total amount of orders placed by each customer.

SQL Query Using GROUP BY

SELECT customer_id, SUM(order_amount) AS total_order_amount
FROM orders
GROUP BY customer_id;

Result

customer_id total_order_amount
1 250
2 300
3 250

Explanation

The SELECT statement in the query selects the customer_id and the sum of order_amount. The SUM aggregate function is used to calculate the total amount for each customer. The GROUP BY statement groups the rows in the orders by customer_id.

  • For customer_id = 1, the total order amount is 100 + 150 = 250.
  • For customer_id = 2, the total order amount is 200 + 100 = 300.
  • For customer_id = 3, the total order amount is 250.

This example demonstrates the basic functionality of the GROUP BY clause. We’ll delve into more advanced use cases throughout the article.

When and How to Use SQL GROUP BY and Advanced Techniques

Some of the practical use cases of GROUP BY in SQL include:

  • Summarizing sales data by region, product, and salesperson
  • Analyzing survey results by demographics like age group, location, etc.
  • Summarizing expenses by department or project
  • Aggregating stock levels by product category

Let’s explore the GROUP BY techniques that your interviewers may expect you to know:

Aggregating Group Data

When you need to calculate summary statistics such as sums, averages, counts, etc., for groups of rows.

Example

To calculate the total sales amount for each product from a sales table:

Table: sales

sale_id product_id product_name sale_amount
1 101 Widget A 100
2 102 Gadget B 200
3 101 Widget A 150
4 103 Doohickey C 250
5 102 Gadget B 100

Query: Total sales amount for each product

SELECT product_id, product_name, SUM(sale_amount) AS total_sales
FROM sales
GROUP BY product_id, product_name;

Result

product_id product_name total_sales
101 Widget A 250
102 Gadget B 300
103 Doohickey C 250

Counting Occurrences

When you need to count the number of occurrences of specific values in a column.

Example To count the number of orders placed by each customer from an orders table:

Table: orders

order_id customer_id customer_name
1 1 John Doe
2 2 Jane Smith
3 1 John Doe
4 3 Mary Brown
5 2 Jane Smith

Query: Number of orders placed by each customer

SELECT customer_id, customer_name, COUNT(order_id) AS order_count
FROM orders
GROUP BY customer_id, customer_name;

Result

customer_id customer_name order_count
1 John Doe 2
2 Jane Smith 2
3 Mary Brown 1

Finding Maximum and Minimum Values

When you need to find the maximum or minimum values within groups.

Example

To find the highest sale amount for each product from a sales table:

Table: sales

sale_id product_id product_name sale_amount
1 101 Widget A 100
2 102 Gadget B 200
3 101 Widget A 150
4 103 Doohickey C 250
5 102 Gadget B 100

Query: Highest sale amount for each product

SELECT product_id, product_name, MAX(sale_amount) AS highest_sale
FROM sales
GROUP BY product_id, product_name;

Result

product_id product_name highest_sale
101 Widget A 150
102 Gadget B 200
103 Doohickey C 250

Grouping By Multiple Columns

When you need to perform grouping based on multiple columns to get more granular summary statistics.

Example

To calculate the total sales amount for each product in each region from a sales table:

Table: sales

sale_id product_id product_name region sale_amount
1 101 Widget A North 100
2 102 Gadget B South 200
3 101 Widget A North 150
4 103 Doohickey C East 250
5 102 Gadget B South 100

Query: Total sales amount for each product in each region

SELECT product_id, product_name, region, SUM(sale_amount) AS total_sales
FROM sales
GROUP BY product_id, product_name, region;

Result

product_id product_name region total_sales
101 Widget A North 250
102 Gadget B South 300
103 Doohickey C East 250

Analyzing Categorical Data

When you need to analyze data that is categorized by specific columns.

Example

To find the average rating for each category of products from a product_reviews table:

Table: product_reviews

review_id product_id category rating
1 101 Electronics 4.5
2 102 Home Appliances 4.0
3 103 Tools 3.8
4 104 Automotive 4.2
5 105 Kitchen 4.7
6 106 Outdoor 3.9
7 107 Electronics 4.6

Query: Average rating for each category of products

SELECT category, AVG(rating) AS average_rating
FROM product_reviews
GROUP BY category;

Result

category average_rating
Electronics 4.55
Home Appliances 4.00
Tools 3.80
Automotive 4.20
Kitchen 4.70
Outdoor 3.90

Filtering Grouped Data with the HAVING Clause

When you need to filter grouped data using the HAVING clause, which allows you to specify conditions on aggregate functions.

Example

To find customers who have placed more than 5 orders from an orders table:

Table: orders

order_id customer_id customer_name
1 1 John Doe
2 2 Jane Smith
3 1 John Doe
4 3 Mary Brown
5 2 Jane Smith
6 1 John Doe
7 4 Paul Green

Query: Customers who have placed more than 2 orders

SELECT customer_id, customer_name, COUNT(order_id) AS order_count
FROM orders
GROUP BY customer_id, customer_name
HAVING COUNT(order_id) > 2;

Result

customer_id customer_name order_count
1 John Doe 3

Combining with WHERE and ORDER BY Clause

When you need to use the GROUP BY clause in conjunction with other SQL clauses like WHERE, ORDER BY, etc.

Example

To calculate the average salary by department for employees who have been with the company for more than 5 years from an employees table:

Table: Employees

employee_id employee_name department salary years_with_company
1 Alice Smith Sales 50000 6
2 Bob Johnson HR 60000 7
3 Carol White IT 70000 5
4 Dave Black Marketing 55000 4
5 Emma Blue Finance 62000 8

Query: Average salary by department for employees who have been with the company for more than 5 years

SELECT department, AVG(salary) AS average_salary
FROM employees
WHERE years_with_company > 5
GROUP BY department
ORDER BY average_salary DESC;

Result

department average_salary
HR 60000
Finance 62000
Sales 50000

Common Pitfalls of Using GROUP BY in SQL

Even the most experienced data analysts may encounter the common pitfalls of using GROUP BY in SQL. When responding to interview questions, consider reviewing your query for these common mistakes to strengthen your candidacy.

Confusing WHERE and HAVING Clause

The HAVING clause is used to filter groups after they have been aggregated, whereas the WHERE clause is used to filter rows before aggregation.

An incorrect query might look like:

SELECT product_id, SUM(sale_amount) 
FROM sales 
WHERE SUM(sale_amount) > 100 
GROUP BY product_id;

And, the correct query would be:

SELECT product_id, SUM(sale_amount) 
FROM sales 
GROUP BY product_id 
HAVING SUM(sale_amount) > 100;

Performance Issues with Large Datasets

Using GROUP BY on large datasets can be slow if not optimized properly. Without indexes or other optimizations, the database may need to perform a full table scan. Furthermore, indexing also helps optimize multiple Join statements within a query.

Incorrect Results with NULL Values

NULL values in columns used for grouping can lead to groups that are difficult to interpret. SQL treats NULL as a separate group and needs to be replaced with something more interpretable.

Example

Consider the sales table with some NULL values in the region column.

sale_id product_id product_name region sale_amount
1 101 Widget A North 100
2 102 Gadget B South 200
3 101 Widget A NULL 150
4 103 Doohickey C East 250
5 102 Gadget B NULL 100

An incorrect query that may lead to confusion:

SELECT region, COUNT(*)
FROM sales
GROUP BY region;

This query will cluster NULL as well, which might be unexpected.

Handling NULL values:

SELECT COALESCE(region, 'Unknown') AS region, COUNT(*)
FROM sales
GROUP BY COALESCE(region, 'Unknown');

Using COALESCE, NULL values are replaced with ‘Unknown’, making the results clearer.

Grouping on Too Many Columns

Except for specific use cases, grouping on too many columns can lead to overly detailed results, making it hard to draw meaningful conclusions. Consider applying thoughtful aggregations and more meaningful queries to reach better conclusions.

Misinterpreting the Results

Misinterpreting grouped data, especially when using multiple aggregate functions, can lead to incorrect conclusions.

Example

Consider the sales table with an additional quantity column.

sale_id product_id product_name sale_amount quantity
1 101 Widget A 100 2
2 102 Gadget B 200 1
3 101 Widget A 150 3
4 103 Doohickey C 250 1
5 102 Gadget B 100 2

Misleading query:

SELECT product_id, SUM(sale_amount), AVG(quantity)
FROM sales
GROUP BY product_id;

The result shows the total sales and the average quantity per order, which may be misinterpreted.

Clearer query:

SELECT product_id, SUM(sale_amount) AS total_sales, AVG(quantity) AS avg_quantity
FROM sales
GROUP BY product_id;

Renaming the columns in the result makes their meaning clearer.

SQL GROUP BY Interview Questions

By learning effective SQL and practicing, you’ll be able to interpret tables accurately and write queries that generate optimal solutions. Here are a few SQL interview questions requiring the GROUP BY clause to solve. You may encounter these in your upcoming interview:

  1. To finish a class, students must pass four exams (exam ids: 1,2,3, and 4). Given a table exam_scores containing the data about all of the exams that students took, form a new table to track the scores for each student.

Note: Students took each exam only once.

Example:

For the given input:

student_id student_name exam_id score
100 Anna 1 71
100 Anna 2 72
100 Anna 3 73
100 Anna 4 74
101 Brian 1 65

The expected output should be:

student_name exam_1 exam_2 exam_3 exam_4
Anna 71 72 73 74
Brian 65 NULL NULL NULL

Input:

exam_scores table

Column Type
student_id INTEGER
student_name VARCHAR
exam_id INTEGER
score INTEGER

Output:

Column Type
student_name VARCHAR
exam_1 INT
exam_2 INT
exam_3 INT
exam_4 INT
  1. Let’s say you’re working in HR at a major tech company, and they ask you to find employees with a high probability of leaving the company. They tell you these employees perform well but have the lowest pay.

Given two tables, employees and projects, find the five lowest-paid employees who have completed at least three projects.

Note: We consider projects to be completed when they have an end date, which is the same as saying their End_dt is not NULL.

Example:

Input:

employees table

id salary
INTEGER FLOAT

projects table

employee_id project_id Start_dt End_dt
INTEGER INTEGER DATETIME DATETIME

Output:

employee_id
INTEGER
  1. Let’s say you work in air traffic control. You are given the table below containing information on flights between two cities. Write a query to find out how much time, in minutes (rounded down), each plane spent in the air each day.

Note: Both cities are in the same time zone, so you do not need to worry about converting time zones.

flights table

Example:

Input:

Columns Type
id INTEGER
destination_location VARCHAR
source_location VARCHAR
plane_id INTEGER
flight_start DATETIME
flight_end DATETIME

Output:

Columns Type
plane_id INTEGER
calendar_day VARCHAR
time_in_min INTEGER

Please give calendar_day in the format YYYY-MM-DD.

The Bottom Line

Hopefully, you’ll now be able to utilize the inherent grouping capabilities of SQL’s GROUP BY clause to count concurrencies, aggregate group data, filter data, and combine other clauses. To delve deeper into SQL functions and prepare for interview questions, explore our SQL Learning Path and Company Interview Guides. Best of luck with your SQL journey!