Interview Query | GROUP BY and HAVING - Data Analytics Interview

In the beginner level course, we saw how aggregate functions can be used to summarize metrics from a table. However, their real utility comes when we combine them with the GROUP BY clause.

Let’s say we have a sales table with the following information:

employeeID	date	amount
1	2021-01-01	100.00
2	2021-01-01	200.00
1	2021-01-02	150.00
3	2021-01-03	300.00
2	2021-01-03	150.00
4	2021-01-04	400.00

If we wanted to know the total amount sold by each employee with the tools we’ve learned so far, we would need a separate query for each of them.

In this case, this would mean that we would find the total amount sold by each employee separately. For example, to find the amount sold by the first employee, we could use the following query:

SELECT employeeID, SUM(amount) AS total_sold
FROM employees
WHERE employeeID = 1

The WHERE clause filters the sales done by the first employee:

employeeID	date	amount
1	2021-01-01	100.00
1	2021-01-02	150.00

And then, the aggregation function finds the total amount sold for this smaller table. The output of the query would be:

employeeID	total_sold
1	250.00

If the table is too large, this approach is inconvenient, as it would need to find the total amount sold for each employee separately.

SQL allows us to solve this problem with a single query, using the GROUP BY clause. The GROUP BY clause lets us:

Divide our table into smaller tables according to a single column.
Perform an aggregate function over each of them.
Integrate the results into a single table again.

In this case, the query we would use is:

SELECT employeeID, SUM(amount) AS total_sol
FROM employees
GROUP BY employeeID

Our output would be:

employeeID	total_amount_sold
1	250.00
2	350.00
3	300.00
4	400.00

The logic of it works in the following way:

The GROUP BY clause divides our original table into four smaller tables according to their employeeID.

Table_1:

employeeID	date	amount
1	2021-01-01	100.00
1	2021-01-02	150.00

Table_2:

employeeID	date	amount
2	2021-01-01	200.00
2	2021-01-03	150.00

Table_3:

employeeID	date	amount
3	2021-01-03	300.00

Table_4:

employeeID	date	amount
4	2021-01-04	400.00

Then, we select employeeID and SUM(amount) AS total_amount from each of the tables:

Table 1:

employeeID	total_amount_sold
1	250.00

Table 2:

employeeID	total_sold
2	350.00

Table 3:

employeeID	total_sold
3	300.00

Table 4:

employeeID	total_sold
4	400.00

Finally, the results are integrated into a single table:

employeeID	total_sold
1	250.00
2	350.00
3	300.00
4	400.00

In reality, the GROUP BY clause does all those steps at once, and is more efficient than carrying them all out separately.

We can use the GROUP BY clause with any aggregate function we need. We can specify more than one argument in the GROUP BY clause by using a separating comma. The original tables will be divided into smaller groups for which all the columns in the GROUP BY clause have the same values.

For example, GROUP BY location_id, employee_id would divide a table into smaller groups that share the same location_id and employee_id.

HAVING

The HAVING clause is a filter for the groups made by the GROUP BY. It works similarly to the WHERE clause, but it’s applied to groups instead of single rows.

For example, if we wanted to find the total amount sold for employees who performed more than one sale, we could do:

SELECT employeeID, 
	SUM(amount) AS total_sold
FROM employees
GROUP BY employeeID
HAVING COUNT(*) > 1

Here, the GROUP BY clause would divide our table into groups according to their employeeID and then filter each of the groups according to our condition. In this case, it would discard the smaller tables 3 and 4, as they only have one row.

After filtering the groups, it continues just as the GROUP BY clause does. It performs the aggregate functions in the SELECT on each of the groups and integrates the results into a single table.

So far:

Example query

SELECT first_name, SUM(salary)
FROM employees 
	JOIN sales
ON employee.id = sales.employeeID
WHERE salary > 7500 AND last_name != "Williams"
GROUP BY employee.id
HAVING COUNT(*) > 1

Execution order

FROM -> Gets input tables
JOIN … ON -> Merges input tables
WHERE -> Filters rows according to condition
GROUP BY -> Separates into smaller tables
HAVING -> Filters smaller tables according to condition
SELECT -> Selects the columns we need (and integrates smaller tables)
SUM() and aggregate functions or numerical operations performed on output.

Download Facts

Total Spent On Products

0%

Completed

You have 84 sections remaining on this learning path.

Loading pricing options