Structured Query Language (SQL) is a skill that every data analytics professional must have in their arsenal, as it has a wide range of applications in manipulating complex databases. Among other commonly tested SQL functions in interviews, JOINs are one of the most important SQL clauses to learn.
In this article, we will explore an important application of SQL JOINs, namely: SQL JOIN on multiple columns. We’ll briefly review the syntax, real-world use cases, and alternatives to joining on multiple columns. We shall also cover some relevant examples to help you appreciate the concept better.
SQL JOIN clauses are used to combine rows from two or more tables based on a related column(s) between them. The commonly used JOINs are INNER JOIN, LEFT OUTER JOIN, RIGHT OUTER JOIN, FULL OUTER JOIN, CROSS JOIN, and SELF JOIN.
Although JOINs are commonly made on the basis of a single column (called primary key or foreign key based on the table we are speaking in the context of), there are instances when the multiple columns need to be taken into consideration to get the desired result.
Here is the SQL JOIN syntax:
SELECT columns
FROM table1
FULL OUTER JOIN table2
ON table1.column_name = table2.column_name;
The syntax will differ based on the type of join you wish to perform; for instance, you’ll need to change the phrasing to INNER JOIN if that’s the kind of join needed.
Note: if you need a refresher on SQL syntax, keep our ultimate SQL cheat sheet handy.
Joins become more powerful when the business problem necessitates more complex relationships — this is where SQL joins on multiple columns comes into play. Rather than joining tables based on a single common column, you can join them based on two or more columns, which allows for a more nuanced and precise retrieval of related data.
Such joins are essential in situations where the integrity and context of the data must be maintained across dimensions. There are several ways we can achieve this, as we will go over in the next section.
For example, in a database where an employee’s identification involves both a department code and an employee number, you would need a join on both of these columns to accurately match employee records with departmental data. Here, a join on multiple columns ensures the integrity and specificity of the data relationship:
SELECT *
FROM Employee
JOIN Department ON Employee.DeptCode = Department.DeptCode
AND Employee.EmpNumber = Department.EmpNumber;
Multi-column joins in SQL are particularly useful when you have complex relationships that cannot be adequately represented by a single-column join. Here are a few real-world scenarios where this can come up.
Consider a database for a university where you have two tables:
Enrollments
table with records of student enrollments, including StudentID
, CourseID
, and Semester
.Grades
table with the students’ grades, including StudentID
, CourseID
, Semester
, and Grade
.Find out what grade each student received in each course for the Fall 2023 semester.
Here are the tables provided:
Enrollments Table
EnrollmentID | StudentID | CourseID | Semester | EnrollmentDate |
---|---|---|---|---|
1 | 101 | 201 | Fall 2023 | 2023-08-15 |
2 | 102 | 202 | Fall 2023 | 2023-08-16 |
3 | 103 | 203 | Fall 2023 | 2023-08-17 |
4 | 104 | 204 | Fall 2023 | 2023-08-18 |
5 | 105 | 205 | Fall 2023 | 2023-08-19 |
6 | 101 | 206 | Spring 2023 | 2023-01-10 |
7 | 102 | 207 | Spring 2023 | 2023-01-11 |
Grades Table
StudentID | CourseID | Semester | Grade |
---|---|---|---|
1 | 101 | Fall 2023 | A |
1 | 102 | Spring 2023 | B |
2 | 101 | Fall 2023 | A |
3 | 103 | Fall 2023 | C |
2 | 104 | Spring 2023 | B |
3 | 102 | Spring 2023 | B |
4 | 101 | Fall 2023 | B |
5 | 103 | Fall 2023 | A |
Solution
To find out what grade each student received in each course for the “Fall 2023” semester, you would join the two tables on three columns: StudentID
, CourseID
, and Semester
.
Here’s how the SQL query might look:
SELECT
e.StudentID,
e.CourseID,
e.Semester,
g.Grade
FROM
Enrollments e
JOIN
Grades g
ON e.StudentID = g.StudentID
AND e.CourseID = g.CourseID
AND e.Semester = g.Semester
WHERE
e.Semester = 'Fall 2023';
This multi-column join is absolutely essential because if you joined only on StudentID
or CourseID
, you would end up with incorrect matches where a student’s grade from one semester could be incorrectly matched with an enrollment from another semester.
The expected output looks like this:
StudentID | CourseID | Semester | Grade |
---|---|---|---|
101 | 201 | Fall 2023 | A |
102 | 202 | Fall 2023 | B |
103 | 203 | Fall 2023 | A |
104 | 204 | Fall 2023 | C |
105 | 205 | Fall 2023 | B |
Here is a similar interview query you can practice on our built-in SQL compiler: Audio Chat Success
In SQL, multi-column joins typically refer to matching rows based on the equality of multiple column pairs. However, you can have multi-condition joins where you’re joining on the equality of one set of columns and a comparative condition on another. We’ll demonstrate this application in this example.
You’re given two tables: employees
and managers
. Find the names of all employees who joined before their manager.
Solution
Here’s the expected code:
SELECT CONCAT(e.first_name, ' ', e.last_name) AS employee_name
FROM employees e
INNER JOIN managers m ON e.manager_id = m.id AND e.join_date < m.join_date
Note the logical condition e.join_date < m.join_date
which implies that you are looking for employees who were joined the company before their managers. This is an instance of the logical operator <
being used in a SQL multiple-column join condition instead of the more commonly used AND
operator.
The below problem statement demonstrates the converse of the above example, namely that even if you have an AND
operator in your join condition, it may not technically be a SQL multiple join.
You’re given two tables, users
and events
. The events
table holds values of all of the user events in the action column (‘like’, ‘comment’, or ‘post’).
Write a query to get the percentage of users that have never liked or commented. Round to two decimal places.
Here is a sample query to get the needed output:
SELECT
ROUND(
COUNT(DISTINCT CASE WHEN action IS NULL THEN u.id END)/ COUNT(DISTINCT u.id),
2
) AS percent_never
FROM
users u
LEFT JOIN events e
ON u.id = e.user_id
AND e.action IN ('like', 'comment');
Important note: While the given join clause is joining the users
and events
tables based on the user_id, the additional condition action IN ('like', 'comment')
is a filter on the rows returned by the join, not an additional join condition on another column. The AND
in the ON
clause does not combine two columns from each table but rather restricts the join to certain rows based on the action
column’s content.
Now let’s look at an example where multiple columns are used to join multiple tables. Let’s consider a simplified example of a database that might be used for a multi-factor authentication (MFA) system. In an MFA system, users might authenticate through multiple methods, like a password, a particular device type, and biometric data like their date of birth. Write a query to verify that a user has completed all three authentication factors.
Here are the input tables.
Users Table
UserID | Username | LastPasswordUpdate | LastDeviceRegistration | LastBiometricUpdate |
---|---|---|---|---|
1 | john_doe_92 | 1010 | 2010 | 3010 |
2 | jane.smith88 | 1020 | 2020 | 3020 |
3 | mikebrown123 | 1030 | 2030 | 3030 |
4 | lisaray234 | 1040 | 2040 | 3040 |
5 | dave.clark45 | 1050 | 2050 | 3050 |
Passwords Table
AuthID | UserID | PasswordHash | UpdateDate |
---|---|---|---|
1010 | 1 | $2b$12$7q | 2023-01-15 08:35:00 |
1020 | 2 | $2b$12$9I | 2023-02-20 09:20:00 |
1030 | 3 | $2b$12$iX | 2023-03-05 10:05:00 |
1040 | 4 | $2b$12$4t | 2023-04-10 11:15:00 |
1050 | 5 | $2b$12$oH | 2023-05-25 12:45:00 |
Devices Table
AuthID | UserID | DeviceID | RegistrationDate |
---|---|---|---|
2010 | 1 | iPhone12 | 2023-01-25 14:30:00 |
2020 | 2 | Pixel5 | 2023-02-15 15:40:00 |
2030 | 3 | GalaxyS21 | 2023-03-18 16:50:00 |
2040 | 4 | iPad8 | 2023-04-08 17:05:00 |
2050 | 5 | OnePlus9 | 2023-05-20 18:20:00 |
Biometrics Table
AuthID | UserID | BiometricID | LastUpdate |
---|---|---|---|
3010 | 1 | FingerprintA | 2023-01-30 19:00:00 |
3020 | 2 | FaceID_X2 | 2023-02-28 20:10:00 |
3030 | 3 | IrisPatternC | 2023-03-22 21:30:00 |
3040 | 4 | VoicePrintD | 2023-04-12 22:00:00 |
3050 | 5 | RetinaScanE | 2023-05-27 23:45:00 |
Output Table
UserID | Username | PasswordHash | DeviceID | BiometricID |
---|---|---|---|---|
1 | john_doe_92 | $2b$12$7q | iPhone12 | FingerprintA |
2 | jane.smith88 | $2b$12$9l | Pixel5 | FaceID_X2 |
3 | mikebrown123 | $2b$12$iX | GalaxyS21 | IrisPatternC |
4 | lisaray234 | $2b$12$4t | iPad8 | VoicePrintD |
5 | dave.clark45 | $2b$12$oH | OnePlus9 | RetinaScanE |
The output can be used to vet if any users have not completed their MFA, and send them reminders if required.
Here’s the SQL join query to produce the above output:
SELECT
u.UserID,
u.Username,
p.Password,
d.DeviceID,
b.BiometricID
FROM
Users u
JOIN
Passwords p ON u.UserID = p.UserID AND u.LastPasswordUpdate = p.AuthID
JOIN
Devices d ON u.UserID = d.UserID AND u.LastDeviceRegistration = d.AuthID
JOIN
Biometrics b ON u.UserID = b.UserID AND u.LastBiometricUpdate = b.AuthID;
To practice more SQL JOIN interview questions, you can go to our database of popular questions and filter by the tag ‘JOINs’.
The most common way to join tables on multiple columns is to use the AND
operator within your ON
clause. However, there are a few variations that can be used, depending on the problem you want to solve.
A composite key is a key composed of two or more columns. You can join tables based on this key as follows:
SELECT *
FROM table1 AS t1
JOIN table2 AS t2
ON (t1.columnA, t1.columnB) = (t2.columnA, t2.columnB);
Composite key joins are used when a combination of columns is designed to be a unique key in the database schema. Multi-column joins, on the other hand, are more general joins used for other purposes as well, such as general filtering and matching of records.
While this method may be effective in rare cases such as legacy systems, since databases are generally optimized to use indexes on joins, concatenating will mean that the database can no longer use the individual indexes for each column. This can result in a full table scan, which is much slower. This method is usually discouraged.
SELECT *
FROM employees AS e
JOIN salaries AS s
ON CONCAT(e.first_name, ' ', e.last_name) = s.employee_name;
Apart from equality, join conditions can include a range of operators (<, >, <=, >=, <>). You can use these operators to perform a join based on a comparative condition.
SELECT *
FROM orders AS o
JOIN shipment AS s
ON o.customer_id = s.customer_id
AND o.order_date <= s.shipment_date;
Natural joins automatically join all columns with the same names. This method is rarely employed because the join is implicit and can produce unexpected results if the tables have more columns in common than anticipated.
SELECT *
FROM customers
NATURAL JOIN orders;
This article has explored the SQL JOIN on multiple columns, with various ways to tackle this common business problem scenario. You can head over to our blog to read our comprehensive SQL guides or practice questions from our interview database.