Back to Data Science Interview
Data Science Interview

Data Science Interview

91 of 257 Completed

SQL JOINs

SQL was designed to work with large databases, which is why it stores information in a special way:

Our data is divided into different tables so that we don’t have to go through all of it each time we need to query specific data. For example, if we only want information about employees, we’d like to be able to look at just one table, rather than all the other data we hold (such as sales, products, projects, etc).

Our tables are “normalized”, which means that data should not be duplicated across tables. This reduces unnecessary use of memory and limits the risk of data inconsistencies. If we have the same information in two different places, we’d have to update them together. If we ever forgot, we’d have an inconsistency.

Primary and foreign keys

Let’s assume we have two tables, called sales and employees. Let’s say the employees table has the name and salary of each employee, and the sales table shows the amount of each sale and the employee who facilitated it.

If we want to find both the salary and total amount sold for each employee, we would have to look at both tables to figure it out.

However, there is a problem: what if two employees share the same name? There would be no way of distinguishing between sales from one and the other.

To solve this issue, all our rows have a primary key, which is a unique identifier for each row. In the case of our employees, the primary key is a unique id that we assign to each of them:

employees:

id name salary
1 John Smith 50000
2 Jane Doe 60000
3 Bob Johnson 45000

Now, whenever another table references an employee, it will use their id.

The sales table also has a column called id which uniquely identifies each sale. However, each sale must reference an employee. To avoid ambiguity, it references the employees’ id. The column that references to the primary key of another table is called a foreign key.

sales:

id employee_id amount
1 1 2500
2 1 1500
3 2 3000
4 3 4500
5 3 3500

JOINs

Let’s say we now want to know the name of the employee who carried out each sale. The tables are normalized, so each sale only holds the employees’ id.

To do this, we would have to look at the employee_id of each sale, and then search for it in the employees table. We could think of this process as joining the two tables together.

SQL has a JOIN operation that matches rows of two different tables and appends them. We must specify the primary and foreign keys with an ON statement, like this:

SELECT * FROM employees 
JOIN sales
ON employees.id = sales.employee_id

This query tells us that the employees and sales tables should be joined by looking at the column id from the employees table and find a matching employee_id in the sales table. Note that we need to clarify the table each column comes from with the syntax {table_name}.{column_name}. We need to do this because columns might have the same names across different tables.

For each row in one table, it appends its matching row on the other table beside it. The result would be this table:

id name salary id employee_id amount
1 John Smith 50000 1 1 2500
1 John Smith 50000 2 1 1500
2 Jane Doe 60000 3 2 3000
3 Bob Johnson 45000 4 3 4500
3 Bob Johnson 45000 5 3 3500

Note that the first three columns came from the employees table and the last three columns came from the sales table.

After the JOIN, we can select the columns we want. We just need to replace the wildcard (*) for the columns we want to select in our query.

But now, there are two columns with the same name id, so SELECT id would be an ambiguous statement.

If we want to SELECT a column with a name that repeats, we must specify the table it came from before the join. In this case, employees.id would refer to the first column and sales.id would refer to the fourth.

Finally, we can also use the keyword AS to rename tables.

As an example, if we just want the name of the employee who carried out each sale, we could do:

SELECT s.id, e.name
FROM employees AS e
JOIN sales AS s
ON s.id = e.id
id name
1 John Smith
2 John Smith
3 Jane Doe
4 Bob Johnson
5 Bob Johnson

Types of JOINs

In our last example, every sale had a matching employee and vice versa. However, it’s not always the case that a row in a table has a matching row in another. There are four types of JOINs, which all differ in how to handle non-matching values within the tables.

In this section, we’ll use two tables as examples, called prices and names.

prices

itemID price
1 10.99
2 5.99
3 7.50

names

itemID name
1 Pencil
2 Pen
4 Paper

INNER JOIN

An inner join returns all rows that have a matching value in both tables. In this case,

SELECT * FROM prices
INNER JOIN names
ON prices.itemID = names.itemID

would return:

itemID price name
1 10.99 Pencil
2 5.99 Pen

This is because both names and prices have rows with itemID values 1 and 2, so they are kept in the table returned by the Join.

However, names doesn’t have the itemID value 4, and prices doesn’t have the itemID value 3, so both rows are discarded in the table returned.

LEFT JOIN

A left join keeps all rows that appear in the first table. If there is no matching value in the second table, the join will append empty cells. For example:

SELECT * FROM prices
LEFT JOIN names
ON prices.itemID = names.itemID

The first and second rows of the prices table have a matching value in the names table, so the join works as usual for those rows.

The third row of the prices table has no matching values in the names table, so the join will have nothing to append to it. However, the left join keeps all rows that were present in the first table.

As a result, the join will complete the row that has no matching values with empty cells, which are symbolized by the keyword NULL.

The third row of the names table has no matching values in the prices table, so the left join will discard it.

itemID price name
1 10.99 Pencil
2 5.99 Pen
3 7.50 NULL

RIGHT JOIN

The right join works exactly like the left join, but the other way around. It keeps all the values in the second table. If they have no matching values in the first table, the join will append empty cells instead.

For example:

SELECT * FROM prices
RIGHT JOIN names
ON prices.itemID = names.itemID

will return

itemID price name
1 10.99 Pencil
2 5.99 Pen
4 NULL Paper

OUTER JOIN

Finally, the outer join keeps all the rows that appeared on the first and second table. If any of them have no matching values in the other table, it completes the join with empty cells.

For example,

SELECT * FROM prices
OUTER JOIN names
ON prices.itemID = names.itemID

would return us the following table:

itemID price name
1 10.99 Pencil
2 5.99 Pen
3 7.50 NULL
4 NULL Paper

What we learned so far:

Example query

SELECT first_name, salary, sales.employeeID
FROM employees JOIN sales
ON employee.id = sales.employeeID
WHERE salary > 7500 AND last_name != "Williams"

Execution order

  1. FROM -> Gets input tables
  2. JOIN … ON -> Merges input tables
  3. WHERE -> Filters rows according to condition
  4. SELECT -> Selects the columns we need.
  5. Aggregate functions and numerical operations performed on output.
Good job, keep it up!

35%

Completed

You have 166 sections remaining on this learning path.