When handling relational databases, there are a couple of ways to facilitate data querying. While methods like ORMs can be beneficial for abstracting database interactions, SQL remains the best language for direct data manipulation and analysis. Having a strong grasp of SQL foundations can provide a crucial edge in an industry where data-driven decisions are paramount.
How long does it take to learn SQL starting from scratch? In this article, we’ll discuss some of the mathematical background behind the language, how to measure your skill level, and more SQL resources to check out.
The learning curve for SQL is highly dependent on your background. For non-programmers learning SQL as their first language, it may be easier to understand since it’s rather declarative: you describe what you want, not the steps to achieve it.
However, if you come from a programming background that typically introduces a procedural paradigm, getting accustomed to how SQL manages and operates on data can be more difficult. Because the language is based on relational algebra, SQL operations typically work on a set rather than a specific entity.
SQL basics are relatively intuitive, but you may wrestle a bit with the syntax. For example, when I first learned SQL, some things I found myself confused with include:
SELECT
→ FROM
→ WHERE
→ GROUP BY
→ HAVING
→ ORDER BY
→ LIMIT
.
FROM
→ WHERE
→ GROUP BY
→ HAVING
→ SELECT
→ ORDER BY
→ LIMIT
.
I was also initially confused about how GROUP BY
interacts with aggregation functions like COUNT()
, SUM()
, and RANK()
. There are also compound clauses with SQL, like INSERT INTO
, that function as one operation but are made of two words. Not that big of a deal, but it can make things confusing anyway.
SQL, the ubiquitous language of relational databases, is not arbitrarily constructed. Its design reflects principles derived from relational algebra– a rigorous mathematical framework. To truly grasp the nuances of SQL, it’s important to understand some of the underlying foundational operations:
WHERE
clause we use in SQL. This operation helps narrow down the data to match our needs.SELECT
keyword, so we only see the relevant data points.These definitions may seem intimidating, but, in practice, these concepts are relatively easy to understand. While JOINs
are a bit harder, with enough hands-on practice, they become more intuitive to use in a relational context.
For SQL, like many disciplines, mastery isn’t about completion– it’s about depth, adaptability, and continuous engagement. While some may see it as an endpoint, true SQL mastery adapts to the language’s evolving landscape.
SQL learning journeys can be as diverse as the language. Here are a few perspectives:
25-year MS SQL Server User: “I’ve always leaned into the ‘learn as you go’ philosophy. Even after 20 years with MS SQL Server and five years with Access before that, I find myself revisiting books and searching on Google. Commands change, new versions emerge, and continuous learning becomes paramount.”
10-Year SQL Enthusiast: “By the fifth year, I felt I had reached a notable level of proficiency. Yet, the journey of discovery and learning in SQL is unending.”
Jay from Interview Query: “My personal mastery moment in SQL was when I grasped self-joins and could effectively manage 90% of my analytical queries. When faced with challenges like funnel analysis or statistics, I could quickly decipher the necessary data and promptly formulate a SQL query to get my answer.”
Assessing your proficiency in SQL is both a reflection and a guidepost for future learning. Here are some ways to approach this:
While there are no definitive “levels” for measuring SQL proficiency, there are benchmarks that can help you determine your SQL knowledge.
Here, we’ll describe different levels of SQL proficiency under three main categories: beginner, intermediate, and advanced. To measure your level of understanding, try out the questions we’ve created below.
At the beginner level, you should be able to identify basic SQL terminologies and analyze how they work together to create a database architecture.
These questions are mostly theory-based and are designed to test your familiarity with basic SQL concepts. While these are “easy” questions that a beginner should be able to answer, it’ll be more difficult if you’re missing any of the fundamentals, including:
To test your mastery of these topics, try the following questions:
A database is a structured collection of data. While databases exist in various forms, SQL databases, in particular, are digital and are often relational. This means they store data in tables that can relate to one another, which allows for both design simplicity and more complex functions, depending on the use.
A DBMS (Database Management System) is a software system that facilitates the creation, management, and use of databases. A RDBMS (Relational Database Management System) is a type of DBMS that uses a relational model. This means the data is organized into tables with defined relationships, allowing for a more connected view of the data. Unlike a traditional DBMS that stores data in file formats, an RDBMS leverages tabular structures.
A primary key in SQL is a type of self-induced constraint that limits the occurrence of a value within a tablet to only once, with NULL
values not allowed. The primary key constraint combines the NOT NULL
and UNIQUE
constraints.
You can create a table with a single field as a primary key using the following code:
CREATE TABLE Employees (
ID int NOT NULL,
FirstName VARCHAR(255) NOT NULL,
LastName VARCHAR(255),
PRIMARY KEY (ID)
);
To enforce a primary key on two fields, do the following:
CREATE TABLE Employees (
ID int NOT NULL,
FirstName VARCHAR(255) NOT NULL,
LastName VARCHAR(255),
CONSTRAINT PK_Employees PRIMARY KEY (ID, FirstName)
);
At an intermediate level of SQL proficiency, it’s important to be familiar with certain concepts that help create insightful queries, although not necessarily the most in-depth functions and concepts.
These topics serve as building blocks for creating more complicated queries. At this level, you should be able to solve most problems ranging from surface-level to relatively complex issues, even if they’re not necessarily the best and most efficient solution available.
Some intermediate-level concepts include:
To test yourself on these concepts, try the following questions:
Write a query to get the number of customers that were upsold by purchasing additional products.
Note: If the customer purchased two things on the same day, that doesn’t count as an upsell since they were purchased within a similar timeframe.
Input:
transactions
table
Column | Type |
---|---|
id | INTEGER |
user_id | INTEGER |
created_at | DATETIME |
product_id | INTEGER |
quantity | INTEGER |
Output:
Column | Type |
---|---|
num_of_upsold_customers | INTEGER |
UNION
and How Is It Different From UNION ALL
?At its core, UNION
is simply UNION ALL
with more parameters. For example, let’s say you’re given a dataset with two tables with employee information:
id
table
Column | Type |
---|---|
name | VARCHAR |
id | INTEGER |
salary
table
Column | Type |
---|---|
name | VARCHAR |
salary | INTEGER |
To merge the two tables and remove duplicate records, you could use UNION.
Example:
SELECT name, id
from demo
UNION
SELECT name, id
from demo
The result will create a table that merges duplicate records (i.e., ‘name’) and generates a table with only one name column. If you use UNION ALL
, the resulting table would have duplicate records, so ‘name’ would occur twice.
Ultimately, UNION ALL
is faster than UNION but concatenates all the values, while UNION
can detect relations and duplicates.
A PIVOT operation allows you to rotate data from a row-centric to a column-centric model. This transformation often involves data aggregation, allowing for more compact and insightful data presentation.
For a practical pivot operation, you might use a structure like:
SELECT column1, column2
FROM table
PIVOT
(
AggregateFunction(column_to_aggregate)
FOR column_to_be_pivoted IN (list_of_values)
) AS AliasName
The family of SQL constraints is extensive, each serving a specific purpose.
Handling missing values in SQL is always a challenge. Improper handling can result in accidental biases that shift business decisions in the wrong direction.
Before mastering the coding methods to handle missing data, it’s important to understand the following concepts.
To find NULL
values in your tables, you can use the following code:
SELECT column1
FROM table1
WHERE column1 IS NULL;
Use NOT NULL
to utilize reverse logic in handling null values:
SELECT column1
FROM table1
WHERE column1 IS NOT NULL;
Making imputation models for handling NULL
values is notoriously tricky. If you can create custom models for managing NULL values, then you’ve probably reached an advanced level of SQL understanding.
It takes more than constant theoretical learning and practical application to attain SQL mastery. You also need to be familiar with specific nuances of SQL, which will allow you to create queries that are efficient, easy to read, and, most importantly, work without compromising data integrity.
Below is a list of concepts that those who’ve reached advanced levels of SQL proficiency should know:
To test your level of understanding of these topics, try the following questions:
The schema below is for a retail online shopping company consisting of two tables, attribution
and user_sessions
.
conversion
is true
, then the user converted to buying on that session.channel
column represents which advertising platform the user was attributed to for that specific session.user_sessions
table maps many-to-one session visits back to one user.First touch attribution is defined as the channel with which the converted user was associated when they first discovered the website.
Calculate the first touch attribution for each user_id
that is converted.
Example:
Input:
attribution
table
Column | Type |
---|---|
session_id | INTEGER |
channel | VARCHAR |
conversion | BOOLEAN |
user_sessions
table
column | type |
---|---|
session_id | INTEGER |
created_at | DATETIME |
user_id | INTEGER |
Example output:
user_id | channel |
---|---|
123 | |
145 | |
153 | |
172 | organic |
173 |
Given their similar names, there’s a tendency to use OLAP and OLTP interchangeably, but they are systems that can be identified as infrastructures with opposite philosophies. OLAP and OLTP are not mutually exclusive but rather work together to create an efficient database architecture.
OLAP stands for OnLine Analytical Processing, while OLTP represents OnLine Transaction Processing.
OLTP involves processing transactions and recording each detail, with built-in data guards to ensure data integrity. Because OLTP works in a fast-paced system, it needs to process and store information quickly, so heavy processing is discouraged.
An example of a real-life application of OLTP is online shopping, where a database records the user ID, order information, and other relevant data. OLTP quickly generates accurate records and avoids colliding data (i.e., time-sensitive information like in-store item inventory).
On the other hand, OLAP utilizes the information from the OLTP system to generate insights for business decisions. While OLTP is fast, OLAP needs more time to create accurate insights from the data warehouse with its analytics-oriented approach.
Let’s examine each of the conditions first and see how they could be triggered. Given two date ranges, what determines if the subscriptions would overlap?
Let’s set an example with two date ranges: A and B.
Let ConditionA>B demonstrate that DateRange A is completely after DateRange B.
|—- DateRange A ——| |—Date Range B —–| _
ConditionA>B is true if StartA > EndB.
Let ConditionB>A demonstrate that DateRange B is completely after DateRange A.
|—- DateRange A —–| _ _ |—Date Range B —-|
Condition B>A is true if EndA < StartB.
Overlap then exists if neither condition is held. In that, if one range is neither completely after the other nor completely before the other, then they must overlap.
If you’re a data scientist in charge of improving recommendations at a company and you develop an algorithm, how do you know if it performs better than the existing one?
One metric to measure performance is called precision (also called positive predictive value), which has applications in machine learning as well as information retrieval. It is defined as the fraction of relevant instances among the retrieved instances.
In a fast-paced environment where insights can have rippling side effects on business decisions, it’s tempting to time yourself on how fast you can generate queries. While this can be a benchmark for skill, most often, it’s the accuracy that matters, not the speed.
Generally speaking, simple queries (like joining basic tables or pivoting a table) can be generated in around five minutes or less, given that the database is quite limited in size.
However, as a data scientist, it’s common to encounter huge 1000-line queries that take considerable effort to generate. The longer your code is, the higher the risk of generating logical errors within your query. As such, it’s more important to prioritize having a logically sound query over speed to a certain extent.
A lot of SQL material is static and generally hard to follow. For those without programming experience, creating commands instead of using a mouse and a pointer can be challenging to conceptualize.
For easy-to-understand SQL material that adapts to your learning style, Interview Query provides SQL courses that are supported by a huge community. If you’re looking for additional interview questions, check out our extensive question bank that covers SQL and other important data science topics.