Data architects are the backbone of every company’s data infrastructure and organization policies. As one of the most sought-after roles within the industry, data architects define how data is stored, retrieved, and integrated within an organization’s databases.
In this article, we’ll cover an overview of the position, the data architect interview process, and technical interview questions for you can practice with.
Data architects are responsible for designing, building, and managing an organization’s data infrastructure and database systems. Their role can be broken down into the following subsections:
Data architects and data engineers are two positions that are often confused and used interchangeably. However, there’s a clear difference to establish between the roles.
Data architects are responsible for visualizing and designing the blueprint of the company’s data vision. They define how data will be stored, retrieved, and used on other projects in the organization.
On the other hand, data engineers build data systems and apply the information specified by the data architect. While data architects have a strong background in database design and modeling, data engineers have expertise in software engineering and app development.
Data engineers also perform the daily tasks of data cleaning in preparation for data analysts or data scientists. They maintain the infrastructure including data pipelines, processing and storage, and transformation.
The interview process for data architects usually consists of five main stages.
Phone Screening: The first part of this process consists of a phone screen that generally takes 30-45 minutes. During this interview, the recruiter will ask you about your qualifications and educational background, as well as previous work experience.
Technical Interview: In this stage, your technical skills will be evaluated. You’ll need to work through a variety of data-related questions and problems, design data models, and/or complete some programming tests in Python, R, or SQL. For some companies, this could also include a whiteboard session.
Panel Interview: This interview is conducted with a group of interviewers, which could include other data architects or positions that work closely with the role. You’ll likely be asked a mix of behavioral and technical questions to assess your approach to designing data models and your problem-solving abilities.
Case Study Interview: Some companies have an additional case study interview to evaluate your ability to work through a real-world situation to either create a new data model or improve a pre-existing architecture.
Final Interview: In the final stage, you’ll meet with the senior team and leadership to discuss the long-term goals of the company and whether your vision aligns with their values.
Every organization has a unique interview process. Some companies have an interview process with fewer stages, while others will include more assessments to gauge your technical ability.
Behavioral interview questions provide insight into an applicant’s soft skills and how they approach problems in their day-to-day work.
These questions are generally asked at multiple stages of the interview process, so it’s important to be well-prepared early on. In this section, we’ll cover some of the questions you may encounter.
This type of question is meant to assess your initial approach to a problem and how you come up with viable solutions. While there is no singular ideal answer, interviewers will expect you to be able to:
When several teams are working on a large data infrastructure, you may come across conflicting requirements. The interviewers want to see how you would handle this situation.
Depending on the team you’re working with, you should communicate the issue with other members, as well as managers who worked on the specification of project requirements, and see which requirements are more important for the end goal of the project.
Your answer should focus on your communication skills and how you’d propose a solution that respects both requirements but resolves the conflict.
To structure your answer, the STAR approach is a great framework to follow by discussing the overall situation, the specific task you were working on, your action plan, and the results of your solution.
A major part of the data architect role is collaborating and communicating with higher management and stakeholder groups. This helps data architects understand the requirements of their data models and solutions, which will deliver data-driven solutions aligned with the client’s needs.
When interviewing a candidate, interviewers will want to know whether you’re able to communicate more complex details about the project to stakeholders and clients without a technical background. Be sure to practice structuring your answers to be understandable to the average layperson.
Given the employees table, get the largest salary of any employee by department.
Example:
Input:
employees
table
Column | Type |
---|---|
id | INTEGER |
department | VARCHAR |
salary | INTEGER |
Output:
Column | Type |
---|---|
department | VARCHAR |
largest_salary | INTEGER |
Solution
To solve this problem, we’ll first need to group employees by their department, creating a separate group for each department.
To accomplish this, we can use the GROUP BY clause in the SELECT statement. This will look something like the following query:
SELECT
department,
salary
FROM employees
GROUP BY department
This query selects the department and salary columns from the employees table and groups the data by the department column. The results will be grouped by department, with each group containing the department and salary data for all employees in that department.
To finish up the problem, we need to find the largest salary within each group. Which function could we use to accomplish this?
We’re given two tables: a users table with demographic information and the neighborhood they live in, and a neighborhoods table.
Write a query that returns all of the neighborhoods that have 0 users.
Example:
Input:
users
table
Columns | Type |
---|---|
id | INTEGER |
name | VARCHAR |
neighborhood_id | INTEGER |
created_at | DATETIME |
neighborhoods
table
Columns | Type |
---|---|
id | INTEGER |
name | VARCHAR |
city_id | INTEGER |
Output:
Columns | Type |
---|---|
name | VARCHAR |
Solution
Whenever the question asks about finding values with 0 something (users, employees, posts, etc.), the first thought to come to your mind should be a LEFT JOIN. While an inner join finds any values that are in both tables, a left join keeps only the values in the left table.
Our predicament is to find all the neighborhoods without users. To do this, we can use a left join from the neighborhoods table to the users table, which generates an output like this:
neighborhoods.name | users.id |
---|---|
castro | 123 |
castro | 124 |
cole valley | null |
castro heights | 534 |
castro heights | 564 |
How can we then find all the neighborhoods without a singular user?
Suppose your company is looking to cut costs due to an economic downturn. During a coffee break, you hear a rumor that a lot of money goes to employees who don’t do their work, and you decide to find out if the rumor is true.
Given two tables, employees and projects, find the sum of the salaries for all the employees who didn’t complete any of their projects.
Example:
Input:
employees
table
id | salary |
---|---|
INTEGER | FLOAT |
projects
table
employee_id | project_id | Start_dt | End_dt |
---|---|---|---|
INTEGER | INTEGER | DATETIME | DATETIME |
Output:
total_slack_salary
INTEGER
Note: consider a project unfinished if it has no end date (its End_dt is NULL).
Given this, we’ll say an employee didn’t finish any of their projects when:
Solution
Your solution should resemble the following query:
WITH slack_salaries AS (
SELECT e.salary
FROM employees e
INNER JOIN projects p
ON e.id = p.employee_id
GROUP BY e.id
HAVING COUNT(p.End_dt) = 0
)
SELECT sum(salary) AS total_slack_salary
FROM slack_salaries
Let’s break this down to understand the individual parts of the query.
We’re looking for employees that:
The first subquery searches for the table that holds all the salaries we must sum up.
How can we then exclude all the employees with no assigned projects?
Given the employees and departments table, write a query to get the Top 3 highest employee salaries by department. If the department contains less than 3 employees, the Top 2 or the Top 1 highest salaries should be listed (assume that each department has at least 1 employee).
Note: The output should include the full name of the employee in one column, the department name, and the salary. The output should be sorted by department name in ascending order and salary in descending order.
Example:
Input:
employees
table
Column | Type |
---|---|
id | INTEGER |
first_name | VARCHAR |
last_name | VARCHAR |
salary | INTEGER |
department_id | INTEGER |
departments
table
Column | Type |
---|---|
id | INTEGER |
name | VARCHAR |
Output:
Column | Type |
---|---|
employee_name | VARCHAR |
department_name | VARCHAR |
salary | INTEGER |
Solution
To solve this problem, we first need to get the Top 3 salaries by department. We can use the RANK() function to get a ranking number for each row that is based on the column specified in the ORDER BY clause.
We need the ranking to be reset back to 1 for each department. This is where the department_id column should be used in a PARTITION BY clause. This can be put into a subquery (employee_ranks), for easier comprehension. Now we have the employees table, with an additional column to show the ranking of salaries.
Next, let’s work on getting the columns properly formatted. To get the department names, we LEFT JOIN departments to employee_ranks. We use a left join to preserve the rows in the left table.
For the rest of our answer, we need to have the employees’ full names and filter for the Top 3 salaries in each department. How can we do this?
Given an integer N, write a function that returns a list of all of the prime numbers up to N.
Note: Return an empty list when there are no prime numbers less than or equal to N.
Example:
Input:
N = 3
Output:
def prime_numbers(N) -> [2,3]
Solution
Your answer should resemble something like the following:
def prime_numbers(N):
primes = []
if N > 1:
for i in range(2,N+1):
is_prime = True
for j in range(2,i):
if i % j == 0:
is_prime = False
break
if is_prime:
primes.append(i)
return primes
prime_numbers(100)
We can represent prime numbers as p, which is an integer greater than 1 and divisible only by 1 and p. Using the mod operator, this means that for all integers n such that 1 < n < p:
p\mod n \neq 0
The for loop checks whether this condition is satisfied for all integers between two and N-1. If we find an integer where this isn’t the case, how can we stop the for loop and move on to checking the next number?
Given a list of strings of letters from a to z, create a function, sum_alphabet, that returns a list of the alphabet sum of each word in the string.
The alphabet sum is the sum of the ordinal position of each of the string’s letters in the standard English alphabet ordering. So, the letter a will have a value of 1, z will have a value of 26, and so on.
As an example of the alphabet sum of a string, the string “sport” will have an alphabet sum of 19 + 16 + 15 + 18 + 20 = 88.
Example:
Input:
words = ["sport" , "good" , "bad"]
Output:
def sum_alphabet(words) -> [88, 41, 7]
Solution
To solve this problem efficiently, we can use Python’s ord function, which takes a letter and returns its ASCII value. This value is a universal standard that encodes each character as a unique positive integer.
In ASCII, “a” is represented as 097, or 01100001 in binary. However, for this problem, we need “a” equal to 1. How can we change our code to modify the index?
Given two nonempty lists of user_ids and tips, write a function most_tips to find the user that tipped the most.
Example:
Input:
user_ids = [103, 105, 105, 107, 106, 103, 102, 108, 107, 103, 102]
tips = [2, 5, 1, 0, 2, 1, 1, 0, 0, 2, 2]
Output:
def most_tips(user_ids,tips) -> 105
Solution
To solve this problem, we need to loop through each user and their tip, sum up the tips for each user, and then find the user that left the biggest tip.
To do this, we can use Python’s collection package that allows us to sort our dictionary with the function most_common().
Given two dates, write a program to find the number of business days that exist between the date range.
Example:
Input:
date1 = 2021-01-31
date2 = 2021-02-18
Output:
def delta_buss_days(date1,date2) -> 14
Solution
This is a straightforward problem that we can solve using the Pandas date_range function, which we can call to find the time difference between two dates.
The date_range function returns the range of equally-spaced time points. We then need to loop through the days of the week to extract the number of days. Since we’re only looking for business days, how does this change your final answer?
We’re given an ascending string of integers that represents page numbers.
Write a function get_last_page to return the last page number in the string. If the string of integers is not in the correct page order, return the last number in order.
Example 1
Input:
input = '12345'
Output:
output = 5
Example 2
Input:
input = '12345678910111213'
Output:
output = 13
Example 3
Input:
input = '1235678'
Output:
output = 3
Solution
This question is a little trickier, as we can have missing values. We can’t just return the length of the string.
Therefore, to solve this problem, we’ll have to loop through each value iteratively to evaluate each page number. What would be the next step?
? Throughout this section, we’ll be referencing our framework for answering database system design questions, which can be found in our Data Engineering course.
Let’s say you’re tasked with designing a data mart or data warehouse for a new online retailer.
How would you design the system?
Note: Sketch a star schema to explain your design.
Solution
Identifying The Business Process
We can identify some parts of the business process by working backward. The question asks us to design a star schema for our warehouse, and we know that star schemas are more efficient for querying than for storing data due to normalized dimensions tables. Therefore, we can already assume that our data warehouse will be mainly built for analytics and reporting purposes.
For this question, we can also assume that the events that we are storing are exclusively sales data.
Declaring The Granularity
Let’s say we want to consider the sale of each distinct product a separate event. However, we’ll allow the warehouse to consider the purchase of multiple items of the same product in one transaction as a single event.
Identifying Dimensions
For each sale, we’ll need to store its relevant dimensions. As these are online sales, we can omit the WHERE dimension, but other good dimensions to consider include WHO, WHAT, WHEN, and HOW.
Since this is an online retailer, the only WHO in a sale is the buyer. For each buyer, we need to choose which attributes to store, including identifying customer information and other variables for analytics.
What are some variables that would be important to store?
Let’s say you’re setting up the analytics tracking for a web app.
How would you create a schema to represent client click data on the web?
Solution
These types of questions are more architecture based and are generally given to test experience within developing databases, setting up architectures, and in this case, representing client-side tracking in the form of clicks.
What exactly does click data on the web mean? Any form of button clicks, scrolls, or action at all as an interaction with the client interface (in this case, desktop) would be somehow represented in a schema for the end user to query. This does not include client views.
How would you start your design schema?
You’re tasked with designing a database system for Swipe Inc., a company that manages payments for software developers.
Swipe manages a set of APIs so developers can abstract payment processes for their web service and transfer control over to Swipe’s APIs to manage payment security and finance handling (contacting credit card providers, etc.)
Your database system should store a collection of API keys with their attributes, as shown in the example below:
{
"API_key":{
"Transaction_list":{
"User_ID":{
"Name":{
"f_name":
"l_name":
}
"Card_Details":{
"Bank":
"Card_No":
"Expiry":
"CV":
}
}
}
"Curr_balance":
}
}
What are the functional and non-functional requirements for this database?
Hint: brush up on the difference between functional and non-functional requirements in our Database System Design course.
Solution
From the problem, we know that Swipe is a fintech company that handles APIs and stores data in JSON. Based on this information, the database system should:
These are more explicit, functional requirements for our database system based on Swipe’s needs. What are some non-functional requirements that may be important as well?
How would you design a classifier to predict the optimal moment for a commercial break during a video?
Solution
Let’s start out by clarifying the question. How exactly would you determine the ‘optimal moment’ for a commercial break?
Since commercials are designed to build brand awareness and generate demand for a specific product, we could say that optimality is determined by how effective the placement of the video ad is. What are some metrics we could use to measure this?
Let’s say you work as a data architect at Facebook.
Facebook is trying to transition their product from just doing likes on posts and comments to adding the multi-reaction under the like button.
How would you approach this problem? What would the database modifications look like?
Solution
Again, our answer should start by clarifying the scope of the problem. For instance, with the multi-reaction option, do we want to support one reaction per post or multiple reactions per post? What about per user? Asking these questions early on will guide the changes we make in our solution.
After this step, we should be looking into implementation details associated with this task. What are some factors we have to consider?
Let’s say you have analytics data stored in a data lake. An analyst tells you they need hourly, daily, and weekly active user data for a dashboard that refreshes every hour.
How would you build this data pipeline?
Solution
Let’s start off by stating our functional requirements. Our solution must:
What are some non-functional requirements you can identify in this problem? Given your answer, what database would you select?
This guide has hopefully provided some helpful information to support you during the data architect interview process. Other details to focus on during your preparation include: