Based on 2019 statistics, Google processes over 40,000 search queries every second on average, which translates to over 3.5 billion searches per day and 1.2 trillion searches per year worldwide. To Google, this presents endless opportunity to help its customer grow and scale, and to data scientists, this presents a treasure trove of information for analysis and interpretation.
Google data scientist interview questions focus primarily on statistics, algorithms, machine learning, and probability. Google’s interview process is designed to assess your ability to perform analysis on large datasets and generate data-driven insights.
Data scientists at Google work across a wide facet of teams, products, and features, from enhancing advertising efficacy to network infrastructure optimization.
The Google data science role is primarily an analytics role that is focused on metrics and experimentation. This is distinctly different from the machine learning and product analyst roles at Google, which focus more on the engineering or product sides, respectively. The data science role at Google used to be called a quantitative analyst before switching to data science to attract more talent.
Google typically hires individuals with two to three years of industry experience in analytics or related fields. Google does have programs for internships and university graduates in data science, and specifically has more advanced roles for new PhD graduates.
Other relevant qualifications include:
Our data shows that Google tests heavily on Statistics & A/B Testing and Machine Learning during their data scientist interview process.
Data scientist interview process at Google is standardized and is similar to many other tech companies. The process includes:
The initial screen is a 30-minute phone interview with a recruiter. During this call, you’ll discuss the job and what it’s like to work at Google. The recruiter will also want to learn about your skills, professional experiences, career goals, and, most importantly, if you’re the right fit for Google’s culture.
Google’s data scientist technical screen is video-based (Google Hangouts). You’ll meet with a data scientist and focus on experimental design, statistics, and a probabilistic coding question.
Technical screens also include discussions about your past research and work experience. Be prepared for questions about business issues you’ve faced and your approach to solving them.
The onsite interview for Google data scientist positions interview includes 5 one-on-one rounds with a data scientist. These rounds cover computational statistics, probability, product interpretation, metrics and experimentation, modeling, and behavioral questions.
Each interview lasts for approximately 45 minutes, and there’s a lunch break in between.
You should plan to brush up on any technical skills and try as many practice interview questions and mock interviews as possible. A few tips for acing your Google interview include:
Know Your Google Products: Google questions are standardized and rely heavily on situational scenarios with their products. Study Google’s large breadth of products and understand how you would personally improve or test them.
Be Data Driven: Google’s data science interviews assess how well you can provide business-driving insights with data science. Brush up on your knowledge of statistics and probability, given these questions can be some of the hardest to solve.
Embody the Spirit: Google at its core has an collaborative, employee-focused culture that values innovation. Practice responding to behavioral questions with answers that touch on Google’s core values.
Google data science interview questions include both behavioral and technical problems, covering a range of topics. Most Google behavioral questions, for instance, are centered around how well the candidate fits in with Google’s work culture. On the other hand, Google’s technical data science questions span multiple areas, including statistics, machine learning, coding and product sense.
Onsite interviews for Google data science positions are demanding. The panel interview typically consists of five 45-minute interviews with various teams, and you’ll be assessed on your data science knowledge in a range of areas. The most common Google data science interview question topics include:
Behavioral interview questions usually occur during the recruiter screen and throughout the onsite Google interview. These types of questions are designed to assess your ability to think on your feet, whether you are the right culture fit for Google, and your ability to communicate ideas.
In particular, your answers should touch on what Google looks for in data science candidates:
Provide concrete examples of what interests you in a job at Google. You might talk about your love for Google’s data science culture or how the company encourages employees to continuously learn and expand their skills.
Hint: Tell the interviewer why the project was successful. Provide any metrics and positive change you were able to bring about.
Again, researching Google’s products before the interview is an absolute must. Be sure you can talk confidently about the majority of the company’s product offerings; but also have 2-3 products that you know in-depth.
Hint: Questions like these can be intimidating. Don’t be afraid to be honest. But also explain how you apply what you have learned as you approach a new project.
The types of machine learning questions asked in Google data science interviews range from basic definition-based questions about regression models or feature selection, to advanced algorithm questions.
Looking for machine learning resources? Check out Interview Query’s Modeling & Machine Learning and Machine Learning Systems Design courses.
Hint: Does this depend on whether the problem is asking about a regression or a classification model?
Say it’s a regression model. One way we could tackle this problem would be to cluster features based on the response variable by working backwards.
Hint: We can begin to think of the solution in the form of a prefix table. How a prefix table works is that your prefix, that is your input string, outputs your output string, one at a time to start with. For an MVP, we could input a string and output a suggestion string with added fuzzy matching and context matching.
Example:
A = 'abcde'
B = 'cdeab'
can_shift(A, B) == True
A = 'abc'
B = 'acb'
can_shift(A, B) == False
Hint: This problem is relatively simple if we figure out the underlying algorithm that allows us to easily check for string shifts between strings A and B.
Statistics and probability are a core focus in Google onsite panel interviews. To best prepare, make sure you have a strong grasp of statistical concepts and know how to perform statistical coding in python interview questions.
Hint: In order to decrease our margin of error, we’ll probably have to increase our sample size. But by how much?
80% of raters are careful and they rate an ad as good (60% chance) or bad (40% chance).
20% of raters are lazy and they rate every ad as good (100% chance).
1. Suppose we have 100 raters each rating one ad independently. What’s the expected number of good ads?
2. Now suppose we have 1 rater rating 100 ads. What’s the expected number of good ads?
3. Suppose we have 1 ad, rated as bad. What’s the probability the rater was lazy?
Hint: Keep in mind that in order for the rater to rate an ad, the rater must first be selected. So the event that the rater is selected happens first, then the rating happens. How would you represent this fact arithmetically using basic properties of probability?
There are several assumptions of linear regression. These assumptions are baked into the dataset and how the model is built. Otherwise if these assumptions are violated, we become privy to the phrase “garbage in, garbage out.”
Hint: Think about things that generally have a normal distribution. Are there other things that we might want to measure that might not be similar to those things? Normal distributions generally measure things like size, mass, content, but what about measures like time, random-number generators, or likelihood?
Hint: What sort of probability distribution should we use to model experiments with only two outcomes?
Now let’s say we take the output from the random integer function and place it into another random function as the max value with the same min value N.
What would the distribution of the samples look like?
What would be the expected value?
Google onsite interviews typically include business and product case study questions. To prepare for case interviews, practice product or business metrics questions, and be prepared to propose solutions, analyze the success of a feature, and measure results.
What data points and metrics would you look at to decide if this is true or not?
Hint: With questions like these, try to rephrase it as a hypothesis. What hypothesis could you draw from the information provided?
Hint: The first step in product case questions is to clarify the question. With this example, you would want some clarity on the type of drop (e.g., time on page, storage, etc.), as well as the timeframe for the usage drop.
How would you differentiate between scrapers and real people?
Hint: Modeling-based theoretical questions are meant to assess whether you can make realistic assumptions and come up with a solution under these assumptions.
Hint: Always ask for clarity. With a question like this, we’d need more information to answer.
How would you assess the validity of the result?
Hint: What is the interviewer leaving out, and how might we rephrase the question for clarity? We could likely re-phrase the question to: How do you set up and measure an AB test correctly?
If we break down this question, we’ll find that another way to phrase it is to ask what the probability is that at least two of the variables are larger than 3. For example, if look at the combination of events that satisfy the condition, the events can actually be divided into two exclusive events.
Given these two events satisfy the condition of the median > 3, we can now calculate the probability of both of the events occurring. The question can now be rephrased as P(Median > 3) = P(A) + P(B).
Let’s calculate the probability of the event A. The probability that a random variable > 3 but less than 4 is equal to 1⁄4. So the probability of event A is:
P(A) = (1⁄4) * (1⁄4) * (1⁄4) = 1⁄64
The probability of event B is that two values must be greater than 3, but one random variable is smaller than 3. We can calculate this the same way as the calculating the probability of A. The probability of a value being greater than 3 is 1⁄4 and the probability of a value being less than 3 is 3⁄4. Given this has to occur three times we multiply the condition three times.
P(B) = 3 * ((3⁄4) * (1⁄4) * (1⁄4)) = 9⁄64
Therefore the total probability is P(A)+P(B) = 1⁄64 + 9⁄64 = 10⁄64
Check out the Interview Query Statistics course for more practice with statistical concepts and coding.
At Google, data scientists work with vast datasets, and are tasked with using coding to generation insights and solutions. Typically, statistical coding (with a tool like Python), SQL queries and algorithmic coding are all covered in Google interviews for data science positions.
Hint: This is a relatively simple problem because we have to set up our distribution and then generate N samples from it which are then plotted. In this question, we make use of the SciPy library which is a library made for scientific computing.
Tip: Follow the link to find the relevant data on Interview Query for this question.
Your task is to determine the minimum number of time steps required to get from the northwest corner to the southeast corner of the building.
Note: If the path doesn’t lead you to exit return -1 .
The input is given as:
'N'
, 'E'
, 'S'
, 'W'
named ‘building’Example:
E | E | S | W |
N | W | S | N |
S | E | E | S |
E | N | W | W |
Expected Output: 6
Input:
threshold = 0.75
n = 6
truncated_dist(n, percentile_threshold)
Output:
# with mean of 2 and std deviation of 1
output = [2, 1.1, 2.2, 3, 1.5, 1.3]
Given the dataset, write code in Pandas to return the cumulative percentage of students that received scores within the buckets of <50, <75, <90, <100.
The attribution table logs a session visit for each row.
If conversion is true, then the user converted to buying on that session.
The channel column represents which advertising platform the user was attributed to for that specific session.
Lastly the user_sessions
table maps many to one session visits back to one user.
First touch attribution is defined as the channel to which the converted user was associated with when they first discovered the website.
Calculate the first touch attribution for each user_id that converted.
attribution
table:
column | type |
---|---|
session_id | integer |
created_at | datetime |
user_id | integer |
user_sessions
table:
column | type |
---|---|
session_id | integer |
created_at | datetime |
user_id | integer |
See more Google data scientist questions from Interview Query:
Average Base Salary
Average Total Compensation