The data science job at LinkedIn is generally focused on the business side rather than engineering, and the data science role functions more like a product analyst and analytics job than it does at many other companies.
LinkedIn’s data science team leverages billions of data points to empower member engagement, business growth, and monetization efforts. With over 500 million members worldwide and a mix of B2B and B2C programs, the data science team has a huge impact in determining product direction.
If you’re preparing for an interview and searching for commonly asked LinkedIn data scientist interview questions, you’ve come to the right place.
The LinkedIn data science interview process is relatively straight-forward. Recruiters at LinkedIn like to dogfood their own product. So they will likely send you a message or InMail through LinkedIn to schedule a 30-minute phone screen, during which they’ll ask behavioral interview questions and see if the role is a good fit.
The initial technical screen consists of two separate phone interviews, each lasting between 30 to 45 minutes long.
One interview is more technical-focused and specializes in testing concepts on SQL and data processing, while the other will run through a product and business case study. Depending on how your interview is structured, either interview could be the first one of the two. However, you are not guaranteed both interviews if you do poorly on one of them. Both interviewers are also going to be employees on the LinkedIn data science team, leaving ample time in the end to ask questions.
We’ve gathered this data from parsing thousands of interview experiences sourced from members.
Typically, interviews at LinkedIn vary by role and team, but they follow fairly standardized data science interview questions across these topics.
How would you analyze the effectiveness of this new feature?
Hint: While it may be tempting to correlate increased usage of LinkedIn chat as a sign that the green dot is “effective,” you may need a more solid metric that relates this increased usage to the profit-generating aspects of LinkedIn’s business model.
There is no exact right answer for problems like this. Modeling-based theoretical questions are more meant to assess whether you can make realistic assumptions and come up with a solution under these assumptions. Likely it will go down the path the interviewer explores as you make assumptions and draw conclusions.
We’re given a dataset of page views with likely scrapers and real users visiting the site. Because a scraper intends to extract data out of the LinkedIn network, a scraper will almost surely have a lot of page views, and the duration of these views will likely be rather short since a robotic scraper can process information much faster than a human (it just needs to download the fetched page and do some simple processing, say extract URLs that lead to other pages on LinkedIn).
The link traversal between users would also be more nuanced. We’d expect users to traverse the pages more through links on the site rather than a scraper making requests to different URLs. A real user, on the other hand, tends to visit the page fewer times and spend more time on each visit.
Let’s say you’re a data scientist at LinkedIn where you’re working on a product that sends qualified job candidates to companies. The team has launched a new feature that allows candidates to message hiring managers at companies directly during the interview process to get updates on their status.
Due to engineering constraints, the company can’t AB test the feature before launching it.
Measuring supply and demand can be tricky because both are related. Demand often follows high-quality supply and, in turn, supply follows large demand. Dynamic pricing, often referred to as “surge” pricing, is a strategy where businesses set flexible prices for products or services based on current market demands. This approach is common in industries where demand can spike significantly depending on the time of day or seasonality of the industry, such as transportation (like ride-sharing services), hospitality, tourism, entertainment, and retail.
The core principle of dynamic pricing is a company’s ability to capitalize on consumer willingness to pay different amounts at different times. For example, ride-sharing prices may increase during peak traffic hours or in inclement weather, when demand is high and supply is relatively constant.
user_experiences
table
Column | Type |
---|---|
id | integer |
user_id | integer |
title | string |
company | string |
start_date | datetime |
end_date | datetime |
is_current_role | boolean |
Determine if a data scientist who switches jobs more often gets promoted to a manager role faster than a data scientist who stays at one job for longer by writing a query to prove or disprove this hypothesis.
The hypothesis is that data scientists who switch jobs more often get promoted faster.
Therefore, in analyzing this dataset, we can prove this hypothesis by separating the data scientists into specific segments on how often they jump in their careers.
For example, if we looked at the number of job switches for data scientists that have been in their field for 5 years, we could prove the hypothesis that the number of data science managers increased as the number of careers jumped.
Never switched jobs: 10% are managers Switched jobs once: 20% are managers Switched jobs twice: 30% are managers Switched jobs three times: 40% are managers We could look at this over different buckets of time to see if the correlation stays consistent after 10 years and 15 years in a data science career.
job_postings
table
Column | Type |
---|---|
id | integer |
job_id | integer |
user_id | integer |
date_posted | datetime |
First, let’s visualize what the output would look like.
We want the value of two different metrics, the number of users who have posted their jobs once versus the number of users who have posted at least one job multiple times. What does that mean exactly?
If a user has 5 jobs but only posts them once, they are part of the first statement. But if they have 5 jobs and posted 7 times, they had to post one job at least multiple times.
transactions
table
Column | Type |
---|---|
id | integer |
user_id | integer |
created_at | datetime |
product_id | integer |
quantity | integer |
products
table
Column | Type |
---|---|
id | integer |
name | string |
price | float |
Whenever there is a question on month-over-month, week-over-week, or year-over-year change, note that it can generally be done in two different ways.
One is using the LAG function that is available in certain SQL services. Another is to do a sneaky join.
For both, we’ll have to first sum the transactions and group by the month and the year. Grouping by the year is generally redundant in this case because we are only looking for the year 2019.
WITH monthly_transactions AS (
SELECT
MONTH(created_at) AS month,
YEAR(created_at) AS year,
SUM(price * quantity) AS revenue
FROM transactions AS t
INNER JOIN products AS p
ON t.product_id = p.id
WHERE YEAR(created_at) = 2019
GROUP BY 1,2
ORDER BY 1
)
SELECT * FROM monthly_transactions
posted a job that is more than 180 days old.
posted a job that has the same job_id as a previous job posting(s), which is more than 180 days old.
job_postings
table
Column | Type |
---|---|
id | integer |
job_id | integer |
user_id | integer |
date_posted | datetime |
conversion_date
column is NULL if the user hasn’t purchased.notification_deliveries
table
Column | Type |
---|---|
notification | varchar |
user_id | int |
created_at | datetime |
users
table
Column | Type |
---|---|
id | int |
created_at | datetime |
conversion_date | datetime |
Write a query to get the distribution of total push notifications before a user converts.
If we’re looking for the distribution of total push notifications before a user converts, we can evaluate that we want our result to look something like this:
total_pushes | frequency |
---|---|
0 | 100 |
1 | 250 |
2 | 300 |
… | … |
To get there, we have to follow a couple of logical conditions for the JOIN between users and notification_deliveries
We have to join the user_id field in both tables. We have to exclude all users that have not converted. To get all notifications sent to the user, we have to set the conversion_date value as greater than the created_at value in the delivery table. We know this has to be a LEFT JOIN additionally to get the users that converted off of zero push notifications.
We can get the count per user and then group by that count to get the overall distribution.
Hint: Let’s say the question is 100 cards, and you select 3 cards without replacement. Does the answer change? Imagine this as a sample space problem ignoring all other distracting details. If you have to draw three different numbered cards without replacement, and they are all unique, then we assume that there will be effectively the lowest, middle, and high cards.
Let’s make it easy and assume we drew the numbers 1, 2, and 3. In our scenario, if we drew (1, 2, 3), that would be the winning scenario. But what’s the full range of outcomes we could draw?
Using this information, how would you build a job recommendation feed?
For this problem, we have to understand what our dataset consists of before being able to build a model for recommendations. More importantly, we need to understand what a recommendation feed might look like for the user.
For example, we’re expecting that the user could go to a tab or open up a mobile app and then view a list of recommended jobs sorted by highest recommended at the top.
We can either use an unsupervised or supervised model. For an unsupervised model, we could use the nearest neighbors or a collaborative filtering algorithm off of features from users and jobs. But if we want more accuracy, we would likely go with a supervised classification algorithm.
If we use a supervised model, we need to understand our training dataset in the form of features and a metric output label (whether the user applied or not).
The expected result is that for each user, we will have user feature data in the form of their user-profiles and user activity data by extracting information from questions that they have answered. Additionally, we’ll have all of the jobs that the user applied. What we’re missing is the data on jobs that the user did not apply to.
More LinkedIn Data Scientist Interview Questions below:
This course is designed to help you learn everything you need to know about working with data, from basic concepts to more advanced techniques.
Average Base Salary
Average Total Compensation