Fraud analytics is the practice of leveraging data science techniques and analysis to assist in detecting potential fraud, either before a transaction is completed or after it has occurred.
Fraud detection systems combine data analytics, statistics, machine learning and artificial intelligence to identify fraud risk factors, predict fraudulent transactions and schedule/send fraud alerts in real-time. These systems can save companies hundreds of millions of dollars each year.
As an example, the insurance provider Highmark Inc. secured an estimated $245 million in savings in 2021 thanks to its fraud analytics and detection system.
Traditionally, simple rule-based systems were used to identify fraud, particularly in the banking, finance and insurance industries. Since those early days, data science has revolutionized fraud analytics, increasing the precision and speed of companies identification and response. Fraud detection is now used in nearly every industry, stretching from governmental agencies to retail and media companies.
Some of the most common use cases for fraud analytics include detecting:
When a customer is trying to process a transaction, data can help to uncover if the transaction is legitimate or not. This can include data points like: device type, location/country of origin, account verification, if the requested transaction fits into normal behavior patterns, etc.
This authentication data falls into four categories:
Fraud analytics systems leverage these data types using data science, automated rules, and/or machine learning to determine if a transaction is legitimate. Traditionally, the three most common types of fraud detection systems include: rule-based, supervised learning, and unsupervised learning models.
1. Rule-based fraud detection - Rule-based systems have been used for more than twenty years, and are still widely deployed. These systems work by detecting fraud when unusual account activity is detected. For example, if the majority of a users’ transactions occur in San Francisco, but are suddenly being processed in Germany, a rule-based system would be triggered.
2. Supervised classifications - This type of fraud detection system leverages algorithmic learning. An algorithm is trained to detect fraud based on company-wide historical data and previous instances of fraud. Supervised learning models have greatly improved the precision of fraud detection systems.
3. Unsupervised classifications - Unsupervised models are traditionally used to uncover fraudulent tactics by clustering unlabeled data. Hidden relationships in the data can be detected, and as a result, new or emerging fraud tactics can be discovered.
Rule-based systems offer distinct advantages, as they are easier to deploy, well documented and cost-effective. However, the rule sets for these systems have grown extremely complex, while, at the same time, often fail to adapt to hidden threats. As a final handicap, they tend to result in a high number of false positives and/or missed fraudulent transactions as they lack the ability to look beyond binary triggers to the rules in place.
Machine learning models (both supervised and unsupervised) provide the ability to analyze data at scale, and as a result, they have increased precision. Unsupervised models in particular can adapt to and identify changing threats on an ongoing basis, finding associations that human monitors would not connect.
Fraud analysts are a type of data analyst that specializes in identifying suspicious activity in customer transactions. Typically, fraud analysts are responsible for investigating theft and fraud, as well as performing risk management; developing, deploying and maintaining fraud detection systems, as well as performing fraud market research.
A key responsibility for a fraud analyst is maintaining fraud analytics models and databases, as any downtime can cause security issues for businesses and customers.
Fraud analysts are employed in a wide range of industries. Certain fields do often employ more of these analysts, specifically in finance and banking, insurance, government and retail. Many job roles and titles fall under the umbrella of fraud analyst, including:
Traditionally, fraud analysts have extensive experience in data science, statistics, and machine learning, though some analysts do start in finance and/or accounting. The majority of fraud analytics roles require a bachelor’s degree, as well as 1-4 years of data analytics and/or fraud experience.
For risk analyst roles, fraud analytics case studies are used to test a candidate’s knowledge of fraud detection. Fraud case study questions include specific information about the case, and the candidate must then use the provided information to propose a solution to the problem.
Here is an example of a fraud analytics case study question:
When answering a question like this, you should start with clarifying questions to the interviewer like:
Next make assumptions about the case study. We can assume that low recall in a fraudulent case scenario would be a disaster. With low predictive power on false negatives, fraudulent purchases would go under the radar with consumers not even knowing they were being defrauded. This could cost the bank thousands of dollars in lost revenue given they would have to refund the cost to the consumer, plus the potential reputational risk.
Meanwhile if there was low precision, customers would think their accounts were being defrauded all the time. They would continue to get text messages until they switched over to another bank, because the transactions would always be flagged as fraudulent, an annoying situation when a customer knows that the transaction is valid.
Since the question prompts for a text messaging service, it might make sense then to optimize for recall in order to minimize risk and avoid costly fraudulent charges.
Finally, ask yourself this question: What model works well on an imbalance dataset? Generally, tree models come to mind.
Here are more fraud case studies you can use to prepare for fraud and risk analyst interviews. These questions include credit card fraud modeling, platform abuse case studies and anomaly detection cases.
At first glance we would have to do some analysis on the dataset to get a clearer picture:
Hint: The most important part after looking at all of these credit card transactions is determining how we can feature engineer in our solution which data points are fraudulent transactions as our response variable. Once we have determined a high confidence for fraud, then we can build a model and extract features.
This type of question gets asked early in interviews to determine your confidence with anomaly detection, which is widely used in fraud detection and risk mitigation.
To answer this question, first, provide a definition of a univariate dataset. Univariate means one variable. For example, travel time in hours from your city to 10 other cities is given in an example list below:
12, 27, 11, 41, 35, 22, 18, 43, 26, 10
This kind of single column data set is called a univariate dataset. Anomaly detection is a way to discover unexpected values in datasets. The anomaly means data exists that is different from the normal data. For example you can see below the dataset where one data point is unexpectedly high intuitively:
12, 27, 11, 41, 35, 22, 76767676, 18, 43, 26, 10
With that visual, how would you design a function to flag that value?
Note: SQL questions are common in fraud analytics interviews. These questions test your ability to pull metrics that can help solve fraud detection problems.
In this case study question, you are informed about an ATM robbery at a bank. Some unauthorized withdrawals were made, and you are asked to investigate these transactions.
However, the only information you have to begin with is that there was more than one withdrawal, that they were all performed in 10-second gaps, and that no legitimate transactions were performed between two fraudulent withdrawals.
For the query, you should retrieve all user IDs in ascending order whose transactions have exactly a 10-second gap.
Hint: Since we need to identify users making transactions that occur exactly ten seconds apart, it is useful to first order the transactions by created_at.
More context: You are provided three tables representing forum users.
From here, answer these questions: What metrics would you use to investigate this problem? How would you write a query to represent the percentage of users who are acting fraudulently?
One metric that could shed insight would be: the number of users who have upvoted one account multiple times that have also never commented.