Proportion Testing
Proportion testing extends the framework of hypothesis testing to categorical data, including the binomial and multinomial distributions. It is a statistical method used to determine if there is a statistically significant difference between two or more proportions in a sample.
If you have not reviewed hypothesis testing, be sure to read that post first.
Use Case
Proportion refers to how many observations in a sample belong to a given class or category. In the binomial case, the categories are denoted success (coded as 1) and failure (coded as 0).
“Success” here is not a value judgment. The minority class (the class that has fewer observations) is often coded as 1/success.
- For example, if you have 100 mortgages and 10 of those mortgages went into default, default would be coded as 1 or success, even though default is a bad thing.
Proportion testing is used to compare the proportions in two different samples to determine if the proportions are statistically significantly different from each other. It can also be used to compare the proportion in one sample against a known or expected proportion.
- For example, you may have two different machine learning models in production that approve or deny loan applications. Each loan is randomly assigned to one of the models. You may want to compare how well they’ve performed over time by comparing the number of defaults that resulted from loans approved from each model.
Null and Alternative Hypotheses
Like hypothesis testing with continuous distributions, proportion testing also requires setting up a null hypothesis and an alternative hypothesis .
: The null hypothesis asserts that there is no significant difference between the proportions.
: The alternative hypothesis asserts that there is a significant difference between the proportions.
To continue with the loan approval example from above, the hypotheses would look like this:
: There is no significant difference between the failure rate of the two loan approval machine learning models.
: Model A has a significantly higher number of loans that go bad than model B.
Example Calculation of a Proportion Test
Suppose you work as an analyst for a bank that gives out personal retail (that is, not business) loans. Humans review some of the loan applications while a machine learning model reviews the rest. You’re interested in determining whether the machine learning model is significantly better or worse than the humans at making loans.
To do this, you’ll look at the number of defaults one year after origination for each group of loans.
We’re interested in whether the human loan officers outperform the machine learning model. We hypothesize that the machine learning model, because it’s trained on more data, performs better than the human loan officers. That is, the proportion of defaults with the human loan officers is higher than the proportion of defaults for model-approved loans.
So the hypotheses would look like this:
:
We will set a statistical significance level of 0.05 for this test (review the hypothesis testing post for more information on statistical significance). This will be used in the last step, after we calculate the z-statistic.
The next step is to calculate the test statistic. In this case, we’ll use the z-statistic:
,
where:
= proportion in group A
= proportion in group B
= pooled sample proportion for groups A and B =
= number of samples in group A
= number of samples in group B
Calculating for this example:
=
In this case, we’re using a one-tailed z-test. Specifically, it’s a left-sided one-tailed test. We can tell it’s a left-sided test because we’re interested in whether the model performs significantly better than the benchmark human loan officers, so .
Using a z-statistic table, we look up our statistical significance level of 0.05, which is equal to a z-statistic of -1.645.
The last step is to compare our z-statistic of -1.159 to the .05 statistical significance z-statistic of -1.645. Because our calculated z-statistic is higher than the threshold, we conclude that we cannot reject the null hypothesis. This means that we cannot conclude that the machine learning model performs better, in a statistically significant way, than do the human loan officers at approving loans that will not default in a year.
49%
CompletedYou have 39 sections remaining on this learning path.