Two-sample Tests
The tests we examined in the previous section were one-sample tests, meaning they tested a statistic of a sample against a single value. While such tests do have broad applications, they are limited because we might not know what value we want to set for at the offset. For example, in A/B Tests, we want to compare the means two samples against each other.
You might say that we could run a one-sample test and set equal to the mean of one of the samples, but this doesn’t work because adding more samples increases the variance of the sample. You can see this if you model the process abstractly using random variables. If and then .
Thus, we need to take into account the increased variance home how. The way in which we do that depends on 1. If the sample sizes are equal 2. If the samples have equal (population) variances 3. If the samples are independent or not
This gives us four different versions of the two-sample -test: 1. Independent samples with equal variance and sample size 2. Independent samples with equal variance, but different sample size 3. Independent samples with difference variance 4. Dependent samples
Independent Two-sample -test With Equal Variances and Sample Sizes
Cheat sheet
Description: Tests if the mean of one sample () is different/less than/more than the mean of another sample () when the samples have equal sample size, variance, and are independent
Statistic: (mean)
Distribution: ()
Sidedness: Either
Null Hypothesis: (two-sided), (one-sided)
Alternative Hypothesis: (two-sided), (one-sided)
Test Statistic:
Description
Let’s go over what the title of this test really means
Independent: This means that both of the samples have no relation to each other. In medical terms, there is no “treatment” between the samples, they just represent different populations.
Same Sample Size: Both samples have the same number of data points, duh.
Same Variance: This condition has a misleading name. It’s generally impossible to determine if two populations actually have the exact same variance. Because of this, “same variance” is defined as such: both samples have variances that are at least half and at most half of the other. Mathematically, if you use the sample variances to estimate the population variances, this can be expressed as:
This test is useful if you have two different classes or categories of the object you’re studying. This could be something like the total time spent watching content of viewers in two different geographical regions.
Independent Two-sample -test With Equal Variances and Unequal Sample Sizes
Cheat sheet
Description: Tests if the mean of one sample () with data points is different/less than/more than the mean of another sample () with data points when the samples have equal variances and are independent
Statistic: (mean)
Distribution: ()
Sidedness: Either
Null Hypothesis: (two-sided), (one-sided)
Alternative Hypothesis: (two-sided), (one-sided)
Test Statistic:
Description
While this test is more or less the same from a mathematical perspective from the previous test, the complexity of the denominator might cause some confusion.
The denominator is defined in such a way so that it remains a unbias estimator for , the population variance common to both samples, no matter the actual sample sizes and . In fact, if you set and a lot of algebra, you’ll see that is the same as the previous case.
Independent Two-sample -test
- Description: Tests if the mean of one sample () with data points is different/less than/more than the mean of another sample () with data points when the samples have unequal variances and are independent
- Statistic: (mean)
- Distribution: () where
- Sidedness: Either
- Null Hypothesis: (two-sided), (one-sided)
- Alternative Hypothesis: (two-sided), (one-sided)
- Test Statistic:
Description
The test is used in the case where the variance of one sample is greater more than double of the variance of the other sample ( or ).
The complicated form of the degrees of freedom is because it is actually an approximation of the true degrees of freedom of this test. The reasoning behind this is well beyond the scope of this course, and data science in general, but those who are curious can look up the “Behrens–Fisher problem” for details.
Paired (Dependent) -test
Cheat Sheet
Description: Tests if the difference of the same sample before and after some treatment is equal/less than/more than some number
Statistic: (mean of differences between pairs)
Distribution: ()
Sidedness: Either
Null Hypothesis: (two-sided), (one-sided)
Alternative Hypothesis: (two-sided), (one-sided)
Test Statistic:
Description
In this case, there aren’t really two samples. Rather, there is one sample that is taken in two different points in time. This could be the heights of a certain animal, a productivity metric after implementing some new software at a workplace, or blood sugar levels after taking a dose of medicine. We’ll call the pre-treatment data set and the post-treatment data set
Because of this, we are not interested in the difference between the means of the two “samples” , but rather the mean difference between the data points at the two different points in time.
Likewise, the test statistic uses , the standard deviation of the differences of the pairs
49%
CompletedYou have 39 sections remaining on this learning path.