Data Science Interview

The tests we examined in the previous section were one-sample tests, meaning they tested a statistic of a sample against a single value. While such tests do have broad applications, they are limited because we might not know what value we want to set for $\mu_0$ at the offset. For example, in A/B Tests, we want to compare the means two samples against each other.

You might say that we could run a one-sample test and set $\mu_0$ equal to the mean of one of the samples, but this doesn’t work because adding more samples increases the variance of the sample. You can see this if you model the process abstractly using random variables. If $X_1\sim\mathcal{N}(\mu_1,\sigma_1^2)$ and $X_2\sim\mathcal{N}(\mu_2,\sigma_2^2)$ then $\mathsf{Var}[X_1+X_2]=\sigma_1^2+\sigma_2^2$ .

Thus, we need to take into account the increased variance home how. The way in which we do that depends on 1. If the sample sizes are equal 2. If the samples have equal (population) variances 3. If the samples are independent or not

This gives us four different versions of the two-sample $t$ -test: 1. Independent samples with equal variance and sample size 2. Independent samples with equal variance, but different sample size 3. Independent samples with difference variance 4. Dependent samples

Independent Two-sample $t$ -test With Equal Variances and Sample Sizes

Cheat sheet

Description: Tests if the mean of one sample ( $\mu_1$ ) is different/less than/more than the mean of another sample ( $\mu_2$ ) when the samples have equal sample size, variance, and are independent
Statistic: $\mu_1-\mu_2$ (mean)
Distribution: $\mathcal{T}(2n-2)$ ( $t$ )
Sidedness: Either
Null Hypothesis: $H_0: \mu_1 = \mu_2$ (two-sided), $\mu_1 \geq\mu_2,\mu_1\leq\mu_2$ (one-sided)
Alternative Hypothesis: $H_a: \mu_1 \neq \mu_2$ (two-sided), $\mu_1 \lt\mu_2,\mu_1\gt\mu_2$ (one-sided)
Test Statistic: $\tau=\frac{\hat{\mu}_1-\hat{\mu}_2}{\sqrt{\frac{s_1^2+s_2^2}{n}}}$

Description

Let’s go over what the title of this test really means

Independent: This means that both of the samples have no relation to each other. In medical terms, there is no “treatment” between the samples, they just represent different populations.

Same Sample Size: Both samples have the same number of data points, duh.

Same Variance: This condition has a misleading name. It’s generally impossible to determine if two populations actually have the exact same variance. Because of this, “same variance” is defined as such: both samples have variances that are at least half and at most half of the other. Mathematically, if you use the sample variances to estimate the population variances, this can be expressed as:

$\frac{1}{2}\leq\frac{s_1^2}{s_2^2}\leq2$ This test is useful if you have two different classes or categories of the object you’re studying. This could be something like the total time spent watching content of viewers in two different geographical regions.

Independent Two-sample $t$ -test With Equal Variances and Unequal Sample Sizes

Cheat sheet

Description: Tests if the mean of one sample ( $\mu_1$ ) with $n_1$ data points is different/less than/more than the mean of another sample ( $\mu_2$ ) with $n_2$ data points when the samples have equal variances and are independent
Statistic: $\mu_1-\mu_2$ (mean)
Distribution: $\mathcal{T}(n_1+n_2-2)$ ( $t$ )
Sidedness: Either
Null Hypothesis: $H_0: \mu_1 = \mu_2$ (two-sided), $\mu_1 \geq\mu_2,\mu_1\leq\mu_2$ (one-sided)
Alternative Hypothesis: $H_a: \mu_1 \neq \mu_2$ (two-sided), $\mu_1 \lt\mu_2,\mu_1\gt\mu_2$ (one-sided)
Test Statistic: $\tau=\frac{\hat{\mu}_1-\hat{\mu}_2}{\sqrt{\frac{\left(n_1-1\right)s_{1}^2+\left(n_2-1\right)s_{2}^2}{n_1+n_2-2}}\cdot\sqrt{\frac{n_1+n_2}{n_1n_2}}}$

Description

While this test is more or less the same from a mathematical perspective from the previous test, the complexity of the denominator might cause some confusion.

The denominator is defined in such a way so that it remains a unbias estimator for $\sigma$ , the population variance common to both samples, no matter the actual sample sizes $n_1$ and $n_2$ . In fact, if you set $n_1=n_2$ and a lot of algebra, you’ll see that $\tau$ is the same as the previous case.

Independent Two-sample $t$ -test

Description: Tests if the mean of one sample ( $\mu_1$ ) with $n_1$ data points is different/less than/more than the mean of another sample ( $\mu_2$ ) with $n_2$ data points when the samples have unequal variances and are independent
Statistic: $\mu_1-\mu_2$ (mean)
Distribution: $\mathcal{T}(df)$ ( $t$ ) where $df = \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(s_1^2/n_1\right)^2}{n_1-1} + \frac{\left(s_2^2/n_2\right)^2}{n_2-1}}$
Sidedness: Either
Null Hypothesis: $H_0: \mu_1 = \mu_2$ (two-sided), $\mu_1 \geq\mu_2,\mu_1\leq\mu_2$ (one-sided)
Alternative Hypothesis: $H_a: \mu_1 \neq \mu_2$ (two-sided), $\mu_1 \lt\mu_2,\mu_1\gt\mu_2$ (one-sided)
Test Statistic: $\tau=\frac{\hat{\mu}_1-\hat{\mu}_2}{\sqrt{\frac{s_1^2}{n_1}+\frac{s_2^2}{n_2}}}$

Description

The test is used in the case where the variance of one sample is greater more than double of the variance of the other sample ( $s_1^2\gt2s_2^2$ or $s_2^2\gt2s_1^2$ ).

The complicated form of the degrees of freedom is because it is actually an approximation of the true degrees of freedom of this test. The reasoning behind this is well beyond the scope of this course, and data science in general, but those who are curious can look up the “Behrens–Fisher problem” for details.

Paired (Dependent) $t$ -test

Cheat Sheet

Description: Tests if the difference of the same sample before and after some treatment is equal/less than/more than some number $\mu_0$
Statistic: $\mu_\Delta$ (mean of differences between pairs)
Distribution: $\mathcal{T}(n-1)$ ( $t$ )
Sidedness: Either
Null Hypothesis: $H_0: \mu_\Delta = \mu_0$ (two-sided), $\mu_\Delta \geq\mu_0,\mu_\Delta\leq\mu_0$ (one-sided)
Alternative Hypothesis: $H_a: \mu_\Delta \neq \mu_0$ (two-sided), $\mu_\Delta \lt\mu_0,\mu_\Delta\gt\mu_0$ (one-sided)
Test Statistic: $\tau=\frac{\mu_\Delta-\mu_0}{s_\Delta/\sqrt{n}}$

Description

In this case, there aren’t really two samples. Rather, there is one sample that is taken in two different points in time. This could be the heights of a certain animal, a productivity metric after implementing some new software at a workplace, or blood sugar levels after taking a dose of medicine. We’ll call the pre-treatment data set $\vec{x}$ and the post-treatment data set $\vec{x}'$

Because of this, we are not interested in the difference between the means of the two “samples” $(\mu_1-\mu_2)$ , but rather the mean difference between the data points at the two different points in time.

$\hat{\mu}_\Delta=\frac{1}{n}\sum_{i=1}^n\Delta x_i=\frac{1}{n}\sum_{i=1}^n(x_i-x_i')$ Likewise, the test statistic uses $s_\Delta$ , the standard deviation of the differences of the pairs

$s_\Delta=\sqrt{\frac{1}{n-1}\sum_{i=1}^n(\Delta x_i-\hat{\mu}_\Delta)}$

Z and t Tests

Proportion Testing

Good job, keep it up!

0%

Completed

You have 256 sections remaining on this learning path.

Advance your learning journey! Go Premium and unlock 40+ hours of specialized content.

Data Science Interview

Two-sample Tests

Independent Two-sample ttt-test With Equal Variances and Sample Sizes

Cheat sheet

Description

Independent Two-sample ttt-test With Equal Variances and Unequal Sample Sizes

Cheat sheet

Description

Independent Two-sample ttt-test

Description

Paired (Dependent) ttt-test

Cheat Sheet

Description

0%

Independent Two-sample $t$ -test With Equal Variances and Sample Sizes

Independent Two-sample $t$ -test With Equal Variances and Unequal Sample Sizes

Independent Two-sample $t$ -test

Paired (Dependent) $t$ -test