Data Science Interview

So far, all of our tests have been parametric, meaning they assume that they make assumptions about the sample distribution. We assumed that the sample followed a normal distribution in the $Z$ , $t$ , and $F$ tests. In the proportions and $\chi^2$ tests, we assume that the samples follow a binomial or multinomial distribution, respectively.

But sometimes, particularly with small samples, it is not far to make these assumptions. For this reason, we have non-parametric tests that do not make any assumptions about the distribution of the sample.

While there are many, many non-parametric tests. We will go over two of the most popular ones: $U$ test and the paired signed-rank test.

Please note that there are ways to calculate $p$ -values for the test statistics of non-parametric tests, but we don’t describe how to do it here due to their esoteric nature.

$U$ Test

Cheat Sheet

Description: Tests if the median of two independent samples (say $\vec{x}_1$ and $\vec{x}_2$ ) are different/more than/less than the median of another sample
Statistic: $m_1-m_2$ (difference of medians)
Sidedness: Either
Null Hypothesis: $H_0: m_1=m_2$ (one-sided), $m_1\leq m_2,m_1\geq m_2$ (two-sided)
Alternative Hypothesis: $H_a: m_1\neq m_2$ (one-sided), $m_1\gt m_2,m_1\lt m_2$ (two-sided)
Test Statistic: $U=\min(U_1,U_2)$ where $U_1 = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2}S(x_{1,i},x_{2,j}),\quad S(x,y)=\begin{cases}1,\quad\text{if} \ x\gt y\\\\ 0.5,\quad\text{if} \ x=y\\\\ 0,\quad\text{if} \ x\lt y\end{cases}$

$U_2 = \sum_{i=1}^{n_1}\sum_{j=1}^{n_2}S_{-1}(x_{1,i},x_{2,j}),\quad S _{-1}(x,y)=\begin{cases}0,\quad\text{if} \ x\gt y\\\\ 0.5,\quad\text{if} \ x=y\\\\ 1,\quad\text{if} \ x\lt y\end{cases}$

Description

The idea behind this test is that $U_1$ and $U_2$ are proxies for $\mathbb{P}(X_1\gt X_2)$ . In fact, the hypotheses of the $U$ -test can be restated as $H_0:\mathbb{P}(X_1\gt X_2)=\mathbb{P}(X_1\lt X_2)$ $H_a:\mathbb{P}(X_1\gt X_2)\neq\mathbb{P}(X_1\lt X_2)$ Since the median is just defined as $m=x$ such that $\mathbb{P}(X\lt x)=0.5$

As stated before, there is a way to calculate a cdf for $U$ and test it against a significance level $\alpha$ , but it is beyond this course’s scope and better left to software.

Paired Signed-ranked Test

Cheat Sheet

Description: Tests if the sample median of a sample at one point in time ( $\vec{x}$ ) is different/more than/less than the median of a sample at a different point in time ( $\vec{x}'$ )
Statistic: $m-m'$ (difference of medians)
Sidedness: Either
Null Hypothesis: $H_0: m=m'$ (one-sided), $m\leq m',m\geq m'$ (two-sided)
Alternative Hypothesis: $H_a: m\neq m'$ (one-sided), $m\gt m',m\lt m'$ (two-sided)
Test Statistic: $W=\sum_{i=1}^n\text{sgn}(\Delta x_i)R_{\vec{x}-\vec{x}'}(\Delta x_i)$

Description

The function $R$ in the $W$ -statistic is called the rank function. It returns the index of $x\in\vec{x}$ when $\vec{x}$ is sorted in ascending order. For example, if $\vec{x}=[5,3,8]$ , then sorted that would be $[3,5,8]$ , so $R_{\vec{x}}(5)=2$ .

The $\text{sgn}$ function (read “sign”) in $W$ takes the “sign” of its input. It is defined as:

$\text{sgn}(x)=\begin{cases}1\ \text{if}\ x\gt 0\\\\ -1\ \text{if} \ x\lt 0\\\\ 0\ \text{if} \ x=0 \end{cases}$ So $W$ contains information about the relative ranks of the difference between the observations at the time of $\vec{x}$ and $\vec{x}'$ .

Hundreds of Hypotheses

Non-Normal AB Testing

Good job, keep it up!

35%

Completed

You have 165 sections remaining on this learning path.

Loading pricing options

Data Science Interview

Non-parametric Tests

UUU Test

Cheat Sheet

Description

Paired Signed-ranked Test

Cheat Sheet

Description

35%

$U$ Test