What is a Student's t-Distribution?

Despite the ubiquity of the t-distribution in classrooms, it often simply with only with one piece of advice: "use the \( t \)-distribution for small sample sizes and the normal distribution (a \(Z \) distribution) for larger ones". Supposedly, according to every introductory statistics class, a t-distribution is supposed to work better because of of mythical fatter tails.

Let us deconstruct this ubiquitous distribution a little bit, and hopefully learn what these mythical fat tails do. 

Why would we be using the \(T\) or the \( Z \) tests? 

Before we get into the nitty gritty of how these two distributions are related, we must ask, what sort of data would we be using this for?

The context these two distributions come up in frequently is doing hypothesis testing for normal data. Suppose that we have a sample of \(n\) data points  \( x_i \) with a mean of \( \bar{x} \), and we have reason to assume that this data was drawn from a normal distribution (more on this later). Now suppose that we want to test if the mean of this data if "significantly different" from what we expect. In this situation we would need to do a hypothesis test, and both the \(t\) test and the \(z\) test are candidates. 

What does "significantly different" mean in this context? In order to answer this let us get a little more formal.  If we let \( \mu \) be the actual population mean, we are testing the hypothesis that:  \[ \begin{align} H_0&:  \mu = \mu_0 \\ H_a&: \mu  \neq \mu_0 \end{align}\]

Our question then becomes: if \(H_0 \) is true, then what is the probability that we would get \( \bar{x} \) from a random sample, and is that probability small enough to incite incredulity for the null hypothesis? A test is precisely the calculation of that probability (called a \(p\) value) and comparing it to a set "significance level" or the seeing if it is smaller than the highest probability that would cause you to be incredulous of the null. (Famously, this is usually \(\alpha = .05 \)).

Let us create a model to test this! If we have assumed that our population is normally distributed and we expect a mean of \( \mu_0 \), then we can model a sample of \( n \) data points as being the set of \(n\) independent identically distributed random variables \( X_i \sim N(\mu_0, \sigma^2) \). Immediately your suspicions should be aroused: what is \( \sigma^2\)? Let us just assume we know it for now, but this will be a pivotal question later on. 

We can model the distribution of sample means under the null as the distribution of the following function \( \bar{X} = \frac{1}{n} \sum\limits_{i=1}^{n} X_i \). This is called the sampling distribution, and it is a theorem that \( X \sim N(\mu_0, \sigma^2/n) \), a proof is shown here. Using this model we can find the probability \( P( \bar{X} = \bar{x} ) \), often called the \(p\)-value. 

It is here that that we must stop and evaluate our assumptions, because blindy doing this procedure can make us run into trouble. 

So far we have assumed: 

  1. Our population is normally distributed with mean \( \mu_0 \) 
  2. Our sample is independent, identically drawn
  3. We know our variance \( \sigma^2 \) 

It is this last assumption that differentiated the \(z \) and the \(t \) test. 

The \(z\) Test 

The \(z\) test has all of the same assumptions as the procedure outlined above, the only difference is the use of the \(z\) statistic.

A normal distribution with $\mu = 0, \sigma = 1 \) is typically denoted \(Z \sim N(0,1) \). The \(z\) test takes the sampling distribution described above, namely \( \bar{X} \sim N(\mu_0, \sigma^2/n) \) and transforms it into a \(Z\) distribution, and then uses this new sampling distribution to find a \(p\)-value. This was mainly done historically so that people could use tables of cumulative probabilities for the known \(Z\) distribution instead of having to calculate the values for the actual sampling distribution, but mathematically there is no difference. 

If we let \(Y = \frac{\bar{X} - \mu_0}{\sqrt{\sigma^2/n}} \) we see that the expectation \( \mathbf{E}[Y] = 0\). Moreover, we see that \[ \mathbf{Var}[Y] = \mathbf{E}[Y^2] - \mathbf{E}[Y]^2 = \mathbf{E}[Y^2]\] A standard Gaussian integral shows that this is equal to \( 1\). Therefore, the \(p\)-value is \[ P_{\bar{X}}(\bar{x} \mid \mu = \mu_0) = P_{Z}\left( \frac{\bar{x} - \mu_0}{\sqrt{\sigma^2/n}}\right) \] This a \(z\)-test, and it has all the same assumptions we made above. 

Now we must face the big question: what is the variance?

Estimating Variance

The variance of a random variable \(X\) is defined as \[ \sigma^2 = E[ (X - E[X])^2 ] \] Which if \(X\) is drawn from a finite sample space \( |\Omega| = n  \) this is  \[\frac{1}{n} \sum\limits_{i = 1}^{n} (x_i - \bar{x})^2 \]

If you have taken a statistics class this might shock you, "isn't variance \( \frac{1}{n-1} \sum\limits_{i = 1}^{n} (x_i - \bar{x})^2 \) ?" 

This is where the difference between the variance and a variance estimate comes in! See, if we want to estimate the variance of the distribution from which a random variable comes from it might make sense to use  \[\frac{1}{n} \sum\limits_{i = 1}^{n} (x_i - \bar{x})^2 \] as an estimator, but it turns out that this is a biased estimator of variance! Instead, we often use Bessel's Correction, which gives an unbiased estimator, namely the more familiar \[ \frac{1}{n-1} \sum\limits_{i = 1}^{n} (x_i - \bar{x})^2 \] Nevertheless, this estimator is not an unbiased estimator of the standard deviation, since the square root is not a linear operator. 

We have an estimate of the standard deviation, the question is: how much does the uncertainty in our standard deviation estimate cause the sampling distribution to be off? 

This is where the \(T\)-distribution comes in: instead of simply using the corrected sample variance, we actually model the distribution of sample variances as well. If we have a sample from \(n\) independent identically distributed random variables \(Z_i \sim N(0,1) \) then the distribution of corrected variance is the distribution of \[ \frac{1}{n} \chi^2 = \sum\limits_{i = 1}^n  (Z_i - \mu)^2  = \sum\limits_{i = 1}^n Z_i^2\] 

This is known at the \( \chi^2 \) distribution with \(n-1\) degrees of freedom. From this we can calculate the \(T\) distribution

The \(T\) Distribution  

The \(T\) distribution arises by asking about the distribution of the \(t\) statistic \[ t = \frac{ \bar{x} - \mu }{\hat{\sigma} / \sqrt{n}} \] This should look familiar: it looks a lot like the Z score! In fact, the \(T\) distribution is just the sampling distribution for a normally distributed population when standardized and the effect of estimated sample variance is taken into account. 

In order to take this into account we first see that the distribution of the sample mean should be \( \bar{X_i} \sim N(\mu, \sigma^2/n) \). Then, if we shift this distribution by \( - \mu \) we get a centered distribution   \( \bar{X_i} - \mu \sim N(0, \sigma^2/n) \).

Now we need the distribution of \( \hat{\sigma^2} \). We know that \[\chi^2_{n-1} = \sum\limits^n_{i = 1} \left(\frac{X_i - \mu}{\sigma} \right)^2 \] which implies that \[\hat{\sigma}^2 = \frac{\sigma^2}{n-1} \chi^2_{n-1} =\frac{1}{n-1} \sum\limits^n_{i = 1} \left(X_i - \mu\right)^2\] 

Then, from Basu's Theorem, we know that these are independent, so then we can substitute in for out estimated standard deviation: \[T = \frac{\bar{X}-\mu}{\frac{1}{\sqrt{n}}\cdot\sqrt{\frac{\sigma^2}{n-1} \chi^2_{n-1}}} \] which we can separate out into \[\frac{\bar{X}-\mu}{\frac{\sigma}{\sqrt{n}}}\cdot{\sqrt{\frac{\chi^2_{n-1}}{n-1} }} = Z \cdot {\sqrt{\frac{\chi^2_{n-1}}{n-1} }} \] This is the T distribution! Instead of using a normal distribution as our sampling distribution and doing a \(z\) test, we simply use this distribution with the \(t\) statistics. 

Small vs Large Samples

So, why not always use the \(t\) distribution? Well, pretty much because as \(n\) gets bigger it is almost exactly the same as using the \(Z\) distribution, because this is just a transformation of the sampling distribution by a factor that decreases with large sample sizes. 

Is My Distribution Normal? 

We have talked a lot about the third assumption on our list, namely that we know our variance, but we haven't really discussed the normality assumption. There are many different tests of normality available, the most powerful of which  is the Shapiro-Wilks test, but there is a problem (maybe a topic for another blog) in that nearly-normal (i.e. symmetric with small excess kurtosis) distributions are rejected because of small deviations from normality with high sample sizes. However, there is some solace in the central limit theorem, which tells us that the sampling distribution of most distributions weakly converge to a normal distribution as sample size increases. There is still the problem of small sample sizes, and so there is the question of how "robust" a \(t\)-test is. Well, this is hard to answer,  but for small sample sizes it seems the \(t\) test is not robust to large skewness, but almost always simulations can help to test whether skewness or leptokurtosis hurts the type-I error rate significantly.