7. Point Estimation

Before getting into point estimation, some words on how the rest of the course differs in what we are doing. So far we have basically considered a structure where we know what the world looks like - that is we know the distribution of the data - and we have been working out what the distribution of certain functions of the data would look like. The certain function has really been the mean, but this is an interesting function.

We will spend the rest of the time going in the reverse direction. Basically we want to know what we can learn about what the world looks like - that is the distribution or features of the distribution that generated the data - from actually observing some data. We call this 'inference' and will work through a few different things we can do

There are three questions we will ask using sample means

  1. Can I learn about $\mu$ from computing sample means? This is 'Point Estimation'
  2. How do I check if the sample mean agrees with my theory on $\mu$. This is 'Hypothesis Testing'
  3. What values for $\mu$ are compatible with my estimated sample mean? This is 'Confidence Intervals'

In real statistics we do these types of things with all the different statistics we might compute, whether it be the sample median, the sample variance, the histogram of the data, or the empirical cdf we saw in Chapter 2.

From the perspective of point estimation, the object we compute from the data is related to the parameters of the random variable (that we call the model). So we might be interested in the mean of a random variable $X$, and compute a sample mean hoping that the value is close to the actual mean of $X$ and hence we have learned something about the random variable (model). Most models are quite complicated of course, and have many parameters, so we have many estimates to learn about the model. But the ideas of what makes a good point estimate of a parameter are similar regardless of whether we are looking at a mean of a random variable or some other parameter.

In this class we keep it simple and just focus on the sample mean. However you will be surprised how many point estimates in statistics are either exactly like a sample mean or very similar

Recall from the previous chapter that if we have data $x_i$, $i=1,...,n$ and compute the sample mean $\bar{x}$ we can think under some conditions that this is the outcome of a random variable $\bar{X}$ which has mean $\mu$, variance $\sigma^2$, and in large enough samples can be considered to be approximately normally distributed. Thus we are thinking of it as a draw from a distribution centered on what we are trying to learn about, namely $\mu$.

In the above figure we have $\bar{X}$ is drawn from a $N(1,2)$. But we do not know this, we have data and estimate $\bar{x}=1.2$ where the vertical line is. Is it reasoanble to say then that our best guess of the unknown mean here is $1.2$? This is the question we are asking here. For point estimation, we are really asking what the sample mean tells us about $\mu$ and also whether or not computing the sample mean makes any sense to learn about $\mu$. All of the ideas in the rest of this chapter are really about what makes a good estimator. For many problems though the sample mean works well.

7.1 Unbiasedness

The property of unbiasedness means that on average the statistic gives us the true parameter value. In terms of the figure above, this means that over different samples we would get different $\bar{x}$ but on average these values, considered as an estimate of $\mu$, equal $\mu$. In any sample it will not be $\mu$, but on average it is.

More specifically, for any estimator $U(X)$ used to estimate a parameter of a distribution $\theta$, we say that the bias of the estimator is $$ Bias(U(X)) = E(U(X))-\theta. $$ If the bias is zero, then we say that the estimator is unbiased.

It is clear that the sample mean is unbiased. Consider our estimator for the variance. We saw earlier that we estimate the variance in the sample as $$ s^2 = \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2 $$ and used this as an estimator for $\sigma^2$. Is it unbiased? I will not go through the math here, but the answer is yes, it is indeed unbiased. This explains the weird division by $n-1$ instead of $n$, if we had used $n$ it would be biased towards zero. So instead we divide by $n-1$ which results in an unbiased estimator. Some texts refer to this correction as a 'degrees of freedom' correction, a meaningless term. You can call it what you want but the reason for it is just to obtain unbiasedness.

In truth the main reason for biased estimators in practice is through problems with collecting the data, such as the sample selection bias we saw in Chapter 5. For complicated statistics that are often used in real research, this remains true although even in the best of cases having a biased estimator is often not all that bad. The main reason we might like an unbiased estimator is that if we are reporting the results to someone, we want them to understand that we are not pushing the estimate in one direction or another for nefarious purposes.

The main reason we might not care about unbiasedness is that it does not mean that the estimate is an accurate one. Each estimate is a draw from the distribution. If the distribution is very wide, then the estimate could still be unbiased but really far from $\mu$. We want estimates that are close to $\mu$.

7.2 Accuracy of Estimators

Unbiasedness is no guarantee of accuracy. We now turn to two notions. The first is a measure of how accurate, on average, we might expect an estimator to be. The second is the suggestion that the estimator is becoming more accurate (again on average) as the sample size increases.

7.2.1 Mean Square Error

We could consider how far our estimator $U(X)$ is on average from $/theta$. For any draw of the estimator, $u(x)$, the error in the estimator is $u(x)-\theta$. If we just took the average of the errors, we would end up with the bias, which is not accuracy. What we need to do is to make sure that regardless of whether the error is negative or positive, when we take the average error, both these types of errors add up to make the measure of accuracy worse.

The suggestion then is to take the squared error $(u(x)-\theta)^2$. Then in expectation we have the error $$ MSE(U(X)) = E[(U(X) - \theta)^2] $$ This looks a bit like a variance, but below we see it is not. But it does have the same issue as with the variance, in that the units are units squared which is hard to think about. But again we can just take the positive square root of this (root mean square error) and it is in the same units as the underlying data.

Why is it not the variance? Some math can show the answer. \begin{split} MSE(U(X)) &= E[(U(X) - \theta)^2] \\ &= E[((U(X) - E(U(X))) + (E(U(X)) - \theta))^2] \\ &= E[(U(X) - EU(X))^2] + (E(U(X)) - \theta))^2 \\ &= Var (U(X)) + [Bias(U(X))]^2 \end{split} Some of the steps might be obscure. When we square out the two terms in the second line, we end up with three terms, the first squared, the second squared, and two times a cross product. Because the expectation of a sum of things is the sum of the expectations, we can see that the two squared pieces are in the third line. The cross product is not, because in expectation it is zero. This is because one of the parts of the cross product $(E[(U(X) - E(U(X))])$ equals zero by definition and the part it multiples $(E(U(X)) - \theta)$ is constant so comes outside the expectations operator.

The result is that the MSE is just the variance of the estimator plus the bias squared. We want an estimator that is most accurate, that is has a small MSE. We see we can get that by having a smaller variance and/or a smaller bias. This makes sense.

7.2.2 Consistency

The notion of consistency of an estimator is that for large enough samples, we can have as accurate an estimate as we desire. Alternatively, we can think of this as being that when the sample size becomes infinitely large, we learn the true value of the parameter.

We have seen that for a VSRS, we have that $X_{n} \sim N(\mu,\sigma^2/n)$ for a sample size of $n$ observations provided $n$ is large enough. Now, consider the chance that we obtain a sample mean outside some chosen 'close' bounds around $\mu$.

We have \begin{split} P[| \bar{X} - \mu| > \epsilon] & = 2 P[ \bar{X} - \mu > \epsilon] \\ &= P \left[ \frac{\bar{X} - \mu }{\sigma/\sqrt{n}} > \frac{\epsilon}{\sigma/\sqrt{n}} \right] \\ & \approx 2P \left[ Z > \frac{\sqrt{n} \epsilon}{\sigma} \right] \end{split}

As n gets large, we see that this is two times the probability that $Z$ is above a larger and larger number. This is going to zero, and indeed for large enough samples the sample mean is consistent for the mean of the random variable we are sampling from. This is really the same idea as the law of large numbers in chapter 5, so you can go back and see the results there (as well as the gif).

7.3 Summary

Most of what we did in chapter 2, which we called descriptive statistics, is some form of point estimation. By building random variables as our model, we can think carefully about what we are measuring as an outcome of the random variables. This chapter is really about what properties for such a point estimate make our choice of something like the sample mean a reasonable choice.

For example in Chapter 2 we noted that the height of the histogram is just a sample average, where we estimated the height of the histogram as $$ \text{frequency} = \frac{1}{n} \sum_{i=1}^{n} 1(\text{lower bound} \leq x_{i} < \text{upper bound})$$ Since this is a sample mean, we would understand that the ideas of this chapter are relevant. Is it a good estimator? This would depend on what we mean by good. It is unbiased if the data is a random sample, which is good. It is also consistent for the true probability that the data falls in this interval, which is also good.

The sample mean is a reasonable choice of an estimator for situations where we want to learn about the mean of a random variable. But the idea of a mean of a random variable is a very broad one, because we can take transformations of random variables to get new ones that give new means to be estimated. What we do see is that sample means from random samples are unbiased for the mean, they are consistent for the mean, and although we did not show it they are often the best estimator (smallest MSE) for the mean. These results require assumptions, basically on how often we might see outliers. You recall that outliers sometimes led to the sample median being more robust - this is still true here and there are real problems where you might want to use the sample median instead of the sample mean. But for most problems it makes little difference, and we can see in the data if the two measures give very different answers.