Sample Mean Derived as Least Squares

The sample mean as a formula makes intuitive sense, but seems to come from nowhere. In this page we derive the sample mean from first principles as the value for the data that makes the deviation from the estimate as small as possible (in a special way).

The idea is that we have data $ \{x_1,x_2,...,x_n \}$ and we want to choose a single number $\hat{\mu}$ that is as 'close' to the data as possible. But what does 'close' mean here?

We can think of the distance of each observation to our chosen number, it will be $x_i - \hat{\mu}$ for the $i^{th}$ observation. So we want to make these differences small once they are added up. Because we are adding up the numbers, we do not want negative differences to offset positive differences, so we would want to take a function of the difference that is non negative always.

Consider the sum of the squared differences, i.e. sum up $(x_i - \hat{\mu})^2$ over all the data. We have $$ \sum_{i=1}^n (x_i - \hat{\mu})^2. $$ We know how to find the $\hat{\mu}$ that minimizes this, it is just calculus. Taking the first order condition and setting to zero, we obtain the equation $$ 2 \sum_{i=1}^n (x_i - \hat{\mu}) = 0 $$ and when we solve this for $\hat{\mu}$ we get $$ \hat{\mu} = \frac{1}{n} \sum_{i=1}^n x_i $$ which is our sample mean estimator.

So this is where the equation comes from. If we chose a different function besides squaring the differences, we would get a different estimator for the center. For example if we took instead the absolute value of the difference, $|x_i - \hat{\mu}|$, then when we minimize the sum we would end up with the sample median.