Sources of Non Representativeness

In the main text we see that collection of the data should be undertaken randomly from the population. The self selection results in the incorrect understanding when we have that the self selection is not independent of the outcome we are trying to measure. In all data collection problems, we encounter the problem of collecting data that is truly representative of the features of the population we are trying to measure, and this is often not easy. Here we look at some other common sources of problems that result in the same mathematical problem we see in the text.

Survivorship Bias

Suppose, as we discussed in Chapter 2, we are interested in investing in mutual funds so we want to examine the likely returns. As we did in Chapter 2 we can collect the information on all the funds that have been around for say 5 years, and look at their returns over this period. Does this give us a good idea of the actual returns across the last 5 years? One problem with this approach is that we are observing the returns of all the of funds conditional on them not being closed down. Funds are usually closed down because investors desert them or will not invest in them. Investors avoid funds that have bad return records, so it is most likely that funds that have closed down, and hence are not in our sample, are ones that had low returns. The return conditional on staying in business is likely higher than the returns over all the funds we could have invested in 5 years ago. We refer to this as 'survivorship' bias because we observe the data conditional on surviving or staying in business.

The math for survivorship bias is the same as that presented for the avocado example in the main text, except that instead of having our $X$ variable take two values it takes on many values. But the conditional distribution still exists, and it is likely not equal to the marginal distribution of returns.

Intentional Bias

It is not all that hard to build intentional bias into samples, since samples are a subset of the possible available measurements and often the researcher has control over how the data is collected. It can even be done in a nuanced way, that is not all that obvious. For example you might be advocating for a new stadium for your sporting team in your city. To show support (needed because tax dollars will be spent on it) you could take a survey of the opinions of locals. But it is actually hard to obtain a random sample, serious survey companies work hard to do it. But since you want to show strong support, it is easier. Just randomly select people who happen to be in places where you expect supporters of the local team to be. This would include places like areas with lots of sports bars, and not so much places like outside the local symphony hall. You can always claim you were careful to randomly choose participants, but you have done it in a way that the population you are randomly choosing from is not the population of locals but the population of locals who hang out near sports bars.

Just plain bad sampling

As noted, it is actually difficult to get a really good random sample of a population. A standard faulty approach is known as 'convenience sampling'. This just means choosing the subpopulation you are sampling from from an easy to reach subpopulation instead of working hard to obtain subjects that could come from any part of the population you are trying to measure. For example you might think for a study of the need for a Starbucks in your local strip mall that the approach of standing in the strip mall and obtaining a random sample of those walking past is a good approach. It is certainly easy, however it misses all those locals who might want a Starbucks but are not interested in the stores that are currently there. So they do not get sampled. This is more of a problem with hard to reach demographic groups that might influence the study.