In any statistical study, we need to consider carefully how the data was generated and how this impacts the assumptions we make in making inferences from the data. In the previous chapters we have built impressive tools for understanding the sampling variation of sample means of data. However we have made some assumptions along the way. The important one is that our data is being generated from random variables with a particular mean and variance. When our interest is the mean of the random variable, it is going to be extraordinarily important that our data really does come from this random variable and not some other random variable with a different mean. Mathematically what we are saying is that although $EX$ equals something, it might not be the object we are trying to measure.
It is clear what would happen if we measure the wrong thing, if the law of large numbers held we would learn all about the mean of $X$, which would not be particularly helpful if this is not what we are trying to measure. Unfortunately, this situation occurs often in doing statistics. Indeed, often when statistics is wrong the reason is usually this, not that someone made a mathematical error. The next section examines this and develops tools to understand how and when this might be a problem.
So what can go wrong? We made the assumption that the data was a VSRS - basically this means that each observation is coming from the same distribution each time and that each observation could have come from the distribution with the probabilities suggested by the distribution. This leads to a number of considerations.
1. We must be able to answer "Is the sample representative of the population that I care about?", in order to determine how far the results from the analysis of the data can be generalized.
2. Are the data iid from this representative sample? The desire for a VSRS suggests that we should collect our data in order to maximize the possibility that these assumptions underlying our statistical analysis are true. If we are in charge of collecting data, then we may want to use techniques that ensure that it is a VSRS.
As we have seen, when the data is a VSRS then we are in a good position to understand the randomness in the sample mean and understand what is going on. Outside this framework the methods are quite complicated (but quite often solvable, the techniques are just more complicated).
By 'representativeness' we mean that the observations we make, even if they are a random sample from a population with mean $\mu$, are drawn from random variables where this mean is actually the mean we are interested in studying. This is not always the case. Consider the following situations.
Return to the examples, and think about it a bit more.
We will work with the last example to show how our tools can be used to understand this problem. The local TV station poll asks locals to vote yes or no depending on whether or not the person is for banning imports of avocados (a Yes vote) or for lifting the ban (a No vote). Suppose that at the end of the voting period they find that 60% of the votes is for the ban and 40% against. Clear evidence that locals are for a ban?
Suppose that the joint distribution of $X$, which measures for or against the ban (a 1 is for the ban), and $Y$ which measures if a person votes or not (a one here signifies the person votes) is given as in the following table.
X\Y | 0 | 1 | |
---|---|---|---|
0 | 0.76 | 0.04 | 0.8 |
1 | 0.14 | 0.06 | 0.2 |
0.9 | 0.1 | sum=1 |
The problem we have with a web poll is that not all of the population votes. Here, $Y$ measures the voting behavior. It shows (look at the marginal distribution) that 90% of the people do not vote (people have lives, often are not that interested in avocados, and might not watch that channel or surf that channels website) and only 10% do. Further, looking at the marginal distribution for $X$, we see that only 20% support the ban. So where did our 60% supporting the ban come from?
The poll does not capture everybody in the community, it captures only those who vote on the website. Consider the distribution of $X$ conditional on $Y=1$, i.e. the distribution conditional on the person actually bothering to vote. The conditional distribution (use our usual formulas from Chapter 4) is as follows.
$X$ | $P[X=x|Y=1]$ |
---|---|
0 | $ \frac{0.04}{0.1} = 40\% $ |
1 | $ \frac{0.06}{0.1} = 60\% $ |
The chance that the types of people who enter the poll differs from the population at large is high for many questions. In this example, farmers, who are more likely to have a strong economic interest in the question, are almost sure to enter the poll. These days for a lot of such issues in the media there are emails flying to alert interested parties who want to make sure that they have their say. Consumers do not have much at stake on this particular question, so are less likely to bother voting. In general most people do not bother with these types of polls - this is known in survey jargon as a low response rate.
This is the reason why on the tabloid shows and local news they call these phone in polls 'non scientific polls', i.e. they are not well grounded and their results meaningless unless we have also the information on who will call. This typically depends on the question.
We refer to this problem as 'self selection', i.e. that since the respondents to the poll are allowed to select themselves into the sample, then we do not get a 'representative' sample. We see one of the aspects of what we mean by a nonrepresentative sample - the mean of the sample is different from the mean of the target population. Self selection problems are widespread and often lead to misinterpretations of data. For example suppose you wanted to construct a job training scheme, but to see how well it works you run a small scale experiment. First, you advertise for unemployed people, then put the 50 or so that reply into a training scheme, and then see if they find employment. Suppose that most do, so you scale up the scheme to all unemployed in your city. Would we expect the same good results as in the pilot scheme? No, because of the self selection effect. The 50 or so that first sign up are likely to be the most motivated, when we scale up we get both motivated and unmotivated types, the distributions are different.
There is one possibility that this will be correct, i.e. when the marginal and conditional distributions are the same. This is of course the situation where X and Y are independent, i.e. in words that the chance that someone calls in (or is sampled) is independent of the chance they are for or against the posed question. Typically, this is a defining element in a 'scientific' or valid poll. You need to select the participants in a way that ensures this independence. We will see how to do it, but basically any time you allow the participants themselves to decide whether or not to be in a poll then this independence is likely to be questionable.
All of the problems with the $X_i$ being what we wanted to meausure, or saying a different way that the means of them are the $\mu$ we want to estimate, are known collectively as 'internal validity'. We say we have internal validity when the study really does measure what we want to measure.
Another problem that arises is when the target population cannot be sampled from randomly, but a subpopulation can. In this case the math is the same as before, we need that the marginal distribution of the population we are interested in be equal to the marginal distribution of the group we can measure. We say that when a subpopulation is measured and we believe that the results also apply to a larger population, that the study extends to the larger population. Other sources of non representativeness are discussed here for More generally we refer to this as 'extendability' of the study, i.e. to which larger populations does the study apply. Another name for this is 'external validity". An example is violence sudies. There have been quite a number of studies, all along the same lines, trying to measure the effect of observing violent media on actually being violent. A standard approach is to show cartoons (some violent, some not) to pre-schoolers and see if this leads them to act out more violently in the playground (see Box X for more detail). Suppose you found an effect, would this extend to adults watching violent movies? One critical example is medical studies. Often drug studies are on a particular subpopulation, say white males, and then the results are assumed to work for all people. Do these studies extend this way?
It should be clear from this that we need to carefully consider whether our data is representative of the population we intend to study, and what are the limits to how much we learn in terms of other populations. Our assumptions of the previous chapter go beyond just this, however. We also want that the data comes from a random sample. To consider both these questions, it is useful to break up the types of studies (or data) into three groups. These are
The different types of studies involve different levels of control over the data that is collected and hence different questions arise.
Experiments are distinguished by the feature that the researcher both directly applies and measures the effect of some `treatment' on the units (or subjects) of the study. So they
From the perspective of representativeness, being able to run an experiment is great because the control we have over the aquisition of the data allows us to take measures to ensure that we are measuring what we want to measure. We directly apply the treatment so we can be more sure that we really are measuring the effect of the treatment. From the perspective of ensuring a random sample, there are methods within the way we conduct the experiment that help ensure that this assumption too holds for our sample.
Some examples will help make this clear. Note that experiments are run in many different fields of study, and whilst the techniques can differ across fields the aim is always the same, to ensure represenativeness and to ensure our sampling assumptions hold.
In each of these examples, we have that the researcher imposes the treatment on the subjects. In the drug example, the doctor directly gives the drug, whose effects we want to measure, to the patient. For speed laws, we want to measure the effect different speeds have and in this experiment we directly impose the speed limit ourselves. For Project STAR, the researchers are directly choosing how big the classes are in studying the effect of class size on learning. This is the first step to concluding that these studies are indeed experiments.
The measurement of the subjects in each of the examples will become our data. In each case we control this too. In the drug example the doctor records how well the patient responds. For the speed laws example, we measure the effect on fatalities. For Project STAR, we randomly assign teachers and students and measure the effect of the different class sizes.
The control we have over the study in an experiment is very important for representativeness, that the observations we take are actually coming from the distribution centered on the effect we want to measure. However such control does not ensure that we conducted the experiment well. There are some commonly applied tools in undertaking experiments that help improve the chances that the data is representative of the effect we are trying to measure
Double Blind Experiments. A double blind experiment is one where the subjects do not know the extent to which they are treated (this part only is single blind) and the researcher does not know the extent to which the subjects have been treated. The reason for this is best understood in a drug trial where the extent of the benefits of the drug are difficult to measure and may be subjective (so we are not measuring if the patient died or not, but perhaps how much their pain decreased etc). When the subject does not know they are being treated, there is less chance that they report or act differently. Reactions of the subject to being treated is known as the Hawthorne effect, see the box for more information. Similarly, if the researcher knows which subjects have a stronger treatment, this can affect their assessment of the results. If the measurement is completely obvious an not subjective in any way, for example counting fatalities on a highway, the double blind approach is not really neccessary.
Compliance. A real problem that can occur in studies is that the subjects do not comply with the taking of the treatment. If the subjects do not comply (or a large enough group of them do not comply) then clearly we are not measureing the effect we want and there is a problem with representativeness of the data. Consider the Speed laws example. Here in Southern California, it does not appear from my casual observation that any drivers pay any attention to posted speed limits. So an experiment that simply changes the posted speed limits might end up showing no effect, not because there is no effect of traffic speed on fatalities but because there is no effect of posted speed limits on fatalities, we would not be measuring what we wanted to measure. Experimenters take considerable efforts to ensure that the treatment is being observed, using their position as controlling the administering of the treatment etc.
The control the researcher has over an experiment allows the experimenter to randomize over the treatment levels of the subjects. This is an important tool, that does a number of things at once. First, correctly undertaken randomization ensures that we have a random sample. If the subjects are randomly chosen there is no reason to think that the outcome of one subject will affect the outcome of a second subject, so it is reasonable to expect that the random variables that describe the observations in the sample are indeed independent of each other.
Randomization also has implications for each of the random variables not only being identical in distribution but also representative. Consider the Project STAR example. Here teachers were randomized to the classes, so we expect that the quality of the teacher is independent of the class size. Students were also randomized to the classes, which means that student ability is independent of class size. This helps with representativeness because we only want to measure the effect of class size, not the effect of teacher quality, on the test scores. Similarly, it avoids accidently putting the better students in the smaller classes, which would also result in test scores capturing both the effect of class size and student ability rather than what we really wanted to measure. In terms of ensuring that the distributions of each student are the same, being independent of teacher quality means that the distribution for a student in a small class will be the same as a student in a good class.
Part of understanding statistics is to know that studies can fail to measure what we are interested in. When looking at experimental results, regardless of field, a student that understands statistics would ask themselves whether or not the study was done in a way to ensure that the data were representative of what they are trying to measure. Was there compliance? Is there any chance that the results are affected by the way the experiment was administered? The main problem with experiments, given that researchers have a lot of tools available to fix these problems, is extendability. Does a study of market design in an underdeveloped community in one country extend to other communities in other countries with differences in culture? Does an experiment examining violence amongst toddlers extend to adults?
Experiments represent the situation where the researcher has the most control over the study. We turn now to Surveys, where less control is available.
Surveys are distinguished by the features that the researcher can no longer impose a treatment that they want to measure (often the concept of treatment in surveys is hard to think about) but they can choose which observations enter the sample. So they
Because we have control over the observations, we can use randomness. This helps with both ensuring a VSRS as well as helping with obtaining a representative sample. Because we do not impose the treatment directly we need to be more careful about representativeness of the sample.
Again, some examples will help make this clear.
Before discussing the main elements of this chapter, a few things to note with surveys. First, although many surveys are precisely what you think they are (someone asking questions and recording the answers), the way we have defined surveys gives them a wider meaning. What characterizes surveys for us is not the question answer format, but the level of control the researcher has over the way the data is generated. For example, consider the speed limit effects on fatalities example from the previous section. Suppose instead of imposing speed limits randomly, we simply randomly chose locations that already have different speed limts adn then recorded the number of fatalities. There is no form here, but this would be a survey. It is a survey because we had control over the subjects (here the particular areas to be examined) but not over imposing the treatment (actually randomly assigning the speed limit). The reason for the wider definition is that the questions we ask depend on the control over the data. A second point to note is that the notion of treatment in a survey is often difficult to think about. In the speed limit survey just described, it is clearly the effect of the speed limit. However in a political poll, there is no obvious notion of a treatment.
Because the treament is not imposed by the reseacher, we have to be much more careful about concluding that our sample is representative of what we are trying to measure. There is just a lot of things that can go wrong. We take these in turn
Non-representative Sampling. This is just a more general form of the sample selection problem we discussed earlier. In any survey we have a target population, from which we want to learn. For example consider a political poll, where we want to know the proportion of voters that will vote Democrat vs Republican, say in the California race. This sounds straightforward, but in practice is anything but straightforward. California is big, with great variation in political beliefs between areas (compare Berkeley to Orange County). So a survey limited to certain areas will fail to be representative. A survey run only in the larger cities will ignore the rural vote. The problems are very large and polling companies expend a great deal of resources and time to try and ensure that the population being sampled from is similar to the population of voters. Even simple surveys can easily go wrong. Suppose you want to run a survey on your campus to see if students are interested in building a new sports complex at the cost of additional student fees. If you stand at a central location on campus, you still might not get a representative sample because many students might not be coming to class (ones that have outside jobs for example). How you conduct the survey matters as well. For example a surveyer might favor one subgroup of students (maybe asking mostly the better looking women, or people in their own demographic because they are more comfortable). Randomization can fix this, for example one trick is to let ten people go by after finishing with a subject before you approach another subject. The randomness of people walking past you guards against a non random sample.
Survey Wording. For traditional surveys where a question is asked, the form of the question has been shown to stongly impact the response. For example the wording can be such that it is clear what the writer of the survey wants the answer to be. Questions like 'Do you agree that ...' suggest that the respondent might be making a mistake to not agree. Most of the Issues here are even more subtle, with question ordering and unclear questions resulting in poor data. Some discussion and guidelines from the American Association for Public Opinion Research are available here.
Non-Response. Not everyone surveyed responds. Just like in the sample selection problem (the math is the same), if the probability that someone responds is not independent of the views they hold on the survey question then the distribution of the outcome we are trying to measure conditional on responding will be different to the marginal distribution of the effect we are trying to measure. For example a survey of how good a service that was provided to you might elicit a 100% response rate amongst customers who had a bad experience (because they are happy to have an avenue to complain) but a much smaller response rate amongst people happy with the service (because they do not want to spend extra time, they just want to move on). So this is really just a special case of sample selection. In good surveys where this is likely to be a problem, researchers make efforts to maximise response. Typical approaches are to follow up on non-responders, offer financial incentives for responding etc. Unfortunately often when survey results are reported, they are not accompanied by response rates. If the response rate was high, we feel better about representativeness, however if we do not know the response rate we have no idea if the data is representative or not.
In both experiments and surveys, the researcher had enough control to use randomization to try and ensure a random sample and also use methods to try and ensure that the data were representative. In the last study type, we do not have this tool. None-the-less it is still possible to learn a lot about the world.
Observational studies are distinguished by the features that the researcher can no longer impose a treatment that they want to measure and they also cannot choose which observations enter the sample. So they
Typically the data is collected for other reasons, then we use it to try and learn about some feature. Because we have no control over the observations, we cannot use randomness. Thus helps we cannot be sure of either a VSRS or a representative sample. Because we do not impose the treatment directly we need to be careful about representativeness of the sample. Here we need to spend a lot of effort convincing ourselves that the study is valid.
Again, some examples will help make this clear.
Unfortunately, it is observational studies that we are mostly concerned with in economics and in the social sciences more generally. This is really why we know so little about the economy whereas in other areas scientists can learn fairly fast (even if it is expensive).
Because the treament is not imposed by the reseacher, we have to be much more careful about concluding that our sample is representative of what we are trying to measure. There is just a lot of things that can go wrong. We take these in turn
Non-representative Sampling. This is just a more general form of the sample selection problem we discussed earlier. In any survey we have a target population, from which we want to learn. For example consider a political poll, where we want to know the proportion of voters that will vote Democrat vs Republican, say in the California race. This sounds straightforward, but in practice is anything but straightforward. California is big, with great variation in political beliefs between areas (compare Berkeley to Orange County). So a survey limited to certain areas will fail to be representative. A survey run only in the larger cities will ignore the rural vote. The problems are very large and polling companies expend a great deal of resources and time to try and ensure that the population being sampled from is similar to the population of voters. Even simple surveys can easily go wrong. Suppose you want to run a survey on your campus to see if students are interested in building a new sports complex at the cost of additional student fees. If you stand at a central location on campus, you still might not get a representative sample because many students might not be coming to class (ones that have outside jobs for example). How you conduct the survey matters as well. For example a surveyer might favor one subgroup of students (maybe asking mostly the better looking women, or people in their own demographic because they are more comfortable). Randomization can fix this, for example one trick is to let ten people go by after finishing with a subject before you approach another subject. The randomness of people walking past you guards against a non random sample.
When we read a study that involves statistical analysis, we need to decide whether or not we find the study convincing. To do this we ask questions of representativeness and whether or not the application of the statistics is reasonable (if the data were not a random sample, there exist methods to deal with this, but this is beyond what this course covers). Questions of extendability should also be considered
Again, some examples will help make this clear.
Unfortunately, it is observational studies that we are mostly concerned with in economics and in the social sciences more generally. This is really why we know so little about the economy whereas in other areas scientists can learn fairly fast (even if it is expensive).
Because the treament is not imposed by the reseacher, we have to be much more careful about concluding that our sample is representative of what we are trying to measure. There is just a lot of things that can go wrong. We take these in turn
Non-representative Sampling. This is just a more general form of the sample selection problem we discussed earlier. In any survey we have a target population, from which we want to learn. For example consider a political poll, where we want to know the proportion of voters that will vote Democrat vs Republican, say in the California race. This sounds straightforward, but in practice is anything but straightforward. California is big, with great variation in political beliefs between areas (compare Berkeley to Orange County). So a survey limited to certain areas will fail to be representative. A survey run only in the larger cities will ignore the rural vote. The problems are very large and polling companies expend a great deal of resources and time to try and ensure that the population being sampled from is similar to the population of voters. Even simple surveys can easily go wrong. Suppose you want to run a survey on your campus to see if students are interested in building a new sports complex at the cost of additional student fees. If you stand at a central location on campus, you still might not get a representative sample because many students might not be coming to class (ones that have outside jobs for example). How you conduct the survey matters as well. For example a surveyer might favor one subgroup of students (maybe asking mostly the better looking women, or people in their own demographic because they are more comfortable). Randomization can fix this, for example one trick is to let ten people go by after finishing with a subject before you approach another subject. The randomness of people walking past you guards against a non random sample.
Copyright © Graham Elliott
Distributed By Themewagon