Chapter 6

6. Data Generation

In any statistical study, we need to consider carefully how the data was generated and how this impacts the assumptions we make in making inferences from the data. In the previous chapters we have built impressive tools for understanding the sampling variation of sample means of data. However we have made some assumptions along the way. The important one is that our data is being generated from random variables with a particular mean and variance. When our interest is the mean of the random variable, it is going to be extraordinarily important that our data really does come from this random variable and not some other random variable with a different mean. Mathematically what we are saying is that although $EX$ equals something, it might not be the object we are trying to measure.

It is clear what would happen if we measure the wrong thing, if the law of large numbers held we would learn all about the mean of $X$, which would not be particularly helpful if this is not what we are trying to measure. Unfortunately, this situation occurs often in doing statistics. Indeed, often when statistics is wrong the reason is usually this, not that someone made a mathematical error. The next section examines this and develops tools to understand how and when this might be a problem.

So what can go wrong? We made the assumption that the data was a VSRS - basically this means that each observation is coming from the same distribution each time and that each observation could have come from the distribution with the probabilities suggested by the distribution. This leads to a number of considerations.

1. We must be able to answer "Is the sample representative of the population that I care about?", in order to determine how far the results from the analysis of the data can be generalized.

2. Are the data iid from this representative sample? The desire for a VSRS suggests that we should collect our data in order to maximize the possibility that these assumptions underlying our statistical analysis are true. If we are in charge of collecting data, then we may want to use techniques that ensure that it is a VSRS.

As we have seen, when the data is a VSRS then we are in a good position to understand the randomness in the sample mean and understand what is going on. Outside this framework the methods are quite complicated (but quite often solvable, the techniques are just more complicated).

6.1 Representativeness of the data

By 'representativeness' we mean that the observations we make, even if they are a random sample from a population with mean $\mu$, are drawn from random variables where this mean is actually the mean we are interested in studying. This is not always the case. Consider the following situations.

Consider measuring sea surface height over time to measure the effects of carbon dioxide on sea levels. It sounds straightforward to measure the sea surface height, maybe we place a floating measuring device with a location device that measures in three dimensions attached and use this to measure the location of the sea surface.
Suppose you want to estimate returns of larger companies. You look at the list of companies in the in the S&P 500 and measure returns for the last five years for each of them. You compute the average and report it.
You want to test if bonuses improve profitability of traders. Your traders are in teams in different rooms, you randomly assign bonuses to each team telling the traders it is because of performance, and see what happens in the next quarter.
You run an experiment changing the diets of a group of people. People stick to the diets assigned (you can check that). You measure health indicators before and after the diets and measure the mean effect of the diet.
Suppose that locally growing avocados is a mainstay of the economy. The government is considering removing import bans, which would lead to greater competition in the industry and most likely lower prices. The local news station on TV runs an online poll to examine if people want the bans to be lifted or not.

In each of these examples the data is still drawn from random variables $\{X_1,X_2,...,X_n\}$ with means equal to $\mu$. But the question we want to ask is whether or not the means of these random variables are what we were intending to measure. In nearly all misleading statistical studies, it is measuring the wrong thing that is the problem with the study (most people get the math of this course correct!). What we really want to get out of this chapter is an understanding of the questions we need to ask.

Return to the examples, and think about it a bit more.

But consider, waves will also affect the surface height, so we might get a messy measurement. If the waves are randomly causing troughs (so the measurement is a bit low) or peaks (so the measurement is a bit high) then there is no problem. But what if we have a storm while we are measuring and there are lots of high peaks but not deep troughs, our measurement will be too high most often.
The problem that can arise here is that the list of companies you see are all companies that did not go bankrupt or drop out of the S&P 500 because they were doing poorly. So instead of really having a random sample of companies from the start of when you are measuring returns you are getting a selected set of companies that were still in the index at the end of the measurment period. This is known as attrition bias.
The issue that can occur here is that there might be other reasons for differences in the ability of the traders in each group than just the bonuses. So the differences in the means might reflect that instead of the bonus. These are known as confounders or omitted effects. There are things you can do here, like compare the change in returns before and after (which might remove the omitted effects, but this only works under certain conditions). Randomization of the bonuses also helps.
Here the problem that could arise is that in addition to the diets, those in the experiment might change other behaviors. If you were assigned a really healthy diet, you might think that you should be better at exercising to really get the benefits. Then the effect being measured includes the changes in exercise.
With online polls and polls where you can call in if you want, only people that really care call in or participate. So instead of getting an idea of what the general public wants you get the opinions of the more heavily affected. Polls like this often generate groups (say farmers in this example) that organize to call as much as possible because it affects their livelihood. This is an example of what we call self selection bias.

We will work with the last example to show how our tools can be used to understand this problem. The local TV station poll asks locals to vote yes or no depending on whether or not the person is for banning imports of avocados (a Yes vote) or for lifting the ban (a No vote). Suppose that at the end of the voting period they find that 60% of the votes is for the ban and 40% against. Clear evidence that locals are for a ban?

Suppose that the joint distribution of $X$, which measures for or against the ban (a 1 is for the ban), and $Y$ which measures if a person votes or not (a one here signifies the person votes) is given as in the following table.

X\Y	0	1
0	0.76	0.04	0.8
1	0.14	0.06	0.2
	0.9	0.1	sum=1

The problem we have with a web poll is that not all of the population votes. Here, $Y$ measures the voting behavior. It shows (look at the marginal distribution) that 90% of the people do not vote (people have lives, often are not that interested in avocados, and might not watch that channel or surf that channels website) and only 10% do. Further, looking at the marginal distribution for $X$, we see that only 20% support the ban. So where did our 60% supporting the ban come from?

The poll does not capture everybody in the community, it captures only those who vote on the website. Consider the distribution of $X$ conditional on $Y=1$, i.e. the distribution conditional on the person actually bothering to vote. The conditional distribution (use our usual formulas from Chapter 4) is as follows.

$X$	$P[X=x\|Y=1]$
0	$ \frac{0.04}{0.1} = 40\% $
1	$ \frac{0.06}{0.1} = 60\% $

The chance that the types of people who enter the poll differs from the population at large is high for many questions. In this example, farmers, who are more likely to have a strong economic interest in the question, are almost sure to enter the poll. These days for a lot of such issues in the media there are emails flying to alert interested parties who want to make sure that they have their say. Consumers do not have much at stake on this particular question, so are less likely to bother voting. In general most people do not bother with these types of polls - this is known in survey jargon as a low response rate.

This is the reason why on the tabloid shows and local news they call these phone in polls 'non scientific polls', i.e. they are not well grounded and their results meaningless unless we have also the information on who will call. This typically depends on the question.

We refer to this problem as 'self selection', i.e. that since the respondents to the poll are allowed to select themselves into the sample, then we do not get a 'representative' sample. We see one of the aspects of what we mean by a nonrepresentative sample - the mean of the sample is different from the mean of the target population. Self selection problems are widespread and often lead to misinterpretations of data. For example suppose you wanted to construct a job training scheme, but to see how well it works you run a small scale experiment. First, you advertise for unemployed people, then put the 50 or so that reply into a training scheme, and then see if they find employment. Suppose that most do, so you scale up the scheme to all unemployed in your city. Would we expect the same good results as in the pilot scheme? No, because of the self selection effect. The 50 or so that first sign up are likely to be the most motivated, when we scale up we get both motivated and unmotivated types, the distributions are different.

There is one possibility that this will be correct, i.e. when the marginal and conditional distributions are the same. This is of course the situation where X and Y are independent, i.e. in words that the chance that someone calls in (or is sampled) is independent of the chance they are for or against the posed question. Typically, this is a defining element in a 'scientific' or valid poll. You need to select the participants in a way that ensures this independence. We will see how to do it, but basically any time you allow the participants themselves to decide whether or not to be in a poll then this independence is likely to be questionable.

All of the problems with the $X_i$ being what we wanted to meausure, or saying a different way that the means of them are the $\mu$ we want to estimate, are known collectively as 'internal validity'. We say we have internal validity when the study really does measure what we want to measure.

Another problem that arises is when the target population cannot be sampled from randomly, but a subpopulation can. In this case the math is the same as before, we need that the marginal distribution of the population we are interested in be equal to the marginal distribution of the group we can measure. We say that when a subpopulation is measured and we believe that the results also apply to a larger population, that the study extends to the larger population. Other sources of non representativeness are discussed here for More generally we refer to this as 'extendability' of the study, i.e. to which larger populations does the study apply. Another name for this is 'external validity". An example is violence sudies. There have been quite a number of studies, all along the same lines, trying to measure the effect of observing violent media on actually being violent. A standard approach is to show cartoons (some violent, some not) to pre-schoolers and see if this leads them to act out more violently in the playground (see Box X for more detail). Suppose you found an effect, would this extend to adults watching violent movies? One critical example is medical studies. Often drug studies are on a particular subpopulation, say white males, and then the results are assumed to work for all people. Do these studies extend this way?

It should be clear from this that we need to carefully consider whether our data is representative of the population we intend to study, and what are the limits to how much we learn in terms of other populations. Our assumptions of the previous chapter go beyond just this, however. We also want that the data comes from a random sample. To consider both these questions, it is useful to break up the types of studies (or data) into three groups. These are

Experiments
Surveys; or
Observational Studies

The different types of studies involve different levels of control over the data that is collected and hence different questions arise.

6.2 Experiments

Experiments are distinguished by the feature that the researcher both directly applies and measures the effect of some `treatment' on the units (or subjects) of the study. So they

Apply the effect we are trying to measure, so they are able to control the sample and the effect;
Measure directly the effect we want to measure.

From the perspective of representativeness, being able to run an experiment is great because the control we have over the aquisition of the data allows us to take measures to ensure that we are measuring what we want to measure. We directly apply the treatment so we can be more sure that we really are measuring the effect of the treatment. From the perspective of ensuring a random sample, there are methods within the way we conduct the experiment that help ensure that this assumption too holds for our sample.

Some examples will help make this clear. Note that experiments are run in many different fields of study, and whilst the techniques can differ across fields the aim is always the same, to ensure represenativeness and to ensure our sampling assumptions hold.

Effectiveness of a Drug. A doctor may give a number of patients with a condition a drug to combat the condition. They then record how well the patient responds. Often two different levels of the drug (usually zero level which is a placebo) are given to different patients and the effect recorded. The `treatment' here is the drug being administered.
Effect of speed laws on traffic fatalities. The city may impose a different speed limit on a particular highway and measure the effect of this on fatalities. The treatment here is changing the speed limit.
Public goods provision Method: give some subjects tokens which can be spent on public goods or private goods. They get to keep the private goods but share the total public goods (plus a bit extra). It is best for everybody if all tokens are put in public goods, best for an individual to put all in private and everyone else in public, but worst for everyone if everyone puts their tokens into private goods. The treatment here is the payoff structure.
Effect of monetary policy on economy. My NSF proposal to take over a number of small countries central bank and repeatedly shock the economy by changing the interest rates is approved. I record the effect on income, spending etc. to test our basic theories. The treatment here is the change in the interest rate.
Project STAR 11,600 students in Tennessee were randomly assigned to different class sizes (small, regular, regular with teaching aide) for grades Kindergarten through 3rd. Teachers were also randomly assigned. The results were that there was a higher chance for those in the smaller class to go on to take the SAT. The effects were bigger for minorities and students from poor families. The treatment here is the size of the class. See here for some more detail.
Minneapolis Domestic Crime Experiment. In this experiment police entertained three response options - arrest, removal of the offender for 8 hours, or mediation. The main outcome variable was re-offence rates (the police being called to the same place), which happened 18% of the time. It turns out that the 'lighter' touch involved a greater chance of re-offence.

In each of these examples, we have that the researcher imposes the treatment on the subjects. In the drug example, the doctor directly gives the drug, whose effects we want to measure, to the patient. For speed laws, we want to measure the effect different speeds have and in this experiment we directly impose the speed limit ourselves. For Project STAR, the researchers are directly choosing how big the classes are in studying the effect of class size on learning. This is the first step to concluding that these studies are indeed experiments.

The measurement of the subjects in each of the examples will become our data. In each case we control this too. In the drug example the doctor records how well the patient responds. For the speed laws example, we measure the effect on fatalities. For Project STAR, we randomly assign teachers and students and measure the effect of the different class sizes.

The control we have over the study in an experiment is very important for representativeness, that the observations we take are actually coming from the distribution centered on the effect we want to measure. However such control does not ensure that we conducted the experiment well. There are some commonly applied tools in undertaking experiments that help improve the chances that the data is representative of the effect we are trying to measure

Double Blind Experiments. A double blind experiment is one where the subjects do not know the extent to which they are treated (this part only is single blind) and the researcher does not know the extent to which the subjects have been treated. The reason for this is best understood in a drug trial where the extent of the benefits of the drug are difficult to measure and may be subjective (so we are not measuring if the patient died or not, but perhaps how much their pain decreased etc). When the subject does not know they are being treated, there is less chance that they report or act differently. Reactions of the subject to being treated is known as the Hawthorne effect, see the box for more information. Similarly, if the researcher knows which subjects have a stronger treatment, this can affect their assessment of the results. If the measurement is completely obvious an not subjective in any way, for example counting fatalities on a highway, the double blind approach is not really neccessary.

Compliance. A real problem that can occur in studies is that the subjects do not comply with the taking of the treatment. If the subjects do not comply (or a large enough group of them do not comply) then clearly we are not measureing the effect we want and there is a problem with representativeness of the data. Consider the Speed laws example. Here in Southern California, it does not appear from my casual observation that any drivers pay any attention to posted speed limits. So an experiment that simply changes the posted speed limits might end up showing no effect, not because there is no effect of traffic speed on fatalities but because there is no effect of posted speed limits on fatalities, we would not be measuring what we wanted to measure. Experimenters take considerable efforts to ensure that the treatment is being observed, using their position as controlling the administering of the treatment etc.

The control the researcher has over an experiment allows the experimenter to randomize over the treatment levels of the subjects. This is an important tool, that does a number of things at once. First, correctly undertaken randomization ensures that we have a random sample. If the subjects are randomly chosen there is no reason to think that the outcome of one subject will affect the outcome of a second subject, so it is reasonable to expect that the random variables that describe the observations in the sample are indeed independent of each other.

Randomization also has implications for each of the random variables not only being identical in distribution but also representative. Consider the Project STAR example. Here teachers were randomized to the classes, so we expect that the quality of the teacher is independent of the class size. Students were also randomized to the classes, which means that student ability is independent of class size. This helps with representativeness because we only want to measure the effect of class size, not the effect of teacher quality, on the test scores. Similarly, it avoids accidently putting the better students in the smaller classes, which would also result in test scores capturing both the effect of class size and student ability rather than what we really wanted to measure. In terms of ensuring that the distributions of each student are the same, being independent of teacher quality means that the distribution for a student in a small class will be the same as a student in a good class.

Part of understanding statistics is to know that studies can fail to measure what we are interested in. When looking at experimental results, regardless of field, a student that understands statistics would ask themselves whether or not the study was done in a way to ensure that the data were representative of what they are trying to measure. Was there compliance? Is there any chance that the results are affected by the way the experiment was administered? The main problem with experiments, given that researchers have a lot of tools available to fix these problems, is extendability. Does a study of market design in an underdeveloped community in one country extend to other communities in other countries with differences in culture? Does an experiment examining violence amongst toddlers extend to adults?

6.3 Surveys

Experiments represent the situation where the researcher has the most control over the study. We turn now to Surveys, where less control is available.

Surveys are distinguished by the features that the researcher can no longer impose a treatment that they want to measure (often the concept of treatment in surveys is hard to think about) but they can choose which observations enter the sample. So they

Do not impose any treatment
Can use selection of the observations to ensure that they are measuring what they want to measure.

Because we have control over the observations, we can use randomness. This helps with both ensuring a VSRS as well as helping with obtaining a representative sample. Because we do not impose the treatment directly we need to be more careful about representativeness of the sample.

Again, some examples will help make this clear.

Polls. Polls are typically surveys, we can use randomization to help ensure that we have a representative sample from our target population X. There is no real treatment here.
Lottery Study. Survey lottery winners on the amount they won, their spending habits before and after the winnings (the idea is to try and get data to test hypotheses about consumption). The 'treatment' here is the lottery win, but the researcher doesn't get to impose this. The researcher can however choose which lottery winners are in the sample.

Before discussing the main elements of this chapter, a few things to note with surveys. First, although many surveys are precisely what you think they are (someone asking questions and recording the answers), the way we have defined surveys gives them a wider meaning. What characterizes surveys for us is not the question answer format, but the level of control the researcher has over the way the data is generated. For example, consider the speed limit effects on fatalities example from the previous section. Suppose instead of imposing speed limits randomly, we simply randomly chose locations that already have different speed limts adn then recorded the number of fatalities. There is no form here, but this would be a survey. It is a survey because we had control over the subjects (here the particular areas to be examined) but not over imposing the treatment (actually randomly assigning the speed limit). The reason for the wider definition is that the questions we ask depend on the control over the data. A second point to note is that the notion of treatment in a survey is often difficult to think about. In the speed limit survey just described, it is clearly the effect of the speed limit. However in a political poll, there is no obvious notion of a treatment.

Because the treament is not imposed by the reseacher, we have to be much more careful about concluding that our sample is representative of what we are trying to measure. There is just a lot of things that can go wrong. We take these in turn

Non-representative Sampling. This is just a more general form of the sample selection problem we discussed earlier. In any survey we have a target population, from which we want to learn. For example consider a political poll, where we want to know the proportion of voters that will vote Democrat vs Republican, say in the California race. This sounds straightforward, but in practice is anything but straightforward. California is big, with great variation in political beliefs between areas (compare Berkeley to Orange County). So a survey limited to certain areas will fail to be representative. A survey run only in the larger cities will ignore the rural vote. The problems are very large and polling companies expend a great deal of resources and time to try and ensure that the population being sampled from is similar to the population of voters. Even simple surveys can easily go wrong. Suppose you want to run a survey on your campus to see if students are interested in building a new sports complex at the cost of additional student fees. If you stand at a central location on campus, you still might not get a representative sample because many students might not be coming to class (ones that have outside jobs for example). How you conduct the survey matters as well. For example a surveyer might favor one subgroup of students (maybe asking mostly the better looking women, or people in their own demographic because they are more comfortable). Randomization can fix this, for example one trick is to let ten people go by after finishing with a subject before you approach another subject. The randomness of people walking past you guards against a non random sample.

Survey Wording. For traditional surveys where a question is asked, the form of the question has been shown to stongly impact the response. For example the wording can be such that it is clear what the writer of the survey wants the answer to be. Questions like 'Do you agree that ...' suggest that the respondent might be making a mistake to not agree. Most of the Issues here are even more subtle, with question ordering and unclear questions resulting in poor data. Some discussion and guidelines from the American Association for Public Opinion Research are available here.

Non-Response. Not everyone surveyed responds. Just like in the sample selection problem (the math is the same), if the probability that someone responds is not independent of the views they hold on the survey question then the distribution of the outcome we are trying to measure conditional on responding will be different to the marginal distribution of the effect we are trying to measure. For example a survey of how good a service that was provided to you might elicit a 100% response rate amongst customers who had a bad experience (because they are happy to have an avenue to complain) but a much smaller response rate amongst people happy with the service (because they do not want to spend extra time, they just want to move on). So this is really just a special case of sample selection. In good surveys where this is likely to be a problem, researchers make efforts to maximise response. Typical approaches are to follow up on non-responders, offer financial incentives for responding etc. Unfortunately often when survey results are reported, they are not accompanied by response rates. If the response rate was high, we feel better about representativeness, however if we do not know the response rate we have no idea if the data is representative or not.

6.4 Observational Studies

In both experiments and surveys, the researcher had enough control to use randomization to try and ensure a random sample and also use methods to try and ensure that the data were representative. In the last study type, we do not have this tool. None-the-less it is still possible to learn a lot about the world.

Observational studies are distinguished by the features that the researcher can no longer impose a treatment that they want to measure and they also cannot choose which observations enter the sample. So they

Do not impose any treatment
Do not select the observations themselves.

Typically the data is collected for other reasons, then we use it to try and learn about some feature. Because we have no control over the observations, we cannot use randomness. Thus helps we cannot be sure of either a VSRS or a representative sample. Because we do not impose the treatment directly we need to be careful about representativeness of the sample. Here we need to spend a lot of effort convincing ourselves that the study is valid.

Again, some examples will help make this clear.

Consumption study A researcher collects information on incomes and consumption of people from credit card data. The problem here is that if you want to find out how an increase in income affects consumption habits, it may be that the consumption habits were moving to the more expensive requiring a second job and higher income, not the other way around. This is quite different from the lottery survey, where changes in income are not planned.
Effect of Monetary policy on GNP We collect interest rate data and income data and examine the correlation. But is a correlation due to an increase in interest rates decreasing income and thus monetary policy works? Or is it a case that when income falls, interest rates rise as capital is hard to find?
Effect of Treatment for Heart Attacks. New expensive equipment administered quickly appears to help with survivability of heart attacks.
Effect of Kindergarten on Crime Rates. New Hampshire state legislator Bob Kingsbury claims that sending kids to kindergarten causes crime rates to skyrocket. He collected data on areas in NH that have kindergartens or not and looked at crime rates in the areas. He claims areas with kindergartens have 400% greater crime. His theory is that mothers should stay at home with kids, outsourcing care leads to dysfunctional kids. (I did not make this up, see a write up from Time Magazine here.

Unfortunately, it is observational studies that we are mostly concerned with in economics and in the social sciences more generally. This is really why we know so little about the economy whereas in other areas scientists can learn fairly fast (even if it is expensive).

6.5 Examples

When we read a study that involves statistical analysis, we need to decide whether or not we find the study convincing. To do this we ask questions of representativeness and whether or not the application of the statistics is reasonable (if the data were not a random sample, there exist methods to deal with this, but this is beyond what this course covers). Questions of extendability should also be considered

Again, some examples will help make this clear.

Consumption study A researcher collects information on incomes and consumption of people from credit card data. The problem here is that if you want to find out how an increase in income affects consumption habits, it may be that the consumption habits were moving to the more expensive requiring a second job and higher income, not the other way around. This is quite different from the lottery survey, where changes in income are not planned.
Effect of Monetary policy on GNP We collect interest rate data and income data and examine the correlation. But is a correlation due to an increase in interest rates decreasing income and thus monetary policy works? Or is it a case that when income falls, interest rates rise as capital is hard to find?
Effect of Treatment for Heart Attacks. New expensive equipment administered quickly appears to help with survivability of heart attacks.
Effect of Kindergarten on Crime Rates. New Hampshire state legislator Bob Kingsbury claims that sending kids to kindergarten causes crime rates to skyrocket. He collected data on areas in NH that have kindergartens or not and looked at crime rates in the areas. He claims areas with kindergartens have 400% greater crime. His theory is that mothers should stay at home with kids, outsourcing care leads to dysfunctional kids.

Distributed By Themewagon