10. Summarizing the ideas in this course

The point of this course was to teach the basic principles of how to use data to think about what we know. Using data to understand science, social or otherwise, is central to understanding what might be a reasonable understanding and what might not. All scientific work is as empirical as they can get, with usually the difficulty of obtaining informative data being the bottleneck. But learning from data goes beyond the formal work of science to just learning in general. We are often subject to others presenting data as a way of convincing us of their points, in the media, politics, and business. A solid understanding of how to think about what data is telling us is by now a basic required life skill.

Statistics, the field of applied mathematics that focuses on evaluating data, relies on probability, which itself relies on set theory (which we did not cover really for time constraints). So a fairly good understandning of probability theory is needed, for a multitude of reasons. We consider each observed datapoint as a draw from a random variable that describes what data we could have seen. These random variables are related to our theories of the world. Working with such random variables allows us to consider what data can tell us about this world. We convey what we learned using probabilities.

But one needs to use the mathematical basis for statistics correctly. The primary mistake in understanding what data is telling us is not incorrectly constructing a t statistic, nor is it looking up the wrong distribution. The primary problem that leads users of data astray is that they have not measured what they were hoping to measure. We saw how to think about this in Chapter 6, (where again it was useful to think about conditional distributions to understand the problems). As models become more complicated as they will in future econometrics or statistical methods classes the opportunity for this problem becomes greater.

“It is easy to lie with statistics, but easier to lie without them.” - Fred Mosteller (a famous statistician, who set up the Harvard Department of Statistics, this was his response to the famous "There are three kinds of lies: lies, damned lies, and statistics.”

10.1 Basic Overview

The course started with constructing sample statistics like the mean or variance, or pictures like histograms and other methods for summarizing data. The vast majority of the measures we calculate for these were the sample mean of something. The sample mean of the data is straightforward, however the height of a histogram is a sample mean of a transformation of the data where we count the number of data points inside a particular range. Values that are zero or one give us proportions, but proportions are simply sample means of these zeros and ones data. Whilst we only really looked at sample means, these are very much the main type of statistic that is employed in empirical work. Linear regression estimates are weighted averges and hence weighted means.

The primary issue faced with statistics from an evaluation perspective, if you are sure that you measured things correctly, is to work out what the data result implies. Suppose you measure the returns from a stock portfolio you think is a good one, and you consider it successful if it yields a return greater than $4\%$. You find it does do better, at $5\%$. All good? Maybe not, because you might just have selected a lucky period for the evaluation. How might we handle this? We now know what we might do here. We could set up a hypothesis test and see if the measured $5\%$ is 'statistically different' from the baseline of $4\%$. Or we could construct a confidence interval that accounts for the statistical uncertainty of the estimate. Constructing these things is easy, but you need to know how to do it.

A question that arises is why we construct the hypothesis test or confidence interval. We now know these are almost the same thing, in that confidence intervals are just doing two sided hypothesis tests for many nulls, and collecting the null hypotheses that fail to reject. So there was really only one trick, with different ways to look at it. But why are they the way they are? This is why we needed the idea of random variables, to build a system that allowed us to construct these things.

We described random variables as our theory, and this is indeed the role they take. But they do a lot more. By considering each observation to be a draw from a random variable, this allows a probability structure to think about the observations. Since our sample statistics, really the sample mean, are functions of the data, we saw that we can consider the sample mean as a draw from the same function of the random variables. We had to spend quite a bit of time learning enough about random variables and probability distributions to be able to think about the probability distribution for the sample mean. But it was exactly this that we exploit to get to the point of understanding things like statistical significance. By understanding the distribution of the statistic we want to calculate, we can do hypothesis tests and construct confidence intervals.

We saw through the earlier part of the course what this entails. We needed to understand distributions, first with one variable then joint distributions where we have two random variables. We needed to understand that functions of random variables were also random variables, first with two random variables and then with $n$ random variables where $n$ was the sample size. In the $n$ random variables case we could only really do it with a lot of assumptions on the data. Happily the one assumption we did not have to make is to actually know what the whole distribution for each random variable describing the underlying data. We are happy about this because that would be hard to characterize in a real problem. And for the only distribution we are sure about (the Bernoulli (p)), we still could not do the calculations. But the central limit theorem came to our rescue. As long as there was a mean and a variance, we could apply this theorem to a random sample. Still assumptions, but ones we can often make in practice.

So you can think of all the work done in learning distributions and working with random variables as an essential understanding of where the calcualtions we do later come from. Whilst these calculations are simple, and you can take some data and do them without too much trouble, really understanding how to work with data requires understanding why the calculations make sense. Indeed, computers can do the calculations themselves. But the assumptions really matter, and they require thought as to whether or not they hold and what happens. The biggest issue though is that the data themselves might be misleading. For this too we need to understand random variables. If the mean of the random variable which describes your data is not the mean you think you are estimating, everything in your statistical analysis is wrong. And this is where most of statistics either fails or needs much more sophisticated methods than those we have dealt with in this course. A true understanding of statistics comes when you realize that you have to think through all the steps to be sure you are measuring the correct thing. How you thought of this depended on how the data was collected (experiment, survey or observational study), and how the measurements were made.

10.2 Worked Example - Willingness to Pay and Accept

The idea of a ‘willingness to accept’, or WTA, plays an important role in economics. The basic idea is that if you are to be compensated for something, we need to know the value of that item. The WTA is the value that you would accept. For example in insurance markets, this would be the value you would place for something you lost but were insured for. In economics we often consider the ‘value’ of a policy change, so if in the change some individuals lost out their WTA would be the value of this losing out (to be considered against what others gain). You can think of it as the value of something you would sell that item for.

Despite the importance of the WTA, it is quite hard to measure generally. Instead, we might observe what people are willing to pay (WTP) for things. We could then use the WTP as an estimate of the WTA. But what if these things are different? This would create a problem. It turns out that often these are different, and there is work in behavioral economics and psychology to try and work out why.

One idea to explain the difference is what is known as the endowment effect – basically we value things more that we have because we have it. So we would be willing to pay less to get that item (we do not have it yet) than we would be willing to sell the item for (we value it more once we have it). This results in the WTA being greater than the WTP.

How might we measure this? If all the items were different, it would not make sense to compare or average the prices offered (WTA) with those purchased (WTP). So we want to examine this for a particular product. We want the data to be as real as possible, but we want little variation in the people involved. For example a very rich person might have a different valuation for the same thing as a poorer person (It’s one banana, What could it cost? 10$?), known as the income effect.

The standard way to measure this is either through a survey or an experiment. Either way we would obtain data on the prices people are willing to pay or accept (to sell) some item. It is not going to be the case that these prices are the same across individuals, so we might think about looking at the mean.

We will look at data from one of the papers in this literature. The above table shows the results from experiments where students (the subjects of the experiment) either had a mug to sell or were tasked to buy the mug if they wanted. Prices offered by the mug owners are WTA, prices that the others were willing to buy the mug are the WTP. The averages are in the table, the difference in the averages is $2.98. So certainly the average WTA is greater than the average WTP for this sample.

What we learned in the course though was that this difference might not be meaningful, in the sense that it is from one sample. In this sample there are a wide range of possible WTA’s and WTP’s – the data is in the table. Might another sample give us a different answer? By how much? Enough that the difference could be negative? In this case the results might not be meaningful.

We saw that we can think about this through looking at the standard error of the mean, which gives us an idea of how precise the estimate is. We have a difference in means, which we did not cover in terms of estimating the standard error. I have calculated it to be 0.485 (this is basically a $\sigma^2/n$ estimate, with a slight change for the difference in means. But we can use it as we have during the class.

Now what do we do with these numbers? We can do a hypothesis test, or report a confidence interval which gives us some idea of the uncertainty in the estimate.

From a hypothesis testing perspective, we would think what the null and alternative hypotheses are. It is reasonable to consider the null hypothesis of the difference being zero, this would be the situation that there is no gap between the WTA and WTP. Similarly, there is no real question that issues arise for a positive gap, but we do not expect a negative gap. From the perspective of the endowment effect, clearly we imagine there is a positive difference if there is an endowment effect.

So we might test $$ H_0: \mu = 0 \quad \textrm{vs.} \quad H_A: \mu > 0 $$ where $\mu$ measures the difference. Constructing the t test is simple, we have that $$ t = \frac{2.98 - 0}{0.485} = 6.14. $$ At the $95%$ level for a one tail test, this is clearly significant (the t statistic is large) and so we reject that there is an endowment effect shown in this experiment. Alternatively we could construct a confidence interval. A $95%$ confidence interval is $$ \{ 2.98 – 1.96* 0.485 , 2.98 + 1.96* 0.485 \} $$ which comes out to $ 2.03$ to $ 3.93$.

Both are telling us similar information. Clearly the data, taken at face value, indicate an endowment effect. And the effect seems pretty large, 2-4 dollars for a fairly small value item. But do we take this at face value?

There are a lot of issues here. But we can use the tools we discussed to think about them.

First, is the $\mu$ in the problem that we want to understand really the mean of the data we collected? This is just to ask are we really measuring what we want to measure? The context here is for mugs, so we should first assume that the endowment effect we want to measure is indeed for mugs. The authors really used mugs for the experiment.

In the paper by Plott and Zieller they examine whether or not the experiment really does elicit the valuations. They are worried about how well the students understand what they are doing, requiring training in the market for their mugs. They also wonder about things like the effects of the students not being anonymous to each other. These complaints are suggesting that we are not measuring the correct thing. This is generally a problem with experiments on subjects like university students, i.e. experiments that are run on subjects that are not a random sample of all possible subjects. It is often hard to do experiments to learn about economics well, because the subjects we would like to use are often unavailable (think about trying to test irrationality of market traders etc.).

We could think of other approaches to getting the data. In Smistky, Liu and Gneezy (2021) they use AmazonTurk to find online some subjects on which they run a survey. Instead of actually trading mugs they simply ask questions like ‘suppose you have this item, how much would you be willing to sell it for” etc. Subjects were paid one quarter (yes, 25 cents) for doing the survey. Would this be a better approach?

Plott and Zieller rerun the experiment with the mugs, but with training and anonymity of the participants. The difference they report is much smaller than for Kahnemann et. al. (1990). The difference is still positive, but now the hypothesis test of the null above does not reject, zero is in the confidence interval for the effect (these are not quite the same because the hypothesis test is one sided).

Does this mean that there is no endowment effect? The Plott and Zieller paper was not saying that – they wanted to show that experimental design for these types of problems matter. And this they showed. To conclude that there is no endowment effect we would need to be able to say that the experiment with mugs extends to many other items. Does it work for valuing public parks? Maybe not. Extendability is also hard.

This all may seem more negative than it should. Measuring things, especially in the social sciences and environmental sciences, is hard. The hard part is not constructing the t statistic and looking up critical values in the tables, the hard part is working out what to measure and if that measurement gets you want you want. For this the construction of the statistical approach, understanding that data can be thought of as coming from random variables and that thinking about it this way helps us understand these important questions, is a big part of what you were supposed to learn in this course.

References Kahneman, D, J. Knetch and R.A. Thaler (1990) “Experimental Tests of the Endowment Effect and Coase Theorem”, The Journal of Political Economy, 98, pp1325-48. Plott, C.R. and K. Zeiler (), “The Willingness to Pay-Willingness to Accept Gap, The Endowment Effect, Subject Misconceptions, and Experimental Procedures for Eliciting Valuations”. American Economic Review, Smitizsky,G , W. Liu, and U. Gneezy (2021), “The Endowment Effect: Loss Aversion or a Buy-Sell Discrepancy?”, Journal of Experimental Psychology,