1. Introduction

The human mind is driven to find patterns in the world around it. When we see clouds, we think of what they look like. When we have some bad luck, we seek explanations for it. We are forever seeking explanations of what we observe in the world, coming up with theories to try and explain why what happens around us is happening. We are rarely satisfied with explanations, seeking always to further refine or find fault in them. We argue them through, trying to see if they really explain all of the facets of our observation of the world. We often then think about how they work, sometimes throwing theories away when they are unable to accound for new observations. On the other hand, theories that are able to explain what happens are kept around. The result of course is the great achievements of human endevor, through all of the sciences and beyond. The outcome is that we are able to describe much of the actions in the world and outside it with near certainty. Our understanding of physics allows scientists to accurately predict what happens to objects and space vehicles far outside the range of where humans have even been - whether in the far reaches of space or in microscopic realms.

Statistics, the subject of this course, is the description of the formalization of how theory and data interact. How do we relate the theories that arise in our minds eye with the observations of our actual eyes? How should we use observations to try and learn about how exactly the world works? What types of data do we need to examine our theories? What can go wrong leading us to misunderstand what the data is telling us? How is it that many different people observe the same thing - say the record temperatures of the 1990's - and come up with wildly different beliefs as to the presence of global warming? This course is designed to try and give an introduction to how scientists and social scientists alike answer these questions.

1.1 The Scientific Method

Consider how you 'know' anything that you think you know. For example for most of us, we know that driving home around 5pm today is likely to be painful because of traffic. How do we know this? Well, first off, we have probably tried it before and found out that this is when everyone else seems to be trying to get home too. We call this observation, and we can learn from observation. However it is not enough in most cases just to learn through observation - we think about why there is traffic rather than just noting that there is traffic. We all (I hope) realize that what is going on is that most people work 9-5 or so and then try and drive home, all these people together creating traffic on the freeways. This is a theory, based on our observation, of what is going on. The great thing about going one step beyond observation and to theory is that theory tends to give implications. For example, our theory that traffic is caused by lots of people finishing work at the same time implies that if few people were at work today then the roads would be clear of traffic. So we could suppose that if there was a catastrophic snowstorm and everyone stayed home (and since I am writing this in Southern California, it would not take much of a snowstorm to be both catastrophic and cause everyone to stay home) that there would be no traffic this afternoon at 5pm. If we wanted to check our theory, we could go and measure the traffic on such a day when people are not working and see if it predicts what we actually observe. This is all casual, but it well describes the basic steps of the scientific method.

Basically the scientific method can be described in the following steps:

  1. We first observe phenomena
  2. We construct conjectures to explain such phenomena (hypotheses or theories). If they are consistent with each other, this is evidence in favour of the hypothesis.
  3. We then examine other implications of the theory.
  4. We test the predictions in experiment or any way we can. If the predictions of the theory are consistent with the hypothesis this is even stronger support for the hypothesis.

If the predictions of the theory are consistent with the hypothesis this is even stronger support for the hypothesis.

This course is about statistical theory. Statistical theory underpins Steps 1 and 3 above, and a good understanding of statistics and how it can go well or wrong is essential for using the scientific method to actually learn things that are useful as opposed to deepening your misunderstanding of how the world works. For this reason I would argue that everyone should have a basic understanding of statistics.

Application of these steps have led to great understandings about our world, understandings that we rely on. It may seem that we really do not need all the steps above, however for us to fully appreciate how things work all the steps are important, and in practice often steps 3 and 4 iterate continually to deepen our understanding.

Consider just using the first two steps, creating all our theories to fit observed facts without thinking of their predictive qualities and whether or not they hold true. A long time ago this was thought to be reasonable - it is called the inductive tradition. The idea was that with enough observation we would be able to learn everything (it sounds a lot like what machine learning adherents say today!). The problem with it (see the box for an example) is that we can only learn correlations in the data, not what caused them. We do not learn the structure of the economy or world. This approach also is a lot like ex post rationalisation - any theory would only explain what happened last time not what happens next, even though what happens next is what we really care about.

1.2 Why Formal Statistics?

Do you really require a sophisticated mathematical understanding of statistics? Basically, much of what we know has come from careful science, with scientists in each field spending a huge amount of time thinking about and examining hypotheses for you. And in many cases these results form the basis of your understanding of the world and its inherent risks and rewards. If you are unable to understand the basic points of this course then you will have to resort to believeing by wieght of authority --- choosing to reject or accept some understanding based purely on how much you trust the source of the information rather than reason for yourself. Do you believe the Atkins company, that pushes a high fat and protein diet or the Center for Disease Control and follow the food pyramid in working out what to eat? Conflicting evidence and theory, and you do not want to simply believe what you are told without the skills to ask the right questions of the studies to gauge how seriously to take the results of each study. They may have been done poorly. A good example is the beneficial effects of orange juice for colds. This well known effect of vitamin C on colds comes from an old but well publicised study that showed a strong effect, resulting in many of us taking orange juice and tablets at the onset of colds and the like. Modern studies find no effect, but you won't hear that from the Vitamin companies. You might be persuaded to still believe in Vitamin C because the result is well known, or you could understand why the newer studies give a better idea. Also, conflicts of interest arise, so what should be trustworthy studies can give the wrong resuls. Perhaps you have heard of the link between child immunization and autism. A 1998 study in the British medical journal Lancet concluded that there was evidence that the three in one shot --- for mumps, measles and rubella --- results in higher rates of autism in children. The outcome of course was that many children went without immunization, even though this is well known to cause problems. The paper was later retracted, the author having had a business conflict of interest and had undertaken a few of the steps of the research to maximise the chance of this outcome. But of course many studies give good accurate results. The problem is that without some understanding you will not be able to think in a clear way about the information that you recieve. Belief by authority of the source is a poor way to access the results of science, a critical mind is required.

“When I call for statistics about the rate of infant mortality, what I want is proof that fewer babies died when I was prime minister than when anyone else was prime minister.” - Sir Winston Churchill (Quoted in Peter Skerry’s Counting on the Census? Race, Group Identity and the Evasion of Politics)

A first thought for a student is 'why so much math?'. Statistics is really applied mathematics. The formalization of the ideas require writing down methods that enable us to think exactly what is going on and then be able to manipulate the observations in a way that allows us to obtain a constructive view of the results. The first of these in formality becomes set theory. Set theory is the foundation of logic, the most useful way we have found to describe things that can possibly happen. This is even more true for things that are not naturally numbers. Suppose you are working for a company and have to make a decision. You could describe the various outcomes from the decision, i.e. the company goes belly up, nothing much happens, or the company becomes more profitable, many other choices. Set theory allows us to name them and still, to a point, manipulate them mathematically. But when we get down to it we really need numbers, which is the next step. Once we have numbers we can set down general methods that work in a lot of ways, however once we are manipulating numbers we are deep in the realms of math.

But that was a mechanical explanation. For many problems, formal statistics are not all that neccessary. This is because there is not much variation in the results - the relationship between the theory and the result is not tenuous but instead is pretty obvious. Think of the hypothesis 'If I don't show up for work today I won't get paid'. Since everytime you do not show you do not get paid, and only get paid when you show, this hypothesis is really easy to test without a great deal of experimentation or formalisation. The same with the hypothesis 'If I head north on I5 on Friday at 5pm I will be stuck in a traffic jam'. There is no variation, it will happen almost for sure. But we do not want to limit ourselves to these types of questions. First, the answers are so obvious that we all know them already. Second, anyone can work these out so are you would not be building skills that differentate yourself from the masses if you stop there. Most interesting questions are about either more subtle effects or ones that are hard to ged a good set of data on. Consider predicting stock markets. There are no obvious big simple ways to predict stock prices, if there were then we would get rich this way. But there are potentially subtle effects that are hard to find, but available to a smart mind. And you could still get rich this way. In other examples, the theories may still seem obviously correct but it is hard to get data that would clearly show it. Consider the nature/nurture problem. There are those that believe that we are all born the same, and that all of the differences in performance --- how much you are paid, how much you achieve, how many degrees you get and the like --- are due to hard work and what we learn after we are born. Others hold that much of this si already hardwired into our brains at birth, that successful people are simply born with what it takes to become successful. A reasonable hypothesis is that both come into play, but how much is due to brains and how much due to learning? The hypotheses make sense, but the effects are so entangled that some formalization of the methods is going to required to be able to understand any attempts to disentangle them.

So to summarize the reasons we need formal statistics, it is to avoid mistakes, and to be able to be more precise in our understanding.

"When you can measure what you are speaking about, and express it in numbers, you know something about it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind" -Lord Kelvin (Victorian Physicist - Kelvin temperature scale named after him).