Probability Basics

Probability theory is a brach of mathematics that provides a foundation not only for statistics, but for general decision making when there is uncertainty. It was invented to understand games of chance (gambling) but is important for many aspects of how we deal with uncertainty. An understanding of probability will make for much clearer thinking, and will of course help with this course.

This side-bar chapter is for students that need some revision or want to see some of the basics that we rely on in the course. Whilst the course chapters are self contained, using the probability ideas we need, this chapter will give a little background.

Probability theory itself is an extension of set theory, so we will start with that. Set theory is a way of understanding all of the components of what might happen, described as the sorts of things that can happen. Probability theory then measures these sets in a way that makes clear the chances that these things can happen. For statistics, we do not want to be so general as to have our sets not be restricted to numbers. For example in set theory when we toss a coin, we can describe the possible things that might happen as 'Heads' or 'Tails', but in statistics we want to use numbers like zero and one. This is because we want to make the math with these outcomes easier.

4.1 Set Theory

To build up the ideas of set theory, first let's consider an example problem. We are going to toss two coins and see what happens. Before we toss the coins, we do not know what will show up but we can easily work out what the possibilities are. Denote H as 'Heads' and T as 'Tails', then possible outcomes are 'TT' for two tails, 'TH' for a Tail then a Head, 'HT' for a head then a tail and 'HH' which should now be obvious in its meaning. So we are keeping track here of the outcome from the first die and the second one, hence we differentiate between 'HT' and 'TH'. If we did not need to keep track of this, we could set this up a bit more simply, which we always try to do. For example in the craps example in Chapter 3 we only care about the sum of the values on the two dice, so we do not keep track of which die had what number.

We will use the notation {HH} to denote the set that says 'two heads', the curly brackets meaning 'set'. So far we have the sets {TT}, {TH},{HT} and {HH}. These sets, taken all together, represent all the possible things that can happen when we toss the two coins (assuming we rule out things like loosing one etc.). The collection of all of the things that can happen, put into a set, is known as the sample space for the problem. Here the sample space, denoted S, is S = {TT,TH,HT,HH}.

There are other sets we could consider, for example the set described by 'Head on first toss' can be written as {HH,HT}. This set has two of the possible basic outcomes, the sample space has all four. We refer to the smallest sets that are the basic outcomes, namely {TT}, {TH},{HT} and {HH}, as singletons or points of the sample space S. The singeltons along with more complicated sets such as {HH,HT} are called events.

We can construct all possible events out of the singeltons using the set operation Union. The union of two events is just the set that comprises of the outcomes in the sets being operated on. More clearly for two sets, A and B, $$ A \cup B = \{\text{all outcomes in either set A or set B}\}$$ The left hand side of the equation is read as 'A union B'. So for example we have $$ \{HH\} \cup \{HT\} = \{HH,HT\}. $$ As noted we can make all the possible sets from the singeltons. Taking the union of all the singletons gives us the sample space S. Rules for unions mean there is no double counting, so $$ \{HH, HT \} \cup \{HT\} = \{HH,HT\}. $$ If you are in you are in, but only once.

The intersection operator is for situations where we want to see what singletons are in multiple sets. For two sets A and B we define intersection as $$ A \cap B = \{\text{all outcomes in both set A and set B}\}$$ For example we have $$ \{HH, HT \} \cap \{HT\} = \{HT\}. $$ We can see that if we take enough intersections of sets, we will be left with only singletons.

We can also have sets of sets, for which we just use the double brackets. So we could define the set of sets $ \{ \{HH, HT \}, \{TT\} \}.$ Finally we could consider the set of all possible sets that comprise the sample space. This would be $$ \begin{equation} \{ \{TT\}, \{TH\}, \{HT\}, \{HH\}, \{HH, HT \}, \{TT, TH\}, \{TT, HH\}, \{HT, TH\} \\ \{HH, HT, TH \}, \{HH, HT, TT \}, \{HH, TH, TT \}, \{TT, TH, HT\}, \{TT, HH, HT \} \\ \{HH, HT, TH, TT \}, \oslash \} \end{equation} $$ where here $\oslash$ is the empty set (like zero for numbers). When we have all of these put together, they make up all the different events we might be interested in that can happen. More formally this is called a $\sigma$-algebra, where an algebra of sets is a collection of sets and a $\sigma$-algebra is a collection of sets that follows some rules.

With this type of description of what might happen, we can now turn to thinking about the probabilities of the things that can happen. This means putting probabilities on all these sets.

4.2 Probability

Now that through set theory we have defined what might happen, we can turn to considering the chances that each of the possible sets can happen. But there are rules on how the probabilities need to be. It should be clear to you that probabilities should not be negative, and the probability of any of the sets happening (i.e. S happening) should be one. In between, probabilities must be sensible in the sense that they should all be compatible with each other. The chance that a head occurs on the first toss should be compatible with the set $\{ HH. HT \} $ but these should be compatible with the chance that we see either of the singletons in that event.

Again consider the two coin toss example. We can think of what probabilities seem sensible here. Since all of the singletons have equal chance, we might set $P(HH)=P(TT)=P(HT)=P(TH)=1/4$. We can certainly do this. But what does it imply for all of the other events? Do they make sense?

First, consider the rule that $$ P(A \cup B) = P(A) + P(B) \text{ when } A \cap B = \oslash. $$ We say that when $A \cap B = \oslash$ that the events $A,B$ are disjoint. Applying this rule we can get results for lots of the other events. We have $$ P( HH \cup HT) = P(HH) + P(HT) = 1/2. $$ Since this is the probability of a head on the first toss, and having this equal to one half makes sense, then this seems to be reasonable.

The problem that arises is when the sets are not disjoint, we need to deal with the 'overcounting' that arises because some of the singletons appear multiple times but are not in the union. The easiest way to do this is to break into singletons, i.e. consider $$ P(\{HH, TH, TT \} \cup \{TT, HH \}) = P(\{HH, TH, TT \}) = P(HH)+P(TT)+P(TH) = 3/4. $$ Doing this in this way will result in probabilities that make sense so long as the probabilities of the singeltons sum to one and are nonnegative.

These rules can be put together, they are known as the Kolmogorov Axioms. Formally these are

  1. $P(A) \ge 0 $ for all possible events $A \subseteq S.$
  2. $P(S)=1.$
  3. If $A,B$ are disjoint then $P(A \cup B) = P(A) + P(B). $
Just these rules ensure that we get sensible probabilites.

4.3 How this all relates to statistics

In chapter 3 we construct random variables and probability distributions. Formally, random variables are a function from the sample space - $S$ above - to numbers. The way to think about it is that each of the sets in the sample space (it really has to be a $\sigma$-algebra of sets) we have a function to numbers. We want to work with numbers instead of sets like $\{HH\}$ because it is easier to do math with. But in reality many of the sets we start with are already numbers, if not it is easy to consider numbers depending on what we are interested in. For the two coin toss above for example we could just have a function that counts the number of heads.

Consider the example of counting heads in the two coin toss example above. We map sets like $/{HH/}$ to the number 2, and the probability of seeing this is still $P(\{HH\})=$P(see a 2)=$1/4$. We denote the outcomes for the numbers as lower case late in the alphabet letters, here we choose x's. So we have $x=2$ for the outcome $\{HH\}$. Clearly $x=0, 1$ or $2$ in this example. You should be able to see that we have a probability that $x=0$ or $x=2$ equal to $1/4$, and since probabilities sum to one the chance of seeing $x=1$ must be one half.

This all seems pretty easy, but it gets hard fast. For a situation where x takes on a lot of integers, it takes a while to work it all out, but it is really just an accounting exercise. In Chapter 4 we will have x's and y's, more sets of measurements make it harder. Where it really becomes more difficult is when the values can take any number on the real number line, or a subset of it. In this case there are an infinite number of possible outcomes, and this is not a simple accounting exercise. We cannot work out the sets and probabilities by hand, it would take forever. We introduced the basic ideas of this in Chapter 3, but to do it properly you really need to take a real analysis class.

In the end though, all the probability distributions and random variables we employ in the rest of this course is built up from the probability and set theory which we lightly introduce above (but in reality with much more rigor). When we are working with random variables, we have that the x's are all disjoint so probabilities over lots of them are just adding probabilities (in the discrete case, when we do not have an uncountable number of them, we call this the discrete case) or integrating over x's (this is continuous random variable case).