0.11 Probabilities

In a sense, the basics of quantum theory boil down to the combination of two bits of mathematics: linear algebra over the complex numbers, and probability theory. We have just gone over all the linear algebra that we will need, so now let’s tackle the other topic (though we will immediately revisit it in Chapter 1).

Probability theory is a vast and beautiful subject which has undergone many transformations over the centuries. What started as something understood in terms of gambling odds later evolved into the theory of measure spaces, and is now even able to be expressed in terms of diagrammatic category theory. But for our purposes, we only need the very elementary parts of the subject, so we will stick with the first interpretation: probability tells us the odds of something happening.¹⁹

The setup is always the same: we have some process (rolling some dice, flipping a coin, drawing a card, etc.) that has some possible outcomes (getting a 5, heads, or an ace of hearts, etc.) but is realised in some way which means that we cannot be certain which outcome we will see whenever we run the process (or “perform the experiment”).

The first thing to define in any such scenario is the sample space, usually denoted by \Omega, which is the set of all possible outcomes. Next we have the event space \mathcal{F}, which is the set of all events, where an event is a set of outcomes (this might sound confusing at first, but we’ll give some examples shortly). Whenever we run the process, we will get some outcome, and we say that any event that contains that outcome occurred. Finally, we will have a probability function, which assigns to any event a probability, which is a number between 0 and 1. This probability function has to satisfy some axioms, but we’ll come to these later; for now, let us give some examples. Here is a table of the sample spaces for some processes.

Process	Sample space \Omega
Rolling a six-sided die	\{1,2,3,4,5,6\}
Flipping a coin	\{H,T\}
Flipping two distinct coins	\{HH,TH,HT,TT\}
Flipping two identical coins	\{HH,TH,TT\}

And here’s a table of some (but not all, except for in the case of flipping a single coin) of the events corresponding to these sample spaces.

Process	Example events A\in\mathcal{F}	Interpretation
Rolling a six-sided die	\{1\}	rolling a 1
	\{1,3,5\}	rolling an odd number
	\{1,2,3\}	rolling a number less than or equal to 3
	\{2,3\}	rolling a prime number less than 4
Flipping a coin	\{H\}	getting heads
	\{T\}	getting tails
	\{H,T\}	any outcome at all
Flipping two distinct coins	\{HH\}	getting two heads
	\{HH,TH,HT\}	getting at least one heads
	\{HH,TT\}	getting two the same
Flipping two identical coins	\{HH\}	getting two heads
	\{HH,TH\}	getting at least one heads

In the table above we can see that, for example, if we rolled a 1 on a six-sided die then many events occurred: we rolled a 1, but we also rolled an odd number, and a number less than 3; but we did not roll a prime number.

Something else that arises in these examples the notion of distinguishable outcomes, when we look at how the sample space of flipping two coins depends on whether or not they are identical. That is, if we have a gold coin and a silver coin then it makes sense to say that HT is different from TH, because the first means that the gold coin was the one that landed on heads, and the second means that it was instead the silver coin. But if we have two identical coins, then how can we distinguish between HT and TH? We would have to be able to point to one of them, say, the one on the left, and say “that’s the coin that’s on the left”, but if we can do this then by definition we can distinguish between them!²⁰ In general, the sample space consists only of distinguishable outcomes, but what counts as distinguishable really depends on the specifics of the experiment that we have set up.

This approach towards probability, where we think of a “scenario” as such a triple (\Omega,\mathcal{F},P), places us firmly within the setting of measure theory. What we have described is a probability measure, and there are many more general types of measure that exist, but they all describe the same sort of idea: we have a set \Omega that we think of as our “space”, some particularly well-behaved collection \mathcal{F} of subsets of \Omega, and a function \mu\colon\mathcal{F}\to\mathbb{R}\sqcup\{\pm\infty\} known as the measure. We think of the measure \mu as telling us how big any element of \mathcal{F} is, be it interpreted as geometric size or, in the example of probability measures, the likelihood. This formalism becomes very useful when we want to do any analysis (which happens as soon as we want to deal with infinite-dimensional spaces, or even just “introduce some geometry”), such as in constructing the Lebesgue integral. Thinking of probability spaces through measure theory allows us to make use of the many powerful tools of real analysis.

Now we can define probability rather succinctly.

In a fair process, where all outcomes are equally likely, the probability P(A) of an event A is the number of desired outcomes divided by the number of total outcomes, i.e. P(A) = \frac{\text{\# of elements of }A}{\text{\# of elements of }\Omega}. In particular, the probability will always be a number between 0 and 1.

Running through some of the above examples of events, we see that this definition of probability agrees with what we might already expect.

Event	Probability
Getting heads on a single coin flip	1/2
Rolling a 6 with a single die	1/6
Rolling an odd number with a single die	3/6=1/2

Flipping a fair coin (or actually, even an unfair one) is a common scenario in discussing probability, because it has just two outcomes — the smallest amount you can have without things becoming purely deterministic. There are lots of numbers that you will see turn up time and time again in calculations of probability for binary outcome events, and most usually they are binomial coefficients. These are numbers that can be read directly from the rows of Pascal’s triangle (which, as is often the case in mathematics, is more deserving of being named after a different person: Al-Karaji, or maybe Omar Khayyam), and they satisfy many interesting combinatorial patterns.

Now let’s look at what happens when we’re interested in more than one event occurring. We might study the possibility of either event A or event B happening, where the “or” here can be exclusive (we want exactly one of them to happen, but not both) or inclusive (we want at least one of them to happen), or we could study the possibility of both event A and event B happening. It turns out that these two notions, which seem somehow opposite, are delicately related.

First of all, let’s consider both events A and B occurring. What does this mean? Well, by our definition of “occurring”, it means that the outcome of the process is an element of A and also of B, which is equivalent to saying that it is an element of their intersection A\cap B. In other words, P(A\text{ and }B) = P(A\cap B). This lets us define two very important terms.

We say that A and B are mutually exclusive if P(A\cap B)=0, and independent if P(A\cap B)=P(A)P(B).

In words, A and B are mutually exclusive if one of them occurring precludes the other from occurring, i.e. if at most one of them can occur, and they are independent if one of them occurring has no effect on the other occurring.

Usually, when we talk of mutually exclusive events we are referring to a single run of an experiment, and for independent events we are referring to multiple runs. For example, “rolling an even number” and “rolling an odd number” are mutually exclusive events when rolling a single die once, but independent events when rolling a single die twice.²¹ Basically, we should be careful when talking about events and make sure to be precise as to what our sample space is, and how the event is actually realised as a subset of this.

We can think of mutually exclusive and independent as extreme ends of a scale: on one side we have events that affect each other so strongly that if one occurs then we know with absolute certainty that the other one did not; on the other we have events that have absolutely no effect on each other whatsoever. One might wonder about what the opposite of mutually exclusive might be, and there are two ideas that seem like they might be interesting: events A and B such that P(A\cap B)=1, and events A and B such that, if one occurs, then the other also always occurs. The first of these two putative definitions is not so interesting, because if P(A\cap B)=1 then both A and B always occur, and so, by the definition of P(A)=P(B)=1, this means that A=B=\Omega is just the event that “anything happens”. The second is a bit less trivial, but also not so deep: saying that A occurs if and only if B occurs is equivalent to saying that A=B as events.

Now let’s think about either event A or event B occurring, in the inclusive-or sense of the word (we will return to the exclusive one afterwards). This is equivalent to the event given by the union A\cup B occurring. In other words, P(A\text{ or }B) = P(A\cup B).

The relationship between P(A\cup B) and P(A\cap B) is given by the following principle.

Inclusion–exclusion principle. P(A\cup B)=P(A)+P(B)-P(A\cap B).

We will not prove this, but it’s a fun exercise to think about why this must be true.²² In fact, the general inclusion–exclusion principle describes what happens for an arbitrary finite number of events. Using this, we can see why mutually independent events are particularly nice: if A and B are mutually exclusive, then P(A\cup B)=P(A)+P(B), i.e. the probability of (A\text{ or }B) is the sum of the probability of A and the probability of B.

In the same way that mutually exclusive events are special in the eyes of the inclusion–exclusion principle, independent events are special in the eyes of conditional probability. Oftentimes we consider events that are not independent, such as drawing cards from a deck without replacing them afterwards: if I take the ace of hearts, then the probability of me drawing a heart the next time has gone down from 13/52 to 12/51, since there is now one fewer heart in the deck. Even worse, the probability of me drawing the ace of hearts again is now 0. Given two events A and B, we define the conditional probability P(B|A), of B given A, to be the probability that B occurs assuming that A has just previously occurred. Thinking back to the definition of probability, we can calculate this by taking the number of outcomes in A\cap B (our desired outcomes, the outcomes in B that also are outcomes in A) and dividing it by the number of outcomes in A (all possible outcomes, since we are assuming that A has happened), so that P(B|A) = \frac{P(A\cap B)}{P(A)}.

Conditional probabilities are the source of many misunderstandings. For example, it’s intuitively obvious that the probability of flipping a coin 100 times and getting heads every single time is very small. So say we’ve flipped a coin 99 times and managed to get a heads every single time, are we now more likely to flip a tail, because the chance of getting 100 heads in a row must be small? Well we don’t even need mathematics to answer this: the coin has no way of remembering what has happened on the previous flips! In other words, also the probability P(H^{100})=P(H^{99}\text{ and }H) is very small, the (conditional) probability P(H|H^{99}) is still exactly 1/2. Looking at the definition of conditional probability, this makes sense: P(H^{99}) is itself very small, almost as small as P(H^{100}), so it’s not surprising that when we divide one by the other we get a number that’s not so far away from 1.

Now we can see how independent events are special: if P(A\cap B)=P(A)P(B) then \begin{aligned} P(B|A) &= \frac{P(A\cap B)}{P(A)} \\&= \frac{P(A)P(B)}{P(A)} \\&= P(B) \end{aligned} and so the probability of B given A is exactly the same as the probability of B without knowing anything about the outcome of A (and similarly for P(A|B)=P(A)).

Finally, we mention what might be called the fundamental theorem of conditional probability.

Bayes’ theorem. Let A and B be events with P(B)\neq0. Then P(A|B) = \frac{P(B|A)P(A)}{P(B)}.

You should now be able to answer the following questions:

When you roll a normal six-sided die, what is the set of distinguishable outcomes?
What is the probability of getting a 5?
What is the probability of getting a number (strictly) less than 3?

Now imagine that you have two six-sided dice.

If you roll both dice at the same time, what is the probability of them both landing on a 6?
What is the probability of getting two numbers that add up to 6?

Finally, we give our two six-sided dice to a friend for them to roll in secret.

If they tell us that they rolled two numbers that added up to 6, what is the probability that they rolled a 1?

We mentioned in Section 0.6 that a lot of the structure inherent in our formalism of quantum theory can be encapsulated by the notion of a dagger compact category, and can thus be investigated with a diagrammatic approach. It turns out that parts of probability theory — specifically Markov processes, which describe scenarios where different events can happen with varying probabilities, but where nothing depends on the history of the scenario, only on the here-and-now — are also amenable to such an approach. This leads to the definition of a Markov category.

What we actually mean by “the odds of something happening” and how we should really interpret probabilities “in the real world” is a profound philosophical problem that we shall completely pass over.↩︎
Another possibility would be to distinguish the coins in time instead of space, i.e. to flip one coin first and then the other afterwards. A coin cannot remember what happened the last time it was flipped, so is there really a difference between flipping a single coin twice or two coins once? In the eyes of probability theory, the answer is “no”.↩︎
Exercise. Are the events “rolling an even number” and “rolling an odd number” still independent when we think of rolling two die simultaneously?↩︎
Drawing a Venn diagram might help.↩︎