MODULE 9:   Randomized Algorithms
Probability Theory Basics

The right language and mathematical model to analyze/study randomization is probability theory. The goal of this chapter is to remind you the basic definitions and theorems in the area of discrete probability theory.

1
Basic Definitions and Properties
Definition (Finite probability space, sample space, probability distribution)

A finite probability space is a tuple (Ω,Pr)(\Omega, \Pr), where

  • Ω\Omega is a non-empty finite set called the sample space;

  • Pr:Ω[0,1]\Pr : \Omega \to [0,1] is a function, called the probability distribution, with the property that ΩPr[]=1\sum_{\ell \in \Omega} \Pr[\ell] = 1.

The elements of Ω\Omega are called outcomes or samples. If Pr[]=p\Pr[\ell] = p, then we say that the probability of outcome \ell is pp.

Remark (Why do probabilities sum to 1?)

The probabilities sum to 11 by convention. We could have defined it so that they sum to 100100, and think of probabilities in terms of percentages. Or we could have defined it so they sum to some other value. What is important is that a consistent choice is made. Note that the choice of 11 allows us to think of probabilities as fractions, and it is arguably the most natural choice.

Note (Modeling randomness)

The abstract definition above of a finite probability space helps us to mathematically model and reason about situations involving randomness and uncertainties (these situations are often called “random experiments” or just “experiments”). For example, consider the experiment of flipping a single coin. We model this as follows. We let Ω={Heads,Tails}\Omega = \{\text{Heads}, \text{Tails}\} and we define function Pr\Pr such that Pr[Heads]=1/2\Pr[\text{Heads}] = 1/2 and Pr[Tails]=1/2\Pr[\text{Tails}] = 1/2. This corresponds to our intuitive understanding that the probability of seeing the outcome Heads is 1/21/2 and the probability of seeing the outcome Tails is also 1/21/2.

image

If we flip two coins, the sample space would look as follows (H represents “Heads” and T represents “Tails”).

image

One can visualize the probability space as a circle or pie with area 1. Each outcome gets a slice of the pie proportional to its probability.

image

Note (Restriction to finite sample spaces)

In this course, we’ll usually restrict ourselves to finite sample spaces. In cases where we need a countably infinite Ω\Omega, the above definition will generalize naturally.

Exercise (Probability space modeling)

How would you model a roll of a single 66-sided die using Definition (Finite probability space, sample space, probability distribution)? How about a roll of two dice? How about a roll of a die and a coin toss together?

Solution

For the case of a single 6-sided die, we want the model to match our intuitive understanding and real-world experience that the probability of observing each of the possible die rolls 1,2,,61, 2, \dots, 6 is equal. We formalize this by defining the sample space Ω\Omega and the probability distribution Pr\Pr as follows: Ω={1,2,3,4,5,6},Pr[]=16   for all Ω.\Omega = \{1,2,3,4,5,6\}, \qquad \Pr[\ell] = \dfrac{1}{6} \; \text{ for all } \ell \in \Omega.

Similarly, to model a roll of two dice, we can let each outcome be an ordered pair representing the roll of each of the two dice: Ω={1,2,3,4,5,6}×{1,2,3,4,5,6},Pr[]=(16)2=136   for all Ω.\Omega = \{1,2,3,4,5,6\} \times \{1,2,3,4,5,6\}, \qquad \Pr[\ell] = \left(\dfrac{1}{6}\right)^2 = \dfrac{1}{36} \; \text{ for all } \ell \in \Omega.

Lastly, to model a roll of a die and a coin toss together, we can let each outcome be an ordered pair where the first element represents the result of the die roll, and the second element represents the result of the coin toss: Ω={1,2,3,4,5,6}×{Heads,Tails},Pr[]=1612=112   for all Ω.\Omega = \{1,2,3,4,5,6\} \times \{\text{Heads},\text{Tails}\}, \qquad \Pr[\ell] = \dfrac{1}{6} \cdot \dfrac{1}{2} = \dfrac{1}{12} \; \text{ for all } \ell \in \Omega.

Note (Modeling through randomized code)

Sometimes, the modeling of a real-world random experiment as a probability space can be non-trivial or tricky. It helps a lot to have a step in between where you first go from the real-world experiment to computer code/algorithm (that makes calls to random number generators), and then you define your probability space based on the computer code. In this course, we allow our programs to have access to the functions Bernoulli(p)\text{Bernoulli}(p) and RandInt(n)\text{RandInt}(n). The function Bernoulli(p)\text{Bernoulli}(p) takes a number 0p10 \leq p \leq 1 as input and returns 11 with probability pp and 00 with probability 1p1-p. The function RandInt(n)\text{RandInt}(n) takes a positive integer nn as input and returns a random integer from 11 to nn (i.e., every number from 11 to nn has probability 1/n1/n). Here is a very simple example of going from a real-world experiment to computer code. The experiment is as follows. You flip a fair coin. If it’s heads, you roll a 33-sided die. If it is tails, you roll a 44-sided die. This experiment can be represented as:

    flip = Bernoulli(1/2)
    if flip == 0:
        die = RandInt(3)
    else:
        die = RandInt(4)

If we were to ask “What is the probability that you roll a 3 or higher?”, this would correspond to asking what is the probability that after the above code is executed, the variable named die stores a value that is 3 or higher.

One advantage of modeling with randomized code is that if there is any ambiguity in the description of the random (real-world) experiment, then it would/should be resolved in this step of creating the randomized code.

The second advantage is that it allows you to easily imagine a probability tree associated with the randomized code. The probability tree gives you clear picture on what the sample space is and how to compute the probabilities of the outcomes. The probability tree corresponding to the above code is as follows.

image

This simple example may not illustrate the usefulness of having a computer code representation of the random experiment, but one can appreciate its value with more sophisticated examples and we do encourage you to think of random experiments as computer code: real-world experimentcomputer code / probability treeprobability space (Ω,Pr).\text{real-world experiment} \longrightarrow \text{computer code / probability tree} \longrightarrow \text{probability space } (\Omega, \Pr).

Definition (Uniform distribution)

If a probability distribution Pr:Ω[0,1]\Pr : \Omega \to [0,1] is such that Pr[]=1/Ω\Pr[\ell] = 1/|\Omega| for all Ω\ell \in \Omega, then we call it a uniform distribution.

Definition (Event)

Let (Ω,Pr)(\Omega, \Pr) be a probability space. Any subset of outcomes EΩE \subseteq \Omega is called an event. We extend the Pr[]\Pr[\cdot] notation and write Pr[E]\Pr[E] to denote EPr[]\sum_{\ell \in E} \Pr[\ell]. Using this notation, Pr[]=0\Pr[\varnothing] = 0 and Pr[Ω]=1\Pr[\Omega] = 1. We use the notation E\overline{E} to denote the event Ω\E\Omega \backslash E.

Example

Continuing the example given in Note (Modeling through randomized code), we can define the event E={(H,3),(T,3),(T,4)}E = \{(H,3), (T,3), (T,4)\}. In other words, EE can be described as “die roll is 33 or higher”. The probability of EE, Pr[E]\Pr[E], is equal to 1/6+1/8+1/8=5/121/6 + 1/8 + 1/8 = 5/12.

image

Exercise (Practice with events)
  1. Suppose we roll two 66-sided dice. How many events are there? Write down the event corresponding to “we roll a double” and determine its probability.

  2. Suppose we roll a 33-sided die and see the number dd. We then roll a dd-sided die. How many different events are there? Write down the event corresponding to “the second roll is a 22” and determine its probability.

Solution

Part (1): We use the model for rolling two 6-sided dice as in Exercise (Probability space modeling). Since an event is any subset of outcomes EΩE \subseteq \Omega, the number of events is the number of such subsets, which is (Ω)=2Ω=236|\wp(\Omega)| = 2^{|\Omega|} = 2^{36} (here (Ω)\wp(\Omega) denotes the power set of Ω\Omega).

The event corresponding to “we roll a double” can be expressed as E={(,):{1,2,3,4,5,6}}E = \big\{(\ell,\ell) : \ell \in \{1,2,3,4,5,6\}\big\} which has probability Pr[E]=EPr[]=136E=636=16.\Pr[E] = \sum_{\ell \in E}\Pr[\ell] = \dfrac{1}{36} \cdot |E| = \dfrac{6}{36} = \dfrac{1}{6}.

Part(2): We model the two dice rolls as follows: Ω={(a,b){1,2,3}2:ab},Pr[(a,b)]=131a=13a.\Omega = \big\{(a,b) \in \{1,2,3\}^2 : a \geq b\big\}, \qquad \Pr[(a,b)] = \dfrac{1}{3} \cdot \dfrac{1}{a} = \dfrac{1}{3a}. The restriction that aba \geq b is imposed because the second die depends on the first roll and the result of the second roll cannot be larger than that of the first. Note that we could also have used a model where Ω={1,2,3}2\Omega = \{1,2,3\}^2 and assigned a probability of 0 to the outcomes where a<ba < b, but considering outcomes that never occur is typically not very useful.

In the model we originally defined, the number of events is (Ω)=2Ω=26=64|\wp(\Omega)| = 2^{|\Omega|} = 2^6 = 64. Note that the number of events depends on the size of the sample space, so this number can vary depending on the model.

The event corresponding to “the second roll is a 2” is given by E={(a,b)Ω:b=2}={(2,2),(3,2)}E = \{(a,b) \in \Omega : b = 2\} = \{(2,2),(3,2)\} which has probability Pr[E]=EPr[]=Pr[(2,2)]+Pr[(3,2)]=132+133=518.\Pr[E] = \sum_{\ell \in E}\Pr[\ell] = \Pr[(2,2)] + \Pr[(3,2)] = \dfrac{1}{3\cdot2} + \dfrac{1}{3\cdot3} = \dfrac{5}{18}.

Exercise (Basic facts about probability)

Let AA and BB be two events. Prove the following.

  • If ABA \subseteq B, then Pr[A]Pr[B]\Pr[A] \leq \Pr[B].

  • Pr[A]=1Pr[A]\Pr[\overline{A}] = 1 - \Pr[A].

  • Pr[AB]=Pr[A]+Pr[B]Pr[AB]\Pr[A \cup B] = \Pr[A] + \Pr[B] - \Pr[A \cap B].

Solution

Part (1): Suppose ABA \subseteq B. Recall that Pr\Pr is a non-negative function (i.e. Pr[]0\Pr[\ell] \geq 0 for all Ω\ell \in \Omega). Hence,

Pr[A]=APr[]by the definition of eventAPr[]+BAPr[]by non-negativity of Pr=BPr[]=Pr[B].\begin{aligned} \Pr[A] &= \sum_{\ell \in A}\Pr[\ell] &\text{by the definition of event} \\ &\leq \sum_{\ell \in A}\Pr[\ell] + \sum_{\ell \in B \setminus A}\Pr[\ell] &\text{by non-negativity of $\Pr$} \\ &= \sum_{\ell \in B}\Pr[\ell] & \\ &= \Pr[B]. & \\\end{aligned}

Part (2): Recall that A\overline{A} denotes ΩA\Omega \setminus A, and that ΩPr[]=1\sum_{\ell \in \Omega}\Pr[\ell] = 1. Hence, Pr[A]=APr[]by definition of event=ΩAPr[]=ΩPr[]APr[]=1Pr[A]by definition of prob. distr.\begin{aligned} \Pr[\overline{A}] &= \sum_{\ell \in \overline{A}}\Pr[\ell] &\text{by definition of event}\\ &= \sum_{\ell \in \Omega \setminus A}\Pr[\ell] \\ &= \sum_{\ell \in \Omega}\Pr[\ell] - \sum_{\ell \in A}\Pr[\ell] \\ &= 1 - \Pr[A] & \text{by definition of prob. distr.}\end{aligned}

Part (3) By partitioning ABA \cup B into ABA \setminus B, BAB \setminus A, and ABA \cap B, we see that Pr[AB]=ABPr[]by definition of event=ABPr[]+BAPr[]+ABPr[]=(APr[]ABPr[])+(BPr[]ABPr[])+ABPr[]=APr[]+BPr[]ABPr[]=Pr[A]+Pr[B]Pr[AB]by definition of event.\begin{aligned} \Pr[A \cup B] &= \sum_{\ell \in A \cup B}\Pr[\ell] \quad \text{by definition of event} \\ &= \sum_{\ell \in A \setminus B}\Pr[\ell] + \sum_{\ell \in B \setminus A}\Pr[\ell] + \sum_{\ell \in A \cap B}\Pr[\ell] \\ &= \left(\sum_{\ell \in A}\Pr[\ell] - \sum_{\ell \in A \cap B}\Pr[\ell]\right) + \left(\sum_{\ell \in B}\Pr[\ell] - \sum_{\ell \in A \cap B}\Pr[\ell]\right) + \sum_{\ell \in A \cap B}\Pr[\ell] \\ &= \sum_{\ell \in A}\Pr[\ell] + \sum_{\ell \in B}\Pr[\ell]- \sum_{\ell \in A \cap B}\Pr[\ell] \\ &= \Pr[A] + \Pr[B] - \Pr[A \cap B] \quad \text{by definition of event.}\end{aligned}

Definition (Disjoint events)

We say that two events AA and BB are disjoint events if AB=A \cap B = \varnothing.

Exercise (Union bound)

Let A1,A2,,AnA_1,A_2,\ldots,A_n be events. Then Pr[A1A2An]Pr[A1]+Pr[A2]++Pr[An].\Pr[A_1 \cup A_2 \cup \cdots \cup A_n] \leq \Pr[A_1] + \Pr[A_2] + \cdots + \Pr[A_n]. We get equality if and only if the AiA_i’s are pairwise disjoint.

Solution

By the last part of Exercise (Basic facts about probability), we can conclude that Pr[AB]Pr[A]+Pr[B].\Pr[A \cup B] \leq \Pr[A] + \Pr[B]. We also notice that equality holds if and only if Pr[AB]=0\Pr[A \cap B] = 0, which happens if and only if AA and BB are disjoint. We will extend this by induction on nn.

The expression is identical on both sides for the case where n=1n = 1, and the case where n=2n = 2 is exactly as above. These serve as the base cases for our induction.

For the inductive case, assume the proposition holds true for n=kn = k. We seek to show that it too holds true for n=k+1n = k+1. Given events A1,A2,,Ak,Ak+1A_1, A_2, \dots, A_k, A_{k+1}, let A=A1A2Ak,B=Ak+1.A = A_1 \cup A_2 \cup \cdots \cup A_k, \qquad B = A_{k+1}. Then Pr[A1A2AkAk+1]=Pr[AB]Pr[A]+Pr[B]by the above result=Pr[A1A2Ak]+Pr[Ak+1](Pr[A1]++Pr[Ak])+Pr[Ak+1]by IH\begin{aligned} \Pr[A_1 \cup A_2 \cup \cdots \cup A_k \cup A_{k+1}] &= \Pr[A \cup B] \\ &\leq \Pr[A] + \Pr[B] \quad \text{by the above result} \\ &= \Pr[A_1 \cup A_2 \cup \cdots \cup A_k] + \Pr[A_{k+1}] \\ &\leq \left(\Pr[A_1] + \cdots + \Pr[A_k]\right) + \Pr[A_{k+1}] \quad \text{by IH}\end{aligned} where equality holds if and only if AA and BB are disjoint and A1,,AkA_1, \dots, A_k are pairwise disjoint. In other words, equality holds if and only if t=1k(AtAk+1)= and AiAj= for all 1i<jk,\bigcup_{t=1}^{k} (A_{t} \cap A_{k+1}) = \varnothing \qquad \text{ and } \qquad A_i \cap A_j = \varnothing\text{ for all }1 \leq i < j \leq k, which in turn holds if and only if AiAj= for all 1i<jk+1.A_i \cap A_j = \varnothing\text{ for all }1 \leq i < j \leq k+1. (i.e. A1,,Ak+1A_1, \dots, A_{k+1} are pairwise disjoint). This completes the proof.

Definition (Conditional probability)

Let EE be an event with Pr[E]0\Pr[E] \neq 0. The conditional probability of outcome Ω\ell \in \Omega given EE, denoted Pr[    E]\Pr[\ell \; | \;E], is defined as Pr[    E]={0if ∉EPr[]Pr[E]if E\Pr[\ell \; | \;E] = \begin{cases} 0 &\text{if $\ell \not \in E$}\\ \frac{\Pr[\ell]}{\Pr[E]} &\text{if $\ell \in E$} \end{cases} For an event AA, the conditional probability of AA given EE, denoted Pr[A    E]\Pr[A \; | \;E], is defined as Pr[A    E]=Pr[AE]Pr[E].\begin{aligned} \Pr[A \; | \;E] = \frac{\Pr[A \cap E]}{\Pr[E]}.\end{aligned}

Example

Once again, continuing the example given in Note (Modeling through randomized code), let EE be the event that the die roll is 33 or higher. Then for each outcome of the sample space, we can calculate its probability, given the event EE.

image

For example, Pr[(H,1)    E]=0\Pr[(H,1) \; | \;E] = 0 and Pr[(H,3)    E]=2/5\Pr[(H,3) \; | \;E] = 2/5. We can also calculate the conditional probabilities of events. Let AA be the event that the coin toss resulted in Tails. Then Pr[A    E]=3/10+3/10=3/5\Pr[A \; | \;E] = 3/10 + 3/10 = 3/5.

Note (Intuitive understanding of conditional probability)

Although it may not be immediately obvious, the above definition of conditional probability does correspond to our intuitive understanding of what conditional probability should represent. If we are told that event EE has already happened, then we know that the probability of any outcome outside of EE should be 00. Therefore, we can view the conditioning on event EE as a transformation of our probability space where we revise the probabilities (i.e., we revise the probability function Pr[]\Pr[\cdot]). In particular, the original probability space (Ω,Pr)(\Omega, \Pr) gets transformed to (Ω,PrE)(\Omega, \Pr_E), where PrE\Pr_E is such that for any ∉E\ell \not \in E, we have PrE[]=0\Pr_E[\ell] = 0, and for any E\ell \in E, we have PrE[]=Pr[]/Pr[E]\Pr_E[\ell] = \Pr[\ell] / \Pr[E]. The 1/Pr[E]1/\Pr[E] factor here is a necessary normalization factor that ensures the probabilities of all the outcomes sum to 11 (which is required by Definition (Finite probability space, sample space, probability distribution)). Indeed, ΩPrE[]=∉EPrE[]+EPrE[]=0+EPr[]/Pr[E]=1Pr[E]EPr[]=1.\begin{aligned} \textstyle \sum_{\ell \in \Omega} \Pr_E[\ell] & = \textstyle \sum_{\ell \not \in E} \Pr_E[\ell] + \sum_{\ell \in E} \Pr_E[\ell] \\ & = \textstyle 0 + \sum_{\ell \in E} \Pr[\ell]/\Pr[E] \\ & = \textstyle \frac{1}{\Pr[E]} \sum_{\ell \in E} \Pr[\ell] \\ & = 1.\end{aligned} If we are interested in the event “AA given EE” (denoted by A    EA \; | \;E) in the probability space (Ω,Pr)(\Omega, \Pr), then we are interested in the event AA in the probability space (Ω,PrE)(\Omega, \Pr_E). That is, Pr[A    E]=PrE[A]\Pr[A \; | \;E] = \Pr_E[A]. Therefore, Pr[A    E]=PrE[A]=PrE[AE]=Pr[AE]Pr[E],\textstyle \Pr[A \; | \;E] = \Pr_E[A] = \Pr_E[A \cap E] = \displaystyle \frac{\Pr[A \cap E]}{\Pr[E]}, where the last equality holds by the definition of PrE[]\Pr_E[ \cdot ]. We have thus recovered the equality in Definition (Conditional probability).

Conditioning on event EE can also be viewed as redefining the sample space Ω\Omega to be EE, and then renormalizing the probabilities so that Pr[Ω]=Pr[E]=1\Pr[\Omega] = \Pr[E] = 1.

image

Exercise (Conditional probability practice)

Suppose we roll a 33-sided die and see the number dd. We then roll a dd-sided die. We are interested in the probability that the first roll was a 11 given that the second roll was a 11. First express this probability using the notation of conditional probability and then determine what the probability is.

Solution

We use the model defined in Exercise (Practice with events). The event that the first roll is a 1 is E1={(1,1)}E_1 = \{(1,1)\} and the event that the second roll is a 1 is E2={(1,1),(2,1),(3,1)}E_2 = \{(1,1),(2,1),(3,1)\} with corresponding probabilities Pr[E1]=13,Pr[E2]=131+132+133=1118.\Pr[E_1] = \dfrac{1}{3}, \qquad \Pr[E_2] = \dfrac{1}{3\cdot1} + \dfrac{1}{3\cdot2} + \dfrac{1}{3\cdot3} = \dfrac{11}{18}. Then the conditional probability we are interested in is Pr[E1E2]=Pr[E1E2]Pr[E2]=Pr[E1]Pr[E2]=1/311/18=611.\Pr[E_1 \mid E_2] = \dfrac{\Pr[E_1 \cap E_2]}{\Pr[E_2]} = \dfrac{\Pr[E_1]}{\Pr[E_2]} = \dfrac{1/3}{11/18} = \dfrac{6}{11}.

Proposition (Chain rule)

Let n2n \geq 2 and let A1,A2,,AnA_1,A_2,\ldots,A_n be events. Then Pr[A1An]=Pr[A1]Pr[A2    A1]Pr[A3    A1A2]Pr[An    A1A2An1].\begin{aligned} & \Pr[A_1 \cap \cdots \cap A_n] = \\ & \qquad \Pr[A_1] \cdot \Pr[A_2 \; | \;A_1] \cdot \Pr[A_3 \; | \;A_1 \cap A_2] \cdots \Pr[A_n \; | \;A_1 \cap A_2 \cap \cdots \cap A_{n-1}].\end{aligned}

Proof

We prove the proposition by induction on nn. The base case with two events follows directly from the definition of conditional probability. Let A=AnA = A_n and B=A1An1B = A_1 \cap \ldots \cap A_{n-1}. Then Pr[A1An]=Pr[AB]=Pr[B]Pr[A    B]=Pr[A1An1]Pr[An    A1An1],\begin{aligned} \Pr[A_1\cap \cdots \cap A_n] & = \Pr[A \cap B] \\ &= \Pr[B] \cdot \Pr[A \; | \;B] \\ &= \Pr[A_1 \cap \cdots \cap A_{n-1}] \cdot \Pr[A_n \; | \;A_1 \cap \cdots \cap A_{n-1}],\end{aligned} where we used the definition of conditional probability for the second equality. Applying the induction hypothesis to Pr[A1An1]\Pr[A_1 \cap \cdots \cap A_{n-1}] gives the desired result.

Exercise (Practice with chain rule)

Suppose there are 100 students in 15-251 and 5 of the students are Andrew Yang supporters. We pick 3 students from class uniformly at random. Calculate the probability that none of them are Andrew Yang supporters using Proposition (Chain rule).

Solution

For i=1,2,3i = 1,2,3, let AiA_i be the event that the ii’th student we pick is not a Andrew Yang supporter. Then using the chain rule, the probability that none of them are Andrew Yang supporters is Pr[A1A2A3]=Pr[A1]Pr[A2A1]Pr[A3A1A2]=9510094999398.\Pr[A_1 \cap A_2 \cap A_3] = \Pr[A_1] \cdot \Pr[A_2 \mid A_1] \cdot \Pr[A_3 \mid A_1 \cap A_2] = \dfrac{95}{100} \cdot \dfrac{94}{99} \cdot \dfrac{93}{98}.

Definition (Independent events)
  • Let AA and BB be two events. We say that AA and BB are independent events if Pr[AB]=Pr[A]Pr[B]\Pr[A \cap B] = \Pr[A]\cdot \Pr[B]. Note that if Pr[B]0\Pr[B] \neq 0, then this is equivalent to Pr[A    B]=Pr[A]\Pr[A \; | \;B] = \Pr[A]. If Pr[A]0\Pr[A] \neq 0, it is also equivalent to Pr[B    A]=Pr[B]\Pr[B \; | \;A] = \Pr[B].

  • Let A1,A2,,AnA_1, A_2,\ldots, A_n be events. We say that A1,,AnA_1,\ldots,A_n are independent if for any subset S{1,2,,n}S \subseteq \{1,2,\ldots,n\}, Pr[iSAi]=iSPr[Ai].\Pr\left[\bigcap_{i \in S} A_i\right] = \prod_{i\in S} \Pr[A_i].

Note (Defining independence through computer code)

Above we have given the definition of independent events as presented in 100% of the textbooks on probability theory. Yet, there is something deeply unsatisfying about this definition. In many situations people want to compute a probability of the form Pr[AB]\Pr[A \cap B], and if possible (if they are independent), would like to use the equality Pr[AB]=Pr[A]Pr[B]\Pr[A \cap B] = \Pr[A] \Pr[B] to simplify the calculation. In order to do this, they will informally argue that events AA and BB are independent in the intuitive sense of the word. For example, they argue that if BB happens, then this doesn’t affect the probability of AA happening (this argument is not done by calculation, but by informal argument). Then using this, they justify using the equality Pr[AB]=Pr[A]Pr[B]\Pr[A \cap B] = \Pr[A] \Pr[B] in their calculations. So really, secretly, people are not using Definition (Independent events) but some other non-formal intuitive definition of independence, and then concluding what the formal definition says, which is Pr[AB]=Pr[A]Pr[B]\Pr[A \cap B] = \Pr[A] \Pr[B].

To be a bit more explicit, recall that the approach to answering probability related questions is to go from a real-world experiment we want to analyze to a formal probability space model: real-world experimentprobability space (Ω,Pr).\text{real-world experiment} \longrightarrow \text{probability space } (\Omega, \Pr). People often argue the independence of events AA and BB on the left-hand-side in order to use Pr[AB]=Pr[A]Pr[B]\Pr[A \cap B] = \Pr[A] \Pr[B] on the right-hand-side. The left-hand-side, however, is not really a formal setting and may have ambiguities. And why does our intuitive notion of independence allow us to conclude Pr[AB]=Pr[A]Pr[B]\Pr[A \cap B] = \Pr[A] \Pr[B]? In these situations, it helps to add the “computer code” step in between: real-world experimentcomputer codeprobability space (Ω,Pr).\text{real-world experiment} \longrightarrow \text{computer code} \longrightarrow \text{probability space } (\Omega, \Pr). Computer code has no ambiguities and we can give a formal definition of independence using it. Suppose you have a randomized code modeling the real-world experiment, and suppose that you can divide the code into two separate parts. Suppose AA is an event that depends only on the first part of the code, and BB is an event that depends only on the second part of the code. If you can prove that the two parts of the code cannot affect each other, then we say that AA and BB are independent. When AA and BB are independent in this sense, then one can verify that indeed the equality Pr[AB]=Pr[A]Pr[B]\Pr[A \cap B] = \Pr[A] \Pr[B] holds.

Proposition (Law of total probability)

Let A1,A2,,An,BA_1,A_2,\ldots,A_n,B be events such that the AiA_i’s form a partition of the sample space Ω\Omega. Then Pr[B]=Pr[BA1]+Pr[BA2]++Pr[BAn].\Pr[B] = \Pr[B \cap A_1] + \Pr[B \cap A_2] + \cdots + \Pr[B \cap A_n]. Equivalently, Pr[B]=Pr[A1]Pr[B    A1]+Pr[A2]Pr[B    A2]++Pr[An]Pr[B    An].\Pr[B] = \Pr[A_1]\cdot \Pr[B \; | \;A_1] + \Pr[A_2]\cdot \Pr[B \; | \;A_2] + \cdots + \Pr[A_n]\cdot \Pr[B \; | \;A_n].

Exercise (Proof of the law of total probability)

Prove the above proposition.

Solution

We note that B=BΩ=B(A1A2An)=(BA1)(BA2)(BAn),B = B \cap \Omega = B \cap (A_1 \cup A_2 \cup \cdots \cup A_n) = (B \cap A_1) \cup (B \cap A_2) \cup \cdots \cup (B \cap A_n), with the (BAi)(B \cap A_i)’s being pairwise disjoint. The first equality Pr[B]=Pr[BA1]+Pr[BA2]++Pr[BAn]\Pr[B] = \Pr[B \cap A_1] + \Pr[B \cap A_2] + \cdots + \Pr[B \cap A_n] follows from a direct application of Exercise (Union bound). The equivalent statement Pr[B]=Pr[A1]Pr[B    A1]+Pr[A2]Pr[B    A2]++Pr[An]Pr[B    An]\Pr[B] = \Pr[A_1] \cdot \Pr[B \; | \;A_1] + \Pr[A_2] \cdot \Pr[B \; | \;A_2] + \cdots + \Pr[A_n] \cdot \Pr[B \; | \;A_n] follows from Definition (Conditional probability).

Exercise (Practice with the law of total probability)

There are 2 bins. Bin 1 contains 6 red balls and 4 blue balls. Bin 2 contains 3 red balls and 7 blue balls. We choose a bin uniformly at random, and then choose one of the balls in that bin uniformly at random. Calculate the probability that the chosen ball is red using Proposition (Law of total probability).

Solution

Let A1A_1 and A2A_2 be the events that we pick the first and second bin respectively. Note that A1A_1 and A2A_2 partition the sample space, and Pr[A1]=Pr[A2]=12\Pr[A_1] = \Pr[A_2] = \frac{1}{2}. Let BB be the event that the chosen ball is red. Then using the law of total probability, Pr[B]=Pr[A1]Pr[B    A1]+Pr[A2]Pr[B    A2]=12610+12310=920.\Pr[B] = \Pr[A_1] \cdot \Pr[B \; | \;A_1] + \Pr[A_2] \cdot \Pr[B \; | \;A_2] = \dfrac{1}{2} \cdot \dfrac{6}{10} + \dfrac{1}{2} \cdot \dfrac{3}{10} = \dfrac{9}{20}.

2
Random Variables Basics
Definition (Random variable)

A random variable is a function X:ΩR\mathbf{X} : \Omega \to \mathbb R, where Ω\Omega is a sample space.

Note (Random variable intuition)

Note that a random variable is just a labeling of the elements in Ω\Omega with some real numbers. One can think of this as a transformation of the original sample space into one that contains only numbers. For example, suppose the original sample space corresponds to flipping two coins. Then we can define a random variable X\mathbf{X} which maps an outcome in the sample space to the number of tails in the outcome. Since we are only flipping two coins, the possible outputs of X\mathbf{X} are 00, 11, and 22.

image

This kind of transformation is often desirable. For example, the transformation allows us to take a weighted average of the elements in the new sample space, where the weights correspond to the probabilities of the elements (if the distribution is uniform, the weighted average is just the regular average). This is called the expectation of the random variable and is formally defined below in Definition (Expected value of a random variable). Without this transformation into real numbers, the concept of an “expected value” (i.e. averaging) would not be possible to define.

Remark (Range of a random variable)

Almost always, the random variables we consider in this course will have a range that is a subset of N\mathbb N.

Definition (Common events through a random variable)

Let X\mathbf{X} be a random variable and xRx \in \mathbb R be some real value. We use X=x to denote the event {Ω:X()=x},Xx to denote the event {Ω:X()x},Xx to denote the event {Ω:X()x},X<x to denote the event {Ω:X()<x},X>x to denote the event {Ω:X()>x}.\begin{aligned} \mathbf{X}=x & \quad \text{ to denote the event $\{\ell \in \Omega : \mathbf{X}(\ell) = x \}$},\\ \mathbf{X} \leq x & \quad \text{ to denote the event $\{\ell \in \Omega : \mathbf{X}(\ell) \leq x \}$},\\ \mathbf{X} \geq x & \quad \text{ to denote the event $\{\ell \in \Omega : \mathbf{X}(\ell) \geq x \}$},\\ \mathbf{X} < x & \quad \text{ to denote the event $\{\ell \in \Omega : \mathbf{X}(\ell) < x \}$},\\ \mathbf{X} > x & \quad \text{ to denote the event $\{\ell \in \Omega : \mathbf{X}(\ell) > x \}$}.\end{aligned} For example, Pr[X=x]\Pr[\mathbf{X}=x] denotes Pr[{Ω:X()=x}]\Pr[\{\ell \in \Omega : \mathbf{X}(\ell) = x\}]. More generally, for SRS \subseteq \mathbb R, we use XS to denote the event {Ω:X()S}.\begin{aligned} \mathbf{X} \in S & \quad \text{ to denote the event $\{\ell \in \Omega : \mathbf{X}(\ell) \in S \}$}.\end{aligned}

Exercise (Practice with random variables)

Suppose we roll two 66-sided dice. Let X\mathbf{X} be the random variable that denotes the sum of the numbers we see. Explicitly write down the input-output pairs for the function X\mathbf{X}. Calculate Pr[X7]\Pr[\mathbf{X} \geq 7].

Solution

We use the model for rolling two 6-sided dice as in Exercise (Probability space modeling). Then X(1,1)=2,X(1,2)=3,X(1,3)=4,X(1,4)=5,X(1,5)=6,X(1,6)=7,X(2,1)=3,X(2,2)=4,X(2,3)=5,X(2,4)=6,X(2,5)=7,X(2,6)=8,X(3,1)=4,X(3,2)=5,X(3,3)=6,X(3,4)=7,X(3,5)=8,X(3,6)=9,X(4,1)=5,X(4,2)=6,X(4,3)=7,X(4,4)=8,X(4,5)=9,X(4,6)=10,X(5,1)=6,X(5,2)=7,X(5,3)=8,X(5,4)=9,X(5,5)=10,X(5,6)=11,X(6,1)=7,X(6,2)=8,X(6,3)=9,X(6,4)=10,X(6,5)=11,X(6,6)=12.\begin{array}{llllll} {\mathbf{X}}(1,1) = 2, & {\mathbf{X}}(1,2) = 3, & {\mathbf{X}}(1,3) = 4, & {\mathbf{X}}(1,4) = 5, & {\mathbf{X}}(1,5) = 6, & {\mathbf{X}}(1,6) = 7, \\ {\mathbf{X}}(2,1) = 3, & {\mathbf{X}}(2,2) = 4, & {\mathbf{X}}(2,3) = 5, & {\mathbf{X}}(2,4) = 6, & {\mathbf{X}}(2,5) = 7, & {\mathbf{X}}(2,6) = 8, \\ {\mathbf{X}}(3,1) = 4, & {\mathbf{X}}(3,2) = 5, & {\mathbf{X}}(3,3) = 6, & {\mathbf{X}}(3,4) = 7, & {\mathbf{X}}(3,5) = 8, & {\mathbf{X}}(3,6) = 9, \\ {\mathbf{X}}(4,1) = 5, & {\mathbf{X}}(4,2) = 6, & {\mathbf{X}}(4,3) = 7, & {\mathbf{X}}(4,4) = 8, & {\mathbf{X}}(4,5) = 9, & {\mathbf{X}}(4,6) = 10, \\ {\mathbf{X}}(5,1) = 6, & {\mathbf{X}}(5,2) = 7, & {\mathbf{X}}(5,3) = 8, & {\mathbf{X}}(5,4) = 9, & {\mathbf{X}}(5,5) = 10, & {\mathbf{X}}(5,6) = 11, \\ {\mathbf{X}}(6,1) = 7, & {\mathbf{X}}(6,2) = 8, & {\mathbf{X}}(6,3) = 9, & {\mathbf{X}}(6,4) = 10, &{\mathbf{X}}(6,5) = 11, & {\mathbf{X}}(6,6) = 12. \end{array} Since the probability distribution is uniform over the outcomes, Pr[X7]=136{Ω:X()7}=13621=712.\Pr[{\mathbf{X}} \geq 7] = \dfrac{1}{36} \cdot |\{\ell \in \Omega : {\mathbf{X}}(\ell) \geq 7\}| = \dfrac{1}{36} \cdot 21 = \dfrac{7}{12}.

Note (Random variables and randomized code)

Using the randomized code and probability tree point of view, we can simply define a random variable as a numerical variable in some randomized code (more accurately, the variable’s value at the end of the execution of the code). For example, consider the following randomized code.

    S = RandInt(6) + RandInt(6)
    if S == 12:
        I = 1
    else:
        I = 0

Here we have two random variables corresponding to the variables in the code, S\mathbf{S} and I\mathbf{I}. From the probability tree picture, we can see how these two random variables can be viewed as functions from the sample space (the set of leaves) to R\mathbb R since they map each leaf/outcome to some numerical value.

image

Note (Forgetting the original sample space)

Given some probability space (Ω,Pr)(\Omega, \Pr) and a random variable X:ΩR\mathbf{X} : \Omega \to \mathbb R, we often forget about the original sample space and consider the sample space to be the range of X\mathbf{X}, range(X)={X():Ω}\text{range}(\mathbf{X}) = \{\mathbf{X}(\ell): \ell \in \Omega\}.

Definition (Probability mass function (PMF))

Let X:ΩR\mathbf{X} : \Omega \to \mathbb R be a random variable. The probability mass function (PMF) of X\mathbf{X} is a function pX:R[0,1]p_{\mathbf{X}} : \mathbb R\to [0,1] such that for any xRx \in \mathbb R, pX(x)=Pr[X=x]p_{\mathbf{X}}(x) = \Pr[\mathbf{X} = x].

Note (Defining a random variable through PMF)

Related to the previous remark, we sometimes “define” a random variable by just specifying its probability mass function. In particular, we make no mention of the underlying sample space.

Definition (Expected value of a random variable)

Let X\mathbf{X} be a random variable. The expected value of X\mathbf{X}, denoted E[X]\mathbb E[\mathbf{X}], is defined as follows: E[X]=ΩPr[]X().\mathbb E[\mathbf{X}] = \sum_{\ell \in \Omega} \Pr[\ell]\cdot \mathbf{X}(\ell). Equivalently, E[X]=xrange(X)Pr[X=x]x,\mathbb E[\mathbf{X}] = \sum_{x \in \text{range}(\mathbf{X})} \Pr[\mathbf{X}=x]\cdot x, where range(X)={X():Ω}\text{range}(\mathbf{X}) = \{\mathbf{X}(\ell): \ell \in \Omega\}.

Exercise (Equivalence of expected value definitions)

Prove that the above two expressions for E[X]\mathbb E[\mathbf{X}] are equivalent.

Solution

We show a chain of equalities from the RHS to the LHS: xrange(X)Pr[X=x]x=xrange(X)Pr[{Ω:X()=x}]x=xrange(X)ΩX()=xPr[]x=xrange(X)ΩX()=xPr[]X()=ΩPr[]X().\begin{aligned} \sum_{x \in \operatorname{range}({\mathbf{X}})}\Pr[{\mathbf{X}} = x] \cdot x &= \sum_{x \in \operatorname{range}({\mathbf{X}})}\Pr[\{\ell \in \Omega : {\mathbf{X}}(\ell) = x\}] \cdot x \\ &= \sum_{x \in \operatorname{range}({\mathbf{X}})}\sum_{\substack{\ell \in \Omega \\ {\mathbf{X}}(\ell) = x}}\Pr[\ell] \cdot x \\ &= \sum_{x \in \operatorname{range}({\mathbf{X}})}\sum_{\substack{\ell \in \Omega \\ {\mathbf{X}}(\ell) = x}}\Pr[\ell] \cdot {\mathbf{X}}(\ell) \\ &= \sum_{\ell \in \Omega}\Pr[\ell] \cdot {\mathbf{X}}(\ell). \\\end{aligned}

Exercise (Practice with expected value)

Suppose we roll two 66-sided dice. Let X\mathbf{X} be the random variable that denotes the sum of the numbers we see. Calculate E[X]\mathbb E[\mathbf{X}].

Solution

We refer to the input-output pairs of X{\mathbf{X}} (recall that random variables are functions) in Exercise (Practice with random variables). With a lot of tedious calculations, we can compute E[X]=xrange(X)Pr[X=x]x=1362+2363+3364++13612=7.\mathbb E[{\mathbf{X}}] = \sum_{x \in \operatorname{range}({\mathbf{X}})}\Pr[{\mathbf{X}} = x] \cdot x = \dfrac{1}{36} \cdot 2 + \dfrac{2}{36} \cdot 3 + \dfrac{3}{36} \cdot 4 + \cdots + \dfrac{1}{36} \cdot 12 = 7. We will see a less tedious way of performing such calculations in the following exercise.

Note (Arithmetic operations on random variables)

Fix some sample space Ω\Omega and consider random variables on this sample space. Since random variables are functions from Ω\Omega to R\mathbb R, arithmetic operations on random variables are arithmetic operations on functions, and these arithmetic operations produce a new random variable. For example, if X\mathbf{X} and Y\mathbf{Y} are random variables, then an expression like X+2XY\mathbf{X} + 2\mathbf{X}\mathbf{Y} denotes a new random variable, because it is also a function from Ω\Omega to R\mathbb R.

Proposition (Linearity of expectation)

Let X\mathbf{X} and Y\mathbf{Y} be two random variables, and let c1,c2Rc_1,c_2 \in \mathbb R be some constants. Then E[c1X+c2Y]=c1E[X]+c2E[Y].\mathbb E[c_1\mathbf{X}+c_2\mathbf{Y}] = c_1\mathbb E[\mathbf{X}] + c_2\mathbb E[\mathbf{Y}].

Proof

Define the random variable Z\mathbf{Z} as Z=c1X+c2Y\mathbf{Z} = c_1\mathbf{X} + c_2\mathbf{Y}. Then using the definition of expected value, we have E[c1X+c2Y]=E[Z]=ΩPr[]Z()=ΩPr[](c1X()+c2Y())=ΩPr[]c1X()+Pr[]c2Y()=(ΩPr[]c1X())+(ΩPr[]c2Y())=c1(ΩPr[]X())+c2(ΩPr[]Y())=c1E[X]+c2E[Y],\begin{aligned} \mathbb E[c_1\mathbf{X}+c_2\mathbf{Y}] & = \mathbb E[\mathbf{Z}] \\ & = \sum_{\ell \in \Omega} \Pr[\ell]\cdot \mathbf{Z}(\ell) \\ & = \sum_{\ell \in \Omega} \Pr[\ell]\cdot (c_1\mathbf{X}(\ell) + c_2\mathbf{Y}(\ell)) \\ & = \sum_{\ell \in \Omega} \Pr[\ell]\cdot c_1 \mathbf{X}(\ell) + \Pr[\ell] \cdot c_2 \mathbf{Y}(\ell) \\ & = \left(\sum_{\ell \in \Omega} \Pr[\ell]\cdot c_1 \mathbf{X}(\ell)\right ) + \left( \sum_{\ell \in \Omega} \Pr[\ell]\cdot c_2 \mathbf{Y}(\ell)\right) \\ & = c_1 \left(\sum_{\ell \in \Omega} \Pr[\ell]\cdot \mathbf{X}(\ell)\right ) + c_2 \left( \sum_{\ell \in \Omega} \Pr[\ell]\cdot \mathbf{Y}(\ell)\right) \\ & = c_1\mathbb E[\mathbf{X}] + c_2\mathbb E[\mathbf{Y}], \end{aligned} as desired.

Corollary (Linearity of expectation 2)

Let X1,X2,,Xn\mathbf{X}_1,\mathbf{X}_2,\ldots,\mathbf{X}_n be random variables, and c1,c2,,cnRc_1,c_2,\ldots,c_n \in \mathbb R be some constants. Then E[c1X1+c2X2++cnXn]=c1E[X1]+c2E[X2]++cnE[Xn].\mathbb E[c_1\mathbf{X}_1 + c_2\mathbf{X}_2 + \cdots + c_n\mathbf{X}_n] = c_1\mathbb E[\mathbf{X}_1] + c_2\mathbb E[\mathbf{X}_2] + \cdots + c_n\mathbb E[\mathbf{X}_n]. In particular, when all the cic_i’s are 1, we get E[X1+X2++Xn]=E[X1]+E[X2]++E[Xn].\mathbb E[\mathbf{X}_1 + \mathbf{X}_2 + \cdots + \mathbf{X}_n] = \mathbb E[\mathbf{X}_1] + \mathbb E[\mathbf{X}_2] + \cdots + \mathbb E[\mathbf{X}_n].

Exercise (Practice with linearity of expectation)

Suppose we roll three 1010-sided dice. Let X\mathbf{X} be the sum of the three values we see. Calculate E[X]\mathbb E[\mathbf{X}].

Solution

Let X1,X2,X3{\mathbf{X}}_1, {\mathbf{X}}_2, {\mathbf{X}}_3 be the values of the rolls of each of the three dice. Note that X1,X2,X3{\mathbf{X}}_1, {\mathbf{X}}_2, {\mathbf{X}}_3 are random variables and that X=X1+X2+X3{\mathbf{X}} = {\mathbf{X}}_1 + {\mathbf{X}}_2 + {\mathbf{X}}_3. Then since X1,X2,X3{\mathbf{X}}_1, {\mathbf{X}}_2, {\mathbf{X}}_3 are identically distributed, we can compute E[X1]=E[X2]=E[X3]=xrange(X1)Pr[X1=x]x=x=110Pr[X1=x]xsince range(X1)={1,2,,10}=x=110110x=112\begin{aligned} \mathbb E[{\mathbf{X}}_1] = \mathbb E[{\mathbf{X}}_2] = \mathbb E[{\mathbf{X}}_3] &= \sum_{x \in \operatorname{range}({\mathbf{X}}_1)}\Pr[{\mathbf{X}}_1 = x] \cdot x \\ &= \sum_{x=1}^{10}\Pr[{\mathbf{X}}_1 = x] \cdot x &\text{since $\operatorname{range}({\mathbf{X}}_1) = \{1,2,\dots,10\}$}\\ &= \sum_{x=1}^{10}\dfrac{1}{10} \cdot x \\ &= \dfrac{11}{2}\end{aligned} and by linearity of expectation, E[X]=E[X1+X2+X3]=E[X1]+E[X2]+E[X3]=3112=332.\mathbb E[{\mathbf{X}}] = \mathbb E[{\mathbf{X}}_1 + {\mathbf{X}}_2 + {\mathbf{X}}_3] = \mathbb E[{\mathbf{X}}_1] + \mathbb E[{\mathbf{X}}_2] + \mathbb E[{\mathbf{X}}_3] = 3 \cdot \dfrac{11}{2} = \dfrac{33}{2}.

Exercise (Expectation of the product of random variables)

Define two random variables X\mathbf{X} and Y\mathbf{Y} such that E[XY]E[X]E[Y]\mathbb E[\mathbf{X}\mathbf{Y}] \neq \mathbb E[\mathbf{X}] \mathbb E[\mathbf{Y}].

Solution

Let X\mathbf{X} be such that Pr[X=1]=1/2\Pr[\mathbf{X} = 1] = 1/2 and Pr[X=1]=1/2\Pr[\mathbf{X} = -1] = 1/2. Define Y=X\mathbf{Y} = \mathbf{X}. Note that E[X]=E[Y]=121+12(1)=0.\mathbb E[\mathbf{X}] = \mathbb E[\mathbf{Y}] = \frac{1}{2} \cdot 1 + \frac{1}{2} \cdot (-1) = 0. Therefore E[X]E[Y]=0\mathbb E[\mathbf{X}] \mathbb E[\mathbf{Y}] = 0. Now let Z=XY\mathbf{Z} = \mathbf{X} \mathbf{Y}. Notice that Pr[Z=1]=1\Pr[\mathbf{Z} = 1] = 1, and therefore E[Z]=1\mathbb E[\mathbf{Z}] = 1.

Definition (Indicator random variable)

Let EΩE \subseteq \Omega be some event. The indicator random variable with respect to EE is denoted by IE\mathbf{I}_E and is defined as IE()={1if E,0otherwise.\mathbf{I}_E(\ell) = \begin{cases} 1 & \quad \text{if $\ell \in E$,}\\ 0 & \quad \text{otherwise.}\\ \end{cases}

Proposition (Expectation of an indicator random variable)

Let EE be an event. Then E[IE]=Pr[E]\mathbb E[\mathbf{I}_E] = \Pr[E].

Proof

By the definition of expected value, E[IE]=Pr[IE=1]1+Pr[IE=0]0=Pr[IE=1]=Pr[{Ω:IE()=1}]=Pr[{Ω:E}]=Pr[E].\begin{aligned} \mathbb E[\mathbf{I}_E] & = \Pr[\mathbf{I}_E = 1] \cdot 1 + \Pr[\mathbf{I}_E = 0] \cdot 0\\ & = \Pr[\mathbf{I}_E = 1] \\ & = \Pr[\{\ell \in \Omega : \mathbf{I}_E(\ell) = 1\}] \\ & = \Pr[\{\ell \in \Omega : \ell \in E \}] \\ & = \Pr[E].\end{aligned}

Important Note (Combining linearity of expectation and indicators)

Suppose that you are interested in computing E[X]\mathbb E[\mathbf{X}] for some random variable X\mathbf{X}. If you can write X\mathbf{X} as a sum of indicator random variables, i.e., if X=jIEj\mathbf{X} = \sum_j \mathbf{I}_{E_j} where IEj\mathbf{I}_{E_j} are indicator random variables, then by linearity of expectation, E[X]=E[jIEj]=jE[IEj].\mathbb E[\mathbf{X}] = \mathbb E\left[\sum_j \mathbf{I}_{E_j}\right] = \sum_j \mathbb E[\mathbf{I}_{E_j}]. Furthermore, by Proposition (Expectation of an indicator random variable), we know E[IEj]=Pr[Ej]\mathbb E[\mathbf{I}_{E_j}] = \Pr[E_j]. Therefore E[X]=jPr[Ej]\mathbb E[\mathbf{X}] = \sum_j \Pr[E_j]. This often provides an extremely convenient way of computing E[X]\mathbb E[\mathbf{X}]. This combination of indicator random variables together with linearity expectation is one of the most useful tricks in probability theory!

Exercise (Practice with linearity of expectation and indicators)
  1. There are nn balls and nn bins. For each ball, you pick one of the bins uniformly at random and drop the ball in that bin. What is the expected number of balls in bin 1? What is the expected number of empty bins?

  2. Suppose you randomly color the vertices of the complete graph on nn vertices one of kk colors. What is the expected number of paths of length cc (where we assume c3c \geq 3) such that no two adjacent vertices on the path have the same color?

Solution

Part (1): Let X{\mathbf{X}} be the number of balls in bin 1. For j=1,2,,nj = 1,2,\dots,n, let EjE_j be the event that the jj’th ball is dropped in bin 1. Observe that X=j=1nIEj{\mathbf{X}} = \sum_{j=1}^{n}{\mathbf{I}}_{E_j}, and that Pr[Ej]=1/n\Pr[E_j] = 1/n for all jj, since the bin each ball is dropped into is picked uniformly at random. Then by linearity of expectation, E[X]=E[j=1nIEj]=j=1nE[IEj]=j=1nPr[Ej]=j=1n1n=1.\mathbb E[{\mathbf{X}}] = \mathbb E\left[\sum_{j=1}^{n}{\mathbf{I}}_{E_j}\right] = \sum_{j=1}^{n}\mathbb E[{\mathbf{I}}_{E_j}] = \sum_{j=1}^{n}\Pr[{E_j}] = \sum_{j=1}^{n}\dfrac{1}{n} = 1. Let Y{\mathbf{Y}} be the number of empty bins. For j=1,2,,nj = 1,2,\dots,n, let FjF_j be the event that bin jj is empty. Observe that Y=j=1nIFj{\mathbf{Y}} = \sum_{j=1}^{n}{\mathbf{I}}_{F_j}, and that Pr[Fj]=(11/n)n\Pr[F_j] = (1 - 1/n)^n for all jj, since the probability that any one of the nn balls is not dropped in a fixed bin is 11/n1 - 1/n, and each ball is dropped independently of the others. Then by linearity of expectation, E[Y]=E[j=1nIFj]=j=1nE[IFj]=j=1nPr[Fj]=j=1n(11n)n=n(11n)n.\mathbb E[{\mathbf{Y}}] = \mathbb E\left[\sum_{j=1}^{n}{\mathbf{I}}_{F_j}\right] = \sum_{j=1}^{n}\mathbb E[{\mathbf{I}}_{F_j}] = \sum_{j=1}^{n}\Pr[{F_j}] = \sum_{j=1}^{n}\left(1 - \dfrac{1}{n}\right)^n = n\left(1 - \dfrac{1}{n}\right)^n.

Part (2): Let X{\mathbf{X}} be the number of paths of length cc such that no two adjacent vertices on the path have the same color. There are a total of n(n1)(nc)=n!(nc1)!n(n-1)\cdots(n-c) = \dfrac{n!}{(n-c-1)!} paths of length cc. Let’s call this value NN and let’s number the paths from 11 to NN. For j=1,2,,Nj = 1,2,\dots,N, let EjE_j be the event that no two adjacent vertices on the jj’th path have the same color. Note that X=j=1NIEj{\mathbf{X}} = \sum_{j=1}^{N}{\mathbf{I}}_{E_j}.

We first compute Pr[Ej]\Pr[E_j] for some fixed jj. Suppose path jj is v0v1vcv_0 v_1 \cdots v_c. Then EjE_j occurs if and only if viv_i is colored differently from vi1v_{i-1} for i=1,2,,ci = 1,2,\dots,c. For each ii, this happens with probability 11/k1 - 1/k, and they are independent of each other. Hence, we can conclude that Pr[Ej]=(11/k)c\Pr[E_j] = (1 - 1/k)^c for each jj. Then by linearity of expectation, E[X]=E[j=1NIEj]=j=1NE[IEj]=j=1NPr[Ej]=j=1N(11k)c=n!(nc1)!(11k)c.\mathbb E[{\mathbf{X}}] = \mathbb E\left[\sum_{j=1}^{N}{\mathbf{I}}_{E_j}\right] = \sum_{j=1}^{N}\mathbb E[{\mathbf{I}}_{E_j}] = \sum_{j=1}^{N}\Pr[{E_j}] = \sum_{j=1}^{N}\left(1 - \dfrac{1}{k}\right)^c = \dfrac{n!}{(n-c-1)!}\left(1 - \dfrac{1}{k}\right)^c.

Theorem (Markov’s inequality)

Let X\mathbf{X} be a non-negative random variable. Then for any c>0c > 0, Pr[XcE[X]]1c.\Pr[\mathbf{X} \geq c \cdot \mathbb E[\mathbf{X}]] \leq \frac{1}{c}. Or equivalently, Pr[Xc]E[X]c.\Pr[\mathbf{X} \geq c] \leq \frac{\mathbb E[\mathbf{X}]}{c}.

Remark (Markov’s inequality intuition)

At a high level, Markov’s inequality states that a non-negative random variable is rarely much bigger than its expectation.

For example, if the average score on an exam is 45%45\%, then the fraction of the students who got at least 90%90\% cannot be very large. And we can put an upper bound on that fraction by considering the scenario that would maximize the fraction of students getting at least 90%90\%. To maximize that fraction, the contribution to the average score, of students who got below 90%90\%, should be minimized. For this, we’ll assume that anyone who got below a 90%90\% actually got a 0%0\% on the exam. Furthermore, anyone who gets strictly above a 90%90\% (e.g. anyone who gets a 100%100\%) would reduce the fraction of students who get at least 90%90\%. So we’ll assume that anyone who got at least 90%90\% actually got exactly 90%90\%. In this scenario, if pp is the fraction of students who got 90%90\% on the exam, the average score on the exam would be p90p \cdot 90. This quantity should be equal to the actual average, 4545. So pp is equal to 1/21/2. Meaning, if the average score in the exam is 45%45\%, then at most half the class can get a score of at least 90%90\%.

A more succinct way to say the above is that, if pp is the fraction of students who got at least 90%90\%, then the average must be at least p90p \cdot 90. So 50p9050 \geq p \cdot 90, and therefore p1/2p \leq 1/2.

Since we are not assuming anything about the random variable other than it is non-negative with non-zero expectation, the bound that Markov’s inequality gives us is rather weak. There are much stronger bounds for more specific random variables.

Proof

Our goal is to find an upper bound on the probability that the random variable X\mathbf{X} is greater than or equal to cE[X].c \cdot \mathbb E[\mathbf{X}]. Let pp be the probability of this event. We want to show p1/cp \leq 1/c. We will do so as follows. First, we put a lower bound on E[X]\mathbb E[\mathbf{X}]. Recall that E[X]=xrange(X)Pr[X=x]x.\mathbb E[\mathbf{X}] = \sum_{x \in \text{range}(\mathbf{X})} \Pr[\mathbf{X}=x]\cdot x. We divide this sum into two parts, based on whether xcE[X]x \geq c \cdot \mathbb E[\mathbf{X}] or not: E[X]=(x<cE[X]Pr[X=x]x)+(xcE[X]Pr[X=x]x).\mathbb E[\mathbf{X}] = \left(\sum_{x < c \cdot \mathbb E[\mathbf{X}]} \Pr[\mathbf{X}=x]\cdot x \right) + \left(\sum_{x \geq c \cdot \mathbb E[\mathbf{X}]} \Pr[\mathbf{X}=x]\cdot x \right). Using the non-negativity of X\mathbf{X}, we know the first sum is at least 00. The second sum can be lower-bounded by pcE[X]p \cdot c \cdot \mathbb E[\mathbf{X}] (because in the worst case, all the probability mass pp is on the event X=x\mathbf{X}=x where x=cE[X]x = c \cdot \mathbb E[\mathbf{X}]). Therefore, E[X]pcE[X]\mathbb E[\mathbf{X}] \geq p \cdot c \cdot \mathbb E[\mathbf{X}]. Rearranging, we get p1/cp \leq 1/c, as desired.

Exercise (Practice with Markov’s inequality)

During the Spring 2022 semester, the 15-251 TAs decide to strike because they are not happy with the lack of free food in grading sessions. Without the TA support, the performance of the students in the class drop dramatically. The class average on the first midterm exam is 15%. Using Markov’s Inequality, give an upper bound on the fraction of the class that got an A (i.e., at least a 90%) in the exam.

Solution

Let X\mathbf{X} be the exam score of a student chosen uniformly at random. We will optimistically assume that X\mathbf{X} is non-negative. Then E[X]=0.150\mathbb E[{\mathbf{X}}] = 0.15 \neq 0 and the fraction of the class that got an A is Pr[X0.9]\Pr[\mathbf{X} \geq 0.9]. By Markov’s inequality, Pr[X0.9]E[X]0.9=16\Pr[\mathbf{X} \geq 0.9] \leq \dfrac{\mathbb E[\mathbf{X}]}{0.9} = \dfrac{1}{6} which is an upper bound on the fraction of the class that got an A.

Definition (Bernoulli random variable)

Let 0<p<10 < p < 1 be some parameter. If X\mathbf{X} is a random variable with probability mass function pX(1)=pp_{\mathbf{X}}(1) = p and pX(0)=1pp_{\mathbf{X}}(0) = 1-p, then we say that X\mathbf{X} has a Bernoulli distribution with parameter pp (we also say that X\mathbf{X} is a Bernoulli random variable). We write XBernoulli(p)\mathbf{X} \sim \text{Bernoulli}(p) to denote this. The parameter pp is often called the success probability.

Note (What does a Bernoulli random variable represent?)

A Bernoulli random variable Bernoulli(p)\text{Bernoulli}(p) captures a random experiment where we toss a pp-biased coin where the probability of heads is pp (and we assign this the numerical outcome of 11) and the probability of tails is 1p1-p (and we assign this the numerical outcome of 00).

Note (Expectation of Bernoulli random variable)

Note that E[X]=0pX(0)+1pX(1)=pX(1)=p\mathbb E[\mathbf{X}] = 0 \cdot p_{\mathbf{X}}(0) + 1 \cdot p_{\mathbf{X}}(1) = p_{\mathbf{X}}(1) = p.

Definition (Binomial random variable)

Let X=X1+X2++Xn\mathbf{X} = \mathbf{X}_1 + \mathbf{X}_2 + \cdots + \mathbf{X}_n, where the Xi\mathbf{X}_i’s are independent and for all ii, XiBernoulli(p)\mathbf{X}_i \sim \text{Bernoulli}(p). Then we say that X\mathbf{X} has a binomial distribution with parameters nn and pp (we also say that X\mathbf{X} is a binomial random variable). We write XBin(n,p)\mathbf{X} \sim \text{Bin}(n,p) to denote this.

Note (What does a Binomial random variable represent?)

A Binomial random variable XBin(n,p)\mathbf{X} \sim \text{Bin}(n,p) captures a random experiment where we toss a pp-biased coin nn times. We are interested in the probability of seeing kk heads among those nn coin tosses, where kk ranges over {0,1,2,,n}=range(X)\{0,1,2,\ldots, n\} = \text{range}(\mathbf{X}).

Note (Bernoulli is a special case of Binomial)

Note that we can view a Bernoulli random variable as a special kind of a binomial random variable where n=1n = 1.

Exercise (Expectation of a Binomial random variable)

Let X\mathbf{X} be a random variable with XBin(n,p)\mathbf{X} \sim \text{Bin}(n,p). Determine E[X]\mathbb E[\mathbf{X}] (use linearity of expectation). Also determine X\mathbf{X}’s probability mass function.

Solution

By definition, X=j=1nXj\mathbf{X} = \sum_{j=1}^{n} \mathbf{X}_j where the Xi\mathbf{X}_i’s are independent and for all ii, XiBernoulli(p)\mathbf{X}_i \sim \text{Bernoulli}(p). Since E[Xi]=p\mathbb E[\mathbf{X}_i] = p, by linearity of expectation: E[X]=E[j=1nXj]=j=1nE[Xj]=np.\mathbb E[\mathbf{X}] = \mathbb E\left[\sum_{j=1}^{n} \mathbf{X}_j\right] = \sum_{j=1}^{n} \mathbb E\left[\mathbf{X}_j\right] = np. We now determine the probability mass function of X\mathbf{X}. Note that the values X\mathbf{X} can take on are {0,1,2,,n}\{0,1,2,\ldots, n\}. Let kk be an arbitrary element of {0,1,2,,n}\{0,1,2,\ldots, n\}. Then, pX(k)=Pr[X=k]=Pr[j=1nXj=k]=J{1,,n}J=kPr[jJ(Xj=1)jJ(Xj=0)]=J{1,,n}J=k(jJPr[Xj=1]jJPr[Xj=0])=J{1,,n}J=k(pk(1p)nk)=(nk)pk(1p)nk.\begin{aligned} p_{{\mathbf{X}}}(k) & = \Pr[{\mathbf{X}} = k] \\ & = \Pr\left[\sum_{j=1}^{n}{\mathbf{X}}_j = k\right] \\ & = \sum_{\substack{J \subseteq \{1,\ldots, n\}\\ |J| = k}} \Pr\left[\bigcap_{j \in J}({\mathbf{X}}_j = 1) \cap \bigcap_{j \notin J}({\mathbf{X}}_j = 0)\right] \\ & = \sum_{\substack{J \subseteq \{1,\ldots, n\}\\ |J| = k}} \left(\prod_{j \in J}\Pr[{\mathbf{X}}_j = 1] \cdot \prod_{j \notin J}\Pr[{\mathbf{X}}_j = 0]\right) \\ & = \sum_{\substack{J \subseteq \{1,\ldots, n\}\\ |J| = k}} \left(p^{k} \cdot (1-p)^{n-k}\right) \\ & = \binom{n}{k}p^{k}(1-p)^{n-k}.\end{aligned}

Exercise (Practice with Binomial random variable)

We toss a coin 55 times. What is the probability that we see at least 4 heads? What is the expected number of heads?

Solution

Let X\mathbf{X} be the number of heads among the 5 coin tosses. Then XBinomial(5,1/2)\mathbf{X} \sim \text{Binomial}(5, 1/2) and so the probability we see at least 4 heads is Pr[X4]=Pr[X=4]+Pr[X=5]=(54)(12)4(12)1+(55)(12)5(12)0=316.\Pr[\mathbf{X} \geq 4] = \Pr[\mathbf{X} = 4] + \Pr[\mathbf{X} = 5] = \binom{5}{4}\left(\dfrac{1}{2}\right)^4\left(\dfrac{1}{2}\right)^1 + \binom{5}{5}\left(\dfrac{1}{2}\right)^5\left(\dfrac{1}{2}\right)^0 = \dfrac{3}{16}.

The expected number of heads, E[X]\mathbb E[\mathbf{X}], is equal to 512=2.55 \cdot \frac{1}{2} = 2.5.

Definition (Geometric random variable)

Let X\mathbf{X} be a random variable with probability mass function pXp_{\mathbf{X}} such that for n{1,2,}n \in \{1,2,\ldots\}, pX(n)=(1p)n1pp_{\mathbf{X}}(n) = (1-p)^{n-1}p. Then we say that X\mathbf{X} has a geometric distribution with parameter pp (we also say that X\mathbf{X} is a geometric random variable). We write XGeometric(p)\mathbf{X} \sim \text{Geometric}(p) to denote this.

Note (What does a Geometric random variable represent?)

A Geometric random variable Geometric(p)\text{Geometric}(p) captures a random experiment where we successively toss a pp-biased coin until we see heads for the first time, and we stop. We are interested in the probability of making nn coin tosses in total before we stop, where nn ranges over {1,2,}\{1,2,\ldots\}.

image

Exercise (PMF of a geometric random variable)

Let X\mathbf{X} be a geometric random variable. Verify that n=1pX(n)=1\sum_{n = 1}^\infty p_{\mathbf{X}}(n) = 1.

Solution

Recall that for r<1|r| < 1, n=0rn=1/(1r)\sum_{n=0}^{\infty}r^n = 1/(1-r). Then n=1pX(n)=n=1(1p)n1p=pn=0(1p)n=p11(1p)=1.\begin{aligned} \sum_{n=1}^{\infty}p_{{\mathbf{X}}}(n) = \sum_{n=1}^{\infty}(1-p)^{n-1}p = p \cdot \sum_{n=0}^{\infty}(1-p)^{n} = p \cdot \dfrac{1}{1-(1-p)} = 1.\end{aligned}

Exercise (Practice with geometric random variable)

Suppose we repeatedly flip a coin until we see a heads for the first time. What is the probability that we will flip the coin more than 55 times?

Solution

Let X\mathbf{X} be the number of flips until we see a heads for the first time. Then XGeometric(1/2)\mathbf{X} \sim \text{Geometric}(1/2). The question asks us to compute Pr[X>5]\Pr[\mathbf{X} > 5]. We know that the event X>5\mathbf{X} > 5 is equivalent to getting tails in our first 5 flips, which happens with probability 1/251/2^5. Therefore Pr[X>5]=1/25\Pr[\mathbf{X} > 5] = 1/2^5.

Exercise (Expectation of a geometric random variable)

Let X\mathbf{X} be a random variable with XGeometric(p)\mathbf{X} \sim \text{Geometric}(p). Determine E[X]\mathbb E[\mathbf{X}].

Solution

By the definition of expectation, we have E[X]=n=1n(1p)n1p\mathbb E[\mathbf{X}] = \sum_{n=1}^\infty n \cdot (1-p)^{n-1} p. Using the fact that n=0(1p)n=1/p\sum_{n = 0}^\infty (1-p)^n = 1/p, we get: E[X]=n=1n(1p)n1p=p(n=1n(1p)n1)=p(n=1(1p)n1+n=2(1p)n1+n=3(1p)n1+)=p(1p+1pp+(1p)2p+)=1+(1p)+(1p)2+=1p.\begin{aligned} \mathbb E[\mathbf{X}] & = \sum_{n=1}^\infty n \cdot (1-p)^{n-1} p \\ & = p \left( \sum_{n=1}^\infty n \cdot (1-p)^{n-1} \right) \\ & = p \left( \sum_{n=1}^\infty (1-p)^{n-1} + \sum_{n=2}^\infty (1-p)^{n-1} + \sum_{n=3}^\infty (1-p)^{n-1} + \cdots \right) \\ & = p \left( \frac{1}{p} + \frac{1-p}{p} + \frac{(1-p)^2}{p} + \cdots \right) \\ & = 1 + (1-p) + (1-p)^2 + \cdots \\ & = \frac{1}{p}.\end{aligned}

Important Note (Some general tips)

Here are some general tips on probability calculations (this is not meant to be an exhaustive list).

  • If you are trying to upper bound Pr[A]\Pr[A], you can try to find BB with ABA \subseteq B, and then bound Pr[B]\Pr[B]. Note that if an event AA implies an event BB, then this means ABA \subseteq B. Similarly, if you are trying to lower bound Pr[A]\Pr[A], you can try to find BB with BAB \subseteq A, and then bound Pr[B]\Pr[B].

  • If you are trying to upper bound Pr[A]\Pr[A], you can try to lower bound Pr[A]\Pr[\overline{A}] since Pr[A]=1Pr[A]\Pr[A] = 1-\Pr[\overline{A}]. Similarly, if you are trying to lower bound Pr[A]\Pr[A], you can try to upper bound Pr[A]\Pr[\overline{A}].

  • In some situations, law of total probability can be very useful in calculating (or bounding) Pr[A]\Pr[A].

  • If you need to calculate Pr[A1An]\Pr[A_1 \cap \cdots \cap A_n], try the chain rule. If the events are independent, then this probability is equal to the product Pr[A1]Pr[An]\Pr[A_1] \cdots \Pr[A_n]. Note that the event “for all i{1,,n}i \in \{1,\ldots,n\}, AiA_i” is the same as A1AnA_1 \cap \cdots \cap A_n.

  • If you need to upper bound Pr[A1An]\Pr[A_1 \cup \cdots \cup A_n], you can try to use the union bound. Note that the event “there exists an i{1,,n}i \in \{1,\ldots,n\} such that AiA_i” is the same as A1AnA_1 \cup \cdots \cup A_n.

  • When trying to calculate E[X]\mathbb E[\mathbf{X}], try:

    1. directly using the definition of expectation;

    2. writing X\mathbf{X} as a sum of indicator random variables, and then using linearity of expectation.

4
Check Your Understanding
Problem
  1. Describe what a probability tree is.

  2. True or false: If two events AA and BB are independent, then their complements A\overline{A} and B\overline{B} are also independent. (The complement of an event AA is A=Ω\A\overline{A} = \Omega \backslash A.)

  3. True or false: If events AA and BB are disjoint, then they are necessarily independent.

  4. True or false: For all events A,BA,B, Pr[A    B]Pr[A]\Pr[A \; | \;B] \le \Pr[A].

  5. True or false: For all events A,BA,B, Pr[A    B]=1Pr[A    B]\Pr[\overline{A} \; | \;B] = 1- \Pr[A \; | \;B].

  6. True or false: For all events A,BA,B, Pr[A    B]=1Pr[A    B]\Pr[A \; | \;\overline{B}] = 1 - \Pr[A \; | \;B].

  7. True or false: Assume that every time a baby is born, there is 1/21/2 chance that the baby is a boy. A couple has two children. At least one of the children is a boy. The probability that both children are boys is 1/21/2.

  8. What is the union bound?

  9. What is the chain rule?

  10. What is a random variable?

  11. What is an indicator random variable?

  12. What is the expectation of a random variable?

  13. What is linearity of expectation?

  14. When calculating the expectation of a random variable X\mathbf{X}, the strategy of writing X\mathbf{X} as a sum of indicator random variables and then using linearity of expectation is quite powerful. Explain how this strategy is carried out.

  15. True or false: Let X\mathbf{X} be a random variable. If E[X]=μ\mathbb E[\mathbf{X}] = \mu, then Pr[X=μ]>0\Pr[\mathbf{X} = \mu] > 0.

  16. True or false: For any random variable X\mathbf{X}, E[1/X]=1/E[X]\mathbb E[1/\mathbf{X}] = 1/\mathbb E[\mathbf{X}].

  17. True or false: For any random variable X\mathbf{X}, Pr[XE[X]]>0\Pr[\mathbf{X} \geq \mathbb E[\mathbf{X}]] > 0.

  18. True or false: For any non-negative random variable X\mathbf{X}, E[X2]E[X]2\mathbb E[\mathbf{X}^2] \le \mathbb E[\mathbf{X}]^2.

  19. True or false: For any random variable X\mathbf{X}, E[X3]=E[X3]\mathbb E[-\mathbf{X}^3] = - \mathbb E[\mathbf{X}^3].

  20. What is Markov’s inequality?

  21. What is a Bernoulli random variable? Give one example.

  22. What is the expectation of a Bernoulli random variable, and how do you derive it?

  23. What is a Binomial random variable? Give one example.

  24. What is the expectation of a Binomial random variable, and how do you derive it?

  25. What is a Geometric random variable? Give one example.

  26. What is the expectation of a Geometric random variable (justification not required)?

  27. Let X\mathbf{X} and Y\mathbf{Y} be random variables. Does the expression E[X    Y]=0\mathbb E[\mathbf{X} \; | \;\mathbf{Y}] = 0 type-check?

  28. Let AA be an event. Does the expression E[A]=0\mathbb E[A] = 0 type-check?

Definition (Finite probability space, sample space, probability distribution)

A finite probability space is a tuple \((\Omega, \Pr)\), where

  • \(\Omega\) is a non-empty finite set called the sample space;

  • \(\Pr : \Omega \to [0,1]\) is a function, called the probability distribution, with the property that \(\sum_{\ell \in \Omega} \Pr[\ell] = 1\).

The elements of \(\Omega\) are called outcomes or samples. If \(\Pr[\ell] = p\), then we say that the probability of outcome \(\ell\) is \(p\).

See in chapter
Note (Modeling through randomized code)

Sometimes, the modeling of a real-world random experiment as a probability space can be non-trivial or tricky. It helps a lot to have a step in between where you first go from the real-world experiment to computer code/algorithm (that makes calls to random number generators), and then you define your probability space based on the computer code. In this course, we allow our programs to have access to the functions \(\text{Bernoulli}(p)\) and \(\text{RandInt}(n)\). The function \(\text{Bernoulli}(p)\) takes a number \(0 \leq p \leq 1\) as input and returns \(1\) with probability \(p\) and \(0\) with probability \(1-p\). The function \(\text{RandInt}(n)\) takes a positive integer \(n\) as input and returns a random integer from \(1\) to \(n\) (i.e., every number from \(1\) to \(n\) has probability \(1/n\)). Here is a very simple example of going from a real-world experiment to computer code. The experiment is as follows. You flip a fair coin. If it’s heads, you roll a \(3\)-sided die. If it is tails, you roll a \(4\)-sided die. This experiment can be represented as:

    flip = Bernoulli(1/2)
    if flip == 0:
        die = RandInt(3)
    else:
        die = RandInt(4)

If we were to ask “What is the probability that you roll a 3 or higher?”, this would correspond to asking what is the probability that after the above code is executed, the variable named die stores a value that is 3 or higher.

One advantage of modeling with randomized code is that if there is any ambiguity in the description of the random (real-world) experiment, then it would/should be resolved in this step of creating the randomized code.

The second advantage is that it allows you to easily imagine a probability tree associated with the randomized code. The probability tree gives you clear picture on what the sample space is and how to compute the probabilities of the outcomes. The probability tree corresponding to the above code is as follows.

image

This simple example may not illustrate the usefulness of having a computer code representation of the random experiment, but one can appreciate its value with more sophisticated examples and we do encourage you to think of random experiments as computer code: \[\text{real-world experiment} \longrightarrow \text{computer code / probability tree} \longrightarrow \text{probability space } (\Omega, \Pr).\]

See in chapter
Exercise (Probability space modeling)

How would you model a roll of a single \(6\)-sided die using Definition (Finite probability space, sample space, probability distribution)? How about a roll of two dice? How about a roll of a die and a coin toss together?

See in chapter
Exercise (Basic facts about probability)

Let \(A\) and \(B\) be two events. Prove the following.

  • If \(A \subseteq B\), then \(\Pr[A] \leq \Pr[B]\).

  • \(\Pr[\overline{A}] = 1 - \Pr[A]\).

  • \(\Pr[A \cup B] = \Pr[A] + \Pr[B] - \Pr[A \cap B]\).

See in chapter
Note (Modeling through randomized code)

Sometimes, the modeling of a real-world random experiment as a probability space can be non-trivial or tricky. It helps a lot to have a step in between where you first go from the real-world experiment to computer code/algorithm (that makes calls to random number generators), and then you define your probability space based on the computer code. In this course, we allow our programs to have access to the functions \(\text{Bernoulli}(p)\) and \(\text{RandInt}(n)\). The function \(\text{Bernoulli}(p)\) takes a number \(0 \leq p \leq 1\) as input and returns \(1\) with probability \(p\) and \(0\) with probability \(1-p\). The function \(\text{RandInt}(n)\) takes a positive integer \(n\) as input and returns a random integer from \(1\) to \(n\) (i.e., every number from \(1\) to \(n\) has probability \(1/n\)). Here is a very simple example of going from a real-world experiment to computer code. The experiment is as follows. You flip a fair coin. If it’s heads, you roll a \(3\)-sided die. If it is tails, you roll a \(4\)-sided die. This experiment can be represented as:

    flip = Bernoulli(1/2)
    if flip == 0:
        die = RandInt(3)
    else:
        die = RandInt(4)

If we were to ask “What is the probability that you roll a 3 or higher?”, this would correspond to asking what is the probability that after the above code is executed, the variable named die stores a value that is 3 or higher.

One advantage of modeling with randomized code is that if there is any ambiguity in the description of the random (real-world) experiment, then it would/should be resolved in this step of creating the randomized code.

The second advantage is that it allows you to easily imagine a probability tree associated with the randomized code. The probability tree gives you clear picture on what the sample space is and how to compute the probabilities of the outcomes. The probability tree corresponding to the above code is as follows.

image

This simple example may not illustrate the usefulness of having a computer code representation of the random experiment, but one can appreciate its value with more sophisticated examples and we do encourage you to think of random experiments as computer code: \[\text{real-world experiment} \longrightarrow \text{computer code / probability tree} \longrightarrow \text{probability space } (\Omega, \Pr).\]

See in chapter
Definition (Finite probability space, sample space, probability distribution)

A finite probability space is a tuple \((\Omega, \Pr)\), where

  • \(\Omega\) is a non-empty finite set called the sample space;

  • \(\Pr : \Omega \to [0,1]\) is a function, called the probability distribution, with the property that \(\sum_{\ell \in \Omega} \Pr[\ell] = 1\).

The elements of \(\Omega\) are called outcomes or samples. If \(\Pr[\ell] = p\), then we say that the probability of outcome \(\ell\) is \(p\).

See in chapter
Definition (Conditional probability)

Let \(E\) be an event with \(\Pr[E] \neq 0\). The conditional probability of outcome \(\ell \in \Omega\) given \(E\), denoted \(\Pr[\ell \; | \;E]\), is defined as \[\Pr[\ell \; | \;E] = \begin{cases} 0 &\text{if $\ell \not \in E$}\\ \frac{\Pr[\ell]}{\Pr[E]} &\text{if $\ell \in E$} \end{cases}\] For an event \(A\), the conditional probability of \(A\) given \(E\), denoted \(\Pr[A \; | \;E]\), is defined as \[\begin{aligned} \Pr[A \; | \;E] = \frac{\Pr[A \cap E]}{\Pr[E]}.\end{aligned}\]

See in chapter
Exercise (Practice with events)
  1. Suppose we roll two \(6\)-sided dice. How many events are there? Write down the event corresponding to “we roll a double” and determine its probability.

  2. Suppose we roll a \(3\)-sided die and see the number \(d\). We then roll a \(d\)-sided die. How many different events are there? Write down the event corresponding to “the second roll is a \(2\)” and determine its probability.

See in chapter
Proposition (Chain rule)

Let \(n \geq 2\) and let \(A_1,A_2,\ldots,A_n\) be events. Then \[\begin{aligned} & \Pr[A_1 \cap \cdots \cap A_n] = \\ & \qquad \Pr[A_1] \cdot \Pr[A_2 \; | \;A_1] \cdot \Pr[A_3 \; | \;A_1 \cap A_2] \cdots \Pr[A_n \; | \;A_1 \cap A_2 \cap \cdots \cap A_{n-1}].\end{aligned}\]

See in chapter
Definition (Independent events)
  • Let \(A\) and \(B\) be two events. We say that \(A\) and \(B\) are independent events if \(\Pr[A \cap B] = \Pr[A]\cdot \Pr[B]\). Note that if \(\Pr[B] \neq 0\), then this is equivalent to \(\Pr[A \; | \;B] = \Pr[A]\). If \(\Pr[A] \neq 0\), it is also equivalent to \(\Pr[B \; | \;A] = \Pr[B]\).

  • Let \(A_1, A_2,\ldots, A_n\) be events. We say that \(A_1,\ldots,A_n\) are independent if for any subset \(S \subseteq \{1,2,\ldots,n\}\), \[\Pr\left[\bigcap_{i \in S} A_i\right] = \prod_{i\in S} \Pr[A_i].\]

See in chapter
Exercise (Union bound)

Let \(A_1,A_2,\ldots,A_n\) be events. Then \[\Pr[A_1 \cup A_2 \cup \cdots \cup A_n] \leq \Pr[A_1] + \Pr[A_2] + \cdots + \Pr[A_n].\] We get equality if and only if the \(A_i\)’s are pairwise disjoint.

See in chapter
Definition (Conditional probability)

Let \(E\) be an event with \(\Pr[E] \neq 0\). The conditional probability of outcome \(\ell \in \Omega\) given \(E\), denoted \(\Pr[\ell \; | \;E]\), is defined as \[\Pr[\ell \; | \;E] = \begin{cases} 0 &\text{if $\ell \not \in E$}\\ \frac{\Pr[\ell]}{\Pr[E]} &\text{if $\ell \in E$} \end{cases}\] For an event \(A\), the conditional probability of \(A\) given \(E\), denoted \(\Pr[A \; | \;E]\), is defined as \[\begin{aligned} \Pr[A \; | \;E] = \frac{\Pr[A \cap E]}{\Pr[E]}.\end{aligned}\]

See in chapter
Proposition (Law of total probability)

Let \(A_1,A_2,\ldots,A_n,B\) be events such that the \(A_i\)’s form a partition of the sample space \(\Omega\). Then \[\Pr[B] = \Pr[B \cap A_1] + \Pr[B \cap A_2] + \cdots + \Pr[B \cap A_n].\] Equivalently, \[\Pr[B] = \Pr[A_1]\cdot \Pr[B \; | \;A_1] + \Pr[A_2]\cdot \Pr[B \; | \;A_2] + \cdots + \Pr[A_n]\cdot \Pr[B \; | \;A_n].\]

See in chapter
Definition (Expected value of a random variable)

Let \(\mathbf{X}\) be a random variable. The expected value of \(\mathbf{X}\), denoted \(\mathbb E[\mathbf{X}]\), is defined as follows: \[\mathbb E[\mathbf{X}] = \sum_{\ell \in \Omega} \Pr[\ell]\cdot \mathbf{X}(\ell).\] Equivalently, \[\mathbb E[\mathbf{X}] = \sum_{x \in \text{range}(\mathbf{X})} \Pr[\mathbf{X}=x]\cdot x,\] where \(\text{range}(\mathbf{X}) = \{\mathbf{X}(\ell): \ell \in \Omega\}\).

See in chapter
Exercise (Probability space modeling)

How would you model a roll of a single \(6\)-sided die using Definition (Finite probability space, sample space, probability distribution)? How about a roll of two dice? How about a roll of a die and a coin toss together?

See in chapter
Exercise (Practice with random variables)

Suppose we roll two \(6\)-sided dice. Let \(\mathbf{X}\) be the random variable that denotes the sum of the numbers we see. Explicitly write down the input-output pairs for the function \(\mathbf{X}\). Calculate \(\Pr[\mathbf{X} \geq 7]\).

See in chapter
Proposition (Expectation of an indicator random variable)

Let \(E\) be an event. Then \(\mathbb E[\mathbf{I}_E] = \Pr[E]\).

See in chapter