# Bayes Theorem

##### February 3, 2018

Written by Boutros El-Gamil

# 1. Idea

Suppose that someday you met with Jamie; a high school student. You noticed that Jamie has strong body, wide shoulders, notable muscles, like the man in this picture:
Now, if I tell you that Jamie is a member of either the wrestling or the soccer team in his school, and ask you to guess, which team Jamie is more likely to be a member of? What will be your answer?

Based on the visible impression, you may guess that Jamie is more likely to be in the wrestling team. Most probably because his body type and physical shape recommends him to be more a wrestler than a soccer player. I think most people will guess the same as well. Therefore, I will translate this guessing to an average probability of 75% that Jamie is a wrestler.

Now, let me add another piece of information to your knowledge: In Jamie’s school, there are 100 athletic students (including Jamie). Among those, only 15 students are in the wrestling team, and the rest are in the soccer team. Would this extra information changes your guess about Jamie’s sport team?

If you are puzzled by the answer, you can ask help from Thomas Bayes, a British statistician who have thought of such sort of questions before two and half centuries, and came up with his theory: Bayes Theorem.

# 2. Theory

in order to understand the roots of Bayes Theorem, you need to be aware of some concepts of probability theory.

## 2.1 Conditional Probabilities

The first concept is called conditional probabilities. We can formulate this concept through the question: Given that a student is in the wrestling team, what is the probability that he is muscular? Using math notations, we express it like that:

P(muscular|wrestler) = \frac{\#students\ whom\ are\ muscular\ \&\ in\ wrestling\ team}{\#\ wrestler\ students}

The general form is:

P(A|B) = \frac{P(A \cap B)}{P(B)}\tag{1}

If we assume there are 10 muscular students in Jamie’s school, 3 of them are in the wrestling team, and the 7 others are in the soccer team. We can calculate the above probability as:

P(muscular|wrestler) = \frac{3}{15} = 0.2

This means that: knowing that a student is a wrestler in Jamie’s school, there is 20% chance that he is muscular, and 80% chance that he is not muscular.

This result shows us an important property of conditional probabilities, which is: conditional probabilities are not interchangeable. I.e.:

P(muscular|wrestler) \neq P(wrestler|muscular)

As another example, if I told you that my friend Julia comes from Germany, you would probably guess that her prime language is German. But If I told you that Julia speaks German fluently, you would think that she may (or may not) come from Germany. She could be from Germany, Austria,  Switzerland, or she might be a foreigner lives in a German-speaking country. Therefore, we can write that:

\begin{split}
P(Julia\ speaks\ German|Julia\ comes\ from\ Germany) &\neq \\ P(Julia\ comes\ from\ Germany|Julia\ speaks\ German)
\end{split}

## 2.2 Joint Probabilities

The second concept is called joint probabilities. We can formulate this concept through the question: What is the probability that a student is both wrestler and muscular? The answer is simply the numerator of the right-hand side of the conditional probability, and is computed as:

P(muscular\ and\ wrestler) = P(wrestler) \times P(muscular|wrestler)

And the general form can be easily inferred from Equation (1):

P(A \cap B) = P(B) \times P(A|B)\tag{2}

Using the above information of Jamie’s example, we can calculate the joint probability as follows:

P(wrestler\ and\ muscular) = \frac{15}{100} \times 0.2 = 0.03

The interpretation is: in Jamie’s school, there is 3% chance that a student is both wrestler and muscular. Please note that, other than conditional probabilities, joint probabilities are interchangeable. I.e.:

P(wrestler\ and\ muscular) = P(muscular\ and\ wrestler)

In words, the probability that a student is wrestler and muscular, is the same as the probability that the student is muscular and wrestler. In Julia’s example, we can write that:

\begin{split}
P(Julia\ speaks\ German\ \&\ Julia\ from\ Germany) &= \\P(Julia\ from\ Germany\ \&\ Julia\ speaks\ German)
\end{split}

## 2.3 Marginal Probabilities

The third probabilistic concept you need to learn is called the marginal probabilities. This type of probabilities answering questions like: What is the probability that a student is muscular, regardless of his sporting team? We calculate such probability by summing up probabilities that a student is muscular and in wrestling team and the student is muscular and in soccer team. We write it as:

P(muscular) = P(muscular\ and\ wrestler) + P(muscular\ and\ footballer)

And the general form is:

P(A) = P(A \cap B) + P(A \cap \neg B)\tag{3}

The negation symbol $$\neg$$ here means $$NOT\ B$$.

In Jamie’s example, we calculate the marginal probability as:

P(muscular) = 0.03 + P(footballer\ and\ muscular)

Using joint probabilities, we calculate $$P(muscular)$$ as:

\begin{split}
P(muscular) &= 0.03 + P(footballer)\times P(muscular|footballer)\\
&= 0.03 + P(footballer)\times\frac{P(muscular \cap footballer)}{P(footballer)} \\
&= 0.03 + \frac{85}{100} \times \frac{7}{85}\\
&= 0.1
\end{split}

Therefore, the total number of muscular students represents 10% of athletic students. This outcome can be easily validated in our example, since the total number of muscular students are 10 out of 100 athletic students, which matches with marginal probability outcome.

## 2.4 Bayes Theorem

After scanning all probability theory concepts we need in know, we get back to our first question: Given that Jamie is muscular, what is the probability that he is a wrestler?

This probability is a conditional probability $$P(wrestler|muscular)$$, but it is the reverse of the probability we calculated in Section (2.1): $$P(muscular|wrestler)$$. The new probability can be inferred using the above probabilistic concepts as follows:  By using the joint probability concept (Section 2.2), we can write the following 2 joint probabilities:

P(wrestler\ and\ muscular) = P(muscular) \times P(wrestler|muscular)

P(muscular\ and\ wrestler) = P(wrestler) \times P(muscular|wrestler)

Because joint probabilities are interchangeable, the left-hand sides of the above 2 equations are equal, which make the right-hand sides equal too. Therefore, we can write:

P(muscular) \times P(wrestler|muscular) = P(wrestler) \times P(muscular|wrestler)

and accordingly:

P(wrestler|muscular) = \frac{P(wrestler) \times P(muscular|wrestler)}{P(muscular)}

Now, we have a formula to calculate the conditional probability we want, using the probability concepts explained above. This formula is known as Bayes Theorem, and can be generalized as:

P(A|B) = \frac{P(A)\times P(B|A)}{P(B)}\tag{4}

Using Bayes Theorem, we can substitute each component in Bayes formula with its value:

P(wrestler|muscular) = \frac{P(wrestler) \times P(muscular|wrestler)}{P(muscular)} = \frac{0.15 \times 0.2}{0.1} = 0.3

Therefore, and according to Bayes Theorem, if Jamie is muscular, there is 30% chance he is in the wrestling team in his school, and 70% chance he is in the soccer team. In other words, Bayes Theorem says that: If Jamie is mascular, he is more likely to be a footballer than a wrestler in his school.

Does this answer make sense to you? If not, it will make much sense with some interpretation.

# 3. Interpretation of Bayes Theorem

Let us list all information we knew about Jamie, along with our corresponded guesses about his favorite sport as follows:

• Info #1: Jamie is muscular:

Guess #1: $$P(Jamie\ is\ wrestler|Jamie\ is\ muscular) = 75\%$$

• Info #2: 15% of athletic students in Jamie’s school are in the wrestling team:

Guess #2: $$P(Jamie\ is\ wrestler|Jamie\ is\ muscular) = 75\%$$

• Info #3: 30% of the muscular students in Jamie’s school are in wrestling team:

Guess #3: $$P(Jamie\ is\ wrestler|Jamie\ is\ muscular) = 30\%$$

Why the probability that Jamie is wrestler has dropped from 75% to 30% after knowing that only 30% of muscular students are in the wrestling team? Well, to answer this question, we need to understand the Bayesian interpretation of each term in Bayes formula.

The core idea of Bayes Theorem is that it ties the “degree of believe” of a hypothesis $$H$$ before and after counting for evidence $$E$$ (e.g. new observation).

In the above example, we draw our hypothesis that “Jamie is wrestler“, based on the logical relationship we built in our minds between Jamie’s body shape and his favorite sport. Using this prior knowledge, we set up our prior probability $$P(H)$$ to 75% that Jamie is wrestler. In other words, knowing that “Jamie is muscular“, we believed that Jamie is more likely to be a wrestler.

Later on, we received a new data, telling that only 15% of athletic students in Jamie’s school are in the wrestling team. This information decreases the probability that any athletic student is in the wrestling team, but it does not bind the size of wrestling team to the total number of muscular students. Accordingly, we kept our believe that Jamie is more likely to be in the wrestling team.

Afterwards, we received a third observation, regarding the ratio of muscular students in the wrestling team. This new information counts for both entities of question: muscular students and wrestling team. Therefore, Bayes theorem used this ratio to update our believe. Based on this new observation, Bayes Theorem updated our prior probability $$P(H)$$ by adding another component. This new component is defined as $$P(E|H)$$. In words, we describe $$P(E|H)$$ as the probability of evidenceJamie is muscular“, given the probability of hypothesisJamie is wrestler“. This component is known as the likelihood.

Finally, Bayes Theorem added a third component called probability of evidence $$P(E)$$. This term is constant, regardless of the hypothesis we use. In our example, $$P(E)$$ represents the probability that “Jamie is muscular“, regardless of whether Jamie is a wrestler or not. $$P(E)$$ acts as a normalization parameter in Bayes Theorem. It has multiple other names, such as average likelihood, marginal likelihood, data probability, model evidence, or global likelihood.

You can think of $$P(E)$$ as the summation of all possible values of hypothesis $$P(H)$$, multiplied by the likelihood $$P(E|H)$$. I.e.:

P(E) = \sum\limits_{H} P(H)\times P(E|H)\tag{5}

Using these interpretations, we re-write Bayes formula as:

P(H|E) = \frac{P(H)\times P(E|H)}{P(E)}\tag{6}

Where:

• $$P(H)$$: prior probability (our guess before obtaining data)
• $$P(E|H)$$: likelihood value (influence of new data on our prior guess)
• $$P(E)$$: probability of evidence (probability of new data over all possible hypotheses)
• $$P(H|E)$$: posterior probability (probability of hypothesis using both prior knowledge and real data)

It’s obvious from Equations (5) and (6), that the sum of posteriors of all possible probabilities of hypothesis $$H$$ given evidence $$E$$, must add up to 1.

Please note also that, as the probability of evidence $$P(E)$$ is independent of any particular value of $$H$$, it could be ignored when calculating $$P(H|E)$$ for a particular value of $$H$$.