Recall Bayes’s theorem from our previous reading:

\[ \Pr(A \mid B) = \frac{\Pr(B \mid A)}{\Pr(B)} \cdot \Pr(A) \]

Observe that we can interpret Bayes’s theorem as a way of expressing
an *updated belief* about the probability of \(A\) given that we learn some new
information \(B\).

- \(\Pr(A)\) is the original
probability, called the
*prior belief*of \(A\). - \(\Pr(A \mid B)\) is the updated
probability, called the
*posterior belief*of \(A\) given \(B\) occurs.

To obtain the posterior belief, we multiply the prior belief by:

- \(\Pr(B \mid A)\), the
*likelihood*, the probability of the new information occurring given \(A\) occurs. - \(\Pr(B)\), the prior probability of the new information.

We can apply this interpretation to the *classification
problem* to obtain a simple algorithm, the *Naive Bayes*
algorithm. For example, consider the following table of data that
records when we are likely to go to class given that certain
circumstances occur:

Time of Day | Partied Yesterday? | Test Today? | Hungry? | Went to Class? |
---|---|---|---|---|

Early | Yes | No | Yes | No |

Early | No | No | No | No |

Early | No | No | No | Yes |

Late | Yes | Yes | Yes | Yes |

Late | No | No | Yes | No |

Early | Yes | Yes | No | Yes |

Early | No | No | Yes | Yes |

Late | No | No | No | No |

Late | Yes | No | Yes | Yes |

Late | No | Yes | Yes | Yes |

We’ll call an individual row of this table a piece of *data*.
The first four columns:

- “Time of day”, either
*early*or*late* - “Partied yesterday?”, either
*yes*or*no*. - “Test today?”, either
*yes*or*no*. - “Hungry?”, either
*yes*or*no*.

We’ll call *features* of the data. To simplify our subsequent
calculation, we’ll let all our features be boolean, *i.e.*,
two-valued, in nature, but we could certainly work other types of values
as well. We can represent these features as a 4-tuple, *e.g.*,
\((\text{Early}, \text{Yes}, \text{No},
\text{Yes})\) for the first entry of the table.

The final column, “Went to Class?”, is the *class* of the
data. For example, the class of the first entry of the table is “No,”
*i.e.*, when it was early, we partied, there was no test, and we
were hungry, we recorded that we did not go to class.

The table above represents our *training data*. Given this
training data, a classification algorithm is then able to compute the
*most likely class* for a new piece of data. For example, if we
see the data:

- \((\text{Early}, \text{No}, \text{No}, \text{Yes})\)

Is it more likely we’ll go to class or not in this scenario?

# Problem 1: Working Out the Theory

First, let’s see how Bayes Theorem applies to this situation.
Ultimately, our algorithm is a *classifier* which determines the
class of a new piece of data given the training data it is initialized
with. We’ll let \(A\) be the result of
the classifier, *i.e.*, whether we go to class, and \(B\) will be the data fed to the algorithm,
*i.e.*, the 4-tuple. If we let \(X\) be the 4-tuple and \(c\) be the class, then we have according to
Bayes’s theorem:

\[ \Pr(c \mid X) = \frac{\Pr(X \mid c)}{\Pr(X)} \cdot \Pr(c) \]

Note that \(c\) can be either “Yes” or “No”, so we can calculate the probability of either class for our data \(X\) by instantiating \(c\) appropriately. Ultimately, our classifier computes each of these probabilities, \(\Pr(\text{Yes} \mid X)\) and \(\Pr(\text{No} \mid X)\) Then our classifier assigns \(X\) the class that generates the higher probability.

In terms of notation, the \(\mathop{\mathrm{arg\,max}}\) function captures this idea succinctly. If \(C = \{\, \text{"Yes"}, \text{"No"} \,\}\) is our set of classes, then this computation is:

\[ \mathop{\mathrm{arg\,max}}_{c \in C} \frac{\Pr(X \mid c)}{\Pr(X)} \cdot \Pr(c) \]

In other words, we choose the class \(c\) that maximizes the probability described by Bayes’s equation. Next, we must figure out how to compute each of these probabilities in terms of our training data. But before we do that, let’s try to simplify the formula to avoid having to compute too much stuff. One observation for simplification is that computing \(\Pr(X)\) is unnecessary, so we can instead compute:

\[ \mathop{\mathrm{arg\,max}}_{c \in C} \Pr(X \mid c) \Pr(c) \]

In a sentence or two, describe why we don’t need to compute \(\Pr(X)\) in our \(\mathop{\mathrm{arg\,max}}\) calculation.

(*Hint*: think about for each \(c\) what \(\Pr(X)\) represents and whether it depends
on the particular \(c\) in
question.)

# Problem 2: Independence and Computation

The above calculation now demands that we compute two quantities for each class \(c\):

- \(\Pr(c)\): the probability that the class occurs in the training data.
- \(\Pr(X \mid c)\): the likelihood of the data given the class.

We can employ the definition of conditional probability which says that:

\[ \Pr(X \cap c) = \Pr(X \mid c) \cdot \Pr(c) \]

Employed in this fashion, we call this the *probability Chain
Rule*. Therefore, we are really computing \(\Pr(X \cap c)\), the probability that we
have data \(X\) and class \(c\) at the same time. Recall that \(X\) is really a tuple of sub-events, \(X = (x_1, x_2, x_3, x_4)\). With this in
mind, we can then repeatedly apply the chain rule, observing that
intersection of events is commutative:

\[ \Pr(X \mid c) \cdot \Pr(c) = \Pr(c) \cdot \Pr(x_1 \mid c) \cdot \Pr(x_2 \mid c, x_1) \cdot \Pr(x_3 \mid c, x_2, x_1) \cdot \Pr(x_4 \mid c, x_3, x_2, x_1). \]

While this, in theory, allows us to compute \(\Pr(X \mid c)\), it is onerous to do. We need to know the conditional probability of each feature given the others.

To simplify the computation, let’s assume that the features \(x_1, \ldots, x_4\) are *independent*
from each other, *i.e.*, their probabilities do not depend on
each other. This leads us to the following computation:

\[ \Pr(X \mid c) = \Pr(x_1 \mid c) \cdot \Pr(x_2 \mid c) \cdot \Pr(x_3 \mid c) \cdot \Pr(x_4 \mid c). \]

\(\Pr(x_1 \mid c)\), for example, is the probability in our training data that it was early (or not) given that we went to class (or not). These probabilities are much easier to compute!

In summary, we now have the following computation to perform:

\[ \mathop{\mathrm{arg\,max}}_{c \in C} \Pr(c) \cdot \Pr(x_1 \mid c) \cdot \Pr(x_2 \mid c) \cdot \Pr(x_3 \mid c) \cdot \Pr(x_4 \mid c). \]

And then we choose the class \(c\) that maximizes this probability.

Go through the training data and hand-calculate the various probabilities you need to carry out this computation. It’ll be useful to organize these probabilities into various tables that you can look up easily. Once you have done this, determine what class your resulting classifier assigns the following data:

- \((\text{Early}, \text{No}, \text{No}, \text{Yes})\)

Create three other test data points and carry out the classifier on those data as well, showing the computation you performed.