 #### Christopher Hoult

Software engineer, actor, speaker, print designer # Bayes' Theorem and Machine Learning

Image credit: Joe deSousa

The application of Bayes' Theorem to Naive Bayes Classifiers is laid out pretty well on Wikipedia but I thought I'd put forward my understanding of how it's used.

First off, we have the theorem itself:

$$P(C|F) = \frac{P(C){\cdotp}P(F|C)}{P(F)}$$

Or, "the probability of event C occurring given event F occurring is the probability of C multiplied by the probability of F given C, all divided by the probability of F" (or, in the case of classification, we might view C as the class, and F as the feature).

As we're looking to see the probability that feature F predicts class C, we can fix the probability of F to be 1, and so simplify our application of the rule down to:

$$P(C|F) = P(C){\cdotp}P(F|C)$$

So far so good, but this is only applicable to one feature - and even then, to the presence of one feature, not to its lack. We not only need add in the rest of the features, but we may also view a missing feature as an "event" in this model. The probability of C given the presence of three features F1, F2 and F3 is written:

$$P(C|F_1,F_2,F_3)$$

To include the extra features, we can apply the chain rule as follows:

\begin{align} P(C|F_1,F_2,F_3) & = P(C){\cdotp}P(F_1,F_2,F_3|C) \\ & = P(C){\cdotp}P(F_1|C){\cdotp}P(F_2,F_3|C,F_1) \\ & = P(C){\cdotp}P(F_1|C){\cdotp}P(F_2|C,F_1){\cdotp}P(F_3|C,F_1,F_2) \end{align}

W can apply the assumption that the features are conditionally independent and thus we can reduce the complexity of the right hand side such that:

$$P(F_2|C,F_1) \: = \: (F_2|C)$$ $$P(F_3|C,F_2,F_1) \: = \: (F_3|C)$$

And thus:

$$P(C|F_1,F_2,F_2) = \frac{P(C){\cdotp}P(F_1|C){\cdotp}P(F_2|C){\cdotp}P(F_3|C)}{P(F_1,F_2,F_3)}$$

You'll note that P(F1,F2,F3) is also unknown. However, it is a fixed value and such represents a scaling factor (Z below) dependant on the F variables - basically, we can consider this something of a fixed mystery, so we can substitute any number for this so long as we're solving with the same input features. Finally, we can say that the probability that class C is applicable given values for our three features is:

$$P(C|F_1,F_2,F_2) = \frac{1}{Z}{\cdotp}P(C){\cdotp}P(F_1|C){\cdotp}P(F_2|C){\cdotp}P(F_3|C)$$

PHEW!