The application of **Bayes' Theorem** to Naive Bayes Classifiers is laid out pretty well on
Wikipedia but I thought I'd put forward my
understanding of how it's used.

First off, we have the theorem itself:

$$P(C|F) = \frac{P(C){\cdotp}P(F|C)}{P(F)}$$

Or, "the probability of event *C* occurring given event *F* occurring is the probability of *C* multiplied by the
probability of *F* given *C*, all divided by the probability of *F*" (or, in the case of classification, we might view
*C* as the class, and *F* as the feature).

As we're looking to see the probability that feature *F* predicts class *C*, we can fix the probability of *F* to
be 1, and so simplify our application of the rule down to:

$$P(C|F) = P(C){\cdotp}P(F|C)$$

So far so good, but this is only applicable to one feature - and even then, to the *presence* of one feature, not to
its lack. We not only need add in the rest of the features, but we may also view a missing feature as an "event" in this
model. The probability of *C* given the presence of three features *F _{1}*,

*F*and

_{2}*F*is written:

_{3}$$P(C|F_1,F_2,F_3)$$

To include the extra features, we can apply the chain rule as follows:

$$ \begin{align} P(C|F_1,F_2,F_3) & = P(C){\cdotp}P(F_1,F_2,F_3|C) \\ & = P(C){\cdotp}P(F_1|C){\cdotp}P(F_2,F_3|C,F_1) \\ & = P(C){\cdotp}P(F_1|C){\cdotp}P(F_2|C,F_1){\cdotp}P(F_3|C,F_1,F_2) \end{align} $$

W can apply the assumption that the features are conditionally independent and thus we can reduce the complexity of the right hand side such that:

$$P(F_2|C,F_1) \: = \: (F_2|C)$$ $$P(F_3|C,F_2,F_1) \: = \: (F_3|C)$$

And thus:

$$P(C|F_1,F_2,F_2) = \frac{P(C){\cdotp}P(F_1|C){\cdotp}P(F_2|C){\cdotp}P(F_3|C)}{P(F_1,F_2,F_3)}$$

You'll note that *P(F _{1},F_{2},F_{3})* is also unknown. However, it is a fixed value and such
represents a scaling factor (

*Z*below) dependant on the

*variables - basically, we can consider this something of a fixed mystery, so we can substitute any number for this so long as we're solving with the same input features. Finally, we can say that the probability that class*

*F**C*is applicable given values for our three features is:

$$P(C|F_1,F_2,F_2) = \frac{1}{Z}{\cdotp}P(C){\cdotp}P(F_1|C){\cdotp}P(F_2|C){\cdotp}P(F_3|C)$$

*PHEW!*