Naive Bayes Classifier
The optimal classifier..
In a Generative model classifier, $p(label|features)$ is computed via
$p(features|label) \times p(label)$
Here $p(features|label)$ is called the Likelihood, and $p(label)$ is called the Prior
For example, Socrative is a voting app that could be used with a smartphone or other means (i.e. webbrowser). We are given that people who own a smartphone use Socrative with a probability of 0.7 and people who don't own a smartphone use Socrative with a probability of 0.5. We want to classify a person into one of two classes (Phone, NoPhone), by this feature, 'uses Socrative':
$p(Soc|Phone) = 0.7$
$p(Soc|NoPhone) = 0.5$
Maximum Likelyhood Hypothesis (ML)
Now if we see that some one uses Socrative, then we should conclude that this person is more likely to have a phone, because $P(Soc|Phone) > P(Soc|NoPhone)$.
This is called the Maximum Likelyhood (ML) Hypothesis. We select the label (or hypothesis) with the maximum probability of evidence (or feature).
Maximum a Posteriori Hypothesis (MAP)
If we were given additional information such as $p(Phone) = 0.3$ (This is called a Prior) then the decision making changes to:
$p(Phone|Soc) = p(Phone) \times p(Soc|Phone) / p(Soc)$
$p(NoPhone|Soc) = p(NoPhone) \times p(Soc|NoPhone)/ p(Soc)$
We use Bayes rule to compute the probabilities of each outcome for the feature we observe.
Substituting,
$p(Phone|Soc) = 0.3 * 0.7 / p(Soc) = 0.21 / p(Soc)$
$p(NoPhone|Soc) = (1-0.3) * 0.5 / p(Soc) = 0.35 / p(Soc)$
In this occasion, we select the label that maximises the probability given the features. Since $p(Phone|Soc) < p(NoPhone|Soc)$, it is more likely this person doesn't have a phone. This is called the Maximum a Posteriori Hypothesis (MAP)
Probability estimation explosion
Generalising this into $d$ features and $k$ classes, to estimate
$p(label_k,features_d) = p(feature_1,feature_2,...,feature_d | label_k) \times p(label_k) $
for $k$ labels we have to estimate $(k-1)$ probabilities. (as they add up to 1)
for each label, we have to estimate $2^d - 1$ binary features (hence 2). (-1 as they add up to 1)
So in total $(k-1) + k * (2^d - 1)$
For example for 20 binary features and 2 classes number of probability estimations would be: $ = (2-1) + 2 * (2^{20} - 1)$ $ = 1 + 2^{21} - 2$ $ = 2^{21} -1$