Indy's Weblog
2016 May 23

Probability and Classification Basics

Machine Learning Basicss

Going over basics of Probability Theory

Independent Event = Probability of one event happening is no way affects the probability of another event occurring. Example: You roll a dice and toss a coin, the probability of landing any face of the dice is not affected by getting a head or tail on the coin. And vice versa.

Mutually Exclusive Events = Two or more events that can not happen together. In set notation $A \cap B = {}$ Example: Roll a dice and landing on one face exclude the possibility of landing on another face

Dependent Event = Probability of one event happening affects the probability of another event happening. Example: You draw a card from a deck of cards and the probability of any other card drawn subsequently is affected by the probability of the first card drawn.

Conditional Probability = Probability of one event given the probability of another dependent event. Denoted by $p(A|B)$ Example: Probability of having a cold given the probability of having a cough

Joint Probability = Usually denoted as $ p(A,B)$ = The probability of event A and event B occurring. Example: Event A = Probability of drawing a K from a card deck. Event B = Probability of drawing a red card from the deck. $p(A,B)$ is the $p(A \cap B)$. i.e. The probability of drawing a red K = $2\over52$

Marginal Probability = The probability of an event occurring not conditioned on any other event. Example: Probability of drawing a King from a deck of cards = $4\over52$

###Two rules of probability

We can combine marginal, conditional and joint probabilities using the following two rules.

Sum Rule Marginal probability can be expressed in terms of the sum of two (or more) joint probabilities.

$$ p(A) = \sum_B p(A,B) $$

Product Rule Joint probability can be expressed in terms of the product of marginal and conditional probabilities of two (or more) events.

$$ p(A,B) = p(A) \times p(B|A) $$ or $$ p(A,B) = p(B) \times p(A|B) $$

From the product rule we can derive

Bayes Rule

$$ p(X) \times p(Y|X) = p(Y) \times p(X|Y) \ \ p(Y|X) = \frac{p(X|Y) \times p(Y)}{p(X)} $$

This can be described as $$ Posterior = \frac{Likelihood \times Prior}{Evidence} $$ Maximum Likelihood Hypothesis

Prior or Evidence probabilities are not available, Posterior probability is the $Y$ that maximises $p(X|Y)$

Maximum a Posteriori Hypothesis

When prior probability is available, posterior probability is the Y that maximises $p(Y|X)$. Evidence is common to all comparisons, therefore not a factor.

With Bayes rule, we need to compute every possible combination of feature outcomes with the class outcomes. $p(X_1,X_2, ...,X_n|Y)$ For example, for a binary class Y with 3 binary features $(X_1,X_2,X_3)$:

Class Feature Combos
$+$ $2^3 - 1$
$-$ $2^3 - 1$

And the prior = p(Y) estimation. In total $2^4 - 1$

Optimal Bayes Classifier requires many orders of magnitude probability estimations and samples for successful classification.

Naive Bayes Classifier

By assuming feature independence, we can reduce the number of probability estimations. $$ p(X_1,X_2,..,X_n|Y) = p(X_1|Y) \times p(X_2|Y) .. p(X_n|Y) $$

Therefore a binary class with 3 binary features will need $2 \times 3 + 1$ probability estimations.