# Probability and Classification Basics

Machine Learning Basicss

Going over basics of Probability Theory

**Independent Event** = Probability of one event happening is no way affects the probability of another event occurring.
*Example: You roll a dice and toss a coin, the probability of landing any face of the dice is not affected by getting a head or tail on the coin. And vice versa.*

**Mutually Exclusive Events** = Two or more events that can not happen together. In set notation $A \cap B = {}$
*Example: Roll a dice and landing on one face exclude the possibility of landing on another face*

**Dependent Event** = Probability of one event happening affects the probability of another event happening.
*Example: You draw a card from a deck of cards and the probability of any other card drawn subsequently is affected by the probability of the first card drawn.*

**Conditional Probability** = Probability of one event given the probability of another dependent event. Denoted by $p(A|B)$
*Example: Probability of having a cold given the probability of having a cough*

**Joint Probability** = Usually denoted as $ p(A,B)$ = The probability of event A **and** event B occurring.
*Example: Event A = Probability of drawing a K from a card deck. Event B = Probability of drawing a red card from the deck. $p(A,B)$ is the $p(A \cap B)$. i.e. The probability of drawing a red K = $2\over52$*

**Marginal Probability** = The probability of an event occurring not conditioned on any other event.
*Example: Probability of drawing a King from a deck of cards = $4\over52$*

###Two rules of probability

We can combine marginal, conditional and joint probabilities using the following two rules.

**Sum Rule**
Marginal probability can be expressed in terms of the sum of two (or more) joint probabilities.

$$ p(A) = \sum_B p(A,B) $$

**Product Rule**
Joint probability can be expressed in terms of the product of marginal and conditional probabilities of two (or more) events.

$$ p(A,B) = p(A) \times p(B|A) $$ or $$ p(A,B) = p(B) \times p(A|B) $$

From the product rule we can derive

**Bayes Rule**

$$ p(X) \times p(Y|X) = p(Y) \times p(X|Y) \ \ p(Y|X) = \frac{p(X|Y) \times p(Y)}{p(X)} $$

This can be described as
$$
Posterior = \frac{Likelihood \times Prior}{Evidence}
$$
**Maximum Likelihood Hypothesis**

Prior or Evidence probabilities are not available, Posterior probability is the $Y$ that maximises $p(X|Y)$

**Maximum a Posteriori Hypothesis**

When prior probability is available, posterior probability is the Y that maximises $p(Y|X)$. Evidence is common to all comparisons, therefore not a factor.

With Bayes rule, we need to compute every possible combination of feature outcomes with the class outcomes. $p(X_1,X_2, ...,X_n|Y)$ For example, for a binary class Y with 3 binary features $(X_1,X_2,X_3)$:

Class | Feature Combos |
---|---|

$+$ | $2^3 - 1$ |

$-$ | $2^3 - 1$ |

And the prior = p(Y) estimation. In total $2^4 - 1$

Optimal Bayes Classifier requires many orders of magnitude probability estimations and samples for successful classification.

**Naive Bayes Classifier**

By assuming feature independence, we can reduce the number of probability estimations. $$ p(X_1,X_2,..,X_n|Y) = p(X_1|Y) \times p(X_2|Y) .. p(X_n|Y) $$

Therefore a binary class with 3 binary features will need $2 \times 3 + 1$ probability estimations.