Indy's Weblog
2016 May 22

Machine Learning Basic Metrics

Basic Intuition and Methods

General methods of model training.

Let's look at a simple case of a classification. Here we are trying to classify a given dataset into $m$ classes, using the feature vectors (of $n$ dimensions) that represent the data instances. In functional notation we can represent this process as follows:

$$ f:X \mapsto y $$

where $f$ is a function that maps $X$ (feature vector) to a $y$ (label or class).

First thing to do is to implement this classifier with random guessing. This would give us the baseline which we should aim to improve upon. We could come up with all sorts of fancy things to do inside a classifier, but if it's as good or below random guessing, then we really need to quit bullshitting around and take a serious evaluation of our process.

The aim is to learn a model from the data (feature vectors) that generalises well enough that we can predict for the unseen data. Generalising is the key aspect of machine learning. A lot of factors are against generalisation. It mostly manifest as the poor quality and lack of representativeness of the training data and then as a result of poor parameter selection in training. Poor quality data can contain a lot of noise from various sources. If we can understand the underlying process that generate the data, then we have a better understanding of what creates the noise. This allows us to model the noise and filter it out from the dataset. In most occasions, we have poor understanding of the process that generate the data. And this forces us to make assumptions about the data and its underlying model (that which we seek to discover).

Our aim is to create a model using the data. This is called training the model. We inspect the data and look at possible correlations and create methods to extract features. In most occations, the features can be directly reported in the dataset itself. For example, if our task is to build a classifier that separates apples from pears, colour and shape would be the obvious features readily available. If the data provided are in the form of images, then these features have to be extracted from the images.

While training a model, we would use various different algorithms, for example Naive Bayes Classifier, Random Decision Forests, Support Vector Machines. Each classifier has properties that are better suited to the nature of the data you have. Selecting the correct algorithm is crucial for creating a model which performs well. Once the algorithm had been selected, various parameters of it have to be tuned such that the model it produces is best as it could be. It's always a good idea to evaluate our model once we have trained it using a particular set of parameters. For this reason, we have to split the dataset into two parts, one for training the model, another for evaluation. They are usually called, Training set and Validation set.

We discussed briefly about generalisation above. Given a dataset and the trained model, we must find out whether our model generalises well enough. The more data we test on, better the indication, whether the model generalised well. But increasing the Validation set size reduces the Training set size, leading to other problems like over fitting (we will discuss this in time).

This is where Cross Validation comes in. With Cross Validation, we partition the data into two sets, but we do this multiple times, each time the two sets will contain a different selection of data, but important to remember that the two sets are always non over lapping. There are many ways to perform Cross Validation, but the central aim and concept remain the same.


Figure 1 - k (=5) Fold Ordered Cross Validation for Training/Validation split of 8:2

So, each time we use a set of parameters for our classification algorithm and train a model, we evaluate by Cross Validation. In the k-fold Cross Validation, we evaluate the model, k times and average the metrics. This way we have a good idea how the trained model will perform with unseen data.

Model Evaluation Metrics

When evaluating the model we created, we use the labels provided with the training data. There are many metrics for performance evaluation, but most common are the following: Accuracy, Error, Precision, Recall, F1 Score etc.

Simplest of the classification methods is called binary classification. In binary classification, our model predicts whether a given data instance belongs to the class that we define. We define a positive prediction as + and all else -. A binary classifier can only detect one class. So in a binary classification task, for example, we try to classify oranges by their feature descriptions, the training data would look like below:

id,  feature-01, feature-02, feature-03, label
 0, 		100,        0.2,    "round", "Orange"
 1,         120,        0.3,   "square", "Not Orange"     	

Our model, while predicting on the validation set, may label a data item correctly or It could label it incorrectly. Correctly labeled item could be either predicting 'orange' when it is actually an orange or, predicting 'not orange' when it is actually not an orange. These are called True Positive and True Negatives respectively. Likewise we can categorise the incorrect responses as well. We can summarise the possible outcomes as shown below.

            Prediction +      Prediction -

Actual +    True Positive     False Negative 
Actual -    False Positive    True Negative

Accuracy is defined as proportion of correct responses over all possible responses. After abbreviating those above terms we can succinctly express it as:

Accuracy = (TP + TN) / (TP + FP + FN + TN)

Accuracy is intuitive and easy to compute and interpret. But accuracy doesn't give you the whole picture. Accuracy can be misleading when the validation dataset is unbalanced. For example, if our validation set contains 20% 'orange' labels, and if our model outputs 'orange' for 50% data instances (a random guessing) and gets 20% correct on 'orange' labels, then our accuracy computation will say 50%.

Accuracy = (20% + 30%) / (20% + 30% + 0% + 50%) = 50%

Error is the complement of Accuracy, as such we can express it as:

Error = 1 - Accuracy

The purpose of these metrics is to summarise the performance. If we can get the full picture with a minimum amount of metrics, that would be the best. But no single metric gives you the full picture. Therefore we define various other metrics to capture different factors that affect our model performance. Precision and Recall are two complementary such metrics.

Precision is defined as the proportion of + predictions that are correct

Precision = TP / (TP + FP)

Recall is defined as the proportion of actual + that are correct.

Recall = TP / (TP + FN)

Some examples :

Let's consider an unbalanced dataset that contains 6 + labels out of 1000

1. Case where 1000 is predicted to be + out of 1000

    P+		P-
A+    6      0   =   6 actually +
A-  994      0   = 994 actually -
   1000      0
     +       -

Accuracy = (TP+FN) / (TP+FN+FP+TN) 
         = (6+0) / (1000)		
         = 0.06%
Precision=  TP / (TP+FP) 
         = 6 / (6+994)
         = 0.06%
Recall   = TP/(TP+FN) 
         = 6 / (6+0) 
         = 100.00%

2.Case where 8 is predicted to be + out of 1000

    P+		P-
A+  5        1   =   6 actually +
A-  3      991   = 994 actually -
    8      992
    +       -

Accuracy = (TP+FN) / (TP+FN+FP+TN) 
         = (5+991) / (1000)	
         = 96.0%
Precision=  TP / (TP+FP) 
         = 5 / (5+3) 
         = 62.5%
Recall   = TP/(TP+FN)
         = 5 / (5+1) = 5/6
         = 83.3%

Precision measures how accurate the classifier is in positive classification and Recall measures the accurate coverage of positive classification.

Recall makes much more sense if it was described from an information retrieval standpoint. For example, if I ask you to remember 5 male names and 5 female names and then a week later ask you to name the female names. Recall is the ability to correctly recall :) the names in question. Out of 5 female names how many you named correctly is the Recall. In the same way Precision is the metric that measures how precise did you recall the names, i.e. out of the k (<= 10) names you may recall, how many were actually female names is the Precision.

F1 Score is a combination of Precision and Recall. This combination again tries to minimise the number of metrics that we have to look at to get the full picture. F1 Score is defined as the geometric mean of Precision and Recall.


$$ F1 Score = 2 * (Precision * Recall) / (Precision + Recall) $$