View on GitHub

wiki

Machine Learning using Class-Imbalanced Data

Contents

  1. Introduction
  2. Examples: Imbalanced Problems in Space Debris
  3. Handling Imbalanced Learning
    1. Evaluation Metrics
    2. Penalising Misclassifications
    3. Balancing Datasets: Undersampling and Oversampling
    4. Stratification
  4. References

1. Introduction

Imbalanced datasets are characterised by a rare (or positive) class, which represents a small portion of the entire population (1 out of 1000 or 1 out of 10000 or even more). Class imbalance can be intrinsic to the problem, which is naturally imbalanced, or it can be determined by the limitations in data collection, caused by economic or privacy reasons.

The minority (or negative) class is scarce and its own characteristics and patterns are scarce as well, but this information is extremely important for the trained model to discriminate the small samples from the crowd. Standard classification algorithms, that do not take into account class distribution, are overwhelmed by the majority class and they ignore and misclassify the minority one: there aren’t enough examples to recognise the patterns and the properties of the rare class.

In these cases, it is often the minority class that is of the most interest (see Examples: Imbalanced Problems in Space Debris), and therefore, to prevent misclassification, the learning process or dataset itself should be adapted (see Handling Imbalanced Learning).

2. Examples: Imbalanced Problems in Space Debris

There are numerous cases of imbalanced data in the field of space debris in which the minority class is most important:

3. Handling Imbalanced Learning

The problem of class imbalance affects the quality and reliability of results in a machine learning task and, for this reason, it should be managed by specific techniques and measures of quality. The approaches highlighted here are explained in more detail in [1].

3.1 Evaluation Metrics

Accuracy can be a misleading metric for imbalanced data sets as it evaluates all the errors as equally important. Other metrics are therefore required to assess the measure of goodness of a particular model, such as:

Recall and precision measures take into account the number of true positives (i.e. the number of items correctly labelled as belonging to the positive class), false positives (i.e. items which were labelled as belonging to the positive class but should not have been), and false negatives (i.e. items which were not labelled as belonging to the positive class but should have been). These elements are often visualised using a confusion matrix.

Normally in class imbalanced problems, we are interested in reducing the number of false negatives, i.e.: the model says “negative” class, but it is actually “positive”, because we don’t want to miss the “rare” events, which are encoded as the positive class. Recall and F2 are good examples of metrics that focus on the false negatives (the F2 score weights recall higher than precision, whereas the F1 score weights them evenly). A “good” model would here need a simultaneously high accuracy, and low recall and F2 score to ensure that it is a “good” model for the the positive class, and not a bad one for the negative.

\[F_{\beta} = (1 + \beta^{2}) \frac{precision \times recall}{(\beta^{2} \times precision) + recall}.\]

Which one is better? There isn’t a better metric. It depends on many factors, such as the goal, the context, and the cost function: is it better to classify correctly one more unit of the rare class but, at the same time, increasing False Positive errors (classify no-spam email as spam email), or misclassify some units of the rare class, but decreasing False Positive errors?

3.2. Penalising Misclassifications

One approach to overcome the problem of class-imbalanced data is to adjust the loss function to penalise misclassification of the minority class more than the majority. This can be achieved by adding “weights” to the classification loss function (e.g. cross entropy). Typically, the weights are adjusted as $w_j = 1/n_j$, where $n_j$ is the population in class j. An example using PyTorch can be seen below:

from torch.nn import CrossEntropyLoss
class_weights = tensor(1/class_size)
loss_func = CrossEntropyLossFlat(weight=class_weights)

3.3. Balancing Datasets: Undersampling and Oversampling

A well-known approach to handling class-imbalance is to make our training dataset balanced. This is often done by undersampling (that is, removing instances from the majority class) or oversampling (that is, providing more instances to the minority class by replication).

Undersampling results in using only as many negative instances (majority) as there are positive instances (minority) in the training phase, thus achieving a 1:1 balance ratio. This solution, however, comes at some cost. When undersampling, we leave out a great portion of the data during training, therefore not learning from the entire collection. To avoid the enormous data waste, a very large dataset should be available overall.

When oversampling, we add replicates of the existing minority instances. This may cause a model to memorise the patterns and structures of minority events in the data instead of generalising and learning about them; a bad practice that is very prone to overfitting.

A variety of out-of-the-box methods are available in this imbalanced-learn toolkit (in Python), such as:

from imblearn.over_sampling import RandomUnderSampler, RandomOverSampler 

The undersampling and oversampling processes are only performed in the training partition, as under or oversampling of the validation partition distorts reality and would not reflect the true model performance.

3.4. Stratification

When splitting the original dataset into training and validation (and testing) subsets, we should ensure that the proportion of classes between partitions is the same (rather than just using random sampling). This is especially necessary in class imbalanced problems to ensure that the subsets are representative of each other, and is known in machine learning as stratification.

4. References

[1]: A. Ahmadzadeh et al. (2019). Challenges with Extreme Class-Imbalance and Temporal Coherence: A Study on Solar Flare Data. arXiv:1911.09061