Binary Imbalanced Learning A practical Approach in R

 Introduction and motivation

Binary classification problem is arguably one of the simplest and most straightforward problems in Machine Learning. Usually we want to learn a model trying to predict whether some instance belongs to a class or not. It has many practical applications ranging from email spam detection to medical testing (determine if a patient has a certain disease or not).

Slightly more formally, the goal of binary classification is to learn a function f(x) that map x (a vector of features for an instance/example) to a predicted binary outcome ŷ (0 or 1). Most classification algorithms, such as logistic regression, Naive Bayes and decision trees, output a probability for an instance belonging to the positive class: Pr(y=1|x).

Class imbalance is the fact that the classes are not represented equally in a classification problem, which is quite common in practice. For instance, fraud detection, prediction of rare adverse drug reactions and prediction gene families (e.g. Kinase, GPCR). Failure to account for the class imbalance often causes inaccurate and decreased predictive performance of many classification algorithms. In this post, I will introduce a couple of practical tips on how to combat class imbalance in binary classification, most of which can be easily adapted to multi-class scenarios.

The ROSE package provides functions to deal with binary classification problems in the presence of imbalanced classes. Artificial balanced samples are generated according to a smoothed bootstrap approach and allow for aiding both the phases of estimation and accuracy evaluation of a binary classifier in the presence of a rare class. Functions that implement more traditional remedies for the class imbalance and different metrics to evaluate accuracy are also provided. These are estimated by holdout, bootstrap, or cross-validation methods.

It is worth mentioning package DMwR (Torgo, 2010), which provides a specific function (smote) to aid the estimation of a classifier in the presence of class imbalance, in addition to extensive tools for data mining problems (among others, functions to compute evaluation metrics as well as different accuracy estimators). In addition, package caret (Kuhn, 2014) contains general functions to select and validate regression and classification problems and specifically addresses the issue of class imbalance with some naive functions (downSample and upSample). These reasons motivate the ROSE package (Lunardon et al., 2013), which is intended to provide both standard and more refined tools to enhance the task of binary classification in an imbalanced setting. The package is designed around ROSE (Random Over-Sampling Examples), a smoothed bootstrap-based technique which has been recently proposed by Menardi and Torelli (2014). ROSE helps to relieve the seriousness of the effects of an imbalanced distribution of classes by aiding both the phases of model estimation and model assessment.
Most of the current research on imbalanced classification focuses on proposing solutions to improve the model estimation step. The most common remedy to the imbalance problem involves altering the class distribution to obtain a more balanced sample. Remedies based on balancing the class distribution include various techniques of data resampling, such as random oversampling (with replacement) of the rare class and random undersampling (without replacement) of the prevalent class. Under the same hat of these balancing methods, we can also include the ones designed to generate new artificial examples that are ‘similar’, in a certain sense, to the rare observations. Generation of new artificial data that have not been previously observed reduces the risk of overfitting and improves the ability of generalization compromised by oversampling methods, which are bound to produce ties in the sample. As will be clarified subsequently, the ROSE technique can be rightly considered as following this route It is worth mentioning package DMwR (Torgo, 2010), which provides a specific function (smote) to aid the estimation of a classifier in the presence of class imbalance, in addition to extensive tools for data mining problems (among others, functions to compute evaluation metrics as well as different accuracy estimators). In addition, package caret (Kuhn, 2014) contains general functions to select and validate regression and classification problems and specifically addresses the issue of class imbalance with some naive functions (downSample and upSample). These reasons motivate the ROSE package (Lunardon et al., 2013), which is intended to provide both standard and more refined tools to enhance the task of binary classification in an imbalanced setting. The package is designed around ROSE (Random Over-Sampling Examples), a smoothed bootstrap-based technique which has been recently proposed by Menardi and Torelli (2014). ROSE helps to relieve the seriousness of the effects of an imbalanced distribution of classes by aiding both the phases of model estimation and model assessment.

Choice of Accurate Matrices:

In the accuracy evaluation step, the first problem one has to face concerns the choice of the accuracy metric, since the use of standard measures, such as the overall accuracy, may yield misleading results. The choice of the evaluation measure has to be addressed in terms of some class-independent quantities, such as precision, recall or the F measure. For the operational computation of these measures, one should set a suitable threshold for the probability of belonging to the positive class, above which an example is predicted to be positive. In standard classification problems, this threshold is usually set to 0.5, but the same choice is not obvious in imbalanced learning, as it is likely that no examples are labeled as positive. Moreover, moving a threshold to smaller values is equivalent to assume a higher misclassification cost for the rare class, which is usually the case. To avoid an arbitrary choice of the threshold, a ROC curve can be adopted to measure the accuracy, because it plots the true positive rate versus the false positive rate as the classification threshold varies.

Apart from the choice of an adequate performance metric, a more serious problem in imbalanced learning concerns the estimation method for the selected accuracy measure. To this aim, standard practices are the resubstitution method, where the available data are used for both training and assessing the classifier or, more frequently, the holdout method, which consists of estimating the classifier over a training sample of data and assessing its accuracy on a test sample. In the presence of a class imbalance, often, there are not sufficient examples from the rare class for both training and testing the classifier. Additionally, the scarcity of data leads to estimates of the accuracy measure which are affected by a high variance and are then regarded as unreliable. On the other hand, the resubstitution method is known to lead to overoptimistic evaluation of learner accuracy. Then, alternative estimators of the accuracy measure have to be considered, as pointed out in the next section.

The ROSE strategy to deal with class imbalance

It builds on the generation of new artificial examples from the classes, according to a smoothed bootstrap approach .

code related to ROSE method with explanation is enclosed here.

Consider a training set Tn, of size n, whose generic row is the pair (xi , yi ), i = 1, . . . , n. The class labels yi belong to the set {Y0, Y1}, and xi are some related attributes supposed to be realizations of a random vector x defined on R d , with an unknown probability density function f(x). Let the number of units in class Yj , j = 0, 1, be denoted by nj < n. The ROSE procedure for generating one new artificial example consists of the following steps:

1. Select y ∗ = Yj with probability πj .

2. Select (xi , yi ) ∈ Tn, such that yi = y ∗ , with probability 1/nj .

3. Sample x ∗ from KHj (·, xi ), with KHj a probability distribution centered at xi and covariance matrix Hj .

Essentially, we draw from the training set an observation belonging to one of the two classes, and generate a new example (x ∗ , y ∗ ) in its neighborhood, where the shape of the neighborhood is determined by the shape of the contour sets of K and its width is governed by Hj . It can be easily shown that, given selection of the class label Yj , the generation of new examples from Yj , according to ROSE, corresponds to the generation of data from the kernel density estimate of f(x|Yj ), with kernel K and smoothing matrix Hj (Menardi and Torelli, 2014). The choices of K and Hj may be then addressed by the large specialized literature on kernel density estimation (see, e.g. Bowman and Azzalini, 1997). It is worthwhile to note that, for Hj → 0, ROSE collapses to a standard combination of over- and under-sampling. Repeating steps 1 to 3 m times produces a new synthetic training set T ∗ m, of size m, where the imbalance level is defined by the probabilities πj (if πj = 1/2, then approximately the same number of examples belong to the two classes). The size m may be set to the original training set size n or chosen in any way.

by introducing normal technique we got a ROC curve of 60% in our example

The first-aid set of remedies provided by the package involves creation of a new artificial data set by suitably resampling the observations belonging to the two classes. Function ovun.sample embeds some consolidated resampling techniques to perform such a task and considers different sampling schemes. It is endowed with the argument method, which takes one value among c(“over”,”under”,”both”).

Option “over” determines simple oversampling with replacement from the minority class until either the specified sample size N is reached or the positive examples have probability p of occurrence. Thus, when method = “over”, an augmented sample is returned.

option “under” determines simple undersampling without replacement of the majority class until either the specified sample size N is reached or the positive examples has probability p of occurring. It then turns out that when method = “under”, a sample of reduced size is returned

When method = “both” is selected, both the minority class is oversampled with replacement and the majority class is undersampled without replacement. In this case, both the arguments N and p have to be set to establish the amount of oversampling and undersampling. Essentially, the minority class is oversampled to reach a size determined as a realization of a binomial random variable with size N and probability p.

unlike the simple balancing mechanism provided by ovun.sample, ROSE generation does not produce ties and it actually provides the learner with the option of enlarging the neighborhoods of the original feature space when generating new observations. The widths of such neighborhoods, governed by the matrices H0 and H1 , are primarily selected as asymptotically optimal under the assumption that the true conditional densities underlying the data follow a Normal distribution (see Menardi and Torelli, 2014, for further details). However, H0 and H1 may be scaled by arguments hmult.majo and hmult.mino, respectively , whose default values are set to 1. Smaller (larger) values of these arguments have the effect of shrinking (inflating) the entries of the corresponding smoothing matrix Hj . Shrinking would be a cautious choice if there is a concern that excessively large neighborhoods could lead to blur the boundaries between the regions of the feature space associated with each class.

Adopting alternative estimators of accuracy In real data applications, we often cannot benefit from the availability of additional data to test the accuracy of the estimated model (or if we can, we will probably use the additional information to train a more accurate model). Moreover, in imbalanced learning, the scarcity of data causes high variance estimates of the accuracy measure. Then, it is often appropriate to adopt some alternative methods to estimate model accuracy in place of the standard holdout. Function ROSE.eval comes to this aid by implementing a ROSE version of holdout, bootstrap or leave-K-out cross-validation estimators of the accuracy of a specified classifier, as measured according to a selected metric. Suppose that a test set, to evaluate the accuracy of the classifier previously denoted as tree.rose, is not available.

As an alternative to the ROSE-based holdout method, we may wish to obtain a ROSE version of the bootstrap accuracy distribution. By selecting method.assess = “BOOT”, we fit the specified learner on B ROSE samples and test each of them on the observed data specified in the formula. The optional argument trace shows the progress of model assessment by printing the number of completed iterations. The bootstrap distribution of the selected accuracy measure is then returned as output of function ROSE.eval.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s