Applied Imbalanced Data Solutions

What Is Imbalanced Data?

Imbalanced data is when the classes are not represented equally. One class has a lot more instances than the other class (or classes).

A lot of real world datasets and problems don't have equal number of samples in each class. The most common example of an imbalanced problem is fraud detection where most of the data is not a fraud ...

Imbalanced datasets can cause a lot of frustration. In this blog post I'll show you some of the ways I deal with imbalanced datasets.

Which dataset are we going to use ?

We will use the "Medical Appointment No Shows" dataset from Kaggle. This dataset includes 300k medical appointments and 15 variables (characteristics) of each appointment. The goal is to predict if the patient is or isn't going to show-up.

This blog's purpose is just to show how to handle imbalanced data, therefore I won't get into much data cleaning and feature engineering.

We'll load some modules, load the data and initilize a first basic model.

In [17]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
np.random.seed(64) # initialize a random seed, this will help us make the random stuff reproducible. 
import warnings
warnings.filterwarnings('ignore') # ignore jupyter's warnings, this is only used for the purpose of the blog post

From reading the dataset's info here and from playing with the data earlier, I have some knowledge about the features' types, so I'll load them with the relevant dtype.

In [2]:
data = pd.read_csv(r'noshowappointments.csv', parse_dates=['AppointmentDay', 'ScheduledDay'],
                   dtype={
                       'Scholarship': bool,
                       'Hipertension': bool,
                       'Diabetes': bool,
                       'Alcoholism': bool
                        },
                   index_col='AppointmentID'
)
In [3]:
data.head()
Out[3]:
PatientId Gender ScheduledDay AppointmentDay Age Neighbourhood Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show
AppointmentID
5642903 2.987250e+13 F 2016-04-29 18:38:08 2016-04-29 62 JARDIM DA PENHA False True False False 0 0 No
5642503 5.589978e+14 M 2016-04-29 16:08:27 2016-04-29 56 JARDIM DA PENHA False False False False 0 0 No
5642549 4.262962e+12 F 2016-04-29 16:19:04 2016-04-29 62 MATA DA PRAIA False False False False 0 0 No
5642828 8.679512e+11 F 2016-04-29 17:29:31 2016-04-29 8 PONTAL DE CAMBURI False False False False 0 0 No
5642494 8.841186e+12 F 2016-04-29 16:07:23 2016-04-29 56 JARDIM DA PENHA False True True False 0 0 No

After we took a look at the dataset. I would like to know how my class labels are distributed:

In [20]:
sns.countplot(data['No-show']);

This is looking as an imbalanced data indeed. Almost 4 times more 'No' instances than 'Yes' ones.

In [10]:
data['IsMale'] = data['Gender'] == 'M'
In [11]:
drop_columns = ['ScheduledDay', 'AppointmentDay', 'PatientId', 'Gender', 'Neighbourhood']
data.drop(drop_columns, inplace=True, axis=1)

As I mentioned before, I'll first initilize a very simple model for demonstration purpose only. So for now I'll drop date and categorical features, which would have required some feature extraction and engineering work.

Let's see our data now:

In [12]:
data.head()
Out[12]:
Age Scholarship Hipertension Diabetes Alcoholism Handcap SMS_received No-show IsMale
AppointmentID
5642903 62 False True False False 0 0 No False
5642503 56 False False False False 0 0 No True
5642549 62 False False False False 0 0 No False
5642828 8 False False False False 0 0 No False
5642494 56 False True True False 0 0 No False

Modelling

Load more modules

In [13]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
In [14]:
X = data.drop('No-show', axis=1)
y = data['No-show']
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y)

Initilize a first simple model

In [18]:
model = LogisticRegression()
model.fit(x_train, y_train)
pred = model.predict(x_test)
print('score on training set:', model.score(x_train, y_train))
print('score on test set:', model.score(x_test, y_test))
score on training set: 0.7980698473973098
score on test set: 0.7980602200347423

At first glance this is looking like a very good score for this pretty simple model and data. Let's look at the classification report to see the scores for each class.

In [19]:
print(metrics.classification_report(y_true=y_test, y_pred=pred))
              precision    recall  f1-score   support

          No       0.80      1.00      0.89     22052
         Yes       0.00      0.00      0.00      5580

   micro avg       0.80      0.80      0.80     27632
   macro avg       0.40      0.50      0.44     27632
weighted avg       0.64      0.80      0.71     27632

After this further look you can see this is a very bad. Althogh we got almost 80% accuracy score we actually have a model which isn't that smart (and not that good). It always predicts a show-up ("No" means the person did show to the appointment).

The test data contains of 27631 samples, 5579 (~20%) samples are 'Yes' and the rest (~80%) are 'No'. The model predicts 'No' all the time, and this is why we have 100% recall and only 80% precision scores.

This model won't achieve our goal and help us predict if the patient is or isn't going to show-up. The model will always say the patient is going to show, no matter what is the data.

This is a great example why the simple accuracy score evaluating matrix is not always good. In this types of problems accuracy score won't give you a good evaluation of your model.

There are some possible ways to deal with this imbalanced problem. The two main ways are:

1) Resample your data. 2) Give weights to samples/classes in your data.

1) Resample your data

Resample your data is basically over-sampling or under-sampling your training data. Over-sampling is the process of adding or duplicating one (or more) class' data points in order to balance the ratio between class distributions in a data set. Under-sampling is the process of removing one (or more) class' data points in order to balance the ratio between class distributions in a data set. You can read more about these techniques here I'll show you the basic ways of oversampling and undersampling, as well as a synthetic way of doing so.

How Do I Choose between over and under sampling?

The Oversampling's plus is using all the variety of the data and not losing any sample. It's downside is increasing the number of samples, which can make some algorithms slower.

The Undersampling's plus is by decreasing the number of samples which can yield faster running algorithms. It's downside is losing some of the data's variety, which in some cases can damage the results.

In my opinion the best way is to do both over and under sampling at the same time. By this I mean undersample a bit from the majority class but don't compare the ratio with the minority class, and oversample the minority class a bit but not to much as well.

Oversample

Oversample only on the train data and predict on the imbalanced test data. We would basically duplicate some of the data - simple as that.

In [21]:
y_yes = y_train[y_train == 'Yes']
x_yes = x_train.loc[y_yes.index]

y_no = y_train[y_train == 'No']
x_no = x_train.loc[y_no.index]
In [22]:
oversample_X = pd.concat([x_no, x_yes, x_yes, x_yes, x_yes])
oversample_y =  pd.concat([y_no, y_yes, y_yes, y_yes, y_yes])
In [24]:
model = LogisticRegression()
model.fit(oversample_X, oversample_y)
pred = model.predict(x_test)
print('score on test set:', model.score(x_test, y_test))
print(metrics.classification_report(y_true=y_test, y_pred=pred))
score on test set: 0.6321656050955414
              precision    recall  f1-score   support

          No       0.83      0.67      0.74     22052
         Yes       0.27      0.47      0.34      5580

   micro avg       0.63      0.63      0.63     27632
   macro avg       0.55      0.57      0.54     27632
weighted avg       0.72      0.63      0.66     27632

This is way better. We can see we have a lower score on the "No" recall but a better score on the overall "Yes" predictions.

Undersample

Undersample only on the train data and predict on the imbalanced test data. We would basically drop some of the data - again, simple as that.

In [25]:
y_yes = y_train[y_train == 'Yes']
x_yes = x_train.loc[y_yes.index]
In [26]:
y_no = y_train[y_train == 'No']
undersample_y_no = y_no.sample(y_yes.shape[0])

undersample_x_no = x_train.loc[undersample_y_no.index]
In [27]:
undersample_y = pd.concat([undersample_y_no, y_yes])
undersample_X = pd.concat([undersample_x_no, x_yes])
In [28]:
model = LogisticRegression()
model.fit(undersample_X, undersample_y)
pred = model.predict(x_test)
print('score on test set:', model.score(x_test, y_test))
print(metrics.classification_report(y_true=y_test, y_pred=pred))
score on test set: 0.6425883034163289
              precision    recall  f1-score   support

          No       0.83      0.69      0.75     22052
         Yes       0.27      0.46      0.34      5580

   micro avg       0.64      0.64      0.64     27632
   macro avg       0.55      0.57      0.55     27632
weighted avg       0.72      0.64      0.67     27632

In this case the oversampling and the undersampling were almost the same.

Synthetic Minority Over-sampling Technique (SMOTE)

This is basically generating more samples from the minority class. The SMOTE algorithm is based on the K-nearest neighbors technique. Creating additional data points close to the minority class.

We will use the SMOTE algorithm by using the imbalanced-learn/imblearn module. This module has a lot of intresting methods. We will start with the "SMOTE" method under "over_sampling".

In [29]:
from imblearn.over_sampling import SMOTE

x_train_smote, y_train_smote = SMOTE(ratio='auto', k_neighbors=5, m_neighbors=10,
      out_step=0.5, kind='regular', svm_estimator=None, n_jobs=-1).fit_sample(x_train, y_train)

We used the the default parameters of the "SMOTE" method. Of course changing these parameters can yield better results.

Let's see how the "SMOTE" method changed the distribution of our train data:

In [30]:
from collections import Counter

print('The original class distribution: {},'.format(Counter(y_train)))
print('After SMOTE class distribution:  {}'.format(Counter(y_train_smote)))
The original class distribution: Counter({'No': 66156, 'Yes': 16739}),
After SMOTE class distribution:  Counter({'No': 66156, 'Yes': 66156})
In [31]:
model = LogisticRegression()
model.fit(x_train_smote, y_train_smote)
pred = model.predict(x_test)
print('score on test set:', model.score(x_test, y_test))
print(metrics.classification_report(y_true=y_test, y_pred=pred))
score on test set: 0.6371598147075854
              precision    recall  f1-score   support

          No       0.83      0.68      0.75     22052
         Yes       0.27      0.46      0.34      5580

   micro avg       0.64      0.64      0.64     27632
   macro avg       0.55      0.57      0.54     27632
weighted avg       0.72      0.64      0.67     27632

The default parameters of the "SMOTE" had the same results as the simple oversampling in this case. I'll recommend changing the "SMOTE" parameters and maybe using other methods from this imblearn module.

2) Give weights to samples/classes in your data.

Sample & Class Weight

Sample-Weights is an array of weights that are assigned to individual samples. This array tells the classifier which samples should have more influence on the predictions.

We use Sample-Weight in the same way on samples from the same class, so in our case it's more like Class-Weight. Higher Class-Weight/Sample-Weight means you want to put more emphasis on this class/sample.

Sample-Weight and Class-Weight give weights to a sample/class to use a modified loss function.

In [32]:
from sklearn.utils import class_weight
train_weights = class_weight.compute_sample_weight('balanced', y=y_train)

You can compute sample-weights by using the "utils" sub-module of Sklearn. The function above, computes the same sample-weight for each class instance, so it would be same as class-weight. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data.

The sample-weights array go to the 'fit' method.

In [33]:
model = LogisticRegression()
model.fit(x_train, y_train, sample_weight=train_weights)
pred = model.predict(x_test)
print('score on test set:', model.score(x_test, y_test))
print(metrics.classification_report(y_true=y_test, y_pred=pred))
score on test set: 0.640851187029531
              precision    recall  f1-score   support

          No       0.83      0.69      0.75     22052
         Yes       0.27      0.46      0.34      5580

   micro avg       0.64      0.64      0.64     27632
   macro avg       0.55      0.57      0.55     27632
weighted avg       0.72      0.64      0.67     27632

In some of Sklearn's algorithms you can pass a "class-weight" parameter instead of computing it as I just showed you. Let's see an example of that:

In [34]:
model = LogisticRegression(class_weight='balanced')
model.fit(x_train, y_train)
pred = model.predict(x_test)
print('score on test set:', model.score(x_test, y_test))
print(metrics.classification_report(y_true=y_test, y_pred=pred))
score on test set: 0.640851187029531
              precision    recall  f1-score   support

          No       0.83      0.69      0.75     22052
         Yes       0.27      0.46      0.34      5580

   micro avg       0.64      0.64      0.64     27632
   macro avg       0.55      0.57      0.55     27632
weighted avg       0.72      0.64      0.67     27632

Same scores for sample and class weights. In this case the scores are very similar to the manual over/under sampling as well.

Conclusion

As far as I know, there isn't a "good" way to deal with imbalanced data. I have showed you some ways to deal with the problem, but I'm sure there are a lot more ways out there, just search for them.

When I encounter an imbalanced data problem, I try some of the methods and see what works best. Sometimes I would even try to combine methods, like over and under sampling together.

Hope this helps