In supervised machine learning (ML) the goal is to have an accurate model, which based on previously tagged data provides predictions for new data.

The number one question when it comes to modeling is: "How can I improve my results?"

There are several basic ways to improve your prediction model:

  1. Hyperparameters optimization
  2. Feature extraction
  3. Selecting another model
  4. Adding more data
  5. Feature selection

In this blog post, I'll walk you through how I used Feature Selection to improve my model. For the demonstration I'll use the 'Wine' dataset from UCI ML repository

Most of the functions are from the sklearn (scikit-learn) module.

For the plotting functions make sure to read about matplotlib and seaborn. Both are great plotting modules with great documentation.

Before we jump into the ML model and prediction we need to understand our data. The process of understanding the data is called EDA - exploratory data analysis.

EDA - exploratory data analysis.

UCI kindly gave us some basic information about the data set. I'll quote some of the more important info given: "These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines ... All attributes are continuous ... 1st attribute is class identifier (1-3)"

Based on this, it seems like a classification problem with 3 class labels and 13 numeric attributes. A classification problem with the goal of predicting the specific cultivar the wine was derived from.

In [1]:
# Loading a few important modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set() #sets a style for the seaborn plots.

In [2]:
# Loading the data from it's csv,
# and converting the 'label' column to be a string so pandas won't infer it as a numeric value
data = pd.read_csv('wine_data_UCI.csv', dtype={'label':str})
data.head() # print the data's top five instances

Out[2]:

label Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280/OD315_of_diluted_wines Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

I named the first columns as 'label'. This is the target attribute - what we are trying to predict. This is a classification problem, so the class label ('label') is not a numeric but a nominal value. that's why I'm telling Pandas this columns dtype is 'str'.

In [3]:
data.info() # prints out a basic information about the data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
label                           178 non-null object
Alcohol                         178 non-null float64
Malic_acid                      178 non-null float64
Ash                             178 non-null float64
Alcalinity_of_ash               178 non-null float64
Magnesium                       178 non-null int64
Total_phenols                   178 non-null float64
Flavanoids                      178 non-null float64
Nonflavanoid_phenols            178 non-null float64
Proanthocyanins                 178 non-null float64
Color_intensity                 178 non-null float64
Hue                             178 non-null float64
OD280/OD315_of_diluted_wines    178 non-null float64
Proline                         178 non-null int64
dtypes: float64(11), int64(2), object(1)
memory usage: 19.5+ KB

As we can see we have 178 entries (instances). as we know from UCI's description of the data, we have 13 numeric attributes and one 'object' type attribute (which is the target column). all the columns of all the rows have data, therefore we see "178 non-null" next to every column description.

In [4]:
print(data['label'].value_counts()) # prints out how many times each value in the 'label' column is appearing.
sns.countplot(data['label']) # plots the above print

2    71
1    59
3    48
Name: label, dtype: int64

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f19b11a8b38>

It's important to check the amount of instances in each class. There is difference between the class labels but It isn't a huge difference. If the difference was bigger we would be in an imbalanced problem. That would require a lot of other things to do, but this is for another post.

In [5]:
# This method prints us some summary statistics for each column in our data.
data.describe()

Out[5]:

Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280/OD315_of_diluted_wines Proline
count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000
mean 13.000618 2.336348 2.366517 19.494944 99.741573 2.295112 2.029270 0.361854 1.590899 5.058090 0.957449 2.611685 746.893258
std 0.811827 1.117146 0.274344 3.339564 14.282484 0.625851 0.998859 0.124453 0.572359 2.318286 0.228572 0.709990 314.907474
min 11.030000 0.740000 1.360000 10.600000 70.000000 0.980000 0.340000 0.130000 0.410000 1.280000 0.480000 1.270000 278.000000
25% 12.362500 1.602500 2.210000 17.200000 88.000000 1.742500 1.205000 0.270000 1.250000 3.220000 0.782500 1.937500 500.500000
50% 13.050000 1.865000 2.360000 19.500000 98.000000 2.355000 2.135000 0.340000 1.555000 4.690000 0.965000 2.780000 673.500000
75% 13.677500 3.082500 2.557500 21.500000 107.000000 2.800000 2.875000 0.437500 1.950000 6.200000 1.120000 3.170000 985.000000
max 14.830000 5.800000 3.230000 30.000000 162.000000 3.880000 5.080000 0.660000 3.580000 13.000000 1.710000 4.000000 1680.000000

This is probably only informative to people who have some experience in statistics. Let's try to plot this information and see if it helps us understand.

In [6]:
# box plots are best for plotting summary statistics.
sns.boxplot(data=data)

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f197b282048>

Unfortunately this is not a very informative plot becasue the data is not in the same value range. We can resolve the problem by plotting each column side by side.

In [7]:
data_to_plot = data.iloc[:, 1:]
fig, ax = plt.subplots(ncols=len(data_to_plot.columns))
plt.subplots_adjust(right=3, wspace=1)
for i, col in enumerate(data_to_plot.columns):
    sns.boxplot(y=data_to_plot[col], ax = ax[i])

This is a better way to plot the data.

We can see that we have some outliers (based on the IQR calculation) in almost all the feaures. These outliers deserve a second look, but we won't deal with them right now.

Pair plot is a great way to see a scatter plot of all the data, of course only for two features at a time. Pair plot is good for small amout of features and for first glance at the columns (features), afterwords in my opinion a simple scatterplot with the relevant columns is better.

In [8]:
columns_to_plot = list(data.columns)
columns_to_plot.remove('label')
sns.pairplot(data, hue='label', vars=columns_to_plot) # the hue parameter colors data instances baces on their value in the 'label' column.

Out[8]:

<seaborn.axisgrid.PairGrid at 0x7f197b181fd0>

The diagonal line from the top left side to the right buttom side of the pair plot are histograms of the columns.

Good feature combination for me is a feature combination that separates some of the class labels. 'Flavanoids' in row 7 looks like a good feature combined with the other ones. Same goes for 'Proline' in the last row.

On the other hand 'Malic_acid' (2nd row) does not look like a good feature at all.

We can have a further look at some features:

In [9]:
sns.lmplot(x='Proline', y='Flavanoids', hue='label', data=data, fit_reg=False)

Out[9]:

<seaborn.axisgrid.FacetGrid at 0x7f197b181da0>

In [10]:
sns.lmplot(x='Hue', y='Flavanoids', hue='label', data=data, fit_reg=False)

Out[10]:

<seaborn.axisgrid.FacetGrid at 0x7f1976dc06d8>

In [11]:
# This is a good feature comination to separate the red ones (label 3)
sns.lmplot(x='Color_intensity', y='Flavanoids', hue='label', data=data, fit_reg=False)

Out[11]:

<seaborn.axisgrid.FacetGrid at 0x7f1976249a58>

In [12]:
sns.boxplot(x=data['label'], y=data['OD280/OD315_of_diluted_wines'])
# this is a vey good feature to separate label 1 and 3

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f1973292e80>

Another thing that is good to check is the feature correlation. We don't like features that correlate with each other so much.

We will use the Pearson correlation to compute pairwise correlation of columns in our data. It's worth to emphasize that the Pearson correlation is only good for linear correlation, but as we saw from the pair plot, our data dosen't seem to correlate in any other way.

In [13]:
plt.figure(figsize=(18,15))
sns.heatmap(data.corr(), annot=True, fmt=".1f")

Out[13]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f1971a06f28>

There is a correlation between Total_phenols - Flavanoids => 0.9 which is strong. Typically I would delete one of the correlting features, but for now I won't do it. Let's plot these feautes to see the correlation.

In [14]:
sns.lmplot(x='Total_phenols', y ='Flavanoids', data=data, fit_reg=True)

Out[14]:

<seaborn.axisgrid.FacetGrid at 0x7f1973292d68>

Modelling and Predicting

As I mentioned before there are several ways in which you can improve your ML model.

Today I'll focus on feature selection, a very basic way to improve your model's score.

In [15]:
# initialize a random seed, this will help us make the random stuff reproducible.
np.random.seed(64)

Let's load the 'train_test_split' function, and separate our data into only the feature vectors and the target vector. We will split our data into 25% test and 75% train.

The 'stratify' parameter will ensure equal distribution of subgroups. It will keep the ration between the classes in the train and test data as they were in the actual full data.

In [16]:
from sklearn.model_selection import train_test_split
X = data.drop('label', axis=1)
y = data['label']
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y)

I'll start with a very simple classifier called knn (k-nearest neighbors).

Knn classifies an object (a new data point) by the majority vote of its k nearest neighbors in the feature space.

In [17]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

model = KNeighborsClassifier() model.fit(x_train, y_train) pred = model.predict(x_test) print('score on training set:', model.score(x_train, y_train)) print('score on test set:', model.score(x_test, y_test)) print(metrics.classification_report(y_true=y_test, y_pred=pred))

score on training set: 0.8045112781954887
score on test set: 0.6888888888888889
             precision    recall  f1-score   support

1 0.78 0.93 0.85 15 2 0.70 0.39 0.50 18 3 0.59 0.83 0.69 12

avg / total 0.70 0.69 0.67 45

As we can see this is a pretty bad result.

The key question which will help us decide in which way we should improve our model is "whether our model is overfitting or underfitting?"

Before I'll answer this question, there is something we didn't do and it's essential in this case: When we are using KNN or any other algorithm which is based on distances (like the Euclidean distance in our case), normalization of the data is necessery. I'll be using the mean normaliztion method, which is subtracting the feature's mean and dividing by it's standard deviation. This is basically converting each data point into it's Z-score.

In [18]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

model = Pipeline( [ ('scaler', StandardScaler()), # mean normalization ('knn', KNeighborsClassifier(n_neighbors=1)) ] ) model.fit(x_train, y_train) pred = model.predict(x_test) print('score on training set:', model.score(x_train, y_train)) print('score on test set:', model.score(x_test, y_test)) print(metrics.classification_report(y_true=y_test, y_pred=pred))

score on training set: 1.0
score on test set: 0.8888888888888888
             precision    recall  f1-score   support

1 0.79 1.00 0.88 15 2 1.00 0.72 0.84 18 3 0.92 1.00 0.96 12

avg / total 0.91 0.89 0.89 45

Great improvment!

I used a "Pipeline" in this part. "Pipeline" is a function in sklearn that combines several other functions and enables us to use other sklearn functions with only one fit command. More on this you can read here and in a future blog post.

So as I mentioned before, we need to understand if our mode is over or under fitting.

In [19]:
from sklearn.model_selection import learning_curve

def over_underfit_plot(model, X, y): plt.figure()

plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve(model, X, y, cv=5, n_jobs=-1) train_scores_mean = np.mean(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) plt.grid()

plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score") plt.legend(loc="best") plt.yticks(sorted(set(np.append(train_scores_mean, test_scores_mean))))

over_underfit_plot(model, x_train, y_train)

As we can see from the plot and from the accuracy scores, we are in an overfitting situation - We have a good score on the train, but a low score on the test and by adding data we improve the model's results. This means the model is bad for unseen data and very good with the data it was trained on.

In an overfitting case there are a few things which need and can be done to improve our model:

  1. Add more data - not possible in this case.
  2. Remove unimportant features in order to make the model less complex - aka. feature selection.
  3. Add regulization.

Feature selection

One of the ways to avoid overfitting is by selecting a subset of features from the data. There are a lot of ways to do feature selection. The most basic one in my opinion is removing correlating features.

As we checked before 'Total_phenols' and 'Flavanoids' are closely correlating features. Let's see what happens if we drop one of them!

In [20]:
X.drop('Total_phenols', axis=1, inplace =True) # delete one of the correlating features
x_train, x_test, y_train, y_test = train_test_split(X, y, stratify=y) # split the data again

#fit the same model again and print the scores model.fit(x_train, y_train) pred = model.predict(x_test) print('score on training set:', model.score(x_train, y_train)) print('score on test set:', model.score(x_test, y_test)) print(metrics.classification_report(y_true=y_test, y_pred=pred))

score on training set: 1.0
score on test set: 0.9111111111111111
             precision    recall  f1-score   support

1 0.83 1.00 0.91 15 2 1.00 0.78 0.88 18 3 0.92 1.00 0.96 12

avg / total 0.92 0.91 0.91 45

Truly as we expected this step improved our model's score, but we are still overfitting.

Another feature selection method (and my favourite) is by using another algorithm's feature importance. Many algorithms rank the importance of the features in the data, based on which feature helped the most to distinguish between the target labels. From this ranking we can learn which features were more and less important and select just the one's which contribute the most.

Let's fit the train data in a Random Forest classifier and print the feature importance scores.

Random Forest is an ensamble that fits a number of decision tree classifiers. Ensamble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of its learning algorithms alone, in our case from a simple Decision tree.

In [21]:
from sklearn.ensemble import RandomForestClassifier

model_feature_importance = RandomForestClassifier(n_estimators=1000).fit(x_train, y_train).feature_importances_ feature_scores = pd.DataFrame({'score':model_feature_importance}, index=list(x_train.columns)).sort_values('score') print(feature_scores) sns.barplot(feature_scores['score'], feature_scores.index)

                                 score
Nonflavanoid_phenols          0.009979
Ash                           0.014576
Proanthocyanins               0.021702
Magnesium                     0.023280
Alcalinity_of_ash             0.026449
Malic_acid                    0.038564
Hue                           0.084430
OD280/OD315_of_diluted_wines  0.114787
Alcohol                       0.116070
Color_intensity               0.152671
Flavanoids                    0.181141
Proline                       0.216350

Out[21]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f196a7c59b0>

As we can see from the plot there are features that are more important than others and we can see 5-7 features which stand out. I'll use the feature important scores to put a threshold for my model feature selection function.

"SelectFromModel" is a sklearn function which takes an estimator and a threshold, extracts from the estimator the feature importance scores and returns only the features with a score above the given threshold.

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier

model = Pipeline( [ ('select', SelectFromModel(RandomForestClassifier(n_estimators=1000), threshold=0.06)), ('scaler', StandardScaler()), ('knn', KNeighborsClassifier(n_neighbors=1)) ] )

model.fit(x_train, y_train) pred = model.predict(x_test) print('score on training set:', model.score(x_train, y_train)) print('score on test set:', model.score(x_test, y_test)) print(metrics.classification_report(y_true=y_test, y_pred=pred))

score on training set: 1.0
score on test set: 0.9777777777777777
             precision    recall  f1-score   support

1 0.94 1.00 0.97 15 2 1.00 0.94 0.97 18 3 1.00 1.00 1.00 12

avg / total 0.98 0.98 0.98 45

An improvment of 8-9% in this high scores is super great and difficult. This is a very good score just by itself! Now we can improve our score in other ways.

My point was to show how I improved my score with this very simple KNN model. I also wanted to show you how a few simple steps can improve your model and get a high score.