In supervised machine learning (ML) the goal is to have an accurate model, which based on previously tagged data provides predictions for new data.

The number one question when it comes to modeling is: "How can I improve my results?"

There are several basic ways to improve your prediction model:

  1. Hyperparameters optimization
  2. Feature extraction
  3. Selecting another model
  4. Adding more data
  5. Feature selection

In this blog post, I'll walk you through how I used Feature Selection to improve my model. For the demonstration I'll use the 'Wine' dataset from UCI ML repository

Most of the functions are from the sklearn (scikit-learn) module.

For the plotting functions make sure to read about matplotlib and seaborn. Both are great plotting modules with great documentation.

Before we jump into the ML model and prediction we need to understand our data. The process of understanding the data is called EDA - exploratory data analysis.

EDA - exploratory data analysis.

UCI kindly gave us some basic information about the data set. I'll quote some of the more important info given: "These data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines ... All attributes are continuous ... 1st attribute is class identifier (1-3)"

Based on this, it seems like a classification problem with 3 class labels and 13 numeric attributes. A classification problem with the goal of predicting the specific cultivar the wine was derived from.

In [1]:
# Loading a few important modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set() #sets a style for the seaborn plots.

In [2]:
# Loading the data from it's csv,
# and converting the 'label' column to be a string so pandas won't infer it as a numeric value
data = pd.read_csv('wine_data_UCI.csv', dtype={'label':str})
data.head() # print the data's top five instances

Out[2]:

label Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280/OD315_of_diluted_wines Proline
0 1 14.23 1.71 2.43 15.6 127 2.80 3.06 0.28 2.29 5.64 1.04 3.92 1065
1 1 13.20 1.78 2.14 11.2 100 2.65 2.76 0.26 1.28 4.38 1.05 3.40 1050
2 1 13.16 2.36 2.67 18.6 101 2.80 3.24 0.30 2.81 5.68 1.03 3.17 1185
3 1 14.37 1.95 2.50 16.8 113 3.85 3.49 0.24 2.18 7.80 0.86 3.45 1480
4 1 13.24 2.59 2.87 21.0 118 2.80 2.69 0.39 1.82 4.32 1.04 2.93 735

I named the first columns as 'label'. This is the target attribute - what we are trying to predict. This is a classification problem, so the class label ('label') is not a numeric but a nominal value. that's why I'm telling Pandas this columns dtype is 'str'.

In [3]:
data.info() # prints out a basic information about the data.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
label                           178 non-null object
Alcohol                         178 non-null float64
Malic_acid                      178 non-null float64
Ash                             178 non-null float64
Alcalinity_of_ash               178 non-null float64
Magnesium                       178 non-null int64
Total_phenols                   178 non-null float64
Flavanoids                      178 non-null float64
Nonflavanoid_phenols            178 non-null float64
Proanthocyanins                 178 non-null float64
Color_intensity                 178 non-null float64
Hue                             178 non-null float64
OD280/OD315_of_diluted_wines    178 non-null float64
Proline                         178 non-null int64
dtypes: float64(11), int64(2), object(1)
memory usage: 19.5+ KB

As we can see we have 178 entries (instances). as we know from UCI's description of the data, we have 13 numeric attributes and one 'object' type attribute (which is the target column). all the columns of all the rows have data, therefore we see "178 non-null" next to every column description.

In [4]:
print(data['label'].value_counts()) # prints out how many times each value in the 'label' column is appearing.
sns.countplot(data['label']) # plots the above print

2    71
1    59
3    48
Name: label, dtype: int64

Out[4]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f19b11a8b38>

It's important to check the amount of instances in each class. There is difference between the class labels but It isn't a huge difference. If the difference was bigger we would be in an imbalanced problem. That would require a lot of other things to do, but this is for another post.

In [5]:
# This method prints us some summary statistics for each column in our data.
data.describe()

Out[5]:

Alcohol Malic_acid Ash Alcalinity_of_ash Magnesium Total_phenols Flavanoids Nonflavanoid_phenols Proanthocyanins Color_intensity Hue OD280/OD315_of_diluted_wines Proline
count 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000 178.000000
mean 13.000618 2.336348 2.366517 19.494944 99.741573 2.295112 2.029270 0.361854 1.590899 5.058090 0.957449 2.611685 746.893258
std 0.811827 1.117146 0.274344 3.339564 14.282484 0.625851 0.998859 0.124453 0.572359 2.318286 0.228572 0.709990 314.907474
min 11.030000 0.740000 1.360000 10.600000 70.000000 0.980000 0.340000 0.130000 0.410000 1.280000 0.480000 1.270000 278.000000
25% 12.362500 1.602500 2.210000 17.200000 88.000000 1.742500 1.205000 0.270000 1.250000 3.220000 0.782500 1.937500 500.500000
50% 13.050000 1.865000 2.360000 19.500000 98.000000 2.355000 2.135000 0.340000 1.555000 4.690000 0.965000 2.780000 673.500000
75% 13.677500 3.082500 2.557500 21.500000 107.000000 2.800000 2.875000 0.437500 1.950000 6.200000 1.120000 3.170000 985.000000
max 14.830000 5.800000 3.230000 30.000000 162.000000 3.880000 5.080000 0.660000 3.580000 13.000000 1.710000 4.000000 1680.000000

This is probably only informative to people who have some experience in statistics. Let's try to plot this information and see if it helps us understand.

In [6]:
# box plots are best for plotting summary statistics.
sns.boxplot(data=data)

Out[6]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f197b282048>

Unfortunately this is not a very informative plot becasue the data is not in the same value range. We can resolve the problem by plotting each column side by side.

In [7]:
data_to_plot = data.iloc[:, 1:]
fig, ax = plt.subplots(ncols=len(data_to_plot.columns))
plt.subplots_adjust(right=3, wspace=1)
for i, col in enumerate(data_to_plot.columns):
    sns.boxplot(y=data_to_plot[col], ax = ax[i])

This is a better way to plot the data.

We can see that we have some outliers (based on the IQR calculation) in almost all the feaures. These outliers deserve a second look, but we won't deal with them right now.

Pair plot is a great way to see a scatter plot of all the data, of course only for two features at a time. Pair plot is good for small amout of features and for first glance at the columns (features), afterwords in my opinion a simple scatterplot with the relevant columns is better.

In [8]:
columns_to_plot = list(data.columns)
columns_to_plot.remove('label')
sns.pairplot(data, hue='label', vars=columns_to_plot) # the hue parameter colors data instances baces on their value in the 'label' column.

Out[8]:

<seaborn.axisgrid.PairGrid at 0x7f197b181fd0>