Classification using Machine Learning with Example – Tutorial

techieswiki

6 years ago

Machine-Learning-Classification-Algorithms

In this post, we are going to discuss how to predict the annual income of an adult? below is sample data which I am going to use to predict whether the person earning >=50K or <= 50K. The sample data looks like below and we are going to predict the target value.

age

workclass

fnlwgt

education

education-num

martial-status

occupation

relationship

race

sex

capital-gain

capital-loss

hours-per-week

native-country

target

State-gov

77516

Bachelors

Never-married

Adm-clerical

Not-in-family

White

Male

2174

United-States

<=50K

Self-emp-not-inc

83311

Bachelors

Married-civ-spouse

Exec-managerial

Husband

White

Male

United-States

<=50K

Private

215646

HS-grad

Divorced

Handlers-cleaners

Not-in-family

White

Male

United-States

<=50K

Private

234721

11th

Married-civ-spouse

Handlers-cleaners

Husband

Black

Male

United-States

<=50K

Private

338409

Bachelors

Married-civ-spouse

Prof-specialty

Wife

Black

Female

Cuba

<=50K

Look at your data

Before building a machine learning model, we need to inspect the data to find abnormalities and to understand the relation to the target variable. In a data analysis project, the most time-consuming part is to analysing and cleaning up the data.

Let’s start with cleaning up the data

If any abnormalities found we need to fix those. As might, already know machine learning technics are applied on integer values. So if you have any text data, we need to convert them into float64. We call these columns as features. There are different methods available to select the features to build our model. But, it always good to analyse your data manually and identify the relationship between the feature columns and the target column.

Before cleaning up your data, there is something which you need to keep in mind

The data shouldn’t have any sequential numbers in it like serial number, or some kind of ids.
If null values exist in the dataset, delete those rows
The dataset should be a balanced one

So, here I am going to check whether any null values are there in my dataset

import pandas as pd
import numpy as np
import os

# Source file
dataFile=os.path.join('E:\PYPrograms','Data','Dumps','salary.csv')
data = pd.read_csv(dataFile)
data.isnull().sum()

null value count for each column in Pandas data frame

As you can see in the above image, there is no null value exist in this dataset.

How to apply Label encoding for TEXT data?

Like I said before, we need to convert all text data to float values in order to apply machine learning algorithms. For this example, I am going to use the LabelEncoder module from sklearn. There are other TEXT transformation modules are available and you can choose any one of them based on your data type. In our example, the column only has the labels, i.e all text values are single word which denotes a particular category. Below is the code which we can use to label the text values in the data set. As you might be noticed, LabelEncoder is only applied to the columns X.

from sklearn.preprocessing import LabelEncoder

X=data.iloc[:,0:-1]
y=data['target']
le=LabelEncoder()
for col in X.columns:
    if X[col].dtypes=='object':
        X[col]=le.fit_transform(X[col])

After executing this code, our data will look like below.

Now we need to select the best features from the data set which is highly related to the target. For this example, I am going to use the RFE (Recursive Feature Elimination) module from sklearn. Here I am using, RandomForestClassifier algorithm to predict the result.

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=20)
rfe = RFE(model,7) # where 7 is the number of features i want to select 
fit_rfe = rfe.fit_transform(x_train,y_train)
model.fit(fit_rfe,y_train)

rfe_test = rfe.transform(x_test) 
resultRaw=model.predict(rfe_test)
acc = accuracy_score(np.array(y_test),resultRaw)
print(acc)

As you can see, our model was able to predict the result with 83% accuracy.

Note: The transform method is applied to the x_test data as we want to select the exact features selected by RFE. If we didn’t apply this, it the predict function will fail with ValueError

ValueError: Number of features of the model must match the input. Model n_features is 7 and input n_features is 14

Fig: selecting features from the test data set

Here is the complete code for your reference.

import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn import model_selection as ms

dataFile=os.path.join('E:\PYPrograms','Data','Dumps','salary.csv')
data = pd.read_csv(dataFile)
data.isnull().sum()
X=data.iloc[:,0:-1]
y=data['target']

# Lable Encoding to transform text to float values
le=LabelEncoder()
for col in X.columns:
    if X[col].dtypes=='object':
        X[col]=le.fit_transform(X[col])

#Split data set to train and test set
x_train,x_test,y_train,y_test = ms.train_test_split(X,y,test_size=0.30,random_state=42)
model = RandomForestClassifier(n_estimators=65)
rfe = RFE(model,7)
fit_rfe = rfe.fit_transform(x_train,y_train)
model.fit(fit_rfe,y_train)
rfe_test = rfe.transform(x_test)
resultRaw=model.predict(rfe_test)
acc = accuracy_score(np.array(y_test),resultRaw)
print(acc)

I am new to this machine learning techniques. Thought of sharing here what I have learned so far, hoping this helps beginners like me. Let me know your feedback & suggestions in the comment section below.