In this post, we are going to discuss how to predict the annual income of an adult? below is sample data which I am going to use to predict whether the person earning >=50K or <= 50K. The sample data looks like below and we are going to predict the target value.

age workclass fnlwgt education education-num martial-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country target
39  State-gov 77516  Bachelors 13  Never-married  Adm-clerical  Not-in-family  White  Male 2174 0 40  United-States <=50K
50  Self-emp-not-inc 83311  Bachelors 13  Married-civ-spouse  Exec-managerial  Husband  White  Male 0 0 13  United-States  <=50K
38  Private 215646  HS-grad 9  Divorced  Handlers-cleaners  Not-in-family  White  Male 0 0 40  United-States  <=50K
53  Private 234721  11th 7  Married-civ-spouse  Handlers-cleaners  Husband  Black  Male 0 0 40  United-States  <=50K
28  Private 338409  Bachelors 13  Married-civ-spouse  Prof-specialty  Wife  Black  Female 0 0 40  Cuba  <=50K

Look at your data

Before building a machine learning model, we need to inspect the data to find abnormalities and to understand the relation to the target variable. In a data analysis project, the most time-consuming part is to analysing and cleaning up the data.

Let’s start with cleaning up the data

If any abnormalities found we need to fix those. As might, already know machine learning technics are applied on integer values. So if you have any text data, we need to convert them into float64. We call these columns as features. There are different methods available to select the features to build our model. But, it always good to analyse your data manually and identify the relationship between the feature columns and the target column.

Before cleaning up your data, there is something which you need to keep in mind

  • The data shouldn’t have any sequential numbers in it like serial number, or some kind of ids.
  • If null values exist in the dataset, delete those rows
  • The dataset should be a balanced one

So, here I am going to check whether any null values are there in my dataset

import pandas as pd
import numpy as np
import os

# Source file
dataFile=os.path.join('E:\PYPrograms','Data','Dumps','salary.csv')
data = pd.read_csv(dataFile)
data.isnull().sum()
null value count for each column in Pandas data frame

As you can see in the above image, there is no null value exist in this dataset.

How to apply Label encoding for TEXT data?

Like I said before, we need to convert all text data to float values in order to apply machine learning algorithms. For this example, I am going to use the LabelEncoder module from sklearn. There are other TEXT transformation modules are available and you can choose any one of them based on your data type. In our example, the column only has the labels, i.e all text values are single word which denotes a particular category. Below is the code which we can use to label the text values in the data set. As you might be noticed, LabelEncoder is only applied to the columns X.

from sklearn.preprocessing import LabelEncoder

X=data.iloc[:,0:-1]
y=data['target']
le=LabelEncoder()
for col in X.columns:
    if X[col].dtypes=='object':
        X[col]=le.fit_transform(X[col])

After executing this code, our data will look like below.

LabelEncoder - sklearn
Fig: LabelEncoder – sklearn

Now we need to select the best features from the data set which is highly related to the target. For this example, I am going to use the RFE (Recursive Feature Elimination) module from sklearn. Here I am using, RandomForestClassifier algorithm to predict the result.

from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=20)
rfe = RFE(model,7) # where 7 is the number of features i want to select 
fit_rfe = rfe.fit_transform(x_train,y_train)
model.fit(fit_rfe,y_train)

rfe_test = rfe.transform(x_test) 
resultRaw=model.predict(rfe_test)
acc = accuracy_score(np.array(y_test),resultRaw)
print(acc)

As you can see, our model was able to predict the result with 83% accuracy.

Note: The transform method is applied to the x_test data as we want to select the exact features selected by RFE. If we didn’t apply this, it the predict function will fail with ValueError

ValueError: Number of features of the model must match the input. Model n_features is 7 and input n_features is 14
Fig: selecting features from the test data set

Here is the complete code for your reference.

import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn import model_selection as ms

dataFile=os.path.join('E:\PYPrograms','Data','Dumps','salary.csv')
data = pd.read_csv(dataFile)
data.isnull().sum()
X=data.iloc[:,0:-1]
y=data['target']

# Lable Encoding to transform text to float values
le=LabelEncoder()
for col in X.columns:
    if X[col].dtypes=='object':
        X[col]=le.fit_transform(X[col])

#Split data set to train and test set
x_train,x_test,y_train,y_test = ms.train_test_split(X,y,test_size=0.30,random_state=42)
model = RandomForestClassifier(n_estimators=65)
rfe = RFE(model,7)
fit_rfe = rfe.fit_transform(x_train,y_train)
model.fit(fit_rfe,y_train)
rfe_test = rfe.transform(x_test)
resultRaw=model.predict(rfe_test)
acc = accuracy_score(np.array(y_test),resultRaw)
print(acc)

I am new to this machine learning techniques. Thought of sharing here what I have learned so far, hoping this helps beginners like me. Let me know your feedback & suggestions in the comment section below.

One thought on “Classification using Machine Learning with Example – Tutorial”

Leave a Reply

Your email address will not be published. Required fields are marked *