In this post, we are going to discuss how to predict the annual income of an adult? below is sample data which I am going to use to predict whether the person earning >=50K or <= 50K. The sample data looks like below and we are going to predict the target value.
age | workclass | fnlwgt | education | education-num | martial-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | target |
39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
Look at your data
Before building a machine learning model, we need to inspect the data to find abnormalities and to understand the relation to the target variable. In a data analysis project, the most time-consuming part is to analysing and cleaning up the data.
Let’s start with cleaning up the data
If any abnormalities found we need to fix those. As might, already know machine learning technics are applied on integer values. So if you have any text data, we need to convert them into float64. We call these columns as features. There are different methods available to select the features to build our model. But, it always good to analyse your data manually and identify the relationship between the feature columns and the target column.
Before cleaning up your data, there is something which you need to keep in mind
- The data shouldn’t have any sequential numbers in it like serial number, or some kind of ids.
- If null values exist in the dataset, delete those rows
- The dataset should be a balanced one
So, here I am going to check whether any null values are there in my dataset
import pandas as pd
import numpy as np
import os
# Source file
dataFile=os.path.join('E:\PYPrograms','Data','Dumps','salary.csv')
data = pd.read_csv(dataFile)
data.isnull().sum()
As you can see in the above image, there is no null value exist in this dataset.
How to apply Label encoding for TEXT data?
Like I said before, we need to convert all text data to float values in order to apply machine learning algorithms. For this example, I am going to use the LabelEncoder module from sklearn. There are other TEXT transformation modules are available and you can choose any one of them based on your data type. In our example, the column only has the labels, i.e all text values are single word which denotes a particular category. Below is the code which we can use to label the text values in the data set. As you might be noticed, LabelEncoder is only applied to the columns X.
from sklearn.preprocessing import LabelEncoder
X=data.iloc[:,0:-1]
y=data['target']
le=LabelEncoder()
for col in X.columns:
if X[col].dtypes=='object':
X[col]=le.fit_transform(X[col])
After executing this code, our data will look like below.
Now we need to select the best features from the data set which is highly related to the target. For this example, I am going to use the RFE (Recursive Feature Elimination) module from sklearn. Here I am using, RandomForestClassifier algorithm to predict the result.
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=20)
rfe = RFE(model,7) # where 7 is the number of features i want to select
fit_rfe = rfe.fit_transform(x_train,y_train)
model.fit(fit_rfe,y_train)
rfe_test = rfe.transform(x_test)
resultRaw=model.predict(rfe_test)
acc = accuracy_score(np.array(y_test),resultRaw)
print(acc)
As you can see, our model was able to predict the result with 83% accuracy.
Note: The transform method is applied to the x_test data as we want to select the exact features selected by RFE. If we didn’t apply this, it the predict function will fail with ValueError
ValueError: Number of features of the model must match the input. Model n_features is 7 and input n_features is 14
Here is the complete code for your reference.
import pandas as pd
import numpy as np
import os
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import RFE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn import model_selection as ms
dataFile=os.path.join('E:\PYPrograms','Data','Dumps','salary.csv')
data = pd.read_csv(dataFile)
data.isnull().sum()
X=data.iloc[:,0:-1]
y=data['target']
# Lable Encoding to transform text to float values
le=LabelEncoder()
for col in X.columns:
if X[col].dtypes=='object':
X[col]=le.fit_transform(X[col])
#Split data set to train and test set
x_train,x_test,y_train,y_test = ms.train_test_split(X,y,test_size=0.30,random_state=42)
model = RandomForestClassifier(n_estimators=65)
rfe = RFE(model,7)
fit_rfe = rfe.fit_transform(x_train,y_train)
model.fit(fit_rfe,y_train)
rfe_test = rfe.transform(x_test)
resultRaw=model.predict(rfe_test)
acc = accuracy_score(np.array(y_test),resultRaw)
print(acc)
I am new to this machine learning techniques. Thought of sharing here what I have learned so far, hoping this helps beginners like me. Let me know your feedback & suggestions in the comment section below.
[…] if an Adult makes <=50K or >=50K is one example of Classification. Read Classification with machine learning to know more about it. Another example is, checking a message is spam or not. So in simple words, you might be observed, […]