Classification

Author

Alfa Pradana

Import library

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.metrics import mutual_info_score
from sklearn.feature_extraction import DictVectorizer
from sklearn.linear_model import LogisticRegression

Data preparation

We will use the lead scoring Bank Marketing dataset https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv

df = pd.read_csv("https://raw.githubusercontent.com/alexeygrigorev/datasets/master/course_lead_scoring.csv")
df.head()

	lead_source	industry	number_of_courses_viewed	annual_income	employment_status	location	interaction_count	lead_score	converted
0	paid_ads	NaN	1	79450.0	unemployed	south_america	4	0.94	1
1	social_media	retail	1	46992.0	employed	south_america	1	0.80	0
2	events	healthcare	5	78796.0	unemployed	australia	3	0.69	1
3	paid_ads	retail	2	83843.0	NaN	australia	1	0.87	0
4	referral	education	3	85012.0	self_employed	europe	3	0.62	1

Missing values checking

df.dtypes

lead_source                  object
industry                     object
number_of_courses_viewed      int64
annual_income               float64
employment_status            object
location                     object
interaction_count             int64
lead_score                  float64
converted                     int64
dtype: object

df.isnull().sum()

lead_source                 128
industry                    134
number_of_courses_viewed      0
annual_income               181
employment_status           100
location                     63
interaction_count             0
lead_score                    0
converted                     0
dtype: int64

There are 5 columns with missing values: - lead_source as object - industry as object - annual_income as numeric - employment_status as object - location as object

For categorical features, I replace them with NA, and for numerical features I replace it with with 0.0

categorical = df.dtypes[df.dtypes == "object"].index.tolist()
categorical

['lead_source', 'industry', 'employment_status', 'location']

numerical = df.dtypes[df.dtypes != "object"].index.tolist()
numerical

['number_of_courses_viewed',
 'annual_income',
 'interaction_count',
 'lead_score',
 'converted']

for col in categorical:
    df[col] = df[col].fillna("NA")

for col in numerical:
    df[col] = df[col].fillna(0.0)

df.isnull().sum()

lead_source                 0
industry                    0
number_of_courses_viewed    0
annual_income               0
employment_status           0
location                    0
interaction_count           0
lead_score                  0
converted                   0
dtype: int64

The most frequent category in `industry`

df["industry"].value_counts()

industry
retail           203
finance          200
other            198
healthcare       187
education        187
technology       179
manufacturing    174
NA               134
Name: count, dtype: int64

retail is the most frequent category in industry with 203 occurrences.

Correlation matrix for the numerical features

corr_mtx = df[numerical].corr()
corr_mtx.style.background_gradient(cmap='coolwarm')

	number_of_courses_viewed	annual_income	interaction_count	lead_score	converted
number_of_courses_viewed	1.000000	0.009770	-0.023565	-0.004879	0.435914
annual_income	0.009770	1.000000	0.027036	0.015610	0.053131
interaction_count	-0.023565	0.027036	1.000000	0.009888	0.374573
lead_score	-0.004879	0.015610	0.009888	1.000000	0.193673
converted	0.435914	0.053131	0.374573	0.193673	1.000000

df[['number_of_courses_viewed', 'annual_income', 'interaction_count']].corrwith(df.lead_score).abs()

number_of_courses_viewed    0.004879
annual_income               0.015610
interaction_count           0.009888
dtype: float64

df[['number_of_courses_viewed', 'annual_income', 'lead_score']].corrwith(df.interaction_count).abs()

number_of_courses_viewed    0.023565
annual_income               0.027036
lead_score                  0.009888
dtype: float64

Based on the given correlation matrix above, the biggest correlation is annual_income with interaction_count with correlation of 0.027036.

Splitting the data

Perform the train/validation/test split with Scikit-Learn’s train_test_split() function.

df_full_train, df_test = train_test_split(df, test_size=0.2, random_state=42)
len(df_full_train), len(df_test)

(1169, 293)

If we can get the 20% of the data for validation and 60% for training, we need to set the test_size parameter to 0.25 (because 0.25 * 0.8 = 0.2).

df_train, df_val = train_test_split(df_full_train, test_size=0.25, random_state=42)
len(df_train), len(df_val), len(df_test)

(876, 293, 293)

Make sure that the target value converted is not in the dataframe

df_train = df_train.reset_index(drop=True)
df_val = df_val.reset_index(drop=True)
df_test = df_test.reset_index(drop=True)

y_train = df_train.converted.values
y_val = df_val.converted.values
y_test = df_test.converted.values

del df_train["converted"]
del df_val["converted"]
del df_test["converted"]

Feature engineering

The biggest of mutual information

Mutual information is the concept from information theory that measures the amount of information obtained about one random variable through another random variable. In the context of feature selection for machine learning, mutual information can be used to quantify the dependency between a feature and the target variable.

def mutual_info_category_score(series):
    return mutual_info_score(series, y_train)

mi = df_train[categorical].apply(mutual_info_category_score)
mi.sort_values(ascending=False).round(2)

lead_source          0.04
employment_status    0.01
industry             0.01
location             0.00
dtype: float64

The biggest mutual information is lead_source with 0.04.

len(df_train), len(df_val), len(df_test)

(876, 293, 293)

One-hot encoding

df_train[categorical].nunique()

lead_source          6
industry             8
employment_status    5
location             8
dtype: int64

numerical = ['number_of_courses_viewed', 'annual_income', 'interaction_count', 'lead_score']
train_dicts = df_train[categorical +  numerical].to_dict(orient='records')
train_dicts[0]

{'lead_source': 'paid_ads',
 'industry': 'retail',
 'employment_status': 'student',
 'location': 'middle_east',
 'number_of_courses_viewed': 0,
 'annual_income': 58472.0,
 'interaction_count': 5,
 'lead_score': 0.03}

dv = DictVectorizer(sparse=False)
dv.fit(train_dicts)
dv.get_feature_names_out()

array(['annual_income', 'employment_status=NA',
       'employment_status=employed', 'employment_status=self_employed',
       'employment_status=student', 'employment_status=unemployed',
       'industry=NA', 'industry=education', 'industry=finance',
       'industry=healthcare', 'industry=manufacturing', 'industry=other',
       'industry=retail', 'industry=technology', 'interaction_count',
       'lead_score', 'lead_source=NA', 'lead_source=events',
       'lead_source=organic_search', 'lead_source=paid_ads',
       'lead_source=referral', 'lead_source=social_media', 'location=NA',
       'location=africa', 'location=asia', 'location=australia',
       'location=europe', 'location=middle_east',
       'location=north_america', 'location=south_america',
       'number_of_courses_viewed'], dtype=object)

X_train = dv.transform(train_dicts)
X_train.shape

(876, 31)

val_dicts = df_val[categorical +  numerical].to_dict(orient='records')
X_val = dv.transform(val_dicts)

Logistic Regression

model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
model.fit(X_train, y_train)

LogisticRegression(max_iter=1000, random_state=42, solver='liblinear')

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Accuracy on the validation dataset

y_pred = model.predict_proba(X_val)[:, 1]
conv_decision = (y_pred >= 0.5)
original_accuracy = (y_val == conv_decision).mean()
print(f"Original accuracy with all features: {original_accuracy:.2f}\n")

Original accuracy with all features: 0.70

Original accuracy with all features: 0.70

Feature elimination based on the accuracy

features = categorical + numerical
for feature in features:
    selected_features = [f for f in features if f != feature]

    train_dicts = df_train[selected_features].to_dict(orient='records')
    X_train = dv.transform(train_dicts)

    val_dicts = df_val[selected_features].to_dict(orient='records')
    X_val = dv.transform(val_dicts)

    # train the model
    model = LogisticRegression(solver='liblinear', C=1.0, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    # evaluate the model
    y_pred = model.predict_proba(X_val)[:, 1]
    conv_decision = (y_pred >= 0.5)
    accuracy = (y_val == conv_decision).mean()

    diff = np.abs(accuracy - original_accuracy)

    print(f"{accuracy:.5f} (difference: {diff:.5f}) accuracy with no '{feature}'")

0.70307 (difference: 0.00341) accuracy with no 'lead_source'
0.69966 (difference: 0.00000) accuracy with no 'industry'
0.69625 (difference: 0.00341) accuracy with no 'employment_status'
0.70990 (difference: 0.01024) accuracy with no 'location'
0.55631 (difference: 0.14334) accuracy with no 'number_of_courses_viewed'
0.85324 (difference: 0.15358) accuracy with no 'annual_income'
0.55631 (difference: 0.14334) accuracy with no 'interaction_count'
0.70648 (difference: 0.00683) accuracy with no 'lead_score'

The smallest difference is industry, it means that if we drop industry feature, the accuracy will be the same as the original accuracy with all features.

Best regularization parameter

params = [0.01, 0.1, 1, 10, 100]
for C in params:
    train_dicts = df_train[categorical + numerical].to_dict(orient='records')
    X_train = dv.transform(train_dicts)

    model = LogisticRegression(solver='liblinear', C=C, max_iter=1000, random_state=42)
    model.fit(X_train, y_train)

    y_pred = model.predict_proba(X_val)[:, 1]
    conv_decision = (y_pred >= 0.5)
    accuracy = (y_val == conv_decision).mean()

    print(f"{accuracy:.3f} for parameter C={C}")

0.700 for parameter C=0.01
0.706 for parameter C=0.1
0.706 for parameter C=1
0.706 for parameter C=10
0.706 for parameter C=100

The best tuning parameter C is 0.01.