Run Your Server

Master the techniques, tools, and tricks that skilled Go programmers use every day.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Perfomance Of Different Machine Learning Models On Classifier Problem

Classifier machine learning models

Knowing the best model to use when solving a business problem can be very handy. However this depends with the type of problem being solved. In this article we will tackle a classifier problem. We consider a marketing campaign of a Portuguese bank and do prediction of customers who will subscribe to their bank term deposit.

This automatically leads to selection of machine learning models that can be used in this particular use case. Since there are vast options when it comes this kind of problem, we will use three models, and compare their performance before and after hyperparameter tuning. We will first look at the structure of our dataset, do univariate and bivariate analysis on the data, preprocess it to be ready for machine learning models and eventually do some predictions using our models. The models we will use are Logistic Regression, XGBoost and Multilayer Perceptron. We will look briefly on what these models are and how they fit this problem at hand before implementing them.

To get a quick overview of our data set, we will use df.info(). This way we will be able to know the size of our data set, the features we have and their data types. We can easily tell if there is any missing values for any of the features.

From the above analysis, we have 11 categorical features and 21 features in total. There are 41188 rows in the data set. Before we jump to univariate and bivariate analysis, let’s have a look at the statistical descriptions of our columns. This can quickly be done using .describe() function.

We will look at the distribution of our features at this point. Plotting the features will help us grasp quickly the distribution of the values.

For the age category, most bank’s customers fall in the age of 30s. For the education feature, high school and university degree categories hold the highest frequencies. Married bank client’s tend to be high in number in the dataset compared to other marital statuses. Generally, the number of client’s who did not subscribe to the bank’s term deposit are more than those who did.

How does the features correlate to each other? Does this affect the client’s subscription to the bank’s term deposit? We will plot a heat map to quickly help us grasp the trend in feature correlation.

Heatmap showing correlation of features

The highly concentrated parts in the heatmap denotes high correlation between the features.

The box plot shows presence of outliers in the data set which we will have to deal with before modelling. Since our main objective is to applying machine learning to the dataset, we will start our preprocessing phase on the dataset.

Before jumping into preprocessing part, let’s get nformation on one of our features in the dataset. The “duration” feature highly affect the output variable (e.g., if duration=0 then y=’no’). Yet, the duration is not known before a call is performed. Also, at the end of the call y is obviously known. In order to have a realistic predictive model, we will drop this attribute at this point.

Looking at our dataframe at this point, we have seen the correlation between the features. We also seen the distribution of the features across the data set. We will now transform our data into a form that can be used for machine learning. This will involve the following steps:

1. Feature Encoding

We drop our y variable since we have encoded subscription variable. Having the output variable as the first or the last position is always my norm. We will rearrange the dataset columns to achieve this.

2. Feature Standardization

We do not want the distribution of our variables to be far from the mean value, hence the need to standardize. This will help to attain normal distribution in the dataset. Scaling also help in speeding calculations in algorithms .We will use StandardScaler class from the sklearn preprocessing module.

At this point we will divide our independent variables and dependent variables.

3. Dealing with class imbalance

Having underpresented datapoints in our variables will lead to to biased machine learning model. For example, let’s see visualization for the target variable.

subscription distribution

We can clearly see the yes values in the subscription features is greatly underpresented. This calls for dealing with the class imbalance in the dataset. We will use RandomOverSampler library to deal with this issue, which comes from the imbalance-learn package. We will need to install this package before using RandomOverSampler.

4. Dimension Reduction

One drawbacks of using one hot encoder is that it creates dummy features hence multiplying the number of features in total. This forces us to do dimension reduction on our independent features since we do not want to use so many features. One technique we will use for this purpose is Principal Component Analysis which we will refer as PCA. This will reduce our dataframe to only ten components. We have reduced the dataframe to ten in this project for demonstration purposes but you are free to reduce it even to a lower number.

We will save our preprocessed dataset at this point and head straight to machine learning models.

We will do a comparison on performance of three machine learning models and decide on which suites most on solving this problem at hand. As a first stepping stone, we will build a dummy classifier to act as our base model from which we will compare the performance of our outlined models.

We will train our models using KFold and Stratified KFold techniques which we will also do a comparison to see which performs better. KFold and Stratified KFold on each model before and after hyperparameter tuning.

KFold is a cross-validation technique where resampling procedure is used to evaluate machine learning models on a limited data sample. The procedure has a single parameter called k that refers to the number of groups that a given data sample is to be split into. As such, the procedure is often called k-fold cross-validation. When a specific value for k is chosen, it may be used in place of k in the reference to the model, such as k=10 becoming 10-fold cross-validation.

We will create a function for this task hence reusing it for the difference models we have.

Logistic regression is a supervised learning classification algorithm used to predict the probability of a target variable. The nature of target or dependent variable is dichotomous, which means there would be only two possible classes. In simple words, the dependent variable is binary in nature having data coded as either 1 (stands for success/yes) or 0 (stands for failure/no).

Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It is one of the simplest machine learning algorithms that can be used for various classification problems and hence suites our use case.

Let’s look at the performance metrics using KFold and Stratified KFold

Let’s look at XGBoost’s performance metrics using KFold and Stratified KFold

A multilayer perceptron (MLP) is a class of feedforward artificial neural network (ANN). A perceptron is a single neuron model that was a precursor to larger neural networks. It is a field that investigates how simple models of biological brains can be used to solve difficult computational tasks like the predictive modeling tasks we see in machine learning. The goal is not to create realistic models of the brain, but instead to develop robust algorithms and data structures that we can use to model difficult problems. The power of neural networks comes from their ability to learn the representation in your training data and how to best relate it to the output variable that you want to predict. In this sense neural networks learn a mapping. Mathematically, they are capable of learning any mapping function and have been proven to be a universal approximation algorithm

We will use Multilayer Perceptron Classifier for our problem at hand. Since it is class of neural network, we will use sklearn to import it.

Let’s look at MLP’s performance metrics using KFold and Stratified KFold

MLP ROC Curve

From the three models, we can justify that XGBoost has the best performance compared to the other models with an accuracy score of 85%. However, model performance always depend with the dataset you have. Another way of affecting model performance is through hyperparameter tuning. For example, tuning hyperparameters in XGBoost can be done as follows:

We will look at other machine learning models like SVM and Decision Trees and investigate their performance metrics. We will also perform hyperparameter tuning for our models.

Some of the resources that I find useful for a project like this are:

Add a comment

Related posts:

Ritme

Yang namanya manusia udah pasti beda-beda; beda usia, gender, fisik, karakter, tapi semua tetap punya standarnya. Nggak semua harus sama, nggak semua harus balap-balapan, cepet-cepetan…

5 Habits That I Wish I Had Developed Sooner

When you are looking for good habits to develop online, you will be bombarded with millions of results. Most of these suggestions, however, are always the same. Eat healthy food, go for a walk, sleep…

Footsteps

Footsteps sound from out of nowhere; I suddenly heard the sluicing of treading water across the pavement, quite out of tandem with my own halted stride; echoing, reverberating, engulfing me —…