Simple Logistic Regression utilizing Python scikit-learn

logistic regression python cheatsheet (picture by writer from

What is Logistic Regression?

Don’t let the identify logistic regression tips you, it normally falls underneath the class of the classification algorithm as a substitute of regression algorithm.

Then, what’s a classification mannequin? Simply put, the prediction generated by a classification mannequin could be a categorical worth, e.g. cat or canine, sure or no, true or false … On the opposite, a regression mannequin would predict a steady numeric worth.

Logistic regression makes predictions primarily based on the Sigmoid operate which is a squiggles-like line as proven under. Despite the truth that it returns the chances, the ultimate output could be a label assigned by evaluating the likelihood with a threshold, which makes it will definitely a classification algorithm.

easy illustration of sigmoid operate (picture by writer)

In this text, I’ll stroll by the next steps to construct a easy logistic regression mannequin utilizing python scikit -learn:

  1. Data Preprocessing
  2. Feature Engineering and EDA
  3. Model Building
  4. Model Evaluation

The knowledge is taken from Kaggle public dataset “Rain in Australia”. The goal is to foretell the binary goal variable “RainTomorrow” primarily based on present data, e.g. temperature, humidity, wind velocity and so on. Feel free to seize the total code on the finish of the article on my website.

1. Data Preprocessing

Firstly let’s load the libraries and the dataset.

import libraries and dataset (picture by writer)
df.head() (picture by writer)

Use df.describe()to have an outline of uncooked knowledge.

df.describe() (picture by writer)

We can not at all times count on that the information offered could be excellent for additional evaluation. In reality, it’s not often the case. Therefore, knowledge preprocessing is essential, particularly, dealing with lacking values is an crucial step to make sure the usability of the dataset. We can use isnull() operate to have a view of the scope of lacking knowledge. The following code snippet calculates the lacking worth share per column.

lacking values share (picture by writer)

There are 4 fields with 38% to 48% of lacking knowledge. I dropped these columns since most likely these values are lacking not at random. For instance, we’re lacking a lot of evaporation figures and this can be restricted by the capability of the measuring devices. Consequently, days with extra excessive evaporation measures might not be recorded within the first place. Therefore, the remaining numbers are already biased. To that finish, retaining these fields might contaminate the enter knowledge. If you wish to distinguish three frequent varieties of lacking knowledge, you could discover this text “How to Address Missing Data” useful.

column-wise and row-wise deletion (picture by writer)

After performing column-wise deletions, I deleted rows which are lacking labels, “RainTomorrow”, by dropna().To construct a machine studying mannequin, we’d like labels to coach or take a look at the mannequin, therefore rows with no labels don’t assist a lot with both course of. However, this part of the dataset will be separated out because the prediction set after the mannequin implementation. While dealing with lacking knowledge, it’s inevitable that knowledge form adjustments therefore df.formis a helpful technique permitting you to maintain observe of the information dimension. After the information manipulation above, the information form modified from 145460 rows, 23 columns to 142193 rows, 19 columns.

For the remaining columns, I imputed the specific variables and numerical variables individually. The code under labeled columns right into a categorical record and a numerical record, which might be additionally useful within the later EDA course of.

separate numerical and categorical variables (picture by writer)
  • Numerical Variables: impute lacking values with the imply of the variable. Notice that combining df.fillna()and df.imply() could be sufficient to remodel solely numerical variables.
tackle numerical lacking values (picture by writer)
  • Categorical Variables: iterate by the cat_list and substitute lacking values with “Unknown”
tackle categorical lacking values (picture by writer)

2. Feature Engeering and EDA

Coupling these two processes collectively is helpful for selecting the suitable characteristic engineering strategies primarily based on the distribution and traits of the dataset.

In this instance, I didn’t go in-depth into the exploratory knowledge evaluation(EDA) course of. If you have an interest to know extra, be happy to have a learn of my article on a extra complete EDA information.

I automated the univariate evaluation by a FOR loop. If a numerical variable is encountered, a histogram will likely be generated to visualise the distribution. On the opposite hand, a bar chart is created for the specific variable.

univariate evaluation (picture by writer)

1) Address Outliers

Now that now we have a holistic view of the information distribution, it’s a lot simpler to identify outliers. For occasion, Rainfall has a closely right-skewed distribution, indicating that there’s at the least one considerably excessive document.

To remove the outliers, I used quantile(0.9) to restrict the dataset to these fall into the 90% quantile of the dataset. As the consequence, the higher sure of Rainfall values considerably dropped from 350 to six.


Please enter your comment!
Please enter your name here