Mastering Logistic Regression: A Comprehensive Guide to WoE and IV Calculation.
A Comprehensive Guide to WoE and IV Calculation.
These metrics are widely recognized for their ability to discern between creditworthy and non-creditworthy individuals. Throughout our journey into understanding these calculations, we frequently encounter the familiar labels of ‘good’ and ‘bad’ customers. In this context, ‘bad customers’ are those who have defaulted on their loans, while ‘good customers’ are those who have dutifully fulfilled their financial obligations.
To shed light on these concepts, we will draw insights from the Titanic - Machine Learning from Disaster dataset, specifically examining the survival information segregated by gender. Our aim is to demystify the calculations of IV and WoE, making them more approachable and tangible. We will utilize the provided data in the table below as a foundation for our exploration.
Sector | # | # |
---|---|---|
female | 81 | 233 |
male | 468 | 109 |
Total | 549 | 342 |
Targets within each segment:
What is commonly referred to as ‘good’ is the .
Let’s consider the chosen sector as female.
For this problem:
What is typically described as ‘bad’ is the .
For this problem, in the female sector:
Sector | # | # | % | % |
---|---|---|---|---|
female | 81 | 233 | ||
male | 468 | 109 | ||
Total | 549 | 342 | 1 | 1 |
Percentage of population in the study sector:
The Percentage of Population in the study sector is a measure that indicates the proportion of the total population represented by a particular sector:
Let’s calculate the percentage of population for the chosen sector, which is the female sector in this case:
Now, let’s examine the table that presents the statistics:
Sector | # | # | % | % | % Population |
---|---|---|---|---|---|
female | 81 | 233 | 0.15 | 0.68 | |
male | 468 | 109 | 0.85 | 0.32 | |
Total | 549 | 342 | 1 | 1 |
This measure provides us with valuable information about the representation of the study sector within the overall population. Understanding this distribution is crucial for conducting a comprehensive analysis of the results and drawing meaningful conclusions from the data.
Distribution of the targets within each segment (Distr):
The distribution for sector ‘i’ can be calculated as the proportion of the sector under study in the target of non-occurrences in relation to the proportion of sector ‘i’ in the target of occurrences:
Likewise, the division of distributions for the female category can be calculated as the percentage of females among who died compared to the percentage of females among those who survivors:
Sector | # | # | % | % | % Population | Distr |
---|---|---|---|---|---|---|
female | 81 | 233 | 0.15 | 0.68 | 0.35 | |
male | 468 | 109 | 0.85 | 0.32 | 0.65 | |
Total | 549 | 342 | 1 | 1 | 1 |
Weight of Evidence (WoE):
It can be calculated using the natural logarithm of the ‘Distr’ for each sector:
Let’s consider the female sector as an example:
Now, let’s examine the table that presents the statistics:
Sector | # | # | % | % | % Population | Distr | WoE |
---|---|---|---|---|---|---|---|
female | 81 | 233 | 0.15 | 0.68 | 0.35 | 0.22 | ln(0.22) |
male | 468 | 109 | 0.85 | 0.32 | 0.65 | 2.67 | ln(2.67) |
Total | 549 | 342 | 1 | 1 | 1 |
By analyzing the WoE values, we can gain insights into the discriminative nature of the variables in predicting the desired outcome.
Information Value (IV):
It can be calculated using the following formula:
Let’s consider the Female sector as an example:
Sector | # | # | % | % | % Population | Distr | WoE | IV | |
---|---|---|---|---|---|---|---|---|---|
female | 81 | 233 | 0.15 | 0.68 | 0.35 | 0.22 | -1.53 | ||
male | 468 | 109 | 0.85 | 0.32 | 0.65 | 2.67 | 0.98 | ||
Total | 549 | 342 | 1 | 1 | 1 |
If you’re interested in checking out the IV values classification , you can find it at this link.
The table with all the calculated metrics looks as follows:
Sector | # | # | % | % | % Population | Distr | WoE | IV |
---|---|---|---|---|---|---|---|---|
female | 81 | 233 | 0.15 | 0.68 | 0.35 | 0.22 | -1.53 | 0.82 |
male | 468 | 109 | 0.85 | 0.32 | 0.65 | 2.67 | 0.98 | 0.53 |
Total | 549 | 342 | 1 | 1 | 1 | 1.35 |
To ensure a more precise comprehension of WoE and IV, I have curated an informative post that delves into these concepts. You can access it here. This article aims to provide a comprehensive explanation, elucidating the intricacies of these metrics.
Moreover, if you find yourself in need of performing these calculations using Python, I have created another post featuring the corresponding formulas, which can be accessed at this link. This resource will empower you to execute the calculations efficiently.
For additional support, I have compiled a variety of supplementary materials on my GitHub, specifically related to the topic of this post. These resources, accessible in the supporting materials repository, are designed to enhance your comprehension and aid in the practical implementation of IV and WoE, calculations.
If you have any further questions or need more information, I’m here to help!
References:
Anderson, Raymond. The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation. Oxford University Press, 2007.
Siddiqi, Naeem. Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring. Wiley, 2006.
Sudarson Mothilal Thoppay (2015). woe: Computes Weight of Evidence and Information Values. R package version 0.2. https://CRAN.R-project.org/package=woe
Thilo Eichenberg (2018). woeBinning: Supervised Weight of Evidence Binning of Numeric Variables and Factors. R package version 0.1.6. https://CRAN.R-project.org/package=woeBinning