Variable screening comes as an important step in the contemporary EDA for predictive modeling: what can we tell about the nature of the relationships between a set of predictors and the dependent before entering the modeling phase? Can we infer something about the predictive power of the independent variables before we start rolling them into a predictive model? In this blog post I will discuss two information-theoretic measures that are common in variable screening for binary classification and regression models in the credit risk arena (the fact being completely unrelated to the simple truth that they could do you some good in any other application of predictive modeling as well). I will first introduce the Weight of Evidence (WoE) and Information Value (IV) of a variable in respect to a binary outcome. Then I will illustrate their computation (it’s fairly easy, in fact) from the {Information} package in R.

### Weight of Evidence

Take the common Bayesian hypothesis test (or a Bayes factor, if you prefer):

and assume your models M1, M2 of the world* are simply two discrete possible states of a binary variable Y, while the data are given by discrete distributions of some predictor X (or, X stands for a binned continuous distribution); for every category j in X, j = 1, 2,.. n, take the log:

and you will get to simple a measure of evidence in favor of M1 against M2 that Good has described as Weight of Evidence (WoE). In theory, any monotonic transformation of the odds would do, but the logarithm brings an intuitive advantage of obtaining a negative WoE when the odds are less than one and a positive one when they are higher than one. To simplify the setting where the analysis under consideration encompasses a binary dependent Y and a discrete (or binned continuous) predictor X, we are simply inspecting the conditional distribution of X given Y:

where f denotes counts.

Let’s illustrate the computation of WoE in this setting for a variable from a well-known dataset**. We have one categorical, binary dependent:

In the previous example I have used exactly the approach to bin X (age, in this case) that is used in the R package {Information} whose application I want to illustrate later. The table() call provides for the conditional distributions like the ones shown in the table above. The computation of WoE is then straightforward – as exemplified in the last line. However, you want to spare yourself from computing the WoE in this way for many variables in the dataset, and that’s where {Information} in R comes handy; for the same dataset:

with the respective data frames in infoTables$Tables standing for the variables in the dataset.

### Information Value

A straightforward definition of the Information Value (IV)of a variable is provided in the {Information} package vignette:

In effect, this means that we are summing across the individual WoE values (i.e. for each bin j of X) and weighting them by the respective differences between P(xj|Y=1) and P(xj|Y=0). The IV of a variable measures its predictive power, and variables with IV < .05 are generally considered to have a low predictive power.

Using {Information} in R, for the dataset under consideration:

You may have noted the usage of parallel = T in the create_infotables() call; the {Information} package will try to use all available cores to speed up the computations by default. Besides the basic package functionality that I have illustrated, the package provides a natural way of dealing with uplift models, where the computation of the IVs for the control vs. treatment designs is nicely automated. Cross-validation procedures are also built-in.

However, now that we know that we have a nice, working package for WoE and IV estimation in R, let’s restrain ourselves from using it to perform automatic feature selection for models like binary logistic regression. While the information-theoretic measures discussed here truly assess the predictive power of a predictor in binary classification, building a model that encompasses multiple terms model is another story. Do not get disappointed if you start figuring out how the AICs for the full models are still lower then those for the nested models obtained by feature selection based on the IV values; although they can provide useful guidelines, WoE and IV are not even meant to be used that way (I’ve tried… once with the dataset used in the previous examples, and then with the two {Information} built-in datasets; not too much of a success, as you may have guessed).

*References:*

**For parametric models, you would need to integrate over the full parameter space, of course; taking the MLEs would result in obtaining the standard LR test.*

*** The dataset is considered in S. Moro, P. Cortez and P. Rita (2014). A Data-Driven Approach to Predict the Success of Bank Telemarketing. Decision Support Systems, Elsevier, 62:22-31, June 2014. I have obtained it from: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing (N.B. https://archive.ics.uci.edu/ml/machine-learning-databases/00222/, file: bank-additional.zip); a nice description of the dataset is found at: http://www2.1010data.com/documentationcenter/beta/Tutorials/MachineLearningExamples/BankMarketingDataSet.html)*

Written by: Smart Cat

June 9, 2021