Skip to content

This part of the project documentation focuses on a problem-oriented approach. You'll tackle common tasks that you might have, with the help of the code provided in this project.

Table of Contents

  1. How To Perform WoE Transformation
  2. How To Explore Variables Explanational Power
  3. How To Create a Logit Model
  4. How To Build Models Ensembles
  5. How To Aggregate With Bagging Scheme
  6. How To Aggregate With Stacking Scheme
  7. How To Calibrate Models
  8. How To Transform PD To Score

How To Perform WoE Tramsformation?

To perform the Weight of Evidence (WoE) transformation on the entire dataset, you'll need to use the WoEDataPreparation function within the transform module. For demonstration purposes, we utilized the HELOC dataset sourced from Kaggle.

We recommend downloading the data for further training purposes.

from combat.transform import *

import pandas as pd
import numpy as np

data  = pd.read_excel('heloc_dataset.xlsx')
data['default'] = data['RiskPerformance'].apply(lambda x: 1 if x in ['Bad'] else 0 )
data = data.drop(columns = ['RiskPerformance'])

# split data into training and testing sets
y = data['default']
x = data.drop(columns = ['default'])

# specify special codes if necessary
special_codes = [-9, -8, -7]

Before proceeding with the WoE Transformation procedure, it's necessary to prepare meta-information about the data being utilized. This involves creating an .xlsx file with three columns:

  1. Variable - the names of variables as found in the HELOC dataset.
  2. dtype - indicating whether each variable is numerical or categorical.
  3. Expec - specifying expectations for coefficients: -1 for a negative coefficient, +1 for a positive coefficient, and 0 for ambiguity.

Once the metadata is prepared, upload it.
Specify the Variable column as the index_col"

The transform module offers rich functionality for Weight of Evidence (WoE) transformation.

df_sign = pd.read_excel("Sign Expec.xlsx"
                        ,  index_col='Variable' # specify Variable as an index col
                        )

# Perform WoE Transformation for the whole data
final_data = WoEDataPreparation(x_data = x
                                , y_data = y
                                , df_sign = df_sign
                                , special_codes = special_codes
                                , metric = 'woe'
                                , min_n_bins=1
                                )

# print the dataframe containing the data of the WoE Transformation results
print(final_data['status'])

# split data into training and testing sets
x_train, x_test, y_train, y_test = train_test_split(final_data['x_woe'], y, test_size=0.2, random_state=42, shuffle=True)

How To Explore Variables Explanational Power

To select features, utilize VarExpPower of Short_list, which allows you to calculate individual accuracy ratios, perform mean comparison tests (both parametric and nonparametric), and calculate the Variance Inflation Factor.

# variable explanational power
var_exp =VarExpPower(y_train = y_train
                     , x_train = x_train
                     , y_test = y_test
                     , x_test = x_test
                     )

After identifying the most valuable variables, the analyst should proceed to remove the remaining variables from the datasets using DeleteVars. The function removed variables from x_train, x_test and df_sign dataframes.

from combat.utilities import DeleteVars

vars_to_remove = ['NumTrades60Ever2DerogPubRec', 'NumTrades90Ever2DerogPubRec', 'NumInqLast6M', 'NumInqLast6Mexcl7days']

new_data = DeleteVars(x_train = x_train
                        , x_test = x_test
                        , df_sign = df_sign
                        , vars_to_remove = vars_to_remove
                        )

# assign results of deletion to new items
x_train_new = new_data['x_train_new']
x_test_new = new_data['x_test_new']
df_sign_new = new_data['df_sign_new']

How To Create a Logit Model

The LogitModel of models module is a core feature of the COMBAT package. Upon initialization, this class facilitates the calculation of various useful metrics on both training and testing sets, as well as enables predictions on external data. It seamlessly integrates features from both sklearn and statsmodels, providing users with a comprehensive set of metrics.

from combat.combat import IsModelValid
from combat.models import LogitModel

model = LogitModel(x_train=x_train_new
                   , y_train=y_train
                   , x_test = x_test_new
                   , y_test = y_test)

# this commands Necessarily Must be run
model.Model_SK()
model.Model_SM()

# Filgure out whether the model meet all the requirements simultaniously
print(
    IsModelValid(model = model
             , coef_expectation = df_sign_new
             , gini_cutoff=0.3
             , p_value = 0.1
             )
)

print(model.Gini_Test())
print(model.Gini_Train())

print(model.Brier_Train())
print(model.Brier_Test())

The LogitModel also offers the opportunity to build a regularized model. However, it's important to note that only l1 regularization is available. To construct a model with regularization, specify the penalty and alpha arguments accordingly.


model = LogitModel(x_train=x_train_new
                   , y_train=y_train
                   , x_test = x_test_new
                   , y_test = y_test
                   , penalty = 'l1'
                   , alpha = 0.2)

# this commands Necessarily Must be run
model.Model_SK()
model.Model_SM()

print(
    IsModelValid(model = model
             , coef_expectation = df_sign_new
             , gini_cutoff=0.3
             , p_value = 0.1
             )
)

print(model.Gini_Test())
print(model.Gini_Train())

print(model.Brier_Train())
print(model.Brier_Test())

How To Build Models Ensembles

ModelCombination function of Combat module enables users to build models ensemble, validate the models in accordance with the prespecified values like p_value, gini_cutoff, and correspondence with df_sign dataframe.

To obtain meta information about all models in the ensemble, you should utilize the ModelMetaInfo function.

The SelectModels function filters an ensemble based on a user-specified metric, such as gini_test.

from combat.combat import *
from combat.utilities import *


# generate 1000 random models and select the models that met all requirements
model_comb_1 = ModelCombination(y_train = y_train
                              , x_train = x_train_new
                              , y_test = y_test
                              , x_test = x_test_new
                              , max_model_number = 1000
                              , dependent_number = 5
                              , coef_expectation = df_sign_new
                              , gini_cutoff=0.4
                              , p_value = 0.1
                              , intercept = True
                              , penalty = None
                              )

# outlines the results of all the final models
meta = ModelMetaInfo(models_dict = model_comb_1
                     , sort_by = 'gini_test'
                     )
print(meta)

# Select Models with Gini on testing sets are higher than 0.52
new_model_comb = SelectModels(models_dict = model_comb_1
                                , meta_data = meta
                                , select_by = 'gini_test'
                                , cutoff  = 0.52
                                )

How To Aggregate With Bagging Scheme

The Combat module enables users to aggregate model results into a single prediction using a scheme similar to bagging. In the first step, determining the weights of each model prediction in the final predictions is achieved by applying WeightsBagging. Once the weights are obtained, you can utilize PredictionBagging to generate the final predictions.

from combat.combat import *

# define weights for each model in the ensemble by Gini coefficients on testing set
weight_aggr = WeightsBagging(models_dict = new_model_comb
                              , metric = 'gini'
                              , check_sample = 'test'
                              )

# once weights are defined calculate final prediction
pred_aggr = PredictionBagging(models_dict = new_model_comb
                                  , weights_dict = weight_aggr
                                  , x_data = x_test
                                  , logprob=False
                                  )

How To Aggregate With Stacking Scheme

The Combat module enables users to aggregate model results into a single prediction using a stacking scheme. First, create a stacking logistic regression model using ModelStacking, then obtain final predictions using PredictionStacking.

from combat.combat import *

# create a final Logistic Regression Model
model_st = ModelStacking(models_dict = new_model_comb_1
                         , x_data = x_test
                         , y_data = y_test
                         )

# predict using created Logistic Regression Model
pred_stack = PredictionStacking(models_dict = new_model_comb
                                , x_data = x_test
                                , model = model_st
                                )

How To Calibrate Models

The Calibration module provides comprehensive tools for calibration. First, apply ExpectedCalibrationError to calculate the expected calibration error (ECE). If the ECE is too high, then calibration must be applied. To do this, create a calibration model with CalibrationModel first. Then obtain final predictions with PredictionCalibration. Finally, CalibrationCurve plots the calibration curve.


from combat.calibration import *

# first calculate the Expected Calibration Error
ece = ExpectedCalibrationError(
    labels = np.array([1 if i > y_test.mean() else 0 for i in pred_stack])
    , probabilities = np.array(pred_stack_test)
    , n_bins = 20
    )

# If ECE is low (< 0.03 for example) no need to calibrate
# If ECE is high, calibration needed to be performed
cal_model = CalibrationModel(x_data = x_test, y_data = y_test)

# Predict with created calibration model
cal_prediction = PredictionCalibration(x_data = x_test, model = cal_model)

# calculate ECE again for calibrated prediction
ece_calibration = ExpectedCalibrationError(
    labels = np.array([1 if i > y_test.mean() else 0 for i in cal_prediction[:,1]])
    , probabilities = np.array(cal_prediction)
    , n_bins = 20
    )

# plot the calibration curve
CalibrationCurve(y_data = y_test
                 , probabilities = cal_prediction[:, 1]
                 , n_bins = 20
                 , label = 'Cailbrated line'
                 )

How To Transform PD To Score

Last but not least is to transform the final Probability of Default (PD) to the most convenient Score Range. By default, the target score is set to 600. To perform this transformation, utilize the Scorecard function of Scorecard package.

from combat.scorecard import *

scorecard = ScoreCard(y_proba = cal_prediction[:,1], log = False)
print(scorecard)