Tutorials

This part of the project documentation focuses on a learning-oriented approach. You'll learn how to get started with the code in this project.

Install combat if it not installed yet

!pip install combat

Import all necessary packages

from combat.models import LogitModel
from combat.short_list import *
from combat.combat import *
from combat.transform import *
from combat.calibration import *
from combat.scorecard import *

import warnings
warnings.filterwarnings("ignore")
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

For the demostration purposes the heloc data will be used. The data could be downloaded from the kaggle website

# upload raw data
data = pd.read_excel('heloc_dataset_v1.csv'
                    , sep=","
                    , header = 0)

y = data['target']
x = data.drop(columns = ['target'])


# upload data with coefficients sign expectations
df_sign = pd.read_excel('df_sign.xlsx'
                        , index_col = 'variable' # it's  crucial to select `variable` column as `index_col`.
                        )

df_sign.xlsx must contain 3 columns:

variable - exact the same names as columns of x.columns
dtype - type of the columns (either numerical or categorical)
expec - coefficinets sign expectations (-1 - in case of negative sign expectation, +1 - positive sign expectation, 0 - ambigious).

Perform WoE Transformation. Split data into training and testing sets.

final_data = WoEDataPreparation(
  x_data = x
  , y_data = y
  , df_sign = df_sign
  , special_codes = None
  , metric = 'woe'
)

# review the transformation results
print(final_data['status'])

# review binning tables
print(final_data['bining_tables'])

# split data into training and testing set
x_train, x_test, y_train, y_test = train_test_split(final_data['x_woe'], y, test_size=0.2, random_state=42, shuffle=True)

Perform Explanatory Data Analysis with VarExpPower function.


var_exp =VarExpPower(
  y_train = y_train
  , x_train = x_train
  , y_test = y_test
  , x_test = x_test
  )

Drop some features if needed. To drop variables from x_train, x_test, df_sign use the DeleteVars function.

Create a Logit Model and check for adequacy

model = LogitModel(
  y_train = y_train
  , x_train = x_train
  , y_test = y_test
  , x_test = x_test
)
model.Model_SK() # initialize sklearn logit model
model.Model_SM() # initialize statsmodels logit model

print(model.Gini_Train()) # Calculate Gini on the Training Set
print(model.Gini_Test() # Calculate Gini on the Testing Set

IsModelValid(
    model = model
    , coef_expectation = df_expec
    , gini_cutoff=0.5
    , p_value = 0.1
    )

Use IsModelValid functions to figure out whether model meets all the specified requirements or not.

Create an ensembles of models

model_comb = ModelCombination(
      y_train = y_train
      , x_train = x_train
      , y_test = y_test
      , x_test = x_test
      , max_model_number = 1000
      , dependent_number = 5
      , coef_expectation = df_sign
      , gini_cutoff=0.4
      , p_value = 0.1
      , intercept = True
      , penalty = None
)

# get Meta information about created ensemble
meta = ModelMetaInfo(
    models_dict = model_comb_1
    , sort_by = 'gini_test'
)

# If necessary the user can filter models. For example Gini_Train() > 0.5

The next step is to aggregate models into final model. But first it worth to check the best aggregation scheme in terms of accuracy

meta_aggr = AggregationMetaInfo(
  models_dict = model_comb
  , x_train = x_train
  , y_train = y_train
  , x_test = x_test
  , y_test = y_test
  )

Once meta information about aggregaton schemes are obtained, the user should choose the aggregation scheme to apply.

If the user opts for stacking scheme


# create stacking model
model_st = ModelStacking(
  models_dict = model_comb
  , x_data = x_test
  , y_data = y_test
  )

# predict once stacking model is obtained
pred_stack = PredictionStacking(
  models_dict = model_comb
  , x_data = x_test
  , model = model_st
)

auc_st = roc_auc_score(
  y_true = y_test
  , y_score = pred_stack_test
  )
print(auc_st)

If the user opts for aggregation scheme, which assume to calculate weighted average score of all models in the ensembles.


# obtain weight of each model in the ensemble
weight_aggr = WeightsBagging(
  models_dict = model_comb
  , metric = 'gini'
  , check_sample = 'train'
  )

# calculate the score given obtained models weights
pred_aggr = PredictionBagging(
  models_dict = model_comb
  , weights_dict = weight_aggr
  , x_data = x_test
  )

Once aggregation scheme was chosen and aggreagation was completed the calibration is needed

# calculate expected calibration error
ece = ExpectedCalibrationError(
  labels = np.array([1 if i > y_test.mean() else 0 for i in pred_stack_test])
  , probabilities = np.array(pred_stack_test)
  , n_bins = 20
  )

# plot Calibration curve
CalibrationCurve(
  y_data = y_test
  , probabilities = pred_stack
  , n_bins = 20
  , label = 'Cuurent Curve'
  )

# if ece too big the calibration must be applied
calib_model = CalibrationModel(
  x_data = x_test
  , y_data = y_test
  )

# predict final PD given calibration model
cal_prediction = PredictionCalibration(
  x_data = x_test,
   model = calib_model
   )

The final step is to convert the PD obtained into score


score = ScoreCard(
  y_proba = cal_prediction
  , log = False
)