Skip to content

WoETransform Function

The WoETransform function performs the Weight of Evidence (WoE) transformation on explanatory variables.

Parameters:

  • x: pd.Series
    A pandas Series of the explanatory variable.

  • y: pd.Series
    A pandas Series of the target binary variable.

  • mon_constraint: int {-1, 0, 1}
    Numeric type of monotonic constraint.

  • special_codes: list
    Special codes in the data.

  • var_name: str
    A variable name.

  • var_type: str {'numerical', 'categorical'}
    A type of explanatory variable.

  • metric: str, {'woe', 'event_rate'}, default = 'woe'
    A metric to perform transformation.

  • prebinning_method: str, {'cart', 'mdlp', 'quantile', 'uniform', None}, default="cart"
    The pre-binning method.

  • solver: str, {'cp', 'mip', 'ls'}, default="cp"
    The optimizer to solve the optimal binning problem.

  • divergence: str, {'iv', 'js', 'hellinger', 'triangular'}, default="iv"
    The divergence measure in the objective function to be maximized.

  • max_n_prebins: int, default=20
    The maximum number of bins after pre-binning (prebins).

  • min_prebin_size: float, default=0.05
    The fraction of the minimum number of records for each prebin.

  • min_n_bins: int or None, optional, default=None
    The minimum number of bins.

  • max_n_bins: int or None, optional, default=None
    The maximum number of bins.

  • min_bin_size: float or None, optional, default=None
    The fraction of the minimum number of records for each bin.

  • max_bin_size: float or None, optional, default=None
    The fraction of the maximum number of records for each bin.

  • min_bin_n_nonevent: int or None, optional, default=None
    The minimum number of non-event records for each bin.

  • max_bin_n_nonevent: int or None, optional, default=None
    The maximum number of non-event records for each bin.

  • min_bin_n_event: int or None, optional, default=None
    The minimum number of event records for each bin.

  • max_bin_n_event: int or None, optional, default=None
    The maximum number of event records for each bin.

  • min_event_rate_diff: float, default=0
    The minimum event rate difference between consecutive bins.

  • max_pvalue: float or None, optional, default=None
    The maximum p-value among bins.

  • max_pvalue_policy: str, default="consecutive"
    The method to determine bins not satisfying the p-value constraint.

  • gamma: float, default=0
    Regularization strength to reduce the number of dominating bins.

  • outlier_detector: str or None, optional, default=None
    The outlier detection method.

  • outlier_params: dict or None, optional, default=None
    Dictionary of parameters to pass to the outlier detection method.

  • class_weight: dict, "balanced" or None, optional, default=None
    Weights associated with classes.

  • cat_cutoff: float or None, optional, default=None
    Generate bin others with categories in which the fraction of occurrences is below the cutoff value.

  • cat_unknown: float, str or None, default=None
    The assigned value to the unobserved categories in training but occurring during transform.

  • user_splits: array-like or None, optional, default=None
    The list of pre-binning split points.

  • user_splits_fixed: array-like or None, default=None
    The list of pre-binning split points that must be fixed.

  • special_codes: array-like, dict or None, optional, default=None
    List of special codes.

  • split_digits: int or None, optional, default=None
    The significant digits of the split points.

  • mip_solver: str, {'bop', 'cbc'}, default="bop"
    The mixed-integer programming solver.

  • time_limit: int, default=100
    The maximum time in seconds to run the optimization solver.

  • verbose: bool, default=False
    Enable verbose output.

Returns:

  • final_data: dict
    A dictionary with transformed data, status, binning table, and WoE transformation.

Exceptions

  • TypeError:
    Raised if the parameter x is not a pandas Series.
    Raised if the parameter y is not a pandas Series.
    Raised if the parameter var_name is not a string.
    Raised if the parameter plot is not a boolean value.
    Raised if the parameter solver is not one of 'cp', 'ls', or 'mip'.
    Raised if the parameter max_n_prebins is not an integer greater than 1.
    Raised if the parameter min_prebin_size is not a float in the range (0, 0.5].
    Raised if the parameter min_n_bins is not a positive integer.
    Raised if the parameter max_n_bins is not a positive integer.
    Raised if the parameter min_bin_size is not a float in the range (0, 0.5].
    Raised if the parameter max_bin_size is not a float in the range (0, 1].
    Raised if the parameter min_bin_n_nonevent is not a positive integer.
    Raised if the parameter max_bin_n_nonevent is not a positive integer.
    Raised if the parameter min_bin_n_event is not a positive integer.
    Raised if the parameter max_bin_n_event is not a positive integer.
    Raised if the parameter min_event_rate_diff is not a float in the range [0, 1].
    Raised if the parameter max_pvalue is not a float in the range (0, 1].
    Raised if the parameter max_pvalue_policy is not one of 'all' or 'consecutive'.
    Raised if the parameter gamma is not a non-negative float.
    Raised if the parameter outlier_detector is provided and not one of 'range' or 'zscore'.
    Raised if the parameter outlier_params is provided and not a dictionary.
    Raised if the parameter class_weight is provided and not a dictionary or 'balanced'.
    Raised if the parameter class_weight is a string and not equal to 'balanced'.
    Raised if the parameter cat_cutoff is provided and not a float in the range (0, 1].
    Raised if the parameter cat_unknown is provided and not a float or a string.
    Raised if the parameter user_splits is provided and not a numpy.ndarray or a list.
    Raised if the parameter user_splits_fixed is provided and:

  • user_splits is None.

  • Not a numpy.ndarray or a list.
  • Not a list of booleans.
  • Length mismatch with user_splits.
    Raised if the parameter special_codes is provided and not a numpy.ndarray, list, or dictionary.
    Raised if the parameter outlier_params is provided but not a dictionary.
    Raised if the parameter cat_unknown is provided but not a float or a string.

  • ValueError:

Raised if the parameter mon_constraint is not one of -1, 0, or 1.
Raised if the parameter var_type is not one of 'categorical' or 'numerical'.
Raised if the parameter metric is not one of 'woe' or 'event_rate'.
Raised if the parameter prebinning_method is not one of 'cart', 'mdlp', 'quantile', 'uniform', or None.
Raised if the lengths of user_splits and user_splits_fixed parameters are not equal.
Raised if the parameter divergence is not one of 'iv', 'js', 'hellinger', 'triangular'
Raised if the lengths of user_splits and user_splits_fixed parameters do not match.
Raised if the parameter var_type is provided but not one of 'categorical' or 'numerical'.
Raised if the parameter metric is provided but not one of 'woe' or 'event_rate'.
Raised if the parameter prebinning_method is provided but not one of 'cart', 'mdlp', 'quantile', 'uniform', or None.
Raised if the parameter outlier_detector is provided but not one of 'range' or 'zscore'.
Raised if the parameter split_digits is provided but not an integer in the range [0, 8].
Raised if the parameter max_pvalue_policy is provided but not one of 'all' or 'consecutive'.
Raised if the parameter mon_constraint is provided but not one of -1, 0, or 1.
Raised if the parameter special_codes is provided as a dictionary but it is empty. The special_codes dictionary must contain at least one special code.

  • TypeError/ValueError (Special Cases):

Raised if the parameter class_weight is provided as a string but not equal to 'balanced'.
Raised if the parameter user_splits_fixed is provided without user_splits.
Raised if the parameter user_splits_fixed is not a list of booleans.
Raised if the lengths of user_splits and user_splits_fixed parameters do not match.

  • ValueError (Inconsistent Parameters):

Raised if both min_n_bins and max_n_bins are provided, but min_n_bins is greater than max_n_bins.
Raised if both min_bin_size and max_bin_size are provided, but min_bin_size is greater than max_bin_size.

import pandas as pd
from combat.transform import WoETransform

# Sample data
data = {
    'age': [25, 35, 45, 55, 65],
    'income': [50000, 60000, 70000, 80000, 90000],
    'target': [0, 1, 0, 1, 0]  # Binary target variable
}

df = pd.DataFrame(data)

# Define explanatory and target variables
x = df[['age', 'income']]
y = df['target']

# Perform WoE transformation
woe_transformer = WoETransform()
result = woe_transformer(x=x['age']
                        , y=y
                        , mon_constraint=1
                        , var_name='age'
                        , var_type='numerical'
                        # Add other parameters as needed
                        )

# Display transformed data
print(result['woe_transform'])