Skip to content

WoEDataPreparation Function Documentation

Description

The WoEDataPreparation function prepares WOE-transformed data for predictive modeling. It takes several parameters to customize the transformation process and handles various validations to ensure the parameters are correctly specified.

Parameters

  • x_data: DataFrame containing explanatory variables.
  • y_data: Series containing the target binary variable.
  • df_sign: DataFrame with sign expectations.
  • metric: Metric to perform transformation ('woe' or 'event_rate').
  • divergence: Divergence measure in the objective function to be maximized.
  • prebinning_method: Pre-binning method.
  • max_n_prebins: Maximum number of bins after pre-binning.
  • min_prebin_size: Fraction of minimum number of records for each prebin.
  • min_n_bins, max_n_bins: Minimum and maximum number of bins.
  • min_bin_size, max_bin_size: Fraction of minimum and maximum number of records for each bin.
  • min_bin_n_nonevent, max_bin_n_nonevent: Minimum and maximum number of non-event records for each bin.
  • min_bin_n_event, max_bin_n_event: Minimum and maximum number of event records for each bin.
  • min_event_rate_diff: Minimum event rate difference between consecutive bins.
  • max_pvalue: Maximum p-value among bins.
  • max_pvalue_policy: Method to determine bins not satisfying the p-value constraint.
  • gamma: Regularization strength to reduce the number of dominating bins.
  • outlier_detector: Outlier detection method.
  • outlier_params: Parameters for the outlier detection method.
  • class_weight: Weights associated with classes.
  • cat_cutoff: Generate bin others with categories where the fraction of occurrences is below this value.
  • cat_unknown: Assigned value to unobserved categories during transform.
  • user_splits, user_splits_fixed: Lists of pre-binning split points and fixed pre-binning split points.
  • special_codes: List of special codes to treat data values separately.
  • split_digits: Significant digits of the split points.
  • mip_solver: Mixed-integer programming solver.
  • time_limit: Maximum time in seconds to run the optimization solver.
  • verbose: Enable verbose output.

Returns

A dictionary containing the following:

  • status: DataFrame with status information for each variable.
  • x_woe: DataFrame with WOE-transformed data.
  • binning_tables: Dictionary containing binning tables for each variable.

Exceptions

  • ValueError

Raised if x_data is not a pandas DataFrame.
Raised if y_data is not a pandas Series.
Raised if Length of x_data and y_data differs.
Raised if plot parameter is not a boolean.
Raised if Invalid value for metric.
Raised if Invalid value for divergence.
Raised if Invalid value for prebinning_method.
Raised if Invalid value for max_n_prebins.
Raised if Invalid value for min_prebin_size.
Raised if Invalid value for min_n_bins or max_n_bins.
Raised if min_n_bins exceeds max_n_bins.
Raised if Invalid value for min_bin_size or max_bin_size.
Raised if min_bin_size exceeds max_bin_size.
Raised if Invalid value for min_bin_n_nonevent or max_bin_n_nonevent.
Raised if min_bin_n_nonevent exceeds max_bin_n_nonevent.
Raised if Invalid value for min_bin_n_event or max_bin_n_event.
Raised if min_bin_n_event exceeds max_bin_n_event.
Raised if Invalid value for min_event_rate_diff.
Raised if Invalid value for max_pvalue.
Raised if Invalid value for max_pvalue_policy.
Raised if Invalid value for gamma.
Raised if Invalid value for outlier_detector.
Raised if outlier_params is not a dictionary.
Raised if Invalid value for class_weight.
Raised if Invalid value for cat_cutoff.
Raised if Invalid value for cat_unknown.
Raised if user_splits is not a list or numpy.ndarray.
Raised if user_splits_fixed is not a list or numpy.ndarray or its elements are not boolean.
Raised if Length mismatch between user_splits and user_splits_fixed.
Raised if Invalid value for special_codes.
Raised if- Invalid value for split_digits.
Raised if Invalid value for mip_solver.
Raised if Invalid value for time_limit.
Raised if verbose is not a boolean.

  • TypeError
    Raised if class_weight is neither a dictionary, "balanced", nor None.
    Raised if outlier_params is not a dictionary.
    Raised if cat_unknown is neither a float nor a string.
    Raised if special_codes is neither a list, dictionary, nor numpy.ndarray.

  • RuntimeError
    Raised if the optimization solver does not converge within the specified time limit.

Usage

from combat.transform import WoEDataPreparation
import pandas as pd

# Prepare the WOE-transformed data
woe_data = WoEDataPreparation(
    x_data=x_data,
    y_data=y_data,
    df_sign=df_sign,
    metric='woe',
    divergence='iv',
    prebinning_method='cart',
    max_n_prebins=20,
    min_prebin_size=0.05,
    # Add other parameters as needed
)

# Access the transformed data and other information
status_df = woe_data['status']
woe_transformed_data = woe_data['x_woe']
binning_tables = woe_data['binning_tables']