WoEDataPreparation Function Documentation
Description
The WoEDataPreparation function prepares WOE-transformed data for predictive modeling. It takes several parameters to customize the transformation process and handles various validations to ensure the parameters are correctly specified.
Parameters
x_data: DataFrame containing explanatory variables.y_data: Series containing the target binary variable.df_sign: DataFrame with sign expectations.metric: Metric to perform transformation ('woe' or 'event_rate').divergence: Divergence measure in the objective function to be maximized.prebinning_method: Pre-binning method.max_n_prebins: Maximum number of bins after pre-binning.min_prebin_size: Fraction of minimum number of records for each prebin.min_n_bins,max_n_bins: Minimum and maximum number of bins.min_bin_size,max_bin_size: Fraction of minimum and maximum number of records for each bin.min_bin_n_nonevent,max_bin_n_nonevent: Minimum and maximum number of non-event records for each bin.min_bin_n_event,max_bin_n_event: Minimum and maximum number of event records for each bin.min_event_rate_diff: Minimum event rate difference between consecutive bins.max_pvalue: Maximum p-value among bins.max_pvalue_policy: Method to determine bins not satisfying the p-value constraint.gamma: Regularization strength to reduce the number of dominating bins.outlier_detector: Outlier detection method.outlier_params: Parameters for the outlier detection method.class_weight: Weights associated with classes.cat_cutoff: Generate bin others with categories where the fraction of occurrences is below this value.cat_unknown: Assigned value to unobserved categories during transform.user_splits,user_splits_fixed: Lists of pre-binning split points and fixed pre-binning split points.special_codes: List of special codes to treat data values separately.split_digits: Significant digits of the split points.mip_solver: Mixed-integer programming solver.time_limit: Maximum time in seconds to run the optimization solver.verbose: Enable verbose output.
Returns
A dictionary containing the following:
status: DataFrame with status information for each variable.x_woe: DataFrame with WOE-transformed data.binning_tables: Dictionary containing binning tables for each variable.
Exceptions
- ValueError
Raised if x_data is not a pandas DataFrame.
Raised if y_data is not a pandas Series.
Raised if Length of x_data and y_data differs.
Raised if plot parameter is not a boolean.
Raised if Invalid value for metric.
Raised if Invalid value for divergence.
Raised if Invalid value for prebinning_method.
Raised if Invalid value for max_n_prebins.
Raised if Invalid value for min_prebin_size.
Raised if Invalid value for min_n_bins or max_n_bins.
Raised if min_n_bins exceeds max_n_bins.
Raised if Invalid value for min_bin_size or max_bin_size.
Raised if min_bin_size exceeds max_bin_size.
Raised if Invalid value for min_bin_n_nonevent or max_bin_n_nonevent.
Raised if min_bin_n_nonevent exceeds max_bin_n_nonevent.
Raised if Invalid value for min_bin_n_event or max_bin_n_event.
Raised if min_bin_n_event exceeds max_bin_n_event.
Raised if Invalid value for min_event_rate_diff.
Raised if Invalid value for max_pvalue.
Raised if Invalid value for max_pvalue_policy.
Raised if Invalid value for gamma.
Raised if Invalid value for outlier_detector.
Raised if outlier_params is not a dictionary.
Raised if Invalid value for class_weight.
Raised if Invalid value for cat_cutoff.
Raised if Invalid value for cat_unknown.
Raised if user_splits is not a list or numpy.ndarray.
Raised if user_splits_fixed is not a list or numpy.ndarray or its elements are not boolean.
Raised if Length mismatch between user_splits and user_splits_fixed.
Raised if Invalid value for special_codes.
Raised if- Invalid value for split_digits.
Raised if Invalid value for mip_solver.
Raised if Invalid value for time_limit.
Raised if verbose is not a boolean.
-
TypeError
Raised ifclass_weightis neither a dictionary, "balanced", nor None.
Raised ifoutlier_paramsis not a dictionary.
Raised ifcat_unknownis neither a float nor a string.
Raised ifspecial_codesis neither a list, dictionary, nor numpy.ndarray. -
RuntimeError
Raised if the optimization solver does not converge within the specified time limit.
Usage
from combat.transform import WoEDataPreparation
import pandas as pd
# Prepare the WOE-transformed data
woe_data = WoEDataPreparation(
x_data=x_data,
y_data=y_data,
df_sign=df_sign,
metric='woe',
divergence='iv',
prebinning_method='cart',
max_n_prebins=20,
min_prebin_size=0.05,
# Add other parameters as needed
)
# Access the transformed data and other information
status_df = woe_data['status']
woe_transformed_data = woe_data['x_woe']
binning_tables = woe_data['binning_tables']