ballet.eng.external.feature_engine module¶
-
class
ballet.eng.external.feature_engine.
AddMissingIndicator
(missing_only=True, variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The AddMissingIndicator() adds additional binary variables that indicate if data is missing. It will add as many missing indicators as variables indicated by the user.
Binary variables are named with the original variable name plus ‘_na’.
The AddMissingIndicator() works for both numerical and categorical variables. You can pass a list with the variables for which the missing indicators should be added. Alternatively, the imputer will select and add missing indicators to all variables in the training set.
Note If how=missing_only, the imputer will add missing indicators only to those variables that show missing data in during fit. These may be a subset of the variables you indicated.
- Parameters
missing_only (bool, default=True) –
Indicates if missing indicators should be added to variables with missing data or to all variables.
True: indicators will be created only for those variables that showed missing data during fit.
False: indicators will be created for all variables
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables.
-
variables_
¶ List of variables for which the missing indicators will be created.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the variables for which the missing indicators will be created
-
transform:
Add the missing indicators.
-
fit_transform:
Fit to the data, then trasnform it.
-
fit
(X, y=None)[source]¶ Learn the variables for which the missing indicators will be created.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
self.variables_ – The list of variables for which missing indicators will be added.
- Return type
list
-
transform
(X)[source]¶ Add the binary missing indicators.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Returns
X_transformed – The dataframe containing the additional binary variables. Binary variables are named with the original variable name plus ‘_na’.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
ArbitraryDiscretiser
(binning_dict, return_object=False, return_boundaries=False)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The ArbitraryDiscretiser() divides continuous numerical variables into contiguous intervals which limits are determined arbitrarily by the user.
You need to enter a dictionary with variable names as keys, and a list of the limits of the intervals as values. For example {‘var1’:[0, 10, 100, 1000], ‘var2’:[5, 10, 15, 20]}.
ArbitraryDiscretiser() will then sort var1 values into the intervals 0-10, 10-100 100-1000, and var2 into 5-10, 10-15 and 15-20. Similar to pandas.cut.
The ArbitraryDiscretiser() works only with numerical variables. The discretiser will check if the dictionary entered by the user contains variables present in the training set, and if these variables are numerical, before doing any transformation.
Then it transforms the variables, that is, it sorts the values into the intervals.
- Parameters
binning_dict (dict) –
The dictionary with the variable to interval limits pairs. A valid dictionary looks like this:
binning_dict = {‘var1’:[0, 10, 100, 1000], ‘var2’:[5, 10, 15, 20]}
return_object (bool, default=False) –
Whether the the discrete variable should be returned casted as numeric or as object. If you would like to proceed with the engineering of the variable as if it was categorical, use True. Alternatively, keep the default to False.
Categorical encoders in Feature-engine work only with variables of type object, thus, if you wish to encode the returned bins, set return_object to True.
return_boundaries (bool, default=False) – Whether the output, that is the bin names / values, should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.
-
binner_dict_
¶ Dictionary with the interval limits per variable.
-
variables_
¶ The variables to discretise.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn any parameter.
-
transform:
Sort continuous variable values into the intervals.
-
fit_transform:
Fit to the data, then transform it.
See also
pandas.cut
https
//pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
-
fit
(X, y=None)[source]¶ This transformer does not learn any parameter.
Check dataframe and variables. Checks that the user entered variables are in the train set and cast as numerical.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (None) – y is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Sort the variable values into the intervals.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()
- Returns
X – The transformed data with the discrete variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
ArbitraryNumberImputer
(arbitrary_number=999, variables=None, imputer_dict=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The ArbitraryNumberImputer() replaces missing data in each variable by an arbitrary value determined by the user. It works only with numerical variables.
You can impute all variables with the same number, in which case you need to define the variables to impute in variables and the imputation number in arbitrary_number. You can pass a dictionary of variable and numbers to use for their imputation.
For example, you can impute varA and varB with 99 like this:
transformer = ArbitraryNumberImputer( variables = ['varA', 'varB'], arbitrary_number = 99 ) Xt = transformer.fit_transform(X)
Alternatively, you can impute varA with 1 and varB with 99 like this:
transformer = ArbitraryNumberImputer( imputer_dict = {'varA' : 1, 'varB': 99] ) Xt = transformer.fit_transform(X)
- Parameters
arbitrary_number (int or float, default=999) – The number to be used to replace missing data.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all numerical variables. This parameter is used only if imputer_dict is None.
imputer_dict (dict, default=None) – The dictionary of variables and the arbitrary numbers for their imputation.
-
imputer_dict_
¶ Dictionary with the values to replace NAs in each variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, then transform it.
See also
feature_engine.imputation.EndTailImputer
-
fit
(X, y=None)[source]¶ This method does not learn any parameter. Checks dataframe and finds numerical variables, or checks that the variables entered by user are numerical.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError – If there are no numerical variables in the df or the df is empty
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe has different number of features than the df used in fit()
- Returns
X – The dataframe without missing values in the selected variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
ArbitraryOutlierCapper
(max_capping_dict=None, min_capping_dict=None, missing_values='raise')[source]¶ Bases:
feature_engine.outliers.base_outlier.BaseOutlier
The ArbitraryOutlierCapper() caps the maximum or minimum values of a variable at an arbitrary value indicated by the user.
You must provide the maximum or minimum values that will be used to cap each variable in a dictionary {feature:capping value}
- Parameters
max_capping_dict (dictionary, default=None) – Dictionary containing the user specified capping values for the right tail of the distribution of each variable (maximum values).
min_capping_dict (dictionary, default=None) – Dictionary containing user specified capping values for the eft tail of the distribution of each variable (minimum values).
missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. If missing_values=’raise’ the transformer will return an error if the training or the datasets to transform contain missing values.
-
right_tail_caps_
¶ Dictionary with the maximum values at which variables will be capped.
-
left_tail_caps_
¶ Dictionary with the minimum values at which variables will be capped.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn any parameter.
-
transform:
Cap the variables.
-
fit_transform:
Fit to the data. Then transform it.
-
fit
(X, y=None)[source]¶ This transformer does not learn any parameter.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples.
y (pandas Series, default=None) – y is not needed in this transformer. You can pass y or None.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
- Return type
self
-
transform
(X)[source]¶ Cap the variable values, that is, censors outliers.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe is not of same size as that used in fit()
- Returns
X – The dataframe with the capped variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
BoxCoxTransformer
(variables=None)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The BoxCoxTransformer() applies the BoxCox transformation to numerical variables.
The Box-Cox transformation is defined as:
T(Y)=(Y exp(λ)−1)/λ if λ!=0
log(Y) otherwise
where Y is the response variable and λ is the transformation parameter. λ varies, typically from -5 to 5. In the transformation, all values of λ are considered and the optimal value for a given variable is selected.
The BoxCox transformation implemented by this transformer is that of SciPy.stats: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html
The BoxCoxTransformer() works only with numerical positive variables (>=0).
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.
-
lambda_dict_
¶ Dictionary with the best BoxCox exponent per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the optimal lambda for the BoxCox transformation.
-
transform:
Apply the BoxCox transformation.
-
fit_transform:
Fit to data, then transform it.
References
- 1
Box and Cox. “An Analysis of Transformations”. Read at a RESEARCH MEETING, 1964. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1964.tb00553.x
-
fit
(X, y=None)[source]¶ Learn the optimal lambda for the BoxCox transformation.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
- ValueError
If there are no numerical variables in the df or the df is empty
If the variable(s) contain null values
If some variables contain zero values
- Returns
- Return type
self
-
transform
(X)[source]¶ Apply the BoxCox transformation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain negative values
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe
-
class
ballet.eng.external.feature_engine.
CategoricalImputer
(imputation_method='missing', fill_value='Missing', variables=None, return_object=False, ignore_format=False)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The CategoricalImputer() replaces missing data in categorical variables by an arbitrary value or by the most frequent category.
The CategoricalVariableImputer() imputes by default only categorical variables (type ‘object’ or ‘categorical’). You can pass a list of variables to impute, or alternatively, the encoder will find and encode all categorical variables.
If you want to impute numerical variables with this transformer, there are 2 ways of doing it:
Option 1: Cast your numerical variables as object in the input dataframe, before passing it to the transformer.
Option 2: Set ignore_format=True. Note that if you do this and do not pass the list of variables to impute, the imputer will automatically select and impute all variables in the dataframe.
- Parameters
imputation_method (str, default='missing') – Desired method of imputation. Can be ‘frequent’ for frequent category imputation or ‘missing’ to impute with an arbitrary value.
fill_value (str, int, float, default='Missing') – Only used when imputation_method=’missing’. User-defined value to replace the missing data.
variables (list, default=None) – The list of categorical variables that will be imputed. If None, the imputer will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the parameter ignore_format below.
return_object (bool, default=False) – If working with numerical variables cast as object, decide whether to return the variables as numeric or re-cast them as object. Note that pandas will re-cast them automatically as numeric after the transformation with the mode or with an arbitrary number.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
imputer_dict_
¶ Dictionary with most frequent category or arbitrary value per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the most frequent category, or assign arbitrary value to variable.
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, than transform it.
-
fit
(X, y=None)[source]¶ Learn the most frequent category if the imputation method is set to frequent.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)
ValueError – If there are no categorical variables in the df or the df is empty
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe has different number of features than the df used in fit()
- Returns
X – The dataframe without missing values in the selected variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
CombineWithReferenceFeature
(variables_to_combine, reference_variables, operations=['sub'], new_variables_names=None, missing_values='ignore')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
CombineWithReferenceFeature() applies basic mathematical operations between a group of variables and one or more reference features. It adds one or more additional features to the dataframe with the result of the operations.
In other words, CombineWithReferenceFeature() sums, multiplies, subtracts or divides a group of features to / by a group of reference variables, and returns the result as new variables in the dataframe.
For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter, number_payments_third_quarter, number_payments_fourth_quarter, and total_payments, we can use CombineWithReferenceFeature() to determine the percentage of payments per quarter as follows:
transformer = CombineWithReferenceFeature( variables_to_combine=[ 'number_payments_first_quarter', 'number_payments_second_quarter', 'number_payments_third_quarter', 'number_payments_fourth_quarter', ], reference_variables=['total_payments'], operations=['div'], new_variables_name=[ 'perc_payments_first_quarter', 'perc_payments_second_quarter', 'perc_payments_third_quarter', 'perc_payments_fourth_quarter', ] ) Xt = transformer.fit_transform(X)
The transformed X, Xt, will contain the additional features indicated in the new_variables_name list plus the original set of variables.
- Parameters
variables_to_combine (list) – The list of numerical variables to be combined with the reference variables.
reference_variables (list) – The list of numerical reference variables that will be added to, multiplied with, or subtracted from the variables_to_combine, or used as denominator for division.
operations (list, default=['sub']) –
The list of basic mathematical operations to be used in transformation.
If None, all of [‘sub’, ‘div’,’add’,’mul’] will be performed. Alternatively, you can enter a list of operations to carry out. Each operation should be a string and must be one of the elements in [‘sub’, ‘div’,’add’, ‘mul’].
Each operation will result in a new variable that will be added to the transformed dataset.
new_variables_names (list, default=None) –
Names of the newly created variables. You can enter a list with the names for the newly created features (recommended). You must enter as many names as new features created by the transformer. The number of new features is the number of operations times the number of reference variables times the number of variables to combine.
Thus, if you want to perform 2 operations, sub and div, combining 4 variables with 2 reference variables, you should enter 2 X 4 X 2 new variable names.
The name of the variables indicated by the user should coincide with the order in which the operations are performed by the transformer. The transformer will first carry out ‘sub’, then ‘div’, then ‘add’ and finally ‘mul’.
If new_variable_names is None, the transformer will assign an arbitrary name to the newly created features.
missing_values (string, default='ignore') – Indicates if missing values should be ignored or raised. If ‘ignore’, the transformer will ignore missing data when transforming the data. If ‘raise’ the transformer will return an error if the training or the datasets to transform contain missing values.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Combine the variables with the mathematical operations.
-
fit_transform:
Fit to the data, then transform it.
Notes
Although the transformer in essence allows us to combine any feature with any of the allowed mathematical operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical examples within the financial sector are:
Ratio between income and debt to create the debt_to_income_ratio.
Subtraction of rent from income to obtain the disposable_income.
-
fit
(X, y=None)[source]¶ This transformer does not learn any parameter. Performs dataframe checks.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, or np.array. Default=None.) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any user provided variables are not numerical
ValueError – If any of the reference variables contain null values and the mathematical operation is ‘div’.
- Returns
- Return type
self
-
transform
(X)[source]¶ Combine the variables with the mathematical operations.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.
- Returns
X – The dataframe with the operations results added as columns.
- Return type
Pandas dataframe, shape = [n_samples, n_features + n_operations]
-
class
ballet.eng.external.feature_engine.
CountFrequencyEncoder
(encoding_method='count', variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The CountFrequencyEncoder() replaces categories by either the count or the percentage of observations per category.
For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.
The CountFrequencyEncoder() will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the counts or frequencies for each variable (fit). The encoder then replaces the categories with those numbers (transform).
- Parameters
encoding_method (str, default='count') –
Desired method of encoding.
’count’: number of observations per category
’frequency’: percentage of observations per category
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the count or frequency per category, per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the count or frequency per category, per variable.
-
transform:
Encode the categories to numbers.
-
fit_transform:
Fit to the data, then transform it.
-
inverse_transform:
Encode the numbers into the original categories.
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
feature_engine.encoding.RareLabelEncoder
-
fit
(X, y=None)[source]¶ Learn the counts or frequencies which will be used to replace the categories.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (pandas Series, default = None) – y is not needed in this encoder. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The un-transformed dataframe, with the categorical variables containing the original values.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – The dataframe containing the categories replaced by numbers.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
DecisionTreeDiscretiser
(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The DecisionTreeDiscretiser() replaces continuous numerical variables by discrete, finite, values estimated by a decision tree.
The methods is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf
The DecisionTreeDiscretiser() works only with numerical variables. A list of variables can be passed as an argument. Alternatively, the discretiser will automatically select all numerical variables.
The DecisionTreeDiscretiser() first trains a decision tree for each variable.
The DecisionTreeDiscretiser() then transforms the variables, that is, makes predictions based on the variable values, using the trained decision tree.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the discretiser will automatically select all numerical variables.
cv (int, default=3) – Desired number of cross-validation fold to be used to fit the decision tree.
scoring (str, default='neg_mean_squared_error') – Desired metric to optimise the performance for the tree. Comes from sklearn.metrics. See DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html
param_grid (dictionary, default=None) –
The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().
If None, then param_grid = {‘max_depth’: [1, 2, 3, 4]}
regression (boolean, default=True) – Indicates whether the discretiser should train a regression or a classification decision tree.
random_state (int, default=None) – The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.
-
binner_dict_
¶ Dictionary containing the fitted tree per variable.
-
scores_dict_
¶ Dictionary with the score of the best decision tree, over the train set.
-
variables_
¶ The variables to discretise.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Fit a decision tree per variable.
-
transform:
Replace continuous values by the predictions of the decision tree.
-
fit_transform:
Fit to the data, then transform it.
See also
sklearn.tree.DecisionTreeClassifier
,sklearn.tree.DecisionTreeRegressor
References
- 1
Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
-
fit
(X, y)[source]¶ Fit the decision trees. One tree per variable to be transformed.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (pandas series.) – Target variable. Required to train the decision tree.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Replaces original variable with the predictions of the tree. The tree outcome is finite, aka, discrete.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()
- Returns
X_transformed – The dataframe with transformed variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
DecisionTreeEncoder
(encoding_method='arbitrary', cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None, variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The DecisionTreeEncoder() encodes categorical variables with predictions of a decision tree.
The encoder first fits a decision tree using a single feature and the target (fit). And then replaces the values of the original feature by the predictions of the tree (transform). The transformer will train a Decision tree per every feature to encode.
The motivation is to try and create monotonic relationships between the categorical variables and the target.
Under the hood, the categorical variable will be first encoded into integers with the OrdinalCategoricalEncoder(). The integers can be assigned arbitrarily to the categories or following the mean value of the target in each category. Then a decision tree will fit the resulting numerical variable to predict the target variable. Finally, the original categorical variable values will be replaced by the predictions of the decision tree.
The DecisionTreeEncoder() will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode or the encoder will find and encode all categorical variables. But with ignore_format=True you have the option to encode numerical variables as well. In this case, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
- Parameters
encoding_method (str, default='arbitrary') –
The categorical encoding method that will be used to encode the original categories to numerical values.
’ordered’: the categories are numbered in ascending order according to the target mean value per category.
’arbitrary’ : categories are numbered arbitrarily.
cv (int, default=3) – Desired number of cross-validation fold to be used to fit the decision tree.
scoring (str, default='neg_mean_squared_error') – Desired metric to optimise the performance for the decision tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html
param_grid (dictionary, default=None) –
The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().
If None, then param_grid = {‘max_depth’: [1, 2, 3, 4]}.
regression (boolean, default=True) – Indicates whether the encoder should train a regression or a classification decision tree.
random_state (int, default=None) – The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_
¶ sklearn Pipeline containing the ordinal encoder and the decision tree.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Fit a decision tree per variable.
-
transform:
Replace categorical variable by the predictions of the decision tree.
-
fit_transform:
Fit to the data, then transform it.
Notes
The authors designed this method originally, to work with numerical variables. We can replace numerical variables by the preditions of a decision tree utilising the DecisionTreeDiscretiser().
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
sklearn.ensemble.DecisionTreeRegressor
,sklearn.ensemble.DecisionTreeClassifier
,feature_engine.discretisation.DecisionTreeDiscretiser
,feature_engine.encoding.RareLabelEncoder
,feature_engine.encoding.OrdinalEncoder
References
- 1
Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
-
fit
(X, y=None)[source]¶ Fit a decision tree per variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.
y (pandas series.) – The target variable. Required to train the decision tree and for ordered ordinal encoding.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace categorical variable by the predictions of the decision tree.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If dataframe is not of same size as that used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – Dataframe with variables encoded with decision tree predictions.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
DropMissingData
(missing_only=True, variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The DropMissingData() will delete rows containing missing values. It provides similar functionality to pandas.drop_na().
It works for both numerical and categorical variables. You can enter the list of variables for which missing values should be removed from the dataframe. Alternatively, the imputer will automatically select all variables in the dataframe.
Note The transformer will first select all variables or all user entered variables and if missing_only=True, it will re-select from the original group only those that show missing data in during fit, that is in the train set.
- Parameters
missing_only (bool, default=True) – If true, missing observations will be dropped only for the variables that have missing data in the train set, during fit. If False, observations with NA will be dropped from all variables indicated by the user.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables in the dataframe.
-
variables_
¶ List of variables for which the rows with NA will be deleted.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the variables for which the rows with NA will be deleted
-
transform:
Remove observations with NA
-
fit_transform:
Fit to the data, then transform it.
-
return_na_data:
Returns the dataframe with the rows that contain NA .
-
fit
(X, y=None)[source]¶ Learn the variables for which the rows with NA will be deleted.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
- Return type
self
-
return_na_data
(X)[source]¶ Returns the subset of the dataframe which contains the rows with missing values. This method could be useful in production, in case we want to store the observations that will not be fed into the model.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
X – The dataframe containing only the rows with missing values.
- Return type
pandas dataframe of shape = [obs_with_na, features]
-
transform
(X)[source]¶ Remove rows with missing values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Returns
X_transformed – The complete case dataframe for the selected variables, of shape [n_samples - rows_with_na, n_features]
- Return type
pandas dataframe
-
class
ballet.eng.external.feature_engine.
EndTailImputer
(imputation_method='gaussian', tail='right', fold=3, variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The EndTailImputer() replaces missing data by a value at either tail of the distribution. It works only with numerical variables.
You can indicate the variables to be imputed in a list. Alternatively, the EndTailImputer() will automatically find and select all variables of type numeric.
The imputer first calculates the values at the end of the distribution for each variable (fit). The values at the end of the distribution are determined using the Gaussian limits, the the IQR proximity rule limits, or a factor of the maximum value:
- Gaussian limits:
right tail: mean + 3*std
left tail: mean - 3*std
- IQR limits:
right tail: 75th quantile + 3*IQR
left tail: 25th quantile - 3*IQR
where IQR is the inter-quartile range = 75th quantile - 25th quantile
- Maximum value:
right tail: max * 3
left tail: not applicable
You can change the factor that multiplies the std, IQR or the maximum value using the parameter ‘fold’ (we used fold=3 in the examples above).
The imputer then replaces the missing data with the estimated values (transform).
- Parameters
imputation_method (str, default=gaussian) –
Method to be used to find the replacement values. Can take ‘gaussian’, ‘iqr’ or ‘max’.
gaussian: the imputer will use the Gaussian limits to find the values to replace missing data.
iqr: the imputer will use the IQR limits to find the values to replace missing data.
max: the imputer will use the maximum values to replace missing data. Note that if ‘max’ is passed, the parameter ‘tail’ is ignored.
tail (str, default=right) – Indicates if the values to replace missing data should be selected from the right or left tail of the variable distribution. Can take values ‘left’ or ‘right’.
fold (int, default=3) – Factor to multiply the std, the IQR or the Max values. Recommended values are 2 or 3 for Gaussian, or 1.5 or 3 for IQR.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables of type numeric.
-
imputer_dict_
¶ Dictionary with the values at the end of the distribution per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn values to replace missing data.
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ Learn the values at the end of the variable distribution.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError – If there are no numerical variables in the df or the df is empty
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe has different number of features than the df used in fit()
- Returns
X – The dataframe without missing values in the selected variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
EqualFrequencyDiscretiser
(variables=None, q=10, return_object=False, return_boundaries=False)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The EqualFrequencyDiscretiser() divides continuous numerical variables into contiguous equal frequency intervals, that is, intervals that contain approximately the same proportion of observations.
The interval limits are determined using pandas.qcut(), in other words, the interval limits are determined by the quantiles. The number of intervals, i.e., the number of quantiles in which the variable should be divided is determined by the user.
The EqualFrequencyDiscretiser() works only with numerical variables. A list of variables can be passed as argument. Alternatively, the discretiser will automatically select and transform all numerical variables.
The EqualFrequencyDiscretiser() first finds the boundaries for the intervals or quantiles for each variable.
Then it transforms the variables, that is, it sorts the values into the intervals.
- Parameters
variables (list, default=None) – The list of numerical variables that will be discretised. If None, the EqualFrequencyDiscretiser() will select all numerical variables.
q (int, default=10) – Desired number of equal frequency intervals / bins. In other words the number of quantiles in which the variables should be divided.
return_object (bool, default=False) –
Whether the the discrete variable should be returned casted as numeric or as object. If you would like to proceed with the engineering of the variable as if it was categorical, use True. Alternatively, keep the default to False.
Categorical encoders in Feature-engine work only with variables of type object, thus, if you wish to encode the returned bins, set return_object to True.
return_boundaries (bool, default=False) – Whether the output should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.
-
binner_dict_
¶ Dictionary with the interval limits per variable.
-
variables_
¶ The variables to discretise.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Find the interval limits.
-
transform:
Sort continuous variable values into the intervals.
-
fit_transform:
Fit to the data, then transform it.
See also
pandas.qcut
https
//pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
References
- 1
Kotsiantis and Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science, vol. 1, pp. 111 117, 2006.
- 2
Dong. “Beating Kaggle the easy way”. Master Thesis. https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf
-
fit
(X, y=None)[source]¶ Learn the limits of the equal frequency intervals.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (None) – y is not needed in this encoder. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Sort the variable values into the intervals.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()
- Returns
X – The transformed data with the discrete variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
EqualWidthDiscretiser
(variables=None, bins=10, return_object=False, return_boundaries=False)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The EqualWidthDiscretiser() divides continuous numerical variables into intervals of the same width, that is, equidistant intervals. Note that the proportion of observations per interval may vary.
The size of the interval is calculated as:
\[( max(X) - min(X) ) / bins\]where bins, which is the number of intervals, should be determined by the user.
The interval limits are determined using pandas.cut(). The number of intervals in which the variable should be divided must be indicated by the user.
The EqualWidthDiscretiser() works only with numerical variables. A list of variables can be passed as argument. Alternatively, the discretiser will automatically select all numerical variables.
The EqualWidthDiscretiser() first finds the boundaries for the intervals for each variable. Then, it transforms the variables, that is, sorts the values into the intervals.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the discretiser will automatically select all numerical type variables.
bins (int, default=10) – Desired number of equal width intervals / bins.
return_object (bool, default=False) –
Whether the the discrete variable should be returned casted as numeric or as object. If you would like to proceed with the engineering of the variable as if it was categorical, use True. Alternatively, keep the default to False.
Categorical encoders in Feature-engine work only with variables of type object, thus, if you wish to encode the returned bins, set return_object to True.
return_boundaries (bool, default=False) – Whether the output should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.
-
binner_dict_
¶ Dictionary with the interval limits per variable.
-
variables_
¶ The variables to be discretised.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Find the interval limits.
-
transform:
Sort continuous variable values into the intervals.
-
fit_transform:
Fit to the data, then transform it.
See also
pandas.cut
https
//pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
References
- 1
Kotsiantis and Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science, vol. 1, pp. 111 117, 2006.
- 2
Dong. “Beating Kaggle the easy way”. Master Thesis. https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf
-
fit
(X, y=None)[source]¶ Learn the boundaries of the equal width intervals / bins for each variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (None) – y is not needed in this encoder. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Sort the variable values into the intervals.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()
- Returns
X – The transformed data with the discrete variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
LogTransformer
(variables=None, base='e')[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The LogTransformer() applies the natural logarithm or the base 10 logarithm to numerical variables. The natural logarithm is the logarithm in base e.
The LogTransformer() only works with positive values. If the variable contains a zero or a negative value the transformer will return an error.
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all variables of type numeric.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will find and select all numerical variables.
base (string, default='e') – Indicates if the natural or base 10 logarithm should be applied. Can take values ‘e’ or ‘10’.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Transform the variables using the logarithm.
-
fit_transform:
Fit to data, then transform it.
-
inverse_transform:
Convert the data back to the original representation.
-
fit
(X, y=None)[source]¶ This transformer does not learn parameters.
Selects the numerical variables and determines whether the logarithm can be applied on the selected variables, i.e., it checks that the variables are positive.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values - If some variables contain zero or negative values
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the data back to the original representation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain zero or negative values
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe
-
transform
(X)[source]¶ Transform the variables with the logarithm.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain zero or negative values
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe
-
class
ballet.eng.external.feature_engine.
MathematicalCombination
(variables_to_combine, math_operations=None, new_variables_names=None, missing_values='raise')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of a group of variables, and returns the result into new variables.
For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter, number_payments_third_quarter and number_payments_fourth_quarter, we can use MathematicalCombination() to calculate the total number of payments and mean number of payments as follows:
transformer = MathematicalCombination( variables_to_combine=[ 'number_payments_first_quarter', 'number_payments_second_quarter', 'number_payments_third_quarter', 'number_payments_fourth_quarter' ], math_operations=[ 'sum', 'mean' ], new_variables_name=[ 'total_number_payments', 'mean_number_payments' ] ) Xt = transformer.fit_transform(X)
The transformed X, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.
Attention, if some of the variables to combine have missing data and missing_values = ‘ignore’, the value will be ignored in the computation. To be clear, if variables A, B and C, have values 10, 20 and NA, and we perform the sum, the result will be A + B = 30.
- Parameters
variables_to_combine (list) – The list of numerical variables to be combined.
math_operations (list, default=None) –
The list of basic math operations to be used to create the new features.
If None, all of [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’] will be performed over the variables_to_combine. Alternatively, you can enter the list of operations to carry out.
Each operation should be a string and must be one of the elements in [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’].
Each operation will result in a new variable that will be added to the transformed dataset.
new_variables_names (list, default=None) –
Names of the newly created variables. You can enter a name or a list of names for the newly created features (recommended). You must enter one name for each mathematical transformation indicated in the math_operations parameter. That is, if you want to perform mean and sum of features, you should enter 2 new variable names. If you perform only mean of features, enter 1 variable name. Alternatively, if you chose to perform all mathematical transformations, enter 6 new variable names.
The name of the variables indicated by the user should coincide with the order in which the mathematical operations are initialised in the transformer. That is, if you set math_operations = [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.
If new_variable_names = None, the transformer will assign an arbitrary name to the newly created features starting by the name of the mathematical operation, followed by the variables combined separated by -.
missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. If ‘raise’ the transformer will return an error if the the datasets to fit or transform contain missing values. If ‘ignore’, missing data will be ignored when performing the calculations.
-
combination_dict_
¶ Dictionary containing the mathematical operation to new variable name pairs.
-
math_operations_
¶ List with the mathematical operations to be applied to the variables_to_combine.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Combine the variables with the mathematical operations.
-
fit_transform:
Fit to the data, then transform it.
Notes
Although the transformer in essence allows us to combine any feature with any of the allowed mathematical operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical examples within the financial sector are:
Sum debt across financial products, i.e., credit cards, to obtain the total debt.
Take the average payments to various financial products per month.
Find the Minimum payment done at any one month.
In insurance, we can sum the damage to various parts of a car to obtain the total damage.
-
fit
(X, y=None)[source]¶ This transformer does not learn parameters.
Perform dataframe checks. Creates dictionary of operation to new feature name pairs.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, or np.array. Defaults to None.) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any user provided variables in variables_to_combine are not numerical
ValueError – If the variable(s) contain null values when missing_values = raise
- Returns
- Return type
self
-
transform
(X)[source]¶ Combine the variables with the mathematical operations.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values when missing_values = raise - If the dataframe is not of the same size as that used in fit()
- Returns
X – The dataframe with the original variables plus the new variables.
- Return type
Pandas dataframe, shape = [n_samples, n_features + n_operations]
-
class
ballet.eng.external.feature_engine.
MeanEncoder
(variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The MeanEncoder() replaces categories by the mean value of the target for each category.
For example in the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 0.5, red by 0.8 and grey by 0.1.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the numbers for each variable (fit). The encoder then replaces the categories with those numbers (transform).
- Parameters
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the target mean value per category per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the target mean value per category, per variable.
-
transform:
Encode the categories to numbers.
-
fit_transform:
Fit to the data, then transform it.
-
inverse_transform:
Encode the numbers into the original categories.
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
feature_engine.encoding.RareLabelEncoder
References
- 1
Micci-Barreca D. “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems”. ACM SIGKDD Explorations Newsletter, 2001. https://dl.acm.org/citation.cfm?id=507538
-
fit
(X, y)[source]¶ Learn the mean value of the target for each category of the variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to be encoded.
y (pandas series) – The target.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The un-transformed dataframe, with the categorical variables containing the original values.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – The dataframe containing the categories replaced by numbers.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
MeanMedianImputer
(imputation_method='median', variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The MeanMedianImputer() replaces missing data by the mean or median value of the variable. It works only with numerical variables.
You can pass a list of variables to be imputed. Alternatively, the MeanMedianImputer() will automatically select all variables of type numeric in the training set.
The imputer:
first calculates the mean / median values of the variables (fit).
Then replaces the missing data with the estimated mean / median (transform).
- Parameters
imputation_method (str, default=median) – Desired method of imputation. Can take ‘mean’ or ‘median’.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will select all variables of type numeric.
-
imputer_dict_
¶ Dictionary with the mean or median values per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the mean or median values.
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ Learn the mean or median values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas series or None, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError – If there are no numerical variables in the df or the df is empty
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe has different number of features than the df used in fit()
- Returns
X – The dataframe without missing values in the selected variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
OneHotEncoder
(top_categories=None, drop_last=False, drop_last_binary=False, variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
One hot encoding consists in replacing the categorical variable by a combination of binary variables which take value 0 or 1, to indicate if a certain category is present in an observation. The binary variables are also known as dummy variables.
For example, from the categorical variable “Gender” with categories “female” and “male”, we can generate the boolean variable “female”, which takes 1 if the observation is female or 0 otherwise. We can also generate the variable “male”, which takes 1 if the observation is “male” and 0 otherwise.
The encoder can create k binary variables per categorical variable, k being the number of unique categories, or alternatively k-1 to avoid redundant information. This behaviour can be specified using the parameter drop_last.
The encoder has the additional option to generate binary variables only for the top n most popular categories, that is, the categories that are shared by the majority of the observations in the dataset. This behaviour can be specified with the parameter top_categories.
Note
Only when creating binary variables for all categories of the variable, we can specify if we want to encode into k or k-1 binary variables, where k is the number if unique categories. If we encode only the top n most popular categories, the encoder will create only n binary variables per categorical variable. Observations that do not show any of these popular categories, will have 0 in all the binary variables.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first finds the categories to be encoded for each variable (fit). The encoder then creates one dummy variable per category for each variable (transform).
Note
New categories in the data to transform, that is, those that did not appear in the training set, will be ignored (no binary variable will be created for them). This means that observations with categories not present in the train set, will be encoded as 0 in all the binary variables.
Also Note
The original categorical variables are removed from the returned dataset when we apply the transform() method. In their place, the binary variables are returned.
- Parameters
top_categories (int, default=None) – If None, a dummy variable will be created for each category of the variable. Alternatively, we can indicate in top_categories the number of most frequent categories to encode. In this case, dummy variables will be created only for those popular categories and the rest will be ignored, i.e., they will show the value 0 in all the binary variables.
drop_last (boolean, default=False) – Only used if top_categories = None. It indicates whether to create dummy variables for all the categories (k dummies), or if set to True, it will ignore the last binary variable and return k-1 dummies.
drop_last_binary (boolean, default=False) – Whether to return 1 or 2 dummy variables for binary categorical variables. When a categorical variable has only 2 categories, then the second dummy variable created by one hot encoding can be completely redundant. Setting this parameter to True, will ensure that for every binary variable in the dataset, only 1 dummy is created.
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the categories for which dummy variables will be created.
-
variables_
¶ The group of variables that will be transformed.
-
variables_binary_
¶ A list with binary variables identified from the data. That is, variables with only 2 categories.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the unique categories per variable
-
transform:
Replace the categorical variables by the binary variables.
-
fit_transform:
Fit to the data, then transform it.
Notes
If the variables are intended for linear models, it is recommended to encode into k-1 or top categories. If the variables are intended for tree based algorithms, it is recommended to encode into k or top n categories. If feature selection will be performed, then also encode into k or top n categories. Linear models evaluate all features during fit, while tree based models and many feature selection algorithms evaluate variables or groups of variables separately. Thus, if encoding into k-1, the last variable / category will not be examined.
References
One hot encoding of top categories was described in the following article:
- 1
Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
-
fit
(X, y=None)[source]¶ Learns the unique categories per variable. If top_categories is indicated, it will learn the most popular categories. Alternatively, it learns all unique categories per variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just seleted variables.
y (pandas series, default=None) – Target. It is not needed in this encoded. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Replaces the categorical variables by the binary variables.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values. - If dataframe has different number of features than the df used in fit()
- Returns
X – The transformed dataframe. The shape of the dataframe will be different from the original as it includes the dummy variables in place of the of the original categorical ones.
- Return type
pandas dataframe.
-
class
ballet.eng.external.feature_engine.
OrdinalEncoder
(encoding_method='ordered', variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The OrdinalCategoricalEncoder() replaces categories by ordinal numbers (0, 1, 2, 3, etc). The numbers can be ordered based on the mean of the target per category, or assigned arbitrarily.
Ordered ordinal encoding: for the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 1, red by 2 and grey by 0.
Arbitrary ordinal encoding: the numbers will be assigned arbitrarily to the categories, on a first seen first served basis.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the categories to the mapped numbers (transform).
- Parameters
encoding_method (str, default='ordered') –
Desired method of encoding.
’ordered’: the categories are numbered in ascending order according to the target mean value per category.
’arbitrary’ : categories are numbered arbitrarily.
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the ordinal number per category, per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Find the integer to replace each category in each variable.
-
transform:
Encode the categories to numbers.
-
fit_transform:
Fit to the data, then transform it.
-
inverse_transform:
Encode the numbers into the original categories.
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
feature_engine.encoding.RareLabelEncoder
References
Encoding into integers ordered following target mean was discussed in the following talk at PyData London 2017:
- 1
Galli S. “Machine Learning in Financial Risk Assessment”. https://www.youtube.com/watch?v=KHGGlozsRtA
-
fit
(X, y=None)[source]¶ Learn the numbers to be used to replace the categories in each variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to be encoded.
y (pandas series, default=None) – The Target. Can be None if encoding_method = ‘arbitrary’. Otherwise, y needs to be passed when fitting the transformer.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The un-transformed dataframe, with the categorical variables containing the original values.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – The dataframe containing the categories replaced by numbers.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
OutlierTrimmer
(capping_method='gaussian', tail='right', fold=3, variables=None, missing_values='raise')[source]¶ Bases:
feature_engine.outliers.winsorizer.Winsorizer
The OutlierTrimmer() removes observations with outliers from the dataset.
It works only with numerical variables. A list of variables can be indicated. Alternatively, the OutlierTrimmer() will select all numerical variables.
The OutlierTrimmer() first calculates the maximum and /or minimum values beyond which a value will be considered an outlier, and thus removed.
Limits are determined using:
a Gaussian approximation
the inter-quantile range proximity rule
percentiles.
Gaussian limits:
right tail: mean + 3* std
left tail: mean - 3* std
IQR limits:
right tail: 75th quantile + 3* IQR
left tail: 25th quantile - 3* IQR
where IQR is the inter-quartile range: 75th quantile - 25th quantile.
percentiles or quantiles:
right tail: 95th percentile
left tail: 5th percentile
You can select how far out to cap the maximum or minimum values with the parameter ‘fold’.
If capping_method=’gaussian’ fold gives the value to multiply the std.
If capping_method=’iqr’ fold is the value to multiply the IQR.
If capping_method=’quantile’, fold is the percentile on each tail that should be censored. For example, if fold=0.05, the limits will be the 5th and 95th percentiles. If fold=0.1, the limits will be the 10th and 90th percentiles.
The transformer first finds the values at one or both tails of the distributions (fit).
The transformer then removes observations with outliers from the dataframe (transform).
- Parameters
capping_method (str, default=gaussian) –
Desired capping method. Can take ‘gaussian’, ‘iqr’ or ‘quantiles’.
’gaussian’: the transformer will find the maximum and / or minimum values to cap the variables using the Gaussian approximation.
’iqr’: the transformer will find the boundaries using the IQR proximity rule.
’quantiles’: the limits are given by the percentiles.
tail (str, default=right) – Whether to cap outliers on the right, left or both tails of the distribution. Can take ‘left’, ‘right’ or ‘both’.
fold (int or float, default=3) –
How far out to to place the capping values. The number that will multiply the std or IQR to calculate the capping values. Recommended values, 2 or 3 for the gaussian approximation, or 1.5 or 3 for the IQR proximity rule.
If capping_method=’quantile’, then ‘fold’ indicates the percentile. So if fold=0.05, the limits will be the 95th and 5th percentiles. Note: Outliers will be removed up to a maximum of the 20th percentiles on both sides. Thus, when capping_method=’quantile’, then ‘fold’ takes values between 0 and 0.20.
variables (list, default=None) – The list of variables for which the outliers will be removed If None, the transformer will find and select all numerical variables.
missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. Sometimes we want to remove outliers in the raw, original data, sometimes, we may want to remove outliers in the already pre-transformed data. If missing_values=’ignore’, the transformer will ignore missing data when learning the capping parameters or transforming the data. If missing_values=’raise’ the transformer will return an error if the training or the datasets to transform contain missing values.
-
right_tail_caps_
¶ Dictionary with the maximum values above which values will be removed.
-
left_tail_caps_
¶ Dictionary with the minimum values below which values will be removed.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Find maximum and minimum values.
-
transform:
Remove outliers.
-
fit_transform:
Fit to the data. Then transform it.
-
transform
(X)[source]¶ Remove observations with outliers from the dataframe.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe is not of same size as that used in fit()
- Returns
X – The dataframe without outlier observations.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
PRatioEncoder
(encoding_method='ratio', variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The PRatioEncoder() replaces categories by the ratio of the probability of the target = 1 and the probability of the target = 0.
The target probability ratio is given by:
\[p(1) / p(0)\]The log of the target probability ratio is:
\[log( p(1) / p(0) )\]Note
This categorical encoding is exclusive for binary classification.
For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is 0.2, blue will be replaced by: 0.8 / 0.2 = 4 if ratio is selected, or log(0.8/0.2) = 1.386 if log_ratio is selected.
Note: the division by 0 is not defined and the log(0) is not defined. Thus, if p(0) = 0 for the ratio encoder, or either p(0) = 0 or p(1) = 0 for log_ratio, in any of the variables, the encoder will return an error.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).
- Parameters
encoding_method (str, default='ratio') –
Desired method of encoding.
’ratio’ : probability ratio
’log_ratio’ : log probability ratio
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the probability ratio per category per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn probability ratio per category, per variable.
-
transform:
Encode categories into numbers.
-
fit_transform:
Fit to the data, then transform it.
-
inverse_transform:
Encode the numbers into the original categories.
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
feature_engine.encoding.RareLabelEncoder
-
fit
(X, y)[source]¶ Learn the numbers that should be used to replace the categories in each variable. That is the ratio of probability.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.
y (pandas series.) – Target, must be binary.
- Raises
TypeError –
If the input is not the Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in df or df is empty - If variable(s) contain null values. - If y is not binary with values 0 and 1. - If p(0) = 0 or any of p(0) or p(1) are 0.
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The un-transformed dataframe, with the categorical variables containing the original values.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – The dataframe containing the categories replaced by numbers.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
PowerTransformer
(variables=None, exp=0.5)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The PowerTransformer() applies power or exponential transformations to numerical variables.
The PowerTransformer() works only with numerical variables.
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.
exp (float or int, default=0.5) – The power (or exponent).
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Apply the power transformation to the variables.
-
fit_transform:
Fit to data, then transform it.
-
inverse_transform:
Convert the data back to the original representation.
-
fit
(X, y=None)[source]¶ This transformer does not learn parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the data back to the original representation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The dataframe with the power transformed variables.
- Return type
pandas Dataframe
-
transform
(X)[source]¶ Apply the power transformation to the variables.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The dataframe with the power transformed variables.
- Return type
pandas Dataframe
-
class
ballet.eng.external.feature_engine.
RandomSampleImputer
(random_state=None, seed='general', seeding_method='add', variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The RandomSampleImputer() replaces missing data with a random sample extracted from the variables in the training set.
The RandomSampleImputer() works with both numerical and categorical variables.
Note
The Random samples used to replace missing values may vary from execution to execution. This may affect the results of your work. This, it is advisable to set a seed.
There are 2 ways in which the seed can be set in the RandomSampleImputer():
If seed = ‘general’ then the random_state can be either None or an integer. The seed will be used as the random_state and all observations will be imputed in one go. This is equivalent to pandas.sample(n, random_state=seed) where n is the number of observations with missing data.
If seed = ‘observation’, then the random_state should be a variable name or a list of variable names. The seed will be calculated observation per observation, either by adding or multiplying the seeding variable values, and passed to the random_state. Then, a value will be extracted from the train set using that seed and used to replace the NAN in particular observation. This is the equivalent of pandas.sample(1, random_state=var1+var2) if the ‘seeding_method’ is set to ‘add’ or pandas.sample(1, random_state=var1*var2) if the ‘seeding_method’ is set to ‘multiply’.
For more details on why this functionality is important refer to the course Feature Engineering for Machine Learning in Udemy: https://www.udemy.com/feature-engineering-for-machine-learning/
Note, if the variables indicated in the random_state list are not numerical the imputer will return an error. Note also that the variables indicated as seed should not contain missing values.
This estimator stores a copy of the training set when the fit() method is called. Therefore, the object can become quite heavy. Also, it may not be GDPR compliant if your training data set contains Personal Information. Please check if this behaviour is allowed within your organisation.
- Parameters
random_state (int, str or list, default=None) – The random_state can take an integer to set the seed when extracting the random samples. Alternatively, it can take a variable name or a list of variables, which values will be used to determine the seed observation per observation.
seed (str, default='general') –
Indicates whether the seed should be set for each observation with missing values, or if one seed should be used to impute all observations in one go.
general: one seed will be used to impute the entire dataframe. This is equivalent to setting the seed in pandas.sample(random_state).
observation: the seed will be set for each observation using the values of the variables indicated in the random_state for that particular observation.
seeding_method (str, default='add') – If more than one variable are indicated to seed the random sampling per observation, you can choose to combine those values as an addition or a multiplication. Can take the values ‘add’ or ‘multiply’.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will select all variables in the train set.
-
X_
¶ Copy of the training dataframe from which to extract the random samples.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Make a copy of the dataframe
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ Makes a copy of the train set. Only stores a copy of the variables to impute. This copy is then used to randomly extract the values to fill the missing data during transform.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Only a copy of the indicated variables will be stored in the transformer.
y (None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with random values taken from the train set.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
X – The dataframe without missing values in the transformed variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
RareLabelEncoder
(tol=0.05, n_categories=10, max_n_categories=None, replace_with='Rare', variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The RareLabelCategoricalEncoder() groups rare / infrequent categories in a new category called “Rare”, or any other name entered by the user.
For example in the variable colour, if the percentage of observations for the categories magenta, cyan and burgundy are < 5 %, all those categories will be replaced by the new label “Rare”.
Note
Infrequent labels can also be grouped under a user defined name, for example ‘Other’. The name to replace infrequent categories is defined with the parameter replace_with.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode.Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first finds the frequent labels for each variable (fit). The encoder then groups the infrequent labels under the new label ‘Rare’ or by another user defined string (transform).
- Parameters
tol (float, default=0.05) – The minimum frequency a label should have to be considered frequent. Categories with frequencies lower than tol will be grouped.
n_categories (int, default=10) – The minimum number of categories a variable should have for the encoder to find frequent labels. If the variable contains less categories, all of them will be considered frequent.
max_n_categories (int, default=None) – The maximum number of categories that should be considered frequent. If None, all categories with frequency above the tolerance (tol) will be considered frequent. If you enter 5, only the 5 most frequent categories will be retained and the rest grouped.
replace_with (string, intege or float, default='Rare') – The value that will be used to replace infrequent categories.
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the frequent categories, i.e., those that will be kept, per variable.
-
variables_
¶ The variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Find frequent categories.
-
transform:
Group rare categories
-
fit_transform:
Fit to data, then transform it.
-
fit
(X, y=None)[source]¶ Learn the frequent categories for each variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just selected variables
y (None) – y is not required. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
Warning – If the number of categories in any one variable is less than the indicated in n_categories.
- Returns
- Return type
self
-
transform
(X)[source]¶ Group infrequent categories. Replace infrequent categories by the string ‘Rare’ or any other name provided by the user.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If user enters non-categorical variables (unless ignore_format is True)
- Returns
X – The dataframe where rare categories have been grouped.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
ReciprocalTransformer
(variables=None)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The ReciprocalTransformer() applies the reciprocal transformation 1 / x to numerical variables.
The ReciprocalTransformer() only works with numerical variables with non-zero values. If a variable contains the value 0, the transformer will raise an error.
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Apply the reciprocal 1 / x transformation.
-
fit_transform:
Fit to data, then transform it.
-
inverse_transform:
Convert the data back to the original representation.
-
fit
(X, y=None)[source]¶ This transformer does not learn parameters.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values - If some variables contain zero as values
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the data back to the original representation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain zero values
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe
-
transform
(X)[source]¶ Apply the reciprocal 1 / x transformation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain zero values
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe
-
class
ballet.eng.external.feature_engine.
Winsorizer
(capping_method='gaussian', tail='right', fold=3, variables=None, missing_values='raise')[source]¶ Bases:
feature_engine.outliers.base_outlier.BaseOutlier
The Winsorizer() caps maximum and / or minimum values of a variable.
The Winsorizer() works only with numerical variables. A list of variables can be indicated. Alternatively, the Winsorizer() will select all numerical variables in the train set.
The Winsorizer() first calculates the capping values at the end of the distribution. The values are determined using:
a Gaussian approximation,
the inter-quantile range proximity rule (IQR)
percentiles.
Gaussian limits:
right tail: mean + 3* std
left tail: mean - 3* std
IQR limits:
right tail: 75th quantile + 3* IQR
left tail: 25th quantile - 3* IQR
where IQR is the inter-quartile range: 75th quantile - 25th quantile.
percentiles or quantiles:
right tail: 95th percentile
left tail: 5th percentile
You can select how far out to cap the maximum or minimum values with the parameter ‘fold’.
If capping_method=’gaussian’ fold gives the value to multiply the std.
If capping_method=’iqr’ fold is the value to multiply the IQR.
If capping_method=’quantile’, fold is the percentile on each tail that should be censored. For example, if fold=0.05, the limits will be the 5th and 95th percentiles. If fold=0.1, the limits will be the 10th and 90th percentiles.
The transformer first finds the values at one or both tails of the distributions (fit). The transformer then caps the variables (transform).
- Parameters
capping_method (str, default=gaussian) –
Desired capping method. Can take ‘gaussian’, ‘iqr’ or ‘quantiles’.
’gaussian’: the transformer will find the maximum and / or minimum values to cap the variables using the Gaussian approximation.
’iqr’: the transformer will find the boundaries using the IQR proximity rule.
’quantiles’: the limits are given by the percentiles.
tail (str, default=right) – Whether to cap outliers on the right, left or both tails of the distribution. Can take ‘left’, ‘right’ or ‘both’.
fold (int or float, default=3) –
How far out to to place the capping values. The number that will multiply the std or IQR to calculate the capping values. Recommended values, 2 or 3 for the gaussian approximation, or 1.5 or 3 for the IQR proximity rule.
If capping_method=’quantile’, then ‘fold’ indicates the percentile. So if fold=0.05, the limits will be the 95th and 5th percentiles. Note: Outliers will be removed up to a maximum of the 20th percentiles on both sides. Thus, when capping_method=’quantile’, then ‘fold’ takes values between 0 and 0.20.
variables (list, default=None) – The list of variables for which the outliers will be capped. If None, the transformer will find and select all numerical variables.
missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. Sometimes we want to remove outliers in the raw, original data, sometimes, we may want to remove outliers in the already pre-transformed data. If missing_values=’ignore’, the transformer will ignore missing data when learning the capping parameters or transforming the data. If missing_values=’raise’ the transformer will return an error if the training or the datasets to transform contain missing values.
-
right_tail_caps_
¶ Dictionary with the maximum values at which variables will be capped.
-
left_tail_caps_
¶ Dictionary with the minimum values at which variables will be capped.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the values that should be used to replace outliers.
-
transform:
Cap the variables.
-
fit_transform:
Fit to the data. Then transform it.
-
fit
(X, y=None)[source]¶ Learn the values that should be used to replace outliers.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples.
y (pandas Series, default=None) – y is not needed in this transformer. You can pass y or None.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
- Return type
self
-
transform
(X)[source]¶ Cap the variable values, that is, censors outliers.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe is not of same size as that used in fit()
- Returns
X – The dataframe with the capped variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
WoEEncoder
(variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The WoERatioCategoricalEncoder() replaces categories by the weight of evidence (WoE). The WoE was used primarily in the financial sector to create credit risk scorecards.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the weight of evidence for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).
Note
This categorical encoding is exclusive for binary classification.
The weight of evidence is given by:
\[log( p(X=xj|Y = 1) / p(X=xj|Y=0) )\]The WoE is determined as follows:
We calculate the percentage positive cases in each category of the total of all positive cases. For example 20 positive cases in category A out of 100 total positive cases equals 20 %. Next, we calculate the percentage of negative cases in each category respect to the total negative cases, for example 5 negative cases in category A out of a total of 50 negative cases equals 10%. Then we calculate the WoE by dividing the category percentages of positive cases by the category percentage of negative cases, and take the logarithm, so for category A in our example WoE = log(20/10).
Note
If WoE values are negative, negative cases supersede the positive cases.
If WoE values are positive, positive cases supersede the negative cases.
And if WoE is 0, then there are equal number of positive and negative examples.
Encoding into WoE:
Creates a monotonic relationship between the encoded variable and the target
Returns variables in a similar scale
Note
The log(0) is not defined and the division by 0 is not defined. Thus, if any of the terms in the WoE equation are 0 for a given category, the encoder will return an error. If this happens, try grouping less frequent categories.
- Parameters
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the WoE per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the WoE per category, per variable.
-
transform:
Encode the categories to numbers.
-
fit_transform:
Fit to the data, then transform it.
-
inverse_transform:
Encode the numbers into the original categories.
Notes
For details on the calculation of the weight of evidence visit: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
In credit scoring, continuous variables are also transformed using the WoE. To do this, first variables are sorted into a discrete number of bins, and then these bins are encoded with the WoE as explained here for categorical variables. You can do this by combining the use of the equal width, equal frequency or arbitrary discretisers.
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
feature_engine.encoding.RareLabelEncoder
,feature_engine.discretisation
-
fit
(X, y)[source]¶ Learn the WoE.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.
y (pandas series.) – Target, must be binary.
- Raises
TypeError –
If the input is not the Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in df or df is empty - If variable(s) contain null values. - If y is not binary with values 0 and 1. - If p(0) = 0 or p(1) = 0.
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The un-transformed dataframe, with the categorical variables containing the original values.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – The dataframe containing the categories replaced by numbers.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.feature_engine.
YeoJohnsonTransformer
(variables=None)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The YeoJohnsonTransformer() applies the Yeo-Johnson transformation to the numerical variables.
The Yeo-Johnson transformation implemented by this transformer is that of SciPy.stats: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.yeojohnson.html
The YeoJohnsonTransformer() works only with numerical variables.
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.
-
lambda_dict_
¶ Dictionary containing the best lambda for the Yeo-Johnson per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the optimal lambda for the Yeo-Johnson transformation.
-
transform:
Apply the Yeo-Johnson transformation.
-
fit_transform:
Fit to data, then transform it.
References
- 1
Weisberg S. “Yeo-Johnson Power Transformations”. https://www.stat.umn.edu/arc/yjpower.pdf
-
fit
(X, y=None)[source]¶ Learn the optimal lambda for the Yeo-Johnson transformation.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
- ValueError
If there are no numerical variables in the df or the df is empty
If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Apply the Yeo-Johnson transformation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe