ballet.eng.external package

class ballet.eng.external.AddMissingIndicator(missing_only=True, variables=None)[source]

Bases: feature_engine.imputation.base_imputer.BaseImputer

The AddMissingIndicator() adds additional binary variables that indicate if data is missing. It will add as many missing indicators as variables indicated by the user.

Binary variables are named with the original variable name plus ‘_na’.

The AddMissingIndicator() works for both numerical and categorical variables. You can pass a list with the variables for which the missing indicators should be added. Alternatively, the imputer will select and add missing indicators to all variables in the training set.

Note If how=missing_only, the imputer will add missing indicators only to those variables that show missing data in during fit. These may be a subset of the variables you indicated.

Parameters
  • missing_only (bool, default=True) –

    Indicates if missing indicators should be added to variables with missing data or to all variables.

    True: indicators will be created only for those variables that showed missing data during fit.

    False: indicators will be created for all variables

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables.

variables_

List of variables for which the missing indicators will be created.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the variables for which the missing indicators will be created

transform:

Add the missing indicators.

fit_transform:

Fit to the data, then trasnform it.

fit(X, y=None)[source]

Learn the variables for which the missing indicators will be created.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.

  • y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.

Raises

TypeError – If the input is not a Pandas DataFrame

Returns

self.variables_ – The list of variables for which missing indicators will be added.

Return type

list

transform(X)[source]

Add the binary missing indicators.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.

Returns

X_transformed – The dataframe containing the additional binary variables. Binary variables are named with the original variable name plus ‘_na’.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.ArbitraryDiscretiser(binning_dict, return_object=False, return_boundaries=False)[source]

Bases: feature_engine.base_transformers.BaseNumericalTransformer

The ArbitraryDiscretiser() divides continuous numerical variables into contiguous intervals which limits are determined arbitrarily by the user.

You need to enter a dictionary with variable names as keys, and a list of the limits of the intervals as values. For example {‘var1’:[0, 10, 100, 1000], ‘var2’:[5, 10, 15, 20]}.

ArbitraryDiscretiser() will then sort var1 values into the intervals 0-10, 10-100 100-1000, and var2 into 5-10, 10-15 and 15-20. Similar to pandas.cut.

The ArbitraryDiscretiser() works only with numerical variables. The discretiser will check if the dictionary entered by the user contains variables present in the training set, and if these variables are numerical, before doing any transformation.

Then it transforms the variables, that is, it sorts the values into the intervals.

Parameters
  • binning_dict (dict) –

    The dictionary with the variable to interval limits pairs. A valid dictionary looks like this:

    binning_dict = {‘var1’:[0, 10, 100, 1000], ‘var2’:[5, 10, 15, 20]}

  • return_object (bool, default=False) –

    Whether the the discrete variable should be returned casted as numeric or as object. If you would like to proceed with the engineering of the variable as if it was categorical, use True. Alternatively, keep the default to False.

    Categorical encoders in Feature-engine work only with variables of type object, thus, if you wish to encode the returned bins, set return_object to True.

  • return_boundaries (bool, default=False) – Whether the output, that is the bin names / values, should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.

binner_dict_

Dictionary with the interval limits per variable.

variables_

The variables to discretise.

n_features_in_

The number of features in the train set used in fit.

fit:

This transformer does not learn any parameter.

transform:

Sort continuous variable values into the intervals.

fit_transform:

Fit to the data, then transform it.

See also

pandas.cut

https

//pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

fit(X, y=None)[source]

This transformer does not learn any parameter.

Check dataframe and variables. Checks that the user entered variables are in the train set and cast as numerical.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.

  • y (None) – y is not needed in this transformer. You can pass y or None.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

  • ValueError

    • If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values

Returns

Return type

self

transform(X)[source]

Sort the variable values into the intervals.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()

Returns

X – The transformed data with the discrete variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.ArbitraryNumberImputer(arbitrary_number=999, variables=None, imputer_dict=None)[source]

Bases: feature_engine.imputation.base_imputer.BaseImputer

The ArbitraryNumberImputer() replaces missing data in each variable by an arbitrary value determined by the user. It works only with numerical variables.

You can impute all variables with the same number, in which case you need to define the variables to impute in variables and the imputation number in arbitrary_number. You can pass a dictionary of variable and numbers to use for their imputation.

For example, you can impute varA and varB with 99 like this:

transformer = ArbitraryNumberImputer(
        variables = ['varA', 'varB'],
        arbitrary_number = 99
        )

Xt = transformer.fit_transform(X)

Alternatively, you can impute varA with 1 and varB with 99 like this:

transformer = ArbitraryNumberImputer(
        imputer_dict = {'varA' : 1, 'varB': 99]
        )

Xt = transformer.fit_transform(X)
Parameters
  • arbitrary_number (int or float, default=999) – The number to be used to replace missing data.

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all numerical variables. This parameter is used only if imputer_dict is None.

  • imputer_dict (dict, default=None) – The dictionary of variables and the arbitrary numbers for their imputation.

imputer_dict_

Dictionary with the values to replace NAs in each variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

This transformer does not learn parameters.

transform:

Impute missing data.

fit_transform:

Fit to the data, then transform it.

See also

feature_engine.imputation.EndTailImputer

fit(X, y=None)[source]

This method does not learn any parameter. Checks dataframe and finds numerical variables, or checks that the variables entered by user are numerical.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.

  • y (None) – y is not needed in this imputation. You can pass None or y.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

  • ValueError – If there are no numerical variables in the df or the df is empty

Returns

Return type

self

transform(X)[source]

Replace missing data with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError – If the dataframe has different number of features than the df used in fit()

Returns

X – The dataframe without missing values in the selected variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.ArbitraryOutlierCapper(max_capping_dict=None, min_capping_dict=None, missing_values='raise')[source]

Bases: feature_engine.outliers.base_outlier.BaseOutlier

The ArbitraryOutlierCapper() caps the maximum or minimum values of a variable at an arbitrary value indicated by the user.

You must provide the maximum or minimum values that will be used to cap each variable in a dictionary {feature:capping value}

Parameters
  • max_capping_dict (dictionary, default=None) – Dictionary containing the user specified capping values for the right tail of the distribution of each variable (maximum values).

  • min_capping_dict (dictionary, default=None) – Dictionary containing user specified capping values for the eft tail of the distribution of each variable (minimum values).

  • missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. If missing_values=’raise’ the transformer will return an error if the training or the datasets to transform contain missing values.

right_tail_caps_

Dictionary with the maximum values at which variables will be capped.

left_tail_caps_

Dictionary with the minimum values at which variables will be capped.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

This transformer does not learn any parameter.

transform:

Cap the variables.

fit_transform:

Fit to the data. Then transform it.

fit(X, y=None)[source]

This transformer does not learn any parameter.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples.

  • y (pandas Series, default=None) – y is not needed in this transformer. You can pass y or None.

Raises

TypeError – If the input is not a Pandas DataFrame

Returns

Return type

self

transform(X)[source]

Cap the variable values, that is, censors outliers.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError – If the dataframe is not of same size as that used in fit()

Returns

X – The dataframe with the capped variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.AutoregressiveTransformer(num_lags=5, pred_stride=1)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
inverse_transform(X)[source]
transform(X, y=None)[source]
class ballet.eng.external.BackwardDifferenceEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Backward difference contrast coding for encoding categorical variables.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = BackwardDifferenceEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 21 columns):
intercept    506 non-null int64
CRIM         506 non-null float64
ZN           506 non-null float64
INDUS        506 non-null float64
CHAS_0       506 non-null float64
NOX          506 non-null float64
RM           506 non-null float64
AGE          506 non-null float64
DIS          506 non-null float64
RAD_0        506 non-null float64
RAD_1        506 non-null float64
RAD_2        506 non-null float64
RAD_3        506 non-null float64
RAD_4        506 non-null float64
RAD_5        506 non-null float64
RAD_6        506 non-null float64
RAD_7        506 non-null float64
TAX          506 non-null float64
PTRATIO      506 non-null float64
B            506 non-null float64
LSTAT        506 non-null float64
dtypes: float64(20), int64(1)
memory usage: 83.1 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

static backward_difference_coding(X_in, mapping)[source]
fit(X, y=None, **kwargs)[source]

Fits an ordinal encoder to produce a consistent mapping across applications and optionally finds generally invariant columns to drop consistently.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

static fit_backward_difference_coding(col, values, handle_missing, handle_unknown)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.BaseNEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, base=2, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Base-N encoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • base (int) – when the downstream model copes well with nonlinearities (like decision tree), use higher base.

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = BaseNEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 18 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS_0     506 non-null int64
CHAS_1     506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD_0      506 non-null int64
RAD_1      506 non-null int64
RAD_2      506 non-null int64
RAD_3      506 non-null int64
RAD_4      506 non-null int64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(11), int64(7)
memory usage: 71.3 KB
None
basen_encode(X_in, cols=None)[source]

Basen encoding encodes the integers as basen code with one column per digit.

Parameters
  • X_in (DataFrame) –

  • cols (list-like, default None) – Column names in the DataFrame to be encoded

Returns

dummies

Return type

DataFrame

basen_to_integer(X, cols, base)[source]

Convert basen code as integers.

Parameters
  • X (DataFrame) – encoded data

  • cols (list-like) – Column names in the DataFrame that be encoded

  • base (int) – The base of transform

Returns

numerical

Return type

DataFrame

calc_required_digits(values)[source]
col_transform(col, digits)[source]

The lambda body to transform the column values

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

fit_base_n_encoding(X)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

inverse_transform(X_in)[source]

Perform the inverse transformation to encoded data.

Parameters

X_in (array-like, shape = [n_samples, n_features]) –

Returns

p

Return type

array, the same size of X_in

static number_to_base(n, b, limit)[source]
transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.Binarizer(*, threshold=0.0, copy=True)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Binarize data (set feature values to 0 or 1) according to a threshold.

Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.

It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modelled using the Bernoulli distribution in a Bayesian setting).

Read more in the User Guide.

Parameters
  • threshold (float, default=0.0) – Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.

  • copy (bool, default=True) – Set to False to perform inplace binarization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

binarize

Equivalent function without the estimator API.

KBinsDiscretizer

Bin continuous data into intervals.

OneHotEncoder

Encode categorical features as a one-hot numeric array.

Notes

If the input is a sparse matrix, only the non-zero values are subject to update by the Binarizer class.

This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.

Examples

>>> from sklearn.preprocessing import Binarizer
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> transformer = Binarizer().fit(X)  # fit does nothing.
>>> transformer
Binarizer()
>>> transformer.transform(X)
array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])
fit(X, y=None)[source]

Do nothing and return the estimator unchanged.

This method is just there to implement the usual API and hence work in pipelines.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data.

  • y (None) – Ignored.

Returns

self – Fitted transformer.

Return type

object

transform(X, copy=None)[source]

Binarize each element of X.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to binarize, element by element. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.

  • copy (bool) – Copy the input X or not.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.BinaryEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 18 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS_0     506 non-null int64
CHAS_1     506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD_0      506 non-null int64
RAD_1      506 non-null int64
RAD_2      506 non-null int64
RAD_3      506 non-null int64
RAD_4      506 non-null int64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(11), int64(7)
memory usage: 71.3 KB
None
fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

inverse_transform(X_in)[source]

Perform the inverse transformation to encoded data.

Parameters

X_in (array-like, shape = [n_samples, n_features]) –

Returns

p

Return type

array, the same size of X_in

transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.BoxCoxTransformer(variables=None)[source]

Bases: feature_engine.base_transformers.BaseNumericalTransformer

The BoxCoxTransformer() applies the BoxCox transformation to numerical variables.

The Box-Cox transformation is defined as:

  • T(Y)=(Y exp(λ)−1)/λ if λ!=0

  • log(Y) otherwise

where Y is the response variable and λ is the transformation parameter. λ varies, typically from -5 to 5. In the transformation, all values of λ are considered and the optimal value for a given variable is selected.

The BoxCox transformation implemented by this transformer is that of SciPy.stats: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html

The BoxCoxTransformer() works only with numerical positive variables (>=0).

A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.

Parameters

variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.

lambda_dict_

Dictionary with the best BoxCox exponent per variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the optimal lambda for the BoxCox transformation.

transform:

Apply the BoxCox transformation.

fit_transform:

Fit to data, then transform it.

References

1

Box and Cox. “An Analysis of Transformations”. Read at a RESEARCH MEETING, 1964. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1964.tb00553.x

fit(X, y=None)[source]

Learn the optimal lambda for the BoxCox transformation.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.

  • y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.

Raises

TypeError

  • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

ValueError
  • If there are no numerical variables in the df or the df is empty

  • If the variable(s) contain null values

  • If some variables contain zero values

Returns

Return type

self

transform(X)[source]

Apply the BoxCox transformation.

Parameters

X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain negative values

Returns

X – The dataframe with the transformed variables.

Return type

pandas dataframe

class ballet.eng.external.CatBoostEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None, a=1)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

CatBoost coding for categorical features.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.

Beware, the training data have to be randomly permutated. E.g.:

# Random permutation perm = np.random.permutation(len(X)) X = X.iloc[perm].reset_index(drop=True) y = y.iloc[perm].reset_index(drop=True)

This is necessary because some data sets are sorted based on the target value and this coder encodes the features on-the-fly in a single pass.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • sigma (float) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). sigma gives the standard deviation (spread or “width”) of the normal distribution.

  • a (float) – additive smoothing (it is the same variable as “m” in m-probability estimate). By default set to 1.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = CatBoostEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Transforming categorical features to numerical features, from

https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/

2

CatBoost: unbiased boosting with categorical features, from

https://arxiv.org/abs/1706.09516

fit(X, y, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.CategoricalImputer(imputation_method='missing', fill_value='Missing', variables=None, return_object=False, ignore_format=False)[source]

Bases: feature_engine.imputation.base_imputer.BaseImputer

The CategoricalImputer() replaces missing data in categorical variables by an arbitrary value or by the most frequent category.

The CategoricalVariableImputer() imputes by default only categorical variables (type ‘object’ or ‘categorical’). You can pass a list of variables to impute, or alternatively, the encoder will find and encode all categorical variables.

If you want to impute numerical variables with this transformer, there are 2 ways of doing it:

Option 1: Cast your numerical variables as object in the input dataframe, before passing it to the transformer.

Option 2: Set ignore_format=True. Note that if you do this and do not pass the list of variables to impute, the imputer will automatically select and impute all variables in the dataframe.

Parameters
  • imputation_method (str, default='missing') – Desired method of imputation. Can be ‘frequent’ for frequent category imputation or ‘missing’ to impute with an arbitrary value.

  • fill_value (str, int, float, default='Missing') – Only used when imputation_method=’missing’. User-defined value to replace the missing data.

  • variables (list, default=None) – The list of categorical variables that will be imputed. If None, the imputer will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the parameter ignore_format below.

  • return_object (bool, default=False) – If working with numerical variables cast as object, decide whether to return the variables as numeric or re-cast them as object. Note that pandas will re-cast them automatically as numeric after the transformation with the mode or with an arbitrary number.

  • ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.

imputer_dict_

Dictionary with most frequent category or arbitrary value per variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the most frequent category, or assign arbitrary value to variable.

transform:

Impute missing data.

fit_transform:

Fit to the data, than transform it.

fit(X, y=None)[source]

Learn the most frequent category if the imputation method is set to frequent.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.

  • y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)

  • ValueError – If there are no categorical variables in the df or the df is empty

Returns

Return type

self

transform(X)[source]

Replace missing data with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError – If the dataframe has different number of features than the df used in fit()

Returns

X – The dataframe without missing values in the selected variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.CombineWithReferenceFeature(variables_to_combine, reference_variables, operations=['sub'], new_variables_names=None, missing_values='ignore')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

CombineWithReferenceFeature() applies basic mathematical operations between a group of variables and one or more reference features. It adds one or more additional features to the dataframe with the result of the operations.

In other words, CombineWithReferenceFeature() sums, multiplies, subtracts or divides a group of features to / by a group of reference variables, and returns the result as new variables in the dataframe.

For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter, number_payments_third_quarter, number_payments_fourth_quarter, and total_payments, we can use CombineWithReferenceFeature() to determine the percentage of payments per quarter as follows:

transformer = CombineWithReferenceFeature(
    variables_to_combine=[
        'number_payments_first_quarter',
        'number_payments_second_quarter',
        'number_payments_third_quarter',
        'number_payments_fourth_quarter',
    ],

    reference_variables=['total_payments'],

    operations=['div'],

    new_variables_name=[
        'perc_payments_first_quarter',
        'perc_payments_second_quarter',
        'perc_payments_third_quarter',
        'perc_payments_fourth_quarter',
    ]
)

Xt = transformer.fit_transform(X)

The transformed X, Xt, will contain the additional features indicated in the new_variables_name list plus the original set of variables.

Parameters
  • variables_to_combine (list) – The list of numerical variables to be combined with the reference variables.

  • reference_variables (list) – The list of numerical reference variables that will be added to, multiplied with, or subtracted from the variables_to_combine, or used as denominator for division.

  • operations (list, default=['sub']) –

    The list of basic mathematical operations to be used in transformation.

    If None, all of [‘sub’, ‘div’,’add’,’mul’] will be performed. Alternatively, you can enter a list of operations to carry out. Each operation should be a string and must be one of the elements in [‘sub’, ‘div’,’add’, ‘mul’].

    Each operation will result in a new variable that will be added to the transformed dataset.

  • new_variables_names (list, default=None) –

    Names of the newly created variables. You can enter a list with the names for the newly created features (recommended). You must enter as many names as new features created by the transformer. The number of new features is the number of operations times the number of reference variables times the number of variables to combine.

    Thus, if you want to perform 2 operations, sub and div, combining 4 variables with 2 reference variables, you should enter 2 X 4 X 2 new variable names.

    The name of the variables indicated by the user should coincide with the order in which the operations are performed by the transformer. The transformer will first carry out ‘sub’, then ‘div’, then ‘add’ and finally ‘mul’.

    If new_variable_names is None, the transformer will assign an arbitrary name to the newly created features.

  • missing_values (string, default='ignore') – Indicates if missing values should be ignored or raised. If ‘ignore’, the transformer will ignore missing data when transforming the data. If ‘raise’ the transformer will return an error if the training or the datasets to transform contain missing values.

n_features_in_

The number of features in the train set used in fit.

fit:

This transformer does not learn parameters.

transform:

Combine the variables with the mathematical operations.

fit_transform:

Fit to the data, then transform it.

Notes

Although the transformer in essence allows us to combine any feature with any of the allowed mathematical operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical examples within the financial sector are:

  • Ratio between income and debt to create the debt_to_income_ratio.

  • Subtraction of rent from income to obtain the disposable_income.

fit(X, y=None)[source]

This transformer does not learn any parameter. Performs dataframe checks.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.

  • y (pandas Series, or np.array. Default=None.) – It is not needed in this transformer. You can pass y or None.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any user provided variables are not numerical

  • ValueError – If any of the reference variables contain null values and the mathematical operation is ‘div’.

Returns

Return type

self

transform(X)[source]

Combine the variables with the mathematical operations.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.

Returns

X – The dataframe with the operations results added as columns.

Return type

Pandas dataframe, shape = [n_samples, n_features + n_operations]

class ballet.eng.external.CountEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', min_group_size=None, combine_min_nan_groups=None, min_group_name=None, normalize=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

combine_min_categories(X)[source]

Combine small categories into a single category.

fit(X, y=None, **kwargs)[source]

Fit encoder according to X.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.CountFrequencyEncoder(encoding_method='count', variables=None, ignore_format=False)[source]

Bases: feature_engine.encoding.base_encoder.BaseCategoricalTransformer

The CountFrequencyEncoder() replaces categories by either the count or the percentage of observations per category.

For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.

The CountFrequencyEncoder() will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).

With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.

The encoder first maps the categories to the counts or frequencies for each variable (fit). The encoder then replaces the categories with those numbers (transform).

Parameters
  • encoding_method (str, default='count') –

    Desired method of encoding.

    ’count’: number of observations per category

    ’frequency’: percentage of observations per category

  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.

  • ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.

encoder_dict_

Dictionary with the count or frequency per category, per variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the count or frequency per category, per variable.

transform:

Encode the categories to numbers.

fit_transform:

Fit to the data, then transform it.

inverse_transform:

Encode the numbers into the original categories.

Notes

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

See also

feature_engine.encoding.RareLabelEncoder

fit(X, y=None)[source]

Learn the counts or frequencies which will be used to replace the categories.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.

  • y (pandas Series, default = None) – y is not needed in this encoder. You can pass y or None.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)

  • ValueError

    • If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values

Returns

Return type

self

inverse_transform(X)[source]

Convert the encoded variable back to the original values.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit()

Returns

X – The un-transformed dataframe, with the categorical variables containing the original values.

Return type

pandas dataframe of shape = [n_samples, n_features]

transform(X)[source]

Replace categories with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit()

  • Warning – If after encoding, NAN were introduced.

Returns

X – The dataframe containing the categories replaced by numbers.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.DFSTransformer(target_entity=None, agg_primitives=None, trans_primitives=None, allowed_paths=None, max_depth=2, ignore_entities=None, ignore_variables=None, seed_features=None, drop_contains=None, drop_exact=None, where_primitives=None, max_features=- 1, verbose=False)[source]

Bases: sklearn.base.TransformerMixin

Transformer using Scikit-Learn interface for Pipeline uses.

fit(X, y=None)[source]

Wrapper for DFS

Calculates a list of features given a dictionary of entities and a list of relationships. Alternatively, an EntitySet can be passed instead of the entities and relationships.

Parameters
  • X – (ft.Entityset or tuple): Entityset to calculate features on. If a tuple is passed it can take one of these forms: (entityset, cutoff_time_dataframe), (entities, relationships), or ((entities, relationships), cutoff_time_dataframe)

  • y – (iterable): Training targets

See also

synthesis.dfs()

get_params(deep=True)[source]
transform(X)[source]

Wrapper for calculate_feature_matrix

Calculates a feature matrix for a the given input data and calculation times.

Parameters

X – (ft.Entityset or tuple): Entityset to calculate features on. If a tuple is passed it can take one of these forms: (entityset, cutoff_time_dataframe), (entities, relationships), or ((entities, relationships), cutoff_time_dataframe)

See also

computational_backends.calculate_feature_matrix()

class ballet.eng.external.DecisionTreeDiscretiser(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]

Bases: feature_engine.base_transformers.BaseNumericalTransformer

The DecisionTreeDiscretiser() replaces continuous numerical variables by discrete, finite, values estimated by a decision tree.

The methods is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf

The DecisionTreeDiscretiser() works only with numerical variables. A list of variables can be passed as an argument. Alternatively, the discretiser will automatically select all numerical variables.

The DecisionTreeDiscretiser() first trains a decision tree for each variable.

The DecisionTreeDiscretiser() then transforms the variables, that is, makes predictions based on the variable values, using the trained decision tree.

Parameters
  • variables (list, default=None) – The list of numerical variables to transform. If None, the discretiser will automatically select all numerical variables.

  • cv (int, default=3) – Desired number of cross-validation fold to be used to fit the decision tree.

  • scoring (str, default='neg_mean_squared_error') – Desired metric to optimise the performance for the tree. Comes from sklearn.metrics. See DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

  • param_grid (dictionary, default=None) –

    The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().

    If None, then param_grid = {‘max_depth’: [1, 2, 3, 4]}

  • regression (boolean, default=True) – Indicates whether the discretiser should train a regression or a classification decision tree.

  • random_state (int, default=None) – The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.

binner_dict_

Dictionary containing the fitted tree per variable.

scores_dict_

Dictionary with the score of the best decision tree, over the train set.

variables_

The variables to discretise.

n_features_in_

The number of features in the train set used in fit.

fit:

Fit a decision tree per variable.

transform:

Replace continuous values by the predictions of the decision tree.

fit_transform:

Fit to the data, then transform it.

See also

sklearn.tree.DecisionTreeClassifier, sklearn.tree.DecisionTreeRegressor

References

1

Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

fit(X, y)[source]

Fit the decision trees. One tree per variable to be transformed.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.

  • y (pandas series.) – Target variable. Required to train the decision tree.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

  • ValueError

    • If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values

Returns

Return type

self

transform(X)[source]

Replaces original variable with the predictions of the tree. The tree outcome is finite, aka, discrete.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()

Returns

X_transformed – The dataframe with transformed variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.DecisionTreeEncoder(encoding_method='arbitrary', cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None, variables=None, ignore_format=False)[source]

Bases: feature_engine.encoding.base_encoder.BaseCategoricalTransformer

The DecisionTreeEncoder() encodes categorical variables with predictions of a decision tree.

The encoder first fits a decision tree using a single feature and the target (fit). And then replaces the values of the original feature by the predictions of the tree (transform). The transformer will train a Decision tree per every feature to encode.

The motivation is to try and create monotonic relationships between the categorical variables and the target.

Under the hood, the categorical variable will be first encoded into integers with the OrdinalCategoricalEncoder(). The integers can be assigned arbitrarily to the categories or following the mean value of the target in each category. Then a decision tree will fit the resulting numerical variable to predict the target variable. Finally, the original categorical variable values will be replaced by the predictions of the decision tree.

The DecisionTreeEncoder() will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode or the encoder will find and encode all categorical variables. But with ignore_format=True you have the option to encode numerical variables as well. In this case, you can either enter the list of variables to encode, or the transformer will automatically select all variables.

Parameters
  • encoding_method (str, default='arbitrary') –

    The categorical encoding method that will be used to encode the original categories to numerical values.

    ’ordered’: the categories are numbered in ascending order according to the target mean value per category.

    ’arbitrary’ : categories are numbered arbitrarily.

  • cv (int, default=3) – Desired number of cross-validation fold to be used to fit the decision tree.

  • scoring (str, default='neg_mean_squared_error') – Desired metric to optimise the performance for the decision tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html

  • param_grid (dictionary, default=None) –

    The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().

    If None, then param_grid = {‘max_depth’: [1, 2, 3, 4]}.

  • regression (boolean, default=True) – Indicates whether the encoder should train a regression or a classification decision tree.

  • random_state (int, default=None) – The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.

  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.

  • ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.

encoder_

sklearn Pipeline containing the ordinal encoder and the decision tree.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Fit a decision tree per variable.

transform:

Replace categorical variable by the predictions of the decision tree.

fit_transform:

Fit to the data, then transform it.

Notes

The authors designed this method originally, to work with numerical variables. We can replace numerical variables by the preditions of a decision tree utilising the DecisionTreeDiscretiser().

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

See also

sklearn.ensemble.DecisionTreeRegressor, sklearn.ensemble.DecisionTreeClassifier, feature_engine.discretisation.DecisionTreeDiscretiser, feature_engine.encoding.RareLabelEncoder, feature_engine.encoding.OrdinalEncoder

References

1

Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf

fit(X, y=None)[source]

Fit a decision tree per variable.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.

  • y (pandas series.) – The target variable. Required to train the decision tree and for ordered ordinal encoding.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)

  • ValueError

    • If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values

Returns

Return type

self

inverse_transform(X)[source]

inverse_transform is not implemented for this transformer.

transform(X)[source]

Replace categorical variable by the predictions of the decision tree.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If dataframe is not of same size as that used in fit()

  • Warning – If after encoding, NAN were introduced.

Returns

X – Dataframe with variables encoded with decision tree predictions.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.DifferenceTransformer(period=1)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
inverse_transform(X)[source]
needs_refit = True
transform(X, y=None, refit=False)[source]
class ballet.eng.external.DropMissingData(missing_only=True, variables=None)[source]

Bases: feature_engine.imputation.base_imputer.BaseImputer

The DropMissingData() will delete rows containing missing values. It provides similar functionality to pandas.drop_na().

It works for both numerical and categorical variables. You can enter the list of variables for which missing values should be removed from the dataframe. Alternatively, the imputer will automatically select all variables in the dataframe.

Note The transformer will first select all variables or all user entered variables and if missing_only=True, it will re-select from the original group only those that show missing data in during fit, that is in the train set.

Parameters
  • missing_only (bool, default=True) – If true, missing observations will be dropped only for the variables that have missing data in the train set, during fit. If False, observations with NA will be dropped from all variables indicated by the user.

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables in the dataframe.

variables_

List of variables for which the rows with NA will be deleted.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the variables for which the rows with NA will be deleted

transform:

Remove observations with NA

fit_transform:

Fit to the data, then transform it.

return_na_data:

Returns the dataframe with the rows that contain NA .

fit(X, y=None)[source]

Learn the variables for which the rows with NA will be deleted.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.

  • y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.

Raises

TypeError – If the input is not a Pandas DataFrame

Returns

Return type

self

return_na_data(X)[source]

Returns the subset of the dataframe which contains the rows with missing values. This method could be useful in production, in case we want to store the observations that will not be fed into the model.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.

Raises

TypeError – If the input is not a Pandas DataFrame

Returns

X – The dataframe containing only the rows with missing values.

Return type

pandas dataframe of shape = [obs_with_na, features]

transform(X)[source]

Remove rows with missing values.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.

Returns

X_transformed – The complete case dataframe for the selected variables, of shape [n_samples - rows_with_na, n_features]

Return type

pandas dataframe

class ballet.eng.external.EndTailImputer(imputation_method='gaussian', tail='right', fold=3, variables=None)[source]

Bases: feature_engine.imputation.base_imputer.BaseImputer

The EndTailImputer() replaces missing data by a value at either tail of the distribution. It works only with numerical variables.

You can indicate the variables to be imputed in a list. Alternatively, the EndTailImputer() will automatically find and select all variables of type numeric.

The imputer first calculates the values at the end of the distribution for each variable (fit). The values at the end of the distribution are determined using the Gaussian limits, the the IQR proximity rule limits, or a factor of the maximum value:

Gaussian limits:
  • right tail: mean + 3*std

  • left tail: mean - 3*std

IQR limits:
  • right tail: 75th quantile + 3*IQR

  • left tail: 25th quantile - 3*IQR

where IQR is the inter-quartile range = 75th quantile - 25th quantile

Maximum value:
  • right tail: max * 3

  • left tail: not applicable

You can change the factor that multiplies the std, IQR or the maximum value using the parameter ‘fold’ (we used fold=3 in the examples above).

The imputer then replaces the missing data with the estimated values (transform).

Parameters
  • imputation_method (str, default=gaussian) –

    Method to be used to find the replacement values. Can take ‘gaussian’, ‘iqr’ or ‘max’.

    gaussian: the imputer will use the Gaussian limits to find the values to replace missing data.

    iqr: the imputer will use the IQR limits to find the values to replace missing data.

    max: the imputer will use the maximum values to replace missing data. Note that if ‘max’ is passed, the parameter ‘tail’ is ignored.

  • tail (str, default=right) – Indicates if the values to replace missing data should be selected from the right or left tail of the variable distribution. Can take values ‘left’ or ‘right’.

  • fold (int, default=3) – Factor to multiply the std, the IQR or the Max values. Recommended values are 2 or 3 for Gaussian, or 1.5 or 3 for IQR.

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables of type numeric.

imputer_dict_

Dictionary with the values at the end of the distribution per variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn values to replace missing data.

transform:

Impute missing data.

fit_transform:

Fit to the data, then transform it.

fit(X, y=None)[source]

Learn the values at the end of the variable distribution.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.

  • y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

  • ValueError – If there are no numerical variables in the df or the df is empty

Returns

Return type

self

transform(X)[source]

Replace missing data with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError – If the dataframe has different number of features than the df used in fit()

Returns

X – The dataframe without missing values in the selected variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.EqualFrequencyDiscretiser(variables=None, q=10, return_object=False, return_boundaries=False)[source]

Bases: feature_engine.base_transformers.BaseNumericalTransformer

The EqualFrequencyDiscretiser() divides continuous numerical variables into contiguous equal frequency intervals, that is, intervals that contain approximately the same proportion of observations.

The interval limits are determined using pandas.qcut(), in other words, the interval limits are determined by the quantiles. The number of intervals, i.e., the number of quantiles in which the variable should be divided is determined by the user.

The EqualFrequencyDiscretiser() works only with numerical variables. A list of variables can be passed as argument. Alternatively, the discretiser will automatically select and transform all numerical variables.

The EqualFrequencyDiscretiser() first finds the boundaries for the intervals or quantiles for each variable.

Then it transforms the variables, that is, it sorts the values into the intervals.

Parameters
  • variables (list, default=None) – The list of numerical variables that will be discretised. If None, the EqualFrequencyDiscretiser() will select all numerical variables.

  • q (int, default=10) – Desired number of equal frequency intervals / bins. In other words the number of quantiles in which the variables should be divided.

  • return_object (bool, default=False) –

    Whether the the discrete variable should be returned casted as numeric or as object. If you would like to proceed with the engineering of the variable as if it was categorical, use True. Alternatively, keep the default to False.

    Categorical encoders in Feature-engine work only with variables of type object, thus, if you wish to encode the returned bins, set return_object to True.

  • return_boundaries (bool, default=False) – Whether the output should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.

binner_dict_

Dictionary with the interval limits per variable.

variables_

The variables to discretise.

n_features_in_

The number of features in the train set used in fit.

fit:

Find the interval limits.

transform:

Sort continuous variable values into the intervals.

fit_transform:

Fit to the data, then transform it.

See also

pandas.qcut

https

//pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html

References

1

Kotsiantis and Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science, vol. 1, pp. 111 117, 2006.

2

Dong. “Beating Kaggle the easy way”. Master Thesis. https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf

fit(X, y=None)[source]

Learn the limits of the equal frequency intervals.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.

  • y (None) – y is not needed in this encoder. You can pass y or None.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

  • ValueError

    • If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values

Returns

Return type

self

transform(X)[source]

Sort the variable values into the intervals.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()

Returns

X – The transformed data with the discrete variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.EqualWidthDiscretiser(variables=None, bins=10, return_object=False, return_boundaries=False)[source]

Bases: feature_engine.base_transformers.BaseNumericalTransformer

The EqualWidthDiscretiser() divides continuous numerical variables into intervals of the same width, that is, equidistant intervals. Note that the proportion of observations per interval may vary.

The size of the interval is calculated as:

\[( max(X) - min(X) ) / bins\]

where bins, which is the number of intervals, should be determined by the user.

The interval limits are determined using pandas.cut(). The number of intervals in which the variable should be divided must be indicated by the user.

The EqualWidthDiscretiser() works only with numerical variables. A list of variables can be passed as argument. Alternatively, the discretiser will automatically select all numerical variables.

The EqualWidthDiscretiser() first finds the boundaries for the intervals for each variable. Then, it transforms the variables, that is, sorts the values into the intervals.

Parameters
  • variables (list, default=None) – The list of numerical variables to transform. If None, the discretiser will automatically select all numerical type variables.

  • bins (int, default=10) – Desired number of equal width intervals / bins.

  • return_object (bool, default=False) –

    Whether the the discrete variable should be returned casted as numeric or as object. If you would like to proceed with the engineering of the variable as if it was categorical, use True. Alternatively, keep the default to False.

    Categorical encoders in Feature-engine work only with variables of type object, thus, if you wish to encode the returned bins, set return_object to True.

  • return_boundaries (bool, default=False) – Whether the output should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.

binner_dict_

Dictionary with the interval limits per variable.

variables_

The variables to be discretised.

n_features_in_

The number of features in the train set used in fit.

fit:

Find the interval limits.

transform:

Sort continuous variable values into the intervals.

fit_transform:

Fit to the data, then transform it.

See also

pandas.cut

https

//pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html

References

1

Kotsiantis and Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science, vol. 1, pp. 111 117, 2006.

2

Dong. “Beating Kaggle the easy way”. Master Thesis. https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf

fit(X, y=None)[source]

Learn the boundaries of the equal width intervals / bins for each variable.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.

  • y (None) – y is not needed in this encoder. You can pass y or None.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

  • ValueError

    • If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values

Returns

Return type

self

transform(X)[source]

Sort the variable values into the intervals.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()

Returns

X – The transformed data with the discrete variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.FeatureAugmenter(default_fc_parameters=None, kind_to_fc_parameters=None, column_id=None, column_sort=None, column_kind=None, column_value=None, timeseries_container=None, chunksize=None, n_jobs=1, show_warnings=False, disable_progressbar=False, impute_function=None, profile=False, profiling_filename='profile.txt', profiling_sorting='cumulative')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Sklearn-compatible estimator, for calculating and adding many features calculated from a given time series to the data. It is basically a wrapper around extract_features().

The features include basic ones like min, max or median, and advanced features like fourier transformations or statistical tests. For a list of all possible features, see the module feature_calculators. The column name of each added feature contains the name of the function of that module, which was used for the calculation.

For this estimator, two datasets play a crucial role:

  1. the time series container with the timeseries data. This container (for the format see data-formats-label) contains the data which is used for calculating the features. It must be groupable by ids which are used to identify which feature should be attached to which row in the second dataframe.

  2. the input data X, where the features will be added to. Its rows are identifies by the index and each index in X must be present as an id in the time series container.

Imagine the following situation: You want to classify 10 different financial shares and you have their development in the last year as a time series. You would then start by creating features from the metainformation of the shares, e.g. how long they were on the market etc. and filling up a table - the features of one stock in one row. This is the input array X, which each row identified by e.g. the stock name as an index.

>>> df = pandas.DataFrame(index=["AAA", "BBB", ...])
>>> # Fill in the information of the stocks
>>> df["started_since_days"] = ... # add a feature

You can then extract all the features from the time development of the shares, by using this estimator. The time series container must include a column of ids, which are the same as the index of X.

>>> time_series = read_in_timeseries() # get the development of the shares
>>> from tsfresh.transformers import FeatureAugmenter
>>> augmenter = FeatureAugmenter(column_id="id")
>>> augmenter.set_timeseries_container(time_series)
>>> df_with_time_series_features = augmenter.transform(df)

The settings for the feature calculation can be controlled with the settings object. If you pass None, the default settings are used. Please refer to ComprehensiveFCParameters for more information.

This estimator does not select the relevant features, but calculates and adds all of them to the DataFrame. See the RelevantFeatureAugmenter for calculating and selecting features.

For a description what the parameters column_id, column_sort, column_kind and column_value mean, please see extraction.

fit(X=None, y=None)[source]

The fit function is not needed for this estimator. It just does nothing and is here for compatibility reasons.

Parameters
  • X (Any) – Unneeded.

  • y (Any) – Unneeded.

Returns

The estimator instance itself

Return type

FeatureAugmenter

set_timeseries_container(timeseries_container)[source]

Set the timeseries, with which the features will be calculated. For a format of the time series container, please refer to extraction. The timeseries must contain the same indices as the later DataFrame, to which the features will be added (the one you will pass to transform()). You can call this function as often as you like, to change the timeseries later (e.g. if you want to extract for different ids).

Parameters

timeseries_container (pandas.DataFrame or dict) – The timeseries as a pandas.DataFrame or a dict. See extraction for the format.

Returns

None

Return type

None

transform(X)[source]

Add the features calculated using the timeseries_container and add them to the corresponding rows in the input pandas.DataFrame X.

To save some computing time, you should only include those time serieses in the container, that you need. You can set the timeseries container with the method set_timeseries_container().

Parameters

X (pandas.DataFrame) – the DataFrame to which the calculated timeseries features will be added. This is not the dataframe with the timeseries itself.

Returns

The input DataFrame, but with added features.

Return type

pandas.DataFrame

class ballet.eng.external.FourierTransformer(period=10, max_order=10, step_size=1)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
transform(X, y=None)[source]
class ballet.eng.external.FunctionTransformer(func=None, inverse_func=None, *, validate=False, accept_sparse=False, check_inverse=True, kw_args=None, inv_kw_args=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Constructs a transformer from an arbitrary callable.

A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.

Note: If a lambda is used as the function, then the resulting transformer will not be pickleable.

New in version 0.17.

Read more in the User Guide.

Parameters
  • func (callable, default=None) – The callable to use for the transformation. This will be passed the same arguments as transform, with args and kwargs forwarded. If func is None, then func will be the identity function.

  • inverse_func (callable, default=None) – The callable to use for the inverse transformation. This will be passed the same arguments as inverse transform, with args and kwargs forwarded. If inverse_func is None, then inverse_func will be the identity function.

  • validate (bool, default=False) –

    Indicate that the input X array should be checked before calling func. The possibilities are:

    • If False, there is no input validation.

    • If True, then X will be converted to a 2-dimensional NumPy array or sparse matrix. If the conversion is not possible an exception is raised.

    Changed in version 0.22: The default of validate changed from True to False.

  • accept_sparse (bool, default=False) – Indicate that func accepts a sparse matrix as input. If validate is False, this has no effect. Otherwise, if accept_sparse is false, sparse matrix inputs will cause an exception to be raised.

  • check_inverse (bool, default=True) –

    Whether to check that or func followed by inverse_func leads to the original inputs. It can be used for a sanity check, raising a warning when the condition is not fulfilled.

    New in version 0.20.

  • kw_args (dict, default=None) –

    Dictionary of additional keyword arguments to pass to func.

    New in version 0.18.

  • inv_kw_args (dict, default=None) –

    Dictionary of additional keyword arguments to pass to inverse_func.

    New in version 0.18.

n_features_in_

Number of features seen during fit. Defined only when validate=True.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when validate=True and X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

MaxAbsScaler

Scale each feature by its maximum absolute value.

StandardScaler

Standardize features by removing the mean and scaling to unit variance.

LabelBinarizer

Binarize labels in a one-vs-all fashion.

MultiLabelBinarizer

Transform between iterable of iterables and a multilabel format.

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(np.log1p)
>>> X = np.array([[0, 1], [2, 3]])
>>> transformer.transform(X)
array([[0.       , 0.6931...],
       [1.0986..., 1.3862...]])
fit(X, y=None)[source]

Fit transformer by checking X.

If validate is True, X will be checked.

Parameters
  • X (array-like, shape (n_samples, n_features)) – Input array.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – FunctionTransformer class instance.

Return type

object

inverse_transform(X)[source]

Transform X using the inverse function.

Parameters

X (array-like, shape (n_samples, n_features)) – Input array.

Returns

X_out – Transformed input.

Return type

array-like, shape (n_samples, n_features)

transform(X)[source]

Transform X using the forward function.

Parameters

X (array-like, shape (n_samples, n_features)) – Input array.

Returns

X_out – Transformed input.

Return type

array-like, shape (n_samples, n_features)

class ballet.eng.external.GLMMEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, binomial_target=None)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Generalized linear mixed model.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

This is a supervised encoder similar to TargetEncoder or MEstimateEncoder, but there are some advantages: 1) Solid statistical theory behind the technique. Mixed effects models are a mature branch of statistics. 2) No hyper-parameters to tune. The amount of shrinkage is automatically determined through the estimation process. In short, the less observations a category has and/or the more the outcome varies for a category then the higher the regularization towards “the prior” or “grand mean”. 3) The technique is applicable for both continuous and binomial targets. If the target is continuous, the encoder returns regularized difference of the observation’s category from the global mean. If the target is binomial, the encoder returns regularized log odds per category.

In comparison to JamesSteinEstimator, this encoder utilizes generalized linear mixed models from statsmodels library.

Note: This is an alpha implementation. The API of the method may change in the future.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns 0.

  • handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns 0.

  • randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).

  • sigma (float) – standard deviation (spread or “width”) of the normal distribution.

  • binomial_target (bool) – if True, the target must be binomial with values {0, 1} and Binomial mixed model is used. If False, the target must be continuous and Linear mixed model is used. If None (the default), a heuristic is applied to estimate the target type.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target > 22.5
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = GLMMEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Data Analysis Using Regression and Multilevel/Hierarchical Models, page 253, from

https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf

fit(X, y, **kwargs)[source]

Fit encoder according to X and binary y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Binary target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

When the data are used for model training, it is important to also pass the target in order to apply leave one out.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.GaussianRandomProjection(n_components='auto', *, eps=0.1, random_state=None)[source]

Bases: sklearn.random_projection.BaseRandomProjection

Reduce dimensionality through Gaussian random projection.

The components of the random matrix are drawn from N(0, 1 / n_components).

Read more in the User Guide.

New in version 0.13.

Parameters
  • n_components (int or 'auto', default='auto') –

    Dimensionality of the target projection space.

    n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the eps parameter.

    It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.

  • eps (float, default=0.1) –

    Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. The value should be strictly positive.

    Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.

  • random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generator used to generate the projection matrix at fit time. Pass an int for reproducible output across multiple function calls. See Glossary.

n_components_

Concrete number of components computed when n_components=”auto”.

Type

int

components_

Random matrix used for the projection.

Type

ndarray of shape (n_components, n_features)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

SparseRandomProjection

Reduce dimensionality through sparse random projection.

Examples

>>> import numpy as np
>>> from sklearn.random_projection import GaussianRandomProjection
>>> rng = np.random.RandomState(42)
>>> X = rng.rand(100, 10000)
>>> transformer = GaussianRandomProjection(random_state=rng)
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)
class ballet.eng.external.HashingEncoder(max_process=0, max_sample=0, verbose=0, n_components=8, cols=None, drop_invariant=False, return_df=True, hash_method='md5')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

A multivariate hashing implementation with configurable dimensionality/precision.

The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.

It’s important to read about how max_process & max_sample work before setting them manually, inappropriate setting slows down encoding.

Default value of ‘max_process’ is 1 on Windows because multiprocessing might cause issues, see in : https://github.com/scikit-learn-contrib/categorical-encoding/issues/215 https://docs.python.org/2/library/multiprocessing.html?highlight=process#windows

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • hash_method (str) – which hashing method to use. Any method from hashlib works.

  • max_process (int) – how many processes to use in transform(). Limited in range(1, 64). By default, it uses half of the logical CPUs. For example, 4C4T makes max_process=2, 4C8T makes max_process=4. Set it larger if you have a strong CPU. It is not recommended to set it larger than is the count of the logical CPUs as it will actually slow down the encoding.

  • max_sample (int) – how many samples to encode by each process at a time. This setting is useful on low memory machines. By default, max_sample=(all samples num)/(max_process). For example, 4C8T CPU with 100,000 samples makes max_sample=25,000, 6C12T CPU with 100,000 samples makes max_sample=16,666. It is not recommended to set it larger than the default value.

  • n_components (int) – how many bits to use to represent the feature. By default we use 8 bits. For high-cardinality features, consider using up-to 32 bits.

Example

>>> from category_encoders.hashing import HashingEncoder
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> y = bunch.target
>>> he = HashingEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> data = he.transform(X)
>>> print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
col_0      506 non-null int64
col_1      506 non-null int64
col_2      506 non-null int64
col_3      506 non-null int64
col_4      506 non-null int64
col_5      506 non-null int64
col_6      506 non-null int64
col_7      506 non-null int64
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(11), int64(8)
memory usage: 75.2 KB
None

References

1

Feature Hashing for Large Scale Multitask Learning, from

https://alex.smola.org/papers/2009/Weinbergeretal09.pdf .. [2] Don’t be tricked by the Hashing Trick, from https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

static hashing_trick(X_in, hashing_method='md5', N=2, cols=None, make_copy=False)[source]

A basic hashing implementation with configurable dimensionality/precision

Performs the hashing trick on a pandas dataframe, X, using the hashing method from hashlib identified by hashing_method. The number of output dimensions (N), and columns to hash (cols) are also configurable.

Parameters
  • X_in (pandas dataframe) – description text

  • hashing_method (string, optional) – description text

  • N (int, optional) – description text

  • cols (list, optional) – description text

  • make_copy (bool, optional) – description text

Returns

out – A hashing encoded dataframe.

Return type

dataframe

References

Cite the relevant literature, e.g. [1]_. You may also cite these references in the notes section above. .. [1] Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.

static require_data(self, data_lock, new_start, done_index, hashing_parts, cols, process_index)[source]
transform(X, override_return_df=False)[source]

Call _transform() if you want to use single CPU with all samples

class ballet.eng.external.HelmertEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Helmert contrast coding for encoding categorical features.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = HelmertEncoder(cols=['CHAS', 'RAD'], handle_unknown='value', handle_missing='value').fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 21 columns):
intercept    506 non-null int64
CRIM         506 non-null float64
ZN           506 non-null float64
INDUS        506 non-null float64
CHAS_0       506 non-null float64
NOX          506 non-null float64
RM           506 non-null float64
AGE          506 non-null float64
DIS          506 non-null float64
RAD_0        506 non-null float64
RAD_1        506 non-null float64
RAD_2        506 non-null float64
RAD_3        506 non-null float64
RAD_4        506 non-null float64
RAD_5        506 non-null float64
RAD_6        506 non-null float64
RAD_7        506 non-null float64
TAX          506 non-null float64
PTRATIO      506 non-null float64
B            506 non-null float64
LSTAT        506 non-null float64
dtypes: float64(20), int64(1)
memory usage: 83.1 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

static fit_helmert_coding(col, values, handle_missing, handle_unknown)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

static helmert_coding(X_in, mapping)[source]
transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.HorizonTransformer(horizon=2)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
fit_transform(X, y=None)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

inverse_transform(X, y=None)[source]
needs_refit = True
transform(X, y=None, refit=False)[source]
y_only = True
class ballet.eng.external.IntegratedTransformer(num_lags=1, pred_stride=1)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
transform(X, y=None)[source]
class ballet.eng.external.JamesSteinEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', model='independent', random_state=None, randomized=False, sigma=0.05)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

James-Stein estimator.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

For feature value i, James-Stein estimator returns a weighted average of:

  1. The mean target value for the observed feature value i.

  2. The mean target value (regardless of the feature value).

This can be written as:

JS_i = (1-B)*mean(y_i) + B*mean(y)

The question is, what should be the weight B? If we put too much weight on the conditional mean value, we will overfit. If we put too much weight on the global mean, we will underfit. The canonical solution in machine learning is to perform cross-validation. However, Charles Stein came with a closed-form solution to the problem. The intuition is: If the estimate of mean(y_i) is unreliable (y_i has high variance), we should put more weight on mean(y). Stein put it into an equation as:

B = var(y_i) / (var(y_i)+var(y))

The only remaining issue is that we do not know var(y), let alone var(y_i). Hence, we have to estimate the variances. But how can we reliably estimate the variances, when we already struggle with the estimation of the mean values?! There are multiple solutions:

1. If we have the same count of observations for each feature value i and all y_i are close to each other, we can pretend that all var(y_i) are identical. This is called a pooled model. 2. If the observation counts are not equal, it makes sense to replace the variances with squared standard errors, which penalize small observation counts:

SE^2 = var(y)/count(y)

This is called an independent model.

James-Stein estimator has, however, one practical limitation - it was defined only for normal distributions. If you want to apply it for binary classification, which allows only values {0, 1}, it is better to first convert the mean target value from the bound interval <0,1> into an unbounded interval by replacing mean(y) with log-odds ratio:

log-odds_ratio_i = log(mean(y_i)/mean(y_not_i))

This is called binary model. The estimation of parameters of this model is, however, tricky and sometimes it fails fatally. In these situations, it is better to use beta model, which generally delivers slightly worse accuracy than binary model but does not suffer from fatal failures.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.

  • handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.

  • model (str) – options are ‘pooled’, ‘beta’, ‘binary’ and ‘independent’, defaults to ‘independent’.

  • randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).

  • sigma (float) – standard deviation (spread or “width”) of the normal distribution.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = JamesSteinEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Parametric empirical Bayes inference: Theory and applications, equations 1.19 & 1.20, from

https://www.jstor.org/stable/2287098

2

Empirical Bayes for multiple sample sizes, from

http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/

3

Shrinkage Estimation of Log-odds Ratios for Comparing Mobility Tables, from

https://journals.sagepub.com/doi/abs/10.1177/0081175015570097

4

Stein’s paradox and group rationality, from

http://www.philos.rug.nl/~romeyn/presentation/2017_romeijn_-_Paris_Stein.pdf

5

Stein’s Paradox in Statistics, from

http://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf

fit(X, y, **kwargs)[source]

Fit encoder according to X and binary y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Binary target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data. When the data are used for model training, it is important to also pass the target in order to apply leave one out.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.KBinsDiscretizer(n_bins=5, *, encode='onehot', strategy='quantile', dtype=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Bin continuous data into intervals.

Read more in the User Guide.

New in version 0.20.

Parameters
  • n_bins (int or array-like of shape (n_features,), default=5) – The number of bins to produce. Raises ValueError if n_bins < 2.

  • encode ({'onehot', 'onehot-dense', 'ordinal'}, default='onehot') –

    Method used to encode the transformed result.

    onehot

    Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.

    onehot-dense

    Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.

    ordinal

    Return the bin identifier encoded as an integer value.

  • strategy ({'uniform', 'quantile', 'kmeans'}, default='quantile') –

    Strategy used to define the widths of the bins.

    uniform

    All bins in each feature have identical widths.

    quantile

    All bins in each feature have the same number of points.

    kmeans

    Values in each bin have the same nearest center of a 1D k-means cluster.

  • dtype ({np.float32, np.float64}, default=None) –

    The desired data-type for the output. If None, output dtype is consistent with input dtype. Only np.float32 and np.float64 are supported.

    New in version 0.24.

bin_edges_

The edges of each bin. Contain arrays of varying shapes (n_bins_, ) Ignored features will have empty arrays.

Type

ndarray of ndarray of shape (n_features,)

n_bins_

Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.

Type

ndarray of shape (n_features,), dtype=np.int_

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

Binarizer

Class used to bin values as 0 or 1 based on a parameter threshold.

Notes

In bin edges for feature i, the first and last values are used only for inverse_transform. During transform, bin edges are extended to:

np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])

You can combine KBinsDiscretizer with ColumnTransformer if you only want to preprocess part of the features.

KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g., VarianceThreshold).

Examples

>>> from sklearn.preprocessing import KBinsDiscretizer
>>> X = [[-2, 1, -4,   -1],
...      [-1, 2, -3, -0.5],
...      [ 0, 3, -2,  0.5],
...      [ 1, 4, -1,    2]]
>>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
>>> est.fit(X)
KBinsDiscretizer(...)
>>> Xt = est.transform(X)
>>> Xt  
array([[ 0., 0., 0., 0.],
       [ 1., 1., 1., 0.],
       [ 2., 2., 2., 1.],
       [ 2., 2., 2., 2.]])

Sometimes it may be useful to convert the data back into the original feature space. The inverse_transform function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.

>>> est.bin_edges_[0]
array([-2., -1.,  0.,  1.])
>>> est.inverse_transform(Xt)
array([[-1.5,  1.5, -3.5, -0.5],
       [-0.5,  2.5, -2.5, -0.5],
       [ 0.5,  3.5, -1.5,  0.5],
       [ 0.5,  3.5, -1.5,  1.5]])
fit(X, y=None)[source]

Fit the estimator.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Data to be discretized.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns

self – Returns the instance itself.

Return type

object

get_feature_names_out(input_features=None)[source]

Get output feature names.

Parameters

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_out – Transformed feature names.

Return type

ndarray of str objects

inverse_transform(Xt)[source]

Transform discretized data back to original feature space.

Note that this function does not regenerate the original data due to discretization rounding.

Parameters

Xt (array-like of shape (n_samples, n_features)) – Transformed data in the binned space.

Returns

Xinv – Data in the original feature space.

Return type

ndarray, dtype={np.float32, np.float64}

transform(X)[source]

Discretize the data.

Parameters

X (array-like of shape (n_samples, n_features)) – Data to be discretized.

Returns

Xt – Data in the binned space. Will be a sparse matrix if self.encode=’onehot’ and ndarray otherwise.

Return type

{ndarray, sparse matrix}, dtype={np.float32, np.float64}

class ballet.eng.external.KNNImputer(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False)[source]

Bases: sklearn.impute._base._BaseImputer

Imputation for completing missing values using k-Nearest Neighbors.

Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

Read more in the User Guide.

New in version 0.22.

Parameters
  • missing_values (int, float, str, np.nan or None, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

  • n_neighbors (int, default=5) – Number of neighboring samples to use for imputation.

  • weights ({'uniform', 'distance'} or callable, default='uniform') –

    Weight function used in prediction. Possible values:

    • ’uniform’ : uniform weights. All points in each neighborhood are weighted equally.

    • ’distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

    • callable : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

  • metric ({'nan_euclidean'} or callable, default='nan_euclidean') –

    Distance metric for searching neighbors. Possible values:

    • ’nan_euclidean’

    • callable : a user-defined function which conforms to the definition of _pairwise_callable(X, Y, metric, **kwds). The function accepts two arrays, X and Y, and a missing_values keyword in kwds and returns a scalar distance value.

  • copy (bool, default=True) – If True, a copy of X will be created. If False, imputation will be done in-place whenever possible.

  • add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

indicator_

Indicator used to add binary indicators for missing values. None if add_indicator is False.

Type

MissingIndicator

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

SimpleImputer

Imputation transformer for completing missing values with simple strategies.

IterativeImputer

Multivariate imputer that estimates each feature from all the others.

References

  • Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525.

Examples

>>> import numpy as np
>>> from sklearn.impute import KNNImputer
>>> X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2)
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])
fit(X, y=None)[source]

Fit the imputer on X.

Parameters
  • X (array-like shape of (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – The fitted KNNImputer class instance.

Return type

object

transform(X)[source]

Impute all missing values in X.

Parameters

X (array-like of shape (n_samples, n_features)) – The input data to complete.

Returns

X – The imputed dataset. n_output_features is the number of features that is not always missing during fit.

Return type

array-like of shape (n_samples, n_output_features)

class ballet.eng.external.LeaveOneOutEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Leave one out coding for categorical features.

This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • sigma (float) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). Sigma gives the standard deviation (spread or “width”) of the normal distribution. The optimal value is commonly between 0.05 and 0.6. The default is to not add noise, but that leads to significantly suboptimal results.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = LeaveOneOutEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Strategies to encode categorical variables with many categories, from

https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.

fit(X, y, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

fit_column_map(series, y)[source]
fit_leave_one_out(X_in, y, cols=None)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

transform_leave_one_out(X_in, y, mapping=None)[source]

Leave one out encoding uses a single column of floats to represent the means of the target variables.

class ballet.eng.external.LogTransformer[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
inverse_transform(X)[source]
needs_refit = False
transform(X, y=None, refit=False)[source]
class ballet.eng.external.MEstimateEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, m=1.0)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

M-probability estimate of likelihood.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

This is a simplified version of target encoder, which goes under names like m-probability estimate or additive smoothing with known incidence rates. In comparison to target encoder, m-probability estimate has only one tunable parameter (m), while target encoder has two tunable parameters (min_samples_leaf and smoothing).

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.

  • handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.

  • randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).

  • sigma (float) – standard deviation (spread or “width”) of the normal distribution.

  • m (float) – this is the “m” in the m-probability estimate. Higher value of m results into stronger shrinking. M is non-negative.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target > 22.5
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = MEstimateEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, equation 7, from

https://dl.acm.org/citation.cfm?id=507538

2

On estimating probabilities in tree pruning, equation 1, from

https://link.springer.com/chapter/10.1007/BFb0017010

3

Additive smoothing, from

https://en.wikipedia.org/wiki/Additive_smoothing#Generalized_to_the_case_of_known_incidence_rates

fit(X, y, **kwargs)[source]

Fit encoder according to X and binary or continuous y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Binary target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

When the data are used for model training, it is important to also pass the target in order to apply leave one out.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.MathematicalCombination(variables_to_combine, math_operations=None, new_variables_names=None, missing_values='raise')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of a group of variables, and returns the result into new variables.

For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter, number_payments_third_quarter and number_payments_fourth_quarter, we can use MathematicalCombination() to calculate the total number of payments and mean number of payments as follows:

transformer = MathematicalCombination(
    variables_to_combine=[
        'number_payments_first_quarter',
        'number_payments_second_quarter',
        'number_payments_third_quarter',
        'number_payments_fourth_quarter'
    ],
    math_operations=[
        'sum',
        'mean'
    ],
    new_variables_name=[
        'total_number_payments',
        'mean_number_payments'
    ]
)

Xt = transformer.fit_transform(X)

The transformed X, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.

Attention, if some of the variables to combine have missing data and missing_values = ‘ignore’, the value will be ignored in the computation. To be clear, if variables A, B and C, have values 10, 20 and NA, and we perform the sum, the result will be A + B = 30.

Parameters
  • variables_to_combine (list) – The list of numerical variables to be combined.

  • math_operations (list, default=None) –

    The list of basic math operations to be used to create the new features.

    If None, all of [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’] will be performed over the variables_to_combine. Alternatively, you can enter the list of operations to carry out.

    Each operation should be a string and must be one of the elements in [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’].

    Each operation will result in a new variable that will be added to the transformed dataset.

  • new_variables_names (list, default=None) –

    Names of the newly created variables. You can enter a name or a list of names for the newly created features (recommended). You must enter one name for each mathematical transformation indicated in the math_operations parameter. That is, if you want to perform mean and sum of features, you should enter 2 new variable names. If you perform only mean of features, enter 1 variable name. Alternatively, if you chose to perform all mathematical transformations, enter 6 new variable names.

    The name of the variables indicated by the user should coincide with the order in which the mathematical operations are initialised in the transformer. That is, if you set math_operations = [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.

    If new_variable_names = None, the transformer will assign an arbitrary name to the newly created features starting by the name of the mathematical operation, followed by the variables combined separated by -.

  • missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. If ‘raise’ the transformer will return an error if the the datasets to fit or transform contain missing values. If ‘ignore’, missing data will be ignored when performing the calculations.

combination_dict_

Dictionary containing the mathematical operation to new variable name pairs.

math_operations_

List with the mathematical operations to be applied to the variables_to_combine.

n_features_in_

The number of features in the train set used in fit.

fit:

This transformer does not learn parameters.

transform:

Combine the variables with the mathematical operations.

fit_transform:

Fit to the data, then transform it.

Notes

Although the transformer in essence allows us to combine any feature with any of the allowed mathematical operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical examples within the financial sector are:

  • Sum debt across financial products, i.e., credit cards, to obtain the total debt.

  • Take the average payments to various financial products per month.

  • Find the Minimum payment done at any one month.

In insurance, we can sum the damage to various parts of a car to obtain the total damage.

fit(X, y=None)[source]

This transformer does not learn parameters.

Perform dataframe checks. Creates dictionary of operation to new feature name pairs.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.

  • y (pandas Series, or np.array. Defaults to None.) – It is not needed in this transformer. You can pass y or None.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any user provided variables in variables_to_combine are not numerical

  • ValueError – If the variable(s) contain null values when missing_values = raise

Returns

Return type

self

transform(X)[source]

Combine the variables with the mathematical operations.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values when missing_values = raise - If the dataframe is not of the same size as that used in fit()

Returns

X – The dataframe with the original variables plus the new variables.

Return type

Pandas dataframe, shape = [n_samples, n_features + n_operations]

class ballet.eng.external.MaxAbsScaler(*, copy=True)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

This scaler can also be applied to sparse CSR or CSC matrices.

New in version 0.17.

Parameters

copy (bool, default=True) – Set to False to perform inplace scaling and avoid a copy (if the input is already a numpy array).

scale_

Per feature relative scaling of the data.

New in version 0.17: scale_ attribute.

Type

ndarray of shape (n_features,)

max_abs_

Per feature maximum absolute value.

Type

ndarray of shape (n_features,)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

n_samples_seen_

The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across partial_fit calls.

Type

int

See also

maxabs_scale

Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import MaxAbsScaler
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> transformer = MaxAbsScaler().fit(X)
>>> transformer
MaxAbsScaler()
>>> transformer.transform(X)
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])
fit(X, y=None)[source]

Compute the maximum absolute value to be used for later scaling.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.

  • y (None) – Ignored.

Returns

self – Fitted scaler.

Return type

object

inverse_transform(X)[source]

Scale back the data to the original representation.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data that should be transformed back.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

partial_fit(X, y=None)[source]

Online computation of max absolute value of X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

Returns

self – Fitted scaler.

Return type

object

transform(X)[source]

Scale the data.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data that should be scaled.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.MeanEncoder(variables=None, ignore_format=False)[source]

Bases: feature_engine.encoding.base_encoder.BaseCategoricalTransformer

The MeanEncoder() replaces categories by the mean value of the target for each category.

For example in the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 0.5, red by 0.8 and grey by 0.1.

The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).

With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.

The encoder first maps the categories to the numbers for each variable (fit). The encoder then replaces the categories with those numbers (transform).

Parameters
  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.

  • ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.

encoder_dict_

Dictionary with the target mean value per category per variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the target mean value per category, per variable.

transform:

Encode the categories to numbers.

fit_transform:

Fit to the data, then transform it.

inverse_transform:

Encode the numbers into the original categories.

Notes

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

See also

feature_engine.encoding.RareLabelEncoder

References

1

Micci-Barreca D. “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems”. ACM SIGKDD Explorations Newsletter, 2001. https://dl.acm.org/citation.cfm?id=507538

fit(X, y)[source]

Learn the mean value of the target for each category of the variable.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to be encoded.

  • y (pandas series) – The target.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)

  • ValueError

    • If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values

Returns

Return type

self

inverse_transform(X)[source]

Convert the encoded variable back to the original values.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit()

Returns

X – The un-transformed dataframe, with the categorical variables containing the original values.

Return type

pandas dataframe of shape = [n_samples, n_features]

transform(X)[source]

Replace categories with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit()

  • Warning – If after encoding, NAN were introduced.

Returns

X – The dataframe containing the categories replaced by numbers.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.MeanMedianImputer(imputation_method='median', variables=None)[source]

Bases: feature_engine.imputation.base_imputer.BaseImputer

The MeanMedianImputer() replaces missing data by the mean or median value of the variable. It works only with numerical variables.

You can pass a list of variables to be imputed. Alternatively, the MeanMedianImputer() will automatically select all variables of type numeric in the training set.

The imputer:

  • first calculates the mean / median values of the variables (fit).

  • Then replaces the missing data with the estimated mean / median (transform).

Parameters
  • imputation_method (str, default=median) – Desired method of imputation. Can take ‘mean’ or ‘median’.

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will select all variables of type numeric.

imputer_dict_

Dictionary with the mean or median values per variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the mean or median values.

transform:

Impute missing data.

fit_transform:

Fit to the data, then transform it.

fit(X, y=None)[source]

Learn the mean or median values.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.

  • y (pandas series or None, default=None) – y is not needed in this imputation. You can pass None or y.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

  • ValueError – If there are no numerical variables in the df or the df is empty

Returns

Return type

self

transform(X)[source]

Replace missing data with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError – If the dataframe has different number of features than the df used in fit()

Returns

X – The dataframe without missing values in the selected variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Read more in the User Guide.

Parameters
  • feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.

  • copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).

  • clip (bool, default=False) –

    Set to True to clip transformed values of held-out data to provided feature range.

    New in version 0.24.

min_

Per feature adjustment for minimum. Equivalent to min - X.min(axis=0) * self.scale_

Type

ndarray of shape (n_features,)

scale_

Per feature relative scaling of the data. Equivalent to (max - min) / (X.max(axis=0) - X.min(axis=0))

New in version 0.17: scale_ attribute.

Type

ndarray of shape (n_features,)

data_min_

Per feature minimum seen in the data

New in version 0.17: data_min_

Type

ndarray of shape (n_features,)

data_max_

Per feature maximum seen in the data

New in version 0.17: data_max_

Type

ndarray of shape (n_features,)

data_range_

Per feature range (data_max_ - data_min_) seen in the data

New in version 0.17: data_range_

Type

ndarray of shape (n_features,)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

n_samples_seen_

The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across partial_fit calls.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

minmax_scale

Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler()
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[1.5 0. ]]
fit(X, y=None)[source]

Compute the minimum and maximum to be used for later scaling.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.

  • y (None) – Ignored.

Returns

self – Fitted scaler.

Return type

object

inverse_transform(X)[source]

Undo the scaling of X according to feature_range.

Parameters

X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.

Returns

Xt – Transformed data.

Return type

ndarray of shape (n_samples, n_features)

partial_fit(X, y=None)[source]

Online computation of min and max on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

Returns

self – Fitted scaler.

Return type

object

transform(X)[source]

Scale features of X according to feature_range.

Parameters

X (array-like of shape (n_samples, n_features)) – Input data that will be transformed.

Returns

Xt – Transformed data.

Return type

ndarray of shape (n_samples, n_features)

class ballet.eng.external.MissingIndicator(*, missing_values=nan, features='missing-only', sparse='auto', error_on_new=True)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Binary indicators for missing values.

Note that this component typically should not be used in a vanilla Pipeline consisting of transformers and a classifier, but rather could be added using a FeatureUnion or ColumnTransformer.

Read more in the User Guide.

New in version 0.20.

Parameters
  • missing_values (int, float, str, np.nan or None, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

  • features ({'missing-only', 'all'}, default='missing-only') –

    Whether the imputer mask should represent all or a subset of features.

    • If ‘missing-only’ (default), the imputer mask will only represent features containing missing values during fit time.

    • If ‘all’, the imputer mask will represent all features.

  • sparse (bool or 'auto', default='auto') –

    Whether the imputer mask format should be sparse or dense.

    • If ‘auto’ (default), the imputer mask will be of same type as input.

    • If True, the imputer mask will be a sparse matrix.

    • If False, the imputer mask will be a numpy array.

  • error_on_new (bool, default=True) – If True, transform() will raise an error when there are features with missing values that have no missing values in fit(). This is applicable only when features=’missing-only’.

features_

The features indices which will be returned when calling transform(). They are computed during fit(). If features=’all’, features_ is equal to range(n_features).

Type

ndarray of shape (n_missing_features,) or (n_features,)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

SimpleImputer

Univariate imputation of missing values.

IterativeImputer

Multivariate imputation of missing values.

Examples

>>> import numpy as np
>>> from sklearn.impute import MissingIndicator
>>> X1 = np.array([[np.nan, 1, 3],
...                [4, 0, np.nan],
...                [8, 1, 0]])
>>> X2 = np.array([[5, 1, np.nan],
...                [np.nan, 2, 3],
...                [2, 4, 0]])
>>> indicator = MissingIndicator()
>>> indicator.fit(X1)
MissingIndicator()
>>> X2_tr = indicator.transform(X2)
>>> X2_tr
array([[False,  True],
       [ True, False],
       [False, False]])
fit(X, y=None)[source]

Fit the transformer on X.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

self – Fitted estimator.

Return type

object

fit_transform(X, y=None)[source]

Generate missing values indicator for X.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input data to complete.

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

Xt – The missing indicator for input data. The data type of Xt will be boolean.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_features_with_missing)

transform(X)[source]

Generate missing values indicator for X.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input data to complete.

Returns

Xt – The missing indicator for input data. The data type of Xt will be boolean.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_features_with_missing)

class ballet.eng.external.Normalizer(norm='l2', *, copy=True)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Normalize samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

Read more in the User Guide.

Parameters
  • norm ({'l1', 'l2', 'max'}, default='l2') – The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.

  • copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

normalize

Equivalent function without the estimator API.

Notes

This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import Normalizer
>>> X = [[4, 1, 2, 2],
...      [1, 3, 9, 3],
...      [5, 7, 5, 1]]
>>> transformer = Normalizer().fit(X)  # fit does nothing.
>>> transformer
Normalizer()
>>> transformer.transform(X)
array([[0.8, 0.2, 0.4, 0.4],
       [0.1, 0.3, 0.9, 0.3],
       [0.5, 0.7, 0.5, 0.1]])
fit(X, y=None)[source]

Do nothing and return the estimator unchanged.

This method is just there to implement the usual API and hence work in pipelines.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to estimate the normalization parameters.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Fitted transformer.

Return type

object

transform(X, copy=None)[source]

Scale each non zero row of X to unit norm.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to normalize, row by row. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.

  • copy (bool, default=None) – Copy the input X or not.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')[source]

Bases: sklearn.preprocessing._encoders._BaseEncoder

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Read more in the User Guide.

Parameters
  • categories ('auto' or a list of array-like, default='auto') –

    Categories (unique values) per feature:

    • ’auto’ : Determine categories automatically from the training data.

    • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

    The used categories can be found in the categories_ attribute.

    New in version 0.20.

  • drop ({'first', 'if_binary'} or a array-like of shape (n_features,), default=None) –

    Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

    However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

    • None : retain all features (the default).

    • ’first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

    • ’if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.

    • array : drop[i] is the category in feature X[:, i] that should be dropped.

    New in version 0.21: The parameter drop was added in 0.21.

    Changed in version 0.23: The option drop=’if_binary’ was added in 0.23.

  • sparse (bool, default=True) – Will return sparse matrix if set True else will return an array.

  • dtype (number type, default=float) – Desired dtype of output.

  • handle_unknown ({'error', 'ignore'}, default='error') – Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

categories_

The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any).

Type

list of arrays

drop_idx_
  • drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature.

  • drop_idx_[i] = None if no category is to be dropped from the feature with index i, e.g. when drop=’if_binary’ and the feature isn’t binary.

  • drop_idx_ = None if all the transformed features will be retained.

Changed in version 0.23: Added the possibility to contain None values.

Type

array of shape (n_features,)

n_features_in_

Number of features seen during fit.

New in version 1.0.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

OrdinalEncoder

Performs an ordinal (integer) encoding of the categorical features.

sklearn.feature_extraction.DictVectorizer

Performs a one-hot encoding of dictionary items (also handles string-valued features).

sklearn.feature_extraction.FeatureHasher

Performs an approximate one-hot encoding of dictionary items or strings.

LabelBinarizer

Binarizes labels in a one-vs-all fashion.

MultiLabelBinarizer

Transforms between iterable of iterables and a multilabel format, e.g. a (samples x classes) binary matrix indicating the presence of a class label.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder

One can discard categories not seen during fit:

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names_out(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)

One can always drop the first column for each feature:

>>> drop_enc = OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 0., 0.],
       [1., 1., 0.]])

Or drop a column for feature only having 2 categories:

>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X)
>>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 1., 0., 0.],
       [1., 0., 1., 0.]])
fit(X, y=None)[source]

Fit OneHotEncoder to X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns

Fitted encoder.

Return type

self

fit_transform(X, y=None)[source]

Fit OneHotEncoder to X, then transform X.

Equivalent to fit(X).transform(X) but more convenient.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data to encode.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns

X_out – Transformed input. If sparse=True, a sparse matrix will be returned.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_encoded_features)

get_feature_names(input_features=None)[source]

DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

Return feature names for output features.

input_featureslist of str of shape (n_features,)

String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

output_feature_namesndarray of shape (n_output_features,)

Array of feature names.

get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_out – Transformed feature names.

Return type

ndarray of str objects

inverse_transform(X)[source]

Convert the data back to the original representation.

When unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category. If the feature with the unknown category has a dropped caregory, the dropped category will be its inverse.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_encoded_features)) – The transformed data.

Returns

X_tr – Inverse transformed array.

Return type

ndarray of shape (n_samples, n_features)

transform(X)[source]

Transform X using one-hot encoding.

Parameters

X (array-like of shape (n_samples, n_features)) – The data to encode.

Returns

X_out – Transformed input. If sparse=True, a sparse matrix will be returned.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_encoded_features)

class ballet.eng.external.OrdinalEncoder(*, categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='error', unknown_value=None)[source]

Bases: sklearn.preprocessing._encoders._BaseEncoder

Encode categorical features as an integer array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

Read more in the User Guide.

New in version 0.20.

Parameters
  • categories ('auto' or a list of array-like, default='auto') –

    Categories (unique values) per feature:

    • ’auto’ : Determine categories automatically from the training data.

    • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values.

    The used categories can be found in the categories_ attribute.

  • dtype (number type, default np.float64) – Desired dtype of output.

  • handle_unknown ({'error', 'use_encoded_value'}, default='error') –

    When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform(), an unknown category will be denoted as None.

    New in version 0.24.

  • unknown_value (int or np.nan, default=None) –

    When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.

    New in version 0.24.

categories_

The categories of each feature determined during fit (in order of the features in X and corresponding with the output of transform). This does not include categories that weren’t seen during fit.

Type

list of arrays

n_features_in_

Number of features seen during fit.

New in version 1.0.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

OneHotEncoder

Performs a one-hot encoding of categorical features.

LabelEncoder

Encodes target labels with values between 0 and n_classes-1.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to an ordinal encoding.

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OrdinalEncoder()
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 3], ['Male', 1]])
array([[0., 2.],
       [1., 0.]])
>>> enc.inverse_transform([[1, 0], [0, 1]])
array([['Male', 1],
       ['Female', 2]], dtype=object)
fit(X, y=None)[source]

Fit the OrdinalEncoder to X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns

self – Fitted encoder.

Return type

object

inverse_transform(X)[source]

Convert the data back to the original representation.

Parameters

X (array-like of shape (n_samples, n_encoded_features)) – The transformed data.

Returns

X_tr – Inverse transformed array.

Return type

ndarray of shape (n_samples, n_features)

transform(X)[source]

Transform X to ordinal codes.

Parameters

X (array-like of shape (n_samples, n_features)) – The data to encode.

Returns

X_out – Transformed input.

Return type

ndarray of shape (n_samples, n_features)

class ballet.eng.external.OutlierTrimmer(capping_method='gaussian', tail='right', fold=3, variables=None, missing_values='raise')[source]

Bases: feature_engine.outliers.winsorizer.Winsorizer

The OutlierTrimmer() removes observations with outliers from the dataset.

It works only with numerical variables. A list of variables can be indicated. Alternatively, the OutlierTrimmer() will select all numerical variables.

The OutlierTrimmer() first calculates the maximum and /or minimum values beyond which a value will be considered an outlier, and thus removed.

Limits are determined using:

  • a Gaussian approximation

  • the inter-quantile range proximity rule

  • percentiles.

Gaussian limits:

  • right tail: mean + 3* std

  • left tail: mean - 3* std

IQR limits:

  • right tail: 75th quantile + 3* IQR

  • left tail: 25th quantile - 3* IQR

where IQR is the inter-quartile range: 75th quantile - 25th quantile.

percentiles or quantiles:

  • right tail: 95th percentile

  • left tail: 5th percentile

You can select how far out to cap the maximum or minimum values with the parameter ‘fold’.

If capping_method=’gaussian’ fold gives the value to multiply the std.

If capping_method=’iqr’ fold is the value to multiply the IQR.

If capping_method=’quantile’, fold is the percentile on each tail that should be censored. For example, if fold=0.05, the limits will be the 5th and 95th percentiles. If fold=0.1, the limits will be the 10th and 90th percentiles.

The transformer first finds the values at one or both tails of the distributions (fit).

The transformer then removes observations with outliers from the dataframe (transform).

Parameters
  • capping_method (str, default=gaussian) –

    Desired capping method. Can take ‘gaussian’, ‘iqr’ or ‘quantiles’.

    ’gaussian’: the transformer will find the maximum and / or minimum values to cap the variables using the Gaussian approximation.

    ’iqr’: the transformer will find the boundaries using the IQR proximity rule.

    ’quantiles’: the limits are given by the percentiles.

  • tail (str, default=right) – Whether to cap outliers on the right, left or both tails of the distribution. Can take ‘left’, ‘right’ or ‘both’.

  • fold (int or float, default=3) –

    How far out to to place the capping values. The number that will multiply the std or IQR to calculate the capping values. Recommended values, 2 or 3 for the gaussian approximation, or 1.5 or 3 for the IQR proximity rule.

    If capping_method=’quantile’, then ‘fold’ indicates the percentile. So if fold=0.05, the limits will be the 95th and 5th percentiles. Note: Outliers will be removed up to a maximum of the 20th percentiles on both sides. Thus, when capping_method=’quantile’, then ‘fold’ takes values between 0 and 0.20.

  • variables (list, default=None) – The list of variables for which the outliers will be removed If None, the transformer will find and select all numerical variables.

  • missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. Sometimes we want to remove outliers in the raw, original data, sometimes, we may want to remove outliers in the already pre-transformed data. If missing_values=’ignore’, the transformer will ignore missing data when learning the capping parameters or transforming the data. If missing_values=’raise’ the transformer will return an error if the training or the datasets to transform contain missing values.

right_tail_caps_

Dictionary with the maximum values above which values will be removed.

left_tail_caps_

Dictionary with the minimum values below which values will be removed.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Find maximum and minimum values.

transform:

Remove outliers.

fit_transform:

Fit to the data. Then transform it.

transform(X)[source]

Remove observations with outliers from the dataframe.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError – If the dataframe is not of same size as that used in fit()

Returns

X – The dataframe without outlier observations.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.PRatioEncoder(encoding_method='ratio', variables=None, ignore_format=False)[source]

Bases: feature_engine.encoding.base_encoder.BaseCategoricalTransformer

The PRatioEncoder() replaces categories by the ratio of the probability of the target = 1 and the probability of the target = 0.

The target probability ratio is given by:

\[p(1) / p(0)\]

The log of the target probability ratio is:

\[log( p(1) / p(0) )\]

Note

This categorical encoding is exclusive for binary classification.

For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is 0.2, blue will be replaced by: 0.8 / 0.2 = 4 if ratio is selected, or log(0.8/0.2) = 1.386 if log_ratio is selected.

Note: the division by 0 is not defined and the log(0) is not defined. Thus, if p(0) = 0 for the ratio encoder, or either p(0) = 0 or p(1) = 0 for log_ratio, in any of the variables, the encoder will return an error.

The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).

With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.

The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).

Parameters
  • encoding_method (str, default='ratio') –

    Desired method of encoding.

    ’ratio’ : probability ratio

    ’log_ratio’ : log probability ratio

  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.

  • ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.

encoder_dict_

Dictionary with the probability ratio per category per variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn probability ratio per category, per variable.

transform:

Encode categories into numbers.

fit_transform:

Fit to the data, then transform it.

inverse_transform:

Encode the numbers into the original categories.

Notes

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

See also

feature_engine.encoding.RareLabelEncoder

fit(X, y)[source]

Learn the numbers that should be used to replace the categories in each variable. That is the ratio of probability.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.

  • y (pandas series.) – Target, must be binary.

Raises
  • TypeError

    • If the input is not the Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)

  • ValueError

    • If there are no categorical variables in df or df is empty - If variable(s) contain null values. - If y is not binary with values 0 and 1. - If p(0) = 0 or any of p(0) or p(1) are 0.

Returns

Return type

self

inverse_transform(X)[source]

Convert the encoded variable back to the original values.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit()

Returns

X – The un-transformed dataframe, with the categorical variables containing the original values.

Return type

pandas dataframe of shape = [n_samples, n_features]

transform(X)[source]

Replace categories with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit()

  • Warning – If after encoding, NAN were introduced.

Returns

X – The dataframe containing the categories replaced by numbers.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.PolynomialEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Polynomial contrast coding for the encoding of categorical features.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = PolynomialEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 21 columns):
intercept    506 non-null int64
CRIM         506 non-null float64
ZN           506 non-null float64
INDUS        506 non-null float64
CHAS_0       506 non-null float64
NOX          506 non-null float64
RM           506 non-null float64
AGE          506 non-null float64
DIS          506 non-null float64
RAD_0        506 non-null float64
RAD_1        506 non-null float64
RAD_2        506 non-null float64
RAD_3        506 non-null float64
RAD_4        506 non-null float64
RAD_5        506 non-null float64
RAD_6        506 non-null float64
RAD_7        506 non-null float64
TAX          506 non-null float64
PTRATIO      506 non-null float64
B            506 non-null float64
LSTAT        506 non-null float64
dtypes: float64(20), int64(1)
memory usage: 83.1 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

static fit_polynomial_coding(col, values, handle_missing, handle_unknown)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

static polynomial_coding(X_in, mapping)[source]
transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.PolynomialFeatures(degree=2, *, interaction_only=False, include_bias=True, order='C')[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

Read more in the User Guide.

Parameters
  • degree (int or tuple (min_degree, max_degree), default=2) – If a single int is given, it specifies the maximal degree of the polynomial features. If a tuple (min_degree, max_degree) is passed, then min_degree is the minimum and max_degree is the maximum polynomial degree of the generated features. Note that min_degree=0 and min_degree=1 are equivalent as outputting the degree zero term is determined by include_bias.

  • interaction_only (bool, default=False) –

    If True, only interaction features are produced: features that are products of at most degree distinct input features, i.e. terms with power of 2 or higher of the same input feature are excluded:

    • included: x[0], x[1], x[0] * x[1], etc.

    • excluded: x[0] ** 2, x[0] ** 2 * x[1], etc.

  • include_bias (bool, default=True) – If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).

  • order ({'C', 'F'}, default='C') –

    Order of output array in the dense case. ‘F’ order is faster to compute, but may slow down subsequent estimators.

    New in version 0.21.

powers_

powers_[i, j] is the exponent of the jth input in the ith output.

Type

ndarray of shape (n_output_features_, n_features_in_)

n_input_features_

The total number of input features.

Deprecated since version 1.0: This attribute is deprecated in 1.0 and will be removed in 1.2. Refer to n_features_in_ instead.

Type

int

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

n_output_features_

The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.

Type

int

See also

SplineTransformer

Transformer that generates univariate B-spline bases for features.

Notes

Be aware that the number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.

See examples/linear_model/plot_polynomial_interpolation.py

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])
>>> poly = PolynomialFeatures(interaction_only=True)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.],
       [ 1.,  2.,  3.,  6.],
       [ 1.,  4.,  5., 20.]])
fit(X, y=None)[source]

Compute number of output features.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Fitted transformer.

Return type

object

get_feature_names(input_features=None)[source]

DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

Return feature names for output features.

input_featureslist of str of shape (n_features,), default=None

String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

output_feature_nameslist of str of shape (n_output_features,)

Transformed feature names.

get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_out – Transformed feature names.

Return type

ndarray of str objects

property n_input_features_

The attribute n_input_features_ was deprecated in version 1.0 and will be removed in 1.2.

Type

DEPRECATED

property powers_

Exponent for each of the inputs in the output.

transform(X)[source]

Transform data to polynomial features.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) –

The data to transform, row by row.

Prefer CSR over CSC for sparse input (for speed), but CSC is required if the degree is 4 or higher. If the degree is less than 4 and the input format is CSC, it will be converted to CSR, have its polynomial features generated, then converted back to CSC.

If the degree is 2 or 3, the method described in “Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices Using K-Simplex Numbers” by Andrew Nystrom and John Hughes is used, which is much faster than the method used on CSC input. For this reason, a CSC input will be converted to CSR, and the output will be converted back to CSC prior to being returned, hence the preference of CSR.

Returns

XP – The matrix of features, where NP is the number of polynomial features generated from the combination of inputs. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Return type

{ndarray, sparse matrix} of shape (n_samples, NP)

class ballet.eng.external.PowerTransformer(method='yeo-johnson', *, standardize=True, copy=True)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.

By default, zero-mean, unit-variance normalization is applied to the transformed data.

Read more in the User Guide.

New in version 0.20.

Parameters
  • method ({'yeo-johnson', 'box-cox'}, default='yeo-johnson') –

    The power transform method. Available methods are:

    • ’yeo-johnson’ [1]_, works with positive and negative values

    • ’box-cox’ [2]_, only works with strictly positive values

  • standardize (bool, default=True) – Set to True to apply zero-mean, unit-variance normalization to the transformed output.

  • copy (bool, default=True) – Set to False to perform inplace computation during transformation.

lambdas_

The parameters of the power transformation for the selected features.

Type

ndarray of float of shape (n_features,)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

power_transform

Equivalent function without the estimator API.

QuantileTransformer

Maps data to a standard normal distribution with the parameter output_distribution=’normal’.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

References

1

I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).

2

G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import PowerTransformer
>>> pt = PowerTransformer()
>>> data = [[1, 2], [3, 2], [4, 5]]
>>> print(pt.fit(data))
PowerTransformer()
>>> print(pt.lambdas_)
[ 1.386... -3.100...]
>>> print(pt.transform(data))
[[-1.316... -0.707...]
 [ 0.209... -0.707...]
 [ 1.106...  1.414...]]
fit(X, y=None)[source]

Estimate the optimal parameter lambda for each feature.

The optimal lambda parameter for minimizing skewness is estimated on each feature independently using maximum likelihood.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data used to estimate the optimal transformation parameters.

  • y (None) – Ignored.

Returns

self – Fitted transformer.

Return type

object

fit_transform(X, y=None)[source]

Fit PowerTransformer to X, then transform X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data used to estimate the optimal transformation parameters and to be transformed using a power transformation.

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

X_new – Transformed data.

Return type

ndarray of shape (n_samples, n_features)

inverse_transform(X)[source]

Apply the inverse power transformation using the fitted lambdas.

The inverse of the Box-Cox transformation is given by:

if lambda_ == 0:
    X = exp(X_trans)
else:
    X = (X_trans * lambda_ + 1) ** (1 / lambda_)

The inverse of the Yeo-Johnson transformation is given by:

if X >= 0 and lambda_ == 0:
    X = exp(X_trans) - 1
elif X >= 0 and lambda_ != 0:
    X = (X_trans * lambda_ + 1) ** (1 / lambda_) - 1
elif X < 0 and lambda_ != 2:
    X = 1 - (-(2 - lambda_) * X_trans + 1) ** (1 / (2 - lambda_))
elif X < 0 and lambda_ == 2:
    X = 1 - exp(-X_trans)
Parameters

X (array-like of shape (n_samples, n_features)) – The transformed data.

Returns

X – The original data.

Return type

ndarray of shape (n_samples, n_features)

transform(X)[source]

Apply the power transform to each feature using the fitted lambdas.

Parameters

X (array-like of shape (n_samples, n_features)) – The data to be transformed using a power transformation.

Returns

X_trans – The transformed data.

Return type

ndarray of shape (n_samples, n_features)

class ballet.eng.external.QuantileTransformer(*, n_quantiles=1000, output_distribution='uniform', ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=True)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Transform features using quantiles information.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

Read more in the User Guide.

New in version 0.19.

Parameters
  • n_quantiles (int, default=1000 or n_samples) – Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples as a larger number of quantiles does not give a better approximation of the cumulative distribution function estimator.

  • output_distribution ({'uniform', 'normal'}, default='uniform') – Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.

  • ignore_implicit_zeros (bool, default=False) – Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.

  • subsample (int, default=1e5) – Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.

  • random_state (int, RandomState instance or None, default=None) – Determines random number generation for subsampling and smoothing noise. Please see subsample for more details. Pass an int for reproducible results across multiple function calls. See Glossary.

  • copy (bool, default=True) – Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).

n_quantiles_

The actual number of quantiles used to discretize the cumulative distribution function.

Type

int

quantiles_

The values corresponding the quantiles of reference.

Type

ndarray of shape (n_quantiles, n_features)

references_

Quantiles of references.

Type

ndarray of shape (n_quantiles, )

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

quantile_transform

Equivalent function without the estimator API.

PowerTransformer

Perform mapping to a normal distribution using a power transform.

StandardScaler

Perform standardization that is faster, but less robust to outliers.

RobustScaler

Perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import QuantileTransformer
>>> rng = np.random.RandomState(0)
>>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
>>> qt = QuantileTransformer(n_quantiles=10, random_state=0)
>>> qt.fit_transform(X)
array([...])
fit(X, y=None)[source]

Compute the quantiles used for transforming.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

  • y (None) – Ignored.

Returns

self – Fitted transformer.

Return type

object

inverse_transform(X)[source]

Back-projection to the original space.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns

Xt – The projected data.

Return type

{ndarray, sparse matrix} of (n_samples, n_features)

transform(X)[source]

Feature-wise transformation of the data.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns

Xt – The projected data.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.RandomSampleImputer(random_state=None, seed='general', seeding_method='add', variables=None)[source]

Bases: feature_engine.imputation.base_imputer.BaseImputer

The RandomSampleImputer() replaces missing data with a random sample extracted from the variables in the training set.

The RandomSampleImputer() works with both numerical and categorical variables.

Note

The Random samples used to replace missing values may vary from execution to execution. This may affect the results of your work. This, it is advisable to set a seed.

There are 2 ways in which the seed can be set in the RandomSampleImputer():

If seed = ‘general’ then the random_state can be either None or an integer. The seed will be used as the random_state and all observations will be imputed in one go. This is equivalent to pandas.sample(n, random_state=seed) where n is the number of observations with missing data.

If seed = ‘observation’, then the random_state should be a variable name or a list of variable names. The seed will be calculated observation per observation, either by adding or multiplying the seeding variable values, and passed to the random_state. Then, a value will be extracted from the train set using that seed and used to replace the NAN in particular observation. This is the equivalent of pandas.sample(1, random_state=var1+var2) if the ‘seeding_method’ is set to ‘add’ or pandas.sample(1, random_state=var1*var2) if the ‘seeding_method’ is set to ‘multiply’.

For more details on why this functionality is important refer to the course Feature Engineering for Machine Learning in Udemy: https://www.udemy.com/feature-engineering-for-machine-learning/

Note, if the variables indicated in the random_state list are not numerical the imputer will return an error. Note also that the variables indicated as seed should not contain missing values.

This estimator stores a copy of the training set when the fit() method is called. Therefore, the object can become quite heavy. Also, it may not be GDPR compliant if your training data set contains Personal Information. Please check if this behaviour is allowed within your organisation.

Parameters
  • random_state (int, str or list, default=None) – The random_state can take an integer to set the seed when extracting the random samples. Alternatively, it can take a variable name or a list of variables, which values will be used to determine the seed observation per observation.

  • seed (str, default='general') –

    Indicates whether the seed should be set for each observation with missing values, or if one seed should be used to impute all observations in one go.

    general: one seed will be used to impute the entire dataframe. This is equivalent to setting the seed in pandas.sample(random_state).

    observation: the seed will be set for each observation using the values of the variables indicated in the random_state for that particular observation.

  • seeding_method (str, default='add') – If more than one variable are indicated to seed the random sampling per observation, you can choose to combine those values as an addition or a multiplication. Can take the values ‘add’ or ‘multiply’.

  • variables (list, default=None) – The list of variables to be imputed. If None, the imputer will select all variables in the train set.

X_

Copy of the training dataframe from which to extract the random samples.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Make a copy of the dataframe

transform:

Impute missing data.

fit_transform:

Fit to the data, then transform it.

fit(X, y=None)[source]

Makes a copy of the train set. Only stores a copy of the variables to impute. This copy is then used to randomly extract the values to fill the missing data during transform.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Only a copy of the indicated variables will be stored in the transformer.

  • y (None) – y is not needed in this imputation. You can pass None or y.

Raises

TypeError – If the input is not a Pandas DataFrame

Returns

Return type

self

transform(X)[source]

Replace missing data with random values taken from the train set.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.

Raises

TypeError – If the input is not a Pandas DataFrame

Returns

X – The dataframe without missing values in the transformed variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.RareLabelEncoder(tol=0.05, n_categories=10, max_n_categories=None, replace_with='Rare', variables=None, ignore_format=False)[source]

Bases: feature_engine.encoding.base_encoder.BaseCategoricalTransformer

The RareLabelCategoricalEncoder() groups rare / infrequent categories in a new category called “Rare”, or any other name entered by the user.

For example in the variable colour, if the percentage of observations for the categories magenta, cyan and burgundy are < 5 %, all those categories will be replaced by the new label “Rare”.

Note

Infrequent labels can also be grouped under a user defined name, for example ‘Other’. The name to replace infrequent categories is defined with the parameter replace_with.

The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode.Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).

With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.

The encoder first finds the frequent labels for each variable (fit). The encoder then groups the infrequent labels under the new label ‘Rare’ or by another user defined string (transform).

Parameters
  • tol (float, default=0.05) – The minimum frequency a label should have to be considered frequent. Categories with frequencies lower than tol will be grouped.

  • n_categories (int, default=10) – The minimum number of categories a variable should have for the encoder to find frequent labels. If the variable contains less categories, all of them will be considered frequent.

  • max_n_categories (int, default=None) – The maximum number of categories that should be considered frequent. If None, all categories with frequency above the tolerance (tol) will be considered frequent. If you enter 5, only the 5 most frequent categories will be retained and the rest grouped.

  • replace_with (string, intege or float, default='Rare') – The value that will be used to replace infrequent categories.

  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.

  • ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.

encoder_dict_

Dictionary with the frequent categories, i.e., those that will be kept, per variable.

variables_

The variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Find frequent categories.

transform:

Group rare categories

fit_transform:

Fit to data, then transform it.

fit(X, y=None)[source]

Learn the frequent categories for each variable.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just selected variables

  • y (None) – y is not required. You can pass y or None.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)

  • ValueError

    • If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values

  • Warning – If the number of categories in any one variable is less than the indicated in n_categories.

Returns

Return type

self

inverse_transform(X)[source]

inverse_transform is not implemented for this transformer yet.

transform(X)[source]

Group infrequent categories. Replace infrequent categories by the string ‘Rare’ or any other name provided by the user.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If user enters non-categorical variables (unless ignore_format is True)

Returns

X – The dataframe where rare categories have been grouped.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.ReciprocalTransformer(variables=None)[source]

Bases: feature_engine.base_transformers.BaseNumericalTransformer

The ReciprocalTransformer() applies the reciprocal transformation 1 / x to numerical variables.

The ReciprocalTransformer() only works with numerical variables with non-zero values. If a variable contains the value 0, the transformer will raise an error.

A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.

Parameters

variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

This transformer does not learn parameters.

transform:

Apply the reciprocal 1 / x transformation.

fit_transform:

Fit to data, then transform it.

inverse_transform:

Convert the data back to the original representation.

fit(X, y=None)[source]

This transformer does not learn parameters.

Parameters
  • X (Pandas DataFrame of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.

  • y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.

Raises
  • TypeError

    • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

  • ValueError

    • If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values - If some variables contain zero as values

Returns

Return type

self

inverse_transform(X)[source]

Convert the data back to the original representation.

Parameters

X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain zero values

Returns

X – The dataframe with the transformed variables.

Return type

pandas dataframe

transform(X)[source]

Apply the reciprocal 1 / x transformation.

Parameters

X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain zero values

Returns

X – The dataframe with the transformed variables.

Return type

pandas dataframe

class ballet.eng.external.ReversibleImputer(y_only=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
inverse_transform(X)[source]
needs_refit = True
transform(X, y=None, refit=False)[source]
class ballet.eng.external.RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform() method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

New in version 0.17.

Read more in the User Guide.

Parameters
  • with_centering (bool, default=True) – If True, center the data before scaling. This will cause transform() to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

  • with_scaling (bool, default=True) – If True, scale the data to interquartile range.

  • quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0)) –

    Quantile range used to calculate scale_. By default this is equal to the IQR, i.e., q_min is the first quantile and q_max is the third quantile.

    New in version 0.18.

  • copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

  • unit_variance (bool, default=False) –

    If True, scale data so that normally distributed features have a variance of 1. In general, if the difference between the x-values of q_max and q_min for a standard normal distribution is greater than 1, the dataset will be scaled down. If less than 1, the dataset will be scaled up.

    New in version 0.24.

center_

The median value for each feature in the training set.

Type

array of floats

scale_

The (scaled) interquartile range for each feature in the training set.

New in version 0.17: scale_ attribute.

Type

array of floats

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

robust_scale

Equivalent function without the estimator API.

sklearn.decomposition.PCA

Further removes the linear correlation across features with ‘whiten=True’.

Notes

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Interquartile_range

Examples

>>> from sklearn.preprocessing import RobustScaler
>>> X = [[ 1., -2.,  2.],
...      [ -2.,  1.,  3.],
...      [ 4.,  1., -2.]]
>>> transformer = RobustScaler().fit(X)
>>> transformer
RobustScaler()
>>> transformer.transform(X)
array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])
fit(X, y=None)[source]

Compute the median and quantiles to be used for scaling.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the median and quantiles used for later scaling along the features axis.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Fitted scaler.

Return type

object

inverse_transform(X)[source]

Scale back the data to the original representation.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The rescaled data to be transformed back.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

transform(X)[source]

Center and scale the data.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the specified axis.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.RollingMeanTransformer(window=5)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
transform(X, y=None)[source]
class ballet.eng.external.SeasonalTransformer(seasonal_period=1, pred_stride=1)[source]

Bases: skits.feature_extraction.AutoregressiveTransformer

class ballet.eng.external.SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)[source]

Bases: sklearn.impute._base._BaseImputer

Imputation transformer for completing missing values.

Read more in the User Guide.

New in version 0.20: SimpleImputer replaces the previous sklearn.preprocessing.Imputer estimator which is now removed.

Parameters
  • missing_values (int, float, str, np.nan or None, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

  • strategy (str, default='mean') –

    The imputation strategy.

    • If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

    • If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

    • If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

    • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

    New in version 0.20: strategy=”constant” for fixed value imputation.

  • fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

  • verbose (int, default=0) – Controls the verbosity of the imputer.

  • copy (bool, default=True) –

    If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:

    • If X is not an array of floating values;

    • If X is encoded as a CSR matrix;

    • If add_indicator=True.

  • add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

statistics_

The imputation fill value for each feature. Computing statistics can result in np.nan values. During transform(), features corresponding to np.nan statistics will be discarded.

Type

array of shape (n_features,)

indicator_

Indicator used to add binary indicators for missing values. None if add_indicator=False.

Type

MissingIndicator

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

IterativeImputer

Multivariate imputation of missing values.

Notes

Columns which only contained missing values at fit() are discarded upon transform() if strategy is not “constant”.

Examples

>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
SimpleImputer()
>>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
>>> print(imp_mean.transform(X))
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]
fit(X, y=None)[source]

Fit the imputer on X.

Parameters
  • X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Fitted estimator.

Return type

object

inverse_transform(X)[source]

Convert the data back to the original representation.

Inverts the transform operation performed on an array. This operation can only be performed after SimpleImputer is instantiated with add_indicator=True.

Note that inverse_transform can only invert the transform in features that have binary indicators for missing values. If a feature has no missing values at fit time, the feature won’t have a binary indicator, and the imputation done at transform time won’t be inverted.

New in version 0.24.

Parameters

X (array-like of shape (n_samples, n_features + n_features_missing_indicator)) – The imputed data to be reverted to original data. It has to be an augmented array of imputed data and the missing indicator mask.

Returns

X_original – The original X with missing values as it was prior to imputation.

Return type

ndarray of shape (n_samples, n_features)

transform(X)[source]

Impute all missing values in X.

Parameters

X ({array-like, sparse matrix}, shape (n_samples, n_features)) – The input data to complete.

Returns

X_imputedX with imputed values.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features_out)

class ballet.eng.external.SparseRandomProjection(n_components='auto', *, density='auto', eps=0.1, dense_output=False, random_state=None)[source]

Bases: sklearn.random_projection.BaseRandomProjection

Reduce dimensionality through sparse random projection.

Sparse random matrix is an alternative to dense random projection matrix that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data.

If we note s = 1 / density the components of the random matrix are drawn from:

  • -sqrt(s) / sqrt(n_components) with probability 1 / 2s

  • 0 with probability 1 - 1 / s

  • +sqrt(s) / sqrt(n_components) with probability 1 / 2s

Read more in the User Guide.

New in version 0.13.

Parameters
  • n_components (int or 'auto', default='auto') –

    Dimensionality of the target projection space.

    n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the eps parameter.

    It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.

  • density (float or 'auto', default='auto') –

    Ratio in the range (0, 1] of non-zero component in the random projection matrix.

    If density = ‘auto’, the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features).

    Use density = 1 / 3.0 if you want to reproduce the results from Achlioptas, 2001.

  • eps (float, default=0.1) –

    Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. This value should be strictly positive.

    Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.

  • dense_output (bool, default=False) –

    If True, ensure that the output of the random projection is a dense numpy array even if the input and random projection matrix are both sparse. In practice, if the number of components is small the number of zero components in the projected data will be very small and it will be more CPU and memory efficient to use a dense representation.

    If False, the projected data uses a sparse representation if the input is sparse.

  • random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generator used to generate the projection matrix at fit time. Pass an int for reproducible output across multiple function calls. See Glossary.

n_components_

Concrete number of components computed when n_components=”auto”.

Type

int

components_

Random matrix used for the projection. Sparse matrix will be of CSR format.

Type

sparse matrix of shape (n_components, n_features)

density_

Concrete density computed from when density = “auto”.

Type

float in range 0.0 - 1.0

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

GaussianRandomProjection

Reduce dimensionality through Gaussian random projection.

References

1

Ping Li, T. Hastie and K. W. Church, 2006, “Very Sparse Random Projections”. https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf

2

D. Achlioptas, 2001, “Database-friendly random projections”, https://users.soe.ucsc.edu/~optas/papers/jl.pdf

Examples

>>> import numpy as np
>>> from sklearn.random_projection import SparseRandomProjection
>>> rng = np.random.RandomState(42)
>>> X = rng.rand(100, 10000)
>>> transformer = SparseRandomProjection(random_state=rng)
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)
>>> # very few components are non-zero
>>> np.mean(transformer.components_ != 0)
0.0100...
class ballet.eng.external.StandardScaler(*, copy=True, with_mean=True, with_std=True)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform().

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

Read more in the User Guide.

Parameters
  • copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

  • with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

  • with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).

scale_

Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using np.sqrt(var_). If a variance is zero, we can’t achieve unit variance, and the data is left as-is, giving a scaling factor of 1. scale_ is equal to None when with_std=False.

New in version 0.17: scale_

Type

ndarray of shape (n_features,) or None

mean_

The mean value for each feature in the training set. Equal to None when with_mean=False.

Type

ndarray of shape (n_features,) or None

var_

The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.

Type

ndarray of shape (n_features,) or None

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

n_samples_seen_

The number of samples processed by the estimator for each feature. If there are no missing samples, the n_samples_seen will be an integer, otherwise it will be an array of dtype int. If sample_weights are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across partial_fit calls.

Type

int or ndarray of shape (n_features,)

See also

scale

Equivalent function without the estimator API.

PCA

Further removes the linear correlation across features with ‘whiten=True’.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler()
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]
fit(X, y=None, sample_weight=None)[source]

Compute the mean and std to be used for later scaling.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • sample_weight (array-like of shape (n_samples,), default=None) –

    Individual weights for each sample.

    New in version 0.24: parameter sample_weight support to StandardScaler.

Returns

self – Fitted scaler.

Return type

object

inverse_transform(X, copy=None)[source]

Scale back the data to the original representation.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.

  • copy (bool, default=None) – Copy the input X or not.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

partial_fit(X, y=None, sample_weight=None)[source]

Online computation of mean and std on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • sample_weight (array-like of shape (n_samples,), default=None) –

    Individual weights for each sample.

    New in version 0.24: parameter sample_weight support to StandardScaler.

Returns

self – Fitted scaler.

Return type

object

transform(X, copy=None)[source]

Perform standardization by centering and scaling.

Parameters
  • X ({array-like, sparse matrix of shape (n_samples, n_features)) – The data used to scale along the features axis.

  • copy (bool, default=None) – Copy the input X or not.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.SumEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Sum contrast coding for the encoding of categorical features.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = SumEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 21 columns):
intercept    506 non-null int64
CRIM         506 non-null float64
ZN           506 non-null float64
INDUS        506 non-null float64
CHAS_0       506 non-null float64
NOX          506 non-null float64
RM           506 non-null float64
AGE          506 non-null float64
DIS          506 non-null float64
RAD_0        506 non-null float64
RAD_1        506 non-null float64
RAD_2        506 non-null float64
RAD_3        506 non-null float64
RAD_4        506 non-null float64
RAD_5        506 non-null float64
RAD_6        506 non-null float64
RAD_7        506 non-null float64
TAX          506 non-null float64
PTRATIO      506 non-null float64
B            506 non-null float64
LSTAT        506 non-null float64
dtypes: float64(20), int64(1)
memory usage: 83.1 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

static fit_sum_coding(col, values, handle_missing, handle_unknown)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

static sum_coding(X_in, mapping)[source]
transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.TargetEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', min_samples_leaf=1, smoothing=1.0)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Target encoding for categorical features.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.

For the case of continuous target: features are replaced with a blend of the expected value of the target given particular categorical value and the expected value of the target over all the training data.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • min_samples_leaf (int) – minimum samples to take category average into account.

  • smoothing (float) – smoothing effect to balance categorical average vs prior. Higher value means stronger regularization. The value must be strictly bigger than 0.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = TargetEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, from

https://dl.acm.org/citation.cfm?id=507538

fit(X, y, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

fit_target_encoding(X, y)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

target_encode(X_in)[source]
transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target info (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.TrendTransformer(shift=0)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
transform(X, y=None)[source]
class ballet.eng.external.WOEEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, regularization=1.0)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Weight of Evidence coding for categorical features.

Supported targets: binomial. For polynomial target support, see PolynomialWrapper.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which will assume WOE=0.

  • handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which will assume WOE=0.

  • randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).

  • sigma (float) – standard deviation (spread or “width”) of the normal distribution.

  • regularization (float) – the purpose of regularization is mostly to prevent division by zero. When regularization is 0, you may encounter division by zero.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target > 22.5
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = WOEEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Weight of Evidence (WOE) and Information Value Explained, from

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

fit(X, y, **kwargs)[source]

Fit encoder according to X and binary y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Binary target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data. When the data are used for model training, it is important to also pass the target in order to apply leave one out.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.Winsorizer(capping_method='gaussian', tail='right', fold=3, variables=None, missing_values='raise')[source]

Bases: feature_engine.outliers.base_outlier.BaseOutlier

The Winsorizer() caps maximum and / or minimum values of a variable.

The Winsorizer() works only with numerical variables. A list of variables can be indicated. Alternatively, the Winsorizer() will select all numerical variables in the train set.

The Winsorizer() first calculates the capping values at the end of the distribution. The values are determined using:

  • a Gaussian approximation,

  • the inter-quantile range proximity rule (IQR)

  • percentiles.

Gaussian limits:

  • right tail: mean + 3* std

  • left tail: mean - 3* std

IQR limits:

  • right tail: 75th quantile + 3* IQR

  • left tail: 25th quantile - 3* IQR

where IQR is the inter-quartile range: 75th quantile - 25th quantile.

percentiles or quantiles:

  • right tail: 95th percentile

  • left tail: 5th percentile

You can select how far out to cap the maximum or minimum values with the parameter ‘fold’.

If capping_method=’gaussian’ fold gives the value to multiply the std.

If capping_method=’iqr’ fold is the value to multiply the IQR.

If capping_method=’quantile’, fold is the percentile on each tail that should be censored. For example, if fold=0.05, the limits will be the 5th and 95th percentiles. If fold=0.1, the limits will be the 10th and 90th percentiles.

The transformer first finds the values at one or both tails of the distributions (fit). The transformer then caps the variables (transform).

Parameters
  • capping_method (str, default=gaussian) –

    Desired capping method. Can take ‘gaussian’, ‘iqr’ or ‘quantiles’.

    ’gaussian’: the transformer will find the maximum and / or minimum values to cap the variables using the Gaussian approximation.

    ’iqr’: the transformer will find the boundaries using the IQR proximity rule.

    ’quantiles’: the limits are given by the percentiles.

  • tail (str, default=right) – Whether to cap outliers on the right, left or both tails of the distribution. Can take ‘left’, ‘right’ or ‘both’.

  • fold (int or float, default=3) –

    How far out to to place the capping values. The number that will multiply the std or IQR to calculate the capping values. Recommended values, 2 or 3 for the gaussian approximation, or 1.5 or 3 for the IQR proximity rule.

    If capping_method=’quantile’, then ‘fold’ indicates the percentile. So if fold=0.05, the limits will be the 95th and 5th percentiles. Note: Outliers will be removed up to a maximum of the 20th percentiles on both sides. Thus, when capping_method=’quantile’, then ‘fold’ takes values between 0 and 0.20.

  • variables (list, default=None) – The list of variables for which the outliers will be capped. If None, the transformer will find and select all numerical variables.

  • missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. Sometimes we want to remove outliers in the raw, original data, sometimes, we may want to remove outliers in the already pre-transformed data. If missing_values=’ignore’, the transformer will ignore missing data when learning the capping parameters or transforming the data. If missing_values=’raise’ the transformer will return an error if the training or the datasets to transform contain missing values.

right_tail_caps_

Dictionary with the maximum values at which variables will be capped.

left_tail_caps_

Dictionary with the minimum values at which variables will be capped.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the values that should be used to replace outliers.

transform:

Cap the variables.

fit_transform:

Fit to the data. Then transform it.

fit(X, y=None)[source]

Learn the values that should be used to replace outliers.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples.

  • y (pandas Series, default=None) – y is not needed in this transformer. You can pass y or None.

Raises

TypeError – If the input is not a Pandas DataFrame

Returns

Return type

self

transform(X)[source]

Cap the variable values, that is, censors outliers.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError – If the dataframe is not of same size as that used in fit()

Returns

X – The dataframe with the capped variables.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.WoEEncoder(variables=None, ignore_format=False)[source]

Bases: feature_engine.encoding.base_encoder.BaseCategoricalTransformer

The WoERatioCategoricalEncoder() replaces categories by the weight of evidence (WoE). The WoE was used primarily in the financial sector to create credit risk scorecards.

The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).

With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.

The encoder first maps the categories to the weight of evidence for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).

Note

This categorical encoding is exclusive for binary classification.

The weight of evidence is given by:

\[log( p(X=xj|Y = 1) / p(X=xj|Y=0) )\]

The WoE is determined as follows:

We calculate the percentage positive cases in each category of the total of all positive cases. For example 20 positive cases in category A out of 100 total positive cases equals 20 %. Next, we calculate the percentage of negative cases in each category respect to the total negative cases, for example 5 negative cases in category A out of a total of 50 negative cases equals 10%. Then we calculate the WoE by dividing the category percentages of positive cases by the category percentage of negative cases, and take the logarithm, so for category A in our example WoE = log(20/10).

Note

  • If WoE values are negative, negative cases supersede the positive cases.

  • If WoE values are positive, positive cases supersede the negative cases.

  • And if WoE is 0, then there are equal number of positive and negative examples.

Encoding into WoE:

  • Creates a monotonic relationship between the encoded variable and the target

  • Returns variables in a similar scale

Note

The log(0) is not defined and the division by 0 is not defined. Thus, if any of the terms in the WoE equation are 0 for a given category, the encoder will return an error. If this happens, try grouping less frequent categories.

Parameters
  • variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.

  • ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.

encoder_dict_

Dictionary with the WoE per variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the WoE per category, per variable.

transform:

Encode the categories to numbers.

fit_transform:

Fit to the data, then transform it.

inverse_transform:

Encode the numbers into the original categories.

Notes

For details on the calculation of the weight of evidence visit: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

In credit scoring, continuous variables are also transformed using the WoE. To do this, first variables are sorted into a discrete number of bins, and then these bins are encoded with the WoE as explained here for categorical variables. You can do this by combining the use of the equal width, equal frequency or arbitrary discretisers.

NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().

See also

feature_engine.encoding.RareLabelEncoder, feature_engine.discretisation

fit(X, y)[source]

Learn the WoE.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.

  • y (pandas series.) – Target, must be binary.

Raises
  • TypeError

    • If the input is not the Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)

  • ValueError

    • If there are no categorical variables in df or df is empty - If variable(s) contain null values. - If y is not binary with values 0 and 1. - If p(0) = 0 or p(1) = 0.

Returns

Return type

self

inverse_transform(X)[source]

Convert the encoded variable back to the original values.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit()

Returns

X – The un-transformed dataframe, with the categorical variables containing the original values.

Return type

pandas dataframe of shape = [n_samples, n_features]

transform(X)[source]

Replace categories with the learned parameters.

Parameters

X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit()

  • Warning – If after encoding, NAN were introduced.

Returns

X – The dataframe containing the categories replaced by numbers.

Return type

pandas dataframe of shape = [n_samples, n_features]

class ballet.eng.external.YeoJohnsonTransformer(variables=None)[source]

Bases: feature_engine.base_transformers.BaseNumericalTransformer

The YeoJohnsonTransformer() applies the Yeo-Johnson transformation to the numerical variables.

The Yeo-Johnson transformation implemented by this transformer is that of SciPy.stats: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.yeojohnson.html

The YeoJohnsonTransformer() works only with numerical variables.

A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.

Parameters

variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.

lambda_dict_

Dictionary containing the best lambda for the Yeo-Johnson per variable.

variables_

The group of variables that will be transformed.

n_features_in_

The number of features in the train set used in fit.

fit:

Learn the optimal lambda for the Yeo-Johnson transformation.

transform:

Apply the Yeo-Johnson transformation.

fit_transform:

Fit to data, then transform it.

References

1

Weisberg S. “Yeo-Johnson Power Transformations”. https://www.stat.umn.edu/arc/yjpower.pdf

fit(X, y=None)[source]

Learn the optimal lambda for the Yeo-Johnson transformation.

Parameters
  • X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.

  • y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.

Raises

TypeError

  • If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical

ValueError
  • If there are no numerical variables in the df or the df is empty

  • If the variable(s) contain null values

Returns

Return type

self

transform(X)[source]

Apply the Yeo-Johnson transformation.

Parameters

X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.

Raises
  • TypeError – If the input is not a Pandas DataFrame

  • ValueError

    • If the variable(s) contain null values - If the df has different number of features than the df used in fit()

Returns

X – The dataframe with the transformed variables.

Return type

pandas dataframe