ballet.eng.external package¶
-
class
ballet.eng.external.
AddMissingIndicator
(missing_only=True, variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The AddMissingIndicator() adds additional binary variables that indicate if data is missing. It will add as many missing indicators as variables indicated by the user.
Binary variables are named with the original variable name plus ‘_na’.
The AddMissingIndicator() works for both numerical and categorical variables. You can pass a list with the variables for which the missing indicators should be added. Alternatively, the imputer will select and add missing indicators to all variables in the training set.
Note If how=missing_only, the imputer will add missing indicators only to those variables that show missing data in during fit. These may be a subset of the variables you indicated.
- Parameters
missing_only (bool, default=True) –
Indicates if missing indicators should be added to variables with missing data or to all variables.
True: indicators will be created only for those variables that showed missing data during fit.
False: indicators will be created for all variables
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables.
-
variables_
¶ List of variables for which the missing indicators will be created.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the variables for which the missing indicators will be created
-
transform:
Add the missing indicators.
-
fit_transform:
Fit to the data, then trasnform it.
-
fit
(X, y=None)[source]¶ Learn the variables for which the missing indicators will be created.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
self.variables_ – The list of variables for which missing indicators will be added.
- Return type
list
-
transform
(X)[source]¶ Add the binary missing indicators.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Returns
X_transformed – The dataframe containing the additional binary variables. Binary variables are named with the original variable name plus ‘_na’.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
ArbitraryDiscretiser
(binning_dict, return_object=False, return_boundaries=False)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The ArbitraryDiscretiser() divides continuous numerical variables into contiguous intervals which limits are determined arbitrarily by the user.
You need to enter a dictionary with variable names as keys, and a list of the limits of the intervals as values. For example {‘var1’:[0, 10, 100, 1000], ‘var2’:[5, 10, 15, 20]}.
ArbitraryDiscretiser() will then sort var1 values into the intervals 0-10, 10-100 100-1000, and var2 into 5-10, 10-15 and 15-20. Similar to pandas.cut.
The ArbitraryDiscretiser() works only with numerical variables. The discretiser will check if the dictionary entered by the user contains variables present in the training set, and if these variables are numerical, before doing any transformation.
Then it transforms the variables, that is, it sorts the values into the intervals.
- Parameters
binning_dict (dict) –
The dictionary with the variable to interval limits pairs. A valid dictionary looks like this:
binning_dict = {‘var1’:[0, 10, 100, 1000], ‘var2’:[5, 10, 15, 20]}
return_object (bool, default=False) –
Whether the the discrete variable should be returned casted as numeric or as object. If you would like to proceed with the engineering of the variable as if it was categorical, use True. Alternatively, keep the default to False.
Categorical encoders in Feature-engine work only with variables of type object, thus, if you wish to encode the returned bins, set return_object to True.
return_boundaries (bool, default=False) – Whether the output, that is the bin names / values, should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.
-
binner_dict_
¶ Dictionary with the interval limits per variable.
-
variables_
¶ The variables to discretise.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn any parameter.
-
transform:
Sort continuous variable values into the intervals.
-
fit_transform:
Fit to the data, then transform it.
See also
pandas.cut
https
//pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
-
fit
(X, y=None)[source]¶ This transformer does not learn any parameter.
Check dataframe and variables. Checks that the user entered variables are in the train set and cast as numerical.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (None) – y is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Sort the variable values into the intervals.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()
- Returns
X – The transformed data with the discrete variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
ArbitraryNumberImputer
(arbitrary_number=999, variables=None, imputer_dict=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The ArbitraryNumberImputer() replaces missing data in each variable by an arbitrary value determined by the user. It works only with numerical variables.
You can impute all variables with the same number, in which case you need to define the variables to impute in variables and the imputation number in arbitrary_number. You can pass a dictionary of variable and numbers to use for their imputation.
For example, you can impute varA and varB with 99 like this:
transformer = ArbitraryNumberImputer( variables = ['varA', 'varB'], arbitrary_number = 99 ) Xt = transformer.fit_transform(X)
Alternatively, you can impute varA with 1 and varB with 99 like this:
transformer = ArbitraryNumberImputer( imputer_dict = {'varA' : 1, 'varB': 99] ) Xt = transformer.fit_transform(X)
- Parameters
arbitrary_number (int or float, default=999) – The number to be used to replace missing data.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all numerical variables. This parameter is used only if imputer_dict is None.
imputer_dict (dict, default=None) – The dictionary of variables and the arbitrary numbers for their imputation.
-
imputer_dict_
¶ Dictionary with the values to replace NAs in each variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, then transform it.
See also
feature_engine.imputation.EndTailImputer
-
fit
(X, y=None)[source]¶ This method does not learn any parameter. Checks dataframe and finds numerical variables, or checks that the variables entered by user are numerical.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError – If there are no numerical variables in the df or the df is empty
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe has different number of features than the df used in fit()
- Returns
X – The dataframe without missing values in the selected variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
ArbitraryOutlierCapper
(max_capping_dict=None, min_capping_dict=None, missing_values='raise')[source]¶ Bases:
feature_engine.outliers.base_outlier.BaseOutlier
The ArbitraryOutlierCapper() caps the maximum or minimum values of a variable at an arbitrary value indicated by the user.
You must provide the maximum or minimum values that will be used to cap each variable in a dictionary {feature:capping value}
- Parameters
max_capping_dict (dictionary, default=None) – Dictionary containing the user specified capping values for the right tail of the distribution of each variable (maximum values).
min_capping_dict (dictionary, default=None) – Dictionary containing user specified capping values for the eft tail of the distribution of each variable (minimum values).
missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. If missing_values=’raise’ the transformer will return an error if the training or the datasets to transform contain missing values.
-
right_tail_caps_
¶ Dictionary with the maximum values at which variables will be capped.
-
left_tail_caps_
¶ Dictionary with the minimum values at which variables will be capped.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn any parameter.
-
transform:
Cap the variables.
-
fit_transform:
Fit to the data. Then transform it.
-
fit
(X, y=None)[source]¶ This transformer does not learn any parameter.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples.
y (pandas Series, default=None) – y is not needed in this transformer. You can pass y or None.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
- Return type
self
-
transform
(X)[source]¶ Cap the variable values, that is, censors outliers.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe is not of same size as that used in fit()
- Returns
X – The dataframe with the capped variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
AutoregressiveTransformer
(num_lags=5, pred_stride=1)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
ballet.eng.external.
BackwardDifferenceEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Backward difference contrast coding for encoding categorical variables.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = BackwardDifferenceEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 21 columns): intercept 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null float64 RAD_1 506 non-null float64 RAD_2 506 non-null float64 RAD_3 506 non-null float64 RAD_4 506 non-null float64 RAD_5 506 non-null float64 RAD_6 506 non-null float64 RAD_7 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(20), int64(1) memory usage: 83.1 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
fit
(X, y=None, **kwargs)[source]¶ Fits an ordinal encoder to produce a consistent mapping across applications and optionally finds generally invariant columns to drop consistently.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
class
ballet.eng.external.
BaseNEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, base=2, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Base-N encoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
base (int) – when the downstream model copes well with nonlinearities (like decision tree), use higher base.
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = BaseNEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 18 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null int64 CHAS_1 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null int64 RAD_1 506 non-null int64 RAD_2 506 non-null int64 RAD_3 506 non-null int64 RAD_4 506 non-null int64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(11), int64(7) memory usage: 71.3 KB None
-
basen_encode
(X_in, cols=None)[source]¶ Basen encoding encodes the integers as basen code with one column per digit.
- Parameters
X_in (DataFrame) –
cols (list-like, default None) – Column names in the DataFrame to be encoded
- Returns
dummies
- Return type
DataFrame
-
basen_to_integer
(X, cols, base)[source]¶ Convert basen code as integers.
- Parameters
X (DataFrame) – encoded data
cols (list-like) – Column names in the DataFrame that be encoded
base (int) – The base of transform
- Returns
numerical
- Return type
DataFrame
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
class
ballet.eng.external.
Binarizer
(*, threshold=0.0, copy=True)[source]¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Binarize data (set feature values to 0 or 1) according to a threshold.
Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.
Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.
It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modelled using the Bernoulli distribution in a Bayesian setting).
Read more in the User Guide.
- Parameters
threshold (float, default=0.0) – Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.
copy (bool, default=True) – Set to False to perform inplace binarization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
binarize
Equivalent function without the estimator API.
KBinsDiscretizer
Bin continuous data into intervals.
OneHotEncoder
Encode categorical features as a one-hot numeric array.
Notes
If the input is a sparse matrix, only the non-zero values are subject to update by the Binarizer class.
This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.
Examples
>>> from sklearn.preprocessing import Binarizer >>> X = [[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]] >>> transformer = Binarizer().fit(X) # fit does nothing. >>> transformer Binarizer() >>> transformer.transform(X) array([[1., 0., 1.], [1., 0., 0.], [0., 1., 0.]])
-
fit
(X, y=None)[source]¶ Do nothing and return the estimator unchanged.
This method is just there to implement the usual API and hence work in pipelines.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data.
y (None) – Ignored.
- Returns
self – Fitted transformer.
- Return type
object
-
transform
(X, copy=None)[source]¶ Binarize each element of X.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to binarize, element by element. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.
copy (bool) – Copy the input X or not.
- Returns
X_tr – Transformed array.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_features)
-
class
ballet.eng.external.
BinaryEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 18 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null int64 CHAS_1 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null int64 RAD_1 506 non-null int64 RAD_2 506 non-null int64 RAD_3 506 non-null int64 RAD_4 506 non-null int64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(11), int64(7) memory usage: 71.3 KB None
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
class
ballet.eng.external.
BoxCoxTransformer
(variables=None)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The BoxCoxTransformer() applies the BoxCox transformation to numerical variables.
The Box-Cox transformation is defined as:
T(Y)=(Y exp(λ)−1)/λ if λ!=0
log(Y) otherwise
where Y is the response variable and λ is the transformation parameter. λ varies, typically from -5 to 5. In the transformation, all values of λ are considered and the optimal value for a given variable is selected.
The BoxCox transformation implemented by this transformer is that of SciPy.stats: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html
The BoxCoxTransformer() works only with numerical positive variables (>=0).
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.
-
lambda_dict_
¶ Dictionary with the best BoxCox exponent per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the optimal lambda for the BoxCox transformation.
-
transform:
Apply the BoxCox transformation.
-
fit_transform:
Fit to data, then transform it.
References
- 1
Box and Cox. “An Analysis of Transformations”. Read at a RESEARCH MEETING, 1964. https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.2517-6161.1964.tb00553.x
-
fit
(X, y=None)[source]¶ Learn the optimal lambda for the BoxCox transformation.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
- ValueError
If there are no numerical variables in the df or the df is empty
If the variable(s) contain null values
If some variables contain zero values
- Returns
- Return type
self
-
transform
(X)[source]¶ Apply the BoxCox transformation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain negative values
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe
-
class
ballet.eng.external.
CatBoostEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None, a=1)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
CatBoost coding for categorical features.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.
Beware, the training data have to be randomly permutated. E.g.:
# Random permutation perm = np.random.permutation(len(X)) X = X.iloc[perm].reset_index(drop=True) y = y.iloc[perm].reset_index(drop=True)
This is necessary because some data sets are sorted based on the target value and this coder encodes the features on-the-fly in a single pass.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
sigma (float) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). sigma gives the standard deviation (spread or “width”) of the normal distribution.
a (float) – additive smoothing (it is the same variable as “m” in m-probability estimate). By default set to 1.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = CatBoostEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Transforming categorical features to numerical features, from
https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/
- 2
CatBoost: unbiased boosting with categorical features, from
https://arxiv.org/abs/1706.09516
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.
CategoricalImputer
(imputation_method='missing', fill_value='Missing', variables=None, return_object=False, ignore_format=False)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The CategoricalImputer() replaces missing data in categorical variables by an arbitrary value or by the most frequent category.
The CategoricalVariableImputer() imputes by default only categorical variables (type ‘object’ or ‘categorical’). You can pass a list of variables to impute, or alternatively, the encoder will find and encode all categorical variables.
If you want to impute numerical variables with this transformer, there are 2 ways of doing it:
Option 1: Cast your numerical variables as object in the input dataframe, before passing it to the transformer.
Option 2: Set ignore_format=True. Note that if you do this and do not pass the list of variables to impute, the imputer will automatically select and impute all variables in the dataframe.
- Parameters
imputation_method (str, default='missing') – Desired method of imputation. Can be ‘frequent’ for frequent category imputation or ‘missing’ to impute with an arbitrary value.
fill_value (str, int, float, default='Missing') – Only used when imputation_method=’missing’. User-defined value to replace the missing data.
variables (list, default=None) – The list of categorical variables that will be imputed. If None, the imputer will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the parameter ignore_format below.
return_object (bool, default=False) – If working with numerical variables cast as object, decide whether to return the variables as numeric or re-cast them as object. Note that pandas will re-cast them automatically as numeric after the transformation with the mode or with an arbitrary number.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
imputer_dict_
¶ Dictionary with most frequent category or arbitrary value per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the most frequent category, or assign arbitrary value to variable.
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, than transform it.
-
fit
(X, y=None)[source]¶ Learn the most frequent category if the imputation method is set to frequent.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)
ValueError – If there are no categorical variables in the df or the df is empty
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe has different number of features than the df used in fit()
- Returns
X – The dataframe without missing values in the selected variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
CombineWithReferenceFeature
(variables_to_combine, reference_variables, operations=['sub'], new_variables_names=None, missing_values='ignore')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
CombineWithReferenceFeature() applies basic mathematical operations between a group of variables and one or more reference features. It adds one or more additional features to the dataframe with the result of the operations.
In other words, CombineWithReferenceFeature() sums, multiplies, subtracts or divides a group of features to / by a group of reference variables, and returns the result as new variables in the dataframe.
For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter, number_payments_third_quarter, number_payments_fourth_quarter, and total_payments, we can use CombineWithReferenceFeature() to determine the percentage of payments per quarter as follows:
transformer = CombineWithReferenceFeature( variables_to_combine=[ 'number_payments_first_quarter', 'number_payments_second_quarter', 'number_payments_third_quarter', 'number_payments_fourth_quarter', ], reference_variables=['total_payments'], operations=['div'], new_variables_name=[ 'perc_payments_first_quarter', 'perc_payments_second_quarter', 'perc_payments_third_quarter', 'perc_payments_fourth_quarter', ] ) Xt = transformer.fit_transform(X)
The transformed X, Xt, will contain the additional features indicated in the new_variables_name list plus the original set of variables.
- Parameters
variables_to_combine (list) – The list of numerical variables to be combined with the reference variables.
reference_variables (list) – The list of numerical reference variables that will be added to, multiplied with, or subtracted from the variables_to_combine, or used as denominator for division.
operations (list, default=['sub']) –
The list of basic mathematical operations to be used in transformation.
If None, all of [‘sub’, ‘div’,’add’,’mul’] will be performed. Alternatively, you can enter a list of operations to carry out. Each operation should be a string and must be one of the elements in [‘sub’, ‘div’,’add’, ‘mul’].
Each operation will result in a new variable that will be added to the transformed dataset.
new_variables_names (list, default=None) –
Names of the newly created variables. You can enter a list with the names for the newly created features (recommended). You must enter as many names as new features created by the transformer. The number of new features is the number of operations times the number of reference variables times the number of variables to combine.
Thus, if you want to perform 2 operations, sub and div, combining 4 variables with 2 reference variables, you should enter 2 X 4 X 2 new variable names.
The name of the variables indicated by the user should coincide with the order in which the operations are performed by the transformer. The transformer will first carry out ‘sub’, then ‘div’, then ‘add’ and finally ‘mul’.
If new_variable_names is None, the transformer will assign an arbitrary name to the newly created features.
missing_values (string, default='ignore') – Indicates if missing values should be ignored or raised. If ‘ignore’, the transformer will ignore missing data when transforming the data. If ‘raise’ the transformer will return an error if the training or the datasets to transform contain missing values.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Combine the variables with the mathematical operations.
-
fit_transform:
Fit to the data, then transform it.
Notes
Although the transformer in essence allows us to combine any feature with any of the allowed mathematical operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical examples within the financial sector are:
Ratio between income and debt to create the debt_to_income_ratio.
Subtraction of rent from income to obtain the disposable_income.
-
fit
(X, y=None)[source]¶ This transformer does not learn any parameter. Performs dataframe checks.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, or np.array. Default=None.) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any user provided variables are not numerical
ValueError – If any of the reference variables contain null values and the mathematical operation is ‘div’.
- Returns
- Return type
self
-
transform
(X)[source]¶ Combine the variables with the mathematical operations.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.
- Returns
X – The dataframe with the operations results added as columns.
- Return type
Pandas dataframe, shape = [n_samples, n_features + n_operations]
-
class
ballet.eng.external.
CountEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', min_group_size=None, combine_min_nan_groups=None, min_group_name=None, normalize=False)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples]) –
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
-
class
ballet.eng.external.
CountFrequencyEncoder
(encoding_method='count', variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The CountFrequencyEncoder() replaces categories by either the count or the percentage of observations per category.
For example in the variable colour, if 10 observations are blue, blue will be replaced by 10. Alternatively, if 10% of the observations are blue, blue will be replaced by 0.1.
The CountFrequencyEncoder() will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the counts or frequencies for each variable (fit). The encoder then replaces the categories with those numbers (transform).
- Parameters
encoding_method (str, default='count') –
Desired method of encoding.
’count’: number of observations per category
’frequency’: percentage of observations per category
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the count or frequency per category, per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the count or frequency per category, per variable.
-
transform:
Encode the categories to numbers.
-
fit_transform:
Fit to the data, then transform it.
-
inverse_transform:
Encode the numbers into the original categories.
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
feature_engine.encoding.RareLabelEncoder
-
fit
(X, y=None)[source]¶ Learn the counts or frequencies which will be used to replace the categories.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (pandas Series, default = None) – y is not needed in this encoder. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The un-transformed dataframe, with the categorical variables containing the original values.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – The dataframe containing the categories replaced by numbers.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
DFSTransformer
(target_entity=None, agg_primitives=None, trans_primitives=None, allowed_paths=None, max_depth=2, ignore_entities=None, ignore_variables=None, seed_features=None, drop_contains=None, drop_exact=None, where_primitives=None, max_features=- 1, verbose=False)[source]¶ Bases:
sklearn.base.TransformerMixin
Transformer using Scikit-Learn interface for Pipeline uses.
-
fit
(X, y=None)[source]¶ Wrapper for DFS
Calculates a list of features given a dictionary of entities and a list of relationships. Alternatively, an EntitySet can be passed instead of the entities and relationships.
- Parameters
X – (ft.Entityset or tuple): Entityset to calculate features on. If a tuple is passed it can take one of these forms: (entityset, cutoff_time_dataframe), (entities, relationships), or ((entities, relationships), cutoff_time_dataframe)
y – (iterable): Training targets
See also
synthesis.dfs()
-
transform
(X)[source]¶ Wrapper for calculate_feature_matrix
Calculates a feature matrix for a the given input data and calculation times.
- Parameters
X – (ft.Entityset or tuple): Entityset to calculate features on. If a tuple is passed it can take one of these forms: (entityset, cutoff_time_dataframe), (entities, relationships), or ((entities, relationships), cutoff_time_dataframe)
See also
computational_backends.calculate_feature_matrix()
-
-
class
ballet.eng.external.
DecisionTreeDiscretiser
(variables=None, cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The DecisionTreeDiscretiser() replaces continuous numerical variables by discrete, finite, values estimated by a decision tree.
The methods is inspired by the following article from the winners of the KDD 2009 competition: http://www.mtome.com/Publications/CiML/CiML-v3-book.pdf
The DecisionTreeDiscretiser() works only with numerical variables. A list of variables can be passed as an argument. Alternatively, the discretiser will automatically select all numerical variables.
The DecisionTreeDiscretiser() first trains a decision tree for each variable.
The DecisionTreeDiscretiser() then transforms the variables, that is, makes predictions based on the variable values, using the trained decision tree.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the discretiser will automatically select all numerical variables.
cv (int, default=3) – Desired number of cross-validation fold to be used to fit the decision tree.
scoring (str, default='neg_mean_squared_error') – Desired metric to optimise the performance for the tree. Comes from sklearn.metrics. See DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html
param_grid (dictionary, default=None) –
The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().
If None, then param_grid = {‘max_depth’: [1, 2, 3, 4]}
regression (boolean, default=True) – Indicates whether the discretiser should train a regression or a classification decision tree.
random_state (int, default=None) – The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.
-
binner_dict_
¶ Dictionary containing the fitted tree per variable.
-
scores_dict_
¶ Dictionary with the score of the best decision tree, over the train set.
-
variables_
¶ The variables to discretise.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Fit a decision tree per variable.
-
transform:
Replace continuous values by the predictions of the decision tree.
-
fit_transform:
Fit to the data, then transform it.
See also
sklearn.tree.DecisionTreeClassifier
,sklearn.tree.DecisionTreeRegressor
References
- 1
Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
-
fit
(X, y)[source]¶ Fit the decision trees. One tree per variable to be transformed.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (pandas series.) – Target variable. Required to train the decision tree.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Replaces original variable with the predictions of the tree. The tree outcome is finite, aka, discrete.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()
- Returns
X_transformed – The dataframe with transformed variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
DecisionTreeEncoder
(encoding_method='arbitrary', cv=3, scoring='neg_mean_squared_error', param_grid=None, regression=True, random_state=None, variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The DecisionTreeEncoder() encodes categorical variables with predictions of a decision tree.
The encoder first fits a decision tree using a single feature and the target (fit). And then replaces the values of the original feature by the predictions of the tree (transform). The transformer will train a Decision tree per every feature to encode.
The motivation is to try and create monotonic relationships between the categorical variables and the target.
Under the hood, the categorical variable will be first encoded into integers with the OrdinalCategoricalEncoder(). The integers can be assigned arbitrarily to the categories or following the mean value of the target in each category. Then a decision tree will fit the resulting numerical variable to predict the target variable. Finally, the original categorical variable values will be replaced by the predictions of the decision tree.
The DecisionTreeEncoder() will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode or the encoder will find and encode all categorical variables. But with ignore_format=True you have the option to encode numerical variables as well. In this case, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
- Parameters
encoding_method (str, default='arbitrary') –
The categorical encoding method that will be used to encode the original categories to numerical values.
’ordered’: the categories are numbered in ascending order according to the target mean value per category.
’arbitrary’ : categories are numbered arbitrarily.
cv (int, default=3) – Desired number of cross-validation fold to be used to fit the decision tree.
scoring (str, default='neg_mean_squared_error') – Desired metric to optimise the performance for the decision tree. Comes from sklearn.metrics. See the DecisionTreeRegressor or DecisionTreeClassifier model evaluation documentation for more options: https://scikit-learn.org/stable/modules/model_evaluation.html
param_grid (dictionary, default=None) –
The list of parameters over which the decision tree should be optimised during the grid search. The param_grid can contain any of the permitted parameters for Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier().
If None, then param_grid = {‘max_depth’: [1, 2, 3, 4]}.
regression (boolean, default=True) – Indicates whether the encoder should train a regression or a classification decision tree.
random_state (int, default=None) – The random_state to initialise the training of the decision tree. It is one of the parameters of the Scikit-learn’s DecisionTreeRegressor() or DecisionTreeClassifier(). For reproducibility it is recommended to set the random_state to an integer.
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_
¶ sklearn Pipeline containing the ordinal encoder and the decision tree.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Fit a decision tree per variable.
-
transform:
Replace categorical variable by the predictions of the decision tree.
-
fit_transform:
Fit to the data, then transform it.
Notes
The authors designed this method originally, to work with numerical variables. We can replace numerical variables by the preditions of a decision tree utilising the DecisionTreeDiscretiser().
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
sklearn.ensemble.DecisionTreeRegressor
,sklearn.ensemble.DecisionTreeClassifier
,feature_engine.discretisation.DecisionTreeDiscretiser
,feature_engine.encoding.RareLabelEncoder
,feature_engine.encoding.OrdinalEncoder
References
- 1
Niculescu-Mizil, et al. “Winning the KDD Cup Orange Challenge with Ensemble Selection”. JMLR: Workshop and Conference Proceedings 7: 23-34. KDD 2009 http://proceedings.mlr.press/v7/niculescu09/niculescu09.pdf
-
fit
(X, y=None)[source]¶ Fit a decision tree per variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.
y (pandas series.) – The target variable. Required to train the decision tree and for ordered ordinal encoding.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace categorical variable by the predictions of the decision tree.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If dataframe is not of same size as that used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – Dataframe with variables encoded with decision tree predictions.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
DifferenceTransformer
(period=1)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
needs_refit
= True¶
-
-
class
ballet.eng.external.
DropMissingData
(missing_only=True, variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The DropMissingData() will delete rows containing missing values. It provides similar functionality to pandas.drop_na().
It works for both numerical and categorical variables. You can enter the list of variables for which missing values should be removed from the dataframe. Alternatively, the imputer will automatically select all variables in the dataframe.
Note The transformer will first select all variables or all user entered variables and if missing_only=True, it will re-select from the original group only those that show missing data in during fit, that is in the train set.
- Parameters
missing_only (bool, default=True) – If true, missing observations will be dropped only for the variables that have missing data in the train set, during fit. If False, observations with NA will be dropped from all variables indicated by the user.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables in the dataframe.
-
variables_
¶ List of variables for which the rows with NA will be deleted.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the variables for which the rows with NA will be deleted
-
transform:
Remove observations with NA
-
fit_transform:
Fit to the data, then transform it.
-
return_na_data:
Returns the dataframe with the rows that contain NA .
-
fit
(X, y=None)[source]¶ Learn the variables for which the rows with NA will be deleted.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
- Return type
self
-
return_na_data
(X)[source]¶ Returns the subset of the dataframe which contains the rows with missing values. This method could be useful in production, in case we want to store the observations that will not be fed into the model.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
X – The dataframe containing only the rows with missing values.
- Return type
pandas dataframe of shape = [obs_with_na, features]
-
transform
(X)[source]¶ Remove rows with missing values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Returns
X_transformed – The complete case dataframe for the selected variables, of shape [n_samples - rows_with_na, n_features]
- Return type
pandas dataframe
-
class
ballet.eng.external.
EndTailImputer
(imputation_method='gaussian', tail='right', fold=3, variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The EndTailImputer() replaces missing data by a value at either tail of the distribution. It works only with numerical variables.
You can indicate the variables to be imputed in a list. Alternatively, the EndTailImputer() will automatically find and select all variables of type numeric.
The imputer first calculates the values at the end of the distribution for each variable (fit). The values at the end of the distribution are determined using the Gaussian limits, the the IQR proximity rule limits, or a factor of the maximum value:
- Gaussian limits:
right tail: mean + 3*std
left tail: mean - 3*std
- IQR limits:
right tail: 75th quantile + 3*IQR
left tail: 25th quantile - 3*IQR
where IQR is the inter-quartile range = 75th quantile - 25th quantile
- Maximum value:
right tail: max * 3
left tail: not applicable
You can change the factor that multiplies the std, IQR or the maximum value using the parameter ‘fold’ (we used fold=3 in the examples above).
The imputer then replaces the missing data with the estimated values (transform).
- Parameters
imputation_method (str, default=gaussian) –
Method to be used to find the replacement values. Can take ‘gaussian’, ‘iqr’ or ‘max’.
gaussian: the imputer will use the Gaussian limits to find the values to replace missing data.
iqr: the imputer will use the IQR limits to find the values to replace missing data.
max: the imputer will use the maximum values to replace missing data. Note that if ‘max’ is passed, the parameter ‘tail’ is ignored.
tail (str, default=right) – Indicates if the values to replace missing data should be selected from the right or left tail of the variable distribution. Can take values ‘left’ or ‘right’.
fold (int, default=3) – Factor to multiply the std, the IQR or the Max values. Recommended values are 2 or 3 for Gaussian, or 1.5 or 3 for IQR.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will find and select all variables of type numeric.
-
imputer_dict_
¶ Dictionary with the values at the end of the distribution per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn values to replace missing data.
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ Learn the values at the end of the variable distribution.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas Series, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError – If there are no numerical variables in the df or the df is empty
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe has different number of features than the df used in fit()
- Returns
X – The dataframe without missing values in the selected variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
EqualFrequencyDiscretiser
(variables=None, q=10, return_object=False, return_boundaries=False)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The EqualFrequencyDiscretiser() divides continuous numerical variables into contiguous equal frequency intervals, that is, intervals that contain approximately the same proportion of observations.
The interval limits are determined using pandas.qcut(), in other words, the interval limits are determined by the quantiles. The number of intervals, i.e., the number of quantiles in which the variable should be divided is determined by the user.
The EqualFrequencyDiscretiser() works only with numerical variables. A list of variables can be passed as argument. Alternatively, the discretiser will automatically select and transform all numerical variables.
The EqualFrequencyDiscretiser() first finds the boundaries for the intervals or quantiles for each variable.
Then it transforms the variables, that is, it sorts the values into the intervals.
- Parameters
variables (list, default=None) – The list of numerical variables that will be discretised. If None, the EqualFrequencyDiscretiser() will select all numerical variables.
q (int, default=10) – Desired number of equal frequency intervals / bins. In other words the number of quantiles in which the variables should be divided.
return_object (bool, default=False) –
Whether the the discrete variable should be returned casted as numeric or as object. If you would like to proceed with the engineering of the variable as if it was categorical, use True. Alternatively, keep the default to False.
Categorical encoders in Feature-engine work only with variables of type object, thus, if you wish to encode the returned bins, set return_object to True.
return_boundaries (bool, default=False) – Whether the output should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.
-
binner_dict_
¶ Dictionary with the interval limits per variable.
-
variables_
¶ The variables to discretise.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Find the interval limits.
-
transform:
Sort continuous variable values into the intervals.
-
fit_transform:
Fit to the data, then transform it.
See also
pandas.qcut
https
//pandas.pydata.org/pandas-docs/stable/reference/api/pandas.qcut.html
References
- 1
Kotsiantis and Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science, vol. 1, pp. 111 117, 2006.
- 2
Dong. “Beating Kaggle the easy way”. Master Thesis. https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf
-
fit
(X, y=None)[source]¶ Learn the limits of the equal frequency intervals.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (None) – y is not needed in this encoder. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Sort the variable values into the intervals.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()
- Returns
X – The transformed data with the discrete variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
EqualWidthDiscretiser
(variables=None, bins=10, return_object=False, return_boundaries=False)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The EqualWidthDiscretiser() divides continuous numerical variables into intervals of the same width, that is, equidistant intervals. Note that the proportion of observations per interval may vary.
The size of the interval is calculated as:
\[( max(X) - min(X) ) / bins\]where bins, which is the number of intervals, should be determined by the user.
The interval limits are determined using pandas.cut(). The number of intervals in which the variable should be divided must be indicated by the user.
The EqualWidthDiscretiser() works only with numerical variables. A list of variables can be passed as argument. Alternatively, the discretiser will automatically select all numerical variables.
The EqualWidthDiscretiser() first finds the boundaries for the intervals for each variable. Then, it transforms the variables, that is, sorts the values into the intervals.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the discretiser will automatically select all numerical type variables.
bins (int, default=10) – Desired number of equal width intervals / bins.
return_object (bool, default=False) –
Whether the the discrete variable should be returned casted as numeric or as object. If you would like to proceed with the engineering of the variable as if it was categorical, use True. Alternatively, keep the default to False.
Categorical encoders in Feature-engine work only with variables of type object, thus, if you wish to encode the returned bins, set return_object to True.
return_boundaries (bool, default=False) – Whether the output should be the interval boundaries. If True, it returns the interval boundaries. If False, it returns integers.
-
binner_dict_
¶ Dictionary with the interval limits per variable.
-
variables_
¶ The variables to be discretised.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Find the interval limits.
-
transform:
Sort continuous variable values into the intervals.
-
fit_transform:
Fit to the data, then transform it.
See also
pandas.cut
https
//pandas.pydata.org/pandas-docs/stable/reference/api/pandas.cut.html
References
- 1
Kotsiantis and Pintelas, “Data preprocessing for supervised leaning,” International Journal of Computer Science, vol. 1, pp. 111 117, 2006.
- 2
Dong. “Beating Kaggle the easy way”. Master Thesis. https://www.ke.tu-darmstadt.de/lehre/arbeiten/studien/2015/Dong_Ying.pdf
-
fit
(X, y=None)[source]¶ Learn the boundaries of the equal width intervals / bins for each variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Can be the entire dataframe, not just the variables to be transformed.
y (None) – y is not needed in this encoder. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Sort the variable values into the intervals.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the dataframe is not of the same size as the one used in fit()
- Returns
X – The transformed data with the discrete variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
FeatureAugmenter
(default_fc_parameters=None, kind_to_fc_parameters=None, column_id=None, column_sort=None, column_kind=None, column_value=None, timeseries_container=None, chunksize=None, n_jobs=1, show_warnings=False, disable_progressbar=False, impute_function=None, profile=False, profiling_filename='profile.txt', profiling_sorting='cumulative')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Sklearn-compatible estimator, for calculating and adding many features calculated from a given time series to the data. It is basically a wrapper around
extract_features()
.The features include basic ones like min, max or median, and advanced features like fourier transformations or statistical tests. For a list of all possible features, see the module
feature_calculators
. The column name of each added feature contains the name of the function of that module, which was used for the calculation.For this estimator, two datasets play a crucial role:
the time series container with the timeseries data. This container (for the format see data-formats-label) contains the data which is used for calculating the features. It must be groupable by ids which are used to identify which feature should be attached to which row in the second dataframe.
the input data X, where the features will be added to. Its rows are identifies by the index and each index in X must be present as an id in the time series container.
Imagine the following situation: You want to classify 10 different financial shares and you have their development in the last year as a time series. You would then start by creating features from the metainformation of the shares, e.g. how long they were on the market etc. and filling up a table - the features of one stock in one row. This is the input array X, which each row identified by e.g. the stock name as an index.
>>> df = pandas.DataFrame(index=["AAA", "BBB", ...]) >>> # Fill in the information of the stocks >>> df["started_since_days"] = ... # add a feature
You can then extract all the features from the time development of the shares, by using this estimator. The time series container must include a column of ids, which are the same as the index of X.
>>> time_series = read_in_timeseries() # get the development of the shares >>> from tsfresh.transformers import FeatureAugmenter >>> augmenter = FeatureAugmenter(column_id="id") >>> augmenter.set_timeseries_container(time_series) >>> df_with_time_series_features = augmenter.transform(df)
The settings for the feature calculation can be controlled with the settings object. If you pass
None
, the default settings are used. Please refer toComprehensiveFCParameters
for more information.This estimator does not select the relevant features, but calculates and adds all of them to the DataFrame. See the
RelevantFeatureAugmenter
for calculating and selecting features.For a description what the parameters column_id, column_sort, column_kind and column_value mean, please see
extraction
.-
fit
(X=None, y=None)[source]¶ The fit function is not needed for this estimator. It just does nothing and is here for compatibility reasons.
- Parameters
X (Any) – Unneeded.
y (Any) – Unneeded.
- Returns
The estimator instance itself
- Return type
-
set_timeseries_container
(timeseries_container)[source]¶ Set the timeseries, with which the features will be calculated. For a format of the time series container, please refer to
extraction
. The timeseries must contain the same indices as the later DataFrame, to which the features will be added (the one you will pass totransform()
). You can call this function as often as you like, to change the timeseries later (e.g. if you want to extract for different ids).- Parameters
timeseries_container (pandas.DataFrame or dict) – The timeseries as a pandas.DataFrame or a dict. See
extraction
for the format.- Returns
None
- Return type
None
-
transform
(X)[source]¶ Add the features calculated using the timeseries_container and add them to the corresponding rows in the input pandas.DataFrame X.
To save some computing time, you should only include those time serieses in the container, that you need. You can set the timeseries container with the method
set_timeseries_container()
.- Parameters
X (pandas.DataFrame) – the DataFrame to which the calculated timeseries features will be added. This is not the dataframe with the timeseries itself.
- Returns
The input DataFrame, but with added features.
- Return type
pandas.DataFrame
-
class
ballet.eng.external.
FourierTransformer
(period=10, max_order=10, step_size=1)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
ballet.eng.external.
FunctionTransformer
(func=None, inverse_func=None, *, validate=False, accept_sparse=False, check_inverse=True, kw_args=None, inv_kw_args=None)[source]¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Constructs a transformer from an arbitrary callable.
A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.
Note: If a lambda is used as the function, then the resulting transformer will not be pickleable.
New in version 0.17.
Read more in the User Guide.
- Parameters
func (callable, default=None) – The callable to use for the transformation. This will be passed the same arguments as transform, with args and kwargs forwarded. If func is None, then func will be the identity function.
inverse_func (callable, default=None) – The callable to use for the inverse transformation. This will be passed the same arguments as inverse transform, with args and kwargs forwarded. If inverse_func is None, then inverse_func will be the identity function.
validate (bool, default=False) –
Indicate that the input X array should be checked before calling
func
. The possibilities are:If False, there is no input validation.
If True, then X will be converted to a 2-dimensional NumPy array or sparse matrix. If the conversion is not possible an exception is raised.
Changed in version 0.22: The default of
validate
changed from True to False.accept_sparse (bool, default=False) – Indicate that func accepts a sparse matrix as input. If validate is False, this has no effect. Otherwise, if accept_sparse is false, sparse matrix inputs will cause an exception to be raised.
check_inverse (bool, default=True) –
Whether to check that or
func
followed byinverse_func
leads to the original inputs. It can be used for a sanity check, raising a warning when the condition is not fulfilled.New in version 0.20.
kw_args (dict, default=None) –
Dictionary of additional keyword arguments to pass to func.
New in version 0.18.
inv_kw_args (dict, default=None) –
Dictionary of additional keyword arguments to pass to inverse_func.
New in version 0.18.
-
n_features_in_
¶ Number of features seen during fit. Defined only when validate=True.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when validate=True and X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
MaxAbsScaler
Scale each feature by its maximum absolute value.
StandardScaler
Standardize features by removing the mean and scaling to unit variance.
LabelBinarizer
Binarize labels in a one-vs-all fashion.
MultiLabelBinarizer
Transform between iterable of iterables and a multilabel format.
Examples
>>> import numpy as np >>> from sklearn.preprocessing import FunctionTransformer >>> transformer = FunctionTransformer(np.log1p) >>> X = np.array([[0, 1], [2, 3]]) >>> transformer.transform(X) array([[0. , 0.6931...], [1.0986..., 1.3862...]])
-
fit
(X, y=None)[source]¶ Fit transformer by checking X.
If
validate
isTrue
,X
will be checked.- Parameters
X (array-like, shape (n_samples, n_features)) – Input array.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
self – FunctionTransformer class instance.
- Return type
object
-
class
ballet.eng.external.
GLMMEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, binomial_target=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
Generalized linear mixed model.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
This is a supervised encoder similar to TargetEncoder or MEstimateEncoder, but there are some advantages: 1) Solid statistical theory behind the technique. Mixed effects models are a mature branch of statistics. 2) No hyper-parameters to tune. The amount of shrinkage is automatically determined through the estimation process. In short, the less observations a category has and/or the more the outcome varies for a category then the higher the regularization towards “the prior” or “grand mean”. 3) The technique is applicable for both continuous and binomial targets. If the target is continuous, the encoder returns regularized difference of the observation’s category from the global mean. If the target is binomial, the encoder returns regularized log odds per category.
In comparison to JamesSteinEstimator, this encoder utilizes generalized linear mixed models from statsmodels library.
Note: This is an alpha implementation. The API of the method may change in the future.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns 0.
handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns 0.
randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma (float) – standard deviation (spread or “width”) of the normal distribution.
binomial_target (bool) – if True, the target must be binomial with values {0, 1} and Binomial mixed model is used. If False, the target must be continuous and Linear mixed model is used. If None (the default), a heuristic is applied to estimate the target type.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target > 22.5 >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = GLMMEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Data Analysis Using Regression and Multilevel/Hierarchical Models, page 253, from
https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and binary y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Binary target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
When the data are used for model training, it is important to also pass the target in order to apply leave one out.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.
GaussianRandomProjection
(n_components='auto', *, eps=0.1, random_state=None)[source]¶ Bases:
sklearn.random_projection.BaseRandomProjection
Reduce dimensionality through Gaussian random projection.
The components of the random matrix are drawn from N(0, 1 / n_components).
Read more in the User Guide.
New in version 0.13.
- Parameters
n_components (int or 'auto', default='auto') –
Dimensionality of the target projection space.
n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the
eps
parameter.It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.
eps (float, default=0.1) –
Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. The value should be strictly positive.
Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.
random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generator used to generate the projection matrix at fit time. Pass an int for reproducible output across multiple function calls. See Glossary.
-
n_components_
¶ Concrete number of components computed when n_components=”auto”.
- Type
int
-
components_
¶ Random matrix used for the projection.
- Type
ndarray of shape (n_components, n_features)
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
SparseRandomProjection
Reduce dimensionality through sparse random projection.
Examples
>>> import numpy as np >>> from sklearn.random_projection import GaussianRandomProjection >>> rng = np.random.RandomState(42) >>> X = rng.rand(100, 10000) >>> transformer = GaussianRandomProjection(random_state=rng) >>> X_new = transformer.fit_transform(X) >>> X_new.shape (100, 3947)
-
class
ballet.eng.external.
HashingEncoder
(max_process=0, max_sample=0, verbose=0, n_components=8, cols=None, drop_invariant=False, return_df=True, hash_method='md5')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
A multivariate hashing implementation with configurable dimensionality/precision.
The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.
It’s important to read about how max_process & max_sample work before setting them manually, inappropriate setting slows down encoding.
Default value of ‘max_process’ is 1 on Windows because multiprocessing might cause issues, see in : https://github.com/scikit-learn-contrib/categorical-encoding/issues/215 https://docs.python.org/2/library/multiprocessing.html?highlight=process#windows
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
hash_method (str) – which hashing method to use. Any method from hashlib works.
max_process (int) – how many processes to use in transform(). Limited in range(1, 64). By default, it uses half of the logical CPUs. For example, 4C4T makes max_process=2, 4C8T makes max_process=4. Set it larger if you have a strong CPU. It is not recommended to set it larger than is the count of the logical CPUs as it will actually slow down the encoding.
max_sample (int) – how many samples to encode by each process at a time. This setting is useful on low memory machines. By default, max_sample=(all samples num)/(max_process). For example, 4C8T CPU with 100,000 samples makes max_sample=25,000, 6C12T CPU with 100,000 samples makes max_sample=16,666. It is not recommended to set it larger than the default value.
n_components (int) – how many bits to use to represent the feature. By default we use 8 bits. For high-cardinality features, consider using up-to 32 bits.
Example
>>> from category_encoders.hashing import HashingEncoder >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> y = bunch.target >>> he = HashingEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> data = he.transform(X) >>> print(data.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 19 columns): col_0 506 non-null int64 col_1 506 non-null int64 col_2 506 non-null int64 col_3 506 non-null int64 col_4 506 non-null int64 col_5 506 non-null int64 col_6 506 non-null int64 col_7 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(11), int64(8) memory usage: 75.2 KB None
References
- 1
Feature Hashing for Large Scale Multitask Learning, from
https://alex.smola.org/papers/2009/Weinbergeretal09.pdf .. [2] Don’t be tricked by the Hashing Trick, from https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
static
hashing_trick
(X_in, hashing_method='md5', N=2, cols=None, make_copy=False)[source]¶ A basic hashing implementation with configurable dimensionality/precision
Performs the hashing trick on a pandas dataframe, X, using the hashing method from hashlib identified by hashing_method. The number of output dimensions (N), and columns to hash (cols) are also configurable.
- Parameters
X_in (pandas dataframe) – description text
hashing_method (string, optional) – description text
N (int, optional) – description text
cols (list, optional) – description text
make_copy (bool, optional) – description text
- Returns
out – A hashing encoded dataframe.
- Return type
dataframe
References
Cite the relevant literature, e.g. [1]_. You may also cite these references in the notes section above. .. [1] Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
-
class
ballet.eng.external.
HelmertEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Helmert contrast coding for encoding categorical features.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = HelmertEncoder(cols=['CHAS', 'RAD'], handle_unknown='value', handle_missing='value').fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 21 columns): intercept 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null float64 RAD_1 506 non-null float64 RAD_2 506 non-null float64 RAD_3 506 non-null float64 RAD_4 506 non-null float64 RAD_5 506 non-null float64 RAD_6 506 non-null float64 RAD_7 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(20), int64(1) memory usage: 83.1 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
class
ballet.eng.external.
HorizonTransformer
(horizon=2)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
fit_transform
(X, y=None)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
needs_refit
= True¶
-
y_only
= True¶
-
-
class
ballet.eng.external.
IntegratedTransformer
(num_lags=1, pred_stride=1)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
ballet.eng.external.
JamesSteinEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', model='independent', random_state=None, randomized=False, sigma=0.05)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
James-Stein estimator.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
For feature value i, James-Stein estimator returns a weighted average of:
The mean target value for the observed feature value i.
The mean target value (regardless of the feature value).
This can be written as:
JS_i = (1-B)*mean(y_i) + B*mean(y)
The question is, what should be the weight B? If we put too much weight on the conditional mean value, we will overfit. If we put too much weight on the global mean, we will underfit. The canonical solution in machine learning is to perform cross-validation. However, Charles Stein came with a closed-form solution to the problem. The intuition is: If the estimate of mean(y_i) is unreliable (y_i has high variance), we should put more weight on mean(y). Stein put it into an equation as:
B = var(y_i) / (var(y_i)+var(y))
The only remaining issue is that we do not know var(y), let alone var(y_i). Hence, we have to estimate the variances. But how can we reliably estimate the variances, when we already struggle with the estimation of the mean values?! There are multiple solutions:
1. If we have the same count of observations for each feature value i and all y_i are close to each other, we can pretend that all var(y_i) are identical. This is called a pooled model. 2. If the observation counts are not equal, it makes sense to replace the variances with squared standard errors, which penalize small observation counts:
SE^2 = var(y)/count(y)
This is called an independent model.
James-Stein estimator has, however, one practical limitation - it was defined only for normal distributions. If you want to apply it for binary classification, which allows only values {0, 1}, it is better to first convert the mean target value from the bound interval <0,1> into an unbounded interval by replacing mean(y) with log-odds ratio:
log-odds_ratio_i = log(mean(y_i)/mean(y_not_i))
This is called binary model. The estimation of parameters of this model is, however, tricky and sometimes it fails fatally. In these situations, it is better to use beta model, which generally delivers slightly worse accuracy than binary model but does not suffer from fatal failures.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
model (str) – options are ‘pooled’, ‘beta’, ‘binary’ and ‘independent’, defaults to ‘independent’.
randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma (float) – standard deviation (spread or “width”) of the normal distribution.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = JamesSteinEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Parametric empirical Bayes inference: Theory and applications, equations 1.19 & 1.20, from
https://www.jstor.org/stable/2287098
- 2
Empirical Bayes for multiple sample sizes, from
http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/
- 3
Shrinkage Estimation of Log-odds Ratios for Comparing Mobility Tables, from
https://journals.sagepub.com/doi/abs/10.1177/0081175015570097
- 4
Stein’s paradox and group rationality, from
http://www.philos.rug.nl/~romeyn/presentation/2017_romeijn_-_Paris_Stein.pdf
- 5
Stein’s Paradox in Statistics, from
http://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and binary y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Binary target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data. When the data are used for model training, it is important to also pass the target in order to apply leave one out.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.
KBinsDiscretizer
(n_bins=5, *, encode='onehot', strategy='quantile', dtype=None)[source]¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Bin continuous data into intervals.
Read more in the User Guide.
New in version 0.20.
- Parameters
n_bins (int or array-like of shape (n_features,), default=5) – The number of bins to produce. Raises ValueError if
n_bins < 2
.encode ({'onehot', 'onehot-dense', 'ordinal'}, default='onehot') –
Method used to encode the transformed result.
- onehot
Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.
- onehot-dense
Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.
- ordinal
Return the bin identifier encoded as an integer value.
strategy ({'uniform', 'quantile', 'kmeans'}, default='quantile') –
Strategy used to define the widths of the bins.
- uniform
All bins in each feature have identical widths.
- quantile
All bins in each feature have the same number of points.
- kmeans
Values in each bin have the same nearest center of a 1D k-means cluster.
dtype ({np.float32, np.float64}, default=None) –
The desired data-type for the output. If None, output dtype is consistent with input dtype. Only np.float32 and np.float64 are supported.
New in version 0.24.
-
bin_edges_
¶ The edges of each bin. Contain arrays of varying shapes
(n_bins_, )
Ignored features will have empty arrays.- Type
ndarray of ndarray of shape (n_features,)
-
n_bins_
¶ Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.
- Type
ndarray of shape (n_features,), dtype=np.int_
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
Binarizer
Class used to bin values as
0
or1
based on a parameterthreshold
.
Notes
In bin edges for feature
i
, the first and last values are used only forinverse_transform
. During transform, bin edges are extended to:np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])
You can combine
KBinsDiscretizer
withColumnTransformer
if you only want to preprocess part of the features.KBinsDiscretizer
might produce constant features (e.g., whenencode = 'onehot'
and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g.,VarianceThreshold
).Examples
>>> from sklearn.preprocessing import KBinsDiscretizer >>> X = [[-2, 1, -4, -1], ... [-1, 2, -3, -0.5], ... [ 0, 3, -2, 0.5], ... [ 1, 4, -1, 2]] >>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform') >>> est.fit(X) KBinsDiscretizer(...) >>> Xt = est.transform(X) >>> Xt array([[ 0., 0., 0., 0.], [ 1., 1., 1., 0.], [ 2., 2., 2., 1.], [ 2., 2., 2., 2.]])
Sometimes it may be useful to convert the data back into the original feature space. The
inverse_transform
function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.>>> est.bin_edges_[0] array([-2., -1., 0., 1.]) >>> est.inverse_transform(Xt) array([[-1.5, 1.5, -3.5, -0.5], [-0.5, 2.5, -2.5, -0.5], [ 0.5, 3.5, -1.5, 0.5], [ 0.5, 3.5, -1.5, 1.5]])
-
fit
(X, y=None)[source]¶ Fit the estimator.
- Parameters
X (array-like of shape (n_samples, n_features)) – Data to be discretized.
y (None) – Ignored. This parameter exists only for compatibility with
Pipeline
.
- Returns
self – Returns the instance itself.
- Return type
object
-
get_feature_names_out
(input_features=None)[source]¶ Get output feature names.
- Parameters
input_features (array-like of str or None, default=None) –
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
feature_names_out – Transformed feature names.
- Return type
ndarray of str objects
-
inverse_transform
(Xt)[source]¶ Transform discretized data back to original feature space.
Note that this function does not regenerate the original data due to discretization rounding.
- Parameters
Xt (array-like of shape (n_samples, n_features)) – Transformed data in the binned space.
- Returns
Xinv – Data in the original feature space.
- Return type
ndarray, dtype={np.float32, np.float64}
-
transform
(X)[source]¶ Discretize the data.
- Parameters
X (array-like of shape (n_samples, n_features)) – Data to be discretized.
- Returns
Xt – Data in the binned space. Will be a sparse matrix if self.encode=’onehot’ and ndarray otherwise.
- Return type
{ndarray, sparse matrix}, dtype={np.float32, np.float64}
-
class
ballet.eng.external.
KNNImputer
(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False)[source]¶ Bases:
sklearn.impute._base._BaseImputer
Imputation for completing missing values using k-Nearest Neighbors.
Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.
Read more in the User Guide.
New in version 0.22.
- Parameters
missing_values (int, float, str, np.nan or None, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.
n_neighbors (int, default=5) – Number of neighboring samples to use for imputation.
weights ({'uniform', 'distance'} or callable, default='uniform') –
Weight function used in prediction. Possible values:
’uniform’ : uniform weights. All points in each neighborhood are weighted equally.
’distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
callable : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.
metric ({'nan_euclidean'} or callable, default='nan_euclidean') –
Distance metric for searching neighbors. Possible values:
’nan_euclidean’
callable : a user-defined function which conforms to the definition of
_pairwise_callable(X, Y, metric, **kwds)
. The function accepts two arrays, X and Y, and a missing_values keyword in kwds and returns a scalar distance value.
copy (bool, default=True) – If True, a copy of X will be created. If False, imputation will be done in-place whenever possible.
add_indicator (bool, default=False) – If True, a
MissingIndicator
transform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
-
indicator_
¶ Indicator used to add binary indicators for missing values.
None
if add_indicator is False.- Type
MissingIndicator
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
SimpleImputer
Imputation transformer for completing missing values with simple strategies.
IterativeImputer
Multivariate imputer that estimates each feature from all the others.
References
Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525.
Examples
>>> import numpy as np >>> from sklearn.impute import KNNImputer >>> X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]] >>> imputer = KNNImputer(n_neighbors=2) >>> imputer.fit_transform(X) array([[1. , 2. , 4. ], [3. , 4. , 3. ], [5.5, 6. , 5. ], [8. , 8. , 7. ]])
-
fit
(X, y=None)[source]¶ Fit the imputer on X.
- Parameters
X (array-like shape of (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
self – The fitted KNNImputer class instance.
- Return type
object
-
transform
(X)[source]¶ Impute all missing values in X.
- Parameters
X (array-like of shape (n_samples, n_features)) – The input data to complete.
- Returns
X – The imputed dataset. n_output_features is the number of features that is not always missing during fit.
- Return type
array-like of shape (n_samples, n_output_features)
-
class
ballet.eng.external.
LeaveOneOutEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
Leave one out coding for categorical features.
This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
sigma (float) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). Sigma gives the standard deviation (spread or “width”) of the normal distribution. The optimal value is commonly between 0.05 and 0.6. The default is to not add noise, but that leads to significantly suboptimal results.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = LeaveOneOutEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Strategies to encode categorical variables with many categories, from
https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.
LogTransformer
[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
needs_refit
= False¶
-
-
class
ballet.eng.external.
MEstimateEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, m=1.0)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
M-probability estimate of likelihood.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
This is a simplified version of target encoder, which goes under names like m-probability estimate or additive smoothing with known incidence rates. In comparison to target encoder, m-probability estimate has only one tunable parameter (m), while target encoder has two tunable parameters (min_samples_leaf and smoothing).
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma (float) – standard deviation (spread or “width”) of the normal distribution.
m (float) – this is the “m” in the m-probability estimate. Higher value of m results into stronger shrinking. M is non-negative.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target > 22.5 >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = MEstimateEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, equation 7, from
https://dl.acm.org/citation.cfm?id=507538
- 2
On estimating probabilities in tree pruning, equation 1, from
https://link.springer.com/chapter/10.1007/BFb0017010
- 3
Additive smoothing, from
https://en.wikipedia.org/wiki/Additive_smoothing#Generalized_to_the_case_of_known_incidence_rates
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and binary or continuous y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Binary target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
When the data are used for model training, it is important to also pass the target in order to apply leave one out.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.
MathematicalCombination
(variables_to_combine, math_operations=None, new_variables_names=None, missing_values='raise')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
MathematicalCombination() applies basic mathematical operations to multiple features, returning one or more additional features as a result. That is, it sums, multiplies, takes the average, maximum, minimum or standard deviation of a group of variables, and returns the result into new variables.
For example, if we have the variables number_payments_first_quarter, number_payments_second_quarter, number_payments_third_quarter and number_payments_fourth_quarter, we can use MathematicalCombination() to calculate the total number of payments and mean number of payments as follows:
transformer = MathematicalCombination( variables_to_combine=[ 'number_payments_first_quarter', 'number_payments_second_quarter', 'number_payments_third_quarter', 'number_payments_fourth_quarter' ], math_operations=[ 'sum', 'mean' ], new_variables_name=[ 'total_number_payments', 'mean_number_payments' ] ) Xt = transformer.fit_transform(X)
The transformed X, Xt, will contain the additional features total_number_payments and mean_number_payments, plus the original set of variables.
Attention, if some of the variables to combine have missing data and missing_values = ‘ignore’, the value will be ignored in the computation. To be clear, if variables A, B and C, have values 10, 20 and NA, and we perform the sum, the result will be A + B = 30.
- Parameters
variables_to_combine (list) – The list of numerical variables to be combined.
math_operations (list, default=None) –
The list of basic math operations to be used to create the new features.
If None, all of [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’] will be performed over the variables_to_combine. Alternatively, you can enter the list of operations to carry out.
Each operation should be a string and must be one of the elements in [‘sum’, ‘prod’, ‘mean’, ‘std’, ‘max’, ‘min’].
Each operation will result in a new variable that will be added to the transformed dataset.
new_variables_names (list, default=None) –
Names of the newly created variables. You can enter a name or a list of names for the newly created features (recommended). You must enter one name for each mathematical transformation indicated in the math_operations parameter. That is, if you want to perform mean and sum of features, you should enter 2 new variable names. If you perform only mean of features, enter 1 variable name. Alternatively, if you chose to perform all mathematical transformations, enter 6 new variable names.
The name of the variables indicated by the user should coincide with the order in which the mathematical operations are initialised in the transformer. That is, if you set math_operations = [‘mean’, ‘prod’], the first new variable name will be assigned to the mean of the variables and the second variable name to the product of the variables.
If new_variable_names = None, the transformer will assign an arbitrary name to the newly created features starting by the name of the mathematical operation, followed by the variables combined separated by -.
missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. If ‘raise’ the transformer will return an error if the the datasets to fit or transform contain missing values. If ‘ignore’, missing data will be ignored when performing the calculations.
-
combination_dict_
¶ Dictionary containing the mathematical operation to new variable name pairs.
-
math_operations_
¶ List with the mathematical operations to be applied to the variables_to_combine.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Combine the variables with the mathematical operations.
-
fit_transform:
Fit to the data, then transform it.
Notes
Although the transformer in essence allows us to combine any feature with any of the allowed mathematical operations, its used is intended mostly for the creation of new features based on some domain knowledge. Typical examples within the financial sector are:
Sum debt across financial products, i.e., credit cards, to obtain the total debt.
Take the average payments to various financial products per month.
Find the Minimum payment done at any one month.
In insurance, we can sum the damage to various parts of a car to obtain the total damage.
-
fit
(X, y=None)[source]¶ This transformer does not learn parameters.
Perform dataframe checks. Creates dictionary of operation to new feature name pairs.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, or np.array. Defaults to None.) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any user provided variables in variables_to_combine are not numerical
ValueError – If the variable(s) contain null values when missing_values = raise
- Returns
- Return type
self
-
transform
(X)[source]¶ Combine the variables with the mathematical operations.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values when missing_values = raise - If the dataframe is not of the same size as that used in fit()
- Returns
X – The dataframe with the original variables plus the new variables.
- Return type
Pandas dataframe, shape = [n_samples, n_features + n_operations]
-
class
ballet.eng.external.
MaxAbsScaler
(*, copy=True)[source]¶ Bases:
sklearn.base._OneToOneFeatureMixin
,sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Scale each feature by its maximum absolute value.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
This scaler can also be applied to sparse CSR or CSC matrices.
New in version 0.17.
- Parameters
copy (bool, default=True) – Set to False to perform inplace scaling and avoid a copy (if the input is already a numpy array).
-
scale_
¶ Per feature relative scaling of the data.
New in version 0.17: scale_ attribute.
- Type
ndarray of shape (n_features,)
-
max_abs_
¶ Per feature maximum absolute value.
- Type
ndarray of shape (n_features,)
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
-
n_samples_seen_
¶ The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across
partial_fit
calls.- Type
int
See also
maxabs_scale
Equivalent function without the estimator API.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
Examples
>>> from sklearn.preprocessing import MaxAbsScaler >>> X = [[ 1., -1., 2.], ... [ 2., 0., 0.], ... [ 0., 1., -1.]] >>> transformer = MaxAbsScaler().fit(X) >>> transformer MaxAbsScaler() >>> transformer.transform(X) array([[ 0.5, -1. , 1. ], [ 1. , 0. , 0. ], [ 0. , 1. , -0.5]])
-
fit
(X, y=None)[source]¶ Compute the maximum absolute value to be used for later scaling.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
y (None) – Ignored.
- Returns
self – Fitted scaler.
- Return type
object
-
inverse_transform
(X)[source]¶ Scale back the data to the original representation.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data that should be transformed back.
- Returns
X_tr – Transformed array.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_features)
-
partial_fit
(X, y=None)[source]¶ Online computation of max absolute value of X for later scaling.
All of X is processed as a single batch. This is intended for cases when
fit()
is not feasible due to very large number of n_samples or because X is read from a continuous stream.- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
- Returns
self – Fitted scaler.
- Return type
object
-
class
ballet.eng.external.
MeanEncoder
(variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The MeanEncoder() replaces categories by the mean value of the target for each category.
For example in the variable colour, if the mean of the target for blue, red and grey is 0.5, 0.8 and 0.1 respectively, blue is replaced by 0.5, red by 0.8 and grey by 0.1.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the numbers for each variable (fit). The encoder then replaces the categories with those numbers (transform).
- Parameters
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the target mean value per category per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the target mean value per category, per variable.
-
transform:
Encode the categories to numbers.
-
fit_transform:
Fit to the data, then transform it.
-
inverse_transform:
Encode the numbers into the original categories.
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
feature_engine.encoding.RareLabelEncoder
References
- 1
Micci-Barreca D. “A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems”. ACM SIGKDD Explorations Newsletter, 2001. https://dl.acm.org/citation.cfm?id=507538
-
fit
(X, y)[source]¶ Learn the mean value of the target for each category of the variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to be encoded.
y (pandas series) – The target.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - f user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The un-transformed dataframe, with the categorical variables containing the original values.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – The dataframe containing the categories replaced by numbers.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
MeanMedianImputer
(imputation_method='median', variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The MeanMedianImputer() replaces missing data by the mean or median value of the variable. It works only with numerical variables.
You can pass a list of variables to be imputed. Alternatively, the MeanMedianImputer() will automatically select all variables of type numeric in the training set.
The imputer:
first calculates the mean / median values of the variables (fit).
Then replaces the missing data with the estimated mean / median (transform).
- Parameters
imputation_method (str, default=median) – Desired method of imputation. Can take ‘mean’ or ‘median’.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will select all variables of type numeric.
-
imputer_dict_
¶ Dictionary with the mean or median values per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the mean or median values.
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ Learn the mean or median values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset.
y (pandas series or None, default=None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError – If there are no numerical variables in the df or the df is empty
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe has different number of features than the df used in fit()
- Returns
X – The dataframe without missing values in the selected variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
MinMaxScaler
(feature_range=(0, 1), *, copy=True, clip=False)[source]¶ Bases:
sklearn.base._OneToOneFeatureMixin
,sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
The transformation is given by:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0)) X_scaled = X_std * (max - min) + min
where min, max = feature_range.
This transformation is often used as an alternative to zero mean, unit variance scaling.
Read more in the User Guide.
- Parameters
feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.
copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).
clip (bool, default=False) –
Set to True to clip transformed values of held-out data to provided feature range.
New in version 0.24.
-
min_
¶ Per feature adjustment for minimum. Equivalent to
min - X.min(axis=0) * self.scale_
- Type
ndarray of shape (n_features,)
-
scale_
¶ Per feature relative scaling of the data. Equivalent to
(max - min) / (X.max(axis=0) - X.min(axis=0))
New in version 0.17: scale_ attribute.
- Type
ndarray of shape (n_features,)
-
data_min_
¶ Per feature minimum seen in the data
New in version 0.17: data_min_
- Type
ndarray of shape (n_features,)
-
data_max_
¶ Per feature maximum seen in the data
New in version 0.17: data_max_
- Type
ndarray of shape (n_features,)
-
data_range_
¶ Per feature range
(data_max_ - data_min_)
seen in the dataNew in version 0.17: data_range_
- Type
ndarray of shape (n_features,)
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
n_samples_seen_
¶ The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across
partial_fit
calls.- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
minmax_scale
Equivalent function without the estimator API.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
Examples
>>> from sklearn.preprocessing import MinMaxScaler >>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]] >>> scaler = MinMaxScaler() >>> print(scaler.fit(data)) MinMaxScaler() >>> print(scaler.data_max_) [ 1. 18.] >>> print(scaler.transform(data)) [[0. 0. ] [0.25 0.25] [0.5 0.5 ] [1. 1. ]] >>> print(scaler.transform([[2, 2]])) [[1.5 0. ]]
-
fit
(X, y=None)[source]¶ Compute the minimum and maximum to be used for later scaling.
- Parameters
X (array-like of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.
y (None) – Ignored.
- Returns
self – Fitted scaler.
- Return type
object
-
inverse_transform
(X)[source]¶ Undo the scaling of X according to feature_range.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.
- Returns
Xt – Transformed data.
- Return type
ndarray of shape (n_samples, n_features)
-
partial_fit
(X, y=None)[source]¶ Online computation of min and max on X for later scaling.
All of X is processed as a single batch. This is intended for cases when
fit()
is not feasible due to very large number of n_samples or because X is read from a continuous stream.- Parameters
X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
- Returns
self – Fitted scaler.
- Return type
object
-
class
ballet.eng.external.
MissingIndicator
(*, missing_values=nan, features='missing-only', sparse='auto', error_on_new=True)[source]¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Binary indicators for missing values.
Note that this component typically should not be used in a vanilla
Pipeline
consisting of transformers and a classifier, but rather could be added using aFeatureUnion
orColumnTransformer
.Read more in the User Guide.
New in version 0.20.
- Parameters
missing_values (int, float, str, np.nan or None, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.
features ({'missing-only', 'all'}, default='missing-only') –
Whether the imputer mask should represent all or a subset of features.
If ‘missing-only’ (default), the imputer mask will only represent features containing missing values during fit time.
If ‘all’, the imputer mask will represent all features.
sparse (bool or 'auto', default='auto') –
Whether the imputer mask format should be sparse or dense.
If ‘auto’ (default), the imputer mask will be of same type as input.
If True, the imputer mask will be a sparse matrix.
If False, the imputer mask will be a numpy array.
error_on_new (bool, default=True) – If True,
transform()
will raise an error when there are features with missing values that have no missing values infit()
. This is applicable only when features=’missing-only’.
-
features_
¶ The features indices which will be returned when calling
transform()
. They are computed duringfit()
. If features=’all’, features_ is equal to range(n_features).- Type
ndarray of shape (n_missing_features,) or (n_features,)
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
SimpleImputer
Univariate imputation of missing values.
IterativeImputer
Multivariate imputation of missing values.
Examples
>>> import numpy as np >>> from sklearn.impute import MissingIndicator >>> X1 = np.array([[np.nan, 1, 3], ... [4, 0, np.nan], ... [8, 1, 0]]) >>> X2 = np.array([[5, 1, np.nan], ... [np.nan, 2, 3], ... [2, 4, 0]]) >>> indicator = MissingIndicator() >>> indicator.fit(X1) MissingIndicator() >>> X2_tr = indicator.transform(X2) >>> X2_tr array([[False, True], [ True, False], [False, False]])
-
fit
(X, y=None)[source]¶ Fit the transformer on X.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present for API consistency by convention.
- Returns
self – Fitted estimator.
- Return type
object
-
fit_transform
(X, y=None)[source]¶ Generate missing values indicator for X.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input data to complete.
y (Ignored) – Not used, present for API consistency by convention.
- Returns
Xt – The missing indicator for input data. The data type of Xt will be boolean.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_features_with_missing)
-
transform
(X)[source]¶ Generate missing values indicator for X.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input data to complete.
- Returns
Xt – The missing indicator for input data. The data type of Xt will be boolean.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_features_with_missing)
-
class
ballet.eng.external.
Normalizer
(norm='l2', *, copy=True)[source]¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Normalize samples individually to unit norm.
Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.
This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).
Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.
Read more in the User Guide.
- Parameters
norm ({'l1', 'l2', 'max'}, default='l2') – The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.
copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
normalize
Equivalent function without the estimator API.
Notes
This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
Examples
>>> from sklearn.preprocessing import Normalizer >>> X = [[4, 1, 2, 2], ... [1, 3, 9, 3], ... [5, 7, 5, 1]] >>> transformer = Normalizer().fit(X) # fit does nothing. >>> transformer Normalizer() >>> transformer.transform(X) array([[0.8, 0.2, 0.4, 0.4], [0.1, 0.3, 0.9, 0.3], [0.5, 0.7, 0.5, 0.1]])
-
fit
(X, y=None)[source]¶ Do nothing and return the estimator unchanged.
This method is just there to implement the usual API and hence work in pipelines.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to estimate the normalization parameters.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
self – Fitted transformer.
- Return type
object
-
transform
(X, copy=None)[source]¶ Scale each non zero row of X to unit norm.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to normalize, row by row. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.
copy (bool, default=None) – Copy the input X or not.
- Returns
X_tr – Transformed array.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_features)
-
class
ballet.eng.external.
OneHotEncoder
(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')[source]¶ Bases:
sklearn.preprocessing._encoders._BaseEncoder
Encode categorical features as a one-hot numeric array.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the
sparse
parameter)By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.
This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.
Note: a one-hot encoding of y labels should use a LabelBinarizer instead.
Read more in the User Guide.
- Parameters
categories ('auto' or a list of array-like, default='auto') –
Categories (unique values) per feature:
’auto’ : Determine categories automatically from the training data.
list :
categories[i]
holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.
The used categories can be found in the
categories_
attribute.New in version 0.20.
drop ({'first', 'if_binary'} or a array-like of shape (n_features,), default=None) –
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.
However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.
None : retain all features (the default).
’first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.
’if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.
array :
drop[i]
is the category in featureX[:, i]
that should be dropped.
New in version 0.21: The parameter drop was added in 0.21.
Changed in version 0.23: The option drop=’if_binary’ was added in 0.23.
sparse (bool, default=True) – Will return sparse matrix if set True else will return an array.
dtype (number type, default=float) – Desired dtype of output.
handle_unknown ({'error', 'ignore'}, default='error') – Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.
-
categories_
¶ The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of
transform
). This includes the category specified indrop
(if any).- Type
list of arrays
-
drop_idx_
¶ drop_idx_[i]
is the index incategories_[i]
of the category to be dropped for each feature.drop_idx_[i] = None
if no category is to be dropped from the feature with indexi
, e.g. when drop=’if_binary’ and the feature isn’t binary.drop_idx_ = None
if all the transformed features will be retained.
Changed in version 0.23: Added the possibility to contain None values.
- Type
array of shape (n_features,)
-
n_features_in_
¶ Number of features seen during fit.
New in version 1.0.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
OrdinalEncoder
Performs an ordinal (integer) encoding of the categorical features.
sklearn.feature_extraction.DictVectorizer
Performs a one-hot encoding of dictionary items (also handles string-valued features).
sklearn.feature_extraction.FeatureHasher
Performs an approximate one-hot encoding of dictionary items or strings.
LabelBinarizer
Binarizes labels in a one-vs-all fashion.
MultiLabelBinarizer
Transforms between iterable of iterables and a multilabel format, e.g. a (samples x classes) binary matrix indicating the presence of a class label.
Examples
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.
>>> from sklearn.preprocessing import OneHotEncoder
One can discard categories not seen during fit:
>>> enc = OneHotEncoder(handle_unknown='ignore') >>> X = [['Male', 1], ['Female', 3], ['Female', 2]] >>> enc.fit(X) OneHotEncoder(handle_unknown='ignore') >>> enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> enc.transform([['Female', 1], ['Male', 4]]).toarray() array([[1., 0., 1., 0., 0.], [0., 1., 0., 0., 0.]]) >>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]]) array([['Male', 1], [None, 2]], dtype=object) >>> enc.get_feature_names_out(['gender', 'group']) array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)
One can always drop the first column for each feature:
>>> drop_enc = OneHotEncoder(drop='first').fit(X) >>> drop_enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray() array([[0., 0., 0.], [1., 1., 0.]])
Or drop a column for feature only having 2 categories:
>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X) >>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray() array([[0., 1., 0., 0.], [1., 0., 1., 0.]])
-
fit
(X, y=None)[source]¶ Fit OneHotEncoder to X.
- Parameters
X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.
y (None) – Ignored. This parameter exists only for compatibility with
Pipeline
.
- Returns
Fitted encoder.
- Return type
self
-
fit_transform
(X, y=None)[source]¶ Fit OneHotEncoder to X, then transform X.
Equivalent to fit(X).transform(X) but more convenient.
- Parameters
X (array-like of shape (n_samples, n_features)) – The data to encode.
y (None) – Ignored. This parameter exists only for compatibility with
Pipeline
.
- Returns
X_out – Transformed input. If sparse=True, a sparse matrix will be returned.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_encoded_features)
-
get_feature_names
(input_features=None)[source]¶ DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
Return feature names for output features.
- input_featureslist of str of shape (n_features,)
String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.
- output_feature_namesndarray of shape (n_output_features,)
Array of feature names.
-
get_feature_names_out
(input_features=None)[source]¶ Get output feature names for transformation.
- Parameters
input_features (array-like of str or None, default=None) –
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
feature_names_out – Transformed feature names.
- Return type
ndarray of str objects
-
inverse_transform
(X)[source]¶ Convert the data back to the original representation.
When unknown categories are encountered (all zeros in the one-hot encoding),
None
is used to represent this category. If the feature with the unknown category has a dropped caregory, the dropped category will be its inverse.- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_encoded_features)) – The transformed data.
- Returns
X_tr – Inverse transformed array.
- Return type
ndarray of shape (n_samples, n_features)
-
transform
(X)[source]¶ Transform X using one-hot encoding.
- Parameters
X (array-like of shape (n_samples, n_features)) – The data to encode.
- Returns
X_out – Transformed input. If sparse=True, a sparse matrix will be returned.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_encoded_features)
-
class
ballet.eng.external.
OrdinalEncoder
(*, categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='error', unknown_value=None)[source]¶ Bases:
sklearn.preprocessing._encoders._BaseEncoder
Encode categorical features as an integer array.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.
Read more in the User Guide.
New in version 0.20.
- Parameters
categories ('auto' or a list of array-like, default='auto') –
Categories (unique values) per feature:
’auto’ : Determine categories automatically from the training data.
list :
categories[i]
holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values.
The used categories can be found in the
categories_
attribute.dtype (number type, default np.float64) – Desired dtype of output.
handle_unknown ({'error', 'use_encoded_value'}, default='error') –
When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In
inverse_transform()
, an unknown category will be denoted as None.New in version 0.24.
unknown_value (int or np.nan, default=None) –
When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.
New in version 0.24.
-
categories_
¶ The categories of each feature determined during
fit
(in order of the features in X and corresponding with the output oftransform
). This does not include categories that weren’t seen duringfit
.- Type
list of arrays
-
n_features_in_
¶ Number of features seen during fit.
New in version 1.0.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
OneHotEncoder
Performs a one-hot encoding of categorical features.
LabelEncoder
Encodes target labels with values between 0 and
n_classes-1
.
Examples
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to an ordinal encoding.
>>> from sklearn.preprocessing import OrdinalEncoder >>> enc = OrdinalEncoder() >>> X = [['Male', 1], ['Female', 3], ['Female', 2]] >>> enc.fit(X) OrdinalEncoder() >>> enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> enc.transform([['Female', 3], ['Male', 1]]) array([[0., 2.], [1., 0.]])
>>> enc.inverse_transform([[1, 0], [0, 1]]) array([['Male', 1], ['Female', 2]], dtype=object)
-
fit
(X, y=None)[source]¶ Fit the OrdinalEncoder to X.
- Parameters
X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.
y (None) – Ignored. This parameter exists only for compatibility with
Pipeline
.
- Returns
self – Fitted encoder.
- Return type
object
-
class
ballet.eng.external.
OutlierTrimmer
(capping_method='gaussian', tail='right', fold=3, variables=None, missing_values='raise')[source]¶ Bases:
feature_engine.outliers.winsorizer.Winsorizer
The OutlierTrimmer() removes observations with outliers from the dataset.
It works only with numerical variables. A list of variables can be indicated. Alternatively, the OutlierTrimmer() will select all numerical variables.
The OutlierTrimmer() first calculates the maximum and /or minimum values beyond which a value will be considered an outlier, and thus removed.
Limits are determined using:
a Gaussian approximation
the inter-quantile range proximity rule
percentiles.
Gaussian limits:
right tail: mean + 3* std
left tail: mean - 3* std
IQR limits:
right tail: 75th quantile + 3* IQR
left tail: 25th quantile - 3* IQR
where IQR is the inter-quartile range: 75th quantile - 25th quantile.
percentiles or quantiles:
right tail: 95th percentile
left tail: 5th percentile
You can select how far out to cap the maximum or minimum values with the parameter ‘fold’.
If capping_method=’gaussian’ fold gives the value to multiply the std.
If capping_method=’iqr’ fold is the value to multiply the IQR.
If capping_method=’quantile’, fold is the percentile on each tail that should be censored. For example, if fold=0.05, the limits will be the 5th and 95th percentiles. If fold=0.1, the limits will be the 10th and 90th percentiles.
The transformer first finds the values at one or both tails of the distributions (fit).
The transformer then removes observations with outliers from the dataframe (transform).
- Parameters
capping_method (str, default=gaussian) –
Desired capping method. Can take ‘gaussian’, ‘iqr’ or ‘quantiles’.
’gaussian’: the transformer will find the maximum and / or minimum values to cap the variables using the Gaussian approximation.
’iqr’: the transformer will find the boundaries using the IQR proximity rule.
’quantiles’: the limits are given by the percentiles.
tail (str, default=right) – Whether to cap outliers on the right, left or both tails of the distribution. Can take ‘left’, ‘right’ or ‘both’.
fold (int or float, default=3) –
How far out to to place the capping values. The number that will multiply the std or IQR to calculate the capping values. Recommended values, 2 or 3 for the gaussian approximation, or 1.5 or 3 for the IQR proximity rule.
If capping_method=’quantile’, then ‘fold’ indicates the percentile. So if fold=0.05, the limits will be the 95th and 5th percentiles. Note: Outliers will be removed up to a maximum of the 20th percentiles on both sides. Thus, when capping_method=’quantile’, then ‘fold’ takes values between 0 and 0.20.
variables (list, default=None) – The list of variables for which the outliers will be removed If None, the transformer will find and select all numerical variables.
missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. Sometimes we want to remove outliers in the raw, original data, sometimes, we may want to remove outliers in the already pre-transformed data. If missing_values=’ignore’, the transformer will ignore missing data when learning the capping parameters or transforming the data. If missing_values=’raise’ the transformer will return an error if the training or the datasets to transform contain missing values.
-
right_tail_caps_
¶ Dictionary with the maximum values above which values will be removed.
-
left_tail_caps_
¶ Dictionary with the minimum values below which values will be removed.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Find maximum and minimum values.
-
transform:
Remove outliers.
-
fit_transform:
Fit to the data. Then transform it.
-
transform
(X)[source]¶ Remove observations with outliers from the dataframe.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe is not of same size as that used in fit()
- Returns
X – The dataframe without outlier observations.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
PRatioEncoder
(encoding_method='ratio', variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The PRatioEncoder() replaces categories by the ratio of the probability of the target = 1 and the probability of the target = 0.
The target probability ratio is given by:
\[p(1) / p(0)\]The log of the target probability ratio is:
\[log( p(1) / p(0) )\]Note
This categorical encoding is exclusive for binary classification.
For example in the variable colour, if the mean of the target = 1 for blue is 0.8 and the mean of the target = 0 is 0.2, blue will be replaced by: 0.8 / 0.2 = 4 if ratio is selected, or log(0.8/0.2) = 1.386 if log_ratio is selected.
Note: the division by 0 is not defined and the log(0) is not defined. Thus, if p(0) = 0 for the ratio encoder, or either p(0) = 0 or p(1) = 0 for log_ratio, in any of the variables, the encoder will return an error.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the numbers for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).
- Parameters
encoding_method (str, default='ratio') –
Desired method of encoding.
’ratio’ : probability ratio
’log_ratio’ : log probability ratio
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the probability ratio per category per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn probability ratio per category, per variable.
-
transform:
Encode categories into numbers.
-
fit_transform:
Fit to the data, then transform it.
-
inverse_transform:
Encode the numbers into the original categories.
Notes
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
feature_engine.encoding.RareLabelEncoder
-
fit
(X, y)[source]¶ Learn the numbers that should be used to replace the categories in each variable. That is the ratio of probability.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.
y (pandas series.) – Target, must be binary.
- Raises
TypeError –
If the input is not the Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in df or df is empty - If variable(s) contain null values. - If y is not binary with values 0 and 1. - If p(0) = 0 or any of p(0) or p(1) are 0.
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The un-transformed dataframe, with the categorical variables containing the original values.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – The dataframe containing the categories replaced by numbers.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
PolynomialEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Polynomial contrast coding for the encoding of categorical features.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = PolynomialEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 21 columns): intercept 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null float64 RAD_1 506 non-null float64 RAD_2 506 non-null float64 RAD_3 506 non-null float64 RAD_4 506 non-null float64 RAD_5 506 non-null float64 RAD_6 506 non-null float64 RAD_7 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(20), int64(1) memory usage: 83.1 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
class
ballet.eng.external.
PolynomialFeatures
(degree=2, *, interaction_only=False, include_bias=True, order='C')[source]¶ Bases:
sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Generate polynomial and interaction features.
Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].
Read more in the User Guide.
- Parameters
degree (int or tuple (min_degree, max_degree), default=2) – If a single int is given, it specifies the maximal degree of the polynomial features. If a tuple (min_degree, max_degree) is passed, then min_degree is the minimum and max_degree is the maximum polynomial degree of the generated features. Note that min_degree=0 and min_degree=1 are equivalent as outputting the degree zero term is determined by include_bias.
interaction_only (bool, default=False) –
If True, only interaction features are produced: features that are products of at most degree distinct input features, i.e. terms with power of 2 or higher of the same input feature are excluded:
included: x[0], x[1], x[0] * x[1], etc.
excluded: x[0] ** 2, x[0] ** 2 * x[1], etc.
include_bias (bool, default=True) – If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).
order ({'C', 'F'}, default='C') –
Order of output array in the dense case. ‘F’ order is faster to compute, but may slow down subsequent estimators.
New in version 0.21.
-
powers_
¶ powers_[i, j] is the exponent of the jth input in the ith output.
- Type
ndarray of shape (n_output_features_, n_features_in_)
-
n_input_features_
¶ The total number of input features.
Deprecated since version 1.0: This attribute is deprecated in 1.0 and will be removed in 1.2. Refer to n_features_in_ instead.
- Type
int
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
-
n_output_features_
¶ The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.
- Type
int
See also
SplineTransformer
Transformer that generates univariate B-spline bases for features.
Notes
Be aware that the number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.
See examples/linear_model/plot_polynomial_interpolation.py
Examples
>>> import numpy as np >>> from sklearn.preprocessing import PolynomialFeatures >>> X = np.arange(6).reshape(3, 2) >>> X array([[0, 1], [2, 3], [4, 5]]) >>> poly = PolynomialFeatures(2) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0., 0., 1.], [ 1., 2., 3., 4., 6., 9.], [ 1., 4., 5., 16., 20., 25.]]) >>> poly = PolynomialFeatures(interaction_only=True) >>> poly.fit_transform(X) array([[ 1., 0., 1., 0.], [ 1., 2., 3., 6.], [ 1., 4., 5., 20.]])
-
fit
(X, y=None)[source]¶ Compute number of output features.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
self – Fitted transformer.
- Return type
object
-
get_feature_names
(input_features=None)[source]¶ DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
Return feature names for output features.
- input_featureslist of str of shape (n_features,), default=None
String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.
- output_feature_nameslist of str of shape (n_output_features,)
Transformed feature names.
-
get_feature_names_out
(input_features=None)[source]¶ Get output feature names for transformation.
- Parameters
input_features (array-like of str or None, default=None) –
Input features.
If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].
If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.
- Returns
feature_names_out – Transformed feature names.
- Return type
ndarray of str objects
-
property
n_input_features_
¶ The attribute n_input_features_ was deprecated in version 1.0 and will be removed in 1.2.
- Type
DEPRECATED
-
property
powers_
¶ Exponent for each of the inputs in the output.
-
transform
(X)[source]¶ Transform data to polynomial features.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) –
The data to transform, row by row.
Prefer CSR over CSC for sparse input (for speed), but CSC is required if the degree is 4 or higher. If the degree is less than 4 and the input format is CSC, it will be converted to CSR, have its polynomial features generated, then converted back to CSC.
If the degree is 2 or 3, the method described in “Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices Using K-Simplex Numbers” by Andrew Nystrom and John Hughes is used, which is much faster than the method used on CSC input. For this reason, a CSC input will be converted to CSR, and the output will be converted back to CSC prior to being returned, hence the preference of CSR.
- Returns
XP – The matrix of features, where NP is the number of polynomial features generated from the combination of inputs. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.
- Return type
{ndarray, sparse matrix} of shape (n_samples, NP)
-
class
ballet.eng.external.
PowerTransformer
(method='yeo-johnson', *, standardize=True, copy=True)[source]¶ Bases:
sklearn.base._OneToOneFeatureMixin
,sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Apply a power transform featurewise to make data more Gaussian-like.
Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.
Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.
Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.
By default, zero-mean, unit-variance normalization is applied to the transformed data.
Read more in the User Guide.
New in version 0.20.
- Parameters
method ({'yeo-johnson', 'box-cox'}, default='yeo-johnson') –
The power transform method. Available methods are:
standardize (bool, default=True) – Set to True to apply zero-mean, unit-variance normalization to the transformed output.
copy (bool, default=True) – Set to False to perform inplace computation during transformation.
-
lambdas_
¶ The parameters of the power transformation for the selected features.
- Type
ndarray of float of shape (n_features,)
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
power_transform
Equivalent function without the estimator API.
QuantileTransformer
Maps data to a standard normal distribution with the parameter output_distribution=’normal’.
Notes
NaNs are treated as missing values: disregarded in
fit
, and maintained intransform
.For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
References
- 1
I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).
- 2
G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).
Examples
>>> import numpy as np >>> from sklearn.preprocessing import PowerTransformer >>> pt = PowerTransformer() >>> data = [[1, 2], [3, 2], [4, 5]] >>> print(pt.fit(data)) PowerTransformer() >>> print(pt.lambdas_) [ 1.386... -3.100...] >>> print(pt.transform(data)) [[-1.316... -0.707...] [ 0.209... -0.707...] [ 1.106... 1.414...]]
-
fit
(X, y=None)[source]¶ Estimate the optimal parameter lambda for each feature.
The optimal lambda parameter for minimizing skewness is estimated on each feature independently using maximum likelihood.
- Parameters
X (array-like of shape (n_samples, n_features)) – The data used to estimate the optimal transformation parameters.
y (None) – Ignored.
- Returns
self – Fitted transformer.
- Return type
object
-
fit_transform
(X, y=None)[source]¶ Fit PowerTransformer to X, then transform X.
- Parameters
X (array-like of shape (n_samples, n_features)) – The data used to estimate the optimal transformation parameters and to be transformed using a power transformation.
y (Ignored) – Not used, present for API consistency by convention.
- Returns
X_new – Transformed data.
- Return type
ndarray of shape (n_samples, n_features)
-
inverse_transform
(X)[source]¶ Apply the inverse power transformation using the fitted lambdas.
The inverse of the Box-Cox transformation is given by:
if lambda_ == 0: X = exp(X_trans) else: X = (X_trans * lambda_ + 1) ** (1 / lambda_)
The inverse of the Yeo-Johnson transformation is given by:
if X >= 0 and lambda_ == 0: X = exp(X_trans) - 1 elif X >= 0 and lambda_ != 0: X = (X_trans * lambda_ + 1) ** (1 / lambda_) - 1 elif X < 0 and lambda_ != 2: X = 1 - (-(2 - lambda_) * X_trans + 1) ** (1 / (2 - lambda_)) elif X < 0 and lambda_ == 2: X = 1 - exp(-X_trans)
- Parameters
X (array-like of shape (n_samples, n_features)) – The transformed data.
- Returns
X – The original data.
- Return type
ndarray of shape (n_samples, n_features)
-
transform
(X)[source]¶ Apply the power transform to each feature using the fitted lambdas.
- Parameters
X (array-like of shape (n_samples, n_features)) – The data to be transformed using a power transformation.
- Returns
X_trans – The transformed data.
- Return type
ndarray of shape (n_samples, n_features)
-
class
ballet.eng.external.
QuantileTransformer
(*, n_quantiles=1000, output_distribution='uniform', ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=True)[source]¶ Bases:
sklearn.base._OneToOneFeatureMixin
,sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Transform features using quantiles information.
This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.
Read more in the User Guide.
New in version 0.19.
- Parameters
n_quantiles (int, default=1000 or n_samples) – Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples as a larger number of quantiles does not give a better approximation of the cumulative distribution function estimator.
output_distribution ({'uniform', 'normal'}, default='uniform') – Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.
ignore_implicit_zeros (bool, default=False) – Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.
subsample (int, default=1e5) – Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.
random_state (int, RandomState instance or None, default=None) – Determines random number generation for subsampling and smoothing noise. Please see
subsample
for more details. Pass an int for reproducible results across multiple function calls. See Glossary.copy (bool, default=True) – Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).
-
n_quantiles_
¶ The actual number of quantiles used to discretize the cumulative distribution function.
- Type
int
-
quantiles_
¶ The values corresponding the quantiles of reference.
- Type
ndarray of shape (n_quantiles, n_features)
-
references_
¶ Quantiles of references.
- Type
ndarray of shape (n_quantiles, )
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
quantile_transform
Equivalent function without the estimator API.
PowerTransformer
Perform mapping to a normal distribution using a power transform.
StandardScaler
Perform standardization that is faster, but less robust to outliers.
RobustScaler
Perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
Examples
>>> import numpy as np >>> from sklearn.preprocessing import QuantileTransformer >>> rng = np.random.RandomState(0) >>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0) >>> qt = QuantileTransformer(n_quantiles=10, random_state=0) >>> qt.fit_transform(X) array([...])
-
fit
(X, y=None)[source]¶ Compute the quantiles used for transforming.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix
. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.y (None) – Ignored.
- Returns
self – Fitted transformer.
- Return type
object
-
inverse_transform
(X)[source]¶ Back-projection to the original space.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix
. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.- Returns
Xt – The projected data.
- Return type
{ndarray, sparse matrix} of (n_samples, n_features)
-
transform
(X)[source]¶ Feature-wise transformation of the data.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse
csc_matrix
. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.- Returns
Xt – The projected data.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_features)
-
class
ballet.eng.external.
RandomSampleImputer
(random_state=None, seed='general', seeding_method='add', variables=None)[source]¶ Bases:
feature_engine.imputation.base_imputer.BaseImputer
The RandomSampleImputer() replaces missing data with a random sample extracted from the variables in the training set.
The RandomSampleImputer() works with both numerical and categorical variables.
Note
The Random samples used to replace missing values may vary from execution to execution. This may affect the results of your work. This, it is advisable to set a seed.
There are 2 ways in which the seed can be set in the RandomSampleImputer():
If seed = ‘general’ then the random_state can be either None or an integer. The seed will be used as the random_state and all observations will be imputed in one go. This is equivalent to pandas.sample(n, random_state=seed) where n is the number of observations with missing data.
If seed = ‘observation’, then the random_state should be a variable name or a list of variable names. The seed will be calculated observation per observation, either by adding or multiplying the seeding variable values, and passed to the random_state. Then, a value will be extracted from the train set using that seed and used to replace the NAN in particular observation. This is the equivalent of pandas.sample(1, random_state=var1+var2) if the ‘seeding_method’ is set to ‘add’ or pandas.sample(1, random_state=var1*var2) if the ‘seeding_method’ is set to ‘multiply’.
For more details on why this functionality is important refer to the course Feature Engineering for Machine Learning in Udemy: https://www.udemy.com/feature-engineering-for-machine-learning/
Note, if the variables indicated in the random_state list are not numerical the imputer will return an error. Note also that the variables indicated as seed should not contain missing values.
This estimator stores a copy of the training set when the fit() method is called. Therefore, the object can become quite heavy. Also, it may not be GDPR compliant if your training data set contains Personal Information. Please check if this behaviour is allowed within your organisation.
- Parameters
random_state (int, str or list, default=None) – The random_state can take an integer to set the seed when extracting the random samples. Alternatively, it can take a variable name or a list of variables, which values will be used to determine the seed observation per observation.
seed (str, default='general') –
Indicates whether the seed should be set for each observation with missing values, or if one seed should be used to impute all observations in one go.
general: one seed will be used to impute the entire dataframe. This is equivalent to setting the seed in pandas.sample(random_state).
observation: the seed will be set for each observation using the values of the variables indicated in the random_state for that particular observation.
seeding_method (str, default='add') – If more than one variable are indicated to seed the random sampling per observation, you can choose to combine those values as an addition or a multiplication. Can take the values ‘add’ or ‘multiply’.
variables (list, default=None) – The list of variables to be imputed. If None, the imputer will select all variables in the train set.
-
X_
¶ Copy of the training dataframe from which to extract the random samples.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Make a copy of the dataframe
-
transform:
Impute missing data.
-
fit_transform:
Fit to the data, then transform it.
-
fit
(X, y=None)[source]¶ Makes a copy of the train set. Only stores a copy of the variables to impute. This copy is then used to randomly extract the values to fill the missing data during transform.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training dataset. Only a copy of the indicated variables will be stored in the transformer.
y (None) – y is not needed in this imputation. You can pass None or y.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
- Return type
self
-
transform
(X)[source]¶ Replace missing data with random values taken from the train set.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataframe to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
X – The dataframe without missing values in the transformed variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
RareLabelEncoder
(tol=0.05, n_categories=10, max_n_categories=None, replace_with='Rare', variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The RareLabelCategoricalEncoder() groups rare / infrequent categories in a new category called “Rare”, or any other name entered by the user.
For example in the variable colour, if the percentage of observations for the categories magenta, cyan and burgundy are < 5 %, all those categories will be replaced by the new label “Rare”.
Note
Infrequent labels can also be grouped under a user defined name, for example ‘Other’. The name to replace infrequent categories is defined with the parameter replace_with.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode.Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first finds the frequent labels for each variable (fit). The encoder then groups the infrequent labels under the new label ‘Rare’ or by another user defined string (transform).
- Parameters
tol (float, default=0.05) – The minimum frequency a label should have to be considered frequent. Categories with frequencies lower than tol will be grouped.
n_categories (int, default=10) – The minimum number of categories a variable should have for the encoder to find frequent labels. If the variable contains less categories, all of them will be considered frequent.
max_n_categories (int, default=None) – The maximum number of categories that should be considered frequent. If None, all categories with frequency above the tolerance (tol) will be considered frequent. If you enter 5, only the 5 most frequent categories will be retained and the rest grouped.
replace_with (string, intege or float, default='Rare') – The value that will be used to replace infrequent categories.
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the frequent categories, i.e., those that will be kept, per variable.
-
variables_
¶ The variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Find frequent categories.
-
transform:
Group rare categories
-
fit_transform:
Fit to data, then transform it.
-
fit
(X, y=None)[source]¶ Learn the frequent categories for each variable.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just selected variables
y (None) – y is not required. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in the df or the df is empty - If the variable(s) contain null values
Warning – If the number of categories in any one variable is less than the indicated in n_categories.
- Returns
- Return type
self
-
transform
(X)[source]¶ Group infrequent categories. Replace infrequent categories by the string ‘Rare’ or any other name provided by the user.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The input samples.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If user enters non-categorical variables (unless ignore_format is True)
- Returns
X – The dataframe where rare categories have been grouped.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
ReciprocalTransformer
(variables=None)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The ReciprocalTransformer() applies the reciprocal transformation 1 / x to numerical variables.
The ReciprocalTransformer() only works with numerical variables with non-zero values. If a variable contains the value 0, the transformer will raise an error.
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
This transformer does not learn parameters.
-
transform:
Apply the reciprocal 1 / x transformation.
-
fit_transform:
Fit to data, then transform it.
-
inverse_transform:
Convert the data back to the original representation.
-
fit
(X, y=None)[source]¶ This transformer does not learn parameters.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
ValueError –
If there are no numerical variables in the df or the df is empty - If the variable(s) contain null values - If some variables contain zero as values
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the data back to the original representation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain zero values
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe
-
transform
(X)[source]¶ Apply the reciprocal 1 / x transformation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit() - If some variables contain zero values
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe
-
class
ballet.eng.external.
ReversibleImputer
(y_only=False)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
needs_refit
= True¶
-
-
class
ballet.eng.external.
RobustScaler
(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)[source]¶ Bases:
sklearn.base._OneToOneFeatureMixin
,sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Scale features using statistics that are robust to outliers.
This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the
transform()
method.Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.
New in version 0.17.
Read more in the User Guide.
- Parameters
with_centering (bool, default=True) – If True, center the data before scaling. This will cause
transform()
to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.with_scaling (bool, default=True) – If True, scale the data to interquartile range.
quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0)) –
Quantile range used to calculate scale_. By default this is equal to the IQR, i.e., q_min is the first quantile and q_max is the third quantile.
New in version 0.18.
copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
unit_variance (bool, default=False) –
If True, scale data so that normally distributed features have a variance of 1. In general, if the difference between the x-values of q_max and q_min for a standard normal distribution is greater than 1, the dataset will be scaled down. If less than 1, the dataset will be scaled up.
New in version 0.24.
-
center_
¶ The median value for each feature in the training set.
- Type
array of floats
-
scale_
¶ The (scaled) interquartile range for each feature in the training set.
New in version 0.17: scale_ attribute.
- Type
array of floats
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
robust_scale
Equivalent function without the estimator API.
sklearn.decomposition.PCA
Further removes the linear correlation across features with ‘whiten=True’.
Notes
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Interquartile_range
Examples
>>> from sklearn.preprocessing import RobustScaler >>> X = [[ 1., -2., 2.], ... [ -2., 1., 3.], ... [ 4., 1., -2.]] >>> transformer = RobustScaler().fit(X) >>> transformer RobustScaler() >>> transformer.transform(X) array([[ 0. , -2. , 0. ], [-1. , 0. , 0.4], [ 1. , 0. , -1.6]])
-
fit
(X, y=None)[source]¶ Compute the median and quantiles to be used for scaling.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the median and quantiles used for later scaling along the features axis.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
self – Fitted scaler.
- Return type
object
-
inverse_transform
(X)[source]¶ Scale back the data to the original representation.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The rescaled data to be transformed back.
- Returns
X_tr – Transformed array.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_features)
-
class
ballet.eng.external.
RollingMeanTransformer
(window=5)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
ballet.eng.external.
SeasonalTransformer
(seasonal_period=1, pred_stride=1)[source]¶ Bases:
skits.feature_extraction.AutoregressiveTransformer
-
class
ballet.eng.external.
SimpleImputer
(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)[source]¶ Bases:
sklearn.impute._base._BaseImputer
Imputation transformer for completing missing values.
Read more in the User Guide.
New in version 0.20: SimpleImputer replaces the previous sklearn.preprocessing.Imputer estimator which is now removed.
- Parameters
missing_values (int, float, str, np.nan or None, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.
strategy (str, default='mean') –
The imputation strategy.
If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.
If “median”, then replace missing values using the median along each column. Can only be used with numeric data.
If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.
If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.
New in version 0.20: strategy=”constant” for fixed value imputation.
fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.
verbose (int, default=0) – Controls the verbosity of the imputer.
copy (bool, default=True) –
If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:
If X is not an array of floating values;
If X is encoded as a CSR matrix;
If add_indicator=True.
add_indicator (bool, default=False) – If True, a
MissingIndicator
transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.
-
statistics_
¶ The imputation fill value for each feature. Computing statistics can result in np.nan values. During
transform()
, features corresponding to np.nan statistics will be discarded.- Type
array of shape (n_features,)
-
indicator_
¶ Indicator used to add binary indicators for missing values. None if add_indicator=False.
- Type
MissingIndicator
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
IterativeImputer
Multivariate imputation of missing values.
Notes
Columns which only contained missing values at
fit()
are discarded upontransform()
if strategy is not “constant”.Examples
>>> import numpy as np >>> from sklearn.impute import SimpleImputer >>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean') >>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]]) SimpleImputer() >>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]] >>> print(imp_mean.transform(X)) [[ 7. 2. 3. ] [ 4. 3.5 6. ] [10. 3.5 9. ]]
-
fit
(X, y=None)[source]¶ Fit the imputer on X.
- Parameters
X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.
y (Ignored) – Not used, present here for API consistency by convention.
- Returns
self – Fitted estimator.
- Return type
object
-
inverse_transform
(X)[source]¶ Convert the data back to the original representation.
Inverts the transform operation performed on an array. This operation can only be performed after
SimpleImputer
is instantiated with add_indicator=True.Note that inverse_transform can only invert the transform in features that have binary indicators for missing values. If a feature has no missing values at fit time, the feature won’t have a binary indicator, and the imputation done at transform time won’t be inverted.
New in version 0.24.
- Parameters
X (array-like of shape (n_samples, n_features + n_features_missing_indicator)) – The imputed data to be reverted to original data. It has to be an augmented array of imputed data and the missing indicator mask.
- Returns
X_original – The original X with missing values as it was prior to imputation.
- Return type
ndarray of shape (n_samples, n_features)
-
class
ballet.eng.external.
SparseRandomProjection
(n_components='auto', *, density='auto', eps=0.1, dense_output=False, random_state=None)[source]¶ Bases:
sklearn.random_projection.BaseRandomProjection
Reduce dimensionality through sparse random projection.
Sparse random matrix is an alternative to dense random projection matrix that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data.
If we note s = 1 / density the components of the random matrix are drawn from:
-sqrt(s) / sqrt(n_components) with probability 1 / 2s
0 with probability 1 - 1 / s
+sqrt(s) / sqrt(n_components) with probability 1 / 2s
Read more in the User Guide.
New in version 0.13.
- Parameters
n_components (int or 'auto', default='auto') –
Dimensionality of the target projection space.
n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the
eps
parameter.It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.
density (float or 'auto', default='auto') –
Ratio in the range (0, 1] of non-zero component in the random projection matrix.
If density = ‘auto’, the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features).
Use density = 1 / 3.0 if you want to reproduce the results from Achlioptas, 2001.
eps (float, default=0.1) –
Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. This value should be strictly positive.
Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.
dense_output (bool, default=False) –
If True, ensure that the output of the random projection is a dense numpy array even if the input and random projection matrix are both sparse. In practice, if the number of components is small the number of zero components in the projected data will be very small and it will be more CPU and memory efficient to use a dense representation.
If False, the projected data uses a sparse representation if the input is sparse.
random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generator used to generate the projection matrix at fit time. Pass an int for reproducible output across multiple function calls. See Glossary.
-
n_components_
¶ Concrete number of components computed when n_components=”auto”.
- Type
int
-
components_
¶ Random matrix used for the projection. Sparse matrix will be of CSR format.
- Type
sparse matrix of shape (n_components, n_features)
-
density_
¶ Concrete density computed from when density = “auto”.
- Type
float in range 0.0 - 1.0
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
See also
GaussianRandomProjection
Reduce dimensionality through Gaussian random projection.
References
- 1
Ping Li, T. Hastie and K. W. Church, 2006, “Very Sparse Random Projections”. https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf
- 2
D. Achlioptas, 2001, “Database-friendly random projections”, https://users.soe.ucsc.edu/~optas/papers/jl.pdf
Examples
>>> import numpy as np >>> from sklearn.random_projection import SparseRandomProjection >>> rng = np.random.RandomState(42) >>> X = rng.rand(100, 10000) >>> transformer = SparseRandomProjection(random_state=rng) >>> X_new = transformer.fit_transform(X) >>> X_new.shape (100, 3947) >>> # very few components are non-zero >>> np.mean(transformer.components_ != 0) 0.0100...
-
class
ballet.eng.external.
StandardScaler
(*, copy=True, with_mean=True, with_std=True)[source]¶ Bases:
sklearn.base._OneToOneFeatureMixin
,sklearn.base.TransformerMixin
,sklearn.base.BaseEstimator
Standardize features by removing the mean and scaling to unit variance.
The standard score of a sample x is calculated as:
z = (x - u) / s
where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.
Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using
transform()
.Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.
Read more in the User Guide.
- Parameters
copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.
with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.
with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).
-
scale_
¶ Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using np.sqrt(var_). If a variance is zero, we can’t achieve unit variance, and the data is left as-is, giving a scaling factor of 1. scale_ is equal to None when with_std=False.
New in version 0.17: scale_
- Type
ndarray of shape (n_features,) or None
-
mean_
¶ The mean value for each feature in the training set. Equal to
None
whenwith_mean=False
.- Type
ndarray of shape (n_features,) or None
-
var_
¶ The variance for each feature in the training set. Used to compute scale_. Equal to
None
whenwith_std=False
.- Type
ndarray of shape (n_features,) or None
-
n_features_in_
¶ Number of features seen during fit.
New in version 0.24.
- Type
int
-
feature_names_in_
¶ Names of features seen during fit. Defined only when X has feature names that are all strings.
New in version 1.0.
- Type
ndarray of shape (n_features_in_,)
-
n_samples_seen_
¶ The number of samples processed by the estimator for each feature. If there are no missing samples, the
n_samples_seen
will be an integer, otherwise it will be an array of dtype int. If sample_weights are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments acrosspartial_fit
calls.- Type
int or ndarray of shape (n_features,)
See also
scale
Equivalent function without the estimator API.
PCA
Further removes the linear correlation across features with ‘whiten=True’.
Notes
NaNs are treated as missing values: disregarded in fit, and maintained in transform.
We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.
For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.
Examples
>>> from sklearn.preprocessing import StandardScaler >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]] >>> scaler = StandardScaler() >>> print(scaler.fit(data)) StandardScaler() >>> print(scaler.mean_) [0.5 0.5] >>> print(scaler.transform(data)) [[-1. -1.] [-1. -1.] [ 1. 1.] [ 1. 1.]] >>> print(scaler.transform([[2, 2]])) [[3. 3.]]
-
fit
(X, y=None, sample_weight=None)[source]¶ Compute the mean and std to be used for later scaling.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
sample_weight (array-like of shape (n_samples,), default=None) –
Individual weights for each sample.
New in version 0.24: parameter sample_weight support to StandardScaler.
- Returns
self – Fitted scaler.
- Return type
object
-
inverse_transform
(X, copy=None)[source]¶ Scale back the data to the original representation.
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.
copy (bool, default=None) – Copy the input X or not.
- Returns
X_tr – Transformed array.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_features)
-
partial_fit
(X, y=None, sample_weight=None)[source]¶ Online computation of mean and std on X for later scaling.
All of X is processed as a single batch. This is intended for cases when
fit()
is not feasible due to very large number of n_samples or because X is read from a continuous stream.The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:
- Parameters
X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.
y (None) – Ignored.
sample_weight (array-like of shape (n_samples,), default=None) –
Individual weights for each sample.
New in version 0.24: parameter sample_weight support to StandardScaler.
- Returns
self – Fitted scaler.
- Return type
object
-
transform
(X, copy=None)[source]¶ Perform standardization by centering and scaling.
- Parameters
X ({array-like, sparse matrix of shape (n_samples, n_features)) – The data used to scale along the features axis.
copy (bool, default=None) – Copy the input X or not.
- Returns
X_tr – Transformed array.
- Return type
{ndarray, sparse matrix} of shape (n_samples, n_features)
-
class
ballet.eng.external.
SumEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Sum contrast coding for the encoding of categorical features.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = SumEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 21 columns): intercept 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null float64 RAD_1 506 non-null float64 RAD_2 506 non-null float64 RAD_3 506 non-null float64 RAD_4 506 non-null float64 RAD_5 506 non-null float64 RAD_6 506 non-null float64 RAD_7 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(20), int64(1) memory usage: 83.1 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
class
ballet.eng.external.
TargetEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', min_samples_leaf=1, smoothing=1.0)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
Target encoding for categorical features.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.
For the case of continuous target: features are replaced with a blend of the expected value of the target given particular categorical value and the expected value of the target over all the training data.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
min_samples_leaf (int) – minimum samples to take category average into account.
smoothing (float) – smoothing effect to balance categorical average vs prior. Higher value means stronger regularization. The value must be strictly bigger than 0.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = TargetEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, from
https://dl.acm.org/citation.cfm?id=507538
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target info (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.
TrendTransformer
(shift=0)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
ballet.eng.external.
WOEEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, regularization=1.0)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
Weight of Evidence coding for categorical features.
Supported targets: binomial. For polynomial target support, see PolynomialWrapper.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which will assume WOE=0.
handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which will assume WOE=0.
randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma (float) – standard deviation (spread or “width”) of the normal distribution.
regularization (float) – the purpose of regularization is mostly to prevent division by zero. When regularization is 0, you may encounter division by zero.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target > 22.5 >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = WOEEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Weight of Evidence (WOE) and Information Value Explained, from
https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and binary y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Binary target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data. When the data are used for model training, it is important to also pass the target in order to apply leave one out.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.
Winsorizer
(capping_method='gaussian', tail='right', fold=3, variables=None, missing_values='raise')[source]¶ Bases:
feature_engine.outliers.base_outlier.BaseOutlier
The Winsorizer() caps maximum and / or minimum values of a variable.
The Winsorizer() works only with numerical variables. A list of variables can be indicated. Alternatively, the Winsorizer() will select all numerical variables in the train set.
The Winsorizer() first calculates the capping values at the end of the distribution. The values are determined using:
a Gaussian approximation,
the inter-quantile range proximity rule (IQR)
percentiles.
Gaussian limits:
right tail: mean + 3* std
left tail: mean - 3* std
IQR limits:
right tail: 75th quantile + 3* IQR
left tail: 25th quantile - 3* IQR
where IQR is the inter-quartile range: 75th quantile - 25th quantile.
percentiles or quantiles:
right tail: 95th percentile
left tail: 5th percentile
You can select how far out to cap the maximum or minimum values with the parameter ‘fold’.
If capping_method=’gaussian’ fold gives the value to multiply the std.
If capping_method=’iqr’ fold is the value to multiply the IQR.
If capping_method=’quantile’, fold is the percentile on each tail that should be censored. For example, if fold=0.05, the limits will be the 5th and 95th percentiles. If fold=0.1, the limits will be the 10th and 90th percentiles.
The transformer first finds the values at one or both tails of the distributions (fit). The transformer then caps the variables (transform).
- Parameters
capping_method (str, default=gaussian) –
Desired capping method. Can take ‘gaussian’, ‘iqr’ or ‘quantiles’.
’gaussian’: the transformer will find the maximum and / or minimum values to cap the variables using the Gaussian approximation.
’iqr’: the transformer will find the boundaries using the IQR proximity rule.
’quantiles’: the limits are given by the percentiles.
tail (str, default=right) – Whether to cap outliers on the right, left or both tails of the distribution. Can take ‘left’, ‘right’ or ‘both’.
fold (int or float, default=3) –
How far out to to place the capping values. The number that will multiply the std or IQR to calculate the capping values. Recommended values, 2 or 3 for the gaussian approximation, or 1.5 or 3 for the IQR proximity rule.
If capping_method=’quantile’, then ‘fold’ indicates the percentile. So if fold=0.05, the limits will be the 95th and 5th percentiles. Note: Outliers will be removed up to a maximum of the 20th percentiles on both sides. Thus, when capping_method=’quantile’, then ‘fold’ takes values between 0 and 0.20.
variables (list, default=None) – The list of variables for which the outliers will be capped. If None, the transformer will find and select all numerical variables.
missing_values (string, default='raise') – Indicates if missing values should be ignored or raised. Sometimes we want to remove outliers in the raw, original data, sometimes, we may want to remove outliers in the already pre-transformed data. If missing_values=’ignore’, the transformer will ignore missing data when learning the capping parameters or transforming the data. If missing_values=’raise’ the transformer will return an error if the training or the datasets to transform contain missing values.
-
right_tail_caps_
¶ Dictionary with the maximum values at which variables will be capped.
-
left_tail_caps_
¶ Dictionary with the minimum values at which variables will be capped.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the values that should be used to replace outliers.
-
transform:
Cap the variables.
-
fit_transform:
Fit to the data. Then transform it.
-
fit
(X, y=None)[source]¶ Learn the values that should be used to replace outliers.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples.
y (pandas Series, default=None) – y is not needed in this transformer. You can pass y or None.
- Raises
TypeError – If the input is not a Pandas DataFrame
- Returns
- Return type
self
-
transform
(X)[source]¶ Cap the variable values, that is, censors outliers.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError – If the dataframe is not of same size as that used in fit()
- Returns
X – The dataframe with the capped variables.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
WoEEncoder
(variables=None, ignore_format=False)[source]¶ Bases:
feature_engine.encoding.base_encoder.BaseCategoricalTransformer
The WoERatioCategoricalEncoder() replaces categories by the weight of evidence (WoE). The WoE was used primarily in the financial sector to create credit risk scorecards.
The encoder will encode only categorical variables by default (type ‘object’ or ‘categorical’). You can pass a list of variables to encode. Alternatively, the encoder will find and encode all categorical variables (type ‘object’ or ‘categorical’).
With ignore_format=True you have the option to encode numerical variables as well. The procedure is identical, you can either enter the list of variables to encode, or the transformer will automatically select all variables.
The encoder first maps the categories to the weight of evidence for each variable (fit). The encoder then transforms the categories into the mapped numbers (transform).
Note
This categorical encoding is exclusive for binary classification.
The weight of evidence is given by:
\[log( p(X=xj|Y = 1) / p(X=xj|Y=0) )\]The WoE is determined as follows:
We calculate the percentage positive cases in each category of the total of all positive cases. For example 20 positive cases in category A out of 100 total positive cases equals 20 %. Next, we calculate the percentage of negative cases in each category respect to the total negative cases, for example 5 negative cases in category A out of a total of 50 negative cases equals 10%. Then we calculate the WoE by dividing the category percentages of positive cases by the category percentage of negative cases, and take the logarithm, so for category A in our example WoE = log(20/10).
Note
If WoE values are negative, negative cases supersede the positive cases.
If WoE values are positive, positive cases supersede the negative cases.
And if WoE is 0, then there are equal number of positive and negative examples.
Encoding into WoE:
Creates a monotonic relationship between the encoded variable and the target
Returns variables in a similar scale
Note
The log(0) is not defined and the division by 0 is not defined. Thus, if any of the terms in the WoE equation are 0 for a given category, the encoder will return an error. If this happens, try grouping less frequent categories.
- Parameters
variables (list, default=None) – The list of categorical variables that will be encoded. If None, the encoder will find and transform all variables of type object or categorical by default. You can also make the transformer accept numerical variables, see the next parameter.
ignore_format (bool, default=False) – Whether the format in which the categorical variables are cast should be ignored. If false, the encoder will automatically select variables of type object or categorical, or check that the variables entered by the user are of type object or categorical. If True, the encoder will select all variables or accept all variables entered by the user, including those cast as numeric.
-
encoder_dict_
¶ Dictionary with the WoE per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the WoE per category, per variable.
-
transform:
Encode the categories to numbers.
-
fit_transform:
Fit to the data, then transform it.
-
inverse_transform:
Encode the numbers into the original categories.
Notes
For details on the calculation of the weight of evidence visit: https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
In credit scoring, continuous variables are also transformed using the WoE. To do this, first variables are sorted into a discrete number of bins, and then these bins are encoded with the WoE as explained here for categorical variables. You can do this by combining the use of the equal width, equal frequency or arbitrary discretisers.
NAN are introduced when encoding categories that were not present in the training dataset. If this happens, try grouping infrequent categories using the RareLabelEncoder().
See also
feature_engine.encoding.RareLabelEncoder
,feature_engine.discretisation
-
fit
(X, y)[source]¶ Learn the WoE.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the categorical variables.
y (pandas series.) – Target, must be binary.
- Raises
TypeError –
If the input is not the Pandas DataFrame. - If user enters non-categorical variables (unless ignore_format is True)
ValueError –
If there are no categorical variables in df or df is empty - If variable(s) contain null values. - If y is not binary with values 0 and 1. - If p(0) = 0 or p(1) = 0.
- Returns
- Return type
self
-
inverse_transform
(X)[source]¶ Convert the encoded variable back to the original values.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The transformed dataframe.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The un-transformed dataframe, with the categorical variables containing the original values.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
transform
(X)[source]¶ Replace categories with the learned parameters.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The dataset to transform.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
Warning – If after encoding, NAN were introduced.
- Returns
X – The dataframe containing the categories replaced by numbers.
- Return type
pandas dataframe of shape = [n_samples, n_features]
-
class
ballet.eng.external.
YeoJohnsonTransformer
(variables=None)[source]¶ Bases:
feature_engine.base_transformers.BaseNumericalTransformer
The YeoJohnsonTransformer() applies the Yeo-Johnson transformation to the numerical variables.
The Yeo-Johnson transformation implemented by this transformer is that of SciPy.stats: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.yeojohnson.html
The YeoJohnsonTransformer() works only with numerical variables.
A list of variables can be passed as an argument. Alternatively, the transformer will automatically select and transform all numerical variables.
- Parameters
variables (list, default=None) – The list of numerical variables to transform. If None, the transformer will automatically find and select all numerical variables.
-
lambda_dict_
¶ Dictionary containing the best lambda for the Yeo-Johnson per variable.
-
variables_
¶ The group of variables that will be transformed.
-
n_features_in_
¶ The number of features in the train set used in fit.
-
fit:
Learn the optimal lambda for the Yeo-Johnson transformation.
-
transform:
Apply the Yeo-Johnson transformation.
-
fit_transform:
Fit to data, then transform it.
References
- 1
Weisberg S. “Yeo-Johnson Power Transformations”. https://www.stat.umn.edu/arc/yjpower.pdf
-
fit
(X, y=None)[source]¶ Learn the optimal lambda for the Yeo-Johnson transformation.
- Parameters
X (pandas dataframe of shape = [n_samples, n_features]) – The training input samples. Can be the entire dataframe, not just the variables to transform.
y (pandas Series, default=None) – It is not needed in this transformer. You can pass y or None.
- Raises
TypeError –
If the input is not a Pandas DataFrame - If any of the user provided variables are not numerical
- ValueError
If there are no numerical variables in the df or the df is empty
If the variable(s) contain null values
- Returns
- Return type
self
-
transform
(X)[source]¶ Apply the Yeo-Johnson transformation.
- Parameters
X (Pandas DataFrame of shape = [n_samples, n_features]) – The data to be transformed.
- Raises
TypeError – If the input is not a Pandas DataFrame
ValueError –
If the variable(s) contain null values - If the df has different number of features than the df used in fit()
- Returns
X – The dataframe with the transformed variables.
- Return type
pandas dataframe