ballet.eng.external.category_encoders module

class ballet.eng.external.category_encoders.BackwardDifferenceEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Backward difference contrast coding for encoding categorical variables.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = BackwardDifferenceEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 21 columns):
intercept    506 non-null int64
CRIM         506 non-null float64
ZN           506 non-null float64
INDUS        506 non-null float64
CHAS_0       506 non-null float64
NOX          506 non-null float64
RM           506 non-null float64
AGE          506 non-null float64
DIS          506 non-null float64
RAD_0        506 non-null float64
RAD_1        506 non-null float64
RAD_2        506 non-null float64
RAD_3        506 non-null float64
RAD_4        506 non-null float64
RAD_5        506 non-null float64
RAD_6        506 non-null float64
RAD_7        506 non-null float64
TAX          506 non-null float64
PTRATIO      506 non-null float64
B            506 non-null float64
LSTAT        506 non-null float64
dtypes: float64(20), int64(1)
memory usage: 83.1 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

static backward_difference_coding(X_in, mapping)[source]
fit(X, y=None, **kwargs)[source]

Fits an ordinal encoder to produce a consistent mapping across applications and optionally finds generally invariant columns to drop consistently.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

static fit_backward_difference_coding(col, values, handle_missing, handle_unknown)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.BaseNEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, base=2, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Base-N encoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • base (int) – when the downstream model copes well with nonlinearities (like decision tree), use higher base.

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = BaseNEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 18 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS_0     506 non-null int64
CHAS_1     506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD_0      506 non-null int64
RAD_1      506 non-null int64
RAD_2      506 non-null int64
RAD_3      506 non-null int64
RAD_4      506 non-null int64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(11), int64(7)
memory usage: 71.3 KB
None
basen_encode(X_in, cols=None)[source]

Basen encoding encodes the integers as basen code with one column per digit.

Parameters
  • X_in (DataFrame) –

  • cols (list-like, default None) – Column names in the DataFrame to be encoded

Returns

dummies

Return type

DataFrame

basen_to_integer(X, cols, base)[source]

Convert basen code as integers.

Parameters
  • X (DataFrame) – encoded data

  • cols (list-like) – Column names in the DataFrame that be encoded

  • base (int) – The base of transform

Returns

numerical

Return type

DataFrame

calc_required_digits(values)[source]
col_transform(col, digits)[source]

The lambda body to transform the column values

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

fit_base_n_encoding(X)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

inverse_transform(X_in)[source]

Perform the inverse transformation to encoded data.

Parameters

X_in (array-like, shape = [n_samples, n_features]) –

Returns

p

Return type

array, the same size of X_in

static number_to_base(n, b, limit)[source]
transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.BinaryEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 18 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS_0     506 non-null int64
CHAS_1     506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD_0      506 non-null int64
RAD_1      506 non-null int64
RAD_2      506 non-null int64
RAD_3      506 non-null int64
RAD_4      506 non-null int64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(11), int64(7)
memory usage: 71.3 KB
None
fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

inverse_transform(X_in)[source]

Perform the inverse transformation to encoded data.

Parameters

X_in (array-like, shape = [n_samples, n_features]) –

Returns

p

Return type

array, the same size of X_in

transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.CatBoostEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None, a=1)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

CatBoost coding for categorical features.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.

Beware, the training data have to be randomly permutated. E.g.:

# Random permutation perm = np.random.permutation(len(X)) X = X.iloc[perm].reset_index(drop=True) y = y.iloc[perm].reset_index(drop=True)

This is necessary because some data sets are sorted based on the target value and this coder encodes the features on-the-fly in a single pass.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • sigma (float) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). sigma gives the standard deviation (spread or “width”) of the normal distribution.

  • a (float) – additive smoothing (it is the same variable as “m” in m-probability estimate). By default set to 1.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = CatBoostEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Transforming categorical features to numerical features, from

https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/

2

CatBoost: unbiased boosting with categorical features, from

https://arxiv.org/abs/1706.09516

fit(X, y, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.CountEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', min_group_size=None, combine_min_nan_groups=None, min_group_name=None, normalize=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

combine_min_categories(X)[source]

Combine small categories into a single category.

fit(X, y=None, **kwargs)[source]

Fit encoder according to X.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.GLMMEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, binomial_target=None)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Generalized linear mixed model.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

This is a supervised encoder similar to TargetEncoder or MEstimateEncoder, but there are some advantages: 1) Solid statistical theory behind the technique. Mixed effects models are a mature branch of statistics. 2) No hyper-parameters to tune. The amount of shrinkage is automatically determined through the estimation process. In short, the less observations a category has and/or the more the outcome varies for a category then the higher the regularization towards “the prior” or “grand mean”. 3) The technique is applicable for both continuous and binomial targets. If the target is continuous, the encoder returns regularized difference of the observation’s category from the global mean. If the target is binomial, the encoder returns regularized log odds per category.

In comparison to JamesSteinEstimator, this encoder utilizes generalized linear mixed models from statsmodels library.

Note: This is an alpha implementation. The API of the method may change in the future.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns 0.

  • handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns 0.

  • randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).

  • sigma (float) – standard deviation (spread or “width”) of the normal distribution.

  • binomial_target (bool) – if True, the target must be binomial with values {0, 1} and Binomial mixed model is used. If False, the target must be continuous and Linear mixed model is used. If None (the default), a heuristic is applied to estimate the target type.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target > 22.5
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = GLMMEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Data Analysis Using Regression and Multilevel/Hierarchical Models, page 253, from

https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf

fit(X, y, **kwargs)[source]

Fit encoder according to X and binary y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Binary target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

When the data are used for model training, it is important to also pass the target in order to apply leave one out.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.HashingEncoder(max_process=0, max_sample=0, verbose=0, n_components=8, cols=None, drop_invariant=False, return_df=True, hash_method='md5')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

A multivariate hashing implementation with configurable dimensionality/precision.

The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.

It’s important to read about how max_process & max_sample work before setting them manually, inappropriate setting slows down encoding.

Default value of ‘max_process’ is 1 on Windows because multiprocessing might cause issues, see in : https://github.com/scikit-learn-contrib/categorical-encoding/issues/215 https://docs.python.org/2/library/multiprocessing.html?highlight=process#windows

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • hash_method (str) – which hashing method to use. Any method from hashlib works.

  • max_process (int) – how many processes to use in transform(). Limited in range(1, 64). By default, it uses half of the logical CPUs. For example, 4C4T makes max_process=2, 4C8T makes max_process=4. Set it larger if you have a strong CPU. It is not recommended to set it larger than is the count of the logical CPUs as it will actually slow down the encoding.

  • max_sample (int) – how many samples to encode by each process at a time. This setting is useful on low memory machines. By default, max_sample=(all samples num)/(max_process). For example, 4C8T CPU with 100,000 samples makes max_sample=25,000, 6C12T CPU with 100,000 samples makes max_sample=16,666. It is not recommended to set it larger than the default value.

  • n_components (int) – how many bits to use to represent the feature. By default we use 8 bits. For high-cardinality features, consider using up-to 32 bits.

Example

>>> from category_encoders.hashing import HashingEncoder
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> y = bunch.target
>>> he = HashingEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> data = he.transform(X)
>>> print(data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 19 columns):
col_0      506 non-null int64
col_1      506 non-null int64
col_2      506 non-null int64
col_3      506 non-null int64
col_4      506 non-null int64
col_5      506 non-null int64
col_6      506 non-null int64
col_7      506 non-null int64
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(11), int64(8)
memory usage: 75.2 KB
None

References

1

Feature Hashing for Large Scale Multitask Learning, from

https://alex.smola.org/papers/2009/Weinbergeretal09.pdf .. [2] Don’t be tricked by the Hashing Trick, from https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

static hashing_trick(X_in, hashing_method='md5', N=2, cols=None, make_copy=False)[source]

A basic hashing implementation with configurable dimensionality/precision

Performs the hashing trick on a pandas dataframe, X, using the hashing method from hashlib identified by hashing_method. The number of output dimensions (N), and columns to hash (cols) are also configurable.

Parameters
  • X_in (pandas dataframe) – description text

  • hashing_method (string, optional) – description text

  • N (int, optional) – description text

  • cols (list, optional) – description text

  • make_copy (bool, optional) – description text

Returns

out – A hashing encoded dataframe.

Return type

dataframe

References

Cite the relevant literature, e.g. [1]_. You may also cite these references in the notes section above. .. [1] Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.

static require_data(self, data_lock, new_start, done_index, hashing_parts, cols, process_index)[source]
transform(X, override_return_df=False)[source]

Call _transform() if you want to use single CPU with all samples

class ballet.eng.external.category_encoders.HelmertEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Helmert contrast coding for encoding categorical features.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = HelmertEncoder(cols=['CHAS', 'RAD'], handle_unknown='value', handle_missing='value').fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 21 columns):
intercept    506 non-null int64
CRIM         506 non-null float64
ZN           506 non-null float64
INDUS        506 non-null float64
CHAS_0       506 non-null float64
NOX          506 non-null float64
RM           506 non-null float64
AGE          506 non-null float64
DIS          506 non-null float64
RAD_0        506 non-null float64
RAD_1        506 non-null float64
RAD_2        506 non-null float64
RAD_3        506 non-null float64
RAD_4        506 non-null float64
RAD_5        506 non-null float64
RAD_6        506 non-null float64
RAD_7        506 non-null float64
TAX          506 non-null float64
PTRATIO      506 non-null float64
B            506 non-null float64
LSTAT        506 non-null float64
dtypes: float64(20), int64(1)
memory usage: 83.1 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

static fit_helmert_coding(col, values, handle_missing, handle_unknown)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

static helmert_coding(X_in, mapping)[source]
transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.JamesSteinEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', model='independent', random_state=None, randomized=False, sigma=0.05)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

James-Stein estimator.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

For feature value i, James-Stein estimator returns a weighted average of:

  1. The mean target value for the observed feature value i.

  2. The mean target value (regardless of the feature value).

This can be written as:

JS_i = (1-B)*mean(y_i) + B*mean(y)

The question is, what should be the weight B? If we put too much weight on the conditional mean value, we will overfit. If we put too much weight on the global mean, we will underfit. The canonical solution in machine learning is to perform cross-validation. However, Charles Stein came with a closed-form solution to the problem. The intuition is: If the estimate of mean(y_i) is unreliable (y_i has high variance), we should put more weight on mean(y). Stein put it into an equation as:

B = var(y_i) / (var(y_i)+var(y))

The only remaining issue is that we do not know var(y), let alone var(y_i). Hence, we have to estimate the variances. But how can we reliably estimate the variances, when we already struggle with the estimation of the mean values?! There are multiple solutions:

1. If we have the same count of observations for each feature value i and all y_i are close to each other, we can pretend that all var(y_i) are identical. This is called a pooled model. 2. If the observation counts are not equal, it makes sense to replace the variances with squared standard errors, which penalize small observation counts:

SE^2 = var(y)/count(y)

This is called an independent model.

James-Stein estimator has, however, one practical limitation - it was defined only for normal distributions. If you want to apply it for binary classification, which allows only values {0, 1}, it is better to first convert the mean target value from the bound interval <0,1> into an unbounded interval by replacing mean(y) with log-odds ratio:

log-odds_ratio_i = log(mean(y_i)/mean(y_not_i))

This is called binary model. The estimation of parameters of this model is, however, tricky and sometimes it fails fatally. In these situations, it is better to use beta model, which generally delivers slightly worse accuracy than binary model but does not suffer from fatal failures.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.

  • handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.

  • model (str) – options are ‘pooled’, ‘beta’, ‘binary’ and ‘independent’, defaults to ‘independent’.

  • randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).

  • sigma (float) – standard deviation (spread or “width”) of the normal distribution.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = JamesSteinEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Parametric empirical Bayes inference: Theory and applications, equations 1.19 & 1.20, from

https://www.jstor.org/stable/2287098

2

Empirical Bayes for multiple sample sizes, from

http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/

3

Shrinkage Estimation of Log-odds Ratios for Comparing Mobility Tables, from

https://journals.sagepub.com/doi/abs/10.1177/0081175015570097

4

Stein’s paradox and group rationality, from

http://www.philos.rug.nl/~romeyn/presentation/2017_romeijn_-_Paris_Stein.pdf

5

Stein’s Paradox in Statistics, from

http://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf

fit(X, y, **kwargs)[source]

Fit encoder according to X and binary y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Binary target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data. When the data are used for model training, it is important to also pass the target in order to apply leave one out.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.LeaveOneOutEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Leave one out coding for categorical features.

This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • sigma (float) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). Sigma gives the standard deviation (spread or “width”) of the normal distribution. The optimal value is commonly between 0.05 and 0.6. The default is to not add noise, but that leads to significantly suboptimal results.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = LeaveOneOutEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Strategies to encode categorical variables with many categories, from

https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.

fit(X, y, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

fit_column_map(series, y)[source]
fit_leave_one_out(X_in, y, cols=None)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

transform_leave_one_out(X_in, y, mapping=None)[source]

Leave one out encoding uses a single column of floats to represent the means of the target variables.

class ballet.eng.external.category_encoders.MEstimateEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, m=1.0)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

M-probability estimate of likelihood.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

This is a simplified version of target encoder, which goes under names like m-probability estimate or additive smoothing with known incidence rates. In comparison to target encoder, m-probability estimate has only one tunable parameter (m), while target encoder has two tunable parameters (min_samples_leaf and smoothing).

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.

  • handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.

  • randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).

  • sigma (float) – standard deviation (spread or “width”) of the normal distribution.

  • m (float) – this is the “m” in the m-probability estimate. Higher value of m results into stronger shrinking. M is non-negative.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target > 22.5
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = MEstimateEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, equation 7, from

https://dl.acm.org/citation.cfm?id=507538

2

On estimating probabilities in tree pruning, equation 1, from

https://link.springer.com/chapter/10.1007/BFb0017010

3

Additive smoothing, from

https://en.wikipedia.org/wiki/Additive_smoothing#Generalized_to_the_case_of_known_incidence_rates

fit(X, y, **kwargs)[source]

Fit encoder according to X and binary or continuous y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Binary target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

When the data are used for model training, it is important to also pass the target in order to apply leave one out.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.OneHotEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', use_cat_names=False)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Onehot (or dummy) coding for categorical features, produces one feature per category, each binary.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • use_cat_names (bool) – if True, category values will be included in the encoded column names. Since this can result in duplicate column names, duplicates are suffixed with ‘#’ symbol until a unique name is generated. If False, category indices will be used instead of the category values.

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = OneHotEncoder(cols=['CHAS', 'RAD'], handle_unknown='indicator').fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 24 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS_1     506 non-null int64
CHAS_2     506 non-null int64
CHAS_-1    506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD_1      506 non-null int64
RAD_2      506 non-null int64
RAD_3      506 non-null int64
RAD_4      506 non-null int64
RAD_5      506 non-null int64
RAD_6      506 non-null int64
RAD_7      506 non-null int64
RAD_8      506 non-null int64
RAD_9      506 non-null int64
RAD_-1     506 non-null int64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(11), int64(13)
memory usage: 95.0 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

property category_mapping
fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

generate_mapping()[source]
get_dummies(X_in)[source]

Convert numerical variable into dummy variables

Parameters

X_in (DataFrame) –

Returns

dummies

Return type

DataFrame

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

inverse_transform(X_in)[source]

Perform the inverse transformation to encoded data.

Parameters

X_in (array-like, shape = [n_samples, n_features]) –

Returns

p

Return type

array, the same size of X_in

reverse_dummies(X, mapping)[source]

Convert dummy variable into numerical variables

Parameters
  • X (DataFrame) –

  • mapping (list-like) – Contains mappings of column to be transformed to it’s new columns and value represented

Returns

numerical

Return type

DataFrame

transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.OrdinalEncoder(verbose=0, mapping=None, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Encodes categorical features as ordinal, in one ordered feature.

Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in; in this case, we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • mapping (list of dicts) –

    a mapping of class to label to use for the encoding, optional. the dict contains the keys ‘col’ and ‘mapping’. the value of ‘col’ should be the feature name. the value of ‘mapping’ should be a dictionary of ‘original_label’ to ‘encoded_label’. example mapping: [

    {‘col’: ‘col1’, ‘mapping’: {None: 0, ‘a’: 1, ‘b’: 2}}, {‘col’: ‘col2’, ‘mapping’: {None: 0, ‘x’: 1, ‘y’: 2}}

    ]

  • handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which will impute the category -1.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, and ‘value, default to ‘value’, which treat nan as a category at fit time, or -2 at transform time if nan is not a category during fit.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = OrdinalEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null int64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null int64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(11), int64(2)
memory usage: 51.5 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

property category_mapping
fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

inverse_transform(X_in)[source]

Perform the inverse transformation to encoded data. Will attempt best case reconstruction, which means it will return nan for handle_missing and handle_unknown settings that break the bijection. We issue warnings when some of those cases occur.

Parameters

X_in (array-like, shape = [n_samples, n_features]) –

Returns

p

Return type

array, the same size of X_in

static ordinal_encoding(X_in, mapping=None, cols=None, handle_unknown='value', handle_missing='value')[source]

Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in, in this case we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.

transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Will use the mapping (if available) and the column list (if available, otherwise every column) to encode the data ordinarily.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.PolynomialEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Polynomial contrast coding for the encoding of categorical features.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = PolynomialEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 21 columns):
intercept    506 non-null int64
CRIM         506 non-null float64
ZN           506 non-null float64
INDUS        506 non-null float64
CHAS_0       506 non-null float64
NOX          506 non-null float64
RM           506 non-null float64
AGE          506 non-null float64
DIS          506 non-null float64
RAD_0        506 non-null float64
RAD_1        506 non-null float64
RAD_2        506 non-null float64
RAD_3        506 non-null float64
RAD_4        506 non-null float64
RAD_5        506 non-null float64
RAD_6        506 non-null float64
RAD_7        506 non-null float64
TAX          506 non-null float64
PTRATIO      506 non-null float64
B            506 non-null float64
LSTAT        506 non-null float64
dtypes: float64(20), int64(1)
memory usage: 83.1 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

static fit_polynomial_coding(col, values, handle_missing, handle_unknown)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

static polynomial_coding(X_in, mapping)[source]
transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.SumEncoder(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Sum contrast coding for the encoding of categorical features.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.

  • handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = SumEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 21 columns):
intercept    506 non-null int64
CRIM         506 non-null float64
ZN           506 non-null float64
INDUS        506 non-null float64
CHAS_0       506 non-null float64
NOX          506 non-null float64
RM           506 non-null float64
AGE          506 non-null float64
DIS          506 non-null float64
RAD_0        506 non-null float64
RAD_1        506 non-null float64
RAD_2        506 non-null float64
RAD_3        506 non-null float64
RAD_4        506 non-null float64
RAD_5        506 non-null float64
RAD_6        506 non-null float64
RAD_7        506 non-null float64
TAX          506 non-null float64
PTRATIO      506 non-null float64
B            506 non-null float64
LSTAT        506 non-null float64
dtypes: float64(20), int64(1)
memory usage: 83.1 KB
None

References

1

Contrast Coding Systems for Categorical Variables, from

https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/

2

Gregory Carey (2003). Coding Categorical Variables, from

http://psych.colorado.edu/~carey/Courses/PSYC5741/handouts/Coding%20Categorical%20Variables%202006-03-03.pdf

fit(X, y=None, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

static fit_sum_coding(col, values, handle_missing, handle_unknown)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

static sum_coding(X_in, mapping)[source]
transform(X, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters

X (array-like, shape = [n_samples, n_features]) –

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.TargetEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', min_samples_leaf=1, smoothing=1.0)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Target encoding for categorical features.

Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.

For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.

For the case of continuous target: features are replaced with a blend of the expected value of the target given particular categorical value and the expected value of the target over all the training data.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.

  • min_samples_leaf (int) – minimum samples to take category average into account.

  • smoothing (float) – smoothing effect to balance categorical average vs prior. Higher value means stronger regularization. The value must be strictly bigger than 0.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = TargetEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, from

https://dl.acm.org/citation.cfm?id=507538

fit(X, y, **kwargs)[source]

Fit encoder according to X and y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Target values.

Returns

self – Returns self.

Return type

encoder

fit_target_encoding(X, y)[source]
get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

target_encode(X_in)[source]
transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target info (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]

class ballet.eng.external.category_encoders.WOEEncoder(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, regularization=1.0)[source]

Bases: sklearn.base.BaseEstimator, category_encoders.utils.TransformerWithTargetMixin

Weight of Evidence coding for categorical features.

Supported targets: binomial. For polynomial target support, see PolynomialWrapper.

Parameters
  • verbose (int) – integer indicating verbosity of the output. 0 for none.

  • cols (list) – a list of columns to encode, if None, all string columns will be encoded.

  • drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.

  • return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).

  • handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which will assume WOE=0.

  • handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which will assume WOE=0.

  • randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).

  • sigma (float) – standard deviation (spread or “width”) of the normal distribution.

  • regularization (float) – the purpose of regularization is mostly to prevent division by zero. When regularization is 0, you may encounter division by zero.

Example

>>> from category_encoders import *
>>> import pandas as pd
>>> from sklearn.datasets import load_boston
>>> bunch = load_boston()
>>> y = bunch.target > 22.5
>>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names)
>>> enc = WOEEncoder(cols=['CHAS', 'RAD']).fit(X, y)
>>> numeric_dataset = enc.transform(X)
>>> print(numeric_dataset.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 13 columns):
CRIM       506 non-null float64
ZN         506 non-null float64
INDUS      506 non-null float64
CHAS       506 non-null float64
NOX        506 non-null float64
RM         506 non-null float64
AGE        506 non-null float64
DIS        506 non-null float64
RAD        506 non-null float64
TAX        506 non-null float64
PTRATIO    506 non-null float64
B          506 non-null float64
LSTAT      506 non-null float64
dtypes: float64(13)
memory usage: 51.5 KB
None

References

1

Weight of Evidence (WOE) and Information Value Explained, from

https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html

fit(X, y, **kwargs)[source]

Fit encoder according to X and binary y.

Parameters
  • X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape = [n_samples]) – Binary target values.

Returns

self – Returns self.

Return type

encoder

get_feature_names()[source]

Returns the names of all transformed / added columns.

Returns

feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!

Return type

list

transform(X, y=None, override_return_df=False)[source]

Perform the transformation to new categorical data. When the data are used for model training, it is important to also pass the target in order to apply leave one out.

Parameters
  • X (array-like, shape = [n_samples, n_features]) –

  • y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)

Returns

p – Transformed values with encoding applied.

Return type

array, shape = [n_samples, n_numeric + N]