ballet.eng.external.category_encoders module¶
-
class
ballet.eng.external.category_encoders.
BackwardDifferenceEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Backward difference contrast coding for encoding categorical variables.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = BackwardDifferenceEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 21 columns): intercept 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null float64 RAD_1 506 non-null float64 RAD_2 506 non-null float64 RAD_3 506 non-null float64 RAD_4 506 non-null float64 RAD_5 506 non-null float64 RAD_6 506 non-null float64 RAD_7 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(20), int64(1) memory usage: 83.1 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
fit
(X, y=None, **kwargs)[source]¶ Fits an ordinal encoder to produce a consistent mapping across applications and optionally finds generally invariant columns to drop consistently.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
class
ballet.eng.external.category_encoders.
BaseNEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, base=2, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Base-N encoder encodes the categories into arrays of their base-N representation. A base of 1 is equivalent to one-hot encoding (not really base-1, but useful), a base of 2 is equivalent to binary encoding. N=number of actual categories is equivalent to vanilla ordinal encoding.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
base (int) – when the downstream model copes well with nonlinearities (like decision tree), use higher base.
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = BaseNEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 18 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null int64 CHAS_1 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null int64 RAD_1 506 non-null int64 RAD_2 506 non-null int64 RAD_3 506 non-null int64 RAD_4 506 non-null int64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(11), int64(7) memory usage: 71.3 KB None
-
basen_encode
(X_in, cols=None)[source]¶ Basen encoding encodes the integers as basen code with one column per digit.
- Parameters
X_in (DataFrame) –
cols (list-like, default None) – Column names in the DataFrame to be encoded
- Returns
dummies
- Return type
DataFrame
-
basen_to_integer
(X, cols, base)[source]¶ Convert basen code as integers.
- Parameters
X (DataFrame) – encoded data
cols (list-like) – Column names in the DataFrame that be encoded
base (int) – The base of transform
- Returns
numerical
- Return type
DataFrame
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
class
ballet.eng.external.category_encoders.
BinaryEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Binary encoding for categorical variables, similar to onehot, but stores categories as binary bitstrings.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = BinaryEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 18 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null int64 CHAS_1 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null int64 RAD_1 506 non-null int64 RAD_2 506 non-null int64 RAD_3 506 non-null int64 RAD_4 506 non-null int64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(11), int64(7) memory usage: 71.3 KB None
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
class
ballet.eng.external.category_encoders.
CatBoostEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None, a=1)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
CatBoost coding for categorical features.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
This is very similar to leave-one-out encoding, but calculates the values “on-the-fly”. Consequently, the values naturally vary during the training phase and it is not necessary to add random noise.
Beware, the training data have to be randomly permutated. E.g.:
# Random permutation perm = np.random.permutation(len(X)) X = X.iloc[perm].reset_index(drop=True) y = y.iloc[perm].reset_index(drop=True)
This is necessary because some data sets are sorted based on the target value and this coder encodes the features on-the-fly in a single pass.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
sigma (float) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). sigma gives the standard deviation (spread or “width”) of the normal distribution.
a (float) – additive smoothing (it is the same variable as “m” in m-probability estimate). By default set to 1.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = CatBoostEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Transforming categorical features to numerical features, from
https://tech.yandex.com/catboost/doc/dg/concepts/algorithm-main-stages_cat-to-numberic-docpage/
- 2
CatBoost: unbiased boosting with categorical features, from
https://arxiv.org/abs/1706.09516
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.category_encoders.
CountEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', min_group_size=None, combine_min_nan_groups=None, min_group_name=None, normalize=False)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples]) –
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
-
class
ballet.eng.external.category_encoders.
GLMMEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, binomial_target=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
Generalized linear mixed model.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
This is a supervised encoder similar to TargetEncoder or MEstimateEncoder, but there are some advantages: 1) Solid statistical theory behind the technique. Mixed effects models are a mature branch of statistics. 2) No hyper-parameters to tune. The amount of shrinkage is automatically determined through the estimation process. In short, the less observations a category has and/or the more the outcome varies for a category then the higher the regularization towards “the prior” or “grand mean”. 3) The technique is applicable for both continuous and binomial targets. If the target is continuous, the encoder returns regularized difference of the observation’s category from the global mean. If the target is binomial, the encoder returns regularized log odds per category.
In comparison to JamesSteinEstimator, this encoder utilizes generalized linear mixed models from statsmodels library.
Note: This is an alpha implementation. The API of the method may change in the future.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns 0.
handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns 0.
randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma (float) – standard deviation (spread or “width”) of the normal distribution.
binomial_target (bool) – if True, the target must be binomial with values {0, 1} and Binomial mixed model is used. If False, the target must be continuous and Linear mixed model is used. If None (the default), a heuristic is applied to estimate the target type.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target > 22.5 >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = GLMMEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Data Analysis Using Regression and Multilevel/Hierarchical Models, page 253, from
https://faculty.psau.edu.sa/filedownload/doc-12-pdf-a1997d0d31f84d13c1cdc44ac39a8f2c-original.pdf
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and binary y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Binary target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
When the data are used for model training, it is important to also pass the target in order to apply leave one out.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.category_encoders.
HashingEncoder
(max_process=0, max_sample=0, verbose=0, n_components=8, cols=None, drop_invariant=False, return_df=True, hash_method='md5')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
A multivariate hashing implementation with configurable dimensionality/precision.
The advantage of this encoder is that it does not maintain a dictionary of observed categories. Consequently, the encoder does not grow in size and accepts new values during data scoring by design.
It’s important to read about how max_process & max_sample work before setting them manually, inappropriate setting slows down encoding.
Default value of ‘max_process’ is 1 on Windows because multiprocessing might cause issues, see in : https://github.com/scikit-learn-contrib/categorical-encoding/issues/215 https://docs.python.org/2/library/multiprocessing.html?highlight=process#windows
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
hash_method (str) – which hashing method to use. Any method from hashlib works.
max_process (int) – how many processes to use in transform(). Limited in range(1, 64). By default, it uses half of the logical CPUs. For example, 4C4T makes max_process=2, 4C8T makes max_process=4. Set it larger if you have a strong CPU. It is not recommended to set it larger than is the count of the logical CPUs as it will actually slow down the encoding.
max_sample (int) – how many samples to encode by each process at a time. This setting is useful on low memory machines. By default, max_sample=(all samples num)/(max_process). For example, 4C8T CPU with 100,000 samples makes max_sample=25,000, 6C12T CPU with 100,000 samples makes max_sample=16,666. It is not recommended to set it larger than the default value.
n_components (int) – how many bits to use to represent the feature. By default we use 8 bits. For high-cardinality features, consider using up-to 32 bits.
Example
>>> from category_encoders.hashing import HashingEncoder >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> y = bunch.target >>> he = HashingEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> data = he.transform(X) >>> print(data.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 19 columns): col_0 506 non-null int64 col_1 506 non-null int64 col_2 506 non-null int64 col_3 506 non-null int64 col_4 506 non-null int64 col_5 506 non-null int64 col_6 506 non-null int64 col_7 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(11), int64(8) memory usage: 75.2 KB None
References
- 1
Feature Hashing for Large Scale Multitask Learning, from
https://alex.smola.org/papers/2009/Weinbergeretal09.pdf .. [2] Don’t be tricked by the Hashing Trick, from https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
static
hashing_trick
(X_in, hashing_method='md5', N=2, cols=None, make_copy=False)[source]¶ A basic hashing implementation with configurable dimensionality/precision
Performs the hashing trick on a pandas dataframe, X, using the hashing method from hashlib identified by hashing_method. The number of output dimensions (N), and columns to hash (cols) are also configurable.
- Parameters
X_in (pandas dataframe) – description text
hashing_method (string, optional) – description text
N (int, optional) – description text
cols (list, optional) – description text
make_copy (bool, optional) – description text
- Returns
out – A hashing encoded dataframe.
- Return type
dataframe
References
Cite the relevant literature, e.g. [1]_. You may also cite these references in the notes section above. .. [1] Kilian Weinberger; Anirban Dasgupta; John Langford; Alex Smola; Josh Attenberg (2009). Feature Hashing for Large Scale Multitask Learning. Proc. ICML.
-
class
ballet.eng.external.category_encoders.
HelmertEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Helmert contrast coding for encoding categorical features.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = HelmertEncoder(cols=['CHAS', 'RAD'], handle_unknown='value', handle_missing='value').fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 21 columns): intercept 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null float64 RAD_1 506 non-null float64 RAD_2 506 non-null float64 RAD_3 506 non-null float64 RAD_4 506 non-null float64 RAD_5 506 non-null float64 RAD_6 506 non-null float64 RAD_7 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(20), int64(1) memory usage: 83.1 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
class
ballet.eng.external.category_encoders.
JamesSteinEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', model='independent', random_state=None, randomized=False, sigma=0.05)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
James-Stein estimator.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
For feature value i, James-Stein estimator returns a weighted average of:
The mean target value for the observed feature value i.
The mean target value (regardless of the feature value).
This can be written as:
JS_i = (1-B)*mean(y_i) + B*mean(y)
The question is, what should be the weight B? If we put too much weight on the conditional mean value, we will overfit. If we put too much weight on the global mean, we will underfit. The canonical solution in machine learning is to perform cross-validation. However, Charles Stein came with a closed-form solution to the problem. The intuition is: If the estimate of mean(y_i) is unreliable (y_i has high variance), we should put more weight on mean(y). Stein put it into an equation as:
B = var(y_i) / (var(y_i)+var(y))
The only remaining issue is that we do not know var(y), let alone var(y_i). Hence, we have to estimate the variances. But how can we reliably estimate the variances, when we already struggle with the estimation of the mean values?! There are multiple solutions:
1. If we have the same count of observations for each feature value i and all y_i are close to each other, we can pretend that all var(y_i) are identical. This is called a pooled model. 2. If the observation counts are not equal, it makes sense to replace the variances with squared standard errors, which penalize small observation counts:
SE^2 = var(y)/count(y)
This is called an independent model.
James-Stein estimator has, however, one practical limitation - it was defined only for normal distributions. If you want to apply it for binary classification, which allows only values {0, 1}, it is better to first convert the mean target value from the bound interval <0,1> into an unbounded interval by replacing mean(y) with log-odds ratio:
log-odds_ratio_i = log(mean(y_i)/mean(y_not_i))
This is called binary model. The estimation of parameters of this model is, however, tricky and sometimes it fails fatally. In these situations, it is better to use beta model, which generally delivers slightly worse accuracy than binary model but does not suffer from fatal failures.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
model (str) – options are ‘pooled’, ‘beta’, ‘binary’ and ‘independent’, defaults to ‘independent’.
randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma (float) – standard deviation (spread or “width”) of the normal distribution.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = JamesSteinEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Parametric empirical Bayes inference: Theory and applications, equations 1.19 & 1.20, from
https://www.jstor.org/stable/2287098
- 2
Empirical Bayes for multiple sample sizes, from
http://chris-said.io/2017/05/03/empirical-bayes-for-multiple-sample-sizes/
- 3
Shrinkage Estimation of Log-odds Ratios for Comparing Mobility Tables, from
https://journals.sagepub.com/doi/abs/10.1177/0081175015570097
- 4
Stein’s paradox and group rationality, from
http://www.philos.rug.nl/~romeyn/presentation/2017_romeijn_-_Paris_Stein.pdf
- 5
Stein’s Paradox in Statistics, from
http://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and binary y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Binary target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data. When the data are used for model training, it is important to also pass the target in order to apply leave one out.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.category_encoders.
LeaveOneOutEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, sigma=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
Leave one out coding for categorical features.
This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
sigma (float) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched). Sigma gives the standard deviation (spread or “width”) of the normal distribution. The optimal value is commonly between 0.05 and 0.6. The default is to not add noise, but that leads to significantly suboptimal results.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = LeaveOneOutEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Strategies to encode categorical variables with many categories, from
https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154.
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.category_encoders.
MEstimateEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, m=1.0)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
M-probability estimate of likelihood.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
This is a simplified version of target encoder, which goes under names like m-probability estimate or additive smoothing with known incidence rates. In comparison to target encoder, m-probability estimate has only one tunable parameter (m), while target encoder has two tunable parameters (min_samples_leaf and smoothing).
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop encoded columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which returns the prior probability.
randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma (float) – standard deviation (spread or “width”) of the normal distribution.
m (float) – this is the “m” in the m-probability estimate. Higher value of m results into stronger shrinking. M is non-negative.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target > 22.5 >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = MEstimateEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, equation 7, from
https://dl.acm.org/citation.cfm?id=507538
- 2
On estimating probabilities in tree pruning, equation 1, from
https://link.springer.com/chapter/10.1007/BFb0017010
- 3
Additive smoothing, from
https://en.wikipedia.org/wiki/Additive_smoothing#Generalized_to_the_case_of_known_incidence_rates
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and binary or continuous y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Binary target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
When the data are used for model training, it is important to also pass the target in order to apply leave one out.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.category_encoders.
OneHotEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', use_cat_names=False)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Onehot (or dummy) coding for categorical features, produces one feature per category, each binary.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
use_cat_names (bool) – if True, category values will be included in the encoded column names. Since this can result in duplicate column names, duplicates are suffixed with ‘#’ symbol until a unique name is generated. If False, category indices will be used instead of the category values.
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = OneHotEncoder(cols=['CHAS', 'RAD'], handle_unknown='indicator').fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 24 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_1 506 non-null int64 CHAS_2 506 non-null int64 CHAS_-1 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_1 506 non-null int64 RAD_2 506 non-null int64 RAD_3 506 non-null int64 RAD_4 506 non-null int64 RAD_5 506 non-null int64 RAD_6 506 non-null int64 RAD_7 506 non-null int64 RAD_8 506 non-null int64 RAD_9 506 non-null int64 RAD_-1 506 non-null int64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(11), int64(13) memory usage: 95.0 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
property
category_mapping
¶
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_dummies
(X_in)[source]¶ Convert numerical variable into dummy variables
- Parameters
X_in (DataFrame) –
- Returns
dummies
- Return type
DataFrame
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
inverse_transform
(X_in)[source]¶ Perform the inverse transformation to encoded data.
- Parameters
X_in (array-like, shape = [n_samples, n_features]) –
- Returns
p
- Return type
array, the same size of X_in
-
class
ballet.eng.external.category_encoders.
OrdinalEncoder
(verbose=0, mapping=None, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Encodes categorical features as ordinal, in one ordered feature.
Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in; in this case, we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
mapping (list of dicts) –
a mapping of class to label to use for the encoding, optional. the dict contains the keys ‘col’ and ‘mapping’. the value of ‘col’ should be the feature name. the value of ‘mapping’ should be a dictionary of ‘original_label’ to ‘encoded_label’. example mapping: [
{‘col’: ‘col1’, ‘mapping’: {None: 0, ‘a’: 1, ‘b’: 2}}, {‘col’: ‘col2’, ‘mapping’: {None: 0, ‘x’: 1, ‘y’: 2}}
]
handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which will impute the category -1.
handle_missing (str) – options are ‘error’, ‘return_nan’, and ‘value, default to ‘value’, which treat nan as a category at fit time, or -2 at transform time if nan is not a category during fit.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = OrdinalEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null int64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null int64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(11), int64(2) memory usage: 51.5 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
property
category_mapping
¶
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
inverse_transform
(X_in)[source]¶ Perform the inverse transformation to encoded data. Will attempt best case reconstruction, which means it will return nan for handle_missing and handle_unknown settings that break the bijection. We issue warnings when some of those cases occur.
- Parameters
X_in (array-like, shape = [n_samples, n_features]) –
- Returns
p
- Return type
array, the same size of X_in
-
static
ordinal_encoding
(X_in, mapping=None, cols=None, handle_unknown='value', handle_missing='value')[source]¶ Ordinal encoding uses a single column of integers to represent the classes. An optional mapping dict can be passed in, in this case we use the knowledge that there is some true order to the classes themselves. Otherwise, the classes are assumed to have no true order and integers are selected at random.
-
transform
(X, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
Will use the mapping (if available) and the column list (if available, otherwise every column) to encode the data ordinarily.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.category_encoders.
PolynomialEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Polynomial contrast coding for the encoding of categorical features.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = PolynomialEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 21 columns): intercept 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null float64 RAD_1 506 non-null float64 RAD_2 506 non-null float64 RAD_3 506 non-null float64 RAD_4 506 non-null float64 RAD_5 506 non-null float64 RAD_6 506 non-null float64 RAD_7 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(20), int64(1) memory usage: 83.1 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
class
ballet.eng.external.category_encoders.
SumEncoder
(verbose=0, cols=None, mapping=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Sum contrast coding for the encoding of categorical features.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_unknown (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has unknown categories. This can cause unexpected changes in dimension in some cases.
handle_missing (str) – options are ‘error’, ‘return_nan’, ‘value’, and ‘indicator’. The default is ‘value’. Warning: if indicator is used, an extra column will be added in if the transform matrix has nan values. This can cause unexpected changes in dimension in some cases.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = SumEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 21 columns): intercept 506 non-null int64 CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS_0 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD_0 506 non-null float64 RAD_1 506 non-null float64 RAD_2 506 non-null float64 RAD_3 506 non-null float64 RAD_4 506 non-null float64 RAD_5 506 non-null float64 RAD_6 506 non-null float64 RAD_7 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(20), int64(1) memory usage: 83.1 KB None
References
- 1
Contrast Coding Systems for Categorical Variables, from
https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/
- 2
Gregory Carey (2003). Coding Categorical Variables, from
-
fit
(X, y=None, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
class
ballet.eng.external.category_encoders.
TargetEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_missing='value', handle_unknown='value', min_samples_leaf=1, smoothing=1.0)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
Target encoding for categorical features.
Supported targets: binomial and continuous. For polynomial target support, see PolynomialWrapper.
For the case of categorical target: features are replaced with a blend of posterior probability of the target given particular categorical value and the prior probability of the target over all the training data.
For the case of continuous target: features are replaced with a blend of the expected value of the target given particular categorical value and the expected value of the target over all the training data.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
handle_unknown (str) – options are ‘error’, ‘return_nan’ and ‘value’, defaults to ‘value’, which returns the target mean.
min_samples_leaf (int) – minimum samples to take category average into account.
smoothing (float) – smoothing effect to balance categorical average vs prior. Higher value means stronger regularization. The value must be strictly bigger than 0.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = TargetEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems, from
https://dl.acm.org/citation.cfm?id=507538
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target info (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]
-
class
ballet.eng.external.category_encoders.
WOEEncoder
(verbose=0, cols=None, drop_invariant=False, return_df=True, handle_unknown='value', handle_missing='value', random_state=None, randomized=False, sigma=0.05, regularization=1.0)[source]¶ Bases:
sklearn.base.BaseEstimator
,category_encoders.utils.TransformerWithTargetMixin
Weight of Evidence coding for categorical features.
Supported targets: binomial. For polynomial target support, see PolynomialWrapper.
- Parameters
verbose (int) – integer indicating verbosity of the output. 0 for none.
cols (list) – a list of columns to encode, if None, all string columns will be encoded.
drop_invariant (bool) – boolean for whether or not to drop columns with 0 variance.
return_df (bool) – boolean for whether to return a pandas DataFrame from transform (otherwise it will be a numpy array).
handle_missing (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which will assume WOE=0.
handle_unknown (str) – options are ‘return_nan’, ‘error’ and ‘value’, defaults to ‘value’, which will assume WOE=0.
randomized (bool,) – adds normal (Gaussian) distribution noise into training data in order to decrease overfitting (testing data are untouched).
sigma (float) – standard deviation (spread or “width”) of the normal distribution.
regularization (float) – the purpose of regularization is mostly to prevent division by zero. When regularization is 0, you may encounter division by zero.
Example
>>> from category_encoders import * >>> import pandas as pd >>> from sklearn.datasets import load_boston >>> bunch = load_boston() >>> y = bunch.target > 22.5 >>> X = pd.DataFrame(bunch.data, columns=bunch.feature_names) >>> enc = WOEEncoder(cols=['CHAS', 'RAD']).fit(X, y) >>> numeric_dataset = enc.transform(X) >>> print(numeric_dataset.info()) <class 'pandas.core.frame.DataFrame'> RangeIndex: 506 entries, 0 to 505 Data columns (total 13 columns): CRIM 506 non-null float64 ZN 506 non-null float64 INDUS 506 non-null float64 CHAS 506 non-null float64 NOX 506 non-null float64 RM 506 non-null float64 AGE 506 non-null float64 DIS 506 non-null float64 RAD 506 non-null float64 TAX 506 non-null float64 PTRATIO 506 non-null float64 B 506 non-null float64 LSTAT 506 non-null float64 dtypes: float64(13) memory usage: 51.5 KB None
References
- 1
Weight of Evidence (WOE) and Information Value Explained, from
https://www.listendata.com/2015/03/weight-of-evidence-woe-and-information.html
-
fit
(X, y, **kwargs)[source]¶ Fit encoder according to X and binary y.
- Parameters
X (array-like, shape = [n_samples, n_features]) – Training vectors, where n_samples is the number of samples and n_features is the number of features.
y (array-like, shape = [n_samples]) – Binary target values.
- Returns
self – Returns self.
- Return type
encoder
-
get_feature_names
()[source]¶ Returns the names of all transformed / added columns.
- Returns
feature_names – A list with all feature names transformed or added. Note: potentially dropped features are not included!
- Return type
list
-
transform
(X, y=None, override_return_df=False)[source]¶ Perform the transformation to new categorical data. When the data are used for model training, it is important to also pass the target in order to apply leave one out.
- Parameters
X (array-like, shape = [n_samples, n_features]) –
y (array-like, shape = [n_samples] when transform by leave one out) – None, when transform without target information (such as transform test set)
- Returns
p – Transformed values with encoding applied.
- Return type
array, shape = [n_samples, n_numeric + N]