ballet.eng.external.sklearn module

class ballet.eng.external.sklearn.Binarizer(*, threshold=0.0, copy=True)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Binarize data (set feature values to 0 or 1) according to a threshold.

Values greater than the threshold map to 1, while values less than or equal to the threshold map to 0. With the default threshold of 0, only positive values map to 1.

Binarization is a common operation on text count data where the analyst can decide to only consider the presence or absence of a feature rather than a quantified number of occurrences for instance.

It can also be used as a pre-processing step for estimators that consider boolean random variables (e.g. modelled using the Bernoulli distribution in a Bayesian setting).

Read more in the User Guide.

Parameters
  • threshold (float, default=0.0) – Feature values below or equal to this are replaced by 0, above it by 1. Threshold may not be less than 0 for operations on sparse matrices.

  • copy (bool, default=True) – Set to False to perform inplace binarization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

binarize

Equivalent function without the estimator API.

KBinsDiscretizer

Bin continuous data into intervals.

OneHotEncoder

Encode categorical features as a one-hot numeric array.

Notes

If the input is a sparse matrix, only the non-zero values are subject to update by the Binarizer class.

This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.

Examples

>>> from sklearn.preprocessing import Binarizer
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> transformer = Binarizer().fit(X)  # fit does nothing.
>>> transformer
Binarizer()
>>> transformer.transform(X)
array([[1., 0., 1.],
       [1., 0., 0.],
       [0., 1., 0.]])
fit(X, y=None)[source]

Do nothing and return the estimator unchanged.

This method is just there to implement the usual API and hence work in pipelines.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data.

  • y (None) – Ignored.

Returns

self – Fitted transformer.

Return type

object

transform(X, copy=None)[source]

Binarize each element of X.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to binarize, element by element. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.

  • copy (bool) – Copy the input X or not.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.sklearn.FunctionTransformer(func=None, inverse_func=None, *, validate=False, accept_sparse=False, check_inverse=True, kw_args=None, inv_kw_args=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Constructs a transformer from an arbitrary callable.

A FunctionTransformer forwards its X (and optionally y) arguments to a user-defined function or function object and returns the result of this function. This is useful for stateless transformations such as taking the log of frequencies, doing custom scaling, etc.

Note: If a lambda is used as the function, then the resulting transformer will not be pickleable.

New in version 0.17.

Read more in the User Guide.

Parameters
  • func (callable, default=None) – The callable to use for the transformation. This will be passed the same arguments as transform, with args and kwargs forwarded. If func is None, then func will be the identity function.

  • inverse_func (callable, default=None) – The callable to use for the inverse transformation. This will be passed the same arguments as inverse transform, with args and kwargs forwarded. If inverse_func is None, then inverse_func will be the identity function.

  • validate (bool, default=False) –

    Indicate that the input X array should be checked before calling func. The possibilities are:

    • If False, there is no input validation.

    • If True, then X will be converted to a 2-dimensional NumPy array or sparse matrix. If the conversion is not possible an exception is raised.

    Changed in version 0.22: The default of validate changed from True to False.

  • accept_sparse (bool, default=False) – Indicate that func accepts a sparse matrix as input. If validate is False, this has no effect. Otherwise, if accept_sparse is false, sparse matrix inputs will cause an exception to be raised.

  • check_inverse (bool, default=True) –

    Whether to check that or func followed by inverse_func leads to the original inputs. It can be used for a sanity check, raising a warning when the condition is not fulfilled.

    New in version 0.20.

  • kw_args (dict, default=None) –

    Dictionary of additional keyword arguments to pass to func.

    New in version 0.18.

  • inv_kw_args (dict, default=None) –

    Dictionary of additional keyword arguments to pass to inverse_func.

    New in version 0.18.

n_features_in_

Number of features seen during fit. Defined only when validate=True.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when validate=True and X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

MaxAbsScaler

Scale each feature by its maximum absolute value.

StandardScaler

Standardize features by removing the mean and scaling to unit variance.

LabelBinarizer

Binarize labels in a one-vs-all fashion.

MultiLabelBinarizer

Transform between iterable of iterables and a multilabel format.

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import FunctionTransformer
>>> transformer = FunctionTransformer(np.log1p)
>>> X = np.array([[0, 1], [2, 3]])
>>> transformer.transform(X)
array([[0.       , 0.6931...],
       [1.0986..., 1.3862...]])
fit(X, y=None)[source]

Fit transformer by checking X.

If validate is True, X will be checked.

Parameters
  • X (array-like, shape (n_samples, n_features)) – Input array.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – FunctionTransformer class instance.

Return type

object

inverse_transform(X)[source]

Transform X using the inverse function.

Parameters

X (array-like, shape (n_samples, n_features)) – Input array.

Returns

X_out – Transformed input.

Return type

array-like, shape (n_samples, n_features)

transform(X)[source]

Transform X using the forward function.

Parameters

X (array-like, shape (n_samples, n_features)) – Input array.

Returns

X_out – Transformed input.

Return type

array-like, shape (n_samples, n_features)

class ballet.eng.external.sklearn.GaussianRandomProjection(n_components='auto', *, eps=0.1, random_state=None)[source]

Bases: sklearn.random_projection.BaseRandomProjection

Reduce dimensionality through Gaussian random projection.

The components of the random matrix are drawn from N(0, 1 / n_components).

Read more in the User Guide.

New in version 0.13.

Parameters
  • n_components (int or 'auto', default='auto') –

    Dimensionality of the target projection space.

    n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the eps parameter.

    It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.

  • eps (float, default=0.1) –

    Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. The value should be strictly positive.

    Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.

  • random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generator used to generate the projection matrix at fit time. Pass an int for reproducible output across multiple function calls. See Glossary.

n_components_

Concrete number of components computed when n_components=”auto”.

Type

int

components_

Random matrix used for the projection.

Type

ndarray of shape (n_components, n_features)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

SparseRandomProjection

Reduce dimensionality through sparse random projection.

Examples

>>> import numpy as np
>>> from sklearn.random_projection import GaussianRandomProjection
>>> rng = np.random.RandomState(42)
>>> X = rng.rand(100, 10000)
>>> transformer = GaussianRandomProjection(random_state=rng)
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)
class ballet.eng.external.sklearn.KBinsDiscretizer(n_bins=5, *, encode='onehot', strategy='quantile', dtype=None)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Bin continuous data into intervals.

Read more in the User Guide.

New in version 0.20.

Parameters
  • n_bins (int or array-like of shape (n_features,), default=5) – The number of bins to produce. Raises ValueError if n_bins < 2.

  • encode ({'onehot', 'onehot-dense', 'ordinal'}, default='onehot') –

    Method used to encode the transformed result.

    onehot

    Encode the transformed result with one-hot encoding and return a sparse matrix. Ignored features are always stacked to the right.

    onehot-dense

    Encode the transformed result with one-hot encoding and return a dense array. Ignored features are always stacked to the right.

    ordinal

    Return the bin identifier encoded as an integer value.

  • strategy ({'uniform', 'quantile', 'kmeans'}, default='quantile') –

    Strategy used to define the widths of the bins.

    uniform

    All bins in each feature have identical widths.

    quantile

    All bins in each feature have the same number of points.

    kmeans

    Values in each bin have the same nearest center of a 1D k-means cluster.

  • dtype ({np.float32, np.float64}, default=None) –

    The desired data-type for the output. If None, output dtype is consistent with input dtype. Only np.float32 and np.float64 are supported.

    New in version 0.24.

bin_edges_

The edges of each bin. Contain arrays of varying shapes (n_bins_, ) Ignored features will have empty arrays.

Type

ndarray of ndarray of shape (n_features,)

n_bins_

Number of bins per feature. Bins whose width are too small (i.e., <= 1e-8) are removed with a warning.

Type

ndarray of shape (n_features,), dtype=np.int_

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

Binarizer

Class used to bin values as 0 or 1 based on a parameter threshold.

Notes

In bin edges for feature i, the first and last values are used only for inverse_transform. During transform, bin edges are extended to:

np.concatenate([-np.inf, bin_edges_[i][1:-1], np.inf])

You can combine KBinsDiscretizer with ColumnTransformer if you only want to preprocess part of the features.

KBinsDiscretizer might produce constant features (e.g., when encode = 'onehot' and certain bins do not contain any data). These features can be removed with feature selection algorithms (e.g., VarianceThreshold).

Examples

>>> from sklearn.preprocessing import KBinsDiscretizer
>>> X = [[-2, 1, -4,   -1],
...      [-1, 2, -3, -0.5],
...      [ 0, 3, -2,  0.5],
...      [ 1, 4, -1,    2]]
>>> est = KBinsDiscretizer(n_bins=3, encode='ordinal', strategy='uniform')
>>> est.fit(X)
KBinsDiscretizer(...)
>>> Xt = est.transform(X)
>>> Xt  
array([[ 0., 0., 0., 0.],
       [ 1., 1., 1., 0.],
       [ 2., 2., 2., 1.],
       [ 2., 2., 2., 2.]])

Sometimes it may be useful to convert the data back into the original feature space. The inverse_transform function converts the binned data into the original feature space. Each value will be equal to the mean of the two bin edges.

>>> est.bin_edges_[0]
array([-2., -1.,  0.,  1.])
>>> est.inverse_transform(Xt)
array([[-1.5,  1.5, -3.5, -0.5],
       [-0.5,  2.5, -2.5, -0.5],
       [ 0.5,  3.5, -1.5,  0.5],
       [ 0.5,  3.5, -1.5,  1.5]])
fit(X, y=None)[source]

Fit the estimator.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Data to be discretized.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns

self – Returns the instance itself.

Return type

object

get_feature_names_out(input_features=None)[source]

Get output feature names.

Parameters

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_out – Transformed feature names.

Return type

ndarray of str objects

inverse_transform(Xt)[source]

Transform discretized data back to original feature space.

Note that this function does not regenerate the original data due to discretization rounding.

Parameters

Xt (array-like of shape (n_samples, n_features)) – Transformed data in the binned space.

Returns

Xinv – Data in the original feature space.

Return type

ndarray, dtype={np.float32, np.float64}

transform(X)[source]

Discretize the data.

Parameters

X (array-like of shape (n_samples, n_features)) – Data to be discretized.

Returns

Xt – Data in the binned space. Will be a sparse matrix if self.encode=’onehot’ and ndarray otherwise.

Return type

{ndarray, sparse matrix}, dtype={np.float32, np.float64}

class ballet.eng.external.sklearn.KNNImputer(*, missing_values=nan, n_neighbors=5, weights='uniform', metric='nan_euclidean', copy=True, add_indicator=False)[source]

Bases: sklearn.impute._base._BaseImputer

Imputation for completing missing values using k-Nearest Neighbors.

Each sample’s missing values are imputed using the mean value from n_neighbors nearest neighbors found in the training set. Two samples are close if the features that neither is missing are close.

Read more in the User Guide.

New in version 0.22.

Parameters
  • missing_values (int, float, str, np.nan or None, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

  • n_neighbors (int, default=5) – Number of neighboring samples to use for imputation.

  • weights ({'uniform', 'distance'} or callable, default='uniform') –

    Weight function used in prediction. Possible values:

    • ’uniform’ : uniform weights. All points in each neighborhood are weighted equally.

    • ’distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.

    • callable : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

  • metric ({'nan_euclidean'} or callable, default='nan_euclidean') –

    Distance metric for searching neighbors. Possible values:

    • ’nan_euclidean’

    • callable : a user-defined function which conforms to the definition of _pairwise_callable(X, Y, metric, **kwds). The function accepts two arrays, X and Y, and a missing_values keyword in kwds and returns a scalar distance value.

  • copy (bool, default=True) – If True, a copy of X will be created. If False, imputation will be done in-place whenever possible.

  • add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto the output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

indicator_

Indicator used to add binary indicators for missing values. None if add_indicator is False.

Type

MissingIndicator

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

SimpleImputer

Imputation transformer for completing missing values with simple strategies.

IterativeImputer

Multivariate imputer that estimates each feature from all the others.

References

  • Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525.

Examples

>>> import numpy as np
>>> from sklearn.impute import KNNImputer
>>> X = [[1, 2, np.nan], [3, 4, 3], [np.nan, 6, 5], [8, 8, 7]]
>>> imputer = KNNImputer(n_neighbors=2)
>>> imputer.fit_transform(X)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])
fit(X, y=None)[source]

Fit the imputer on X.

Parameters
  • X (array-like shape of (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – The fitted KNNImputer class instance.

Return type

object

transform(X)[source]

Impute all missing values in X.

Parameters

X (array-like of shape (n_samples, n_features)) – The input data to complete.

Returns

X – The imputed dataset. n_output_features is the number of features that is not always missing during fit.

Return type

array-like of shape (n_samples, n_output_features)

class ballet.eng.external.sklearn.MaxAbsScaler(*, copy=True)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Scale each feature by its maximum absolute value.

This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.

This scaler can also be applied to sparse CSR or CSC matrices.

New in version 0.17.

Parameters

copy (bool, default=True) – Set to False to perform inplace scaling and avoid a copy (if the input is already a numpy array).

scale_

Per feature relative scaling of the data.

New in version 0.17: scale_ attribute.

Type

ndarray of shape (n_features,)

max_abs_

Per feature maximum absolute value.

Type

ndarray of shape (n_features,)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

n_samples_seen_

The number of samples processed by the estimator. Will be reset on new calls to fit, but increments across partial_fit calls.

Type

int

See also

maxabs_scale

Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import MaxAbsScaler
>>> X = [[ 1., -1.,  2.],
...      [ 2.,  0.,  0.],
...      [ 0.,  1., -1.]]
>>> transformer = MaxAbsScaler().fit(X)
>>> transformer
MaxAbsScaler()
>>> transformer.transform(X)
array([[ 0.5, -1. ,  1. ],
       [ 1. ,  0. ,  0. ],
       [ 0. ,  1. , -0.5]])
fit(X, y=None)[source]

Compute the maximum absolute value to be used for later scaling.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.

  • y (None) – Ignored.

Returns

self – Fitted scaler.

Return type

object

inverse_transform(X)[source]

Scale back the data to the original representation.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data that should be transformed back.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

partial_fit(X, y=None)[source]

Online computation of max absolute value of X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

Returns

self – Fitted scaler.

Return type

object

transform(X)[source]

Scale the data.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data that should be scaled.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.sklearn.MinMaxScaler(feature_range=(0, 1), *, copy=True, clip=False)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Transform features by scaling each feature to a given range.

This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.

The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min

where min, max = feature_range.

This transformation is often used as an alternative to zero mean, unit variance scaling.

Read more in the User Guide.

Parameters
  • feature_range (tuple (min, max), default=(0, 1)) – Desired range of transformed data.

  • copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).

  • clip (bool, default=False) –

    Set to True to clip transformed values of held-out data to provided feature range.

    New in version 0.24.

min_

Per feature adjustment for minimum. Equivalent to min - X.min(axis=0) * self.scale_

Type

ndarray of shape (n_features,)

scale_

Per feature relative scaling of the data. Equivalent to (max - min) / (X.max(axis=0) - X.min(axis=0))

New in version 0.17: scale_ attribute.

Type

ndarray of shape (n_features,)

data_min_

Per feature minimum seen in the data

New in version 0.17: data_min_

Type

ndarray of shape (n_features,)

data_max_

Per feature maximum seen in the data

New in version 0.17: data_max_

Type

ndarray of shape (n_features,)

data_range_

Per feature range (data_max_ - data_min_) seen in the data

New in version 0.17: data_range_

Type

ndarray of shape (n_features,)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

n_samples_seen_

The number of samples processed by the estimator. It will be reset on new calls to fit, but increments across partial_fit calls.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

minmax_scale

Equivalent function without the estimator API.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import MinMaxScaler
>>> data = [[-1, 2], [-0.5, 6], [0, 10], [1, 18]]
>>> scaler = MinMaxScaler()
>>> print(scaler.fit(data))
MinMaxScaler()
>>> print(scaler.data_max_)
[ 1. 18.]
>>> print(scaler.transform(data))
[[0.   0.  ]
 [0.25 0.25]
 [0.5  0.5 ]
 [1.   1.  ]]
>>> print(scaler.transform([[2, 2]]))
[[1.5 0. ]]
fit(X, y=None)[source]

Compute the minimum and maximum to be used for later scaling.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data used to compute the per-feature minimum and maximum used for later scaling along the features axis.

  • y (None) – Ignored.

Returns

self – Fitted scaler.

Return type

object

inverse_transform(X)[source]

Undo the scaling of X according to feature_range.

Parameters

X (array-like of shape (n_samples, n_features)) – Input data that will be transformed. It cannot be sparse.

Returns

Xt – Transformed data.

Return type

ndarray of shape (n_samples, n_features)

partial_fit(X, y=None)[source]

Online computation of min and max on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

Returns

self – Fitted scaler.

Return type

object

transform(X)[source]

Scale features of X according to feature_range.

Parameters

X (array-like of shape (n_samples, n_features)) – Input data that will be transformed.

Returns

Xt – Transformed data.

Return type

ndarray of shape (n_samples, n_features)

class ballet.eng.external.sklearn.MissingIndicator(*, missing_values=nan, features='missing-only', sparse='auto', error_on_new=True)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Binary indicators for missing values.

Note that this component typically should not be used in a vanilla Pipeline consisting of transformers and a classifier, but rather could be added using a FeatureUnion or ColumnTransformer.

Read more in the User Guide.

New in version 0.20.

Parameters
  • missing_values (int, float, str, np.nan or None, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

  • features ({'missing-only', 'all'}, default='missing-only') –

    Whether the imputer mask should represent all or a subset of features.

    • If ‘missing-only’ (default), the imputer mask will only represent features containing missing values during fit time.

    • If ‘all’, the imputer mask will represent all features.

  • sparse (bool or 'auto', default='auto') –

    Whether the imputer mask format should be sparse or dense.

    • If ‘auto’ (default), the imputer mask will be of same type as input.

    • If True, the imputer mask will be a sparse matrix.

    • If False, the imputer mask will be a numpy array.

  • error_on_new (bool, default=True) – If True, transform() will raise an error when there are features with missing values that have no missing values in fit(). This is applicable only when features=’missing-only’.

features_

The features indices which will be returned when calling transform(). They are computed during fit(). If features=’all’, features_ is equal to range(n_features).

Type

ndarray of shape (n_missing_features,) or (n_features,)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

SimpleImputer

Univariate imputation of missing values.

IterativeImputer

Multivariate imputation of missing values.

Examples

>>> import numpy as np
>>> from sklearn.impute import MissingIndicator
>>> X1 = np.array([[np.nan, 1, 3],
...                [4, 0, np.nan],
...                [8, 1, 0]])
>>> X2 = np.array([[5, 1, np.nan],
...                [np.nan, 2, 3],
...                [2, 4, 0]])
>>> indicator = MissingIndicator()
>>> indicator.fit(X1)
MissingIndicator()
>>> X2_tr = indicator.transform(X2)
>>> X2_tr
array([[False,  True],
       [ True, False],
       [False, False]])
fit(X, y=None)[source]

Fit the transformer on X.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

self – Fitted estimator.

Return type

object

fit_transform(X, y=None)[source]

Generate missing values indicator for X.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input data to complete.

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

Xt – The missing indicator for input data. The data type of Xt will be boolean.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_features_with_missing)

transform(X)[source]

Generate missing values indicator for X.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The input data to complete.

Returns

Xt – The missing indicator for input data. The data type of Xt will be boolean.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features) or (n_samples, n_features_with_missing)

class ballet.eng.external.sklearn.Normalizer(norm='l2', *, copy=True)[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Normalize samples individually to unit norm.

Each sample (i.e. each row of the data matrix) with at least one non zero component is rescaled independently of other samples so that its norm (l1, l2 or inf) equals one.

This transformer is able to work both with dense numpy arrays and scipy.sparse matrix (use CSR format if you want to avoid the burden of a copy / conversion).

Scaling inputs to unit norms is a common operation for text classification or clustering for instance. For instance the dot product of two l2-normalized TF-IDF vectors is the cosine similarity of the vectors and is the base similarity metric for the Vector Space Model commonly used by the Information Retrieval community.

Read more in the User Guide.

Parameters
  • norm ({'l1', 'l2', 'max'}, default='l2') – The norm to use to normalize each non zero sample. If norm=’max’ is used, values will be rescaled by the maximum of the absolute values.

  • copy (bool, default=True) – Set to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array or a scipy.sparse CSR matrix).

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

normalize

Equivalent function without the estimator API.

Notes

This estimator is stateless (besides constructor parameters), the fit method does nothing but is useful when used in a pipeline.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import Normalizer
>>> X = [[4, 1, 2, 2],
...      [1, 3, 9, 3],
...      [5, 7, 5, 1]]
>>> transformer = Normalizer().fit(X)  # fit does nothing.
>>> transformer
Normalizer()
>>> transformer.transform(X)
array([[0.8, 0.2, 0.4, 0.4],
       [0.1, 0.3, 0.9, 0.3],
       [0.5, 0.7, 0.5, 0.1]])
fit(X, y=None)[source]

Do nothing and return the estimator unchanged.

This method is just there to implement the usual API and hence work in pipelines.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to estimate the normalization parameters.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Fitted transformer.

Return type

object

transform(X, copy=None)[source]

Scale each non zero row of X to unit norm.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data to normalize, row by row. scipy.sparse matrices should be in CSR format to avoid an un-necessary copy.

  • copy (bool, default=None) – Copy the input X or not.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.sklearn.OneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')[source]

Bases: sklearn.preprocessing._encoders._BaseEncoder

Encode categorical features as a one-hot numeric array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka ‘one-of-K’ or ‘dummy’) encoding scheme. This creates a binary column for each category and returns a sparse matrix or dense array (depending on the sparse parameter)

By default, the encoder derives the categories based on the unique values in each feature. Alternatively, you can also specify the categories manually.

This encoding is needed for feeding categorical data to many scikit-learn estimators, notably linear models and SVMs with the standard kernels.

Note: a one-hot encoding of y labels should use a LabelBinarizer instead.

Read more in the User Guide.

Parameters
  • categories ('auto' or a list of array-like, default='auto') –

    Categories (unique values) per feature:

    • ’auto’ : Determine categories automatically from the training data.

    • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values within a single feature, and should be sorted in case of numeric values.

    The used categories can be found in the categories_ attribute.

    New in version 0.20.

  • drop ({'first', 'if_binary'} or a array-like of shape (n_features,), default=None) –

    Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into a neural network or an unregularized regression.

    However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models.

    • None : retain all features (the default).

    • ’first’ : drop the first category in each feature. If only one category is present, the feature will be dropped entirely.

    • ’if_binary’ : drop the first category in each feature with two categories. Features with 1 or more than 2 categories are left intact.

    • array : drop[i] is the category in feature X[:, i] that should be dropped.

    New in version 0.21: The parameter drop was added in 0.21.

    Changed in version 0.23: The option drop=’if_binary’ was added in 0.23.

  • sparse (bool, default=True) – Will return sparse matrix if set True else will return an array.

  • dtype (number type, default=float) – Desired dtype of output.

  • handle_unknown ({'error', 'ignore'}, default='error') – Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None.

categories_

The categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any).

Type

list of arrays

drop_idx_
  • drop_idx_[i] is the index in categories_[i] of the category to be dropped for each feature.

  • drop_idx_[i] = None if no category is to be dropped from the feature with index i, e.g. when drop=’if_binary’ and the feature isn’t binary.

  • drop_idx_ = None if all the transformed features will be retained.

Changed in version 0.23: Added the possibility to contain None values.

Type

array of shape (n_features,)

n_features_in_

Number of features seen during fit.

New in version 1.0.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

OrdinalEncoder

Performs an ordinal (integer) encoding of the categorical features.

sklearn.feature_extraction.DictVectorizer

Performs a one-hot encoding of dictionary items (also handles string-valued features).

sklearn.feature_extraction.FeatureHasher

Performs an approximate one-hot encoding of dictionary items or strings.

LabelBinarizer

Binarizes labels in a one-vs-all fashion.

MultiLabelBinarizer

Transforms between iterable of iterables and a multilabel format, e.g. a (samples x classes) binary matrix indicating the presence of a class label.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to a binary one-hot encoding.

>>> from sklearn.preprocessing import OneHotEncoder

One can discard categories not seen during fit:

>>> enc = OneHotEncoder(handle_unknown='ignore')
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OneHotEncoder(handle_unknown='ignore')
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 1], ['Male', 4]]).toarray()
array([[1., 0., 1., 0., 0.],
       [0., 1., 0., 0., 0.]])
>>> enc.inverse_transform([[0, 1, 1, 0, 0], [0, 0, 0, 1, 0]])
array([['Male', 1],
       [None, 2]], dtype=object)
>>> enc.get_feature_names_out(['gender', 'group'])
array(['gender_Female', 'gender_Male', 'group_1', 'group_2', 'group_3'], ...)

One can always drop the first column for each feature:

>>> drop_enc = OneHotEncoder(drop='first').fit(X)
>>> drop_enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> drop_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 0., 0.],
       [1., 1., 0.]])

Or drop a column for feature only having 2 categories:

>>> drop_binary_enc = OneHotEncoder(drop='if_binary').fit(X)
>>> drop_binary_enc.transform([['Female', 1], ['Male', 2]]).toarray()
array([[0., 1., 0., 0.],
       [1., 0., 1., 0.]])
fit(X, y=None)[source]

Fit OneHotEncoder to X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns

Fitted encoder.

Return type

self

fit_transform(X, y=None)[source]

Fit OneHotEncoder to X, then transform X.

Equivalent to fit(X).transform(X) but more convenient.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data to encode.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns

X_out – Transformed input. If sparse=True, a sparse matrix will be returned.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_encoded_features)

get_feature_names(input_features=None)[source]

DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

Return feature names for output features.

input_featureslist of str of shape (n_features,)

String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

output_feature_namesndarray of shape (n_output_features,)

Array of feature names.

get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_out – Transformed feature names.

Return type

ndarray of str objects

inverse_transform(X)[source]

Convert the data back to the original representation.

When unknown categories are encountered (all zeros in the one-hot encoding), None is used to represent this category. If the feature with the unknown category has a dropped caregory, the dropped category will be its inverse.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_encoded_features)) – The transformed data.

Returns

X_tr – Inverse transformed array.

Return type

ndarray of shape (n_samples, n_features)

transform(X)[source]

Transform X using one-hot encoding.

Parameters

X (array-like of shape (n_samples, n_features)) – The data to encode.

Returns

X_out – Transformed input. If sparse=True, a sparse matrix will be returned.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_encoded_features)

class ballet.eng.external.sklearn.OrdinalEncoder(*, categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='error', unknown_value=None)[source]

Bases: sklearn.preprocessing._encoders._BaseEncoder

Encode categorical features as an integer array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

Read more in the User Guide.

New in version 0.20.

Parameters
  • categories ('auto' or a list of array-like, default='auto') –

    Categories (unique values) per feature:

    • ’auto’ : Determine categories automatically from the training data.

    • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values.

    The used categories can be found in the categories_ attribute.

  • dtype (number type, default np.float64) – Desired dtype of output.

  • handle_unknown ({'error', 'use_encoded_value'}, default='error') –

    When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform(), an unknown category will be denoted as None.

    New in version 0.24.

  • unknown_value (int or np.nan, default=None) –

    When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.

    New in version 0.24.

categories_

The categories of each feature determined during fit (in order of the features in X and corresponding with the output of transform). This does not include categories that weren’t seen during fit.

Type

list of arrays

n_features_in_

Number of features seen during fit.

New in version 1.0.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

OneHotEncoder

Performs a one-hot encoding of categorical features.

LabelEncoder

Encodes target labels with values between 0 and n_classes-1.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to an ordinal encoding.

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OrdinalEncoder()
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 3], ['Male', 1]])
array([[0., 2.],
       [1., 0.]])
>>> enc.inverse_transform([[1, 0], [0, 1]])
array([['Male', 1],
       ['Female', 2]], dtype=object)
fit(X, y=None)[source]

Fit the OrdinalEncoder to X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data to determine the categories of each feature.

  • y (None) – Ignored. This parameter exists only for compatibility with Pipeline.

Returns

self – Fitted encoder.

Return type

object

inverse_transform(X)[source]

Convert the data back to the original representation.

Parameters

X (array-like of shape (n_samples, n_encoded_features)) – The transformed data.

Returns

X_tr – Inverse transformed array.

Return type

ndarray of shape (n_samples, n_features)

transform(X)[source]

Transform X to ordinal codes.

Parameters

X (array-like of shape (n_samples, n_features)) – The data to encode.

Returns

X_out – Transformed input.

Return type

ndarray of shape (n_samples, n_features)

class ballet.eng.external.sklearn.PolynomialFeatures(degree=2, *, interaction_only=False, include_bias=True, order='C')[source]

Bases: sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Generate polynomial and interaction features.

Generate a new feature matrix consisting of all polynomial combinations of the features with degree less than or equal to the specified degree. For example, if an input sample is two dimensional and of the form [a, b], the degree-2 polynomial features are [1, a, b, a^2, ab, b^2].

Read more in the User Guide.

Parameters
  • degree (int or tuple (min_degree, max_degree), default=2) – If a single int is given, it specifies the maximal degree of the polynomial features. If a tuple (min_degree, max_degree) is passed, then min_degree is the minimum and max_degree is the maximum polynomial degree of the generated features. Note that min_degree=0 and min_degree=1 are equivalent as outputting the degree zero term is determined by include_bias.

  • interaction_only (bool, default=False) –

    If True, only interaction features are produced: features that are products of at most degree distinct input features, i.e. terms with power of 2 or higher of the same input feature are excluded:

    • included: x[0], x[1], x[0] * x[1], etc.

    • excluded: x[0] ** 2, x[0] ** 2 * x[1], etc.

  • include_bias (bool, default=True) – If True (default), then include a bias column, the feature in which all polynomial powers are zero (i.e. a column of ones - acts as an intercept term in a linear model).

  • order ({'C', 'F'}, default='C') –

    Order of output array in the dense case. ‘F’ order is faster to compute, but may slow down subsequent estimators.

    New in version 0.21.

powers_

powers_[i, j] is the exponent of the jth input in the ith output.

Type

ndarray of shape (n_output_features_, n_features_in_)

n_input_features_

The total number of input features.

Deprecated since version 1.0: This attribute is deprecated in 1.0 and will be removed in 1.2. Refer to n_features_in_ instead.

Type

int

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

n_output_features_

The total number of polynomial output features. The number of output features is computed by iterating over all suitably sized combinations of input features.

Type

int

See also

SplineTransformer

Transformer that generates univariate B-spline bases for features.

Notes

Be aware that the number of features in the output array scales polynomially in the number of features of the input array, and exponentially in the degree. High degrees can cause overfitting.

See examples/linear_model/plot_polynomial_interpolation.py

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import PolynomialFeatures
>>> X = np.arange(6).reshape(3, 2)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5]])
>>> poly = PolynomialFeatures(2)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.,  0.,  1.],
       [ 1.,  2.,  3.,  4.,  6.,  9.],
       [ 1.,  4.,  5., 16., 20., 25.]])
>>> poly = PolynomialFeatures(interaction_only=True)
>>> poly.fit_transform(X)
array([[ 1.,  0.,  1.,  0.],
       [ 1.,  2.,  3.,  6.],
       [ 1.,  4.,  5., 20.]])
fit(X, y=None)[source]

Compute number of output features.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Fitted transformer.

Return type

object

get_feature_names(input_features=None)[source]

DEPRECATED: get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.

Return feature names for output features.

input_featureslist of str of shape (n_features,), default=None

String names for input features if available. By default, “x0”, “x1”, … “xn_features” is used.

output_feature_nameslist of str of shape (n_output_features,)

Transformed feature names.

get_feature_names_out(input_features=None)[source]

Get output feature names for transformation.

Parameters

input_features (array-like of str or None, default=None) –

Input features.

  • If input_features is None, then feature_names_in_ is used as feature names in. If feature_names_in_ is not defined, then names are generated: [x0, x1, …, x(n_features_in_)].

  • If input_features is an array-like, then input_features must match feature_names_in_ if feature_names_in_ is defined.

Returns

feature_names_out – Transformed feature names.

Return type

ndarray of str objects

property n_input_features_

The attribute n_input_features_ was deprecated in version 1.0 and will be removed in 1.2.

Type

DEPRECATED

property powers_

Exponent for each of the inputs in the output.

transform(X)[source]

Transform data to polynomial features.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) –

The data to transform, row by row.

Prefer CSR over CSC for sparse input (for speed), but CSC is required if the degree is 4 or higher. If the degree is less than 4 and the input format is CSC, it will be converted to CSR, have its polynomial features generated, then converted back to CSC.

If the degree is 2 or 3, the method described in “Leveraging Sparsity to Speed Up Polynomial Feature Expansions of CSR Matrices Using K-Simplex Numbers” by Andrew Nystrom and John Hughes is used, which is much faster than the method used on CSC input. For this reason, a CSC input will be converted to CSR, and the output will be converted back to CSC prior to being returned, hence the preference of CSR.

Returns

XP – The matrix of features, where NP is the number of polynomial features generated from the combination of inputs. If a sparse matrix is provided, it will be converted into a sparse csr_matrix.

Return type

{ndarray, sparse matrix} of shape (n_samples, NP)

class ballet.eng.external.sklearn.PowerTransformer(method='yeo-johnson', *, standardize=True, copy=True)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Apply a power transform featurewise to make data more Gaussian-like.

Power transforms are a family of parametric, monotonic transformations that are applied to make data more Gaussian-like. This is useful for modeling issues related to heteroscedasticity (non-constant variance), or other situations where normality is desired.

Currently, PowerTransformer supports the Box-Cox transform and the Yeo-Johnson transform. The optimal parameter for stabilizing variance and minimizing skewness is estimated through maximum likelihood.

Box-Cox requires input data to be strictly positive, while Yeo-Johnson supports both positive or negative data.

By default, zero-mean, unit-variance normalization is applied to the transformed data.

Read more in the User Guide.

New in version 0.20.

Parameters
  • method ({'yeo-johnson', 'box-cox'}, default='yeo-johnson') –

    The power transform method. Available methods are:

    • ’yeo-johnson’ [1]_, works with positive and negative values

    • ’box-cox’ [2]_, only works with strictly positive values

  • standardize (bool, default=True) – Set to True to apply zero-mean, unit-variance normalization to the transformed output.

  • copy (bool, default=True) – Set to False to perform inplace computation during transformation.

lambdas_

The parameters of the power transformation for the selected features.

Type

ndarray of float of shape (n_features,)

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

power_transform

Equivalent function without the estimator API.

QuantileTransformer

Maps data to a standard normal distribution with the parameter output_distribution=’normal’.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

References

1

I.K. Yeo and R.A. Johnson, “A new family of power transformations to improve normality or symmetry.” Biometrika, 87(4), pp.954-959, (2000).

2

G.E.P. Box and D.R. Cox, “An Analysis of Transformations”, Journal of the Royal Statistical Society B, 26, 211-252 (1964).

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import PowerTransformer
>>> pt = PowerTransformer()
>>> data = [[1, 2], [3, 2], [4, 5]]
>>> print(pt.fit(data))
PowerTransformer()
>>> print(pt.lambdas_)
[ 1.386... -3.100...]
>>> print(pt.transform(data))
[[-1.316... -0.707...]
 [ 0.209... -0.707...]
 [ 1.106...  1.414...]]
fit(X, y=None)[source]

Estimate the optimal parameter lambda for each feature.

The optimal lambda parameter for minimizing skewness is estimated on each feature independently using maximum likelihood.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data used to estimate the optimal transformation parameters.

  • y (None) – Ignored.

Returns

self – Fitted transformer.

Return type

object

fit_transform(X, y=None)[source]

Fit PowerTransformer to X, then transform X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – The data used to estimate the optimal transformation parameters and to be transformed using a power transformation.

  • y (Ignored) – Not used, present for API consistency by convention.

Returns

X_new – Transformed data.

Return type

ndarray of shape (n_samples, n_features)

inverse_transform(X)[source]

Apply the inverse power transformation using the fitted lambdas.

The inverse of the Box-Cox transformation is given by:

if lambda_ == 0:
    X = exp(X_trans)
else:
    X = (X_trans * lambda_ + 1) ** (1 / lambda_)

The inverse of the Yeo-Johnson transformation is given by:

if X >= 0 and lambda_ == 0:
    X = exp(X_trans) - 1
elif X >= 0 and lambda_ != 0:
    X = (X_trans * lambda_ + 1) ** (1 / lambda_) - 1
elif X < 0 and lambda_ != 2:
    X = 1 - (-(2 - lambda_) * X_trans + 1) ** (1 / (2 - lambda_))
elif X < 0 and lambda_ == 2:
    X = 1 - exp(-X_trans)
Parameters

X (array-like of shape (n_samples, n_features)) – The transformed data.

Returns

X – The original data.

Return type

ndarray of shape (n_samples, n_features)

transform(X)[source]

Apply the power transform to each feature using the fitted lambdas.

Parameters

X (array-like of shape (n_samples, n_features)) – The data to be transformed using a power transformation.

Returns

X_trans – The transformed data.

Return type

ndarray of shape (n_samples, n_features)

class ballet.eng.external.sklearn.QuantileTransformer(*, n_quantiles=1000, output_distribution='uniform', ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=True)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Transform features using quantiles information.

This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.

The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

Read more in the User Guide.

New in version 0.19.

Parameters
  • n_quantiles (int, default=1000 or n_samples) – Number of quantiles to be computed. It corresponds to the number of landmarks used to discretize the cumulative distribution function. If n_quantiles is larger than the number of samples, n_quantiles is set to the number of samples as a larger number of quantiles does not give a better approximation of the cumulative distribution function estimator.

  • output_distribution ({'uniform', 'normal'}, default='uniform') – Marginal distribution for the transformed data. The choices are ‘uniform’ (default) or ‘normal’.

  • ignore_implicit_zeros (bool, default=False) – Only applies to sparse matrices. If True, the sparse entries of the matrix are discarded to compute the quantile statistics. If False, these entries are treated as zeros.

  • subsample (int, default=1e5) – Maximum number of samples used to estimate the quantiles for computational efficiency. Note that the subsampling procedure may differ for value-identical sparse and dense matrices.

  • random_state (int, RandomState instance or None, default=None) – Determines random number generation for subsampling and smoothing noise. Please see subsample for more details. Pass an int for reproducible results across multiple function calls. See Glossary.

  • copy (bool, default=True) – Set to False to perform inplace transformation and avoid a copy (if the input is already a numpy array).

n_quantiles_

The actual number of quantiles used to discretize the cumulative distribution function.

Type

int

quantiles_

The values corresponding the quantiles of reference.

Type

ndarray of shape (n_quantiles, n_features)

references_

Quantiles of references.

Type

ndarray of shape (n_quantiles, )

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

quantile_transform

Equivalent function without the estimator API.

PowerTransformer

Perform mapping to a normal distribution using a power transform.

StandardScaler

Perform standardization that is faster, but less robust to outliers.

RobustScaler

Perform robust standardization that removes the influence of outliers but does not put outliers and inliers on the same scale.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> import numpy as np
>>> from sklearn.preprocessing import QuantileTransformer
>>> rng = np.random.RandomState(0)
>>> X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
>>> qt = QuantileTransformer(n_quantiles=10, random_state=0)
>>> qt.fit_transform(X)
array([...])
fit(X, y=None)[source]

Compute the quantiles used for transforming.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

  • y (None) – Ignored.

Returns

self – Fitted transformer.

Return type

object

inverse_transform(X)[source]

Back-projection to the original space.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns

Xt – The projected data.

Return type

{ndarray, sparse matrix} of (n_samples, n_features)

transform(X)[source]

Feature-wise transformation of the data.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis. If a sparse matrix is provided, it will be converted into a sparse csc_matrix. Additionally, the sparse matrix needs to be nonnegative if ignore_implicit_zeros is False.

Returns

Xt – The projected data.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.sklearn.RobustScaler(*, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True, unit_variance=False)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Scale features using statistics that are robust to outliers.

This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile).

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Median and interquartile range are then stored to be used on later data using the transform() method.

Standardization of a dataset is a common requirement for many machine learning estimators. Typically this is done by removing the mean and scaling to unit variance. However, outliers can often influence the sample mean / variance in a negative way. In such cases, the median and the interquartile range often give better results.

New in version 0.17.

Read more in the User Guide.

Parameters
  • with_centering (bool, default=True) – If True, center the data before scaling. This will cause transform() to raise an exception when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

  • with_scaling (bool, default=True) – If True, scale the data to interquartile range.

  • quantile_range (tuple (q_min, q_max), 0.0 < q_min < q_max < 100.0, default=(25.0, 75.0)) –

    Quantile range used to calculate scale_. By default this is equal to the IQR, i.e., q_min is the first quantile and q_max is the third quantile.

    New in version 0.18.

  • copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

  • unit_variance (bool, default=False) –

    If True, scale data so that normally distributed features have a variance of 1. In general, if the difference between the x-values of q_max and q_min for a standard normal distribution is greater than 1, the dataset will be scaled down. If less than 1, the dataset will be scaled up.

    New in version 0.24.

center_

The median value for each feature in the training set.

Type

array of floats

scale_

The (scaled) interquartile range for each feature in the training set.

New in version 0.17: scale_ attribute.

Type

array of floats

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

robust_scale

Equivalent function without the estimator API.

sklearn.decomposition.PCA

Further removes the linear correlation across features with ‘whiten=True’.

Notes

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

https://en.wikipedia.org/wiki/Median https://en.wikipedia.org/wiki/Interquartile_range

Examples

>>> from sklearn.preprocessing import RobustScaler
>>> X = [[ 1., -2.,  2.],
...      [ -2.,  1.,  3.],
...      [ 4.,  1., -2.]]
>>> transformer = RobustScaler().fit(X)
>>> transformer
RobustScaler()
>>> transformer.transform(X)
array([[ 0. , -2. ,  0. ],
       [-1. ,  0. ,  0.4],
       [ 1. ,  0. , -1.6]])
fit(X, y=None)[source]

Compute the median and quantiles to be used for scaling.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the median and quantiles used for later scaling along the features axis.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Fitted scaler.

Return type

object

inverse_transform(X)[source]

Scale back the data to the original representation.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The rescaled data to be transformed back.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

transform(X)[source]

Center and scale the data.

Parameters

X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the specified axis.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

class ballet.eng.external.sklearn.SimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)[source]

Bases: sklearn.impute._base._BaseImputer

Imputation transformer for completing missing values.

Read more in the User Guide.

New in version 0.20: SimpleImputer replaces the previous sklearn.preprocessing.Imputer estimator which is now removed.

Parameters
  • missing_values (int, float, str, np.nan or None, default=np.nan) – The placeholder for the missing values. All occurrences of missing_values will be imputed. For pandas’ dataframes with nullable integer dtypes with missing values, missing_values should be set to np.nan, since pd.NA will be converted to np.nan.

  • strategy (str, default='mean') –

    The imputation strategy.

    • If “mean”, then replace missing values using the mean along each column. Can only be used with numeric data.

    • If “median”, then replace missing values using the median along each column. Can only be used with numeric data.

    • If “most_frequent”, then replace missing using the most frequent value along each column. Can be used with strings or numeric data. If there is more than one such value, only the smallest is returned.

    • If “constant”, then replace missing values with fill_value. Can be used with strings or numeric data.

    New in version 0.20: strategy=”constant” for fixed value imputation.

  • fill_value (str or numerical value, default=None) – When strategy == “constant”, fill_value is used to replace all occurrences of missing_values. If left to the default, fill_value will be 0 when imputing numerical data and “missing_value” for strings or object data types.

  • verbose (int, default=0) – Controls the verbosity of the imputer.

  • copy (bool, default=True) –

    If True, a copy of X will be created. If False, imputation will be done in-place whenever possible. Note that, in the following cases, a new copy will always be made, even if copy=False:

    • If X is not an array of floating values;

    • If X is encoded as a CSR matrix;

    • If add_indicator=True.

  • add_indicator (bool, default=False) – If True, a MissingIndicator transform will stack onto output of the imputer’s transform. This allows a predictive estimator to account for missingness despite imputation. If a feature has no missing values at fit/train time, the feature won’t appear on the missing indicator even if there are missing values at transform/test time.

statistics_

The imputation fill value for each feature. Computing statistics can result in np.nan values. During transform(), features corresponding to np.nan statistics will be discarded.

Type

array of shape (n_features,)

indicator_

Indicator used to add binary indicators for missing values. None if add_indicator=False.

Type

MissingIndicator

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

IterativeImputer

Multivariate imputation of missing values.

Notes

Columns which only contained missing values at fit() are discarded upon transform() if strategy is not “constant”.

Examples

>>> import numpy as np
>>> from sklearn.impute import SimpleImputer
>>> imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
>>> imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
SimpleImputer()
>>> X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
>>> print(imp_mean.transform(X))
[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]
fit(X, y=None)[source]

Fit the imputer on X.

Parameters
  • X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.

  • y (Ignored) – Not used, present here for API consistency by convention.

Returns

self – Fitted estimator.

Return type

object

inverse_transform(X)[source]

Convert the data back to the original representation.

Inverts the transform operation performed on an array. This operation can only be performed after SimpleImputer is instantiated with add_indicator=True.

Note that inverse_transform can only invert the transform in features that have binary indicators for missing values. If a feature has no missing values at fit time, the feature won’t have a binary indicator, and the imputation done at transform time won’t be inverted.

New in version 0.24.

Parameters

X (array-like of shape (n_samples, n_features + n_features_missing_indicator)) – The imputed data to be reverted to original data. It has to be an augmented array of imputed data and the missing indicator mask.

Returns

X_original – The original X with missing values as it was prior to imputation.

Return type

ndarray of shape (n_samples, n_features)

transform(X)[source]

Impute all missing values in X.

Parameters

X ({array-like, sparse matrix}, shape (n_samples, n_features)) – The input data to complete.

Returns

X_imputedX with imputed values.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features_out)

class ballet.eng.external.sklearn.SparseRandomProjection(n_components='auto', *, density='auto', eps=0.1, dense_output=False, random_state=None)[source]

Bases: sklearn.random_projection.BaseRandomProjection

Reduce dimensionality through sparse random projection.

Sparse random matrix is an alternative to dense random projection matrix that guarantees similar embedding quality while being much more memory efficient and allowing faster computation of the projected data.

If we note s = 1 / density the components of the random matrix are drawn from:

  • -sqrt(s) / sqrt(n_components) with probability 1 / 2s

  • 0 with probability 1 - 1 / s

  • +sqrt(s) / sqrt(n_components) with probability 1 / 2s

Read more in the User Guide.

New in version 0.13.

Parameters
  • n_components (int or 'auto', default='auto') –

    Dimensionality of the target projection space.

    n_components can be automatically adjusted according to the number of samples in the dataset and the bound given by the Johnson-Lindenstrauss lemma. In that case the quality of the embedding is controlled by the eps parameter.

    It should be noted that Johnson-Lindenstrauss lemma can yield very conservative estimated of the required number of components as it makes no assumption on the structure of the dataset.

  • density (float or 'auto', default='auto') –

    Ratio in the range (0, 1] of non-zero component in the random projection matrix.

    If density = ‘auto’, the value is set to the minimum density as recommended by Ping Li et al.: 1 / sqrt(n_features).

    Use density = 1 / 3.0 if you want to reproduce the results from Achlioptas, 2001.

  • eps (float, default=0.1) –

    Parameter to control the quality of the embedding according to the Johnson-Lindenstrauss lemma when n_components is set to ‘auto’. This value should be strictly positive.

    Smaller values lead to better embedding and higher number of dimensions (n_components) in the target projection space.

  • dense_output (bool, default=False) –

    If True, ensure that the output of the random projection is a dense numpy array even if the input and random projection matrix are both sparse. In practice, if the number of components is small the number of zero components in the projected data will be very small and it will be more CPU and memory efficient to use a dense representation.

    If False, the projected data uses a sparse representation if the input is sparse.

  • random_state (int, RandomState instance or None, default=None) – Controls the pseudo random number generator used to generate the projection matrix at fit time. Pass an int for reproducible output across multiple function calls. See Glossary.

n_components_

Concrete number of components computed when n_components=”auto”.

Type

int

components_

Random matrix used for the projection. Sparse matrix will be of CSR format.

Type

sparse matrix of shape (n_components, n_features)

density_

Concrete density computed from when density = “auto”.

Type

float in range 0.0 - 1.0

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

See also

GaussianRandomProjection

Reduce dimensionality through Gaussian random projection.

References

1

Ping Li, T. Hastie and K. W. Church, 2006, “Very Sparse Random Projections”. https://web.stanford.edu/~hastie/Papers/Ping/KDD06_rp.pdf

2

D. Achlioptas, 2001, “Database-friendly random projections”, https://users.soe.ucsc.edu/~optas/papers/jl.pdf

Examples

>>> import numpy as np
>>> from sklearn.random_projection import SparseRandomProjection
>>> rng = np.random.RandomState(42)
>>> X = rng.rand(100, 10000)
>>> transformer = SparseRandomProjection(random_state=rng)
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)
>>> # very few components are non-zero
>>> np.mean(transformer.components_ != 0)
0.0100...
class ballet.eng.external.sklearn.StandardScaler(*, copy=True, with_mean=True, with_std=True)[source]

Bases: sklearn.base._OneToOneFeatureMixin, sklearn.base.TransformerMixin, sklearn.base.BaseEstimator

Standardize features by removing the mean and scaling to unit variance.

The standard score of a sample x is calculated as:

z = (x - u) / s

where u is the mean of the training samples or zero if with_mean=False, and s is the standard deviation of the training samples or one if with_std=False.

Centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. Mean and standard deviation are then stored to be used on later data using transform().

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

For instance many elements used in the objective function of a learning algorithm (such as the RBF kernel of Support Vector Machines or the L1 and L2 regularizers of linear models) assume that all features are centered around 0 and have variance in the same order. If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.

This scaler can also be applied to sparse CSR or CSC matrices by passing with_mean=False to avoid breaking the sparsity structure of the data.

Read more in the User Guide.

Parameters
  • copy (bool, default=True) – If False, try to avoid a copy and do inplace scaling instead. This is not guaranteed to always work inplace; e.g. if the data is not a NumPy array or scipy.sparse CSR matrix, a copy may still be returned.

  • with_mean (bool, default=True) – If True, center the data before scaling. This does not work (and will raise an exception) when attempted on sparse matrices, because centering them entails building a dense matrix which in common use cases is likely to be too large to fit in memory.

  • with_std (bool, default=True) – If True, scale the data to unit variance (or equivalently, unit standard deviation).

scale_

Per feature relative scaling of the data to achieve zero mean and unit variance. Generally this is calculated using np.sqrt(var_). If a variance is zero, we can’t achieve unit variance, and the data is left as-is, giving a scaling factor of 1. scale_ is equal to None when with_std=False.

New in version 0.17: scale_

Type

ndarray of shape (n_features,) or None

mean_

The mean value for each feature in the training set. Equal to None when with_mean=False.

Type

ndarray of shape (n_features,) or None

var_

The variance for each feature in the training set. Used to compute scale_. Equal to None when with_std=False.

Type

ndarray of shape (n_features,) or None

n_features_in_

Number of features seen during fit.

New in version 0.24.

Type

int

feature_names_in_

Names of features seen during fit. Defined only when X has feature names that are all strings.

New in version 1.0.

Type

ndarray of shape (n_features_in_,)

n_samples_seen_

The number of samples processed by the estimator for each feature. If there are no missing samples, the n_samples_seen will be an integer, otherwise it will be an array of dtype int. If sample_weights are used it will be a float (if no missing data) or an array of dtype float that sums the weights seen so far. Will be reset on new calls to fit, but increments across partial_fit calls.

Type

int or ndarray of shape (n_features,)

See also

scale

Equivalent function without the estimator API.

PCA

Further removes the linear correlation across features with ‘whiten=True’.

Notes

NaNs are treated as missing values: disregarded in fit, and maintained in transform.

We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance.

For a comparison of the different scalers, transformers, and normalizers, see examples/preprocessing/plot_all_scaling.py.

Examples

>>> from sklearn.preprocessing import StandardScaler
>>> data = [[0, 0], [0, 0], [1, 1], [1, 1]]
>>> scaler = StandardScaler()
>>> print(scaler.fit(data))
StandardScaler()
>>> print(scaler.mean_)
[0.5 0.5]
>>> print(scaler.transform(data))
[[-1. -1.]
 [-1. -1.]
 [ 1.  1.]
 [ 1.  1.]]
>>> print(scaler.transform([[2, 2]]))
[[3. 3.]]
fit(X, y=None, sample_weight=None)[source]

Compute the mean and std to be used for later scaling.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • sample_weight (array-like of shape (n_samples,), default=None) –

    Individual weights for each sample.

    New in version 0.24: parameter sample_weight support to StandardScaler.

Returns

self – Fitted scaler.

Return type

object

inverse_transform(X, copy=None)[source]

Scale back the data to the original representation.

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to scale along the features axis.

  • copy (bool, default=None) – Copy the input X or not.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)

partial_fit(X, y=None, sample_weight=None)[source]

Online computation of mean and std on X for later scaling.

All of X is processed as a single batch. This is intended for cases when fit() is not feasible due to very large number of n_samples or because X is read from a continuous stream.

The algorithm for incremental mean and std is given in Equation 1.5a,b in Chan, Tony F., Gene H. Golub, and Randall J. LeVeque. “Algorithms for computing the sample variance: Analysis and recommendations.” The American Statistician 37.3 (1983): 242-247:

Parameters
  • X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The data used to compute the mean and standard deviation used for later scaling along the features axis.

  • y (None) – Ignored.

  • sample_weight (array-like of shape (n_samples,), default=None) –

    Individual weights for each sample.

    New in version 0.24: parameter sample_weight support to StandardScaler.

Returns

self – Fitted scaler.

Return type

object

transform(X, copy=None)[source]

Perform standardization by centering and scaling.

Parameters
  • X ({array-like, sparse matrix of shape (n_samples, n_features)) – The data used to scale along the features axis.

  • copy (bool, default=None) – Copy the input X or not.

Returns

X_tr – Transformed array.

Return type

{ndarray, sparse matrix} of shape (n_samples, n_features)