ballet.eng.external.tsfresh module

class ballet.eng.external.tsfresh.FeatureAugmenter(default_fc_parameters=None, kind_to_fc_parameters=None, column_id=None, column_sort=None, column_kind=None, column_value=None, timeseries_container=None, chunksize=None, n_jobs=1, show_warnings=False, disable_progressbar=False, impute_function=None, profile=False, profiling_filename='profile.txt', profiling_sorting='cumulative')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Sklearn-compatible estimator, for calculating and adding many features calculated from a given time series to the data. It is basically a wrapper around extract_features().

The features include basic ones like min, max or median, and advanced features like fourier transformations or statistical tests. For a list of all possible features, see the module feature_calculators. The column name of each added feature contains the name of the function of that module, which was used for the calculation.

For this estimator, two datasets play a crucial role:

  1. the time series container with the timeseries data. This container (for the format see data-formats-label) contains the data which is used for calculating the features. It must be groupable by ids which are used to identify which feature should be attached to which row in the second dataframe.

  2. the input data X, where the features will be added to. Its rows are identifies by the index and each index in X must be present as an id in the time series container.

Imagine the following situation: You want to classify 10 different financial shares and you have their development in the last year as a time series. You would then start by creating features from the metainformation of the shares, e.g. how long they were on the market etc. and filling up a table - the features of one stock in one row. This is the input array X, which each row identified by e.g. the stock name as an index.

>>> df = pandas.DataFrame(index=["AAA", "BBB", ...])
>>> # Fill in the information of the stocks
>>> df["started_since_days"] = ... # add a feature

You can then extract all the features from the time development of the shares, by using this estimator. The time series container must include a column of ids, which are the same as the index of X.

>>> time_series = read_in_timeseries() # get the development of the shares
>>> from tsfresh.transformers import FeatureAugmenter
>>> augmenter = FeatureAugmenter(column_id="id")
>>> augmenter.set_timeseries_container(time_series)
>>> df_with_time_series_features = augmenter.transform(df)

The settings for the feature calculation can be controlled with the settings object. If you pass None, the default settings are used. Please refer to ComprehensiveFCParameters for more information.

This estimator does not select the relevant features, but calculates and adds all of them to the DataFrame. See the RelevantFeatureAugmenter for calculating and selecting features.

For a description what the parameters column_id, column_sort, column_kind and column_value mean, please see extraction.

fit(X=None, y=None)[source]

The fit function is not needed for this estimator. It just does nothing and is here for compatibility reasons.

Parameters
  • X (Any) – Unneeded.

  • y (Any) – Unneeded.

Returns

The estimator instance itself

Return type

FeatureAugmenter

set_timeseries_container(timeseries_container)[source]

Set the timeseries, with which the features will be calculated. For a format of the time series container, please refer to extraction. The timeseries must contain the same indices as the later DataFrame, to which the features will be added (the one you will pass to transform()). You can call this function as often as you like, to change the timeseries later (e.g. if you want to extract for different ids).

Parameters

timeseries_container (pandas.DataFrame or dict) – The timeseries as a pandas.DataFrame or a dict. See extraction for the format.

Returns

None

Return type

None

transform(X)[source]

Add the features calculated using the timeseries_container and add them to the corresponding rows in the input pandas.DataFrame X.

To save some computing time, you should only include those time serieses in the container, that you need. You can set the timeseries container with the method set_timeseries_container().

Parameters

X (pandas.DataFrame) – the DataFrame to which the calculated timeseries features will be added. This is not the dataframe with the timeseries itself.

Returns

The input DataFrame, but with added features.

Return type

pandas.DataFrame