ballet.discovery module

ballet.discovery.countunique(z, axis=0)[source]
ballet.discovery.discover(features, X_df, y_df, y, input=None, primitive=None, expensive_stats=False)[source]

Discover existing features

Display information about existing features including summary statistics on the development dataset. If the feature extracts multiple feature values, then the summary statistics (e.g. mean, std, nunique) are computed for each feature value and then averaged. If the development dataset cannot be loaded, computation of summary statistics is skipped.

The following information is shown: - name: the name of the feature - description: the description of the feature - input: the variables that are used as input to the feature - transformer: the transformer/transformer pipeline - output: the output columns of the feature (not usually specified) - author: the GitHub username of the feature’s author - source: the fully-qualified name of the Python module that contains the

feature

  • mutual_information: estimated mutual information between the feature (or

    averaged over feature values) and the target on the development dataset split

  • conditional_mutual_information: estimated conditional mutual information

    between the feature (or averaged over feature values) and the target conditional on all other features on the development dataset split

  • ninputs: the number of input columns to the feature

  • nvalues: the number of feature values this feature extracts (i.e. 1 for

    a scalar-valued feature and >1 for a vector-valued feature)

  • ncontinuous: the number of feature values this feature extracts that are

    continuous-valued

  • ndiscrete: the number of feature values this feature extracts that are

    discrete-valued

  • mean: mean of the feature on the development dataset split

  • std: standard deviation of the feature (or averaged over feature values)

    on the development dataset split

  • var: variance of the feature (or averaged over feature values) on the

    development dataset split

  • min: minimum of the feature on the development dataset split

  • median: median of the feature (or median over feature values) on the

    development dataset split

  • max: maximum of the feature on the development dataset split

  • nunique: number of unique values of the feature (or averaged over

    feature values) on the development dataset split

The following query operators are supported: - input (str): filter to only features that have input in their input/

list of inputs

  • primitive (str): filter to only features that use primitive

    primitive (i.e. a class with name primitive) in the transformer/transformer pipeline

For other queries, you should just use normal DataFrame indexing:

>>> features_df[features_df['author'] == 'jane']
>>> features_df[features_df['name'].str.contains('married')]
>>> features_df[features_df['mutual_information'] > 0.05]
>>> features_df[features_df['input'].apply(
        lambda input: 'A' in input and 'B' in input)]
Return type

DataFrame

Returns

data frame with features on the row index and columns as described above