Maintainer Guide

Creating a Ballet project

You and your hundred smartest colleagues want to collaborate on a feature engineering project. How will you organize your work? You are in the right place to learn. With the Ballet framework, contributors to your project will write self-contained feature engineering source code. Then, Ballet will take care of the rest: submitting proposed features as pull requests to your GitHub repository, carefully validating the proposed features, and combining all of the accepted features into a single feature engineering pipeline.

In this section, we will describe how the Ballet framework can be leveraged for your project, which we will call myproject.

Prerequisites

Before creating the project, the maintainer must have a training dataset used for developing features and details about the prediction problem they are ultimately trying to solve.

Then, install Ballet on your development machine.

Project instantiation

To instantiate a project, use the ballet quickstart command. (You may want to look ahead and see what options are available for this command, such as for automatically creating a GitHub repository for the project.):

$ ballet quickstart
Generating new ballet project...
full_name [Your Name]: Jane Developer
email [you@example.com]: jane@developer.org
github_owner [jane]: jane_developer
project_name [Predict X]: Predict my thing
project_slug [ballet-predict-my-thing]: ballet-my-project
package_slug [predict_my_thing]: myproject
Select problem_type:
1 - classification
2 - regression
Choose from 1, 2 [1]: 2
Select classification_type:
1 - n/a
2 - binary
3 - multiclass
Choose from 1, 2, 3 [1]: 1
Select classification_scorer:
1 - n/a
2 - accuracy
3 - balanced_accuracy
4 - average_precision
5 - brier_score_loss
6 - f1
7 - f1_micro
8 - f1_macro
9 - f1_weighted
10 - f1_samples
11 - neg_log_loss
12 - precision
13 - precision_micro
14 - precision_macro
15 - precision_weighted
16 - precision_samples
17 - recall
18 - recall_micro
19 - recall_macro
20 - recall_weighted
21 - recall_samples
22 - roc_auc
Choose from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 [1]: 1
Select regression_scorer:
1 - n/a
2 - explained_variance
3 - neg_mean_absolute_error
4 - neg_mean_squared_error
5 - neg_mean_squared_log_error
6 - neg_median_absolute_error
7 - r2
Choose from 1, 2, 3, 4, 5, 6, 7 [1]: 5
Select pruning_action:
1 - no_action
2 - make_pull_request
3 - commit_to_master
Choose from 1, 2, 3 [1]: 3
Select auto_merge_accepted_features:
1 - no
2 - yes
Choose from 1, 2 [1]: 2
Select auto_close_rejected_features:
1 - no
2 - yes
Choose from 1, 2 [1]: 2
Generating new ballet project...DONE

This command uses cookiecutter to render a project template using information supplied by the project maintainer. The resulting files are then committed to a new git repository. Note that the specification of a scorer for the not-chosen problem type can be skipped (by selecting n/a).

Let’s see what files have we have created:

$ tree -a ballet-my-project -I ".git|__pycache__"
ballet-my-project
├── .cookiecutter_context.json
├── .github
│   └── repolockr.yml
├── .gitignore
├── .travis.yml
├── README.md
├── ballet.yml
├── binder
│   ├── postBuild
│   ├── requirements.txt
│   └── workspace.json
├── notebooks
│   └── Analysis.ipynb
├── requirements-notebook.txt
├── requirements.txt
├── setup.py
├── src
│   └── myproject
│       ├── __init__.py
│       ├── __main__.py
│       ├── api.py
│       ├── features
│       │   ├── __init__.py
│       │   ├── contrib
│       │   │   └── __init__.py
│       │   └── encoder.py
│       └── load_data.py
└── tasks.py

7 directories, 21 files

Importantly, by keeping this project structure intact, Ballet will be able to automatically care for your feature engineering pipeline.

  • ballet.yml: a Ballet configuration file, with details about the prediction problem, the training data, and location of feature engineering source code.

  • .travis.yml: a Travis CI configuration file pre-configured to run a Ballet validation suite.

  • src/myproject/api.py: this is where Ballet will look for functionality implemented by your project, including a function to load training/test data or collected features. Stubs for this functionality are already provided by the template but you can further adapt them.

Project installation

For local development, you can then install your project. This will make your feature engineering pipeline accessible in interactive settings (Python interpreter, Jupyter notebook) and as a command-line tool.

$ cd ballet-my-project
$ conda create -n myproject -y && conda activate myproject  # or your preferred environment tool
(myproject) $ pip install invoke && invoke install

Collaboration via git and GitHub

Under the hood, contributors will collaborate using the powerful functionality provided by git and GitHub. In fact, after the quickstart step, you will already have a git-tracked repository and a git remote set up.

$ git log
commit 87fd82e58ea586337e32f88c5a251d26c47d6910
Author: Jane Developer <jane@developer.org>
Date:   Fri Apr 2 13:28:42 2021 -0400

    Automatically generated files from ballet quickstart


$ git remote -v
origin	git@github.com:jane_developer/ballet-my-project (fetch)
origin	git@github.com:jane_developer/ballet-my-project (push)

Automatic repository creation

The matching remote repository on GitHub must be created. This can be done automatically by the quickstart command by passing the --create-github-repo flag. This causes Ballet to use the GitHub API to create a repository under the account of the github_owner that you specified earlier (in this case, jane_developer), and then push the local repository to GitHub. You must provide a GitHub access token with the appropriate permissions, either by exposing the GITHUB_TOKEN environment variable, or by passing it to the quickstart command using the --github-token option. See more details on these options here.

Manual repository creation

Alternately, you can manually create the repository on GitHub. Do not initialize the project with any sample files that GitHub offers. Once you do this, push your local copy.

$ git push --all origin

Enabling continuous integration

Ballet makes uses of the continuous integration service Travis CI in order to validate code that contributors propose as well as perform streaming feature definition selection. You must enable Travis CI for your project on GitHub by following these simple directions. You can skip any steps that have to do with customizing the .travis.yml file, as we have already done that for you in the quickstart.

Installing bots

Many Ballet project use bots to assist maintainers.

1. Ballet bot. Install it here. Ballet bot will automatically merge or close PRs based on the CI test result and the project settings configured in the ballet.yml file.

2. Repolockr. Install it here. Repolockr checks every PR to ensure that “protected” files have not been changed. These are files listed in the Repolockr config file on the master branch. A contributor might accidentally modify a protected file like ballet.yml which could break the project or the CI pipeline; Repolockr will detect this and fail the PR which might accidentally pass otherwise.

Configuring the project

Ballet allows you to configure many aspects of your project.

Configuration is stored in the project root ballet.yml file. More details about project configuration will be added soon.

Here is an incomplete list of configuration options, identified by the dotted keys from a root config object:

  • config.validation.project_structure_validator: fully-qualified name of the class used to validate changes to the project structure

  • config.validation.feature_api_validator: fully-qualified name of the class used to validate the feature API of new features

  • config.validation.feature_accepter: fully-qualified name of the class used to validate the ML performance of new features

  • config.validation.feature_pruner: fully-qualified name of the class used to prune existing features with respect to their ML performance

  • config.validation.split: the name of the data split used for validating contributions. It will be passed as a keyword argument to your load_data function, i.e. load_data(split=split). This split should probably appear under the list at config.data.splits.

Developing new features

At this point, your feature engineering pipeline contains no features. How will your contributors add more?

Using any of a number of development workflows, contributors write new features and submit them to your project for validation. For more details on the contributor workflow, see Contributor Guide.

Validating features

The ballet-my-project repository has received a new pull request which triggers an automatic evaluation.

  1. The PR is examined by the CI service.

  2. The ballet validate command is run, which validates the proposed feature contribution using functionality within the ballet.validation package.

  3. If the feature can be validated successfully, the PR passes, and the proposed feature can be merged into the project.

Pruning features

Once a feature has been accepted and merged into your project’s master branch, it may mean that an older feature has now become “redundant”: the new feature is providing all of the information contained in the old feature, and more.

  1. Each commit to master is examined by the CI service.

  2. The ballet validate command is run and automatically determines whether the commit is a merge commit that comes from merging an accepted feature.

  3. If so, then the set of existing features is pruned to remove redundant features.

  4. Pruned features are automatically deleted from your source repository by an automated service.

Updating the framework

If there are updates to the Ballet framework after you have started working on your project, you can access them easily.

First, update the ballet package itself using the usual pip mechanism:

$ pip install --upgrade ballet

Pip will complain that the upgraded version of ballet is incompatible with the version required by the installed project. That is okay, as we will presently update the project itself to work with the new version of ballet.

Next, use the updated version of ballet to incorporate any updates to the “upstream” project template used to create new projects.

$ ballet update-project-template --push

This command will re-render the project template using the saved inputs you have provided in the past and then safely merge it first to your project-template branch and then to your master branch. Finally, given the --push flag it will push updates to origin/master and origin/project-template. The usage of this command is described in more detail here.