Maintainer Guide¶
Creating a Ballet project¶
You and your hundred smartest colleagues want to collaborate on a feature engineering project. How will you organize your work? You are in the right place to learn. With the Ballet framework, contributors to your project will write self-contained feature engineering source code. Then, Ballet will take care of the rest: submitting proposed features as pull requests to your GitHub repository, carefully validating the proposed features, and combining all of the accepted features into a single feature engineering pipeline.
In this section, we will describe how the Ballet framework can be leveraged for your project, which
we will call myproject
.
Prerequisites¶
Before creating the project, the maintainer must have a training dataset used for developing features and details about the prediction problem they are ultimately trying to solve.
Then, install Ballet on your development machine.
Project instantiation¶
To instantiate a project, use the ballet quickstart
command. (You may want to look ahead and see what options are available for this command, such as for automatically creating a GitHub repository for the project.):
$ ballet quickstart
Generating new ballet project...
full_name [Your Name]: Jane Developer
email [you@example.com]: jane@developer.org
github_owner [jane]: jane_developer
project_name [Predict X]: Predict my thing
project_slug [ballet-predict-my-thing]: ballet-my-project
package_slug [predict_my_thing]: myproject
Select problem_type:
1 - classification
2 - regression
Choose from 1, 2 [1]: 2
Select classification_type:
1 - n/a
2 - binary
3 - multiclass
Choose from 1, 2, 3 [1]: 1
Select classification_scorer:
1 - n/a
2 - accuracy
3 - balanced_accuracy
4 - average_precision
5 - brier_score_loss
6 - f1
7 - f1_micro
8 - f1_macro
9 - f1_weighted
10 - f1_samples
11 - neg_log_loss
12 - precision
13 - precision_micro
14 - precision_macro
15 - precision_weighted
16 - precision_samples
17 - recall
18 - recall_micro
19 - recall_macro
20 - recall_weighted
21 - recall_samples
22 - roc_auc
Choose from 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 [1]: 1
Select regression_scorer:
1 - n/a
2 - explained_variance
3 - neg_mean_absolute_error
4 - neg_mean_squared_error
5 - neg_mean_squared_log_error
6 - neg_median_absolute_error
7 - r2
Choose from 1, 2, 3, 4, 5, 6, 7 [1]: 5
Select pruning_action:
1 - no_action
2 - make_pull_request
3 - commit_to_master
Choose from 1, 2, 3 [1]: 3
Select auto_merge_accepted_features:
1 - no
2 - yes
Choose from 1, 2 [1]: 2
Select auto_close_rejected_features:
1 - no
2 - yes
Choose from 1, 2 [1]: 2
Generating new ballet project...DONE
This command uses cookiecutter to render a project template using information supplied by the
project maintainer. The resulting files are then committed to a new git repository. Note that the
specification of a scorer for the not-chosen problem type can be skipped (by selecting n/a
).
Let’s see what files have we have created:
$ tree -a ballet-my-project -I ".git|__pycache__"
ballet-my-project
├── .cookiecutter_context.json
├── .github
│ └── repolockr.yml
├── .gitignore
├── .travis.yml
├── README.md
├── ballet.yml
├── binder
│ ├── postBuild
│ ├── requirements.txt
│ └── workspace.json
├── notebooks
│ └── Analysis.ipynb
├── requirements-notebook.txt
├── requirements.txt
├── setup.py
├── src
│ └── myproject
│ ├── __init__.py
│ ├── __main__.py
│ ├── api.py
│ ├── features
│ │ ├── __init__.py
│ │ ├── contrib
│ │ │ └── __init__.py
│ │ └── encoder.py
│ └── load_data.py
└── tasks.py
7 directories, 21 files
Importantly, by keeping this project structure intact, Ballet will be able to automatically care for your feature engineering pipeline.
ballet.yml
: a Ballet configuration file, with details about the prediction problem, the training data, and location of feature engineering source code..travis.yml
: a Travis CI configuration file pre-configured to run a Ballet validation suite.src/myproject/api.py
: this is where Ballet will look for functionality implemented by your project, including a function to load training/test data or collected features. Stubs for this functionality are already provided by the template but you can further adapt them.
Project installation¶
For local development, you can then install your project. This will make your feature engineering pipeline accessible in interactive settings (Python interpreter, Jupyter notebook) and as a command-line tool.
$ cd ballet-my-project
$ conda create -n myproject -y && conda activate myproject # or your preferred environment tool
(myproject) $ pip install invoke && invoke install
Collaboration via git and GitHub¶
Under the hood, contributors will collaborate using the powerful functionality provided by git and GitHub. In fact, after the quickstart step, you will already have a git-tracked repository and a git remote set up.
$ git log
commit 87fd82e58ea586337e32f88c5a251d26c47d6910
Author: Jane Developer <jane@developer.org>
Date: Fri Apr 2 13:28:42 2021 -0400
Automatically generated files from ballet quickstart
$ git remote -v
origin git@github.com:jane_developer/ballet-my-project (fetch)
origin git@github.com:jane_developer/ballet-my-project (push)
Automatic repository creation¶
The matching remote repository on GitHub must be created. This can be done automatically by the quickstart command by passing the --create-github-repo
flag. This causes Ballet to use the GitHub API to create a repository under the account of the github_owner
that you specified earlier (in this case, jane_developer
), and then push the local repository to GitHub. You must provide a GitHub access token with the appropriate permissions, either by exposing the GITHUB_TOKEN
environment variable, or by passing it to the quickstart command using the --github-token
option. See more details on these options here.
Manual repository creation¶
Alternately, you can manually create the repository on GitHub. Do not initialize the project with any sample files that GitHub offers. Once you do this, push your local copy.
$ git push --all origin
Enabling continuous integration¶
Ballet makes uses of the continuous integration service Travis CI in order to validate code
that contributors propose as well as perform streaming feature definition selection. You must
enable Travis CI for your project on GitHub by following these simple directions. You can
skip any steps that have to do with customizing the .travis.yml
file, as we have already done
that for you in the quickstart.
Installing bots¶
Many Ballet project use bots to assist maintainers.
1. Ballet bot. Install it here. Ballet bot will
automatically merge or close PRs based on the CI test result and the project settings configured
in the ballet.yml
file.
2. Repolockr. Install it here. Repolockr checks every PR
to ensure that “protected” files have not been changed. These are files listed in the Repolockr
config file on the master branch. A contributor might accidentally modify a protected file like
ballet.yml
which could break the project or the CI pipeline; Repolockr will detect this and
fail the PR which might accidentally pass otherwise.
Configuring the project¶
Ballet allows you to configure many aspects of your project.
Configuration is stored in the project root ballet.yml
file. More details about project configuration will be added soon.
Here is an incomplete list of configuration options, identified by the dotted keys from a root config
object:
config.validation.project_structure_validator
: fully-qualified name of the class used to validate changes to the project structureconfig.validation.feature_api_validator
: fully-qualified name of the class used to validate the feature API of new featuresconfig.validation.feature_accepter
: fully-qualified name of the class used to validate the ML performance of new featuresconfig.validation.feature_pruner
: fully-qualified name of the class used to prune existing features with respect to their ML performanceconfig.validation.split
: the name of the data split used for validating contributions. It will be passed as a keyword argument to yourload_data
function, i.e.load_data(split=split)
. This split should probably appear under the list atconfig.data.splits
.
Developing new features¶
At this point, your feature engineering pipeline contains no features. How will your contributors add more?
Using any of a number of development workflows, contributors write new features and submit them to your project for validation. For more details on the contributor workflow, see Contributor Guide.
Validating features¶
The ballet-my-project
repository has received a new pull request which triggers an automatic
evaluation.
The PR is examined by the CI service.
The
ballet validate
command is run, which validates the proposed feature contribution using functionality within theballet.validation
package.If the feature can be validated successfully, the PR passes, and the proposed feature can be merged into the project.
Pruning features¶
Once a feature has been accepted and merged into your project’s master branch, it may mean that an older feature has now become “redundant”: the new feature is providing all of the information contained in the old feature, and more.
Each commit to master is examined by the CI service.
The
ballet validate
command is run and automatically determines whether the commit is a merge commit that comes from merging an accepted feature.If so, then the set of existing features is pruned to remove redundant features.
Pruned features are automatically deleted from your source repository by an automated service.
Updating the framework¶
If there are updates to the Ballet framework after you have started working on your project, you can access them easily.
First, update the ballet
package itself using the usual pip
mechanism:
$ pip install --upgrade ballet
Pip will complain that the upgraded version of ballet is incompatible with the version required by the installed project. That is okay, as we will presently update the project itself to work with the new version of ballet.
Next, use the updated version of ballet
to incorporate any updates to the “upstream” project
template used to create new projects.
$ ballet update-project-template --push
This command will re-render the project template using the saved inputs you have provided in the
past and then safely merge it first to your project-template
branch and then to your
master
branch. Finally, given the --push
flag it will push updates to
origin/master
and origin/project-template
. The usage of this command is described in more
detail here.