Ballet banner
Ballet

A lightweight framework

for collaborative, open-source data science

A project from the Data to AI Lab at MIT

About

While the open-source model for software development has led to successful, large-scale collaborations in building software applications, chess engines, and scientific analyses, data science has not benefited from this development paradigm. In part, this is due to the divide between the development processes used by software engineers and those used by data scientists.

Ballet tries to address this disparity. It is a lightweight software framework that supports collaborative data science development by composing a data science pipeline from a collection of modular patches that can be written in parallel. Ballet provides the underlying functionality to support interactive development, test and merge high-quality contributions, and compose the accepted contributions into a single product.

We have deployed Ballet for feature engineering collaborations on tabular survey datasets of public interest. For example, ballet-predict-census-income is a large real-world collaborative project to engineer features from raw individual survey responses to the U.S. Census American Community Survey (ACS) and predict personal income. The resulting project is one of the largest data science collaborations GitHub, and outperforms state-of-the-art tabular AutoML systems and independent data science experts.

Interested in using Ballet for your own project? Get in touch!

How It Works

Create project

A maintainer with a dataset wants to mobilize the power of the data science crowd to solve a predictive modeling task. They use the Ballet CLI to render a new project from a provided template and push to GitHub. At first, the project contains a usable, if at first empty, feature engineering pipeline, with an invitation to contribute.

Develop feature definitions

A developer interested in the project is tasked with defining individual Feature objects, the unit of contribution defined by Ballet for this project. They can launch the project in Assemblé, a custom development environment built on Binder and JupyterLab. Ballet’s high-level b client automatically detects the project configuration and supports them in exploring the data, developing candidate feature definitions, and validating them within their messy notebook, surfacing API and ML performance issues right away. Once they are satisfied, they can submit the feature definition alone by selecting the code cell and using Assemblé’s submit button.

Structured contributions

The selected code is automatically extracted and processed as a pull request following the project structure imposed by Ballet.

Automatic validation

Ballet runs a battery of both unit and statistical tests to validate this one feature from a feature API and ML performance standpoint.

Continuous delivery

Features that validate successfully can be automatically and safely merged by a bot.

Pipeline usage

The framework will now collect and compose this new feature into a feature engineering pipeline that can be used by the community for modeling of their own raw data. Due to continuous delivery, the pipeline can be installed from the default branch and always engineers high-quality features.

Learn More

Software

Framework:

Projects:

Papers

Presentations

Team

Micah Smith

Micah Smith

Project Lead

Research Affiliate, MIT

Jürgen Cito

Jürgen Cito

Assistant Professor, TU Wien

Kelvin Lu

Kelvin Lu

MEng, MIT

Andrea Ortner

Andrea Ortner

Undergraduate Researcher, TU Wien

Kalyan Veeramachaneni

Kalyan Veeramachaneni

Principal Investigator, MIT

Contact

Funding

This work is supported in part by NSF Award 1761812.

NSF logo