Ballista Reboot

In July 2019, I created Ballista as a small proof-of-concept of parallel query execution across a number of Rust executors running in Kubernetes. This PoC generated a good discussion on Hacker News, which I felt demonstrated that there is interest in a platform like this. Unfortunately, it was far from usable for anything real and lacked a well-designed architecture.

Over the past few months, I have had the opportunity to discuss this project with some really smart people in the industry and this has inspired me to reboot the project with a slightly different focus.

Ballista still aims to be a modern distributed compute platform with full support for Rust as a first-class citizen but the emphasis now is on an architecture that is not tied to a single programming language. The reality is that many developers have existing code, tools, and language preferences for the particular type of workloads they need to run and it shouldn’t be necessary to use a single language end-to-end. I am now of the opinion that designing a distributed compute platform in [insert favorite programming language] is a mistake, even if the language is Rust!

Supporting multiple languages from the start helps to enforce an architecture that is language-agnostic.

The foundational technologies in Ballista are:

Ballista supports a number of query engines that can be deployed as executors that can accept a query plan and execute it:

Ballista provides DataFrame APIs for Rust and Kotlin. The Kotlin DataFrame API can be used from Java, Kotlin, and Scala because Kotlin is 100% compatible with Java.

The goal is for any client to be able to use any query engine.

I am currently actively working on two examples that demonstrate the capabilities of this architecture.

Rust Parallel Query Execution

This example is based on the original PoC and demonstrates running a simple HashAggregate query in parallel across a number of Rust executors that are deployed to a Kubernetes cluster. Because Ballista does not yet have any distributed query planning or scheduling capabilities, this example sends the query to each executor in parallel and then combines the results locally before running a secondary aggregate query to combine the results.

The full source code for this example can be found here.

Rust bindings for Apache Spark

This example demonstrates a Rust client performing a query against an Apache Spark executor. The Spark executor is essentially just a Spark driver that is hosting a Flight server that accepts queries and then executes them against the Spark context. It is conceptually similar to the Spark SQL Thrift Server.

The full source code for this example can be found here.

Future Work

I am planning on working on the following areas over the coming months, roughly in this order.

As I work through these items, I will continue to expand the content in my book How Query Engines Work, which is an introductory guide to the topic. The book explains concepts such as logical query plans versus physical query plans, as well as the query planning, optimization, and execution phases of a query engine. The book is published on the leanpub platform, which provides free updates as more content is available.