DataFusion 2019

Earlier this year I put a lot of time and energy into DataFusion with the goal of creating a platform somewhat like Apache Spark, but implemented in Rust, without all the inefficiencies of the JVM. This was quite the journey, and I learned a lot of positive things from this effort, specifically:

As you can see from the Github activity on my personal Github account, this was a period of intense activity in the first six months of 2018, all out-of-hours work in addition to my day job.

![Github Activity]({{ “/img/github2018.png” }})

I also learned some other lessons from this work:

I have pretty much stopped working on this project for the past five months, partly due to taking on more responsibilities in my day job, but also partly because I realized I had taken on too much with this project and it was no longer fun to work on. It had become like a second job in many ways. I also coded myself into a corner with some of my design choices and created code that was hard to maintain.

However, after taking a break to re-evaluate this, I am now getting ready to work on the next iteration of DataFusion but with a different approach and some different goals.

This time around, I am going to start with distributed deployment first, using Kubernetes to run instances of DataFusion workers. Queries will be executed by passing serialized logical or physical query plans to workers and receiving Arrow data back. With this in place it will be possible to implement true parallel/distributed query processing, which was always the goal. As part of this work I plan to contribute the Rust implementation of Arrow RPC / Flight now that Flatbuffers supports Rust, and also contribute towards interop testing to ensure that the Rust implementation of Arrow is compatible with the existing implementations.

The biggest change though is that I am no longer going to be promoting DataFusion as a Spark replacement and will be treating this as a fun hobby project and taking my time to build things correctly, one step at a time. I still hope that others will contribute and that one day this might become a useful project for real-world projects but primarily this is a way for me to continue my learning process around distributed data processing, query planning and optimization, and building real world applications in Rust.

I’m also hoping that this project inspires others to start contributing to data processing tools in Rust in general. I still believe that Rust is ideally suited for distributed data processing as I mentioned in my Rust is for Big Data post at the start of 2018.