Earlier this year I put a lot of time and energy into DataFusion with the goal of creating a platform somewhat like Apache Spark, but implemented in Rust, without all the inefficiencies of the JVM. This was quite the journey, and I learned a lot of positive things from this effort, specifically:
- I greatly increased my skills in the Rust programming language (I’m still no expert but I would classify myself as competent and productive at least)
- I really understood the benefits of columnar data formats for the first time. I mean, I knew the theory, but I didn’t fully appreciate it until I started working with it directly
- I discovered the Apache Arrow project and was able to donate code to form the basis of the Rust implementation of Arrow, and later became a committer on the project
- I proved my initial hypothesis to be correct, that a fairly simple Rust implementation could be significantly faster than Apache Spark. For certain queries, I saw orders of magnitude improvement in performance.
As you can see from the Github activity on my personal Github account, this was a period of intense activity in the first six months of 2018, all out-of-hours work in addition to my day job.
I also learned some other lessons from this work:
- There seems to be a real appetite in the community for something like this, but less of an appetite to contribute towards it
- Trying to build a distributed compute platform alone in my spare time is not a good idea, for many reasons
- To really build something world class, I would need to grow my skills in new areas, such as LLVM and SIMD
- Rather than building my own execution engine in Rust, it would be smarter to create Rust bindings for Gandiva
I have pretty much stopped working on this project for the past five months, partly due to taking on more responsibilities in my day job, but also partly because I realized I had taken on too much with this project and it was no longer fun to work on. It had become like a second job in many ways. I also coded myself into a corner with some of my design choices and created code that was hard to maintain.
However, after taking a break to re-evaluate this, I am now getting ready to work on the next iteration of DataFusion but with a different approach and some different goals.
This time around, I am going to start with distributed deployment first, using Kubernetes to run instances of DataFusion workers. Queries will be executed by passing serialized logical or physical query plans to workers and receiving Arrow data back. With this in place it will be possible to implement true parallel/distributed query processing, which was always the goal. As part of this work I plan to contribute the Rust implementation of Arrow RPC / Flight now that Flatbuffers supports Rust, and also contribute towards interop testing to ensure that the Rust implementation of Arrow is compatible with the existing implementations.
The biggest change though is that I am no longer going to be promoting DataFusion as a Spark replacement and will be treating this as a fun hobby project and taking my time to build things correctly, one step at a time. I still hope that others will contribute and that one day this might become a useful project for real-world projects but primarily this is a way for me to continue my learning process around distributed data processing, query planning and optimization, and building real world applications in Rust.
I’m also hoping that this project inspires others to start contributing to data processing tools in Rust in general. I still believe that Rust is ideally suited for distributed data processing as I mentioned in my Rust is for Big Data post at the start of 2018.
Want to learn more about query engines? Check out my book "How Query Engines Work".