Improving Code Generation in Spark SQL

Big-data systems have gained significant momentum, and Apache Spark is becoming a de-facto standard for modern data analytics. Spark relies on SQL query compilation to optimize the execution performance of analytical workloads on a variety of data sources. Despite its scalable architecture, Spark’s SQL code generation suffers from significant runtime overheads related to data access and de-serialization. Such performance penalty can be significant, especially when applications operate on human-readable data formats such as CSV or JSON.

We developed a new approach to query compilation that overcomes these limitations by relying on run-time profiling and dynamic code generation leveraging the Truffle framework and the Graal compiler. We evaluated our new SQL compiler for Spark on the TPC-H benchmark with textual-form data formats such as CSV or JSON. Our evaluation shows speedups up to 4.4x executing Spark in distributed mode and up to 8.45x executing Spark in local mode. This work has been published in VLDB’20 [1].

Key Publications

[1] Filippo Schiavio, Daniele Bonetta, Walter Binder:
Dynamic Speculative Optimizations for SQL Compilation in Apache Spark. Proc. VLDB Endow. 13(5): 754-767 (2020) [pdf][video][slides]

Other Resources

Video and slides of the talk at VMM’20