Faculty of Informatics – Università della Svizzera italiana (USI)

Language-Agnostic Integrated Queries in GraalVM Languages


Language-integrated query (LINQ) frameworks offer a convenient programming abstraction for processing in-memory collections of data, allowing developers to concisely express declarative queries using general-purpose programming languages. Existing LINQ frameworks rely on the type system of statically typed languages such as C# or Java to perform query compilation and execution. As a consequence of this design, they do not support dynamic languages such as Python, R, or JavaScript. Such languages are however very popular among data scientists, who would certainly benefit from LINQ frameworks in data-analytics applications. Besides data analytics, supporting language-integrated queries in dynamic languages would also be useful in other contexts. As an example, JavaScript and Node.js are widely used to implement data-intensive server-side applications.

A language-agnostic query engine could be implemented by following a canonical compiler design approach, i.e., implementing a common front-end query language (e.g., SQL), a common optimizer (e.g., a query planner) and language-specific backends for each target language. However, such an approach would require a lot of engineering efforts, since many conceptually similar operations need to be implemented in any backend. Moreover, integrating new query operators would require extending all language-specific backends.

The GraalVM project and the Truffle framework, in particular their interoperability libraries, offer a great opportunity for addressing the urgent need of LINQ frameworks on dynamic languages without incurring in the penalties mentioned above. In particular, query operators can be implemented with Truffle nodes, exploiting the automatic partial evaluation offered by the Graal compiler for generating efficient machine code for a given query. Moreover, leveraging the Truffle interoperability libraries, there is no need to replicate the implementation of similar operations in multiple backends, since entities of languages implemented with Truffle, e.g., objects and functions, can be accessed through interoperability libraries.

The described approach is the key idea behind the design of DynQ, a novel query engine targeting GraalVM languages. DynQ can execute SQL queries combining data from multiple sources, namely in-memory object collections as well as on-file data and external database systems. We have evaluated DynQ with in-memory data-intensive workloads on R and JavaScript. The evaluation on R has targeted data analytics workloads, comparing the performance of DynQ against the R data.table package and the embedded database DuckDB1 on the TPC-H dataset, using a set of simple queries proposed in stream-fusion-engine2 as micro-benchmark and the TPC-H queries as macro-benchmark. Our evaluation shows that DynQ outperforms both the baselines in most of the queries. Concerning JavaScript, we evaluated DynQ on the same micro-benchmark, against hand-optimized JavaScript implementation as well as implementations that leverage Lodash, which is arguably the most efficient streaming library for JavaScript. Moreover, we also evaluated DynQ on existing codebases, by implementing existing Node.js libraries using DynQ. Our evaluations show that DynQ outperforms both the baselines in all the evaluated workloads. A research paper describing this work has been accepted at VLDB’21 [1]. An extended version has been published in The VLDB Journal [2]


Key Publications


[1] Filippo Schiavio, Daniele Bonetta, Walter Binder: Language-Agnostic Integrated Queries in a Managed Polyglot Runtime. Proc. VLDB Endow. 14(8): 1414-1426 (2021) [pdf][video][slides]

[2] Filippo Schiavio, Daniele Bonetta, Walter Binder: DynQ: A Dynamic Query Engine with Query-reuse Capabilities Embedded in a Polyglot Runtime. The VLDB Journal (2023) [pdf]


Software


DynQ is released open-source on GitHub.


References


1 M. Raasveldt, and H. Mühleisen. “Data Management for Data Science-Towards Embedded Analytics”. CIDR 2020
2 A. Shaikhha, M. Dashti, and C. Koch. “Push versus Pull-based Loop Fusion in Query Engines”. Journal of Functional Programming 2018.