MaRe: Container-Based Parallel Computing with Data Locality

Published: 2020-04-20

Formatted citation

Capuccini M, Dahlö M, Toor S, and Spjuth O. MaRe: Container-Based Parallel Computing with Data Locality.
Gigascience. 9, 5, giaa042. (2020). DOI: 10.1093/gigascience/giaa042

Abstract

Application containers are emerging as key components in scientific processing, as they can improve reproducibility and standardization in-silico analysis. Chaining software tools in processing pipelines is a common practice in scientific applications and, as application containers gain momentum, workflow systems are starting to provide support for this emerging technology. Nevertheless, workflow systems fall short when it comes to data-intensive analysis, as they do not provide locality-aware scheduling for parallel workloads. To this extent, Big Data cluster-computing frameworks, such as Apache Spark, represent a natural choice. However, even though these frameworks excel at parallelizing code blocks, they do not provide any support for containerized tools parallelization. Here we introduce MaRe, which extends Apache Spark, providing an easy way to parallelize container-based analytics, with transparent management of data locality. MaRe is Docker-compliant, and it can be used as a standalone solution, as well as a workflow system add-on. We demonstrate MaRe on two data-intensive applications in virtual drug screening and in predictive toxicology, showing good scalability. MaRe is generally applicable and available as open source.