Abstract: | High-throughput methods in various scientific fields produced massive datasets in the past decade, and using Big Data frameworks, such as Apache Spark, is a natural choice to enable large-scale analysis. In scientific applications, many tools are highly optimized to resemble, or detect, some phenomenon that occurs in a certain system, and the effort of reimplementing such tools in Spark cannot be sustained by research groups. Application containers are gaining a tremendous momentum, as they allow to wrap whole software stacks, that can be then easily fired up and teared down on demand, in a matter of seconds. Docker emerges as the most broadly used containerization tool, and it represents the perfect candidate to wrap scientific application stacks. In Uppsala University (Sweden) we developed EasyMapReduce, a Spark-based utility to run Docker containers in MapReduce fashion, in order to process a large-scale distributed dataset. In this talk we will present the challenges that scientists have to face, in order to run scientific tools over large datasets, and how EasyMapReduce helped us to rapidly implement many use cases in our research group. In addition, we will discuss challenges and future plans for the EasyMapReduce implementation. |