Large-scale virtual screening on public cloud resources with Apache Spark

Published: 2017-03-06

Formatted citation

Capuccini M, Ahmed L, Schaal W, Laure E, Spjuth O. Large-scale virtual screening on public cloud resources with Apache Spark.
Journal of Cheminformatics. 9, 15. (2017). DOI: 10.1186/s13321-017-0204-4

Abstract

Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark.