Large-Scale Virtual Screening with Spark

← Back to Publications

← Back to Posters

Authors:Valentin Georgiev, Staffan Arvidsson, Laeeq Ahmed, Marco Capuccini, Salman Toor, Wesley Schaal and Ola Spjuth
Abstract:Structure-based virtual screening (SBVS) is a computational approach widely used in drug discovery to identify drug leads from libraries of chemical structures. Since the chemical space is enormous and the available molecular libraries today are huge (tens of millions of compounds), virtual screening can be approached as a Big Data analytics problem. We present here our recent effort focused on developing SBVS tools in Apache Spark. The hypothesis that augmenting molecular docking with machine learning can reduce the computational cost relative to traditional SBVS and a “dock all molecules” strategy. Since traditional machine learning methods only output point predictions, our strategy is to apply Inductive Conformal Predictions (ICP). These algorithms produce prediction regions, which contain the true value with a certain probability. Libraries for massively parallel SBVS in Spark, and libraries for efficient computation of Signatures were developed and implemented in the Scala programming language. These libraries showed good scalability and ease of use in the applications. The initial stages of combining these components in the development of cloud-ready pipelines for SBVS indicate good performance, scalability and implementation simplicity.
Venue: Swedish e-Science Academy 2015, Stockholm
Published:7 Oct, 2015
DOI:10.13140/RG.2.2.31574.55367