Abstract: | Datasets used for predictive modeling in biomedical science are becoming larger and are continuously updated. This presents challenges when the aim is to provide updated predictive models based on this data, and requires that preprocessing and modeling is automated and runs on large-scale e-infrastructures. I will present the research in my lab focusing on methods for enabling such automation, including using scientific workflows, Big Data analytics frameworks (primarily Apache Spark) on high-performance computing as well as public and private cloud resources. We also work on the modeling lifecycle and have developed means for automated versioning, archiving and publishing of predictive models so that they can be readily accessible from e.g. the Bioclipse workbench. The ChEMBL database is a valuable resource to us, and I will present some of our previous and ongoing projects where automation can aid when building predictive models on interaction data, e.g. for ligand-based target prediction. I will also present some of our work on interpretable machine-learning models with confidence intervals. |