HTSeq-Hadoop: Extending HTSeq for massively parallel sequencing data analysis using Hadoop

Published: 2014-12-04

Formatted citation

Siretskiy A, Spjuth O. HTSeq-Hadoop: Extending HTSeq for massively parallel sequencing data analysis using Hadoop.
Proceedings - 2014 IEEE 10th International Conference on eScience, eScience 2014. 1, 317-323. (2014). DOI: 10.1109/eScience.2014.27

Abstract

Hadoop is a convenient framework in e-Science enabling scalable distributed data analysis. In molecular biology, next-generation sequencing produces vast amounts of data and requires flexible frameworks for constructing analysis pipelines. We extend the popular HTSeq package into the Hadoop realm by introducing massively parallel versions of short read quality assessment as well as functionality to count genes mapped by the short reads. We use the Hadoop-streaming library which allows the components to run in both Hadoop and regular Linux systems and evaluate their performance in two different execution environments: A single node on a computational cluster and a Hadoop cluster in a private cloud. We compare the implementations with Apache Pig showing improved runtime performance of our developed methods. We also inject the components in the graphical platform Cloudgene to simplify user interaction. © 2014 IEEE.