Developer Spotlight: Baylor Computational Biologist on Porting Mercury to the Cloud

Narayanan VeeraraghavanAs users log in to DNAnexus to check out the Mercury pipeline and see how it could be applied to their own data, we sat down with Dr. Narayanan Veeraraghavan from Baylor’s Human Genome Sequencing Center (HGSC) to talk about details of the pipeline, how it was ported to our cloud platform, and upcoming feature additions.

Almost every sample sequenced at the HGSC (~24 terabases/month) is processed by Mercury, the production pipeline for analyzing next-generation sequencing data. The framework takes raw base calls from the sequencing instruments and runs a series of algorithms and tools to generate an annotated VCF file. This process includes steps for mapping to a reference genome with BWA, certain finishing steps for preparing the BAM file for realignment and recalibration with GATK, and variant calling and annotation using Atlas2 and Cassandra, tools developed at the HGSC.

The pipeline runs well on the genome center’s local computational infrastructure, matching up with the production data flow rate. But some large projects — such as Baylor’s participation in the CHARGE consortium — can impose substantial load on that infrastructure, inconveniencing researchers who use the local cluster for their own projects. To address such spikes in compute requirements and to explore scaling of compute infrastructure for future demands in next-gen sequencing, HGSC looked for a technology that would keep the business going as usual locally while allowing the center to perform ambitious large-scale analysis. Enter cloud computing.

hgsc baylor college of medicineVeeraraghavan, who is a lead scientific programmer at the HGSC, was in charge of exploring the feasibility of operating on the cloud at scale. His immediate task was to take the existing Mercury pipeline and get it to run in the cloud. This was no small task, given Mercury’s complex, multi-component workflow and the fact that it was optimized to run on local compute infrastructure. Porting it meant reimagining the pipeline for a cloud environment and optimizing it for massively parallel computing.

Working closely with Andrew Carroll, a scientist at DNAnexus, Veeraraghavan says, “We optimized every component of the pipeline as we ported it to DNAnexus.” Becoming acquainted with the DNAnexus platform and getting the first implementation of Mercury up and running took just three weeks, far less time than Veeraraghavan was expecting given his previous efforts to port code to new environments. The DNAnexus team helped make sure that the new Mercury pipeline took advantage of smart ways to parallelize tasks and handle data. “That translates to faster turnaround time and lower cost,” Veeraraghavan says. “The most important part of my experience was the fantastic user support. The DNAnexus team was receptive to feature suggestions and quick to implement them. Working with DNAnexus was a real partnership.”

The CHARGE project data analysis was chosen as one of a few pilot projects to understand computing and collaborating on the cloud using DNAnexus; the successful effort wound up being the largest known genomic analysis conducted on Amazon Web Services, the cloud provider used by DNAnexus.

According to Veeraraghavan, new features are coming soon to the Mercury pipeline, including an RNA-seq workflow as well as tools for variant calling at the population level and on low depth-of-coverage whole genome sequences. These will be available to the public free to use.

Other features that we helped HGSC enable include LIMS communication and a direct .bcl-to-cloud implementation in which the raw base-calls from the sequencing instruments are directly uploaded to the cloud in real-time, making the process of data analysis automatic, integrated, and seamless. These features, distinct to HGSC, were made possible by our science team, which aims for a painless and straightforward integration for all DNAnexus customers.

Veeraraghavan, who has extensive programming expertise, says that DNAnexus is a good option for beginners and advanced users alike. For people like him, there’s virtually unlimited configurability via the command line interface, meta-data features, and APIs. “It can also be used by scientists who just want to pick and choose a few apps, stitch them together graphically into a workflow, and get going with their science without having to bother with installation and configuration,” he says. “It is extremely easy for a person with zero programming background to use the Mercury pipeline on DNAnexus. You can bring your data into a secure environment and use our tools without any hardware or software setup. This is the next generation of genomic analysis.”