Developer Spotlight: Baylor Computational Biologist on Porting Mercury to the Cloud

Narayanan VeeraraghavanAs users log in to DNAnexus to check out the Mercury pipeline and see how it could be applied to their own data, we sat down with Dr. Narayanan Veeraraghavan from Baylor’s Human Genome Sequencing Center (HGSC) to talk about details of the pipeline, how it was ported to our cloud platform, and upcoming feature additions.

Almost every sample sequenced at the HGSC (~24 terabases/month) is processed by Mercury, the production pipeline for analyzing next-generation sequencing data. The framework takes raw base calls from the sequencing instruments and runs a series of algorithms and tools to generate an annotated VCF file. This process includes steps for mapping to a reference genome with BWA, certain finishing steps for preparing the BAM file for realignment and recalibration with GATK, and variant calling and annotation using Atlas2 and Cassandra, tools developed at the HGSC.

The pipeline runs well on the genome center’s local computational infrastructure, matching up with the production data flow rate. But some large projects — such as Baylor’s participation in the CHARGE consortium — can impose substantial load on that infrastructure, inconveniencing researchers who use the local cluster for their own projects. To address such spikes in compute requirements and to explore scaling of compute infrastructure for future demands in next-gen sequencing, HGSC looked for a technology that would keep the business going as usual locally while allowing the center to perform ambitious large-scale analysis. Enter cloud computing.

hgsc baylor college of medicineVeeraraghavan, who is a lead scientific programmer at the HGSC, was in charge of exploring the feasibility of operating on the cloud at scale. His immediate task was to take the existing Mercury pipeline and get it to run in the cloud. This was no small task, given Mercury’s complex, multi-component workflow and the fact that it was optimized to run on local compute infrastructure. Porting it meant reimagining the pipeline for a cloud environment and optimizing it for massively parallel computing.

Working closely with Andrew Carroll, a scientist at DNAnexus, Veeraraghavan says, “We optimized every component of the pipeline as we ported it to DNAnexus.” Becoming acquainted with the DNAnexus platform and getting the first implementation of Mercury up and running took just three weeks, far less time than Veeraraghavan was expecting given his previous efforts to port code to new environments. The DNAnexus team helped make sure that the new Mercury pipeline took advantage of smart ways to parallelize tasks and handle data. “That translates to faster turnaround time and lower cost,” Veeraraghavan says. “The most important part of my experience was the fantastic user support. The DNAnexus team was receptive to feature suggestions and quick to implement them. Working with DNAnexus was a real partnership.”

The CHARGE project data analysis was chosen as one of a few pilot projects to understand computing and collaborating on the cloud using DNAnexus; the successful effort wound up being the largest known genomic analysis conducted on Amazon Web Services, the cloud provider used by DNAnexus.

According to Veeraraghavan, new features are coming soon to the Mercury pipeline, including an RNA-seq workflow as well as tools for variant calling at the population level and on low depth-of-coverage whole genome sequences. These will be available to the public free to use.

Other features that we helped HGSC enable include LIMS communication and a direct .bcl-to-cloud implementation in which the raw base-calls from the sequencing instruments are directly uploaded to the cloud in real-time, making the process of data analysis automatic, integrated, and seamless. These features, distinct to HGSC, were made possible by our science team, which aims for a painless and straightforward integration for all DNAnexus customers.

Veeraraghavan, who has extensive programming expertise, says that DNAnexus is a good option for beginners and advanced users alike. For people like him, there’s virtually unlimited configurability via the command line interface, meta-data features, and APIs. “It can also be used by scientists who just want to pick and choose a few apps, stitch them together graphically into a workflow, and get going with their science without having to bother with installation and configuration,” he says. “It is extremely easy for a person with zero programming background to use the Mercury pipeline on DNAnexus. You can bring your data into a secure environment and use our tools without any hardware or software setup. This is the next generation of genomic analysis.”

Developer Spotlight: A De Novo Assembler Named Ray

sebastien boisvertWe recently launched the DNAnexus developer program, and to our delight one user was able to contribute a valuable new app in less than a day. Sébastien Boisvert, a doctoral student at the Université Laval in Québec, Canada, converted a software application he had previously written for short-read de novo assembly to an app for the DNAnexus community.

Boisvert is the mind behind Ray, a scalable genome assembler built specifically for next-gen, short-read sequence data and related applications, such as metagenomics. Ray was first reported in 2010 in the Journal of Computational Biology. Written in C++, it is an MPI-based parallel tool using a single executable to eliminate the need for writing perl scripts. Ray is sequencing platform-agnostic, so it can be used with data from any short-read sequencer.

Today, Ray is primarily used by bioinformaticians who have ongoing access to a supercomputer. The software’s peer-to-peer design makes it ideal to run on systems with hundreds or thousands of nodes — which also makes it just right for a cloud computing environment. When Boisvert heard that DNAnexus was opening its doors to developer-contributed apps, he immediately looked into how to submit Ray so even more users could have access to the tool. From his perspective, cloud computing offers a more instantaneous experience with massively parallel computing to people who don’t readily have supercomputer access, and also provides the type of infrastructure management that allows users to focus on what they want to compute, rather than how to manage queries and coding.

Boisvert remarked that the DNAnexus documentation for contributing an app was straightforward and that the interface in particular was easy to use. Writing the wrapper to convert the software code into an app took less than a day. He worked with the Developer Program support team at DNAnexus to make sure everything was working properly, and now Ray is available for any DNAnexus user to add to an analysis pipeline — and it’s free. (Check out Boisvert’s own blog about cloud computing options, where he notes that it’s fun to start an app in DNAnexus!)

As our developer program continues to grow, we look forward to working with more contributors to get their great apps into our platform so they can be broadly available to our growing community of users. If you’re interested in learning more about our Developer Program, please visit https://dnanexus.com/developers.