Faster BAM Sorting with SAMtools and RocksDB

Brief introduction for non-experts: fully sequencing a person’s genome with current technology produces roughly 1.5 billion data fragments totaling 500 GiB (uncompressed). BAM is a compressed file format for this data used by various downstream analyses, including clinical interpretation algorithms. Such analyses usually require the fragments to be sorted according to their estimated position of origin in the human genome.

We’ve been hacking on a fork of samtools providing a new command, samtools rocksort, capable of sorting large BAMs significantly faster than samtools sort on modern server hardware. It leverages the flexible external sorting and compaction algorithms in RocksDB, Facebook’s recently open-sourced embedded storage engine (itself a descendant of Google’s LevelDB).

How to sort a large BAM
To sort a BAM file larger than available memory, samtools divides the data into several batches, each of which it sorts in memory and writes to a temporary disk file. It then merges the temporary files to form the final BAM file. Because the files are individually sorted, the merging can be done in a “streaming” fashion using a heap, requiring only a small portion of each temporary file in memory at any time.

sorted BAM file

If the total number of alignment records is N, and we sort up to R in RAM to create one temporary file, then we create N/R temporary files. The algorithmic running time is O((N/R) log R + N log(N/R)) = O(N log N), where the factor log(N/R) is incurred in the heap used in the merge phase. Newer versions of samtools can also parallelize in-memory sorting and output compression across multiple cores, which yields a nice wallclock speedup; however, the merge phase still has an O(N log(N/R)) critical path which is essentially unparallelized.

Sorting with RocksDB
At a high level, samtools rocksort works in much the same way, but RocksDB does almost all of the hard work. We insert each alignment record into a temporary RocksDB ordered key-value database, with a key encoding the reference coordinates. As a matter of course, RocksDB runs background threads to sort batches of them in memory (using a comparator function we provide) and write each batch out to a file. Once we finish loading all the BAM data, we ask RocksDB for a database iterator, which we simply loop over and emit the sorted BAM file. The RocksDB iterator automatically heap-merges from the sorted disk files as needed.

Just entrusting the external sort to RocksDB in this way is enough to get a decent speed improvement of 25% or so over the original sort implementation. That’s largely because the binary key-value interface allows us to simply blit each in-memory bam1_t record directly into and out of RocksDB, which compresses temporary disk files using Snappy. In contrast, samtools sort writes its temporary files in the actual BAM format, requiring encoding/decoding and slower zlib compression. Very nice and convenient — but no fundamental algorithmic difference so far.

Here’s where RocksDB brings important new capabilities to the table. There’s a problem with the single-pass merge described above when the number of intermediate files, N/R, is large. Merging the sorted intermediate files in limited memory requires constantly reading little bits from all those files, incurring a lot of disk seeks on rotating drives. In fact, at some point, samtools sort performance becomes effectively bound to disk seeking; we’ll see an example of this below.

In this scenario, samtools rocksort can sort the same data in much less time, using no more memory, by invoking RocksDB’s background compaction capabilities. With a few extra lines of code we configure RocksDB so that, while we’re still in the process of loading the BAM data, it runs additional background threads to merge batches of existing sorted temporary files into fewer, larger, sorted files. Just like the final merge, each background compaction requires only a modest amount of working memory.

How to sort with RocksDB

By expending this background effort we greatly reduce the number of files in the final merge, and thus the seek load for datasets many times larger than provisioned RAM. The cost of shortening the merge critical path in this way is that each alignment record is processed and written to temporary files multiple times, instead of just once. This is an instance of a classic database engineering tradeoff where we accept write amplification in order to reduce read amplification. An explicit command-line option to samtools rocksort is needed to activate background compaction, because the write amplification can be counterproductive when the merge is not bound to disk seeks.

Benchmarks
We benchmarked samtools rocksort and samtools sort on Amazon EC2 instances to explore different hardware configurations and parameter settings. The build of samtools used in these benchmarks includes performance enhancements to its parallel compression code we made as an aside while developing rocksort. That change significantly benefits both algorithms, so the speed of rocksort compared to previous versions of samtools is even better than illustrated below.

Warmup: sorting a 20 GiB BAM file on rotating drives
First we took a BAM file from the 1000 Genomes Project representing a low-coverage, whole-genome sequence. We shuffled the 20 GiB BAM file and sorted it on a c1.xlarge instance, which has eight virtual CPUs, four rotating drives for scratch space (here configured in RAID0, and only 7 GiB of RAM. The limited RAM on this instance type provides an interesting approximation of the fairly common strategy of running BAM sorting concurrently with (streaming input from) another memory-hungry process, such as an aligner.

The exact invocations were:

cat wgs_lo.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_lo.sort
cat wgs_lo.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M – wgs_lo.rocksort

(Streaming the input to samtools from pigz is a trick to effectively use a separate worker thread for decompression despite samtools lacking that capability.)

Here’s a chart of the results (data collected at 30-second intervals by dstat), with samtools rocksort in blue and samtools sort in black:

chart samstools sort

The CPU usage chart clearly reflects the two distinct phases (splitting and merging) of both algorithms. The existing sort algorithm has erratic CPU usage in the split phase, as it alternates between reading the input data and writing compressed temporary files. RocksDB does a better job parallelizing the corresponding steps, and has less compression work to do thanks to Snappy, so it shows more consistent and less total CPU usage. In the merge phase, rocksort is able to drive somewhat higher CPU utilization for output compression, reflecting the shorter merge critical path. Its memory consumption is substantially lower on average (but more variable), and it drives a higher rate of disk activity.

Overall, samtools sort took 34m to sort this dataset, and samtools rocksort shaved off about seven minutes. That’s a 25% speed increase – not bad at all!

The future: 135 GiB BAM on solid-state drives
The 20 GiB dataset in the previous test is pretty well representative of large datasets to which samtools sort is currently applied, but it seems safe to expect this size to grow along with the ever-increasing throughput of modern sequencing instruments. Let’s next consider producing a 135 GiB BAM file — about the size of a deep human whole genome sequence (WGS) — in one sort operation. We synthesized a dataset of this size by merging and shuffling several of the low-coverage BAMs from the 1000 Genomes Project.

First we benchmarked this on one of the powerful new i2.2xlarge instances, which notably comes equipped with two solid-state drives (SSDs), also configured here in RAID0. SSDs largely obviate “seek time” concerns, and they’re becoming increasingly available both from cloud providers and in HPC clusters. We also set it up with a fairly generous 32 GiB of RAM.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 4G – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 4G – wgs_hi.rocksort

chart samstools sort

In this test samtools rocksort (2h 7m) was 34% faster than samtools sort (2h 50m). The difference is more modest if we only consider the merge phase, though, which is what really matters when the data are streaming from an aligner. SSDs are awesome! (Though still costly to use just for scratch space.)

The extreme case: 135 GiB BAM on rotating drives
Lastly, let’s look at a case where background compaction allows samtools rocksort to really blow the doors off. We took the deep WGS BAM from the last test back to the c1.xlarge instance with four rotating drives in RAID0 and limited RAM. To activate background compaction, we supply rocksort with an estimate of the total uncompressed size of the BAM dataset, in this case 512 GiB; the README file contains guidelines for coming up with this estimate.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M -s 512G – wgs_hi.rocksort

extreme case samstools sort

In this test, samtools sort produces well over 1,000 temporary disk files, and then tries to concurrently stream all of them in the subsequent merge phase. That’s far too much effectively random I/O for rotating disks, even in a RAID array, and leads to very poor system utilization. In contrast, samtools rocksort makes much more efficient use of the disks, as the background compaction effectively leads to a highly concurrent, two-pass external merge sort. It finished five times faster (3h 46m vs. 22h 37m).

What’s next?
There are still some obvious ways to further speed up samtools BAM sorting, which would largely close any remaining speed gap with commercial tools. Parallel CPU utilization during output compression, while significantly improved by our patches, could still be increased — though at the cost of further complicating the Pthreads-based code, which is no one’s idea of fun. At the other end, input decompression could also be sped up. Indeed, the decompression rate with pigz falls well short of the insertion rate RocksDB is capable of sustaining, and BAM’s block compression format admits more parallelization than pigz (a general-purpose program) takes advantage of. There’s actually some experimental code for this out there already, which would probably benefit both rocksort and the vanilla sort algorithm. That stated, input decompression is not a factor in the common case where the data are slowly streamed from an aligner.

Both of those potential optimizations reflect an important point worth emphasizing in closing: rocksort is not a rewrite of samtools. It adds one independent command in bam_rocksort.c, comparable in SLOC to the existing bam_sort.c. The implementation reuses numerous BAM processing functions from samtools, and thus will benefit from any future improvements therein. And while rocksort executes a highly concurrent, I/O-heavy algorithm, almost all of the hard work is courtesy of crack engineers from Facebook and Google (at least one of whom is said to build his code before shipping it, but only to check for compiler and linker bugs!). Importantly therefore, rocksort is maintainable.

You can give rocksort a try by building our fork of samtools.

Developer Spotlight: Baylor Computational Biologist on Porting Mercury to the Cloud

Narayanan VeeraraghavanAs users log in to DNAnexus to check out the Mercury pipeline and see how it could be applied to their own data, we sat down with Dr. Narayanan Veeraraghavan from Baylor’s Human Genome Sequencing Center (HGSC) to talk about details of the pipeline, how it was ported to our cloud platform, and upcoming feature additions.

Almost every sample sequenced at the HGSC (~24 terabases/month) is processed by Mercury, the production pipeline for analyzing next-generation sequencing data. The framework takes raw base calls from the sequencing instruments and runs a series of algorithms and tools to generate an annotated VCF file. This process includes steps for mapping to a reference genome with BWA, certain finishing steps for preparing the BAM file for realignment and recalibration with GATK, and variant calling and annotation using Atlas2 and Cassandra, tools developed at the HGSC.

The pipeline runs well on the genome center’s local computational infrastructure, matching up with the production data flow rate. But some large projects — such as Baylor’s participation in the CHARGE consortium — can impose substantial load on that infrastructure, inconveniencing researchers who use the local cluster for their own projects. To address such spikes in compute requirements and to explore scaling of compute infrastructure for future demands in next-gen sequencing, HGSC looked for a technology that would keep the business going as usual locally while allowing the center to perform ambitious large-scale analysis. Enter cloud computing.

hgsc baylor college of medicineVeeraraghavan, who is a lead scientific programmer at the HGSC, was in charge of exploring the feasibility of operating on the cloud at scale. His immediate task was to take the existing Mercury pipeline and get it to run in the cloud. This was no small task, given Mercury’s complex, multi-component workflow and the fact that it was optimized to run on local compute infrastructure. Porting it meant reimagining the pipeline for a cloud environment and optimizing it for massively parallel computing.

Working closely with Andrew Carroll, a scientist at DNAnexus, Veeraraghavan says, “We optimized every component of the pipeline as we ported it to DNAnexus.” Becoming acquainted with the DNAnexus platform and getting the first implementation of Mercury up and running took just three weeks, far less time than Veeraraghavan was expecting given his previous efforts to port code to new environments. The DNAnexus team helped make sure that the new Mercury pipeline took advantage of smart ways to parallelize tasks and handle data. “That translates to faster turnaround time and lower cost,” Veeraraghavan says. “The most important part of my experience was the fantastic user support. The DNAnexus team was receptive to feature suggestions and quick to implement them. Working with DNAnexus was a real partnership.”

The CHARGE project data analysis was chosen as one of a few pilot projects to understand computing and collaborating on the cloud using DNAnexus; the successful effort wound up being the largest known genomic analysis conducted on Amazon Web Services, the cloud provider used by DNAnexus.

According to Veeraraghavan, new features are coming soon to the Mercury pipeline, including an RNA-seq workflow as well as tools for variant calling at the population level and on low depth-of-coverage whole genome sequences. These will be available to the public free to use.

Other features that we helped HGSC enable include LIMS communication and a direct .bcl-to-cloud implementation in which the raw base-calls from the sequencing instruments are directly uploaded to the cloud in real-time, making the process of data analysis automatic, integrated, and seamless. These features, distinct to HGSC, were made possible by our science team, which aims for a painless and straightforward integration for all DNAnexus customers.

Veeraraghavan, who has extensive programming expertise, says that DNAnexus is a good option for beginners and advanced users alike. For people like him, there’s virtually unlimited configurability via the command line interface, meta-data features, and APIs. “It can also be used by scientists who just want to pick and choose a few apps, stitch them together graphically into a workflow, and get going with their science without having to bother with installation and configuration,” he says. “It is extremely easy for a person with zero programming background to use the Mercury pipeline on DNAnexus. You can bring your data into a secure environment and use our tools without any hardware or software setup. This is the next generation of genomic analysis.”

Apps vs. Applets in DNAnexus

One of our main design goals for the new DNAnexus platform was to make it possible for expert bioinformaticians to run all of their genome analyses in the cloud — eliminating the need to copy enormous data sets between multiple compute and storage silos. Many features of the platform were built with this in mind: the extensive API and SDK, the scriptable command-line interface, the built-in genome browser, and the integration with R and other high-level environments, to name a few.

In this blog post, we’ll look at what may be the most important DNAnexus feature enabling bioinformaticians to move their work into the cloud: not only can you select from a library of popular analysis tools (and publish new ones), but you can also upload your own private programs to run on the platform. We call these two types of executables, apps and applets, respectively, and conceptually they reflect the mix of off-the-shelf and custom code involved in any end-to-end bioinformatics analysis. You can experiment with apps and applets in DNAnexus firsthand by signing up for a trial account, complete with $125 in compute and storage credits.

Apps represent general-purpose tools expected to be of interest to a wide audience on the platform, striving for compatibility, ease of use, and robustness. They’re published in a dedicated section of the website, and typically include extensive metadata and documentation. Let’s take a look at the app representing the Velvet de novo assembler:

DNAnexus apps

The app page includes details about the tool, links to source repositories, citation links generated from publication DOIs, and detailed documentation of the inputs and outputs. You’ll also notice the page has a separate “versions” tab. When a new version of an app is published, the old version is automatically archived and remains available.

Overall, an app like this should be easily usable by any bioinformatician on the platform, even if they’re not already familiar with Velvet. The documentation and input/output specification is presented in a standardized format (try dx run velvet -h) and there’s no chain of compilation dependencies to deal with. Of course, from the development side, polishing the platform app to this point takes significant time and effort.

Applets are lighter-weight executables that can be used as scripts for project-specific analyses or ad hoc data manipulations, proprietary analysis pipelines, or development/testing versions of apps. Unlike apps, they reside inside your projects alongside data objects — completely private unless you choose to share a project with others. At the same time, they have full access to the platform’s API and parallelization infrastructure; an applet could very well use 1,000 cores!

Suppose, for example, we wrote a script to perform some custom quality filtration on sequencing reads before providing them to the de novo assembler. Using the dx-app-wizard utility in the SDK, we could quickly package this as an applet and upload it to the platform. There’s no need to prepare documentation and metadata, nor to wait for any kind of approval or review by DNAnexus. The result is an applet in our project along with our data:

DNAnexus applets

It’s easy to run apps and applets, and to mix them together in workflows. Here’s what we see when we click “Run” in the project:

run apps

Applets in the project are shown at the top, followed by a list of platform-wide apps. Similarly, the command-line interface ‘dx run’ can launch both apps and applets.

Apps and applets both benefit from the platform’s built-in reproducibility infrastructure. Every time an executable — app or applet — is uploaded into the system, it receives a permanent, unique ID. This ID is recorded in every job launched with the executable and by extension every file and data object produced by any job. Thus, any data produced on the platform can be traced to the exact executable, inputs, and parameter settings used to create it. While apps and applets both have this permanent ID, apps also have a semantic version number in the form xx.yy.zz, to ease association with version numbers of upstream open-source projects.

One final point of interest: the ability of any user to upload arbitrary executables into our platform obviously raises intriguing security considerations. You can be sure, however, that the platform is built with the security and access control infrastructure necessary to make this feasible. We’ll delve into that topic in a future blog post.