Faster BAM Sorting with SAMtools and RocksDB

Brief introduction for non-experts: fully sequencing a person’s genome with current technology produces roughly 1.5 billion data fragments totaling 500 GiB (uncompressed). BAM is a compressed file format for this data used by various downstream analyses, including clinical interpretation algorithms. Such analyses usually require the fragments to be sorted according to their estimated position of origin in the human genome.

We’ve been hacking on a fork of samtools providing a new command, samtools rocksort, capable of sorting large BAMs significantly faster than samtools sort on modern server hardware. It leverages the flexible external sorting and compaction algorithms in RocksDB, Facebook’s recently open-sourced embedded storage engine (itself a descendant of Google’s LevelDB).

How to sort a large BAM
To sort a BAM file larger than available memory, samtools divides the data into several batches, each of which it sorts in memory and writes to a temporary disk file. It then merges the temporary files to form the final BAM file. Because the files are individually sorted, the merging can be done in a “streaming” fashion using a heap, requiring only a small portion of each temporary file in memory at any time.

sorted BAM file

If the total number of alignment records is N, and we sort up to R in RAM to create one temporary file, then we create N/R temporary files. The algorithmic running time is O((N/R) log R + N log(N/R)) = O(N log N), where the factor log(N/R) is incurred in the heap used in the merge phase. Newer versions of samtools can also parallelize in-memory sorting and output compression across multiple cores, which yields a nice wallclock speedup; however, the merge phase still has an O(N log(N/R)) critical path which is essentially unparallelized.

Sorting with RocksDB
At a high level, samtools rocksort works in much the same way, but RocksDB does almost all of the hard work. We insert each alignment record into a temporary RocksDB ordered key-value database, with a key encoding the reference coordinates. As a matter of course, RocksDB runs background threads to sort batches of them in memory (using a comparator function we provide) and write each batch out to a file. Once we finish loading all the BAM data, we ask RocksDB for a database iterator, which we simply loop over and emit the sorted BAM file. The RocksDB iterator automatically heap-merges from the sorted disk files as needed.

Just entrusting the external sort to RocksDB in this way is enough to get a decent speed improvement of 25% or so over the original sort implementation. That’s largely because the binary key-value interface allows us to simply blit each in-memory bam1_t record directly into and out of RocksDB, which compresses temporary disk files using Snappy. In contrast, samtools sort writes its temporary files in the actual BAM format, requiring encoding/decoding and slower zlib compression. Very nice and convenient — but no fundamental algorithmic difference so far.

Here’s where RocksDB brings important new capabilities to the table. There’s a problem with the single-pass merge described above when the number of intermediate files, N/R, is large. Merging the sorted intermediate files in limited memory requires constantly reading little bits from all those files, incurring a lot of disk seeks on rotating drives. In fact, at some point, samtools sort performance becomes effectively bound to disk seeking; we’ll see an example of this below.

In this scenario, samtools rocksort can sort the same data in much less time, using no more memory, by invoking RocksDB’s background compaction capabilities. With a few extra lines of code we configure RocksDB so that, while we’re still in the process of loading the BAM data, it runs additional background threads to merge batches of existing sorted temporary files into fewer, larger, sorted files. Just like the final merge, each background compaction requires only a modest amount of working memory.

How to sort with RocksDB

By expending this background effort we greatly reduce the number of files in the final merge, and thus the seek load for datasets many times larger than provisioned RAM. The cost of shortening the merge critical path in this way is that each alignment record is processed and written to temporary files multiple times, instead of just once. This is an instance of a classic database engineering tradeoff where we accept write amplification in order to reduce read amplification. An explicit command-line option to samtools rocksort is needed to activate background compaction, because the write amplification can be counterproductive when the merge is not bound to disk seeks.

We benchmarked samtools rocksort and samtools sort on Amazon EC2 instances to explore different hardware configurations and parameter settings. The build of samtools used in these benchmarks includes performance enhancements to its parallel compression code we made as an aside while developing rocksort. That change significantly benefits both algorithms, so the speed of rocksort compared to previous versions of samtools is even better than illustrated below.

Warmup: sorting a 20 GiB BAM file on rotating drives
First we took a BAM file from the 1000 Genomes Project representing a low-coverage, whole-genome sequence. We shuffled the 20 GiB BAM file and sorted it on a c1.xlarge instance, which has eight virtual CPUs, four rotating drives for scratch space (here configured in RAID0, and only 7 GiB of RAM. The limited RAM on this instance type provides an interesting approximation of the fairly common strategy of running BAM sorting concurrently with (streaming input from) another memory-hungry process, such as an aligner.

The exact invocations were:

cat wgs_lo.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_lo.sort
cat wgs_lo.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M – wgs_lo.rocksort

(Streaming the input to samtools from pigz is a trick to effectively use a separate worker thread for decompression despite samtools lacking that capability.)

Here’s a chart of the results (data collected at 30-second intervals by dstat), with samtools rocksort in blue and samtools sort in black:

chart samstools sort

The CPU usage chart clearly reflects the two distinct phases (splitting and merging) of both algorithms. The existing sort algorithm has erratic CPU usage in the split phase, as it alternates between reading the input data and writing compressed temporary files. RocksDB does a better job parallelizing the corresponding steps, and has less compression work to do thanks to Snappy, so it shows more consistent and less total CPU usage. In the merge phase, rocksort is able to drive somewhat higher CPU utilization for output compression, reflecting the shorter merge critical path. Its memory consumption is substantially lower on average (but more variable), and it drives a higher rate of disk activity.

Overall, samtools sort took 34m to sort this dataset, and samtools rocksort shaved off about seven minutes. That’s a 25% speed increase – not bad at all!

The future: 135 GiB BAM on solid-state drives
The 20 GiB dataset in the previous test is pretty well representative of large datasets to which samtools sort is currently applied, but it seems safe to expect this size to grow along with the ever-increasing throughput of modern sequencing instruments. Let’s next consider producing a 135 GiB BAM file — about the size of a deep human whole genome sequence (WGS) — in one sort operation. We synthesized a dataset of this size by merging and shuffling several of the low-coverage BAMs from the 1000 Genomes Project.

First we benchmarked this on one of the powerful new i2.2xlarge instances, which notably comes equipped with two solid-state drives (SSDs), also configured here in RAID0. SSDs largely obviate “seek time” concerns, and they’re becoming increasingly available both from cloud providers and in HPC clusters. We also set it up with a fairly generous 32 GiB of RAM.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 4G – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 4G – wgs_hi.rocksort

chart samstools sort

In this test samtools rocksort (2h 7m) was 34% faster than samtools sort (2h 50m). The difference is more modest if we only consider the merge phase, though, which is what really matters when the data are streaming from an aligner. SSDs are awesome! (Though still costly to use just for scratch space.)

The extreme case: 135 GiB BAM on rotating drives
Lastly, let’s look at a case where background compaction allows samtools rocksort to really blow the doors off. We took the deep WGS BAM from the last test back to the c1.xlarge instance with four rotating drives in RAID0 and limited RAM. To activate background compaction, we supply rocksort with an estimate of the total uncompressed size of the BAM dataset, in this case 512 GiB; the README file contains guidelines for coming up with this estimate.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M -s 512G – wgs_hi.rocksort

extreme case samstools sort

In this test, samtools sort produces well over 1,000 temporary disk files, and then tries to concurrently stream all of them in the subsequent merge phase. That’s far too much effectively random I/O for rotating disks, even in a RAID array, and leads to very poor system utilization. In contrast, samtools rocksort makes much more efficient use of the disks, as the background compaction effectively leads to a highly concurrent, two-pass external merge sort. It finished five times faster (3h 46m vs. 22h 37m).

What’s next?
There are still some obvious ways to further speed up samtools BAM sorting, which would largely close any remaining speed gap with commercial tools. Parallel CPU utilization during output compression, while significantly improved by our patches, could still be increased — though at the cost of further complicating the Pthreads-based code, which is no one’s idea of fun. At the other end, input decompression could also be sped up. Indeed, the decompression rate with pigz falls well short of the insertion rate RocksDB is capable of sustaining, and BAM’s block compression format admits more parallelization than pigz (a general-purpose program) takes advantage of. There’s actually some experimental code for this out there already, which would probably benefit both rocksort and the vanilla sort algorithm. That stated, input decompression is not a factor in the common case where the data are slowly streamed from an aligner.

Both of those potential optimizations reflect an important point worth emphasizing in closing: rocksort is not a rewrite of samtools. It adds one independent command in bam_rocksort.c, comparable in SLOC to the existing bam_sort.c. The implementation reuses numerous BAM processing functions from samtools, and thus will benefit from any future improvements therein. And while rocksort executes a highly concurrent, I/O-heavy algorithm, almost all of the hard work is courtesy of crack engineers from Facebook and Google (at least one of whom is said to build his code before shipping it, but only to check for compiler and linker bugs!). Importantly therefore, rocksort is maintainable.

You can give rocksort a try by building our fork of samtools.

Importing CGHub data into DNAnexus – quickly!

DNAnexus was founded on the premise that the future of genome informatics resides in the cloud. At the time it was a radical notion; four years later, that inevitable trend is widely recognized among experts in the field. And yet for most, the methods and practical realities of moving into the cloud still remain mysterious. Frequently, one of the first questions to arise is: “How would I get data into the cloud?” It’s an understandable concern for anyone accustomed to e-mailing files around, downloading from FTP sites, or worst of all, shipping hard drives!

Remember that a significant and growing fraction of all day-to-day Internet traffic flows through the cloud. In that light, genome data sets are more than manageable. In fact, a modern high-throughput sequencing instrument produces data at less than 10 Mbps (averaged over its run time, and with ordinary compression techniques). The price of an Internet link with enough throughput for a whole fleet of such instruments is just a tiny fraction of the other costs to operate them.

So streaming newly-sequenced data into the cloud is clearly no sweat. What about all the massive data sets already generated? At DNAnexus, we have the experience to know that this is no problem, either. We’ll discuss an example here.

An enterprise bioinformatics group recently approached us about analyzing RNA-seq data from the Cancer Cell Line Encyclopedia (CCLE). This data set is substantial, bigger than 10 TB, and freely available through CGHub, the cancer genomics data repository operated by UCSC’s genome bioinformatics center. Because we have multiple users who have expressed interest in working with cancer genomics data on our platform, our bioinformatics team decided to lend assistance in developing a capability to import data from CGHub.

CGHub provides file transfers using a special program called GeneTorrent. We began by writing a simple app to wrap GeneTorrent using our SDK (available to any platform user). Given a CGHub analysis ID, the app downloads the associated BAM and BAI files, double-checks their MD5 integrity hashes, and outputs them to the user’s project. Our users were able to incorporate this app easily into their own analysis workflow, which they continued to develop using our SDK, with minimal guidance from us.

We got involved again when it came time to import all ~800 CCLE RNA-seq samples for analysis with the final pipeline. Corresponding with CGHub’s team, we learned that the best transfer speeds would be obtained by running numerous GeneTorrent downloads in parallel, using multicore servers to support the CPU-intensive protocol. Following this advice, we launched jobs on our platform to run the transfers 64 at a time in parallel, each using a quad-core cloud instance. (Since our app spends some time double-checking the file integrity and writing it back to cloud storage, somewhat fewer than 64 would actually be running GeneTorrent at steady state.)

Using this strategy, we completed the transfer of 10.68 TB in fewer than seven hours, for an average sustained throughput of about 4 Gbps. The transfers were going trans-continentally, most of the distance via Internet2; as far as we know, there were no bottlenecks internal to the DNAnexus platform during this process. Here’s a screenshot of our project with all of the BAM files:
CCLE RNA-seq samples

How many institutions in the world have infrastructure that can readily bring to bear both a multi-gigabit route to CGHub and the hundreds of compute cores needed to fully utilize it? Perhaps a few dozen, or fewer. But any DNAnexus user has exactly that; in fact, throughout this entire effort, we relied on features of the platform that are readily accessible to all regular users. And infrastructure is just the beginning: the platform is also secure and compliant to clinical standards (as well as dbGaP security practices), and the revolution really starts with seamless sharing of data and analysis tools without regard to institutional boundaries.

Launched just this past spring, we now find the new DNAnexus platform reaching amazing milestones practically every week. In fact, we’re already deploying more processing power, as far as we know, than any of the dedicated clusters at the major genome centers — well over 20,000 compute cores at times, according to user demand. For more on that, watch this space in a few weeks; we’ll be making some big announcements at ASHG 2013 about how we’re realizing truly mega-scale genomics in the cloud.

Apps vs. Applets in DNAnexus

One of our main design goals for the new DNAnexus platform was to make it possible for expert bioinformaticians to run all of their genome analyses in the cloud — eliminating the need to copy enormous data sets between multiple compute and storage silos. Many features of the platform were built with this in mind: the extensive API and SDK, the scriptable command-line interface, the built-in genome browser, and the integration with R and other high-level environments, to name a few.

In this blog post, we’ll look at what may be the most important DNAnexus feature enabling bioinformaticians to move their work into the cloud: not only can you select from a library of popular analysis tools (and publish new ones), but you can also upload your own private programs to run on the platform. We call these two types of executables, apps and applets, respectively, and conceptually they reflect the mix of off-the-shelf and custom code involved in any end-to-end bioinformatics analysis. You can experiment with apps and applets in DNAnexus firsthand by signing up for a trial account, complete with $125 in compute and storage credits.

Apps represent general-purpose tools expected to be of interest to a wide audience on the platform, striving for compatibility, ease of use, and robustness. They’re published in a dedicated section of the website, and typically include extensive metadata and documentation. Let’s take a look at the app representing the Velvet de novo assembler:

DNAnexus apps

The app page includes details about the tool, links to source repositories, citation links generated from publication DOIs, and detailed documentation of the inputs and outputs. You’ll also notice the page has a separate “versions” tab. When a new version of an app is published, the old version is automatically archived and remains available.

Overall, an app like this should be easily usable by any bioinformatician on the platform, even if they’re not already familiar with Velvet. The documentation and input/output specification is presented in a standardized format (try dx run velvet -h) and there’s no chain of compilation dependencies to deal with. Of course, from the development side, polishing the platform app to this point takes significant time and effort.

Applets are lighter-weight executables that can be used as scripts for project-specific analyses or ad hoc data manipulations, proprietary analysis pipelines, or development/testing versions of apps. Unlike apps, they reside inside your projects alongside data objects — completely private unless you choose to share a project with others. At the same time, they have full access to the platform’s API and parallelization infrastructure; an applet could very well use 1,000 cores!

Suppose, for example, we wrote a script to perform some custom quality filtration on sequencing reads before providing them to the de novo assembler. Using the dx-app-wizard utility in the SDK, we could quickly package this as an applet and upload it to the platform. There’s no need to prepare documentation and metadata, nor to wait for any kind of approval or review by DNAnexus. The result is an applet in our project along with our data:

DNAnexus applets

It’s easy to run apps and applets, and to mix them together in workflows. Here’s what we see when we click “Run” in the project:

run apps

Applets in the project are shown at the top, followed by a list of platform-wide apps. Similarly, the command-line interface ‘dx run’ can launch both apps and applets.

Apps and applets both benefit from the platform’s built-in reproducibility infrastructure. Every time an executable — app or applet — is uploaded into the system, it receives a permanent, unique ID. This ID is recorded in every job launched with the executable and by extension every file and data object produced by any job. Thus, any data produced on the platform can be traced to the exact executable, inputs, and parameter settings used to create it. While apps and applets both have this permanent ID, apps also have a semantic version number in the form xx.yy.zz, to ease association with version numbers of upstream open-source projects.

One final point of interest: the ability of any user to upload arbitrary executables into our platform obviously raises intriguing security considerations. You can be sure, however, that the platform is built with the security and access control infrastructure necessary to make this feasible. We’ll delve into that topic in a future blog post.