Faster BAM Sorting with SAMtools and RocksDB

Brief introduction for non-experts: fully sequencing a person’s genome with current technology produces roughly 1.5 billion data fragments totaling 500 GiB (uncompressed). BAM is a compressed file format for this data used by various downstream analyses, including clinical interpretation algorithms. Such analyses usually require the fragments to be sorted according to their estimated position of origin in the human genome.

We’ve been hacking on a fork of samtools providing a new command, samtools rocksort, capable of sorting large BAMs significantly faster than samtools sort on modern server hardware. It leverages the flexible external sorting and compaction algorithms in RocksDB, Facebook’s recently open-sourced embedded storage engine (itself a descendant of Google’s LevelDB).

How to sort a large BAM
To sort a BAM file larger than available memory, samtools divides the data into several batches, each of which it sorts in memory and writes to a temporary disk file. It then merges the temporary files to form the final BAM file. Because the files are individually sorted, the merging can be done in a “streaming” fashion using a heap, requiring only a small portion of each temporary file in memory at any time.

sorted BAM file

If the total number of alignment records is N, and we sort up to R in RAM to create one temporary file, then we create N/R temporary files. The algorithmic running time is O((N/R) log R + N log(N/R)) = O(N log N), where the factor log(N/R) is incurred in the heap used in the merge phase. Newer versions of samtools can also parallelize in-memory sorting and output compression across multiple cores, which yields a nice wallclock speedup; however, the merge phase still has an O(N log(N/R)) critical path which is essentially unparallelized.

Sorting with RocksDB
At a high level, samtools rocksort works in much the same way, but RocksDB does almost all of the hard work. We insert each alignment record into a temporary RocksDB ordered key-value database, with a key encoding the reference coordinates. As a matter of course, RocksDB runs background threads to sort batches of them in memory (using a comparator function we provide) and write each batch out to a file. Once we finish loading all the BAM data, we ask RocksDB for a database iterator, which we simply loop over and emit the sorted BAM file. The RocksDB iterator automatically heap-merges from the sorted disk files as needed.

Just entrusting the external sort to RocksDB in this way is enough to get a decent speed improvement of 25% or so over the original sort implementation. That’s largely because the binary key-value interface allows us to simply blit each in-memory bam1_t record directly into and out of RocksDB, which compresses temporary disk files using Snappy. In contrast, samtools sort writes its temporary files in the actual BAM format, requiring encoding/decoding and slower zlib compression. Very nice and convenient — but no fundamental algorithmic difference so far.

Here’s where RocksDB brings important new capabilities to the table. There’s a problem with the single-pass merge described above when the number of intermediate files, N/R, is large. Merging the sorted intermediate files in limited memory requires constantly reading little bits from all those files, incurring a lot of disk seeks on rotating drives. In fact, at some point, samtools sort performance becomes effectively bound to disk seeking; we’ll see an example of this below.

In this scenario, samtools rocksort can sort the same data in much less time, using no more memory, by invoking RocksDB’s background compaction capabilities. With a few extra lines of code we configure RocksDB so that, while we’re still in the process of loading the BAM data, it runs additional background threads to merge batches of existing sorted temporary files into fewer, larger, sorted files. Just like the final merge, each background compaction requires only a modest amount of working memory.

How to sort with RocksDB

By expending this background effort we greatly reduce the number of files in the final merge, and thus the seek load for datasets many times larger than provisioned RAM. The cost of shortening the merge critical path in this way is that each alignment record is processed and written to temporary files multiple times, instead of just once. This is an instance of a classic database engineering tradeoff where we accept write amplification in order to reduce read amplification. An explicit command-line option to samtools rocksort is needed to activate background compaction, because the write amplification can be counterproductive when the merge is not bound to disk seeks.

Benchmarks
We benchmarked samtools rocksort and samtools sort on Amazon EC2 instances to explore different hardware configurations and parameter settings. The build of samtools used in these benchmarks includes performance enhancements to its parallel compression code we made as an aside while developing rocksort. That change significantly benefits both algorithms, so the speed of rocksort compared to previous versions of samtools is even better than illustrated below.

Warmup: sorting a 20 GiB BAM file on rotating drives
First we took a BAM file from the 1000 Genomes Project representing a low-coverage, whole-genome sequence. We shuffled the 20 GiB BAM file and sorted it on a c1.xlarge instance, which has eight virtual CPUs, four rotating drives for scratch space (here configured in RAID0, and only 7 GiB of RAM. The limited RAM on this instance type provides an interesting approximation of the fairly common strategy of running BAM sorting concurrently with (streaming input from) another memory-hungry process, such as an aligner.

The exact invocations were:

cat wgs_lo.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_lo.sort
cat wgs_lo.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M – wgs_lo.rocksort

(Streaming the input to samtools from pigz is a trick to effectively use a separate worker thread for decompression despite samtools lacking that capability.)

Here’s a chart of the results (data collected at 30-second intervals by dstat), with samtools rocksort in blue and samtools sort in black:

chart samstools sort

The CPU usage chart clearly reflects the two distinct phases (splitting and merging) of both algorithms. The existing sort algorithm has erratic CPU usage in the split phase, as it alternates between reading the input data and writing compressed temporary files. RocksDB does a better job parallelizing the corresponding steps, and has less compression work to do thanks to Snappy, so it shows more consistent and less total CPU usage. In the merge phase, rocksort is able to drive somewhat higher CPU utilization for output compression, reflecting the shorter merge critical path. Its memory consumption is substantially lower on average (but more variable), and it drives a higher rate of disk activity.

Overall, samtools sort took 34m to sort this dataset, and samtools rocksort shaved off about seven minutes. That’s a 25% speed increase – not bad at all!

The future: 135 GiB BAM on solid-state drives
The 20 GiB dataset in the previous test is pretty well representative of large datasets to which samtools sort is currently applied, but it seems safe to expect this size to grow along with the ever-increasing throughput of modern sequencing instruments. Let’s next consider producing a 135 GiB BAM file — about the size of a deep human whole genome sequence (WGS) — in one sort operation. We synthesized a dataset of this size by merging and shuffling several of the low-coverage BAMs from the 1000 Genomes Project.

First we benchmarked this on one of the powerful new i2.2xlarge instances, which notably comes equipped with two solid-state drives (SSDs), also configured here in RAID0. SSDs largely obviate “seek time” concerns, and they’re becoming increasingly available both from cloud providers and in HPC clusters. We also set it up with a fairly generous 32 GiB of RAM.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 4G – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 4G – wgs_hi.rocksort

chart samstools sort

In this test samtools rocksort (2h 7m) was 34% faster than samtools sort (2h 50m). The difference is more modest if we only consider the merge phase, though, which is what really matters when the data are streaming from an aligner. SSDs are awesome! (Though still costly to use just for scratch space.)

The extreme case: 135 GiB BAM on rotating drives
Lastly, let’s look at a case where background compaction allows samtools rocksort to really blow the doors off. We took the deep WGS BAM from the last test back to the c1.xlarge instance with four rotating drives in RAID0 and limited RAM. To activate background compaction, we supply rocksort with an estimate of the total uncompressed size of the BAM dataset, in this case 512 GiB; the README file contains guidelines for coming up with this estimate.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M -s 512G – wgs_hi.rocksort

extreme case samstools sort

In this test, samtools sort produces well over 1,000 temporary disk files, and then tries to concurrently stream all of them in the subsequent merge phase. That’s far too much effectively random I/O for rotating disks, even in a RAID array, and leads to very poor system utilization. In contrast, samtools rocksort makes much more efficient use of the disks, as the background compaction effectively leads to a highly concurrent, two-pass external merge sort. It finished five times faster (3h 46m vs. 22h 37m).

What’s next?
There are still some obvious ways to further speed up samtools BAM sorting, which would largely close any remaining speed gap with commercial tools. Parallel CPU utilization during output compression, while significantly improved by our patches, could still be increased — though at the cost of further complicating the Pthreads-based code, which is no one’s idea of fun. At the other end, input decompression could also be sped up. Indeed, the decompression rate with pigz falls well short of the insertion rate RocksDB is capable of sustaining, and BAM’s block compression format admits more parallelization than pigz (a general-purpose program) takes advantage of. There’s actually some experimental code for this out there already, which would probably benefit both rocksort and the vanilla sort algorithm. That stated, input decompression is not a factor in the common case where the data are slowly streamed from an aligner.

Both of those potential optimizations reflect an important point worth emphasizing in closing: rocksort is not a rewrite of samtools. It adds one independent command in bam_rocksort.c, comparable in SLOC to the existing bam_sort.c. The implementation reuses numerous BAM processing functions from samtools, and thus will benefit from any future improvements therein. And while rocksort executes a highly concurrent, I/O-heavy algorithm, almost all of the hard work is courtesy of crack engineers from Facebook and Google (at least one of whom is said to build his code before shipping it, but only to check for compiler and linker bugs!). Importantly therefore, rocksort is maintainable.

You can give rocksort a try by building our fork of samtools.

Developer Spotlight: Baylor Computational Biologist on Porting Mercury to the Cloud

Narayanan VeeraraghavanAs users log in to DNAnexus to check out the Mercury pipeline and see how it could be applied to their own data, we sat down with Dr. Narayanan Veeraraghavan from Baylor’s Human Genome Sequencing Center (HGSC) to talk about details of the pipeline, how it was ported to our cloud platform, and upcoming feature additions.

Almost every sample sequenced at the HGSC (~24 terabases/month) is processed by Mercury, the production pipeline for analyzing next-generation sequencing data. The framework takes raw base calls from the sequencing instruments and runs a series of algorithms and tools to generate an annotated VCF file. This process includes steps for mapping to a reference genome with BWA, certain finishing steps for preparing the BAM file for realignment and recalibration with GATK, and variant calling and annotation using Atlas2 and Cassandra, tools developed at the HGSC.

The pipeline runs well on the genome center’s local computational infrastructure, matching up with the production data flow rate. But some large projects — such as Baylor’s participation in the CHARGE consortium — can impose substantial load on that infrastructure, inconveniencing researchers who use the local cluster for their own projects. To address such spikes in compute requirements and to explore scaling of compute infrastructure for future demands in next-gen sequencing, HGSC looked for a technology that would keep the business going as usual locally while allowing the center to perform ambitious large-scale analysis. Enter cloud computing.

hgsc baylor college of medicineVeeraraghavan, who is a lead scientific programmer at the HGSC, was in charge of exploring the feasibility of operating on the cloud at scale. His immediate task was to take the existing Mercury pipeline and get it to run in the cloud. This was no small task, given Mercury’s complex, multi-component workflow and the fact that it was optimized to run on local compute infrastructure. Porting it meant reimagining the pipeline for a cloud environment and optimizing it for massively parallel computing.

Working closely with Andrew Carroll, a scientist at DNAnexus, Veeraraghavan says, “We optimized every component of the pipeline as we ported it to DNAnexus.” Becoming acquainted with the DNAnexus platform and getting the first implementation of Mercury up and running took just three weeks, far less time than Veeraraghavan was expecting given his previous efforts to port code to new environments. The DNAnexus team helped make sure that the new Mercury pipeline took advantage of smart ways to parallelize tasks and handle data. “That translates to faster turnaround time and lower cost,” Veeraraghavan says. “The most important part of my experience was the fantastic user support. The DNAnexus team was receptive to feature suggestions and quick to implement them. Working with DNAnexus was a real partnership.”

The CHARGE project data analysis was chosen as one of a few pilot projects to understand computing and collaborating on the cloud using DNAnexus; the successful effort wound up being the largest known genomic analysis conducted on Amazon Web Services, the cloud provider used by DNAnexus.

According to Veeraraghavan, new features are coming soon to the Mercury pipeline, including an RNA-seq workflow as well as tools for variant calling at the population level and on low depth-of-coverage whole genome sequences. These will be available to the public free to use.

Other features that we helped HGSC enable include LIMS communication and a direct .bcl-to-cloud implementation in which the raw base-calls from the sequencing instruments are directly uploaded to the cloud in real-time, making the process of data analysis automatic, integrated, and seamless. These features, distinct to HGSC, were made possible by our science team, which aims for a painless and straightforward integration for all DNAnexus customers.

Veeraraghavan, who has extensive programming expertise, says that DNAnexus is a good option for beginners and advanced users alike. For people like him, there’s virtually unlimited configurability via the command line interface, meta-data features, and APIs. “It can also be used by scientists who just want to pick and choose a few apps, stitch them together graphically into a workflow, and get going with their science without having to bother with installation and configuration,” he says. “It is extremely easy for a person with zero programming background to use the Mercury pipeline on DNAnexus. You can bring your data into a secure environment and use our tools without any hardware or software setup. This is the next generation of genomic analysis.”

Cache Money: Keeping Your Product Fresh With Minimal Customer Pain

Background

The DNAnexus platform’s front-end is driven off a technology stack called “Membrane”. Membrane is a front-end views framework which allows internal developers to build isolated components that can be used throughout the website to build complex interactions. Below are two screenshots from two different components, the Data Tree and the Data List, respectively.

Data Tree – Used to show the folder hierarchy in a project, allowing users to expand/collapse folders, select a folder, etc.

data_tree

Data List – Used to show a list of objects/folders, with support for sorting, etc. Typically used to show the items in folder, or search results.

data_list

At the time of this writing, Membrane has 111 components, each having a javascript, html, and css file. That’s a total of 333 files for the components, in addition to bootstrapping resources and third party libraries.

The problem / more background

The Membrane team wants to push out new features quickly and often, with minimal customer pain. In most releases only a few components have been updated, and we only want users to download the components that have been updated, and keep the existing copy of the components that haven’t changed. This is where browser cache management comes in. Web browsers have a web cache where they store web content such as html, javascript, css, images, etc. When you visit the website again in the future this content may be served from cache, which removes the need to fetch that content over the network. The conditions under which content is served from the cache is subject to much configuration, such as client side cache settings, and server side response headers such as “expires”, “max-age”, and “cache-control”, among others.

Utilizing the web cache can greatly speed up the loading of your product, so we know we want to use the cache as much as possible. Simply updating your server to tell the clients that these assets never expire is a simple solution to ensure the cache is used, but at the expense of clients seeing a stale product as long as these entries exist in their cache. What we want is a solution which will use a browser cache entry as long as possible, while quickly invaliding a cache entry when there is a newer version of the asset.

The solution

First let’s start off by coming up with a mechanism for versioning our assets. Instead of adding explicit versions, we will compute an MD5 checksum of the content. During each release, if the MD5 checksum of an asset is the same as it was during the prior release, we know that it did not change. For example the MD5 checksum for the Data List javascript is currently c2a92513c4604b92255f620950ecb93c.

We will now define a new release asset called the manifest. The manifest contains a mapping from an asset path, such as /data/list/view.js to the MD5 checksum for that asset (c2a92513c4604b92255f620950ecb93c). The manifest is loaded during bootstrapping and is always consulted when fetching an asset to provide the latest MD5 checksums for that asset. To prevent caching of the manifest itself, we use the HTTP header “Cache-Control: no-cache”. Every time a user loads the page we will fetch the manifest from the server. We also periodically ping the server during the user session to detect manifest changes and notify users that there are some updates.

The next part of the solution involves how the manifest information is used to manage the browser cache. Earlier I mentioned that the browser may use the cache to fulfil requests for previously accessed content. The browser determines whether or not it has already seen an asset by comparing the url of the asset with entries in the cache. Therefore if the page is requesting http://foo.com/image.png the browser will first check if it has a cache entry with that url. Knowing how the browser keys cache entries gives us the ability to force a client update by simply changing the URL of a particular asset.

We’ve already come up with a way to create a unique version of each asset, which is the MD5 checksum. We will now use this in the URL to address not only a particular asset but also the version of that asset. Our path for the data list javascript will now be /asset/c2a92513c4604b92255f620950ecb93c/data/list/view.js. Now that our asset paths have a version in them, we can update our nginx server to include headers which tell the browser to cache these files for a very long period of time (e.g. 2 years). If the data list javascript is updated in a subsequent release the MD5 checksum would change, and the resulting path to that asset would change as well which removes the chance of the client having a stale cache entry.

The final piece of our solution is to make deployments simple. We don’t want to keep these MD5 checksums on our server’s file system, so we use an nginx rewrite rule to strip the MD5 checksum from the path when resolving the asset on disk. So in reality we don’t have assets on disk with md5 checksums in their path, but rather the MD5 checksum is used only in the URL to fetch the asset as a mechanism for managing the user’s browser cache.

Closing thoughts

Cache management is a tricky problem and if you search the web you will find a variety of solutions, each having its own set of pros and cons. Cache validation is also a complex topic which involves more mechanisms that I’ve gone into here, such as ETags.

After evaluating existing solutions both internally and externally, we’ve come up with a fairly simple solution that allows us to update quickly and often, while leveraging the user’s cache as much as possible to provide fast load times and remove the potential for users seeing a stale version of a component.

For additional information or if you have any questions, please feel free to email evan@dnanexus.com.