Faster BAM Sorting with SAMtools and RocksDB

Brief introduction for non-experts: fully sequencing a person’s genome with current technology produces roughly 1.5 billion data fragments totaling 500 GiB (uncompressed). BAM is a compressed file format for this data used by various downstream analyses, including clinical interpretation algorithms. Such analyses usually require the fragments to be sorted according to their estimated position of origin in the human genome.

We’ve been hacking on a fork of samtools providing a new command, samtools rocksort, capable of sorting large BAMs significantly faster than samtools sort on modern server hardware. It leverages the flexible external sorting and compaction algorithms in RocksDB, Facebook’s recently open-sourced embedded storage engine (itself a descendant of Google’s LevelDB).

How to sort a large BAM
To sort a BAM file larger than available memory, samtools divides the data into several batches, each of which it sorts in memory and writes to a temporary disk file. It then merges the temporary files to form the final BAM file. Because the files are individually sorted, the merging can be done in a “streaming” fashion using a heap, requiring only a small portion of each temporary file in memory at any time.

sorted BAM file

If the total number of alignment records is N, and we sort up to R in RAM to create one temporary file, then we create N/R temporary files. The algorithmic running time is O((N/R) log R + N log(N/R)) = O(N log N), where the factor log(N/R) is incurred in the heap used in the merge phase. Newer versions of samtools can also parallelize in-memory sorting and output compression across multiple cores, which yields a nice wallclock speedup; however, the merge phase still has an O(N log(N/R)) critical path which is essentially unparallelized.

Sorting with RocksDB
At a high level, samtools rocksort works in much the same way, but RocksDB does almost all of the hard work. We insert each alignment record into a temporary RocksDB ordered key-value database, with a key encoding the reference coordinates. As a matter of course, RocksDB runs background threads to sort batches of them in memory (using a comparator function we provide) and write each batch out to a file. Once we finish loading all the BAM data, we ask RocksDB for a database iterator, which we simply loop over and emit the sorted BAM file. The RocksDB iterator automatically heap-merges from the sorted disk files as needed.

Just entrusting the external sort to RocksDB in this way is enough to get a decent speed improvement of 25% or so over the original sort implementation. That’s largely because the binary key-value interface allows us to simply blit each in-memory bam1_t record directly into and out of RocksDB, which compresses temporary disk files using Snappy. In contrast, samtools sort writes its temporary files in the actual BAM format, requiring encoding/decoding and slower zlib compression. Very nice and convenient — but no fundamental algorithmic difference so far.

Here’s where RocksDB brings important new capabilities to the table. There’s a problem with the single-pass merge described above when the number of intermediate files, N/R, is large. Merging the sorted intermediate files in limited memory requires constantly reading little bits from all those files, incurring a lot of disk seeks on rotating drives. In fact, at some point, samtools sort performance becomes effectively bound to disk seeking; we’ll see an example of this below.

In this scenario, samtools rocksort can sort the same data in much less time, using no more memory, by invoking RocksDB’s background compaction capabilities. With a few extra lines of code we configure RocksDB so that, while we’re still in the process of loading the BAM data, it runs additional background threads to merge batches of existing sorted temporary files into fewer, larger, sorted files. Just like the final merge, each background compaction requires only a modest amount of working memory.

How to sort with RocksDB

By expending this background effort we greatly reduce the number of files in the final merge, and thus the seek load for datasets many times larger than provisioned RAM. The cost of shortening the merge critical path in this way is that each alignment record is processed and written to temporary files multiple times, instead of just once. This is an instance of a classic database engineering tradeoff where we accept write amplification in order to reduce read amplification. An explicit command-line option to samtools rocksort is needed to activate background compaction, because the write amplification can be counterproductive when the merge is not bound to disk seeks.

Benchmarks
We benchmarked samtools rocksort and samtools sort on Amazon EC2 instances to explore different hardware configurations and parameter settings. The build of samtools used in these benchmarks includes performance enhancements to its parallel compression code we made as an aside while developing rocksort. That change significantly benefits both algorithms, so the speed of rocksort compared to previous versions of samtools is even better than illustrated below.

Warmup: sorting a 20 GiB BAM file on rotating drives
First we took a BAM file from the 1000 Genomes Project representing a low-coverage, whole-genome sequence. We shuffled the 20 GiB BAM file and sorted it on a c1.xlarge instance, which has eight virtual CPUs, four rotating drives for scratch space (here configured in RAID0, and only 7 GiB of RAM. The limited RAM on this instance type provides an interesting approximation of the fairly common strategy of running BAM sorting concurrently with (streaming input from) another memory-hungry process, such as an aligner.

The exact invocations were:

cat wgs_lo.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_lo.sort
cat wgs_lo.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M – wgs_lo.rocksort

(Streaming the input to samtools from pigz is a trick to effectively use a separate worker thread for decompression despite samtools lacking that capability.)

Here’s a chart of the results (data collected at 30-second intervals by dstat), with samtools rocksort in blue and samtools sort in black:

chart samstools sort

The CPU usage chart clearly reflects the two distinct phases (splitting and merging) of both algorithms. The existing sort algorithm has erratic CPU usage in the split phase, as it alternates between reading the input data and writing compressed temporary files. RocksDB does a better job parallelizing the corresponding steps, and has less compression work to do thanks to Snappy, so it shows more consistent and less total CPU usage. In the merge phase, rocksort is able to drive somewhat higher CPU utilization for output compression, reflecting the shorter merge critical path. Its memory consumption is substantially lower on average (but more variable), and it drives a higher rate of disk activity.

Overall, samtools sort took 34m to sort this dataset, and samtools rocksort shaved off about seven minutes. That’s a 25% speed increase – not bad at all!

The future: 135 GiB BAM on solid-state drives
The 20 GiB dataset in the previous test is pretty well representative of large datasets to which samtools sort is currently applied, but it seems safe to expect this size to grow along with the ever-increasing throughput of modern sequencing instruments. Let’s next consider producing a 135 GiB BAM file — about the size of a deep human whole genome sequence (WGS) — in one sort operation. We synthesized a dataset of this size by merging and shuffling several of the low-coverage BAMs from the 1000 Genomes Project.

First we benchmarked this on one of the powerful new i2.2xlarge instances, which notably comes equipped with two solid-state drives (SSDs), also configured here in RAID0. SSDs largely obviate “seek time” concerns, and they’re becoming increasingly available both from cloud providers and in HPC clusters. We also set it up with a fairly generous 32 GiB of RAM.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 4G – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 4G – wgs_hi.rocksort

chart samstools sort

In this test samtools rocksort (2h 7m) was 34% faster than samtools sort (2h 50m). The difference is more modest if we only consider the merge phase, though, which is what really matters when the data are streaming from an aligner. SSDs are awesome! (Though still costly to use just for scratch space.)

The extreme case: 135 GiB BAM on rotating drives
Lastly, let’s look at a case where background compaction allows samtools rocksort to really blow the doors off. We took the deep WGS BAM from the last test back to the c1.xlarge instance with four rotating drives in RAID0 and limited RAM. To activate background compaction, we supply rocksort with an estimate of the total uncompressed size of the BAM dataset, in this case 512 GiB; the README file contains guidelines for coming up with this estimate.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M -s 512G – wgs_hi.rocksort

extreme case samstools sort

In this test, samtools sort produces well over 1,000 temporary disk files, and then tries to concurrently stream all of them in the subsequent merge phase. That’s far too much effectively random I/O for rotating disks, even in a RAID array, and leads to very poor system utilization. In contrast, samtools rocksort makes much more efficient use of the disks, as the background compaction effectively leads to a highly concurrent, two-pass external merge sort. It finished five times faster (3h 46m vs. 22h 37m).

What’s next?
There are still some obvious ways to further speed up samtools BAM sorting, which would largely close any remaining speed gap with commercial tools. Parallel CPU utilization during output compression, while significantly improved by our patches, could still be increased — though at the cost of further complicating the Pthreads-based code, which is no one’s idea of fun. At the other end, input decompression could also be sped up. Indeed, the decompression rate with pigz falls well short of the insertion rate RocksDB is capable of sustaining, and BAM’s block compression format admits more parallelization than pigz (a general-purpose program) takes advantage of. There’s actually some experimental code for this out there already, which would probably benefit both rocksort and the vanilla sort algorithm. That stated, input decompression is not a factor in the common case where the data are slowly streamed from an aligner.

Both of those potential optimizations reflect an important point worth emphasizing in closing: rocksort is not a rewrite of samtools. It adds one independent command in bam_rocksort.c, comparable in SLOC to the existing bam_sort.c. The implementation reuses numerous BAM processing functions from samtools, and thus will benefit from any future improvements therein. And while rocksort executes a highly concurrent, I/O-heavy algorithm, almost all of the hard work is courtesy of crack engineers from Facebook and Google (at least one of whom is said to build his code before shipping it, but only to check for compiler and linker bugs!). Importantly therefore, rocksort is maintainable.

You can give rocksort a try by building our fork of samtools.