Security Advisory: Response to Heartbleed Vulnerability

On April 7, 2014, a serious vulnerability known as Heartbleed (CVE-2014-0160) was disclosed in the OpenSSL cryptography library, affecting many popular software packages and Internet services. The vulnerability could potentially be exploited to steal sensitive data such as encryption keys and user passwords. We have no evidence that any DNAnexus customer data or credentials were compromised using this vulnerability. However, out of an abundance of caution, we have taken the following steps below and will continue to implement security actions in response to this event.

At DNAnexus, the security of our clients is our top priority. As soon as the vulnerability was disclosed, we started identifying services on our platform that were affected. All such services were patched to eliminate the vulnerability within 8 hours after it was initially disclosed. After this initial response, we started a thorough analysis of how our systems and the security of our clients could have been affected.

At this time, we have no reason to believe any customer data or credentials were compromised using this vulnerability. Moreover, none of our services that handle genomic data were directly vulnerable. However, services that handle credential information were affected. The nature of this attack makes it hard to detect, and therefore we have decided to take the following precautions:

  • We have updated our affected SSL certificates, to eliminate the possibility that our private SSL keys were compromised.
  • Existing browser-based login sessions initiated before the patch date have been terminated, so you will need to log in again the next time you use the platform.
  • We have triggered early expiration of DNAnexus passwords set before the patch date, so the next time you log in to the platform, you will be prompted to reset your password.
  • The next time you log in, you will also see a security alert advising you to update any API keys that you may have issued on the platform.

To minimize the risk of compromise of your account from possible attacks including this one, we also recommend turning on Two-Factor Authentication (2FA) on the DNAnexus platform, or cycling it if it was already on. Follow these steps:

  • Log in to https://platform.dnanexus.com/
  • Select your name on the upper right and pull down the “Profile” menu item
  • Select the Account tab and click Security
  • If 2FA was previously on, turn it off using your current password and a 2FA Code
  • Turn on Two-Factor Authentication and link your account and authenticator application
  • Verify access using your current password and a Two-Factor Authentication Code, being sure to save your backup codes before pressing “Continue”

We welcome customer feedback – if you have any questions or comments about our security practices, please reach us at support@dnanexus.com.

 

Faster BAM Sorting with SAMtools and RocksDB

Brief introduction for non-experts: fully sequencing a person’s genome with current technology produces roughly 1.5 billion data fragments totaling 500 GiB (uncompressed). BAM is a compressed file format for this data used by various downstream analyses, including clinical interpretation algorithms. Such analyses usually require the fragments to be sorted according to their estimated position of origin in the human genome.

We’ve been hacking on a fork of samtools providing a new command, samtools rocksort, capable of sorting large BAMs significantly faster than samtools sort on modern server hardware. It leverages the flexible external sorting and compaction algorithms in RocksDB, Facebook’s recently open-sourced embedded storage engine (itself a descendant of Google’s LevelDB).

How to sort a large BAM
To sort a BAM file larger than available memory, samtools divides the data into several batches, each of which it sorts in memory and writes to a temporary disk file. It then merges the temporary files to form the final BAM file. Because the files are individually sorted, the merging can be done in a “streaming” fashion using a heap, requiring only a small portion of each temporary file in memory at any time.

sorted BAM file

If the total number of alignment records is N, and we sort up to R in RAM to create one temporary file, then we create N/R temporary files. The algorithmic running time is O((N/R) log R + N log(N/R)) = O(N log N), where the factor log(N/R) is incurred in the heap used in the merge phase. Newer versions of samtools can also parallelize in-memory sorting and output compression across multiple cores, which yields a nice wallclock speedup; however, the merge phase still has an O(N log(N/R)) critical path which is essentially unparallelized.

Sorting with RocksDB
At a high level, samtools rocksort works in much the same way, but RocksDB does almost all of the hard work. We insert each alignment record into a temporary RocksDB ordered key-value database, with a key encoding the reference coordinates. As a matter of course, RocksDB runs background threads to sort batches of them in memory (using a comparator function we provide) and write each batch out to a file. Once we finish loading all the BAM data, we ask RocksDB for a database iterator, which we simply loop over and emit the sorted BAM file. The RocksDB iterator automatically heap-merges from the sorted disk files as needed.

Just entrusting the external sort to RocksDB in this way is enough to get a decent speed improvement of 25% or so over the original sort implementation. That’s largely because the binary key-value interface allows us to simply blit each in-memory bam1_t record directly into and out of RocksDB, which compresses temporary disk files using Snappy. In contrast, samtools sort writes its temporary files in the actual BAM format, requiring encoding/decoding and slower zlib compression. Very nice and convenient — but no fundamental algorithmic difference so far.

Here’s where RocksDB brings important new capabilities to the table. There’s a problem with the single-pass merge described above when the number of intermediate files, N/R, is large. Merging the sorted intermediate files in limited memory requires constantly reading little bits from all those files, incurring a lot of disk seeks on rotating drives. In fact, at some point, samtools sort performance becomes effectively bound to disk seeking; we’ll see an example of this below.

In this scenario, samtools rocksort can sort the same data in much less time, using no more memory, by invoking RocksDB’s background compaction capabilities. With a few extra lines of code we configure RocksDB so that, while we’re still in the process of loading the BAM data, it runs additional background threads to merge batches of existing sorted temporary files into fewer, larger, sorted files. Just like the final merge, each background compaction requires only a modest amount of working memory.

How to sort with RocksDB

By expending this background effort we greatly reduce the number of files in the final merge, and thus the seek load for datasets many times larger than provisioned RAM. The cost of shortening the merge critical path in this way is that each alignment record is processed and written to temporary files multiple times, instead of just once. This is an instance of a classic database engineering tradeoff where we accept write amplification in order to reduce read amplification. An explicit command-line option to samtools rocksort is needed to activate background compaction, because the write amplification can be counterproductive when the merge is not bound to disk seeks.

Benchmarks
We benchmarked samtools rocksort and samtools sort on Amazon EC2 instances to explore different hardware configurations and parameter settings. The build of samtools used in these benchmarks includes performance enhancements to its parallel compression code we made as an aside while developing rocksort. That change significantly benefits both algorithms, so the speed of rocksort compared to previous versions of samtools is even better than illustrated below.

Warmup: sorting a 20 GiB BAM file on rotating drives
First we took a BAM file from the 1000 Genomes Project representing a low-coverage, whole-genome sequence. We shuffled the 20 GiB BAM file and sorted it on a c1.xlarge instance, which has eight virtual CPUs, four rotating drives for scratch space (here configured in RAID0, and only 7 GiB of RAM. The limited RAM on this instance type provides an interesting approximation of the fairly common strategy of running BAM sorting concurrently with (streaming input from) another memory-hungry process, such as an aligner.

The exact invocations were:

cat wgs_lo.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_lo.sort
cat wgs_lo.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M – wgs_lo.rocksort

(Streaming the input to samtools from pigz is a trick to effectively use a separate worker thread for decompression despite samtools lacking that capability.)

Here’s a chart of the results (data collected at 30-second intervals by dstat), with samtools rocksort in blue and samtools sort in black:

chart samstools sort

The CPU usage chart clearly reflects the two distinct phases (splitting and merging) of both algorithms. The existing sort algorithm has erratic CPU usage in the split phase, as it alternates between reading the input data and writing compressed temporary files. RocksDB does a better job parallelizing the corresponding steps, and has less compression work to do thanks to Snappy, so it shows more consistent and less total CPU usage. In the merge phase, rocksort is able to drive somewhat higher CPU utilization for output compression, reflecting the shorter merge critical path. Its memory consumption is substantially lower on average (but more variable), and it drives a higher rate of disk activity.

Overall, samtools sort took 34m to sort this dataset, and samtools rocksort shaved off about seven minutes. That’s a 25% speed increase – not bad at all!

The future: 135 GiB BAM on solid-state drives
The 20 GiB dataset in the previous test is pretty well representative of large datasets to which samtools sort is currently applied, but it seems safe to expect this size to grow along with the ever-increasing throughput of modern sequencing instruments. Let’s next consider producing a 135 GiB BAM file — about the size of a deep human whole genome sequence (WGS) — in one sort operation. We synthesized a dataset of this size by merging and shuffling several of the low-coverage BAMs from the 1000 Genomes Project.

First we benchmarked this on one of the powerful new i2.2xlarge instances, which notably comes equipped with two solid-state drives (SSDs), also configured here in RAID0. SSDs largely obviate “seek time” concerns, and they’re becoming increasingly available both from cloud providers and in HPC clusters. We also set it up with a fairly generous 32 GiB of RAM.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 4G – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 4G – wgs_hi.rocksort

chart samstools sort

In this test samtools rocksort (2h 7m) was 34% faster than samtools sort (2h 50m). The difference is more modest if we only consider the merge phase, though, which is what really matters when the data are streaming from an aligner. SSDs are awesome! (Though still costly to use just for scratch space.)

The extreme case: 135 GiB BAM on rotating drives
Lastly, let’s look at a case where background compaction allows samtools rocksort to really blow the doors off. We took the deep WGS BAM from the last test back to the c1.xlarge instance with four rotating drives in RAID0 and limited RAM. To activate background compaction, we supply rocksort with an estimate of the total uncompressed size of the BAM dataset, in this case 512 GiB; the README file contains guidelines for coming up with this estimate.

cat wgs_hi.shuffled.bam | pigz -dc | samtools sort -@ 8 -m 640M – wgs_hi.sort
cat wgs_hi.shuffled.bam | pigz -dc | samtools rocksort -@ 8 -m 640M -s 512G – wgs_hi.rocksort

extreme case samstools sort

In this test, samtools sort produces well over 1,000 temporary disk files, and then tries to concurrently stream all of them in the subsequent merge phase. That’s far too much effectively random I/O for rotating disks, even in a RAID array, and leads to very poor system utilization. In contrast, samtools rocksort makes much more efficient use of the disks, as the background compaction effectively leads to a highly concurrent, two-pass external merge sort. It finished five times faster (3h 46m vs. 22h 37m).

What’s next?
There are still some obvious ways to further speed up samtools BAM sorting, which would largely close any remaining speed gap with commercial tools. Parallel CPU utilization during output compression, while significantly improved by our patches, could still be increased — though at the cost of further complicating the Pthreads-based code, which is no one’s idea of fun. At the other end, input decompression could also be sped up. Indeed, the decompression rate with pigz falls well short of the insertion rate RocksDB is capable of sustaining, and BAM’s block compression format admits more parallelization than pigz (a general-purpose program) takes advantage of. There’s actually some experimental code for this out there already, which would probably benefit both rocksort and the vanilla sort algorithm. That stated, input decompression is not a factor in the common case where the data are slowly streamed from an aligner.

Both of those potential optimizations reflect an important point worth emphasizing in closing: rocksort is not a rewrite of samtools. It adds one independent command in bam_rocksort.c, comparable in SLOC to the existing bam_sort.c. The implementation reuses numerous BAM processing functions from samtools, and thus will benefit from any future improvements therein. And while rocksort executes a highly concurrent, I/O-heavy algorithm, almost all of the hard work is courtesy of crack engineers from Facebook and Google (at least one of whom is said to build his code before shipping it, but only to check for compiler and linker bugs!). Importantly therefore, rocksort is maintainable.

You can give rocksort a try by building our fork of samtools.

Developer Spotlight: Baylor Computational Biologist on Porting Mercury to the Cloud

Narayanan VeeraraghavanAs users log in to DNAnexus to check out the Mercury pipeline and see how it could be applied to their own data, we sat down with Dr. Narayanan Veeraraghavan from Baylor’s Human Genome Sequencing Center (HGSC) to talk about details of the pipeline, how it was ported to our cloud platform, and upcoming feature additions.

Almost every sample sequenced at the HGSC (~24 terabases/month) is processed by Mercury, the production pipeline for analyzing next-generation sequencing data. The framework takes raw base calls from the sequencing instruments and runs a series of algorithms and tools to generate an annotated VCF file. This process includes steps for mapping to a reference genome with BWA, certain finishing steps for preparing the BAM file for realignment and recalibration with GATK, and variant calling and annotation using Atlas2 and Cassandra, tools developed at the HGSC.

The pipeline runs well on the genome center’s local computational infrastructure, matching up with the production data flow rate. But some large projects — such as Baylor’s participation in the CHARGE consortium — can impose substantial load on that infrastructure, inconveniencing researchers who use the local cluster for their own projects. To address such spikes in compute requirements and to explore scaling of compute infrastructure for future demands in next-gen sequencing, HGSC looked for a technology that would keep the business going as usual locally while allowing the center to perform ambitious large-scale analysis. Enter cloud computing.

hgsc baylor college of medicineVeeraraghavan, who is a lead scientific programmer at the HGSC, was in charge of exploring the feasibility of operating on the cloud at scale. His immediate task was to take the existing Mercury pipeline and get it to run in the cloud. This was no small task, given Mercury’s complex, multi-component workflow and the fact that it was optimized to run on local compute infrastructure. Porting it meant reimagining the pipeline for a cloud environment and optimizing it for massively parallel computing.

Working closely with Andrew Carroll, a scientist at DNAnexus, Veeraraghavan says, “We optimized every component of the pipeline as we ported it to DNAnexus.” Becoming acquainted with the DNAnexus platform and getting the first implementation of Mercury up and running took just three weeks, far less time than Veeraraghavan was expecting given his previous efforts to port code to new environments. The DNAnexus team helped make sure that the new Mercury pipeline took advantage of smart ways to parallelize tasks and handle data. “That translates to faster turnaround time and lower cost,” Veeraraghavan says. “The most important part of my experience was the fantastic user support. The DNAnexus team was receptive to feature suggestions and quick to implement them. Working with DNAnexus was a real partnership.”

The CHARGE project data analysis was chosen as one of a few pilot projects to understand computing and collaborating on the cloud using DNAnexus; the successful effort wound up being the largest known genomic analysis conducted on Amazon Web Services, the cloud provider used by DNAnexus.

According to Veeraraghavan, new features are coming soon to the Mercury pipeline, including an RNA-seq workflow as well as tools for variant calling at the population level and on low depth-of-coverage whole genome sequences. These will be available to the public free to use.

Other features that we helped HGSC enable include LIMS communication and a direct .bcl-to-cloud implementation in which the raw base-calls from the sequencing instruments are directly uploaded to the cloud in real-time, making the process of data analysis automatic, integrated, and seamless. These features, distinct to HGSC, were made possible by our science team, which aims for a painless and straightforward integration for all DNAnexus customers.

Veeraraghavan, who has extensive programming expertise, says that DNAnexus is a good option for beginners and advanced users alike. For people like him, there’s virtually unlimited configurability via the command line interface, meta-data features, and APIs. “It can also be used by scientists who just want to pick and choose a few apps, stitch them together graphically into a workflow, and get going with their science without having to bother with installation and configuration,” he says. “It is extremely easy for a person with zero programming background to use the Mercury pipeline on DNAnexus. You can bring your data into a secure environment and use our tools without any hardware or software setup. This is the next generation of genomic analysis.”