Importing CGHub data into DNAnexus – quickly!

DNAnexus was founded on the premise that the future of genome informatics resides in the cloud. At the time it was a radical notion; four years later, that inevitable trend is widely recognized among experts in the field. And yet for most, the methods and practical realities of moving into the cloud still remain mysterious. Frequently, one of the first questions to arise is: “How would I get data into the cloud?” It’s an understandable concern for anyone accustomed to e-mailing files around, downloading from FTP sites, or worst of all, shipping hard drives!

Remember that a significant and growing fraction of all day-to-day Internet traffic flows through the cloud. In that light, genome data sets are more than manageable. In fact, a modern high-throughput sequencing instrument produces data at less than 10 Mbps (averaged over its run time, and with ordinary compression techniques). The price of an Internet link with enough throughput for a whole fleet of such instruments is just a tiny fraction of the other costs to operate them.

So streaming newly-sequenced data into the cloud is clearly no sweat. What about all the massive data sets already generated? At DNAnexus, we have the experience to know that this is no problem, either. We’ll discuss an example here.

An enterprise bioinformatics group recently approached us about analyzing RNA-seq data from the Cancer Cell Line Encyclopedia (CCLE). This data set is substantial, bigger than 10 TB, and freely available through CGHub, the cancer genomics data repository operated by UCSC’s genome bioinformatics center. Because we have multiple users who have expressed interest in working with cancer genomics data on our platform, our bioinformatics team decided to lend assistance in developing a capability to import data from CGHub.

CGHub provides file transfers using a special program called GeneTorrent. We began by writing a simple app to wrap GeneTorrent using our SDK (available to any platform user). Given a CGHub analysis ID, the app downloads the associated BAM and BAI files, double-checks their MD5 integrity hashes, and outputs them to the user’s project. Our users were able to incorporate this app easily into their own analysis workflow, which they continued to develop using our SDK, with minimal guidance from us.

We got involved again when it came time to import all ~800 CCLE RNA-seq samples for analysis with the final pipeline. Corresponding with CGHub’s team, we learned that the best transfer speeds would be obtained by running numerous GeneTorrent downloads in parallel, using multicore servers to support the CPU-intensive protocol. Following this advice, we launched jobs on our platform to run the transfers 64 at a time in parallel, each using a quad-core cloud instance. (Since our app spends some time double-checking the file integrity and writing it back to cloud storage, somewhat fewer than 64 would actually be running GeneTorrent at steady state.)

Using this strategy, we completed the transfer of 10.68 TB in fewer than seven hours, for an average sustained throughput of about 4 Gbps. The transfers were going trans-continentally, most of the distance via Internet2; as far as we know, there were no bottlenecks internal to the DNAnexus platform during this process. Here’s a screenshot of our project with all of the BAM files:
CCLE RNA-seq samples

How many institutions in the world have infrastructure that can readily bring to bear both a multi-gigabit route to CGHub and the hundreds of compute cores needed to fully utilize it? Perhaps a few dozen, or fewer. But any DNAnexus user has exactly that; in fact, throughout this entire effort, we relied on features of the platform that are readily accessible to all regular users. And infrastructure is just the beginning: the platform is also secure and compliant to clinical standards (as well as dbGaP security practices), and the revolution really starts with seamless sharing of data and analysis tools without regard to institutional boundaries.

Launched just this past spring, we now find the new DNAnexus platform reaching amazing milestones practically every week. In fact, we’re already deploying more processing power, as far as we know, than any of the dedicated clusters at the major genome centers — well over 20,000 compute cores at times, according to user demand. For more on that, watch this space in a few weeks; we’ll be making some big announcements at ASHG 2013 about how we’re realizing truly mega-scale genomics in the cloud.