Importing CGHub data into DNAnexus – quickly!

DNAnexus was founded on the premise that the future of genome informatics resides in the cloud. At the time it was a radical notion; four years later, that inevitable trend is widely recognized among experts in the field. And yet for most, the methods and practical realities of moving into the cloud still remain mysterious. Frequently, one of the first questions to arise is: “How would I get data into the cloud?” It’s an understandable concern for anyone accustomed to e-mailing files around, downloading from FTP sites, or worst of all, shipping hard drives!

Remember that a significant and growing fraction of all day-to-day Internet traffic flows through the cloud. In that light, genome data sets are more than manageable. In fact, a modern high-throughput sequencing instrument produces data at less than 10 Mbps (averaged over its run time, and with ordinary compression techniques). The price of an Internet link with enough throughput for a whole fleet of such instruments is just a tiny fraction of the other costs to operate them.

So streaming newly-sequenced data into the cloud is clearly no sweat. What about all the massive data sets already generated? At DNAnexus, we have the experience to know that this is no problem, either. We’ll discuss an example here.

An enterprise bioinformatics group recently approached us about analyzing RNA-seq data from the Cancer Cell Line Encyclopedia (CCLE). This data set is substantial, bigger than 10 TB, and freely available through CGHub, the cancer genomics data repository operated by UCSC’s genome bioinformatics center. Because we have multiple users who have expressed interest in working with cancer genomics data on our platform, our bioinformatics team decided to lend assistance in developing a capability to import data from CGHub.

CGHub provides file transfers using a special program called GeneTorrent. We began by writing a simple app to wrap GeneTorrent using our SDK (available to any platform user). Given a CGHub analysis ID, the app downloads the associated BAM and BAI files, double-checks their MD5 integrity hashes, and outputs them to the user’s project. Our users were able to incorporate this app easily into their own analysis workflow, which they continued to develop using our SDK, with minimal guidance from us.

We got involved again when it came time to import all ~800 CCLE RNA-seq samples for analysis with the final pipeline. Corresponding with CGHub’s team, we learned that the best transfer speeds would be obtained by running numerous GeneTorrent downloads in parallel, using multicore servers to support the CPU-intensive protocol. Following this advice, we launched jobs on our platform to run the transfers 64 at a time in parallel, each using a quad-core cloud instance. (Since our app spends some time double-checking the file integrity and writing it back to cloud storage, somewhat fewer than 64 would actually be running GeneTorrent at steady state.)

Using this strategy, we completed the transfer of 10.68 TB in fewer than seven hours, for an average sustained throughput of about 4 Gbps. The transfers were going trans-continentally, most of the distance via Internet2; as far as we know, there were no bottlenecks internal to the DNAnexus platform during this process. Here’s a screenshot of our project with all of the BAM files:
CCLE RNA-seq samples

How many institutions in the world have infrastructure that can readily bring to bear both a multi-gigabit route to CGHub and the hundreds of compute cores needed to fully utilize it? Perhaps a few dozen, or fewer. But any DNAnexus user has exactly that; in fact, throughout this entire effort, we relied on features of the platform that are readily accessible to all regular users. And infrastructure is just the beginning: the platform is also secure and compliant to clinical standards (as well as dbGaP security practices), and the revolution really starts with seamless sharing of data and analysis tools without regard to institutional boundaries.

Launched just this past spring, we now find the new DNAnexus platform reaching amazing milestones practically every week. In fact, we’re already deploying more processing power, as far as we know, than any of the dedicated clusters at the major genome centers — well over 20,000 compute cores at times, according to user demand. For more on that, watch this space in a few weeks; we’ll be making some big announcements at ASHG 2013 about how we’re realizing truly mega-scale genomics in the cloud.

Dev Talks: Genomics Applications in the Cloud with the DNAnexus Platform

The next public talk in our 2013 series will be at the University of Toronto’s, TCAG New Technologies Seminar. DNAnexus Sr. Software Engineer, Andrey Kislyuk, will conduct a brief demo of developing applications on the platform, as well as scientific collaboration, publishing, and reproducibility features.

Title: Building Genomics Applications in the Cloud with the DNAnexus Platform
When: Thursday, September 26, 2013
Time: 10:30–11:30 AM and 2:00-3:00 PM
Building: MaRS Toronto Medical Discovery Tower, 101 College St.
Room: 14-203

Abstract: The DNAnexus platform-as-a-service (PaaS) was designed to eliminate the common challenges and costs that enterprises face when building clinically compliant analysis pipelines for next-generation sequencing (NGS) data.  The DNAnexus platform provides a configurable API-based infrastructure that enables research labs to efficiently move their analysis pipelines into the cloud, using their own algorithms with industry-recognized tools and resources to create customized workflows in a secure and compliant environment.

DNAnexus is available to give on-site talks and demos to public and private institutions. Contact us for details at developers@dnanexus.com.
tcag

Keeping the Genome Browser Responsive

The new DNAnexus genome browser is very powerful, and lets users visualize a wide variety of data — and with the power of apps and applets, users can even customize what gets displayed in the browser by generating new, personalized spans tracks.

As regular web developers know, this flexibility in a JavaScript-based environment can have an effect on user experience. We have explored many different approaches to this challenge, and the new DNAnexus genome browser demonstrates that it is indeed possible to run a resource-hungry application in any popular web browser while sacrificing very little of the user’s perception and time.

In this blog post, we’ll go through techniques we considered and how we ultimately solved the problem.

genome browser

The Problem

First, let’s go through a little background to get everyone on the same page. The tracks in the genome browser are all generated using JavaScript, a versatile language that most web browsers use to grant interactivity to web pages. This allows us to take advantage of the user’s computer to render and interact with the genome browser tracks, rather than doing computation and rendering on the servers and presenting a static image to the user.

However, browser-based JavaScript has a number of problems in this context. First among these is that in most cases it’s still slow compared to many other languages. Monumental and very successful efforts have been made in recent years to make it more efficient, but it is still rare that a piece of code written in JavaScript will perform at the same level as a similar algorithm written in other languages such as C or even Perl.

Also, in most cases, JavaScript only has one thread. This means that with a handful of exceptions, your browser will execute the scripts given to it in order, and execution of one script will delay execution of any other script until the first has finished. In an age when even home computers have many cores between which processing can be divided, this can result in a slower perceived performance.

Finally, and most notably, in many web browsers the thread that executes JavaScript and the thread responsible for updating the user interface and responding to user events (such as clicking, typing, and mouse movement) are the same thread. This means that while JavaScript is executing, the user can’t interact with the web page in any way (including closing the browser tab in many cases).

Since the genome browser has the potential to process tens of thousands of elements, this could result in the user waiting for tens of seconds, unable to do anything, while the browser renders the slew of data coming into it. Unless fixed, this would make the genome browser range from merely aggravating to downright unusable.

One solution: web workers

One of the most anticipated features of the nascent HTML5 specification is web workers, a new method for offloading JavaScript processing into a background process. This solves most of the above-mentioned problems quite handily: it allows a semblance of multi-threading and it doesn’t interfere with the user interface. However, implementation in modern browsers is iffy, with some fairly recent browsers still lacking support. Additionally, availability of the general JavaScript execution environment is limited, with no access to the document model or certain other resources.

If you have the luxury of only supporting browsers that have web workers enabled, this is probably still the best option. However, the new DNAnexus genome browser needs to run on IE 9, which rules out this option for us.

Another solution: self-monitoring timeouts

A slightly more complicated solution is to launch processing chunks in timeouts, which are JavaScript’s way of delaying code execution until after a set period of time. This allows other scripts to execute in the meantime, and more crucially allows the UI to update itself on a fairly regular basis. It is also compatible with almost all browsers, and allows access to the entire execution environment and the document.

The key to this approach is giving any potentially long-running algorithm a sense of how much user time it has taken up and forcing it to yield execution at set time intervals. Thus, if a function detects that it’s been running for too long, it can pause execution for a short time until other pieces of code have had their turn.

In practice, this requires some fine-tuning. The two crucial variables are how long to let a function execute before yielding, and how much time to yield for. The latter is relatively simple: a short wait time is better, since we’re really concerned with allowing the UI to update. Most browsers safely go as low at 10ms for a timeout interval, and it’s recommended to stay somewhere near that range. Note setting a timeout for 10ms does not guarantee that it will resume execution exactly 10ms later; it could be 12, 15, or even 50 milliseconds before execution resumes. This is one potential pitfall of this solution.

The other variable, time before yielding, is much trickier and will probably end up getting fine-tuned according to your application. There are two separate issues pulling the value in either direction: on one hand, you want your function to execute for as long as possible between periods of inactivity so that it remains as efficient as possible. On the other hand, you also want to preserve the maximum interactivity from the user’s point of view, which encourages short periods of processing followed by long idle periods.

In his book Designing with the Mind in Mind, Jeff Johnson compiles several studies into a convenient chart that lists crucial time intervals for human perception. While 5ms is listed as the shortest perceptible display time for visual stimuli and would be ideal from a UX point of view, that’s obviously a little too aggressive for our scripts — it would triple the execution time if a script ran for 5ms and then delayed for 10ms! The next important increment is 100ms, which is the maximum amount of time for continuity — that is, if a user presses a button and the computer waits more than 100ms to render that button as being pressed, the impression that the user caused that button to be pressed is broken. This seems like a good upper bound for execution time, but keep in mind that you may have to adjust this timing downward depending on how often you check the function run time. If each item takes several milliseconds to process, you may have to yield every 95ms instead of 100ms. (We ultimately decided to be conservative and go with an 85ms execution interval.)

The end result

Either of these solutions will yield a responsive interface that still allows for large amounts of data to be processed. We opted for the second solution due to browser support issues, keeping in mind key perceptual phenomena and best UI practices. While there are several promising technologies on the horizon, such as web workers in the near term and WebCL in the long term, JavaScript in browser environments in 2013 still presents some weaknesses. These are not insurmountable, however, and the new DNAnexus genome browser demonstrates that it is possible to have a usable, computation-heavy application running in the web browser today.