Cache Money: Keeping Your Product Fresh With Minimal Customer Pain

Background

The DNAnexus platform’s front-end is driven off a technology stack called “Membrane”. Membrane is a front-end views framework which allows internal developers to build isolated components that can be used throughout the website to build complex interactions. Below are two screenshots from two different components, the Data Tree and the Data List, respectively.

Data Tree – Used to show the folder hierarchy in a project, allowing users to expand/collapse folders, select a folder, etc.

data_tree

Data List – Used to show a list of objects/folders, with support for sorting, etc. Typically used to show the items in folder, or search results.

data_list

At the time of this writing, Membrane has 111 components, each having a javascript, html, and css file. That’s a total of 333 files for the components, in addition to bootstrapping resources and third party libraries.

The problem / more background

The Membrane team wants to push out new features quickly and often, with minimal customer pain. In most releases only a few components have been updated, and we only want users to download the components that have been updated, and keep the existing copy of the components that haven’t changed. This is where browser cache management comes in. Web browsers have a web cache where they store web content such as html, javascript, css, images, etc. When you visit the website again in the future this content may be served from cache, which removes the need to fetch that content over the network. The conditions under which content is served from the cache is subject to much configuration, such as client side cache settings, and server side response headers such as “expires”, “max-age”, and “cache-control”, among others.

Utilizing the web cache can greatly speed up the loading of your product, so we know we want to use the cache as much as possible. Simply updating your server to tell the clients that these assets never expire is a simple solution to ensure the cache is used, but at the expense of clients seeing a stale product as long as these entries exist in their cache. What we want is a solution which will use a browser cache entry as long as possible, while quickly invaliding a cache entry when there is a newer version of the asset.

The solution

First let’s start off by coming up with a mechanism for versioning our assets. Instead of adding explicit versions, we will compute an MD5 checksum of the content. During each release, if the MD5 checksum of an asset is the same as it was during the prior release, we know that it did not change. For example the MD5 checksum for the Data List javascript is currently c2a92513c4604b92255f620950ecb93c.

We will now define a new release asset called the manifest. The manifest contains a mapping from an asset path, such as /data/list/view.js to the MD5 checksum for that asset (c2a92513c4604b92255f620950ecb93c). The manifest is loaded during bootstrapping and is always consulted when fetching an asset to provide the latest MD5 checksums for that asset. To prevent caching of the manifest itself, we use the HTTP header “Cache-Control: no-cache”. Every time a user loads the page we will fetch the manifest from the server. We also periodically ping the server during the user session to detect manifest changes and notify users that there are some updates.

The next part of the solution involves how the manifest information is used to manage the browser cache. Earlier I mentioned that the browser may use the cache to fulfil requests for previously accessed content. The browser determines whether or not it has already seen an asset by comparing the url of the asset with entries in the cache. Therefore if the page is requesting http://foo.com/image.png the browser will first check if it has a cache entry with that url. Knowing how the browser keys cache entries gives us the ability to force a client update by simply changing the URL of a particular asset.

We’ve already come up with a way to create a unique version of each asset, which is the MD5 checksum. We will now use this in the URL to address not only a particular asset but also the version of that asset. Our path for the data list javascript will now be /asset/c2a92513c4604b92255f620950ecb93c/data/list/view.js. Now that our asset paths have a version in them, we can update our nginx server to include headers which tell the browser to cache these files for a very long period of time (e.g. 2 years). If the data list javascript is updated in a subsequent release the MD5 checksum would change, and the resulting path to that asset would change as well which removes the chance of the client having a stale cache entry.

The final piece of our solution is to make deployments simple. We don’t want to keep these MD5 checksums on our server’s file system, so we use an nginx rewrite rule to strip the MD5 checksum from the path when resolving the asset on disk. So in reality we don’t have assets on disk with md5 checksums in their path, but rather the MD5 checksum is used only in the URL to fetch the asset as a mechanism for managing the user’s browser cache.

Closing thoughts

Cache management is a tricky problem and if you search the web you will find a variety of solutions, each having its own set of pros and cons. Cache validation is also a complex topic which involves more mechanisms that I’ve gone into here, such as ETags.

After evaluating existing solutions both internally and externally, we’ve come up with a fairly simple solution that allows us to update quickly and often, while leveraging the user’s cache as much as possible to provide fast load times and remove the potential for users seeing a stale version of a component.

For additional information or if you have any questions, please feel free to email evan@dnanexus.com.

Importing CGHub data into DNAnexus – quickly!

DNAnexus was founded on the premise that the future of genome informatics resides in the cloud. At the time it was a radical notion; four years later, that inevitable trend is widely recognized among experts in the field. And yet for most, the methods and practical realities of moving into the cloud still remain mysterious. Frequently, one of the first questions to arise is: “How would I get data into the cloud?” It’s an understandable concern for anyone accustomed to e-mailing files around, downloading from FTP sites, or worst of all, shipping hard drives!

Remember that a significant and growing fraction of all day-to-day Internet traffic flows through the cloud. In that light, genome data sets are more than manageable. In fact, a modern high-throughput sequencing instrument produces data at less than 10 Mbps (averaged over its run time, and with ordinary compression techniques). The price of an Internet link with enough throughput for a whole fleet of such instruments is just a tiny fraction of the other costs to operate them.

So streaming newly-sequenced data into the cloud is clearly no sweat. What about all the massive data sets already generated? At DNAnexus, we have the experience to know that this is no problem, either. We’ll discuss an example here.

An enterprise bioinformatics group recently approached us about analyzing RNA-seq data from the Cancer Cell Line Encyclopedia (CCLE). This data set is substantial, bigger than 10 TB, and freely available through CGHub, the cancer genomics data repository operated by UCSC’s genome bioinformatics center. Because we have multiple users who have expressed interest in working with cancer genomics data on our platform, our bioinformatics team decided to lend assistance in developing a capability to import data from CGHub.

CGHub provides file transfers using a special program called GeneTorrent. We began by writing a simple app to wrap GeneTorrent using our SDK (available to any platform user). Given a CGHub analysis ID, the app downloads the associated BAM and BAI files, double-checks their MD5 integrity hashes, and outputs them to the user’s project. Our users were able to incorporate this app easily into their own analysis workflow, which they continued to develop using our SDK, with minimal guidance from us.

We got involved again when it came time to import all ~800 CCLE RNA-seq samples for analysis with the final pipeline. Corresponding with CGHub’s team, we learned that the best transfer speeds would be obtained by running numerous GeneTorrent downloads in parallel, using multicore servers to support the CPU-intensive protocol. Following this advice, we launched jobs on our platform to run the transfers 64 at a time in parallel, each using a quad-core cloud instance. (Since our app spends some time double-checking the file integrity and writing it back to cloud storage, somewhat fewer than 64 would actually be running GeneTorrent at steady state.)

Using this strategy, we completed the transfer of 10.68 TB in fewer than seven hours, for an average sustained throughput of about 4 Gbps. The transfers were going trans-continentally, most of the distance via Internet2; as far as we know, there were no bottlenecks internal to the DNAnexus platform during this process. Here’s a screenshot of our project with all of the BAM files:
CCLE RNA-seq samples

How many institutions in the world have infrastructure that can readily bring to bear both a multi-gigabit route to CGHub and the hundreds of compute cores needed to fully utilize it? Perhaps a few dozen, or fewer. But any DNAnexus user has exactly that; in fact, throughout this entire effort, we relied on features of the platform that are readily accessible to all regular users. And infrastructure is just the beginning: the platform is also secure and compliant to clinical standards (as well as dbGaP security practices), and the revolution really starts with seamless sharing of data and analysis tools without regard to institutional boundaries.

Launched just this past spring, we now find the new DNAnexus platform reaching amazing milestones practically every week. In fact, we’re already deploying more processing power, as far as we know, than any of the dedicated clusters at the major genome centers — well over 20,000 compute cores at times, according to user demand. For more on that, watch this space in a few weeks; we’ll be making some big announcements at ASHG 2013 about how we’re realizing truly mega-scale genomics in the cloud.

Dev Talks: Genomics Applications in the Cloud with the DNAnexus Platform

The next public talk in our 2013 series will be at the University of Toronto’s, TCAG New Technologies Seminar. DNAnexus Sr. Software Engineer, Andrey Kislyuk, will conduct a brief demo of developing applications on the platform, as well as scientific collaboration, publishing, and reproducibility features.

Title: Building Genomics Applications in the Cloud with the DNAnexus Platform
When: Thursday, September 26, 2013
Time: 10:30–11:30 AM and 2:00-3:00 PM
Building: MaRS Toronto Medical Discovery Tower, 101 College St.
Room: 14-203

Abstract: The DNAnexus platform-as-a-service (PaaS) was designed to eliminate the common challenges and costs that enterprises face when building clinically compliant analysis pipelines for next-generation sequencing (NGS) data.  The DNAnexus platform provides a configurable API-based infrastructure that enables research labs to efficiently move their analysis pipelines into the cloud, using their own algorithms with industry-recognized tools and resources to create customized workflows in a secure and compliant environment.

DNAnexus is available to give on-site talks and demos to public and private institutions. Contact us for details at developers@dnanexus.com.
tcag