Nov 17 2009

3rd Gen Sequencing company PacBio reveals commercial specs

Tag: TechnologyGoerlitz @ 23:14

In today’s edition of In Sequence, editor Julia Karow, revealed a number of performance specifications for PacBio’s first commercial single-molecule real-time DNA sequencer, due to be released during the second half of 2010.

PacBio CEO, Hugh Martin, states:
Our pricing strategy, at this time, is that we are going to probably fix a $100 per-run cost. Over time, that price is going to remain the same, but the amount of sequence that you will get for that $100 will go up tremendously. The minimum run time is between 10 and 15 minutes, which users can adjust, depending on whether they want to maximize their throughput or read length. An entire experiment, from the time that you start sample prep to when you have your data, can be completed in less than 12 hours.

Click to read the entire article at In Sequence (requires subscription)


Nov 04 2009

New benchmarks of our upcoming de novo assembler

Tag: Development, Technology, UpdatesGoerlitz @ 09:27

We have had our upcoming de novo assembly algorithm in internal Alpha for some time now, and have just released an updated technical note with benchmarks from the Alpha 2 version, which is the version we’re currently running.

Impressive benchmarks
These new benchmarks are quite impressive: De novo assembly of a data set with 38 fold coverage of the human genome, completed in only 7 hours on a single computer, while also improving the overall quality!

Click to download the PDF-file with benchmarks on the Alpha 2 version of our upcoming de novo assembly algorithm

Interested in trying this de novo assembler yourself?
Hopefully we can have the beta version out for public testing before Christmas - stay tuned! If you’re interested in more information, you can sign up here and get an invitation, once it’s released: Click to sign up for a beta trial of our de novo assembler


Nov 03 2009

High-performance computing technologies for genomics

Tag: TechnologyGoerlitz @ 18:14

Everyone working with genomics and high-throughput sequencing data analysis knows that you need serious number crunching capabilities to analyze the vast amounts of data coming off the Next Generation Sequencing instruments. A tendency that is likely only to be reinforced with the Single Molecule 3rd generation sequencers, starting to roll out in 2010.

FPGA
A few years ago FPGA (Field Programmable Gate Array) technology was seen by many as an ideal way to handle a lot of the data overhead in bioinformatics. However, due to the fact that FPGAs are relatively hard to program and optimize, there has never been any major breakthroughs for FPGA technology within genomics.

GPU
Currently GPU (Graphics Processing Unit) technology have moved in to the spotlight as the premier technology seen to lift the computational challenges following with the roll-out of NGS instruments. This is partly due to the fact that GPUs and the popular CUDA architecture offer a rather cheap shortcut to teraflop computing territory. However, the genomics community still has to see implementations of known algorithms, that truly exploits the computing power at hand.

Cloud Computing
Cloud Computing
is another way of handling the data analysis challenges, by renting access to large setups with many CPU cores on a need-to-use basis. While Cloud Computing surely is a relatively cheap and interesting alternative to setting up and running your own cluster, generally Cloud Computing isn’t ideal when talking about pure performance, because of the huge overhead of distributing the data and calculations across the rented nodes. And with the huge data sets in genomics you may also have to physically ship your hard drive to your Cloud Computing vendor as opposed to uploading data through an internet connection – hardly the fastest option.

The neverending scheme
The current scheme to obtain adequate computing power, seems like a road well traveled: Stacking up more hardware until you reach the teraflop goal you aimed for. While this strategy does work, it’s certainly neither environmentally friendly nor economically sensible!

Software Acceleration
There is another - less popular - strategy, that gives both a good return on investment, while also delivering top performance results: Using threading on multiple cores with SIMD (Single Instruction, Multiple Data). Unlike the other technologies mentioned above, SIMD is based on software acceleration, by utilizing the built-in SSE instruction sets that inherently is built in to all x86 architecture CPUs from Intel and AMD. Many IT guys will frown upon the phrase software acceleration, but that is wrong to do - read why below!

SIMD
SIMD acceleration can be somewhat compared to car engines, where a regular family car may produce around a 100 kilowatts of power, and the same engine can be tuned to produce maybe 2-300 kilowatts in a high-performance version of the same car. Only, when talking about CPUs and SIMD technology the increase in computing power is substantially higher than what you can see in the car industry. This essentially means that the vast majority of life science software only uses a fraction of the computational power that is actually available on the CPUs.

So why not use the power you already have at hand, before you start adding more CPUs? That also seems to be the reasoning behind Sean Eddy’s team implementing SIMD acceleration for the upcoming HMMER 3 package.

But how does software acceleration stack up against hardware?
The obvious question is of course, if software acceleration, like SIMD, is as fast as hardware acceleration? There is no doubt that it is cheaper, but is it equal to adding more hardware? Actually, in a lot of cases it’s better!

We have benchmarked our CLC Genomics Machine against some classic bioinformatics algorithms accelerated by FPGA technology, and generally the performance is a lot better. And when looking into genomics, we just released a new technical note benchmarking our upcoming de novo assembler (also SIMD accelerated) against ABySS with some staggering results on our part. Click here to download the PDF with benchmarks - you won’t be disappointed!

Coming soon…
When the next generation of high-performance computing technology arrive soon - the more flexible GPGPU chips (General-Purpose computing on Graphics Processing Units) which also support SIMD - a strategy of using SIMD acceleration seems even more obvious. Intel’s  upcoming GPGPU codenamed Larrabee is rumored to feature around 80 SIMD-enabled CPU cores on a single board, that can be installed in any (new) computer, which in turn will provide a huge boost in performance.

Will this be the path to real-time data analysis of 3rd generation sequencing data? Time will tell…