Mar 11
Evaluating the Illumina/Solexa Genome Analyzer for whole genome re-sequencing
Does Solexa have problems with amplifying A/T rich regions? I just read a really interesting paper by Hillier et al, from Nature Methods which claims that this might be the case.
The paper is entitled “Whole-genome sequencing and variant discovery in C. elegans” and reports the use of Solexa technology to re-sequence two C.elegans specimens for variant discovery. The paper demonstrates the use of the Solexa technology for re-assembly of the C.elegans genome, especially when paired-end information is used.
However, it points to a general lack of coverage in A/T rich regions (see figure 2 of the supplementary material) which leaves a number of zero size gaps in the assembly - places where reads sit shoulder to shoulder but simply do not overlap. Having found these problematic A/T rich regions, the authors went back and took a look across the genome, where they found a general correlation between A/T content and read coverage. This correlation was stronger when examining a 200 bp window than when examining a 32 bp window. 200 bp corresponds to the size of the amplicons that are amplified during the cluster generation step prior to sequencing and 32 bp corresponds to the number of cycles in the actual sequencing by synthesis procedure. This finding made Hillier et al. conclude that failure to amplify A/T rich regions during cluster generation is the cause of the low coverage (other reasons for the bias such as hairpin formation were also explored but discarded).
Unfortunately the authors did not pursue a chemical explanation for the phenomenon and did not investigate other Solexa datasets for a similar trend. Therefore it is premature to say whether this is a general phenomenon of the Solexa technology but it is definitely something that warrants the attention of people like us that are designing assembly- and variant detection algorithms.

March 11th, 2008 at 17:43
We are seeing the same thing with some bacterial genomes…
The problem isn’t due to different prep methods, and it isn’t machine specific.
March 12th, 2008 at 09:20
There is a comment on this over at the Evolgen blog http://scienceblogs.com/evolgen/2008/03/not_all_nextgen_sequencing_tec.php
March 12th, 2008 at 10:03
Check this website for papers dealing with Next-Gen Sequencing: http://www.genomeweb.com/newspics/InSequence_Papers_Feb08.htm
You probably need to register to get access but it´s for free.
March 12th, 2008 at 16:01
to expand on the previous post:
We have resequenced a previously sequenced bacterial genome in order to figure out how to make de novo sequencing work with “next” generation” sequencing technologies. We noticed that parts of the genome weren’t covered by this resequencing (shorter gaps though, not whole genes, and we know that they are there) but I don’t know if these are exactly AT rich regions. This first sequencing pass was 20x and so we tried adding in another 20x coverage…done on a different machine by different people using different DNA preps…and still we couldn’t capture the missing pieces.
March 12th, 2008 at 20:30
inhumataq, thanks for sharing your observations, it would be really great if you could do a simple plot of the read coverage as a function of the A/T content of the reference genome.
May 5th, 2008 at 10:45
In fact this is a sample prep issue. different libraries can show a biased GC selection, and low copexity libraries can deplete certain templates resulting in duplicates. You have to ensure you have good representation in your libraries, and remove the thermal gel melting step (which is where the AT/GC selection occurs). There should be revised protocols bouncing around.