The Genome Reference Consortium (GRC) has recently been formed. The goal of this group is as stated on their website “to correct the small number of regions in the reference that are currently misrepresented, to close as many remaining gaps as possible and to produce alternative assemblies of structurally variant loci when necessary“. GenomeWeb has a bit more background.
From the perspective on our company who designs assembly algorithms, this is of course exciting news. Reference assembly is evidently very dependent on the quality of the reference sequence/genome and any improvements on this side will make our lives easier. An open question is how to represent information such as structural variation? As of now, assembly is performed against a single reference sequence which is clearly too simple a format to describe complex structural variation. We will follow the work of the consortium closely, they have their work cut out for them, with all the new structural variation that is continuously being discovered in the human genome.
To follow up on a previous post, researchers funded by the Human Microbiome Project have now launched the Human Oral Microbiome Database (HOMD). HOMD is intended as a service to researchers that are investigating the role of microbes in human health and disease, with particular emphasis on the oral environment. It is anticipated that the database can serve as a model for the gut, skin, and vaginal databases for the Human Microbiome Project. GenomeWeb has more.
An old Danish proverb says that when then the manger is empty the horses bite each other. This idea has now been put to use by Anthony Sinskey’s research group at MIT and is described in a report by the MIT Technology Review. Sinskey’s group had previously produced the genome sequence of the soil-dwelling bacteria known as Rhodococcus fascians. Looking at the genome, they were surprised to find that this organism, not known for its antibiotic-producing powers, harbored a number of genes involved in the metabolism of antibiotic-like compounds. However, none of these genes seemed to be expressed when the bacteria was grown in the lab. To bring out the worst in the bacteria, the group decided to grow the bacteria in competition with a Streptomyces bacteria. After performing selection experiments, one strain of the Rhodococcus bacteria was shown to excrete a novel antibiotic compound, dubbed rhodostreptomycin, which belongs to the same class of antibiotics as streptomycin, a tuberculosis drug.
The inference of the exact molecular mechanisms responsible for the new compound are still under way, but one fascinating preliminary finding is that the selected Rhodococcus strain seem to have assimilated a large chunk of DNA from the competing Streptomyces strain.
This is a fascinating example of the new picture of bacterial genomics that is emerging as a result of improved sequencing technology - for an introduction, I recommend this review by Raskin et al.
Applied Biosystems have released data to the public from the genome sequencing of a Yoruba Nigerian HapMap sample. In their press release, AB claim that the data were generated using only 7 runs of the SOLiD system and at a total sequencing costs of less than 60.000$.
The data covered the genome 12 fold and paired end information provided a physical coverage of a 100 fold, i.e. the coverage stemming from the inserted but not sequenced part of paired end reads. Millions of SNP’s and a large number of structural variations were identified from the data.
As an amusing aside, AB gave these funny facts about the dataset:
- If all 36 billion bases were spread out at 1 millimeter apart, they would extend 36,000 kilometers, or more than 4,000 times the height of Mt. Everest, which at 8,848 meters above sea level, is the highest mountain on Earth.
- If all 36 billion bases were spread along the Great Wall of China at 1 millimeter apart, this would equate to spanning the 5,000 kilometer wall more than 7 times.
- If a person were to proofread the 36 billion bases in this dataset at one letter per second for 24 hours-per-day, it would take 1,200 years to read the entire data set.
- If each base represented one individual in the world population, the dataset would account for more than 5 times the entire world population of 6.8 billion people.
- This dataset, at 36 billion bases of DNA sequence, is equivalent to 360 times all of the 100 million visible stars in the Earth’s galaxy.
Does Solexa have problems with amplifying A/T rich regions? I just read a really interesting paper by Hillier et al, from Nature Methods which claims that this might be the case.
The paper is entitled “Whole-genome sequencing and variant discovery in C. elegans” and reports the use of Solexa technology to re-sequence two C.elegans specimens for variant discovery. The paper demonstrates the use of the Solexa technology for re-assembly of the C.elegans genome, especially when paired-end information is used.
However, it points to a general lack of coverage in A/T rich regions (see figure 2 of the supplementary material) which leaves a number of zero size gaps in the assembly - places where reads sit shoulder to shoulder but simply do not overlap. Having found these problematic A/T rich regions, the authors went back and took a look across the genome, where they found a general correlation between A/T content and read coverage. This correlation was stronger when examining a 200 bp window than when examining a 32 bp window. 200 bp corresponds to the size of the amplicons that are amplified during the cluster generation step prior to sequencing and 32 bp corresponds to the number of cycles in the actual sequencing by synthesis procedure. This finding made Hillier et al. conclude that failure to amplify A/T rich regions during cluster generation is the cause of the low coverage (other reasons for the bias such as hairpin formation were also explored but discarded).
Unfortunately the authors did not pursue a chemical explanation for the phenomenon and did not investigate other Solexa datasets for a similar trend. Therefore it is premature to say whether this is a general phenomenon of the Solexa technology but it is definitely something that warrants the attention of people like us that are designing assembly- and variant detection algorithms.