Wednesday 16 January 2008

In Praise of the Pedestrian

Happy New Year, my posting lapse was because things got busy before Christmas. They aren't really less busy now, but I'm attempting to be more organised.

Ray Bradbury is my favourite author. He's had some competition from Neil Gaiman and, more recently Philip Reeve, but he still retains the top spot. My favourite novel is Fahrenheit 451, which Bradbury based on a short story called The Pedestrian. What Wikipedia fails to record is Bradbury's anecdote on the incident that inspired the original story - I am recounting this from memory as it's from an introduction to a previous version of Fahrenheit 451 that I no longer have.

The author was out on a late night stroll along the street, perhaps enjoying the clemency of the weather or a view to the stars, when a police patrol car pulled up beside him. The policeman shouted across to him, asking what he was doing. "Walking" was the reply. Not believing that there could ever be a reason that wasn't nefarious for simply walking and could not envisage that any benefit could be gained from the activity, the policeman continued to question him. Bradbury was eventually able to convince them not to arrest him and headed home, probably indignantly, to compose The Pedestrian.

Currently I am doing some work with an emeritus Professor in the department, Ted Maden. He has retired, but still has an office and carries on researching things that take his interest, albeit without funding or a salary. He has discovered a number of pseudogenes in the human genome sequence, these are remnant genes that no longer function, but can tell you something of the evolutionary history of the active version of the gene.

Ted learnt his molecular biology before computers were used to analyse DNA and protein sequences, and certainly before whole genomes of any organism were sequenced, I am helping him by obtaining further sequences from genome sequences as I have some familiarity with how to extract these from databases.

One of the first steps for comparing different versions of a gene is to produce an alignment where you line up in columns all the base pairs along the length of the DNA sequence against the equivalent positions in the other versions. Sometimes you will have to add gaps to the sequence to compensate where base pairs are missing or additional base pairs have been added. Sometimes sections of the gene have evolved quickly and changed so much that you can't really align two sequences against each other with any certainty and have to make informed judgements as to which base should go where. There are computer programs that can do this relatively quickly such as ClustalX or ARB, they can handle a large number of sequences at once and generally do a good job of aligning, though I always check and edit alignments by hand.

Ted produces all his alignments by hand on ruled sheets of A3 paper, writing out each of the thousands of bases, checking and double-checking the sequence and the alignment. The wadges of completed alignment are held together by bulldog-clips and stored in map-drawers. He then annotates them and collates details of mutations that have occurred between the different sequences in the alignments, noting the locations, types and frequencies.

All the things that Ted does are possible using modern bioinformatics. I could apply an algorithm to count CpG mutations and locate restriction sites, but what I can never do is obtain the deep understanding of the sequences that Ted has gained by aligning things his way. He spots sequence features that I wouldn't know to screen for, corrects sections of my alignments that have completely foxed the computer alignment algorithm. There is simply no substitute for quality.

Sequencing technology is getting faster and cheaper. Complete genome sequences are published with increasing regularity. The NCBI currently lists 623 complete microbial genomes with 915 in progress link. The first microbial genome sequence to be released (excepting some small viral genomes) was that of Haemophilus influenzae in July 1995 and it took many months to produce. Now it could be resequenced in a time-scale more appropriately measured in hours. However, annotation of the resulting sequence, identifying the genes found, has not increased at a similar rate to our sequencing ability and our understanding of how these organisms work based on the sequence generated has increased at a slower rate still. When a new bacterial genome is sequenced the number of genes that are unidentifiable is significant, I've heard an average of 30 % of these genes on a new genome can actually be identified, the rest are marked as hypothetical.

My point is that we are getting faster and faster at producing larger and larger amounts of data, but not making similar gains in understanding the data we're producing. Ted's approaches might seem preposterous when compared against work comparing multiple entire genomes to one another, but they can only skim the surface of the information that's there.

In The Pedestrian the protagonist ends up being taken to the Psychiatric Center for Research on Regressive Tendencies. Ted's hoping to publish our findings, hopefully his methodology will be received well. Otherwise next post from Bedlam.