Thursday 1 May 2008

Open Source Sequencing and Colossal Squid

The Polonator has a really cheesy name (Arnie dressed as a giant bee?) , but it's a really important step in molecular biology. It's a DNA sequencing machine, there are lots of these around at the moment as people have managed to make systems that beat the traditional dideoxy Sanger-sequencing method (which was used to sequence the human genome) in speed and in the bulk cost of sequencing. Examples include 454's pyrosequencer and the dubiously capitalised SOLiD system. Both produce substantially shorter reads, single sequences of DNA, than the ABI 3700's used to sequence the human genome (though I've heard rumours that a new 454 machine that can do 500 bp reads is on the cards), but do millions at a time.

If you want to run one of these systems, you purchase them for hundreds of thousands from the manufacturer, get trained how to run them by the manufacturer and buy the consumables from the system from the manufacturer. You don't necessarily know what the reagents are in these consumables and therefore all the trouble-shooting you do involves the manufacturer too. This is because these machines are expensive to make, not many people buy them and they become defunct very quickly. It ties labs to the machine, probably long past the time when they're cutting edge.

In some ways it's silly to keep these reagents secret. I worked for AstraZeneca for a year as an undergraduate (which was great experience - companies are much stricter about lab protocol than universities and you pick up good habits!) and they purchased a machine for typing bacteria that I was involved with validating for use - the RiboPrinter Microbial Characterisation system. We bought their kits and never knew what was in them, but myself and another student, frustrated by the fact that we had to write a report for university that would contain no science at all as we didn't know what was in these kits, sat down and attempted to work out what they contained. Judging by the comments from the sales reps from Dupont who make the system, we were pretty close in our estimation. If two undergraduates can work out what's in the various solutions (which is a long way from making a working machine, but at least would mean we weren't reliant on buying inevitably more expensive proprietary reagents) what's the point in the secrecy anyway?

The Polonator is different in that the system is open source. They give you full specs of the machine, they give you full details of the reagents required: what's in them and the concentrations. You can get the machine and make your own protocols for it and share them with other users. According to the article in Technology Review (via Digg) users are already collaborating to improve the chemistry.

Surely this is how science should be done? The philosophy is that scientists share their work and it's peer-reviewed before grants are rewarded, after papers are written and discussed by the community at conferences and after publication. This should allow everyone to build on each others' work and mean that resources aren't wasted - molecular biological research is not cheap to do. This isn't in reality what happens. Labs have their own protocols for performing experiments that might be shared amongst collaborators, but more rarely with the whole scientific community unless it's a more ground-breaking achievement. Minor improvements to protocols that could save time and money are shared perhaps by word-of-mouth. Negative results are generally not reported, meaning that identical experiments could be performed again, perhaps many times, in another lab, wasting resources again.

There are good reasons why some information shouldn't be shared, perhaps to allow a patent to be taken out on the intellectual property to prevent someone nicking the idea for their own gain. But I believe that there are more benefits in the majority of cases for sharing results. There's not a lot of truly open source science going on out there currently. Another example would be the OpenWetWare project, which aims to make sharing of information and protocols amongst bioscientists easier - to the extent that they encourage lab notebooks to be kept online on their wiki. It takes brave souls to get involved initially, posting up their failed and frustrating experiments along with the successes, but then once more people are involved I reckon the benefits will speak for themselves.

Hopefully more projects like this will be initiated and we can have a truly open, more productive scientific community.

On a completely different topic, a group of scientists at the Museum of New Zealand Te Papa Tongarewa have been investigating three Giant Squid (Architeuthis dux) and one Colossal Squid (Mesonychoteuthis hamiltoni, it's bigger, I reckon they're reserving the name Ginormous Squid for the next bigger one they find) that they've had on ice. The blog is fascinating, with pictures of the biggest eye in nature and some clips of them during the dissection. I like squid, and own a T-shirt to prove it - linky.

Thursday 13 March 2008

Venter at TED

TED2008 finished at the beginning of the month and there's been a flurry of new videos of the talks (generally about 15 minutes in length) to the site. One of the new talks is by Craig Venter, I've posted before about his last TED talk here, and like last time I find his new talk fascinating, but it leaves me in two minds.

The science is very impressive and his delivery is understated so that when he says things like "replacing the entire petrochemical industry" or "producing fuel from sequestered CO2 within 18 months" it makes it seem more so. On the other hand, does he actually answer the question at the end about using the technology he's creating to produce bioweapons? He gives reasons why it is currently unlikely, but is it actually impossible? That's probably unreasonable of me - you can use all sorts of technologies to make weapons that are also used to produce entirely innocuous and useful products.

And the product he's talking about here is to divert CO2 waste-streams from manufacturing and make fuel directly from them using synthetically constructed microorganisms. Though this doesn't actually remove CO2 from the atmosphere, you remove the emission - perhaps permanently if when you burn this new fuel you recycle that CO2 as well. Climate change undeniably needs radical solutions and this certainly is one.

I'm trying to look past the presentation to the heart of his talk and find out what he's actually saying and what I think about it, meanwhile see for yourself below.

Wednesday 16 January 2008

In Praise of the Pedestrian

Happy New Year, my posting lapse was because things got busy before Christmas. They aren't really less busy now, but I'm attempting to be more organised.

Ray Bradbury is my favourite author. He's had some competition from Neil Gaiman and, more recently Philip Reeve, but he still retains the top spot. My favourite novel is Fahrenheit 451, which Bradbury based on a short story called The Pedestrian. What Wikipedia fails to record is Bradbury's anecdote on the incident that inspired the original story - I am recounting this from memory as it's from an introduction to a previous version of Fahrenheit 451 that I no longer have.

The author was out on a late night stroll along the street, perhaps enjoying the clemency of the weather or a view to the stars, when a police patrol car pulled up beside him. The policeman shouted across to him, asking what he was doing. "Walking" was the reply. Not believing that there could ever be a reason that wasn't nefarious for simply walking and could not envisage that any benefit could be gained from the activity, the policeman continued to question him. Bradbury was eventually able to convince them not to arrest him and headed home, probably indignantly, to compose The Pedestrian.

Currently I am doing some work with an emeritus Professor in the department, Ted Maden. He has retired, but still has an office and carries on researching things that take his interest, albeit without funding or a salary. He has discovered a number of pseudogenes in the human genome sequence, these are remnant genes that no longer function, but can tell you something of the evolutionary history of the active version of the gene.

Ted learnt his molecular biology before computers were used to analyse DNA and protein sequences, and certainly before whole genomes of any organism were sequenced, I am helping him by obtaining further sequences from genome sequences as I have some familiarity with how to extract these from databases.

One of the first steps for comparing different versions of a gene is to produce an alignment where you line up in columns all the base pairs along the length of the DNA sequence against the equivalent positions in the other versions. Sometimes you will have to add gaps to the sequence to compensate where base pairs are missing or additional base pairs have been added. Sometimes sections of the gene have evolved quickly and changed so much that you can't really align two sequences against each other with any certainty and have to make informed judgements as to which base should go where. There are computer programs that can do this relatively quickly such as ClustalX or ARB, they can handle a large number of sequences at once and generally do a good job of aligning, though I always check and edit alignments by hand.

Ted produces all his alignments by hand on ruled sheets of A3 paper, writing out each of the thousands of bases, checking and double-checking the sequence and the alignment. The wadges of completed alignment are held together by bulldog-clips and stored in map-drawers. He then annotates them and collates details of mutations that have occurred between the different sequences in the alignments, noting the locations, types and frequencies.

All the things that Ted does are possible using modern bioinformatics. I could apply an algorithm to count CpG mutations and locate restriction sites, but what I can never do is obtain the deep understanding of the sequences that Ted has gained by aligning things his way. He spots sequence features that I wouldn't know to screen for, corrects sections of my alignments that have completely foxed the computer alignment algorithm. There is simply no substitute for quality.

Sequencing technology is getting faster and cheaper. Complete genome sequences are published with increasing regularity. The NCBI currently lists 623 complete microbial genomes with 915 in progress link. The first microbial genome sequence to be released (excepting some small viral genomes) was that of Haemophilus influenzae in July 1995 and it took many months to produce. Now it could be resequenced in a time-scale more appropriately measured in hours. However, annotation of the resulting sequence, identifying the genes found, has not increased at a similar rate to our sequencing ability and our understanding of how these organisms work based on the sequence generated has increased at a slower rate still. When a new bacterial genome is sequenced the number of genes that are unidentifiable is significant, I've heard an average of 30 % of these genes on a new genome can actually be identified, the rest are marked as hypothetical.

My point is that we are getting faster and faster at producing larger and larger amounts of data, but not making similar gains in understanding the data we're producing. Ted's approaches might seem preposterous when compared against work comparing multiple entire genomes to one another, but they can only skim the surface of the information that's there.

In The Pedestrian the protagonist ends up being taken to the Psychiatric Center for Research on Regressive Tendencies. Ted's hoping to publish our findings, hopefully his methodology will be received well. Otherwise next post from Bedlam.