An interesting article appears in the Nov 2007 issue of Genome Technology which talks about how a simple bioinformatics solution using binary search trees helped solve a huge bottlenecking problem at the Biological Data Management and Technology Center at the Lawrence Berkeley National Laboratory. The time to process a task was incredibly reduced from 30 day to 12 hours....
[November 2007] By Matthew Dublin
Sometimes a fresh perspective on an old problem can yield new solutions. Department of Energy graduate fellow Ben Campbell Smith from Harvard University recently demonstrated this concept during a summer stint at the Biological Data Management and Technology Center at the Lawrence Berkeley National Laboratory.
Upon arriving at his assigned lab, Smith was tasked with solving a bottleneck problem that prohibits the centers researchers from updating their Integrated Microbial Genomes system. The IMG system combines microbial data from the Joint Genome Institute, the University of California, San Francisco, and LBNLs Life Sciences and Physical Biosciences divisions in order to facilitate comparative analyses. Whenever a microbes genome is sequenced, the data is sent to a supercomputer at the Pacific Northwest National Laboratory in Washington, which uses Blast to conduct all-by-all searches for matches among the roughly 3.2 million microbial sequences already in the database. The challenge is in formatting those results so that specific comparisons between two genomes can be located quickly and easily by researchers.
Before Smiths arrival, the center utilized an algorithm written with Python that read the output line-by-line and then wrote the results to separate files based on the two taxons. But with datasets usually comprising 20 billion lines, the algorithm would often take 30 days to complete the task. With Smiths new algorithm, these same jobs now take about 12 hours to complete.
Smith says that the algorithm he devised, which he wrote using the C++ programming language, is based on run-of-the-mill computer science solutions. A lot of the basic components, like using binary search trees and stuff like that, are very well established computer science techniques, but to be honest, I knew nothing about this kind of bioinformatics work, says Smith. So I came there and they said heres the project, and basically I thought, this sounds like a lot of other sorting kind of stuff people do, and this is how they normally do it.
While Smith says he enjoyed his foray into bioinformatics code tweaking, his heart belongs to high-energy particle physics. His PhD work focuses on searching for the Higgs boson particle, a yet-to-be identified particle that, if discovered, may help solidify our current understanding of electrons and quarks, Smith says. But he is still involved with the LBNL project and aims to ensure that the new algorithm will work seamlessly.
One of the kinks has to do with memory limitations on the current system, but he is determined to make it work without the need for new hardware. Right now, Im finishing it up, but one of the problems is that the trees no longer fit in the RAM of the computers they have, he says. But I think its one of those things where we could be more clever and get it to work with existing computers.