Award
 » Home » Bioinformatics » Software Developers » Sequence Search Solution
 
Solutions Search! The Customized Life Science Search Engine
Search Site
Search Suppliers
Search Internet
Search over 6000 life science websites specifically selected by our expert scientist moderators.

Other Topics
9/9/2008 12:51 PM
What is the most common b ...
9/2/2008 08:27 PM
how to find a motif, whic ...
8/24/2008 12:16 PM
How to start? (PDB Viewer ...
8/4/2008 11:04 PM
Transcription Factor moti ...
7/9/2008 07:31 PM
What is your programming ...
9/3/2007 08:27 PM
Bioinformatic Software De ...
9/3/2007 08:44 PM
Database Poll
3/26/2007 08:47 PM
Scibuntu (ubuntu linux di ...
3/14/2007 03:28 PM
Transcription Factor moti ...
2/14/2007 06:23 AM
Bioinformaticians looking ...
8/9/2006 09:39 PM
extremely fast Smith-wate ...
4/20/2006 02:38 PM
Database Poll, What DBMS ...
4/19/2006 01:19 PM
Inference engine for comp ...
2/22/2006 10:56 PM
about the coupling of GPC ...
12/7/2005 01:37 PM
Software Development
6/21/2005 03:27 PM
Error in using SeqIOTools
6/14/2005 04:13 PM
PERL
1/13/2005 10:32 PM
HT-GO-FAT
Subscribet to topic
bottom of page RSS Feed Topic Feed
 Sequence Search Solution [View Printable]
Tony Rook

Moderator

See
Similar
Scientists



View Blogs


Group: Member
Posts: 582
Joined: Nov 03, 2005







 Send a personal messsage to Tony Rook Reply with a quote from this post Go to the top of the page


An interesting article appears in the Nov 2007 issue of Genome Technology which talks about how a simple bioinformatics solution using binary search trees helped solve a huge bottlenecking problem at the Biological Data Management and Technology Center at the Lawrence Berkeley National Laboratory. The time to process a task was incredibly reduced from 30 day to 12 hours....



Algorithm Whiz Kid Boosts Sequence Searches

[November 2007] By Matthew Dublin

Sometimes a fresh perspective on an old problem can yield new solutions. Department of Energy graduate fellow Ben Campbell Smith from Harvard University recently demonstrated this concept during a summer stint at the Biological Data Management and Technology Center at the Lawrence Berkeley National Laboratory.

Upon arriving at his assigned lab, Smith was tasked with solving a bottleneck problem that prohibits the centers researchers from updating their Integrated Microbial Genomes system. The IMG system combines microbial data from the Joint Genome Institute, the University of California, San Francisco, and LBNLs Life Sciences and Physical Biosciences divisions in order to facilitate comparative analyses. Whenever a microbes genome is sequenced, the data is sent to a supercomputer at the Pacific Northwest National Laboratory in Washington, which uses Blast to conduct all-by-all searches for matches among the roughly 3.2 million microbial sequences already in the database. The challenge is in formatting those results so that specific comparisons between two genomes can be located quickly and easily by researchers.

Before Smiths arrival, the center utilized an algorithm written with Python that read the output line-by-line and then wrote the results to separate files based on the two taxons. But with datasets usually comprising 20 billion lines, the algorithm would often take 30 days to complete the task. With Smiths new algorithm, these same jobs now take about 12 hours to complete.

Smith says that the algorithm he devised, which he wrote using the C++ programming language, is based on run-of-the-mill computer science solutions. A lot of the basic components, like using binary search trees and stuff like that, are very well established computer science techniques, but to be honest, I knew nothing about this kind of bioinformatics work, says Smith. So I came there and they said heres the project, and basically I thought, this sounds like a lot of other sorting kind of stuff people do, and this is how they normally do it.

While Smith says he enjoyed his foray into bioinformatics code tweaking, his heart belongs to high-energy particle physics. His PhD work focuses on searching for the Higgs boson particle, a yet-to-be identified particle that, if discovered, may help solidify our current understanding of electrons and quarks, Smith says. But he is still involved with the LBNL project and aims to ensure that the new algorithm will work seamlessly.

One of the kinks has to do with memory limitations on the current system, but he is determined to make it work without the need for new hardware. Right now, Im finishing it up, but one of the problems is that the trees no longer fit in the RAM of the computers they have, he says. But I think its one of those things where we could be more clever and get it to work with existing computers.
.........................
Tony Rook

 Posted Nov 15, 2007, 18:53 PM
top of page

Forum Jump