Transcription Factor motif (PERL)

7 posts / 0 new
Last post
ABC
ABC's picture
Transcription Factor motif (PERL)

I want to use perl to find transcription factor DNA binding sites. For instance CACTTGAN. I only have basic perl writing comprehension but can follow a script fairly well. Thanks.

ABC

ryan_m
ryan_m's picture
This would be a rather crude

This would be a rather crude way to search for TFBS's. There are many 'scanner' tools available that use PWWMs or PSSMs to find putative sites (and score them). I have used TRANSFAC's "match" tool as well as "patser" with some success. If you can get ahold of a PSSM for your transcription factor I would suggest scanners rather than strict regexes. Failing that, you could probably do what you need in a few lines of perl. Try what I have put below (keep in mind I haven't tested it).

#input is a fasta file of upstream sequences, you can get 1kb and 5kb upstream sequences pre-extracted from the UCSC genome browser download page

#!/usr/bin/perl
use strict;
use Bio::SeqIO;

my $pattern = "CACTTGA[ACTG]";

my $io = Bio::SeqIO->new(-file=>"upstream_1kb.fa",-format=>'fasta');

while(my $seq_obj = $io->next_seq){
my $full_sequence = uc($seq_obj->seq);
if($full_sequence =~ /$pattern/ ){
#this sequence matches, do something with it
my $id = $seq_obj->display_id;
print "$id has match on (+) strand\n";
}
my $rc = $seq_obj->revcom;
my $rc_seq = uc($rc->seq);
if($rc_seq =~ /$pattern/ ){
#this sequence matches, do something with it
my $id = $seq_obj->display_id;
print "$id has match on (-) strand\n";
}
}

surferchic
surferchic's picture
Ya..

Ya..

Why do you want to use a script? Are you after TFBS that aren't in the databases already?

e.g. Transfac or Oreganno.

-SC

ABC
ABC's picture
I'm not partial to Perl at

I'm not partial to Perl at all....there are easier ways I know. I am learning to find known TF binding sites first so I can proceed to design scripts to find unknown TF's. So ultimately I'd like to use perl to search both strands (forward and reverse comlement) to do this. Thanks.

ABC

ABC
ABC's picture
Are you familiar with, the $&

Are you familiar with, the $& function, I was told this may help?

ryan_m
ryan_m's picture
ABC wrote:Are you familiar

ABC wrote:

Are you familiar with, the $& function, I was told this may help?

After applying your regex, $& stores the portion of the sequence that matched. So in the code I supplied you with, you could store all the $& in an array if you want to know the real sequence of the matching sites. $` and $' give you the left and right flanking sequences as well (in other words $` . $& . $` is your original sequence). However, considering what you say about discovering novel sites, I think you are looking at this in an overly-simplistic way. How does knowing how to match known regular expressions lead to a way to find novel ones? The identification of novel sites generally uses some sort of alignment of the promoter sites of co-regulated genes (using Gibbs sampling, for example).

Ryan

ABC
ABC's picture
I'll be using my output to

I'll be using my output to compare results obtained through Patser and MEME for example. Then seeing which program spit out sequences which actually are known to bind TF's. I will be using Gibbs etc for the novel sites, I tried to simply what it is I need (perl script) now. Thanks for the interest.