Kumar,
You should look for a way to solve this with dictionaries or sets. If you look for each element of Lseq in each element of refseq, that is 33,155,825,985 lookups. That is a lot!
Sets have a fast test for membership, look for ways to use them!
In this case, the target string in refseq seems to be pretty easy to pick out with a regular expression. So you can do this:
- put Lseq into a set
- iterate through refseq looking for lines starting with >
- use a regular expression search to pick out the NM_ code
- look up the code in the Lseq set
- if it is found, print out the sequence
Here is an implementation that uses this idea. To print the entire group, I keep a flag that turns printing on and off.
Kumar, please take the time to understand the basics about sets, dictionaries and iteration. You ask similar questions over and over and you don't seem to learn from the answers. You repeatedly make the same mistakes with indexing. If you don't understand the solutions, ask more questions. But please learn enough so you don't keep asking the same questions and making the same mistakes.
Kent
import re
# Regular expression to pick out the NM_xxxxxx codes from refseq refseq_re = re.compile(r'(NM_\d+)\.1')
Lseq = ['NM_025164', 'NM_025164', 'NM_012384', 'NM_006380', 'NM_007032','NM_014332']
refseq = ['>gi|10047089|ref|NM_014332.1| Homo sapiens small muscle protein, X-linked (SMPX), mRNA', 'GTTCTCAATACCGGGAGAGGCACAGAGCTATTTCAGCCACATGAAAAGCATCGGAATTGAGATCGCAGCT', 'CAGAGGACACCGGGCGCCCCTTCCACCTTCCAAGGAGCTTTGTATTCTTGCATCTGGCTGCCTGGGACTT', 'CCCTTAGGCAGTAAACAAATACATAAAGCAGGGATAAGACTGCATGAATATGTCGAAACAGCCAGTTTCC', 'AATGTTAGAGCCATCCAGGCAAATATCAATATTCCAATGGGAGCCTTTCGGCCAGGAGCAGGTCAACCCC', 'CCAGAAGAAAAGAATGTACTCCTGAAGTGGAGGAGGGTGTTCCTCCCACCTCGGATGAGGAGAAGAAGCC', 'AATTCCAGGAGCGAAGAAACTTCCAGGACCTGCAGTCAATCTATCGGAAATCCAGAATATTAAAAGTGAA', 'CTAAAATATGTCCCCAAAGCTGAACAGTAGTAGGAAGAAAAAAGGATTGATGTGAAGAAATAAAGAGGCA', 'GAAGATGGATTCAATAGCTCACTAAAATTTTATATATTTGTATGATGATTGTGAACCTCCTGAATGCCTG', 'AGACTCTAGCAGAAATGGCCTGTTTGTACATTTATATCTCTTCCTTCTAGTTGGCTGTATTTCTTACTTT', 'ATCTTCATTTTTGGCACCTCACAGAACAAATTAGCCCATAAATTCAACACCTGGAGGGTGTGGTTTTGAG', 'GAGGGATATGATTTTATGGAGAATGATATGGCAATGTGCCTAACGATTTTGATGAAAAGTTTCCCAAGCT', 'ACTTCCTACAGTATTTTGGTCAATATTTGGAATGCGTTTTAGTTCTTCACCTTTTAAATTATGTCACTAA', 'ACTTTGTATGAGTTCAAATAAATATTTGACTAAATGTAAAATGTGA', '>gi|10047091|ref|NM_013259.1| Homo sapiens neuronal protein (NP25), mRNA', 'TGTGCTGCTATTGTGTGGATGCCGCGCGTGTCTTCTCTTCTTTCCAGAGATGGCTAACAGGGGCCCGAGC', 'TATGGCTTAAGCCGAGAGGTGCAGGAGAAGATCGAGCAGAAGTATGATGCGGACCTGGAGAACAAGCTGG', 'TGGACTGGATCATCCTGCAGTGCGCCGAGGACATAGAGCACCCGCCCCCCGGCAGGGCCCATTTTCAGAA', 'ATGGTTAATGGACGGGACGGTCCTGTGCAAGCTGATAAATAGTTTATACCCACCAGGACAAGAGCCCATA', 'CCCAAGATCTCAGAGTCAAAGATGGCTTTTAAGCAGATGGAGCAAATCTCCCAGTTCCTAAAAGCTGCGG', 'AGACCTATGGTGTCAGAACCACCGACATCTTTCAGACGGTGGATCTATGGGAAGGGAAGGACATGGCAGC', 'TGTGCAGAGGACCCTGATGGCTTTAGGCAGCGTTGCAGTCACCAAGGATGATGGCTGCTATCGGGGAGAG', 'CCATCCTGGTTTCACAGGAAAGCCCAGCAGAATCGGAGAGGCTTTTCCGAGGAGCAGCTTCGCCAGGGAC', 'AGAACGTAATAGGCCTGCAGATGGGCAGCAACAAGGGAGCCTCCCAGGCGGGCATGACAGGGTACGGGAT', 'GCCCAGGCAGATCATGTTAGGACGCGGCATCCTGCCCCTGGTAGAGAGGACGAATGTTCCACACCATGGT']
# Turn Lseq into a set so we can search it efficiently Lseq_set = set(Lseq)
printing = False # Flag whether to print lines or not (are we in a target sequence?)
for l in refseq: if l.startswith('>'): printing = False # Reset at start of a group
# Look for an interesting group # First extract the code m = refseq_re.search(l) if m: nm_code = m.group(1) # This is the NM code if nm_code in Lseq_set: # We found a good one; turn on printing printing = True else: # This is an error - a line starting with > didn't match the re print "**** Error - couldn't match line:" print "****", l
if printing: print l
kumar s wrote:
Dear group:
I have two lists:
1. Lseq:
len(Lseq)
30673
Lseq[20:25]
['NM_025164', 'NM_025164', 'NM_012384', 'NM_006380', 'NM_007032','NM_014332']
2. refseq:
len(refseq)
1080945
refseq[0:25]
['>gi|10047089|ref|NM_014332.1| Homo sapiens small muscle protein, X-linked (SMPX), mRNA', 'GTTCTCAATACCGGGAGAGGCACAGAGCTATTTCAGCCACATGAAAAGCATCGGAATTGAGATCGCAGCT', 'CAGAGGACACCGGGCGCCCCTTCCACCTTCCAAGGAGCTTTGTATTCTTGCATCTGGCTGCCTGGGACTT', 'CCCTTAGGCAGTAAACAAATACATAAAGCAGGGATAAGACTGCATGAATATGTCGAAACAGCCAGTTTCC', 'AATGTTAGAGCCATCCAGGCAAATATCAATATTCCAATGGGAGCCTTTCGGCCAGGAGCAGGTCAACCCC', 'CCAGAAGAAAAGAATGTACTCCTGAAGTGGAGGAGGGTGTTCCTCCCACCTCGGATGAGGAGAAGAAGCC', 'AATTCCAGGAGCGAAGAAACTTCCAGGACCTGCAGTCAATCTATCGGAAATCCAGAATATTAAAAGTGAA', 'CTAAAATATGTCCCCAAAGCTGAACAGTAGTAGGAAGAAAAAAGGATTGATGTGAAGAAATAAAGAGGCA', 'GAAGATGGATTCAATAGCTCACTAAAATTTTATATATTTGTATGATGATTGTGAACCTCCTGAATGCCTG', 'AGACTCTAGCAGAAATGGCCTGTTTGTACATTTATATCTCTTCCTTCTAGTTGGCTGTATTTCTTACTTT', 'ATCTTCATTTTTGGCACCTCACAGAACAAATTAGCCCATAAATTCAACACCTGGAGGGTGTGGTTTTGAG', 'GAGGGATATGATTTTATGGAGAATGATATGGCAATGTGCCTAACGATTTTGATGAAAAGTTTCCCAAGCT', 'ACTTCCTACAGTATTTTGGTCAATATTTGGAATGCGTTTTAGTTCTTCACCTTTTAAATTATGTCACTAA', 'ACTTTGTATGAGTTCAAATAAATATTTGACTAAATGTAAAATGTGA', '>gi|10047091|ref|NM_013259.1| Homo sapiens neuronal protein (NP25), mRNA', 'TGTGCTGCTATTGTGTGGATGCCGCGCGTGTCTTCTCTTCTTTCCAGAGATGGCTAACAGGGGCCCGAGC', 'TATGGCTTAAGCCGAGAGGTGCAGGAGAAGATCGAGCAGAAGTATGATGCGGACCTGGAGAACAAGCTGG', 'TGGACTGGATCATCCTGCAGTGCGCCGAGGACATAGAGCACCCGCCCCCCGGCAGGGCCCATTTTCAGAA', 'ATGGTTAATGGACGGGACGGTCCTGTGCAAGCTGATAAATAGTTTATACCCACCAGGACAAGAGCCCATA', 'CCCAAGATCTCAGAGTCAAAGATGGCTTTTAAGCAGATGGAGCAAATCTCCCAGTTCCTAAAAGCTGCGG', 'AGACCTATGGTGTCAGAACCACCGACATCTTTCAGACGGTGGATCTATGGGAAGGGAAGGACATGGCAGC', 'TGTGCAGAGGACCCTGATGGCTTTAGGCAGCGTTGCAGTCACCAAGGATGATGGCTGCTATCGGGGAGAG', 'CCATCCTGGTTTCACAGGAAAGCCCAGCAGAATCGGAGAGGCTTTTCCGAGGAGCAGCTTCGCCAGGGAC', 'AGAACGTAATAGGCCTGCAGATGGGCAGCAACAAGGGAGCCTCCCAGGCGGGCATGACAGGGTACGGGAT', 'GCCCAGGCAGATCATGTTAGGACGCGGCATCCTGCCCCTGGTAGAGAGGACGAATGTTCCACACCATGGT']
If Lseq[i] is present in refseq[k], then I am
interested in printing starting from refseq[k] until
the element that starts with '>' sign.
my Lseq has NM_014332 element and this is also present in second list refseq. I want to print starting from element where NM_014332 is present until next element that starts with '>' sign.
In this case, it would be: '>gi|10047089|ref|NM_014332.1| Homo sapiens small muscle protein, X-linked (SMPX), mRNA', 'GTTCTCAATACCGGGAGAGGCACAGAGCTATTTCAGCCACATGAAAAGCATCGGAATTGAGATCGCAGCT', 'CAGAGGACACCGGGCGCCCCTTCCACCTTCCAAGGAGCTTTGTATTCTTGCATCTGGCTGCCTGGGACTT', 'CCCTTAGGCAGTAAACAAATACATAAAGCAGGGATAAGACTGCATGAATATGTCGAAACAGCCAGTTTCC', 'AATGTTAGAGCCATCCAGGCAAATATCAATATTCCAATGGGAGCCTTTCGGCCAGGAGCAGGTCAACCCC', 'CCAGAAGAAAAGAATGTACTCCTGAAGTGGAGGAGGGTGTTCCTCCCACCTCGGATGAGGAGAAGAAGCC', 'AATTCCAGGAGCGAAGAAACTTCCAGGACCTGCAGTCAATCTATCGGAAATCCAGAATATTAAAAGTGAA', 'CTAAAATATGTCCCCAAAGCTGAACAGTAGTAGGAAGAAAAAAGGATTGATGTGAAGAAATAAAGAGGCA', 'GAAGATGGATTCAATAGCTCACTAAAATTTTATATATTTGTATGATGATTGTGAACCTCCTGAATGCCTG', 'AGACTCTAGCAGAAATGGCCTGTTTGTACATTTATATCTCTTCCTTCTAGTTGGCTGTATTTCTTACTTT', 'ATCTTCATTTTTGGCACCTCACAGAACAAATTAGCCCATAAATTCAACACCTGGAGGGTGTGGTTTTGAG', 'GAGGGATATGATTTTATGGAGAATGATATGGCAATGTGCCTAACGATTTTGATGAAAAGTTTCCCAAGCT', 'ACTTCCTACAGTATTTTGGTCAATATTTGGAATGCGTTTTAGTTCTTCACCTTTTAAATTATGTCACTAA', 'ACTTTGTATGAGTTCAAATAAATATTTGACTAAATGTAAAATGTGA'
I could not think of any smart way to do this, although I have tried like this:
for ele1 in Lseq:
for ele2 in refseq: if ele1 in ele2: k = ele2 s = refseq[ele2].startswith('>') print k,s
Traceback (most recent call last): File "<pyshell#261>", line 5, in -toplevel- s = refseq[ele2].startswith('>') TypeError: list indices must be integers
I do not know how to dictate to python to select lines
between two > symbols.
Could any one help me thanks.
K
__________________________________ Do you Yahoo!? Yahoo! Mail - 250MB free storage. Do more. Manage less. http://info.mail.yahoo.com/mail_250
_______________________________________________
Tutor maillist - Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor
_______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor