[Tutor] Re: sorting a 2 gb file- i shrunk it and turned it around

Scott Melnyk Fri, 28 Jan 2005 05:53:47 -0800

Hello again!

Thank you everyone for all the help.  The most important step in
dealing with Blast results is getting rid of all the extraneous
information.  The standard result file gives you everything you could
ask for when comparing DNA sequences, however at this step I am
looking for specific data.  Once I have my relevant matches I can go
back and pull out the info for just those matches from the file.


Following Kent's advice I came up with the following (I should set up
defs and a modular system for renewability, that will be next)

import sys   #my system complained when I had them on 1 line as sys, sets, re
import sets  #not sure why but that's another issue (2.4 on win XP)
import re

result=re.compile('^(hg17.+)5\'.+')  #the import part of the line is in the ( )
query=re.compile('^.+(ENST\d+\.\d)+') #matches the line up to the
important transcriptID
                                                       #and groups the ID info

TFILE = open(sys.argv[1], 'r' )         #file to read from
WFILE=open(sys.argv[2], 'w')         # file to write to

results={}

for line in TFILE:
    isQueryLine= query.match(line)
    isResultLine= result.match(line)
    if isQueryLine:
        current_query = isQueryLine.group(1)
    if isResultLine:
        key = isResultLine.group(1)
        results.setdefault(key, []).append(current_query) # see
explanation below



# Now go through results looking for entries with more than one query
for key, queries in results.iteritems():
  if len(queries) > 1:
    print >> WFILE
    print >> WFILE, key
    for query in queries:
      print >> WFILE, query

I am almost there the program seemed to run well will minimal swapping
and finished in 5 minutes. My output file is only 45 mb.

A sample of the output file is:


hg17_chainMm5_chr15 range=chr7:148238502-148239073
ENST00000339563.1
ENST00000342196.1
ENST00000339563.1
ENST00000344055.1

hg17_chainMm5_chr13 range=chr5:42927967-42928726
ENST00000279800.3
ENST00000309556.3

hg17_chainMm5_chr6 range=chr1:155548627-155549517
ENST00000321157.3
ENST00000256324.4

hg17_chainMm5_chr13 range=chr1:81386270-81386967
ENST00000011649.3
ENST00000348636.1

hg17_chainMm5_chr19 range=chr11:56050656-56051559
ENST00000341231.1
ENST00000341231.1
ENST00000331792.1
ENST00000341231.1
ENST00000341231.1
ENST00000331792.1

hg17_chainMm5_chr9 range=chr11:123561223-123562097
ENST00000341493.1
ENST00000318666.4
ENST00000341493.1

I can see where any of the chains appear more than once, which is good
and I am looking for situations like first example where
ENST00000339563.1 is the first and third on the list or the fifth
example.
Next step is to cut out the ENST lines that only show up once and wind
up with just the places where there are matches at least twice to a
given transcript (using the ENST00000...) ids.  Like in the final
example I only want the first and third so I know it is twice in that
transcript.

Back to it and other things.  Thanks for all the help so far,
Scott
_______________________________________________
Tutor maillist  -  Tutor@python.org
http://mail.python.org/mailman/listinfo/tutor

[Tutor] Re: sorting a 2 gb file- i shrunk it and turned it around

Reply via email to