Hello again! Thank you everyone for all the help. The most important step in dealing with Blast results is getting rid of all the extraneous information. The standard result file gives you everything you could ask for when comparing DNA sequences, however at this step I am looking for specific data. Once I have my relevant matches I can go back and pull out the info for just those matches from the file.
Following Kent's advice I came up with the following (I should set up defs and a modular system for renewability, that will be next) import sys #my system complained when I had them on 1 line as sys, sets, re import sets #not sure why but that's another issue (2.4 on win XP) import re result=re.compile('^(hg17.+)5\'.+') #the import part of the line is in the ( ) query=re.compile('^.+(ENST\d+\.\d)+') #matches the line up to the important transcriptID #and groups the ID info TFILE = open(sys.argv[1], 'r' ) #file to read from WFILE=open(sys.argv[2], 'w') # file to write to results={} for line in TFILE: isQueryLine= query.match(line) isResultLine= result.match(line) if isQueryLine: current_query = isQueryLine.group(1) if isResultLine: key = isResultLine.group(1) results.setdefault(key, []).append(current_query) # see explanation below # Now go through results looking for entries with more than one query for key, queries in results.iteritems(): if len(queries) > 1: print >> WFILE print >> WFILE, key for query in queries: print >> WFILE, query I am almost there the program seemed to run well will minimal swapping and finished in 5 minutes. My output file is only 45 mb. A sample of the output file is: hg17_chainMm5_chr15 range=chr7:148238502-148239073 ENST00000339563.1 ENST00000342196.1 ENST00000339563.1 ENST00000344055.1 hg17_chainMm5_chr13 range=chr5:42927967-42928726 ENST00000279800.3 ENST00000309556.3 hg17_chainMm5_chr6 range=chr1:155548627-155549517 ENST00000321157.3 ENST00000256324.4 hg17_chainMm5_chr13 range=chr1:81386270-81386967 ENST00000011649.3 ENST00000348636.1 hg17_chainMm5_chr19 range=chr11:56050656-56051559 ENST00000341231.1 ENST00000341231.1 ENST00000331792.1 ENST00000341231.1 ENST00000341231.1 ENST00000331792.1 hg17_chainMm5_chr9 range=chr11:123561223-123562097 ENST00000341493.1 ENST00000318666.4 ENST00000341493.1 I can see where any of the chains appear more than once, which is good and I am looking for situations like first example where ENST00000339563.1 is the first and third on the list or the fifth example. Next step is to cut out the ENST lines that only show up once and wind up with just the places where there are matches at least twice to a given transcript (using the ENST00000...) ids. Like in the final example I only want the first and third so I know it is twice in that transcript. Back to it and other things. Thanks for all the help so far, Scott _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor