On Thu, 16 Dec 2004, Scott Melnyk wrote:
> I recently suffered a loss of programming files (and I had been > putting off my backups...) Hi Scott, [Side note that's not really related to Python: if you don't use a version control system to manage your software yet, please learn to use one. There's a very good one called Subversion: http://subversion.tigris.org/ A good version control system is invaluable to programmers, since many programs are not really written once, but are rather maintained and revised for a long time.] Rich Krauter already caught the bug that was occurring: the intersection() method of sets produces a brand new set, rather than do a mutation on the old set. Here are some more comments on the programs you've shown us: > A sample of the file format is: > > >ENSE00001384652.1|ENSG00000166157.5|ENST00000359693.1 > assembly=NCBI35|chr=21|strand=reverse|bases 10012801 to 10012624|exons > plus upstream and downstream regions for exon > GCGGCCGTTCAAGGCAGGGGGCGGGGCGTCTCCGAGCGGCGGGGCCAAGGGAGGGCACAACAGCTGCTACCTGAACAGTTTCTGACCCAACAGTTACCCAGCGCCGGACTCGCTGCGCCCCGGCGGCTCTAGGGACCCCCGGCGCCTACACTTAGCTCCGCGCCCGAGGTGAGCCCAG Ah, ok, so this is a FASTA file. (For others on the list, see: http://ngfnblast.gbf.de/docs/fasta.html for a description of the BLAST format.) > ENSExxxxxxx is an exon id tag followed by a ENSGxxxxxgene id tag then > a ENSTxxxxxxx transcript id tag followed by information about the > location of exon. [text cut] Ok, so it sounds like your program will mostly pay attention to each description line in the FASTA file. > In order to visually understand the data better I made a script to > organize it into sections with Gene ID, then Trancript ID followed by > the different Exon IDs like so: [lots of text cut] There's one big assumption in the code to OrganizeData.py that may need to be explicitely stated: the code appears to assume that the sequences are already sorted and grouped in order, since the code maintains a variable named 'sOldGene' and maintains some state on the last FASTA sequence that has been seen. Is your data already sorted? As a long term suggestion: you may find it easier to write the extracted data out as something that will be easy to work with for the next stages of your pipeline. Human readability is important too, of course, so there's a balance necessary between machine and numan convenience. If possible, you may want to make every record a single line, rather than have a record spread across several lines. Your program does do this in some part, but it also adds other output, like announcements to signal the start of new genes. That can complicates later stages of your analysis. Eric Raymond has summarized the rules-of-thumb that the Unix utitilies try to follow: http://www.faqs.org/docs/artu/ch05s02.html#id2907428 As a concrete counterexample, the 'BLAST' utility that bioinformaticians use has a default output that's very human readable, but so ad-hoc that programs that try to use BLAST output often have to resort to fragile, hand-crafted BLAST parsers. The situation's a lot better now, since newer versions of BLAST finally support a structured format. So I'd strongly recommend dropping the "NEW GENE", "end of gene group", and "NEW TRANSCRIPT" lines out of your output. And if you really want to keep them, you can write a separate program that adds those notes back into the output of OrganizeData.py for a human reader. If you drop those decorations out, then the other parts of your pipeline can be simplified, since you can assume that each line of the input file is data. The other stages of your analysis pipeline appear to try to ignore those lines anyway, so why put them in? *grin* _______________________________________________ Tutor maillist - [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/tutor