Re: What strategy for random accession of records in massive FASTA file?
On 14 Jan 2005 12:30:57 -0800, Paul Rubin http://[EMAIL PROTECTED] wrote: Mmap lets you treat a disk file as an array, so you can randomly access the bytes in the file without having to do seek operations Cool! Just say a[234]='x' and you've changed byte 234 of the file to the letter x. However.. however.. suppose this element located more or less in the middle of an array occupies more space after changing it, say 2 bytes instead of 1. Will flush() need to rewrite the half of mmaped file just to add that one byte? flush() definitely makes updating less of an issue, I'm just curious about the cost of writing small changes scattered all over the place back to the large file. -- I have come to kick ass, chew bubble gum and do the following: from __future__ import py3k And it doesn't work. -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Jeff Shannon wrote: Chris Lasher wrote: And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping the character stream. This 105MB FASTA file is 8.3 MB gzip-ed. And a 4:1 packed-format file would be ~26MB. It'd be interesting to see how that packed-format file would compress, but I don't care enough to write a script to convert the FASTA file into a packed-format file to experiment with... ;) Short version, then, is that yes, size concerns (such as they may be) are outweighed by speed and conceptual simplicity (i.e. avoiding a huge mess of bit-masking every time a single base needs to be examined, or a human-(semi-)readable display is needed). (Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (or else base-encodings split across byte-boundaries) That gets ugly real fast.) Jeff Shannon Technician/Programmer Credit International Hello, Just to clear up a few things on the topic : If the file denotes DNA sequences there are five basic identifiers AGCT and X (where X means 'dunno!'). If the files denoites RNA sequences, you will still only need five basic indentifiers the issue is that the T is replaced by a U. One very good way I have found to parse large files of this nature (I've done it with many a use case) is to write a sax parser for the file. Therefore you can register a content handler, receive events from the sax parser and do whatever you like with it. Basically, using the sax framework to read the files - if your write the sax parser carefully then you stream the files and remove old lines from memory, therefore you have a scalable solution (rather than keeping everything in memory). As an aside, I would seriously consider parsing your files and putting this information in a small local db - it's really not much work to do and the 'pure' python thing is a misnomer, whichever persistence mechanism you use (file,DB,etching it on the floor with a small robot accepting logo commands,etc) is unlikely to be pure python. The advantage of putting it in a DB will show up later when you have fast and powerful retrieval capability. Cheers, Neil -- Neil Benn Senior Automation Engineer Cenix BioScience BioInnovations Zentrum Tatzberg 47 D-01307 Dresden Germany Tel : +49 (0)351 4173 154 e-mail : [EMAIL PROTECTED] Cenix Website : http://www.cenix-bioscience.com -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Bengt Richter wrote: On 12 Jan 2005 14:46:07 -0800, Chris Lasher [EMAIL PROTECTED] wrote: [...] Others have probably solved your basic problem, or pointed the way. I'm just curious. Given that the information content is 2 bits per character that is taking up 8 bits of storage, there must be a good reason for storing and/or transmitting them this way? I.e., it it easy to think up a count-prefixed compressed format packing 4:1 in subsequent data bytes (except for the last byte which have less than 4 2-bit codes). I'm wondering how the data is actually used once records are retrieved. (but I'm too lazy to explore the biopython.org link). Revealingly honest. Of course, adopting an encoding that only used two bits per base would make it impossible to use the re module to search for patterns in them, for example. So the work of continuously translating between representations might militate against more efficient representations. Or, of course, it might not :-) it's-only-storage-ly y'rs - steve -- Steve Holden http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/ Holden Web LLC +1 703 861 4237 +1 800 494 3119 -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Jeff Shannon wrote: Chris Lasher wrote: And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping the character stream. This 105MB FASTA file is 8.3 MB gzip-ed. And a 4:1 packed-format file would be ~26MB. It'd be interesting to see how that packed-format file would compress, but I don't care enough to write a script to convert the FASTA file into a packed-format file to experiment with... ;) If your compression algorithm's any good then both, when compressed, should be approximately equal in size, since the size should be determined by the information content rather than the representation. Short version, then, is that yes, size concerns (such as they may be) are outweighed by speed and conceptual simplicity (i.e. avoiding a huge mess of bit-masking every time a single base needs to be examined, or a human-(semi-)readable display is needed). (Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (or else base-encodings split across byte-boundaries) That gets ugly real fast.) Right! regards Steve -- Steve Holden http://www.holdenweb.com/ Python Web Programming http://pydish.holdenweb.com/ Holden Web LLC +1 703 861 4237 +1 800 494 3119 -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
In article [EMAIL PROTECTED], Chris Lasher [EMAIL PROTECTED] wrote: Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '' (right-angle bracket) followed by lines that contain the actual sequence data. Three example FASTA records are below: CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACAT CW127_A02 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACATTCCGCCTAGTACGGTCGCAAGATTCTCAAAGGAATAGACGG CW127_A03 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACATTCCGCCTGGG ... Since the file I'm working with contains tens of thousands of these records, I believe I need to find a way to hash this file such that I can retrieve the respective sequence more quickly than I could by parsing through the file request-by-request. First, before embarking on any major project, take a look at http://www.biopython.org/ to at least familiarize yourself with what other people have done in the field. The easiest thing I think would be to use the gdbm module. You can write a simple parser to parse the FASTA file (or, I would imagine, find one already written on biopython), and then store the data in a gdbm map, using the tag lines as the keys and the sequences as the values. Even for a Python neophyte, this should be a pretty simple project. The most complex part might getting the gdbm module built if your copy of Python doesn't already have it, but gdbm is so convenient, it's worth the effort. -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Before you get too carried away, how often do you want to do this and how grunty is the box you will be running on? Oops, I should have specified this. The script will only need to be run once every three or four months, when the sequences are updated. I'll be running it on boxes that are 3GHz/100GB Ram, but others may not be so fortunate, and so I'd like to keep that in mind. BTW, you need to clarify don't have access to an RDBMS ... surely this can only be due to someone stopping them from installing good free software freely available on the Internet. I understand your and others' sentiment on this. I agree, the open-source database systems are wonderful. However, keeping it Python-only saves me hassle by only having to assist in instances where others need help downloading and installing Python. I suppose if I keep it in Python, I can even use Py2exe to generate an executable that wouldn't even require them to install Python. A solution using interaction with a database is much sexier, but, for the purposes of the script, seems unnecesary. However, I certainly appreciate the suggestions. My guess is that you don't need anything much fancier than the effbot's index method -- which by now you have probably found works straight out of the box and is more than fast enough for your needs. I think Mr. Lundh's code will work very well for these purposes. Thanks very much to him for posting it. Many thanks for posting that! You'll have full credit for that part of the code. Thanks very much to all who replied! Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Chris Lasher wrote: Given that the information content is 2 bits per character that is taking up 8 bits of storage, there must be a good reason for storing and/or transmitting them this way? I.e., it it easy to think up a count-prefixed compressed format packing 4:1 in subsequent data bytes (except for the last byte which have less than 4 2-bit codes). My guess for the inefficiency in storage size is because it is human-readable, and because most in-silico molecular biology is just a bunch of fancy string algorithms. This is my limited view of these things at least. Yeah, that pretty much matches my guess (not that I'm involved in anything related to computational molecular biology or genetics). Given the current technology, the cost of the extra storage size is presumably lower than the cost of translating into/out of a packed format. Heck, hard drives cost less than $1/GB now. And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping the character stream. Jeff Shannon Technician/Programmer Credit International -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Chris Lasher wrote: And besides, for long-term archiving purposes, I'd expect that zip et al on a character-stream would provide significantly better compression than a 4:1 packed format, and that zipping the packed format wouldn't be all that much more efficient than zipping the character stream. This 105MB FASTA file is 8.3 MB gzip-ed. And a 4:1 packed-format file would be ~26MB. It'd be interesting to see how that packed-format file would compress, but I don't care enough to write a script to convert the FASTA file into a packed-format file to experiment with... ;) Short version, then, is that yes, size concerns (such as they may be) are outweighed by speed and conceptual simplicity (i.e. avoiding a huge mess of bit-masking every time a single base needs to be examined, or a human-(semi-)readable display is needed). (Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (or else base-encodings split across byte-boundaries) That gets ugly real fast.) Jeff Shannon Technician/Programmer Credit International -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Jeff Shannon wrote: (Plus, if this format might be used for RNA sequences as well as DNA sequences, you've got at least a fifth base to represent, which means you need at least three bits per base, which means only two bases per byte (or else base-encodings split across byte-boundaries) That gets ugly real fast.) Not to mention all the IUPAC symbols for incompletely specified bases (e.g. R = A or G). http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html -- Robert Kern [EMAIL PROTECTED] In the fields of hell where the grass grows high Are the graves of dreams allowed to die. -- Richard Harter -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Chris Lasher wrote: Since the file I'm working with contains tens of thousands of these records, I believe I need to find a way to hash this file such that I can retrieve the respective sequence more quickly than I could by parsing through the file request-by-request. However, I'm very new to Python and am still very low on the learning curve for programming and algorithms in general; while I'm certain there are ubiquitous algorithms for this type of problem, I don't know what they are or where to look for them. So I turn to the gurus and accost you for help once again. :-) If you could help me figure out how to code a solution that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a relational database would provide the fastest method--the group I'm trying to write this for does not have access to a RDBMS.) keeping an index in memory might be reasonable. the following class creates an index file by scanning the FASTA file, and uses the marshal module to save it to disk. if the index file already exists, it's used as is. to regenerate the index, just remove the index file, and run the program again. import os, marshal class FASTA: def __init__(self, file): self.file = open(file) self.checkindex() def __getitem__(self, key): try: pos = self.index[key] except KeyError: raise IndexError(no such item) else: f = self.file f.seek(pos) header = f.readline() assert + header + \n data = [] while 1: line = f.readline() if not line or line[0] == : break data.append(line) return data def checkindex(self): indexfile = self.file.name + .index try: self.index = marshal.load(open(indexfile, rb)) except IOError: print building index... index = {} # scan the file f = self.file f.seek(0) while 1: pos = f.tell() line = f.readline() if not line: break if line[0] == : # save offset to header line header = line[1:].strip() index[header] = pos # save index to disk f = open(indexfile, wb) marshal.dump(index, f) f.close() self.index = index db = FASTA(myfastafile.dat) print db[CW127_A02] ['TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG... tweak as necessary. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Don't fight it, lite it! You should parse the fasta and put it into a database: http://www.sqlite.org/index.html Then index by name and it will be superfast. James -- http://mail.python.org/mailman/listinfo/python-list
What strategy for random accession of records in massive FASTA file?
Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '' (right-angle bracket) followed by lines that contain the actual sequence data. Three example FASTA records are below: CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACAT CW127_A02 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACATTCCGCCTAGTACGGTCGCAAGATTCTCAAAGGAATAGACGG CW127_A03 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACATTCCGCCTGGG ... Since the file I'm working with contains tens of thousands of these records, I believe I need to find a way to hash this file such that I can retrieve the respective sequence more quickly than I could by parsing through the file request-by-request. However, I'm very new to Python and am still very low on the learning curve for programming and algorithms in general; while I'm certain there are ubiquitous algorithms for this type of problem, I don't know what they are or where to look for them. So I turn to the gurus and accost you for help once again. :-) If you could help me figure out how to code a solution that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a relational database would provide the fastest method--the group I'm trying to write this for does not have access to a RDBMS.) Thanks very much in advance, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
If you could help me figure out how to code a solution that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a relational database would provide the fastest method--the group I'm trying to write this for does not have access to a RDBMS.) You don't need a RDBMS; I'd put it in a DBM or CDB myself. -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
You don't say how this will be used, but here goes: 1) Read the records and put into dictionary with key of sequence (from header) and data being the sequence data. Use shelve to store the dictionary for subsequent runs (if load time is excessive). 2) Take a look at Gadfly (gadfly.sourceforge.net). It provides you with Python SQL-like database and may be better solution if data is basically static and you do lots of processing. All depends on how you use the data. Regards, Larry Bates Syscon, Inc. Chris Lasher wrote: Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '' (right-angle bracket) followed by lines that contain the actual sequence data. Three example FASTA records are below: CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACAT CW127_A02 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACATTCCGCCTAGTACGGTCGCAAGATTCTCAAAGGAATAGACGG CW127_A03 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACATTCCGCCTGGG ... Since the file I'm working with contains tens of thousands of these records, I believe I need to find a way to hash this file such that I can retrieve the respective sequence more quickly than I could by parsing through the file request-by-request. However, I'm very new to Python and am still very low on the learning curve for programming and algorithms in general; while I'm certain there are ubiquitous algorithms for this type of problem, I don't know what they are or where to look for them. So I turn to the gurus and accost you for help once again. :-) If you could help me figure out how to code a solution that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a relational database would provide the fastest method--the group I'm trying to write this for does not have access to a RDBMS.) Thanks very much in advance, Chris -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
RE: What strategy for random accession of records in massive FASTA file? Batista, Facundo [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] [If you want to keep the memory usage low, you can parse the file once and store in a list the byte position where the record starts and ends. Then access the list randomly and read each record with seek() and read(). -- Or if you want to access by sequence name rather than number, use a dict instead. tjr -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
In article [EMAIL PROTECTED], Chris Lasher wrote: Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '' (right-angle bracket) followed by lines that contain the actual sequence data. Three example FASTA records are below: Use biopython. They have dictionary-style classes which wrap FASTA files using indexes. http://www.biopython.org Dave -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
Chris Lasher wrote: Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '' (right-angle bracket) followed by lines that contain the actual sequence data. Three example FASTA records are below: CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACAT [snip] Since the file I'm working with contains tens of thousands of these records, I believe I need to find a way to hash this file such that I can retrieve the respective sequence more quickly than I could by parsing through the file request-by-request. However, I'm very new to Python and am still very low on the learning curve for programming and algorithms in general; while I'm certain there are ubiquitous algorithms for this type of problem, I don't know what they are or where to look for them. So I turn to the gurus and accost you for help once again. :-) If you could help me figure out how to code a solution that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to keep it in Python only, even though I know interaction with a relational database would provide the fastest method--the group I'm trying to write this for does not have access to a RDBMS.) Thanks very much in advance, Chris Before you get too carried away, how often do you want to do this and how grunty is the box you will be running on? Will the data be on a server? If the server is on a WAN or at the other end of a radio link between buildings, you definitely need an index so that you can access the data randomly! By way of example, to read all of a 157MB file into memory from a local (i.e. not networked) disk using readlines() takes less than 4 seconds on a 1.4Ghz Athlon processor (see below). The average new corporate desktop box is about twice as fast as that. Note that Windows Task Manager showed 100% CPU utilisation for both read() and readlines(). My guess is that you don't need anything much fancier than the effbot's index method -- which by now you have probably found works straight out of the box and is more than fast enough for your needs. BTW, you need to clarify don't have access to an RDBMS ... surely this can only be due to someone stopping them from installing good free software freely available on the Internet. HTH, John C:\junkpython -m timeit -n 1 -r 6 print len(file('bigfile.csv').read()) 157581595 157581595 157581595 157581595 157581595 157581595 1 loops, best of 6: 3.3e+006 usec per loop C:\junkpython -m timeit -n 1 -r 6 print len(file('bigfile.csv').readlines()) 1118870 1118870 1118870 1118870 1118870 1118870 1 loops, best of 6: 3.57e+006 usec per loop -- http://mail.python.org/mailman/listinfo/python-list
Re: What strategy for random accession of records in massive FASTA file?
On 12 Jan 2005 14:46:07 -0800, Chris Lasher [EMAIL PROTECTED] wrote: Hello, I have a rather large (100+ MB) FASTA file from which I need to access records in a random order. The FASTA format is a standard format for storing molecular biological sequences. Each record contains a header line for describing the sequence that begins with a '' (right-angle bracket) followed by lines that contain the actual sequence data. Three example FASTA records are below: Others have probably solved your basic problem, or pointed the way. I'm just curious. Given that the information content is 2 bits per character that is taking up 8 bits of storage, there must be a good reason for storing and/or transmitting them this way? I.e., it it easy to think up a count-prefixed compressed format packing 4:1 in subsequent data bytes (except for the last byte which have less than 4 2-bit codes). I'm wondering how the data is actually used once records are retrieved. (but I'm too lazy to explore the biopython.org link). CW127_A01 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACAT CW127_A02 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACATTCCGCCTAGTACGGTCGCAAGATTCTCAAAGGAATAGACGG CW127_A03 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA GCATTAAACATTCCGCCTGGG ... Regards, Bengt Richter -- http://mail.python.org/mailman/listinfo/python-list