Re: What strategy for random accession of records in massive FASTA file?

2005-01-15 Thread Bulba!
On 14 Jan 2005 12:30:57 -0800, Paul Rubin
http://[EMAIL PROTECTED] wrote:

Mmap lets you treat a disk file as an array, so you can randomly
access the bytes in the file without having to do seek operations

Cool!

Just say a[234]='x' and you've changed byte 234 of the file to the
letter x.  

However.. however.. suppose this element located more or less
in the middle of an array occupies more space after changing it, 
say 2 bytes instead of 1. Will flush() need to rewrite the half of
mmaped file just to add that one byte? 

flush() definitely makes updating less of an issue,  I'm just 
curious about the cost of writing small changes scattered all 
over the place back to the large file.



--
I have come to kick ass, chew bubble gum and do the following:

from __future__ import py3k

And it doesn't work.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Neil Benn
Jeff Shannon wrote:
Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

This 105MB FASTA file is 8.3 MB gzip-ed.

And a 4:1 packed-format file would be ~26MB.  It'd be interesting to 
see how that packed-format file would compress, but I don't care 
enough to write a script to convert the FASTA file into a 
packed-format file to experiment with... ;)

Short version, then, is that yes, size concerns (such as they may be) 
are outweighed by speed and conceptual simplicity (i.e. avoiding a 
huge mess of bit-masking every time a single base needs to be 
examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA 
sequences, you've got at least a fifth base to represent, which means 
you need at least three bits per base, which means only two bases per 
byte (or else base-encodings split across byte-boundaries) That 
gets ugly real fast.)

Jeff Shannon
Technician/Programmer
Credit International
Hello,
Just to clear up a few things on the topic :
   If the file denotes DNA sequences there are five basic identifiers
AGCT and X (where X means 'dunno!').
   If the files denoites RNA sequences, you will still only need five 
basic indentifiers the issue is that the T is replaced by a U. 

   One very good way I have found to parse large files of this nature 
(I've done it with many a use case) is to write a sax parser for the 
file.  Therefore you can register a content handler, receive events from 
the sax parser and do whatever you like with it.  Basically, using the 
sax framework to read the files - if your write the sax parser carefully 
then you stream the files and remove old lines from memory, therefore 
you have a scalable solution (rather than keeping everything in memory).

   As an aside, I would seriously consider parsing your files and 
putting this information in a small local db - it's really not much work 
to do and the 'pure' python thing is a misnomer, whichever persistence 
mechanism you use (file,DB,etching it on the floor with a small robot 
accepting logo commands,etc) is unlikely to be pure python.

   The advantage of putting it in a DB will show up later when you have 
fast and powerful retrieval capability.

Cheers,
Neil
--
Neil Benn
Senior Automation Engineer
Cenix BioScience
BioInnovations Zentrum
Tatzberg 47
D-01307
Dresden
Germany
Tel : +49 (0)351 4173 154
e-mail : [EMAIL PROTECTED]
Cenix Website : http://www.cenix-bioscience.com
--
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Steve Holden
Bengt Richter wrote:
On 12 Jan 2005 14:46:07 -0800, Chris Lasher [EMAIL PROTECTED] wrote:
[...]
Others have probably solved your basic problem, or pointed
the way. I'm just curious.
Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).
I'm wondering how the data is actually used once records are
retrieved. (but I'm too lazy to explore the biopython.org link).
Revealingly honest.
Of course, adopting an encoding that only used two bits per base would 
make it impossible to use the re module to search for patterns in them, 
for example. So the work of continuously translating between 
representations might militate against more efficient representations. 
Or, of course, it might not :-)

it's-only-storage-ly y'rs  - steve
--
Steve Holden   http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC  +1 703 861 4237  +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Steve Holden
Jeff Shannon wrote:
Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.

This 105MB FASTA file is 8.3 MB gzip-ed.

And a 4:1 packed-format file would be ~26MB.  It'd be interesting to see 
how that packed-format file would compress, but I don't care enough to 
write a script to convert the FASTA file into a packed-format file to 
experiment with... ;)

If your compression algorithm's any good then both, when compressed, 
should be approximately equal in size, since the size should be 
determined by the information content rather than the representation.

Short version, then, is that yes, size concerns (such as they may be) 
are outweighed by speed and conceptual simplicity (i.e. avoiding a huge 
mess of bit-masking every time a single base needs to be examined, or a 
human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA 
sequences, you've got at least a fifth base to represent, which means 
you need at least three bits per base, which means only two bases per 
byte (or else base-encodings split across byte-boundaries) That gets 
ugly real fast.)

Right!
regards
 Steve
--
Steve Holden   http://www.holdenweb.com/
Python Web Programming  http://pydish.holdenweb.com/
Holden Web LLC  +1 703 861 4237  +1 800 494 3119
--
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-14 Thread Roy Smith
In article [EMAIL PROTECTED],
 Chris Lasher [EMAIL PROTECTED] wrote:

 Hello,
 I have a rather large (100+ MB) FASTA file from which I need to
 access records in a random order. The FASTA format is a standard format
 for storing molecular biological sequences. Each record contains a
 header line for describing the sequence that begins with a ''
 (right-angle bracket) followed by lines that contain the actual
 sequence data. Three example FASTA records are below:
 
 CW127_A01
 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
 TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
 GCATTAAACAT
 CW127_A02
 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
 TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
 GCATTAAACATTCCGCCTAGTACGGTCGCAAGATTCTCAAAGGAATAGACGG
 CW127_A03
 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
 TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
 GCATTAAACATTCCGCCTGGG
 ...
 
 Since the file I'm working with contains tens of thousands of these
 records, I believe I need to find a way to hash this file such that I
 can retrieve the respective sequence more quickly than I could by
 parsing through the file request-by-request.

First, before embarking on any major project, take a look at 
http://www.biopython.org/ to at least familiarize yourself with what 
other people have done in the field.

The easiest thing I think would be to use the gdbm module.  You can 
write a simple parser to parse the FASTA file (or, I would imagine, find 
one already written on biopython), and then store the data in a gdbm 
map, using the tag lines as the keys and the sequences as the values.  

Even for a Python neophyte, this should be a pretty simple project.  The 
most complex part might getting the gdbm module built if your copy of 
Python doesn't already have it, but gdbm is so convenient, it's worth 
the effort.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Chris Lasher
Before you get too carried away, how often do you want to do this and
how grunty is the box you will be running on?

Oops, I should have specified this. The script will only need to be run
once every three or four months, when the sequences are updated. I'll
be running it on boxes that are 3GHz/100GB Ram, but others may not be
so fortunate, and so I'd like to keep that in mind.

BTW, you need to clarify don't have access to an RDBMS ... surely
this can only be due to someone stopping them from installing good
free software freely available on the Internet.

I understand your and others' sentiment on this. I agree, the
open-source database systems are wonderful. However, keeping it
Python-only saves me hassle by only having to assist in instances where
others need help downloading and installing Python. I suppose if I keep
it in Python, I can even use Py2exe to generate an executable that
wouldn't even require them to install Python. A solution using
interaction with a database is much sexier, but, for the purposes of
the script, seems unnecesary. However, I certainly appreciate the
suggestions.

My guess is that you don't need anything much fancier than the
effbot's index method -- which by now you have probably found works
straight out of the box and is more than fast enough for your needs.

I think Mr. Lundh's code will work very well for these purposes. Thanks
very much to him for posting it. Many thanks for posting that! You'll
have full credit for that part of the code. Thanks very much to all who
replied!

Chris

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Jeff Shannon
Chris Lasher wrote:
Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).
My guess for the inefficiency in storage size is because it is
human-readable, and because most in-silico molecular biology is just a
bunch of fancy string algorithms. This is my limited view of these
things at least.
Yeah, that pretty much matches my guess (not that I'm involved in 
anything related to computational molecular biology or genetics). 
Given the current technology, the cost of the extra storage size is 
presumably lower than the cost of translating into/out of a packed 
format.  Heck, hard drives cost less than $1/GB now.

And besides, for long-term archiving purposes, I'd expect that zip et 
al on a character-stream would provide significantly better 
compression than a 4:1 packed format, and that zipping the packed 
format wouldn't be all that much more efficient than zipping the 
character stream.

Jeff Shannon
Technician/Programmer
Credit International
--
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Jeff Shannon
Chris Lasher wrote:
And besides, for long-term archiving purposes, I'd expect that zip et
al on a character-stream would provide significantly better
compression than a 4:1 packed format, and that zipping the packed
format wouldn't be all that much more efficient than zipping the
character stream.
This 105MB FASTA file is 8.3 MB gzip-ed.
And a 4:1 packed-format file would be ~26MB.  It'd be interesting to 
see how that packed-format file would compress, but I don't care 
enough to write a script to convert the FASTA file into a 
packed-format file to experiment with... ;)

Short version, then, is that yes, size concerns (such as they may be) 
are outweighed by speed and conceptual simplicity (i.e. avoiding a 
huge mess of bit-masking every time a single base needs to be 
examined, or a human-(semi-)readable display is needed).

(Plus, if this format might be used for RNA sequences as well as DNA 
sequences, you've got at least a fifth base to represent, which means 
you need at least three bits per base, which means only two bases per 
byte (or else base-encodings split across byte-boundaries) That 
gets ugly real fast.)

Jeff Shannon
Technician/Programmer
Credit International
--
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-13 Thread Robert Kern
Jeff Shannon wrote:
(Plus, if this format might be used for RNA sequences as well as DNA 
sequences, you've got at least a fifth base to represent, which means 
you need at least three bits per base, which means only two bases per 
byte (or else base-encodings split across byte-boundaries) That gets 
ugly real fast.)
Not to mention all the IUPAC symbols for incompletely specified bases 
(e.g. R = A or G).

http://www.chem.qmul.ac.uk/iubmb/misc/naseq.html
--
Robert Kern
[EMAIL PROTECTED]
In the fields of hell where the grass grows high
 Are the graves of dreams allowed to die.
  -- Richard Harter
--
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Fredrik Lundh
Chris Lasher wrote:

 Since the file I'm working with contains tens of thousands of these
 records, I believe I need to find a way to hash this file such that I
 can retrieve the respective sequence more quickly than I could by
 parsing through the file request-by-request. However, I'm very new to
 Python and am still very low on the learning curve for programming and
 algorithms in general; while I'm certain there are ubiquitous
 algorithms for this type of problem, I don't know what they are or
 where to look for them. So I turn to the gurus and accost you for help
 once again. :-) If you could help me figure out how to code a solution
 that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
 keep it in Python only, even though I know interaction with a
 relational database would provide the fastest method--the group I'm
 trying to write this for does not have access to a RDBMS.)

keeping an index in memory might be reasonable.  the following class
creates an index file by scanning the FASTA file, and uses the marshal
module to save it to disk.  if the index file already exists, it's used as is.
to regenerate the index, just remove the index file, and run the program
again.

import os, marshal

class FASTA:

  def __init__(self, file):
self.file = open(file)
self.checkindex()

  def __getitem__(self, key):
try:
  pos = self.index[key]
except KeyError:
  raise IndexError(no such item)
else:
  f = self.file
  f.seek(pos)
  header = f.readline()
  assert  + header + \n
  data = []
  while 1:
line = f.readline()
if not line or line[0] == :
  break
data.append(line)
  return data

  def checkindex(self):
indexfile = self.file.name + .index

try:
  self.index = marshal.load(open(indexfile, rb))
except IOError:
  print building index...

  index = {}

  # scan the file
  f = self.file
  f.seek(0)
  while 1:
pos = f.tell()
line = f.readline()
if not line:
  break
if line[0] == :
  # save offset to header line
  header = line[1:].strip()
  index[header] = pos

  # save index to disk
  f = open(indexfile, wb)
  marshal.dump(index, f)
  f.close()

  self.index = index

db = FASTA(myfastafile.dat)
print db[CW127_A02]

['TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAG...

tweak as necessary.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread James Stroud
Don't fight it, lite it!

You should parse the fasta and put it into a database:

http://www.sqlite.org/index.html

Then index by name and it will be superfast.

James

-- 
http://mail.python.org/mailman/listinfo/python-list


What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Chris Lasher
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a ''
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

CW127_A01
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
GCATTAAACAT
CW127_A02
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
GCATTAAACATTCCGCCTAGTACGGTCGCAAGATTCTCAAAGGAATAGACGG
CW127_A03
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
GCATTAAACATTCCGCCTGGG
...

Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread John Lenton
 If you could help me figure out how to code a solution
 that won't be a resource whore, I'd be _very_ grateful. (I'd prefer
to
 keep it in Python only, even though I know interaction with a
 relational database would provide the fastest method--the group I'm
 trying to write this for does not have access to a RDBMS.)
You don't need a RDBMS; I'd put it in a DBM or CDB myself.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Larry Bates
You don't say how this will be used, but here goes:
1) Read the records and put into dictionary with key
of sequence (from header) and data being the sequence
data.  Use shelve to store the dictionary for subsequent
runs (if load time is excessive).
2) Take a look at Gadfly (gadfly.sourceforge.net).  It
provides you with Python SQL-like database and may be
better solution if data is basically static and you
do lots of processing.
All depends on how you use the data.
Regards,
Larry Bates
Syscon, Inc.
Chris Lasher wrote:
Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a ''
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

CW127_A01
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
GCATTAAACAT
CW127_A02
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
GCATTAAACATTCCGCCTAGTACGGTCGCAAGATTCTCAAAGGAATAGACGG
CW127_A03
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
GCATTAAACATTCCGCCTGGG
...
Since the file I'm working with contains tens of thousands of these
records, I believe I need to find a way to hash this file such that I
can retrieve the respective sequence more quickly than I could by
parsing through the file request-by-request. However, I'm very new to
Python and am still very low on the learning curve for programming and
algorithms in general; while I'm certain there are ubiquitous
algorithms for this type of problem, I don't know what they are or
where to look for them. So I turn to the gurus and accost you for help
once again. :-) If you could help me figure out how to code a solution
that won't be a resource whore, I'd be _very_ grateful. (I'd prefer to
keep it in Python only, even though I know interaction with a
relational database would provide the fastest method--the group I'm
trying to write this for does not have access to a RDBMS.)
Thanks very much in advance,
Chris
--
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Terry Reedy
RE: What strategy for random accession of records in massive FASTA file?
Batista, Facundo [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
[If you want to keep the memory usage low, you can parse the file once and 
store in a list the byte position where the record starts and ends. Then 
access the list randomly and read each record with seek() and read().
--
Or if you want to access by sequence name rather than number, use a dict 
instead.

tjr



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread David E. Konerding DSD staff
In article [EMAIL PROTECTED], Chris Lasher wrote:
 Hello,
 I have a rather large (100+ MB) FASTA file from which I need to
 access records in a random order. The FASTA format is a standard format
 for storing molecular biological sequences. Each record contains a
 header line for describing the sequence that begins with a ''
 (right-angle bracket) followed by lines that contain the actual
 sequence data. Three example FASTA records are below:

Use biopython.  They have dictionary-style classes which wrap FASTA files using 
indexes.

http://www.biopython.org

Dave
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread John Machin

Chris Lasher wrote:
 Hello,
 I have a rather large (100+ MB) FASTA file from which I need to
 access records in a random order. The FASTA format is a standard
format
 for storing molecular biological sequences. Each record contains a
 header line for describing the sequence that begins with a ''
 (right-angle bracket) followed by lines that contain the actual
 sequence data. Three example FASTA records are below:

 CW127_A01
 TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
 TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
 GCATTAAACAT
[snip]
 Since the file I'm working with contains tens of thousands of these
 records, I believe I need to find a way to hash this file such that I
 can retrieve the respective sequence more quickly than I could by
 parsing through the file request-by-request. However, I'm very new to
 Python and am still very low on the learning curve for programming
and
 algorithms in general; while I'm certain there are ubiquitous
 algorithms for this type of problem, I don't know what they are or
 where to look for them. So I turn to the gurus and accost you for
help
 once again. :-) If you could help me figure out how to code a
solution
 that won't be a resource whore, I'd be _very_ grateful. (I'd prefer
to
 keep it in Python only, even though I know interaction with a
 relational database would provide the fastest method--the group I'm
 trying to write this for does not have access to a RDBMS.)
 Thanks very much in advance,
 Chris

Before you get too carried away, how often do you want to do this and
how grunty is the box you will be running on? Will the data be on a
server? If the server is on a WAN or at the other end of a radio link
between buildings, you definitely need an index so that you can access
the data randomly!

By way of example, to read all of a 157MB file into memory from a local
(i.e. not networked) disk using readlines() takes less than 4 seconds
on a 1.4Ghz Athlon processor (see below). The average new corporate
desktop box is about twice as fast as that. Note that Windows Task
Manager showed 100% CPU utilisation for both read() and readlines().

My guess is that you don't need anything much fancier than the effbot's
index method -- which by now you have probably found works straight out
of the box and is more than fast enough for your needs.

BTW, you need to clarify don't have access to an RDBMS ... surely
this can only be due to someone stopping them from installing good free
software freely available on the Internet.

HTH,
John

C:\junkpython -m timeit -n 1 -r 6 print
len(file('bigfile.csv').read())
157581595
157581595
157581595
157581595
157581595
157581595
1 loops, best of 6: 3.3e+006 usec per loop

C:\junkpython -m timeit -n 1 -r 6 print
len(file('bigfile.csv').readlines())
1118870
1118870
1118870
1118870
1118870
1118870
1 loops, best of 6: 3.57e+006 usec per loop

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: What strategy for random accession of records in massive FASTA file?

2005-01-12 Thread Bengt Richter
On 12 Jan 2005 14:46:07 -0800, Chris Lasher [EMAIL PROTECTED] wrote:

Hello,
I have a rather large (100+ MB) FASTA file from which I need to
access records in a random order. The FASTA format is a standard format
for storing molecular biological sequences. Each record contains a
header line for describing the sequence that begins with a ''
(right-angle bracket) followed by lines that contain the actual
sequence data. Three example FASTA records are below:

Others have probably solved your basic problem, or pointed
the way. I'm just curious.

Given that the information content is 2 bits per character
that is taking up 8 bits of storage, there must be a good reason
for storing and/or transmitting them this way? I.e., it it easy
to think up a count-prefixed compressed format packing 4:1 in
subsequent data bytes (except for the last byte which have
less than 4 2-bit codes).

I'm wondering how the data is actually used once records are
retrieved. (but I'm too lazy to explore the biopython.org link).

CW127_A01
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
GCATTAAACAT
CW127_A02
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
GCATTAAACATTCCGCCTAGTACGGTCGCAAGATTCTCAAAGGAATAGACGG
CW127_A03
TGCAGTCGAACGAGAACGGTCCTTCGGGATGTCAGCTAAGTGGCGGACGGGTGAGTAATG
TATAGTTAATCTGCCCTTTAGAGATAACAGTTGGAAACGACTGCTAATAATA
GCATTAAACATTCCGCCTGGG
...


Regards,
Bengt Richter
-- 
http://mail.python.org/mailman/listinfo/python-list