kumar s wrote: > hi, > > thank you. this is not a homework question. > > I have a very huge file of fasta sequence. > > I want to create a dictionary where 'GeneName' as key > and sequence of ATGC characters as value > > > biglist = dat.split('\t') > ['GeneName xxxxxxxx','yyyyyyyy','ATTAAGGCCAA'.......] > > Now I want to select ''GeneName xxxxxxxx' into listA > and 'ATTAAGGCCAA' into listB > > so I want to select 0,3,6,9 elements into listA > and 2,5,8,11 and so on elements into listB > > then I can do dict(zip(listA,listB)) > > however, the very loops concept is getting blanked out > in my brain when I want to do this: > > for j in range(len(biglist)): > from here .. I cannot think anything.. > > may be it is just mental block.. thats the reason I > seek help on forum.
Lloyd has pointed you to slicing as the answer to your immediate question. However for the larger question of reading FASTA files, you might want to look at CoreBio, this is a new library of Python modules for computational biology that looks pretty good. http://code.google.com/p/corebio/ CoreBio has built-in support for reading FASTA files into Seq objects. For example: In [1]: import corebio.seq_io In [2]: f=open(r'F:\Bio\BIOE48~1\KENTJO~1\SEQUEN~2\fasta\GI5082~1.FAS') In [3]: seqs = corebio.seq_io.read(f) seqs is now a list of Seq objects for each sequence in the original file In this case there is only one sequence but it will work for your file also. In [4]: for seq in seqs: ...: print seq.name ...: print seq ...: ...: gi|50826|emb|CAA28242.1| MIRTLLLSALVAGALSCGYPTYEVEDDVSRVVGGQEATPNTWPWQVSLQVLSSGRWRHNCGGSLVANNWVLTAAHCLSNYQTYRVLLGAHSLSNPGAGSAAVQVSKLVVHQRWNSQNVGNGYDIALIKLASPVTLSKNIQTACLPPAGTI LPRNYVCYVTGWGLLQTNGNSPDTLRQGRLLVVDYATCSSASWWGSSVKSSMVCAGGDGVTSSCNGDSGGPLNCRASNGQWQVHGIVSFGSSLGCNYPRKPSVFTRVSNYIDWINSVMARN In your case, you want a dict whose keys are the sequence name up to the first tab, and the values are the actual sequences. Something like this should work: d = dict( (seq.name.split('\t')[0], seq) for seq in seqs) The Seq class is a string subclass so putting the seq in the dict is what you want. There is also an iterator to read sequences one at a time, this might be a little faster and more memory efficient because it doesn't have to create the big list of all sequences. Something like this (untested): from corebio.seq_io.fasta_io import iterseq f = open(...) d = dict( (seq.name.split('\t')[0], seq) for seq in iterseq(f)) Kent _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor