I would do this by making a dictionary mapping sequence to header for each data set. Then make a set that contains the keys common to both data sets. Finally use the dictionaries again to look up the headers.
a = '''>a1 TTAATTGGAACA >a2 AGGACAAGGATA >a3 TTAAGGAACAAA'''.split() # Make a dict mapping sequence to header for the 'a' data set ak = a[1::2] av = a[::2] a_dict = dict(zip(ak,av)) print a_dict b = '''>b1 TTAATTGGAACA >b2 AGGTCAAGGATA >b3 AAGGCCAATTAA'''.split() # Make a dict mapping sequence to header for the 'b' data set bk = b[1::2] bv = b[::2] b_dict = dict(zip(bk,bv)) print b_dict # Make a set that contains the keys common to both dicts common_keys = set(a_dict.iterkeys()) common_keys.intersection_update(b_dict.iterkeys()) print common_keys # For each common key, print the corresponding headers for common in common_keys: print '%s\t%s' % (a_dict[common], b_dict[common]) Kent Srinivas Iyyer wrote: > dear group, > > > I have two files in a text format and look this way: > > > File a1.txt: > >>a1 > > TTAATTGGAACA > >>a2 > > AGGACAAGGATA > >>a3 > > TTAAGGAACAAA > > > > File b1.txt: > >>b1 > > TTAATTGGAACA > >>b2 > > AGGTCAAGGATA > >>b3 > > AAGGCCAATTAA > > > I want to check if there are common elements based on > ATGC sequences. a1 and b1 are identical sequences and > I want to select them and print the headers (starting > with > symbol). > > a1 '\t' b1 > > > > Here: > >>XXXXX is called header and the line followed by >line > > is sequence. In bioinformatics, this is called a FASTA > format. What I am doing here is, I am matching the > sequences (these are always 25 mers in this instance) > and if they match, I am asking python to write the > header +'\t'+ header > > > ak = a[1::2] > av = a[::2] > seq_dict = dict(zip(ak,av)) > > ************************************** > >>>>seq_dict > > {'TTAAGGAACAAA': '>a3', 'AGGACAAGGATA': '>a2', > 'TTAATTGGAACA': '>a1'} > ************************************** > > > > bv = b[1::2] > > *************************************** > >>>>bv > > ['TTAATTGGAACA', 'AGGTCAAGGATA', 'AAGGCCAATTAA'] > > > >>>>for i in bv: > > if seq_dict.has_key(i): > print seq_dict[i] > > > >>a1 > > > *************************************** > > Here a1 is the only common element. > > However, I am having difficulty printing that b1 is > identical to a1 > > > how do i take b and do this search. It was easy for me > to take the sequence part by doing > > b[1::2]. however, I want to print b1 header has same > sequence as a1 > > a1 +'\t'+b1 > > Is there anyway i can do this. This is very simple and > due to my brain block, I am unable to get it out. > Can any one please help me out. > > Thanks > > > > > > __________________________________ > Yahoo! Mail - PC Magazine Editors' Choice 2005 > http://mail.yahoo.com > _______________________________________________ > Tutor maillist - Tutor@python.org > http://mail.python.org/mailman/listinfo/tutor > > _______________________________________________ Tutor maillist - Tutor@python.org http://mail.python.org/mailman/listinfo/tutor