@Ricardo Aráoz Thanks for your response, Before I saw your response I had posted the question on stack overflow. See link below. I like your solution better than the re solution posted. It looks like this task may take longer than I think. The .re solution I guess might take more than 10 days. The search string in 80million digits long. But Obviously I can stop once I find a match and then just move on the the next sequence. You might what to post this answer on stackoverflow. I like the more interactive form of a mailing list but there seems to be a very p\broad audience on stackoverflow.
Thanks again, http://stackoverflow.com/questions/2420412/search-for-string-allowing-for-one-mismatches-in-any-location-of-the-string-pyth *Vincent Davis 720-301-3003 * vinc...@vincentdavis.net my blog <http://vincentdavis.net> | LinkedIn<http://www.linkedin.com/in/vincentdavis> 2010/3/10 Ricardo Aráoz <ricar...@gmail.com> > Vincent Davis wrote: > > I have never used the difflib or similar and have a few questions. > I am working with DNA sequences of length 25. I have a list of 230,000 and > need to look for each sequence in the entire genome (toxoplasma parasite) I > am not sure how large the genome is but more that 230,000 sequences. > The are programs that do this and really fast, and they eve do partial > matches but not quite what I need. So I am looking to build a custom > solution. > I need to look for each of my sequences of 25 characters example( > AGCCTCCCATGATTGAACAGATCAT). > The genome is formatted as a continuos string > (CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG.........) > > I don't care where or how many times on if it exists. This is simple I > think, str.find(AGCCTCCCATGATTGAACAGATCAT) > > But I also what to find a close match defined as only wrong at 1 location > and I what to record the location. I am not sure how do do this. The only > thing I can think of is using a wildcard and performing the search with a > wildcard in each position. ie 25 time. > For example > AGCCTCCCATGATTGAACAGATCAT > AGCCTCCCATGATAGAACAGATCAT > close match with a miss-match at position 13 > > > also : > > sequence = 'AGGCTTGCGGAGCCTGAGGGCGGAG' > seqList = ['*' + sequence[0:i] + '?' + sequence[i+1:] + '*' for i in > range(len(sequence))] > import fnmatch > > genome = 'CATGGGAGGCTTGCGGAGCCTGAGGGCGGAGCCTGAGGTGGGAGGCTTGCGGAG........' > if any(fnmatch.fnmatch(genome, i) for i in seqList) > print 'It matches' > > Which might be better if the sequence is fixed and the genome changes > inside a loop. > > HTH > > > > > _______________________________________________ > Tutor maillist - Tutor@python.org > To unsubscribe or change subscription options: > http://mail.python.org/mailman/listinfo/tutor > >
_______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor