Re: newb: comapring two strings
manstey wrote: Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings. E.g. str1=yaqtil str2=yaqtel they differ at str1[4] and the difference is ('i','e') But if there was str1=yiqtol and str2=yaqtel, I am not interested. can anyone suggest a simple way to do this? My next problem is, I have a list of 300,000+ words and I want to find every pair of such strings. I thought I would first sort on length of string, but how do I iterate through the following: Not sure if it can handle 30 words, but here is something to play with. import sys def find_similars(words, lookup=None, dupes=None): if lookup is None: lookup = {} if dupes is None: dupes = set() for word in words: low_word = word.lower() if low_word not in dupes: dupes.add(low_word) last = None for i, c in enumerate(low_word): if c == last: continue key = low_word[:i], low_word[i+1:] if key in lookup: lookup[key].append(word) else: lookup[key] = [word] last = c return (group for group in lookup.itervalues() if len(group) 1) def main(): import optparse parser = optparse.OptionParser() parser.usage += infile[s] parser.add_option(-n, --limit, type=int, help=process at most LIMIT words) options, args = parser.parse_args() if args: words = (w.strip() for fn in args for w in open(fn)) else: words = (w.strip() for w in sys.stdin) if options.limit is not None: from itertools import islice words = islice(words, max_len) for group in find_similars(words): print .join(sorted(group)) if __name__ == __main__: main() Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
Use the levenshtein distance. Given the constraint that the two strings are the same length, I'm assuming (as other posters appear to have done) that vary by only one character precludes insertion and deletion operations. In that case, the problem can be solved in O(n) time by a simple loop which counts the number of differences and notes the position of the first (if any) difference. Any full-distance Levenshtein method that does no pre-processing must take O(n**2) time. It is possible to take advantage of the fact that the OP doesn't care about what the distance is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc. In this case it is possible to achieve O(n) time [ref. Ukkonen]. However (a) one still needs to track where the difference arose (b) we are talking about extreme overkill for the OP's problem. -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
John Machin [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] In that case, the problem can be solved in O(n) time by a simple loop which counts the number of differences and notes the position of the first (if any) difference. Any full-distance Levenshtein method that does no pre-processing must take O(n**2) time. It is possible to take advantage of the fact that the OP doesn't care about what the distance is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc. Here is a class that implements 4 different approaches to this comparison problem, with a configurable number of maximum mismatches (where a and b are the strings being compared): 1. use sum(map(lambda (x,y): x!=y, itertools.izip(a,b))) to count mismatches - no short-circuiting 2. use sum(map(abs, itertools.starmap(cmp, itertools.izip(a,b to count mismatches - no short-circuiting 3. use explicit for loop to compare each tuple returned from itertools.izip(a,b) - short-circuits when number of mismatches exceeds allowable number 4. use for loop over itertools.takewhile to count mismatches - short-circuits when number of mismatches exceeds allowable number Also note the short-circuiting in the initializer - if the strings being compared are shorter in length then the number of allowed mismatches, than they will always match, no comparisons required. (In the OP's specific case, any two one-character strings will match within one character). Of course this does nothing to address the larger issue of the program design - I just wanted to learn a little more about itertools. :) -- Paul from itertools import izip,starmap,takewhile class offByNoMoreThanOneCharacter(object): def __init__(self,a,b,maxMismatches=1): len_a = len(a) self.val = None if len_a != len(b): self.val = False elif len_a = maxMismatches: self.val = True else: self.a = a self.b = b self.maxMismatches = maxMismatches def eval1(self): # one-liner using map with lambda - sum counts mismatches self.val = sum(map(lambda (x,y): x!=y, izip(self.a,self.b))) = self.maxMismatches def eval2(self): # one-liner using map with abs and cmp - sum counts mismatches self.val = sum(map(abs, starmap(cmp, izip(self.a,self.b = self.maxMismatches def eval3(self): # explicit traversal, with short-circuit when too many mismatches found mismatches = 0 for (ac,bc) in izip(self.a,self.b): if ac != bc: mismatches += 1 if mismatches self.maxMismatches: self.val = False break else: self.val = True def eval4(self): # traversal using takewhile, also short-circuits when too many mismatches found numMismatches = 0 maxMismatches = self.maxMismatches stillOk = lambda (s,t)=(None,None) : numMismatches = maxMismatches for (ac,bc) in takewhile(stillOk, izip(self.a,self.b)): numMismatches += (ac != bc) self.val = stillOk() # special instance method to return boolean-ness of this object def __nonzero__(self): if self.val is None: # change this line to use whichever eval method you want to test self.eval1() return self.val def test(a,b): print a,b print bool(offByNoMoreThanOneCharacter(a,b)) print test(abc,abd) test(abc,axd) test(yaqtil,yaqtel) test(yiqtol,yaqtel) -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
I think a way to solve the problem may be: 1) create a little Python script to separate the original words in many files, each one containing only words of the same length. Every filename can contain the relative word length. 2) write a little C program with two nested loops, that scans all the pairs of the words of a single file, looking for single char differences using a scanning of the chars. Probably there are many possible tricks to speed up this search (like separating the words in subgroups, of using some assembly-derived tricks, etc) but maybe you don't need them. As C data structure you may simply use a single block (because you know the length of a single word, so the files produced by Python can be without spaces and returns). I have suggested C because if the words are all of the same length then you have 3^2 = 90 000 000 000 pairs to test. If you want, before creating the C program you can create a little Python+Psyco prototype that uses array.array of chars (because sometimes Psyco uses them quite quickly). Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
I have suggested C because if the words are all of the same length then you have 3^2 = 90 000 000 000 pairs to test. Sorry, you have (n*(n-1))/2 pairs to test (~ 45 000 000 000). bearophile -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
[EMAIL PROTECTED] writes: I have suggested C because if the words are all of the same length then you have 3^2 = 90 000 000 000 pairs to test. Sorry, you have (n*(n-1))/2 pairs to test (~ 45 000 000 000). Still terrible. Use a better algorithm! -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
Paul Rubin wrote: [EMAIL PROTECTED] writes: I have suggested C because if the words are all of the same length then you have 3^2 = 90 000 000 000 pairs to test. Sorry, you have (n*(n-1))/2 pairs to test (~ 45 000 000 000). Still terrible. Use a better algorithm! To put all this in perspective, here's a very similar real-world problem: You have a customer database with 300,000 records. Some of them are duplicated, because there are differences in recording the customers' names, like a one-keystroke typo. The task is to locate pairs of rows which are likely to be duplicates. 45 billion comaprisons[1] may be ecxessive. And here's a clue for a better algorithm: Knuth's TAOCP vol 3 [1973 edition] chapter 6 (searching) third page as if the words were spelled backwards. If you find yourself reading about soundex, you've definitely gone too far :-) [1]: comapring in the subject: is this a game of 'Spot the deliberate mistake? Is the OP counting a transposition of adjacent characters as varying by one character? Cheers -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
Paul RubinStill terrible. Use a better algorithm! I agree, it's O(n^2), but if you need to run this program just 1 time, and the program is in C, and you don't want to use much time to think and code a better algorithm (or you aren't able to do it) then maybe that naive solution can be enough, (if the running time is around 30-90 minutes). Total time to solve a one-shot problem is often the sum of thinking time + programming time + running time :-) Paul RubinUse this to generate all the variants of all the words in the dictionary, and write those out into a file, each line containing a variant plus the original word. Then use a sorting utility like the unix sort program to sort the file. Those programs work efficiently even if the file is too large to fit in memory. Then read the sorted file to find equivalence classes on the variants. If the words are all 5 character long, then I think you are creating: (len(lowercase)-1) * 5 * 3 = 375 strings of len 5+5 (plus enter if you save it into a file), this means a file of about 37 MB. All those trings may even fit in memory if you have lot of it. Your solution looks nice :-) (It reminds me a little the word signature trick used to find anagrams). Bye, bearophile -- http://mail.python.org/mailman/listinfo/python-list
newb: comapring two strings
Hi, Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings. E.g. str1=yaqtil str2=yaqtel they differ at str1[4] and the difference is ('i','e') But if there was str1=yiqtol and str2=yaqtel, I am not interested. can anyone suggest a simple way to do this? My next problem is, I have a list of 300,000+ words and I want to find every pair of such strings. I thought I would first sort on length of string, but how do I iterate through the following: str1 str2 str3 str4 str5 so that I compare str1 str2, str1 str3, str 1 str4, str1 str5, str2 str3, str3 str4, str3 str5, str4 str5. Thanks in advance, Matthew -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings. E.g. str1=yaqtil str2=yaqtel they differ at str1[4] and the difference is ('i','e') But if there was str1=yiqtol and str2=yaqtel, I am not interested. can anyone suggest a simple way to do this? Use the levenshtein distance. http://en.wikisource.org/wiki/Levenshtein_distance My next problem is, I have a list of 300,000+ words and I want to find every pair of such strings. I thought I would first sort on length of string, but how do I iterate through the following: str1 str2 str3 str4 str5 so that I compare str1 str2, str1 str3, str 1 str4, str1 str5, str2 str3, str3 str4, str3 str5, str4 str5. decorate-sort-undecorate is the idion for this l = list of strings l = [(len(w), w) for w in l] l.sort() l = [w for _, w in l] Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
manstey wrote: Hi, Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings. E.g. str1=yaqtil str2=yaqtel they differ at str1[4] and the difference is ('i','e') something like this maybe? str1='yaqtil' str2='yaqtel' set(enumerate(str1)) ^ set(enumerate(str2)) set([(4, 'e'), (4, 'i')]) -- - Justin -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
manstey wrote: Hi, Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings. E.g. str1=yaqtil str2=yaqtel they differ at str1[4] and the difference is ('i','e') But if there was str1=yiqtol and str2=yaqtel, I am not interested. can anyone suggest a simple way to do this? My next problem is, I have a list of 300,000+ words and I want to find every pair of such strings. I thought I would first sort on length of string, but how do I iterate through the following: str1 str2 str3 str4 str5 so that I compare str1 str2, str1 str3, str 1 str4, str1 str5, str2 str3, str3 str4, str3 str5, str4 str5. If your strings are pretty short you can do it like this even without sorting by length first: def fuzzy_keys(s): for pos in range(len(s)): yield s[0:pos]+chr(0)+s[pos+1:] def fuzzy_insert(d, s): for fuzzy_key in fuzzy_keys(s): if fuzzy_key in d: strings = d[fuzzy_key] if type(strings) is list: strings += s else: d[fuzzy_key] = [strings, s] else: d[fuzzy_key] = s def gather_fuzzy_matches(d): for strings in d.itervalues(): if type(strings) is list: yield strings acc = {} fuzzy_insert(acc, yaqtel) fuzzy_insert(acc, yaqtil) fuzzy_insert(acc, oaqtil) print list(gather_fuzzy_matches(acc)) prints [['yaqtil', 'oaqtil'], ['yaqtel', 'yaqtil']] -- http://mail.python.org/mailman/listinfo/python-list
Re: newb: comapring two strings
manstey wrote: Hi, Is there a clever way to see if two strings of the same length vary by only one character, and what the character is in both strings. You want zip. def diffbyonlyone(string1, string2): diffcount = 0 for c1, c2 in zip(string1, string2): if c1 != c2: diffcount += 1 if diffcount 1: return False return diffcount == 1 print diffbyonlyone(yaqtil,yaqtel) # True print diffbyonlyone(yiqtol,yaqtel) # False If your strings are long, it might be faster/more memory efficient to use itertools.izip instead. My next problem is, I have a list of 300,000+ words and I want to find every pair of such strings. I thought I would first sort on length of string, but how do I iterate through the following: str1 str2 str3 str4 str5 so that I compare str1 str2, str1 str3, str 1 str4, str1 str5, str2 str3, str3 str4, str3 str5, str4 str5. for index1 in xrange(len(words)): for index2 in xrange(index1+1,len(words)): if diffbyonlyone(words[index1], words[index2]): print words[index1] + -- + words[index2] ...but by all means run that only on sets of words that you have already identified, pursuant to some criteria like word length, to be likely matches. Do the math; that's a lot of comparisons! -- http://mail.python.org/mailman/listinfo/python-list