Re: newb: comapring two strings

2006-05-19 Thread Peter Otten
manstey wrote:

 Is there a clever way to see if two strings of the same length vary by
 only one character, and what the character is in both strings.
 
 E.g. str1=yaqtil str2=yaqtel
 
 they differ at str1[4] and the difference is ('i','e')
 
 But if there was str1=yiqtol and str2=yaqtel, I am not interested.
 
 can anyone suggest a simple way to do this?
 
 My next problem is, I have a list of 300,000+ words and I want to find
 every pair of such strings. I thought I would first sort on length of
 string, but how do I iterate through the following:

Not sure if it can handle 30 words, but here is something to play with.

import sys

def find_similars(words, lookup=None, dupes=None):
if lookup is None:
lookup = {}
if dupes is None:
dupes = set()
for word in words:
low_word = word.lower()
if low_word not in dupes:
dupes.add(low_word)
last = None
for i, c in enumerate(low_word):
if c == last: continue
key = low_word[:i], low_word[i+1:]
if key in lookup:
lookup[key].append(word)
else:
lookup[key] = [word]
last = c
return (group for group in lookup.itervalues() if len(group)  1)

def main():
import optparse
parser = optparse.OptionParser()
parser.usage +=  infile[s]
parser.add_option(-n, --limit, type=int, help=process at most
LIMIT words)
options, args = parser.parse_args()
if args:
words = (w.strip() for fn in args for w in open(fn))
else:
words = (w.strip() for w in sys.stdin)
if options.limit is not None:
from itertools import islice
words = islice(words, max_len)

for group in find_similars(words):
print  .join(sorted(group))

if __name__ == __main__:
main()
 
Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-19 Thread John Machin
 Use the levenshtein distance.

Given the constraint that the two strings are the same length, I'm
assuming (as other posters appear to have done) that vary by only one
character precludes insertion and deletion operations.

In that case, the problem can be solved in O(n) time by a simple loop
which counts the number of differences and notes the position of the
first (if any) difference. Any full-distance Levenshtein method that
does no pre-processing must take O(n**2) time. It is possible to take
advantage of the fact that the OP doesn't care about what the distance
is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc. In this
case it is possible to achieve O(n) time [ref. Ukkonen]. However (a)
one still needs to track where the difference arose (b) we are talking
about extreme overkill for the OP's problem.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-19 Thread Paul McGuire
John Machin [EMAIL PROTECTED] wrote in message
news:[EMAIL PROTECTED]

 In that case, the problem can be solved in O(n) time by a simple loop
 which counts the number of differences and notes the position of the
 first (if any) difference. Any full-distance Levenshtein method that
 does no pre-processing must take O(n**2) time. It is possible to take
 advantage of the fact that the OP doesn't care about what the distance
 is exactly if it is not 1; 0 is just as bad as 2, 3, 4, etc.


Here is a class that implements 4 different approaches to this comparison
problem, with a configurable number of maximum mismatches (where a and b are
the strings being compared):

1. use sum(map(lambda (x,y): x!=y, itertools.izip(a,b))) to count
mismatches - no short-circuiting
2. use sum(map(abs, itertools.starmap(cmp, itertools.izip(a,b to count
mismatches - no short-circuiting
3. use explicit for loop to compare each tuple returned from
itertools.izip(a,b) - short-circuits when number of mismatches exceeds
allowable number
4. use for loop over itertools.takewhile to count mismatches -
short-circuits when number of mismatches exceeds allowable number

Also note the short-circuiting in the initializer - if the strings being
compared are shorter in length then the number of allowed mismatches, than
they will always match, no comparisons required.  (In the OP's specific
case, any two one-character strings will match within one character).

Of course this does nothing to address the larger issue of the program
design - I just wanted to learn a little more about itertools. :)

-- Paul


from itertools import izip,starmap,takewhile

class offByNoMoreThanOneCharacter(object):
def __init__(self,a,b,maxMismatches=1):
len_a = len(a)
self.val = None
if len_a != len(b):
self.val = False
elif len_a = maxMismatches:
self.val = True
else:
self.a = a
self.b = b
self.maxMismatches = maxMismatches

def eval1(self):
# one-liner using map with lambda - sum counts mismatches
self.val = sum(map(lambda (x,y): x!=y, izip(self.a,self.b))) =
self.maxMismatches

def eval2(self):
# one-liner using map with abs and cmp - sum counts mismatches
self.val = sum(map(abs, starmap(cmp, izip(self.a,self.b =
self.maxMismatches

def eval3(self):
# explicit traversal, with short-circuit when too many mismatches
found
mismatches = 0
for (ac,bc) in izip(self.a,self.b):
if ac != bc:
mismatches += 1
if mismatches  self.maxMismatches:
self.val = False
break
else:
self.val = True

def eval4(self):
# traversal using takewhile, also short-circuits when too many
mismatches found
numMismatches = 0
maxMismatches = self.maxMismatches
stillOk = lambda (s,t)=(None,None) : numMismatches = maxMismatches
for (ac,bc) in takewhile(stillOk, izip(self.a,self.b)):
numMismatches += (ac != bc)
self.val = stillOk()


# special instance method to return boolean-ness of this object
def __nonzero__(self):
if self.val is None:
# change this line to use whichever eval method you want to test
self.eval1()
return self.val


def test(a,b):
print a,b
print bool(offByNoMoreThanOneCharacter(a,b))
print

test(abc,abd)
test(abc,axd)
test(yaqtil,yaqtel)
test(yiqtol,yaqtel)


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-19 Thread bearophileHUGS
I think a way to solve the problem may be:
1) create a little Python script to separate the original words in many
files, each one containing only words of the same length. Every
filename can contain the relative word length.
2) write a little C program with two nested loops, that scans all the
pairs of the words of a single file, looking for single char
differences using a scanning of the chars. Probably there are many
possible tricks to speed up this search (like separating the words in
subgroups, of using some assembly-derived tricks, etc) but maybe you
don't need them. As C data structure you may simply use a single block
(because you know the length of a single word, so the files produced by
Python can be without spaces and returns).

I have suggested C because if the words are all of the same length then
you have 3^2 = 90 000 000 000 pairs to test.
If you want, before creating the C program you can create a little
Python+Psyco prototype that uses array.array of chars (because
sometimes Psyco uses them quite quickly).

Bye,
bearophile

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-19 Thread bearophileHUGS
 I have suggested C because if the words are all of the same length then
 you have 3^2 = 90 000 000 000 pairs to test.

Sorry, you have (n*(n-1))/2  pairs to test (~ 45 000 000 000).

bearophile

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-19 Thread Paul Rubin
[EMAIL PROTECTED] writes:
  I have suggested C because if the words are all of the same length then
  you have 3^2 = 90 000 000 000 pairs to test.
 
 Sorry, you have (n*(n-1))/2  pairs to test (~ 45 000 000 000).

Still terrible.  Use a better algorithm!
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-19 Thread John Machin
Paul Rubin wrote:
 [EMAIL PROTECTED] writes:
 I have suggested C because if the words are all of the same length then
 you have 3^2 = 90 000 000 000 pairs to test.

 Sorry, you have (n*(n-1))/2  pairs to test (~ 45 000 000 000).

 Still terrible.  Use a better algorithm!

To put all this in perspective, here's a very similar real-world
problem:

You have a customer database with 300,000 records. Some of them are
duplicated, because there are differences in recording the customers'
names, like a one-keystroke typo. The task is to locate pairs of rows
which are likely to be duplicates. 45 billion comaprisons[1] may be
ecxessive.

And here's a clue for a better algorithm: Knuth's TAOCP vol 3 [1973
edition] chapter 6 (searching) third page as if the words were spelled
backwards. If you find yourself reading about soundex, you've
definitely gone too far :-)

[1]: comapring in the subject: is this a game of 'Spot the deliberate
mistake? Is the OP counting a transposition of adjacent characters as
varying by one character?

Cheers

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-19 Thread bearophileHUGS
Paul RubinStill terrible.  Use a better algorithm!

I agree, it's O(n^2), but if you need to run this program just 1 time,
and the program is in C, and you don't want to use much time to think
and code a better algorithm (or you aren't able to do it) then maybe
that naive solution can be enough, (if the running time is around 30-90
minutes).
Total time to solve a one-shot problem is often the sum of thinking
time + programming time + running time :-)


Paul RubinUse this to generate all the variants of all the words in
the dictionary, and write those out into a file, each line containing a
variant plus the original word.  Then use a sorting utility like the
unix sort program to sort the file.  Those programs work efficiently
even if the file is too large to fit in memory.  Then read the sorted
file to find equivalence classes on the variants.

If the words are all 5 character long, then I think you are creating:
(len(lowercase)-1) * 5 * 3 = 375
strings of len 5+5 (plus enter if you save it into a file), this means
a file of about 37 MB. All those trings may even fit in memory if you
have lot of it.
Your solution looks nice :-) (It reminds me a little the word signature
trick used to find anagrams).

Bye,
bearophile

-- 
http://mail.python.org/mailman/listinfo/python-list


newb: comapring two strings

2006-05-18 Thread manstey
Hi,

Is there a clever way to see if two strings of the same length vary by
only one character, and what the character is in both strings.

E.g. str1=yaqtil str2=yaqtel

they differ at str1[4] and the difference is ('i','e')

But if there was str1=yiqtol and str2=yaqtel, I am not interested.

can anyone suggest a simple way to do this?

My next problem is, I have a list of 300,000+ words and I want to find
every pair of such strings. I thought I would first sort on length of
string, but how do I iterate through the following:

str1
str2
str3
str4
str5

so that I compare str1  str2, str1  str3, str 1  str4, str1  str5,
str2  str3, str3  str4, str3  str5, str4  str5.

Thanks in advance,
Matthew

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-18 Thread Diez B. Roggisch
 Is there a clever way to see if two strings of the same length vary by
 only one character, and what the character is in both strings.
 
 E.g. str1=yaqtil str2=yaqtel
 
 they differ at str1[4] and the difference is ('i','e')
 
 But if there was str1=yiqtol and str2=yaqtel, I am not interested.
 
 can anyone suggest a simple way to do this?

Use the levenshtein distance.
http://en.wikisource.org/wiki/Levenshtein_distance


 My next problem is, I have a list of 300,000+ words and I want to find
 every pair of such strings. I thought I would first sort on length of
 string, but how do I iterate through the following:
 
 str1
 str2
 str3
 str4
 str5
 
 so that I compare str1  str2, str1  str3, str 1  str4, str1  str5,
 str2  str3, str3  str4, str3  str5, str4  str5.

decorate-sort-undecorate is the idion for this

l = list of strings

l = [(len(w), w) for w in l]
l.sort()
l = [w for _, w in l]


Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-18 Thread Justin Azoff
manstey wrote:
 Hi,

 Is there a clever way to see if two strings of the same length vary by
 only one character, and what the character is in both strings.

 E.g. str1=yaqtil str2=yaqtel

 they differ at str1[4] and the difference is ('i','e')

something like this maybe?

 str1='yaqtil'
 str2='yaqtel'
 set(enumerate(str1)) ^ set(enumerate(str2))
set([(4, 'e'), (4, 'i')])


-- 
- Justin

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-18 Thread Serge Orlov
manstey wrote:
 Hi,

 Is there a clever way to see if two strings of the same length vary by
 only one character, and what the character is in both strings.

 E.g. str1=yaqtil str2=yaqtel

 they differ at str1[4] and the difference is ('i','e')

 But if there was str1=yiqtol and str2=yaqtel, I am not interested.

 can anyone suggest a simple way to do this?

 My next problem is, I have a list of 300,000+ words and I want to find
 every pair of such strings. I thought I would first sort on length of
 string, but how do I iterate through the following:

 str1
 str2
 str3
 str4
 str5

 so that I compare str1  str2, str1  str3, str 1  str4, str1  str5,
 str2  str3, str3  str4, str3  str5, str4  str5.

If your strings are pretty short you can do it like this even without
sorting by length first:

def fuzzy_keys(s):
for pos in range(len(s)):
yield s[0:pos]+chr(0)+s[pos+1:]

def fuzzy_insert(d, s):
for fuzzy_key in fuzzy_keys(s):
if fuzzy_key in d:
strings = d[fuzzy_key]
if type(strings) is list:
strings += s
else:
d[fuzzy_key] = [strings, s]
else:
d[fuzzy_key] = s

def gather_fuzzy_matches(d):
for strings in d.itervalues():
if type(strings) is list:
yield strings

acc = {}
fuzzy_insert(acc, yaqtel)
fuzzy_insert(acc, yaqtil)
fuzzy_insert(acc, oaqtil)
print list(gather_fuzzy_matches(acc))

prints

[['yaqtil', 'oaqtil'], ['yaqtel', 'yaqtil']]

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newb: comapring two strings

2006-05-18 Thread johnzenger

manstey wrote:
 Hi,

 Is there a clever way to see if two strings of the same length vary by
 only one character, and what the character is in both strings.

You want zip.

def diffbyonlyone(string1, string2):
diffcount = 0
for c1, c2 in zip(string1, string2):
if c1 != c2:
diffcount += 1
if diffcount  1:
return False
return diffcount == 1

print diffbyonlyone(yaqtil,yaqtel) # True
print diffbyonlyone(yiqtol,yaqtel) # False

If your strings are long, it might be faster/more memory efficient to
use itertools.izip instead.

 My next problem is, I have a list of 300,000+ words and I want to find
 every pair of such strings. I thought I would first sort on length of
 string, but how do I iterate through the following:

 str1
 str2
 str3
 str4
 str5

 so that I compare str1  str2, str1  str3, str 1  str4, str1  str5,
 str2  str3, str3  str4, str3  str5, str4  str5.

for index1 in xrange(len(words)):
for index2 in xrange(index1+1,len(words)):
if diffbyonlyone(words[index1], words[index2]):
print words[index1] +  --  + words[index2]

...but by all means run that only on sets of words that you have
already identified, pursuant to some criteria like word length, to be
likely matches.  Do the math; that's a lot of comparisons!

-- 
http://mail.python.org/mailman/listinfo/python-list