Re: Counting elements in a list wildcard
Edward Elliott wrote: John Machin wrote: On 25/04/2006 6:26 PM, Iain King wrote: iain = re.compile((Ia(i)?n|Eoin)) steven = re.compile(Ste(v|ph|f)(e|a)n) IMHO, the amount of hand-crafting that goes into a *general-purpose* phonetic matching algorithm is already bordering on overkill. Your method using REs would not appear to scale well at all. Also compare the readability of regular expressions in this case to a simple list: [Steven, Stephen, Stefan, Stephan, ...] Somehow I'm the advocate for REs here, which: erg. But you have some mighty convenient elipses there... compare: steven = re.compile(Ste(v|ph|f|ff)(e|a)n) steven = [Steven, Stephen, Stefen, Steffen, Stevan, Stephan, Stefan, Steffan] I know which I'd rather type. 'Course, if you can use a ready-built list of names... Iain -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
Iain King wrote: steven = re.compile(Ste(v|ph|f|ff)(e|a)n) steven = [Steven, Stephen, Stefen, Steffen, Stevan, Stephan, Stefan, Steffan] I know which I'd rather type. 'Course, if you can use a ready-built list of names... Oh I agree, I'd rather *type* the former, but I'd rather *read* the latter. :) -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
Iain King wrote: steven = re.compile(Ste(v|ph|f|ff)(e|a)n) Also you can expand the RE a bit to improve readability: re.compile(Stev|Steph|Stef|Steff)(en|an)) -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
hawkesed wrote: If I have a list, say of names. And I want to count all the people named, say, Susie, but I don't care exactly how they spell it (ie, Susy, Susi, Susie all work.) how would I do this? Set up a regular expression inside the count? Is there a wildcard variable I can use? Here is the code for the non-fuzzy way: lstNames.count(Susie) Any ideas? Is this something you wouldn't expect count to do? Thanks y'all from a newbie. Ed Dare I suggest using REs? This looks like something they'de be good for: import re def countMatches(names, namePattern): count = 0 for name in names: if namePattern.match(name): count += 1 return count susie = re.compile(Su(s|z)(i|ie|y)) print countMatches([John, Suzy, Peter, Steven, Susie, Susi], susie) some other patters: iain = re.compile((Ia(i)?n|Eoin)) steven = re.compile(Ste(v|ph|f)(e|a)n) john = re.compile(Jo(h)?n) Iain -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
On 25/04/2006 3:15 PM, Edward Elliott wrote: Phoneme matching seems overly complex and might grab things like Tsu-zi. It might *only* if somebody had a rush of blood to the head and devised yet another phonetic key algorithm. Tsuzi does *not* give the same result as any of Suzi, Suzie, Susi, and Susie when pushed through any of the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None of them throw away the 'T' sound. -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
On 25/04/2006 6:26 PM, Iain King wrote: hawkesed wrote: If I have a list, say of names. And I want to count all the people named, say, Susie, but I don't care exactly how they spell it (ie, Susy, Susi, Susie all work.) how would I do this? Set up a regular expression inside the count? Is there a wildcard variable I can use? Here is the code for the non-fuzzy way: lstNames.count(Susie) Any ideas? Is this something you wouldn't expect count to do? Thanks y'all from a newbie. Ed Dare I suggest using REs? This looks like something they'de be good for: import re def countMatches(names, namePattern): count = 0 for name in names: if namePattern.match(name): count += 1 return count susie = re.compile(Su(s|z)(i|ie|y)) print countMatches([John, Suzy, Peter, Steven, Susie, Susi], susie) some other patters: iain = re.compile((Ia(i)?n|Eoin)) steven = re.compile(Ste(v|ph|f)(e|a)n) What about Steffan, Etienne, Esteban, István, ... ? john = re.compile(Jo(h)?n) IMHO, the amount of hand-crafting that goes into a *general-purpose* phonetic matching algorithm is already bordering on overkill. Your method using REs would not appear to scale well at all. -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
John Machin wrote: On 25/04/2006 6:26 PM, Iain King wrote: hawkesed wrote: If I have a list, say of names. And I want to count all the people named, say, Susie, but I don't care exactly how they spell it (ie, Susy, Susi, Susie all work.) how would I do this? Set up a regular expression inside the count? Is there a wildcard variable I can use? Here is the code for the non-fuzzy way: lstNames.count(Susie) Any ideas? Is this something you wouldn't expect count to do? Thanks y'all from a newbie. snip steven = re.compile(Ste(v|ph|f)(e|a)n) What about Steffan, Etienne, Esteban, István, ... ? well, obviously these could be included: (Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban), but the OP never said he wanted to translate anything into another language. He just wanted to catch variable spellings. john = re.compile(Jo(h)?n) IMHO, the amount of hand-crafting that goes into a *general-purpose* phonetic matching algorithm is already bordering on overkill. Your method using REs would not appear to scale well at all. Iain -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
On 25/04/2006 8:51 PM, Iain King wrote: John Machin wrote: On 25/04/2006 6:26 PM, Iain King wrote: hawkesed wrote: If I have a list, say of names. And I want to count all the people named, say, Susie, but I don't care exactly how they spell it (ie, Susy, Susi, Susie all work.) how would I do this? Set up a regular expression inside the count? Is there a wildcard variable I can use? Here is the code for the non-fuzzy way: lstNames.count(Susie) Any ideas? Is this something you wouldn't expect count to do? Thanks y'all from a newbie. snip steven = re.compile(Ste(v|ph|f)(e|a)n) What about Steffan, Etienne, Esteban, István, ... ? well, obviously these could be included: (Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban), but the OP never said he wanted to translate anything into another language. Neither did I. But if you have to cope with a practical situation like where the birth certificate says István and the job application says Steven and the foreman calls him Steve, you won't be stuffing about with hand-crafted REs, one per popular given name. Could be worse: the punter could have looked up a dictionary and changed his surname from Kovács to Smith; believe me -- it happens. Oh and if you cast your net as wide as the Pacific islands, chuck in Sitiveni. That's enough examples. We won't go near Benjamin :-) -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
John Machin wrote: On 25/04/2006 6:26 PM, Iain King wrote: iain = re.compile((Ia(i)?n|Eoin)) steven = re.compile(Ste(v|ph|f)(e|a)n) IMHO, the amount of hand-crafting that goes into a *general-purpose* phonetic matching algorithm is already bordering on overkill. Your method using REs would not appear to scale well at all. Also compare the readability of regular expressions in this case to a simple list: [Steven, Stephen, Stefan, Stephan, ...] -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
John Machin wrote: On 25/04/2006 3:15 PM, Edward Elliott wrote: Phoneme matching seems overly complex and might grab things like Tsu-zi. It might *only* if somebody had a rush of blood to the head and devised yet another phonetic key algorithm. Tsuzi does *not* give the same result as any of Suzi, Suzie, Susi, and Susie when pushed through any of the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None of them throw away the 'T' sound. Spelling isn't phonetic. The 't' character doesn't necessarily affect pronounciation. Or it may affect pronounciation in a way the soundex doesn't understand (think tonal languages). Latinizing foreign languages raises all sorts of problems. A soundex is only as good as its pronounciation database. It may work well in many situations, but it isn't fool-proof. -- http://mail.python.org/mailman/listinfo/python-list
Counting elements in a list wildcard
If I have a list, say of names. And I want to count all the people named, say, Susie, but I don't care exactly how they spell it (ie, Susy, Susi, Susie all work.) how would I do this? Set up a regular expression inside the count? Is there a wildcard variable I can use? Here is the code for the non-fuzzy way: lstNames.count(Susie) Any ideas? Is this something you wouldn't expect count to do? Thanks y'all from a newbie. Ed -- http://mail.python.org/mailman/listinfo/python-list
RE: Counting elements in a list wildcard
Behalf Of hawkesed If I have a list, say of names. And I want to count all the people named, say, Susie, but I don't care exactly how they spell it (ie, Susy, Susi, Susie all work.) how would I do this? Set up a regular expression inside the count? Is there a wildcard variable I can use? Here is the code for the non-fuzzy way: lstNames.count(Susie) Any ideas? Is this something you wouldn't expect count to do? Thanks y'all from a newbie. If there are specific spellings you want to allow, you could just create a list of them and see if your Suzy is in there: possible_suzys = [ 'Susy', 'Susi', 'Susie' ] my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Jane' ] for line in my_strings: ... if line in possible_suzys: print line ... Susi I think a general solution to this problem is to use edit (also called Levenshtein) distance. There is an implementation in Python at this Wiki: http://en.wikisource.org/wiki/Levenshtein_distance You could use this distance function, and normalize for string length using the following score function: def score( a, b ): Calculates the similarity score of the two strings based on edit distance. high_len = max( len(a), len(b) ) return float( high_len - distance( a, b ) ) / float( high_len ) for line in my_strings: ... if score( line, 'Susie' ) .75: print line ... Susi -- Regards, Ryan Ginstrom -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
Ryan Ginstrom [EMAIL PROTECTED] writes: If there are specific spellings you want to allow, you could just create a list of them and see if your Suzy is in there: possible_suzys = [ 'Susy', 'Susi', 'Susie' ] my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Jane' ] for line in my_strings: ... if line in possible_suzys: print line ... Susi If you wanted to do something later, rather than only during the scan over the list, getting a list of suzies would probaby be more useful: possible_suzys = [ 'Susy', 'Susi', 'Susie' ] my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ] found_suzys = [s for s in my_strings if s in possible_suzys] found_suzys ['Susi', 'Susy'] -- \The number of UNIX installations has grown to 10, with more | `\ expected. -- Unix Programmer's Manual, 2nd Ed., 12-Jun-1972 | _o__) | Ben Finney -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
hawkesed wrote: If I have a list, say of names. And I want to count all the people named, say, Susie, but I don't care exactly how they spell it (ie, Susy, Susi, Susie all work.) how would I do this? Set up a regular expression inside the count? Is there a wildcard variable I can use? Here is the code for the non-fuzzy way: lstNames.count(Susie) Any ideas? Is this something you wouldn't expect count to do? Thanks y'all from a newbie. Ed You might want to check out the SoundEx and MetaPhone algorithms which provide approximations of the sound of a word based on spelling (assuming English pronunciations). Apparently a soundex module used to be built into Python but was removed in 2.0. You can find several implementations on the 'net, for example: http://orca.mojam.com/~skip/python/soundex.py http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213 MetaPhone is generally considered better than SoundEx for sounds-like matching, although it's considerably more complex (IIRC, although it's been a long time since I wrote an implementation of either in any language). A Python MetaPhone implementations (there must be more than this one?): http://joelspeters.com/awesomecode/ Another algorithm that might interest isn't based on sounds-like but instead computes the number of transforms necessary to get from one word to another: the Levenshtein distance. A C based implementation (with Python interface) is available: http://trific.ath.cx/resources/python/levenshtein/ Whichever algorithm you go with, you'll wind up with some sort of similar function which could be applied in a similar manner to Ben's example (I've just mocked up the following -- it's not an actual session): import soundex import metaphone import levenshtein my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ] found_suzys = [s for s in my_strings if soundsex.sounds_similar(s, 'Susy')] found_suzys = [s for s in my_strings if metaphone.sounds_similar(s, 'Susy')] found_suzys = [s for s in my_strings if levenshtein.distance(s, 'Susy') 4] found_suzys ['Susi', 'Susy'] (one hopes anyway!) HTH, Dave. -- -- http://mail.python.org/mailman/listinfo/python-list
Re: Counting elements in a list wildcard
Dave Hughes wrote: Another algorithm that might interest isn't based on sounds-like but instead computes the number of transforms necessary to get from one word to another: the Levenshtein distance. A C based implementation (with Python interface) is available: I don't know what algorithm it uses, but the difflib module looks similar. I've had good results using the get_close_matches function to locate similarly-named mp3 files. However I don't think close enough is well suited for this application. The sequences are short and non-distinct. Difference matching needs longer sequences to be effective. Phoneme matching seems overly complex and might grab things like Tsu-zi. I'd just use a list of alternate spellings like Ben suggested. -- http://mail.python.org/mailman/listinfo/python-list