Re: Counting elements in a list wildcard

2006-04-26 Thread Iain King

Edward Elliott wrote:
 John Machin wrote:
  On 25/04/2006 6:26 PM, Iain King wrote:
  iain = re.compile((Ia(i)?n|Eoin))
  steven = re.compile(Ste(v|ph|f)(e|a)n)
 
  IMHO, the amount of hand-crafting that goes into a *general-purpose*
  phonetic matching algorithm is already bordering on overkill. Your
  method using REs would not appear to scale well at all.

 Also compare the readability of regular expressions in this case to a simple
 list:
 [Steven, Stephen, Stefan, Stephan, ...]

Somehow I'm the advocate for REs here, which: erg. But you have some
mighty convenient elipses there...
compare:

steven = re.compile(Ste(v|ph|f|ff)(e|a)n)
steven = [Steven, Stephen, Stefen, Steffen, Stevan,
Stephan, Stefan, Steffan]

I know which I'd rather type.  'Course, if you can use a ready-built
list of names...

Iain

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-26 Thread Edward Elliott
Iain King wrote:
 steven = re.compile(Ste(v|ph|f|ff)(e|a)n)
 steven = [Steven, Stephen, Stefen, Steffen, Stevan,
 Stephan, Stefan, Steffan]
 
 I know which I'd rather type.  'Course, if you can use a ready-built
 list of names...

Oh I agree, I'd rather *type* the former, but I'd rather *read* the
latter. :)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-26 Thread Edward Elliott
Iain King wrote:
 steven = re.compile(Ste(v|ph|f|ff)(e|a)n)

Also you can expand the RE a bit to improve readability:

re.compile(Stev|Steph|Stef|Steff)(en|an))


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-25 Thread Iain King

hawkesed wrote:
 If I have a list, say of names. And I want to count all the people
 named, say, Susie, but I don't care exactly how they spell it (ie,
 Susy, Susi, Susie all work.) how would I do this? Set up a regular
 expression inside the count? Is there a wildcard variable I can use?
 Here is the code for the non-fuzzy way:
 lstNames.count(Susie)
 Any ideas? Is this something you wouldn't expect count to do?
 Thanks y'all from a newbie.
 Ed

Dare I suggest using REs?  This looks like something they'de be good
for:

import re

def countMatches(names, namePattern):
  count = 0
  for name in names:
if namePattern.match(name):
  count += 1
  return count

susie = re.compile(Su(s|z)(i|ie|y))

print countMatches([John, Suzy, Peter, Steven, Susie,
Susi], susie)


some other patters:

iain = re.compile((Ia(i)?n|Eoin))
steven = re.compile(Ste(v|ph|f)(e|a)n)
john = re.compile(Jo(h)?n)


Iain

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-25 Thread John Machin
On 25/04/2006 3:15 PM, Edward Elliott wrote:
  Phoneme matching seems overly complex and might
 grab things like Tsu-zi.

It might *only* if somebody had a rush of blood to the head and devised 
yet another phonetic key algorithm. Tsuzi does *not* give the same 
result as any of Suzi, Suzie, Susi, and Susie when pushed through any of 
the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None 
of them throw away the 'T' sound.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-25 Thread John Machin
On 25/04/2006 6:26 PM, Iain King wrote:
 hawkesed wrote:
 If I have a list, say of names. And I want to count all the people
 named, say, Susie, but I don't care exactly how they spell it (ie,
 Susy, Susi, Susie all work.) how would I do this? Set up a regular
 expression inside the count? Is there a wildcard variable I can use?
 Here is the code for the non-fuzzy way:
 lstNames.count(Susie)
 Any ideas? Is this something you wouldn't expect count to do?
 Thanks y'all from a newbie.
 Ed
 
 Dare I suggest using REs?  This looks like something they'de be good
 for:
 
 import re
 
 def countMatches(names, namePattern):
   count = 0
   for name in names:
 if namePattern.match(name):
   count += 1
   return count
 
 susie = re.compile(Su(s|z)(i|ie|y))
 
 print countMatches([John, Suzy, Peter, Steven, Susie,
 Susi], susie)
 
 
 some other patters:
 
 iain = re.compile((Ia(i)?n|Eoin))
 steven = re.compile(Ste(v|ph|f)(e|a)n)

What about Steffan, Etienne, Esteban, István, ... ?

 john = re.compile(Jo(h)?n)
 

IMHO, the amount of hand-crafting that goes into a *general-purpose* 
phonetic matching algorithm is already bordering on overkill. Your 
method using REs would not appear to scale well at all.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-25 Thread Iain King

John Machin wrote:
 On 25/04/2006 6:26 PM, Iain King wrote:
  hawkesed wrote:
  If I have a list, say of names. And I want to count all the people
  named, say, Susie, but I don't care exactly how they spell it (ie,
  Susy, Susi, Susie all work.) how would I do this? Set up a regular
  expression inside the count? Is there a wildcard variable I can use?
  Here is the code for the non-fuzzy way:
  lstNames.count(Susie)
  Any ideas? Is this something you wouldn't expect count to do?
  Thanks y'all from a newbie.

snip

  steven = re.compile(Ste(v|ph|f)(e|a)n)

 What about Steffan, Etienne, Esteban, István, ... ?


well, obviously these could be included:
(Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban), but the OP never said he
wanted to translate anything into another language.  He just wanted to
catch variable spellings.

  john = re.compile(Jo(h)?n)
 

 IMHO, the amount of hand-crafting that goes into a *general-purpose*
 phonetic matching algorithm is already bordering on overkill. Your
 method using REs would not appear to scale well at all.

Iain

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-25 Thread John Machin
On 25/04/2006 8:51 PM, Iain King wrote:
 John Machin wrote:
 On 25/04/2006 6:26 PM, Iain King wrote:
 hawkesed wrote:
 If I have a list, say of names. And I want to count all the people
 named, say, Susie, but I don't care exactly how they spell it (ie,
 Susy, Susi, Susie all work.) how would I do this? Set up a regular
 expression inside the count? Is there a wildcard variable I can use?
 Here is the code for the non-fuzzy way:
 lstNames.count(Susie)
 Any ideas? Is this something you wouldn't expect count to do?
 Thanks y'all from a newbie.
 
 snip
 
 steven = re.compile(Ste(v|ph|f)(e|a)n)
 What about Steffan, Etienne, Esteban, István, ... ?

 
 well, obviously these could be included:
 (Ste(v|ph|f)(e|a)n|Steffan|Etienne|Esteban), but the OP never said he
 wanted to translate anything into another language.

Neither did I. But if you have to cope with a practical situation like 
where the birth certificate says István and the job application says 
Steven and the foreman calls him Steve, you won't be stuffing about with 
hand-crafted REs, one per popular given name. Could be worse: the punter 
could have looked up a dictionary and changed his surname from Kovács to 
Smith; believe me -- it happens.

Oh and if you cast your net as wide as the Pacific islands, chuck in 
Sitiveni. That's enough examples. We won't go near Benjamin :-)

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-25 Thread Edward Elliott
John Machin wrote:
 On 25/04/2006 6:26 PM, Iain King wrote:
 iain = re.compile((Ia(i)?n|Eoin))
 steven = re.compile(Ste(v|ph|f)(e|a)n)
 
 IMHO, the amount of hand-crafting that goes into a *general-purpose*
 phonetic matching algorithm is already bordering on overkill. Your
 method using REs would not appear to scale well at all.

Also compare the readability of regular expressions in this case to a simple
list:
[Steven, Stephen, Stefan, Stephan, ...]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-25 Thread Edward Elliott
John Machin wrote:
 On 25/04/2006 3:15 PM, Edward Elliott wrote:
  Phoneme matching seems overly complex and might
 grab things like Tsu-zi.
 
 It might *only* if somebody had a rush of blood to the head and devised
 yet another phonetic key algorithm. Tsuzi does *not* give the same
 result as any of Suzi, Suzie, Susi, and Susie when pushed through any of
 the following; Soundex, NYSIIS, Metaphone, Dolby, and Caverphone. None
 of them throw away the 'T' sound.

Spelling isn't phonetic.  The 't' character doesn't necessarily affect
pronounciation.  Or it may affect pronounciation in a way the soundex
doesn't understand (think tonal languages).  Latinizing foreign languages
raises all sorts of problems.

A soundex is only as good as its pronounciation database.  It may work well
in many situations, but it isn't fool-proof.

-- 
http://mail.python.org/mailman/listinfo/python-list


Counting elements in a list wildcard

2006-04-24 Thread hawkesed
If I have a list, say of names. And I want to count all the people
named, say, Susie, but I don't care exactly how they spell it (ie,
Susy, Susi, Susie all work.) how would I do this? Set up a regular
expression inside the count? Is there a wildcard variable I can use?
Here is the code for the non-fuzzy way:
lstNames.count(Susie)
Any ideas? Is this something you wouldn't expect count to do?
Thanks y'all from a newbie.
Ed

-- 
http://mail.python.org/mailman/listinfo/python-list


RE: Counting elements in a list wildcard

2006-04-24 Thread Ryan Ginstrom
 
 Behalf Of hawkesed
 If I have a list, say of names. And I want to count all the people
 named, say, Susie, but I don't care exactly how they spell it (ie,
 Susy, Susi, Susie all work.) how would I do this? Set up a regular
 expression inside the count? Is there a wildcard variable I can use?
 Here is the code for the non-fuzzy way:
 lstNames.count(Susie)
 Any ideas? Is this something you wouldn't expect count to do?
 Thanks y'all from a newbie.

If there are specific spellings you want to allow, you could just create a
list of them and see if your Suzy is in there:

 possible_suzys = [ 'Susy', 'Susi', 'Susie' ]
 my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Jane' ]
 for line in my_strings:
... if line in possible_suzys: print line
... 
Susi


I think a general solution to this problem is to use edit (also called
Levenshtein) distance. There is an implementation in Python at this Wiki:
http://en.wikisource.org/wiki/Levenshtein_distance

You could use this distance function, and normalize for string length using
the following score function:

def score( a, b ):
Calculates the similarity score of the two strings based on edit
distance.
high_len = max( len(a), len(b) )
return float( high_len - distance( a, b ) ) / float( high_len )

 for line in my_strings:
... if score( line, 'Susie' )  .75: print line
... 
Susi

--
Regards,
Ryan Ginstrom

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-24 Thread Ben Finney
Ryan Ginstrom [EMAIL PROTECTED] writes:

 If there are specific spellings you want to allow, you could just
 create a list of them and see if your Suzy is in there:
 
  possible_suzys = [ 'Susy', 'Susi', 'Susie' ]
  my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Jane' ]
  for line in my_strings:
 ...   if line in possible_suzys: print line
 ...   
 Susi

If you wanted to do something later, rather than only during the scan
over the list, getting a list of suzies would probaby be more useful:

 possible_suzys = [ 'Susy', 'Susi', 'Susie' ]
 my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ]
 found_suzys = [s for s in my_strings if s in possible_suzys]
 found_suzys
['Susi', 'Susy']

-- 
 \The number of UNIX installations has grown to 10, with more |
  `\ expected.  -- Unix Programmer's Manual, 2nd Ed., 12-Jun-1972 |
_o__)  |
Ben Finney

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-24 Thread Dave Hughes
hawkesed wrote:

 If I have a list, say of names. And I want to count all the people
 named, say, Susie, but I don't care exactly how they spell it (ie,
 Susy, Susi, Susie all work.) how would I do this? Set up a regular
 expression inside the count? Is there a wildcard variable I can use?
 Here is the code for the non-fuzzy way:
 lstNames.count(Susie)
 Any ideas? Is this something you wouldn't expect count to do?
 Thanks y'all from a newbie.
 Ed

You might want to check out the SoundEx and MetaPhone algorithms which
provide approximations of the sound of a word based on spelling
(assuming English pronunciations).

Apparently a soundex module used to be built into Python but was
removed in 2.0. You can find several implementations on the 'net, for
example:

http://orca.mojam.com/~skip/python/soundex.py
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/52213

MetaPhone is generally considered better than SoundEx for sounds-like
matching, although it's considerably more complex (IIRC, although it's
been a long time since I wrote an implementation of either in any
language). A Python MetaPhone implementations (there must be more than
this one?):

http://joelspeters.com/awesomecode/

Another algorithm that might interest isn't based on sounds-like but
instead computes the number of transforms necessary to get from one
word to another: the Levenshtein distance. A C based implementation
(with Python interface) is available:

http://trific.ath.cx/resources/python/levenshtein/

Whichever algorithm you go with, you'll wind up with some sort of
similar function which could be applied in a similar manner to Ben's
example (I've just mocked up the following -- it's not an actual
session):

 import soundex
 import metaphone
 import levenshtein
 my_strings = ['Bob', 'Sally', 'Susi', 'Dick', 'Susy', 'Jane' ]
 found_suzys = [s for s in my_strings if
soundsex.sounds_similar(s, 'Susy')]
 found_suzys = [s for s in my_strings if
metaphone.sounds_similar(s, 'Susy')]
 found_suzys = [s for s in my_strings if levenshtein.distance(s,
'Susy')  4]
 found_suzys
['Susi', 'Susy'] (one hopes anyway!)


HTH,

Dave.
-- 

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Counting elements in a list wildcard

2006-04-24 Thread Edward Elliott
Dave Hughes wrote:
 Another algorithm that might interest isn't based on sounds-like but
 instead computes the number of transforms necessary to get from one
 word to another: the Levenshtein distance. A C based implementation
 (with Python interface) is available:

I don't know what algorithm it uses, but the difflib module looks similar. 
I've had good results using the get_close_matches function to locate
similarly-named mp3 files.

However I don't think close enough is well suited for this application. 
The sequences are short and non-distinct.  Difference matching needs longer
sequences to be effective.  Phoneme matching seems overly complex and might
grab things like Tsu-zi.  I'd just use a list of alternate spellings like
Ben suggested.


-- 
http://mail.python.org/mailman/listinfo/python-list