I can't remember the federal agency that tracks names, but they have a phonetic 
system that is useful when you're doing genealogy research. Their system breaks 
the words down a bit and eliminates doubles, accented characters, and some 
other things as well, IIRC.

Wish I could remember the name of the place. Maybe the National Archives? 

Leam

--------------------------------------------
Leam Hall       
Linux/Unix/Network Administrator
Contractor for Smartronix  ( CMMI Level 3 ISO 9001:2000 FS 91000 )
Com:    229.639.6028
Cell:   704.607.6747
Email:  leam.hall....@usmc.mil
--------------------------------------------  


-----Original Message-----
From: talk-boun...@lists.nyphp.org [mailto:talk-boun...@lists.nyphp.org] On 
Behalf Of Brent Baisley
Sent: Monday, November 01, 2010 17:01
To: NYPHP Talk
Subject: Re: [nyphp-talk] Squashing accented characters

If you are using mysql on the backend, you can make your table UTF8, then your 
indexes would use utf8_general_ci collation by default. That collation 
basically strips out all accent marks on the data, then indexes it. So if you 
search for Dusseldorf or Düsseldorf, they will both come up with the same set 
of records. The you don't have to do anything on the PHP side.

Regards,
Brent

On Oct 22, 2010, at 2:50 PM, Paul A Houle wrote:


        For my site at
        
        http://ookaboo.com/
        
        I'm running into the problem that people are searching for "Dusseldorf" 
but the name of the place is "Düsseldorf",  so they don't find it.
        
        It seems to me a good answer to this is to have some function that 
squashes accented characters down to unaccented forms.  I'd index the 
unaccented forms and also squash down queries so they'd always match up.  I 
definitely need to do both ISO-Latin-1 and the Latin-Extended-A,   because fate 
has given me a lot of place names that have the Polish dark L in them (ł 
<http://fileformat.info/info/unicode/char/0142/> ).  It also seems like there 
are a lot of characters in Latin Extended-B that would also map plausably to 
unaccented characters.
        
        I can see how to write something like this,  I'd need to parse out the 
Unicode code points from UTF-8 and run them through a lookup table,  but it's a 
lot of details and I wonder if anybody has written a PHP function to do this 
already.
        
        _______________________________________________
        New York PHP Users Group Community Talk Mailing List
        http://lists.nyphp.org/mailman/listinfo/talk
        
        http://www.nyphp.org/Show-Participation


Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
New York PHP Users Group Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

http://www.nyphp.org/Show-Participation

Reply via email to