On 17/06/2009 1:19 AM, Christophe Leske wrote:

>>> So far ,  so good, but my client also expects ANY simplification of a
>>> character to be recognized:
>>> Cote d'azur for instance  should return "Côte d'azur"
>>> or the Sao Paulo issue - how can a search for "Sao Paulo" return "Sào
>>> Paulo" in the result set?
>>>     
>> How are these examples different from previous ones?
>>   
> I am sorry, but I find this to be quite obvious?
> Here, the problematic char is to be found in the *result set*, not in 
> the query itself.

It's NOT different. You need to map BOTH your database values and your 
query values into the same space and then compare them. Don't fall into 
the trap of assuming that your database is correctly accented.

> 
> How do you educate SQlite to return me "Sào Paulo" if only "Sao Paulo" 
> is being entered?
> How do I know which character to substitute with a placeholder?
> 
> Is it
> S%o Paulo to look for?
> Or Sa% Paulo?
> Or Sao P%ulo?
> 
> I can't know this beforehand. These are just examples, i need a generic 
> solution if possivble.
> 
> All i can see so far is to build a table of all special characters ever 
> used in the 24000 names of cities which make problems and remap them 
> accordingly.

That's exactly what you need. And you're not the first person with this 
problem. See for example 
http://mail.python.org/pipermail/python-list/2008-July/669592.html

The technique discussed there starts off with using the unicodedata 
database and finding dynamically (and caching) Unicode characters that 
can be decomposed into a basic latin letter plus one or more accents, 
backed up by a table of cases not found by that technique. Great for 
likers of clever code who have lots of CPU and disk space (unicodeddata 
is huge!) to spare.

I have developed a table which maps most latin-decorated Unicode 
characters into the non-decorated basic form. Sometimes 2 ASCII 
characters will be produced (e.g. latin capital letter thorn -> "Th") 
but latin small letter u with diaeresis -> "u" -- not "ue" which is 
German-specific.

I can let you have a copy if you are interested. What is your 
implementation language? C/C++?

BTW someone mentioned smashing everything into lowercase for comparison 
purposes at some stage -- I'd suggest uppercase especially if you have a 
few of the good old eszett in your data :-)

BTW2: The only sane usage of soundex IMHO is as a strawman when 
proposing phonetic matching algorithms like NYSIIS and [Double 
]Metaphone :-)

HTH,
John
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to