Re: [sqlite] Question about searches

2009-06-17 Thread Olaf Schmidt

"Christophe Leske"  schrieb im
Newsbeitrag news:4a37b811.4030...@multimedial.de...

>> You write your own comparison function that would consider
>> these two strings equal. See sqlite3_create_function,
>> sqlite3_create_collation.
>
> this problem pertains not only to Zürich, but to 24000 other
> entries, so
> ...
> How do you educate SQlite to return me
> "Sào Paulo" if only "Sao Paulo" is being entered?

As already advised above, the best thing is, to write a small
collation-extension for sqlite, which can be registered then
dynamically, before you start working with your data.

An implementation of such a small collation-routine is
pretty easy, if you use the right system-api-call, to compare
your strings - and under windows that is CompareStringW.
http://msdn.microsoft.com/en-us/library/ms647476.aspx

Before passing WStringPointers to that function, you will
have to convert your UTF8 into UTF16 beforehand -
either *after* entering your Collation-Callback with UTF8-
Data using the "officiall" MultiByteToWideChar-API,
using codepage 65001.

Alternatively simply define the correct SQLite-constant
whilst registering your new collation within the engine,
what your string-parameters (passed into your Callback)
are expected to be ... - for Win-WChars use:
SQLITE_UTF16
or
SQLITE_UTF16LE

After that you should be able, to implement this collation-
callback with only a few lines of code.

CompareStringW offers some nice Flags, which will be
useful for you as I see it:
E.g. with a combination of:
NORM_IGNORENONSPACE | NORM_IGNORESYMBOLS |
NORM_IGNORECASE

...and an LCID of 1033 (us-en, which should be available on each system)

...the following comparisons are all evaluated to be identical:
"a" = "Ä"
"Sao Paulo" = "Sào Paulo"
"Cote d azur" = "Côte d'azur"

and with LCID 1031 you will additionally get:
"SS" = "ß"
"ae" = "Ä"

Not that much control as with a selfdefined mapping
of course, but a whole lot more "tolerant" than what
you currently have.

Olaf




___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Question about searches

2009-06-16 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

John Machin wrote:
> I have developed a table which maps most latin-decorated Unicode 
> characters into the non-decorated basic form. 

This is a fascinating article by Sean Burke (a linguist) about
converting all Unicode characters into US-ASCII.  The conversion is
primarily based on sound, so in theory running soundex on the result
could be somewhat useful.

  http://interglacial.com/~sburke/tpj/as_html/tpj22.html

You can find his tables at this link encoded as perl data structures.


http://cpansearch.perl.org/src/SBURKE/Text-Unidecode-0.04/lib/Text/Unidecode/

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAko4c4wACgkQmOOfHg372QQwUwCglqxQzZSGjHHoL13/L8Kw6NrX
46wAn3q12ugcrBryawTwpV8bjs/nYlZe
=XPU9
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Question about searches

2009-06-16 Thread John Machin
On 17/06/2009 1:19 AM, Christophe Leske wrote:

>>> So far ,  so good, but my client also expects ANY simplification of a
>>> character to be recognized:
>>> Cote d'azur for instance  should return "Côte d'azur"
>>> or the Sao Paulo issue - how can a search for "Sao Paulo" return "Sào
>>> Paulo" in the result set?
>>> 
>> How are these examples different from previous ones?
>>   
> I am sorry, but I find this to be quite obvious?
> Here, the problematic char is to be found in the *result set*, not in 
> the query itself.

It's NOT different. You need to map BOTH your database values and your 
query values into the same space and then compare them. Don't fall into 
the trap of assuming that your database is correctly accented.

> 
> How do you educate SQlite to return me "Sào Paulo" if only "Sao Paulo" 
> is being entered?
> How do I know which character to substitute with a placeholder?
> 
> Is it
> S%o Paulo to look for?
> Or Sa% Paulo?
> Or Sao P%ulo?
> 
> I can't know this beforehand. These are just examples, i need a generic 
> solution if possivble.
> 
> All i can see so far is to build a table of all special characters ever 
> used in the 24000 names of cities which make problems and remap them 
> accordingly.

That's exactly what you need. And you're not the first person with this 
problem. See for example 
http://mail.python.org/pipermail/python-list/2008-July/669592.html

The technique discussed there starts off with using the unicodedata 
database and finding dynamically (and caching) Unicode characters that 
can be decomposed into a basic latin letter plus one or more accents, 
backed up by a table of cases not found by that technique. Great for 
likers of clever code who have lots of CPU and disk space (unicodeddata 
is huge!) to spare.

I have developed a table which maps most latin-decorated Unicode 
characters into the non-decorated basic form. Sometimes 2 ASCII 
characters will be produced (e.g. latin capital letter thorn -> "Th") 
but latin small letter u with diaeresis -> "u" -- not "ue" which is 
German-specific.

I can let you have a copy if you are interested. What is your 
implementation language? C/C++?

BTW someone mentioned smashing everything into lowercase for comparison 
purposes at some stage -- I'd suggest uppercase especially if you have a 
few of the good old eszett in your data :-)

BTW2: The only sane usage of soundex IMHO is as a strawman when 
proposing phonetic matching algorithms like NYSIIS and [Double 
]Metaphone :-)

HTH,
John
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Question about searches

2009-06-16 Thread Simon Slavin

On 16 Jun 2009, at 4:46pm, Swithun Crowe wrote:

> How about having an extra column for each column that you want to  
> search
> in? In the extra column, have a plain lowercase ASCII version of the  
> word.
> So, for 'Sào Paulo', have 'sao paulo'. You would need to write a small
> program to convert the characters. When you want to search for  
> something,
> convert your search query into something without accents, and search  
> in
> the extra column.


That would be a good solution, but it would require some intelligence  
in the users.  Instead of writing a small conversion program you could  
write your own encoding function as an SQLite extension.  More  
puzzling to start with but far more convenient to use in the long run.



Might I suggest that if you do either of those you look into doing  
soundex as part of your handling of accented characters ?  Even  
countries that do not use accented characters find soundex encoding  
very useful.



Including soundex as part of your hashing function gets rid of the  
'Zürich' problem: all three of 'Zürich', 'Zuerich', and 'Zurich' all  
render the same value using soundex, so searching for any one of them  
would return all records which contained any of the three versions.

I do not know if the SOUNDEX() function handles accented characters in  
this way.  SQLite apparently supports its built-in SOUNDEX() function  
only if built with a particular switch.

Simon.
-- 
  http://www.hearsay.demon.co.uk | I'd expect if a computer was involved
 | it all would have been much worse.
No Buffy for you.|-- John "West" McKenna
Leave quickly now. -- Anya   |  THE FRENCH WAS THERE

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Question about searches

2009-06-16 Thread Igor Tandetnik
Christophe Leske  wrote:
>> You write your own comparison function that would consider these two
>> strings equal. See sqlite3_create_function, sqlite3_create_collation.
>>
> this problem pertains not only to Zürich, but to 24000 other entries,
> so
> I guess that this is no option for me.
> And again, I am using the sqlite3 command line exe and can't compile a
> custom version.

Well, I guess you could store two versions of every city name - the 
"correct" version with accents and all, and the "for search" version in 
ASCII characters only. Perhaps you would need more than one - say both 
Zurich and Zuerich for Zürich - then create a separate table with common 
spellings that would be in one-to-many relationship with the table of 
cities.

Igor Tandetnik 



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Question about searches

2009-06-16 Thread Christophe Leske

> You write your own comparison function that would consider these two 
> strings equal. See sqlite3_create_function, sqlite3_create_collation.
>   
Well,

this problem pertains not only to Zürich, but to 24000 other entries, so 
I guess that this is no option for me.
And again, I am using the sqlite3 command line exe and can't compile a 
custom version.


> Why would you ever want two % in a row? A % matches zero or more of 
> arbitrary characters. You might be thinking of an underscore _.
>   
OK. Thanks for the hint, I was under the wrong assumption that % matches 
one character exactly, whereas this seems to be "_".

>> So far ,  so good, but my client also expects ANY simplification of a
>> character to be recognized:
>> Cote d'azur for instance  should return "Côte d'azur"
>> or the Sao Paulo issue - how can a search for "Sao Paulo" return "Sào
>> Paulo" in the result set?
>> 
> How are these examples different from previous ones?
>   
I am sorry, but I find this to be quite obvious?
Here, the problematic char is to be found in the *result set*, not in 
the query itself.

How do you educate SQlite to return me "Sào Paulo" if only "Sao Paulo" 
is being entered?
How do I know which character to substitute with a placeholder?

Is it
S%o Paulo to look for?
Or Sa% Paulo?
Or Sao P%ulo?

I can't know this beforehand. These are just examples, i need a generic 
solution if possivble.

All i can see so far is to build a table of all special characters ever 
used in the 24000 names of cities which make problems and remap them 
accordingly.



-- 
Christophe Leske

www.multimedial.de - i...@multimedial.de
http://www.linkedin.com/in/multimedial
Lessingstr. 5 - 40227 Duesseldorf - Germany
+49(0)180102 - 06 60 02 96 // +49(0)177 249 70 31

This e-mail may contain confidential information. If you are not the intended 
recipient, 
it is appreciated that you notify the sender and delete your copy. Thank you.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Question about searches

2009-06-16 Thread Igor Tandetnik
Christophe Leske  wrote:
> - how can SQlite be instructed to return search results which include
> a
> special character in it?
> E.g. you search literally for "Zurich" on an englisch system and
> expect
> "Zürich" to be in the result set.

You write your own comparison function that would consider these two 
strings equal. See sqlite3_create_function, sqlite3_create_collation.

> The next problem is that educated users might know that german Umlaute
> can be written out. The rules are simple:
>
> ä becomes "ae"
> ö becomes "oe"
> ü becomes "ue"
>
> So how would I go about filtering an educated user which looks for
> "Zuerich" and expects "Zürich" in the result set?

The same way.

> Best find on my behalf so far is to build a filter which replaces any
> occurence of "ae", "oe", "ue" with two placeholders ("%")

Why would you ever want two % in a row? A % matches zero or more of 
arbitrary characters. You might be thinking of an underscore _.

> So far ,  so good, but my client also expects ANY simplification of a
> character to be recognized:
> Cote d'azur for instance  should return "Côte d'azur"
> or the Sao Paulo issue - how can a search for "Sao Paulo" return "Sào
> Paulo" in the result set?

How are these examples different from previous ones?

Igor Tandetnik 



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Question about searches

2009-06-16 Thread Christophe Leske
Hi there,

i have written an application which runs under german and englisch 
versions of Windows.
It includes a city databases which is ought to be searchable, yet there 
are a couple of issues which are of more logical nature...

My shell application surrounding the sqlite database only supports the 
ANSI charset (no Unicode), yet the underlying SQlite database has been 
created with UTF-8 support.

So far so good.

I got complaints from my client here that excentric city names like "Sào 
Paulo" cannot be found.
Further digging revealed two potential problems:

- how can SQlite be instructed to return search results which include a 
special character in it?
E.g. you search literally for "Zurich" on an englisch system and expect 
"Zürich" to be in the result set.

The next problem is that educated users might know that german Umlaute 
can be written out. The rules are simple:

ä becomes "ae"
ö becomes "oe"
ü becomes "ue"

So how would I go about filtering an educated user which looks for 
"Zuerich" and expects "Zürich" in the result set?

Best find on my behalf so far is to build a filter which replaces any 
occurence of "ae", "oe", "ue" with two placeholders ("%") which would 
effectively lead to a search of the type

select * from cities where name like "Z%%rich"


So far ,  so good, but my client also expects ANY simplification of a 
character to be recognized:
Cote d'azur for instance  should return "Côte d'azur"
or the Sao Paulo issue - how can a search for "Sao Paulo" return "Sào 
Paulo" in the result set?

Please note that I am using the normal command line sqlite3.exe application. I 
already started looking into soundex() yet unfortunately, it does not seem to 
be compiled into the normal command like executable. I also doubt that it would 
help?

Any help much appreciated, 

Christophe Leske

www.multimedial.de - i...@multimedial.de
http://www.linkedin.com/in/multimedial
Lessingstr. 5 - 40227 Duesseldorf - Germany
+49(0)180102 - 06 60 02 96 // +49(0)177 249 70 31

This email may contain confidential information. If you are not the intended 
recipient, it would be appreciated that you delete it and notify the sender. 
Thank you.

___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users