I apologize if double-post, the first one didn't make it to the list.
---------------------------------------------------------------------

Hi,

>Hi all, I am interested in seeing "fuzzy searching" in SQLite, for
>lack of a better term. This type of search would return more results
>than LIKE currently does today. The search would return matches based
>on expanded criteria, each one may be considered a separate RFE for
>LIKE or for another specialized function (FLIKE, maybe).
>
>1) Case insensitivity

Yes

>2) ASCII-equivalent searching

Yes

>3) Diacritic-elimination searching

Yes

>4) Punctuation-elimination searching

Not out of the box.


I've written a small SQLite extension for dealing with Unicode text: 
unifuzz.  It contains, among other [I believe] useful functions, an 
unaccent() scalar function which will do right that: map accented 
characters to their basic form (unaccented) equivalent.  It also 
handles a few special cases like the German Eszet 'ß' and a number of 
ligatures.  All Unicode functions rely on Unicode v5.1 tries internal 
to the extension.

You'll also find a set of Unicode-aware set of functions: upper, lower, 
title, proper.

There are as well a set of Unicode-aware locale-independant collations: 
nocase and names and a collation usefull to sort numerically strings 
with numeric prefix (built-in collations will sort lexicographically, 
which is a pain in this case).  Being able to deal with text as 
locale-INdependant was a need for me as I manipulate data from 
customers in more that 27 countries now.  Of course it's far from 
perfect since doing things "perfectly" means first selecting a single 
locale, which would ruin operations applied to text from another 
country/language.  The numeric-prefix collation recognizes all known 
zero_to_nine representations and maps them all to "our" 0-9.

All this is detailed at length in headers/comments of the code.

Finally, there is a fuzzy search function which works internally with 
unaccented versions of the strings supplied.  It returns the 
Damerau-Levenshtein distance between its two arguments, and is 
(uncreatively) named 'typos'.

There is one limitation with this extension: the collation functions 
rely on invoking a Windows call (from kernel32.dll), which makes them 
currently non-portable across systems outside the Windows world.  The 
extension can probably be easily built without collation for compiling 
on non-Windows systems.  I didn't have the need yet to build for 64 
bits but I don't believe this should pose unduly complex issues.  It 
should also be quite easy to port the offending function to unix like OSes.

Full source is included and contains detailed explanations: read 
them!  Last note: all this comes without guaranty of any kind.  I still 
consider these extensions as beta, but I've been using them daily on 
our production base for months, like several other users.  Of course, 
it can't pretend be as perfect as ICU, but it's way smaller (170Kb vs 
many Mb) and much, much faster.  Aother advantage is that it doesn't 
require that you select a specific locale to register collations.  Not 
only does this simplify the code but it's also a prerequisite when you 
need to deal with data from several languages at once.

The source contains detailed explanations.  The binaries are x86 32bit 
ready for use.

If you have any question or would like to obtain a download link, feel 
free to drop me a mail.  I also would appreciate it if you can report 
how this extension behave w.r.t. your language(s).

Cheers,

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to