I apologize if double-post, the first one didn't make it to the list. ---------------------------------------------------------------------
Hi, >Hi all, I am interested in seeing "fuzzy searching" in SQLite, for >lack of a better term. This type of search would return more results >than LIKE currently does today. The search would return matches based >on expanded criteria, each one may be considered a separate RFE for >LIKE or for another specialized function (FLIKE, maybe). > >1) Case insensitivity Yes >2) ASCII-equivalent searching Yes >3) Diacritic-elimination searching Yes >4) Punctuation-elimination searching Not out of the box. I've written a small SQLite extension for dealing with Unicode text: unifuzz. It contains, among other [I believe] useful functions, an unaccent() scalar function which will do right that: map accented characters to their basic form (unaccented) equivalent. It also handles a few special cases like the German Eszet 'ß' and a number of ligatures. All Unicode functions rely on Unicode v5.1 tries internal to the extension. You'll also find a set of Unicode-aware set of functions: upper, lower, title, proper. There are as well a set of Unicode-aware locale-independant collations: nocase and names and a collation usefull to sort numerically strings with numeric prefix (built-in collations will sort lexicographically, which is a pain in this case). Being able to deal with text as locale-INdependant was a need for me as I manipulate data from customers in more that 27 countries now. Of course it's far from perfect since doing things "perfectly" means first selecting a single locale, which would ruin operations applied to text from another country/language. The numeric-prefix collation recognizes all known zero_to_nine representations and maps them all to "our" 0-9. All this is detailed at length in headers/comments of the code. Finally, there is a fuzzy search function which works internally with unaccented versions of the strings supplied. It returns the Damerau-Levenshtein distance between its two arguments, and is (uncreatively) named 'typos'. There is one limitation with this extension: the collation functions rely on invoking a Windows call (from kernel32.dll), which makes them currently non-portable across systems outside the Windows world. The extension can probably be easily built without collation for compiling on non-Windows systems. I didn't have the need yet to build for 64 bits but I don't believe this should pose unduly complex issues. It should also be quite easy to port the offending function to unix like OSes. Full source is included and contains detailed explanations: read them! Last note: all this comes without guaranty of any kind. I still consider these extensions as beta, but I've been using them daily on our production base for months, like several other users. Of course, it can't pretend be as perfect as ICU, but it's way smaller (170Kb vs many Mb) and much, much faster. Aother advantage is that it doesn't require that you select a specific locale to register collations. Not only does this simplify the code but it's also a prerequisite when you need to deal with data from several languages at once. The source contains detailed explanations. The binaries are x86 32bit ready for use. If you have any question or would like to obtain a download link, feel free to drop me a mail. I also would appreciate it if you can report how this extension behave w.r.t. your language(s). Cheers, _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users