Re: [sqlite] Advice needed for fuzzy search

Simon Slavin Thu, 02 Jul 2009 09:13:47 -0700

On 2 Jul 2009, at 2:01pm, Jean-Christophe Deschamps wrote:

> I need to deal with codepoints that would expand to several individual
> characters.  Examples are ligatures or fractions.  I've never seen
> ligatures used in French, nor in any european language, when it comes
> to user input.  I believe such ligatures are more a typesetting or  
> word
> processing finesse which is beyond most users care / knowledge.
>
> But if I ever encounter some, how should I deal with them?  If I leave
> them alone, then for instance ligature 'fi' would not compare to the
> letter sequence 'f' 'i'.  If I expand them, then ligature 'fi' would
> get to 'f' 'i' but if the corresponding char in the second string is
> 'g' then it would count for two errors instead of one.


You /do/ need to expand ligatures, especially since some sources will  
already have them expanded.  You may have to consider a distance of 2  
to be near enough for a match.  I assume ... I hope ... you have  
access to a unicode library that has functions which can do things  
like expand ligatures.

I've not come across any good standard way of dealing with this  
problem.  You are at the leading edge of technology !  What we need is  
a new version of Soundex which is written to deal with unicode instead  
of ASCII.  The best known code along those lines is a perl function  
called unidecode.  Reading about it may help you decide how to proceed:

http://search.cpan.org/~sburke/Text-Unidecode-0.04/lib/Text/Unidecode.pm

http://interglacial.com/~sburke/tpj/as_html/tpj22.html

Simon.
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Advice needed for fuzzy search

Reply via email to