On Tue, Dec 22, 2009 at 07:49:24AM -0500, Tim Romano wrote:
> On 12/22/2009 5:31 AM, Sylvain Pointeau wrote:
> > It cannot be done in the application layer...
> >    
> You are wrong about that. I have written a full-text search application 
> to go against ancient Germanic texts where, for example, there were two 
> dozen ways to spell the word for modern English 'sister' --spelling had 
> not yet been regularized but reflected regional dialect pronunciations 
> and regional scriptorium conventions.  There is no way an ICU collation 
> could handle that crazy quilt and it had to be done in the application 
> layer.
> 
> It is done in the application layer  by *normalizing* the data on its 
> way into the database and then, of course, you must also normalize the 
> search terms as the user supplies them.  So, for example, if you are 
> importing a-umlaut you store 'ae' and if you are given a-umlaut as a 
> search term by the user, you search for 'ae'.  Normalization of 
> graphemes is analogous to Unicode decomposition of composite characters.

Indeed.

> However, if SQLite can flip an ICU German collation into Full 
> Normalization mode this could be done in the database.

Sure, but you lose control in the process.  Suppose you eventually need
to add support for non-German locales yet you've been using a collation
that performs these conversions -- oops.  Or suppose there are
well-known words of foreign origin to which this sort of normalization
must not be applied: a generic toolkit like ICU, that works
character-by-character or codepoint-by-codepoint, will not know about
them -- why should it?  And so on.

> P.S. I recently asked for a lightweight raw "reverse-string" 
> (codepoint-by-codepoint) function to be added to the SQLite core 
> (because I don't have access to its UDF mechanism in Adobe 
> Flex/FlashBuilder) and do agree that there are often good reasons for 
> wanting something to be done in the database layer, provided it does not 
> slow the database down for everyone else.

Yes.  I believe that databases need to support Unicode normalization-
insensitive/preserving behavior, at least as an option (most input
methods produce pre-composed output, so often one can get away with not
normalizing at all).

Nico
-- 
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to