Re: [sqlite] ICU collation

Sylvain Pointeau Tue, 22 Dec 2009 10:46:16 -0800

ok then to do it on the application side but why sqlite does not provide
simple functions along with ICU to normalize a string following a specified
locale? Then I will be able implement it on the application layer.


I think that trying to re-implement myself the rules ä=ae ü=ue is waste of
time, at worst I would use ICU behind ... I want to rely on something well
tested :-)

may I can ask for having some functions inside SQLite to provide the
normalization etc?
That would be really great :-)

Best regards,
Sylvain

On Tue, Dec 22, 2009 at 5:44 PM, Nicolas Williams
<[email protected]>wrote:

> On Tue, Dec 22, 2009 at 07:49:24AM -0500, Tim Romano wrote:
> > On 12/22/2009 5:31 AM, Sylvain Pointeau wrote:
> > > It cannot be done in the application layer...
> > >
> > You are wrong about that. I have written a full-text search application
> > to go against ancient Germanic texts where, for example, there were two
> > dozen ways to spell the word for modern English 'sister' --spelling had
> > not yet been regularized but reflected regional dialect pronunciations
> > and regional scriptorium conventions.  There is no way an ICU collation
> > could handle that crazy quilt and it had to be done in the application
> > layer.
> >
> > It is done in the application layer  by *normalizing* the data on its
> > way into the database and then, of course, you must also normalize the
> > search terms as the user supplies them.  So, for example, if you are
> > importing a-umlaut you store 'ae' and if you are given a-umlaut as a
> > search term by the user, you search for 'ae'.  Normalization of
> > graphemes is analogous to Unicode decomposition of composite characters.
>
> Indeed.
>
> > However, if SQLite can flip an ICU German collation into Full
> > Normalization mode this could be done in the database.
>
> Sure, but you lose control in the process.  Suppose you eventually need
> to add support for non-German locales yet you've been using a collation
> that performs these conversions -- oops.  Or suppose there are
> well-known words of foreign origin to which this sort of normalization
> must not be applied: a generic toolkit like ICU, that works
> character-by-character or codepoint-by-codepoint, will not know about
> them -- why should it?  And so on.
>
> > P.S. I recently asked for a lightweight raw "reverse-string"
> > (codepoint-by-codepoint) function to be added to the SQLite core
> > (because I don't have access to its UDF mechanism in Adobe
> > Flex/FlashBuilder) and do agree that there are often good reasons for
> > wanting something to be done in the database layer, provided it does not
> > slow the database down for everyone else.
>
> Yes.  I believe that databases need to support Unicode normalization-
> insensitive/preserving behavior, at least as an option (most input
> methods produce pre-composed output, so often one can get away with not
> normalizing at all).
>
> Nico
> --
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[email protected]
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] ICU collation

Reply via email to