On Tue, Dec 22, 2009 at 07:49:24AM -0500, Tim Romano wrote: > On 12/22/2009 5:31 AM, Sylvain Pointeau wrote: > > It cannot be done in the application layer... > > > You are wrong about that. I have written a full-text search application > to go against ancient Germanic texts where, for example, there were two > dozen ways to spell the word for modern English 'sister' --spelling had > not yet been regularized but reflected regional dialect pronunciations > and regional scriptorium conventions. There is no way an ICU collation > could handle that crazy quilt and it had to be done in the application > layer. > > It is done in the application layer by *normalizing* the data on its > way into the database and then, of course, you must also normalize the > search terms as the user supplies them. So, for example, if you are > importing a-umlaut you store 'ae' and if you are given a-umlaut as a > search term by the user, you search for 'ae'. Normalization of > graphemes is analogous to Unicode decomposition of composite characters.
Indeed. > However, if SQLite can flip an ICU German collation into Full > Normalization mode this could be done in the database. Sure, but you lose control in the process. Suppose you eventually need to add support for non-German locales yet you've been using a collation that performs these conversions -- oops. Or suppose there are well-known words of foreign origin to which this sort of normalization must not be applied: a generic toolkit like ICU, that works character-by-character or codepoint-by-codepoint, will not know about them -- why should it? And so on. > P.S. I recently asked for a lightweight raw "reverse-string" > (codepoint-by-codepoint) function to be added to the SQLite core > (because I don't have access to its UDF mechanism in Adobe > Flex/FlashBuilder) and do agree that there are often good reasons for > wanting something to be done in the database layer, provided it does not > slow the database down for everyone else. Yes. I believe that databases need to support Unicode normalization- insensitive/preserving behavior, at least as an option (most input methods produce pre-composed output, so often one can get away with not normalizing at all). Nico -- _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users