On Tue, Nov 17, 2009 at 09:31:46PM -0500, Tim Romano wrote: > <quote> but if ORDER BY is > relying on an index for ordering, then flip() can have negative > effects.</quote> > > > Substr() could have negative effects on ordering too. That is a red > herring. Flip() is merely a function that reverses the order of > codepoints "as found" without knowing anything about what those > codepoints, individually or in combination, might signify in a writing > system. If I want to write those codepoints to a column that's my concern.
In Unicode there's codepoints, characters, and glyphs. Codepoints are single 21-bit values. Characters are either single codepoints or combinations of codepoints. Glyphs are either single characters or combinations of characters that are displayed as single programatically-constructed glyphs. SQLite3 knows about none of that. Nor about normalization forms. Therefore any functions like substr() and flip() that work at the codepoint level (or worse, at the byte level, but fortunately substr() is UTF-8/16 aware) can break semantics for your strings. > What if I wanted to have a column that consisted of codepoints from all > over the Unicode range: a codepoint from Greek next to a codepoint from > Swahili next to a codepoint from Hungarian? Shouldn't I be able to say > to a database: this column contains codepoints (characters) and > collation is not relevant, sort the column using the numeric value of > the codepoints? Yes, I think so. I'm not sure why you'd want that, but yes, it ought to be possible, and right now SQLite3 lets you do that because it is not aware of characters and glyphs -- SQLite3 is aware of only codepoints. But if you load the ICU extensions that might change! Ideally there should be a way to indicate a variety of Unicode-related behaviors: - normalization form for use in index keys - normalization-insensitive string comparison operators - whether to normalize values in tables and, if so, with what form (by column, obviously) - if you normalize strings in index keys but not in tables then you get normalization-insensitive-but-normalization-preserving behavior, which is really, really convenient - collation options, such as language - whether to honor language tags embedded in the UTF-8/16 strings - multiple text types? (string of codepoints, of characters, or glyphs) - a whole range of Unicode-aware functions like substr() (and flip(), and like(), and regex(), and glob(), ...), with options for character and glyph counting instead of codepoint counting - codesets (for non-Unicode data), with automatic codeset conversions similar to type conversions - to have automatic conversions I think would require an extensible text type system That's... a lot of functionality. I'm not sure how much of it needs to be implemented with help from the SQLite3 core, versus extensions. It'd be nice if all of it could be implemented via extensions, but I don't think that's possible right now. Nico -- _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users