Re: [sqlite] Unicode support

Nicolas Williams Fri, 20 Nov 2009 09:55:19 -0800

On Tue, Nov 17, 2009 at 09:31:46PM -0500, Tim Romano wrote:
> <quote> but if ORDER BY is
> relying on an index for ordering, then flip() can have negative 
> effects.</quote>
> 
> 
> Substr() could have negative effects on ordering too.  That is a red 
> herring.  Flip() is merely a function that  reverses the order of 
> codepoints "as found" without knowing anything about what those 
> codepoints, individually or in combination, might signify in a writing 
> system.  If I want to write those codepoints to a column that's my concern.


In Unicode there's codepoints, characters, and glyphs.  Codepoints are
single 21-bit values.  Characters are either single codepoints or
combinations of codepoints.  Glyphs are either single characters or
combinations of characters that are displayed as single
programatically-constructed glyphs.  SQLite3 knows about none of that.
Nor about normalization forms.

Therefore any functions like substr() and flip() that work at the
codepoint level (or worse, at the byte level, but fortunately substr()
is UTF-8/16 aware) can break semantics for your strings.

> What if I wanted to have a column that consisted of codepoints from all 
> over the Unicode range: a codepoint from Greek next to a codepoint from 
> Swahili next to a codepoint from Hungarian?  Shouldn't I be able to say 
> to a database:  this column contains codepoints (characters) and 
> collation is not relevant, sort the column using the numeric value of 
> the codepoints? 

Yes, I think so.  I'm not sure why you'd want that, but yes, it ought to
be possible, and right now SQLite3 lets you do that because it is not
aware of characters and glyphs -- SQLite3 is aware of only codepoints.
But if you load the ICU extensions that might change!

Ideally there should be a way to indicate a variety of Unicode-related
behaviors:

 - normalization form for use in index keys
 - normalization-insensitive string comparison operators
 - whether to normalize values in tables and, if so, with what form (by
   column, obviously)
    - if you normalize strings in index keys but not in tables then you
      get normalization-insensitive-but-normalization-preserving
      behavior, which is really, really convenient

 - collation options, such as language
    - whether to honor language tags embedded in the UTF-8/16 strings

 - multiple text types? (string of codepoints, of characters, or glyphs)

 - a whole range of Unicode-aware functions like substr() (and flip(),
   and like(), and regex(), and glob(), ...), with options for character
   and glyph counting instead of codepoint counting

 - codesets (for non-Unicode data), with automatic codeset conversions
   similar to type conversions
     - to have automatic conversions I think would require an extensible
       text type system

That's... a lot of functionality.  I'm not sure how much of it needs to
be implemented with help from the SQLite3 core, versus extensions.  It'd
be nice if all of it could be implemented via extensions, but I don't
think that's possible right now.

Nico
-- 
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Unicode support

Reply via email to