Re: [sqlite] Lightweight solution for Unicode-savvy collation?

Simon Slavin Fri, 28 Jul 2017 13:15:06 -0700

On 28 Jul 2017, at 12:47am, Jens Alfke <[email protected]> wrote:

> The project I work on needs the ability to do Unicode-savvy string collation, 
> which SQLite doesn’t provide. But we’re somewhat sensitive to code size, so 
> we don’t want to just drop in the hugeness that is ICU. We’ve looked at a 
> couple of other Unicode/UTF-8 libraries (like utf8rewind), and while they do 
> case folding they don’t do collation.
> 
> We can’t be the first SQLite client to have this need. Anyone know of any 
> good solution?


The SQLite devs would like this to exist, too.  But it doesn’t and I don’t 
think it can.

Even just distinguishing upper and lower-case Unicode characters requires a 
great deal of data.  The cases aren’t laid out in a logical or consistent 
manner.  Sometimes you get a run of LOWER case characters followed by the run 
of their UPPER case equivalents.  Other times they alternate.  And sometimes 
the order is swapped.  So just doing the Unicode equivalent of NOCASE for all 
scripts is difficult.

Add to that the problem that different languages have different sort rules for 
the same characters.  So a German speaker considers that the characters 'o' and 
'ö' are two versions of the same character (like upper and lower case) whereas 
a Turkish speaker pronounces and considers them as very different to 
one-another.  Because of this it’s impossible to have something like COLLATE 
UNICODE-NOCASE .  You’d have to have COLLATE UNICODE-NOCASE-TURKISH and COLLATE 
UNICODE-NOCASE-GERMAN .  Or perhaps if you moved your database to an 
organisation which spoke a different language you’d change a PRAGMA setting and 
that would trigger SQLite recreate all indexes based on COLLATE 
UNICODE-NOCASE-DEFAULT.

Because of these and numerous other considerations, even minimal code that 
purports to do Unicode-savvy string collation needs to be either a lot of code, 
or some code and a huge amount of data.  And I hate that fact as much as you 
do.  But that’s why ICU is as big as it is.

Simon.

Refs:
        Unicode® Technical Standard #10: UNICODE COLLATION ALGORITHM
        <http://www.unicode.org/reports/tr10/>
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Lightweight solution for Unicode-savvy collation?

Reply via email to