Re: [sqlite] Unicode collation

Nuno Lucas Thu, 28 Jun 2007 08:27:52 -0700

On 6/28/07, Jiri Hajek <[EMAIL PROTECTED]> wrote:

> My idea is to implement the UCA collation in SQLite (with the usual
> OMIT_* #ifdef's), using the DUCET table as base, and if people need
> the tailoring part for localized sorting, have it be optional by
> having a "sqlite_collation_data" table with the needed locale data
> included on the database.


That would certainly be great if this is implemented. Note, however,
that it doesn't fully solve the issues described in this thread - i.e.
if you create a DB by some SQLite version and then use it by a newer
version, where some elements were
added/modified in DUCET, indexes of these DBs wouldn't be compatible.
It can be resolved in several ways, e.g. as suggested to have all
DUCET data stored in a special table in SQLite database. It's just a
matter of choosing a well-balanced solution...


There are 2 problems: UCA changes and DUCET (and/or other locale data)
changes. DUCET and locale data is in the database table, so it can
only change by user intervention, meaning it's his fault if done
without rebuilding the affected table index(es).

UCA changes are more problematic, but they are less frequent (it seems
there was a minor change between Unicode 4.0 and 4.1, though).

I don't see any good solution for this other than having an extra
field on the database file (or in the collation data tables) with the
UCA version and advise the user when using an UCA version different
from the one it was created (or last used, as by default no collation
data is needed).

It's very probable that not many users are actually bothered by this
(the algorithm is probably not changing much over time, and probably
not in incompatible ways for most locales).

Maybe we can just make sure "PRAGMA consistency_check" notices if
there is an inconsistency and that VACUUM will fix this.

Btw, even if this is implemented, there is still a need for a
standardization such new collation names. E.g. that new language
neutral collation could be called Unicode or DUCET? And how about
language specific collations? After some thoughts, I'd suggest
something like UNIL_en_AU (where UNIL means Unicode linguistic - i.e.
some characters are properly ignored, given for example by an ordering
of 'con', 'coop', 'co-op') and UNIS_en_AU (where UNIS means Unicode
strings - i.e. special characters aren't ignored, so that above words
would be ordered as 'co-op', 'con', 'coop').


I don't find this particularly important, because the collation name
has to be on the tables, so it can be called "DEFAULT" and have only
data for the "fr_FR" locale on the "default" tables (including the
DUCET base embedded) on embedded devices.

I'm thinking there will be a "reference" database with all locale
data, and it's up to users to use it "as is" or build their own (maybe
just rename locales).

It seems more natural to use the standard C locale names to me (the
usual "pt_PT" and "pt_BR" for  Portuguese/Portugal culture and
Portuguese/Brazil culture) , but I'm open to suggestions when that
problem arise, and I'm sure there are already standards we can follow
in relation to that.

Well, I will probably only have time to actually put words into code
next weekend, so I will say something when I have source code to show.


Best regards,
~Nuno Lucas


Jiri

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------


-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] Unicode collation

Reply via email to