On Mon, Nov 30, 2009 at 08:15:58AM +0000, Simon Slavin wrote:
> On 30 Nov 2009, at 1:58am, Igor Tandetnik wrote:
> > Note that Unicode collation is not as simple as you might think. Did
> > you know that in Estonian, 'y' sorts between 'i' and 'j'? Or that in
> > German phonebook sort, 'oe' sorts as if it were a single letter
> > between 'o' and 'p'? Basically, your simplistic approach would only
> > work for plain unaccented Latin letters and English collation rules.
> 
> I spent a lot of time annoyed about this, and ended up deciding that
> the only way to do Unicode sorting correctly is to store the language
> (or collation method) with each piece of Unicode text.  Of course,
> this still gives you the problem of working out which order two pieces
> of text go in if they are in two different languages.  Perhaps you
> also need a 'default language' marker for the entire column.

Consider a column that contains a person's last name.  Q: do proper
names have a language?  A: No, since people can be from all over and
even within a single country may have last names of various radically
different origins.

Now consider a column that contains a person's self-description.  Q:
what language will it be in?  A: If you don't specify it, it could be in
one (or more!) language(s) of the person's choice.  You might not need
to sort by such a column though...

IMO you'll have two types of text to sort: a) generic text (e.g., proper
names), b) localized text (e.g., message catalogs).  For (a) you'll want
to pick a collation, _any_ collation.  For (b) you'll know the language
of the relevant text and can then sort with a specific collation.  If
you think there is a third class of text then you'll be best off forcing
it into (a) or (b), IMO, as statements like this make no sense:

    CREATE TABLE foo (txt, lang);
    ...
    -- this makes no sense:
    SELECT txt FROM foo ORDER BY txt COLLATE lang; -- where lang is a
                                                   -- column of foo;

Unfortunately you can't parametrize collations in COLLATE clauses in
SQLite3!  I.e., this gives an error:

sqlite> CREATE TABLE toy(a);
sqlite> EXPLAIN SELECT a FROM toy ORDER BY a COLLATE :a ASC;
SQL error: near ":a": syntax error

Therefore for (b) you'll have to compile statements for each collation
that you want to use.  (Say you have per-language tables: you'll need
separate SELECT statements for each table.  Say you have a single table
with a text column and another column indicating language: you'll need
to compile a statement for each language since the collation cannot be
parametrized.)

Making collation functions parametrizable might be a useful extension,
so you could then write:

sqlite> CREATE TABLE toy(txt, lang);
sqlite> explain SELECT txt FROM toy WHERE lang = :a ORDER BY txt COLLATE :b ASC;

where :a specifies some language and :b specifies a collation for that
language.

Note too that Unicode has codepoints for specifying the language that
the subsequent text is written in.  Such codepoints could be used for
deriving a collation from some text.  But again, I don't think this will
prove useful, both, for the reasons given above (SELECT ...  ORDER BY
... COLLATE lang_of(...) makes no sense) and also because users won't
know how to ensure that such language tags are embedded in the text that
they write.

Nico
-- 
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to