On Mon, Nov 30, 2009 at 08:15:58AM +0000, Simon Slavin wrote: > On 30 Nov 2009, at 1:58am, Igor Tandetnik wrote: > > Note that Unicode collation is not as simple as you might think. Did > > you know that in Estonian, 'y' sorts between 'i' and 'j'? Or that in > > German phonebook sort, 'oe' sorts as if it were a single letter > > between 'o' and 'p'? Basically, your simplistic approach would only > > work for plain unaccented Latin letters and English collation rules. > > I spent a lot of time annoyed about this, and ended up deciding that > the only way to do Unicode sorting correctly is to store the language > (or collation method) with each piece of Unicode text. Of course, > this still gives you the problem of working out which order two pieces > of text go in if they are in two different languages. Perhaps you > also need a 'default language' marker for the entire column.
Consider a column that contains a person's last name. Q: do proper names have a language? A: No, since people can be from all over and even within a single country may have last names of various radically different origins. Now consider a column that contains a person's self-description. Q: what language will it be in? A: If you don't specify it, it could be in one (or more!) language(s) of the person's choice. You might not need to sort by such a column though... IMO you'll have two types of text to sort: a) generic text (e.g., proper names), b) localized text (e.g., message catalogs). For (a) you'll want to pick a collation, _any_ collation. For (b) you'll know the language of the relevant text and can then sort with a specific collation. If you think there is a third class of text then you'll be best off forcing it into (a) or (b), IMO, as statements like this make no sense: CREATE TABLE foo (txt, lang); ... -- this makes no sense: SELECT txt FROM foo ORDER BY txt COLLATE lang; -- where lang is a -- column of foo; Unfortunately you can't parametrize collations in COLLATE clauses in SQLite3! I.e., this gives an error: sqlite> CREATE TABLE toy(a); sqlite> EXPLAIN SELECT a FROM toy ORDER BY a COLLATE :a ASC; SQL error: near ":a": syntax error Therefore for (b) you'll have to compile statements for each collation that you want to use. (Say you have per-language tables: you'll need separate SELECT statements for each table. Say you have a single table with a text column and another column indicating language: you'll need to compile a statement for each language since the collation cannot be parametrized.) Making collation functions parametrizable might be a useful extension, so you could then write: sqlite> CREATE TABLE toy(txt, lang); sqlite> explain SELECT txt FROM toy WHERE lang = :a ORDER BY txt COLLATE :b ASC; where :a specifies some language and :b specifies a collation for that language. Note too that Unicode has codepoints for specifying the language that the subsequent text is written in. Such codepoints could be used for deriving a collation from some text. But again, I don't think this will prove useful, both, for the reasons given above (SELECT ... ORDER BY ... COLLATE lang_of(...) makes no sense) and also because users won't know how to ensure that such language tags are embedded in the text that they write. Nico -- _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users