On 17 August 2010 13:06, Nikola Smolenski <[email protected]> wrote:
> Дана Tuesday 17 August 2010 20:37:44 Aryeh Gregor написа:
>> The code is currently enabled in trunk and is still awaiting review.
>> It's basically complete, but there are some issues left:
>>
>> * What sortkey algorithm to use?  Currently it just ASCII uppercases
>> the words, which is okay for a proof-of-concept but doesn't actually
>> solve bug 164.
>
> For some time now, I am thinking about a stupidly simple solution:
>
> php -r 'for($i = 0; $i < 65536; $i++) { echo pack("nx", $i); echo "\n"; }'|
> iconv -f ucs-2be -t utf8 | sort | php -r 'foreach(file("php://stdin") as $v)
> { echo var_export(substr($v, 0, -1)) . " => \"" . str_pad(base_convert($i,
> 10, 36), 4, 0, STR_PAD_LEFT) . "\",\n"; $i++; }'
>
> This, more or less, should:
>
> - Print every Unicode (UCS-2 only) character on its own line
> - Sort that according to the current locale
> - Print a PHP array to replace each Unicode character (UTF-8 encoded) with
> appropriate base36 number
>
> If an UTF-8 string is encoded with this array, the resulting strings should be
> sorted exactly the same as in the locale through mere ASCII sorting. Or am I
> missing something big? (Except contextual sensitivity, but it occurs
> relatively rarely and this should still be better than what we have now.)
>

You are missing most of it :). In many cases a single "letter" is made
up of multiple code-points (of which there are considerably more than
65536 by the way) - think of Hungarian gy, then there are all kinds of
conventions for sorting accents - in French you sort á after a but
only if the rest of the word is spelt the same (i.e ab <- áb <- ac).

There is the ICU, and it is available to PHP (in some versions)
http://docs.php.net/manual/en/class.collator.php, using those sort
keys should be "good enough" for now I imagine. There are languages on
Wiktionary that won't be in the ICU yet (just because they are
ludicrously obscure) but it's probably best to start with something
manageable.

Conrad

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to