[Bug 45596] Set $wgCategoryCollation to 'uca-hu' on Hungarian Wikipedia and rebuild category sort keys

bugzilla-daemon Sat, 09 Mar 2013 10:56:46 -0800

https://bugzilla.wikimedia.org/show_bug.cgi?id=45596


--- Comment #13 from Bartosz Dziewoński <[email protected]> ---
(In reply to comment #12)
> - will it be harder the change the rules on the fly, if they turn out to
> be imperfect? I understand changing the collation is difficult because
> one has to reindex the whole table, but I suppose changing the first
> letters would be simpler.

Real changes to the collation will require running the update script again,
which might take a couple of hours for hu.wiki (according to Reedy's
testing, it took about 20 hours for the 3.2 million pages on pl.wikipedia).
Category sorting might be slightly borked during this time, and all category
pages will have to be purged afterwards (action=purge or just wait till the
caches expire).

Changing the first letters later won't break the collation, since it's
entirely handled by an external library (ICU); it'll require a purge to
appear on-wiki, though.


> - by the way, should we also check the collation itself? I have mostly
> collected input on the first letter grouping until now.

Please do, but I'm pretty much certain it's correct; it's handled by the ICU
library, which is a battle-tested and mature piece of software.


> - will it be possible to create custom groups? (e.g. someone suggested
> using a "Numbers" group, having separate groups for all digits looks a
> bit silly)

This isn't supported right now, but at a first glance possible; it would
likely depend on whether creating the group would require different sorting
order. However, IMO this particular change should be done for all projects
at once, if desired, and should wait for the natural number sorting to be
implemented first (bug 6948) and for multiple collation support (bug 44667;
the chinese-collation branch includes this).


> - what is the logic for non-Hungarian characters? Accented latin
> characters seem to be ordered as if the accents were stripped, which is
> good, but it would be nice to see the rules spelled out somewhere.

Yes, that's exactly what happens, and similarly for accented variants of
letters in other alphabets; I though I mentioned that somewhere, apologies.
The default sorting rules are the ones [[Unicode Collation Algorithm]] uses;
they are appropriately tailored for each language-specific collation.

The default "first-letters" list includes full basic latin, greek and
cyrillic alphabets and I think all printable ASCII characters, as well as a
lot of letters from other alphabets and a whole lot of Unicode symbols. It
is generated by MediaWiki based on the data about which letters have
primary-level weight in UCA, but I'm not sure what is the exact behavior;
you can see the generation script at
/maintenance/language/generateCollationData.php in mediawiki/core
repository, and the pregenerated list at /serialized/first-letters-root.ser.
I doubt that's relevant, though. :)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are watching all bug changes.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 45596] Set $wgCategoryCollation to 'uca-hu' on Hungarian Wikipedia and rebuild category sort keys

Reply via email to