https://bugzilla.wikimedia.org/show_bug.cgi?id=46330

--- Comment #4 from Bartosz Dziewoński <[email protected]> ---
(In reply to comment #2)
> It would also be nice to know which standard the ICU implementation
> is supposed to comply with (my guess: SFS-EN 13710). There are a couple of
> slightly different standards.

I have no idea, to be honest. Wikimedia wikis are currently running ICU 4.8
(per bug 46036); that's all the information I can give you :)

The data used to "partition" the sorted list into headers is probably not
standardised at all and somehow based on the information about primary-level
collation data. For details you should probably look at the code that generates
it, maintenance/language/generateCollationData.php. 


(In reply to comment #3)
> I wonder if there is some fundamental flaw with the grouping of letters under
> these one-letter headers?

I don't think there's such a "fundamental flaw" in it; the list is generated
using generalised data that's reasonably correct for most languages, and thus
needs such modifications for some specific ones. For example, no modifications
were needed for Portuguese, and Polish only required adding the appropriate
letters with diacritics.

You and Swedes are just unlucky, I suppose :) It's interesting how those
characters are sorted among Latin letters in Finnish, and at the end of the
Latin alphabet in Polish or Portuguese.

I automatically created a category with all two-letter combinations of ASCII
letters + Å, Ä, Ö:
http://users.v-lo.krakow.pl/~matmarex/testwiki-fi/index.php?title=Luokka:Autotest
. It seems like we need to exclude those four characters: Ǥ, Ŋ, Ŧ, Ʒ. I'll
submit a patch to do this later today.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
You are the assignee for the bug.
You are watching all bug changes.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to