On Mon, May 11, 2009 at 3:29 PM, Lars Aronsson <[email protected]> wrote: > There is a way to avoid all such problems, namely by a more > aggressive use of DEFAULTSORT that removes from sorting all upper > case letters (except the initial one), all whitespace and all > commas. It would mean almost every article needs a DEFAULTSORT. > In the examples above: > > {{DEFAULTSORT:Walesjimmy}} > {{DEFAULTSORT:Europeancourtofauditors}} > {{DEFAULTSORT:Europeanunionmission}} > {{DEFAULTSORT:Europeanquarterofbrussels}} > {{DEFAULTSORT:Moonillusion}}
This would be a good thing to do in the software. We could implement the framework reasonably easily, if anyone cares to, and then let each language do its thing. A basic English implementation like this would be easy enough. Of course, any change to the sortkey beyond the first will require that all existing sort keys be changed by a batch job -- otherwise sorting will be a mess. Every change to the sortkey algorithm would either require that all pages be reparsed (very expensive), or that a special conversion script be defined to account for that exact change. Unless it's minor enough that the inconsistency is acceptable, I guess. On Tue, May 12, 2009 at 7:18 AM, Petr Kadlec <[email protected]> wrote: > Well, not really. Bug 164 would be fixed almost completely for > Czech-language wikis by using database features designed for exactly > this problem. [1] But, I guess you know the situation. > ... > [1] http://dev.mysql.com/doc/refman/4.1/en/charset-collation-effect.html Note the version. Wikimedia uses MySQL 4.0, which doesn't contain any charsets or collations other than binary. If we used a higher version, utf8 might be an option: that would use a Unicode collation, I guess, which should at least be okay for most languages, if not perfect. (But MySQL's utf8 has other downsides, like being variable-width and not supporting Unicode outside the BMP.) > If Swedish sorting rules are simple enough that removing all > whitespace and punctuation and converting to lower case would solve > most of the problems, I would say that such feature would not be too > difficult to implement right into MediaWiki (into LanguageSv.php), > writing those DEFAULTSORT codes explicitly into every article would be > nonsense, IMHO. (So, go ahead with it, I won’t stop you or anything, > I’m just trying to say that this is not really a solution for Czech > language.) There's no reason this couldn't be implemented for Czech as well in the software, in principle. Ideally we'd use something based on Unicode collation as a baseline, with optional customizations per language: http://unicode.org/reports/tr10/ _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
