Re: [Wikitech-l] More aggressive DEFAULTSORT

Aryeh Gregor Tue, 12 May 2009 08:40:07 -0700

On Mon, May 11, 2009 at 3:29 PM, Lars Aronsson <[email protected]> wrote:
> There is a way to avoid all such problems, namely by a more
> aggressive use of DEFAULTSORT that removes from sorting all upper
> case letters (except the initial one), all whitespace and all
> commas.  It would mean almost every article needs a DEFAULTSORT.
> In the examples above:
>
>  {{DEFAULTSORT:Walesjimmy}}
>  {{DEFAULTSORT:Europeancourtofauditors}}
>  {{DEFAULTSORT:Europeanunionmission}}
>  {{DEFAULTSORT:Europeanquarterofbrussels}}
>  {{DEFAULTSORT:Moonillusion}}

This would be a good thing to do in the software.  We could implement
the framework reasonably easily, if anyone cares to, and then let each
language do its thing.  A basic English implementation like this would
be easy enough.

Of course, any change to the sortkey beyond the first will require
that all existing sort keys be changed by a batch job -- otherwise
sorting will be a mess.  Every change to the sortkey algorithm would
either require that all pages be reparsed (very expensive), or that a
special conversion script be defined to account for that exact change.
 Unless it's minor enough that the inconsistency is acceptable, I
guess.

On Tue, May 12, 2009 at 7:18 AM, Petr Kadlec <[email protected]> wrote:
> Well, not really. Bug 164 would be fixed almost completely for
> Czech-language wikis by using database features designed for exactly
> this problem. [1] But, I guess you know the situation.
> ...
> [1] http://dev.mysql.com/doc/refman/4.1/en/charset-collation-effect.html

Note the version.  Wikimedia uses MySQL 4.0, which doesn't contain any
charsets or collations other than binary.  If we used a higher
version, utf8 might be an option: that would use a Unicode collation,
I guess, which should at least be okay for most languages, if not
perfect.  (But MySQL's utf8 has other downsides, like being
variable-width and not supporting Unicode outside the BMP.)

> If Swedish sorting rules are simple enough that removing all
> whitespace and punctuation and converting to lower case would solve
> most of the problems, I would say that such feature would not be too
> difficult to implement right into MediaWiki (into LanguageSv.php),
> writing those DEFAULTSORT codes explicitly into every article would be
> nonsense, IMHO. (So, go ahead with it, I won’t stop you or anything,
> I’m just trying to say that this is not really a solution for Czech
> language.)

There's no reason this couldn't be implemented for Czech as well in
the software, in principle.  Ideally we'd use something based on
Unicode collation as a baseline, with optional customizations per
language:

http://unicode.org/reports/tr10/

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] More aggressive DEFAULTSORT

Reply via email to