https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #131 from Andrew Dunbar <[email protected]>  2009-11-19 08:11:38 
UTC ---
> Well, there are people who know the architecture and the programming language
> and can presumably do this very quickly (at least the basic step)

Why do you presume it can be done quickly? It doesn't seem so to me and I
regard this bug to be of huge importance.

> if they realize how much of a priority it is. The first step is just to
> apply a few extra functions to category sort keys with the intention of
> converting them into collation keys.

Which functions would those be? What makes you think category sort keys can
be converted to collation keys?

For instance category sort keys are human-readable and human editable but
collation keys are typically binary and not human readable. Category sort
keys are included directly in the text of a category link such as
[[Category:foo|bar]].

Do to their binary nature collation keys cannot appear here. So you would
need to decide whether to remove all category sort keys or make category sort
keys interact with collation keys that would be added elsewhere.

For collation keys to replace category sort keys you would need to establish
that cateogry sort keys have no legitimate uses other than forcing alphabetic
order in cases where the current order results in nonalphabetic sequences.
I can assure you that people do use category sort keys for other purposes and
some might be vociferously upset if these were removed without discussion.

For collation keys to interact with category sort keys you need to generate
and maintain in the database, collation keys for each page title and for each
category sort key since collation keys must be of the same nature to be able
to compare and hence sort them.

Now Unicode does specifiy a "Unicode Collation Algorithm" (UCA) which we could
and probably should use. It is language agnostic but provides for "tailoring"
for individual languages.

The UCA definitely generates binary keys. Not printable. Not human readable.

UCA keys can be very long. I use them in an offline tool for the English
Wiktionary and initially set their maximum length to 1024, 4 times the maximum
length of a page title. We already had about 10 pages for which 1024 was too
short so I had to set it to 2048! Many people might not like all page titles
and category sort keys to now require 9x their current amount of space in the
database.

UCA does allow for various types of sort key compression however. In which case
we would need to choose one to use since it will not be possible to mix and
match them.

PHP currently seems to have no implementation of UCA. We would need to create
it from scratch, or find a way to use one in C.

For multilingual wikis such as Commons and all of the Wiktionaries just havine
one collation language will not work since users of each language will expect
things to be in the correct order for their language.

For the Wiktionaries this means each category needs a way to declare which
language collation to use and each page needs to declare which subset of
possible
language collation keys to generate for that page.

For Commons I'm not sure what the requirements would be but the may differ from
those of the Wiktionaries.

These new fields will need support in the database schema. The ones requiring
multiple language collations will reqire more drastic database changes quite
different from what we now have.

> Once that's in place, people can work on actually writing such functions for
> their particular languages. Later, when those functions are written, they can
> be used additionally to generate a proper alphabetically ordered table of
> pages for use in the contents listings. Or some similar workflow

UCA tailoring would make the particular language collations very easy as long
as
we have a decent implementation of UCA that easily works with tailoring.

> - but there needs to be a plan of action, and that can't be effected by just
> anyone, only by the devs who are in charge (no use pretending that everyone's
> equal - only certain devs actually have the power to make anything happen).

Not true. Anyone with commit access can add such code. Myself for instance.
My understaning is that there are not technically any dev in charge at the
moment since Brion stepped down though there certainly are a few such as Tim
who are acknowledged to have a greater understanding of the entire codebase
and hence greater trust, and you definitely want those people to check such
changes and would expect them to revert any premature commits.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.

_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to