[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-09-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Huji huji.h...@gmail.com changed:

   What|Removed |Added

 CC||reza.ene...@gmail.com

--- Comment #237 from Huji huji.h...@gmail.com 2011-09-18 17:40:08 UTC ---
*** Bug 30287 has been marked as a duplicate of this bug. ***

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-09-06 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Brion Vibber br...@wikimedia.org changed:

   What|Removed |Added

 Blocks|29788   |

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-09-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Philippe Verdy verd...@wanadoo.fr changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |

--- Comment #234 from Philippe Verdy verd...@wanadoo.fr 2011-09-03 16:52:30 
UTC ---
Don't close this bug as long as it blocks other bugs that are not closed:

   bug 3969 -- UTF-8 compatibility (tracking)
   bug 6948 -- natural number sorting
   bug 29788 -- sort Swedish letters correctly (at end of alphabet)

These bugs effectively depend on UCA (or UCA improvements)

My opinion is that bug 3969 should be detached, but the other two should be
reattached to one of the 4 new bugs, on which they depend (most probably bug
30673).

And the new bugs should still be linked to this bug 164, transformed only in a
tracking number (that should not be closed as long as the new 4 bugs are open)

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-09-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Phillip Patriakeas dragonlordofxant...@gmail.com changed:

   What|Removed |Added

 CC||dragonlordofxanther@gmail.c
   ||om

--- Comment #235 from Phillip Patriakeas dragonlordofxant...@gmail.com 
2011-09-03 17:10:54 UTC ---
(In reply to comment #234)
 Don't close this bug as long as it blocks other bugs that are not closed

I thought it worked the other way around; bugs that are blocked by open bugs
can't/shouldn't be closed...?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-09-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Siebrand s.mazel...@xs4all.nl changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution||FIXED

--- Comment #236 from Siebrand s.mazel...@xs4all.nl 2011-09-03 17:22:30 UTC 
---
RESOLVED / FIXED per hexmode. Please leave this closed. We discussed this at
length about this during the triage session.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-31 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Mark A. Hershberger m...@everybody.org changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution||FIXED

--- Comment #233 from Mark A. Hershberger m...@everybody.org 2011-09-01 
03:11:07 UTC ---
This bug has been split into 4 separate bus to track the different issues
raised:

bug 30672 -- improve sorting on other pages than category pages (tracking)
bug 30673 -- Support locale-specific, or tailored, sorting (tracking)
bug 30674 -- Support better client-side sorting (tracking)
bug 30675 -- Use allkeys_CLDR.txt - the CLDR tailored DUCET instead of
allkeys.txt

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #231 from Bawolff bawolff...@gmail.com 2011-08-28 23:35:55 UTC ---
Re JS sortable table stuff (this bug is about too many things...): Well client
side js sorting stuff would be nice, since these tables aren't dynamic,
wouldn't it be easier to: on the server side add a sortkey as a data attribute
for each cell of the table, and then just sort that on client side. (Only issue
with this is it might be complicated to implement in the parser [table code in
parser is scary imho, but I'm sure some other people find it less scary], and
lang preferences, but I think that just using the content language [or the
page's language I suppose now that we have that] would be sufficient).

Thoughts?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #232 from Philippe Verdy verd...@wanadoo.fr 2011-08-29 01:29:45 
UTC ---
If table contents are static, it's true that their collation weights could be
delivered as well with the visible table.

In a past message above, I had already proposed that the function that already
is needed to sort categories, and that computes the UCA collation keys, would
be exposed as a ParserFunction. This would then obviously allow any wiki page
to compose web tables containing both the visible data and the associated
collation key.

Many very complex templates, that are used to generate sortable
pseudo-collation keys would no longer be needed or would be largely simplified.
In fact, I even proposed initially to first expose this function before even
modifying the schema of categories : there was no immediate mergency to modify
this schema, as long as we could experiment assigning sort keys when
categorizing a page, using that ParserFunction.

And probably we would have been able to experiment with a better schema,
including one that would have allowed multiple collation keys (generated from
different locale parameters), something that has been forgotten completely (and
that will require a new change in the database schema in the future, to support
multiple collations according to users localisation preferences (and
expectations).

Anyway, my work on JS implementation of UCA is already well advanced, and in
fact more advanced now on some points that MySQL still does not support
correctly, notably contractions and expansions.

And very soon I'll add support for contexts (see UTR#35 to see what I mean with
those terms), because it's a trivial use of contractions and expansions,
combined

But it can be implemented also as a separate function, to generate shorter
keys, I'm stil not very concerned by choosing the most compact format for
representing collation keys, but I have already worked a lot in how to compact
automatically the sets of collation weights for each level, suppressing all
unnecessary gaps that are shown in the standard DUCET table, and only provided
for convenience with very basic implementation of UCA not using tailorings, or
only using very basic tailorings and no contraction/expansions).

For now, the most critical algorithm (that takes most time when computing
collation keys, or when directly comparing two strings) is the computing the
NFD: I can save time when comparing pair of strings (for example processing
only the begining of strings ontil a collation difference is found, and this
does not always require strings being fully converted to NFD), but not when
computing any collation key (unless I also add a constraint limiting the length
of the generated collation keys, this constraint allowing to stop early). I
have looked at severla implementation of Unicode normalizers in Javascript, I'm
not satisfied with them as they are clearly not optimal.

In fact both the NFD transform and the generation of collation weights for each
level are linked: if we sort only on primary collation level, we loose too much
time in computing the NFD transform, that provides too many unnecessary details
that will be finally ignored: this means that I'm also trying to perform a NFD
transform, that removes these details. Such transform is effectively what the
Unicode standard calls a collation mapping (something more powerful than just
a case mapping, or even a case folding).

Such collation mapping would be extremely useful for implementing the first
letter classification in categories, or even to provide thiner classifications
in very populated categories (for example allowing two letters). This needs is
in fact exactly equivalent to searching for the smallest string that has the
smallest collation keys containing only two collation weights per collation
level, and with a collation set to only one level, and this also can be largely
optimized so that the NFD transform will remove all collation-ignorable details
that would never be needed to compute the collation key.

All this is an interesting subject of research (and even ICU does not provide
such a general mechanism...).

I will be also innovative in how to provide a very compact representation of
tailorings. I also have ideas on how to represent, store, query and cache the
precomputed lookup tables for tailorings. And on a standard interface that will
allow plugable implementations in native code (if possible and needed), or for
use with other languages, but also with other non-UCA based collations
(including dummy collations, such as binary sorting, or sorting by standard
case folding, or by standard case mappings), or complex collations requiring a
lot more mapping data than the DUCET and a set of collation orders and
contractions/expansions (for example for sorting Han ideographs on
radical/stroke, or for sorting based on romanisations and other
translitterations, that are all language-dependant; but for which 

[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-27 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #229 from Andrew Dunbar hippytr...@gmail.com 2011-08-27 10:03:28 
UTC ---
@Philippe Verdy: I think work on the bigger Unicode problems for JavaScript is
pretty important. I tried to file a bug or feature request somewhere a couple
of years ago and the sentiment from the js community was that Unicode was too
big and  unneeded, would bloat js and be too slow. I couldn't convince whoever
was my audience that you just need it and in any case the js implememtations
should be hooking into Unicode APIs provided by their host OS or browser which
must have them anyway.

I have implemented some character/script property stuff from Unidata in js and
for my purposes found it very fast indeed to do everything with binary search.
And all the js implementations are much faster now.

I was doing stuff like looking up the script of every character in an HTML
page.

I've also implemented internationalized sorting including UCA of page titles in
en.wiktionary using the toolserver as a back end. All work was done on a first
generation netbook (-: You have my support. Is there somewhere I can watch your
project perhaps?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-27 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #230 from Philippe Verdy verd...@wanadoo.fr 2011-08-27 11:06:18 
UTC ---
Until recently, it was too slow to initialize a large table like the DUCET. But
now that Javascript starts being used for 2D/3D rendering with interactive
speed, demonstrating complex animations, I don't think this is a performance
issue. The only issue is the code size that must be loaded once (that's why I
spoke about caching the DUCET and loading it by parts, on demand). But note
that the DUCET is not as large as the full Unicode. You can easily exclude the
Hangul syllables, and the sinograms, and can autobuild some parts of it for
blocks of symbols. The core of the implementation is already written for the
DUCET-based collation, I need more work to instantiate tailorings. But I've not
determined the best way to handle on-demand loading of the DUCET. In addition,
if I can finalize it, I will need a way to regenerate with some automatic tool
the representation of the DUCET for each release of Unicode. For now I just use
a static table, created with lots of manual operations: a code generator will
be needed (just like in all existing full implementations of UCA).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #228 from Philippe Verdy verd...@wanadoo.fr 2011-08-26 21:49:23 
UTC ---
I am currently working on a Javascript implementation of UCA (i.e. with full
support of the DUCET, at least 4 collation levels, and support for
language-based tailorings). This is a very difficult task if we need
performance. The trick is to use efficient lookup tables, but the difficulty is
being able to update them (for tailorings); also performance will highly depend
on how the DUCET is represented in Javascript, so that it can be initialized
sufficiently fast that you won't notice this init time, when a collator object
gets instanciated.

I'm still not sure that performance will be good enough (even with today's much
faster Javascript implementations), and may be some optional cooperation with a
server-side script will be needed (or possibly by using a client-side native
plugin interfaced with Javascript, such as Flash) could limit the number of
JSON requests to a server in order to provide the tailored lookup tables. How
to detect ans use client-side plugins in an open way.

I really hope that someday, the Javascript/ECMAScript standard will specify a
common standard interface to collators. I've already looked for help from
authors of jQuery, where such common interface would become part of its tools
library (even if this has nothing in common with working in the HTML DOM, that
is the main focus of jQuery).

For now there exists absolutely no reliable implementation of UCA in Javascript
working only on the client-side (but there does exist already some Javascript
libraries that in fact use JSON or XML requests to a server, in order to build
or retreive the necessary lookup tables).

For practical reasons, a server-side JSON/XML request could be optimized to
reduce the volume of data exchanged (for example by providing only some parts
of the collation elements, these data being then requested on demand even after
the collator has already been -partly- initialized, and then cached in the
collator object). But this is a complication in the design that for now I don't
want to investigate too early, before I get a good image of the performances
effectively needed in practice.

May be it will be just enough to initialize only a part of the DUCET for a
particular language and its specific tailoring, sorting all the other many
characters with default weights (not necessarily from the large DUCET).

My implementation will provide at least 3 interfaced functions:
- instantiating a collator for a given language  sort order
- computing collation keys from strings, for more efficient sorts of large
tables (even if we use a QuickSort, the same strings are compared multiple
times with other strings, so it may help to compute their collation keys only
once); in addition, it may not be necessary to compute the full collation key,
but only keys up to the level that's sufficient to perform a specific compare;
collation keys can then be computed partly, on demand, and cached, instead of
being fully computed for the full string at all collation levels; in addition,
not all table sorts may require the maximum collation level supported, so
collation keys don't necessarily have to be long (in memory);
- the tradeoff for precomputing keys instead of compring keys on the fly, is
highly dependant with the table sizes: for small number of rows to be sorted,
you gain little with precomputed keys, you just waste memory and get lower
performance. So the interface will also include a 3rd function: comparing
strings using this collator, without necessarily having to compute the
collation keys.

I estimate that the tradeoff limit between precomputed collation keys and
direct compares is for tables of about 100 rows, for which it may be helpful to
provide a very fast response time when sorting them, but I may be wrong,
because these small tables will not require precomputing and storing a lot of
collation keys (so this first step before the QuickSort loop may be
insignificant). The real limit could be memory use in very large tables for
precomputing, storing or caching the collation keys; but such very large tables
will probably be very uncommon in tables generated from Wiki pages (most of
these tables would not include more than 500 rows, and storing 500 collation
keys is not really a problem in today's browsers, except possibly in
smartphones with limited memory resources and slow processors, compared to
laptops, notebooks and tablets).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-25 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #227 from Purodha Blissenbach 
bugzilla.wikime...@publi.purodha.net 2011-08-25 16:02:59 UTC ---
 (In reply to comment #224)
 
 Don't forget sortable table sorting.

Sortable tables are sorted client-side. This might offer additional
flexibility. If clients were enabled to choose collations for sortable
tables at the time of accutal sorting, we could offer them options
above those available in server side database sorts.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Brion Vibber br...@wikimedia.org changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |

--- Comment #224 from Brion Vibber br...@wikimedia.org 2011-08-24 20:25:53 
UTC ---
I'm reopening this based on notes from bug 29788 etc.

* The fixes applied for this bug apply only for category member sorting, but
the problem exists in many other places as mentioned in the summary, such as
[[Special:Allpages]], [[Special:ListUsers]], etc

* There doesn't appear to be any straightforward way to set the collation to a
locale-specific one!

The only available options are uppercase and uca-default -- sorting in a
locale-specific way (say, Swedish per bug 29788) appears to require writing
additional code

* For some reason, site content language doesn't appear to be taken into
account in any way; you must set $wgCategoryCollation manually.

This seems very strange to me, and is an obvious additional burden on site
administrators to figure out how to code up (!) and then activate a
language-specific collation.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Brion Vibber br...@wikimedia.org changed:

   What|Removed |Added

 Blocks||29788

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Siebrand s.mazel...@xs4all.nl changed:

   What|Removed |Added

 CC||s.mazel...@xs4all.nl

--- Comment #225 from Siebrand s.mazel...@xs4all.nl 2011-08-24 20:55:21 UTC 
---
(In reply to comment #224)
 I'm reopening this based on notes from bug 29788 etc.
 
 * The fixes applied for this bug apply only for category member sorting, but
 the problem exists in many other places

Don't forget sortable table sorting.

I'm not sure if this bug with 224 (225 now) comments is the best bug to keep
tracking this. How about creating a tracking bug, and slicing this issue into
smaller, more manageable issues?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-08-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Bawolff bawolff...@gmail.com changed:

   What|Removed |Added

 CC||bawolff...@gmail.com

--- Comment #226 from Bawolff bawolff...@gmail.com 2011-08-24 22:40:39 UTC ---
(In reply to comment #224)
 I'm reopening this based on notes from bug 29788 etc.
 
 * The fixes applied for this bug apply only for category member sorting, but
 the problem exists in many other places as mentioned in the summary, such as
 [[Special:Allpages]], [[Special:ListUsers]], etc

Related bug 24574. I suppose we would have to add a sortkey field to other
various table.

 * There doesn't appear to be any straightforward way to set the collation to a
 locale-specific one!
 
 The only available options are uppercase and uca-default -- sorting in a
 locale-specific way (say, Swedish per bug 29788) appears to require writing
 additional code
 
 * For some reason, site content language doesn't appear to be taken into
 account in any way; you must set $wgCategoryCollation manually.

I think that was partially out of concern for php installs without the intl
extension installed (So uca collation would not work). Since changing the
collation requires running a maintenance script, if we just checked that the
relevant class existed to determine which collation to use, and then someone
upgraded their php so intl was suddenly present, things would become screwed
up. (Or that's what I assumed the reasoning was. I could be wrong)

 
 This seems very strange to me, and is an obvious additional burden on site
 administrators to figure out how to code up (!) and then activate a
 language-specific collation.

Perhaps making the value $wgCategoryCollation consider anything past uca to be
the locale. For example it would consider the value of uca-foo to use
IcuCollation( 'foo' ) as the Collation class.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #218 from Philippe Verdy verd...@wanadoo.fr 2011-07-20 08:54:53 
UTC ---
OK, I immediately found a bug in the binary search function (which is defintely
not beinary, has slower than expected convergence, and may even stale on
testing the same index, due to incorrect handling of min/max indice after
comparing):

291 function findLowerBound( $valueCallback, $valueCount,
$comparisonCallback, $target ) {
292$min = 0;
293$max = $valueCount - 1;
294do {
295$mid = $min + ( ( $max - $min )  1 );
296$item = call_user_func( $valueCallback, $mid );
297$comparison = call_user_func(
$comparisonCallback, $target, $item );
298if ( $comparison  0 ) {
299$min = $mid;
300} elseif ( $comparison == 0 ) {
301$min = $mid;
302break;
303} else {
304$max = $mid;
305}
306} while ( $min  $max - 1 );
307
308if ( $min == 0  $max == 0  $comparison  0 ) {
309// Before the first item
310return false;
311} else {
312return $min;
313}
314}

The authout should have better read how TESTED binary searches are implemented.
This should really be:

function findLowerBound( $valueCallback, $valueCount, $comparisonCallback,
$target ) {
$min = 0;
$max = $valueCount - 1;
while ( $min = $max ) {
$mid = ($min + $max)  1;
$item = call_user_func( $valueCallback, $mid );
$comparison = call_user_func( $comparisonCallback, $item, $target
);
if ( $comparison  0 ) {
$min = $mid + 1;
} elseif ( $comparison  0 ) {
$max = $mid - 1;
} else {
return $mid; // or: $max = $mid; break;
}
}
// $target not found, now $max  min (more exactly, $max = $min - 1).
// $target is to insert between them: $max is the highest lower bound.

if ( $max  0 ) {
// Before the first item
return false; // Don't test with a simple if !
} else {
return $max; // May return 0 (lower bound on first item)
}
}

Note that the final test to return false is unnecessary and causes confusion on
usage (if you had written this in C/C++ instead of PHP, you would not be able
to make the distinction between false (no lower bound not found) and 0 (valid
lower bound). Your code should not depend on those PHP hacks, which consists in
returning values of distinct types (but with PHP's implicit promotion on basic
types, it's really dangerous do do that) !

It's just simpler to let it return -1 (the content already in $max when we are
before the first item), and then let the caller test the negative sign of the
result, instead of comparing it then with false.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #219 from Philippe Verdy verd...@wanadoo.fr 2011-07-20 11:20:08 
UTC ---
I note the following comment in r80443:
* Changed Title::getCategorySortkey() to separate its parts with a line break
instead of a null character. All collations supported by the intl extension
ignore the null character, i.e. ab == a\0b. It would have required a lot of
hacking to make it work.

If you had read my comments above, I had already spoken since long about the
choice of the separator to use between the prefix and the rest of the full page
name: this had to be a non-ignorable character, that is clearly lower than any
character usable in the full pagename.

I had also stated which character to use, namely the TAB character that is the
first non-ignorable (i.e. with non-all-zeroes weights). See my Comment #190 for
the rationale (one year ago, much longer before your recent fix to use LF,
because before yu had tried with NULL, which was wrong, and even without any
separator which was also wrong).

You could have avoided these bugs, as I had analyzed it long before.

OK, LF (U+000A) is OK (because it is not allowed in full page names), even it
is not the first one.

---

One word for performance and space: you have put too many fields in table
categorylinks. Only one essential think is needed there for referencial
constraints and counting: the UNIQUE index on (cl_from, cl_to), which is
architectural. As this index is unique, it should share the same store as the
table itself. But SQL engines will never use more than one index per table to
read it, because it requires additional random-access cursors to access the
table. As the current unique index uses less fields than the table, it
implicitly creates a random-access reference from this index to the table
(basically, using internal rowids stored in the index, instead of reading
directly from the table).

To avoid this problem, you have to make sure that the table is formatted so
that its first fields match exactly the same order as the unique index. This is
the case for the first unique index created just after that. So the table and
the index share the same store, however the number of rows in the index per
storage page will be limited (the table can use up to 1KB per row, this does
not give a very compact and fast search in the index, as it will load too many
storage pages in memory, even when using a very selective search).

You may optimize searches (at the price of storage space), by indicating that
the unique index should still use a separate store. Normally this is the role
of the specifier PRIMARY which indicates to not use a separate store (it is
still UNIQUE). For now you don't have any PRIMARY index, only a UNIQUE index
which only uses a separate index store.

Now comes the two main kind of requests performed by MediaWiki:

(1) getting the list of categories in which a page is currently listed
(including when saving a page because this list must be flushed out and
replaced by the categories found when parsing the page). Ideally, this should
require at least an index on the cl_from. The UNIQUE index on (cl_from,cl_to)
works perfectly there because it is in separate store. Other attributes in the
table will not even be read, unless the SQL query wants to get these attributes
(if this is the case, both the index and table store will be read, the index
being read very rapidly and joined with the table using the internal table row
id, stored in the index with the two keys). In all what I have seen in the
MediaWiki code, this does not occur (including when rendering a page, to
display the list of categories, it just needs this list but gets it from the
page's wiki code parsing before saving it, and this gets saved in the page
cache).

(2) getting the list of pages in a category. There you need to use collation as
well because this list is sorted. Ideally, this should use an index and table
that just contain the necessary fields, and that the SQL SORT BY clause will
use those same fields, in the same order (and either all in ASCENDING, or all
in DESCENDING order). This is the second index:

CREATE INDEX /*i*/cl_sortkey ON /*_*/categorylinks (cl_to, cl_type, cl_sortkey,
cl_from);

Obviously, it cannot use the same index store as the table. The introduction of
the cl_type field is recent. It supposes that you will be paging the category
views by page type, in the order set by cl_type, which is:
ENUM ('page', 'subcat', 'file').

I wonder if this is correct and if it's the expected order. But anyway this
design means that the current code will actually perform three separate
requests, each one will be paged separately: one request to get the list of
normal pages, another to get the list of subcats, and another to get the list
of files. This will be inefficient, and hard to manage for the paging, unless
these three types can be effectively paged separately in the MediaWiki HTML UI
(this is not the case, as of today).

[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #220 from Tim Starling tstarl...@wikimedia.org 2011-07-20 
12:14:54 UTC ---
(In reply to comment #218)
 if ( $comparison  0 ) {
 $min = $mid + 1;
 } elseif ( $comparison  0 ) {
 $max = $mid - 1;

The function is not a classical binary search, which only determines equality,
instead it finds the lower bound of a range. When the test item sorts below the
target ($comparison  0), it can't be ruled out as a prospective lower bound.
That's why you need $min = $mid, not $min = $mid + 1. 

Maybe $max = $mid - 1 is possible in the $comparison  0 case, with your
suggested alteration to the test for the before-the-start case. But the
existing code is tested and does work, and the improvement in convergence rate
would be small. 

 OK, I immediately found a bug in the binary search function (which is 
 defintely
 not beinary, has slower than expected convergence, and may even stale on
 testing the same index, due to incorrect handling of min/max indice after
 comparing):

I assume by stale you mean an infinite loop. This does not occur because the
midpoint is always different from either of the two endpoints until the
difference $max - $min is reduced to 1 or 0, at which point the loop exits. 

In a textbook binary search, offsetting the midpoint by 1 allows the loop to
continue until $min  $max. But as I said, we can't offset the midpoint in the
$comparison  0 case, so a slightly earlier loop termination is required to
avoid an infinite loop.

 Note that the final test to return false is unnecessary and causes confusion 
 on
 usage (if you had written this in C/C++ instead of PHP, you would not be able
 to make the distinction between false (no lower bound not found) and 0 (valid
 lower bound). Your code should not depend on those PHP hacks, which consists 
 in
 returning values of distinct types (but with PHP's implicit promotion on basic
 types, it's really dangerous do do that) !
 
 It's just simpler to let it return -1 (the content already in $max when we are
 before the first item), and then let the caller test the negative sign of the
 result, instead of comparing it then with false.

It's conventional for PHP functions to return false to indicate an error or
other unusual situation, see for instance http://php.net/array_search . It's
not at all dangerous. If I wrote the function in C, I would indeed have used
-1, and I would have called it an ugly hack to work around C's lack of weak
typing.

If you're aware of some actual bug in findLowerBound(), you should file a
separate bug report, or note it on
http://www.mediawiki.org/wiki/Special:Code/MediaWiki/80443

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Domas Mituzas domas.mitu...@gmail.com changed:

   What|Removed |Added

 CC||domas.mitu...@gmail.com

--- Comment #221 from Domas Mituzas domas.mitu...@gmail.com 2011-07-20 
12:40:40 UTC ---
@Philippe, InnoDB treats first UNIQUE index as clustered PRIMARY key, which
stores all data with it (I assume you don't come from InnoDB background, please
correct me otherwise ;-)

cl_sortkey structure allows having dataset split by type (e.g. subcategories
first, files last, whatever), and still maintain decent sorted order. cl_from
is included for cases where we need to page over multiple entries with same
sortkey.

do note, as secondary keys would include primary key values anyway, it is not
much of a cost. 

Back in the day both major reads used to be covering ones, without random
lookups, I'm not sure if there's a regression nowadays with all the new
collation code though.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #222 from Philippe Verdy verd...@wanadoo.fr 2011-07-20 13:18:31 
UTC ---
You should know that a binary search for equality of for lower bound is
completely the same algorithm. (If not convined, look at the source of a C++
standard library, which uses the same function for both: finding exact matches
or finding a lower bound for inserting a new item).

The difference only occurs at end of the loop, about what you return when an
exact equality is not found. In both cases, the binary search loop will
terminate with min=max+1 exactly if no exact match is found, so that max is the
lower bound you want.

My suggestion fixes a few things:
- it uses a while()... instead of a do...while() which is incorrect for the
case where the search space opperates with count2.
- it has faster convergence
- it contains no special case to handle (your code will be erroneous in those
cases).
- even when searching with count=0, it starts with min=0 and max=-1, which
terminates the loop immediately, with the same condition for not found (max 
min, and max if the lower bound)
- it works with count=1 without creating an infinite loop like in your code
when the target is higher that the only element [0] to check, because it
ensures that the loop will ALWAYS reduce the search space at each interation.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #223 from Philippe Verdy verd...@wanadoo.fr 2011-07-20 13:35:21 
UTC ---
 The function is not a classical binary search, which only determines equality,
instead it finds the lower bound of a range. When the test item sorts below the
target ($comparison  0), it can't be ruled out as a prospective lower bound.

And here you're completely wrong : in that case the comparison says that the
mid idem is certainly not the lower bound. Don't forget that the item at max
(and below) will also be tested: max will decrease as well, up to the point
that it will either find an exact match, or will fall 1 position below min.

 That's why you need $min = $mid, not $min = $mid + 1. 

That's why you don't need that !

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #217 from Aryeh Gregor simetrical+wikib...@gmail.com 2011-07-18 
18:25:02 UTC ---
The interesting code is here:

http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/Collation.php?view=markup

That's the part that does the actual collation, which is really just handed off
to the intl extension.  The part that actually hooks into category pages is in
CategoryPage.php, but it's relatively boring.  I don't think there are any
parser functions or anything else exposed to users right now.

If you have questions or comments, I strongly encourage you to post to
wikitech-l, not this bug.  Bugzilla is used for bug tracking, not discussion,
and this bug is fixed.  If you have further feature requests, like the addition
of new parser functions that expose the sortkey functionality, file new bugs. 
I am not going to answer any questions that are asked here.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-17 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #215 from Aryeh Gregor simetrical+wikib...@gmail.com 2011-07-17 
14:12:58 UTC ---
Maybe.  It depends.  Given the number of comments here and the number of people
being spammed on every post here, though, I'd strongly suggest that people open
new bugs for each place they want to have improved collation, rather than
continuing to use this bug.  You can mark them as depending on this one so
interested parties can find them.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-17 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #216 from Philippe Verdy verd...@wanadoo.fr 2011-07-17 16:50:16 
UTC ---
Aryeh, Can you point me with a link where your implementation was added in MW ?
I'd like to review it, and locate possible bugs or caveats.
Also because the announces integration does not give documentation details
about how we can use it (Wiki code syntax in pages, possible support templates,
integration points, server settings, possible user preferences, possible parser
functions for generating helper data in the HTML so that collation will also
work on the browser side because there's still no support anywhere of collation
in Javascript and because most Javascript code found to handle special sorts
are bogous and do not implement true UCA, except some HTML/Javascript
applications using Ajax requests to perform sort on the server side).

---

Note: jQuery still does not have any utility function to support collation.
It provides a .sort() method that requires a callback function in parameter
to perform the actual compare(), but inefficiently as it still cannot use
precomputed collation keys from actual table data, so the user-supplied compare
function would recompute collations many times from the same data cell, while
performing the sort).

I have asked to jQuery to extend its interface to allow precomputing collation
keys in a first pass using a user-supplied callback, and then only supply these
precomputed collation keys when comparing cells during the actual sort, but
this is still a dead letter. I have also asked to jQuery to start a project for
implementing a $.CollatorFactory object from which collators can be prepared
and instantiated, and a common interface for these collators returned by the
factory.

jQuery should then supply at least a very basic CollatorFactory (a stub
implementing only binary sort). Then a separate subproject could start creating
a more complete CollatorFactory that will fully support multiplevel UCA, that
will integrate the DUCET table, and that will integrate support for
locale-specific tailorings of the DUCET, before integrating a collection of
tailorings (based on CLDR data, but not necessarily on the same format, and not
necessarily by allowing a dynamic instantiation of any kind of tailoring using
the CLDR syntax for them.

Note that the collection of tailorings (there are many locales in the CLDR)
need not always be fully loaded into the browser Javascript: a server-side Ajax
request could as well provide the necessary data to the browser (most probably
in a JSON structure), for allowing tailored collators to be instantiated
rapidly without having to parse the CLDR syntax).

The DUCET table dataset (most probably large) could as well be loaded to the
browser in a delayed way as well through Ajax only on demand (and possibly only
a part of it, under the control of the CollatorFactory sollicitated by specific
collators that would just ask to the CollatorFactory to preload some segments
of it, to save page load time and volume, and only when the Javascript will
start running some collation function that requires this data to have been
initialized and loaded.

I'd like to start my own project for that, and integrate it to jQuery, in a
much better way than all what I've seen on the web (all the implementations
I've seen for support of collation in Javascript are simply bogous, or at least
very defective). Unfortunately, Javascript itself (or the most recent
ECMAScript specification), has completely forgotten to develop the subject of
collation, so it will take many years before we get correct browser-side
sorting becoming standard in browsers: there's not even any ongoing project for
that, or even just some preliminary specs, either at ECMA, or at W3C, Apache,
jQuery... and even at Oracle, Microsoft, Apple, Mozilla and Opera).

Such support on the opposite is already present in Adobe Flash, in Microsoft's
Silverlight (based on the .Net libray), and since many years in Java. This
means that ALL web applications that want correct browser-side collation still
need to run with browser plugins, and not with pure Javascript over the
HTML/CSS DOM API. Given the ongoing efforts to deprecate these plugins in favor
of HTML5, I don't understand why no such effort has ever been initiated.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-16 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #214 from Niklas Laxström niklas.laxst...@gmail.com 2011-07-16 
11:51:15 UTC ---
There are lots of places where pages are listed - don't they also require or
benefit form such index?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-15 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Lejonel lejo...@telia.com changed:

   What|Removed |Added

 CC||lejo...@telia.com

--- Comment #212 from Lejonel lejo...@telia.com 2011-07-15 19:53:46 UTC ---
(In reply to comment #211)
 On wikitech-l, Roan writes 
 
  This support is now in MW with the category collation code Aryeh
  wrote, and developers can now proceed to write language-specific
  collations if they want to.

So other collations are possible for categories. But this bug is also about
sorting in other places. Is there any support for sorting lists in special
pages? For example duplicate bug 353 is about Danish letters in
Special:Allpages.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-07-15 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #213 from Aryeh Gregor simetrical+wikib...@gmail.com 2011-07-15 
20:12:10 UTC ---
Getting sorting working in other places would require work, but we could use
the same infrastructure.  For Special:Allpages, we'd need to add an extra
column and index to the page table, which seems a bit excessive for such a
minor special page.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-03-18 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Mark A. Hershberger m...@everybody.org changed:

   What|Removed |Added

 Status|ASSIGNED|RESOLVED
 CC||m...@everybody.org
 Resolution||FIXED

--- Comment #211 from Mark A. Hershberger m...@everybody.org 2011-03-19 
04:03:09 UTC ---
On wikitech-l, Roan writes 

 This support is now in MW with the category collation code Aryeh
 wrote, and developers can now proceed to write language-specific
 collations if they want to.

See
http://permalink.gmane.org/gmane.science.linguistics.wikipedia.technical/52572

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2011-03-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Roan Kattouw roan.katt...@gmail.com changed:

   What|Removed |Added

 CC||mybugs.m...@gmail.com

--- Comment #210 from Roan Kattouw roan.katt...@gmail.com 2011-03-09 13:58:51 
UTC ---
*** Bug 24142 has been marked as a duplicate of this bug. ***

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-09-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Chad H. innocentkil...@gmail.com changed:

   What|Removed |Added

 CC||pinar.bugzi...@gmail.com

--- Comment #209 from Chad H. innocentkil...@gmail.com 2010-09-10 17:53:07 
UTC ---
*** Bug 25129 has been marked as a duplicate of this bug. ***

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-08-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Aryeh Gregor simetrical+wikib...@gmail.com changed:

   What|Removed |Added

 CC||mis...@gmail.com

--- Comment #208 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-08-27 
00:27:48 UTC ---
*** Bug 24953 has been marked as a duplicate of this bug. ***

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-08-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #205 from Philippe Verdy verd...@wanadoo.fr 2010-08-04 17:09:44 
UTC ---
(In reply to comment #204)
 If people want to put crazy stuff in sortkeys that changes based on who's
 viewing it, we can't stop them.  Curly braces are evaluated at an earlier 
 stage
 than category links, so we can't make them behave differently based on whether
 they're being used in a sortkey, I don't think.  You could also put
 {{CURRENTTIME}} in sortkeys, or any other silly thing like that.

That's because the Mediawiki parser still operates on the wrong level, and
performs text subtitutions always without ingoring the context of use. Not only
this is bad, but this is also very inefficient, because each processing level
is converting a fullpage to another fullpage that needs to be reparsed again at
the next level.

A much better scheme based on a true gramatical analyser using a finite state
automata would really help defining the state at which the parser (or its
ParserFunctions extensions) operate, without ever having to create huge
complete page buffers between each level (which uses costly appending
operations with many memory reallocations).

In other words, when you start parsing [[ you enter in a start-of-link
state, which then parses category: until it has found a colon, in which case
the whole prefix is case-folded and goes to the in-category state, then the
parser scans up to the next { or | or ]. It can then correctly process
all the text using such rules.

I have suggested since long that the MediaWiki syntax should be reformulated
using a LALR(1) formal syntax, from which a finite-state automata can be
automatically built to cover all the state information, and then ported to PHP
(Yacc, Bison or even better PCTCS could do that without difficulty). Then
instead of calling parsing functions that process all the text, then return the
converted text for processing to the next level, it will do that in a much
simpler (and faster) processing loop, calling much simple generation methods
and altering its state in a much cleaner and faster way (no more need to append
small lexical items to various buffers, the atoms will be pased from level to
level using a chained implementation.

This would also significantly speedup the expansion of templates, and would
allow the parser to make distintions when {{int:}} is encountered in the
context of a [[category:...]] (which would have temporarily forced the UI
language to the CONTENTLANGUAGE, until the final ]] token is encountered that
would restore the UI language within the parser's internal state. this would
also have significant performance benefits (for example, no more need to
convert and expand all the parameters of {{#if:...}} or {{#switch:...}}, only
convert them lazily when they are really needed for generating the effective
output.

Yes this comment is going too far out of the topic of this bug, it is
architectural. PHP has all the rools needed to support the construction of
tree-like data: instead of passing only strings to parser generation functions,
you would pass it an ordered array, whose elements are the parsed parameters,
themselves being either strings or subparsed arrays containing strings and
other arrays. The builtin functions would then request back to the MediaWiki
parser the evaluation/expansion of ONLY the parameters they need, and MediaWiki
would still be able to call the expansions of ONLY these parameters,
recursively. Most of the items in the wikicode would then be atomic and
processed lazily, only when they are effectively needed.

The crazy things like {{CURRENTTIME}} or {{time:...}} or {{int:...}} found
within [[category:...]] and whose result depends on time or on user preferences
would be easily avoided. This would also simplify a lot the management of
whitespaces (if they need ot be trimmed and/or compressed depends on the
builtin expansion function called by the parser.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-08-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #206 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-08-04 
17:14:25 UTC ---
(In reply to comment #205)
 That's because the Mediawiki parser still operates on the wrong level, and
 performs text subtitutions always without ingoring the context of use. Not 
 only
 this is bad, but this is also very inefficient, because each processing level
 is converting a fullpage to another fullpage that needs to be reparsed again 
 at
 the next level.

People have tried to fix this before, and failed miserably.  It is not
practical in the immediate future, if ever.  Everyone likes to say how great it
would be to write a real parser, but nobody ever seems interested in
producing working code that actually replicates the current parser to a good
enough level to maintain acceptable backward compatibility.  If you're
interested in that, please give it a try and get in touch with us when you have
a working implementation.  Otherwise, it's not worth discussing pointlessly yet
again.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-08-04 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #207 from Philippe Verdy verd...@wanadoo.fr 2010-08-04 17:41:52 
UTC ---
In reply to comment #206)
Top-down analyzers are very easy to write, provided that you allow parsing to
be made via TWO generation points, instead of one: the first one converts and
accumulates the tokens in an array as long as it is not able to be processable
as a whole; this requiers a first parsing loop. Then the final step will ask
the expansion of parsed tokens lazily, in a basic processing loop which will
eat the tokens lazily, replacing them by their expansion, until there remains
only one token, a basic string that can be used as an expanded parameter.

I may try working on it, it's not so complicate to formulate, even without
using Yacc/bison/PCTCS, I have many ideas on how it will work : basically each
non-terminal token in the formal syntax becomes TWO functions, one is used when
parsing individual tokens, until the last one is reached, then a second
function will be used to expand itself, that will be used later (after
returning from the  BY the caller, and ONLY when it will need the expansion in
its OWN expansion function.

For example the MediaWiki parser would become TWO functions: mediawiki_parse()
(which returns an array of tokens) and mediawiki_expand() which is stored as
the first token in the array. All ..._expand() functions would have two
parameters: one for the context in which they are to be evaluated (containing
for example time properties, cache properties, language properties ), and the
array itself to be expanded up to the end until they produce some simple text
The main concept is that tokens will expand lazily, but with full control by
the non-terminal parsed token which knows its context of expansion.

Basically the first step for each non-terminal token in the formal grammar is
performing a bottom-up conversion, from a top-down analysis (performed
recursively); this first step builds a AST tree (stored as a basic array of
children). The second step is performing a *conditional* top-down expansion of
the AST array into a flat text, ignoring nodes that don't need to be expanded
as their effective expansion will not be needed for the expansion of the
current AST node. Both steps are recursive, but in separate simple processing
loops.

OK this is still out of topic, but I don't know for now where to post these
concepts. I'll stop discussing this here.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-08-03 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #204 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-08-03 
17:37:45 UTC ---
(In reply to comment #200)
 Which SQL backends (and dialects) do you support in MediaWiki ? This may help
 me understanding some development constraints. I know you support MySQL, but
 are there simpler backends like BerkeleyDB, dBase files, or flat text files 
 for
 small projects supported only via some ODBC driver with a PHP interface ?

We primarily support MySQL.  We also have at least theoretical support for
PostgreSQL, SQLite, MSSQL, Oracle, and DB2.  However, if this doesn't work for
all DB backends, it's not a big deal, as long as it works with MySQL.

(In reply to comment #201)
 For those of us who are less technical, but keen to see this bug resolved, is
 there somewhere where we can submit alphabets, so that they may be added for
 collation? I.e. where we submit the sorting order of specific languages for 
 use
 with {{sort:en}} (or whatever the mark-up will be)?

We'll be using some kind of pre-rolled algorithm from CLDR or such, so you
don't need to do this.  If the deployed algorithm turns out to be deficient,
and it's not a bug on our side, you should probably contact CLDR, not us.

(In reply to comment #203)
 May I even suggest that the name of the new column does not even use sortkey
 
 i.e. cl_sort_prefix instead of cl_sortkey_prefix
 
 Why? because this column should NEVER need to convert that prefix itself to a
 binary sortkey, it should just still be the humane-readable prefix that was
 specified in {{DEFAULTSORT:sort_prefix}} or in [[Category:...|sort_prefix]].

Yes, this stores the plaintext, unconverted prefix.  I don't think the name
needs to be changed, though.  Both names work well enough.

 This will avoid confusion, but it will also allow simpler recomputing of 
 actual
 sortkeys for various locales, without having to parse the page again for each
 locale:

Yes, that's intended to work.

 Note that when evaluating and expanding the parameters of
 {{DEFAULTSORT:sort_prefix}} and of [[Category:...|sort_prefix]] in the wiki
 code of a page (or when transcluding templates), the result text should NEVER
 depend on the UI language (so if {{int:lang}} is used in that parameter, it
 should evaluate always as if it was {{CONTENTLANGUAGE}}).

If people want to put crazy stuff in sortkeys that changes based on who's
viewing it, we can't stop them.  Curly braces are evaluated at an earlier stage
than category links, so we can't make them behave differently based on whether
they're being used in a sortkey, I don't think.  You could also put
{{CURRENTTIME}} in sortkeys, or any other silly thing like that.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #202 from Philippe Verdy verd...@wanadoo.fr 2010-07-28 22:06:54 
UTC ---
I would suggest instead making those proposals in the CLDR project.
It already contains lots of data for many languages, related to their expected
sort order (in fact more generally about their collation).

Go to http://www.unicode.org/cldr and visit the existing ressources browser,
and the demo of ICU which uses the same data.

Our collator should in fine use the CLDR tailorings that are already defined,
with very minor changes (if they are really needed).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-28 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #203 from Philippe Verdy verd...@wanadoo.fr 2010-07-28 22:18:49 
UTC ---
May I even suggest that the name of the new column does not even use sortkey

i.e. cl_sort_prefix instead of cl_sortkey_prefix

Why? because this column should NEVER need to convert that prefix itself to a
binary sortkey, it should just still be the humane-readable prefix that was
specified in {{DEFAULTSORT:sort_prefix}} or in [[Category:...|sort_prefix]].

This will avoid confusion, but it will also allow simpler recomputing of actual
sortkeys for various locales, without having to parse the page again for each
locale:

Note that when evaluating and expanding the parameters of
{{DEFAULTSORT:sort_prefix}} and of [[Category:...|sort_prefix]] in the wiki
code of a page (or when transcluding templates), the result text should NEVER
depend on the UI language (so if {{int:lang}} is used in that parameter, it
should evaluate always as if it was {{CONTENTLANGUAGE}}).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #190 from Philippe Verdy verd...@wanadoo.fr 2010-07-26 18:38:10 
UTC ---
Yes Language::firstLetterforList(s) maps more or less to COLLATIONMAP, but
COLLATIONMAP is a more generic concept which reflects what is defined in
Unicode standard annexes, which speaks about various mappings (including collan
mapppings, but also case mappings)

One bad thing about the name Language::firstLetterforList(s) is that it implies
that this should only be the first letter. In fact, for many locales, the
significant unit is not the letter, but the collation element (for exemple
digrams like ch or trigrams like c’h).

For some categories, it should be convenient also to be able to use longer
substrings, containing more than one grapheme cluster (in Wiktionnary for lists
of terms belonging to a language, or in lists of people names, a category may
need to be indexed and anchored with section headings containing the first 2 or
3 grapheme clusters, because the first grapheme is not discriminant enough and
will remain identical an all columns of the disaplyed list on one page, and
even sometimes on several or many successive pages : the first letter heading
does not help, and is just an unneeded visual pollution)

For other categories that have very few contents, too many headings are
frequently added that also appear as pollution. Being able to suppress all of
them, by specifying 0 graphemeclusters in that category will better help
readers locate the wanted item.

The collation map also has several levels of implementations, which match
exactly the same levels as collation levels used to generate sort keys.



About sort keys now:

As sort keys are intended to be opaque binary objects, they do not qualify as
being used directly as a parserfunction, without being exposed by a
serialization to safe Unicode text, even if it means nothing for reader. That's
why I proposed a Base-36 mapping to plain ASCII which will still sort correctly
in binary order, and for use in sortable tables, but it could use any arbitrary
Base that sorts correctly using binary ordering, and that uses ONLY valid
Unicode characters.

The chosen base should be easy to compute, but all the standard Base-64
variants do not qualify (there's no warranty for the last two letters of all
base-64 variants). We could use Base-62 (using all 10 digits, and the 26 pairs
of Basic Latin letters), or Base-32 (simpler to compute but will generate
longer texts). The only intent is not really to make the string visible in
pages, but to help in the generation of accurate sort keys in sortable columns.

For now these sort keys are generated by templates as invisible text spans
(using a CSS style=display:none attribute), but ideally, the templates used
in sortable tables that generate custome sortkeys should put them in some HTML
attribute that can be specified on table cells, and that the Javascript will
use directly. In my opinin, these opaque strings should be as compact as
possible, but still safe for use inclusing in pages, and directly usable by
simple Javascript without requiring any complex reimplemementation of UCA in
the Javascript code.


Why do I think that exposing the functions as parser functions will be useful ?
that's because it allows the implementation to be tested extensively on lots of
cases, but only within a limited set of pages, long before the schema is
developed, finalized and finally deployed.

In other words, it will not block the development of the schema update, as long
as we agree about what are the essential functions to support, i.e. their
interface that will be exposed (partly) in parser functions.

Also because I'm convinced that the exposed parser functions will not have this
syntax changed, and that what they return will be well known:

- The {{COLLATIONMAP:}} function is very well described and will effectively
return humane-readable text. Its formal implementation should follow the
standard Unicode definitions.

You can expose it at least in a test wiki where you'll be able to track very
easily the result and progress of its implementation (just create a page
containing test words in various languages, arranged in a Wikitable).

- The {{SORTKEY:}} function can be exposed as well (and tested on the same test
page for various languages, using the same list of words). Its result will be
opaque for humane and compact. It will be easy to assert that it generates the
expected sort order by using it in a sortable wikitable.


Both functions will be deployable rapidly, even on wikis that won't want to
apply the schema change (so they will continue to use a single collation order
for ALL their categories, and will anyway be able to sort specific categories
using another supplied locale matching the category name).

If you think about it, changing the SQL schema may be rejected at end by lots
of people. Exposing the parser functions will provide a convenient alternative
that can be 

[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #191 from Philippe Verdy verd...@wanadoo.fr 2010-07-26 19:07:34 
UTC ---
In all this discussion it appears that the development can be made in two
separate projects developped independantly.

You can continue to develop the SQL schema extension, provided that:

- you call a PHP function/method that will be developped separately in a
Collator class, to compute the collation sortkeys for each locale and
specified collation level.

- you call a PHP function/method that will be developped separately in a
Collator class, to compute the collation mappings for each locale and
specified collation level and maximum grapheme clusters, in order to generate
the headings on the fly in the category list

- you think about a HTTP query parameter that will allow to change the locale
(this parameter exists, it's uselang, but another one will be needed
eventually to specify sort options like -x-lc or -x-uc (the -x- separator
will be used impliclty. So the HTTP query may contain: uselang=fropt=lc).
These parameters will be used later when you'll be able to store multiple
sortkeys.


Separately, another project will implement the Collator class, that will load
the DUCET table, load the tailorings designed per locale or
locale+-x-+option, and prepare the table of collation elements and their
associated collation weights for each level. Then it will implement the
methods:

- Collator::Collator(locale, level=1) which instanciates a Collator object for
the specified locale string (language + optional options) and the collation
level (it does not need to prepare the full table of colaltion weights, just
those needed for the specified level). The default level will be 1.

- Collator::sortkey(text) which returns the opaque collation key (a binary
array of bytes) for the specified Collator instance.

- Collator::map(text, maxelements='') which returns the humane readable text
after applying the collation mapping associated to the Collator instance. It
will stop after returning at most 'maxelements' collation elements in the
remapped string, or will process all the text if maxelements is null or not
specified. For category headings, you'll give by default maxelements=1 (unless
another calue was specified when preparing a specific locale with the extra
option of {{SORTAS:locale|maxelements}} in the visited category page.


Separately another simple project will use the Collator class to expose them in
parser functions. In fact this third project will be very easy and fast to
complete, even if the Collator class is not fully developped with all its
options.

A very basic Collator class can be used to develop the parser function
extension that will expose:
-  {{SORTKEY:text|locale|level}}
   defaults: locale={{CONTENTLANGUAGE}}|level=1
- {{COLLATIONMAP:text|locale|level|maxelements}}
   defaults: locale={{CONTENTLANGUAGE}}|level=1|maxelements=!--empty/null--

With it you can immediately deply it on a test wiki server, where you'll build
test lists in a sortable wikitable. and you can continue building the
Collator class. to support various locales and all the needed options.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #192 from Philippe Verdy verd...@wanadoo.fr 2010-07-26 19:20:35 
UTC ---
Note that this is the minimum functional interface for the Collator class.
Other methods will be certianly needed to cache the already prepared collators,
or to cache the preloaded tailoring tables and the common DUCET.

In other words, the Collator::Collator() constructor is in fact probably using
the service of a CollatorFactory class that will cache the needed shared data.
This will save a lot of overhead and will be important for the performance.

This factory should NOT be used directly by the SQL schema extension, that MUST
only use the Collator::getCollator(locale, level) static function which will
use the CollatorFactory in order to instantiate the Collator object through its
constuctor.

The CollatorFactory will use ICU, most probably, via its already existing
PHP-extension interface module. But a basic PHP-only implementation is still
possible without using ICU immediately, or if MediaWiki must be kept compatible
with existing PHP servers that can't load such native extensions (using DLLs or
shared libraries or requiring to recompile/relink PHP itself) in their
installation.

The CollatorFactory should NOT prepare all supported locales. Only those that
are requested **on demand**.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #193 from Philippe Verdy verd...@wanadoo.fr 2010-07-26 19:41:57 
UTC ---
Exemple of interface for you:

?php

require('.../CollatorFactory.php');
$wgCollatorFactory=CollatorFactory(); // this is the only global PHP variable

...

// prepare a Collator object
//(note that $locale is NOT case sensitive in BCP47)

$locale = $wgContentLanguage; // use {{CONTENTLANGUAGE}} by default
if (isset($options)  $options  '')
  $locale .= '-x-' . $options; // append sort options if any (like 'uc', 'lc')
$level = 1; //use collation level 1 by default

$collator = $wgCollatorFactory-get($locale, $level);

...

// use the collator object to compute opaque sortkeys (opaque binary string)
// to store in categories:

$sortkey = $collator-sortKey($text);

// optionally map it to plain-text, if the database can't store and
// sort VARBINARY(N) fields, but only VARCHAR(N) objects:

$sortkey = base32($sortkey);

...

// use the collator object to compute a heading on the fly from a pagename:

$collationElements = 1; // by default generate only 1 collation element

$heading = $collator-map($pageName, $collationElements);

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #194 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-07-26 
19:49:15 UTC ---
(In reply to comment #190)
 Yes Language::firstLetterforList(s) maps more or less to COLLATIONMAP, but
 COLLATIONMAP is a more generic concept which reflects what is defined in
 Unicode standard annexes, which speaks about various mappings (including 
 collan
 mapppings, but also case mappings)

Which particular Unicode standards are relevant here?  It would probably be a
good idea for me to take a look at them.

 For some categories, it should be convenient also to be able to use longer
 substrings, containing more than one grapheme cluster (in Wiktionnary for 
 lists
 of terms belonging to a language, or in lists of people names, a category may
 need to be indexed and anchored with section headings containing the first 2 
 or
 3 grapheme clusters, because the first grapheme is not discriminant enough and
 will remain identical an all columns of the disaplyed list on one page, and
 even sometimes on several or many successive pages : the first letter heading
 does not help, and is just an unneeded visual pollution)
 
 For other categories that have very few contents, too many headings are
 frequently added that also appear as pollution. Being able to suppress all of
 them, by specifying 0 graphemeclusters in that category will better help
 readers locate the wanted item.

This is all beyond the scope of what I'm currently doing, but it should be
possible to add on without too much trouble as a separate project.

 Why do I think that exposing the functions as parser functions will be useful 
 ?
 that's because it allows the implementation to be tested extensively on lots 
 of
 cases, but only within a limited set of pages, long before the schema is
 developed, finalized and finally deployed.

This presupposes that the sortkey generation algorithm is settled upon before
the database changes are.  In fact, it's exactly the opposite: I've been
contracted to do the backend changes, and other people will figure out how
exactly to generate the sortkeys later.  Really, we could do the changes in
either order.

 Both functions will be deployable rapidly, even on wikis that won't want to
 apply the schema change (so they will continue to use a single collation order
 for ALL their categories, and will anyway be able to sort specific categories
 using another supplied locale matching the category name).
 
 If you think about it, changing the SQL schema may be rejected at end by lots
 of people.

The schema change will be part of the core software and will not be an optional
update.  Anyone who doesn't want to accept it will likely have to stick with
1.16 forever, because we're not going to support the old schema.

 Exposing the parser functions will provide a convenient alternative
 that can be deployed much more simply, and with MUCH LESS risks, using the
 existing facilities offered by [[category:...|sortkey]] and
 {{DEFAULTSORT:sortkey}}, except that their parameter will be computed using 
 the
 exposed {{SORTKEY:}} function:
 
   {{DEFAULTSORT:{{SORTKEY:text|locale|level
 
 or:
 
   [[category:...|{{SORTKEY:text|locale|level}}]]
 
 both being generalizable through helper templates.

It's not particularly less risky.  It does encourage each wiki to hack around
the problem locally instead of pushing for a proper fix in the software.  You
shouldn't have to do DEFAULTSORT on any pages where the right sorting is
human-detectable -- it should only have to be for things like Abraham Lincoln
being sorted as Lincoln, Abraham, which the software can't do automatically.

 (Note also that section headings (first letter) will have to be translated
 to correctly report the first letter of the Pinyin romanization, because the
 page names listed will continue to display their distinctive ideographs ! The
 only way to do that is to use the collation mapping exposed by
 {{COLLATIONMAP:}})

Surely the most sensible idea is just to disable the section headings
altogether for CJK?

 My opinion is that the same category should be sortable using different
 locales, and that's why they should support multiple sortkeys par indexed 
 page,
 one for each specified locale. Some wikis will only sort on the
 {{CONTENTLANGUAGE}} by default, but the Chinese Wiktionnary will benefit of
 sorting automatically all categories using at least the default zh locale
 which is an alias for zh-hans, plus the zh-hant locale for traditional
 radical/stroke order, zh-latn  for the Pinyin order, and zh-bpmf for the
 Bopomofo order.
 
 The exact locale to which zh corresponds will be a user preference, but one
 will be able to navigate by clicking the automatically generated links that
 will allow them to specify the other collation orders supported specifically 
 by
 the category or by default throughout the wiki project.

This is doable if it's desired, as long as the number of locales is very
limited (like 

[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #195 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-07-26 
19:50:19 UTC ---
By the way, I'm keeping track of progress here:

http://www.mediawiki.org/wiki/User:Simetrical/Collation

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #196 from Philippe Verdy verd...@wanadoo.fr 2010-07-26 20:37:46 
UTC ---
Note that the CollatorFactory may fail to locate the specified locale for which
a collator is being requested. Additionally, several locales may share exactly
the same collator.

In all cases, the collatorFactory will return a valid Collator object, whose
locale can be identified: the Collator object returned by:

$collator = $wgCollatorFactory-get($locale, $level);

will have a property that will contain the effective locale code (normalized)
and an other property containing the collation level from which it was
effectively built. You should be able to access it simply with something like:

$effectiveLocale = $collator-locale();
$effectiveLevel = $collator-level();

after just getting the collator instance from the factory.

This may be useful to avoid storing duplicate equivalent binary sortkeys, or
simply to determine which effective locale to use in SQL select queries (to
retrieve the sorted list of pagenames ordered by a specified locale), when the
SQL schema will be able to store several sortkeys for the same page in the same
category.

The factory will also instanciate a collator with an effective locale and an
effective collation level only once, caching it in an internal array, for
repeated use.

This will save the complex preparation of tables, and will avoid building
tables for all supported languages (for example in Commons where lots of
languages may be desirable, weahc one with possibly several sort options, or
supported conversions to other scripts or script variants).

The factory however should probably be able to load the DUCET table associated
to the CLDR root locale completely and immediately when it is first
instanciated and stored in the global variable (there's probably no need to
test this each time vecaue of lazy initializations with null member fields);
and it should most probably build the default collator (for
$locale=$wgContentLanguage, and $collationlevel=1) immediately, storing it in
the first position of its member array of already prepared Collator instances.

But you may think the opposite, in order to speedup the server startup by some
(milli-)seconds or reduce the initial CPU/memory stress in the garbage collator
of PHP. However I'm not convince that the server will be ready faster, and the
extra tests that will be performed at each use of the $wgCollatorFactory-get()
method may impact the performance at runtime...

Note also the ICU uses the same approach of a CollatorFactory to build and
cache reusable Collator instances, because it's a proven good design pattern
for implementing and using collators.

A collator object may also be used to compare to texts without even generating
their sortkeys, or without mapping them, so it may help to include in the
Collator interface this method:

$collator-compare($text1, $text2);

that will return an integer (in other words, a Collator also implements the
Comparator interface), by parsing $text1 and $text2 collation element by
collation element up to the end at level 1, comparing their collation weights
only at theis level, before restarting with the next level. When the collator
was instanciated at level 1, the successive collation elements need not be
stored, but for higher levels, it helps if they are parsed only once and kept
in an indexed array that will allow faster lookup for the next levels in the
table of collation weights for these levels.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #197 from Philippe Verdy verd...@wanadoo.fr 2010-07-26 21:01:11 
UTC ---
I note that you have started to define the Collator class as the Language
class.

Well I'm not sure that this should be the same class: a language has several
collators, even if a collator is for only one language (including a virtual
language like root which is the base of all localizations including English,
with lots of common properties that are shared in multiple languages derived
from root).

Also I note that the collator you define is a stub. Great for your development
in the SQL schema, but someone else must work in implementing the
CollatorFactory, using ICU, or a full PHP implementation, or only a stub,
depending on the PHP script path you'll require() to load this factory class.

Technically, collators also share lots of properties across several languages.
This is why it should be separated from the Language class.

May be the Language class can query the CollatorFactory about the list of
collators (i.e. their complete locale code, including options) that are
supported by that language, using an interface method of the loaded global
CollatorFactory instance.

This way, a stub factory, building a single Collator instance will still work
immediately.

Someone with expertise in Unicode collations will help you build the Factory if
you want a PHP-only implementation.

Someone else with ICU expertize could also work on its existing integration in
PHP (needs testing in the MediaWiki installation scripts for deployments) and
to use it in the CollatorFactory and in the Collator class.

The most tricky development is the Collator::map() function, because Unicode
does not describe the algorithm completely.

I know the principles and could offer help if you give some pointers where it
would be beneficial.

But really, to test a complete implementation of the CollatorFactory, you'll
need to be able to expose it in the two optional builtin parser functions that
I described. As this can be clearly developped separately even if you start
with a stub for the factory, and tested with the very basic Parser Functions
installed on a test server.

So I maintain my position for the non-risky ParserFunctions (notably also
because it will be simpler for an existing non-Wikimedia installation of
MediaWiki to install the simple functions only, without upgrading to the new
schema immediately, knowing that the factory can also be a basic stub as well
on these installations).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #198 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-07-26 
21:43:21 UTC ---
(In reply to comment #187)
 Drop this tacking immediately in the PHP code: the cl_sortkey is only intended
 to store the clear-text sortkey hint specified in {{DEFAULTSORT:sortkey}} or
 [[category:...|sortkey]] to override the sort order, and this clear-text
 sortkey is not really a true sortkey but a sort hint to override the default
 sort order. 
 
 For example in people's names whose page name is FirstNames LastName but 
 that
 we want to sort as if they were LastName, FirstNames by indicating only
 {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
 the wiki text, as this sort hint will not be unique and the group of pages
 using the same hint will still need to sort within this group using their
 natural order). Even in that case, there's no need to bogously tack the 
 cl_from
 field in the stored field.

How do you propose this be implemented?  We would need some character that
sorts before all characters that can legitimately occur in a sort key.

(In reply to comment #197)
 I note that you have started to define the Collator class as the Language
 class.

This is completely provisional and can easily be changed later.  As far as I
understand, I'm not the one who will work on it.

 But really, to test a complete implementation of the CollatorFactory, you'll
 need to be able to expose it in the two optional builtin parser functions that
 I described. As this can be clearly developped separately even if you start
 with a stub for the factory, and tested with the very basic Parser Functions
 installed on a test server.
 
 So I maintain my position for the non-risky ParserFunctions (notably also
 because it will be simpler for an existing non-Wikimedia installation of
 MediaWiki to install the simple functions only, without upgrading to the new
 schema immediately, knowing that the factory can also be a basic stub as well
 on these installations).

Upgrading the schema will be handled by running maintenance/update.php, which
must be run on every update anyway.  We typically have a number of schema
changes in every release, and those always must be applied when deploying the
new code.  So this is not an issue.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #199 from Philippe Verdy verd...@wanadoo.fr 2010-07-26 22:23:47 
UTC ---
 (Note also that section headings (first letter) will have to be 
 translated
 to correctly report the first letter of the Pinyin romanization, because 
 the
 page names listed will continue to display their distinctive ideographs ! The
 only way to do that is to use the collation mapping exposed by
 {{COLLATIONMAP:}})

Surely the most sensible idea is just to disable the section headings
altogether for CJK?

Don't need to do that.

The Collator instance returned by the factory for $locale=zh-latn (which
sorts using Pinyin) just has to return an empty string for its map() method, as
long as this is a stub which can't safely map ideographs to Latin initial
consonnants of the Pinyin syllable.

Note that the syntax of the Latin Pinyin ampping should be quite similar to the
syntax used in Minnan as the Minnan syllables have more or less the same
structure as the Pinyin syllables. for ideographs that are not supported, in
the Pinyin romanization, there will be no substitution so they will sort at end
and will not start by a Latin letter.

Here also, it is possible to infer a common heading for all of them, such as
their radical component, or just the first ideographic radical encoded in the
DUCET with a non zero primary weight, or even the first stroke.

The first CJK stroke is:
31C0  ; [*10E0.0020.0002.31C0] # CJK STROKE T

But I should look at the exact sequence in the zh tailoring of the DUCET, in
the CLDR database.

There's a demo page for zh-hans collation here (in the ICU project which is
used by the Unicode's CLDR project as a reference implementation):

http://demo.icu-project.org/icu-bin/locexp?d_=fr_=zh_Hans

The interesting part is the data for Rules which orders the examplar
sinograms, where the data for Set just show them in the numeric codepoint
order or as ranges of character with ascending numeric code point values.
But on both cases they just concentrate on the most common basic subset used in
GB2312, this is not the complete set defined in GB18030 and Unicode.

For actual transforms from Sinograms to Latin (Pinyin) there's this demo page:

http://demo.icu-project.org/icu-bin/translit

To see how the DUCET orders ideographs, look at:

http://unicode.org/charts/collation/

The first sinogram (non-stroke) defined with a non-zero primary weight in the
DUCET sequence is U+4E00 (一) at it seems that it provides a very convenient
geading for every sinogram that we can't sort or convert to Pinyin.

Note that in all cases all the sinograms are at end of the DUCET, just before
the unsupported/reserved/unassaigned characters.

The zh-hans is just moving all the other letters starting in the Variable
subset upward (in fact it just moves upwards the letters starting at Latin), to
fit the sinograms before them.

The zh-Hant collation is like zh-Hans but swaps some positions, according
to their expected radical, but also because they differ in thei stroke count.

Only about 31000 sinograms have known transiptions to Pinyin (this number is
progressing), all the other will then appear under the remaining heading group
starting by U+4E00 (一), except those in the CJK-Extension blocks that should
be listed under the heading U+3400 (㐀).

Most non-extension CJK have now a pinyin transcription, most CJK extensions
don't (but they are also the rarest character used, so there should not be a
lot of pages indexed there)...

 For example in people's names whose page name is FirstNames LastName but 
 that
 we want to sort as if they were LastName, FirstNames by indicating only
 {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
 the wiki text, as this sort hint will not be unique and the group of pages
 using the same hint will still need to sort within this group using their
 natural order). Even in that case, there's no need to bogously tack the 
 cl_from
 field in the stored field.

 How do you propose this be implemented?  We would need some character that
 sorts before all characters that can legitimately occur in a sort key.

This is exactly what UCA defines as a tailoring. We have such a character
available that we don't use in pagenames and in sortkey prefixes: Control
characters.

Note that control characters are present in the DUCET, and they are NOT ALL
ignorable. The first of them is TAB (U+0009) and it is the first character that
we DON'T USE, and that has a primary weight in the DUCET and that is not
ignorable or null.

All the characters have null weights are listed wihin the NULL group, which is
immediately followed by ignorable characters.

TAB has a primary weight of 0201 in the DUCET (it comes far later, after all
the ignorables)

The first ignorable character has a primary weight 0021, so the primary weight
0001 is free to serve as the separator.

All we have to do is then to tailor the primary weight of TAB to assign it the
weight 0001 

[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #200 from Philippe Verdy verd...@wanadoo.fr 2010-07-26 22:44:37 
UTC ---
Note that if your collator stub does not really compute true sort keys
(compacted in binary format and representing collation weights), it may return
spaces (U+0020), that is represented by byte 0x20 in the UTF-8 encoding. The
separator used must be lower than this encoded character, so the VARCHAR(N)
database field has to accept this character.

If it does not, you may offset all the UTF-8 bytes returned by your stub by 1
(because UTF-8 bytes never use the byte value 0xFF), so that you'll be able to
ue the byte value 0x20 for the separator (the UTF-8 encoded SPACE will be
stored in the sortkey field as 0x21 after it has been offseted).

This does not matter because sortkeys are opaque, so this is stimple to do in
the stub. The stored sort key field DOES NOT have to use the UTF-8 encoding
explicitly, it must just accept binary encoded bytes that can fit a non-Unicode
character. If the database wants you to specify a charset for this binary
sortable field, use ISO-8859-1, as long as the database will not alter the
binary order of ISO-8859-1 when computing ORDER BY clauses.

I really hope that we will always have the possibility of using VARBINARY(N)
fields allowing compact random byte values, because it will be much more
efficient for our use of opaque binary sortable sort keys (that are just
streams of bytes).

Be also careful about the sign of bytes (i.e. their range value), otherwise
byte 0x80.0xFF will sort before 0x00..0x7F in the order by clause. This should
be the case if the field is declared to use the ISO 8859-1 encoding with its
binary order.

Which SQL backends (and dialects) do you support in MediaWiki ? This may help
me understanding some development constraints. I know you support MySQL, but
are there simpler backends like BerkeleyDB, dBase files, or flat text files for
small projects supported only via some ODBC driver with a PHP interface ?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #201 from V85 v85.wikipe...@gmail.com 2010-07-26 23:02:33 UTC ---
For those of us who are less technical, but keen to see this bug resolved, is
there somewhere where we can submit alphabets, so that they may be added for
collation? I.e. where we submit the sorting order of specific languages for use
with {{sort:en}} (or whatever the mark-up will be)?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-25 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #189 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-07-25 
17:22:32 UTC ---
(In reply to comment #186)
 That'a a bad assumption : even the highest quality collations will need to be
 updated from time to time:
 - Unicode evolves and new characters get encoded (new versions are published
 about every 1-2 years after synchronization and final balloting at both ISO 
 WG2
 and the UTC.
 - The content of the Unicode DUCET is NOT stable: characters are inserted in
 the sequence so that the full list of collation weights needs to be offseted
 where the new characters get inserted.
 - Collations for languages get corrected. We should be able to upgrade these
 rules when the CLDR project produces new tailorings (CLDR updates are 
 published
 separately, about every 1-2 years.)

It's not critical for most wikis to use the very latest collations, though. 
The English Wikipedia, for example, will do fine with even very out-of-date
collations, since the article names overwhelmingly use characters that have
been in Unicode for an extremely long time.  The same is true for most other
large wikis; but on the other hand, to change collations on small wikis will be
fast in any event.

However, Tim Starling independently suggested a cl_collation column, and my
initial implementation does have one.  The current one doesn't allow the same
row to be stored multiple times with different collations, so sorting might
still be wrong as the collation is updated, but this might not be a big problem
in practice.  If it is, we can go with your idea of extending the primary key
to (cl_collation, cl_from, cl_to) instead of (cl_from, cl_to).

 Anyway you did not reply to the idea of first developin the parser functions
 and test them. Developping the SQL schema extension should not be attempted
 before at least the first function {{SORTKEY:text|locale|level}} has been 
 fully
 developed and tested on specific pages (it can be tested easily in tables).

 The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed
 immediately to develop the schema, but will be useful to restore the
 functionality of headings. Headings don't need to be stored as they can be
 computed on the fly, directly by reading sequentially the sorted result set
 from the SQL query:

I tried to figure out what you were talking about here, and eventually figured
out that you just want me to expose Language::convertToSortkey() and
Language::firstLetterForLists() (as they're currently called) as parser
functions, for the benefit of lists.  That might be useful, and should be easy
to do, although it's not part of my assigned job here.

 You can compute headings from the returned page names, or from the existing
 stored cl_sortkey which should be used now ONLY to store the plain-text
 specified in articles with {{DEFAULTSORT:sortkey}} and
 [[category:...|sortkey]].

I've introduced cl_raw_sortkey to store that, while retaining cl_sortkey for
actual sorting.  Based on your feedback, it might make more sense to rename
cl_raw_sortkey to cl_sortkey_prefix, put {{defaultsort:}} into it, and make it
the empty string by default.

(In reply to comment #187)
 This does not require any change in the schema. This can be made immediately 
 by
 MediaWiki developers and will not influence the developement of all 
 corrections
 needed for bug #164 itself.

Correct.  I'll probably do it anyway in the course of the work I'm doing,
though, since I'll be rewriting CategoryPage.php's sort logic in any case.

 For example in people's names whose page name is FirstNames LastName but 
 that
 we want to sort as if they were LastName, FirstNames by indicating only
 {{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
 the wiki text, as this sort hint will not be unique and the group of pages
 using the same hint will still need to sort within this group using their
 natural order).

I can append the page title to cl_raw_sortkey before computing cl_sortkey. 
That shouldn't be a problem.  As noted above, maybe renaming it to
cl_sortkey_prefix would make more sense.

(In reply to comment #188)
 An interface function is DEFINITELY NOT an implementation detail. My
 comment #180 _correctly_ describes and summarizes what is really wanted.

Unfortunately, it's hard for me to understand what you're saing.  However, I
think I've got it now, at least mostly.  I don't think parser functions are
essential, here, and they're not strictly part of this bug.  They can be added
after the initial implementation is finished.

 It correctly explains the dependancies and why any change in the SQL schema 
 can
 be and should be delayed. In fact I consider the step (1) in my plan to have a
 high priority on all the rest, and it does not imply any immediate change in
 the schema to support it.

Writing those functions is not part of my job.  I expect the i18n people will
handle that.  I'm just doing a prototype, 

[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #187 from Philippe Verdy verd...@wanadoo.fr 2010-07-24 06:17:18 
UTC ---
(In reply to comment #185)
 Oh, I didn't get what he was saying.  Yes, obviously we should use the actual
 cl_from field, not tack it onto cl_sortkey (is that what we do these days?). 
 I'm going to have to make a lot of changes to the category pager anyway to
 support paging by multiple things, so I can do that while I'm at it.

This does not require any change in the schema. This can be made immediately by
MediaWiki developers and will not influence the developement of all corrections
needed for bug #164 itself.

The unique index just has to include the cl_from field (whatever it contains,
it is already the unique ID of the page, and it is already stored). Only the
SQL queries and the HTTP requests have to be able to specify this cl_from field
separately.

Tacking it in the sortkey was a bad decision made by lazy developers before
they realized that storing a separate cl_from field was necessary. It just
polluted the indexes and had the bad effect of truncating more precious space
for correctly sorting some pages.

Drop this tacking immediately in the PHP code: the cl_sortkey is only intended
to store the clear-text sortkey hint specified in {{DEFAULTSORT:sortkey}} or
[[category:...|sortkey]] to override the sort order, and this clear-text
sortkey is not really a true sortkey but a sort hint to override the default
sort order. 

For example in people's names whose page name is FirstNames LastName but that
we want to sort as if they were LastName, FirstNames by indicating only
{{DEFAULTSORT:LastName !}} (it should not needed to include the FirstNames in
the wiki text, as this sort hint will not be unique and the group of pages
using the same hint will still need to sort within this group using their
natural order). Even in that case, there's no need to bogously tack the cl_from
field in the stored field.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #188 from Philippe Verdy verd...@wanadoo.fr 2010-07-24 07:00:47 
UTC ---
(In reply to comment #183)
 Okay, look, if you make posts this long I'm not going to be able to read 
 them. 
 You have to state your points briefly and clearly, or else I'm just not going
 to have time to read through carefully and figure out exactly what you're
 trying to say.  Walls of text are really not helpful.
 
 Furthermore, you're focusing too narrowly on implementation details, like the
 exact interface a function should implement.

An interface function is DEFINITELY NOT an implementation detail. My
comment #180 _correctly_ describes and summarizes what is really wanted.

It's not focusing on how it will be implemented (using ICU or not, how to
integrate it in PHP if it's used, otherwise how to represent the DUCET and how
to represent the language tailorings...), and it explains why the development
can be planned in several clearly independant steps that can be tested
separately.

It correctly explains the dependancies and why any change in the SQL schema can
be and should be delayed. In fact I consider the step (1) in my plan to have a
high priority on all the rest, and it does not imply any immediate change in
the schema to support it.

There will be some time needed only to assert that the implemented collations
are effectively correct for various languages: this will be verifiable by users
looking at list of terms in some wiki page, using a sortable wikitable whose
rows are using the new {{SORTKEY:text|locale|level}} parser function.

If this is validated by users, then only two independant development phases can
be started (in parallel):

- to develop the new schema for stored sortkeys, based only on the internal PHP
functions implementing {{SORTKEY:text|locale|level}}. The schema should NOT be
deployed before the collations have been tested extensively by users and their
internal data structures reflect the wanted collations order and tailorings.

- to develop the {{COLLATIONMAP:text|locale|level|clusters}} parser function
(for later inclusion in the generation of the correct column headings in the
lists displayed by category pages, because for now these headings are almost
useless for anything else than English, or in heavily populated categories).

The second function will be separately integrable in the display of category
pages (before or after the schema has been updated for stored sort keys), but
can be tested separately as well. And anyway, it will already be a very useful
general clear-text function for use in wiki templates that are trying to equate
equivalent strings (for example in a {{#switch:}}, even better than
{{LC:text}}) and will be very convenient for some languages like Arabic, Hebrew
and Russian that have optional diacritics in their orthographies (such as
cantillation marks, dots for vowels, and tonic accents) or that may contain
compatibility ligatures or display forms (e.g. narrow/wide kanas and square
clusters of kanas in Japanese, compatibility jamos in Korean).

The name of this second function could be abbreviated as {{CMAP:}} (it does not
matter, but it should be implemented based on its functional description by
Unicode in one of its UAX) or could be localized as well like other magic
wiki keywords. And it probably should use {{CONTENTLANGUAGE}} as the default
value for the second parameter locale (because {{int:lang}} may cause
rendering problems in some pages that do not want to depend on the user's
prefered language). And by default the level parameter will be 1 (could be up
to 3, any higher value will return the same text without performing any mapping
because higher collation values have to contain all the collation differences),
and the default clusters will be unlimited.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-23 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #186 from Philippe Verdy verd...@wanadoo.fr 2010-07-24 05:53:30 
UTC ---
(In reply to comment #183)
 Upgrading the collation can be done in-place.  The worst case is that
 categories sort weirdly for a few hours.  Also, we would only realistically
 have to change the collation often on smaller wikis, since the largest wikis
 should have high-quality collations that cover almost all their pages to begin
 with.  I don't think we need to adjust the schema to prepare for this.

That'a a bad assumption : even the highest quality collations will need to be
updated from time to time:
- Unicode evolves and new characters get encoded (new versions are published
about every 1-2 years after synchronization and final balloting at both ISO WG2
and the UTC.
- The content of the Unicode DUCET is NOT stable: characters are inserted in
the sequence so that the full list of collation weights needs to be offseted
where the new characters get inserted.
- Collations for languages get corrected. We should be able to upgrade these
rules when the CLDR project produces new tailorings (CLDR updates are published
separately, about every 1-2 years.)

These corrections may be rare (every few months), but when they will occur, any
upgrade could take many hours that could horce the site to go offline when
recomputing sortkeys, or NO correction will be possible. Upgrading in place
is effectively what I proposed, but how will you track which pages need to to
reindexed?

A collation ID in the stored index can really help determine which collation
rule was used to generate the stored sortkey; In addition it will allow to
support multiple collations. This is the mean by which the in place
recomputing can be safely be done.

Note: truncating the sortkeys will ALWAYS be needed, just because the database
column will still have a length limit. Truncating is not so bad anyway,
because:
- the compact binary sequence of primary collation weights, that starts the
sort key will be at the begining of the sort key. Further length is used to
store the compacted sequence of secundary collation weights, then the sequence
of ternary collation weights.
- if truncation occurs, the effect will be that only the smallest differences
will not be represented.

But if you accept to store only non-truncated sort keys, you'll still have the
case where some pages will have some long name, plus the case where someone
will have indicated for that page a very long {{DEFAULTSORT:sortkey}} or very
long text in the second parameter of [[category:...|sortkey]]. To avoid this:
- page names already have a length limit. This also limits the length of sort
keys computed from only them
- we should already truncate the string given in {{DEFAULTSORT:sortkey}} or
{{category:..|sortkey]] so that the concatenation of this string and of the
page name can be used to compute the binary sortkey.

If you can accept arbitrary lengths, so go with it, but it will be unsafe and
your schema will not be able to put that in a sortable column (you'll be only
able to put it in a referenced BLOB, just like the the text of articles, and
databases can never sort external BLOB's)

Anyway you did not reply to the idea of first developin the parser functions
and test them. Developping the SQL schema extension should not be attempted
before at least the first function {{SORTKEY:text|locale|level}} has been fully
developed and tested on specific pages (it can be tested easily in tables).

And with just this function, it should be possible on specific wikis to use it
immediately to sort specific categories (for example by using templates using
that function).

The second function {{COLLATIONMAP:text|locale|level|clusters}} is not needed
immediately to develop the schema, but will be useful to restore the
functionality of headings. Headings don't need to be stored as they can be
computed on the fly, directly by reading sequentially the sorted result set
from the SQL query:

You can compute headings from the returned page names, or from the existing
stored cl_sortkey which should be used now ONLY to store the plain-text
specified in articles with {{DEFAULTSORT:sortkey}} and
[[category:...|sortkey]].
The existing cl_sortkey is just a forced hint, it does not make the sort
order unique. Otherwise it should remain completely empty with the new schema.
It will always be locale neutral and will take precedence on the page name : to
sort the pages effectively, the content of the cl_sortkey content and the
pagename should be always concatenated inernally to compute the binary sortkey
for various  locales.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #172 from Philippe Verdy verd...@wanadoo.fr 2010-07-22 06:00:34 
UTC ---
 But if the SQL engine does not have such support, this must be implemented in
 the PHP code and collation keys can be stored in a new datacolumn (the extra
 data column can be added or filled conditionnally : if the SQL engine 
 supports
 the needed collations, this column can remain NULL to save storage space).
 
If you sort this stuff in PHP, you need to grab the entire list before you can
reliably sort it. Doing that for [[Category:Living people]] has no chance of
staying within the memory limit.

And this was false. Because you assume that the generation of sort keys has to
be done on database queries for listing the content of a category, when instead
this generation os sortkeys can be done safely only on database inserts/updates
for each separate page.

What I wanted to say is that the computed sortkeys will have to be stored. But
several sort keys for the same page in the same category are possible (one for
each collation locale indicated by the target category).

There will be no memory limit, but of course this will have a storage cost, as
the stored sort keys will have to be queried along with the list of pages to
display. 

The good question to ask is: where do we store these sortkeys? Currently we
have a SQL relation table containing a unique index on (categoryID, sortkey,
pageID) and this is enough to perform the join with the table of pages. However
thre's still only one sortkey per page and per category.

That sortkey is needlessly using the pageID within the generated sortkey (this
is is visible when crossing a page limit and navigating throught pages) so in
fact the unique index is on (categoryID, augmented sortkey). Conceptually
wrong and bogous (I think this was just a fast patch when there were unicity
problems and multiple pages could be specified with the same sortkey).

What is limiting you of changing the relation table containing the list of
pages in categories, of using instead a unique index on:
 (categoryID, sortkey, pageID, localeID)
where the localeID is one of the locales supported by the project, which
specifies: the language for which a collation is being performed, and a
collation variant (for example, in Chinese, sort by radical/strokes with
locale=zh-Hans, or sort by pinyin with locale=zh-Latn)

The generation of the concent of the sortkey column is the only major problem
requiring a design decision. This is where it should not even depend on the SQL
engine, and where it can be implemented within PHP, using the PHP extension
that allows using ICU functions. That string does not have to be extremely long
and does not have to be be humane readable.

It can be safely be stored with a reasonnable length limit. So ICU-generated
sortkeys are still safe if they get truncated. Notably because the unique index
on:
 (categoryID, sortkey, pageID, localeID)
is also unique on its restriction:
 (categoryID, pageID, localeID)
And the sortkey generated by ICU, even if it's a string of binary bytes can
still safely be stored in a table index that does not support blobs but want
only VARCHAR(n) types, by serializing the binary sortkey to a safe encoding
(the most basic that will work is hexadecimal) that does not even require the
support of Unicode or UCA collation. Just use an ASCII only column to store the
computed binary sortkey serialized as an ASCII-only string.

But if the database engine supports strings of bytes, just don't serialize the
blob, use the supported SQL type that can store it directly, for example
VARBINARY(n), if it remains sortable in binary order.

With this design, you are completely independant of the SQL engine, it will
work identically on MySQL, PostrgresSQL, or others. And you'll have solved the
problems of collation with multiple locales according to their rules, and
possibly according to visitors preferences.

Note: above, the localeID is not directly a VARCHAR(n) containing en, or
zh-Hans. It is an arbitrary unique numeric identifier that maps in fact to a
collation rule within a locale, and this collation rule may need to be updated
from time to time: when upgraded a collation, you'll generate additional keys
with a new localeID. And when this is done the table of supported collations
will indicate which localeID is the current one to use, and you'll be able to
perform easiky the cleanup of old sortkeys that where computed within the old
rule.

It's not a complicate design, and it offers stability warranties and supports
as well the possibility of upgrading the collations.

The possibility of offering multiple sortkeys for the same page in the same
category comes as a bonus, and you can assign localeID=0 to store the
user-specified sortkey that has been assigned in a page using the second
parameter of the [[category:pagename|sortkey]] link or the parameter of the
{{DEFAULTSORT:sortkey}} if this paramter was missing (this 

[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #173 from Roan Kattouw roan.katt...@gmail.com 2010-07-22 07:15:01 
UTC ---
(In reply to comment #172)
 And this was false. Because you assume that the generation of sort keys has to
 be done on database queries for listing the content of a category, when 
 instead
 this generation os sortkeys can be done safely only on database 
 inserts/updates
 for each separate page.
 
 What I wanted to say is that the computed sortkeys will have to be stored. But
 several sort keys for the same page in the same category are possible (one for
 each collation locale indicated by the target category).
 
[etc., etc., etc.]

I suggest you read comment #171 and the wikitech-l thread linked from there.
That thread is an implementation proposal by Aryeh, who, unlike pretty much
everyone else commenting on this bug (myself included), is actually gonna start
implementing it; he suggested he might start today. This bug already has 173
comments, and posting more long comments about hypothetical implementations
while not replying to the implementation plan of the one person actually being
serious about getting this done is not useful and might just push this bug to
200 comments without getting anywhere. If we're gonna discuss anything, let's
discuss the current implementation plan: it is the only relevant plan to
discuss at this point.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #174 from Philippe Verdy verd...@wanadoo.fr 2010-07-22 09:13:30 
UTC ---
 If we're gonna discuss anything, let's
discuss the current implementation plan: it is the only relevant plan to
discuss at this point.

This is EXACTLY what I was discussing: proposing an implementation design,
which also considers the fact that collations will also need to evolve over
time (for example the UCS repertoire is evolving (so the DUCET table is
modified), and collation rules are being corrected for some languages, in the
CLDR project) : each change will generate a new internal localeid to support
it, and there will be possibly different keys during the transition, even if
(finally) an old collation rule will be deleted (with its old sortkeys) after
the new sortkeys will have been fully recomputed.

So this is clearly not blah blah. And this is certainly relevant for the fact
that you're considering implementing some or all of the suggestions (and you'll
have to test your solutions, including on their performance impact.

I propose a simple model that can be very space-efficient, and that also avoids
reparsing all pages if ever a collation rule is changed, or if additional
collation rules are added in a category to support multiple sort orders
(notably within Chinese categories that could support different orders).

My proposal does not even depend on the backend SQL server capabilities (all it
has to support is at least a binary order on ASCII-only VARCHAR(n) to store the
computed and truncated sortkeys, that will be generated by the PHP front-end
(using ICU) and followed by an ASCII-only serialization. This means that the
simplest ORDER BY queries to retrieve correctly ordered lists of pages will
work independantly of the backend.

The function used in PHP to generate the binary-ordered sortkey (that will
finally be effectively stored) should also be accessible in MediaWiki as a
builtin parser function, that will take two parameters: the locale code, and
the text.

For example, as {{SORTKEY:text|locale}}, where the ''locale'' specified can be
optional and will take the default value of the {{CONTENTLANGUAGE}} of the
project).

This builtin parser function could also be used to correctly sort the sortable
Mediawiki tables inserted in articles, by using correct sortkeys generated by
this template, if the generated sortkey is readable and effectively serialized
as ASCII-only, but it does not necessarily have to be truncated by this
function, even if it will be truncated when the same function result will be
used to store sortkeys in the schema).

This parser function should even be the first development to realize, before
even changing the category-page indexes, because it can be applied and tested
immediately in existing categories (for example by using categorizing templates
in Wiktionary), without even upgrading the SQL schema.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #175 from Roan Kattouw roan.katt...@gmail.com 2010-07-22 09:59:32 
UTC ---
(In reply to comment #174)
  If we're gonna discuss anything, let's
 discuss the current implementation plan: it is the only relevant plan to
 discuss at this point.
 
 This is EXACTLY what I was discussing: proposing an implementation design,
AFAICT you're proposing your own design without mentioning where it differs
from or coincides with Aryeh's proposal. From briefly reading your latest
comment, it looks like a lot of what you're proposing is already in Aryeh's
proposal, such as the binary-ordered sortkey thing. Other things, such as the
attention you're paying to the possibility that collation rules may change, are
not covered in Aryeh's proposal, so those are valuable additions.

So you're right, it's certainly not blah blah, but it would help a lot if you
limited your comments to the differences between your and Aryeh's proposal,
rather than leaving it up to the reader to find and filter the parts the two
proposals have in common.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #176 from Philippe Verdy verd...@wanadoo.fr 2010-07-22 10:04:53 
UTC ---
Anyway, the Aryeh's proposal is not found or documented at the location you
indicate. It just says that he will try to work on it today and still asks for
solutions and asks for questions that are unanswered in his comment on the
wikitech-l list.

In a similar spirit, the generation of the section heading for categories could
also be a different builtin parser function such as:

: {{COLLATIONMAP:text|locale|level|clusters}}

with the similar parameters, and clusters=1 be default (more below). It will
return a non-opaque string that can be displayed in category pages, and could
even become a valid anchor for searching starting at some text position in the
list (ordered using the specified locale). It should be only a level-1
collation mapping by default (only level 1 will be considered for displaying
headings, however some categories could later specify another default collation
level for such mapping.

Note that collation-based mappings is a VERY different concept from the concept
of collation-based sortkeys (read the Unicode UCA specification): sortkeys are
opaque and intended to generate a full order of texts, mappings are readable
but only intended to produce a partial order.

Another optional parameter could be given after the collation level, to
indicate how many locale grapheme clusters should be included in the heading.
The default 
 headings in categories should just use 1 grapheme cluster from the text given
to the first parameter of {{COLLATIONMAP:text|locale|level|clusters}} (more
below).

In a category you could *LATER* specify additional sort orders (and collation
mappings at the same time) using a syntax like:
{{SORTORDER:locale|level}} (the default collation level will be 1 for
categories).

Example 1: in a Chinese category page, you could specify:
; {{SORTORDER:zh-hans}}
: to indicate that pages in that category will be also available using the
radical/stroke order of sinograms. 
; {{SORTORDER:zh-latn}}
: to indicate that pages in that category will be also available using the
Pinyin Latin order.

Example 2: in a Korean cateogry, where the primary order is based on the
decomposition into jamos, it will be enough to specify:
; {{SORTORDER:ko}}
: (for the default South Korean order of jamos)
; {{SORTORDER:ko-kp}}
: (for the default North Korean order of jamos)

Indicating the collation level with a value different from 1 could generate sub
headings for level 2, but I think it should only display them with the
specified level, all using the same heading from the level-2 collation mapping.

Indicating clusters=2 (or more) in the builtin parserfunction {{COLLATIONMAP:}}
may be used to generate more precise headings (for example in heavily English
or French populated categories, use 2 grapheme clusters to generate headings on
the 2 first letters, but keep the collation level to 1). By default the builtin
parser function will not limit the number of grapheme clusters (so it will
remap all the text), but a category could still specify this maximum number of
clusters to consider for headings.

By default a category will consider either
* clusters=1 : section headings will be generated only for the first cluster
(this is what is currently used in categories). or
* clusters=0 : section headings will not be generated at all.
(this default could be a per-project default).

The generation of section headings (using the same PHP function used by
{{COLLATIONMAP:}}) does not require any modification of the schema. Headings
can be computed and generated on the fly, from the retrieved list of pages.
Headings should not even influence the sort order, they are just convenient
groupings for display.

The {{COLLATIONMAP:}} function described here would in fact enter in the list
of string builtin function, as it falls in the same category as other mapping
functions like {{UC:}} or {{LC:}}. This is just another mapping.

The generation of sort keys (using the same PHP function used by {{SORTKEY:}})
however requires active parsing to store them in the schema. So it can only be
done later. This developement should be made *later* (when the SQL schema for
category indexes will be adjusted to support multiple/upgradable collation
orders).

So yes, a builtin parser function such as
{{COLLATIONMAP:text|locale|level|clusters should first be implemented and
tested separately before being used to group items in category headings, but it
can be implemented separately from the developement and deployment of the SQL
schema for indexing categories with {{SORTKEY:}}), and integrated later in the
code that will present the list of pages to the users.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list

[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #177 from Roan Kattouw roan.katt...@gmail.com 2010-07-22 10:17:54 
UTC ---
(In reply to comment #176)
 Anyway, the Aryeh's proposal is not found or documented at the location you
 indicate. It just says that he will try to work on it today and still asks for
 solutions and asks for questions that are unanswered in his comment on the
 wikitech-l list.
 
He does document a fair bit of what he wants to do (e.g. separating files,
categories and regular pages with sortkey prefixes). It's not a complete
proposal and he is asking for input, but he hasn't left everything open either.

Also, why keep the discussion in two places? Why not join the discussion on
wikitech-l?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #178 from Philippe Verdy verd...@wanadoo.fr 2010-07-22 10:48:39 
UTC ---
Because his specification is really incomplete, and he said that Bug#164 was
useless (despite of the fact that I had described my solution extensively in
Bug#164long before Ariyeh started working on it.

And yes, before ever attempting to change the schema, I support the prior
developement and extensive testing of builtin parser functions supported by PHP
functions which will be shared later to support the updated SQL schema.

Only this developmeent alone will have significant difficulties:
* notably integrating ICU in a PHP module installation, or
* rewriting the collation algorithms entirely with PHP;
* having to support the DUCET updates caused by new Unicode versions or
corrections;
* having to support multiple collation orders by per-locale tailorizations
(coming from CLDR or from other sources).

The need to support upgraded collation orders is also an important decision
factor for the schema, if sortkeys are stored in a SQL backend, that's why I
speak about it very early:
* collations supported by SQL backends have very strong limitations, or any
upgrade would require shutting down the servers for hours or days to perform
the upgrade of collated indexes.
* in their missing full ISO 10646 level 3 implementation for the support of
supplementary planes.

All this is something that can be avoided completely by using ICU and not
depending on SQL backends for their support of many more collation locales that
we need in our international projects:

* the schema just needs to be able to store multiple sortkeys, so that newer
sortkeys (computed with the new rules) can be progressively computed in the
background by a bot or server script or some upgrades occuring on the fly when
processing articles.
* older sortkeys that were using a older generation rule can be deleted in a
simple DELETE operation after the new collation rule for a corrected locale has
been made the default one, or can be deleted one by one each time a new
generation sortkey is recomputed and has been inserted (there's not even the
need to perform the two sucessive operations in a transaction if the first
INSERT withe the new rule has been sucessful).

Because we have now multiple sortkeys per indexed page in a category, we can
conveniently support multiple sortkeys for different locales and offer a good
experience for users that will want alternate sort orders (notably Chinese
users that will want presentations in radical/stroke order, or in pinyin
order).



Another note about how to serialize the opaque sortkeys:
the builtin function {{SORTKEY:text|locale|level}} described above will not
limit the length of the generated binary sortkey, however it should serilize it
in a valid Unicode text that can be used in tables.

A convenient serialization of bytes to characters that will sort correctly is
Base-36 using the alphabet [0-9A-Z] (no padding necessary) or Base-32 (it
avoids modular arithmetics but will serialize into longer strings)

If sortkeys are about to be stored, retrieved in the SQL schema, and sorted by
the SQL clause ORDER BY...sortkey..., then:

- either the SQL backend allows storing and sorting binary sequences of bytes
as VARBINARY(N) : then no extra serialization is needed, store directly that
opaque sort key, after truncation to the max length value (N) indicated in the
SQL type of the sortkey table column.

- or the SQL backend does not support sortable binary sequences of arbitrary
bytes, but can only sort VARCHAR(N), then use a similar Base-32 or Base-36
conversion to create compatible sortkeys, and then store the converted string
after truncating to the max length value (N) indicated in the SQL type of the
'sortkey table column.

- in both cases, the stored sortkeys will NEVER be exposed to users, its sole
purpose is to make the SQL ORDER BY clause work properly.

To start listing a category from a given artbitrary Unicode text, use the
start= HTTP query parameter and compute internally the sortkey associated
with it to generate the value used in SQL clause WHERE sortkey = 'value'.

- Section headings in categories will never need to be stored, they are
generated on the fly by reading the page names retrieved in the SQL result set
using the {{COLLATIONMAP:}} function, with the specified locale in the
uselang= HTTP query parameters, and the specified (or default) clusters=
parameter (whose default will be 1 or 0 as indicated above). They will be
diretly readable by users and do not require decoding anything from the stored
sortkey.

- the readable collation mappings and the opaque sortkeys should be coherent in
the same locale, but they are clearly different: pagenames that are
collation-mapped should sort in the same natural order as the section headings
generated from them, so it's absolutely not needed to generate sort keys from
collation-ampped headings computed in the fly.

-- 

[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #179 from Roan Kattouw roan.katt...@gmail.com 2010-07-22 11:16:18 
UTC ---
(In reply to comment #178)
 All this is something that can be avoided completely by using ICU and not
 depending on SQL backends for their support of many more collation locales
This is exactly what Aryeh is proposing. I think everyone agrees that it's
better to use binary sorting with munged sortkeys rather than SQL backends'
half-baked collation support, so you don't need to argue that.

 - Section headings in categories will never need to be stored, they are
 generated on the fly by reading the page names retrieved in the SQL result set
 using the {{COLLATIONMAP:}} function, with the specified locale in the
 uselang= HTTP query parameters, and the specified (or default) clusters=
 parameter (whose default will be 1 or 0 as indicated above). They will be
 diretly readable by users and do not require decoding anything from the stored
 sortkey.
 
This is not that simple, as was pointed out on wikitech-l: what if you've got a
language where á sorts the same as a (that is, a and á are the same for sorting
purposes), then your sorted result could look like:

Áa
Áb
Ac
Ád
Ae
Af
Ág
...

We humans understand that the proper heading for this is A, not Á, but how
will the software know that? Even if we store the original, unmunged sortkeys
(note that the sortkey is not necessarily equal to the page name: [[Albert
Einstein]] sorts as Einstein, Albert) so we can differentiate A from Á, we
can't just take the first or even the most frequent letter: neither is accurate
in this case.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #180 from Philippe Verdy verd...@wanadoo.fr 2010-07-22 11:42:55 
UTC ---
Your issue ***IS*** addressed in my proposal:

*Both* {{COLLATIONMAP:Áa}} and {{COLLATIONMAP:Aa}} will be unambiguously aa
in the primary collation level for that locale. They only differ at the
secondary collation level (with accents).

You did not understand that a collation-mapping is DIFFERENT from an opaque
sort key, even if both are built using the same collation rules for the same
locale.

The case of Albert Einstein sorting as Einstein, Albert will pass through
the standard generation of the sortkey from the string Einstein, Albert
explicitly given in MediaWiki source code as the parameter of the
{{DEFAULTSORT:Einstein, Albert}} special function or as the second parameter of
the [[category:...|Einstein, Albert]] link.



So here is a development and deployment map:


1) Develop to PHP functions that will compute:

  function sortkey($text, $locale, $level=1)
- it will return an opaque array of byte values
- $locale may be given a default value from the project's content language,
but this is not specified here but part of its integration in MediaWiki
- $level may take the default value of 1.
- the algorithm involves parsing steps to transform the $text parameter
into normalized form, then parse it by collation element, and then locating the
collation element in the tailored collation table, which is indexed by
collation element value and returns an array of collation weights, one for each
level.
- it packs all the collation weights into the returned opaque array of byte
values, by appaending all non-zero collation weights for each collation element
at the same collation level before appending the collation weights for higher
successive levels.

  function collationmap($text, $locale, $level=1, $clusters)
- it will return a remapped text using the same $locale and $level
parameters
- it will use the same DUCET table and the same per-locale tailorings
- the same parsing steps are performed
- but instead of packing the collation weigths, it scans the collation
table in the reverse order, by locating the first collation element (a small
Unicode string, often limited to a single character) that has the same
collation weights up to the specified level. When this smallest collation
element is found, append this to the result string.

  function base36($bytes)
- it packs the opaque binary array of bytes into plain ASCII that has safe
binary order and can be safely be stored in a VARCHAR(N) table field, or that
can be returned in a MediaWiki function.

  This module should use ICU and implement the locale tailorings, and should be
able to support a full DUCET table,and allow lookups from a collation element
to an array of collation weights, or the reverse (and ordered) lookup from a
collation weight to a collation element for the function collationmap())

2) Integrate these functions in a Media Wiki extension for builtin parser
functions.

  {{SORTKEY:text|locale|level}}
- this will return base36(sortkey($text, $locale, $level))
- by default take $level=1
- by default take $locale={{CONTENTLANGUAGE}} (or {{int:lang}} ?)
- it can be used as a much better implementation of Wikipedia templates
like [[Template:Sort]]

  {{COLLATIONMAP:text|locale|level|clusters}}
- this will return collationmap($text, $locale, $level, $clusters)
- it can be used to simulate the generation of headings in categories, but
as well within mediawiki tables
- by default take $clusters=null (no limitation of length)
- by default take $level=1
- by default take $locale={{CONTENTLANGUAGE}} (or {{int:lang}} ?)


3) build a function for mapping category sortkeys into stored sort keys, this
will depend on the SQL backend capabilities and on the schema constraint length
for the sortkey data columns:

   function sqlsortkey($text, $locale, $level)
 - it will return either :
substring(sortkey($text, $locale, $level), 0, $maxlength)
 - or :
substring(base36(sortkey($text, $locale, $level)), 0, $maxlength)
 - the choice will depend on the support of VARBINARY(N) and its
sortability  in the SQL engine, or of only VARCHAR(N)
 - the sortkey will not have to be UTF-8, and will not need any support of
the same locales for collation tailoring in the SQL backend.

4) update the schema to support the description of supported internal collation
ids.
   - add a mapping function from the human readable $locale parameter to the
collation id associated to the curent version of the collation rule currently
applicable to a locale.
   - support this mapping with a simple collations relational table

5) build a new category index based on:
  (categoryid, collationid, sortkey, pageid)
  - categoryid and pageid are standard MediaWiki page ids (in any modification
version).
  - collationid will come from the previous 

[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #182 from Philippe Verdy verd...@wanadoo.fr 2010-07-22 13:18:28 
UTC ---
Note that my developement plan will NOT imply an immmediate change of the SQL
schema. At first, if only working on the frist 2 steps, no schema change is
necessary to effectively test the two functions.

Notably, you'll be able to create test pages containing lists of words
formatted as table rows using a template. You'll be able to show the words, in
one column, then the opaque (base-36 converted) sort keys in another column,
and then the visible collation-mapped string in a third column.

Then you'll include this list within a page adding the table headers and using
the existing sortable table class to see how they collate when sorting by the
second column, and you'll also be able to sort again on the third column, to
see that the collation-mapped string will not alter this sort order.

If all this passes, this can be deployed and immediately used in Wiktionnary
(notably French Wiktionnary) for sorting categories containing huge lists of
words, or in english Wiktionnary for use in the template that generate sort
keys for people's Last name, first name.

This will be already a considerable progress, even if there will be no
immediate support for multiple sort orders (in Chinese).

The change of schema, where the {{defaultsort:sortkey}} parameter or the second
parameter of the [[category:page|sortkey]] link will be used effectively, so
that we no longer have to use a sortkey parameter in pages and templates can be
delayed a lot.

But the two developed functions (at least the first one
{{SORTKEY:text|locale|level}} which is a bit simpler to implement than the
second one {{COLLATIONMAP:text|locale|level}} that is a bit trickier as it
requires special handling for collation elements that are mapped to more than 1
collation weight per level) should be a lot simplified, and will be enough
tested before even thinking about reusing them for sorting pages in categories
and presenting the lists with headings.

There is a separate modification to develop for grouping category contents by
namespace. This should be developed and deployed first before even trying to
change the schema for sort keys.

In fact, the first integration for displaying categories will be to use
{{COLLATIONMAP:}} internally for displaying the headings in category pages.
This won't require any change as well for the SQL schema.

Finally, the use of {{SORTKEY:}} and the addition of new keywords for use in
category pages, that will allow a category to support multiple collations, will
happen only at the last step. Because this is the most sensitive one and this
is the only step that may finally involve a change of schema. There's much
enough work to do for building all the first integration steps, as extensive
tests in various locales will be needed, so that all wikis report their
problems with the implemented collations and their supported tailorings.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #183 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-07-22 
17:57:42 UTC ---
Okay, look, if you make posts this long I'm not going to be able to read them. 
You have to state your points briefly and clearly, or else I'm just not going
to have time to read through carefully and figure out exactly what you're
trying to say.  Walls of text are really not helpful.

Furthermore, you're focusing too narrowly on implementation details, like the
exact interface a function should implement.  Please try to state your concerns
primarily in terms of use-cases or problems -- scenarios that the
implementation will need to handle.  Secondarily, it's fine if you propose
specific solutions, but I can't evaluate your solutions unless you state
clearly what problems they're trying to solve.

I'll comment on a few things you said.  If I miss anything important, please
restate it *briefly*, and I can address that too.

(In reply to comment #172)
 What I wanted to say is that the computed sortkeys will have to be stored. But
 several sort keys for the same page in the same category are possible (one for
 each collation locale indicated by the target category).

If we store separate sortkeys per category link per locale, it would take
vastly too much space, since there are likely to be dozens of locales. 
categorylinks is already a big table; we can't justify making it ten times
bigger for this.  Each category can have one locale and only one locale.

 That sortkey is needlessly using the pageID within the generated sortkey (this
 is is visible when crossing a page limit and navigating throught pages) so in
 fact the unique index is on (categoryID, augmented sortkey). Conceptually
 wrong and bogous (I think this was just a fast patch when there were unicity
 problems and multiple pages could be specified with the same sortkey).

It is not needless.  We need a unique key to sort by, or else paging doesn't
work if many pages have the same sort key (e.g., if it was manually assigned). 
See bug 4912.

 The generation of the concent of the sortkey column is the only major problem
 requiring a design decision. This is where it should not even depend on the 
 SQL
 engine, and where it can be implemented within PHP, using the PHP extension
 that allows using ICU functions. That string does not have to be extremely 
 long
 and does not have to be be humane readable.

Yes, we will certainly use ICU or something.  GerardM has said he much prefers
CLDR, so maybe we'll use that instead.

 It can be safely be stored with a reasonnable length limit. So ICU-generated
 sortkeys are still safe if they get truncated. Notably because the unique 
 index
 on:
  (categoryID, sortkey, pageID, localeID)
 is also unique on its restriction:
  (categoryID, pageID, localeID)
 And the sortkey generated by ICU, even if it's a string of binary bytes can
 still safely be stored in a table index that does not support blobs but want
 only VARCHAR(n) types, by serializing the binary sortkey to a safe encoding
 (the most basic that will work is hexadecimal) that does not even require the
 support of Unicode or UCA collation. Just use an ASCII only column to store 
 the
 computed binary sortkey serialized as an ASCII-only string.
 
 But if the database engine supports strings of bytes, just don't serialize the
 blob, use the supported SQL type that can store it directly, for example
 VARBINARY(n), if it remains sortable in binary order.
 
 With this design, you are completely independant of the SQL engine, it will
 work identically on MySQL, PostrgresSQL, or others. And you'll have solved the
 problems of collation with multiple locales according to their rules, and
 possibly according to visitors preferences.

This was always the intended design.  (But note that truncating these sort keys
might not always be fully safe: it could cause tons of collisions, depending on
the nature of the sort key generation algorithm.  These would be resolved
arbitrarily, by page_id, which is confusing to users.)

 It's not a complicate design, and it offers stability warranties and supports
 as well the possibility of upgrading the collations.

Upgrading the collation can be done in-place.  The worst case is that
categories sort weirdly for a few hours.  Also, we would only realistically
have to change the collation often on smaller wikis, since the largest wikis
should have high-quality collations that cover almost all their pages to begin
with.  I don't think we need to adjust the schema to prepare for this.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #184 from Roan Kattouw roan.katt...@gmail.com 2010-07-22 18:40:19 
UTC ---
(In reply to comment #183)
 It is not needless.  We need a unique key to sort by, or else paging doesn't
 work if many pages have the same sort key (e.g., if it was manually 
 assigned). 
 See bug 4912.
 
It is needless in the sense that the cl_from field is right there and is
indexed. We should just be using that field rather than tacking its contents
onto the sortkey because we're too lazy to rewrite IndexPager so it can handle
paging by more than one field (which is not very hard).

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-22 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #185 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-07-22 
18:47:23 UTC ---
Oh, I didn't get what he was saying.  Yes, obviously we should use the actual
cl_from field, not tack it onto cl_sortkey (is that what we do these days?). 
I'm going to have to make a lot of changes to the category pager anyway to
support paging by multiple things, so I can do that while I'm at it.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-07-21 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Aryeh Gregor simetrical+wikib...@gmail.com changed:

   What|Removed |Added

 Status|NEW |ASSIGNED
 AssignedTo|wikibug...@lists.wikimedia. |simetrical+wikib...@gmail.c
   |org |om

--- Comment #171 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-07-21 
21:03:40 UTC ---
I'll be working on this bug.  I hope to have a solution coded up within a
couple of weeks.  I wrote a post to wikitech-l about it, and encourage people
to respond there rather than here (since this involves several bugs):

http://lists.wikimedia.org/pipermail/wikitech-l/2010-July/048399.html

I'll also note that the enormous amount of bickering and repetition here meant
that I wound up skimming it, so if I missed your suggestion, please repeat it
on the wikitech-l thread.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-06-01 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Tim Starling tstarl...@wikimedia.org changed:

   What|Removed |Added

 CC||tstarl...@wikimedia.org

--- Comment #170 from Tim Starling tstarl...@wikimedia.org 2010-06-02 
00:48:00 UTC ---
I think using ICU sort keys, possibly with a pure PHP port to support non-WM
installations, would be a good way to go. It would support all the languages we
want to support, which is important.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-05-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Liangent liang...@gmail.com changed:

   What|Removed |Added

 CC||liang...@gmail.com

--- Comment #169 from Liangent liang...@gmail.com 2010-05-14 15:33:03 UTC ---
[[mw:Extension:CategoryMultisort]] and
[[mw:Extension:CategoryMultisortChinese]] written by me is expected to resolve
the problem on Chinese Wikipedia.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-14 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #168 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-03-14 
16:26:48 UTC ---
Changing cl_sortkey to utf8 *is* a solution.  It needs a sysadmin to do it on
Wikimedia (if we want to do it), and we'd probably also want some core code to
filter out non-BMP characters before they get to the sortkey.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #161 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-03-12 
17:12:17 UTC ---
utf8 support in MySQL supports a number of common languages, including Danish,
Swedish, French, Hungarian, and Estonian:

http://dev.mysql.com/doc/refman/5.1/en/charset-unicode-sets.html

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #162 from Huji huji.h...@gmail.com 2010-03-12 17:48:42 UTC ---
(In reply to comment #160)

Actually, I think the issues you raised with collations and languages has
nothing to do with MediaWiki; if anything, MySQL is to be blamed. All we want
here is for MediaWiki to use the full extent of collation features offered by
MySQL. Perhaps not all languages will benefit from it, but at least some of
them do, and that's enough a reason for a change like this.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #163 from T. Gries m...@tgries.de 2010-03-12 21:47:56 UTC ---
(In reply to comments #159 and #160)
 While the suggested patch clearly dramatically improves the page title 
 sorting,
 it doesn't make it 100% correct for all European languages though it may 
 make it so for some, and I think it would be an improvement for all but the 
 Nordic
 languages Norwegians, and Swedes expect Ö to be treated as a separate 
 letter that sorts after Z.

Yes, I know this, too.

 The suggested patch may be very appropriate for individual Wikimedia
 installations, but is not so as part of the proper fix.

Thanks for pointing out. I must admit, that you are _fully_ right, also knowing
of the collation differences. We developers should collaboratively think to
find a satisfying solution for the collation problem.

This could be done in Berlin during the Wikimedia Developer Workshop 2010, held
on April 14.-16. in Berlin, Germany, as part of the Wikimedia Conference 2010.
http://techblog.wikimedia.org/2010/03/registration-open-for-the-developer-workshop-in-berlin/

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #164 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-03-12 
22:06:57 UTC ---
(In reply to comment #163)
 Thanks for pointing out. I must admit, that you are _fully_ right, also 
 knowing
 of the collation differences. We developers should collaboratively think to
 find a satisfying solution for the collation problem.

The options I know of are:

1) Use MySQL collation support, just using utf8 everywhere.  This will mess up
equality comparisons in ways we almost certainly don't want, doesn't support
characters outside the BMP, etc.

2) Roll our own sortkeys.  Probably requires a lot more work, and a lot less
efficient (need to keep extra column+index around in all relevant tables). 
Also might be tricky, since the default Unicode algorithm gives very long sort
keys, which we'd have to squish somehow (AFAIK).

3) Use some utf8 collation for cl_sortkey, but keep everything else binary.  We
could also add an extra page_sortkey using utf8 collation, which is no more
expensive than (2).  This sounds like the best option to me offhand.  It would
be as easy to implement as anything here, and wouldn't have such bad side
effects.  It would somewhat mess up pages with non-BMP characters in the name,
though.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #165 from T. Gries m...@tgries.de 2010-03-12 22:18:10 UTC ---
(In reply to comment #164)
 (In reply to comment #163)
 The options I know of are:
 1) Use MySQL collation support, just using utf8 everywhere.  This will mess up
 equality comparisons in ways we almost certainly don't want, doesn't support
 characters outside the BMP, etc.
 
 2) Roll our own sortkeys.  Probably requires a lot more work, and a lot less
 efficient (need to keep extra column+index around in all relevant tables). 
 Also might be tricky, since the default Unicode algorithm gives very long sort
 keys, which we'd have to squish somehow (AFAIK).
 
 3) Use some utf8 collation for cl_sortkey, but keep everything else binary.  
 We
 could also add an extra page_sortkey using utf8 collation, which is no more
 expensive than (2).  This sounds like the best option to me offhand.  It would
 be as easy to implement as anything here, and wouldn't have such bad side
 effects.  It would somewhat mess up pages with non-BMP characters in the name,
 though.

Can you start writing an extension, which outputs the list of pages
http://en.wikipedia.org/wiki/Special:AllPages in a user-definable collation
such as utf8_general_ci; utf8_unicode_ci; utf8_swedish_ci ? This would be a
good starting point in my view.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #166 from Aryeh Gregor simetrical+wikib...@gmail.com 2010-03-12 
22:30:03 UTC ---
This shouldn't be an extension, we may as well have a page_sortkey added in
core.  Makes more sense there.  However, categories are lower-hanging fruit,
since we just need to change cl_sortkey's collation -- the column already
exists.  Plus, categories are more visible.  This needs a sysadmin either way,
however, not just developers.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-12 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #167 from T. Gries m...@tgries.de 2010-03-12 22:38:48 UTC ---
(In reply to comment #166)
 This shouldn't be an extension, we may as well have a page_sortkey added in
 core.  Makes more sense there.  However, categories are lower-hanging fruit,
 since we just need to change cl_sortkey's collation -- the column already
 exists.  Plus, categories are more visible.  This needs a sysadmin either way,
 however, not just developers.

Sounds reasonable, ok, but who comes up with a sustainable solution for the
core code and this buzilla ?

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-11 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

--- Comment #160 from Stu Derby stude...@yahoo.com 2010-03-12 01:32:28 UTC ---
Re: comment 159 

While the suggested patch clearly dramatically improves the page title sorting,
it doesn't make it 100% correct for all European languages though it may make
it so for some, and I think it would be an improvement for all but the Nordic
languages.

The different European languages have different and conflicting collation
sequences; no single order will work for all of them.

For example:

German speakers expect Ö and O to collate as equivalents; Danes,
Norwegians, and Swedes expect Ö to be treated as a separate letter that sorts
after Z.

French has an unusual collation rule; the last accent on a word is more
significant than the first, leading to this collation:
cote  côte  coté  côté. This is unique to French, so far as I know.

Hungarian sorts the combination dzs after the combination dz (i.e.
foodzsbar should sort before foodzbar in Hungarian).

In Estonian, Z sorts as an equivalent to S.

Etc. etc. etc.

The suggested patch may be very appropriate for individual Wikimedia
installations, but is not so as part of the proper fix.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-03-07 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

T. Gries m...@tgries.de changed:

   What|Removed |Added

 CC||m...@tgries.de

--- Comment #159 from T. Gries m...@tgries.de 2010-03-07 11:15:15 UTC ---
When setting up a new(!) wiki, and when using MySQL 4.1/5.0 UTF-8 (not: binary)
a correct page title sort order can be achieved when applying this patch

in maintenance/tables.sql change 
  page_title varchar(255) binary NOT NULL,
to
  page_title varchar(255) NOT NULL,

before you run the installer.

The column page_title will then become of encoding type

  character set utf8 collate utf8_general_ci

and the sorting order of page titles is correct for (at least European
languages). You then get page titles in Special:Pages listed in the order
Anton, Äpfel, École supérieure, Zulu - and not any longer Anton,
Zulu, Äpfel, École supérieure as is currently the case on all WMF wikis.

This tip is only valid for fresh installations: you cannot change this
important column when updating from an existing database even not when
mysqldumping and importing; attention: the tip has not yet checked for unwanted
side effects.

For collations see 
[1] http://www.collation-charts.org/mysql60/ and 
[2]
http://www.collation-charts.org/mysql60/mysql604.utf8_general_ci.european.html

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-02-27 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Le Chat cat...@vp.pl changed:

   What|Removed |Added

 Status|ASSIGNED|NEW

--- Comment #158 from Le Chat cat...@vp.pl 2010-02-27 16:37:31 UTC ---
As I understand it, this bug was wrongly assigned to someone who is not working
on it (at least, not in its full generality), so I'm boldly de-assigning it, in
the hope that someone will finally take it up.

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-02-27 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164

Le Chat cat...@vp.pl changed:

   What|Removed |Added

 AssignedTo|philip@gmail.com|wikibug...@lists.wikimedia.
   ||org

-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are the assignee for the bug.
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2010-01-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164


Chris chriskwonle...@msn.com changed:

   What|Removed |Added

 CC||chriskwonle...@msn.com




--- Comment #157 from Chris chriskwonle...@msn.com  2010-01-19 16:35:07 UTC 
---
I'm not sure if i'm posting in the right category, but I figured I should post
it here before making a new one post.
There's a problem for sorting with Korean page names.
If the page is 가, it will sort under ㄱ if you don't specify where to sort
it (which is what it should do).

However, if I manually sort it to like this: [[Category:name|ㄱ]] it won't
sort under the same ㄱ as before. Instead it will make another ㄱ to sort
under.
So making if a user manually sorts a page under ㄱ, and another has makes a
page that starts with the letter ㄱ and mediawiki automatically sort it --
even though they should be sorted under the same letter, they are not. My
mediawiki is the latest version by the way (1.15.1)

Incase your confused:

ㄱ
-가 (manually sorted page)

ㄴ
-(page)

.
.
.

ㅈ
-(page)

end of korean alphabet, there should be no more Korean letters, but pages
that weren't sorted manually will appear after

ㄱ (appears at the end, identical letter as before and should not be recreated
by mediawiki)
- 기 (a page that was sorted by mediawiki, this should be sorted in the same
section as page 가)


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-27 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #156 from Andrew Dunbar hippytr...@gmail.com  2009-11-27 21:50:21 
UTC ---
Apparently non-BMP UTF-8 was in MySQL 6.0 which was around as alpha for a while
and whose features are all due to be added into upcoming 5.X versions:
http://lists.mysql.com/packagers/418


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-26 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #155 from Andrew Dunbar hippytr...@gmail.com  2009-11-27 02:07:39 
UTC ---
Here is another related MySQL bug to watch:
http://bugs.mysql.com/bug.php?id=25666


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-25 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #154 from Philippe Verdy verd...@wanadoo.fr  2009-11-25 23:06:01 
UTC ---
That's another good resson why collation should be supported directly within
the MediaWiki software, which already depends completely of PHP, so that it
should really use the best integration as possible using the dedicated ICU
integration module for PHP.
This also means that it will be much better to simply store the computed
collation keys directly within the database schema, unless the database adapter
for PHP already supports ICU (and there's a commitment from the database vendor
to integrate ICU as part of its design).
The Todo:ICU in PostgreSQL will be fine when it will be effectively
implemented, and this integration becomes fully supported.
But anyway, the first thing to do is to press the PHP developers to have their
own commitment to offer full support and integration of ICU within PHP, and if
this is still nto the case, making sure that the ICU integration module for PHP
comes from a viable project (otherwise there will be the need, in MediaWiki, to
develop an adaptation layer for collation, that will support transparent change
for another PHP integration module, or a later integrtion within PHP core
itself).

The second thing to look for (and that is still missing) is a support for a
ICU-like project (or port) for Javascript (for integration on the client-side,
in the browser) with here also an Javascript-written adapter layer, that allows
replacement of the Javascript-written collator by some future API supported
natively by browsers (because it will perform much better).

The best integration tools (for client-side collation) that I have seen, using
Javascript, fully depends on AJAX (i.e. with collaboration with
serverside-scripts that can provide precomputed collation data, or that can
compute the collation keys from the client-provided texts): some interesting
demos use JSON requests or XML requests though AJAX, but this adds some delays
and increases the number of HTTP requests needed to sort lots of client-side
data (for example when sorting the rendered HTML table columns, which currently
just uses the Javascript localeCompare function which seems to use only the
DUCET or some locale-neutral collation, without taking into account the actual
locale).

It would be much better if all major browser engines (for IE, Mozilla for
Firefox, Wekbit for Safari/Chrome/KHTML) decided to extend the very poor
support of Unicode and locales within Javascript/ECMAScript strings, using ICU
as a base foundation or at least for the services API that it can implement
(even if those browsers use different integration strategies): they should
still support the same collation rules, with the same syntax in a similar
language, such as the languages already documented in the Unicode standard,
including the possibility to use the collation tailoring data already coming
the CLDR project, and the possibility for these implementations to still
support user-specified tailorings (so without hardcoding them in a way that
would completely depend on the implemented Unicode version and the limited list
of locales already supported by CLDR).

There are two standard languages defined for collation tailorings : one is
XML-based but is extremely verbose (it is probably easier to use from a
DOM-based object view, and most probably more efficient at run-time), another
equivalent one is much more compact and more readable and much easier to
specify by users or in scripts. Both syntaxes are automatically and easily
convertible between each other, with equivalent compilation times and
complexities, but the comapct form is easier to transmit in small scripts over
HTTP (including through AJAX), and the compact form is much faster to parse as
it won't depend on a ressource-hungry XML parser (due to its required and
complex conformance rules). For Javascript clients, a JSON-based syntax for
collation tailorings may also be even more efficient without requiring
additional complex code written in Javascript itself.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164


Anthony bugzi...@inbox.org changed:

   What|Removed |Added

 CC||bugzi...@inbox.org




--- Comment #152 from Anthony bugzi...@inbox.org  2009-11-25 05:42:19 UTC ---
Peter, for PostgreSQL you can ignore all the hacks and workarounds and just use
the built in collation.

psql (8.4.1)
Type help for help.

a=# create table category (title varchar(255) not null primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index category_pkey
for table category
CREATE TABLE
a=# insert into category values ('Apple');
INSERT 0 1
a=# insert into category values ('Banana');
INSERT 0 1
a=# insert into category values ('aaa');
INSERT 0 1
a=# insert into category values ('banana');
INSERT 0 1
a=# insert into category values ('aáa');
INSERT 0 1
a=# insert into category values ('apple');
INSERT 0 1
a=# insert into category values ('Apple');
ERROR:  duplicate key value violates unique constraint category_pkey
a=#  insert into category values ('APPLE');
INSERT 0 1
a=# insert into category values ('aab');
INSERT 0 1
a=# select * from category order by title;
 title

 aaa
 aáa
 aab
 apple
 Apple
 APPLE
 banana
 Banana
(8 rows)

The easiest solution would probably be to drop support for MySQL.  Not that
that's ever going to happen.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-24 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #153 from Andrew Dunbar hippytr...@gmail.com  2009-11-25 07:57:50 
UTC ---
MySQL can do the collation that Anthony mentions for PostgreSQL above in #152
and much more besides.

PostgreSQL has all the problems and more of MySQL when it comes to MediaWiki's
collation needs. It depends on the underlying OS, some of which don't do UTF-8
collation at all and as yet does not make us of ICU. See the page here:
http://wiki.postgresql.org/wiki/Todo:ICU

I can't find any information on whether PostgreSQL supports Unicode beyong the
Basic Multilingual Plane.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #149 from Andrew Dunbar hippytr...@gmail.com  2009-11-20 11:05:53 
UTC ---
I've added some feedback on the MySQL bug. I urge anyone here who cares about
this bug to also pay attention to that bug. Perhaps by working on their code,
perhaps voting for the bug, perhaps providing further helpful feedback. Perhaps
they will raise its importance if they see it is important for MediaWiki and
Wikipedia - a top 5 website. I believe MediaWiki and MySQL have worked together
in the past.

http://forge.mysql.com/worklog/task.php?id=1213


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #150 from Peter Gervai (grin) g...@grin.hu  2009-11-20 15:09:45 
UTC ---
Well I didn't want to chime in (you seem to be able to shout at each others
without my assistance :-)) but please do not forget that the solution for
MEDIAWIKI (as opposed to Wikipedia) ought not base on Mysql specific features. 

If there are two approaches, and one of them is database backend agnostic then
that one should be preferred.

Of course I didn't want to mention that I believe code speed (resource hunger)
is a real factor here, since this is accessed frequently.

By no means I want to speak against helping Mysql, since it's a nice open
source project. I just use postgresql, that's all. :-)


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-20 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #151 from Philippe Verdy verd...@wanadoo.fr  2009-11-20 15:43:56 
UTC ---
Anyway, the backend compatibility layer that maps the SQL into the SQL dialect
spoken by the SQL engine can still adapt itself : if there's a true support for
collation supported by the SQL backend, it can be used instead of the
client-side (in PHP) implementation in the MediaWiki code itself, if it
provides additional performance benefits and stronger security. If there's no
real difference, then we can simply abandon the SQL-based collation (even if
its supported) and use the PHP implementation everywhere

This will simplify the design of the SQL-backend adaptation layer within the
MediaWiki code.

This will allow easier integration of other backends, such as Postgres
(suggested in Comment #150), or Oracle, Sybase, Informix, MSSQL, or other
backends available on Windows servers through ODBC, or others accessible
through JDBC via a PHP-to-Java adaptation layer... if the PHP code used in
MediaWiki can also run from within a Java-based JSP application server or
.Net-based ASP application server, or from within other newer VMs that are
already being developed to support both the Java and .Net environments
simultaneously. There are very exentive research to make all these VMs
compatible with each other (and to ease the migration with mutual
compatibility, and with transparent deployment facilities in heterogeneous
environments like computing grids and cloud computing).

Note that in some future, even the WMF projects could benefit a lot of such
facilities, when its servers will be virtualized to scale better with reduced
costs, because we all know that the WMF projects need more and more money each
year to support its ever growing databases and audience, and the increased need
for supporting better internationalization. It is highly strategic (and should
also be discussed in the WMF Strategy wiki) to reduce the technical 
dependencies (also because there's now a demonstrated need for open-projects to
work together, such as within the Open Alliance which has lots of contents to
share with the WMF, thanks to their compatible licencing schemes).


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #131 from Andrew Dunbar hippytr...@gmail.com  2009-11-19 08:11:38 
UTC ---
 Well, there are people who know the architecture and the programming language
 and can presumably do this very quickly (at least the basic step)

Why do you presume it can be done quickly? It doesn't seem so to me and I
regard this bug to be of huge importance.

 if they realize how much of a priority it is. The first step is just to
 apply a few extra functions to category sort keys with the intention of
 converting them into collation keys.

Which functions would those be? What makes you think category sort keys can
be converted to collation keys?

For instance category sort keys are human-readable and human editable but
collation keys are typically binary and not human readable. Category sort
keys are included directly in the text of a category link such as
[[Category:foo|bar]].

Do to their binary nature collation keys cannot appear here. So you would
need to decide whether to remove all category sort keys or make category sort
keys interact with collation keys that would be added elsewhere.

For collation keys to replace category sort keys you would need to establish
that cateogry sort keys have no legitimate uses other than forcing alphabetic
order in cases where the current order results in nonalphabetic sequences.
I can assure you that people do use category sort keys for other purposes and
some might be vociferously upset if these were removed without discussion.

For collation keys to interact with category sort keys you need to generate
and maintain in the database, collation keys for each page title and for each
category sort key since collation keys must be of the same nature to be able
to compare and hence sort them.

Now Unicode does specifiy a Unicode Collation Algorithm (UCA) which we could
and probably should use. It is language agnostic but provides for tailoring
for individual languages.

The UCA definitely generates binary keys. Not printable. Not human readable.

UCA keys can be very long. I use them in an offline tool for the English
Wiktionary and initially set their maximum length to 1024, 4 times the maximum
length of a page title. We already had about 10 pages for which 1024 was too
short so I had to set it to 2048! Many people might not like all page titles
and category sort keys to now require 9x their current amount of space in the
database.

UCA does allow for various types of sort key compression however. In which case
we would need to choose one to use since it will not be possible to mix and
match them.

PHP currently seems to have no implementation of UCA. We would need to create
it from scratch, or find a way to use one in C.

For multilingual wikis such as Commons and all of the Wiktionaries just havine
one collation language will not work since users of each language will expect
things to be in the correct order for their language.

For the Wiktionaries this means each category needs a way to declare which
language collation to use and each page needs to declare which subset of
possible
language collation keys to generate for that page.

For Commons I'm not sure what the requirements would be but the may differ from
those of the Wiktionaries.

These new fields will need support in the database schema. The ones requiring
multiple language collations will reqire more drastic database changes quite
different from what we now have.

 Once that's in place, people can work on actually writing such functions for
 their particular languages. Later, when those functions are written, they can
 be used additionally to generate a proper alphabetically ordered table of
 pages for use in the contents listings. Or some similar workflow

UCA tailoring would make the particular language collations very easy as long
as
we have a decent implementation of UCA that easily works with tailoring.

 - but there needs to be a plan of action, and that can't be effected by just
 anyone, only by the devs who are in charge (no use pretending that everyone's
 equal - only certain devs actually have the power to make anything happen).

Not true. Anyone with commit access can add such code. Myself for instance.
My understaning is that there are not technically any dev in charge at the
moment since Brion stepped down though there certainly are a few such as Tim
who are acknowledged to have a greater understanding of the entire codebase
and hence greater trust, and you definitely want those people to check such
changes and would expect them to revert any premature commits.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #132 from Philip Tzou philip@gmail.com  2009-11-19 08:27:39 
UTC ---
I'm busy with my study and I can't help this until I have passed my exam next
year January. I'm sorry to make your guys unhappy, though Chinese Wikipedia
also needs such functions. :)


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #133 from Le Chat cat...@vp.pl  2009-11-19 12:53:22 UTC ---
We seem to be overcomplicating things now. Of course category sort keys and
binary collation keys are different beasts. We just need a couple of stub
functions (we - or rather individual projects - can worry about what they
should do later), F1 and F2, such that F1(T) means the collation key derived
from page title T, and F2(K) means the collation key derived from
human-readable category sort key K (the two functions will be similar, but
potentially not identical). Then apply them at the appropriate places in the
code generating the category listings. There will also need to be a function
that retrieves the first letter back from a binary key, as has been noted. All
these functions will be language- and project-dependent - no-one is in a
position to write all of them, but once the basic (and trivial) framework is in
place, people who know about the various languages will be able to work on
writing these functions. If a project thinks it needs the option of multiple
sort algorithms, then that's pretty simple to implement as well, though
starting with just one algorithm per project seems perfectly satisfactory - all
the main public-facing projects are language-specific anyway, so even if
English Wiktionary contains Czech words, we would still expect them to be
sorted in the English order.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #134 from Philippe Verdy verd...@wanadoo.fr  2009-11-19 13:51:53 
UTC ---
Collation keys can coexist with custom sort keys.
Custom sort keys are useful and will continue to be useful, to tweak the
default collation (which for now is simply a binary ordering of codepoints).

Tdoday, basically, the categories are sorted by [sort key, full page name].
(possibly truncated to a reasonable size using unique identifier).

What we want is to be able to sort by: [collationkey([sort key, full page
name]), sort key, full pagename) (also with possible truncation of the whole). 

This will preserve the existing tweaks made in pages when they reference
categories, and will help sort all the rest.

Ideally, the collation keys should be generated on the wild by the SQL engine
itself (because it would allow alternate sort orders, according to user locale
preferences or according to web query parameters set by GUI buttons, especially
for Chinese where several sort orders are common: sort by Pinyin, sort by
radical/strokes, sort by traditional dictionary orders), as part of its
supported ORDER BY clause for getting the list of article names to display in
categories.

But if the SQL engine does not have such support, this must be implemented in
the PHP code and collation keys can be stored in a new datacolumn (the extra
data column can be added or filled conditionnally : if the SQL engine supports
the needed collations, this column can remain NULL to save storage space).

If the SQL engine does not have support for dynamic collations, then the
alternate (user locale-based) sort orders will not easy to implement because of
the cost that it would require in the SQL client-side (in PHP) for heavily
populated categories, where the support for true locale-based collation orders
is the most wanted, unless the database can store multiple collation keys (for
distinct specific locales): supporting the storage of multiple collation keys
for different locales can severaly impact the server performance as it would
require an extra join to a separate 1:N table to store the collation keys
indexed by (pageid, locale); instead of storing these keys in the same SQL
table used for storing the category index.

Additionally, the stored collation keys will sometimes need to be updated (when
the CLDR data for locale-tailored collations will be updated or when there will
be updated in the Unicode version with new characters): updating a large volume
of stored collation keys will require a lot of work, and this can impact the
availability of the wiki project, unless the data model includes a versioning
system that allows at least two versions for the same locale to coexist for
some time, and then allows switching from one version to the next before
cleaning up the old collation keys after the collation keys have been updated
to the new tailoring.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #135 from Philippe Verdy verd...@wanadoo.fr  2009-11-19 14:03:26 
UTC ---
Finally, the proposed patch for Chinese is not sufficient. It just supports a
single collation, and is not suitable. It will require lots of works to support
other locales.

I really suggest that such patch uses either the SQL-supported collations, when
they exist for wellknown locales, or that the collation keys are computed using
a strong and well supported design, most probably ICU for C (already available
as a extension module for PHP and Perl) also available as ICU for Java (and
that can be integrated in some SQL engine backends), that natively includes the
support for CLDR locales (and their widely discussed tuned standard
tailorings), in addition to easy integration of locales using MediaWiki
supported tailorings (using the LDML syntax) for the locales currently not
supported in CLDR but needed for the many languages that MediaWiki sites
support.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #136 from Le Chat cat...@vp.pl  2009-11-19 14:13:17 UTC ---
I'm assuming the SQL engine doesn't support this (if it did, then it would have
been turned on years ago, right?) But I don't understand the emphasis on
multiple locales - the main need is just for ''one'' sort algorithm per
database (any more would be a luxury). The human-binary algorithm just needs
to be applied once when a page is added to a category; I don't see any major
performance issues resulting. (Of course there would be inconsistencies for a
time after the algorithms have been rewritten, but we wouldn't expect any
tweaks to those algorithms to have a major impact on the sort order.)


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #137 from Roan Kattouw roan.katt...@gmail.com  2009-11-19 
14:17:38 UTC ---
(In reply to comment #134)
 Tdoday, basically, the categories are sorted by [sort key, full page name].
 (possibly truncated to a reasonable size using unique identifier).
 
This is not true. We're sorting by (sort key, pageID) where the sortkey is
whatever the page sets as sortkey (or the page name if not set), and pageID is
a unique number identifying the page (roughly proportional to creation time).

 Ideally, the collation keys should be generated on the wild by the SQL 
 engine
 itself (because it would allow alternate sort orders, according to user locale
 preferences or according to web query parameters set by GUI buttons, 
 especially
 for Chinese where several sort orders are common: sort by Pinyin, sort by
 radical/strokes, sort by traditional dictionary orders), as part of its
 supported ORDER BY clause for getting the list of article names to display 
 in
 categories.
 
SQL engines probably support this, but using it would result in an unacceptably
inefficient query. Only queries sorting by the actual values of fields are
efficient, and only if there is an index on those fields.

 But if the SQL engine does not have such support, this must be implemented in
 the PHP code and collation keys can be stored in a new datacolumn (the extra
 data column can be added or filled conditionnally : if the SQL engine supports
 the needed collations, this column can remain NULL to save storage space).
 
If you sort this stuff in PHP, you need to grab the entire list before you can
reliably sort it. Doing that for [[Category:Living people]] has no chance of
staying within the memory limit.

 If the SQL engine does not have support for dynamic collations, then the
 alternate (user locale-based) sort orders will not easy to implement because 
 of
 the cost that it would require in the SQL client-side (in PHP) for heavily
 populated categories, where the support for true locale-based collation orders
 is the most wanted, unless the database can store multiple collation keys (for
 distinct specific locales): supporting the storage of multiple collation keys
 for different locales can severaly impact the server performance as it would
 require an extra join to a separate 1:N table to store the collation keys
 indexed by (pageid, locale); instead of storing these keys in the same SQL
 table used for storing the category index.
 
For supporting multiple collations, a separate table sounds a lot more sane in
terms of read performance, as it would suffer less from the problems mentioned
above.

 Additionally, the stored collation keys will sometimes need to be updated 
 (when
 the CLDR data for locale-tailored collations will be updated or when there 
 will
 be updated in the Unicode version with new characters): updating a large 
 volume
 of stored collation keys will require a lot of work, and this can impact the
 availability of the wiki project, unless the data model includes a versioning
 system that allows at least two versions for the same locale to coexist for
 some time, and then allows switching from one version to the next before
 cleaning up the old collation keys after the collation keys have been updated
 to the new tailoring.
 
Expensive writes are better than expensive reads.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #138 from Le Chat cat...@vp.pl  2009-11-19 15:27:02 UTC ---
If you sort this stuff in PHP, you need to grab the entire list before you can
reliably sort it. Doing that for [[Category:Living people]] has no chance of
staying within the memory limit.

Not if you compute the collation key when you add a page to a category, and
index it based on that key. (Just as you currently index it based on the
human-readable sort key.)


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 164] Support collation by a certain locale (sorting order of characters)

2009-11-19 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=164





--- Comment #139 from Philippe Verdy verd...@wanadoo.fr  2009-11-19 15:51:05 
UTC ---
#88,#100,#101 Does UCA explicitly specify the binary representation of the
generated
key?

yes and no:
- Yes there's an interface to retrieve the computed sort key.
- But no: it is not stable across versions of Unicode (which can add characters
in the DUCET), of ICU (that can change the representation), and of CLDR (that
can modify also the locale-dependant tailorings).

In other words : precomputed (stored) collations will need to be versioned too,
any upgrade will require recomputing the collation keys (but a real problem on
large wikis, if this is implemented in the SQL native engine (because the
database index will have to be FULLY rescanned and updated, which may take a
considerable down time on large wikis).

This will be much less problematic if the colaltion keys are not part of the
SQL engine itself, but stored in a separate table that can store multiple keys
for distinct collations, because it can also be used for storing successive
versions. In that case, recomputing the collation keys can be deferred, and
once a category is finished, the collation version can be updated in the
category, and then the collation keys for the previous version can be deleted,
and then the next category can be handled. This will imply NO downtime on the
server, as new collations can be added on the fly (and separately for each
category).

Consequence: add to the MediaWiki data model a table of supported
locales-collation (those that can be specified in the site-wide default) with
their supported version. Attach a unique id for the versioned collation, and
create a separate category collation table containing the collation keys.

Conceptually (some details omitted) :

CREATE TABLE collations(
  coll_id INTEGER NOT NULL,  -- primary key for this table
  coll_name VARCHAR(32) NOT NULL,-- e.g. root, fr, en-US,
en-GB, zh-Latn, zh-Bopo, zh-Hans.radical-strokes
  coll_isprefered SMALLINT NOT NULL, -- identifier of the version
  coll_version VARCHAR(32) NOT NULL, -- unique description of the version
  PRIMARY KEY (coll_id),
  UNIQUE KEY(coll_name, coll_version),   -- strong constraint
  INDEX ON (coll_name, coll_ispreferred) -- optional, for fast retrieval of the
prefered version of a given collation
);

CREATE TABLE categorysort (
  cl_id INTEGER NOT NULL,   -- in fact, the primary key of the categorylinks
table
  coll_id INTEGER NOT NULL, -- primary key of the collations table
  cs_key VARCHAR(255),  -- in fact a binary value, computed from ICU's
Collator:getKey(UString).
  PRIMARY KEY(cl_id, coll_id)
);

Then no need to change the categorylinks table, which will continue to store
the full pagename, and the custom sortkey.



To support the firstChar() conceptual API, each entry in the collations table
above would also need another table containing the possible lowest strings (in
collation order) that use the same weight value at the primary level:

CREATE TABLE collations_headings(
  coll_id INTEGER NOT NULL, -- primary key of the collations table
  ch_weight INTEGER NOT NULL, -- primary collation weight value
  ch_cluster VARCHAR(32) NOT NULL -- single default grapheme cluster from the
first string starting with this primary weight. 
  PRIMARY KEY (coll_id, cf_cluster)
)

One problem is that ICU does not currently contains such a list of default
grapheme clusters suitable for all primary weight values in each collation. Is
there a way to generate it anyway? Note that some primary weights can sometimes
only exist with multiple characters

E.g. ch is a default grapheme cluster in the Breton collation, LC_COLLATE=br.
It makes no sense in Breton to mix c and ch together under the same
heading, given that they sort separately. The same thing occurs in many
languages (e.g. Spanish, Swedish...).

However it probably does not occur (?) in the DUCET (I did not verify this
assertion), so may be we can just avoid storing all possible Unicode characters
to convert them to their default primary heading, and instead we just have to
store the graphemes specified in the examplar set for the language (or other
graphemes that are explicitly present in the locale specific tailoring rules.

The other graphemes usable as first char can then be taken automatically
generated from the first Unicode character present in the Mediawiki custom
cl_sortkey (whose default is the full pagename, including the namespace name
localized to the default locale of the wiki), according to the DUCET which is
shared as the base of all locales.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are on the CC list for the bug.

___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


  1   2   >