On Fri, May 15, 2009 at 4:22 AM, Tisza Gergő <[email protected]> wrote:
> Would it be very expensive to have a separate (namespace, title, sortkey) 
> table,
> and join on that for queries that need sorting?

You would have to scan the *entire* table you're joining from (which
may be hundreds of millions of rows).  Not a possibility.

On Fri, May 15, 2009 at 5:47 AM, Tisza Gergő <[email protected]> wrote:
> Coding the first or second type of collation rule seems relatively simple, and
> already a huge gain. (Also, RFC 3454 might be worth checking out as it has
> language-independent rules for more than diacritics.)

I agree.

> You can have a separate raw_sortkey column if that's a large concern.

That would still mean an UPDATE of many millions of rows.  Plus you'd
add another column to a table that's already very large --
categorylinks is ~40,000,000 rows on enwiki, and that's an extra 40m
varchar(255)s clogging up the buffer pool even though they're never
going to be used except for the occasional update.

> Anyway,
> this is the same for any solution that does not rely on MySQL collation: when
> the rules change, you need to update the relevant column in the database.

Correct.  In fact, when MySQL's rules change you also have to rebuild
the index, AFAIK.

> What are the chances that we get decent MySQL collation in the close future
> (say, next few years)?

If we don't upgrade, I'd say about 0%.  :)  Even if we do, there are
still the uniqueness problems, and the non-BMP problem.  So not very
good, I'd say, for our purposes.  (That's not to say MySQL collation
isn't decent for other purposes).

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to