On 23/07/10 02:34, Aryeh Gregor wrote:
> On Thu, Jul 22, 2010 at 3:01 AM, Tim Starling <[email protected]> wrote:
>> This restriction is enforced by Title::isValidMoveOperation().
> 
> Any objections to changing this so files can't be moved over non-files
> or vice versa?

No objection. That's mostly how it is already, except when the file
doesn't exist but the description page does.

>> Since we won't be sorting on the plain text form anymore, we could use
>> some tricks to save space. For instance, if the sort key is the same
>> as the article title, we could store NULL instead of another copy of
>> the article title. That should save 95% or so.
> 
> It doesn't seem like it would save nearly that much.  On the Welsh
> Wikipedia (small enough database to be manageable), I get the
> following:

Welsh is not really what I was thinking when I said get statistics. On
the English Wikipedia (db38):

mysql> show table status like 'categorylinks'\G
*************************** 1. row ***************************
           Name: categorylinks
         Engine: InnoDB
        Version: 10
     Row_format: Compact
           Rows: 38875439
 Avg_row_length: 161
    Data_length: 6271123456
Max_data_length: 0
   Index_length: 7946960896
      Data_free: 7340032
 Auto_increment: NULL
    Create_time: 2010-05-24 11:29:52
    Update_time: NULL
     Check_time: NULL
      Collation: binary
       Checksum: NULL
 Create_options:
        Comment:
1 row in set (0.15 sec)

SELECT
   count(*),
   sum(length(cl_sortkey)) as raw_length,
   sum( if(REPLACE(cl_sortkey, ' ', '_') = page_title,
      0, length(cl_sortkey) ) ) as compact_length
FROM categorylinks,page
WHERE
   cl_from=page_id and
   page_namespace=0 and
   page_id % 10 = 0

*************************** 1. row ***************************
      count(*): 1957629
    raw_length: 34177525
compact_length: 14857665
1 row in set (19 min 26.05 sec)

So we're looking at 17 bytes per row for raw text, and 8 bytes per row
for compacted text, plus 1 byte per row for the length byte. Overall,
assuming the lengths are the same across all namespaces, it would be
approximately 680 MB in the raw form for the English Wikipedia, and
presumably several times that for all wikis. Our English Wikipedia
core DB servers have between 700 GB and 2 TB of storage space, with
~450 GB currently in use. So the impact of adding an extra 1 GB or so
would be minimal.

No doubt Domas will complain anyway, but without developers adding new
features, I figure his volunteer DBA work would get very boring.

> It's still not at all clear to me that saving a raw copy in the
> database is worth it.  If we really need sectioning by first letter on
> category pages, we could save the first letter instead, and leave that
> NULL when it's the same as the first letter of the page title (all of
> this for some locale-specific definition of "first letter").  But I
> don't know if we need that.

Truncating after the first letter would only save about 260MB for the
entire English Wikipedia. And it would limit the applications. For
instance, it would prevent fast updates of the collation algorithm.
Instead we would have to reparse the pages. That could take weeks,
even with a dozen servers dedicated to the task.

> This whole problems arises for sortkey changes generally.  It will be
> just as much of a problem when going to a new sortkey type (based on
> CLDR or whatever).  The only way to avoid it is to create a new
> column, populate it while maintaining both columns at once, start
> using the new column once it's fully populated, and then drop the old
> column.  That seems excessive.  

If we're going to have multiple locale-specific collation algorithms
(and that seems likely), then it may make sense to add a collation ID
foreign key to the categorylinks table, to track updates. Sensible
sorting behaviour mid-way through an update is probably not feasible,
but we can at least make it possible to track the problem.

> On Thu, Jul 22, 2010 at 5:34 AM, David Gerard <[email protected]> wrote:
>> Please don't remove the feature where the first letter of the sort key
>> is displayed in the rendered category page, and if necessary add what
>> it takes to keep it.
>>
>> There are scripts where this will be a hard problem, but it's still
>> much-used and much-loved in those where it isn't.
> 
> Is it?  What use does it serve?  We don't have it for any other type
> of list.  We have zillions of types of page lists, and category pages
> are the only ones that have the first letter displayed.  It makes the
> columns uneven, and is completely crazy for some scripts (like CJK,
> AFAICT).

We have zillions of lists, but category pages are by far the most
visible and heavily-used, that's why so much work has been done on
making them look nice, and why so many people are complaining about
category sorting instead of [[Special:DeadendPages]] sorting.

The CJK issue could be fixed by making the feature optional. The
uneven column issue is fixable using the multi-column layout feature
in CSS 3 and more recent versions of the major browsers.

-- Tim Starling


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to