Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

Seb35 Fri, 19 Aug 2011 01:33:29 -0700

Hi,

(I don’t post often here and I’m not a MW developer but I try to follow,  
correct me if I’m wrong.)

I see a couple of things which must be done carefully and willingly about  
page titles<ref>. Currently there is a difference between page_id and page  
title, since the page_id is conserved when the title of the page changes  
(during a move), so there is currently no canonical page title associated  
to a revision, only a page_id, or in other words I think it is  
theoretically non possible to retrieve the original page title of a given  
past revision (this could be discussed on another thread) and I have some  
doubts also about retrieving the original page_id of a revision in very  
rare cases (with a succession of deletion-undeletion of some  
revisions-moves) but I’m not sure of that.

So introduce a page_title in the revisions (your §1.) is a new interesting  
information if your consider this as the title as of date of saving of the  
revision, and then page_id->title and page_title can be different, the  
same for the namespace. But this information is not currently available in  
the database. This would pose the problem of definition of existing  
revisions in the dumps: use the current page title associated to the  
current page_id? If you put the current page_title associated to the  
current page_id of the revision this means the page_title will change  
accross dumps every time a move is done, I don’t find it is semantically  
correct, but at least it should be clearly explained. This is the current  
behaviour but since the page_title is outside of a revision you implicitly  
aggree this behaviour which is semantically correct.

In the §2. there is a similar thing for the redirect: currently the  
redirect points to a title, not a page_id (if you move the pointed page,  
the redirect will point to the new page).

<ref>: I tried to work two years ago about an extension to restore ideally  
pixel-per-pixel an old revision, but I think it’s not (currently) possible  
mainly because of this problem of page titles. There are other problems  
but this is the main problem. Others include retrieving of an old version  
of the templates (related to the problem on the title), color of links and  
categories, version of an image, external ressources like site CSS/JS,  
status about deleted revisions (display or not), and finer things like  
user preferences and rights, ultimately differences due to changes of MW  
configuration or MW version, etc. (I don’t consider a change of version of  
the user browser :) I didn’t publish it then (Sumana was not here to say  
me to publish it ;) but I retrieved it on my computer, I try to publish it  
and explain on mw.org.

Sébastien

Thu, 18 Aug 2011 13:30:18 -0400, Diederik van Liere <dvanli...@gmail.com>  
wrote:
> Hi!
>
> Over the last year, I have been using the Wikipedia XML dumps
> extensively. I used it to conduct the Editor Trends Study [0] and me
> and the Summer Research Fellows [1] have used it in the last three
> months during the Summer of Research. I am proposing some changes to
> the current XML schema based on those experiences.
>
> The current XML schema presents a number of challenges for both the
> people who are creating dump files as the people who are consuming the
> dump files. Challenges include:
>
> 1) The embedded structure of the schema, a single <page> tag with
> multiple <revision> tags makes it very hard to develop an incremental
> dump utility
> 2) A lot of post processing is required.
> 3) By storing the entire text for each revision, the dump files are
> getting so large that they become unmanageable for most people.
>
>
> 1. Denormalization of the schema
> Instead of having a <page> tag with multiple <revision> tags, I
> propose to just have <revision> tags. Each <revision> tag would
> include a <page_id>, <page_title>, <page_namespace> and
> <page_redirect> tag. This denormalization would make it much easier to
> build an incremental dump utility. You only need to keep track of the
> final revision of each article at the moment of dump creation and then
> you can create a new incremental dump continueing from the last dump.
> It would also easier to restore a dump process that crashed.  Finally,
> tools like Hadoop would have a way easier time handling this XML
> schema than the current one.
>
>
> 2. Post-processing of data
> Currently, a significant amount of time is required for
> post-processing the data. Some examples include:
> * The title includes the namespace and so to exclude pages from a
> particular namespace requires generating a separate namespace
> variable. Particularly, focusing on the main namespace is tricky
> because that can only be done by checking whether a page does not
> belong to any other namespace (see bug
> https://bugzilla.wikimedia.org/show_bug.cgi?id=27775).
> * The <redirect> tag currently is either True or False, more useful
> would be the article_id of the page to which a page is redirected.
> * Revisions within a <page> are sorted by revision_id, but they should
> be sorted by timestamp. The current ordering makes it even harder to
> generate diffs between two revisions (see bug
> https://bugzilla.wikimedia.org/show_bug.cgi?id=27112)
> * Some useful variables in the MySQL database are not yet exposed in
> the XML files. Examples include:
>       - Length of revision (part of Mediawiki 1.17)
>       - Namespace of article
>
>
> 3. Smaller dump sizes
> The dump files continue to grow as the text of each revision is stored
> in the XML file. Currently, the uncompressed XML dump files of the
> English Wikipedia are about 5.5Tb in size and this will only continue
> to grow. An alternative would be to replace the <text> tag with a
> <text_added> and <text_removed> tags. A page can still be
> reconstructed by patching multiple <text_added> and <text_removed>
> tags. We can provide a simple script / tool that would reconstruct the
> full text of an article up to a particular date / revision id. This
> has two advantages:
> 1) The dump files will be significantly smaller
> 2) It will be easier and faster to analyze the types of edits. Who is
> adding a template, who is wikifying an edit, who is fixing spelling
> and grammar mistakes.
>
>
> 4. Downsides
> This suggestion is obviously not backwards compatible and it might
> break some tools out there. I think that the upsides (incremental
> backups, Hadoop-ready and smaller sizes) outweigh the downside of
> being backwards incompatible. The current way of dump generation
> cannot continue forever.
>
> [0] http://strategy.wikimedia.org/wiki/Editor_Trends_Study,
> http://strategy.wikimedia.org/wiki/March_2011_Update
> [1] http://blog.wikimedia.org/2011/06/01/summerofresearchannouncement/
>
> I would love to hear your thoughts and comments!
>
> Best,
> Diederik

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Changing XML Wikipedia Schema to Enable Smaller Incremental Dumps that are Hadoop ready

Reply via email to