The problem, as I understand it (and Brion may come by to correct me)
is essentially that the current dump process is designed in a way that
can't be sustained given the size of enwiki.  It really needs to be
re-engineered, which means that developer time is needed to create a
new approach to dumping.

The main target for improvement is almost certainly parallelizing the
process so that wouldn't be a single monolithic dump process, but
rather a lot of little processes working in parallel.  That would also
ensure that if a single process gets stuck and dies, the entire dump
doesn't need to start over.


By way of observation, the dewiki's full history dumps in 26 hours
with 96% prefetched (i.e. loaded from previous dumps).  That suggests
that even starting from scratch (prefetch = 0%) it should dump in ~25
days under the current process.  enwiki is perhaps 3-6 times larger
than dewiki depending on how you do the accounting, which implies
dumping the whole thing from scratch would take ~5 months if the
process scaled linearly.  Of course it doesn't scale linearly, and we
end up with a prediction for completion that is currently 10 months
away (which amounts to a 13 month total execution).  And of course, if
there is any serious error in the next ten months the entire process
could die with no result.


Whether we want to let the current process continue to try and finish
or not, I would seriously suggest someone look into redumping the rest
of the enwiki files (i.e. logs, current pages, etc.).  I am also among
the people that care about having reasonably fresh dumps and it really
is a problem that the other dumps (e.g. stubs-meta-history) are frozen
while we wait to see if the full history dump can run to completion.

-Robert Rohde


On Tue, Jan 27, 2009 at 11:24 AM, Christian Storm <st...@iparadigms.com> wrote:
>>> On 1/4/09 6:20 AM, yegg at alum.mit.edu wrote:
>>> The current enwiki database dump 
>>> (http://download.wikimedia.org/enwiki/20081008/
>>> ) has been crawling along since 10/15/2008.
>> The current dump system is not sustainable on very large wikis and
>> is being replaced. You'll hear about it when we have the new one in
>> place. :)
>> -- brion
>
> Following up on this thread:  
> http://lists.wikimedia.org/pipermail/wikitech-l/2009-January/040841.html
>
> Brion,
>
> Can you offer any general timeline estimates (weeks, months, 1/2
> year)?  Are there any alternatives to retrieving the article data
> beyond directly crawling
> the site?  I know this is verboten but we are in dire need of
> retrieving this data and don't know of any alternatives.  The current
> estimate of end of year is
> too long for us to wait.  Unfortunately, wikipedia is a favored source
> for students to plagiarize from which makes out of date content a real
> issue.
>
> Is there any way to help this process along?  We can donate disk
> drives, developer time, ...?  There is another possibility
> that we could offer but I would need to talk with someone at the
> wikimedia foundation offline.  Is there anyone I could
> contact?
>
> Thanks for any information and/or direction you can give.
>
> Christian
>
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to