Thanks, I'm trying this. It consumes phenomenal amounts of memory
though - I keep getting a "Killed" message from Ubuntu, even with a
20Gb swap file. Will keep trying with an even bigger one.

I'll also give mwdumper another go.

Steve

On Wed, Jun 13, 2012 at 3:03 PM, Adam Wight <s...@ludd.net> wrote:
> I ran into this problem recently.  A python script is available at 
> https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport.py,
>  that will convert .xml.bz2 dumps into flat fast-import files which can be 
> loaded into most databases.  Sorry this tool is still alpha quality.
>
> Feel free to contact with problems.
>
> -Adam Wight
>
> j...@sahnwaldt.de:
>> mwdumper seems to work for recent dumps:
>> http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html
>>
>> On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett <stevag...@gmail.com> wrote:
>> > Hi all,
>> >  I've been tasked with setting up a local copy of the English
>> > Wikipedia for researchers - sort of like another Toolserver. I'm not
>> > having much luck, and wondered if anyone has done this recently, and
>> > what approach they used? We only really need the current article text
>> > - history and meta pages aren't needed.
>> >
>> > Things I have tried:
>> > 1) Downloading and mounting the SQL dumps
>> >
>> > No good because they don't contain article text
>> >
>> > 2) Downloading and mounting other SQL "research dumps" (eg
>> > ftp://ftp.rediris.es/mirror/WKP_research)
>> >
>> > No good because they're years out of date
>> >
>> > 3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml 
>> > files
>> >
>> > No good because they decompress to astronomically large. I got about
>> > halfway through decompressing them and was over 7Tb.
>> >
>> > Also, WikiXRay appears to be old and out of date (although
>> > interestingly its author Felipe Ortega has just committed to the
>> > gitorious repository[1] on Monday for the first time in over a year)
>> >
>> > 4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper)
>> >
>> > No good because it's old and out of date: it only supports export
>> > version 0.3, and the current dumps are 0.6
>> >
>> > 5) Using importDump.php on a latest-pages-articles.xml dump [2]
>> >
>> > No good because it just spews out 7.6Gb of this output:
>> >
>> > PHP Warning:  xml_parse(): Unable to call handler in_() in
>> > /usr/share/mediawiki/includes/Import.php on line 437
>> > PHP Warning:  xml_parse(): Unable to call handler out_() in
>> > /usr/share/mediawiki/includes/Import.php on line 437
>> > PHP Warning:  xml_parse(): Unable to call handler in_() in
>> > /usr/share/mediawiki/includes/Import.php on line 437
>> > PHP Warning:  xml_parse(): Unable to call handler in_() in
>> > /usr/share/mediawiki/includes/Import.php on line 437
>> > ...
>> >
>> >
>> > So, any suggestions for approaches that might work? Or suggestions for
>> > fixing the errors in step 5?
>> >
>> > Steve
>> >
>> >
>> > [1] http://gitorious.org/wikixray
>> > [2] 
>> > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
>> >
>> > _______________________________________________
>> > Wikitech-l mailing list
>> > Wikitech-l@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>>
>> _______________________________________________
>> Wikitech-l mailing list
>> Wikitech-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to