Thanks, I'm trying this. It consumes phenomenal amounts of memory though - I keep getting a "Killed" message from Ubuntu, even with a 20Gb swap file. Will keep trying with an even bigger one.
I'll also give mwdumper another go. Steve On Wed, Jun 13, 2012 at 3:03 PM, Adam Wight <s...@ludd.net> wrote: > I ran into this problem recently. A python script is available at > https://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/Offline/mwimport.py, > that will convert .xml.bz2 dumps into flat fast-import files which can be > loaded into most databases. Sorry this tool is still alpha quality. > > Feel free to contact with problems. > > -Adam Wight > > j...@sahnwaldt.de: >> mwdumper seems to work for recent dumps: >> http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html >> >> On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett <stevag...@gmail.com> wrote: >> > Hi all, >> > I've been tasked with setting up a local copy of the English >> > Wikipedia for researchers - sort of like another Toolserver. I'm not >> > having much luck, and wondered if anyone has done this recently, and >> > what approach they used? We only really need the current article text >> > - history and meta pages aren't needed. >> > >> > Things I have tried: >> > 1) Downloading and mounting the SQL dumps >> > >> > No good because they don't contain article text >> > >> > 2) Downloading and mounting other SQL "research dumps" (eg >> > ftp://ftp.rediris.es/mirror/WKP_research) >> > >> > No good because they're years out of date >> > >> > 3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml >> > files >> > >> > No good because they decompress to astronomically large. I got about >> > halfway through decompressing them and was over 7Tb. >> > >> > Also, WikiXRay appears to be old and out of date (although >> > interestingly its author Felipe Ortega has just committed to the >> > gitorious repository[1] on Monday for the first time in over a year) >> > >> > 4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper) >> > >> > No good because it's old and out of date: it only supports export >> > version 0.3, and the current dumps are 0.6 >> > >> > 5) Using importDump.php on a latest-pages-articles.xml dump [2] >> > >> > No good because it just spews out 7.6Gb of this output: >> > >> > PHP Warning: xml_parse(): Unable to call handler in_() in >> > /usr/share/mediawiki/includes/Import.php on line 437 >> > PHP Warning: xml_parse(): Unable to call handler out_() in >> > /usr/share/mediawiki/includes/Import.php on line 437 >> > PHP Warning: xml_parse(): Unable to call handler in_() in >> > /usr/share/mediawiki/includes/Import.php on line 437 >> > PHP Warning: xml_parse(): Unable to call handler in_() in >> > /usr/share/mediawiki/includes/Import.php on line 437 >> > ... >> > >> > >> > So, any suggestions for approaches that might work? Or suggestions for >> > fixing the errors in step 5? >> > >> > Steve >> > >> > >> > [1] http://gitorious.org/wikixray >> > [2] >> > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 >> > >> > _______________________________________________ >> > Wikitech-l mailing list >> > Wikitech-l@lists.wikimedia.org >> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l >> >> _______________________________________________ >> Wikitech-l mailing list >> Wikitech-l@lists.wikimedia.org >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l