mwdumper seems to work for recent dumps: http://lists.wikimedia.org/pipermail/mediawiki-l/2012-May/039347.html
On Tue, Jun 12, 2012 at 11:19 PM, Steve Bennett <[email protected]> wrote: > Hi all, > I've been tasked with setting up a local copy of the English > Wikipedia for researchers - sort of like another Toolserver. I'm not > having much luck, and wondered if anyone has done this recently, and > what approach they used? We only really need the current article text > - history and meta pages aren't needed. > > Things I have tried: > 1) Downloading and mounting the SQL dumps > > No good because they don't contain article text > > 2) Downloading and mounting other SQL "research dumps" (eg > ftp://ftp.rediris.es/mirror/WKP_research) > > No good because they're years out of date > > 3) Using WikiXRay on the enwiki-latest-pages-meta-history?.xml-.....xml files > > No good because they decompress to astronomically large. I got about > halfway through decompressing them and was over 7Tb. > > Also, WikiXRay appears to be old and out of date (although > interestingly its author Felipe Ortega has just committed to the > gitorious repository[1] on Monday for the first time in over a year) > > 4) Using MWDumper (http://www.mediawiki.org/wiki/Manual:MWDumper) > > No good because it's old and out of date: it only supports export > version 0.3, and the current dumps are 0.6 > > 5) Using importDump.php on a latest-pages-articles.xml dump [2] > > No good because it just spews out 7.6Gb of this output: > > PHP Warning: xml_parse(): Unable to call handler in_() in > /usr/share/mediawiki/includes/Import.php on line 437 > PHP Warning: xml_parse(): Unable to call handler out_() in > /usr/share/mediawiki/includes/Import.php on line 437 > PHP Warning: xml_parse(): Unable to call handler in_() in > /usr/share/mediawiki/includes/Import.php on line 437 > PHP Warning: xml_parse(): Unable to call handler in_() in > /usr/share/mediawiki/includes/Import.php on line 437 > ... > > > So, any suggestions for approaches that might work? Or suggestions for > fixing the errors in step 5? > > Steve > > > [1] http://gitorious.org/wikixray > [2] > http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2 > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
