Hi,

     I have been importing the English Wikipeida XML Dumps every few 
months (last time I did this was in June). I then used xml2sql and it 
always worked for me. Now I attempted the import on the latest dump 
enwiki-20090920-pages-articles.xml (and on the dump from 
enwiki-20090810-pages-articles.xml), both of these have the error:

 >$  xml2sql enwiki-20090920-pages-articles.xml
unexpected element <redirect>
xml2sql: parsing aborted at line 33 pos 16.

So then I try mwdumper  and after 1.4 M Pages, it craps out:
……
1,423,000 pages (957.283/sec), 1,423,000 revs (957.283/sec)
1,424,000 pages (957.465/sec), 1,424,000 revs (957.465/sec)
Exception in thread "main" java.lang.IllegalArgumentException: Invalid 
contributor
         at 
org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown Source)
         at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
         at 
org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
         at 
org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown 
Source)
         at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown 
Source)
         at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
 
Source)
         at 
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown 
Source)
         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown 
Source)
         at org.apache.xerces.parsers.XML11Configuration.parse(Unknown 
Source)
         at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
         at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown 
Source)
         at 
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown Source)
         at javax.xml.parsers.SAXParser.parse(SAXParser.java:395)
         at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
         at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
         at org.mediawiki.dumper.Dumper.main(Unknown Source)


I tried the importDump.php and I get errors of the kind (MediaWiki 1.14.0)
…
Warning: xml_parse(): Unable to call handler in_() in 
/var/www/includes/Import.php on line 437
Warning: xml_parse(): Unable to call handler in_() in 
/var/www/includes/Import.php on line 437
Warning: xml_parse(): Unable to call handler out_() in 
/var/www/includes/Import.php on line 437
….
(Sorry I don’t know where this error starts, but it processes a few 
thousand pages, up till I get sick of looking at it before failing.)

Any ideas if the format of the XML files have changed because I can 
swear that as of June or may be May, I had xml2sql working. I know that 
I might need to upgrade MediaWiki to 1.15, however importDump.php 
usually does not work for the English Wikipedia anyways.

I would be grateful if someone has any ideas?
Thanks guys,
O. O.

P.S. http://download.wikimedia.org/tools/ does not have the source of 
MWDumper. I thought this was open source?


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to