Hi,

Thanks for responding. let me try to be a little bit more clear.

I am primarily interested in extracting, what image is linked from the
infobox of an article (if there is a infobox in the article page).
Initially i thought of  parsing the xml for this info, but then after
looking around a bit, I felt it might be easier and faster to get the
wikipedia data loaded into database. So that I can play around with the
data a lot more.

I am working on my lab machine, where already some web applications are
running. Since MediaWiki installation mentioned that I need to change some
PHP settings, I was a little wary about it. Also I dont have root access
to the lab machines, but I can ask my lab admin to do stuff for me when i
want something.

My understanding is that I should import the data even if I install
MediaWiki. And it is primarily for those who want to view the data in a
wiki format. So I decided to go only with the database. I didnt use
importDump.php, as it was suggested to be very slow and not advisable for
large dumps in http://meta.wikimedia.org/wiki/Data_dumps. I wouldnt mind
installing MediaWiki if that would help me import the data easily.

I created the database using the database layout in
http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?view=markup

This time I downloaded a different version of the pages-articles.xml.bz2
dump from http://download.wikimedia.org/enwiki/20090618/  and tried
importing using mwdumper.jar.

$ java -jar ../../lib/mwdumper.jar --format=sql:1.5
enwiki-20090618-pages-articles.xml | mysql -f -u root
--default-character-set=utf-8 wikipedia


When I issued the above command the importing process crashes after a
while with the following error message,

1,427,000 pages (705.771/sec), 1,427,000 revs (705.771/sec)
1,428,000 pages (705.879/sec), 1,428,000 revs (705.879/sec)
Exception in thread "main" java.lang.IllegalArgumentException: Invalid
contributor
        at org.mediawiki.importer.XmlDumpReader.closeContributor(Unknown
Source)
        at org.mediawiki.importer.XmlDumpReader.endElement(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown
Source)
        at
org.apache.xerces.parsers.AbstractXMLDocumentParser.emptyElement(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanStartElement(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown
Source)
        at
org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown
Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        at
org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser.parse(Unknown
Source)
        at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source)
        at javax.xml.parsers.SAXParser.parse(SAXParser.java:198)
        at org.mediawiki.importer.XmlDumpReader.readDump(Unknown Source)
        at org.mediawiki.dumper.Dumper.main(Unknown Source)
ERROR 1064 (42000) at line 16355: You have an error in your SQL syntax;
check the manual that corresponds to your MySQL server version for the
right syntax to use near ''\'\'\'[[Rutherfordium]]\'\'\' (\'\'\'Rf\'\'\')
has no stable isotopes. A standa' at line 1

I also tried the same with mwimport.pl , it crashed with a similar error
saying "invalid contributor".

Any help or suggestion for a successful import would be very helpful !

sorry for being too long ...

Thanks
Srini

> [email protected] wrote:
>> Hi All,
>>
>> I have been trying to upload one of the latest version of the XML dumps,
>> pages-articles.xml.bz2 from
>> http://download.wikimedia.org/enwiki/20090604/. I dont want the front
>> end
>> and other things that comes with wikimedia installations, so i thought i
>> would just create the database and upload the dump.
>
> What exactly you don't want?
> I don't see what's the unneeded bloat of a mediawiki install. The
> created main page? The user account?
>
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to