[email protected] wrote: > Hi, > > Thanks for responding. let me try to be a little bit more clear. > > I am primarily interested in extracting, what image is linked from the > infobox of an article (if there is a infobox in the article page). > Initially i thought of parsing the xml for this info, but then after > looking around a bit, I felt it might be easier and faster to get the > wikipedia data loaded into database. So that I can play around with the > data a lot more. > > I am working on my lab machine, where already some web applications are > running. Since MediaWiki installation mentioned that I need to change some > PHP settings, I was a little wary about it. Also I dont have root access > to the lab machines, but I can ask my lab admin to do stuff for me when i > want something.
You don't need to change php settings. Unless you have a really esoteric php config Mediawiki will work fine. > My understanding is that I should import the data even if I install > MediaWiki. And it is primarily for those who want to view the data in a > wiki format. So I decided to go only with the database. I didnt use > importDump.php, as it was suggested to be very slow and not advisable for > large dumps in http://meta.wikimedia.org/wiki/Data_dumps. I wouldnt mind > installing MediaWiki if that would help me import the data easily. If you just want to manually parse the wikitext of the articles, don't import into a bd. Feed your program directly from the XML. It will be way faster. In the other hand, if you want mediawiki to do something with it, you'll need a mediawiki install. > I created the database using the database layout in > http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/maintenance/tables.sql?view=markup > > This time I downloaded a different version of the pages-articles.xml.bz2 > dump from http://download.wikimedia.org/enwiki/20090618/ and tried > importing using mwdumper.jar. > > $ java -jar ../../lib/mwdumper.jar --format=sql:1.5 > enwiki-20090618-pages-articles.xml | mysql -f -u root > --default-character-set=utf-8 wikipedia > > > When I issued the above command the importing process crashes after a > while with the following error message, > > 1,427,000 pages (705.771/sec), 1,427,000 revs (705.771/sec) > 1,428,000 pages (705.879/sec), 1,428,000 revs (705.879/sec) > Exception in thread "main" java.lang.IllegalArgumentException: Invalid > contributor > I also tried the same with mwimport.pl , it crashed with a similar error > saying "invalid contributor". You're right. It's bug 18328. They don't support. rev_deleted. _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
