https://bugzilla.wikimedia.org/show_bug.cgi?id=18651
--- Comment #6 from Maxim Iorsh <[email protected]> 2009-07-16 21:39:51 UTC --- This script should be ran as ./HeWiktionary_2_CulmusDic.pl hewiktionary-pages-articles.xml > hewiktionary-culmus.xml where hewiktionary-pages-articles.xml is the dump in question. It will produce a bunch of reports of form ... Bad word in heading: כלכלן Bad word in heading: זבד Bad word in heading: חג ... I made an effort to ensure that these reports refer to actual dump errors with high probability. Try a few if you don't encounter an error for the first time. The example is from 20090713 dump (http://download.wikimedia.org/hewiktionary/20090713/, file http://download.wikimedia.org/hewiktionary/20090713/hewiktionary-20090713-pages-articles.xml.bz2). Take any report and check the entry in the pages-articles.xml file which corresponds to a page with that name. E.g. for "כלכלן" look for "<title>כלכלן</title>". You will find an XML entry for the page http://he.wiktionary.org/wiki/כלכלן, but the contents of the entry have nothing to do with the actual contents of the wiki page. I guess that the XML entry <text xml:space="preserve"> contents come from http://he.wiktionary.org/wiki/דינמיט. The inner workings of the script are probably of no interest to you. It parses wiki pages and complains when the page seems too inconsistent with the usual Hebrew Wiktionary page template. It should mainly serve as an dump error detector. -- Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are on the CC list for the bug. _______________________________________________ Wikibugs-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
