https://bugzilla.wikimedia.org/show_bug.cgi?id=18651





--- Comment #6 from Maxim Iorsh <[email protected]>  2009-07-16 
21:39:51 UTC ---
This script should be ran as

 ./HeWiktionary_2_CulmusDic.pl hewiktionary-pages-articles.xml >
hewiktionary-culmus.xml

where hewiktionary-pages-articles.xml is the dump in question. It will produce
a bunch of reports of form

 ...
 Bad word in heading: כלכלן
 Bad word in heading: זבד
 Bad word in heading: חג
 ...

I made an effort to ensure that these reports refer to actual dump errors with
high probability. Try a few if you don't encounter an error for the first time.
The example is from 20090713 dump
(http://download.wikimedia.org/hewiktionary/20090713/, file
http://download.wikimedia.org/hewiktionary/20090713/hewiktionary-20090713-pages-articles.xml.bz2).
Take any report and check the entry in the pages-articles.xml file which
corresponds to a page with that name.

E.g. for "כלכלן" look for "<title>כלכלן</title>". You will find an
XML entry for the page http://he.wiktionary.org/wiki/כלכלן, but the
contents of the entry have nothing to do with the actual contents of the wiki
page. I guess that the XML entry <text xml:space="preserve"> contents come from
http://he.wiktionary.org/wiki/דינמיט.

The inner workings of the script are probably of no interest to you. It parses
wiki pages and complains when the page seems too inconsistent with the usual
Hebrew Wiktionary page template. It should mainly serve as an dump error
detector.


-- 
Configure bugmail: https://bugzilla.wikimedia.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

Reply via email to