Hello,

I need some help. I have to classify the wikilinks in a Wikipedia article based 
on their relative position in the article (in best case on the rendered page). 
For each wikilink I would like to have something like the position in text 
(ascending for each section), if it is in a infobox and if it is in a navibox. 
I need this classification for a specific revision of every article in the 
English Wikipedia in the zero namespace . I tried out to do it by parsing the 
wikitext, but there are some problems with replacing the templates. For example 
if a template is embedded with parameters and/or with conditions it is a bit 
difficult to know what exactly is rendered. I tried out some parser from 
https://www.mediawiki.org/wiki/Alternative_parsers that claim to handle 
templates but they did not work out mainly due the same problems that I had 
parsing wikitext myself. Now, I am considering parsing the html of a wikipedia 
article. I tried also the MediaWiki API 
(https://www.mediawiki.org/wiki/API:Parsing_wikitext) in order to retrieve the 
html for a article and parse it myself but the API is very slow for previous 
revisions of an article and it will take me forever. My question has two parts:
1. What is the fastest way  to get the html of an article for specific revision 
or what is the best tool to setup local copy of Wikipedia (currently I am 
experimenting with Xowa and Wikitaxi).
2. Is somebody aware of a html Wikipedia parser that could provide e.g. the 
position of link or a classification of the links regarding their position in 
text (in each section), if a link is in a infobox and if it is in a navibox.

If you think there is a better way to get a classification of the links 
regarding their position than to parse the html of an article please let me 
know.

Cheers Dimi



GESIS - Leibniz Institute for the Social Sciences
GESIS Cologne
da|ra - Registration Agency for Social and Economic Data
Unter Sachsenhausen 6-8
D- 50667 Cologne
Tel: +49 221 47694 512

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to