Visual Editor uses Parasoid to covert markup to HTML. It could then be possible to strip the HTML with a standard library. https://m.mediawiki.org/wiki/Parsoid
There are some alternative parsers listed here, but I have no idea on how well any perform/scale. https://m.mediawiki.org/wiki/Alternative_parsers Would love to hear if anyone has a better answer. Obviously a plain text dump or even an HTML dump could save a good amount of processing. Cheers, Scott On Mon, Feb 22, 2016, 15:18 Bruno Goncalves <bgoncal...@gmail.com> wrote: > Hi, > > I was wondering if there is any place where I can find text (without > markup, etc) only versions of wikipedia suitable for NLP tasks? I've been > able to find a couple of old ones for the english wikipedia but I would > like to analyze different languages (mandarin, arabic, etc...). > > Of course, any pointers to software that I can use to convert the usual > XML dumps to text would be great as well. > > Best, > > Bruno > > ******************************************* > Bruno Miguel Tavares Gonçalves, PhD > Homepage: www.bgoncalves.com > Email: bgoncal...@gmail.com > ******************************************* > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- Dr. Scott Hale Data Scientist Oxford Internet Institute University of Oxford http://www.scotthale.net/
_______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l