Visual Editor uses Parasoid to covert markup to HTML. It could then be
possible to strip the HTML with a standard library.
https://m.mediawiki.org/wiki/Parsoid

There are some alternative parsers listed here, but I have no idea on how
well any perform/scale.
https://m.mediawiki.org/wiki/Alternative_parsers

Would love to hear if anyone has a better answer. Obviously a plain text
dump or even an HTML dump could save a good amount of processing.

Cheers,
Scott


On Mon, Feb 22, 2016, 15:18 Bruno Goncalves <bgoncal...@gmail.com> wrote:

> Hi,
>
> I was wondering if there is any place where I can find text (without
> markup, etc) only versions of wikipedia suitable for NLP tasks? I've been
> able to find a couple of old ones for the english wikipedia but I would
> like to analyze different languages (mandarin, arabic, etc...).
>
> Of course, any pointers to software that I can use to convert the usual
> XML dumps to text would be great as well.
>
> Best,
>
> Bruno
>
> *******************************************
> Bruno Miguel Tavares Gonçalves, PhD
> Homepage: www.bgoncalves.com
> Email: bgoncal...@gmail.com
> *******************************************
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
-- 
Dr. Scott Hale
Data Scientist
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to