Thanks for the suggestions. I'll take a look.

There used to be official HTML dumps
https://dumps.wikimedia.org/other/static_html_dumps/ but they haven't been
updated in almost a decade :) HTML or Plain Text dumps would be a boon for
the NLP world.

Best,

B



*******************************************
Bruno Miguel Tavares Gonçalves, PhD
Homepage: www.bgoncalves.com
Email: bgoncal...@gmail.com
*******************************************

On Mon, Feb 22, 2016 at 11:10 AM, Scott Hale <computermacgy...@gmail.com>
wrote:

> Visual Editor uses Parasoid to covert markup to HTML. It could then be
> possible to strip the HTML with a standard library.
> https://m.mediawiki.org/wiki/Parsoid
>
> There are some alternative parsers listed here, but I have no idea on how
> well any perform/scale.
> https://m.mediawiki.org/wiki/Alternative_parsers
>
> Would love to hear if anyone has a better answer. Obviously a plain text
> dump or even an HTML dump could save a good amount of processing.
>
> Cheers,
> Scott
>
>
> On Mon, Feb 22, 2016, 15:18 Bruno Goncalves <bgoncal...@gmail.com> wrote:
>
>> Hi,
>>
>> I was wondering if there is any place where I can find text (without
>> markup, etc) only versions of wikipedia suitable for NLP tasks? I've been
>> able to find a couple of old ones for the english wikipedia but I would
>> like to analyze different languages (mandarin, arabic, etc...).
>>
>> Of course, any pointers to software that I can use to convert the usual
>> XML dumps to text would be great as well.
>>
>> Best,
>>
>> Bruno
>>
>> *******************************************
>> Bruno Miguel Tavares Gonçalves, PhD
>> Homepage: www.bgoncalves.com
>> Email: bgoncal...@gmail.com
>> *******************************************
>> _______________________________________________
>> Wiki-research-l mailing list
>> Wiki-research-l@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
> --
> Dr. Scott Hale
> Data Scientist
> Oxford Internet Institute
> University of Oxford
> http://www.scotthale.net/
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to