Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

Fæ Thu, 03 May 2018 12:20:07 -0700

On 3 May 2018 at 19:54, Aidan Hogan <[email protected]> wrote:
> Hi all,
>
> I am wondering what is the fastest/best way to get a local dump of English
> Wikipedia in HTML? We are looking just for the current versions (no edit
> history) of articles for the purposes of a research project.
>
> We have been exploring using bliki [1] to do the conversion of the source
> markup in the Wikipedia dumps to HTML, but the latest version seems to take
> on average several seconds per article (including after the most common
> templates have been downloaded and stored locally). This means it would take
> several months to convert the dump.
>
> We also considered using Nutch to crawl Wikipedia, but with a reasonable
> crawl delay (5 seconds) it would several months to get a copy of every
> article in HTML (or at least the "reachable" ones).
>
> Hence we are a bit stuck right now and not sure how to proceed. Any help,
> pointers or advice would be greatly appreciated!!
>
> Best,
> Aidan
>
> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Just in case you have not thought of it, how about taking the XML dump
and converting it to the format you are looking for?

Ref 
https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia

Fae
-- 
[email protected] https://commons.wikimedia.org/wiki/User:Fae

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

Reply via email to