Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

Aidan Hogan Sun, 13 May 2018 19:27:45 -0700

Hi all,

Many thanks for all the pointers! In the end we wrote a small client tograb documents from RESTBase (https://www.mediawiki.org/wiki/RESTBase)as suggested by Neil. The HTML looks perfect, and with the generous 200requests/second limit (which we could not even manage to reach with ourlocal machine), it only took a couple of days to grab all currentEnglish Wikipedia articles.

@Kaartic, many thanks for the offers of help with extracting HTML fromZIM! We also investigated this option in parallel with converting ZIM toHTML using Zimreader-Java [1], and indeed it looked promising, but wehad some issues with extracting links. We did not try the mwofflinertool you mentioned since we got what we needed through RESTBase in theend. In any case, we appreciate the offers of help. :)


Best,
Aidan

[1] https://github.com/openzim/zimreader-java

On 08-05-2018 9:34, Kaartic Sivaraam wrote:

On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote:

On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:

On 2018-05-03 20:54, Aidan Hogan wrote:

I am wondering what is the fastest/best way to get a local dump of
English Wikipedia in HTML? We are looking just for the current
versions (no edit history) of articles for the purposes of a research
project.


The Kiwix project provides HTML dumps of Wikipedia for offline reading:
http://www.kiwix.org/downloads/


In case you need pure HTML and not the ZIM file format, you could check
out mwoffliner[1], ...


Note that the HTML is (of course) is not the same as the one you see
when visiting Wikipedia. For example, the side bar links are not present
here, the ToC would not be present.


_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

Reply via email to