On 3 May 2018 at 19:54, Aidan Hogan <[email protected]> wrote: > Hi all, > > I am wondering what is the fastest/best way to get a local dump of English > Wikipedia in HTML? We are looking just for the current versions (no edit > history) of articles for the purposes of a research project. > > We have been exploring using bliki [1] to do the conversion of the source > markup in the Wikipedia dumps to HTML, but the latest version seems to take > on average several seconds per article (including after the most common > templates have been downloaded and stored locally). This means it would take > several months to convert the dump. > > We also considered using Nutch to crawl Wikipedia, but with a reasonable > crawl delay (5 seconds) it would several months to get a copy of every > article in HTML (or at least the "reachable" ones). > > Hence we are a bit stuck right now and not sure how to proceed. Any help, > pointers or advice would be greatly appreciated!! > > Best, > Aidan > > [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Just in case you have not thought of it, how about taking the XML dump and converting it to the format you are looking for? Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia Fae -- [email protected] https://commons.wikimedia.org/wiki/User:Fae _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
