Hey Aidan! I would suggest checking out RESTBase ( https://www.mediawiki.org/wiki/RESTBase), which offers an API for retrieving HTML versions of Wikipedia pages. It's maintained by the Wikimedia Foundation and used by a number of production Wikimedia services, so you can rely on it.
I don't believe there are any prepared dumps of this HTML, but you should be able to iterate through the RESTBase API, as long as you follow the rules (from https://en.wikipedia.org/api/rest_v1/): - *Limit your clients to no more than 200 requests/s to this API. Each API endpoint's documentation may detail more specific usage limits.* - *Set a unique User-Agent or Api-User-Agent header that allows us to contact you quickly. Email addresses or URLs of contact pages work well.* On Thu, 3 May 2018 at 14:26, Aidan Hogan <[email protected]> wrote: > Hi Fae, > > On 03-05-2018 16:18, Fæ wrote: > > On 3 May 2018 at 19:54, Aidan Hogan <[email protected]> wrote: > >> Hi all, > >> > >> I am wondering what is the fastest/best way to get a local dump of > English > >> Wikipedia in HTML? We are looking just for the current versions (no edit > >> history) of articles for the purposes of a research project. > >> > >> We have been exploring using bliki [1] to do the conversion of the > source > >> markup in the Wikipedia dumps to HTML, but the latest version seems to > take > >> on average several seconds per article (including after the most common > >> templates have been downloaded and stored locally). This means it would > take > >> several months to convert the dump. > >> > >> We also considered using Nutch to crawl Wikipedia, but with a reasonable > >> crawl delay (5 seconds) it would several months to get a copy of every > >> article in HTML (or at least the "reachable" ones). > >> > >> Hence we are a bit stuck right now and not sure how to proceed. Any > help, > >> pointers or advice would be greatly appreciated!! > >> > >> Best, > >> Aidan > >> > >> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home > > > > Just in case you have not thought of it, how about taking the XML dump > > and converting it to the format you are looking for? > > > > Ref > https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia > > > > Thanks for the pointer! We are currently attempting to do something like > that with bliki. The issue is that we are interested in the > semi-structured HTML elements (like lists, tables, etc.) which are often > generated through external templates with complex structures. Often from > the invocation of a template in an article, we cannot even tell if it > will generate a table, a list, a box, etc. E.g., it might say "Weather > box" in the markup, which gets converted to a table. > > Although bliki can help us to interpret and expand those templates, each > page takes quite long, meaning months of computation time to get the > semi-structured data we want from the dump. Due to these templates, we > have not had much success yet with this route of taking the XML dump and > converting it to HTML (or even parsing it directly); hence we're still > looking for other options. :) > > Cheers, > Aidan > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF> (he/him/his) product analyst, Wikimedia Foundation _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
