Hey Aidan!

I would suggest checking out RESTBase (
https://www.mediawiki.org/wiki/RESTBase), which offers an API for
retrieving HTML versions of Wikipedia pages. It's maintained by the
Wikimedia Foundation and used by a number of production Wikimedia services,
so you can rely on it.

I don't believe there are any prepared dumps of this HTML, but you should
be able to iterate through the RESTBase API, as long as you follow the
rules (from https://en.wikipedia.org/api/rest_v1/):

   - *Limit your clients to no more than 200 requests/s to this API. Each
   API endpoint's documentation may detail more specific usage limits.*
   - *Set a unique User-Agent or Api-User-Agent header that allows us to
   contact you quickly. Email addresses or URLs of contact pages work well.*



On Thu, 3 May 2018 at 14:26, Aidan Hogan <[email protected]> wrote:

> Hi Fae,
>
> On 03-05-2018 16:18, Fæ wrote:
> > On 3 May 2018 at 19:54, Aidan Hogan <[email protected]> wrote:
> >> Hi all,
> >>
> >> I am wondering what is the fastest/best way to get a local dump of
> English
> >> Wikipedia in HTML? We are looking just for the current versions (no edit
> >> history) of articles for the purposes of a research project.
> >>
> >> We have been exploring using bliki [1] to do the conversion of the
> source
> >> markup in the Wikipedia dumps to HTML, but the latest version seems to
> take
> >> on average several seconds per article (including after the most common
> >> templates have been downloaded and stored locally). This means it would
> take
> >> several months to convert the dump.
> >>
> >> We also considered using Nutch to crawl Wikipedia, but with a reasonable
> >> crawl delay (5 seconds) it would several months to get a copy of every
> >> article in HTML (or at least the "reachable" ones).
> >>
> >> Hence we are a bit stuck right now and not sure how to proceed. Any
> help,
> >> pointers or advice would be greatly appreciated!!
> >>
> >> Best,
> >> Aidan
> >>
> >> [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home
> >
> > Just in case you have not thought of it, how about taking the XML dump
> > and converting it to the format you are looking for?
> >
> > Ref
> https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia
> >
>
> Thanks for the pointer! We are currently attempting to do something like
> that with bliki. The issue is that we are interested in the
> semi-structured HTML elements (like lists, tables, etc.) which are often
> generated through external templates with complex structures. Often from
> the invocation of a template in an article, we cannot even tell if it
> will generate a table, a list, a box, etc. E.g., it might say "Weather
> box" in the markup, which gets converted to a table.
>
> Although bliki can help us to interpret and expand those templates, each
> page takes quite long, meaning months of computation time to get the
> semi-structured data we want from the dump. Due to these templates, we
> have not had much success yet with this route of taking the XML dump and
> converting it to HTML (or even parsing it directly); hence we're still
> looking for other options. :)
>
> Cheers,
> Aidan
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l



-- 
Neil Patel Quinn <https://meta.wikimedia.org/wiki/User:Neil_P._Quinn-WMF>
(he/him/his)
product analyst, Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to