Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML
Hi all, Many thanks for all the pointers! In the end we wrote a small client to grab documents from RESTBase (https://www.mediawiki.org/wiki/RESTBase) as suggested by Neil. The HTML looks perfect, and with the generous 200 requests/second limit (which we could not even manage to reach with our local machine), it only took a couple of days to grab all current English Wikipedia articles. @Kaartic, many thanks for the offers of help with extracting HTML from ZIM! We also investigated this option in parallel with converting ZIM to HTML using Zimreader-Java [1], and indeed it looked promising, but we had some issues with extracting links. We did not try the mwoffliner tool you mentioned since we got what we needed through RESTBase in the end. In any case, we appreciate the offers of help. :) Best, Aidan [1] https://github.com/openzim/zimreader-java On 08-05-2018 9:34, Kaartic Sivaraam wrote: On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote: On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote: On 2018-05-03 20:54, Aidan Hogan wrote: I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project. The Kiwix project provides HTML dumps of Wikipedia for offline reading: http://www.kiwix.org/downloads/ In case you need pure HTML and not the ZIM file format, you could check out mwoffliner[1], ... Note that the HTML is (of course) is not the same as the one you see when visiting Wikipedia. For example, the side bar links are not present here, the ToC would not be present. ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML
Hi Fae, On 03-05-2018 16:18, Fæ wrote: On 3 May 2018 at 19:54, Aidan Hogan <aho...@dcc.uchile.cl> wrote: Hi all, I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project. We have been exploring using bliki [1] to do the conversion of the source markup in the Wikipedia dumps to HTML, but the latest version seems to take on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would take several months to convert the dump. We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones). Hence we are a bit stuck right now and not sure how to proceed. Any help, pointers or advice would be greatly appreciated!! Best, Aidan [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home Just in case you have not thought of it, how about taking the XML dump and converting it to the format you are looking for? Ref https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia Thanks for the pointer! We are currently attempting to do something like that with bliki. The issue is that we are interested in the semi-structured HTML elements (like lists, tables, etc.) which are often generated through external templates with complex structures. Often from the invocation of a template in an article, we cannot even tell if it will generate a table, a list, a box, etc. E.g., it might say "Weather box" in the markup, which gets converted to a table. Although bliki can help us to interpret and expand those templates, each page takes quite long, meaning months of computation time to get the semi-structured data we want from the dump. Due to these templates, we have not had much success yet with this route of taking the XML dump and converting it to HTML (or even parsing it directly); hence we're still looking for other options. :) Cheers, Aidan ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Getting a local dump of Wikipedia in HTML
Hi all, I am wondering what is the fastest/best way to get a local dump of English Wikipedia in HTML? We are looking just for the current versions (no edit history) of articles for the purposes of a research project. We have been exploring using bliki [1] to do the conversion of the source markup in the Wikipedia dumps to HTML, but the latest version seems to take on average several seconds per article (including after the most common templates have been downloaded and stored locally). This means it would take several months to convert the dump. We also considered using Nutch to crawl Wikipedia, but with a reasonable crawl delay (5 seconds) it would several months to get a copy of every article in HTML (or at least the "reachable" ones). Hence we are a bit stuck right now and not sure how to proceed. Any help, pointers or advice would be greatly appreciated!! Best, Aidan [1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
[Wikitech-l] Downloading Wikimedia/Wikipedia image content in bulk
Hi all, [Sorry I'm new to the list so I'm not sure if a similar discussion has happened before or if the questions appear naive.] I am working with a masters student and another colleagues on Wikimedia image data. The idea is to combine the meta-data and some descriptors computed from the content of the images in Wikimedia with the structured data of DBpedia/Wikidata to (hopefully) create a semantic search service over these images. The goal would ultimately be to enable queries such as "give me images of cathedrals in Europe" or "give me images where an Iraqi politician met an American politician" or "give me pairs of similar images where the first image is a Spanish national football player and the second image is of somebody else". These queries are executed based on the combination of structured data from DBpedia/Wikidata, and standard image descriptors (used, e.g., for searching for similar images). The goal is ambitious but from our side, nothing looks infeasible. If you are interested, a sketch of some of the more technical details of our idea are given in this short workshop paper: http://aidanhogan.com/docs/imgpedia_amw2015.pdf In any case, for this project, we would need to get the meta-data and the image content itself for as many of the Wikimedia images linked from Wikipedia as possible. So our questions would be: * How many images are we talking about in Wikimedia (considering most recent version, for example)? * How many are linked from Wikipedia (e.g., English, any language)? * What overall on-disk size would those images be? * What would be the best way to access/download those images in bulk? * How could we get the meta-data as well? Any answers or hints on where to look would be great. From our own searches, it seems the number of Wikimedia images is around 23 million and those used on Wikipedia (all languages) is around 6 million, so we're talking about a ball-park of maybe 10 terabytes of raw image content? We know we can extract a list of relevant Wikidata images from the Wikipedia dump. In terms of getting image content and meta-data in bulk, crawling is not a great option for obvious reasons ... the possible options we found mentioned on the Web were: 1. The following mirror for rsynching image data: http://ftpmirror.your.org/pub/wikimedia/images/ 2. The All Images API to get some meta-data for images (but not the content). https://www.mediawiki.org/wiki/API:Allimages So the idea we are looking at right now is to get images from 1. and then try match them with the meta-data from 2. Would this make the most sense? Also, the only documentation for 1. we could find was: https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media Is there more of a description on how the folder structure is organised and how, e.g., to figure out the URL of each image? Any hints or feedback would be great. Best/thanks, Aidan ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l