Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

2018-05-13 Thread Aidan Hogan

Hi all,

Many thanks for all the pointers! In the end we wrote a small client to 
grab documents from RESTBase (https://www.mediawiki.org/wiki/RESTBase) 
as suggested by Neil. The HTML looks perfect, and with the generous 200 
requests/second limit (which we could not even manage to reach with our 
local machine), it only took a couple of days to grab all current 
English Wikipedia articles.


@Kaartic, many thanks for the offers of help with extracting HTML from 
ZIM! We also investigated this option in parallel with converting ZIM to 
HTML using Zimreader-Java [1], and indeed it looked promising, but we 
had some issues with extracting links. We did not try the mwoffliner 
tool you mentioned since we got what we needed through RESTBase in the 
end. In any case, we appreciate the offers of help. :)


Best,
Aidan

[1] https://github.com/openzim/zimreader-java

On 08-05-2018 9:34, Kaartic Sivaraam wrote:

On Tuesday 08 May 2018 05:53 PM, Kaartic Sivaraam wrote:

On Friday 04 May 2018 03:49 AM, Bartosz Dziewoński wrote:

On 2018-05-03 20:54, Aidan Hogan wrote:

I am wondering what is the fastest/best way to get a local dump of
English Wikipedia in HTML? We are looking just for the current
versions (no edit history) of articles for the purposes of a research
project.


The Kiwix project provides HTML dumps of Wikipedia for offline reading:
http://www.kiwix.org/downloads/



In case you need pure HTML and not the ZIM file format, you could check
out mwoffliner[1], ...


Note that the HTML is (of course) is not the same as the one you see
when visiting Wikipedia. For example, the side bar links are not present
here, the ToC would not be present.




___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Getting a local dump of Wikipedia in HTML

2018-05-03 Thread Aidan Hogan

Hi Fae,

On 03-05-2018 16:18, Fæ wrote:

On 3 May 2018 at 19:54, Aidan Hogan <aho...@dcc.uchile.cl> wrote:

Hi all,

I am wondering what is the fastest/best way to get a local dump of English
Wikipedia in HTML? We are looking just for the current versions (no edit
history) of articles for the purposes of a research project.

We have been exploring using bliki [1] to do the conversion of the source
markup in the Wikipedia dumps to HTML, but the latest version seems to take
on average several seconds per article (including after the most common
templates have been downloaded and stored locally). This means it would take
several months to convert the dump.

We also considered using Nutch to crawl Wikipedia, but with a reasonable
crawl delay (5 seconds) it would several months to get a copy of every
article in HTML (or at least the "reachable" ones).

Hence we are a bit stuck right now and not sure how to proceed. Any help,
pointers or advice would be greatly appreciated!!

Best,
Aidan

[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home


Just in case you have not thought of it, how about taking the XML dump
and converting it to the format you are looking for?

Ref 
https://en.wikipedia.org/wiki/Wikipedia:Database_download#English-language_Wikipedia



Thanks for the pointer! We are currently attempting to do something like 
that with bliki. The issue is that we are interested in the 
semi-structured HTML elements (like lists, tables, etc.) which are often 
generated through external templates with complex structures. Often from 
the invocation of a template in an article, we cannot even tell if it 
will generate a table, a list, a box, etc. E.g., it might say "Weather 
box" in the markup, which gets converted to a table.


Although bliki can help us to interpret and expand those templates, each 
page takes quite long, meaning months of computation time to get the 
semi-structured data we want from the dump. Due to these templates, we 
have not had much success yet with this route of taking the XML dump and 
converting it to HTML (or even parsing it directly); hence we're still 
looking for other options. :)


Cheers,
Aidan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Getting a local dump of Wikipedia in HTML

2018-05-03 Thread Aidan Hogan

Hi all,

I am wondering what is the fastest/best way to get a local dump of 
English Wikipedia in HTML? We are looking just for the current versions 
(no edit history) of articles for the purposes of a research project.


We have been exploring using bliki [1] to do the conversion of the 
source markup in the Wikipedia dumps to HTML, but the latest version 
seems to take on average several seconds per article (including after 
the most common templates have been downloaded and stored locally). This 
means it would take several months to convert the dump.


We also considered using Nutch to crawl Wikipedia, but with a reasonable 
crawl delay (5 seconds) it would several months to get a copy of every 
article in HTML (or at least the "reachable" ones).


Hence we are a bit stuck right now and not sure how to proceed. Any 
help, pointers or advice would be greatly appreciated!!


Best,
Aidan

[1] https://bitbucket.org/axelclk/info.bliki.wiki/wiki/Home

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Downloading Wikimedia/Wikipedia image content in bulk

2015-10-07 Thread Aidan Hogan

Hi all,

[Sorry I'm new to the list so I'm not sure if a similar discussion has 
happened before or if the questions appear naive.]


I am working with a masters student and another colleagues on Wikimedia 
image data. The idea is to combine the meta-data and some descriptors 
computed from the content of the images in Wikimedia with the structured 
data of DBpedia/Wikidata to (hopefully) create a semantic search service 
over these images.



The goal would ultimately be to enable queries such as "give me images 
of cathedrals in Europe" or "give me images where an Iraqi politician 
met an American politician" or "give me pairs of similar images where 
the first image is a Spanish national football player and the second 
image is of somebody else". These queries are executed based on the 
combination of structured data from DBpedia/Wikidata, and standard image 
descriptors (used, e.g., for searching for similar images).


The goal is ambitious but from our side, nothing looks infeasible. If 
you are interested, a sketch of some of the more technical details of 
our idea are given in this short workshop paper:


http://aidanhogan.com/docs/imgpedia_amw2015.pdf


In any case, for this project, we would need to get the meta-data and 
the image content itself for as many of the Wikimedia images linked from 
Wikipedia as possible. So our questions would be:


* How many images are we talking about in Wikimedia (considering most 
recent version, for example)?

* How many are linked from Wikipedia (e.g., English, any language)?
* What overall on-disk size would those images be?
* What would be the best way to access/download those images in bulk?
* How could we get the meta-data as well?

Any answers or hints on where to look would be great.


From our own searches, it seems the number of Wikimedia images is 
around 23 million and those used on Wikipedia (all languages) is around 
6 million, so we're talking about a ball-park of maybe 10 terabytes of 
raw image content? We know we can extract a list of relevant Wikidata 
images from the Wikipedia dump. In terms of getting image content and 
meta-data in bulk, crawling is not a great option for obvious reasons 
... the possible options we found mentioned on the Web were:


1. The following mirror for rsynching image data: 
http://ftpmirror.your.org/pub/wikimedia/images/


2. The All Images API to get some meta-data for images (but not the 
content). https://www.mediawiki.org/wiki/API:Allimages


So the idea we are looking at right now is to get images from 1. and 
then try match them with the meta-data from 2. Would this make the most 
sense? Also, the only documentation for 1. we could find was:



https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media

Is there more of a description on how the folder structure is organised 
and how, e.g., to figure out the URL of each image?


Any hints or feedback would be great.

Best/thanks,
Aidan

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l