Re: [Wikitech-l] Bulk download

Tomasz Finc Thu, 26 Sep 2013 23:25:29 -0700

Thanks for helping to distribute wikipedia more broadly Mihai. Do give
Kiwix for Android [1] a shot as it does something very similar to your
app. Perhaps you can even collaborate on the project.


--tomasz

[1] - https://play.google.com/store/apps/details?id=org.kiwix.kiwixmobile

On Tue, Sep 24, 2013 at 12:24 AM, Mihai Chintoanu
<mihai.chinto...@skobbler.com> wrote:
> Hello,
>
> Thank you to all who have taken the time to answer.
>
> As more people have asked, here are some details about the project. We want 
> to build a feature in our smartphone app that allows users to read wikipedia 
> articles and we want to make the articles and their images available offline 
> to them, which is why we have to first download this content from wikipedia 
> and wikimedia. We have installed a wikipedia mirror locally and extracted the 
> desired article texts through the API. For the images, we first thought about 
> getting the image dump tarballs. However, the articles (and consequently the 
> images) are spread over more language domains, so this approach would have 
> been both inefficient and too much space consuming.
>
> I'll look into the rsync approach.
>
> Once again, many thanks for all your suggestions.
> Mihai
>
> -----Original Message-----
> From: wikitech-l-boun...@lists.wikimedia.org 
> [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Jeremy Baron
> Sent: 23 September 2013 17:12
> To: Wikimedia developers; Wikipedia Xmldatadumps-l
> Subject: Re: [Wikitech-l] Bulk download
>
> On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" <mihai.chinto...@skobbler.com>
> wrote:
>> I have a list of about 1.8 million images which I have to download from
> commons.wikimedia.org. Is there any simple way to do this which doesn't
> involve an individual HTTP hit for each image?
>
> You mean full size originals, not thumbs scaled to a certain size, right?
>
> You should rsync from a mirror[0] (rsync allows specifying a list of files
> to copy) and then fill in the missing images from upload.wikimedia.org ;
> for upload.wikimedia.org I'd say you should throttle yourself to 1 cache
> miss per second (you can check headers on a response to see if was a hit or
> miss and then back off when you get a miss) and you shouldn't use more than
> one or two simultaneous HTTP connections. In any case, make sure you have
> an accurate UA string with contact info (email address) so ops can contact
> you if there's an issue.
>
> At the moment there's only one mirror and it's ~6-12 months out of date so
> there may be a substantial amount to fill in. And of course you should be
> getting checksums from somewhere (the API?) and verifying them. If your
> images are all missing from the mirror than it should take around 40 days
> at 0.5 img/sec but I guess you probably could do it in less than 10 days if
> you have a fast enough pipe. (depends on if you get a lit of misses or hits)
>
> See also [1] but not all of that applies because upload.wikimedia.org isn't
> MediaWiki. so e.g. no maxlag param.
>
> -Jeremy
>
> [0]
> https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media
> [1] https://www.mediawiki.org/wiki/API:Etiquette
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Bulk download

Reply via email to