Thanks for helping to distribute wikipedia more broadly Mihai. Do give Kiwix for Android [1] a shot as it does something very similar to your app. Perhaps you can even collaborate on the project.
--tomasz [1] - https://play.google.com/store/apps/details?id=org.kiwix.kiwixmobile On Tue, Sep 24, 2013 at 12:24 AM, Mihai Chintoanu <mihai.chinto...@skobbler.com> wrote: > Hello, > > Thank you to all who have taken the time to answer. > > As more people have asked, here are some details about the project. We want > to build a feature in our smartphone app that allows users to read wikipedia > articles and we want to make the articles and their images available offline > to them, which is why we have to first download this content from wikipedia > and wikimedia. We have installed a wikipedia mirror locally and extracted the > desired article texts through the API. For the images, we first thought about > getting the image dump tarballs. However, the articles (and consequently the > images) are spread over more language domains, so this approach would have > been both inefficient and too much space consuming. > > I'll look into the rsync approach. > > Once again, many thanks for all your suggestions. > Mihai > > -----Original Message----- > From: wikitech-l-boun...@lists.wikimedia.org > [mailto:wikitech-l-boun...@lists.wikimedia.org] On Behalf Of Jeremy Baron > Sent: 23 September 2013 17:12 > To: Wikimedia developers; Wikipedia Xmldatadumps-l > Subject: Re: [Wikitech-l] Bulk download > > On Sep 23, 2013 9:25 AM, "Mihai Chintoanu" <mihai.chinto...@skobbler.com> > wrote: >> I have a list of about 1.8 million images which I have to download from > commons.wikimedia.org. Is there any simple way to do this which doesn't > involve an individual HTTP hit for each image? > > You mean full size originals, not thumbs scaled to a certain size, right? > > You should rsync from a mirror[0] (rsync allows specifying a list of files > to copy) and then fill in the missing images from upload.wikimedia.org ; > for upload.wikimedia.org I'd say you should throttle yourself to 1 cache > miss per second (you can check headers on a response to see if was a hit or > miss and then back off when you get a miss) and you shouldn't use more than > one or two simultaneous HTTP connections. In any case, make sure you have > an accurate UA string with contact info (email address) so ops can contact > you if there's an issue. > > At the moment there's only one mirror and it's ~6-12 months out of date so > there may be a substantial amount to fill in. And of course you should be > getting checksums from somewhere (the API?) and verifying them. If your > images are all missing from the mirror than it should take around 40 days > at 0.5 img/sec but I guess you probably could do it in less than 10 days if > you have a fast enough pipe. (depends on if you get a lit of misses or hits) > > See also [1] but not all of that applies because upload.wikimedia.org isn't > MediaWiki. so e.g. no maxlag param. > > -Jeremy > > [0] > https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_XML_dumps#Media > [1] https://www.mediawiki.org/wiki/API:Etiquette > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l