On Sat, Jul 18, 2009 at 3:20 PM, David Gerard<[email protected]> wrote: > 2009/7/18 Alexandre Dulaunoy <[email protected]>: > >> I was wondering if it would be possible to allow web robots to access >> http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror >> the media files. As this is pure HTTP, the mirroring could benefit from >> the caching mechanisms of HTTP object (instead of having a large dump >> containing all the media files, that is more difficult to cache/update). > > > I see lots of files on upload.wikimedia.org on Google Image Search > already. Is that actually forbidden by our robots.txt? > > It'd actually be better if Google properly indexed text pages whose > name ends in .jpg or whatever ... but they're aware we'd like that, so > it's up to them.
But the current directory listing (upload dir) is disallowed, for example : http://upload.wikimedia.org/wikipedia/commons/8/8c/ Of course, the bot will be able to get the media files by following the links from the other pages but this is not very handy/effective to make a exact mirror of just the current media files repository. Would it possible to enable directory listing of http://upload.wikimedia.org/wikipedia/commons and the following subdirectories? Thanks for the feedback, -- -- Alexandre Dulaunoy (adulau) -- http://www.foo.be/ -- http://www.foo.be/cgi-bin/wiki.pl/Diary -- "Knowledge can create problems, it is not through ignorance -- that we can solve them" Isaac Asimov _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
