On Sat, Jul 18, 2009 at 3:20 PM, David Gerard<[email protected]> wrote:
> 2009/7/18 Alexandre Dulaunoy <[email protected]>:
>
>> I was wondering if it would be possible to allow web robots to access
>> http://upload.wikimedia.org/wikipedia/commons/ to gather and mirror
>> the media files. As this is pure HTTP, the mirroring could benefit from
>> the caching mechanisms of HTTP object (instead of having a large dump
>> containing all the media files, that is more difficult to cache/update).
>
>
> I see lots of files on upload.wikimedia.org on Google Image Search
> already. Is that actually forbidden by our robots.txt?
>
> It'd actually be better if Google properly indexed text pages whose
> name ends in .jpg or whatever ... but they're aware we'd like that, so
> it's up to them.

But the current directory listing (upload dir) is disallowed, for example :

http://upload.wikimedia.org/wikipedia/commons/8/8c/

Of course, the bot will be able to get the media files by
following the links from the other pages but this is not
very handy/effective to make a exact mirror of just
the current media files repository.

Would it possible to enable directory listing of
http://upload.wikimedia.org/wikipedia/commons
and the following subdirectories?

Thanks for the feedback,


-- 
--                   Alexandre Dulaunoy (adulau) -- http://www.foo.be/
--                             http://www.foo.be/cgi-bin/wiki.pl/Diary
--         "Knowledge can create problems, it is not through ignorance
--                                that we can solve them" Isaac Asimov

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to