[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

bugzilla-daemon Mon, 10 Mar 2014 07:01:32 -0700

https://bugzilla.wikimedia.org/show_bug.cgi?id=62468


--- Comment #4 from Nathan Larson <[email protected]> ---
I suspect most MediaWiki installations probably have robots.txt set up, as
recommended at [[mw:Manual:Robots.txt#With_short_URLs]], with

User-agent: *
Disallow: /w/

See for example:

*https://web.archive.org/web/20140310075905/http://en.wikipedia.org/w/index.php?title=Main_Page&action=edit
*
https://web.archive.org/web/20140310075905/http://en.wikipedia.org/w/index.php?title=Main_Page&action=raw

So, they couldn't retrieve action=raw even if they wanted to. In fact, if I
were to set up a script to download it, might I not be in violation of
robots.txt, which would make my script an ill-behaving bot? I'm not sure my
moral fiber can handle an ethical breach of that magnitude. However, some sites
do allow indexing of their edit and raw pages, e.g.

https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=edit
https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=raw
https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=edit
https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=raw

Dramatica and RationalWiki use all kinds of secret sauces, though, so who knows
what's going on there. Normally, edit pages have a <meta name="robots"
content="noindex,nofollow" /> but that's not the case with Dramatica or
RationalWiki edit pages. Is there some config setting or extension that changes
the robot policy on edit pages? Also, I wonder if they had to tell the Internet
Archive to archive those pages, or if the Internet Archive just did it on its
own initiative.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
_______________________________________________
Wikibugs-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l

[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

Reply via email to