[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

2014-03-11 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468

--- Comment #8 from Nathan Larson  ---
(In reply to Nemo from comment #7)
> Did you include dofollow links to action=raw URLs in your skin?

I put in MediaWiki:Sidebar **{{fullurl:{{FULLPAGENAMEE}}|action=raw}}|View raw
wikitext

As a backup, I also added a sidebar link to Special:WikiWikitext (per
instructions at [[mw:Extension:ViewWikitext]]) just to be sure. Of course, most
people won't want to have that on their sidebar. I started a page on this at
[[mw:Manual:Internet Archive]].

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

2014-03-11 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468

--- Comment #7 from Nemo  ---
(In reply to Nathan Larson from comment #6)
> So, I guess a few months from now, I'll see whether the archive of my wiki
> for 12 March 2014 and thereafter has the raw pages. If not, that's a bug, I
> think.

Did you include dofollow links to action=raw URLs in your skin?

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

2014-03-11 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468

--- Comment #6 from Nathan Larson  ---
(In reply to Nemo from comment #5)
> IA doesn't crawl on request.
> On what "Allow" directives and other directives do or should take
> precedence, please see (and reply) on
> https://archive.org/post/1004436/googles-robotstxt-rules-interpreted-too-
> strictly-by-wayback-machine

I might reply to that, as more information becomes available. Today, I set my
site's robots.txt to say:

User-agent: *
Disallow: /w/

User-agent: ia_archiver
Allow: /*&action=raw

So, I guess a few months from now, I'll see whether the archive of my wiki for
12 March 2014 and thereafter has the raw pages. If not, that's a bug, I think.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

2014-03-11 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468

--- Comment #5 from Nemo  ---
IA doesn't crawl on request.
On what "Allow" directives and other directives do or should take precedence,
please see (and reply) on
https://archive.org/post/1004436/googles-robotstxt-rules-interpreted-too-strictly-by-wayback-machine

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

2014-03-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468

--- Comment #4 from Nathan Larson  ---
I suspect most MediaWiki installations probably have robots.txt set up, as
recommended at [[mw:Manual:Robots.txt#With_short_URLs]], with

User-agent: *
Disallow: /w/

See for example:

*https://web.archive.org/web/20140310075905/http://en.wikipedia.org/w/index.php?title=Main_Page&action=edit
*
https://web.archive.org/web/20140310075905/http://en.wikipedia.org/w/index.php?title=Main_Page&action=raw

So, they couldn't retrieve action=raw even if they wanted to. In fact, if I
were to set up a script to download it, might I not be in violation of
robots.txt, which would make my script an ill-behaving bot? I'm not sure my
moral fiber can handle an ethical breach of that magnitude. However, some sites
do allow indexing of their edit and raw pages, e.g.

https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=edit
https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=raw
https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=edit
https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=raw

Dramatica and RationalWiki use all kinds of secret sauces, though, so who knows
what's going on there. Normally, edit pages have a  but that's not the case with Dramatica or
RationalWiki edit pages. Is there some config setting or extension that changes
the robot policy on edit pages? Also, I wonder if they had to tell the Internet
Archive to archive those pages, or if the Internet Archive just did it on its
own initiative.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

2014-03-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468

Nemo  changed:

   What|Removed |Added

 Status|NEW |UNCONFIRMED
 Ever confirmed|1   |0

--- Comment #3 from Nemo  ---
It make little sense to "archive" wikitext via action=edit, there is action=raw
for that. But the IA crawler won't follow action=raw links (there are none) and
as you say there is no indication that fetching action=edit would work.
I propose two things:
1) install heritrix and check if it can fetch action=edit: if not file a bug
and see what they say, if yes ask the IA folks on the "FAQ" forum and see what
they say;
2) just download such data yourself and upload it to archive.org: you only need
wget --warc and then upload in your favorite way.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

2014-03-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468

--- Comment #2 from Nathan Larson  ---
Theoretically, you could put something in your robots.txt allowing the Internet
Archiver to index the edit pages:
https://www.mediawiki.org/wiki/Robots.txt#Allow_indexing_of_edit_pages_by_the_Internet_Archiver

I'm not sure how well the particular implementation suggested there works,
though; from what I can tell, it doesn't. Also, most archived wiki pages I've
seen haven't had an "Edit" link.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

2014-03-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468

Nathan Larson  changed:

   What|Removed |Added

   Priority|Unprioritized   |Lowest

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages

2014-03-09 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468

--- Comment #1 from Nathan Larson  ---
I was going to say, there should also be an option to let the Archiver access
Special:AllPages or a variant of it, so that all the pages can be easily
browsed; currently it seems like, when browsing archived pages, it's often
necessary to find the page one is looking for by going from link to link,
category to category, etc.

-- 
You are receiving this mail because:
You are the assignee for the bug.
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l