[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468 --- Comment #8 from Nathan Larson --- (In reply to Nemo from comment #7) > Did you include dofollow links to action=raw URLs in your skin? I put in MediaWiki:Sidebar **{{fullurl:{{FULLPAGENAMEE}}|action=raw}}|View raw wikitext As a backup, I also added a sidebar link to Special:WikiWikitext (per instructions at [[mw:Extension:ViewWikitext]]) just to be sure. Of course, most people won't want to have that on their sidebar. I started a page on this at [[mw:Manual:Internet Archive]]. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468 --- Comment #7 from Nemo --- (In reply to Nathan Larson from comment #6) > So, I guess a few months from now, I'll see whether the archive of my wiki > for 12 March 2014 and thereafter has the raw pages. If not, that's a bug, I > think. Did you include dofollow links to action=raw URLs in your skin? -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468 --- Comment #6 from Nathan Larson --- (In reply to Nemo from comment #5) > IA doesn't crawl on request. > On what "Allow" directives and other directives do or should take > precedence, please see (and reply) on > https://archive.org/post/1004436/googles-robotstxt-rules-interpreted-too- > strictly-by-wayback-machine I might reply to that, as more information becomes available. Today, I set my site's robots.txt to say: User-agent: * Disallow: /w/ User-agent: ia_archiver Allow: /*&action=raw So, I guess a few months from now, I'll see whether the archive of my wiki for 12 March 2014 and thereafter has the raw pages. If not, that's a bug, I think. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468 --- Comment #5 from Nemo --- IA doesn't crawl on request. On what "Allow" directives and other directives do or should take precedence, please see (and reply) on https://archive.org/post/1004436/googles-robotstxt-rules-interpreted-too-strictly-by-wayback-machine -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468 --- Comment #4 from Nathan Larson --- I suspect most MediaWiki installations probably have robots.txt set up, as recommended at [[mw:Manual:Robots.txt#With_short_URLs]], with User-agent: * Disallow: /w/ See for example: *https://web.archive.org/web/20140310075905/http://en.wikipedia.org/w/index.php?title=Main_Page&action=edit * https://web.archive.org/web/20140310075905/http://en.wikipedia.org/w/index.php?title=Main_Page&action=raw So, they couldn't retrieve action=raw even if they wanted to. In fact, if I were to set up a script to download it, might I not be in violation of robots.txt, which would make my script an ill-behaving bot? I'm not sure my moral fiber can handle an ethical breach of that magnitude. However, some sites do allow indexing of their edit and raw pages, e.g. https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=edit https://web.archive.org/web/20131204083339/https://encyclopediadramatica.es/index.php?title=Wikipedia&action=raw https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=edit https://web.archive.org/web/20131012144928/http://rationalwiki.org/w/index.php?title=Wikipedia&action=raw Dramatica and RationalWiki use all kinds of secret sauces, though, so who knows what's going on there. Normally, edit pages have a but that's not the case with Dramatica or RationalWiki edit pages. Is there some config setting or extension that changes the robot policy on edit pages? Also, I wonder if they had to tell the Internet Archive to archive those pages, or if the Internet Archive just did it on its own initiative. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468 Nemo changed: What|Removed |Added Status|NEW |UNCONFIRMED Ever confirmed|1 |0 --- Comment #3 from Nemo --- It make little sense to "archive" wikitext via action=edit, there is action=raw for that. But the IA crawler won't follow action=raw links (there are none) and as you say there is no indication that fetching action=edit would work. I propose two things: 1) install heritrix and check if it can fetch action=edit: if not file a bug and see what they say, if yes ask the IA folks on the "FAQ" forum and see what they say; 2) just download such data yourself and upload it to archive.org: you only need wget --warc and then upload in your favorite way. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468 --- Comment #2 from Nathan Larson --- Theoretically, you could put something in your robots.txt allowing the Internet Archiver to index the edit pages: https://www.mediawiki.org/wiki/Robots.txt#Allow_indexing_of_edit_pages_by_the_Internet_Archiver I'm not sure how well the particular implementation suggested there works, though; from what I can tell, it doesn't. Also, most archived wiki pages I've seen haven't had an "Edit" link. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468 Nathan Larson changed: What|Removed |Added Priority|Unprioritized |Lowest -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l
[Bug 62468] Add option to have the Internet Archiver (and/or other robots) retrieve raw wikitext of all pages
https://bugzilla.wikimedia.org/show_bug.cgi?id=62468 --- Comment #1 from Nathan Larson --- I was going to say, there should also be an option to let the Archiver access Special:AllPages or a variant of it, so that all the pages can be easily browsed; currently it seems like, when browsing archived pages, it's often necessary to find the page one is looking for by going from link to link, category to category, etc. -- You are receiving this mail because: You are the assignee for the bug. You are on the CC list for the bug. ___ Wikibugs-l mailing list Wikibugs-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikibugs-l