Re: [Wikimedia-l] Internet archive and automatic retroactive robots.txt (was Re: Internet archive and strategy survey (was Re: 24 TB for User:Dispenser on Tool Labs please))

2014-07-07 Thread David Gerard
On 7 July 2014 21:08, Brad Jorsch (Anomie)  wrote:
> On Mon, Jul 7, 2014 at 1:49 PM, Federico Leva (Nemo) 
> wrote:
>> Brad Jorsch (Anomie), 07/07/2014 17:37:

>> > And the robots.txt for the new
>> > version of the site denies everything, likely because the new owners
>> don't
>> > want the redirects or other old content showing up in Google searches.
>> But
>> > this has the unfortunate side effect that IA removes all the old content
>> > from public access.

>> This is not correct. If you can reproduce it, please file a bug, but
>> it's not how it's supposed or said to work.
>> I've pasted some links where you can find additional information like
>> this at
>> https://archive.org/post/1019415/retroactive-robotstxt-removal-of-past-crawls-aka-oakland-archive-policy
>> (also reposting an elaborated version of my message of this morning).

>  anyone else>
> I'm confused. You say this is not correct, but then you post a link to a
> post of your own that does not refute it and that has many links to people
> confirming it.



Indeed, I was about to note the same. Nemo, there are multiple links
from that page confirming that IA does retroactive takedowns. When you
say they don't, do you have anything to back up that claim?


- d.

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Internet archive and automatic retroactive robots.txt (was Re: Internet archive and strategy survey (was Re: 24 TB for User:Dispenser on Tool Labs please))

2014-07-07 Thread Brad Jorsch (Anomie)
On Mon, Jul 7, 2014 at 1:49 PM, Federico Leva (Nemo) 
wrote:

> Brad Jorsch (Anomie), 07/07/2014 17:37:
> > And the robots.txt for the new
> > version of the site denies everything, likely because the new owners
> don't
> > want the redirects or other old content showing up in Google searches.
> But
> > this has the unfortunate side effect that IA removes all the old content
> > from public access.
>
> This is not correct. If you can reproduce it, please file a bug, but
> it's not how it's supposed or said to work.
> I've pasted some links where you can find additional information like
> this at
>
> https://archive.org/post/1019415/retroactive-robotstxt-removal-of-past-crawls-aka-oakland-archive-policy
> (also reposting an elaborated version of my message of this morning).
>



I'm confused. You say this is not correct, but then you post a link to a
post of your own that does not refute it and that has many links to people
confirming it.
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Internet archive and automatic retroactive robots.txt (was Re: Internet archive and strategy survey (was Re: 24 TB for User:Dispenser on Tool Labs please))

2014-07-07 Thread Federico Leva (Nemo)
Brad Jorsch (Anomie), 07/07/2014 17:37:
> And the robots.txt for the new
> version of the site denies everything, likely because the new owners don't
> want the redirects or other old content showing up in Google searches. But
> this has the unfortunate side effect that IA removes all the old content
> from public access.

This is not correct. If you can reproduce it, please file a bug, but
it's not how it's supposed or said to work.
I've pasted some links where you can find additional information like
this at
https://archive.org/post/1019415/retroactive-robotstxt-removal-of-past-crawls-aka-oakland-archive-policy
(also reposting an elaborated version of my message of this morning).

Nemo

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


[Wikimedia-l] Internet archive and automatic retroactive robots.txt (was Re: Internet archive and strategy survey (was Re: 24 TB for User:Dispenser on Tool Labs please))

2014-07-07 Thread Brad Jorsch (Anomie)
On Mon, Jul 7, 2014 at 5:21 AM, James Salsman  wrote:

> Kevin Gorman wrote:
>
> > Regarding the IA: they have a significant interest in working with the
> > Wikimedia projects, a lot more experience than the Wikimedia projects
> have
> > caching absolutely tremendous quantities of data, a willinness to handle
> a
> > degree of legal risk that would be inappropriate for the Wikimedia
> projects
> > to take on
>
> Because they censor things retroactively when requested by new domain
> owners' robots.txt,




This point shouldn't get lost in the various other issues of more dubious
veracity and/or applicability raised in the original message.

I've seen cases where domain ownership changes or a major corporate
restructuring results in a domain being completely reorganized or even
redirected wholesale to some other domain. And the robots.txt for the new
version of the site denies everything, likely because the new owners don't
want the redirects or other old content showing up in Google searches. But
this has the unfortunate side effect that IA removes all the old content
from public access.

I really wish that IA would reconsider their policy of *automatically*
retroactively honoring robots.txt.
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,