Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation

2014-07-14 Thread richo

On 10/07/14 17:15 +, Ivan Kozik wrote:

On Thu, Jul 10, 2014 at 3:49 PM, Jonas Wielicki
j.wieli...@sotecware.net wrote:

While this is a good thing /all/ software projects should be doing imo,
one could still explicitly allow Archive.org by prepending:

User-agent: ia_archiver
Disallow:

?


It looks like documentation for the old versions e.g.
http://doc.rust-lang.org/0.9/ is a 404 anyway.



Which is kinda the point of letting archive.org keep a copy for posterity.

Not that I have a bunch of sway, but I'm +1 on not letting old docs be
searchable, and also +1 on making an exception for archive.org
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation

2014-07-14 Thread Brian Anderson

Can somebody file an issue described exactly what we should do and cc me?

On 07/14/2014 01:13 AM, richo wrote:

On 10/07/14 17:15 +, Ivan Kozik wrote:

On Thu, Jul 10, 2014 at 3:49 PM, Jonas Wielicki
j.wieli...@sotecware.net wrote:

While this is a good thing /all/ software projects should be doing imo,
one could still explicitly allow Archive.org by prepending:

User-agent: ia_archiver
Disallow:

?


It looks like documentation for the old versions e.g.
http://doc.rust-lang.org/0.9/ is a 404 anyway.



Which is kinda the point of letting archive.org keep a copy for 
posterity.


Not that I have a bunch of sway, but I'm +1 on not letting old docs be
searchable, and also +1 on making an exception for archive.org
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation

2014-07-14 Thread Chris Morgan
On Tue, Jul 15, 2014 at 4:16 AM, Brian Anderson bander...@mozilla.com wrote:
 Can somebody file an issue described exactly what we should do and cc me?

Nothing. Absolutely nothing.

robots.txt rules do not apply to historical data; if archive.org has
archived something, the introduction of a new Disallow rule will not
remove the contents of a previous scan.

It therefore has three months in which to make a scan of a release
before that release is marked obsolete with the introduction of a
Disallow directive.

This is right and proper. Special casing a specific user agent is not
the right thing to do. The contents won’t be changing after the
release, anyway, so allowing archive.org to continue scanning it is a
complete waste of effort.
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation

2014-07-14 Thread Isaac Dupree
On 07/14/2014 09:56 PM, Chris Morgan wrote:
 On Tue, Jul 15, 2014 at 4:16 AM, Brian Anderson bander...@mozilla.com wrote:
 Can somebody file an issue described exactly what we should do and cc me?
 
 Nothing. Absolutely nothing.
 
 robots.txt rules do not apply to historical data; if archive.org has
 archived something, the introduction of a new Disallow rule will not
 remove the contents of a previous scan.

Although that is the robots.txt standard, archive.org does retroactively
apply robots.txt Disallow rules to already-archived content.
https://archive.org/about/exclude.php

 It therefore has three months in which to make a scan of a release
 before that release is marked obsolete with the introduction of a
 Disallow directive.
 
 This is right and proper. Special casing a specific user agent is not
 the right thing to do. The contents won’t be changing after the
 release, anyway, so allowing archive.org to continue scanning it is a
 complete waste of effort.

It's my understanding that archive.org doesn't have the funding to
reliably crawl everything on the Web promptly.  I agree with the
principle that Special casing a specific user agent is not the right
thing to do. but I also support the Internet Archive's mission.

Another option is a `X-Robots-Tag: noindex` HTTP header, which is more
robust at banning indexing[1], and it allows archiving (vs.
`X-Robots-Tag: noindex, noarchive` would disallow it).  It's likely less
robust from the perspective of keeping our website serving that header
consistently long-term though.  For HTML files, X-Robots-Tag can also go
in a meta tag in the head.

-Isaac

[1] (Google can still list a robots.txt-disallowed page as a search
result if many sites it trusts link to that page)

___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation

2014-07-14 Thread Evan G
Its not about special casing a user agent its about archiving duplicate
copies of old documents. Right now, everything is crawled off of the
current docs, and none of the archived docs are allowed. with this change,
the IA would store multiple copies of old documentation—once as the old
entry for docs.rust-lang.org/ and once as the new entry for
docs.rust-lang.org/0.9/

At least that's how I'm understanding the situation. Also, if you're really
interested, all you have to do is a git checkout 0.9 and run rustdoc.


On Mon, Jul 14, 2014 at 9:30 PM, Isaac Dupree 
m...@isaac.cedarswampstudios.org wrote:

 On 07/14/2014 09:56 PM, Chris Morgan wrote:
  On Tue, Jul 15, 2014 at 4:16 AM, Brian Anderson bander...@mozilla.com
 wrote:
  Can somebody file an issue described exactly what we should do and cc
 me?
 
  Nothing. Absolutely nothing.
 
  robots.txt rules do not apply to historical data; if archive.org has
  archived something, the introduction of a new Disallow rule will not
  remove the contents of a previous scan.

 Although that is the robots.txt standard, archive.org does retroactively
 apply robots.txt Disallow rules to already-archived content.
 https://archive.org/about/exclude.php

  It therefore has three months in which to make a scan of a release
  before that release is marked obsolete with the introduction of a
  Disallow directive.
 
  This is right and proper. Special casing a specific user agent is not
  the right thing to do. The contents won’t be changing after the
  release, anyway, so allowing archive.org to continue scanning it is a
  complete waste of effort.

 It's my understanding that archive.org doesn't have the funding to
 reliably crawl everything on the Web promptly.  I agree with the
 principle that Special casing a specific user agent is not the right
 thing to do. but I also support the Internet Archive's mission.

 Another option is a `X-Robots-Tag: noindex` HTTP header, which is more
 robust at banning indexing[1], and it allows archiving (vs.
 `X-Robots-Tag: noindex, noarchive` would disallow it).  It's likely less
 robust from the perspective of keeping our website serving that header
 consistently long-term though.  For HTML files, X-Robots-Tag can also go
 in a meta tag in the head.

 -Isaac

 [1] (Google can still list a robots.txt-disallowed page as a search
 result if many sites it trusts link to that page)

 ___
 Rust-dev mailing list
 Rust-dev@mozilla.org
 https://mail.mozilla.org/listinfo/rust-dev

___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


[rust-dev] robots.txt prevents Archive.org from storing old documentation

2014-07-10 Thread Gioele Barabucci

Hi,

the current robots.txt on docs.rust-lang.org prevents Archive.org from 
storing copies of the old documentation. I think having the old 
documentation archived would be a good thing. BTW, all the documentation 
before 0.10 seems gone and this is a shame.


Could you please allow the Archive.org bot to index the site?

For the records:

$ curl http://doc.rust-lang.org/robots.txt
User-agent: *
Disallow: /0.3/
Disallow: /0.4/
Disallow: /0.5/
Disallow: /0.6/
Disallow: /0.7/
Disallow: /0.8/
Disallow: /0.9/
Disallow: /0.10/

--
Gioele Barabucci gio...@svario.it
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation

2014-07-10 Thread Daniel Micay
On 10/07/14 03:46 AM, Gioele Barabucci wrote:
 Hi,
 
 the current robots.txt on docs.rust-lang.org prevents Archive.org from
 storing copies of the old documentation. I think having the old
 documentation archived would be a good thing. BTW, all the documentation
 before 0.10 seems gone and this is a shame.
 
 Could you please allow the Archive.org bot to index the site?
 
 For the records:
 
 $ curl http://doc.rust-lang.org/robots.txt
 User-agent: *
 Disallow: /0.3/
 Disallow: /0.4/
 Disallow: /0.5/
 Disallow: /0.6/
 Disallow: /0.7/
 Disallow: /0.8/
 Disallow: /0.9/
 Disallow: /0.10/

The old documentation is all available from the Git repository. The
robots.txt rule is there to reverse the trend of searches being filled
with out of date documentation.



signature.asc
Description: OpenPGP digital signature
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation

2014-07-10 Thread Jonas Wielicki
On 10.07.2014 16:56, Daniel Micay wrote:
 On 10/07/14 03:46 AM, Gioele Barabucci wrote:
 Hi,

 the current robots.txt on docs.rust-lang.org prevents Archive.org from
 storing copies of the old documentation. I think having the old
 documentation archived would be a good thing. BTW, all the documentation
 before 0.10 seems gone and this is a shame.

 Could you please allow the Archive.org bot to index the site?

 For the records:

 $ curl http://doc.rust-lang.org/robots.txt
 User-agent: *
 Disallow: /0.3/
 Disallow: /0.4/
 Disallow: /0.5/
 Disallow: /0.6/
 Disallow: /0.7/
 Disallow: /0.8/
 Disallow: /0.9/
 Disallow: /0.10/
 
 The old documentation is all available from the Git repository. The
 robots.txt rule is there to reverse the trend of searches being filled
 with out of date documentation.

While this is a good thing /all/ software projects should be doing imo,
one could still explicitly allow Archive.org by prepending:

User-agent: ia_archiver
Disallow:

?

 
 
 
 ___
 Rust-dev mailing list
 Rust-dev@mozilla.org
 https://mail.mozilla.org/listinfo/rust-dev
 




signature.asc
Description: OpenPGP digital signature
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev


Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation

2014-07-10 Thread Ivan Kozik
On Thu, Jul 10, 2014 at 3:49 PM, Jonas Wielicki
j.wieli...@sotecware.net wrote:
 While this is a good thing /all/ software projects should be doing imo,
 one could still explicitly allow Archive.org by prepending:

 User-agent: ia_archiver
 Disallow:

 ?

It looks like documentation for the old versions e.g.
http://doc.rust-lang.org/0.9/ is a 404 anyway.

Ivan
___
Rust-dev mailing list
Rust-dev@mozilla.org
https://mail.mozilla.org/listinfo/rust-dev