Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation
On 10/07/14 17:15 +, Ivan Kozik wrote: On Thu, Jul 10, 2014 at 3:49 PM, Jonas Wielicki j.wieli...@sotecware.net wrote: While this is a good thing /all/ software projects should be doing imo, one could still explicitly allow Archive.org by prepending: User-agent: ia_archiver Disallow: ? It looks like documentation for the old versions e.g. http://doc.rust-lang.org/0.9/ is a 404 anyway. Which is kinda the point of letting archive.org keep a copy for posterity. Not that I have a bunch of sway, but I'm +1 on not letting old docs be searchable, and also +1 on making an exception for archive.org ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation
Can somebody file an issue described exactly what we should do and cc me? On 07/14/2014 01:13 AM, richo wrote: On 10/07/14 17:15 +, Ivan Kozik wrote: On Thu, Jul 10, 2014 at 3:49 PM, Jonas Wielicki j.wieli...@sotecware.net wrote: While this is a good thing /all/ software projects should be doing imo, one could still explicitly allow Archive.org by prepending: User-agent: ia_archiver Disallow: ? It looks like documentation for the old versions e.g. http://doc.rust-lang.org/0.9/ is a 404 anyway. Which is kinda the point of letting archive.org keep a copy for posterity. Not that I have a bunch of sway, but I'm +1 on not letting old docs be searchable, and also +1 on making an exception for archive.org ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation
On Tue, Jul 15, 2014 at 4:16 AM, Brian Anderson bander...@mozilla.com wrote: Can somebody file an issue described exactly what we should do and cc me? Nothing. Absolutely nothing. robots.txt rules do not apply to historical data; if archive.org has archived something, the introduction of a new Disallow rule will not remove the contents of a previous scan. It therefore has three months in which to make a scan of a release before that release is marked obsolete with the introduction of a Disallow directive. This is right and proper. Special casing a specific user agent is not the right thing to do. The contents won’t be changing after the release, anyway, so allowing archive.org to continue scanning it is a complete waste of effort. ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation
On 07/14/2014 09:56 PM, Chris Morgan wrote: On Tue, Jul 15, 2014 at 4:16 AM, Brian Anderson bander...@mozilla.com wrote: Can somebody file an issue described exactly what we should do and cc me? Nothing. Absolutely nothing. robots.txt rules do not apply to historical data; if archive.org has archived something, the introduction of a new Disallow rule will not remove the contents of a previous scan. Although that is the robots.txt standard, archive.org does retroactively apply robots.txt Disallow rules to already-archived content. https://archive.org/about/exclude.php It therefore has three months in which to make a scan of a release before that release is marked obsolete with the introduction of a Disallow directive. This is right and proper. Special casing a specific user agent is not the right thing to do. The contents won’t be changing after the release, anyway, so allowing archive.org to continue scanning it is a complete waste of effort. It's my understanding that archive.org doesn't have the funding to reliably crawl everything on the Web promptly. I agree with the principle that Special casing a specific user agent is not the right thing to do. but I also support the Internet Archive's mission. Another option is a `X-Robots-Tag: noindex` HTTP header, which is more robust at banning indexing[1], and it allows archiving (vs. `X-Robots-Tag: noindex, noarchive` would disallow it). It's likely less robust from the perspective of keeping our website serving that header consistently long-term though. For HTML files, X-Robots-Tag can also go in a meta tag in the head. -Isaac [1] (Google can still list a robots.txt-disallowed page as a search result if many sites it trusts link to that page) ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation
Its not about special casing a user agent its about archiving duplicate copies of old documents. Right now, everything is crawled off of the current docs, and none of the archived docs are allowed. with this change, the IA would store multiple copies of old documentation—once as the old entry for docs.rust-lang.org/ and once as the new entry for docs.rust-lang.org/0.9/ At least that's how I'm understanding the situation. Also, if you're really interested, all you have to do is a git checkout 0.9 and run rustdoc. On Mon, Jul 14, 2014 at 9:30 PM, Isaac Dupree m...@isaac.cedarswampstudios.org wrote: On 07/14/2014 09:56 PM, Chris Morgan wrote: On Tue, Jul 15, 2014 at 4:16 AM, Brian Anderson bander...@mozilla.com wrote: Can somebody file an issue described exactly what we should do and cc me? Nothing. Absolutely nothing. robots.txt rules do not apply to historical data; if archive.org has archived something, the introduction of a new Disallow rule will not remove the contents of a previous scan. Although that is the robots.txt standard, archive.org does retroactively apply robots.txt Disallow rules to already-archived content. https://archive.org/about/exclude.php It therefore has three months in which to make a scan of a release before that release is marked obsolete with the introduction of a Disallow directive. This is right and proper. Special casing a specific user agent is not the right thing to do. The contents won’t be changing after the release, anyway, so allowing archive.org to continue scanning it is a complete waste of effort. It's my understanding that archive.org doesn't have the funding to reliably crawl everything on the Web promptly. I agree with the principle that Special casing a specific user agent is not the right thing to do. but I also support the Internet Archive's mission. Another option is a `X-Robots-Tag: noindex` HTTP header, which is more robust at banning indexing[1], and it allows archiving (vs. `X-Robots-Tag: noindex, noarchive` would disallow it). It's likely less robust from the perspective of keeping our website serving that header consistently long-term though. For HTML files, X-Robots-Tag can also go in a meta tag in the head. -Isaac [1] (Google can still list a robots.txt-disallowed page as a search result if many sites it trusts link to that page) ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
[rust-dev] robots.txt prevents Archive.org from storing old documentation
Hi, the current robots.txt on docs.rust-lang.org prevents Archive.org from storing copies of the old documentation. I think having the old documentation archived would be a good thing. BTW, all the documentation before 0.10 seems gone and this is a shame. Could you please allow the Archive.org bot to index the site? For the records: $ curl http://doc.rust-lang.org/robots.txt User-agent: * Disallow: /0.3/ Disallow: /0.4/ Disallow: /0.5/ Disallow: /0.6/ Disallow: /0.7/ Disallow: /0.8/ Disallow: /0.9/ Disallow: /0.10/ -- Gioele Barabucci gio...@svario.it ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation
On 10/07/14 03:46 AM, Gioele Barabucci wrote: Hi, the current robots.txt on docs.rust-lang.org prevents Archive.org from storing copies of the old documentation. I think having the old documentation archived would be a good thing. BTW, all the documentation before 0.10 seems gone and this is a shame. Could you please allow the Archive.org bot to index the site? For the records: $ curl http://doc.rust-lang.org/robots.txt User-agent: * Disallow: /0.3/ Disallow: /0.4/ Disallow: /0.5/ Disallow: /0.6/ Disallow: /0.7/ Disallow: /0.8/ Disallow: /0.9/ Disallow: /0.10/ The old documentation is all available from the Git repository. The robots.txt rule is there to reverse the trend of searches being filled with out of date documentation. signature.asc Description: OpenPGP digital signature ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation
On 10.07.2014 16:56, Daniel Micay wrote: On 10/07/14 03:46 AM, Gioele Barabucci wrote: Hi, the current robots.txt on docs.rust-lang.org prevents Archive.org from storing copies of the old documentation. I think having the old documentation archived would be a good thing. BTW, all the documentation before 0.10 seems gone and this is a shame. Could you please allow the Archive.org bot to index the site? For the records: $ curl http://doc.rust-lang.org/robots.txt User-agent: * Disallow: /0.3/ Disallow: /0.4/ Disallow: /0.5/ Disallow: /0.6/ Disallow: /0.7/ Disallow: /0.8/ Disallow: /0.9/ Disallow: /0.10/ The old documentation is all available from the Git repository. The robots.txt rule is there to reverse the trend of searches being filled with out of date documentation. While this is a good thing /all/ software projects should be doing imo, one could still explicitly allow Archive.org by prepending: User-agent: ia_archiver Disallow: ? ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev signature.asc Description: OpenPGP digital signature ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev
Re: [rust-dev] robots.txt prevents Archive.org from storing old documentation
On Thu, Jul 10, 2014 at 3:49 PM, Jonas Wielicki j.wieli...@sotecware.net wrote: While this is a good thing /all/ software projects should be doing imo, one could still explicitly allow Archive.org by prepending: User-agent: ia_archiver Disallow: ? It looks like documentation for the old versions e.g. http://doc.rust-lang.org/0.9/ is a 404 anyway. Ivan ___ Rust-dev mailing list Rust-dev@mozilla.org https://mail.mozilla.org/listinfo/rust-dev