[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-03-25 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

Marc A. Pelletier m...@uberbox.org changed:

   What|Removed |Added

 Status|REOPENED|RESOLVED
 Resolution|--- |WONTFIX

--- Comment #14 from Marc A. Pelletier m...@uberbox.org ---
Closing as WONTFIX for the general case.  Individual tool owners are welcome to
request a whitelisting of their tool so long as they have properly validated
that a bot spidering them cannot cause issues.

In particular, tools which return pages with dynamic content that is or may be
expensive on the database to generate and which contains further internal links
generally throw spiders in a loop and consume a great deal of resources,
impacting all other tools.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-03-25 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #15 from Nemo federicol...@tiscali.it ---
Meh. Ok, will host my stuff elsewhere. I'd like it to be found and used. :)

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

Nemo federicol...@tiscali.it changed:

   What|Removed |Added

   See Also||https://bugzilla.wikimedia.
   ||org/show_bug.cgi?id=61133

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #1 from Tim Landscheidt t...@tim-landscheidt.de ---
Do you mean the tools themselves (e. g. https://tools.wmflabs.org/wikilint/) or
the index (just https://tools.wmflabs.org/)?

The first is a WONTFIX, for the second I haven't found a solution yet.  Do you
have an idea?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #2 from Nemo federicol...@tiscali.it ---
Why would the first be a WONTFIX?
For the second see the docs,

Allow: /$

is supposed to work (at least with Google).

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

Nemo federicol...@tiscali.it changed:

   What|Removed |Added

   Keywords||code-update-regression

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

Nemo federicol...@tiscali.it changed:

   What|Removed |Added

 Blocks||58791

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #3 from Tim Landscheidt t...@tim-landscheidt.de ---
(In reply to comment #2)
 Why would the first be a WONTFIX?

Because there are tools that are linked from every wiki page and any spider
accessing them brings the house down.  As tools are created and updated without
any review by admins and wiki edits are not monitored as well, blacklisting
them after the meltdown doesn't work.

So unlimited spider access is not possible.

 For the second see the docs,

Unfortunately, there is no specification for robots.txt; that's the core of the
problem.

 Allow: /$

 is supposed to work (at least with Google).

According to [[de:Robots Exclusion Standard]] with Googlebot, Yahoo! Slurp and
msnbot.  And the other spiders?  Will they read it in the same way or as /? 
How do we whitelist /?Rules?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #4 from Nemo federicol...@tiscali.it ---
(In reply to comment #3)
 (In reply to comment #2)
  Why would the first be a WONTFIX?
 
 Because there are tools that are linked from every wiki page 

Blacklist them, then?

https://toolserver.org/robots.txt has:

User-agent: *
Disallow: /~magnus/geo/geohack.php
Disallow: /~daniel/WikiSense
Disallow: /~geohack/
Disallow: /~enwp10/
Disallow: /~cbm/cgi-bin/

 and any spider
 accessing them brings the house down.  As tools are created and updated
 without
 any review by admins and wiki edits are not monitored as well, blacklisting
 them after the meltdown doesn't work.
 
 So unlimited spider access is not possible.

Nobody said unlimited. This works on Toolserver, it's not inherently
impossible. It's unfortunate that migration implies such usability regressions,
because then tool developers will try to postpone migration as long as possible
and we'll have little time.

 
  For the second see the docs,
 
 Unfortunately, there is no specification for robots.txt; that's the core of
 the
 problem.

Not really, there is a specification but everyone has extensions. I meant
Google's, as I said.

 msnbot.  And the other spiders?  Will they read it in the same way or as
 /? 

You'll find out with experience.

 How do we whitelist /?Rules?

Mentioning it specifically, no?
However, while I can understand blocking everything except the root page,
whitelisting individual pages is rather crazy and I don't see how /?Rules would
be more interesting than most other pages. Horrible waste of time to go haunt
them, you could as well just snail mail a print of webpages on demand.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #5 from Tim Landscheidt t...@tim-landscheidt.de ---
(In reply to comment #4)
 [...]

  and any spider
  accessing them brings the house down.  As tools are created and updated
  without
  any review by admins and wiki edits are not monitored as well, blacklisting
  them after the meltdown doesn't work.

  So unlimited spider access is not possible.

 Nobody said unlimited. This works on Toolserver, it's not inherently
 impossible. It's unfortunate that migration implies such usability
 regressions,
 because then tool developers will try to postpone migration as long as
 possible
 and we'll have little time.

I haven't met a tool developer who postpones migration because of robots.txt
(or cares about that at all, because their tools are linked from Wikipedia). 
Noone even asked to change robots.txt.  Who are they?

If tool developers guarantee that a specific tool is resistant to spiders, we
can whitelist that (even automated à la ~/.description).

 [...]

  msnbot.  And the other spiders?  Will they read it in the same way or as
  /? 

 You'll find out with experience.

 [...]

Why would we take that risk with only marginal benefit gained?  Experience
means a lot of people yelling.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #6 from Nemo federicol...@tiscali.it ---
(In reply to comment #5)
 I haven't met a tool developer who postpones migration because of robots.txt

Why would you meet them? People unaware of this obscure dark corner of the
internet called tool labs, hidden from the rest of the WWW, will never arrive
to us.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

Tim Landscheidt t...@tim-landscheidt.de changed:

   What|Removed |Added

 Status|NEW |RESOLVED
 Resolution|--- |WONTFIX

--- Comment #7 from Tim Landscheidt t...@tim-landscheidt.de ---
(In reply to comment #6)
  I haven't met a tool developer who postpones migration because of robots.txt

 Why would you meet them? People unaware of this obscure dark corner of the
 internet called tool labs, hidden from the rest of the WWW, will never arrive
 to us.

That's why I asked you: Who postpones migration to Labs because of robots.txt?

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

Tim Landscheidt t...@tim-landscheidt.de changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|WONTFIX |---

--- Comment #8 from Tim Landscheidt t...@tim-landscheidt.de ---
Sorry, that was too fast.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #9 from Nemo federicol...@tiscali.it ---
(In reply to comment #7)
 That's why I asked you: Who postpones migration to Labs because of
 robots.txt?

Sorry, it's not my job to go ask dozens or hundreds of tools owners why they've
not yet migrated their tools.

Missed this:

(In reply to comment #5)
 Why would we take that risk with only marginal benefit gained? [...]

Ah, right, marginal benefit. I had forgotten that Tool Labs was only built as a
monument to computer science; having people finding and using tools and pages
useful for them is just an accessory, a marginal benefit.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #10 from Tim Landscheidt t...@tim-landscheidt.de ---
(In reply to comment #9)
 (In reply to comment #7)
  That's why I asked you: Who postpones migration to Labs because of
  robots.txt?

 Sorry, it's not my job to go ask dozens or hundreds of tools owners why
 they've
 not yet migrated their tools.

Then why do you claim that it is related to robots.txt?

 Missed this:

 (In reply to comment #5)
  Why would we take that risk with only marginal benefit gained? [...]

 Ah, right, marginal benefit. I had forgotten that Tool Labs was only built
 as a
 monument to computer science; having people finding and using tools and pages
 useful for them is just an accessory, a marginal benefit.

This bug isn't about people finding and using tools and pages useful for
them, but robots.txt.  If you want to increase the visibility of the available
tools at Tools, you can set up a mirror at a more prominent wiki very easily. 
The code for https://tools.wmflabs.org/ is at
http://git.wikimedia.org/blob/labs%2Ftoollabs.git/master/www%2Fcontent%2Flist.php.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

Jarry1250 jarry1...@gmail.com changed:

   What|Removed |Added

 CC||jarry1...@gmail.com

--- Comment #11 from Jarry1250 jarry1...@gmail.com ---
I need robots.txt-esque access for my tool, http://tools.wmflabs.org/wmukevents
, which is a calendar feed. For users to be able to add it to their Google
calendars requires the Google Calendar Bot to be able to access it.
Unfortunately Google Calendar Bot uses the same user agent as the regular
Google spider.

That said, I mentioned this to Coren a while back, he twiddled some levers
(can't recall precisely what) and now it WORKSFORME, so perhaps I've
misremembered the problem on some level.

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #12 from Merlijn van Deen valhall...@arctus.nl ---
 Ah, right, marginal benefit. I had forgotten that Tool Labs was only built
 as a
 monument to computer science; having people finding and using tools and pages
 useful for them is just an accessory, a marginal benefit.

Google is smart enough to do it's job even without robots.txt:

https://encrypted.google.com/search?q=gerrit%20patch%20uploader

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l


[Bug 61132] robots.txt should let search engines to index tools.wmflabs.org

2014-02-10 Thread bugzilla-daemon
https://bugzilla.wikimedia.org/show_bug.cgi?id=61132

--- Comment #13 from Merlijn van Deen valhall...@arctus.nl ---
Sorry, that should have read 'Google is smart enough to do it's job even when
blocked by robots.txt'

-- 
You are receiving this mail because:
You are on the CC list for the bug.
___
Wikibugs-l mailing list
Wikibugs-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikibugs-l