On Sun, Oct 11, 2009 at 6:03 PM, Aryeh Gregor
<[email protected]> wrote:
> On Sun, Oct 11, 2009 at 3:28 PM, Erik Zachte <[email protected]> 
> wrote:
>> A
>> Any idea why there are so many TCP_DENIED/403, are these really failures ?
>
> Certain types of requests are blocked at the Squid level for various
> reasons.  For instance, try wgetting Wikipedia; you'll get a 403
> because the default UA headers for such things are blocked.  (You're
> supposed to use a custom UA header, preferably with contact info, to
> make your script distinctive and easily blockable by itself if there's
> a problem.)  Similarly, try something like this:
>
> http://en.wikipedia.org/&amp;
>
> I assume this kind of thing is what causes those responses.

Actually wget isn't blocked for either pageviews or action=edit based
on a test a minute ago.

> On Sun, Oct 11, 2009 at 8:12 PM, Robert Rohde <[email protected]> wrote:
>> However, a logical guess would
>> be if the Squid is configured to reject action=edit requests from
>> search engine spiders and similar non-human processes.  Since such
>> things are not easily incorporated into robots.txt, blocking at the
>> squid layer would be a good option for stopping such traffic from
>> hitting the main servers.  That would be my guess.  I suspect others
>> can give a more concrete answer.
>
> Those things are all blocked in robots.txt:
>
> User-agent: *
> Disallow: /w/
>
> That's part of why we use long URLs for everything but page views, so
> that they can be neatly blocked from spiders.

Excellent point, though I wouldn't be surprised to find that some
disrespectful spiders and bots are also blocked at the squid level.

-Robert Rohde

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to