On Sun, Oct 11, 2009 at 6:03 PM, Aryeh Gregor <[email protected]> wrote: > On Sun, Oct 11, 2009 at 3:28 PM, Erik Zachte <[email protected]> > wrote: >> A >> Any idea why there are so many TCP_DENIED/403, are these really failures ? > > Certain types of requests are blocked at the Squid level for various > reasons. For instance, try wgetting Wikipedia; you'll get a 403 > because the default UA headers for such things are blocked. (You're > supposed to use a custom UA header, preferably with contact info, to > make your script distinctive and easily blockable by itself if there's > a problem.) Similarly, try something like this: > > http://en.wikipedia.org/& > > I assume this kind of thing is what causes those responses.
Actually wget isn't blocked for either pageviews or action=edit based on a test a minute ago. > On Sun, Oct 11, 2009 at 8:12 PM, Robert Rohde <[email protected]> wrote: >> However, a logical guess would >> be if the Squid is configured to reject action=edit requests from >> search engine spiders and similar non-human processes. Since such >> things are not easily incorporated into robots.txt, blocking at the >> squid layer would be a good option for stopping such traffic from >> hitting the main servers. That would be my guess. I suspect others >> can give a more concrete answer. > > Those things are all blocked in robots.txt: > > User-agent: * > Disallow: /w/ > > That's part of why we use long URLs for everything but page views, so > that they can be neatly blocked from spiders. Excellent point, though I wouldn't be surprised to find that some disrespectful spiders and bots are also blocked at the squid level. -Robert Rohde _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
