Are your 404 errors part of a Denial of Service attack? That would be a good reason to set up a blocking mechanism. Though, as others have pointed out Apache handles 404 errors very efficiently, so you should make sure that this is really an issue. Measure bandwidth, disk and CPU resources used for servicing 404 pages as a fraction of all requests. If it is the 404 errors that are causing you problems, then certainly block sites that are causing them.
Personally, I find most 404 errors on my web sites are due to broken links. whatexit.org has plenty of them right now (d'oh!). Real spiders don't guess random URLs to crawl the web. That would be a waste of time. In a world with billions (trillians?) of web sites, spiders aren't looking to make more work for themselves. They look for links on pages and crawl those links. They do re-crawl pages that they've seen before to look for updates, so if you remove a web page there is a good chance that you will see occasional 404 errors for it. That's a good thing. You want the spider to see that it has really gone away so that they stop listing it in their search results. There are two ways to direclty influence spiders: Negative hints: A robots.txt file (http://www.robotstxt.org) can be used to indicate which URLs you don't want crawled. All major web search engines obey robots.txt, and the ones that don't obey it quickly get banned or realize that it is in their best interest to pay attention to robots.txt. This is particularly important for "infinite" web sites (like if you have a calendar with a "next month" button that can be clicked until the year 9999. You probably don't want a spider clicking "next month" for days on end and the spiders have better sites to search anyway.) Positive hints: There is a standard called "XML Sitemaps" ( http://www.xml-sitemaps.com/) which lets you expose it so that spiders can use it to be more efficient about searching your site. More importantly, search engines use the sitemap to display more info to your users. (that's why http://www.google.com/search?q=robots.txt shows the menu structure of robotstxt.org right in the search results). There is one time that spiders *do* request non-existent pages on purpose. They're testing to make sure your 404 page mechanism is working. Some web servers do something stupid with non-existing pages... like redirecting the user to a non-error page. If you look at a custom 404 page like http://www.gocomics.com/does-not-exist you'll see that even though the page is bright, colorful, and even useful it still gave an HTTP 404 error at the protocol level. Some sites do that but return a non-error code in the HTTP protocol. To a spider this could be as dangerous as the "infinite calendar" example above. You also brought up the issue of brute force password attacks. Your password system shouldn't give a 404 error when the user has entered the wrong password. 401 or other errors are more appropriate ( http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html). I agree that blocking that kind of attack is a good thing, but that should be unrelated to 404s. (and if you try entering the wrong password on http://login.yahoo.com, you'll see that you don't get any http error... the error is part of the UI, not the protocol.) Not all 404s are bad. A user accidentally mistyping a URL should get a 404 error and not be punished. Maybe their finger slipped, eh? Or maybe you have a broken link on you web site and they are clicking "reload" thinking that this will find it. Therefore, make sure that if you block users that generate 404 errors, it is because they are using seriously large amount of resources, not because they have bad typing skills. Tom
_______________________________________________ Tech mailing list [email protected] http://lopsa.org/cgi-bin/mailman/listinfo/tech This list provided by the League of Professional System Administrators http://lopsa.org/
