Sure. Let the world learn from my mistakes. But first a bit of background: I work for a company that supplies internet access to schools in the US. In following CIPA (http://www.ala.org/cipa/) we have to provide internet filtering (via "8e6 Technologies" R3000 filtering appliance). Hence the central proxy servers mentioned earlier. As an additional filtering measure, I've created a form where the customers can (on the customer premise equipment) add other sites they would like blocked. This for allows for blocking just a particular page on a site, a subfolder of a site or a whole domain. All they put in is a URL, and choose whether to block the site or the domain. The regular expressions are generated automatically, added to the filterurls file, and Squid is reloaded. Here are some example types, requested urls, and the resultant regexes:
Site: http://gamesondemand.yahoo.com http://(.*@)?(www\.)?gamesondemand.yahoo.com/.* Site: http://www.bpssoft.com/powertools/library.htm http://(.*@)?(www\.)?bpssoft.com/powertools/library.htm Site: http://www2.photeus.com:8090/~ewot/ (http(s)?://)?(.*@)?(www\.)?www2.photeus.com(:.*)?/~ewot Domain: http://mail.com (http(s)?://)?(.*@)?(.*\.)?mail.com(/.*)? Now you may be looking at the regular expressions and asking yourself "What the hell was he thinking?". I don't blame you. In retrospect, these regexes are overkill. We had a problem with our filtering service (8e6 Technologies XStop on IIS) at one point where a site that would normally be blocked (say http://www.playboy.com/ for example) would pass the filtering service if HTTP authentication was used (http://[EMAIL PROTECTED]). As compensation, I gave the customers power to block sites on a case by case basis and made sure those blocks would cover this situation. Obviously (again in retrospect) I was being a bit too specific. Then again, I created this function over two years ago, and my customers have just started really using this feature, which is what was causing the problems. Go figure. With one or two sites being blocked this way, and as little traffic as most of my sites consume, Squid was okay with my incompetence (inexperience? naivetie?). Once more sites are blocked, matching these complex regexes gets to be overwhelming. I'm still working on rewriting the regexes for the above requests. As it stands now, I'm blocking any domain that has a Site block associated with it (i.e. all of bpssoft.com is being blocked at the example site). Here's the proposed solution for site blocking (using url_regex): (www\.)?gamesondemand\.yahoo\.com/ (www\.)?bpssoft\.com/powertools/library\.htm (www\.)?www2\.photeus\.com(:[0-9]+)?/~ewot/ This is not as exact, as any url with these stings (such as a netcraft query) will be blocked, but sadly filtering is not an exact science, and at least they can surf. Blocked (and allowed) domains have already been moved to a new acl using dstdom_regex: (.*\.mail\.com|mail\.com)$ Which gives more exact results (and quickly), but can't be used to match just a page or subdomain. If someone has suggestions on how to make these more granular while maintaining efficiency, I'm all ears. Chris -----Original Message----- From: Dave Holland [mailto:[EMAIL PROTECTED] Sent: Thursday, November 04, 2004 3:10 AM To: [EMAIL PROTECTED] Subject: Re: [squid-users] Sporadic high CPU usage, no traffic On Tue, Nov 02, 2004 at 10:56:28AM -0900, Chris Robertson wrote: > before, neat). I was using two url_regex acls, and the regular expressions > I was using seem to be the problem. Removing those two lines dropped CPU > usage from a low of %50 to a HIGH of 10%. Yikes. Off to optimize them. It would be interesting to see those url_regex lines, if you're willing to share them? thanks, Dave -- ** Dave Holland ** Systems Support -- Special Projects Team ** ** 01223 494965 ** Sanger Institute, Hinxton, Cambridge, UK ** "Always remember: you're unique. Just like everybody else."
