Yet more data: Version 2.5.STABLE7 configure options: --bindir=/home/squid2/bin --sbindir=/home/squid2/bin --libexecdir=/home/squid2/bin --datadir=/home/squid2/etc --sysconfdir=/etc/squid --localstatedir=/home/squid2 --mandir=/usr/man --enable-ssl --enable-err-languages=English --disable-ident-lookups --enable-storeio=ufs,null --enable-snmp --disable-poll
Changed the cache_dir line to read : cache_dir null /dev/null Now the access.log (with full debugging) at high CPU usage shows: 2004/11/01 11:32:55| cbdataUnlock: 0x83f0560 2004/11/01 11:32:55| cbdataUnlock: Freeing 0x83f0560 2004/11/01 11:32:55| cbdataFree: 0x84dae78 2004/11/01 11:32:55| cbdataFree: Freeing 0x84dae78 2004/11/01 11:32:55| cbdataUnlock: 0x84dfa88 2004/11/01 11:32:55| cbdataUnlock: 0x88830e8 2004/11/01 11:32:55| cbdataFree: 0x83ef270 2004/11/01 11:32:55| cbdataFree: 0x83ef270 has 1 locks, not freeing 2004/11/01 11:32:55| cbdataUnlock: 0x83ef270 2004/11/01 11:32:55| cbdataUnlock: Freeing 0x83ef270 2004/11/01 11:33:08| eventRun: RUN ID 1342 2004/11/01 11:33:08| eventRun: Running 'idnsCheckQueue', id 1340 2004/11/01 11:33:08| comm_select: 54+0 FDs ready at 1099341188 2004/11/01 11:33:08| comm_select: FD 25 ready for reading 2004/11/01 11:33:08| httpReadReply: FD 25: len 0. 2004/11/01 11:33:08| fwdComplete: http://ps.galenaalaska.org/ status 302 2004/11/01 11:33:08| fwdReforward: http://ps.galenaalaska.org/? 2004/11/01 11:33:08| fwdReforward: No, ENTRY_FWD_HDR_WAIT isn't set 2004/11/01 11:33:08| fwdComplete: not re-forwarding status 302 This is really killing me. Is it the eventRun that is causing the stall, is it breaking the stall, or is it completely unrelated? Thanks for your time, Chris -----Original Message----- From: Chris Robertson [mailto:[EMAIL PROTECTED] Sent: Friday, October 29, 2004 4:40 PM To: '[EMAIL PROTECTED]' Subject: RE: [squid-users] Sporadic high CPU usage, no traffic More data, same problem. I've made distclean, reconfigured (Squid Cache: Version 2.5.STABLE7 configure options: --bindir=/home/squid2/bin --sbindir=/home/squid2/bin --libexecdir=/home/squid2/bin --datadir=/home/squid2/etc --sysconfdir=/etc/squid --localstatedir=/home/squid2 --mandir=/usr/man --enable-err-languages=English --disable-ident-lookups --enable-storeio=diskd,ufs --enable-snmp --disable-poll) to eliminate the possibility of it being poor thread support in glibc, or polling in Linux 2.2 and still see the same problems. I've tried both diskd and ufs as cache_dir types. I picked a fairly low traffic (2-4 req/sec & 10-15 KB/sec) client and added debug_options ALL,9 to the conf file, and watched the CPU usage. At around 16:09:25 AKST October 29, 2004) I noticed a CPU spike. Here is an excerpt from the access.log: 1099094958.155 2259 10.200.6.29 TCP_MISS/302 510 GET http://a.tribalfusion.com/j.ad? - ROUNDROBIN_PARENT/proxy2.schoolaccess.net text/html 1099094958.155 862 10.200.6.29 TCP_MISS/302 0 GET http://media72.fastclick.net/w/safepop.cgi? - ROUNDROBIN_PARENT/proxy2.schoolaccess.net text/plain 1099094958.762 1468 10.200.6.6 TCP_MISS/200 42328 GET http://scrapaddictinc.com/phpBB2/viewtopic.php? - ROUNDROBIN_PARENT/proxy1.schoolaccess.net text/html 1099094958.882 750 10.200.6.29 TCP_MISS/200 14386 GET http://a1444.g.akamai.net/7/1444/1104/1096471980/www.burstnet.com/gifs/oreck _air8_728x90_092904.gif - ROUNDROBIN_PARENT/proxy1.schoolaccess.net image/gif 1099094958.907 1043 10.200.22.39 TCP_MISS/200 4483 GET http://xml.eshop.msn.com/xmlbuddy/eShopOffer.aspx? - ROUNDROBIN_PARENT/proxy1.schoolaccess.net text/html 1099094958.921 446 10.200.6.29 TCP_MISS/200 1582 GET http://a.tribalfusion.com/jr.ad? - ROUNDROBIN_PARENT/proxy2.schoolaccess.net application/x-javascript 1099094960.116 1961 10.200.6.29 TCP_MISS/200 568 GET http://te.burstnet.com/tte/blank.gif? - ROUNDROBIN_PARENT/proxy2.schoolaccess.net image/gif 1099094961.816 1788 10.200.22.39 TCP_MISS/200 2960 GET http://di.image.eshop.msn.com/img/sinv/2773/2/08f341e.jpg - ROUNDROBIN_PARENT/proxy2.schoolaccess.net image/jpeg 1099094961.820 2898 10.200.6.29 TCP_MISS/200 6788 GET http://view.atdmt.com/TLA/iview/trblfask0070000058tla/direct;wi.120;hi.600/0 1/733113640? - ROUNDROBIN_PARENT/proxy1.schoolaccess.net text/html 1099094961.923 1806 10.200.6.6 TCP_MISS/200 1454 GET http://scrapaddictinc.com/phpAdsNew/adjs.php? - ROUNDROBIN_PARENT/proxy2.schoolaccess.net application/x-javascript 1099094962.015 10314 10.200.6.5 TCP_MISS/000 348 CONNECT login.icq.com:443 - ROUNDROBIN_PARENT/proxy2.schoolaccess.net - 1099094987.107 26991 10.200.6.6 TCP_MISS/000 0 GET http://www.scrapbookingtop50.com/cgi-bin/counter.cgi? - ROUNDROBIN_PARENT/proxy1.schoolaccess.net - 1099094987.107 98887 10.200.22.20 TCP_MISS/200 1025228 GET http://motionslow.espn.go.com/persistent/ad_ebay_clock.wm9_548k.motion_persi st.gif - ROUNDROBIN_PARENT/proxy1.schoolaccess.net image/gif 1099094987.107 68887 10.200.22.39 TCP_MISS/000 0 POST http://ads.centralmedia.ws/requestimpression.aspx? - ROUNDROBIN_PARENT/proxy1.schoolaccess.net - 1099094987.358 251 10.200.22.50 TCP_IMS_HIT/304 265 GET http://hp.msn.com/c/home/bullets/000000.gif - NONE/- image/gif 1099094987.358 251 10.200.22.50 TCP_IMS_HIT/304 265 GET http://hp.msn.com/c/home/bullets/000000.gif - NONE/- image/gif 1099094987.358 251 10.200.22.50 TCP_IMS_HIT/304 265 GET http://hp.msn.com/c/home/bullets/000000.gif - NONE/- image/gif 1099094987.358 251 10.200.22.50 TCP_IMS_HIT/304 265 GET http://hp.msn.com/c/home/bullets/000000.gif - NONE/- image/gif 1099094987.744 27717 10.200.22.39 TCP_MISS/200 6326 GET http://context3.kanoodle.com/cgi-bin/context.cgi? - ROUNDROBIN_PARENT/proxy1.schoolaccess.net text/html 1099094987.789 25701 10.200.6.29 TCP_MISS/200 20432 GET http://spe.atdmt.com/ds/TBTLAASKJASK/coding_creative/120x600_top10_pics_1022 .swf? - ROUNDROBIN_PARENT/proxy1.schoolaccess.net application/x-shockwave-flash 1099094987.941 833 10.200.22.126 TCP_MISS/200 6743 GET http://www.cbs.state.or.us/ins/docs/consumer/tomake.htm - ROUNDROBIN_PARENT/proxy2.schoolaccess.net text/html Notice the 25 second gap in traffic at 1099094962.015 Here's an excerpt from the cache.log: 2004/10/29 16:09:22| cbdataFree: 0x8ad9aa8 2004/10/29 16:09:22| cbdataFree: Freeing 0x8ad9aa8 2004/10/29 16:09:22| comm_select: FD 42 ready for writing 2004/10/29 16:09:22| sslWriteServer: FD 42, 232 bytes to write 2004/10/29 16:09:22| sslWriteServer: FD 42, 232 bytes written 2004/10/29 16:09:22| cbdataLock: 0x962d620 2004/10/29 16:09:22| cbdataValid: 0x962d620 2004/10/29 16:09:22| commSetSelect: FD 40 type 1 2004/10/29 16:09:22| commSetSelect: FD 42 type 1 2004/10/29 16:09:22| cbdataUnlock: 0x962d620 2004/10/29 16:09:47| eventRun: RUN ID 9626 2004/10/29 16:09:47| eventRun: Running 'MaintainSwapSpace', id 9623 2004/10/29 16:09:47| storeMaintainSwapSpace: f=0.000000, max_scan=100, max_remove=10 2004/10/29 16:09:47| cbdataFree: 0x8a11610 2004/10/29 16:09:47| cbdataFree: Freeing 0x8a11610 2004/10/29 16:09:47| storeUfsDirMaintain: /home/squid2/cache removed 0/10 f=0.000 max_scan=100 2004/10/29 16:09:47| eventAdd: Adding 'MaintainSwapSpace', in 1.000000 seconds 2004/10/29 16:09:47| comm_select: 16+0 FDs ready at 1099094987 2004/10/29 16:09:47| checkTimeouts: FD 34 Expired 2004/10/29 16:09:47| checkTimeouts: FD 34: Call timeout handler Again, there is a 25 second gap. Looking for at the first thing to be logged after the gap, I see Running 'MaintainSwapSpace'. Searching the cache.log for this string shows it usually runs without a hitch, but not always. Prior to this particular gap, there is another: 2004/10/29 16:08:00| httpPconnTransferDone: FD 62 2004/10/29 16:08:00| commSetTimeout: FD 62 timeout 900 2004/10/29 16:08:00| commSetSelect: FD 62 type 1 2004/10/29 16:08:00| comm_select: FD 54 ready for writing 2004/10/29 16:08:00| commHandleWrite: FD 54: off 0, sz 353. 2004/10/29 16:08:00| commHandleWrite: write() returns 353 2004/10/29 16:08:00| cbdataValid: 0x8db6550 2004/10/29 16:08:00| httpSendComplete: FD 54: size 353: errflag 0. 2004/10/29 16:08:00| commSetTimeout: FD 54 timeout 900 2004/10/29 16:08:00| cbdataUnlock: 0x8db6550 2004/10/29 16:08:05| eventRun: RUN ID 9524 2004/10/29 16:08:05| eventRun: Running 'MaintainSwapSpace', id 9523 2004/10/29 16:08:05| storeMaintainSwapSpace: f=0.000000, max_scan=100, max_remove=10 2004/10/29 16:08:05| cbdataFree: 0x8a11610 2004/10/29 16:08:05| cbdataFree: Freeing 0x8a11610 2004/10/29 16:08:05| storeUfsDirMaintain: /home/squid2/cache removed 0/10 f=0.000 max_scan=100 2004/10/29 16:08:05| eventAdd: Adding 'MaintainSwapSpace', in 1.000000 seconds 2004/10/29 16:08:05| comm_select: 15+0 FDs ready at 1099094885 2004/10/29 16:08:05| comm_select: FD 23 ready for reading 2004/10/29 16:08:05| clientReadRequest: FD 23: reading request... Is this just coincidence? Am I way off base here? Would it help to make these logs accessible? Chris -----Original Message----- From: Chris Robertson [mailto:[EMAIL PROTECTED] Sent: Thursday, October 28, 2004 3:57 PM To: '[EMAIL PROTECTED]' Subject: [squid-users] Sporadic high CPU usage, no traffic I'm at a loss. Here's the situation. We have a couple hundred squid clients forced (using never-direct allow all) to run through a parent proxy array. On occasion (no discernable pattern) the squid process on the clients uses all the available CPU cycles, and tailing the access log shows that nothing is being served. Attempting to use the cachemgr.cgi (from a different box) or the squidclient (on the client proxy) returns no results while the CPU usage is running wild. Once the CPU usage slows down, results are returned. Running a squid -k debug (for the usual 2 seconds) and tailing the cache.log seems to show acls being processed. Often in conjunction with these CPU spikes I'll see "TCP connection to [parent-proxy] failed", but not always. Sometimes these connection failed messages appear in the cache log without a CPU spike. There are two round-robin parent proxies, which peer (using ICP) and a third "master" parent (also peering with ICP) which is running two squid processes (one per CPU) due to some sites not handling "sessions" changing IPs. All three are on the same network switch. They are being monitored with MRTG (http://mrtg.schoolaccess.net/squid/) and never seem to slow down. The connection failed messages don't favor one server over the other. Most of the clients are accessing the internet over satellite, but some are on terrestrial links. The problem does not occur more often with either connection method. I thought that I had found the problem when I read about the half_closed_clients bug, but after patching, compiling and installing SQUID2.5-STABLE7 on a couple of sites, the problem persists (though perhaps less often). Disabling half-closed-clients in the squid.conf, unsurprisingly had no noticeable effect. The version I replaced for testing is SQUID2.5-STABLE4. STABLE7 was compiled with --bindir=/home/squid2/bin --sbindir=/home/squid2/bin --libexecdir=/home/squid2/bin --datadir=/home/squid2/etc --sysconfdir=/etc/squid --localstatedir=/home/squid2 --mandir=/usr/man --enable-ssl --enable-err-languages=English --disable-ident-lookups --with-pthreads --enable-storeio=ufs,aufs,diskd --enable-snmp --enable-async-io --with-aio STABLE4 was compiled with --bindir=/home/squid2/bin --sbindir=/home/squid2/bin --libexecdir=/home/squid2/bin --datadir=/home/squid2/etc --sysconfdir=/etc/squid --localstatedir=/home/squid2 --mandir=/usr/man --enable-xmalloc-statistics --enable-useragent-log --enable-referer-log --enable-err-languages=English --disable-ident-lookups --with-pthreads --enable-storeio=ufs,aufs,diskd In the interest of preventing this message from becoming too long, I have posted the conf files at http://mrtg.schoolaccess.net/squidconf/. The -sparse files have all comments and blank lines stripped out. What further tools can I use (on a Linux 2.2 kernel) to figure out what squid is doing when the CPU usage spikes? What information did I leave out that might be relevant?
