Hi all, I have a problem that I am currently trying to diagnose. Please correct me if you think this should be on the -users list instead, but I think its a bug and I need further help trying to diagnose it.
We run an array of 30 squid servers, sitting behind a load balancer. >From time-to-time the number of "available file descriptors" gradually drops from 14K (when under normal load) down to zero, but only on particular squid at a time. Each squid is configured with a max of 16K file descriptors. Load across the 30 squid servers remains constant, at around 200-300 req/sec for each server, and the number of available file descriptors for the remaining servers remains at around 14K. When this fault affects a particular server, it starts acting strangely, like it just stops processing requests for particular web site(s), while other web sites can still be accessed fine. For example, google.com and yahoo.com might work OK, but bbc.co.uk fail - no error messages back from squid or any thing, browser just hangs. Browsing the same website through an unaffected squid server (or directly) still works fine. squidclient mgr:filedescriptors will show a whole heap of file descriptors (ie 14,000), the vast majority being in the "half-closed" state, but a few being in the "Reading next request" state. Running a 'netstat -an' however, shows significantly less actual TCP sockets (a few hundred), and of these, the majority are in the CLOSE_WAIT state. I am concerned that, for which ever reason, squid stops processing requests for a particular website, and then fails to detect when clients give up, incorrectly putting the FD into the "half-closed" state, leading to the situation where the client closes the socket but squid still thinks that the socket is open. Dropping the squid server out of service on the load balancer to stop actual traffic, and then running "squid -k debug" produces the following messages for lots of different FDs (I presume its for all FD's that squid thinks are active): 2008/06/23 10:57:55| comm_call_handlers(): got fd=11314 read_event=19 write_event=18 F->read_handler=0x8070890 F->write_handler=(nil) 2008/06/23 10:57:55| comm_call_handlers(): Calling read handler on fd=11314 2008/06/23 10:57:55| clientReadRequest: FD 11314: reading request... 2008/06/23 10:57:55| clientReadRequest: FD 11314 closed? 2008/06/23 10:57:55| cbdataLock: 0x655b16c0 2008/06/23 10:57:55| cbdataValid: 0x655b16c0 2008/06/23 10:57:55| cbdataUnlock: 0x655b16c0 2008/06/23 10:57:55| commSetSelect: FD 11314 type 1 2008/06/23 10:57:55| commSetEvents(fd=11314) I've looked through the source code but cannot find what read_event=19 means (ie what kind of event is triggering the 'comm_call_handlers()' call? With no traffic going through the device, it consumes 14% CPU processing these events. I could set "half_closed_clients off", however, even at the start of the decline in file descriptors (ie when there are still file descriptors available) there are problems browsing certain websites, so I think this will just mask a symptom of the problem rather than fix it. A simple restart of squid fixes the issue, but only for a while. Our support guys are having to restart squid on various devices about 5-10 times a day at the moment in order to try minimise impact to our customers. We are running squid-2.6-STABLE18 on Linux (Centos 5), using the epoll() event notification interface. Any help will be much appreciated as I've run out of things to check. Please let me know if you need any further information. Many thanks in advance. -- Bradley Kite
