Thanks Amos for the explanation. Apologies for the lack of clarity.

FYI - we have ICAP connection set up to be a 'critical' service.

Do you know if the squid ICAP functionality has changed between 3.0 & 3.1? we 
were not seeing some of these issues previously. For instance, if the ICAP 
server went down previously - after the ICAP timeout (icap_io_timeout), squid 
clients would just receive 500 responses for all queued connections (as seen in 
the squid access logs), effectively limiting the number of connections to be 
queued. Are you saying that *all* client connections will now be queued - even 
after the timeout? If ICAP server went down for a long period, and connections 
kept on being made to squid, the number of queued connections would be very 
big. This could easily stop squid from being responsive for a long time, or 
completely - would this be correct?

Note - we are still seeing very high file descriptor usage in squid still - 
even 3 hours after the restarts. Currently file descriptor is still about 3.2k, 
and has never gone much below this number. Would it take this long to rebuild 
the journal? 

I'm just noticing the 'TIME_WAIT' connections between squid & ICAP server is 
also very high - above 2k - and has been like this for the last 30 minutes. Is 
this anything to worry about? The number of 'ESTABLISHED' connections never 
goes above 50.

Queue Congestion Errors:
2011/11/06 08:03:07| squidaio_queue_request: WARNING - Queue congestion

Thanks,
Justin


-----Original Message-----
From: Amos Jeffries [mailto:[email protected]]
Sent: Sunday, November 06, 2011 6:22 PM
To: [email protected]
Subject: Re: [squid-users] Squid Crashed - then ran out of file descriptors 
after restart

On 6/11/2011 9:15 p.m., Justin Lawler wrote:
> Hi,
>
> We're running squid 3.1.16 on solaris on a sparc box.
>
> We're running it against an ICAP server, and were testing some scenarios when 
> ICAP server went down, how squid would handle it. After freezing the ICAP 
> server, squid seemed to have big problems.

For reference the exected behaviour is this:

  ** if squid is configured to allow bypass of the ICAP service
    --> no noticable problems. possibly faster response time for clients.

** if squid is configured not to bypass on failures (ie critical ICAP
service)
   --> New connections continue to be accepted.
   --> All traffic needing ICAP halts waiting recovery, RAM and FD consumption 
rises until available resources are full.
   --> On ICAP recovery the traffic being held gets sent to it and service 
resumes as the results come back.

>
> Once it was back up again, it kept on sending OPTION requests to the server - 
> but squid itself became completely unresponsive. It wouldn't accept any 
> further requests, you couldn't use squidclient against it or doing a squid 
> reconfigure, and was not responding to 'squid -k shutdown', so had to be 
> manually killed with a 'kill -9'.

This description is not very clear. You seem to use "it" torefer to several 
different things in first sentence of paragraph 2.

Apparently:
  * "it" comes back up again. ... apparently refering to ICAP?  >>>>> JL (yes)
  * "it" sends OPTION requests ... apparently referring to Squid now? or to 
some unmentioned backend part of the ICAP service?  >>>>> JL (referring to 
squid)
  * squid itself is unresponsive .... waiting for queued requets to get through 
ICAP and the network fetch stages perhapse? noting that ICAP may be slowed as 
it faces teh spike or waiting traffic from Squid.


>
> We then restarted the squid instance, and it started to go crazy, file 
> descriptors reaching the limit (4096 - previously it never went above 
> 1k during long

"kill -9" causes Squid to terminate before savign teh cache index or closing 
the journal properly. Thus on restart the journal is discovered corrupt and a 
"DIRTY" rebuild is begun. Scanning the entire disk cache object by object to 
rebuild the index and journa contents. This can consume a lot of FD, for a 
period of time proportional to the size of your disk cache(s).

Also, clients can hit Squid with a lot of connections that accumulated during 
the outage. Which each have to be processed in full, including all lookups and 
tests. Immediately. This startup spike is normal immediately after a 
start/restart or reconfigure when all the active running state is erased and 
requires rebuilding.

The lag problems and resource/queue overloads can be expected to drop away 
relatively quickly as the nromal running state gets rebuilt from the new 
traffic. The FD consumption from cache scan will disappear abruptly when that 
process completes.

> stability test runs), and a load of 'Queue Congestion' errors in the logs. 
> Tried to restart it again, and it seemed to behave better then, but still the 
> number of file descriptors is very big (above 3k).

Any particular queue mentioned?


Amos
This message and the information contained herein is proprietary and 
confidential and subject to the Amdocs policy statement,
you may review at http://www.amdocs.com/email_disclaimer.asp

Reply via email to