Re: Child Died

Kristian Lyngstol Mon, 07 Sep 2009 02:58:59 -0700

On Wed, Sep 02, 2009 at 10:12:00AM -0400, maillis...@gmail.com wrote:
> I just started my first instance of varnish in production. Within 12 hours,
> there were alerts from our monitoring system that Varnish was taking 90% of
> the cpu. Right after that, I find these messages in /var/log/messages,
> several times over a 2 minute period:


Did you check syslog for assert errors too?

> varnishd[12461]: Child (20086) not responding to ping, killing it.
> 
> The child restarted, and the stats and cache all disappeared.
> 
> This is a machine with 8 gigs of ram and a pair of slightly older quad core
> xeons. The storage method is file with a 50 gig limit. At its peak, the
> machine is serving around 40 requests a second, about 5000k a second. The
> configs are the defaults.
> 
> What should my first steps be to troubleshoot this? Is there a likely
> culprit?

The first I'd do is check syslog for assert errors. If it's being killed in
the same place, something must be wrong (... ).

Secondly, I'd check the value of cli_timeout. This default has changed over
time, but a very busy varnish can be slow to reply to pings from the
management thread, and thus get killed needlessly. You can check it with
the telnet interface or «varnishadm -T localhost:yourmangementport
param.show cli_timeout». The new default is 10s, which should be enough,
though it still might be too low for extremely busy threads.

You may also want to supply a varnishstat -1 (after varnish has had a
chance to warm up) and any custom VCL to the list.


-- 
Kristian Lyngstøl
Redpill Linpro AS
Tlf: +47 21544179
Mob: +47 99014497

pgpzGhT4qnAZT.pgp
Description: PGP signature

_______________________________________________
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc

Re: Child Died

Reply via email to