On Wed, Sep 02, 2009 at 10:12:00AM -0400, maillis...@gmail.com wrote:
> I just started my first instance of varnish in production. Within 12 hours,
> there were alerts from our monitoring system that Varnish was taking 90% of
> the cpu. Right after that, I find these messages in /var/log/messages,
> several times over a 2 minute period:

Did you check syslog for assert errors too?

> varnishd[12461]: Child (20086) not responding to ping, killing it.
> 
> The child restarted, and the stats and cache all disappeared.
> 
> This is a machine with 8 gigs of ram and a pair of slightly older quad core
> xeons. The storage method is file with a 50 gig limit. At its peak, the
> machine is serving around 40 requests a second, about 5000k a second. The
> configs are the defaults.
> 
> What should my first steps be to troubleshoot this? Is there a likely
> culprit?

The first I'd do is check syslog for assert errors. If it's being killed in
the same place, something must be wrong (... ).

Secondly, I'd check the value of cli_timeout. This default has changed over
time, but a very busy varnish can be slow to reply to pings from the
management thread, and thus get killed needlessly. You can check it with
the telnet interface or «varnishadm -T localhost:yourmangementport
param.show cli_timeout». The new default is 10s, which should be enough,
though it still might be too low for extremely busy threads.

You may also want to supply a varnishstat -1 (after varnish has had a
chance to warm up) and any custom VCL to the list.


-- 
Kristian Lyngstøl
Redpill Linpro AS
Tlf: +47 21544179
Mob: +47 99014497

Attachment: pgpzGhT4qnAZT.pgp
Description: PGP signature

_______________________________________________
varnish-misc mailing list
varnish-misc@projects.linpro.no
http://projects.linpro.no/mailman/listinfo/varnish-misc
  • Child Died maillists0
    • Re: Child Died Kristian Lyngstol

Reply via email to