On Wed, Sep 02, 2009 at 10:12:00AM -0400, maillis...@gmail.com wrote: > I just started my first instance of varnish in production. Within 12 hours, > there were alerts from our monitoring system that Varnish was taking 90% of > the cpu. Right after that, I find these messages in /var/log/messages, > several times over a 2 minute period:
Did you check syslog for assert errors too? > varnishd[12461]: Child (20086) not responding to ping, killing it. > > The child restarted, and the stats and cache all disappeared. > > This is a machine with 8 gigs of ram and a pair of slightly older quad core > xeons. The storage method is file with a 50 gig limit. At its peak, the > machine is serving around 40 requests a second, about 5000k a second. The > configs are the defaults. > > What should my first steps be to troubleshoot this? Is there a likely > culprit? The first I'd do is check syslog for assert errors. If it's being killed in the same place, something must be wrong (... ). Secondly, I'd check the value of cli_timeout. This default has changed over time, but a very busy varnish can be slow to reply to pings from the management thread, and thus get killed needlessly. You can check it with the telnet interface or «varnishadm -T localhost:yourmangementport param.show cli_timeout». The new default is 10s, which should be enough, though it still might be too low for extremely busy threads. You may also want to supply a varnishstat -1 (after varnish has had a chance to warm up) and any custom VCL to the list. -- Kristian Lyngstøl Redpill Linpro AS Tlf: +47 21544179 Mob: +47 99014497
pgpzGhT4qnAZT.pgp
Description: PGP signature
_______________________________________________ varnish-misc mailing list varnish-misc@projects.linpro.no http://projects.linpro.no/mailman/listinfo/varnish-misc