Hello all, I am a varnish-cache user and most of the time everything goes smoothly, but I have encountered 2 serious problems with it. I have two deployments of varnish on different web clusters: - number1 web cluster - number2 web cluster Both have pretty much the same vcl.
The first problem was last week with number1 web cluster. /var/log/messages: May 22 11:13:03 kernel: varnishd[21817] general protection rip:3497e00b79 rsp:62ed6970 error:0 May 22 11:13:32 kernel: varnishd[21958] general protection rip:3497e00b79 rsp:721f5970 error:0 May 22 11:14:48 kernel: varnishd[22115] general protection rip:3497e00b79 rsp:42fe2970 error:0 May 22 11:15:21 kernel: varnishd[22476] general protection rip:3497e00b79 rsp:7264e970 error:0 May 22 11:15:32 kernel: varnishd[22575] general protection rip:3497e00b79 rsp:5b135970 error:0 May 22 11:16:06 kernel: varnishd[22611]: segfault at 0000000000000010 rip 0000003497e00b6a rsp 0000000045d40970 error 4 May 22 11:16:21 kernel: varnishd[22746] general protection rip:3497e00b79 rsp:4c5bd970 error:0 May 22 11:16:33 kernel: varnishd[22876] general protection rip:3497e00b79 rsp:5b6ad970 error:0 May 22 11:16:39 kernel: varnishd[22889] general protection rip:3497e00b79 rsp:485a5970 error:0 May 22 11:16:51 kernel: varnishd[22969] general protection rip:3497e00b79 rsp:5d4c9970 error:0 When trying to access any website hosted on number1 cluster, I would get Connection refused. Based on what the log says, it seems that I get these protection faults everyday. The second problem was also on the same day, with number2 web cluster. /var/log/messages: May 22 16:17:16 varnishd[2451]: Child (25232) not responding to ping, killing it. May 22 16:17:18 varnishd[2451]: Child (25232) not responding to ping, killing it. May 22 16:17:18 varnishd[2451]: Child (25232) died signal=3 May 22 16:17:18 varnishd[2451]: child (22499) Started May 22 16:17:23 varnishd[2451]: Pushing vcls failed: CLI communication error May 22 16:17:30 varnishd[2451]: Child (22499) said Closed fds: 4 5 6 11 12 14 15 May 22 16:17:30 varnishd[2451]: Child (22499) said Child starts May 22 16:17:34 varnishd[2451]: Child (22499) said managed to mmap 2147483648 bytes of 2147483648 May 22 16:17:34 varnishd[2451]: Child (22499) said Ready May 22 16:17:36 varnishd[2451]: Child (22499) ended Here also I got Connection refused. What I noticed in the second log was that the hours were stuck somehow. It failed at exactly 19:18, and I restarted varnish around at 19:24, but the logs matching the restart were also "in-the-past". May 22 16:23:05 varnishd[2451]: Manager got SIGINT May 22 16:23:07 varnishd[23228]: child (23229) Started May 22 16:23:07 varnishd[23228]: Child (23229) said Closed fds: 4 5 6 10 11 13 14 May 22 16:23:07 varnishd[23228]: Child (23229) said Child starts May 22 16:23:07 varnishd[23228]: Child (23229) said managed to mmap 2147483648 bytes of 2147483648 May 22 16:23:07 varnishd[23228]: Child (23229) said Ready Everything else was logged at the correct time (I am using ntp), so I am dismissing the fact that the server was not in sync. Also, it is very strange that two completely unrelated varnish clusters would fail in the same day, at 9 hours distance. Using CentOs 5.4, 2.6.18-164.15.1.el5 and varnishd (varnish-2.0.5) on both machines. Any help on debugging these problems is kindly appreciated! _______________________________________________ varnish-misc mailing list [email protected] http://lists.varnish-cache.org/mailman/listinfo/varnish-misc
