Re: Child process recurrently being restarted

Guillaume Quintard Wed, 28 Jun 2017 06:58:40 -0700

Hi,

can you look that "varnishstat -1 | grep g_bytes" and see if if matches the
memory you are seeing?


-- 
Guillaume Quintard

On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <[email protected]>
wrote:

> Hi Guillaume.
>
> I increased the cli_timeout yesterday to 900sec (15min) and it restarted
> anyway, which seems to indicate that the thread is really stalled.
>
> This was 1 minute after the last restart:
>
> MAIN.n_object               3908216          .   object structs made
> SMF.s0.g_alloc              7794510          .   Allocations outstanding
>
> I've just changed the I/O Scheduler to noop to see what happens.
>
> One interest thing I've found is about the memory usage.
>
> In the 1st minute of use:
> MemTotal:        3865572 kB
> MemFree:          120768 kB
> MemAvailable:    2300268 kB
>
> 1 minute before a restart:
> MemTotal:        3865572 kB
> MemFree:           82480 kB
> MemAvailable:      68316 kB
>
> It seems like the system is possibly running out of memory.
>
> When calling varnishd, I'm specifying only "-s file,..." as storage. I see
> in some examples that is common to use "-s file" AND "-s malloc" together.
> Should I be passing "-s malloc" as well to somehow try to limit the memory
> usage by varnishd?
>
> Best,
> Stefano
>
>
> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
> [email protected]> wrote:
>
>> Sadly, nothing suspicious here, you can still try:
>> - bumping the cli_timeout
>> - changing your disk scheduler
>> - changing the advice option of the file storage
>>
>> I'm still convinced this is due to Varnish getting stuck waiting for the
>> disk because of the file storage fragmentation.
>>
>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>> objects. Ideally, we would have a 1:1 relation between objects and
>> allocations. If that number drops prior to a restart, that would be a good
>> clue.
>>
>>
>> --
>> Guillaume Quintard
>>
>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <[email protected]>
>> wrote:
>>
>>> Hi Guillaume.
>>>
>>> It keeps restarting.
>>> Would you mind taking a quick look in the following VCL file to check if
>>> you find anything suspicious?
>>>
>>> Thank you very much.
>>>
>>> Best,
>>> Stefano
>>>
>>> vcl 4.0;
>>>
>>> import std;
>>>
>>> backend default {
>>>   .host = "sites-web-server-lb";
>>>   .port = "80";
>>> }
>>>
>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>
>>> sub vcl_recv {
>>>   call bad_bot_detection;
>>>
>>>   if (req.url == "/nocache" || req.url == "/version") {
>>>     return(pass);
>>>   }
>>>
>>>   unset req.http.Cookie;
>>>   if (req.method == "PURGE") {
>>>     ban("obj.http.x-host == " + req.http.host + " &&
>>> obj.http.x-user-agent !~ Googlebot");
>>>     return(synth(750));
>>>   }
>>>
>>>   set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>> }
>>>
>>> sub vcl_synth {
>>>   if (resp.status == 750) {
>>>     set resp.status = 200;
>>>     synthetic("PURGED => " + req.url);
>>>     return(deliver);
>>>   } elsif (resp.status == 501) {
>>>     set resp.status = 200;
>>>     set resp.http.Content-Type = "text/html; charset=utf-8";
>>>     synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>>>     return(deliver);
>>>   }
>>> }
>>>
>>> sub vcl_backend_response {
>>>   unset beresp.http.Set-Cookie;
>>>   set beresp.http.x-host = bereq.http.host;
>>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>>>
>>>   if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>     || bereq.url == "/api/events/PAGEVIEW"
>>>     || bereq.url ~ "^\/assets\/img\/") {
>>>     set beresp.http.Cache-Control = "max-age=0";
>>>   } else {
>>>     unset beresp.http.Cache-Control;
>>>   }
>>>
>>>   if (beresp.status == 200 ||
>>>     beresp.status == 301 ||
>>>     beresp.status == 302 ||
>>>     beresp.status == 404) {
>>>       if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>         set beresp.http.X-TTL = "1d";
>>>         set beresp.ttl = 1d;
>>>       } else {
>>>         set beresp.http.X-TTL = "1w";
>>>         set beresp.ttl = 1w;
>>>       }
>>>   }
>>>
>>>   if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>> {
>>>     set beresp.do_gzip = true;
>>>   }
>>> }
>>>
>>> sub vcl_pipe {
>>>   set bereq.http.connection = "close";
>>>   return (pipe);
>>> }
>>>
>>> sub vcl_deliver {
>>>   unset resp.http.x-host;
>>>   unset resp.http.x-user-agent;
>>> }
>>>
>>> sub vcl_backend_error {
>>>   if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>>> 504) {
>>>     set beresp.status = 200;
>>>     set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>     synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>     return (deliver);
>>>   }
>>> }
>>>
>>> sub vcl_hash {
>>>   if (req.http.User-Agent ~ "Google Page Speed") {
>>>     hash_data("Google Page Speed");
>>>   } elsif (req.http.User-Agent ~ "Googlebot") {
>>>     hash_data("Googlebot");
>>>   }
>>> }
>>>
>>> sub vcl_deliver {
>>>   if (resp.status == 501) {
>>>     return (synth(resp.status));
>>>   }
>>>   if (obj.hits > 0) {
>>>     set resp.http.X-Cache = "hit";
>>>   } else {
>>>     set resp.http.X-Cache = "miss";
>>>   }
>>> }
>>>
>>>
>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>> [email protected]> wrote:
>>>
>>>> Nice! It may have been the cause, time will tell.can you report back in
>>>> a few days to let us know?
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <[email protected]> wrote:
>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> I think things will start to going better now after changing the bans.
>>>>> This is how my last varnishstat looked like moments before a crash
>>>>> regarding the bans:
>>>>>
>>>>> MAIN.bans                     41336          .   Count of bans
>>>>> MAIN.bans_completed           37967          .   Number of bans marked
>>>>> 'completed'
>>>>> MAIN.bans_obj                     0          .   Number of bans using
>>>>> obj.*
>>>>> MAIN.bans_req                 41335          .   Number of bans using
>>>>> req.*
>>>>> MAIN.bans_added               41336         0.68 Bans added
>>>>> MAIN.bans_deleted                 0         0.00 Bans deleted
>>>>>
>>>>> And this is how it looks like now:
>>>>>
>>>>> MAIN.bans                         2          .   Count of bans
>>>>> MAIN.bans_completed               1          .   Number of bans marked
>>>>> 'completed'
>>>>> MAIN.bans_obj                     2          .   Number of bans using
>>>>> obj.*
>>>>> MAIN.bans_req                     0          .   Number of bans using
>>>>> req.*
>>>>> MAIN.bans_added                2016         0.69 Bans added
>>>>> MAIN.bans_deleted              2014         0.69 Bans deleted
>>>>>
>>>>> Before the changes, bans were never deleted!
>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>> couple of seconds.
>>>>>
>>>>> May this was the cause of the problem? It seems like varnish was
>>>>> having a large number of bans to manage and test against.
>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>> gone! :-)
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Looking good!
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Hi Guillaume,
>>>>>>>
>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>
>>>>>>> sub vcl_backend_response {
>>>>>>>   set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_recv {
>>>>>>>   if (req.method == "PURGE") {
>>>>>>>     ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>     return(synth(750));
>>>>>>>   }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_deliver {
>>>>>>>   unset resp.http.x-url;
>>>>>>>   unset resp.http.x-user-agent;
>>>>>>> }
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>
>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Guillaume.
>>>>>>>>
>>>>>>>> Thanks for answering.
>>>>>>>>
>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>> performance but it stills restarting.
>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>> signal of overhead.
>>>>>>>>
>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>> was a problem here. After a couple of hours varnish died and I 
>>>>>>>> received a
>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish 
>>>>>>>> solved
>>>>>>>> the problem and varnish was up again, but it's weird because there was 
>>>>>>>> free
>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>> /var/lib/varnish size.
>>>>>>>>
>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>
>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>
>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>> documentation and it looks like they're not.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Stefano,
>>>>>>>>>
>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the 
>>>>>>>>> CLI. I'd
>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to 
>>>>>>>>> confirm.
>>>>>>>>>
>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>> solution to help the disks cope is to overprovision themif they're 
>>>>>>>>> SSDs,
>>>>>>>>> and you can try different advices in the file storage definition in 
>>>>>>>>> the
>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>
>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>
>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Quintard
>>>>>>>>>
>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hello.
>>>>>>>>>>
>>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>
>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>>> responding to CLI, killed it.
>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>>>>> signal=9
>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>> complete
>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>> Started
>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>> Child starts
>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>
>>>>>>>>>> The following link is the varnishstat output just 1 minute before
>>>>>>>>>> a restart:
>>>>>>>>>>
>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>
>>>>>>>>>> Environment:
>>>>>>>>>>
>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>> packagecloud.io
>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>> Running inside a Docker container
>>>>>>>>>> NFILES=131072
>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>
>>>>>>>>>> Additional info:
>>>>>>>>>>
>>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I 
>>>>>>>>>> don't know
>>>>>>>>>> if this is a problem;
>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>> before the last crash. I really don't know if this is too much or 
>>>>>>>>>> may have
>>>>>>>>>> anything to do with it;
>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>> syslog;
>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Stefano Baldo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>> [email protected]
>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

_______________________________________________
varnish-misc mailing list
[email protected]
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc

Re: Child process recurrently being restarted

Reply via email to