Re: Child process recurrently being restarted

Guillaume Quintard Mon, 26 Jun 2017 12:16:21 -0700

Nice! It may have been the cause, time will tell.can you report back in a
few days to let us know?
-- 
Guillaume Quintard


On Jun 26, 2017 20:21, "Stefano Baldo" <[email protected]> wrote:

> Hi Guillaume.
>
> I think things will start to going better now after changing the bans.
> This is how my last varnishstat looked like moments before a crash
> regarding the bans:
>
> MAIN.bans                     41336          .   Count of bans
> MAIN.bans_completed           37967          .   Number of bans marked
> 'completed'
> MAIN.bans_obj                     0          .   Number of bans using obj.*
> MAIN.bans_req                 41335          .   Number of bans using req.*
> MAIN.bans_added               41336         0.68 Bans added
> MAIN.bans_deleted                 0         0.00 Bans deleted
>
> And this is how it looks like now:
>
> MAIN.bans                         2          .   Count of bans
> MAIN.bans_completed               1          .   Number of bans marked
> 'completed'
> MAIN.bans_obj                     2          .   Number of bans using obj.*
> MAIN.bans_req                     0          .   Number of bans using req.*
> MAIN.bans_added                2016         0.69 Bans added
> MAIN.bans_deleted              2014         0.69 Bans deleted
>
> Before the changes, bans were never deleted!
> Now the bans are added and quickly deleted after a minute or even a couple
> of seconds.
>
> May this was the cause of the problem? It seems like varnish was having a
> large number of bans to manage and test against.
> I will let it ride now. Let's see if the problem persists or it's gone! :-)
>
> Best,
> Stefano
>
>
> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
> [email protected]> wrote:
>
>> Looking good!
>>
>> --
>> Guillaume Quintard
>>
>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <[email protected]>
>> wrote:
>>
>>> Hi Guillaume,
>>>
>>> Can the following be considered "ban lurker friendly"?
>>>
>>> sub vcl_backend_response {
>>>   set beresp.http.x-url = bereq.http.host + bereq.url;
>>>   set beresp.http.x-user-agent = bereq.http.user-agent;
>>> }
>>>
>>> sub vcl_recv {
>>>   if (req.method == "PURGE") {
>>>     ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>> obj.http.x-user-agent !~ Googlebot");
>>>     return(synth(750));
>>>   }
>>> }
>>>
>>> sub vcl_deliver {
>>>   unset resp.http.x-url;
>>>   unset resp.http.x-user-agent;
>>> }
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>> [email protected]> wrote:
>>>
>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>
>>>> I don't think you need to expand the VSL at all.
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <[email protected]> wrote:
>>>>
>>>> Hi Guillaume.
>>>>
>>>> Thanks for answering.
>>>>
>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>> performance but it stills restarting.
>>>> Also, I checked the I/O performance for the disk and there is no signal
>>>> of overhead.
>>>>
>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>> default size passing "-l 200m,20m" to varnishd and using
>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
>>>> problem here. After a couple of hours varnish died and I received a "no
>>>> space left on device" message - deleting the /var/lib/varnish solved the
>>>> problem and varnish was up again, but it's weird because there was free
>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>> what could have happened. I will try to stop increasing the
>>>> /var/lib/varnish size.
>>>>
>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>
>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url +
>>>> " && req.http.User-Agent !~ Googlebot");
>>>>
>>>> Are they lurker friendly? I was taking a quick look and the
>>>> documentation and it looks like they're not.
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Stefano,
>>>>>
>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>>>>> trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>
>>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>>> try different advices in the file storage definition in the command line
>>>>> (last parameter, after granularity).
>>>>>
>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>
>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I am having a critical problem with Varnish Cache in production for
>>>>>> over a month and any help will be appreciated.
>>>>>> The problem is that Varnish child process is recurrently being
>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>
>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>> responding to CLI, killed it.
>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>>> ping: 400 CLI communication error
>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>> signal=9
>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>> Child starts
>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>
>>>>>> The following link is the varnishstat output just 1 minute before a
>>>>>> restart:
>>>>>>
>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>
>>>>>> Environment:
>>>>>>
>>>>>> varnish-5.1.2 revision 6ece695
>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>> Installed using pre-built package from official repo at
>>>>>> packagecloud.io
>>>>>> CPU 2x2.9 GHz
>>>>>> Mem 3.69 GiB
>>>>>> Running inside a Docker container
>>>>>> NFILES=131072
>>>>>> MEMLOCK=82000
>>>>>>
>>>>>> Additional info:
>>>>>>
>>>>>> - I need to cache a large number of objets and the cache should last
>>>>>> for almost a week, so I have set up a 450G storage space, I don't know if
>>>>>> this is a problem;
>>>>>> - I use ban a lot. There was about 40k bans in the system just before
>>>>>> the last crash. I really don't know if this is too much or may have
>>>>>> anything to do with it;
>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>>> - During all the time, event moments before the crashes, everything
>>>>>> is okay and requests are being responded very fast.
>>>>>>
>>>>>> Best,
>>>>>> Stefano Baldo
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> varnish-misc mailing list
>>>>>> [email protected]
>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

_______________________________________________
varnish-misc mailing list
[email protected]
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc

Re: Child process recurrently being restarted

Reply via email to