Hi Guillaume. I increased the cli_timeout yesterday to 900sec (15min) and it restarted anyway, which seems to indicate that the thread is really stalled.
This was 1 minute after the last restart: MAIN.n_object 3908216 . object structs made SMF.s0.g_alloc 7794510 . Allocations outstanding I've just changed the I/O Scheduler to noop to see what happens. One interest thing I've found is about the memory usage. In the 1st minute of use: MemTotal: 3865572 kB MemFree: 120768 kB MemAvailable: 2300268 kB 1 minute before a restart: MemTotal: 3865572 kB MemFree: 82480 kB MemAvailable: 68316 kB It seems like the system is possibly running out of memory. When calling varnishd, I'm specifying only "-s file,..." as storage. I see in some examples that is common to use "-s file" AND "-s malloc" together. Should I be passing "-s malloc" as well to somehow try to limit the memory usage by varnishd? Best, Stefano On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < [email protected]> wrote: > Sadly, nothing suspicious here, you can still try: > - bumping the cli_timeout > - changing your disk scheduler > - changing the advice option of the file storage > > I'm still convinced this is due to Varnish getting stuck waiting for the > disk because of the file storage fragmentation. > > Maybe you could look at SMF.*.g_alloc and compare it to the number of > objects. Ideally, we would have a 1:1 relation between objects and > allocations. If that number drops prior to a restart, that would be a good > clue. > > > -- > Guillaume Quintard > > On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <[email protected]> > wrote: > >> Hi Guillaume. >> >> It keeps restarting. >> Would you mind taking a quick look in the following VCL file to check if >> you find anything suspicious? >> >> Thank you very much. >> >> Best, >> Stefano >> >> vcl 4.0; >> >> import std; >> >> backend default { >> .host = "sites-web-server-lb"; >> .port = "80"; >> } >> >> include "/etc/varnish/bad_bot_detection.vcl"; >> >> sub vcl_recv { >> call bad_bot_detection; >> >> if (req.url == "/nocache" || req.url == "/version") { >> return(pass); >> } >> >> unset req.http.Cookie; >> if (req.method == "PURGE") { >> ban("obj.http.x-host == " + req.http.host + " && >> obj.http.x-user-agent !~ Googlebot"); >> return(synth(750)); >> } >> >> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/"); >> } >> >> sub vcl_synth { >> if (resp.status == 750) { >> set resp.status = 200; >> synthetic("PURGED => " + req.url); >> return(deliver); >> } elsif (resp.status == 501) { >> set resp.status = 200; >> set resp.http.Content-Type = "text/html; charset=utf-8"; >> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >> return(deliver); >> } >> } >> >> sub vcl_backend_response { >> unset beresp.http.Set-Cookie; >> set beresp.http.x-host = bereq.http.host; >> set beresp.http.x-user-agent = bereq.http.user-agent; >> >> if (bereq.url == "/themes/basic/assets/theme.min.css" >> || bereq.url == "/api/events/PAGEVIEW" >> || bereq.url ~ "^\/assets\/img\/") { >> set beresp.http.Cache-Control = "max-age=0"; >> } else { >> unset beresp.http.Cache-Control; >> } >> >> if (beresp.status == 200 || >> beresp.status == 301 || >> beresp.status == 302 || >> beresp.status == 404) { >> if (bereq.url ~ "\&ordenar=aleatorio$") { >> set beresp.http.X-TTL = "1d"; >> set beresp.ttl = 1d; >> } else { >> set beresp.http.X-TTL = "1w"; >> set beresp.ttl = 1w; >> } >> } >> >> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >> { >> set beresp.do_gzip = true; >> } >> } >> >> sub vcl_pipe { >> set bereq.http.connection = "close"; >> return (pipe); >> } >> >> sub vcl_deliver { >> unset resp.http.x-host; >> unset resp.http.x-user-agent; >> } >> >> sub vcl_backend_error { >> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >> 504) { >> set beresp.status = 200; >> set beresp.http.Content-Type = "text/html; charset=utf-8"; >> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >> return (deliver); >> } >> } >> >> sub vcl_hash { >> if (req.http.User-Agent ~ "Google Page Speed") { >> hash_data("Google Page Speed"); >> } elsif (req.http.User-Agent ~ "Googlebot") { >> hash_data("Googlebot"); >> } >> } >> >> sub vcl_deliver { >> if (resp.status == 501) { >> return (synth(resp.status)); >> } >> if (obj.hits > 0) { >> set resp.http.X-Cache = "hit"; >> } else { >> set resp.http.X-Cache = "miss"; >> } >> } >> >> >> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >> [email protected]> wrote: >> >>> Nice! It may have been the cause, time will tell.can you report back in >>> a few days to let us know? >>> -- >>> Guillaume Quintard >>> >>> On Jun 26, 2017 20:21, "Stefano Baldo" <[email protected]> wrote: >>> >>>> Hi Guillaume. >>>> >>>> I think things will start to going better now after changing the bans. >>>> This is how my last varnishstat looked like moments before a crash >>>> regarding the bans: >>>> >>>> MAIN.bans 41336 . Count of bans >>>> MAIN.bans_completed 37967 . Number of bans marked >>>> 'completed' >>>> MAIN.bans_obj 0 . Number of bans using >>>> obj.* >>>> MAIN.bans_req 41335 . Number of bans using >>>> req.* >>>> MAIN.bans_added 41336 0.68 Bans added >>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>> >>>> And this is how it looks like now: >>>> >>>> MAIN.bans 2 . Count of bans >>>> MAIN.bans_completed 1 . Number of bans marked >>>> 'completed' >>>> MAIN.bans_obj 2 . Number of bans using >>>> obj.* >>>> MAIN.bans_req 0 . Number of bans using >>>> req.* >>>> MAIN.bans_added 2016 0.69 Bans added >>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>> >>>> Before the changes, bans were never deleted! >>>> Now the bans are added and quickly deleted after a minute or even a >>>> couple of seconds. >>>> >>>> May this was the cause of the problem? It seems like varnish was having >>>> a large number of bans to manage and test against. >>>> I will let it ride now. Let's see if the problem persists or it's gone! >>>> :-) >>>> >>>> Best, >>>> Stefano >>>> >>>> >>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>> [email protected]> wrote: >>>> >>>>> Looking good! >>>>> >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <[email protected] >>>>> > wrote: >>>>> >>>>>> Hi Guillaume, >>>>>> >>>>>> Can the following be considered "ban lurker friendly"? >>>>>> >>>>>> sub vcl_backend_response { >>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>> } >>>>>> >>>>>> sub vcl_recv { >>>>>> if (req.method == "PURGE") { >>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>> return(synth(750)); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_deliver { >>>>>> unset resp.http.x-url; >>>>>> unset resp.http.x-user-agent; >>>>>> } >>>>>> >>>>>> Best, >>>>>> Stefano >>>>>> >>>>>> >>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>> >>>>>>> I don't think you need to expand the VSL at all. >>>>>>> >>>>>>> -- >>>>>>> Guillaume Quintard >>>>>>> >>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> Thanks for answering. >>>>>>> >>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>> performance but it stills restarting. >>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>> signal of overhead. >>>>>>> >>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>>>> default size passing "-l 200m,20m" to varnishd and using >>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was >>>>>>> a problem here. After a couple of hours varnish died and I received a >>>>>>> "no >>>>>>> space left on device" message - deleting the /var/lib/varnish solved the >>>>>>> problem and varnish was up again, but it's weird because there was free >>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>> what could have happened. I will try to stop increasing the >>>>>>> /var/lib/varnish size. >>>>>>> >>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>> >>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url >>>>>>> + " && req.http.User-Agent !~ Googlebot"); >>>>>>> >>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>> documentation and it looks like they're not. >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Stefano, >>>>>>>> >>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>> stuck trying to push/pull data and can't make time to reply to the >>>>>>>> CLI. I'd >>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>> >>>>>>>> After some time, the file storage is terrible on a hard drive (SSDs >>>>>>>> take a bit more time to degrade) because of fragmentation. One >>>>>>>> solution to >>>>>>>> help the disks cope is to overprovision themif they're SSDs, and you >>>>>>>> can >>>>>>>> try different advices in the file storage definition in the command >>>>>>>> line >>>>>>>> (last parameter, after granularity). >>>>>>>> >>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>> >>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hello. >>>>>>>>> >>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>> >>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>> responding to CLI, killed it. >>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>> from ping: 400 CLI communication error >>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>>>> signal=9 >>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>> complete >>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>> Child starts >>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>> >>>>>>>>> The following link is the varnishstat output just 1 minute before >>>>>>>>> a restart: >>>>>>>>> >>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>> >>>>>>>>> Environment: >>>>>>>>> >>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>> packagecloud.io >>>>>>>>> CPU 2x2.9 GHz >>>>>>>>> Mem 3.69 GiB >>>>>>>>> Running inside a Docker container >>>>>>>>> NFILES=131072 >>>>>>>>> MEMLOCK=82000 >>>>>>>>> >>>>>>>>> Additional info: >>>>>>>>> >>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>> last for almost a week, so I have set up a 450G storage space, I >>>>>>>>> don't know >>>>>>>>> if this is a problem; >>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>> before the last crash. I really don't know if this is too much or may >>>>>>>>> have >>>>>>>>> anything to do with it; >>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>> syslog; >>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano Baldo >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> varnish-misc mailing list >>>>>>>>> [email protected] >>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >
_______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
