Hi, can you look that "varnishstat -1 | grep g_bytes" and see if if matches the memory you are seeing?
-- Guillaume Quintard On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <[email protected]> wrote: > Hi Guillaume. > > I increased the cli_timeout yesterday to 900sec (15min) and it restarted > anyway, which seems to indicate that the thread is really stalled. > > This was 1 minute after the last restart: > > MAIN.n_object 3908216 . object structs made > SMF.s0.g_alloc 7794510 . Allocations outstanding > > I've just changed the I/O Scheduler to noop to see what happens. > > One interest thing I've found is about the memory usage. > > In the 1st minute of use: > MemTotal: 3865572 kB > MemFree: 120768 kB > MemAvailable: 2300268 kB > > 1 minute before a restart: > MemTotal: 3865572 kB > MemFree: 82480 kB > MemAvailable: 68316 kB > > It seems like the system is possibly running out of memory. > > When calling varnishd, I'm specifying only "-s file,..." as storage. I see > in some examples that is common to use "-s file" AND "-s malloc" together. > Should I be passing "-s malloc" as well to somehow try to limit the memory > usage by varnishd? > > Best, > Stefano > > > On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < > [email protected]> wrote: > >> Sadly, nothing suspicious here, you can still try: >> - bumping the cli_timeout >> - changing your disk scheduler >> - changing the advice option of the file storage >> >> I'm still convinced this is due to Varnish getting stuck waiting for the >> disk because of the file storage fragmentation. >> >> Maybe you could look at SMF.*.g_alloc and compare it to the number of >> objects. Ideally, we would have a 1:1 relation between objects and >> allocations. If that number drops prior to a restart, that would be a good >> clue. >> >> >> -- >> Guillaume Quintard >> >> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <[email protected]> >> wrote: >> >>> Hi Guillaume. >>> >>> It keeps restarting. >>> Would you mind taking a quick look in the following VCL file to check if >>> you find anything suspicious? >>> >>> Thank you very much. >>> >>> Best, >>> Stefano >>> >>> vcl 4.0; >>> >>> import std; >>> >>> backend default { >>> .host = "sites-web-server-lb"; >>> .port = "80"; >>> } >>> >>> include "/etc/varnish/bad_bot_detection.vcl"; >>> >>> sub vcl_recv { >>> call bad_bot_detection; >>> >>> if (req.url == "/nocache" || req.url == "/version") { >>> return(pass); >>> } >>> >>> unset req.http.Cookie; >>> if (req.method == "PURGE") { >>> ban("obj.http.x-host == " + req.http.host + " && >>> obj.http.x-user-agent !~ Googlebot"); >>> return(synth(750)); >>> } >>> >>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/"); >>> } >>> >>> sub vcl_synth { >>> if (resp.status == 750) { >>> set resp.status = 200; >>> synthetic("PURGED => " + req.url); >>> return(deliver); >>> } elsif (resp.status == 501) { >>> set resp.status = 200; >>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>> return(deliver); >>> } >>> } >>> >>> sub vcl_backend_response { >>> unset beresp.http.Set-Cookie; >>> set beresp.http.x-host = bereq.http.host; >>> set beresp.http.x-user-agent = bereq.http.user-agent; >>> >>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>> || bereq.url == "/api/events/PAGEVIEW" >>> || bereq.url ~ "^\/assets\/img\/") { >>> set beresp.http.Cache-Control = "max-age=0"; >>> } else { >>> unset beresp.http.Cache-Control; >>> } >>> >>> if (beresp.status == 200 || >>> beresp.status == 301 || >>> beresp.status == 302 || >>> beresp.status == 404) { >>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>> set beresp.http.X-TTL = "1d"; >>> set beresp.ttl = 1d; >>> } else { >>> set beresp.http.X-TTL = "1w"; >>> set beresp.ttl = 1w; >>> } >>> } >>> >>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>> { >>> set beresp.do_gzip = true; >>> } >>> } >>> >>> sub vcl_pipe { >>> set bereq.http.connection = "close"; >>> return (pipe); >>> } >>> >>> sub vcl_deliver { >>> unset resp.http.x-host; >>> unset resp.http.x-user-agent; >>> } >>> >>> sub vcl_backend_error { >>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>> 504) { >>> set beresp.status = 200; >>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>> return (deliver); >>> } >>> } >>> >>> sub vcl_hash { >>> if (req.http.User-Agent ~ "Google Page Speed") { >>> hash_data("Google Page Speed"); >>> } elsif (req.http.User-Agent ~ "Googlebot") { >>> hash_data("Googlebot"); >>> } >>> } >>> >>> sub vcl_deliver { >>> if (resp.status == 501) { >>> return (synth(resp.status)); >>> } >>> if (obj.hits > 0) { >>> set resp.http.X-Cache = "hit"; >>> } else { >>> set resp.http.X-Cache = "miss"; >>> } >>> } >>> >>> >>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>> [email protected]> wrote: >>> >>>> Nice! It may have been the cause, time will tell.can you report back in >>>> a few days to let us know? >>>> -- >>>> Guillaume Quintard >>>> >>>> On Jun 26, 2017 20:21, "Stefano Baldo" <[email protected]> wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> I think things will start to going better now after changing the bans. >>>>> This is how my last varnishstat looked like moments before a crash >>>>> regarding the bans: >>>>> >>>>> MAIN.bans 41336 . Count of bans >>>>> MAIN.bans_completed 37967 . Number of bans marked >>>>> 'completed' >>>>> MAIN.bans_obj 0 . Number of bans using >>>>> obj.* >>>>> MAIN.bans_req 41335 . Number of bans using >>>>> req.* >>>>> MAIN.bans_added 41336 0.68 Bans added >>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>> >>>>> And this is how it looks like now: >>>>> >>>>> MAIN.bans 2 . Count of bans >>>>> MAIN.bans_completed 1 . Number of bans marked >>>>> 'completed' >>>>> MAIN.bans_obj 2 . Number of bans using >>>>> obj.* >>>>> MAIN.bans_req 0 . Number of bans using >>>>> req.* >>>>> MAIN.bans_added 2016 0.69 Bans added >>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>> >>>>> Before the changes, bans were never deleted! >>>>> Now the bans are added and quickly deleted after a minute or even a >>>>> couple of seconds. >>>>> >>>>> May this was the cause of the problem? It seems like varnish was >>>>> having a large number of bans to manage and test against. >>>>> I will let it ride now. Let's see if the problem persists or it's >>>>> gone! :-) >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> >>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>> [email protected]> wrote: >>>>> >>>>>> Looking good! >>>>>> >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Guillaume, >>>>>>> >>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>> >>>>>>> sub vcl_backend_response { >>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>> } >>>>>>> >>>>>>> sub vcl_recv { >>>>>>> if (req.method == "PURGE") { >>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>> return(synth(750)); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_deliver { >>>>>>> unset resp.http.x-url; >>>>>>> unset resp.http.x-user-agent; >>>>>>> } >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>> >>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Guillaume. >>>>>>>> >>>>>>>> Thanks for answering. >>>>>>>> >>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>> performance but it stills restarting. >>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>> signal of overhead. >>>>>>>> >>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>>>>> default size passing "-l 200m,20m" to varnishd and using >>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>> was a problem here. After a couple of hours varnish died and I >>>>>>>> received a >>>>>>>> "no space left on device" message - deleting the /var/lib/varnish >>>>>>>> solved >>>>>>>> the problem and varnish was up again, but it's weird because there was >>>>>>>> free >>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>> /var/lib/varnish size. >>>>>>>> >>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>> >>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>> >>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>> documentation and it looks like they're not. >>>>>>>> >>>>>>>> Best, >>>>>>>> Stefano >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Stefano, >>>>>>>>> >>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>> stuck trying to push/pull data and can't make time to reply to the >>>>>>>>> CLI. I'd >>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to >>>>>>>>> confirm. >>>>>>>>> >>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>> solution to help the disks cope is to overprovision themif they're >>>>>>>>> SSDs, >>>>>>>>> and you can try different advices in the file storage definition in >>>>>>>>> the >>>>>>>>> command line (last parameter, after granularity). >>>>>>>>> >>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>> >>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Guillaume Quintard >>>>>>>>> >>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hello. >>>>>>>>>> >>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>> >>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>> responding to CLI, killed it. >>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>>>>> signal=9 >>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>> complete >>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>> Started >>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>> Child starts >>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>> >>>>>>>>>> The following link is the varnishstat output just 1 minute before >>>>>>>>>> a restart: >>>>>>>>>> >>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>> >>>>>>>>>> Environment: >>>>>>>>>> >>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>> packagecloud.io >>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>> Mem 3.69 GiB >>>>>>>>>> Running inside a Docker container >>>>>>>>>> NFILES=131072 >>>>>>>>>> MEMLOCK=82000 >>>>>>>>>> >>>>>>>>>> Additional info: >>>>>>>>>> >>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I >>>>>>>>>> don't know >>>>>>>>>> if this is a problem; >>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>> before the last crash. I really don't know if this is too much or >>>>>>>>>> may have >>>>>>>>>> anything to do with it; >>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>> syslog; >>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano Baldo >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> varnish-misc mailing list >>>>>>>>>> [email protected] >>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >> >
_______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
