Transient is not limited I suppose, so the g_space == 0 is normal. You can try running on bare metal, not sure there will be a difference
-- Guillaume Quintard On Wed, Jun 28, 2017 at 3:54 PM, Stefano Baldo <[email protected]> wrote: > Also, we are running varnish inside a docker container. > The storage disk is attached to the same host, and mounted to the > container via docker volume. > > Do you think it's worth a try to run varnish directly on the host, > avoiding docker? I don't see how this could be a problem but I don't know > what to do anymore. > > Best, > > > On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard < > [email protected]> wrote: > >> Yeah, I was wondering about Transient, but it seems under control. >> >> Apart from moving away from file storage, I have nothing at the moment :-/ >> >> -- >> Guillaume Quintard >> >> On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo <[email protected]> >> wrote: >> >>> Hi. >>> >>> root@2c6c325b279f:/# varnishstat -1 | grep g_bytes >>> SMA.Transient.g_bytes 519022 . Bytes >>> outstanding >>> SMF.s0.g_bytes 23662845952 . Bytes >>> outstanding >>> >>> You mean g_bytes from SMA.Transient? I have set no malloc storage. >>> >>> >>> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < >>> [email protected]> wrote: >>> >>>> Hi, >>>> >>>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >>>> the memory you are seeing? >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <[email protected]> >>>> wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> I increased the cli_timeout yesterday to 900sec (15min) and it >>>>> restarted anyway, which seems to indicate that the thread is really >>>>> stalled. >>>>> >>>>> This was 1 minute after the last restart: >>>>> >>>>> MAIN.n_object 3908216 . object structs made >>>>> SMF.s0.g_alloc 7794510 . Allocations >>>>> outstanding >>>>> >>>>> I've just changed the I/O Scheduler to noop to see what happens. >>>>> >>>>> One interest thing I've found is about the memory usage. >>>>> >>>>> In the 1st minute of use: >>>>> MemTotal: 3865572 kB >>>>> MemFree: 120768 kB >>>>> MemAvailable: 2300268 kB >>>>> >>>>> 1 minute before a restart: >>>>> MemTotal: 3865572 kB >>>>> MemFree: 82480 kB >>>>> MemAvailable: 68316 kB >>>>> >>>>> It seems like the system is possibly running out of memory. >>>>> >>>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>>>> see in some examples that is common to use "-s file" AND "-s malloc" >>>>> together. Should I be passing "-s malloc" as well to somehow try to limit >>>>> the memory usage by varnishd? >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> >>>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>>>> [email protected]> wrote: >>>>> >>>>>> Sadly, nothing suspicious here, you can still try: >>>>>> - bumping the cli_timeout >>>>>> - changing your disk scheduler >>>>>> - changing the advice option of the file storage >>>>>> >>>>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>>>> the disk because of the file storage fragmentation. >>>>>> >>>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>>>> objects. Ideally, we would have a 1:1 relation between objects and >>>>>> allocations. If that number drops prior to a restart, that would be a >>>>>> good >>>>>> clue. >>>>>> >>>>>> >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> It keeps restarting. >>>>>>> Would you mind taking a quick look in the following VCL file to >>>>>>> check if you find anything suspicious? >>>>>>> >>>>>>> Thank you very much. >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> vcl 4.0; >>>>>>> >>>>>>> import std; >>>>>>> >>>>>>> backend default { >>>>>>> .host = "sites-web-server-lb"; >>>>>>> .port = "80"; >>>>>>> } >>>>>>> >>>>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>>>> >>>>>>> sub vcl_recv { >>>>>>> call bad_bot_detection; >>>>>>> >>>>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>>>> return(pass); >>>>>>> } >>>>>>> >>>>>>> unset req.http.Cookie; >>>>>>> if (req.method == "PURGE") { >>>>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>> return(synth(750)); >>>>>>> } >>>>>>> >>>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/"); >>>>>>> } >>>>>>> >>>>>>> sub vcl_synth { >>>>>>> if (resp.status == 750) { >>>>>>> set resp.status = 200; >>>>>>> synthetic("PURGED => " + req.url); >>>>>>> return(deliver); >>>>>>> } elsif (resp.status == 501) { >>>>>>> set resp.status = 200; >>>>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht >>>>>>> ml")); >>>>>>> return(deliver); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_backend_response { >>>>>>> unset beresp.http.Set-Cookie; >>>>>>> set beresp.http.x-host = bereq.http.host; >>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>> >>>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>>>> } else { >>>>>>> unset beresp.http.Cache-Control; >>>>>>> } >>>>>>> >>>>>>> if (beresp.status == 200 || >>>>>>> beresp.status == 301 || >>>>>>> beresp.status == 302 || >>>>>>> beresp.status == 404) { >>>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>>>> set beresp.http.X-TTL = "1d"; >>>>>>> set beresp.ttl = 1d; >>>>>>> } else { >>>>>>> set beresp.http.X-TTL = "1w"; >>>>>>> set beresp.ttl = 1w; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> if (bereq.url !~ >>>>>>> "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>>>> { >>>>>>> set beresp.do_gzip = true; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_pipe { >>>>>>> set bereq.http.connection = "close"; >>>>>>> return (pipe); >>>>>>> } >>>>>>> >>>>>>> sub vcl_deliver { >>>>>>> unset resp.http.x-host; >>>>>>> unset resp.http.x-user-agent; >>>>>>> } >>>>>>> >>>>>>> sub vcl_backend_error { >>>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status >>>>>>> == 504) { >>>>>>> set beresp.status = 200; >>>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>>>> return (deliver); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_hash { >>>>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>>>> hash_data("Google Page Speed"); >>>>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>>>> hash_data("Googlebot"); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_deliver { >>>>>>> if (resp.status == 501) { >>>>>>> return (synth(resp.status)); >>>>>>> } >>>>>>> if (obj.hits > 0) { >>>>>>> set resp.http.X-Cache = "hit"; >>>>>>> } else { >>>>>>> set resp.http.X-Cache = "miss"; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Nice! It may have been the cause, time will tell.can you report >>>>>>>> back in a few days to let us know? >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Guillaume. >>>>>>>>> >>>>>>>>> I think things will start to going better now after changing the >>>>>>>>> bans. >>>>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>>>> regarding the bans: >>>>>>>>> >>>>>>>>> MAIN.bans 41336 . Count of bans >>>>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>>>> marked 'completed' >>>>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>>>> using obj.* >>>>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>>>> using req.* >>>>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>>>> >>>>>>>>> And this is how it looks like now: >>>>>>>>> >>>>>>>>> MAIN.bans 2 . Count of bans >>>>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>>>> marked 'completed' >>>>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>>>> using obj.* >>>>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>>>> using req.* >>>>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>>>> >>>>>>>>> Before the changes, bans were never deleted! >>>>>>>>> Now the bans are added and quickly deleted after a minute or even >>>>>>>>> a couple of seconds. >>>>>>>>> >>>>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>>>> having a large number of bans to manage and test against. >>>>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>>>> gone! :-) >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Looking good! >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Guillaume, >>>>>>>>>>> >>>>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>>>> >>>>>>>>>>> sub vcl_backend_response { >>>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> sub vcl_recv { >>>>>>>>>>> if (req.method == "PURGE") { >>>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>>>> return(synth(750)); >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> sub vcl_deliver { >>>>>>>>>>> unset resp.http.x-url; >>>>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Stefano >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>>>> >>>>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Guillaume Quintard >>>>>>>>>>>> >>>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Guillaume. >>>>>>>>>>>> >>>>>>>>>>>> Thanks for answering. >>>>>>>>>>>> >>>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to >>>>>>>>>>>> increase performance but it stills restarting. >>>>>>>>>>>> Also, I checked the I/O performance for the disk and there is >>>>>>>>>>>> no signal of overhead. >>>>>>>>>>>> >>>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. >>>>>>>>>>>> There was a problem here. After a couple of hours varnish died and >>>>>>>>>>>> I >>>>>>>>>>>> received a "no space left on device" message - deleting the >>>>>>>>>>>> /var/lib/varnish solved the problem and varnish was up again, but >>>>>>>>>>>> it's >>>>>>>>>>>> weird because there was free memory on the host to be used with >>>>>>>>>>>> the tmpfs >>>>>>>>>>>> directory, so I don't know what could have happened. I will try to >>>>>>>>>>>> stop >>>>>>>>>>>> increasing the /var/lib/varnish size. >>>>>>>>>>>> >>>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans >>>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created >>>>>>>>>>>> this way: >>>>>>>>>>>> >>>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>>>> >>>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Stefano >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Stefano, >>>>>>>>>>>>> >>>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish >>>>>>>>>>>>> gets stuck trying to push/pull data and can't make time to reply >>>>>>>>>>>>> to the >>>>>>>>>>>>> CLI. I'd recommend monitoring the disk activity (bandwidth and >>>>>>>>>>>>> iops) to >>>>>>>>>>>>> confirm. >>>>>>>>>>>>> >>>>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. >>>>>>>>>>>>> One >>>>>>>>>>>>> solution to help the disks cope is to overprovision themif >>>>>>>>>>>>> they're SSDs, >>>>>>>>>>>>> and you can try different advices in the file storage definition >>>>>>>>>>>>> in the >>>>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>>>> >>>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>>>> >>>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Guillaume Quintard >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am having a critical problem with Varnish Cache in >>>>>>>>>>>>>> production for over a month and any help will be appreciated. >>>>>>>>>>>>>> The problem is that Varnish child process is recurrently >>>>>>>>>>>>>> being restarted after 10~20h of use, with the following message: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>>> not responding to CLI, killed it. >>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected >>>>>>>>>>>>>> reply from ping: 400 CLI communication error >>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>>> died signal=9 >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>>>> complete >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>>> Started >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>>> said Child starts >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>>>> >>>>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>>>> before a restart: >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>>>> >>>>>>>>>>>>>> Environment: >>>>>>>>>>>>>> >>>>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>>>> packagecloud.io >>>>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Additional info: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - I need to cache a large number of objets and the cache >>>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage >>>>>>>>>>>>>> space, I >>>>>>>>>>>>>> don't know if this is a problem; >>>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system >>>>>>>>>>>>>> just before the last crash. I really don't know if this is too >>>>>>>>>>>>>> much or may >>>>>>>>>>>>>> have anything to do with it; >>>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>>>> syslog; >>>>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>>>> [email protected] >>>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>>>> -misc >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
_______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
