Transient is using memory. One other option I forgot to mention: - Move transient allocations to file: -s Transient=file,/tmp,1G
-- Reza Naghibi Varnish Software On Wed, Jun 28, 2017 at 9:39 AM, Stefano Baldo <[email protected]> wrote: > Hi. > > root@2c6c325b279f:/# varnishstat -1 | grep g_bytes > SMA.Transient.g_bytes 519022 . Bytes > outstanding > SMF.s0.g_bytes 23662845952 . Bytes > outstanding > > You mean g_bytes from SMA.Transient? I have set no malloc storage. > > > On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < > [email protected]> wrote: > >> Hi, >> >> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >> the memory you are seeing? >> >> -- >> Guillaume Quintard >> >> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <[email protected]> >> wrote: >> >>> Hi Guillaume. >>> >>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted >>> anyway, which seems to indicate that the thread is really stalled. >>> >>> This was 1 minute after the last restart: >>> >>> MAIN.n_object 3908216 . object structs made >>> SMF.s0.g_alloc 7794510 . Allocations outstanding >>> >>> I've just changed the I/O Scheduler to noop to see what happens. >>> >>> One interest thing I've found is about the memory usage. >>> >>> In the 1st minute of use: >>> MemTotal: 3865572 kB >>> MemFree: 120768 kB >>> MemAvailable: 2300268 kB >>> >>> 1 minute before a restart: >>> MemTotal: 3865572 kB >>> MemFree: 82480 kB >>> MemAvailable: 68316 kB >>> >>> It seems like the system is possibly running out of memory. >>> >>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>> see in some examples that is common to use "-s file" AND "-s malloc" >>> together. Should I be passing "-s malloc" as well to somehow try to limit >>> the memory usage by varnishd? >>> >>> Best, >>> Stefano >>> >>> >>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>> [email protected]> wrote: >>> >>>> Sadly, nothing suspicious here, you can still try: >>>> - bumping the cli_timeout >>>> - changing your disk scheduler >>>> - changing the advice option of the file storage >>>> >>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>> the disk because of the file storage fragmentation. >>>> >>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>> objects. Ideally, we would have a 1:1 relation between objects and >>>> allocations. If that number drops prior to a restart, that would be a good >>>> clue. >>>> >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <[email protected] >>>> > wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> It keeps restarting. >>>>> Would you mind taking a quick look in the following VCL file to check >>>>> if you find anything suspicious? >>>>> >>>>> Thank you very much. >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> vcl 4.0; >>>>> >>>>> import std; >>>>> >>>>> backend default { >>>>> .host = "sites-web-server-lb"; >>>>> .port = "80"; >>>>> } >>>>> >>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>> >>>>> sub vcl_recv { >>>>> call bad_bot_detection; >>>>> >>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>> return(pass); >>>>> } >>>>> >>>>> unset req.http.Cookie; >>>>> if (req.method == "PURGE") { >>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>> obj.http.x-user-agent !~ Googlebot"); >>>>> return(synth(750)); >>>>> } >>>>> >>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/"); >>>>> } >>>>> >>>>> sub vcl_synth { >>>>> if (resp.status == 750) { >>>>> set resp.status = 200; >>>>> synthetic("PURGED => " + req.url); >>>>> return(deliver); >>>>> } elsif (resp.status == 501) { >>>>> set resp.status = 200; >>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>>>> return(deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_backend_response { >>>>> unset beresp.http.Set-Cookie; >>>>> set beresp.http.x-host = bereq.http.host; >>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>> >>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>> } else { >>>>> unset beresp.http.Cache-Control; >>>>> } >>>>> >>>>> if (beresp.status == 200 || >>>>> beresp.status == 301 || >>>>> beresp.status == 302 || >>>>> beresp.status == 404) { >>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>> set beresp.http.X-TTL = "1d"; >>>>> set beresp.ttl = 1d; >>>>> } else { >>>>> set beresp.http.X-TTL = "1w"; >>>>> set beresp.ttl = 1w; >>>>> } >>>>> } >>>>> >>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>> { >>>>> set beresp.do_gzip = true; >>>>> } >>>>> } >>>>> >>>>> sub vcl_pipe { >>>>> set bereq.http.connection = "close"; >>>>> return (pipe); >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> unset resp.http.x-host; >>>>> unset resp.http.x-user-agent; >>>>> } >>>>> >>>>> sub vcl_backend_error { >>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>>>> 504) { >>>>> set beresp.status = 200; >>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>> return (deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_hash { >>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>> hash_data("Google Page Speed"); >>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>> hash_data("Googlebot"); >>>>> } >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> if (resp.status == 501) { >>>>> return (synth(resp.status)); >>>>> } >>>>> if (obj.hits > 0) { >>>>> set resp.http.X-Cache = "hit"; >>>>> } else { >>>>> set resp.http.X-Cache = "miss"; >>>>> } >>>>> } >>>>> >>>>> >>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>> [email protected]> wrote: >>>>> >>>>>> Nice! It may have been the cause, time will tell.can you report back >>>>>> in a few days to let us know? >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> I think things will start to going better now after changing the >>>>>>> bans. >>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>> regarding the bans: >>>>>>> >>>>>>> MAIN.bans 41336 . Count of bans >>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>> >>>>>>> And this is how it looks like now: >>>>>>> >>>>>>> MAIN.bans 2 . Count of bans >>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>> >>>>>>> Before the changes, bans were never deleted! >>>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>>> couple of seconds. >>>>>>> >>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>> having a large number of bans to manage and test against. >>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>> gone! :-) >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Looking good! >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Guillaume, >>>>>>>>> >>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>> >>>>>>>>> sub vcl_backend_response { >>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_recv { >>>>>>>>> if (req.method == "PURGE") { >>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>> return(synth(750)); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_deliver { >>>>>>>>> unset resp.http.x-url; >>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>> >>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Guillaume. >>>>>>>>>> >>>>>>>>>> Thanks for answering. >>>>>>>>>> >>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>>> performance but it stills restarting. >>>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>>> signal of overhead. >>>>>>>>>> >>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>>> was a problem here. After a couple of hours varnish died and I >>>>>>>>>> received a >>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish >>>>>>>>>> solved >>>>>>>>>> the problem and varnish was up again, but it's weird because there >>>>>>>>>> was free >>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't >>>>>>>>>> know >>>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>>> /var/lib/varnish size. >>>>>>>>>> >>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>>> >>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>> >>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Stefano, >>>>>>>>>>> >>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the >>>>>>>>>>> CLI. I'd >>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to >>>>>>>>>>> confirm. >>>>>>>>>>> >>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>>> solution to help the disks cope is to overprovision themif they're >>>>>>>>>>> SSDs, >>>>>>>>>>> and you can try different advices in the file storage definition in >>>>>>>>>>> the >>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>> >>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>> >>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Quintard >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello. >>>>>>>>>>>> >>>>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>>> >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>>>> responding to CLI, killed it. >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>> died signal=9 >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>> complete >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> Started >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said Child starts >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>> >>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>> before a restart: >>>>>>>>>>>> >>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>> >>>>>>>>>>>> Environment: >>>>>>>>>>>> >>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>> packagecloud.io >>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>> >>>>>>>>>>>> Additional info: >>>>>>>>>>>> >>>>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I >>>>>>>>>>>> don't know >>>>>>>>>>>> if this is a problem; >>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>>> before the last crash. I really don't know if this is too much or >>>>>>>>>>>> may have >>>>>>>>>>>> anything to do with it; >>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>> syslog; >>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>> [email protected] >>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>> -misc >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >> > > _______________________________________________ > varnish-misc mailing list > [email protected] > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >
_______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
