Hi. root@2c6c325b279f:/# varnishstat -1 | grep g_bytes SMA.Transient.g_bytes 519022 . Bytes outstanding SMF.s0.g_bytes 23662845952 . Bytes outstanding
You mean g_bytes from SMA.Transient? I have set no malloc storage. On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < [email protected]> wrote: > Hi, > > can you look that "varnishstat -1 | grep g_bytes" and see if if matches > the memory you are seeing? > > -- > Guillaume Quintard > > On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <[email protected]> > wrote: > >> Hi Guillaume. >> >> I increased the cli_timeout yesterday to 900sec (15min) and it restarted >> anyway, which seems to indicate that the thread is really stalled. >> >> This was 1 minute after the last restart: >> >> MAIN.n_object 3908216 . object structs made >> SMF.s0.g_alloc 7794510 . Allocations outstanding >> >> I've just changed the I/O Scheduler to noop to see what happens. >> >> One interest thing I've found is about the memory usage. >> >> In the 1st minute of use: >> MemTotal: 3865572 kB >> MemFree: 120768 kB >> MemAvailable: 2300268 kB >> >> 1 minute before a restart: >> MemTotal: 3865572 kB >> MemFree: 82480 kB >> MemAvailable: 68316 kB >> >> It seems like the system is possibly running out of memory. >> >> When calling varnishd, I'm specifying only "-s file,..." as storage. I >> see in some examples that is common to use "-s file" AND "-s malloc" >> together. Should I be passing "-s malloc" as well to somehow try to limit >> the memory usage by varnishd? >> >> Best, >> Stefano >> >> >> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >> [email protected]> wrote: >> >>> Sadly, nothing suspicious here, you can still try: >>> - bumping the cli_timeout >>> - changing your disk scheduler >>> - changing the advice option of the file storage >>> >>> I'm still convinced this is due to Varnish getting stuck waiting for the >>> disk because of the file storage fragmentation. >>> >>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>> objects. Ideally, we would have a 1:1 relation between objects and >>> allocations. If that number drops prior to a restart, that would be a good >>> clue. >>> >>> >>> -- >>> Guillaume Quintard >>> >>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <[email protected]> >>> wrote: >>> >>>> Hi Guillaume. >>>> >>>> It keeps restarting. >>>> Would you mind taking a quick look in the following VCL file to check >>>> if you find anything suspicious? >>>> >>>> Thank you very much. >>>> >>>> Best, >>>> Stefano >>>> >>>> vcl 4.0; >>>> >>>> import std; >>>> >>>> backend default { >>>> .host = "sites-web-server-lb"; >>>> .port = "80"; >>>> } >>>> >>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>> >>>> sub vcl_recv { >>>> call bad_bot_detection; >>>> >>>> if (req.url == "/nocache" || req.url == "/version") { >>>> return(pass); >>>> } >>>> >>>> unset req.http.Cookie; >>>> if (req.method == "PURGE") { >>>> ban("obj.http.x-host == " + req.http.host + " && >>>> obj.http.x-user-agent !~ Googlebot"); >>>> return(synth(750)); >>>> } >>>> >>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/"); >>>> } >>>> >>>> sub vcl_synth { >>>> if (resp.status == 750) { >>>> set resp.status = 200; >>>> synthetic("PURGED => " + req.url); >>>> return(deliver); >>>> } elsif (resp.status == 501) { >>>> set resp.status = 200; >>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>>> return(deliver); >>>> } >>>> } >>>> >>>> sub vcl_backend_response { >>>> unset beresp.http.Set-Cookie; >>>> set beresp.http.x-host = bereq.http.host; >>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>> >>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>> || bereq.url == "/api/events/PAGEVIEW" >>>> || bereq.url ~ "^\/assets\/img\/") { >>>> set beresp.http.Cache-Control = "max-age=0"; >>>> } else { >>>> unset beresp.http.Cache-Control; >>>> } >>>> >>>> if (beresp.status == 200 || >>>> beresp.status == 301 || >>>> beresp.status == 302 || >>>> beresp.status == 404) { >>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>> set beresp.http.X-TTL = "1d"; >>>> set beresp.ttl = 1d; >>>> } else { >>>> set beresp.http.X-TTL = "1w"; >>>> set beresp.ttl = 1w; >>>> } >>>> } >>>> >>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>> { >>>> set beresp.do_gzip = true; >>>> } >>>> } >>>> >>>> sub vcl_pipe { >>>> set bereq.http.connection = "close"; >>>> return (pipe); >>>> } >>>> >>>> sub vcl_deliver { >>>> unset resp.http.x-host; >>>> unset resp.http.x-user-agent; >>>> } >>>> >>>> sub vcl_backend_error { >>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>>> 504) { >>>> set beresp.status = 200; >>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>> return (deliver); >>>> } >>>> } >>>> >>>> sub vcl_hash { >>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>> hash_data("Google Page Speed"); >>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>> hash_data("Googlebot"); >>>> } >>>> } >>>> >>>> sub vcl_deliver { >>>> if (resp.status == 501) { >>>> return (synth(resp.status)); >>>> } >>>> if (obj.hits > 0) { >>>> set resp.http.X-Cache = "hit"; >>>> } else { >>>> set resp.http.X-Cache = "miss"; >>>> } >>>> } >>>> >>>> >>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>> [email protected]> wrote: >>>> >>>>> Nice! It may have been the cause, time will tell.can you report back >>>>> in a few days to let us know? >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <[email protected]> wrote: >>>>> >>>>>> Hi Guillaume. >>>>>> >>>>>> I think things will start to going better now after changing the bans. >>>>>> This is how my last varnishstat looked like moments before a crash >>>>>> regarding the bans: >>>>>> >>>>>> MAIN.bans 41336 . Count of bans >>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>> marked 'completed' >>>>>> MAIN.bans_obj 0 . Number of bans using >>>>>> obj.* >>>>>> MAIN.bans_req 41335 . Number of bans using >>>>>> req.* >>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>> >>>>>> And this is how it looks like now: >>>>>> >>>>>> MAIN.bans 2 . Count of bans >>>>>> MAIN.bans_completed 1 . Number of bans >>>>>> marked 'completed' >>>>>> MAIN.bans_obj 2 . Number of bans using >>>>>> obj.* >>>>>> MAIN.bans_req 0 . Number of bans using >>>>>> req.* >>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>> >>>>>> Before the changes, bans were never deleted! >>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>> couple of seconds. >>>>>> >>>>>> May this was the cause of the problem? It seems like varnish was >>>>>> having a large number of bans to manage and test against. >>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>> gone! :-) >>>>>> >>>>>> Best, >>>>>> Stefano >>>>>> >>>>>> >>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Looking good! >>>>>>> >>>>>>> -- >>>>>>> Guillaume Quintard >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Hi Guillaume, >>>>>>>> >>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>> >>>>>>>> sub vcl_backend_response { >>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>> } >>>>>>>> >>>>>>>> sub vcl_recv { >>>>>>>> if (req.method == "PURGE") { >>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>> return(synth(750)); >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> sub vcl_deliver { >>>>>>>> unset resp.http.x-url; >>>>>>>> unset resp.http.x-user-agent; >>>>>>>> } >>>>>>>> >>>>>>>> Best, >>>>>>>> Stefano >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>> >>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Guillaume Quintard >>>>>>>>> >>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Guillaume. >>>>>>>>> >>>>>>>>> Thanks for answering. >>>>>>>>> >>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>> performance but it stills restarting. >>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>> signal of overhead. >>>>>>>>> >>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>>>>>> default size passing "-l 200m,20m" to varnishd and using >>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>> was a problem here. After a couple of hours varnish died and I >>>>>>>>> received a >>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish >>>>>>>>> solved >>>>>>>>> the problem and varnish was up again, but it's weird because there >>>>>>>>> was free >>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't >>>>>>>>> know >>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>> /var/lib/varnish size. >>>>>>>>> >>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>> >>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>> >>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>> documentation and it looks like they're not. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Stefano, >>>>>>>>>> >>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the >>>>>>>>>> CLI. I'd >>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to >>>>>>>>>> confirm. >>>>>>>>>> >>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>> solution to help the disks cope is to overprovision themif they're >>>>>>>>>> SSDs, >>>>>>>>>> and you can try different advices in the file storage definition in >>>>>>>>>> the >>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>> >>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>> >>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hello. >>>>>>>>>>> >>>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>> >>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>>> responding to CLI, killed it. >>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>>>>>> signal=9 >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>> complete >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>> Started >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>>> Child starts >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>> >>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>> before a restart: >>>>>>>>>>> >>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>> >>>>>>>>>>> Environment: >>>>>>>>>>> >>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>> packagecloud.io >>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>> Running inside a Docker container >>>>>>>>>>> NFILES=131072 >>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>> >>>>>>>>>>> Additional info: >>>>>>>>>>> >>>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I >>>>>>>>>>> don't know >>>>>>>>>>> if this is a problem; >>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>> before the last crash. I really don't know if this is too much or >>>>>>>>>>> may have >>>>>>>>>>> anything to do with it; >>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>> syslog; >>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Stefano Baldo >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>> [email protected] >>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>> -misc >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> >
_______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
