Hi Guillaume and Reza. This time varnish restarted but it left some more info on syslog. It seems like the system is running out of memory.
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.297487] pool_herder invoked oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.300992] pool_herder cpuset=/ mems_allowed=0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.303157] CPU: 1 PID: 16214 Comm: pool_herder Tainted: G C O 3.16.0-4-amd64 #1 Debian 3.16.36-1+deb8u2 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] 0000000000000000 ffffffff815123b5 ffff8800eb3652f0 0000000000000000 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff8150ff8d 0000000000000000 ffffffff810d6e3f 0000000000000000 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff81516d2e 0000000000000200 ffffffff810689d3 ffffffff810c43e4 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Call Trace: Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815123b5>] ? dump_stack+0x5d/0x78 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8150ff8d>] ? dump_header+0x76/0x1e8 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810d6e3f>] ? smp_call_function_single+0x5f/0xa0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81516d2e>] ? mutex_lock+0xe/0x2a Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810689d3>] ? put_online_cpus+0x23/0x80 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810c43e4>] ? rcu_oom_notify+0xc4/0xe0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81153d1c>] ? do_try_to_free_pages+0x4ac/0x520 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff811427dd>] ? oom_kill_process+0x21d/0x370 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8114239d>] ? find_lock_task_mm+0x3d/0x90 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81142f43>] ? out_of_memory+0x473/0x4b0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81148e0f>] ? __alloc_pages_nodemask+0x9ef/0xb50 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81065c86>] ? copy_process.part.25+0x116/0x1c50 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81058301>] ? __do_page_fault+0x1d1/0x4f0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81067990>] ? do_fork+0xe0/0x3d0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815188f9>] ? stub_clone+0x69/0x90 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8151858d>] ? system_call_fast_compare_end+0x10/0x15 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.367638] Mem-Info: Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.368962] Node 0 DMA per-cpu: Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.370768] CPU 0: hi: 0, btch: 1 usd: 0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.373249] CPU 1: hi: 0, btch: 1 usd: 0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.375652] Node 0 DMA32 per-cpu: Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.377508] CPU 0: hi: 186, btch: 31 usd: 29 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.379898] CPU 1: hi: 186, btch: 31 usd: 0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_anon:846474 inactive_anon:1913 isolated_anon:0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_file:408 inactive_file:415 isolated_file:32 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] unevictable:20736 dirty:27 writeback:0 unstable:0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free:16797 slab_reclaimable:15276 slab_unreclaimable:10521 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] mapped:22002 shmem:22935 pagetables:30362 bounce:0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free_cma:0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.397242] Node 0 DMA free:15192kB min:184kB low:228kB high:276kB active_anon:416kB inactive_anon:60kB active_file:0kB inactive_file:0kB unevictable:20kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:20kB dirty:0kB writeback:0kB mapped:20kB shmem:80kB slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:112kB pagetables:20kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.416338] lowmem_reserve[]: 0 3757 3757 3757 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.419030] Node 0 DMA32 free:50120kB min:44868kB low:56084kB high:67300kB active_anon:3386780kB inactive_anon:7592kB active_file:1732kB inactive_file:2060kB unevictable:82924kB isolated(anon):0kB isolated(file):128kB present:3915776kB managed:3849676kB mlocked:82924kB dirty:108kB writeback:0kB mapped:88432kB shmem:91660kB slab_reclaimable:61072kB slab_unreclaimable:42184kB kernel_stack:27248kB pagetables:121428kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.440095] lowmem_reserve[]: 0 0 0 0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.442202] Node 0 DMA: 22*4kB (UEM) 6*8kB (EM) 1*16kB (E) 2*32kB (UM) 2*64kB (UE) 2*128kB (EM) 3*256kB (UEM) 1*512kB (E) 3*1024kB (UEM) 3*2048kB (EMR) 1*4096kB (M) = 15192kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.451936] Node 0 DMA32: 4031*4kB (EM) 2729*8kB (EM) 324*16kB (EM) 1*32kB (R) 1*64kB (R) 0*128kB 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 46820kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.460240] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.464122] 24240 total pagecache pages Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.466048] 0 pages in swap cache Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.467672] Swap cache stats: add 0, delete 0, find 0/0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.470159] Free swap = 0kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.471513] Total swap = 0kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.472980] 982941 pages RAM Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.474380] 0 pages HighMem/MovableOnly Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.476190] 16525 pages reserved Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.477772] 0 pages hwpoisoned Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.479189] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.482698] [ 163] 0 163 10419 1295 21 0 0 systemd-journal Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.486646] [ 165] 0 165 10202 136 21 0 -1000 systemd-udevd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.490598] [ 294] 0 294 6351 1729 14 0 0 dhclient Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.494457] [ 319] 0 319 6869 62 18 0 0 cron Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.498260] [ 321] 0 321 4964 67 14 0 0 systemd-logind Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.502346] [ 326] 105 326 10558 101 25 0 -900 dbus-daemon Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.506315] [ 342] 0 342 65721 228 31 0 0 rsyslogd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.510222] [ 343] 0 343 88199 2108 61 0 -500 dockerd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.514022] [ 350] 106 350 18280 181 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.518040] [ 351] 106 351 18280 475 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.522041] [ 352] 106 352 18280 187 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.526025] [ 353] 106 353 18280 187 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.530067] [ 354] 106 354 18280 187 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.534033] [ 355] 106 355 18280 190 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.538001] [ 358] 0 358 66390 1826 32 0 0 fail2ban-server Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.541972] [ 400] 0 400 35984 444 24 0 -500 docker-containe Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.545879] [ 568] 0 568 13796 168 30 0 -1000 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.549733] [ 576] 0 576 3604 41 12 0 0 agetty Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.553569] [ 577] 0 577 3559 38 12 0 0 agetty Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.557322] [16201] 0 16201 29695 20707 60 0 0 varnishd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.561103] [16209] 108 16209 118909802 822425 29398 0 0 cache-main Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.565002] [27352] 0 27352 20131 214 42 0 0 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.568682] [27354] 1000 27354 20165 211 41 0 0 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.572307] [27355] 1000 27355 5487 146 17 0 0 bash Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.575920] [27360] 0 27360 11211 107 26 0 0 sudo Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.579593] [27361] 0 27361 11584 97 27 0 0 su Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.583155] [27362] 0 27362 5481 142 15 0 0 bash Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.586782] [27749] 0 27749 20131 214 41 0 0 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.590428] [27751] 1000 27751 20164 211 39 0 0 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.593979] [27752] 1000 27752 5487 147 15 0 0 bash Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.597488] [28762] 0 28762 26528 132 17 0 0 varnishstat Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.601239] [28764] 0 28764 11211 106 26 0 0 sudo Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.604737] [28765] 0 28765 11584 97 26 0 0 su Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.608602] [28766] 0 28766 5481 141 15 0 0 bash Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.612288] [28768] 0 28768 26528 220 18 0 0 varnishstat Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.616189] Out of memory: Kill process 16209 (cache-main) score 880 or sacrifice child Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.620106] Killed process 16209 (cache-main) total-vm:475639208kB, anon-rss:3289700kB, file-rss:0kB Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (16209) died signal=9 Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child cleanup complete Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) Started Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said Child starts Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 Best, Stefano On Wed, Jun 28, 2017 at 11:33 AM, Reza Naghibi <[email protected]> wrote: > Assuming the problem is running out of memory, you will need to do some > memory tuning, especially given the number of threads you are using and > your access patterns. Your options: > > - Add more memory to the system > - Reduce thread_pool_max > - Reduce jemalloc's thread cache (MALLOC_CONF="lg_tcache_max:10") > - Use some of the tuning params in here: https://info.varnish- > software.com/blog/understanding-varnish-cache-memory-usage > > <https://info.varnish-software.com/blog/understanding-varnish-cache-memory-usage> > > > -- > Reza Naghibi > Varnish Software > > On Wed, Jun 28, 2017 at 9:26 AM, Guillaume Quintard < > [email protected]> wrote: > >> Hi, >> >> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >> the memory you are seeing? >> >> -- >> Guillaume Quintard >> >> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <[email protected]> >> wrote: >> >>> Hi Guillaume. >>> >>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted >>> anyway, which seems to indicate that the thread is really stalled. >>> >>> This was 1 minute after the last restart: >>> >>> MAIN.n_object 3908216 . object structs made >>> SMF.s0.g_alloc 7794510 . Allocations outstanding >>> >>> I've just changed the I/O Scheduler to noop to see what happens. >>> >>> One interest thing I've found is about the memory usage. >>> >>> In the 1st minute of use: >>> MemTotal: 3865572 kB >>> MemFree: 120768 kB >>> MemAvailable: 2300268 kB >>> >>> 1 minute before a restart: >>> MemTotal: 3865572 kB >>> MemFree: 82480 kB >>> MemAvailable: 68316 kB >>> >>> It seems like the system is possibly running out of memory. >>> >>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>> see in some examples that is common to use "-s file" AND "-s malloc" >>> together. Should I be passing "-s malloc" as well to somehow try to limit >>> the memory usage by varnishd? >>> >>> Best, >>> Stefano >>> >>> >>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>> [email protected]> wrote: >>> >>>> Sadly, nothing suspicious here, you can still try: >>>> - bumping the cli_timeout >>>> - changing your disk scheduler >>>> - changing the advice option of the file storage >>>> >>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>> the disk because of the file storage fragmentation. >>>> >>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>> objects. Ideally, we would have a 1:1 relation between objects and >>>> allocations. If that number drops prior to a restart, that would be a good >>>> clue. >>>> >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <[email protected] >>>> > wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> It keeps restarting. >>>>> Would you mind taking a quick look in the following VCL file to check >>>>> if you find anything suspicious? >>>>> >>>>> Thank you very much. >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> vcl 4.0; >>>>> >>>>> import std; >>>>> >>>>> backend default { >>>>> .host = "sites-web-server-lb"; >>>>> .port = "80"; >>>>> } >>>>> >>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>> >>>>> sub vcl_recv { >>>>> call bad_bot_detection; >>>>> >>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>> return(pass); >>>>> } >>>>> >>>>> unset req.http.Cookie; >>>>> if (req.method == "PURGE") { >>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>> obj.http.x-user-agent !~ Googlebot"); >>>>> return(synth(750)); >>>>> } >>>>> >>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/"); >>>>> } >>>>> >>>>> sub vcl_synth { >>>>> if (resp.status == 750) { >>>>> set resp.status = 200; >>>>> synthetic("PURGED => " + req.url); >>>>> return(deliver); >>>>> } elsif (resp.status == 501) { >>>>> set resp.status = 200; >>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>>>> return(deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_backend_response { >>>>> unset beresp.http.Set-Cookie; >>>>> set beresp.http.x-host = bereq.http.host; >>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>> >>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>> } else { >>>>> unset beresp.http.Cache-Control; >>>>> } >>>>> >>>>> if (beresp.status == 200 || >>>>> beresp.status == 301 || >>>>> beresp.status == 302 || >>>>> beresp.status == 404) { >>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>> set beresp.http.X-TTL = "1d"; >>>>> set beresp.ttl = 1d; >>>>> } else { >>>>> set beresp.http.X-TTL = "1w"; >>>>> set beresp.ttl = 1w; >>>>> } >>>>> } >>>>> >>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>> { >>>>> set beresp.do_gzip = true; >>>>> } >>>>> } >>>>> >>>>> sub vcl_pipe { >>>>> set bereq.http.connection = "close"; >>>>> return (pipe); >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> unset resp.http.x-host; >>>>> unset resp.http.x-user-agent; >>>>> } >>>>> >>>>> sub vcl_backend_error { >>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>>>> 504) { >>>>> set beresp.status = 200; >>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>> return (deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_hash { >>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>> hash_data("Google Page Speed"); >>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>> hash_data("Googlebot"); >>>>> } >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> if (resp.status == 501) { >>>>> return (synth(resp.status)); >>>>> } >>>>> if (obj.hits > 0) { >>>>> set resp.http.X-Cache = "hit"; >>>>> } else { >>>>> set resp.http.X-Cache = "miss"; >>>>> } >>>>> } >>>>> >>>>> >>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>> [email protected]> wrote: >>>>> >>>>>> Nice! It may have been the cause, time will tell.can you report back >>>>>> in a few days to let us know? >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> I think things will start to going better now after changing the >>>>>>> bans. >>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>> regarding the bans: >>>>>>> >>>>>>> MAIN.bans 41336 . Count of bans >>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>> >>>>>>> And this is how it looks like now: >>>>>>> >>>>>>> MAIN.bans 2 . Count of bans >>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>> >>>>>>> Before the changes, bans were never deleted! >>>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>>> couple of seconds. >>>>>>> >>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>> having a large number of bans to manage and test against. >>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>> gone! :-) >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Looking good! >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Hi Guillaume, >>>>>>>>> >>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>> >>>>>>>>> sub vcl_backend_response { >>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_recv { >>>>>>>>> if (req.method == "PURGE") { >>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>> return(synth(750)); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_deliver { >>>>>>>>> unset resp.http.x-url; >>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>> >>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Guillaume. >>>>>>>>>> >>>>>>>>>> Thanks for answering. >>>>>>>>>> >>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>>> performance but it stills restarting. >>>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>>> signal of overhead. >>>>>>>>>> >>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>>> was a problem here. After a couple of hours varnish died and I >>>>>>>>>> received a >>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish >>>>>>>>>> solved >>>>>>>>>> the problem and varnish was up again, but it's weird because there >>>>>>>>>> was free >>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't >>>>>>>>>> know >>>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>>> /var/lib/varnish size. >>>>>>>>>> >>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>>> >>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>> >>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Stefano, >>>>>>>>>>> >>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the >>>>>>>>>>> CLI. I'd >>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to >>>>>>>>>>> confirm. >>>>>>>>>>> >>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>>> solution to help the disks cope is to overprovision themif they're >>>>>>>>>>> SSDs, >>>>>>>>>>> and you can try different advices in the file storage definition in >>>>>>>>>>> the >>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>> >>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>> >>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Quintard >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello. >>>>>>>>>>>> >>>>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>>> >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>>>> responding to CLI, killed it. >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>> died signal=9 >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>> complete >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> Started >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said Child starts >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>> >>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>> before a restart: >>>>>>>>>>>> >>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>> >>>>>>>>>>>> Environment: >>>>>>>>>>>> >>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>> packagecloud.io >>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>> >>>>>>>>>>>> Additional info: >>>>>>>>>>>> >>>>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I >>>>>>>>>>>> don't know >>>>>>>>>>>> if this is a problem; >>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>>> before the last crash. I really don't know if this is too much or >>>>>>>>>>>> may have >>>>>>>>>>>> anything to do with it; >>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>> syslog; >>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>> [email protected] >>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>> -misc >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >> >> _______________________________________________ >> varnish-misc mailing list >> [email protected] >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >> > >
_______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
