Anything else using RAM on that machine? Best idea I have now is monitor Varnish memory usage in parallel of the *.g_bytes counters.
-- Guillaume Quintard On Thu, Jun 29, 2017 at 7:09 PM, Stefano Baldo <[email protected]> wrote: > Hi Guillaume and Reza. > > This time varnish restarted but it left some more info on syslog. > It seems like the system is running out of memory. > > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.297487] pool_herder invoked > oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.300992] pool_herder cpuset=/ > mems_allowed=0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.303157] CPU: 1 PID: 16214 > Comm: pool_herder Tainted: G C O 3.16.0-4-amd64 #1 Debian > 3.16.36-1+deb8u2 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Hardware name: Xen > HVM domU, BIOS 4.2.amazon 02/16/2017 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] 0000000000000000 > ffffffff815123b5 ffff8800eb3652f0 0000000000000000 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff8150ff8d > 0000000000000000 ffffffff810d6e3f 0000000000000000 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff81516d2e > 0000000000000200 ffffffff810689d3 ffffffff810c43e4 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Call Trace: > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815123b5>] > ? dump_stack+0x5d/0x78 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8150ff8d>] > ? dump_header+0x76/0x1e8 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810d6e3f>] > ? smp_call_function_single+0x5f/0xa0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81516d2e>] > ? mutex_lock+0xe/0x2a > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810689d3>] > ? put_online_cpus+0x23/0x80 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810c43e4>] > ? rcu_oom_notify+0xc4/0xe0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81153d1c>] > ? do_try_to_free_pages+0x4ac/0x520 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff811427dd>] > ? oom_kill_process+0x21d/0x370 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8114239d>] > ? find_lock_task_mm+0x3d/0x90 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81142f43>] > ? out_of_memory+0x473/0x4b0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81148e0f>] > ? __alloc_pages_nodemask+0x9ef/0xb50 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81065c86>] > ? copy_process.part.25+0x116/0x1c50 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81058301>] > ? __do_page_fault+0x1d1/0x4f0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81067990>] > ? do_fork+0xe0/0x3d0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815188f9>] > ? stub_clone+0x69/0x90 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8151858d>] > ? system_call_fast_compare_end+0x10/0x15 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.367638] Mem-Info: > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.368962] Node 0 DMA per-cpu: > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.370768] CPU 0: hi: 0, > btch: 1 usd: 0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.373249] CPU 1: hi: 0, > btch: 1 usd: 0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.375652] Node 0 DMA32 per-cpu: > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.377508] CPU 0: hi: 186, > btch: 31 usd: 29 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.379898] CPU 1: hi: 186, > btch: 31 usd: 0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_anon:846474 > inactive_anon:1913 isolated_anon:0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_file:408 > inactive_file:415 isolated_file:32 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] unevictable:20736 > dirty:27 writeback:0 unstable:0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free:16797 > slab_reclaimable:15276 slab_unreclaimable:10521 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] mapped:22002 > shmem:22935 pagetables:30362 bounce:0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free_cma:0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.397242] Node 0 DMA > free:15192kB min:184kB low:228kB high:276kB active_anon:416kB > inactive_anon:60kB active_file:0kB inactive_file:0kB unevictable:20kB > isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB > mlocked:20kB dirty:0kB writeback:0kB mapped:20kB shmem:80kB > slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:112kB > pagetables:20kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB > pages_scanned:0 all_unreclaimable? yes > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.416338] lowmem_reserve[]: 0 > 3757 3757 3757 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.419030] Node 0 DMA32 > free:50120kB min:44868kB low:56084kB high:67300kB active_anon:3386780kB > inactive_anon:7592kB active_file:1732kB inactive_file:2060kB > unevictable:82924kB isolated(anon):0kB isolated(file):128kB > present:3915776kB managed:3849676kB mlocked:82924kB dirty:108kB > writeback:0kB mapped:88432kB shmem:91660kB slab_reclaimable:61072kB > slab_unreclaimable:42184kB kernel_stack:27248kB pagetables:121428kB > unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 > all_unreclaimable? no > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.440095] lowmem_reserve[]: 0 0 > 0 0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.442202] Node 0 DMA: 22*4kB > (UEM) 6*8kB (EM) 1*16kB (E) 2*32kB (UM) 2*64kB (UE) 2*128kB (EM) 3*256kB > (UEM) 1*512kB (E) 3*1024kB (UEM) 3*2048kB (EMR) 1*4096kB (M) = 15192kB > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.451936] Node 0 DMA32: > 4031*4kB (EM) 2729*8kB (EM) 324*16kB (EM) 1*32kB (R) 1*64kB (R) 0*128kB > 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 46820kB > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.460240] Node 0 > hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.464122] 24240 total pagecache > pages > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.466048] 0 pages in swap cache > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.467672] Swap cache stats: add > 0, delete 0, find 0/0 > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.470159] Free swap = 0kB > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.471513] Total swap = 0kB > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.472980] 982941 pages RAM > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.474380] 0 pages > HighMem/MovableOnly > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.476190] 16525 pages reserved > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.477772] 0 pages hwpoisoned > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.479189] [ pid ] uid tgid > total_vm rss nr_ptes swapents oom_score_adj name > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.482698] [ 163] 0 163 > 10419 1295 21 0 0 systemd-journal > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.486646] [ 165] 0 165 > 10202 136 21 0 -1000 systemd-udevd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.490598] [ 294] 0 294 > 6351 1729 14 0 0 dhclient > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.494457] [ 319] 0 319 > 6869 62 18 0 0 cron > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.498260] [ 321] 0 321 > 4964 67 14 0 0 systemd-logind > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.502346] [ 326] 105 326 > 10558 101 25 0 -900 dbus-daemon > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.506315] [ 342] 0 342 > 65721 228 31 0 0 rsyslogd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.510222] [ 343] 0 343 > 88199 2108 61 0 -500 dockerd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.514022] [ 350] 106 350 > 18280 181 36 0 0 zabbix_agentd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.518040] [ 351] 106 351 > 18280 475 36 0 0 zabbix_agentd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.522041] [ 352] 106 352 > 18280 187 36 0 0 zabbix_agentd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.526025] [ 353] 106 353 > 18280 187 36 0 0 zabbix_agentd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.530067] [ 354] 106 354 > 18280 187 36 0 0 zabbix_agentd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.534033] [ 355] 106 355 > 18280 190 36 0 0 zabbix_agentd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.538001] [ 358] 0 358 > 66390 1826 32 0 0 fail2ban-server > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.541972] [ 400] 0 400 > 35984 444 24 0 -500 docker-containe > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.545879] [ 568] 0 568 > 13796 168 30 0 -1000 sshd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.549733] [ 576] 0 576 > 3604 41 12 0 0 agetty > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.553569] [ 577] 0 577 > 3559 38 12 0 0 agetty > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.557322] [16201] 0 16201 > 29695 20707 60 0 0 varnishd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.561103] [16209] 108 16209 > 118909802 822425 29398 0 0 cache-main > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.565002] [27352] 0 27352 > 20131 214 42 0 0 sshd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.568682] [27354] 1000 27354 > 20165 211 41 0 0 sshd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.572307] [27355] 1000 27355 > 5487 146 17 0 0 bash > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.575920] [27360] 0 27360 > 11211 107 26 0 0 sudo > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.579593] [27361] 0 27361 > 11584 97 27 0 0 su > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.583155] [27362] 0 27362 > 5481 142 15 0 0 bash > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.586782] [27749] 0 27749 > 20131 214 41 0 0 sshd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.590428] [27751] 1000 27751 > 20164 211 39 0 0 sshd > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.593979] [27752] 1000 27752 > 5487 147 15 0 0 bash > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.597488] [28762] 0 28762 > 26528 132 17 0 0 varnishstat > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.601239] [28764] 0 28764 > 11211 106 26 0 0 sudo > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.604737] [28765] 0 28765 > 11584 97 26 0 0 su > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.608602] [28766] 0 28766 > 5481 141 15 0 0 bash > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.612288] [28768] 0 28768 > 26528 220 18 0 0 varnishstat > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.616189] Out of memory: Kill > process 16209 (cache-main) score 880 or sacrifice child > Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.620106] Killed process 16209 > (cache-main) total-vm:475639208kB, anon-rss:3289700kB, file-rss:0kB > Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (16209) died signal=9 > Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child cleanup complete > Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) Started > Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said Child > starts > Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said SMF.s0 > mmap'ed 483183820800 bytes of 483183820800 > > Best, > Stefano > > > On Wed, Jun 28, 2017 at 11:33 AM, Reza Naghibi <[email protected]> > wrote: > >> Assuming the problem is running out of memory, you will need to do some >> memory tuning, especially given the number of threads you are using and >> your access patterns. Your options: >> >> - Add more memory to the system >> - Reduce thread_pool_max >> - Reduce jemalloc's thread cache (MALLOC_CONF="lg_tcache_max:10") >> - Use some of the tuning params in here: https://info.varnish-sof >> tware.com/blog/understanding-varnish-cache-memory-usage >> >> <https://info.varnish-software.com/blog/understanding-varnish-cache-memory-usage> >> >> >> -- >> Reza Naghibi >> Varnish Software >> >> On Wed, Jun 28, 2017 at 9:26 AM, Guillaume Quintard < >> [email protected]> wrote: >> >>> Hi, >>> >>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >>> the memory you are seeing? >>> >>> -- >>> Guillaume Quintard >>> >>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <[email protected]> >>> wrote: >>> >>>> Hi Guillaume. >>>> >>>> I increased the cli_timeout yesterday to 900sec (15min) and it >>>> restarted anyway, which seems to indicate that the thread is really >>>> stalled. >>>> >>>> This was 1 minute after the last restart: >>>> >>>> MAIN.n_object 3908216 . object structs made >>>> SMF.s0.g_alloc 7794510 . Allocations outstanding >>>> >>>> I've just changed the I/O Scheduler to noop to see what happens. >>>> >>>> One interest thing I've found is about the memory usage. >>>> >>>> In the 1st minute of use: >>>> MemTotal: 3865572 kB >>>> MemFree: 120768 kB >>>> MemAvailable: 2300268 kB >>>> >>>> 1 minute before a restart: >>>> MemTotal: 3865572 kB >>>> MemFree: 82480 kB >>>> MemAvailable: 68316 kB >>>> >>>> It seems like the system is possibly running out of memory. >>>> >>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>>> see in some examples that is common to use "-s file" AND "-s malloc" >>>> together. Should I be passing "-s malloc" as well to somehow try to limit >>>> the memory usage by varnishd? >>>> >>>> Best, >>>> Stefano >>>> >>>> >>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>>> [email protected]> wrote: >>>> >>>>> Sadly, nothing suspicious here, you can still try: >>>>> - bumping the cli_timeout >>>>> - changing your disk scheduler >>>>> - changing the advice option of the file storage >>>>> >>>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>>> the disk because of the file storage fragmentation. >>>>> >>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>>> objects. Ideally, we would have a 1:1 relation between objects and >>>>> allocations. If that number drops prior to a restart, that would be a good >>>>> clue. >>>>> >>>>> >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi Guillaume. >>>>>> >>>>>> It keeps restarting. >>>>>> Would you mind taking a quick look in the following VCL file to check >>>>>> if you find anything suspicious? >>>>>> >>>>>> Thank you very much. >>>>>> >>>>>> Best, >>>>>> Stefano >>>>>> >>>>>> vcl 4.0; >>>>>> >>>>>> import std; >>>>>> >>>>>> backend default { >>>>>> .host = "sites-web-server-lb"; >>>>>> .port = "80"; >>>>>> } >>>>>> >>>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>>> >>>>>> sub vcl_recv { >>>>>> call bad_bot_detection; >>>>>> >>>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>>> return(pass); >>>>>> } >>>>>> >>>>>> unset req.http.Cookie; >>>>>> if (req.method == "PURGE") { >>>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>> return(synth(750)); >>>>>> } >>>>>> >>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/"); >>>>>> } >>>>>> >>>>>> sub vcl_synth { >>>>>> if (resp.status == 750) { >>>>>> set resp.status = 200; >>>>>> synthetic("PURGED => " + req.url); >>>>>> return(deliver); >>>>>> } elsif (resp.status == 501) { >>>>>> set resp.status = 200; >>>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht >>>>>> ml")); >>>>>> return(deliver); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_backend_response { >>>>>> unset beresp.http.Set-Cookie; >>>>>> set beresp.http.x-host = bereq.http.host; >>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>> >>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>>> } else { >>>>>> unset beresp.http.Cache-Control; >>>>>> } >>>>>> >>>>>> if (beresp.status == 200 || >>>>>> beresp.status == 301 || >>>>>> beresp.status == 302 || >>>>>> beresp.status == 404) { >>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>>> set beresp.http.X-TTL = "1d"; >>>>>> set beresp.ttl = 1d; >>>>>> } else { >>>>>> set beresp.http.X-TTL = "1w"; >>>>>> set beresp.ttl = 1w; >>>>>> } >>>>>> } >>>>>> >>>>>> if (bereq.url !~ >>>>>> "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>>> { >>>>>> set beresp.do_gzip = true; >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_pipe { >>>>>> set bereq.http.connection = "close"; >>>>>> return (pipe); >>>>>> } >>>>>> >>>>>> sub vcl_deliver { >>>>>> unset resp.http.x-host; >>>>>> unset resp.http.x-user-agent; >>>>>> } >>>>>> >>>>>> sub vcl_backend_error { >>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status >>>>>> == 504) { >>>>>> set beresp.status = 200; >>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>>> return (deliver); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_hash { >>>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>>> hash_data("Google Page Speed"); >>>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>>> hash_data("Googlebot"); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_deliver { >>>>>> if (resp.status == 501) { >>>>>> return (synth(resp.status)); >>>>>> } >>>>>> if (obj.hits > 0) { >>>>>> set resp.http.X-Cache = "hit"; >>>>>> } else { >>>>>> set resp.http.X-Cache = "miss"; >>>>>> } >>>>>> } >>>>>> >>>>>> >>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Nice! It may have been the cause, time will tell.can you report back >>>>>>> in a few days to let us know? >>>>>>> -- >>>>>>> Guillaume Quintard >>>>>>> >>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Guillaume. >>>>>>>> >>>>>>>> I think things will start to going better now after changing the >>>>>>>> bans. >>>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>>> regarding the bans: >>>>>>>> >>>>>>>> MAIN.bans 41336 . Count of bans >>>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>>> marked 'completed' >>>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>>> using obj.* >>>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>>> using req.* >>>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>>> >>>>>>>> And this is how it looks like now: >>>>>>>> >>>>>>>> MAIN.bans 2 . Count of bans >>>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>>> marked 'completed' >>>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>>> using obj.* >>>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>>> using req.* >>>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>>> >>>>>>>> Before the changes, bans were never deleted! >>>>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>>>> couple of seconds. >>>>>>>> >>>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>>> having a large number of bans to manage and test against. >>>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>>> gone! :-) >>>>>>>> >>>>>>>> Best, >>>>>>>> Stefano >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Looking good! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Guillaume Quintard >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Guillaume, >>>>>>>>>> >>>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>>> >>>>>>>>>> sub vcl_backend_response { >>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> sub vcl_recv { >>>>>>>>>> if (req.method == "PURGE") { >>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>>> return(synth(750)); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> sub vcl_deliver { >>>>>>>>>> unset resp.http.x-url; >>>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>>> >>>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Quintard >>>>>>>>>>> >>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Guillaume. >>>>>>>>>>> >>>>>>>>>>> Thanks for answering. >>>>>>>>>>> >>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>>>> performance but it stills restarting. >>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>>>> signal of overhead. >>>>>>>>>>> >>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>>>> was a problem here. After a couple of hours varnish died and I >>>>>>>>>>> received a >>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish >>>>>>>>>>> solved >>>>>>>>>>> the problem and varnish was up again, but it's weird because there >>>>>>>>>>> was free >>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't >>>>>>>>>>> know >>>>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>>>> /var/lib/varnish size. >>>>>>>>>>> >>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans >>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created >>>>>>>>>>> this way: >>>>>>>>>>> >>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>>> >>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Stefano >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Stefano, >>>>>>>>>>>> >>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the >>>>>>>>>>>> CLI. I'd >>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to >>>>>>>>>>>> confirm. >>>>>>>>>>>> >>>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. >>>>>>>>>>>> One >>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're >>>>>>>>>>>> SSDs, >>>>>>>>>>>> and you can try different advices in the file storage definition >>>>>>>>>>>> in the >>>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>>> >>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>>> >>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Guillaume Quintard >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello. >>>>>>>>>>>>> >>>>>>>>>>>>> I am having a critical problem with Varnish Cache in >>>>>>>>>>>>> production for over a month and any help will be appreciated. >>>>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>>>> >>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>> not responding to CLI, killed it. >>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>> died signal=9 >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>>> complete >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>> Started >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>> said Child starts >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>>> >>>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>>> before a restart: >>>>>>>>>>>>> >>>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>>> >>>>>>>>>>>>> Environment: >>>>>>>>>>>>> >>>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>>> packagecloud.io >>>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>>> >>>>>>>>>>>>> Additional info: >>>>>>>>>>>>> >>>>>>>>>>>>> - I need to cache a large number of objets and the cache >>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage >>>>>>>>>>>>> space, I >>>>>>>>>>>>> don't know if this is a problem; >>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>>>> before the last crash. I really don't know if this is too much or >>>>>>>>>>>>> may have >>>>>>>>>>>>> anything to do with it; >>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>>> syslog; >>>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>>> [email protected] >>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>>> -misc >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >>> _______________________________________________ >>> varnish-misc mailing list >>> [email protected] >>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>> >> >> >
_______________________________________________ varnish-misc mailing list [email protected] https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
