Hi Paolo, Yes, this _was_ correct.... however, I now also see more systems with exactly the same symptoms...
On the most problematic system, I reverted the config yesterday back from ZeroMQ to the circular buffer (no recompile) So far, no out-of-memory observed.... but the "Missing data detected" messages are back in return. I'll keep an eye on it, i fit keeps on running another 24h without issue, then we might have found a culprit... System was around 200K pps according to the SNMP graphs of the switch. (it's a fiber with an optical splitter, so the switch get's the same traffic as the pmacct system) I could not observe a significant increase or decrease in the 1 minute interval graphs. I'll see if I can graph memory/cpu of the systems and check if there would be an indication. Any other things I could in terms of ZeroMQ I could try / troubleshoot ? ZeroMQ was configged with the 'large' profile, but not explicitely for both plugins... I think it will then take the global option for each plugin, and set both plugins to 'large' ? (One plugin is for IPv4 traffic, and one for IPv6 traffic ... which could probably be changed to very small unfortunately) I could share config privately if needed/wanted. Regards, Wouter -----Oorspronkelijk bericht----- Van: pmacct-discussion [mailto:pmacct-discussion-boun...@pmacct.net] Namens Paolo Lucente Verzonden: dinsdag 7 november 2017 01:27 Aan: pmacct-discussion@pmacct.net Onderwerp: Re: [pmacct-discussion] oom-killer with pmacct 1.7.0 Hi Wouter, If i understand correctly you are using 1.7.0 with ZMQ on all of your boxes but only 2 of them present an issue, while the others run fine, is that correct? Please, yes, either recompile without ZeroMQ or revert to the home-grown circular buffer for those two boxes - for the sake of testing and see if the culprit is in that area. Keep me posted. Also, do you have a sense if those two boxes are the most busy? And how many pps they receive? Do you by any chance monitor these boxes by SNMP so to see, for example, if the memory consumption is directly related to a traffic spike? Paolo On Mon, Nov 06, 2017 at 01:32:32PM +0000, Wouter de Jong wrote: > Hi, > > As we had (quite) some "missing data" messages, I was excited to see 1.7.0 > makes ZeroMQ config very easy, > since we where still running with the howe-grown circular buffer. > > > So last week, I've upgraded our systems from pmacct 1.6.1 to 1.7.0 > > Since then I experience issues on 2 out of 7 systems : oom-killer on one of > the pmacct MySQL plugins > > [Mon Nov 6 09:54:38 2017] pmacctd invoked oom-killer: gfp_mask=0x280da, > order=0, oom_score_adj=0 > [Mon Nov 6 09:54:38 2017] pmacctd cpuset=/ mems_allowed=0-1 > [Mon Nov 6 09:54:38 2017] CPU: 15 PID: 44049 Comm: pmacctd Tainted: G > OE ------------ 3.10.0-693.5.2.el7.x86_64 #1 > [Mon Nov 6 09:54:38 2017] Hardware name: Dell Inc. PowerEdge R630, BIOS > 2.4.3 01/17/2017 > [Mon Nov 6 09:54:38 2017] ffff88103dd4dee0 0000000019a0de94 > ffff880036adf5f0 ffffffff816a3e51 > [Mon Nov 6 09:54:38 2017] ffff880036adf680 ffffffff8169f246 > ffff880036adf688 ffffffff812b7d1b > [Mon Nov 6 09:54:38 2017] ffff88203c336e68 0000000000000202 > ffffffff00000202 fffeefff00000000 > [Mon Nov 6 09:54:38 2017] Call Trace: > [Mon Nov 6 09:54:38 2017] [<ffffffff816a3e51>] dump_stack+0x19/0x1b > [Mon Nov 6 09:54:38 2017] [<ffffffff8169f246>] dump_header+0x90/0x229 > [Mon Nov 6 09:54:38 2017] [<ffffffff812b7d1b>] ? > cred_has_capability+0x6b/0x120 > [Mon Nov 6 09:54:38 2017] [<ffffffff811863a4>] oom_kill_process+0x254/0x3d0 > [Mon Nov 6 09:54:38 2017] [<ffffffff812b7efe>] ? selinux_capable+0x2e/0x40 > [Mon Nov 6 09:54:38 2017] [<ffffffff81186be6>] out_of_memory+0x4b6/0x4f0 > [Mon Nov 6 09:54:38 2017] [<ffffffff8169fd4a>] > __alloc_pages_slowpath+0x5d6/0x724 > [Mon Nov 6 09:54:38 2017] [<ffffffff8118cdb5>] > __alloc_pages_nodemask+0x405/0x420 > [Mon Nov 6 09:54:38 2017] [<ffffffff811d40a5>] alloc_pages_vma+0xb5/0x200 > [Mon Nov 6 09:54:38 2017] [<ffffffff811b2350>] handle_mm_fault+0xb60/0xfa0 > [Mon Nov 6 09:54:38 2017] [<ffffffff810c8f28>] ? __enqueue_entity+0x78/0x80 > [Mon Nov 6 09:54:38 2017] [<ffffffff816b0074>] __do_page_fault+0x154/0x450 > [Mon Nov 6 09:54:38 2017] [<ffffffff816b03a5>] do_page_fault+0x35/0x90 > [Mon Nov 6 09:54:38 2017] [<ffffffff816ac5c8>] page_fault+0x28/0x30 > [Mon Nov 6 09:54:38 2017] [<ffffffff81330379>] ? > copy_user_enhanced_fast_string+0x9/0x20 > [Mon Nov 6 09:54:38 2017] [<ffffffff81336a4a>] ? memcpy_toiovec+0x4a/0x90 > [Mon Nov 6 09:54:38 2017] [<ffffffff815796e8>] > skb_copy_datagram_iovec+0x128/0x280 > [Mon Nov 6 09:54:38 2017] [<ffffffff815d88aa>] tcp_recvmsg+0x24a/0xb50 > [Mon Nov 6 09:54:38 2017] [<ffffffff81606aea>] inet_recvmsg+0x7a/0xa0 > [Mon Nov 6 09:54:38 2017] [<ffffffff8156a88f>] sock_recvmsg+0xbf/0x100 > [Mon Nov 6 09:54:38 2017] [<ffffffff815da029>] ? tcp_poll+0x219/0x230 > [Mon Nov 6 09:54:38 2017] [<ffffffff8124b859>] ? > ep_scan_ready_list.isra.7+0x1b9/0x1f0 > [Mon Nov 6 09:54:38 2017] [<ffffffff8156aa08>] SYSC_recvfrom+0xe8/0x160 > [Mon Nov 6 09:54:38 2017] [<ffffffff8156b2fe>] SyS_recvfrom+0xe/0x10 > [Mon Nov 6 09:54:38 2017] [<ffffffff816b5089>] > system_call_fastpath+0x16/0x1b > [Mon Nov 6 09:54:38 2017] Mem-Info: > [Mon Nov 6 09:54:38 2017] active_anon:31102734 inactive_anon:1375631 > isolated_anon:64 > active_file:61 inactive_file:0 isolated_file:0 > unevictable:0 dirty:0 writeback:200 unstable:0 > slab_reclaimable:10481 slab_unreclaimable:34483 > mapped:10650 shmem:9528 pagetables:66634 bounce:0 > free:88657 free_pcp:30 free_cma:0 > [Mon Nov 6 09:54:38 2017] Node 0 DMA free:15864kB min:8kB low:8kB high:12kB > active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB > unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB > managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB > slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB > unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes > [Mon Nov 6 09:54:38 2017] lowmem_reserve[]: 0 1690 64141 64141 > [Mon Nov 6 09:54:38 2017] Node 0 DMA32 free:250920kB min:1184kB low:1480kB > high:1776kB active_anon:1096172kB inactive_anon:365444kB active_file:0kB > inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB > present:1985264kB managed:1733112kB mlocked:0kB dirty:0kB writeback:0kB > mapped:484kB shmem:488kB slab_reclaimable:456kB slab_unreclaimable:3256kB > kernel_stack:224kB pagetables:2600kB unstable:0kB bounce:0kB free_pcp:120kB > local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 > all_unreclaimable? yes > [Mon Nov 6 09:54:38 2017] lowmem_reserve[]: 0 0 62451 62451 > [Mon Nov 6 09:54:38 2017] Node 0 Normal free:42772kB min:43740kB low:54672kB > high:65608kB active_anon:60618136kB inactive_anon:2525396kB active_file:264kB > inactive_file:0kB unevictable:0kB isolated(anon):128kB isolated(file):0kB > present:65011712kB managed:63949968kB mlocked:0kB dirty:0kB writeback:376kB > mapped:21788kB shmem:21608kB slab_reclaimable:20352kB > slab_unreclaimable:65620kB kernel_stack:5856kB pagetables:129356kB > unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > writeback_tmp:0kB pages_scanned:403 all_unreclaimable? yes > [Mon Nov 6 09:54:38 2017] lowmem_reserve[]: 0 0 0 0 > [Mon Nov 6 09:54:38 2017] Node 1 Normal free:45072kB min:45172kB low:56464kB > high:67756kB active_anon:62696628kB inactive_anon:2611684kB active_file:0kB > inactive_file:0kB unevictable:0kB isolated(anon):128kB isolated(file):0kB > present:67108864kB managed:66046872kB mlocked:0kB dirty:0kB writeback:424kB > mapped:20328kB shmem:16016kB slab_reclaimable:21116kB > slab_unreclaimable:69024kB kernel_stack:4976kB pagetables:134580kB > unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB > writeback_tmp:0kB pages_scanned:1984 all_unreclaimable? yes > [Mon Nov 6 09:54:38 2017] lowmem_reserve[]: 0 0 0 0 > [Mon Nov 6 09:54:38 2017] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 1*32kB (U) > 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB > (M) = 15864kB > [Mon Nov 6 09:54:38 2017] Node 0 DMA32: 220*4kB (UE) 253*8kB (UE) 182*16kB > (UEM) 64*32kB (UEM) 76*64kB (UEM) 45*128kB (UEM) 38*256kB (UEM) 39*512kB > (UEM) 32*1024kB (UE) 29*2048kB (U) 27*4096kB (UM) = 250936kB > [Mon Nov 6 09:54:38 2017] Node 0 Normal: 2520*4kB (UE) 1980*8kB (UEM) > 788*16kB (UEM) 82*32kB (UEM) 24*64kB (UEM) 8*128kB (UEM) 1*256kB (M) 0*512kB > 0*1024kB 0*2048kB 0*4096kB = 43968kB > [Mon Nov 6 09:54:38 2017] Node 1 Normal: 2140*4kB (UE) 4637*8kB (UM) 82*16kB > (UM) 1*32kB (M) 2*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB > = 47128kB > [Mon Nov 6 09:54:38 2017] Node 0 hugepages_total=0 hugepages_free=0 > hugepages_surp=0 hugepages_size=1048576kB > [Mon Nov 6 09:54:38 2017] Node 0 hugepages_total=0 hugepages_free=0 > hugepages_surp=0 hugepages_size=2048kB > [Mon Nov 6 09:54:38 2017] Node 1 hugepages_total=0 hugepages_free=0 > hugepages_surp=0 hugepages_size=1048576kB > [Mon Nov 6 09:54:38 2017] Node 1 hugepages_total=0 hugepages_free=0 > hugepages_surp=0 hugepages_size=2048kB > [Mon Nov 6 09:54:38 2017] 13517 total pagecache pages > [Mon Nov 6 09:54:38 2017] 4044 pages in swap cache > [Mon Nov 6 09:54:38 2017] Swap cache stats: add 4249745, delete 4245701, > find 35957085/35970537 > [Mon Nov 6 09:54:38 2017] Free swap = 0kB > [Mon Nov 6 09:54:38 2017] Total swap = 4194300kB > [Mon Nov 6 09:54:38 2017] 33530455 pages RAM > [Mon Nov 6 09:54:38 2017] 0 pages HighMem/MovableOnly > [Mon Nov 6 09:54:38 2017] 593993 pages reserved > [Mon Nov 6 09:54:38 2017] [ pid ] uid tgid total_vm rss nr_ptes > swapents oom_score_adj name > [Mon Nov 6 09:54:38 2017] [ 771] 0 771 17483 7670 41 > 51 0 systemd-journal > [Mon Nov 6 09:54:38 2017] [ 802] 0 802 11959 25 24 > 735 -1000 systemd-udevd > [Mon Nov 6 09:54:38 2017] [ 2444] 0 2444 13863 10 26 > 101 -1000 auditd > [Mon Nov 6 09:54:38 2017] [ 2463] 0 2463 5468 89 15 > 80 0 irqbalance > [Mon Nov 6 09:54:38 2017] [ 2467] 81 2467 8153 62 18 > 49 -900 dbus-daemon > [Mon Nov 6 09:54:38 2017] [ 2482] 0 2482 6051 43 17 > 32 0 systemd-logind > [Mon Nov 6 09:54:38 2017] [ 2483] 998 2483 133561 96 58 > 1532 0 polkitd > [Mon Nov 6 09:54:38 2017] [ 2484] 0 2484 75472 3998 66 > 835 0 rsyslogd > [Mon Nov 6 09:54:38 2017] [ 2554] 0 2554 31558 26 18 > 132 0 crond > [Mon Nov 6 09:54:38 2017] [ 2578] 0 2578 27511 1 10 > 31 0 agetty > [Mon Nov 6 09:54:38 2017] [ 2585] 997 2585 25108 30 20 > 62 0 chronyd > [Mon Nov 6 09:54:38 2017] [ 3055] 0 3055 26499 13 55 > 232 -1000 sshd > [Mon Nov 6 09:54:38 2017] [ 3057] 0 3057 140598 106 88 > 2614 0 tuned > [Mon Nov 6 09:54:38 2017] [ 3524] 0 3524 22504 4 44 > 275 0 master > [Mon Nov 6 09:54:38 2017] [ 3526] 89 3526 22547 14 45 > 260 0 qmgr > [Mon Nov 6 09:54:38 2017] [ 3730] 0 3730 247949 257 67 > 4361 0 dsm_sa_datamgrd > [Mon Nov 6 09:54:38 2017] [ 3803] 0 3803 75246 92 40 > 126 0 dsm_sa_eventmgr > [Mon Nov 6 09:54:38 2017] [ 3828] 0 3828 111461 494 51 > 879 0 dsm_sa_snmpd > [Mon Nov 6 09:54:38 2017] [ 3834] 0 3834 180364 6 59 > 4326 0 dsm_sa_datamgrd > [Mon Nov 6 09:54:38 2017] [ 3877] 0 3877 158222 21 41 > 672 0 dsm_om_shrsvcd > [Mon Nov 6 09:54:38 2017] [44029] 0 44029 96018 3183 46 > 547 0 pmacctd > [Mon Nov 6 09:54:38 2017] [44030] 0 44030 3861827 3769206 7400 > 287 0 pmacctd > [Mon Nov 6 09:54:38 2017] [44037] 0 44037 29678700 28615469 57953 > 997921 0 pmacctd > [Mon Nov 6 09:54:38 2017] [44038] 0 44038 112024 72059 184 > 4966 0 pmacctd > [Mon Nov 6 09:54:38 2017] [44045] 0 44045 46356 2918 49 > 6122 0 pmacctd > [Mon Nov 6 09:54:38 2017] [44046] 0 44046 46389 3147 49 > 5705 0 pmacctd > [Mon Nov 6 09:54:38 2017] [58219] 89 58219 22530 272 44 > 0 0 pickup > [Mon Nov 6 09:54:38 2017] [59874] 0 59874 47222 3116 51 > 6044 0 pmacctd > [Mon Nov 6 09:54:38 2017] [59875] 0 59875 47225 3665 53 > 5494 0 pmacctd > [Mon Nov 6 09:54:38 2017] Out of memory: Kill process 44037 (pmacctd) score > 846 or sacrifice child > [Mon Nov 6 09:54:38 2017] Killed process 44037 (pmacctd) > total-vm:118714800kB, anon-rss:114461768kB, file-rss:104kB, shmem-rss:4kB > > All 7 systems are identical in terms of config, they only receive different > traffic and have a slightly different HW config > (CPU, R630 vs R720, 64G - 128G memory) > > Each system runs CentOS 7.4.1708 64-bit, fully updated, dual-port Intel X520 > 10G NIC > and runs a pmacctd instance for 10G NIC1, and one for 10G NIC2 > Per instance, traffic is split out over an IPv4 MySQL plugin and an IPv6 > MySQL plugin. > > Data is stored to an external MySQL (/ Percona) server > > As CentOS 7.x EPEL comes with ZeroMQ 4.1, and pmacct likes >= 4.2, > I installed ZeroMQ 4.2.2 from the ZeroMQ yum repository. > > Eager as I was, I installed PR_RING 7.0.0 (non ZC to start with) as well in > the same change from the ntop repository > > After some time running, I observed the oom-killer issue on the two machines. > > > I suspected PF_RING at first, was running with the following config : > > options pf_ring enable_tx_capture=0 quick_mode=1 > > Then I reduced that on those 2 machines to : > > options pf_ring enable_tx_capture=0 > > > Seems to work _so_ far on 1 of them, but on the other .... no change. > > > Then I removed PF_RING completely from that system, recompiled pmacct, and > made sure pmacctd > was now linked to libpcap.so.1 again, and no longer against libpfring.so.1 > > > This morning, another crash..... so it does not seem (fully) related to > PF_RING or it's config > > > > So the only other change for 1.6.1 <-> 1.7.0 on this machines was pmacct now > compiled with the additional option > "--enable-zmq" > > And in the config I replaced plugin_buffer_size & plugin_pipe_size with : > > plugin_pipe_zmq: true > plugin_pipe_zmq_profile: large > > > > I will probably recompile once more without ZeroMQ, and revert the config > change, and see how that goes. > > > But it would be nice to get to a stable system with all features enabled, so > if anyone has good hints, what to check, etc.. > > Any help/insight is appreciated :) > > > Regards, > > Wouter > > _______________________________________________ > pmacct-discussion mailing list > http://www.pmacct.net/#mailinglists _______________________________________________ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists _______________________________________________ pmacct-discussion mailing list http://www.pmacct.net/#mailinglists