Hi Paolo,

Yes, this _was_ correct.... however, I now also see more systems with exactly 
the same symptoms...

On the most problematic system, I reverted the config yesterday back from 
ZeroMQ to the circular buffer 
(no recompile)

So far, no out-of-memory observed....  but the "Missing data detected" messages 
are back in return.
I'll keep an eye on it, i fit keeps on running another 24h without issue, then 
we might have found a culprit...

System was around 200K pps according to the SNMP graphs of the switch.
(it's a fiber with an optical splitter, so the switch get's the same traffic as 
the pmacct system)

I could not observe a significant increase or decrease in the 1 minute interval 
graphs.

I'll see if I can graph memory/cpu of the systems and check if there would be 
an indication.

Any other things I could in terms of ZeroMQ I could try / troubleshoot ?
ZeroMQ was configged with the 'large' profile, but not explicitely for both 
plugins...


I think it will then take the global option for each plugin, and set both 
plugins to 'large' ?
(One plugin is for IPv4 traffic, and one for IPv6 traffic ... which could 
probably be changed to very small unfortunately)

I could share config privately if needed/wanted.


Regards,

Wouter

-----Oorspronkelijk bericht-----
Van: pmacct-discussion [mailto:pmacct-discussion-boun...@pmacct.net] Namens 
Paolo Lucente
Verzonden: dinsdag 7 november 2017 01:27
Aan: pmacct-discussion@pmacct.net
Onderwerp: Re: [pmacct-discussion] oom-killer with pmacct 1.7.0


Hi Wouter,

If i understand correctly you are using 1.7.0 with ZMQ on all of your
boxes but only 2 of them present an issue, while the others run fine, is
that correct? Please, yes, either recompile without ZeroMQ or revert to
the home-grown circular buffer for those two boxes - for the sake of
testing and see if the culprit is in that area. Keep me posted.

Also, do you have a sense if those two boxes are the most busy? And how
many pps they receive? Do you by any chance monitor these boxes by SNMP
so to see, for example, if the memory consumption is directly related to
a traffic spike? 

Paolo 

On Mon, Nov 06, 2017 at 01:32:32PM +0000, Wouter de Jong wrote:
> Hi,
> 
> As we had (quite) some "missing data" messages, I was excited to see 1.7.0 
> makes ZeroMQ config very easy, 
> since we where still running with the howe-grown circular buffer.
> 
> 
> So last week, I've upgraded our systems from pmacct 1.6.1 to 1.7.0 
> 
> Since then I experience issues on 2 out of 7 systems : oom-killer on one of 
> the pmacct MySQL plugins
> 
> [Mon Nov  6 09:54:38 2017] pmacctd invoked oom-killer: gfp_mask=0x280da, 
> order=0, oom_score_adj=0
> [Mon Nov  6 09:54:38 2017] pmacctd cpuset=/ mems_allowed=0-1
> [Mon Nov  6 09:54:38 2017] CPU: 15 PID: 44049 Comm: pmacctd Tainted: G        
>    OE  ------------   3.10.0-693.5.2.el7.x86_64 #1
> [Mon Nov  6 09:54:38 2017] Hardware name: Dell Inc. PowerEdge R630, BIOS 
> 2.4.3 01/17/2017
> [Mon Nov  6 09:54:38 2017]  ffff88103dd4dee0 0000000019a0de94 
> ffff880036adf5f0 ffffffff816a3e51
> [Mon Nov  6 09:54:38 2017]  ffff880036adf680 ffffffff8169f246 
> ffff880036adf688 ffffffff812b7d1b
> [Mon Nov  6 09:54:38 2017]  ffff88203c336e68 0000000000000202 
> ffffffff00000202 fffeefff00000000
> [Mon Nov  6 09:54:38 2017] Call Trace:
> [Mon Nov  6 09:54:38 2017]  [<ffffffff816a3e51>] dump_stack+0x19/0x1b
> [Mon Nov  6 09:54:38 2017]  [<ffffffff8169f246>] dump_header+0x90/0x229
> [Mon Nov  6 09:54:38 2017]  [<ffffffff812b7d1b>] ? 
> cred_has_capability+0x6b/0x120
> [Mon Nov  6 09:54:38 2017]  [<ffffffff811863a4>] oom_kill_process+0x254/0x3d0
> [Mon Nov  6 09:54:38 2017]  [<ffffffff812b7efe>] ? selinux_capable+0x2e/0x40
> [Mon Nov  6 09:54:38 2017]  [<ffffffff81186be6>] out_of_memory+0x4b6/0x4f0
> [Mon Nov  6 09:54:38 2017]  [<ffffffff8169fd4a>] 
> __alloc_pages_slowpath+0x5d6/0x724
> [Mon Nov  6 09:54:38 2017]  [<ffffffff8118cdb5>] 
> __alloc_pages_nodemask+0x405/0x420
> [Mon Nov  6 09:54:38 2017]  [<ffffffff811d40a5>] alloc_pages_vma+0xb5/0x200
> [Mon Nov  6 09:54:38 2017]  [<ffffffff811b2350>] handle_mm_fault+0xb60/0xfa0
> [Mon Nov  6 09:54:38 2017]  [<ffffffff810c8f28>] ? __enqueue_entity+0x78/0x80
> [Mon Nov  6 09:54:38 2017]  [<ffffffff816b0074>] __do_page_fault+0x154/0x450
> [Mon Nov  6 09:54:38 2017]  [<ffffffff816b03a5>] do_page_fault+0x35/0x90
> [Mon Nov  6 09:54:38 2017]  [<ffffffff816ac5c8>] page_fault+0x28/0x30
> [Mon Nov  6 09:54:38 2017]  [<ffffffff81330379>] ? 
> copy_user_enhanced_fast_string+0x9/0x20
> [Mon Nov  6 09:54:38 2017]  [<ffffffff81336a4a>] ? memcpy_toiovec+0x4a/0x90
> [Mon Nov  6 09:54:38 2017]  [<ffffffff815796e8>] 
> skb_copy_datagram_iovec+0x128/0x280
> [Mon Nov  6 09:54:38 2017]  [<ffffffff815d88aa>] tcp_recvmsg+0x24a/0xb50
> [Mon Nov  6 09:54:38 2017]  [<ffffffff81606aea>] inet_recvmsg+0x7a/0xa0
> [Mon Nov  6 09:54:38 2017]  [<ffffffff8156a88f>] sock_recvmsg+0xbf/0x100
> [Mon Nov  6 09:54:38 2017]  [<ffffffff815da029>] ? tcp_poll+0x219/0x230
> [Mon Nov  6 09:54:38 2017]  [<ffffffff8124b859>] ? 
> ep_scan_ready_list.isra.7+0x1b9/0x1f0
> [Mon Nov  6 09:54:38 2017]  [<ffffffff8156aa08>] SYSC_recvfrom+0xe8/0x160
> [Mon Nov  6 09:54:38 2017]  [<ffffffff8156b2fe>] SyS_recvfrom+0xe/0x10
> [Mon Nov  6 09:54:38 2017]  [<ffffffff816b5089>] 
> system_call_fastpath+0x16/0x1b
> [Mon Nov  6 09:54:38 2017] Mem-Info:
> [Mon Nov  6 09:54:38 2017] active_anon:31102734 inactive_anon:1375631 
> isolated_anon:64
>  active_file:61 inactive_file:0 isolated_file:0
>  unevictable:0 dirty:0 writeback:200 unstable:0
>  slab_reclaimable:10481 slab_unreclaimable:34483
>  mapped:10650 shmem:9528 pagetables:66634 bounce:0
>  free:88657 free_pcp:30 free_cma:0
> [Mon Nov  6 09:54:38 2017] Node 0 DMA free:15864kB min:8kB low:8kB high:12kB 
> active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB 
> unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB 
> managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB 
> slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB 
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 
> writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
> [Mon Nov  6 09:54:38 2017] lowmem_reserve[]: 0 1690 64141 64141
> [Mon Nov  6 09:54:38 2017] Node 0 DMA32 free:250920kB min:1184kB low:1480kB 
> high:1776kB active_anon:1096172kB inactive_anon:365444kB active_file:0kB 
> inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB 
> present:1985264kB managed:1733112kB mlocked:0kB dirty:0kB writeback:0kB 
> mapped:484kB shmem:488kB slab_reclaimable:456kB slab_unreclaimable:3256kB 
> kernel_stack:224kB pagetables:2600kB unstable:0kB bounce:0kB free_pcp:120kB 
> local_pcp:120kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 
> all_unreclaimable? yes
> [Mon Nov  6 09:54:38 2017] lowmem_reserve[]: 0 0 62451 62451
> [Mon Nov  6 09:54:38 2017] Node 0 Normal free:42772kB min:43740kB low:54672kB 
> high:65608kB active_anon:60618136kB inactive_anon:2525396kB active_file:264kB 
> inactive_file:0kB unevictable:0kB isolated(anon):128kB isolated(file):0kB 
> present:65011712kB managed:63949968kB mlocked:0kB dirty:0kB writeback:376kB 
> mapped:21788kB shmem:21608kB slab_reclaimable:20352kB 
> slab_unreclaimable:65620kB kernel_stack:5856kB pagetables:129356kB 
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 
> writeback_tmp:0kB pages_scanned:403 all_unreclaimable? yes
> [Mon Nov  6 09:54:38 2017] lowmem_reserve[]: 0 0 0 0
> [Mon Nov  6 09:54:38 2017] Node 1 Normal free:45072kB min:45172kB low:56464kB 
> high:67756kB active_anon:62696628kB inactive_anon:2611684kB active_file:0kB 
> inactive_file:0kB unevictable:0kB isolated(anon):128kB isolated(file):0kB 
> present:67108864kB managed:66046872kB mlocked:0kB dirty:0kB writeback:424kB 
> mapped:20328kB shmem:16016kB slab_reclaimable:21116kB 
> slab_unreclaimable:69024kB kernel_stack:4976kB pagetables:134580kB 
> unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB 
> writeback_tmp:0kB pages_scanned:1984 all_unreclaimable? yes
> [Mon Nov  6 09:54:38 2017] lowmem_reserve[]: 0 0 0 0
> [Mon Nov  6 09:54:38 2017] Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 1*32kB (U) 
> 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB 
> (M) = 15864kB
> [Mon Nov  6 09:54:38 2017] Node 0 DMA32: 220*4kB (UE) 253*8kB (UE) 182*16kB 
> (UEM) 64*32kB (UEM) 76*64kB (UEM) 45*128kB (UEM) 38*256kB (UEM) 39*512kB 
> (UEM) 32*1024kB (UE) 29*2048kB (U) 27*4096kB (UM) = 250936kB
> [Mon Nov  6 09:54:38 2017] Node 0 Normal: 2520*4kB (UE) 1980*8kB (UEM) 
> 788*16kB (UEM) 82*32kB (UEM) 24*64kB (UEM) 8*128kB (UEM) 1*256kB (M) 0*512kB 
> 0*1024kB 0*2048kB 0*4096kB = 43968kB
> [Mon Nov  6 09:54:38 2017] Node 1 Normal: 2140*4kB (UE) 4637*8kB (UM) 82*16kB 
> (UM) 1*32kB (M) 2*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB 
> = 47128kB
> [Mon Nov  6 09:54:38 2017] Node 0 hugepages_total=0 hugepages_free=0 
> hugepages_surp=0 hugepages_size=1048576kB
> [Mon Nov  6 09:54:38 2017] Node 0 hugepages_total=0 hugepages_free=0 
> hugepages_surp=0 hugepages_size=2048kB
> [Mon Nov  6 09:54:38 2017] Node 1 hugepages_total=0 hugepages_free=0 
> hugepages_surp=0 hugepages_size=1048576kB
> [Mon Nov  6 09:54:38 2017] Node 1 hugepages_total=0 hugepages_free=0 
> hugepages_surp=0 hugepages_size=2048kB
> [Mon Nov  6 09:54:38 2017] 13517 total pagecache pages
> [Mon Nov  6 09:54:38 2017] 4044 pages in swap cache
> [Mon Nov  6 09:54:38 2017] Swap cache stats: add 4249745, delete 4245701, 
> find 35957085/35970537
> [Mon Nov  6 09:54:38 2017] Free swap  = 0kB
> [Mon Nov  6 09:54:38 2017] Total swap = 4194300kB
> [Mon Nov  6 09:54:38 2017] 33530455 pages RAM
> [Mon Nov  6 09:54:38 2017] 0 pages HighMem/MovableOnly
> [Mon Nov  6 09:54:38 2017] 593993 pages reserved
> [Mon Nov  6 09:54:38 2017] [ pid ]   uid  tgid total_vm      rss nr_ptes 
> swapents oom_score_adj name
> [Mon Nov  6 09:54:38 2017] [  771]     0   771    17483     7670      41      
>  51             0 systemd-journal
> [Mon Nov  6 09:54:38 2017] [  802]     0   802    11959       25      24      
> 735         -1000 systemd-udevd
> [Mon Nov  6 09:54:38 2017] [ 2444]     0  2444    13863       10      26      
> 101         -1000 auditd
> [Mon Nov  6 09:54:38 2017] [ 2463]     0  2463     5468       89      15      
>  80             0 irqbalance
> [Mon Nov  6 09:54:38 2017] [ 2467]    81  2467     8153       62      18      
>  49          -900 dbus-daemon
> [Mon Nov  6 09:54:38 2017] [ 2482]     0  2482     6051       43      17      
>  32             0 systemd-logind
> [Mon Nov  6 09:54:38 2017] [ 2483]   998  2483   133561       96      58     
> 1532             0 polkitd
> [Mon Nov  6 09:54:38 2017] [ 2484]     0  2484    75472     3998      66      
> 835             0 rsyslogd
> [Mon Nov  6 09:54:38 2017] [ 2554]     0  2554    31558       26      18      
> 132             0 crond
> [Mon Nov  6 09:54:38 2017] [ 2578]     0  2578    27511        1      10      
>  31             0 agetty
> [Mon Nov  6 09:54:38 2017] [ 2585]   997  2585    25108       30      20      
>  62             0 chronyd
> [Mon Nov  6 09:54:38 2017] [ 3055]     0  3055    26499       13      55      
> 232         -1000 sshd
> [Mon Nov  6 09:54:38 2017] [ 3057]     0  3057   140598      106      88     
> 2614             0 tuned
> [Mon Nov  6 09:54:38 2017] [ 3524]     0  3524    22504        4      44      
> 275             0 master
> [Mon Nov  6 09:54:38 2017] [ 3526]    89  3526    22547       14      45      
> 260             0 qmgr
> [Mon Nov  6 09:54:38 2017] [ 3730]     0  3730   247949      257      67     
> 4361             0 dsm_sa_datamgrd
> [Mon Nov  6 09:54:38 2017] [ 3803]     0  3803    75246       92      40      
> 126             0 dsm_sa_eventmgr
> [Mon Nov  6 09:54:38 2017] [ 3828]     0  3828   111461      494      51      
> 879             0 dsm_sa_snmpd
> [Mon Nov  6 09:54:38 2017] [ 3834]     0  3834   180364        6      59     
> 4326             0 dsm_sa_datamgrd
> [Mon Nov  6 09:54:38 2017] [ 3877]     0  3877   158222       21      41      
> 672             0 dsm_om_shrsvcd
> [Mon Nov  6 09:54:38 2017] [44029]     0 44029    96018     3183      46      
> 547             0 pmacctd
> [Mon Nov  6 09:54:38 2017] [44030]     0 44030  3861827  3769206    7400      
> 287             0 pmacctd
> [Mon Nov  6 09:54:38 2017] [44037]     0 44037 29678700 28615469   57953   
> 997921             0 pmacctd
> [Mon Nov  6 09:54:38 2017] [44038]     0 44038   112024    72059     184     
> 4966             0 pmacctd
> [Mon Nov  6 09:54:38 2017] [44045]     0 44045    46356     2918      49     
> 6122             0 pmacctd
> [Mon Nov  6 09:54:38 2017] [44046]     0 44046    46389     3147      49     
> 5705             0 pmacctd
> [Mon Nov  6 09:54:38 2017] [58219]    89 58219    22530      272      44      
>   0             0 pickup
> [Mon Nov  6 09:54:38 2017] [59874]     0 59874    47222     3116      51     
> 6044             0 pmacctd
> [Mon Nov  6 09:54:38 2017] [59875]     0 59875    47225     3665      53     
> 5494             0 pmacctd
> [Mon Nov  6 09:54:38 2017] Out of memory: Kill process 44037 (pmacctd) score 
> 846 or sacrifice child
> [Mon Nov  6 09:54:38 2017] Killed process 44037 (pmacctd) 
> total-vm:118714800kB, anon-rss:114461768kB, file-rss:104kB, shmem-rss:4kB
> 
> All 7 systems are identical in terms of config, they only receive different 
> traffic and have a slightly different HW config
> (CPU, R630 vs R720, 64G - 128G memory)
> 
> Each system runs CentOS 7.4.1708 64-bit, fully updated, dual-port Intel X520 
> 10G NIC 
> and runs a pmacctd instance for 10G NIC1, and one for 10G NIC2
> Per instance, traffic is split out over an IPv4 MySQL plugin and an IPv6 
> MySQL plugin.
> 
> Data is stored to an external MySQL (/ Percona) server
> 
> As CentOS 7.x EPEL comes with ZeroMQ 4.1, and pmacct likes >= 4.2, 
> I installed ZeroMQ 4.2.2 from the ZeroMQ yum repository.
> 
> Eager as I was, I installed PR_RING 7.0.0 (non ZC to start with) as well in 
> the same change from the ntop repository
> 
> After some time running, I observed the oom-killer issue on the two machines.
> 
> 
> I suspected PF_RING at first, was running with the following config :
> 
> options pf_ring enable_tx_capture=0 quick_mode=1
> 
> Then I reduced that on those 2 machines to :
> 
> options pf_ring enable_tx_capture=0
> 
> 
> Seems to work _so_ far on 1 of them, but on the other .... no change.
> 
> 
> Then I removed PF_RING completely from that system, recompiled pmacct, and 
> made sure pmacctd 
> was now linked to libpcap.so.1 again, and no longer against libpfring.so.1
> 
> 
> This morning, another crash..... so it does not seem (fully) related to 
> PF_RING or it's config
> 
> 
> 
> So the only other change for 1.6.1 <-> 1.7.0 on this machines was pmacct now 
> compiled with the additional option 
> "--enable-zmq"
> 
> And in the config I replaced plugin_buffer_size & plugin_pipe_size with :
> 
> plugin_pipe_zmq: true
> plugin_pipe_zmq_profile: large
> 
> 
> 
> I will probably recompile once more without ZeroMQ, and revert the config 
> change, and see how that goes.
> 
> 
> But it would be nice to get to a stable system with all features enabled, so 
> if anyone has good hints, what to check, etc..
> 
> Any help/insight is appreciated :)
> 
> 
> Regards,
> 
> Wouter
> 
> _______________________________________________
> pmacct-discussion mailing list
> http://www.pmacct.net/#mailinglists

_______________________________________________
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists

_______________________________________________
pmacct-discussion mailing list
http://www.pmacct.net/#mailinglists

Reply via email to