All the CPU cards on the system running the same load. Seen similar issue about 6 weeks back but seen again now on one card compared to all cards last time. At this time, there was very light traffic (handshake).
I had seen following as part of the log, not sure it contributes the issue or not: [686808.930065] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx timeout [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! // details ... [686797.890210] [<ffffffff816dea1f>] ret_from_fork+0x3f/0x70 [686797.895692] [<ffffffff8109cc40>] ? flush_kthread_worker+0x90/0x90 [686797.901951] Code: 00 eb 02 89 c6 f7 c6 00 ff ff ff 75 41 83 fe 01 89 ca 89 f0 41 0f 44 d0 f0 0f b1 17 39 f0 75 e3 83 fa 01 75 04 eb 0d f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 5d c3 8b 37 81 fe 00 01 00 00 75 b6 [686798.930348] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx timeout [686803.938207] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx timeout [686808.930065] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx timeout [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks: [686810.067613] 2-...: (2 GPs behind) idle=a27/1/0 softirq=70531357/70531357 fqs=4305315 [686810.075606] (detected by 1, t=13200382 jiffies, g=173829751, c=173829750, q=25641590) [686810.083624] ffff880351a83e68 0000000000000018 ffffffff81591bf1 ffff880351a83ec8 [686810.091163] 00000002005932b8 0000000100000006 ffff880351a84000 ffffffff81d1ce20 [686810.098697] ffff880351a84000 ffff88035fc5d300 ffffffff81cb2c00 ffff880351a83eb8 [686810.106233] Call Trace: [686810.108767] [<ffffffff81591bf1>] ? cpuidle_enter_state+0x91/0x200 [686810.115026] [<ffffffff81591d97>] ? cpuidle_enter+0x17/0x20 [686810.120673] [<ffffffff810bcdc7>] ? call_cpuidle+0x37/0x60 [686810.126234] [<ffffffff81591d73>] ? cpuidle_select+0x13/0x20 [686810.131978] [<ffffffff810bd001>] ? cpu_startup_entry+0x211/0x2d0 [686810.138156] [<ffffffff8103b213>] ? start_secondary+0x103/0x130 [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [swapper/3:0] On Thu, May 19, 2016 at 3:06 PM, Erik Hugne <erik.hu...@gmail.com> wrote: > On Thu, May 19, 2016 at 10:34:05AM -0400, GUNA wrote: >> One of the card in my system is dead and rebooted to recover it. >> The system is running on Kernel 4.4.0 + some latest TIPC patches. >> Your earliest feedback of the issue is recommended. >> > At first i thought this might be a spinlock contention problem. > > CPU2 is receiving TIPC traffic on a socket, and is trying to grab a > spinlock in tipc_sk_rcv context (probably sk->sk_lock.slock) > First argument to spin_trylock_bh() is passed in RDI: ffffffffa01546cc > > CPU3 is sending TIPC data, tipc_node_xmit()->tipc_sk_rcv() indicates > that it's traffic between sockets on the same machine. > And i think this is the same socket as on CPU2, because we see the same > address in RDI: ffffffffa01546cc > > But this made me unsure: > [686798.930348] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx timeout > Is it contributing to the problem, or is it a side effect of a spinlock > contention? > > Driver (or HW) bugs _are_ fatal for a network stack, but why would a lock > contention > in a network stack cause NIC TX timeouts? > > Does all cards in your system have similar workloads? > Do you see this on multiple cards? > > //E ------------------------------------------------------------------------------ Mobile security can be enabling, not merely restricting. Employees who bring their own devices (BYOD) to work are irked by the imposition of MDM restrictions. Mobile Device Manager Plus allows you to control only the apps on BYO-devices by containerizing them, leaving personal data untouched! https://ad.doubleclick.net/ddm/clk/304595813;131938128;j _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion