Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0

Jon Maloy Fri, 20 May 2016 09:19:01 -0700


> -----Original Message-----
> From: GUNA [mailto:gbala...@gmail.com]
> Sent: Friday, 20 May, 2016 11:04
> To: Erik Hugne
> Cc: Richard Alpe; Ying Xue; Parthasarathy Bhuvaragan; Jon Maloy; tipc-
> discuss...@lists.sourceforge.net
> Subject: Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card 
> on 4.4.0
> 
> Thanks Erik for your quick analysis.
> If it is not known issue, are there any expert available to
> investigate it further why this lockup happen? Otherwise let me know
> the patch or fix information.
> 
> // Guna
> 
> On Fri, May 20, 2016 at 1:19 AM, Erik Hugne <erik.hu...@gmail.com> wrote:
> > A little more awake now. Didnt see this yesterday.
> > Look at the trace from CPU2 in Guna's initial mail.
> >
> > TIPC is recursing into the receive loop a second time, and freezes when it
> > tries to take slock a second time. this is done in a timer CB, and softirq
> > lockup detector kicks in after ~20s.


This happens because the socket timer  does not set  the "owned_by_user" flag 
like ordinary users do; it just grabs slock as a spinlock.
I wonder if we could let tipc_sk_timeout()  use lock_sock/release_sock here, 
although it is not in user context ?

This is most certainly one of the background probe messages that is sent once 
an hour between connected sockets, to verify that the peer socket is still 
there.
In this case the peer socket has disappeared, but then it remains to understand 
why the sender has not already received an abort message when that happened. 
Was the connection never set up completely? Is it just bad timing, so the abort 
message is already waiting in the sender's backlog queue? Or it’s the abort 
mechanism broken somehow?

///jon



> >
> > //E
> >
> > [686797.257426]  <IRQ>
> >
> > [686797.257426]  [<ffffffff816de340>] _raw_spin_trylock_bh+0x40/0x50
> >
> > [686797.257430]  [<ffffffffa01546cc>] tipc_sk_rcv+0xbc/0x490 [tipc]
> >
> > [686797.257432]  [<ffffffff8162ecde>] ? tcp_rcv_established+0x40e/0x760
> >
> > [686797.257435]  [<ffffffffa014f58f>] tipc_node_xmit+0x11f/0x150 [tipc]
> >
> > [686797.257437]  [<ffffffff810b5353>] ? find_busiest_group+0x153/0x980
> >
> > [686797.257441]  [<ffffffffa014f5f7>] tipc_node_xmit_skb+0x37/0x60 [tipc]
> >
> > [686797.257444]  [<ffffffffa0151cb9>] tipc_sk_respond+0x99/0xc0 [tipc]
> >
> > [686797.257447]  [<ffffffffa0152e6d>] filter_rcv+0x4cd/0x550 [tipc]
> >
> > [686797.257451]  [<ffffffffa01548ed>] tipc_sk_rcv+0x2dd/0x490 [tipc]
> >
> > [686797.257454]  [<ffffffffa014f58f>] tipc_node_xmit+0x11f/0x150 [tipc]
> >
> > [686797.257458]  [<ffffffffa0152690>] ? tipc_recv_stream+0x370/0x370 [tipc]
> >
> > [686797.257461]  [<ffffffffa014f5f7>] tipc_node_xmit_skb+0x37/0x60 [tipc]
> >
> > [686797.257464]  [<ffffffffa0152770>] tipc_sk_timeout+0xe0/0x180 [tipc
> >
> > On May 19, 2016 21:37, "GUNA" <gbala...@gmail.com> wrote:
> >
> > All the CPU cards on the system running the same load.  Seen similar
> > issue about 6 weeks back but seen again now on one card compared to
> > all cards last time. At this time, there was very light traffic
> > (handshake).
> >
> > I had seen following as part of the log, not sure it contributes the
> > issue or not:
> >
> > [686808.930065] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx timeout
> > [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s!
> >
> > // details
> > ...
> > [686797.890210]  [<ffffffff816dea1f>] ret_from_fork+0x3f/0x70
> > [686797.895692]  [<ffffffff8109cc40>] ? flush_kthread_worker+0x90/0x90
> > [686797.901951] Code: 00 eb 02 89 c6 f7 c6 00 ff ff ff 75 41 83 fe 01
> > 89 ca 89 f0 41 0f 44 d0 f0 0f b1 17 39 f0 75 e3 83 fa 01 75 04 eb 0d
> > f3 90 8b 07 <84> c0 75 f8 66 c7 07 01 00 5d c3 8b 37 81 fe 00 01 00 00
> > 75 b6
> > [686798.930348] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx timeout
> > [686803.938207] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx timeout
> > [686808.930065] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx timeout
> > [686810.062026] INFO: rcu_sched detected stalls on CPUs/tasks:
> > [686810.067613]         2-...: (2 GPs behind) idle=a27/1/0
> > softirq=70531357/70531357 fqs=4305315
> > [686810.075606]         (detected by 1, t=13200382 jiffies,
> > g=173829751, c=173829750, q=25641590)
> > [686810.083624]  ffff880351a83e68 0000000000000018 ffffffff81591bf1
> > ffff880351a83ec8
> > [686810.091163]  00000002005932b8 0000000100000006 ffff880351a84000
> > ffffffff81d1ce20
> > [686810.098697]  ffff880351a84000 ffff88035fc5d300 ffffffff81cb2c00
> > ffff880351a83eb8
> > [686810.106233] Call Trace:
> > [686810.108767]  [<ffffffff81591bf1>] ? cpuidle_enter_state+0x91/0x200
> > [686810.115026]  [<ffffffff81591d97>] ? cpuidle_enter+0x17/0x20
> > [686810.120673]  [<ffffffff810bcdc7>] ? call_cpuidle+0x37/0x60
> > [686810.126234]  [<ffffffff81591d73>] ? cpuidle_select+0x13/0x20
> > [686810.131978]  [<ffffffff810bd001>] ? cpu_startup_entry+0x211/0x2d0
> > [686810.138156]  [<ffffffff8103b213>] ? start_secondary+0x103/0x130
> > [686813.257936] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s!
> > [swapper/3:0]
> >
> > On Thu, May 19, 2016 at 3:06 PM, Erik Hugne <erik.hu...@gmail.com> wrote:
> >> On Thu, May 19, 2016 at 10:34:05AM -0400, GUNA wrote:
> >>> One of the card in my system is dead and rebooted to recover it.
> >>> The system is running on Kernel 4.4.0 + some latest TIPC patches.
> >>> Your earliest feedback of the issue is recommended.
> >>>
> >> At first i thought this might be a spinlock contention problem.
> >>
> >> CPU2 is receiving TIPC traffic on a socket, and is trying to grab a
> >> spinlock in tipc_sk_rcv context (probably sk->sk_lock.slock)
> >> First argument to spin_trylock_bh() is passed in RDI: ffffffffa01546cc
> >>
> >> CPU3 is sending TIPC data, tipc_node_xmit()->tipc_sk_rcv() indicates
> >> that it's traffic between sockets on the same machine.
> >> And i think this is the same socket as on CPU2, because we see the same
> >> address in RDI: ffffffffa01546cc
> >>
> >> But this made me unsure:
> >> [686798.930348] ixgbe 0000:01:00.0 p19p2: initiating reset due to tx
> >> timeout
> >> Is it contributing to the problem, or is it a side effect of a spinlock
> >> contention?
> >>
> >> Driver (or HW) bugs _are_ fatal for a network stack, but why would a lock
> >> contention
> >> in a network stack cause NIC TX timeouts?
> >>
> >> Does all cards in your system have similar workloads?
> >> Do you see this on multiple cards?
> >>
> >> //E
------------------------------------------------------------------------------
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM
restrictions. Mobile Device Manager Plus allows you to control only the
apps on BYO-devices by containerizing them, leaving personal data untouched!
https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] tipc_sk_rcv: Kernel panic on one of the card on 4.4.0

Reply via email to