Hi,

I have been using the patches Partha had provided for the nametable soft
lockup, and that I had tested.  This was seen when testing on a SMP system.

Unfortunately I have come across another nametable soft lockup:

<0>NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [AIS listener:1591]
<6>Modules linked in: tipc jitterentropy_rng echainiv drbg
platform_driver(O) ipifwd(PO)
<6>CPU: 0 PID: 1591 Comm: AIS listener Tainted: P           O
<6>task: ae393600 ti: ae286000 task.ti: ae286000
<6>NIP: 806952bc LR: c160bfe0 CTR: 80695280
<6>REGS: ae287b40 TRAP: 0901   Tainted: P           O
<6>MSR: 00029002 <CE,EE,ME>  CR: 48002484  XER: 00000000
<6>
<6>GPR00: c160a64c ae287bf0 ae393600 a20f18ac 00000000 00000000 ae064fbc
00000030
<6>GPR08: 01001006 00000001 00000001 00000006 80695280
<6>NIP [806952bc] _raw_spin_lock_bh+0x3c/0x70
<6>LR [c160bfe0] tipc_nametbl_unsubscribe+0x50/0x120 [tipc]
<6>Call Trace:
<6>[ae287c10] [c160a64c] tipc_named_reinit+0x33c/0x8a0 [tipc]
<6>[ae287c30] [c160ad44] tipc_subscrp_report_overlap+0xc4/0xe0 [tipc]
<6>[ae287c70] [c160b30c] tipc_topsrv_stop+0x45c/0x4f0 [tipc]
<6>[ae287ca0] [c160b838] tipc_nametbl_remove_publ+0x58/0x110 [tipc]
<6>[ae287cd0] [c160bcf8] tipc_nametbl_withdraw+0x68/0x140 [tipc]
<6>[ae287d00] [c1613cd4] tipc_nl_node_dump_link+0x1904/0x45d0 [tipc]
<6>[ae287d30] [c16148e8] tipc_nl_node_dump_link+0x2518/0x45d0 [tipc]
<6>[ae287d70] [804f5a40] sock_release+0x30/0xf0
<6>[ae287d80] [804f5b14] sock_close+0x14/0x30
<6>[ae287d90] [80105844] __fput+0x94/0x200
<6>[ae287db0] [8003dca4] task_work_run+0xd4/0x100
<6>[ae287dd0] [80023620] do_exit+0x280/0x980
<6>[ae287e10] [80024c48] do_group_exit+0x48/0xb0
<6>[ae287e30] [80030344] get_signal+0x244/0x4f0
<6>[ae287e80] [80007734] do_signal+0x34/0x1c0
<6>[ae287f30] [800079a8] do_notify_resume+0x68/0x80
<6>[ae287f40] [8000fa1c] do_user_signal+0x74/0xc4


I have gone through the code and I think I have found a place where there
is a potential soft lockup.
The call chain is:
tipc_nametbl_stop() Grabs nametbl_lock
   tipc_purge_publications()
      tipc_nameseq_remove_publ()
         tipc_subscrp_report_overlap()
            tipc_subscrp_put() Calls kref_put when kref == 0 -- could have
been put by a different CPU
               tipc_subscrp_kref_release()
                  tipc_nametbl_unsubscribe()
                     << lockup occurs as it grabs the
                         nametbl_lock again >>


Another possible issue is in tipc_subscrp_report_overlap(), there are 2
early returns after a tipc_subscrp_get() before the tipc_subscrp_put().
Could this end up with an incorrect kref?

JT
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to