Hi Peter,
This is a known bug, fixed in commit d2f394dc4816 ("tipc: fix random link
resets while adding a second bearer") from Partha Bhuvaragan, and present in
Linux 4.8.
Do you have any possibility to upgrade your kernel?
BR
///jon
> -----Original Message-----
> From: Butler, Peter [mailto:[email protected]]
> Sent: Friday, 09 December, 2016 09:31
> To: [email protected]
> Subject: [tipc-discussion] reproducible link failure scenario
>
> I have a reproducible failure scenario that results in the following kernel
> messages being printed in succession (along with the associated link failing):
>
> Dec 8 12:10:33 [SEQ 617259] lab236slot6 kernel: [44856.752261]
> Retransmission
> failure on link <1.1.6:p19p1-1.1.8:p19p1>
> Dec 8 12:10:33 [SEQ 617260] lab236slot6 kernel: [44856.758633] Resetting
> link
> Link <1.1.6:p19p1-1.1.8:p19p1> state e
> Dec 8 12:10:33 [SEQ 617261] lab236slot6 kernel: [44856.758635] XMTQ: 3
> [2-4],
> BKLGQ: 0, SNDNX: 5, RCVNX: 4
> Dec 8 12:10:33 [SEQ 617262] lab236slot6 kernel: [44856.758637] Failed msg:
> usr
> 10, typ 0, len 1540, err 0
> Dec 8 12:10:33 [SEQ 617263] lab236slot6 kernel: [44856.758638] sqno 2, prev:
> 1001006, src: 1001006
>
> The issue occurs within 30 seconds after any node in the cluster is rebooted.
> There are two 10Gb Ethernet fabrics in the cluster, so every node has two
> links to
> every other node. When the failure occurs, it is only ever one of the two
> links
> that fails (although it appears to be random which of the two that will be on
> a
> boot-to-boot basis).
>
> Important: links only fail to a common node in the mesh. While all nodes in
> the
> mesh are running the same kernel (Linux 4.4.0), the common node is the only
> one that is also running DRBD. Actually, there are two nodes running DRBD,
> but
> at any given time only one of the two nodes is the 'active' DRBD manager, so
> to
> speak, as they use a shared IP for the DRBD functionality, much akin the
> HA-Linux
> heartbeat. Again, the failure only ever occurs on TIPC links to the active
> DRBD
> node, as the other one is invisible (insofar as DRBD is concerned) as a
> stand-by.
>
> So it would appear (on the surface at least) that there is some conflict
> between
> running DRBD within the TIPC mesh.
>
> This failure scenario is 100% reproducible and only takse the time of a
> reboot + 30
> seconds to trigger. It should be noted that the issue is only triggered if a
> node is
> rebooted after the DRBD node is already up and running. In other words, if
> the
> DRBD node is rebooted *after* all other nodes are up and running, the link
> failures to the other nodes do not occur (unless, of course, one or more of
> those
> nodes is then subsequently rebooted, in which case those nodes will experience
> a link failure once up and running).
>
> One other useful piece of information. When the TIPC link fails (again it is
> only
> ever one of the two TIPC links to a node that ever fails), it can be
> recovered by
> manually 'bouncing' the bearer on the DRBD card (i.e. disabling bearer
> followed
> by enabling the bearer). However the interesting point here is that if the
> link on
> fabric A is the one that failed, it is the B bearer that must be 'bounced' to
> fix the
> link on fabric A. Sounds like something to do with the DRBD shared address
> scheme...
>
> ------------------------------------------------------------------------------
> Developer Access Program for Intel Xeon Phi Processors
> Access to Intel Xeon Phi processor-based developer platforms.
> With one year of Intel Parallel Studio XE.
> Training and support from Colfax.
> Order your platform today.http://sdm.link/xeonphi
> _______________________________________________
> tipc-discussion mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
------------------------------------------------------------------------------
Developer Access Program for Intel Xeon Phi Processors
Access to Intel Xeon Phi processor-based developer platforms.
With one year of Intel Parallel Studio XE.
Training and support from Colfax.
Order your platform today.http://sdm.link/xeonphi
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion