Hello, I am using a 19-node TIPC configuration, whereby each card (node) in the mesh has two Ethernet interfaces connected to two disjoint subnets served by switch0 and switch1, respectively. TIPC is set to use two bearers on each card. 16 of these cards are using TIPC 4.4.0 (with a few patches backported from later releases as per John Maloy, Parthasarathy Bhuvaragan, and Ying Xue). (The other 3 cards are using a much older 1.7-based TIPC, but are not actually involved in the testing pertaining to the issue detailed below.)
There are applications on several of the (4.4.0-based) cards which are collectively sending/receiving about 500 TIPC msg/s (i.e. in total, not each). When I reboot switch0, I often get strange behaviour soon after the switch comes back into service. To be clear, there are no issues that appear to stem from the loss of connectivity on the switch0 Ethernet fabric: while that switch is rebooting (or powered off, or otherwise unavailable) the applications behave fine by using the Ethernet fabric associated with switch1. However, shortly after switch0 returns to service, one or more of the cards in the TIPC mesh will often then experience problems on the switch0 fabric. Specifically, the sendto() calls (on the cards in question) will fail. By default, we are using a blocking sendto() call, and the associated process is being put to sleep by the kernel at this line in socket.c: static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p) { struct sock *sk = sock->sk; struct tipc_sock *tsk = tipc_sk(sk); DEFINE_WAIT(wait); int done; do { int err = sock_error(sk); if (err) return err; if (sock->state == SS_DISCONNECTING) return -EPIPE; if (!*timeo_p) return -EAGAIN; if (signal_pending(current)) return sock_intr_errno(*timeo_p); prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE); done = sk_wait_event(sk, timeo_p, !tsk->link_cong); <--------------------- finish_wait(sk_sleep(sk), &wait); } while (!done); return 0; } Once in this state the process never recovers, and at the very least needs to be killed off and restarted, or the card rebooted. When changing this to a non-blocking sendto() call, the process is no longer put to sleep, but will forever fail the sendto() calls with -EAGAIN, and once again at the very least needs to be killed off and restarted, or the card rebooted. The TIPC traffic in question is connectionless, on a SOCK_RDM socket, and with destination-droppable set to false. Note that the hardware setup I am using is essentially identical to that used by Andrew Booth in his recent post "TIPC issue: connection stalls when switch for bearer 0 recovers" - both issues are almost certainly related, if not identical. Although in each of our cases the problem was detected using different application-level software. Could it be that TIPC is erroneously flagging the link as being congested and thus preventing any further traffic on it? (Just speculating, based on the line of code shown above.) Peter Butler ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion