[tipc-discussion] TIPC connection stalling due to invalid congestion status when bearer 0 recovers

Butler, Peter Fri, 21 Jul 2017 13:37:51 -0700

Hello,

I am using a 19-node TIPC configuration, whereby each card (node) in the mesh 
has two Ethernet interfaces connected to two disjoint subnets served by switch0 
and switch1, respectively. TIPC is set to use two bearers on each card.  16 of 
these cards are using TIPC 4.4.0 (with a few patches backported from later 
releases as per John Maloy, Parthasarathy Bhuvaragan, and Ying Xue).  (The 
other 3 cards are using a much older 1.7-based TIPC, but are not actually 
involved in the testing pertaining to the issue detailed below.)


There are applications on several of the (4.4.0-based) cards which are 
collectively sending/receiving about 500 TIPC msg/s (i.e. in total, not each). 

When I reboot switch0, I often get strange behaviour soon after the switch 
comes back into service.  To be clear, there are no issues that appear to stem 
from the loss of connectivity on the switch0 Ethernet fabric: while that switch 
is rebooting (or powered off, or otherwise unavailable) the applications behave 
fine by using the Ethernet fabric associated with switch1.  However, shortly 
after switch0 returns to service, one or more of the cards in the TIPC mesh 
will often then experience problems on the switch0 fabric.

Specifically, the sendto() calls (on the cards in question) will fail.  By 
default, we are using a blocking sendto() call, and the associated process is 
being put to sleep by the kernel at this line in socket.c:

static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p)
{
   struct sock *sk = sock->sk;
   struct tipc_sock *tsk = tipc_sk(sk);
   DEFINE_WAIT(wait);
   int done;

   do {
      int err = sock_error(sk);
      if (err)
         return err;
      if (sock->state == SS_DISCONNECTING)
         return -EPIPE;
      if (!*timeo_p)
         return -EAGAIN;
      if (signal_pending(current))
         return sock_intr_errno(*timeo_p);

      prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
      done = sk_wait_event(sk, timeo_p, !tsk->link_cong);                       
   <---------------------
      finish_wait(sk_sleep(sk), &wait);
   } while (!done);
   return 0;
}

Once in this state the process never recovers, and at the very least needs to 
be killed off and restarted, or the card rebooted.

When changing this to a non-blocking sendto() call, the process is no longer 
put to sleep, but will forever fail the sendto() calls with -EAGAIN, and once 
again at the very least needs to be killed off and restarted, or the card 
rebooted.

The TIPC traffic in question is connectionless, on a SOCK_RDM socket, and with 
destination-droppable set to false.

Note that the hardware setup I am using is essentially identical to that used 
by Andrew Booth in his recent post "TIPC issue: connection stalls when switch 
for bearer 0 recovers" - both issues are almost certainly related, if not 
identical.  Although in each of our cases the problem was detected using 
different application-level software.

Could it be that TIPC is erroneously flagging the link as being congested and 
thus preventing any further traffic on it?  (Just speculating, based on the 
line of code shown above.)

Peter Butler



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

[tipc-discussion] TIPC connection stalling due to invalid congestion status when bearer 0 recovers

Reply via email to