Re: [tipc-discussion] TIPC connection stalling due to invalid congestion status when bearer 0 recovers

Butler, Peter Mon, 24 Jul 2017 13:00:44 -0700

Hi Ying

In answer to your questions:


(1) When the first failed bearer recovers, I usually see one or two nodes in 
the TIPC mesh experience the problem
(2) We are sending almost exclusively unicast messages to my knowledge

-----Original Message-----
From: Ying Xue [mailto:[email protected]] 
Sent: July-24-17 9:00 AM
To: Butler, Peter <[email protected]>; [email protected]
Cc: Jon Maloy <[email protected]>; Parthasarathy Bhuvaragan 
<[email protected]>; LUU Duc Canh 
<[email protected]>
Subject: Re: TIPC connection stalling due to invalid congestion status when 
bearer 0 recovers

Hi Peter,

Thank you for well describing your met issue.

In my first expression, I think this should be a TIPC bug, and it should be 
same as the one Andrew Booth reported. Meanwhile, I doubt link failover's bug 
leads to the issue especially because we ever overhauled for link state machine.

In addition, I have two questions:

1. When the first failed bearer recovers, do the issue consistently happen?
2. Do you check whether the issue occurs when you send unicast message rather 
broadcast message?

Thanks,
Ying

On 07/22/2017 04:20 AM, Butler, Peter wrote:
> Hello,
> 
> I am using a 19-node TIPC configuration, whereby each card (node) in 
> the mesh has two Ethernet interfaces connected to two disjoint subnets 
> served by switch0 and switch1, respectively. TIPC is set to use two 
> bearers on each card.  16 of these cards are using TIPC 4.4.0 (with a 
> few patches backported from later releases as per John Maloy, 
> Parthasarathy Bhuvaragan, and Ying Xue).  (The other 3 cards are using 
> a much older 1.7-based TIPC, but are not actually involved in the 
> testing pertaining to the issue detailed below.)
> 
> There are applications on several of the (4.4.0-based) cards which are 
> collectively sending/receiving about 500 TIPC msg/s (i.e. in total, not 
> each). 
> 
> When I reboot switch0, I often get strange behaviour soon after the switch 
> comes back into service.  To be clear, there are no issues that appear to 
> stem from the loss of connectivity on the switch0 Ethernet fabric: while that 
> switch is rebooting (or powered off, or otherwise unavailable) the 
> applications behave fine by using the Ethernet fabric associated with 
> switch1.  However, shortly after switch0 returns to service, one or more of 
> the cards in the TIPC mesh will often then experience problems on the switch0 
> fabric.
> 
> Specifically, the sendto() calls (on the cards in question) will fail.  By 
> default, we are using a blocking sendto() call, and the associated process is 
> being put to sleep by the kernel at this line in socket.c:
> 
> static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p) {
>    struct sock *sk = sock->sk;
>    struct tipc_sock *tsk = tipc_sk(sk);
>    DEFINE_WAIT(wait);
>    int done;
> 
>    do {
>       int err = sock_error(sk);
>       if (err)
>          return err;
>       if (sock->state == SS_DISCONNECTING)
>          return -EPIPE;
>       if (!*timeo_p)
>          return -EAGAIN;
>       if (signal_pending(current))
>          return sock_intr_errno(*timeo_p);
> 
>       prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
>       done = sk_wait_event(sk, timeo_p, !tsk->link_cong);                     
>      <---------------------
>       finish_wait(sk_sleep(sk), &wait);
>    } while (!done);
>    return 0;
> }
> 
> Once in this state the process never recovers, and at the very least needs to 
> be killed off and restarted, or the card rebooted.
> 
> When changing this to a non-blocking sendto() call, the process is no longer 
> put to sleep, but will forever fail the sendto() calls with -EAGAIN, and once 
> again at the very least needs to be killed off and restarted, or the card 
> rebooted.
> 
> The TIPC traffic in question is connectionless, on a SOCK_RDM socket, and 
> with destination-droppable set to false.
> 
> Note that the hardware setup I am using is essentially identical to that used 
> by Andrew Booth in his recent post "TIPC issue: connection stalls when switch 
> for bearer 0 recovers" - both issues are almost certainly related, if not 
> identical.  Although in each of our cases the problem was detected using 
> different application-level software.
> 
> Could it be that TIPC is erroneously flagging the link as being 
> congested and thus preventing any further traffic on it?  (Just 
> speculating, based on the line of code shown above.)
> 
> Peter Butler
> 
> 
> 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Re: [tipc-discussion] TIPC connection stalling due to invalid congestion status when bearer 0 recovers

Reply via email to