Hi Partha

I can try your patch.  Is it just the one change to socket.c shown on that 
link?  I ask because that patch is named [PATCH net v1 1/6] and not sure if the 
other parts (2/6, 3/6, 4/6, 5/6, 6/6) are also required for this particular 
issue.

Peter

-----Original Message-----
From: Parthasarathy Bhuvaragan [mailto:[email protected]] 
Sent: July-24-17 8:58 AM
To: Butler, Peter <[email protected]>; [email protected]
Cc: Jon Maloy <[email protected]>; Ying Xue <[email protected]>; LUU 
Duc Canh <[email protected]>
Subject: Re: TIPC connection stalling due to invalid congestion status when 
bearer 0 recovers

Hi Peter,

Have you looked through this?
https://sourceforge.net/p/tipc/mailman/message/35809792/

The symptoms you describe is identical to mine, its worth a try my patch on 
your system.

I need to address comments from Jon.M before pushing it to net-next.

regards
Partha

On 07/21/2017 10:20 PM, Butler, Peter wrote:
> Hello,
> 
> I am using a 19-node TIPC configuration, whereby each card (node) in 
> the mesh has two Ethernet interfaces connected to two disjoint subnets 
> served by switch0 and switch1, respectively. TIPC is set to use two 
> bearers on each card.  16 of these cards are using TIPC 4.4.0 (with a 
> few patches backported from later releases as per John Maloy, 
> Parthasarathy Bhuvaragan, and Ying Xue).  (The other 3 cards are using 
> a much older 1.7-based TIPC, but are not actually involved in the 
> testing pertaining to the issue detailed below.)
> 
> There are applications on several of the (4.4.0-based) cards which are 
> collectively sending/receiving about 500 TIPC msg/s (i.e. in total, not each).
> 
> When I reboot switch0, I often get strange behaviour soon after the switch 
> comes back into service.  To be clear, there are no issues that appear to 
> stem from the loss of connectivity on the switch0 Ethernet fabric: while that 
> switch is rebooting (or powered off, or otherwise unavailable) the 
> applications behave fine by using the Ethernet fabric associated with 
> switch1.  However, shortly after switch0 returns to service, one or more of 
> the cards in the TIPC mesh will often then experience problems on the switch0 
> fabric.
> 
> Specifically, the sendto() calls (on the cards in question) will fail.  By 
> default, we are using a blocking sendto() call, and the associated process is 
> being put to sleep by the kernel at this line in socket.c:
> 
> static int tipc_wait_for_sndmsg(struct socket *sock, long *timeo_p) {
>     struct sock *sk = sock->sk;
>     struct tipc_sock *tsk = tipc_sk(sk);
>     DEFINE_WAIT(wait);
>     int done;
> 
>     do {
>        int err = sock_error(sk);
>        if (err)
>           return err;
>        if (sock->state == SS_DISCONNECTING)
>           return -EPIPE;
>        if (!*timeo_p)
>           return -EAGAIN;
>        if (signal_pending(current))
>           return sock_intr_errno(*timeo_p);
> 
>        prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
>        done = sk_wait_event(sk, timeo_p, !tsk->link_cong);                    
>       <---------------------
>        finish_wait(sk_sleep(sk), &wait);
>     } while (!done);
>     return 0;
> }
> 
> Once in this state the process never recovers, and at the very least needs to 
> be killed off and restarted, or the card rebooted.
> 
> When changing this to a non-blocking sendto() call, the process is no longer 
> put to sleep, but will forever fail the sendto() calls with -EAGAIN, and once 
> again at the very least needs to be killed off and restarted, or the card 
> rebooted.
> 
> The TIPC traffic in question is connectionless, on a SOCK_RDM socket, and 
> with destination-droppable set to false.
> 
> Note that the hardware setup I am using is essentially identical to that used 
> by Andrew Booth in his recent post "TIPC issue: connection stalls when switch 
> for bearer 0 recovers" - both issues are almost certainly related, if not 
> identical.  Although in each of our cases the problem was detected using 
> different application-level software.
> 
> Could it be that TIPC is erroneously flagging the link as being 
> congested and thus preventing any further traffic on it?  (Just 
> speculating, based on the line of code shown above.)
> 
> Peter Butler
> 
> 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to