Hi Jon, My comments are below.
JT On Wed, Sep 7, 2016 at 12:48 AM, Jon Maloy <jon.ma...@ericsson.com> wrote: > Hi John, > > See below. > > > > ///jon > > > > > > *From:* John THompson [mailto:thompa....@gmail.com] > *Sent:* Tuesday, 06 September, 2016 00:14 > *To:* Jon Maloy <ma...@donjonn.com> > *Cc:* Jon Maloy <jon.ma...@ericsson.com>; tipc-discussion@lists. > sourceforge.net > *Subject:* Re: [tipc-discussion] BC rcv link acked stuck after receiving > a named with a BC ACK of 0 > > > > Hi Jon, > > > > The packet I see the error happening on is when receiving a usr 11 > > (NAME_DISTRIBUTOR) over the unicast link. > > The reception of this packet is happening interleaved with processing > > a packet (or packets) on the BC link that has brought the peer up. > > The BC link packet processing has the tipc_bcast_lock and the unicast > > pkt processing cannot get the bcast lock for a while. > > When it can get the lock it processes the BC ack == 0 from the > NAME_DISTRIBUTOR > > packet and sets the acked field on the BC link to 0. > > > > The debug / call trace below is me trying to show from the debug I > captured what happens. > > If I add debug for each pkt the problem doesn't reproduce. > > > > > > tipc_rcv 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt bc ack rcv 0 uc seq 3 ack 0 user > 11 type 0 > > + calls tipc_bcast_ack_rcv > > tipc_rcv > > + tipc_bcast_ack_rcv > > + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as > link not up 1 or peer not up 1 > > > > What kind of packet was this? > This is a LINK_PROTOCOL STATE msg received on the unicast link Dbg of the packet: broadcast-link-5-18 bc ack 0 orig_l 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt (hdr usr 7 type 0 bc ack 0 seq 32771 ack 0)- can't ack as link not up 0 or peer not up 1 > > tipc_rcv > > + tipc_bcast_ack_rcv > > + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack > as link not up 0 or peer not up 1 > > > > And this? > This is a LINK_PROTOCOL STATE msg received on the unicast link, with ack == 1, therefore this packet will be the first pkt that sets bc_peer_is_up to true. Dbg of the packet: broadcast-link-5-18 bc ack 43402 orig_l 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt (hdr usr 7 type 0 bc ack 43402 seq 32771 ack 1)- can't ack as link not up 0 or peer not up 1 > > > === > > Somewhere at this point bc_peer_is_up gets set > > === > > tipc_rcv > > + tipc_bcast_ack_rcv > > + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less > than it was previously (53574) > > tipc_rcv > > + tipc_bcast_ack_rcv > > + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less > than it was previously (53574) > > > > + from tipc_rcv on unicast link > > + tipc_bcast_ack_rcv Going to set BC ACK outside window, new 0 old > 53574 win 200 > > - dump_stack > > CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: P O 4.4.6-at1 #3 > > Call Trace: > > [a3093a80] [806943b0] dump_stack+0x84/0xb0 (unreliable) > > [a3093a90] [c1507314] tipc_link_bc_ack_rcv+0x244/0x250 [tipc] > > [a3093ab0] [c1501b04] tipc_bcast_ack_rcv+0x74/0xd0 [tipc] > > [a3093ae0] [c1511a08] tipc_rcv+0x468/0xa30 [tipc] > > [a3093b80] [c150218c] tipc_bcast_stop+0xfc/0x7b0 [tipc] > > [a3093b90] [8050d6a8] __netif_receive_skb_core+0x468/0xa10 > > [a3093c30] [80510b6c] netif_receive_skb_internal+0x3c/0xe0 > > [a3093c60] [8064b2b8] br_handle_frame_finish+0x1d8/0x4d0 > > [a3093cd0] [8064b7a0] br_handle_frame+0x1f0/0x330 > > [a3093d20] [8050d738] __netif_receive_skb_core+0x4f8/0xa10 > > [a3093dc0] [805119f0] process_backlog+0x90/0x140 > > [a3093df0] [8051103c] net_rx_action+0x15c/0x320 > > [a3093e50] [8002594c] __do_softirq+0x13c/0x250 > > [a3093eb0] [80025ab0] run_ksoftirqd+0x50/0x80 > > [a3093ec0] [800434c4] smpboot_thread_fn+0x1e4/0x1f0 > > [a3093ef0] [8003fb38] kthread+0xc8/0xe0 > > [a3093f40] [8000eed8] ret_from_kernel_thread+0x5c/0x64 > > > > I am going to send in a patch that adds checking for a valid BC ack (being > within the window size) to > > tipc_link_bc_ack_rcv. > > > > Not sure that is a good idea. Even if #0 happens to be within a valid > range it is still invalid, and may lead to an inadvertent release of > packets which are not ready to be released yet. I’ll try to take a > closer look at this today. > I thought that #0 was a valid value when the seqno wraps around? There doesn't appear to be any special handling for the wrap around case. I agree with your statement that it might lead to inadvertent release of packets that aren't ready to be released. Therefore I won't send it in. JT > > ///jon > > > > Cheers, > > JT > > > > On Wed, Aug 31, 2016 at 10:57 PM, John THompson <thompa....@gmail.com> > wrote: > > Hi Jon, > > > > I have verified that the patch is included in my build. > > 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization > criteria”) > > > > I am trying to verify which packets are received when the problem occurs > but I am having trouble getting the information out of my system at the > moment. > > > > I will keep trying. > > Thanks, > > JT > > > > > > On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy <ma...@donjonn.com> wrote: > > > > On 08/29/2016 06:48 PM, Jon Maloy wrote: > > Hi John, > Sorry for my late answer; I was on vacation for a few days. > It seems I gave you the wrong commit reference in my previous mail. The > one I really meant was > 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization > criteria”) > > This one explains why the first packets sometimes get an invalid ack > number, but also remedies it, and I simply cannot see how an invalid ack #0 > can ever be accepted when this patch is applied. > I see no reason why this patch shouldn’t also be present in you code, but > just to make sure, can you confirm this? > > I am right now wondering if a retransmission is the problem: > 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to > true. > > Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true, > the likely sequence is rather > 1: We receive a STATE message with unicast ack #1. This message should > also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up > is set to true. > 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the > invalid unicast ack #0. This one is now accepted. > > I believe this may happen, because STATE messages, contrary to data > packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data > messages, but I cannot see it happening as often and consistently as you > seem to be observing it. Another possibility is that bc_ack in the received > STATE message also is an invalid zero, although I cannot see how this can > happen either. > > Regards > ///jon > > 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted, > and we are in trouble. > > I’ll try to figure out a solution to this, but it may be possible for you > to verify this first. > > BR > ///jon > > > > From: John THompson [mailto:thompa....@gmail.com] > Sent: Wednesday, 24 August, 2016 16:22 > To: Jon Maloy <jon.ma...@ericsson.com> > Cc: tipc-discussion@lists.sourceforge.net > Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a > named with a BC ACK of 0 > > Hi Jon, > > To clarify my previous email regarding the behaviour observed, > > What happens over time: > + remove bc peer > ... > some time until peer rejoins > ... > + add bc peer > + tipc_link_bc_ack_rcv > link is up = false, node is up = false > (this gets called a number of times until both the link and node are up) > > + tipc_link_bc_ack_rcv > l->acked set to valid ack > ... > + tipc_rcv - usr 5 or 11, bc_ack = 0 > + tipc_bcast_ack_rcv > + tipc_link_bc_ack_rcv > sets l->acked to 0 > > Regards, > JT > > On Thu, Aug 25, 2016 at 8:06 AM, John THompson <thompa....@gmail.com > <mailto:thompa....@gmail.com>> wrote: > Hi Jon, > > It is a similar problem in terms of what happens to the bc link. I do > have that patch applied. > > I have added debug through the remove bc peer and various other functions > and the setting of the acked field to 0 is occurring when processing a > packet from named (msg user 11) or BCAST protocol (msg user 5). > > Thanks, > JT > > On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com< > mailto:jon.ma...@ericsson.com>> wrote: > Hi John, > This sounds a lot like the problem I tried to fix in > a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when > peer is lost") > So, either that patch is not present in your kernel (if it is 4.7 it is > supposed to be) or my solution somehow hasn't solved the problem. > Can you confirm that the patch is there? > > BR > ///jon > > -----Original Message----- > From: John THompson [mailto:thompa....@gmail.com<mailto: > thompa....@gmail.com>] > Sent: Tuesday, 23 August, 2016 20:21 > > To: tipc-discussion@lists.sourceforge.net<mailto:tipc- > discuss...@lists.sourceforge.net> > Subject: [tipc-discussion] BC rcv link acked stuck after receiving a named > with a BC > ACK of 0 > > Hi, > > I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040 > and Marvell Armada-XP processors. There are 10 nodes in all. > When 2 of the nodes are removed, then rejoin the cluster we sometimes see > behaviour where the TIPC BC link gets stuck and eventually the backlog gets > full. the 2 nodes that are joining have already connected together. > > The problem occurs when the BC link sndnxt value is greater than 32k on one > of the nodes (call it NODE1) and 2 nodes begin to join. > When NODE1 detects the joining nodes, at some early point after they have > joined, NODE1 receives a NAMED publication with a BC ack of 0. NODE1 > immediately sets its BC acked to 0 and tries to ack packets off the > transmq. No packets get removed as the new ack value doesn't match any of > the packets that need to be acked. > > The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures that > the new acked value is more than the old acked value. When the values are > greater than 32k apart this means that 0 can indeed be greater than > 40,000. So when new packets are processed the new BC ack value is > considered less than the stored one (0). > > This results in the BC transmq getting full and the backlog getting full, > thereby preventing communication over the BC link between nodes. > > I am persisting in trying to work out why the NAMED publication has a BC > ack of 0, which I think is the root cause of the problem. > > I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an > invalid BC ack value cannot be set. I am defining invalid as being an > acked value that is greater than the current BC acked value + the link > window. > > Thanks, > John > ------------------------------------------------------------ > ------------------ > _______________________________________________ > tipc-discussion mailing list > > tipc-discussion@lists.sourceforge.net<mailto:tipc- > discuss...@lists.sourceforge.net> > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > ------------------------------------------------------------ > ------------------ > _______________________________________________ > tipc-discussion mailing list > tipc-discussion@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/tipc-discussion > > > > > > > ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion