Hi Jon, The packet I see the error happening on is when receiving a usr 11 (NAME_DISTRIBUTOR) over the unicast link. The reception of this packet is happening interleaved with processing a packet (or packets) on the BC link that has brought the peer up. The BC link packet processing has the tipc_bcast_lock and the unicast pkt processing cannot get the bcast lock for a while. When it can get the lock it processes the BC ack == 0 from the NAME_DISTRIBUTOR packet and sets the acked field on the BC link to 0.
The debug / call trace below is me trying to show from the debug I captured what happens. If I add debug for each pkt the problem doesn't reproduce. tipc_rcv 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt bc ack rcv 0 uc seq 3 ack 0 user 11 type 0 + calls tipc_bcast_ack_rcv tipc_rcv + tipc_bcast_ack_rcv + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as link not up 1 or peer not up 1 tipc_rcv + tipc_bcast_ack_rcv + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as link not up 0 or peer not up 1 === Somewhere at this point bc_peer_is_up gets set === tipc_rcv + tipc_bcast_ack_rcv + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less than it was previously (53574) tipc_rcv + tipc_bcast_ack_rcv + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less than it was previously (53574) + from tipc_rcv on unicast link + tipc_bcast_ack_rcv Going to set BC ACK outside window, new 0 old 53574 win 200 - dump_stack CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: P O 4.4.6-at1 #3 Call Trace: [a3093a80] [806943b0] dump_stack+0x84/0xb0 (unreliable) [a3093a90] [c1507314] tipc_link_bc_ack_rcv+0x244/0x250 [tipc] [a3093ab0] [c1501b04] tipc_bcast_ack_rcv+0x74/0xd0 [tipc] [a3093ae0] [c1511a08] tipc_rcv+0x468/0xa30 [tipc] [a3093b80] [c150218c] tipc_bcast_stop+0xfc/0x7b0 [tipc] [a3093b90] [8050d6a8] __netif_receive_skb_core+0x468/0xa10 [a3093c30] [80510b6c] netif_receive_skb_internal+0x3c/0xe0 [a3093c60] [8064b2b8] br_handle_frame_finish+0x1d8/0x4d0 [a3093cd0] [8064b7a0] br_handle_frame+0x1f0/0x330 [a3093d20] [8050d738] __netif_receive_skb_core+0x4f8/0xa10 [a3093dc0] [805119f0] process_backlog+0x90/0x140 [a3093df0] [8051103c] net_rx_action+0x15c/0x320 [a3093e50] [8002594c] __do_softirq+0x13c/0x250 [a3093eb0] [80025ab0] run_ksoftirqd+0x50/0x80 [a3093ec0] [800434c4] smpboot_thread_fn+0x1e4/0x1f0 [a3093ef0] [8003fb38] kthread+0xc8/0xe0 [a3093f40] [8000eed8] ret_from_kernel_thread+0x5c/0x64 I am going to send in a patch that adds checking for a valid BC ack (being within the window size) to tipc_link_bc_ack_rcv. Cheers, JT On Wed, Aug 31, 2016 at 10:57 PM, John THompson <thompa....@gmail.com> wrote: > Hi Jon, > > I have verified that the patch is included in my build. > 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization > criteria”) > > I am trying to verify which packets are received when the problem occurs > but I am having trouble getting the information out of my system at the > moment. > > I will keep trying. > Thanks, > JT > > > On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy <ma...@donjonn.com> wrote: > >> >> >> On 08/29/2016 06:48 PM, Jon Maloy wrote: >> >>> Hi John, >>> Sorry for my late answer; I was on vacation for a few days. >>> It seems I gave you the wrong commit reference in my previous mail. The >>> one I really meant was >>> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization >>> criteria”) >>> >>> This one explains why the first packets sometimes get an invalid ack >>> number, but also remedies it, and I simply cannot see how an invalid ack #0 >>> can ever be accepted when this patch is applied. >>> I see no reason why this patch shouldn’t also be present in you code, >>> but just to make sure, can you confirm this? >>> >>> I am right now wondering if a retransmission is the problem: >>> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to >>> true. >>> >> Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true, >> the likely sequence is rather >> 1: We receive a STATE message with unicast ack #1. This message should >> also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up >> is set to true. >> 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the >> invalid unicast ack #0. This one is now accepted. >> >> I believe this may happen, because STATE messages, contrary to data >> packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data >> messages, but I cannot see it happening as often and consistently as you >> seem to be observing it. Another possibility is that bc_ack in the received >> STATE message also is an invalid zero, although I cannot see how this can >> happen either. >> >> Regards >> ///jon >> >> 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted, >>> and we are in trouble. >>> >>> I’ll try to figure out a solution to this, but it may be possible for >>> you to verify this first. >>> >>> BR >>> ///jon >>> >>> >>> >>> From: John THompson [mailto:thompa....@gmail.com] >>> Sent: Wednesday, 24 August, 2016 16:22 >>> To: Jon Maloy <jon.ma...@ericsson.com> >>> Cc: tipc-discussion@lists.sourceforge.net >>> Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a >>> named with a BC ACK of 0 >>> >>> Hi Jon, >>> >>> To clarify my previous email regarding the behaviour observed, >>> >>> What happens over time: >>> + remove bc peer >>> ... >>> some time until peer rejoins >>> ... >>> + add bc peer >>> + tipc_link_bc_ack_rcv >>> link is up = false, node is up = false >>> (this gets called a number of times until both the link and node are >>> up) >>> >>> + tipc_link_bc_ack_rcv >>> l->acked set to valid ack >>> ... >>> + tipc_rcv - usr 5 or 11, bc_ack = 0 >>> + tipc_bcast_ack_rcv >>> + tipc_link_bc_ack_rcv >>> sets l->acked to 0 >>> >>> Regards, >>> JT >>> >>> >>> On Thu, Aug 25, 2016 at 8:06 AM, John THompson <thompa....@gmail.com >>> <mailto:thompa....@gmail.com>> wrote: >>> Hi Jon, >>> >>> It is a similar problem in terms of what happens to the bc link. I do >>> have that patch applied. >>> >>> I have added debug through the remove bc peer and various other >>> functions and the setting of the acked field to 0 is occurring when >>> processing a packet from named (msg user 11) or BCAST protocol (msg user 5). >>> >>> Thanks, >>> JT >>> >>> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com >>> <mailto:jon.ma...@ericsson.com>> wrote: >>> Hi John, >>> This sounds a lot like the problem I tried to fix in >>> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when >>> peer is lost") >>> So, either that patch is not present in your kernel (if it is 4.7 it is >>> supposed to be) or my solution somehow hasn't solved the problem. >>> Can you confirm that the patch is there? >>> >>> BR >>> ///jon >>> >>> -----Original Message----- >>>> From: John THompson [mailto:thompa....@gmail.com<mailto: >>>> thompa....@gmail.com>] >>>> Sent: Tuesday, 23 August, 2016 20:21 >>>> To: tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion >>>> @lists.sourceforge.net> >>>> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a >>>> named with a BC >>>> ACK of 0 >>>> >>>> Hi, >>>> >>>> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040 >>>> and Marvell Armada-XP processors. There are 10 nodes in all. >>>> When 2 of the nodes are removed, then rejoin the cluster we sometimes >>>> see >>>> behaviour where the TIPC BC link gets stuck and eventually the backlog >>>> gets >>>> full. the 2 nodes that are joining have already connected together. >>>> >>>> The problem occurs when the BC link sndnxt value is greater than 32k on >>>> one >>>> of the nodes (call it NODE1) and 2 nodes begin to join. >>>> When NODE1 detects the joining nodes, at some early point after they >>>> have >>>> joined, NODE1 receives a NAMED publication with a BC ack of 0. NODE1 >>>> immediately sets its BC acked to 0 and tries to ack packets off the >>>> transmq. No packets get removed as the new ack value doesn't match any >>>> of >>>> the packets that need to be acked. >>>> >>>> The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures >>>> that >>>> the new acked value is more than the old acked value. When the values >>>> are >>>> greater than 32k apart this means that 0 can indeed be greater than >>>> 40,000. So when new packets are processed the new BC ack value is >>>> considered less than the stored one (0). >>>> >>>> This results in the BC transmq getting full and the backlog getting >>>> full, >>>> thereby preventing communication over the BC link between nodes. >>>> >>>> I am persisting in trying to work out why the NAMED publication has a BC >>>> ack of 0, which I think is the root cause of the problem. >>>> >>>> I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an >>>> invalid BC ack value cannot be set. I am defining invalid as being an >>>> acked value that is greater than the current BC acked value + the link >>>> window. >>>> >>>> Thanks, >>>> John >>>> ------------------------------------------------------------ >>>> ------------------ >>>> _______________________________________________ >>>> tipc-discussion mailing list >>>> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion >>>> @lists.sourceforge.net> >>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>>> >>> >>> ------------------------------------------------------------ >>> ------------------ >>> _______________________________________________ >>> tipc-discussion mailing list >>> tipc-discussion@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion >>> >> >> > ------------------------------------------------------------------------------ _______________________________________________ tipc-discussion mailing list tipc-discussion@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/tipc-discussion