Hi Jon,

The packet I see the error happening on is when receiving a usr 11
(NAME_DISTRIBUTOR) over the unicast link.
The reception of this packet is happening interleaved with processing
a packet (or packets) on the BC link that has brought the peer up.
The BC link packet processing has the tipc_bcast_lock and the unicast
pkt processing cannot get the bcast lock for a while.
When it can get the lock it processes the BC ack == 0 from the
NAME_DISTRIBUTOR
packet and sets the acked field on the BC link to 0.

The debug / call trace below is me trying to show from the debug I captured
what happens.
If I add debug for each pkt the problem doesn't reproduce.


tipc_rcv 1.1.5:vcs_mgmt-1.1.18:vcs_mgmt bc ack rcv 0 uc seq 3 ack 0 user 11
type 0
  + calls tipc_bcast_ack_rcv
tipc_rcv
  + tipc_bcast_ack_rcv
    + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as
link not up 1 or peer not up 1
tipc_rcv
  + tipc_bcast_ack_rcv
    + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack 53574 - can't ack as
link not up 0 or peer not up 1
===
Somewhere at this point bc_peer_is_up gets set
===
tipc_rcv
  + tipc_bcast_ack_rcv
    + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
than it was previously (53574)
tipc_rcv
  + tipc_bcast_ack_rcv
    + tipc_link_bc_ack_rcv broadcast-link-5-18 bc ack - acked (53574) less
than it was previously (53574)

  + from tipc_rcv on unicast link
    + tipc_bcast_ack_rcv Going to set BC ACK outside window, new 0 old
53574 win 200
      - dump_stack
CPU: 2 PID: 19 Comm: ksoftirqd/2 Tainted: P           O    4.4.6-at1 #3
Call Trace:
[a3093a80] [806943b0] dump_stack+0x84/0xb0 (unreliable)
[a3093a90] [c1507314] tipc_link_bc_ack_rcv+0x244/0x250 [tipc]
[a3093ab0] [c1501b04] tipc_bcast_ack_rcv+0x74/0xd0 [tipc]
[a3093ae0] [c1511a08] tipc_rcv+0x468/0xa30 [tipc]
[a3093b80] [c150218c] tipc_bcast_stop+0xfc/0x7b0 [tipc]
[a3093b90] [8050d6a8] __netif_receive_skb_core+0x468/0xa10
[a3093c30] [80510b6c] netif_receive_skb_internal+0x3c/0xe0
[a3093c60] [8064b2b8] br_handle_frame_finish+0x1d8/0x4d0
[a3093cd0] [8064b7a0] br_handle_frame+0x1f0/0x330
[a3093d20] [8050d738] __netif_receive_skb_core+0x4f8/0xa10
[a3093dc0] [805119f0] process_backlog+0x90/0x140
[a3093df0] [8051103c] net_rx_action+0x15c/0x320
[a3093e50] [8002594c] __do_softirq+0x13c/0x250
[a3093eb0] [80025ab0] run_ksoftirqd+0x50/0x80
[a3093ec0] [800434c4] smpboot_thread_fn+0x1e4/0x1f0
[a3093ef0] [8003fb38] kthread+0xc8/0xe0
[a3093f40] [8000eed8] ret_from_kernel_thread+0x5c/0x64

I am going to send in a patch that adds checking for a valid BC ack (being
within the window size) to
tipc_link_bc_ack_rcv.

Cheers,
JT

On Wed, Aug 31, 2016 at 10:57 PM, John THompson <thompa....@gmail.com>
wrote:

> Hi Jon,
>
> I have verified that the patch is included in my build.
> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
> criteria”)
>
> I am trying to verify which packets are received when the problem occurs
> but I am having trouble getting the information out of my system at the
> moment.
>
> I will keep trying.
> Thanks,
> JT
>
>
> On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy <ma...@donjonn.com> wrote:
>
>>
>>
>> On 08/29/2016 06:48 PM, Jon Maloy wrote:
>>
>>> Hi John,
>>> Sorry for my late answer; I was on vacation for a few days.
>>> It seems I gave you the wrong commit reference in my previous mail. The
>>> one I really meant was
>>> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
>>> criteria”)
>>>
>>> This one explains why the first packets sometimes get an invalid ack
>>> number, but also remedies it, and I simply cannot see how an invalid ack #0
>>> can ever be accepted when this patch is applied.
>>> I see no reason why this patch shouldn’t also be present in you code,
>>> but just to make sure, can you confirm this?
>>>
>>> I am right now wondering if a retransmission is the problem:
>>> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to
>>> true.
>>>
>> Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true,
>> the likely sequence is rather
>> 1: We receive a STATE message with unicast ack #1. This message should
>> also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up
>> is set to true.
>> 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the
>> invalid unicast ack #0. This one is now accepted.
>>
>> I believe this may happen, because STATE messages, contrary to data
>> packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data
>> messages, but I cannot see it happening as often and consistently as you
>> seem to be observing it. Another possibility is that bc_ack in the received
>> STATE message also is an invalid zero, although I cannot see how this can
>> happen either.
>>
>> Regards
>> ///jon
>>
>> 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted,
>>> and we are in trouble.
>>>
>>> I’ll try to figure out a solution to this, but it may be possible for
>>> you to verify this first.
>>>
>>> BR
>>> ///jon
>>>
>>>
>>>
>>> From: John THompson [mailto:thompa....@gmail.com]
>>> Sent: Wednesday, 24 August, 2016 16:22
>>> To: Jon Maloy <jon.ma...@ericsson.com>
>>> Cc: tipc-discussion@lists.sourceforge.net
>>> Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a
>>> named with a BC ACK of 0
>>>
>>> Hi Jon,
>>>
>>> To clarify my previous email regarding the behaviour observed,
>>>
>>> What happens over time:
>>> + remove bc peer
>>> ...
>>> some time until peer rejoins
>>> ...
>>> + add bc peer
>>> + tipc_link_bc_ack_rcv
>>>    link is up = false, node is up = false
>>>    (this gets called a number of times until both the link and node are
>>> up)
>>>
>>> + tipc_link_bc_ack_rcv
>>>    l->acked set to valid ack
>>> ...
>>> + tipc_rcv - usr 5 or 11, bc_ack = 0
>>>    + tipc_bcast_ack_rcv
>>>      + tipc_link_bc_ack_rcv
>>>        sets l->acked to 0
>>>
>>> Regards,
>>> JT
>>>
>>>
>>> On Thu, Aug 25, 2016 at 8:06 AM, John THompson <thompa....@gmail.com
>>> <mailto:thompa....@gmail.com>> wrote:
>>> Hi Jon,
>>>
>>> It is a similar problem in terms of what happens to the bc link.  I do
>>> have that patch applied.
>>>
>>> I have added debug through the remove bc peer and various other
>>> functions and the setting of the acked field to 0 is occurring when
>>> processing a packet from named (msg user 11) or BCAST protocol (msg user 5).
>>>
>>> Thanks,
>>> JT
>>>
>>> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com
>>> <mailto:jon.ma...@ericsson.com>> wrote:
>>> Hi John,
>>> This sounds a lot like the problem I tried to fix in
>>> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
>>> peer is lost")
>>> So, either that patch is not present in your kernel (if it is 4.7 it is
>>> supposed to be) or my solution somehow hasn't solved the problem.
>>> Can you confirm that the patch is there?
>>>
>>> BR
>>> ///jon
>>>
>>> -----Original Message-----
>>>> From: John THompson [mailto:thompa....@gmail.com<mailto:
>>>> thompa....@gmail.com>]
>>>> Sent: Tuesday, 23 August, 2016 20:21
>>>> To: tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion
>>>> @lists.sourceforge.net>
>>>> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a
>>>> named with a BC
>>>> ACK of 0
>>>>
>>>> Hi,
>>>>
>>>> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
>>>> and Marvell Armada-XP processors.  There are 10 nodes in all.
>>>> When 2 of the nodes are removed, then rejoin the cluster we sometimes
>>>> see
>>>> behaviour where the TIPC BC link gets stuck and eventually the backlog
>>>> gets
>>>> full.  the 2 nodes that are joining have already connected together.
>>>>
>>>> The problem occurs when the BC link sndnxt value is greater than 32k on
>>>> one
>>>> of the nodes (call it NODE1) and 2 nodes begin to join.
>>>> When NODE1 detects the joining nodes, at some early point after they
>>>> have
>>>> joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
>>>> immediately sets its BC acked to 0 and tries to ack packets off the
>>>> transmq.  No packets get removed as the new ack value doesn't match any
>>>> of
>>>> the packets that need to be acked.
>>>>
>>>> The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures
>>>> that
>>>> the new acked value is more than the old acked value.  When the values
>>>> are
>>>> greater than 32k apart this means that 0 can indeed be greater than
>>>> 40,000.  So when new packets are processed the new BC ack value is
>>>> considered less than the stored one (0).
>>>>
>>>> This results in the BC transmq getting full and the backlog getting
>>>> full,
>>>> thereby preventing communication over the BC link between nodes.
>>>>
>>>> I am persisting in trying to work out why the NAMED publication has a BC
>>>> ack of 0, which I think is the root cause of the problem.
>>>>
>>>> I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
>>>> invalid BC ack value cannot be set.  I am defining invalid as being an
>>>> acked value that is greater than the current BC acked value + the link
>>>> window.
>>>>
>>>> Thanks,
>>>> John
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> _______________________________________________
>>>> tipc-discussion mailing list
>>>> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion
>>>> @lists.sourceforge.net>
>>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>>
>>>
>>> ------------------------------------------------------------
>>> ------------------
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> tipc-discussion@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>
>>
>>
>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to