Hi Jon,

I have verified that the patch is included in my build.
2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
criteria”)

I am trying to verify which packets are received when the problem occurs
but I am having trouble getting the information out of my system at the
moment.

I will keep trying.
Thanks,
JT


On Tue, Aug 30, 2016 at 6:20 PM, Jon Maloy <ma...@donjonn.com> wrote:

>
>
> On 08/29/2016 06:48 PM, Jon Maloy wrote:
>
>> Hi John,
>> Sorry for my late answer; I was on vacation for a few days.
>> It seems I gave you the wrong commit reference in my previous mail. The
>> one I really meant was
>> 2d18ac4ba7454a426047 (“ tipc: extend broadcast link initialization
>> criteria”)
>>
>> This one explains why the first packets sometimes get an invalid ack
>> number, but also remedies it, and I simply cannot see how an invalid ack #0
>> can ever be accepted when this patch is applied.
>> I see no reason why this patch shouldn’t also be present in you code, but
>> just to make sure, can you confirm this?
>>
>> I am right now wondering if a retransmission is the problem:
>> 1: we receive pkt #2 which contains ack #1, so we set bc_peer_is_up to
>> true.
>>
> Since only LINK_PROTO/STATE messages can cause bc_peer_is_up to go true,
> the likely sequence is rather
> 1: We receive a STATE message with unicast ack #1. This message should
> also contain a valid, with high probability non-zero, bc_ack. bc_peer_is_up
> is set to true.
> 2: We receive unicast pkt#1 (BCAST init or NAMED) which contains the
> invalid unicast ack #0. This one is now accepted.
>
> I believe this may happen, because STATE messages, contrary to data
> packets, are sent as TC_PRIO_CONTROL, and may sometimes bypass data
> messages, but I cannot see it happening as often and consistently as you
> seem to be observing it. Another possibility is that bc_ack in the received
> STATE message also is an invalid zero, although I cannot see how this can
> happen either.
>
> Regards
> ///jon
>
> 2: we receive pkt #1 retransmitted with ack #0. This now gets accepted,
>> and we are in trouble.
>>
>> I’ll try to figure out a solution to this, but it may be possible for you
>> to verify this first.
>>
>> BR
>> ///jon
>>
>>
>>
>> From: John THompson [mailto:thompa....@gmail.com]
>> Sent: Wednesday, 24 August, 2016 16:22
>> To: Jon Maloy <jon.ma...@ericsson.com>
>> Cc: tipc-discussion@lists.sourceforge.net
>> Subject: Re: [tipc-discussion] BC rcv link acked stuck after receiving a
>> named with a BC ACK of 0
>>
>> Hi Jon,
>>
>> To clarify my previous email regarding the behaviour observed,
>>
>> What happens over time:
>> + remove bc peer
>> ...
>> some time until peer rejoins
>> ...
>> + add bc peer
>> + tipc_link_bc_ack_rcv
>>    link is up = false, node is up = false
>>    (this gets called a number of times until both the link and node are
>> up)
>>
>> + tipc_link_bc_ack_rcv
>>    l->acked set to valid ack
>> ...
>> + tipc_rcv - usr 5 or 11, bc_ack = 0
>>    + tipc_bcast_ack_rcv
>>      + tipc_link_bc_ack_rcv
>>        sets l->acked to 0
>>
>> Regards,
>> JT
>>
>>
>> On Thu, Aug 25, 2016 at 8:06 AM, John THompson <thompa....@gmail.com
>> <mailto:thompa....@gmail.com>> wrote:
>> Hi Jon,
>>
>> It is a similar problem in terms of what happens to the bc link.  I do
>> have that patch applied.
>>
>> I have added debug through the remove bc peer and various other functions
>> and the setting of the acked field to 0 is occurring when processing a
>> packet from named (msg user 11) or BCAST protocol (msg user 5).
>>
>> Thanks,
>> JT
>>
>> On Wed, Aug 24, 2016 at 10:23 PM, Jon Maloy <jon.ma...@ericsson.com
>> <mailto:jon.ma...@ericsson.com>> wrote:
>> Hi John,
>> This sounds a lot like the problem I tried to fix in
>> a71eb720355c2 ("tipc: ensure correct broadcast send buffer release when
>> peer is lost")
>> So, either that patch is not present in your kernel (if it is 4.7 it is
>> supposed to be) or my solution somehow hasn't solved the problem.
>> Can you confirm that the patch is there?
>>
>> BR
>> ///jon
>>
>> -----Original Message-----
>>> From: John THompson [mailto:thompa....@gmail.com<mailto:
>>> thompa....@gmail.com>]
>>> Sent: Tuesday, 23 August, 2016 20:21
>>> To: tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion
>>> @lists.sourceforge.net>
>>> Subject: [tipc-discussion] BC rcv link acked stuck after receiving a
>>> named with a BC
>>> ACK of 0
>>>
>>> Hi,
>>>
>>> I am running TIPC 2.0 on Linux 4.7 on a cluster of Freescale QorIQ P2040
>>> and Marvell Armada-XP processors.  There are 10 nodes in all.
>>> When 2 of the nodes are removed, then rejoin the cluster we sometimes see
>>> behaviour where the TIPC BC link gets stuck and eventually the backlog
>>> gets
>>> full.  the 2 nodes that are joining have already connected together.
>>>
>>> The problem occurs when the BC link sndnxt value is greater than 32k on
>>> one
>>> of the nodes (call it NODE1) and 2 nodes begin to join.
>>> When NODE1 detects the joining nodes, at some early point after they have
>>> joined, NODE1 receives a NAMED publication with a BC ack of 0.  NODE1
>>> immediately sets its BC acked to 0 and tries to ack packets off the
>>> transmq.  No packets get removed as the new ack value doesn't match any
>>> of
>>> the packets that need to be acked.
>>>
>>> The problem doesn't recover because in tipc_link_bc_ack_rcv it ensures
>>> that
>>> the new acked value is more than the old acked value.  When the values
>>> are
>>> greater than 32k apart this means that 0 can indeed be greater than
>>> 40,000.  So when new packets are processed the new BC ack value is
>>> considered less than the stored one (0).
>>>
>>> This results in the BC transmq getting full and the backlog getting full,
>>> thereby preventing communication over the BC link between nodes.
>>>
>>> I am persisting in trying to work out why the NAMED publication has a BC
>>> ack of 0, which I think is the root cause of the problem.
>>>
>>> I think that tipc_link_bc_ack_rcv needs an extra check to ensure that an
>>> invalid BC ack value cannot be set.  I am defining invalid as being an
>>> acked value that is greater than the current BC acked value + the link
>>> window.
>>>
>>> Thanks,
>>> John
>>> ------------------------------------------------------------
>>> ------------------
>>> _______________________________________________
>>> tipc-discussion mailing list
>>> tipc-discussion@lists.sourceforge.net<mailto:tipc-discussion
>>> @lists.sourceforge.net>
>>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>>
>>
>> ------------------------------------------------------------
>> ------------------
>> _______________________________________________
>> tipc-discussion mailing list
>> tipc-discussion@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/tipc-discussion
>>
>
>
------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to