> -----Original Message-----
> From: Hamish Martin [mailto:[email protected]]
> Sent: Sunday, 01 May, 2016 16:44
> To: Jon Maloy; Jon Maloy
> Cc: [email protected]; Ying Xue; Xue Ying
> ([email protected]); Richard Alpe; Parthasarathy Bhuvaragan
> Subject: Re: [PATCH] tipc: Only process unicast on intended node
> 
> Hi Jon,
> 
> The broadcast transmitters were definitely the "steady" ones that
> weren't being rebooted. I can't rule out the other nodes having the
> issue too, but since it was so hard to find I followed the reproduction
> path I could see most easily. In that case it was a "steady" node that
> was transmitting to a new node and the "steady" node processed the ack
> from some other node as though it came from the new node.

Do you have a wireshark dump? The matching of mac-address/tipc-node would show 
where these messages really came from, i.e. if it is dest_node or prev_node 
that is wrong.
Also, did you really start new machines after you had removed the old ones, or 
did you just re-use the same machine/VM with a new tipc node id ?

> 
> Also, I found this on v4.4.6 so I recommend, given the seriousness of
> the symptoms, that it be put in the 4.4 and 4.5 branches too.

That is why I posted this to 'net'. Then, it will be applied as far back as 
possible, but also "forward" into the ongoing 4.6 cycle in the near future.

///jon

> 
> Thanks,
> Hamish M.
> 
> On 04/29/2016 11:33 PM, Jon Maloy wrote:
> >
> >
> > On 04/28/2016 10:20 PM, Hamish Martin wrote:
> >> Hi Jon,
> >>
> >> Yes it was very difficult to track! Unfortunately I don't know why they
> >> get onto the wrong link in the first place.
> >>
> >> I agree the root problem would be good to find, but i am limited in both
> >> my understanding of TIPC and my ability to track it further.
> >> To reproduce this I had a bunch of nodes (7 all up). I effectively turn
> >> off two of them at a time and then turn them back on. When they are back
> >> in the cluster I restart another different two. Three of the nodes
> >> remain up at all times. It is on one of those three nodes that I
> >> eventually see the problem. Perhaps someone more experienced could try
> >> something like that and see if they can trigger the issue.
> >>
> >> In short, this is ambulance at the bottom of the cliff stuff rather than
> >> your desired fence at the top. I am happy to help test any theories of
> >> assist in further describing the test that shows it for us.
> >
> > I understand. This test is legitimate to do even when we later find
> > the root problem.
> > A further question: which nodes were the broadcast transmitters? All?
> > Any of the new ones? Any of the "steady" ones?
> >
> > ///jon
> >
> >>
> >> Thanks,
> >> Hamish Martin.
> >>
> >>
> >> On 04/29/2016 02:09 PM, Jon Maloy wrote:
> >>> (Removed netdev from list, added some others).
> >>>
> >>> This is interesting, and must it have been hard to track.  But I
> >>> would really like to know the real reason why this happens, so we
> >>> can catch the root problem. Broadcast ACK messages are just ordinary
> >>> STATE messages, and should have a correct destination address. Did
> >>> you find out where these messages really came from, and why they
> >>> have wrong destination addresses?
> >>>
> >>> ///jon
> >>>
> >>>> -----Original Message-----
> >>>> From: Hamish Martin [mailto:[email protected]]
> >>>> Sent: Thursday, 28 April, 2016 21:35
> >>>> To: Jon Maloy; [email protected]
> >>>> Cc: Hamish Martin
> >>>> Subject: [PATCH] tipc: Only process unicast on intended node
> >>>>
> >>>> We have observed complete lock up of broadcast-link transmission
> >>>> due to
> >>>> unacknowledged packets never being removed from the 'transmq'
> >>>> queue. This
> >>>> is traced to nodes having their ack field set beyond the sequence
> >>>> number
> >>>> of packets that have actually been transmitted to them.
> >>>> Consider an example where node 1 has sent 10 packets to node 2 on a
> >>>> link and node 3 has sent 20 packets to node 2 on another link. We
> >>>> see examples of an ack from node 2 destined for node 3 being
> >>>> treated as
> >>>> an ack from node 2 at node 1. This leads to the ack on the node 1
> >>>> to node
> >>>> 2 link being increased to 20 even though we have only sent 10 packets.
> >>>> When node 1 does get around to sending further packets, none of the
> >>>> packets with sequence numbers less than 21 are actually removed
> >>>> from the
> >>>> transmq.
> >>>> To resolve this we reinstate some code lost in commit d999297c3dbb
> >>>> ("tipc:
> >>>> reduce locking scope during packet reception") which ensures that only
> >>>> messages destined for the receiving node are processed by that
> >>>> node. This
> >>>> prevents the sequence numbers from getting out of sync and resolves
> >>>> the
> >>>> packet leakage, thereby resolving the broadcast-link transmission
> >>>> lock-ups we observed.
> >>>>
> >>>> Signed-off-by: Hamish Martin <[email protected]>
> >>>> Reviewed-by: Chris Packham <[email protected]>
> >>>> Reviewed-by: John Thompson <[email protected]>
> >>>> ---
> >>>>    net/tipc/node.c | 5 +++++
> >>>>    1 file changed, 5 insertions(+)
> >>>>
> >>>> diff --git a/net/tipc/node.c b/net/tipc/node.c
> >>>> index ace178fd3850..e5dda495d4b6 100644
> >>>> --- a/net/tipc/node.c
> >>>> +++ b/net/tipc/node.c
> >>>> @@ -1460,6 +1460,11 @@ void tipc_rcv(struct net *net, struct
> >>>> sk_buff *skb,
> >>>> struct tipc_bearer *b)
> >>>>                return tipc_node_bc_rcv(net, skb, bearer_id);
> >>>>        }
> >>>>
> >>>> +    /* Discard unicast link messages destined for another node */
> >>>> +    if (unlikely(!msg_short(hdr) &&
> >>>> +             (msg_destnode(hdr) != tipc_own_addr(net))))
> >>>> +        goto discard;
> >>>> +
> >>>>        /* Locate neighboring node that sent packet */
> >>>>        n = tipc_node_find(net, msg_prevnode(hdr));
> >>>>        if (unlikely(!n))
> >>>> --
> >>>> 2.8.1
> >

------------------------------------------------------------------------------
Find and fix application performance issues faster with Applications Manager
Applications Manager provides deep performance insights into multiple tiers of
your business applications. It resolves application problems quickly and
reduces your MTTR. Get your free trial!
https://ad.doubleclick.net/ddm/clk/302982198;130105516;z
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to