Jon/Allan,
I removed my change and implemented Jon's suggested change and it fixed the
problem I was seeing. Next, I will retest with Allan's version of the
change.
Richard.
On 2/8/08, Stephens, Allan <[EMAIL PROTECTED]> wrote:
>
> Hi Jon/Richard:
>
> It looks like Jon is correct in observing that this problem was actually
> introduced in TIPC 1.7. It appears that the check he identified was
> first updated (by him) to allow for the handling of routed messages, and
> then mysteriously dropped (by me) during the update which first
> introduced the link_recv_buf_validate() routine.
>
> At the time the check was removed it looked like the following:
>
> /* Discard non-routeable messages destined for another node */
>
> if (unlikely(!msg_isdata(msg) && (msg_destnode(msg) !=
> tipc_own_addr))) {
> if (msg_user(msg) != CONN_MANAGER)
> goto cont;
> }
>
> However, I think that this code is no longer sufficient, since it
> doesn't allow TIPC to route message fragments. If think the updated
> check would need to be:
>
> /* Discard non-routeable messages destined for another node */
>
> if (unlikely(!msg_isdata(msg) && (msg_destnode(msg) !=
> tipc_own_addr))) {
> if ((msg_user(msg) != CONN_MANAGER) && (msg_user(msg) !=
> MSG_FRAGMENTER))
> goto cont;
> }
>
> Jon, let me know if you agree with this analysis. Richard, feel free to
> try this fix out, too, once you've completed your previous testing. The
> fix will need fuller verification to ensure that it is really correct
> (and doesn't break anything that is currently working), but I'm totally
> swamped at the moment and don't have time to do it right now.
>
> Regards,
> Al
>
> -----Original Message-----
> From: Jon Maloy [mailto:[EMAIL PROTECTED]
> Sent: Thursday, February 07, 2008 7:32 PM
> To: Stephens, Allan
> Cc: Richard Lopez; [email protected]
> Subject: Re: [tipc-discussion] Stale Links
>
> Richard and Allan,
> See my comment below.
>
> Regards
> ///jon
>
> Stephens, Allan wrote:
> > Hi Richard:
> >
> > This certainly sounds like a bug in TIPC. If so, I think it's a long
> > standing one. TIPC has been updated a number of times over the last
> > couple of years to handle issues relating to link failure and the
> > replacement of network hardware, and I believe it works fine if:
> > a) you replace the MAC address used by a node with a new MAC address,
> > or b) if you reuse a MAC address on another node after first doing a);
>
> > however, I'm not convinced that we've covered the case where a MAC
> > address is reused on another node without first doing a).
> >
> > In answer to your first question, the reason that TIPC doesn't delete
> > a link endpoint when connectivity is lost is because the link endpoint
>
> > is usually needed again within a short while because connectivity is
> > restored. (For example, consider the case where a node crashes and
> > then reboots, or the case where someone disconnects a cable and then
> > reconnects it.) Leaving the link endpoint around allows TIPC to
> > restore communication quickly, without incurring the significant delay
>
> > that can occur in some cases with TIPC's auto-discovery mechanism.
> >
> > In answer to your second question, there is no current plan to fix
> > this issue ... but then we didn't know the problem existed until
> > today! :-) The fix you propose doesn't strike me as being the correct
>
> > thing to do, because it interferes with the ability of TIPC to
> > immediately restore a failed link. (However, it does appear to solve
> > your problem.) It would probably be preferable to introduce some sort
>
> > of timeout mechanism into the link code that would defer this kind of
> > MAC address "erasing" for a reasonable period of time. However, I'm
> > not sure what constitutes "reasonable".
> I agree with that, a timeout mechanism to remove dangling links would at
> least save some memory.
> But the problem we have at hand here seems to be a bug introduced i
> tipc-1.7, and can easily be fixed with a simple test.
>
> The scenario I see is the following:
>
> 1) Link <1.1.1>-<1.1.3> gets established.
> 2) The old link object in <1.1.1>-<1.1.2>, continue to send periodic
> RESET messages towards the
> non-existing destination <1.1.2>/MAC_B.
> 3) Node <1.1.3> receive these messages because the MAC address matches,
> but is supposed to
> throw them away in tipc_recv_msg(), because their destination <1.1.2>
> doesn't match the
> node's current address. There is such a check in he beginning of
> tipc_recv_msg() in tipc-1.6
> (linux-2.6.23), but in tipc-1.7 this check seems to have
> disappeared.
> 4: The received RESET messages cause link <1.1.1>-<1.1.3> to wobble up
> and down, and not much
> traffic will go through. tipc-config -l will show the link as "up"
> (most of the time), but occasionally
> as "down".
>
>
> You can easily verify this by
> 1: looking into syslog. You will see "Established link" and "Lost link"
>
> messages for this link
> at regular intervals.
> 2: add the following lines at line 1839 in tipc_link.c
> if (unlikely(!msg_short(msg) &&
> (msg_destnode(msg) != tipc_own_addr)))
> goto cont;
> and recompile. If it works now, my theory is confirmed.
> >
> > I've copied Jon Maloy on this email as he has a lot more history with
> > TIPC's link code than I do, and understands better why things are
> > coded the way they are. I'd be interested in hearing his thoughts on
> > this matter ...
> >
> > Regards,
> > Al
> >
> > ----------------------------------------------------------------------
> > --
> > *From:* [EMAIL PROTECTED]
> > [mailto:[EMAIL PROTECTED] *On Behalf Of
> > *Richard Lopez
> > *Sent:* Thursday, February 07, 2008 10:44 AM
> > *To:* [email protected]
> > *Subject:* [tipc-discussion] Stale Links
> >
> > Configuration:
> > - Linux Kernel 2.6.20
> > - TIPC 1.7.5
> >
> > I am using TIPC in a chassis based environment. Each card in the
> > chassis runs TIPC and the Node address for a card is based on the slot
>
> > it is plugged. For example, a card in slot 3 would have a node address
>
> > of <1.1.3>.
> >
> > The problem I observer occurs when I start to move around cards that
> > are plugged into the chassis. The TIPC port associated with the card
> > that is moved never gets created. Here is a simple example:
> > Card A in slot 1 has node address <1.1.1> and MAC address MAC_A Card B
>
> > in slot 2 has node address <1.1.2> and MAC address MAC_B
> >
> > If I move Card B to slot 3 (node address is now <1.1.3> and MAC
> > address is still MAC_B) the link shows "up" when I run tipc-config -l
> > on Card A and Card B, but the TIPC port never gets created
> > (tipc-config -p) and applications using TIPC cannot send messages.
> >
> > I believe the problem is caused by the fact that the original link
> > between <1.1.1> and <1.1.2> never gets deleted. When I perform a
> > tipc-config -l on Card A I see a link between <1.1.1> and <1.1.2>
> > "down", and a link between <1.1.1> and <1.1.3> "up". Instrumenting the
>
> > code it appears that the original link still has MAC_B as the media
> > address and this causes problems when Card B comes up with the same
> > MAC but on a different link and Node address.
> >
> > Question 1: There seems to be a tipc_link_delete function but it does
> > not seem to be used when communication is lost with a peer. Why are
> > links not deleted?
> >
> > I was able to workaround this issue by clearing the media address
> > before resetting the link when a peer is not responding, see below. I
> > saw something similar done in tipc_discover.c and this comment.
> > * TODO: It might be better to delete these "stale" link
> endpoints,
> > * but this could be tricky [see tipc_link_delete()].
> >
> > Question 2: Is there a plan to cleanup stale links in a future
> > release? If not, is the change shown below an appropriate fix?
> >
> > Change:
> > ===================================================================
> > --- linux-2.6.20.7/net/tipc/tipc_link.c (original copy)
> > +++ linux-2.6.20.7/net/tipc/tipc_link.c (working copy)
> > @@ -836,6 +836,12 @@
> > l_ptr->fsm_msg_cnt);
> > warn("Resetting link <%s>, peer not
> > responding\n",
> > l_ptr->name);
> > + memset(&l_ptr->media_addr, 0, sizeof(struct
> > tipc_media_addr));
> > tipc_link_reset(l_ptr);
> > l_ptr->state = RESET_UNKNOWN;
> > l_ptr->fsm_msg_cnt = 0;
> >
> >
> > Richard Lopez
>
>
--
Richard Lopez
Cyan
(707) 338-9678
[EMAIL PROTECTED]
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion