Jon/Allan,

Initial testing shows that Allan's patch works. I would like to get a little
more test time on it before declaring it "good". I will give an update next
week.

Richard

On 2/8/08, Jon Maloy <[EMAIL PROTECTED]> wrote:
>
>  I looked into Allan's patch, and it seems correct to me. The one I sent
> was
> picked directly from linux-2.6.23, (tipc-1.6), which is not prepared for
> packet routing at all.
> So, if Richard can confirm that Allan's patch is working, that is the one
> we go for.
>
> ///jon
>
>
>  ------------------------------
> *From:* Richard Lopez [mailto:[EMAIL PROTECTED]
> *Sent:* February 8, 2008 1:40 PM
> *To:* Stephens, Allan
> *Cc:* Jon Maloy; [email protected]
> *Subject:* Re: [tipc-discussion] Stale Links
>
> Jon/Allan,
>
> I removed my change and implemented Jon's suggested change and it fixed
> the problem I was seeing. Next, I will retest with Allan's version of the
> change.
>
> Richard.
>
> On 2/8/08, Stephens, Allan <[EMAIL PROTECTED]> wrote:
> >
> > Hi Jon/Richard:
> >
> > It looks like Jon is correct in observing that this problem was actually
> > introduced in TIPC 1.7.  It appears that the check he identified was
> > first updated (by him) to allow for the handling of routed messages, and
> > then mysteriously dropped (by me) during the update which first
> > introduced the link_recv_buf_validate() routine.
> >
> > At the time the check was removed it looked like the following:
> >
> >         /* Discard non-routeable messages destined for another node */
> >
> >         if (unlikely(!msg_isdata(msg) && (msg_destnode(msg) !=
> > tipc_own_addr))) {
> >                 if (msg_user(msg) != CONN_MANAGER)
> >                         goto cont;
> >         }
> >
> > However, I think that this code is no longer sufficient, since it
> > doesn't allow TIPC to route message fragments.  If think the updated
> > check would need to be:
> >
> >         /* Discard non-routeable messages destined for another node */
> >
> >         if (unlikely(!msg_isdata(msg) && (msg_destnode(msg) !=
> > tipc_own_addr))) {
> >                 if ((msg_user(msg) != CONN_MANAGER) && (msg_user(msg) !=
> > MSG_FRAGMENTER))
> >                         goto cont;
> >         }
> >
> > Jon, let me know if you agree with this analysis.  Richard, feel free to
> > try this fix out, too, once you've completed your previous testing.  The
> > fix will need fuller verification to ensure that it is really correct
> > (and doesn't break anything that is currently working), but I'm totally
> > swamped at the moment and don't have time to do it right now.
> >
> > Regards,
> > Al
> >
> > -----Original Message-----
> > From: Jon Maloy [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, February 07, 2008 7:32 PM
> > To: Stephens, Allan
> > Cc: Richard Lopez; [email protected]
> > Subject: Re: [tipc-discussion] Stale Links
> >
> > Richard and Allan,
> > See my comment below.
> >
> > Regards
> > ///jon
> >
> > Stephens, Allan wrote:
> > > Hi Richard:
> > >
> > > This certainly sounds like a bug in TIPC.  If so, I think it's a long
> > > standing one.  TIPC has been updated a number of times over the last
> > > couple of years to handle issues relating to link failure and the
> > > replacement of network hardware, and I believe it works fine if:
> > > a) you replace the MAC address used by a node with a new MAC address,
> > > or b) if you reuse a MAC address on another node after first doing a);
> >
> > > however, I'm not convinced that we've covered the case where a MAC
> > > address is reused on another node without first doing a).
> > >
> > > In answer to your first question, the reason that TIPC doesn't delete
> > > a link endpoint when connectivity is lost is because the link endpoint
> >
> > > is usually needed again within a short while because connectivity is
> > > restored.  (For example, consider the case where a node crashes and
> > > then reboots, or the case where someone disconnects a cable and then
> > > reconnects it.)  Leaving the link endpoint around allows TIPC to
> > > restore communication quickly, without incurring the significant delay
> >
> > > that can occur in some cases with TIPC's auto-discovery mechanism.
> > >
> > > In answer to your second question, there is no current plan to fix
> > > this issue ... but then we didn't know the problem existed until
> > > today! :-)  The fix you propose doesn't strike me as being the correct
> >
> > > thing to do, because it interferes with the ability of TIPC to
> > > immediately restore a failed link.  (However, it does appear to solve
> > > your problem.)  It would probably be preferable to introduce some sort
> >
> > > of timeout mechanism into the link code that would defer this kind of
> > > MAC address "erasing" for a reasonable period of time.  However, I'm
> > > not sure what constitutes "reasonable".
> > I agree with that, a timeout mechanism to remove dangling links would at
> > least save some memory.
> > But the problem we have at hand here seems to be a bug introduced i
> > tipc-1.7, and can easily be fixed with a simple test.
> >
> > The scenario I see is the following:
> >
> > 1) Link <1.1.1>-<1.1.3> gets established.
> > 2) The old link object in <1.1.1>-<1.1.2>, continue to send periodic
> > RESET messages towards the
> >     non-existing destination  <1.1.2>/MAC_B.
> > 3) Node <1.1.3> receive these messages because the MAC address matches,
> > but is supposed to
> >    throw them away in tipc_recv_msg(), because their destination <1.1.2>
> > doesn't match the
> >     node's current address. There is such a check in he beginning of
> > tipc_recv_msg() in  tipc-1.6
> >     (linux-2.6.23), but in tipc-1.7 this check seems to have
> > disappeared.
> > 4: The received RESET messages cause link <1.1.1>-<1.1.3> to wobble up
> > and down, and not much
> >     traffic will go through. tipc-config -l will show the link as "up"
> > (most of the time), but occasionally
> >    as "down".
> >
> >
> > You can easily verify this by
> > 1:  looking into syslog. You will see "Established link" and "Lost link"
> >
> > messages for this link
> >     at regular intervals.
> > 2: add the following lines at line 1839 in tipc_link.c
> >         if (unlikely(!msg_short(msg) &&
> >                  (msg_destnode(msg) != tipc_own_addr)))
> >             goto cont;
> >    and recompile. If it works now, my theory is confirmed.
> > >
> > > I've copied Jon Maloy on this email as he has a lot more history with
> > > TIPC's link code than I do, and understands better why things are
> > > coded the way they are.  I'd be interested in hearing his thoughts on
> > > this matter ...
> > >
> > > Regards,
> > > Al
> > >
> > > ----------------------------------------------------------------------
> > > --
> > > *From:* [EMAIL PROTECTED]
> > > [mailto:[EMAIL PROTECTED] *On Behalf Of
> > > *Richard Lopez
> > > *Sent:* Thursday, February 07, 2008 10:44 AM
> > > *To:* [email protected]
> > > *Subject:* [tipc-discussion] Stale Links
> > >
> > > Configuration:
> > > - Linux Kernel 2.6.20
> > > - TIPC 1.7.5
> > >
> > > I am using TIPC in a chassis based environment. Each card in the
> > > chassis runs TIPC and the Node address for a card is based on the slot
> >
> > > it is plugged. For example, a card in slot 3 would have a node address
> >
> > > of <1.1.3>.
> > >
> > > The problem I observer occurs when I start to move around cards that
> > > are plugged into the chassis. The TIPC port associated with the card
> > > that is moved never gets created. Here is a simple example:
> > > Card A in slot 1 has node address <1.1.1> and MAC address MAC_A Card B
> >
> > > in slot 2 has node address <1.1.2> and MAC address MAC_B
> > >
> > > If I move Card B to slot 3 (node address is now <1.1.3> and MAC
> > > address is still MAC_B) the link shows "up" when I run tipc-config -l
> > > on Card A and Card B, but the TIPC port never gets created
> > > (tipc-config -p) and applications using TIPC cannot send messages.
> > >
> > > I believe the problem is caused by the fact that the original link
> > > between <1.1.1> and <1.1.2> never gets deleted. When I perform a
> > > tipc-config -l on Card A I see a link between <1.1.1> and <1.1.2>
> > > "down", and a link between <1.1.1> and <1.1.3> "up". Instrumenting the
> >
> > > code it appears that the original link still has MAC_B as the media
> > > address and this causes problems when Card B comes up with the same
> > > MAC but on a different link and Node address.
> > >
> > > Question 1: There seems to be a tipc_link_delete function but it does
> > > not seem to be used when communication is lost with a peer. Why are
> > > links not deleted?
> > >
> > > I was able to workaround this issue by clearing the media address
> > > before resetting the link when a peer is not responding, see below. I
> > > saw something similar done in tipc_discover.c and this comment.
> > >      * TODO: It might be better to delete these "stale" link
> > endpoints,
> > >      * but this could be tricky [see tipc_link_delete()].
> > >
> > > Question 2: Is there a plan to cleanup stale links in a future
> > > release? If not, is the change shown below an appropriate fix?
> > >
> > > Change:
> > > ===================================================================
> > > --- linux-2.6.20.7/net/tipc/tipc_link.c (original copy)
> > > +++ linux-2.6.20.7/net/tipc/tipc_link.c (working copy)
> > > @@ -836,6 +836,12 @@
> > >                                          l_ptr->fsm_msg_cnt);
> > >                                 warn("Resetting link <%s>, peer not
> > > responding\n",
> > >                                      l_ptr->name);
> > > +                memset(&l_ptr->media_addr, 0, sizeof(struct
> > > tipc_media_addr));
> > >                                 tipc_link_reset(l_ptr);
> > >                                 l_ptr->state = RESET_UNKNOWN;
> > >                                 l_ptr->fsm_msg_cnt = 0;
> > >
> > >
> > > Richard Lopez
> >
> >
>
>
> --
> Richard Lopez
> Cyan
> (707) 338-9678
> [EMAIL PROTECTED]
>



-- 
Richard Lopez
Cyan
(707) 338-9678
[EMAIL PROTECTED]
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to