Allan,

The patch looks good from my side. Thanks again for the help.

Richard

On 2/8/08, Stephens, Allan <[EMAIL PROTECTED]> wrote:
>
>  Thanks, Richard.  I was able to squeeze in a few minutes today to do some
> sanity testing, and was able to confirm that it fixed the problem you
> encountered.  I also showed that the TIPC test suite ran correctly, thereby
> proving that the fix doesn't interfere with the forwarding of fragmented
> messages.
>
> Once I get your confirmation, I'll post the patch on the TIPC website.
>
> -- Al
>
>  ------------------------------
> *From:* Richard Lopez [mailto:[EMAIL PROTECTED]
> *Sent:* Friday, February 08, 2008 4:44 PM
> *To:* Jon Maloy
> *Cc:* Stephens, Allan; [email protected]
> *Subject:* Re: [tipc-discussion] Stale Links
>
> Jon/Allan,
>
> Initial testing shows that Allan's patch works. I would like to get a
> little more test time on it before declaring it "good". I will give an
> update next week.
>
> Richard
>
> On 2/8/08, Jon Maloy <[EMAIL PROTECTED]> wrote:
> >
> >  I looked into Allan's patch, and it seems correct to me. The one I sent
> > was
> > picked directly from linux-2.6.23, (tipc-1.6), which is not prepared for
> >
> > packet routing at all.
> > So, if Richard can confirm that Allan's patch is working, that is the
> > one we go for.
> >
> > ///jon
> >
> >
> >  ------------------------------
> > *From:* Richard Lopez [mailto:[EMAIL PROTECTED]
> > *Sent:* February 8, 2008 1:40 PM
> > *To:* Stephens, Allan
> > *Cc:* Jon Maloy; [email protected]
> > *Subject:* Re: [tipc-discussion] Stale Links
> >
> >  Jon/Allan,
> >
> > I removed my change and implemented Jon's suggested change and it fixed
> > the problem I was seeing. Next, I will retest with Allan's version of the
> > change.
> >
> > Richard.
> >
> > On 2/8/08, Stephens, Allan <[EMAIL PROTECTED]> wrote:
> > >
> > > Hi Jon/Richard:
> > >
> > > It looks like Jon is correct in observing that this problem was
> > > actually
> > > introduced in TIPC 1.7.  It appears that the check he identified was
> > > first updated (by him) to allow for the handling of routed messages,
> > > and
> > > then mysteriously dropped (by me) during the update which first
> > > introduced the link_recv_buf_validate() routine.
> > >
> > > At the time the check was removed it looked like the following:
> > >
> > >         /* Discard non-routeable messages destined for another node */
> > >
> > >         if (unlikely(!msg_isdata(msg) && (msg_destnode(msg) !=
> > > tipc_own_addr))) {
> > >                 if (msg_user(msg) != CONN_MANAGER)
> > >                         goto cont;
> > >         }
> > >
> > > However, I think that this code is no longer sufficient, since it
> > > doesn't allow TIPC to route message fragments.  If think the updated
> > > check would need to be:
> > >
> > >         /* Discard non-routeable messages destined for another node */
> > >
> > >         if (unlikely(!msg_isdata(msg) && (msg_destnode(msg) !=
> > > tipc_own_addr))) {
> > >                 if ((msg_user(msg) != CONN_MANAGER) && (msg_user(msg)
> > > !=
> > > MSG_FRAGMENTER))
> > >                         goto cont;
> > >         }
> > >
> > > Jon, let me know if you agree with this analysis.  Richard, feel free
> > > to
> > > try this fix out, too, once you've completed your previous
> > > testing.  The
> > > fix will need fuller verification to ensure that it is really correct
> > > (and doesn't break anything that is currently working), but I'm
> > > totally
> > > swamped at the moment and don't have time to do it right now.
> > >
> > > Regards,
> > > Al
> > >
> > > -----Original Message-----
> > > From: Jon Maloy [mailto:[EMAIL PROTECTED]
> > > Sent: Thursday, February 07, 2008 7:32 PM
> > > To: Stephens, Allan
> > > Cc: Richard Lopez; [email protected]
> > > Subject: Re: [tipc-discussion] Stale Links
> > >
> > > Richard and Allan,
> > > See my comment below.
> > >
> > > Regards
> > > ///jon
> > >
> > > Stephens, Allan wrote:
> > > > Hi Richard:
> > > >
> > > > This certainly sounds like a bug in TIPC.  If so, I think it's a
> > > long
> > > > standing one.  TIPC has been updated a number of times over the last
> > > > couple of years to handle issues relating to link failure and the
> > > > replacement of network hardware, and I believe it works fine if:
> > > > a) you replace the MAC address used by a node with a new MAC
> > > address,
> > > > or b) if you reuse a MAC address on another node after first doing
> > > a);
> > >
> > > > however, I'm not convinced that we've covered the case where a MAC
> > > > address is reused on another node without first doing a).
> > > >
> > > > In answer to your first question, the reason that TIPC doesn't
> > > delete
> > > > a link endpoint when connectivity is lost is because the link
> > > endpoint
> > >
> > > > is usually needed again within a short while because connectivity is
> > > > restored.  (For example, consider the case where a node crashes and
> > > > then reboots, or the case where someone disconnects a cable and then
> > > > reconnects it.)  Leaving the link endpoint around allows TIPC to
> > > > restore communication quickly, without incurring the significant
> > > delay
> > >
> > > > that can occur in some cases with TIPC's auto-discovery mechanism.
> > > >
> > > > In answer to your second question, there is no current plan to fix
> > > > this issue ... but then we didn't know the problem existed until
> > > > today! :-)  The fix you propose doesn't strike me as being the
> > > correct
> > >
> > > > thing to do, because it interferes with the ability of TIPC to
> > > > immediately restore a failed link.  (However, it does appear to
> > > solve
> > > > your problem.)  It would probably be preferable to introduce some
> > > sort
> > >
> > > > of timeout mechanism into the link code that would defer this kind
> > > of
> > > > MAC address "erasing" for a reasonable period of time.  However, I'm
> > > > not sure what constitutes "reasonable".
> > > I agree with that, a timeout mechanism to remove dangling links would
> > > at
> > > least save some memory.
> > > But the problem we have at hand here seems to be a bug introduced i
> > > tipc-1.7, and can easily be fixed with a simple test.
> > >
> > > The scenario I see is the following:
> > >
> > > 1) Link <1.1.1>-<1.1.3> gets established.
> > > 2) The old link object in <1.1.1>-<1.1.2>, continue to send periodic
> > > RESET messages towards the
> > >     non-existing destination  <1.1.2>/MAC_B.
> > > 3) Node <1.1.3> receive these messages because the MAC address
> > > matches,
> > > but is supposed to
> > >    throw them away in tipc_recv_msg(), because their destination <
> > > 1.1.2>
> > > doesn't match the
> > >     node's current address. There is such a check in he beginning of
> > > tipc_recv_msg() in  tipc-1.6
> > >     (linux-2.6.23), but in tipc-1.7 this check seems to have
> > > disappeared.
> > > 4: The received RESET messages cause link <1.1.1>-<1.1.3> to wobble up
> > > and down, and not much
> > >     traffic will go through. tipc-config -l will show the link as "up"
> > > (most of the time), but occasionally
> > >    as "down".
> > >
> > >
> > > You can easily verify this by
> > > 1:  looking into syslog. You will see "Established link" and "Lost
> > > link"
> > >
> > > messages for this link
> > >     at regular intervals.
> > > 2: add the following lines at line 1839 in tipc_link.c
> > >         if (unlikely(!msg_short(msg) &&
> > >                  (msg_destnode(msg) != tipc_own_addr)))
> > >             goto cont;
> > >    and recompile. If it works now, my theory is confirmed.
> > > >
> > > > I've copied Jon Maloy on this email as he has a lot more history
> > > with
> > > > TIPC's link code than I do, and understands better why things are
> > > > coded the way they are.  I'd be interested in hearing his thoughts
> > > on
> > > > this matter ...
> > > >
> > > > Regards,
> > > > Al
> > > >
> > > >
> > > ----------------------------------------------------------------------
> > > > --
> > > > *From:* [EMAIL PROTECTED]
> > > > [mailto:[EMAIL PROTECTED] *On Behalf Of
> > > > *Richard Lopez
> > > > *Sent:* Thursday, February 07, 2008 10:44 AM
> > > > *To:* [email protected]
> > > > *Subject:* [tipc-discussion] Stale Links
> > > >
> > > > Configuration:
> > > > - Linux Kernel 2.6.20
> > > > - TIPC 1.7.5
> > > >
> > > > I am using TIPC in a chassis based environment. Each card in the
> > > > chassis runs TIPC and the Node address for a card is based on the
> > > slot
> > >
> > > > it is plugged. For example, a card in slot 3 would have a node
> > > address
> > >
> > > > of <1.1.3>.
> > > >
> > > > The problem I observer occurs when I start to move around cards that
> > > > are plugged into the chassis. The TIPC port associated with the card
> > > > that is moved never gets created. Here is a simple example:
> > > > Card A in slot 1 has node address <1.1.1> and MAC address MAC_A Card
> > > B
> > >
> > > > in slot 2 has node address <1.1.2> and MAC address MAC_B
> > > >
> > > > If I move Card B to slot 3 (node address is now <1.1.3> and MAC
> > > > address is still MAC_B) the link shows "up" when I run tipc-config
> > > -l
> > > > on Card A and Card B, but the TIPC port never gets created
> > > > (tipc-config -p) and applications using TIPC cannot send messages.
> > > >
> > > > I believe the problem is caused by the fact that the original link
> > > > between <1.1.1> and <1.1.2> never gets deleted. When I perform a
> > > > tipc-config -l on Card A I see a link between <1.1.1> and <1.1.2>
> > > > "down", and a link between <1.1.1> and <1.1.3> "up". Instrumenting
> > > the
> > >
> > > > code it appears that the original link still has MAC_B as the media
> > > > address and this causes problems when Card B comes up with the same
> > > > MAC but on a different link and Node address.
> > > >
> > > > Question 1: There seems to be a tipc_link_delete function but it
> > > does
> > > > not seem to be used when communication is lost with a peer. Why are
> > > > links not deleted?
> > > >
> > > > I was able to workaround this issue by clearing the media address
> > > > before resetting the link when a peer is not responding, see below.
> > > I
> > > > saw something similar done in tipc_discover.c and this comment.
> > > >      * TODO: It might be better to delete these "stale" link
> > > endpoints,
> > > >      * but this could be tricky [see tipc_link_delete()].
> > > >
> > > > Question 2: Is there a plan to cleanup stale links in a future
> > > > release? If not, is the change shown below an appropriate fix?
> > > >
> > > > Change:
> > > > ===================================================================
> > > > --- linux-2.6.20.7/net/tipc/tipc_link.c (original copy)
> > > > +++ linux-2.6.20.7/net/tipc/tipc_link.c (working copy)
> > > > @@ -836,6 +836,12 @@
> > > >                                          l_ptr->fsm_msg_cnt);
> > > >                                 warn("Resetting link <%s>, peer not
> > > > responding\n",
> > > >                                      l_ptr->name);
> > > > +                memset(&l_ptr->media_addr, 0, sizeof(struct
> > > > tipc_media_addr));
> > > >                                 tipc_link_reset(l_ptr);
> > > >                                 l_ptr->state = RESET_UNKNOWN;
> > > >                                 l_ptr->fsm_msg_cnt = 0;
> > > >
> > > >
> > > > Richard Lopez
> > >
> > >
> >
> >
> > --
> > Richard Lopez
> > Cyan
> > (707) 338-9678
> > [EMAIL PROTECTED]
> >
>
>
>
> --
> Richard Lopez
> Cyan
> (707) 338-9678
> [EMAIL PROTECTED]
>



-- 
Richard Lopez
Cyan
(707) 338-9678
[EMAIL PROTECTED]
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to