I looked into Allan's patch, and it seems correct to me. The one I sent
was 
picked directly from linux-2.6.23, (tipc-1.6), which is not prepared for

packet routing at all.
So, if Richard can confirm that Allan's patch is working, that is the
one we go for.
 
///jon
 

________________________________

From: Richard Lopez [mailto:[EMAIL PROTECTED] 
Sent: February 8, 2008 1:40 PM
To: Stephens, Allan
Cc: Jon Maloy; [email protected]
Subject: Re: [tipc-discussion] Stale Links


Jon/Allan,

I removed my change and implemented Jon's suggested change and it fixed
the problem I was seeing. Next, I will retest with Allan's version of
the change.

Richard.


On 2/8/08, Stephens, Allan <[EMAIL PROTECTED]> wrote: 

        Hi Jon/Richard:
        
        It looks like Jon is correct in observing that this problem was
actually
        introduced in TIPC 1.7.  It appears that the check he identified
was
        first updated (by him) to allow for the handling of routed
messages, and
        then mysteriously dropped (by me) during the update which first
        introduced the link_recv_buf_validate() routine.
        
        At the time the check was removed it looked like the following:
        
                /* Discard non-routeable messages destined for another
node */
        
                if (unlikely(!msg_isdata(msg) && (msg_destnode(msg) !=
        tipc_own_addr))) {
                        if (msg_user(msg) != CONN_MANAGER)
                                goto cont;
                }
        
        However, I think that this code is no longer sufficient, since
it
        doesn't allow TIPC to route message fragments.  If think the
updated
        check would need to be:
        
                /* Discard non-routeable messages destined for another
node */
        
                if (unlikely(!msg_isdata(msg) && (msg_destnode(msg) !=
        tipc_own_addr))) {
                        if ((msg_user(msg) != CONN_MANAGER) &&
(msg_user(msg) !=
        MSG_FRAGMENTER))
                                goto cont;
                }
        
        Jon, let me know if you agree with this analysis.  Richard, feel
free to
        try this fix out, too, once you've completed your previous
testing.  The
        fix will need fuller verification to ensure that it is really
correct
        (and doesn't break anything that is currently working), but I'm
totally
        swamped at the moment and don't have time to do it right now.
        
        Regards,
        Al
        
        -----Original Message-----
        From: Jon Maloy [mailto:[EMAIL PROTECTED]
        Sent: Thursday, February 07, 2008 7:32 PM
        To: Stephens, Allan
        Cc: Richard Lopez; [email protected]
        Subject: Re: [tipc-discussion] Stale Links
        
        Richard and Allan,
        See my comment below.
        
        Regards
        ///jon
        
        Stephens, Allan wrote:
        > Hi Richard:
        >
        > This certainly sounds like a bug in TIPC.  If so, I think it's
a long
        > standing one.  TIPC has been updated a number of times over
the last
        > couple of years to handle issues relating to link failure and
the
        > replacement of network hardware, and I believe it works fine
if:
        > a) you replace the MAC address used by a node with a new MAC
address,
        > or b) if you reuse a MAC address on another node after first
doing a);
        
        > however, I'm not convinced that we've covered the case where a
MAC
        > address is reused on another node without first doing a).
        >
        > In answer to your first question, the reason that TIPC doesn't
delete
        > a link endpoint when connectivity is lost is because the link
endpoint
        
        > is usually needed again within a short while because
connectivity is
        > restored.  (For example, consider the case where a node
crashes and
        > then reboots, or the case where someone disconnects a cable
and then
        > reconnects it.)  Leaving the link endpoint around allows TIPC
to
        > restore communication quickly, without incurring the
significant delay
        
        > that can occur in some cases with TIPC's auto-discovery
mechanism.
        >
        > In answer to your second question, there is no current plan to
fix
        > this issue ... but then we didn't know the problem existed
until
        > today! :-)  The fix you propose doesn't strike me as being the
correct
        
        > thing to do, because it interferes with the ability of TIPC to
        > immediately restore a failed link.  (However, it does appear
to solve
        > your problem.)  It would probably be preferable to introduce
some sort
        
        > of timeout mechanism into the link code that would defer this
kind of
        > MAC address "erasing" for a reasonable period of time.
However, I'm
        > not sure what constitutes "reasonable".
        I agree with that, a timeout mechanism to remove dangling links
would at
        least save some memory.
        But the problem we have at hand here seems to be a bug
introduced i
        tipc-1.7, and can easily be fixed with a simple test.
        
        The scenario I see is the following:
        
        1) Link <1.1.1>-<1.1.3> gets established.
        2) The old link object in <1.1.1>-<1.1.2>, continue to send
periodic
        RESET messages towards the
            non-existing destination  <1.1.2>/MAC_B.
        3) Node <1.1.3> receive these messages because the MAC address
matches,
        but is supposed to
           throw them away in tipc_recv_msg(), because their destination
<1.1.2>
        doesn't match the
            node's current address. There is such a check in he
beginning of
        tipc_recv_msg() in  tipc-1.6
            (linux-2.6.23), but in tipc-1.7 this check seems to have
        disappeared.
        4: The received RESET messages cause link <1.1.1>-<1.1.3> to
wobble up
        and down, and not much
            traffic will go through. tipc-config -l will show the link
as "up"
        (most of the time), but occasionally
           as "down".
        
        
        You can easily verify this by
        1:  looking into syslog. You will see "Established link" and
"Lost link"
        
        messages for this link
            at regular intervals.
        2: add the following lines at line 1839 in tipc_link.c
                if (unlikely(!msg_short(msg) &&
                         (msg_destnode(msg) != tipc_own_addr)))
                    goto cont;
           and recompile. If it works now, my theory is confirmed.
        >
        > I've copied Jon Maloy on this email as he has a lot more
history with
        > TIPC's link code than I do, and understands better why things
are
        > coded the way they are.  I'd be interested in hearing his
thoughts on
        > this matter ...
        >
        > Regards,
        > Al
        >
        >
----------------------------------------------------------------------
        > --
        > *From:* [EMAIL PROTECTED]
        > [mailto:[EMAIL PROTECTED] *On
Behalf Of
        > *Richard Lopez
        > *Sent:* Thursday, February 07, 2008 10:44 AM
        > *To:* [email protected]
        > *Subject:* [tipc-discussion] Stale Links
        >
        > Configuration:
        > - Linux Kernel 2.6.20
        > - TIPC 1.7.5
        >
        > I am using TIPC in a chassis based environment. Each card in
the
        > chassis runs TIPC and the Node address for a card is based on
the slot
        
        > it is plugged. For example, a card in slot 3 would have a node
address
        
        > of <1.1.3>.
        >
        > The problem I observer occurs when I start to move around
cards that
        > are plugged into the chassis. The TIPC port associated with
the card
        > that is moved never gets created. Here is a simple example:
        > Card A in slot 1 has node address <1.1.1> and MAC address
MAC_A Card B
        
        > in slot 2 has node address <1.1.2> and MAC address MAC_B
        >
        > If I move Card B to slot 3 (node address is now <1.1.3> and
MAC
        > address is still MAC_B) the link shows "up" when I run
tipc-config -l
        > on Card A and Card B, but the TIPC port never gets created
        > (tipc-config -p) and applications using TIPC cannot send
messages.
        >
        > I believe the problem is caused by the fact that the original
link
        > between <1.1.1> and <1.1.2> never gets deleted. When I perform
a
        > tipc-config -l on Card A I see a link between <1.1.1> and
<1.1.2>
        > "down", and a link between <1.1.1> and <1.1.3> "up".
Instrumenting the
        
        > code it appears that the original link still has MAC_B as the
media
        > address and this causes problems when Card B comes up with the
same
        > MAC but on a different link and Node address.
        >
        > Question 1: There seems to be a tipc_link_delete function but
it does
        > not seem to be used when communication is lost with a peer.
Why are
        > links not deleted?
        >
        > I was able to workaround this issue by clearing the media
address
        > before resetting the link when a peer is not responding, see
below. I
        > saw something similar done in tipc_discover.c and this
comment.
        >      * TODO: It might be better to delete these "stale" link
        endpoints,
        >      * but this could be tricky [see tipc_link_delete()].
        >
        > Question 2: Is there a plan to cleanup stale links in a future
        > release? If not, is the change shown below an appropriate fix?
        >
        > Change:
        >
===================================================================
        > --- linux-2.6.20.7/net/tipc/tipc_link.c (original copy)
        > +++ linux-2.6.20.7/net/tipc/tipc_link.c (working copy)
        > @@ -836,6 +836,12 @@
        >                                          l_ptr->fsm_msg_cnt);
        >                                 warn("Resetting link <%s>,
peer not
        > responding\n",
        >                                      l_ptr->name);
        > +                memset(&l_ptr->media_addr, 0, sizeof(struct
        > tipc_media_addr));
        >                                 tipc_link_reset(l_ptr);
        >                                 l_ptr->state = RESET_UNKNOWN;
        >                                 l_ptr->fsm_msg_cnt = 0;
        >
        >
        > Richard Lopez
        
        




-- 
Richard Lopez
Cyan
(707) 338-9678
[EMAIL PROTECTED] 
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to