Thanks, Richard.  I was able to squeeze in a few minutes today to do
some sanity testing, and was able to confirm that it fixed the problem
you encountered.  I also showed that the TIPC test suite ran correctly,
thereby proving that the fix doesn't interfere with the forwarding of
fragmented messages.
 
Once I get your confirmation, I'll post the patch on the TIPC website.
 
-- Al

________________________________

From: Richard Lopez [mailto:[EMAIL PROTECTED] 
Sent: Friday, February 08, 2008 4:44 PM
To: Jon Maloy
Cc: Stephens, Allan; [email protected]
Subject: Re: [tipc-discussion] Stale Links


Jon/Allan,

Initial testing shows that Allan's patch works. I would like to get a
little more test time on it before declaring it "good". I will give an
update next week.

Richard


On 2/8/08, Jon Maloy <[EMAIL PROTECTED]> wrote: 

        I looked into Allan's patch, and it seems correct to me. The one
I sent was 
        picked directly from linux-2.6.23, (tipc-1.6), which is not
prepared for 
        packet routing at all.
        So, if Richard can confirm that Allan's patch is working, that
is the one we go for.
         
        ///jon
         

________________________________

        From: Richard Lopez [mailto:[EMAIL PROTECTED] 
        Sent: February 8, 2008 1:40 PM
        To: Stephens, Allan
        Cc: Jon Maloy; [email protected] 
        
        Subject: Re: [tipc-discussion] Stale Links
        

        
        Jon/Allan,
        
        I removed my change and implemented Jon's suggested change and
it fixed the problem I was seeing. Next, I will retest with Allan's
version of the change.
        
        Richard.
        
        
        On 2/8/08, Stephens, Allan <[EMAIL PROTECTED]> wrote:


                Hi Jon/Richard:
                
                It looks like Jon is correct in observing that this
problem was actually
                introduced in TIPC 1.7.  It appears that the check he
identified was
                first updated (by him) to allow for the handling of
routed messages, and
                then mysteriously dropped (by me) during the update
which first
                introduced the link_recv_buf_validate() routine.
                
                At the time the check was removed it looked like the
following:
                
                        /* Discard non-routeable messages destined for
another node */
                
                        if (unlikely(!msg_isdata(msg) &&
(msg_destnode(msg) !=
                tipc_own_addr))) {
                                if (msg_user(msg) != CONN_MANAGER)
                                        goto cont;
                        }
                
                However, I think that this code is no longer sufficient,
since it
                doesn't allow TIPC to route message fragments.  If think
the updated
                check would need to be:
                
                        /* Discard non-routeable messages destined for
another node */
                
                        if (unlikely(!msg_isdata(msg) &&
(msg_destnode(msg) !=
                tipc_own_addr))) {
                                if ((msg_user(msg) != CONN_MANAGER) &&
(msg_user(msg) !=
                MSG_FRAGMENTER))
                                        goto cont;
                        }
                
                Jon, let me know if you agree with this analysis.
Richard, feel free to
                try this fix out, too, once you've completed your
previous testing.  The
                fix will need fuller verification to ensure that it is
really correct
                (and doesn't break anything that is currently working),
but I'm totally
                swamped at the moment and don't have time to do it right
now.
                
                Regards,
                Al
                
                -----Original Message-----
                From: Jon Maloy [mailto:[EMAIL PROTECTED]
                Sent: Thursday, February 07, 2008 7:32 PM
                To: Stephens, Allan
                Cc: Richard Lopez; [email protected]
                Subject: Re: [tipc-discussion] Stale Links
                
                Richard and Allan,
                See my comment below.
                
                Regards
                ///jon
                
                Stephens, Allan wrote:
                > Hi Richard:
                >
                > This certainly sounds like a bug in TIPC.  If so, I
think it's a long
                > standing one.  TIPC has been updated a number of times
over the last
                > couple of years to handle issues relating to link
failure and the
                > replacement of network hardware, and I believe it
works fine if:
                > a) you replace the MAC address used by a node with a
new MAC address,
                > or b) if you reuse a MAC address on another node after
first doing a);
                
                > however, I'm not convinced that we've covered the case
where a MAC
                > address is reused on another node without first doing
a).
                >
                > In answer to your first question, the reason that TIPC
doesn't delete
                > a link endpoint when connectivity is lost is because
the link endpoint
                
                > is usually needed again within a short while because
connectivity is
                > restored.  (For example, consider the case where a
node crashes and
                > then reboots, or the case where someone disconnects a
cable and then
                > reconnects it.)  Leaving the link endpoint around
allows TIPC to
                > restore communication quickly, without incurring the
significant delay
                
                > that can occur in some cases with TIPC's
auto-discovery mechanism.
                >
                > In answer to your second question, there is no current
plan to fix
                > this issue ... but then we didn't know the problem
existed until
                > today! :-)  The fix you propose doesn't strike me as
being the correct
                
                > thing to do, because it interferes with the ability of
TIPC to
                > immediately restore a failed link.  (However, it does
appear to solve
                > your problem.)  It would probably be preferable to
introduce some sort
                
                > of timeout mechanism into the link code that would
defer this kind of
                > MAC address "erasing" for a reasonable period of time.
However, I'm
                > not sure what constitutes "reasonable".
                I agree with that, a timeout mechanism to remove
dangling links would at
                least save some memory.
                But the problem we have at hand here seems to be a bug
introduced i
                tipc-1.7, and can easily be fixed with a simple test.
                
                The scenario I see is the following:
                
                1) Link <1.1.1>-<1.1.3> gets established.
                2) The old link object in <1.1.1>-<1.1.2>, continue to
send periodic
                RESET messages towards the
                    non-existing destination  <1.1.2>/MAC_B.
                3) Node <1.1.3> receive these messages because the MAC
address matches,
                but is supposed to
                   throw them away in tipc_recv_msg(), because their
destination <1.1.2>
                doesn't match the
                    node's current address. There is such a check in he
beginning of
                tipc_recv_msg() in  tipc-1.6
                    (linux-2.6.23), but in tipc-1.7 this check seems to
have
                disappeared.
                4: The received RESET messages cause link
<1.1.1>-<1.1.3> to wobble up
                and down, and not much
                    traffic will go through. tipc-config -l will show
the link as "up"
                (most of the time), but occasionally
                   as "down".
                
                
                You can easily verify this by
                1:  looking into syslog. You will see "Established link"
and "Lost link"
                
                messages for this link
                    at regular intervals.
                2: add the following lines at line 1839 in tipc_link.c
                        if (unlikely(!msg_short(msg) &&
                                 (msg_destnode(msg) != tipc_own_addr)))
                            goto cont;
                   and recompile. If it works now, my theory is
confirmed.
                >
                > I've copied Jon Maloy on this email as he has a lot
more history with
                > TIPC's link code than I do, and understands better why
things are
                > coded the way they are.  I'd be interested in hearing
his thoughts on
                > this matter ...
                >
                > Regards,
                > Al
                >
                >
----------------------------------------------------------------------
                > --
                > *From:* [EMAIL PROTECTED]
                > [mailto:[EMAIL PROTECTED]
*On Behalf Of
                > *Richard Lopez
                > *Sent:* Thursday, February 07, 2008 10:44 AM
                > *To:* [email protected]
                > *Subject:* [tipc-discussion] Stale Links
                >
                > Configuration:
                > - Linux Kernel 2.6.20
                > - TIPC 1.7.5
                >
                > I am using TIPC in a chassis based environment. Each
card in the
                > chassis runs TIPC and the Node address for a card is
based on the slot
                
                > it is plugged. For example, a card in slot 3 would
have a node address
                
                > of <1.1.3>.
                >
                > The problem I observer occurs when I start to move
around cards that
                > are plugged into the chassis. The TIPC port associated
with the card
                > that is moved never gets created. Here is a simple
example:
                > Card A in slot 1 has node address <1.1.1> and MAC
address MAC_A Card B
                
                > in slot 2 has node address <1.1.2> and MAC address
MAC_B
                >
                > If I move Card B to slot 3 (node address is now
<1.1.3> and MAC
                > address is still MAC_B) the link shows "up" when I run
tipc-config -l
                > on Card A and Card B, but the TIPC port never gets
created
                > (tipc-config -p) and applications using TIPC cannot
send messages.
                >
                > I believe the problem is caused by the fact that the
original link
                > between <1.1.1> and <1.1.2> never gets deleted. When I
perform a
                > tipc-config -l on Card A I see a link between <1.1.1>
and <1.1.2>
                > "down", and a link between <1.1.1> and <1.1.3> "up".
Instrumenting the
                
                > code it appears that the original link still has MAC_B
as the media
                > address and this causes problems when Card B comes up
with the same
                > MAC but on a different link and Node address.
                >
                > Question 1: There seems to be a tipc_link_delete
function but it does
                > not seem to be used when communication is lost with a
peer. Why are
                > links not deleted?
                >
                > I was able to workaround this issue by clearing the
media address
                > before resetting the link when a peer is not
responding, see below. I
                > saw something similar done in tipc_discover.c and this
comment.
                >      * TODO: It might be better to delete these
"stale" link
                endpoints,
                >      * but this could be tricky [see
tipc_link_delete()].
                >
                > Question 2: Is there a plan to cleanup stale links in
a future
                > release? If not, is the change shown below an
appropriate fix?
                >
                > Change:
                >
===================================================================
                > --- linux-2.6.20.7/net/tipc/tipc_link.c (original
copy)
                > +++ linux-2.6.20.7/net/tipc/tipc_link.c (working copy)
                > @@ -836,6 +836,12 @@
                >
l_ptr->fsm_msg_cnt);
                >                                 warn("Resetting link
<%s>, peer not
                > responding\n",
                >                                      l_ptr->name);
                > +                memset(&l_ptr->media_addr, 0,
sizeof(struct
                > tipc_media_addr));
                >
tipc_link_reset(l_ptr);
                >                                 l_ptr->state =
RESET_UNKNOWN;
                >                                 l_ptr->fsm_msg_cnt =
0;
                >
                >
                > Richard Lopez
                
                




        -- 
        Richard Lopez
        Cyan
        (707) 338-9678
        [EMAIL PROTECTED] 




-- 
Richard Lopez
Cyan
(707) 338-9678
[EMAIL PROTECTED] 
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to