Richard and Allan,
See my comment below.

Regards
///jon

Stephens, Allan wrote:
> Hi Richard:
>  
> This certainly sounds like a bug in TIPC.  If so, I think it's a long 
> standing one.  TIPC has been updated a number of times over the last 
> couple of years to handle issues relating to link failure and the 
> replacement of network hardware, and I believe it works fine if: 
> a) you replace the MAC address used by a node with a new MAC address, 
> or b) if you reuse a MAC address on another node after first doing a); 
> however, I'm not convinced that we've covered the case where a MAC 
> address is reused on another node without first doing a).
>  
> In answer to your first question, the reason that TIPC doesn't delete 
> a link endpoint when connectivity is lost is because the link endpoint 
> is usually needed again within a short while because connectivity is 
> restored.  (For example, consider the case where a node crashes and 
> then reboots, or the case where someone disconnects a cable and then 
> reconnects it.)  Leaving the link endpoint around allows TIPC to 
> restore communication quickly, without incurring the significant delay 
> that can occur in some cases with TIPC's auto-discovery mechanism.
>  
> In answer to your second question, there is no current plan to fix 
> this issue ... but then we didn't know the problem existed until 
> today! :-)  The fix you propose doesn't strike me as being the correct 
> thing to do, because it interferes with the ability of TIPC to 
> immediately restore a failed link.  (However, it does appear to solve 
> your problem.)  It would probably be preferable to introduce some sort 
> of timeout mechanism into the link code that would defer this kind of 
> MAC address "erasing" for a reasonable period of time.  However, I'm 
> not sure what constitutes "reasonable".
I agree with that, a timeout mechanism to remove dangling links would at 
least save some memory.
But the problem we have at hand here seems to be a bug introduced i  
tipc-1.7, and can easily
be fixed with a simple test.

The scenario I see is the following:

1) Link <1.1.1>-<1.1.3> gets established.
2) The old link object in <1.1.1>-<1.1.2>, continue to send periodic 
RESET messages towards the
    non-existing destination  <1.1.2>/MAC_B. 
3) Node <1.1.3> receive these messages because the MAC address matches, 
but is supposed to
   throw them away in tipc_recv_msg(), because their destination <1.1.2> 
doesn't match the
    node's current address. There is such a check in he beginning of 
tipc_recv_msg() in  tipc-1.6
    (linux-2.6.23), but in tipc-1.7 this check seems to have disappeared.
4: The received RESET messages cause link <1.1.1>-<1.1.3> to wobble up 
and down, and not much
    traffic will go through. tipc-config -l will show the link as "up" 
(most of the time), but occasionally
   as "down".


You can easily verify this by
1:  looking into syslog. You will see "Established link" and "Lost link" 
messages for this link
    at regular intervals.
2: add the following lines at line 1839 in tipc_link.c
        if (unlikely(!msg_short(msg) &&
                 (msg_destnode(msg) != tipc_own_addr)))
            goto cont;
   and recompile. If it works now, my theory is confirmed.
>  
> I've copied Jon Maloy on this email as he has a lot more history with 
> TIPC's link code than I do, and understands better why things are 
> coded the way they are.  I'd be interested in hearing his thoughts on 
> this matter ...
>  
> Regards,
> Al
>
> ------------------------------------------------------------------------
> *From:* [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] *On Behalf Of 
> *Richard Lopez
> *Sent:* Thursday, February 07, 2008 10:44 AM
> *To:* [email protected]
> *Subject:* [tipc-discussion] Stale Links
>
> Configuration:
> - Linux Kernel 2.6.20
> - TIPC 1.7.5
>
> I am using TIPC in a chassis based environment. Each card in the 
> chassis runs TIPC and the Node address for a card is based on the slot 
> it is plugged. For example, a card in slot 3 would have a node address 
> of <1.1.3>.
>
> The problem I observer occurs when I start to move around cards that 
> are plugged into the chassis. The TIPC port associated with the card 
> that is moved never gets created. Here is a simple example:
> Card A in slot 1 has node address <1.1.1> and MAC address MAC_A
> Card B in slot 2 has node address <1.1.2> and MAC address MAC_B
>
> If I move Card B to slot 3 (node address is now <1.1.3> and MAC 
> address is still MAC_B) the link shows "up" when I run tipc-config -l 
> on Card A and Card B, but the TIPC port never gets created 
> (tipc-config -p) and applications using TIPC cannot send messages.
>
> I believe the problem is caused by the fact that the original link 
> between <1.1.1> and <1.1.2> never gets deleted. When I perform a 
> tipc-config -l on Card A I see a link between <1.1.1> and <1.1.2> 
> "down", and a link between <1.1.1> and <1.1.3> "up". Instrumenting the 
> code it appears that the original link still has MAC_B as the media 
> address and this causes problems when Card B comes up with the same 
> MAC but on a different link and Node address.
>
> Question 1: There seems to be a tipc_link_delete function but it does 
> not seem to be used when communication is lost with a peer. Why are 
> links not deleted?
>
> I was able to workaround this issue by clearing the media address 
> before resetting the link when a peer is not responding, see below. I 
> saw something similar done in tipc_discover.c and this comment.
>      * TODO: It might be better to delete these "stale" link endpoints,
>      * but this could be tricky [see tipc_link_delete()].
>
> Question 2: Is there a plan to cleanup stale links in a future 
> release? If not, is the change shown below an appropriate fix?
>
> Change:
> ===================================================================
> --- linux-2.6.20.7/net/tipc/tipc_link.c (original copy)
> +++ linux-2.6.20.7/net/tipc/tipc_link.c (working copy)
> @@ -836,6 +836,12 @@
>                                          l_ptr->fsm_msg_cnt);
>                                 warn("Resetting link <%s>, peer not 
> responding\n",
>                                      l_ptr->name);
> +                memset(&l_ptr->media_addr, 0, sizeof(struct 
> tipc_media_addr));
>                                 tipc_link_reset(l_ptr);
>                                 l_ptr->state = RESET_UNKNOWN;
>                                 l_ptr->fsm_msg_cnt = 0;
>
>
> Richard Lopez


-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

Reply via email to