Richard and Allan,
See my comment below.
Regards
///jon
Stephens, Allan wrote:
> Hi Richard:
>
> This certainly sounds like a bug in TIPC. If so, I think it's a long
> standing one. TIPC has been updated a number of times over the last
> couple of years to handle issues relating to link failure and the
> replacement of network hardware, and I believe it works fine if:
> a) you replace the MAC address used by a node with a new MAC address,
> or b) if you reuse a MAC address on another node after first doing a);
> however, I'm not convinced that we've covered the case where a MAC
> address is reused on another node without first doing a).
>
> In answer to your first question, the reason that TIPC doesn't delete
> a link endpoint when connectivity is lost is because the link endpoint
> is usually needed again within a short while because connectivity is
> restored. (For example, consider the case where a node crashes and
> then reboots, or the case where someone disconnects a cable and then
> reconnects it.) Leaving the link endpoint around allows TIPC to
> restore communication quickly, without incurring the significant delay
> that can occur in some cases with TIPC's auto-discovery mechanism.
>
> In answer to your second question, there is no current plan to fix
> this issue ... but then we didn't know the problem existed until
> today! :-) The fix you propose doesn't strike me as being the correct
> thing to do, because it interferes with the ability of TIPC to
> immediately restore a failed link. (However, it does appear to solve
> your problem.) It would probably be preferable to introduce some sort
> of timeout mechanism into the link code that would defer this kind of
> MAC address "erasing" for a reasonable period of time. However, I'm
> not sure what constitutes "reasonable".
I agree with that, a timeout mechanism to remove dangling links would at
least save some memory.
But the problem we have at hand here seems to be a bug introduced i
tipc-1.7, and can easily
be fixed with a simple test.
The scenario I see is the following:
1) Link <1.1.1>-<1.1.3> gets established.
2) The old link object in <1.1.1>-<1.1.2>, continue to send periodic
RESET messages towards the
non-existing destination <1.1.2>/MAC_B.
3) Node <1.1.3> receive these messages because the MAC address matches,
but is supposed to
throw them away in tipc_recv_msg(), because their destination <1.1.2>
doesn't match the
node's current address. There is such a check in he beginning of
tipc_recv_msg() in tipc-1.6
(linux-2.6.23), but in tipc-1.7 this check seems to have disappeared.
4: The received RESET messages cause link <1.1.1>-<1.1.3> to wobble up
and down, and not much
traffic will go through. tipc-config -l will show the link as "up"
(most of the time), but occasionally
as "down".
You can easily verify this by
1: looking into syslog. You will see "Established link" and "Lost link"
messages for this link
at regular intervals.
2: add the following lines at line 1839 in tipc_link.c
if (unlikely(!msg_short(msg) &&
(msg_destnode(msg) != tipc_own_addr)))
goto cont;
and recompile. If it works now, my theory is confirmed.
>
> I've copied Jon Maloy on this email as he has a lot more history with
> TIPC's link code than I do, and understands better why things are
> coded the way they are. I'd be interested in hearing his thoughts on
> this matter ...
>
> Regards,
> Al
>
> ------------------------------------------------------------------------
> *From:* [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED] *On Behalf Of
> *Richard Lopez
> *Sent:* Thursday, February 07, 2008 10:44 AM
> *To:* [email protected]
> *Subject:* [tipc-discussion] Stale Links
>
> Configuration:
> - Linux Kernel 2.6.20
> - TIPC 1.7.5
>
> I am using TIPC in a chassis based environment. Each card in the
> chassis runs TIPC and the Node address for a card is based on the slot
> it is plugged. For example, a card in slot 3 would have a node address
> of <1.1.3>.
>
> The problem I observer occurs when I start to move around cards that
> are plugged into the chassis. The TIPC port associated with the card
> that is moved never gets created. Here is a simple example:
> Card A in slot 1 has node address <1.1.1> and MAC address MAC_A
> Card B in slot 2 has node address <1.1.2> and MAC address MAC_B
>
> If I move Card B to slot 3 (node address is now <1.1.3> and MAC
> address is still MAC_B) the link shows "up" when I run tipc-config -l
> on Card A and Card B, but the TIPC port never gets created
> (tipc-config -p) and applications using TIPC cannot send messages.
>
> I believe the problem is caused by the fact that the original link
> between <1.1.1> and <1.1.2> never gets deleted. When I perform a
> tipc-config -l on Card A I see a link between <1.1.1> and <1.1.2>
> "down", and a link between <1.1.1> and <1.1.3> "up". Instrumenting the
> code it appears that the original link still has MAC_B as the media
> address and this causes problems when Card B comes up with the same
> MAC but on a different link and Node address.
>
> Question 1: There seems to be a tipc_link_delete function but it does
> not seem to be used when communication is lost with a peer. Why are
> links not deleted?
>
> I was able to workaround this issue by clearing the media address
> before resetting the link when a peer is not responding, see below. I
> saw something similar done in tipc_discover.c and this comment.
> * TODO: It might be better to delete these "stale" link endpoints,
> * but this could be tricky [see tipc_link_delete()].
>
> Question 2: Is there a plan to cleanup stale links in a future
> release? If not, is the change shown below an appropriate fix?
>
> Change:
> ===================================================================
> --- linux-2.6.20.7/net/tipc/tipc_link.c (original copy)
> +++ linux-2.6.20.7/net/tipc/tipc_link.c (working copy)
> @@ -836,6 +836,12 @@
> l_ptr->fsm_msg_cnt);
> warn("Resetting link <%s>, peer not
> responding\n",
> l_ptr->name);
> + memset(&l_ptr->media_addr, 0, sizeof(struct
> tipc_media_addr));
> tipc_link_reset(l_ptr);
> l_ptr->state = RESET_UNKNOWN;
> l_ptr->fsm_msg_cnt = 0;
>
>
> Richard Lopez
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion