Hi Richard:
This certainly sounds like a bug in TIPC. If so, I think it's a long
standing one. TIPC has been updated a number of times over the last
couple of years to handle issues relating to link failure and the
replacement of network hardware, and I believe it works fine if: a) you
replace the MAC address used by a node with a new MAC address, or b) if
you reuse a MAC address on another node after first doing a); however,
I'm not convinced that we've covered the case where a MAC address is
reused on another node without first doing a).
In answer to your first question, the reason that TIPC doesn't delete a
link endpoint when connectivity is lost is because the link endpoint is
usually needed again within a short while because connectivity is
restored. (For example, consider the case where a node crashes and then
reboots, or the case where someone disconnects a cable and then
reconnects it.) Leaving the link endpoint around allows TIPC to restore
communication quickly, without incurring the significant delay that can
occur in some cases with TIPC's auto-discovery mechanism.
In answer to your second question, there is no current plan to fix this
issue ... but then we didn't know the problem existed until today! :-)
The fix you propose doesn't strike me as being the correct thing to do,
because it interferes with the ability of TIPC to immediately restore a
failed link. (However, it does appear to solve your problem.) It would
probably be preferable to introduce some sort of timeout mechanism into
the link code that would defer this kind of MAC address "erasing" for a
reasonable period of time. However, I'm not sure what constitutes
"reasonable".
I've copied Jon Maloy on this email as he has a lot more history with
TIPC's link code than I do, and understands better why things are coded
the way they are. I'd be interested in hearing his thoughts on this
matter ...
Regards,
Al
________________________________
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
Richard Lopez
Sent: Thursday, February 07, 2008 10:44 AM
To: [email protected]
Subject: [tipc-discussion] Stale Links
Configuration:
- Linux Kernel 2.6.20
- TIPC 1.7.5
I am using TIPC in a chassis based environment. Each card in the chassis
runs TIPC and the Node address for a card is based on the slot it is
plugged. For example, a card in slot 3 would have a node address of
<1.1.3>.
The problem I observer occurs when I start to move around cards that are
plugged into the chassis. The TIPC port associated with the card that is
moved never gets created. Here is a simple example:
Card A in slot 1 has node address <1.1.1> and MAC address MAC_A
Card B in slot 2 has node address <1.1.2> and MAC address MAC_B
If I move Card B to slot 3 (node address is now <1.1.3> and MAC address
is still MAC_B) the link shows "up" when I run tipc-config -l on Card A
and Card B, but the TIPC port never gets created (tipc-config -p) and
applications using TIPC cannot send messages.
I believe the problem is caused by the fact that the original link
between <1.1.1> and <1.1.2> never gets deleted. When I perform a
tipc-config -l on Card A I see a link between <1.1.1> and <1.1.2>
"down", and a link between <1.1.1> and <1.1.3> "up". Instrumenting the
code it appears that the original link still has MAC_B as the media
address and this causes problems when Card B comes up with the same MAC
but on a different link and Node address.
Question 1: There seems to be a tipc_link_delete function but it does
not seem to be used when communication is lost with a peer. Why are
links not deleted?
I was able to workaround this issue by clearing the media address before
resetting the link when a peer is not responding, see below. I saw
something similar done in tipc_discover.c and this comment.
* TODO: It might be better to delete these "stale" link endpoints,
* but this could be tricky [see tipc_link_delete()].
Question 2: Is there a plan to cleanup stale links in a future release?
If not, is the change shown below an appropriate fix?
Change:
===================================================================
--- linux-2.6.20.7/net/tipc/tipc_link.c (original copy)
+++ linux-2.6.20.7/net/tipc/tipc_link.c (working copy)
@@ -836,6 +836,12 @@
l_ptr->fsm_msg_cnt);
warn("Resetting link <%s>, peer not
responding\n",
l_ptr->name);
+ memset(&l_ptr->media_addr, 0, sizeof(struct
tipc_media_addr));
tipc_link_reset(l_ptr);
l_ptr->state = RESET_UNKNOWN;
l_ptr->fsm_msg_cnt = 0;
Richard Lopez
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
tipc-discussion mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/tipc-discussion