The attached patch fixes (or hacks around) some problems I see with
certain scenarios related to removing interfaces dynamically when
multicast is enabled. It's possible this problem is because of my other
fea changes, but I think it's a problem with the base logic.
There is likely more that one issue here, but I think this use case
described
below would probably reproduce this on un-patched xorp configured with
multicast.
My case is with my patched xorp, and the patch is on top of my other
patches.
Here is the scenario:
Suppose a xorp instance has 3 interfaces, br0, eth0, eth1, and suppose
eth0 and eth1
are usb NICs that can dissappear when removed from the machine (I'm using
different virtual interfaces, but the logic is similar.)
I have a control program that notices (or causes) interface removal, and
when it does,
it sends cli commands to xorpsh to remove the config related to that
interface.
Now, xorp is all fine with 3 interfaces.
...
eth0 is removed
my monitor program initiates xorpsh commands to remove eth0 config info
eth1 is removed
xorp notices eth0 is removed and cleans up fea logic.
xorp notices eth1 is removed and cleans up fea logic.
xorpsh commands above get to the 'commit' stage
monitor program queues additional commands to remove eth1 config info
The commit fails, because the commit tries to set info for eth1, and the
device is no
longer in the kernel. Since this commit fails, the next one to remove
eth1 will as well
(because eth0 is still in the config, and still not in the OS).
The scenario above can happen in several different orders, but the basic
problem is that
xorp commit cannot deal with interfaces being removed from the OS at
arbitrary times.
The attached patch ignores failure cases related to interfaces being
removed. I think
it would be better if we could return a warning message to the xorpsh
without
failing the commit, but that is beyond the scope of what I have time to
accomplish
currently.
Please consider applying any part of this you see as useful, and/or
coming up with something
better.
Thanks,
Ben
--
Ben Greear <[EMAIL PROTECTED]>
Candela Technologies Inc http://www.candelatech.com
diff --git a/fea/data_plane/ifconfig/ifconfig_set.cc b/fea/data_plane/ifconfig/ifconfig_set.cc
index 5693373..9362d3c 100644
--- a/fea/data_plane/ifconfig/ifconfig_set.cc
+++ b/fea/data_plane/ifconfig/ifconfig_set.cc
@@ -140,9 +140,11 @@ IfConfigSet::push_config(const IfTree& iftree)
// XXX: ignore deleted interfaces that are not recognized
continue;
}
- error_reporter.interface_error(config_iface.ifname(),
- "interface not recognized");
- break;
+ // Maybe it was already deleted from xorp due to OS removal. We should
+ // just ignore this one
+ //error_reporter.interface_error(config_iface.ifname(),
+ // "interface not recognized");
+ //break;
}
//
@@ -244,6 +246,7 @@ IfConfigSet::push_config(const IfTree& iftree)
push_vif_creation(system_ifp, system_vifp, config_iface,
config_vif);
+
}
}
@@ -648,8 +651,24 @@ IfConfigSet::push_vif_address(const IfTreeInterface* system_ifp,
config_iface, config_vif, config_addr,
error_msg)
!= XORP_OK) {
- error_msg = c_format("Failed to add address: %s",
- error_msg.c_str());
+ if (strstr(error_msg.c_str(), "No such device")) {
+ XLOG_ERROR("Failed to configure address because of device not found: %s",
+ error_msg.c_str());
+ // We can't pass this back as an error to the cli because the device is gone,
+ // and if we fail this set, the whole commit will fail. This commit could be
+ // deleting *another* device, with a (very near) future set of xorpsh commands
+ // coming in to delete *this* interface in question.
+ // This is a hack...we really need a way to pass warnings back to the
+ // cli but NOT fail the commit if the logical state is correct but the
+ // (transient, unpredictable, asynchronously discovered) state of the actual
+ // OS network devices is currently out of state.
+ // --Ben
+ error_msg = "";
+ }
+ else {
+ error_msg = c_format("Failed to add address, not device-no-found error: %s",
+ error_msg.c_str());
+ }
}
} else {
//
@@ -720,8 +739,24 @@ IfConfigSet::push_vif_address(const IfTreeInterface* system_ifp,
config_iface, config_vif, config_addr,
error_msg)
!= XORP_OK) {
- error_msg = c_format("Failed to configure address: %s",
- error_msg.c_str());
+ if (strstr(error_msg.c_str(), "No such device")) {
+ XLOG_ERROR("Failed to configure address because of device not found: %s",
+ error_msg.c_str());
+ // We can't pass this back as an error to the cli because the device is gone,
+ // and if we fail this set, the whole commit will fail. This commit could be
+ // deleting *another* device, with a (very near) future set of xorpsh commands
+ // coming in to delete *this* interface in question.
+ // This is a hack...we really need a way to pass warnings back to the
+ // cli but NOT fail the commit if the logical state is correct but the
+ // (transient, unpredictable, asynchronously discovered) state of the actual
+ // OS network devices is currently out of state.
+ // --Ben
+ error_msg = "";
+ }
+ else {
+ error_msg = c_format("Failed to configure address, not device-no-found error: %s",
+ error_msg.c_str());
+ }
}
} else {
//
diff --git a/fea/data_plane/ifconfig/ifconfig_set_netlink_socket.cc b/fea/data_plane/ifconfig/ifconfig_set_netlink_socket.cc
index 0880b5d..2381d6b 100644
--- a/fea/data_plane/ifconfig/ifconfig_set_netlink_socket.cc
+++ b/fea/data_plane/ifconfig/ifconfig_set_netlink_socket.cc
@@ -961,7 +961,7 @@ IfConfigSetNetlinkSocket::add_addr(const string& ifname,
if (ns.sendto(&buffer, nlh->nlmsg_len, 0,
reinterpret_cast<struct sockaddr*>(&snl), sizeof(snl))
!= (ssize_t)nlh->nlmsg_len) {
- error_msg = c_format("Cannot add address '%s' "
+ error_msg = c_format("IfConfigSetNetlinkSocket::add_addr: sendto: Cannot add address '%s' "
"on interface '%s' vif '%s': %s",
addr.str().c_str(),
ifname.c_str(), vifname.c_str(), strerror(errno));
@@ -970,7 +970,7 @@ IfConfigSetNetlinkSocket::add_addr(const string& ifname,
if (NlmUtils::check_netlink_request(_ns_reader, ns, nlh->nlmsg_seq,
last_errno, error_msg)
!= XORP_OK) {
- error_msg = c_format("Cannot add address '%s' "
+ error_msg = c_format("IfConfigSetNetlinkSocket::add_addr: check_nl_req: Cannot add address '%s' "
"on interface '%s' vif '%s': %s",
addr.str().c_str(),
ifname.c_str(), vifname.c_str(),
diff --git a/fea/mfea_node.cc b/fea/mfea_node.cc
index 04027d7..a216414 100644
--- a/fea/mfea_node.cc
+++ b/fea/mfea_node.cc
@@ -1125,7 +1125,7 @@ MfeaNode::enable_vif(const string& vif_name, string& error_msg)
{
MfeaVif *mfea_vif = vif_find_by_name(vif_name);
if (mfea_vif == NULL) {
- error_msg = c_format("Cannot enable vif %s: no such vif",
+ error_msg = c_format("MfeaNode: Cannot enable vif %s: no such vif",
vif_name.c_str());
XLOG_ERROR("%s", error_msg.c_str());
return (XORP_ERROR);
@@ -1153,7 +1153,9 @@ MfeaNode::disable_vif(const string& vif_name, string& error_msg)
error_msg = c_format("Cannot disable vif %s: no such vif",
vif_name.c_str());
XLOG_ERROR("%s", error_msg.c_str());
- return (XORP_ERROR);
+ // If it's gone, it's as disabled as it can be.
+ //return (XORP_ERROR);
+ return XORP_OK;
}
mfea_vif->disable();
@@ -1211,7 +1213,10 @@ MfeaNode::stop_vif(const string& vif_name, string& error_msg)
error_msg = c_format("Cannot stop vif %s: no such vif",
vif_name.c_str());
XLOG_ERROR("%s", error_msg.c_str());
- return (XORP_ERROR);
+ // If it doesn't exist, it's as stopped as it's going to get. Returning
+ // error will cause entire commit to fail.
+ //return (XORP_ERROR);
+ return XORP_OK;
}
if (mfea_vif->stop(error_msg) != XORP_OK) {
diff --git a/fea/xrl_mfea_node.cc b/fea/xrl_mfea_node.cc
index 79a1afd..b834f21 100644
--- a/fea/xrl_mfea_node.cc
+++ b/fea/xrl_mfea_node.cc
@@ -1407,8 +1407,10 @@ XrlMfeaNode::mfea_0_1_enable_vif(
else
ret_code = MfeaNode::disable_vif(vif_name, error_msg);
- if (ret_code != XORP_OK)
- return XrlCmdError::COMMAND_FAILED(error_msg);
+ if (ret_code != XORP_OK) {
+ XLOG_ERROR("Mfea, enable/disable vif failed. Allowing commit to succeed anyway since this is likely a race with a deleted interface, error: %s\n", error_msg.c_str());
+ //return XrlCmdError::COMMAND_FAILED(error_msg);
+ }
return XrlCmdError::OKAY();
}
diff --git a/mld6igmp/mld6igmp_node.cc b/mld6igmp/mld6igmp_node.cc
index bd4610d..d0b70b6 100644
--- a/mld6igmp/mld6igmp_node.cc
+++ b/mld6igmp/mld6igmp_node.cc
@@ -1028,7 +1028,7 @@ Mld6igmpNode::enable_vif(const string& vif_name, string& error_msg)
{
Mld6igmpVif *mld6igmp_vif = vif_find_by_name(vif_name);
if (mld6igmp_vif == NULL) {
- error_msg = c_format("Cannot enable vif %s: no such vif",
+ error_msg = c_format("Mld6igmpNode: Cannot enable vif %s: no such vif",
vif_name.c_str());
XLOG_ERROR("%s", error_msg.c_str());
return (XORP_ERROR);
@@ -1056,7 +1056,9 @@ Mld6igmpNode::disable_vif(const string& vif_name, string& error_msg)
error_msg = c_format("Cannot disable vif %s: no such vif",
vif_name.c_str());
XLOG_ERROR("%s", error_msg.c_str());
- return (XORP_ERROR);
+ // It's as disabled as it will get, don't fail commit.
+ error_msg = "";
+ return XORP_OK;
}
mld6igmp_vif->disable();
diff --git a/pim/pim_node.cc b/pim/pim_node.cc
index 7ea543c..e18feee 100644
--- a/pim/pim_node.cc
+++ b/pim/pim_node.cc
@@ -1155,7 +1155,7 @@ PimNode::enable_vif(const string& vif_name, string& error_msg)
{
PimVif *pim_vif = vif_find_by_name(vif_name);
if (pim_vif == NULL) {
- error_msg = c_format("Cannot enable vif %s: no such vif",
+ error_msg = c_format("PimNode: Cannot enable vif %s: no such vif",
vif_name.c_str());
XLOG_ERROR("%s", error_msg.c_str());
return (XORP_ERROR);
@@ -1183,7 +1183,9 @@ PimNode::disable_vif(const string& vif_name, string& error_msg)
error_msg = c_format("Cannot disable vif %s: no such vif",
vif_name.c_str());
XLOG_ERROR("%s", error_msg.c_str());
- return (XORP_ERROR);
+ // It's as disabled as it's going to get..don't fail the commit.
+ error_msg = "";
+ return XORP_OK;
}
pim_vif->disable();
diff --git a/pim/xrl_pim_node.cc b/pim/xrl_pim_node.cc
index 45da17e..6cb38cd 100644
--- a/pim/xrl_pim_node.cc
+++ b/pim/xrl_pim_node.cc
@@ -1322,8 +1322,9 @@ XrlPimNode::mfea_client_send_register_unregister_protocol_cb(
//
// If a command failed because the other side rejected it, this is
// fatal.
+ // Shouldn't be fatal, network device can suddenly dissappear, for instance.
//
- XLOG_FATAL("Cannot %s protocol with the MFEA: %s",
+ XLOG_ERROR("Cannot %s protocol with the MFEA: %s",
entry->operation_name(), xrl_error.str().c_str());
break;
_______________________________________________
Xorp-hackers mailing list
[email protected]
http://mailman.ICSI.Berkeley.EDU/mailman/listinfo/xorp-hackers