Re: [PATCH v2 net] bonding: fix 802.3ad state sent to partner when unbinding slave
Toni Peltonen wrote: >Previously when unbinding a slave the 802.3ad implementation only told >partner that the port is not suitable for aggregation by setting the port >aggregation state from aggregatable to individual. This is not enough. If the >physical layer still stays up and we only unbinded this port from the bond >there >is nothing in the aggregation status alone to prevent the partner from sending >traffic towards us. To ensure that the partner doesn't consider this >port at all anymore we should also disable collecting and distributing to >signal that this actor is going away. Also clear AD_STATE_SYNCHRONIZATION to >ensure partner exits collecting + distributing state. > >I have tested this behaviour againts Arista EOS switches with mlx5 cards >(physical link stays up even when interface is down) and simulated >the same situation virtually Linux <-> Linux with two network namespaces >running two veth device pairs. In both cases setting aggregation to >individual doesn't alone prevent traffic from being to sent towards this >port given that the link stays up in partners end. Partner still keeps >it's end in collecting + distributing state and continues until timeout is >reached. In most cases this means we are losing the traffic partner sends >towards our port while we wait for timeout. This is most visible with slow >periodic time (LACP rate slow). > >Other open source implementations like Open VSwitch and libreswitch, and >vendor implementations like Arista EOS, seem to disable collecting + >distributing to when doing similar port disabling/detaching/removing change. >With this patch kernel implementation would behave the same way and ensure >partner doesn't consider our actor viable anymore. > >Signed-off-by: Toni Peltonen Signed-off-by: Jay Vosburgh >--- >v2 changes: >* Fix typo in commit message >* Also clear AD_STATE_SYNCHRONIZATION >--- > drivers/net/bonding/bond_3ad.c | 3 +++ > 1 file changed, 3 insertions(+) > >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c >index f43fb2f958a5..93dfcef8afc4 100644 >--- a/drivers/net/bonding/bond_3ad.c >+++ b/drivers/net/bonding/bond_3ad.c >@@ -2086,6 +2086,9 @@ void bond_3ad_unbind_slave(struct slave *slave) > aggregator->aggregator_identifier); > > /* Tell the partner that this port is not suitable for aggregation */ >+ port->actor_oper_port_state &= ~AD_STATE_SYNCHRONIZATION; >+ port->actor_oper_port_state &= ~AD_STATE_COLLECTING; >+ port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING; > port->actor_oper_port_state &= ~AD_STATE_AGGREGATION; > __update_lacpdu_from_port(port); > ad_lacpdu_send(port); >-- >2.19.0 >
Re: [PATCH net] bonding: fix 802.3ad state sent to partner when unbinding slave
Toni Peltonen wrote: >Previously when unbinding a slave the 802.3ad implementation only told >partner that the port is not suitable for aggregation by setting the port >aggregation state from aggregatable to individual. This is not enough. If the >physical layer still stays up and we only unbinded this port from the bond >there >is nothing in the aggregation status alone to prevent the partner from sending >traffic towards us. To ensure that the partner doesn't consider this >port at all anymore we should also disable collecting and distributing to >signal that this actor is going away. > >I have tested this behaviour againts Arista EOS switches with mlx5 cards >(physical link stays up when even when interface is down) and simulated >the same situation virtually Linux <-> Linux with two network namespaces >running two veth device pairs. In both cases setting aggregation to >individual doesn't alone prevent traffic from being to sent towards this >port given that the link stays up in partners end. Partner still keeps >it's end in collecting + distributing state and continues until timeout is >reached. In most cases this means we are losing the traffic partner sends >towards our port while we wait for timeout. This is most visible with slow >periodic time (LAPC rate slow). "LAPC" -> "LACP" >Other open source implementations like Open VSwitch and libreswitch, and >vendor implementations like Arista EOS, seem to disable collecting + >distributing to when doing similar port disabling/detaching/removing change. >With this patch kernel implementation would behave the same way and ensure >partner doesn't consider our actor viable anymore. After re-reading the relevant bits of 802.1AX (particularly 5.4.9 on recordPDU and update_Selected) I'm going to suggest also clearing AD_STATE_SYNCHRONIZATION, based on: Partner_Oper_Port_State.Synchronization is also set to TRUE if the value of Actor_State.Aggregation in the received PDU is set to FALSE (i.e., indicates an Individual link), Actor_State.Synchronization in the received PDU is set to TRUE, and LACP will actively maintain the link. Per the above, learing _SYNC in the LACPDU should un-sync the port, inducing the Mux state machine (figure 5-15) to exit C_D state and go to ATTACHED state (disabling Coll/Dist). But, either way, as this is a hint to get the link partner to stop using the port, this looks reasonable to me. Acked-by: Jay Vosburgh -J >Signed-off-by: Toni Peltonen >--- > drivers/net/bonding/bond_3ad.c | 2 ++ > 1 file changed, 2 insertions(+) > >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c >index f43fb2f958a5..6776c33753dc 100644 >--- a/drivers/net/bonding/bond_3ad.c >+++ b/drivers/net/bonding/bond_3ad.c >@@ -2086,6 +2086,8 @@ void bond_3ad_unbind_slave(struct slave *slave) > aggregator->aggregator_identifier); > > /* Tell the partner that this port is not suitable for aggregation */ >+ port->actor_oper_port_state &= ~AD_STATE_COLLECTING; >+ port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING; > port->actor_oper_port_state &= ~AD_STATE_AGGREGATION; > __update_lacpdu_from_port(port); > ad_lacpdu_send(port); >-- >2.19.0 --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: fix length of actor system
Tobias Jungel wrote: >The attribute IFLA_BOND_AD_ACTOR_SYSTEM is sent to user space having the >length of sizeof(bond->params.ad_actor_system) which is 8 byte. This >patch aligns the length to ETH_ALEN to have the same MAC address exposed >as using sysfs. > >fixes f87fda00b6ed2 > >Signed-off-by: Tobias Jungel The patch looks fine to me, but the "fixes" line is not formatted properly. Please format it according to Documentation/process/submitting-patches.rst and resubmit your patch as V2. -J >--- > drivers/net/bonding/bond_netlink.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > >diff --git a/drivers/net/bonding/bond_netlink.c >b/drivers/net/bonding/bond_netlink.c >index 9697977b80f0..6b9ad8673218 100644 >--- a/drivers/net/bonding/bond_netlink.c >+++ b/drivers/net/bonding/bond_netlink.c >@@ -638,8 +638,7 @@ static int bond_fill_info(struct sk_buff *skb, > goto nla_put_failure; > > if (nla_put(skb, IFLA_BOND_AD_ACTOR_SYSTEM, >- sizeof(bond->params.ad_actor_system), >- >params.ad_actor_system)) >+ ETH_ALEN, >params.ad_actor_system)) > goto nla_put_failure; > } > if (!bond_3ad_get_active_agg_info(bond, )) { > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next] net/ipv6: Block IPv6 addrconf on team ports
Chas Williams <3ch...@gmail.com> wrote: >On 10/25/2018 05:59 PM, Jay Vosburgh wrote: >> Chas Williams <3ch...@gmail.com> wrote: >> >>> netif_is_lag_port should be used to identify link aggregation ports. >>> For this to work, we need to reorganize the bonding and team drivers >>> so that the necessary flags are set before dev_open is called. >>> >>> commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle") >>> made this decision originally based on the IFF_SLAVE flag which isn't >>> used by the team driver. Note, we do need to retain the IFF_SLAVE >>> check for the eql driver. >> >> Is 31e77c93e432 the correct commit reference? I don't see >> anything in there about IFF_SLAVE or bonding; it's a patch to the >> process scheduler. > >No, that's wrong. It should be c2edacf80e155. > >> And, as Jiri said, the subject doesn't mention bonding. > >The behavior of bonding wasn't changed. The intent of the patch >is to add team slaves to the interfaces that don't get automatic >IPv6 addresses. The body discusses why bonding had to change as >well. Sure, but the bonding code has changed, and the current presentation makes it harder for reviewers to follow (or perhaps even notice). >I was under the impression that the subject needs to kept short. >If there a better way to phrase what I want to do? I'd suggest splitting this into three patches: A first patch that adds the new IPv6 functionality, then one patch each for team and bonding to take advantage of that new functionality. Each of the three would then be very straightforward, change just one thing, and should be clearer all around. -J >>> Signed-off-by: Chas Williams <3ch...@gmail.com> >>> --- >>> drivers/net/bonding/bond_main.c | 4 ++-- >>> drivers/net/team/team.c | 7 +-- >>> net/ipv6/addrconf.c | 2 +- >>> 3 files changed, 8 insertions(+), 5 deletions(-) >>> >>> diff --git a/drivers/net/bonding/bond_main.c >>> b/drivers/net/bonding/bond_main.c >>> index ffa37adb7681..5cdad164332b 100644 >>> --- a/drivers/net/bonding/bond_main.c >>> +++ b/drivers/net/bonding/bond_main.c >>> @@ -1536,6 +1536,7 @@ int bond_enslave(struct net_device *bond_dev, struct >>> net_device *slave_dev, >>> >>> /* set slave flag before open to prevent IPv6 addrconf */ >>> slave_dev->flags |= IFF_SLAVE; >>> + slave_dev->priv_flags |= IFF_BONDING; >>> >>> /* open the slave since the application closed it */ >>> res = dev_open(slave_dev); >>> @@ -1544,7 +1545,6 @@ int bond_enslave(struct net_device *bond_dev, struct >>> net_device *slave_dev, >>> goto err_restore_mac; >>> } >>> >>> - slave_dev->priv_flags |= IFF_BONDING; >>> /* initialize slave stats */ >>> dev_get_stats(new_slave->dev, _slave->slave_stats); >>> >>> @@ -1804,10 +1804,10 @@ int bond_enslave(struct net_device *bond_dev, >>> struct net_device *slave_dev, >>> slave_disable_netpoll(new_slave); >>> >>> err_close: >>> - slave_dev->priv_flags &= ~IFF_BONDING; >>> dev_close(slave_dev); >>> >>> err_restore_mac: >>> + slave_dev->priv_flags &= ~IFF_BONDING; >>> slave_dev->flags &= ~IFF_SLAVE; >>> if (!bond->params.fail_over_mac || >>> BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { >>> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c >>> index db633ae9f784..8fc7d57e9f6d 100644 >>> --- a/drivers/net/team/team.c >>> +++ b/drivers/net/team/team.c >>> @@ -1128,14 +1128,12 @@ static int team_upper_dev_link(struct team *team, >>> struct team_port *port, >>>_upper_info, extack); >>> if (err) >>> return err; >>> - port->dev->priv_flags |= IFF_TEAM_PORT; >>> return 0; >>> } >>> >>> static void team_upper_dev_unlink(struct team *team, struct team_port *port) >>> { >>> netdev_upper_dev_unlink(port->dev, team->dev); >>> - port->dev->priv_flags &= ~IFF_TEAM_PORT; >>> } >>> >>> static void __team_port_change_port_added(struct team_port *port, bool >>> linkup); >>> @@ -1214,6 +1212,9 @@ static int team_port_add(struct team *team, struct >>> net_device *port_dev, >>> goto err_po
Re: [PATCH net-next] net/ipv6: Block IPv6 addrconf on team ports
Chas Williams <3ch...@gmail.com> wrote: >netif_is_lag_port should be used to identify link aggregation ports. >For this to work, we need to reorganize the bonding and team drivers >so that the necessary flags are set before dev_open is called. > >commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle") >made this decision originally based on the IFF_SLAVE flag which isn't >used by the team driver. Note, we do need to retain the IFF_SLAVE >check for the eql driver. Is 31e77c93e432 the correct commit reference? I don't see anything in there about IFF_SLAVE or bonding; it's a patch to the process scheduler. And, as Jiri said, the subject doesn't mention bonding. >Signed-off-by: Chas Williams <3ch...@gmail.com> >--- > drivers/net/bonding/bond_main.c | 4 ++-- > drivers/net/team/team.c | 7 +-- > net/ipv6/addrconf.c | 2 +- > 3 files changed, 8 insertions(+), 5 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index ffa37adb7681..5cdad164332b 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1536,6 +1536,7 @@ int bond_enslave(struct net_device *bond_dev, struct >net_device *slave_dev, > > /* set slave flag before open to prevent IPv6 addrconf */ > slave_dev->flags |= IFF_SLAVE; >+ slave_dev->priv_flags |= IFF_BONDING; > > /* open the slave since the application closed it */ > res = dev_open(slave_dev); >@@ -1544,7 +1545,6 @@ int bond_enslave(struct net_device *bond_dev, struct >net_device *slave_dev, > goto err_restore_mac; > } > >- slave_dev->priv_flags |= IFF_BONDING; > /* initialize slave stats */ > dev_get_stats(new_slave->dev, _slave->slave_stats); > >@@ -1804,10 +1804,10 @@ int bond_enslave(struct net_device *bond_dev, struct >net_device *slave_dev, > slave_disable_netpoll(new_slave); > > err_close: >- slave_dev->priv_flags &= ~IFF_BONDING; > dev_close(slave_dev); > > err_restore_mac: >+ slave_dev->priv_flags &= ~IFF_BONDING; > slave_dev->flags &= ~IFF_SLAVE; > if (!bond->params.fail_over_mac || > BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { >diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c >index db633ae9f784..8fc7d57e9f6d 100644 >--- a/drivers/net/team/team.c >+++ b/drivers/net/team/team.c >@@ -1128,14 +1128,12 @@ static int team_upper_dev_link(struct team *team, >struct team_port *port, > _upper_info, extack); > if (err) > return err; >- port->dev->priv_flags |= IFF_TEAM_PORT; > return 0; > } > > static void team_upper_dev_unlink(struct team *team, struct team_port *port) > { > netdev_upper_dev_unlink(port->dev, team->dev); >- port->dev->priv_flags &= ~IFF_TEAM_PORT; > } > > static void __team_port_change_port_added(struct team_port *port, bool > linkup); >@@ -1214,6 +1212,9 @@ static int team_port_add(struct team *team, struct >net_device *port_dev, > goto err_port_enter; > } > >+ /* set slave flag before open to prevent IPv6 addrconf */ >+ port->dev->priv_flags |= IFF_TEAM_PORT; >+ > err = dev_open(port_dev); > if (err) { > netdev_dbg(dev, "Device %s opening failed\n", >@@ -1292,6 +1293,7 @@ static int team_port_add(struct team *team, struct >net_device *port_dev, > dev_close(port_dev); > > err_dev_open: >+ port->dev->priv_flags &= ~IFF_TEAM_PORT; > team_port_leave(team, port); > team_port_set_orig_dev_addr(port); > >@@ -1328,6 +1330,7 @@ static int team_port_del(struct team *team, struct >net_device *port_dev) > dev_uc_unsync(port_dev, dev); > dev_mc_unsync(port_dev, dev); > dev_close(port_dev); >+ port->dev->priv_flags &= ~IFF_TEAM_PORT; > team_port_leave(team, port); > > __team_option_inst_mark_removed_port(team, port); >diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c >index 45b84dd5c4eb..121f863022ed 100644 >--- a/net/ipv6/addrconf.c >+++ b/net/ipv6/addrconf.c >@@ -3482,7 +3482,7 @@ static int addrconf_notify(struct notifier_block *this, >unsigned long event, > > case NETDEV_UP: > case NETDEV_CHANGE: >- if (dev->flags & IFF_SLAVE) >+ if (netif_is_lag_port(dev) || dev->flags & IFF_SLAVE) Note that netvsc_vf_join() also uses IFF_SLAVE in order skip IPv6 addrconf for netvsc devices; I don't believe its usage will pass netif_is_lag_port(). It looks like the above will work, but your commit message mentions eql as the reason for retaining the IFF_SLAVE test, and eql isn't the only user of IFF_SLAVE in this manner. -J > break; > > if (idev && idev->cnf.disable_ipv6) >-- >2.14.4 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net 02/15] bonding: use netpoll_poll_dev() helper
Eric Dumazet wrote: >We want to allow NAPI drivers to no longer provide >ndo_poll_controller() method, as it has been proven problematic. > >team driver must not look at its presence, but instead call >netpoll_poll_dev() which factorize the needed actions. This patch is for bonding, not team; otherwise LGTM. Acked-by: Jay Vosburgh -J >Signed-off-by: Eric Dumazet >Cc: Jay Vosburgh >Cc: Veaceslav Falico >Cc: Andy Gospodarek >--- > drivers/net/bonding/bond_main.c | 11 ++- > 1 file changed, 2 insertions(+), 9 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index >a764a83f99dabe54585dbad7dba40b6601177c03..0d87e11e7f1d84537fe43d95249b1bd3a2ce291d > 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -971,16 +971,13 @@ static void bond_poll_controller(struct net_device >*bond_dev) > struct slave *slave = NULL; > struct list_head *iter; > struct ad_info ad_info; >- struct netpoll_info *ni; >- const struct net_device_ops *ops; > > if (BOND_MODE(bond) == BOND_MODE_8023AD) > if (bond_3ad_get_active_agg_info(bond, _info)) > return; > > bond_for_each_slave_rcu(bond, slave, iter) { >- ops = slave->dev->netdev_ops; >- if (!bond_slave_is_up(slave) || !ops->ndo_poll_controller) >+ if (!bond_slave_is_up(slave)) > continue; > > if (BOND_MODE(bond) == BOND_MODE_8023AD) { >@@ -992,11 +989,7 @@ static void bond_poll_controller(struct net_device >*bond_dev) > continue; > } > >- ni = rcu_dereference_bh(slave->dev->npinfo); >- if (down_trylock(>dev_lock)) >- continue; >- ops->ndo_poll_controller(slave->dev); >- up(>dev_lock); >+ netpoll_poll_dev(slave->dev); > } > } > >-- >2.19.0.444.g18242da7ef-goog >
Re: [PATCH next] bonding: pass link-local packets to bonding master also.
Mahesh Bandewar wrote: >From: Mahesh Bandewar > >Commit b89f04c61efe ("bonding: deliver link-local packets with >skb->dev set to link that packets arrived on") changed the behavior >of how link-local-multicast packets are processed. The change in >the behavior broke some legacy use cases where these packets are >expected to arrive on bonding master device also. > >This patch passes the packet to the stack with the link it arrived >on as well as passes to the bonding-master device to preserve the >legacy use case. Michal, can you test this? I'm travelling this week and won't be able to run the patch. Mahesh, will this confuse LLDP, et al, daemons that, e.g., bind to every possible interface and now see the same LLDP PDU (identical Chassis ID, Port ID, et al, TLVs) on multiple interfaces? Thanks, -J >Reported-by: Michal Soltys >Signed-off-by: Mahesh Bandewar >--- > drivers/net/bonding/bond_main.c | 17 +++-- > 1 file changed, 15 insertions(+), 2 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 9a2ea3c1f949..1d3b7d8448f2 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1177,9 +1177,22 @@ static rx_handler_result_t bond_handle_frame(struct >sk_buff **pskb) > } > } > >- /* don't change skb->dev for link-local packets */ >- if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) >+ /* Link-local multicast packets should be passed to the >+ * stack on the link they arrive as well as pass them to the >+ * bond-master device. These packets are mostly usable when >+ * stack receives it with the link on which they arrive >+ * (e.g. LLDP) but there may be some legacy behavior that >+ * expects these packets to appear on bonding master too. >+ */ >+ if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) { >+ struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC); >+ >+ if (nskb) { >+ nskb->dev = bond->dev; >+ netif_rx(nskb); >+ } > return RX_HANDLER_PASS; >+ } > if (bond_should_deliver_exact_match(skb, slave, bond)) > return RX_HANDLER_EXACT; > >-- >2.18.0.203.gfac676dfb9-goog --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [BUG] bonded interfaces drop bpdu (stp) frames
Mahesh Bandewar (महेश बंडेवार) wrote: >On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh > wrote: >> Michal Soltys wrote: >> >>>On 07/12/2018 04:51 PM, Jay Vosburgh wrote: >>>> Mahesh Bandewar (महेश बंडेवार) wrote: >>>> >>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> As weird as that sounds, this is what I observed today after bumping >>>>>> kernel version. I have a setup where 2 bonds are attached to linux >>>>>> bridge and physically are connected to two switches doing MSTP (and >>>>>> linux bridge is just passing them). >>>>>> >>>>>> Initially I suspected some changes related to bridge code - but quick >>>>>> peek at the code showed nothing suspicious - and the part of it that >>>>>> explicitly passes stp frames if stp is not enabled has seen little >>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if >>>>>> regular non-bonded interfaces are attached everything works fine. >>>>>> >>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with >>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were >>>>>> there (with them being present just fine on active enslaved interface, >>>>>> or on the bond device in earlier kernels). >>>>>> >>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from >>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on >>>>>> debian) and 4.17.3 (tested on archlinux) are failing. >>>>>> >>>>>> Unless this is already a known issue (or you have any suggestions what >>>>>> could be responsible). >>>>>> >>>>> I believe these are link-local-multicast messages and sometime back a >>>>> change went into to not pass those frames to the bonding master. This >>>>> could be the side effect of that. >>>> >>>> Mahesh, I suspect you're thinking of: >>>> >>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e >>>> Author: Chonggang Li >>>> Date: Sun Apr 16 12:02:18 2017 -0700 >>>> >>>> bonding: deliver link-local packets with skb->dev set to link that >>>> packets arrived on >>>> >>>> Michal, are you able to revert this patch and test? >>>> >>>> -J >>>> >>>> --- >>>> -Jay Vosburgh, jay.vosbu...@canonical.com >>>> >>> >>> >>>Just tested - yes, reverting that patch solves the issues. >> >> Chonggang, >> >> Reading the changelog in your commit referenced above, I'm not >> entirely sure what actual problem it is fixing. Could you elaborate? >> >> As the patch appears to cause a regression, it needs to be >> either fixed or reverted. >> >> Mahesh, you signed-off on it as well, perhaps you also have some >> context? >> > >I think the original idea behind it was to pass the LLDPDUs to the >stack on the interface that they came on since this is considered to >be link-local traffic and passing to bond-master would loose it's >"linklocal-ness". This is true for LLDP and if you change the skb->dev >of the packet, then you don't know which slave link it came on in >(from LLDP consumer's perspective). > >I don't know much about STP but trunking two links and aggregating >this link info through bond-master seems wrong. Just like LLDP, you >are losing info specific to a link and the decision derived from that >info could be wrong. > >Having said that, we determine "linklocal-ness" by looking at L2 and >bondmaster shares this with lts slaves. So it does seem fair to pass >those frames to the bonding-master but at the same time link-local >traffic is supposed to be limited to the physical link (LLDP/STP/LACP >etc). Your thoughts? I agree the whole thing sounds kind of weird, but I'm curious as to what Michal's actual use case is; he presumably has some practical use for this, since he noticed that the behavior changed. Michal, you mentioned MSTP and using 802.3ad (LACP) mode; how does that combination work rationally given that the bond might send and receive traffic across multiple slaves? Or does the switch side bundle the ports together into a single logical interface for MSTP purposes? On the TX side, I think the bond will likely balance all STP frames to just one slave. As for a resolution, presuming that Michal has some reasonable use case, I'm thinking along the lines of reverting the new (leave frame attached to slave) behavior for the general case and adding a special case for LLDP and friends to get the new behavior. I'd like to avoid adding any new options to bonding. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [BUG] bonded interfaces drop bpdu (stp) frames
Michal Soltys wrote: >On 07/12/2018 04:51 PM, Jay Vosburgh wrote: >> Mahesh Bandewar (महेश बंडेवार) wrote: >> >>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys wrote: >>>> >>>> Hi, >>>> >>>> As weird as that sounds, this is what I observed today after bumping >>>> kernel version. I have a setup where 2 bonds are attached to linux >>>> bridge and physically are connected to two switches doing MSTP (and >>>> linux bridge is just passing them). >>>> >>>> Initially I suspected some changes related to bridge code - but quick >>>> peek at the code showed nothing suspicious - and the part of it that >>>> explicitly passes stp frames if stp is not enabled has seen little >>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if >>>> regular non-bonded interfaces are attached everything works fine. >>>> >>>> Just to be sure I detached the bond (802.3ad mode) and checked it with >>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were >>>> there (with them being present just fine on active enslaved interface, >>>> or on the bond device in earlier kernels). >>>> >>>> If time permits I'll bisect tommorow to pinpoint the commit, but from >>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on >>>> debian) and 4.17.3 (tested on archlinux) are failing. >>>> >>>> Unless this is already a known issue (or you have any suggestions what >>>> could be responsible). >>>> >>> I believe these are link-local-multicast messages and sometime back a >>> change went into to not pass those frames to the bonding master. This >>> could be the side effect of that. >> >> Mahesh, I suspect you're thinking of: >> >> commit b89f04c61efe3b7756434d693b9203cc0cce002e >> Author: Chonggang Li >> Date: Sun Apr 16 12:02:18 2017 -0700 >> >> bonding: deliver link-local packets with skb->dev set to link that >> packets arrived on >> >> Michal, are you able to revert this patch and test? >> >> -J >> >> --- >> -Jay Vosburgh, jay.vosbu...@canonical.com >> > > >Just tested - yes, reverting that patch solves the issues. Chonggang, Reading the changelog in your commit referenced above, I'm not entirely sure what actual problem it is fixing. Could you elaborate? As the patch appears to cause a regression, it needs to be either fixed or reverted. Mahesh, you signed-off on it as well, perhaps you also have some context? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [BUG] bonded interfaces drop bpdu (stp) frames
Mahesh Bandewar (महेश बंडेवार) wrote: >On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys wrote: >> >> Hi, >> >> As weird as that sounds, this is what I observed today after bumping >> kernel version. I have a setup where 2 bonds are attached to linux >> bridge and physically are connected to two switches doing MSTP (and >> linux bridge is just passing them). >> >> Initially I suspected some changes related to bridge code - but quick >> peek at the code showed nothing suspicious - and the part of it that >> explicitly passes stp frames if stp is not enabled has seen little >> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if >> regular non-bonded interfaces are attached everything works fine. >> >> Just to be sure I detached the bond (802.3ad mode) and checked it with >> simple tcpdump (ether proto \\stp) - and indeed no hello packets were >> there (with them being present just fine on active enslaved interface, >> or on the bond device in earlier kernels). >> >> If time permits I'll bisect tommorow to pinpoint the commit, but from >> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on >> debian) and 4.17.3 (tested on archlinux) are failing. >> >> Unless this is already a known issue (or you have any suggestions what >> could be responsible). >> >I believe these are link-local-multicast messages and sometime back a >change went into to not pass those frames to the bonding master. This >could be the side effect of that. Mahesh, I suspect you're thinking of: commit b89f04c61efe3b7756434d693b9203cc0cce002e Author: Chonggang Li Date: Sun Apr 16 12:02:18 2017 -0700 bonding: deliver link-local packets with skb->dev set to link that packets arrived on Michal, are you able to revert this patch and test? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next v2 4/4] bonding: allow carrier and link status to determine link state
Debabrata Banerjee <dbane...@akamai.com> wrote: >In a mixed environment it may be difficult to tell if your hardware >support carrier, if it does not it can always report true. With a new >use_carrier option of 2, we can check both carrier and link status >sequentially, instead of one or the other Your reply in the prior discussion suggests to me that there is a bug somewhere else (perhaps in bonding, perhaps in the network driver), and this is just papering over it. As I said, I don't believe that an additional hack to bonding is the appropriate way to resolve this. -J >Signed-off-by: Debabrata Banerjee <dbane...@akamai.com> >--- > Documentation/networking/bonding.txt | 4 ++-- > drivers/net/bonding/bond_main.c | 12 > drivers/net/bonding/bond_options.c | 7 --- > 3 files changed, 14 insertions(+), 9 deletions(-) > >diff --git a/Documentation/networking/bonding.txt >b/Documentation/networking/bonding.txt >index 9ba04c0bab8d..f063730e7e73 100644 >--- a/Documentation/networking/bonding.txt >+++ b/Documentation/networking/bonding.txt >@@ -828,8 +828,8 @@ use_carrier > MII / ETHTOOL ioctl method to determine the link state. > > A value of 1 enables the use of netif_carrier_ok(), a value of >- 0 will use the deprecated MII / ETHTOOL ioctls. The default >- value is 1. >+ 0 will use the deprecated MII / ETHTOOL ioctls. A value of 2 >+ will check both. The default value is 1. > > xmit_hash_policy > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index f7f8a49cb32b..7e9652c4b35c 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -132,7 +132,7 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link >down, " > "in milliseconds"); > module_param(use_carrier, int, 0); > MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in > miimon; " >-"0 for off, 1 for on (default)"); >+"0 for off, 1 for on (default), 2 for carrier >then legacy checks"); > module_param(mode, charp, 0); > MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, " > "1 for active-backup, 2 for balance-xor, " >@@ -434,12 +434,16 @@ static int bond_check_dev_link(struct bonding *bond, > int (*ioctl)(struct net_device *, struct ifreq *, int); > struct ifreq ifr; > struct mii_ioctl_data *mii; >+ bool carrier = true; > > if (!reporting && !netif_running(slave_dev)) > return 0; > > if (bond->params.use_carrier) >- return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0; >+ carrier = netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0; >+ >+ if (!carrier) >+ return carrier; > > /* Try to get link status using Ethtool first. */ > if (slave_dev->ethtool_ops->get_link) >@@ -4399,8 +4403,8 @@ static int bond_check_params(struct bond_params *params) > downdelay = 0; > } > >- if ((use_carrier != 0) && (use_carrier != 1)) { >- pr_warn("Warning: use_carrier module parameter (%d), not of >valid value (0/1), so it was set to 1\n", >+ if (use_carrier < 0 || use_carrier > 2) { >+ pr_warn("Warning: use_carrier module parameter (%d), not of >valid value (0-2), so it was set to 1\n", > use_carrier); > use_carrier = 1; > } >diff --git a/drivers/net/bonding/bond_options.c >b/drivers/net/bonding/bond_options.c >index 8a945c9341d6..dba6cef05134 100644 >--- a/drivers/net/bonding/bond_options.c >+++ b/drivers/net/bonding/bond_options.c >@@ -164,9 +164,10 @@ static const struct bond_opt_value >bond_primary_reselect_tbl[] = { > }; > > static const struct bond_opt_value bond_use_carrier_tbl[] = { >- { "off", 0, 0}, >- { "on", 1, BOND_VALFLAG_DEFAULT}, >- { NULL, -1, 0} >+ { "off", 0, 0}, >+ { "on", 1, BOND_VALFLAG_DEFAULT}, >+ { "both", 2, 0}, >+ { NULL, -1, 0} > }; > > static const struct bond_opt_value bond_all_slaves_active_tbl[] = { >-- >2.17.0 --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next 4/4] bonding: allow carrier and link status to determine link state
Debabrata Banerjee <dbane...@akamai.com> wrote: >In a mixed environment it may be difficult to tell if your hardware >support carrier, if it does not it can always report true. With a new >use_carrier option of 2, we can check both carrier and link status >sequentially, instead of one or the other What do you mean by "mixed environment," and under what circumstances are you seeing an actual benefit from doing the MII / ethtool test in addition to the standard netif_carrier_ok test? The use_carrier option was meant for backwards compatibility with old-in-2005 device drivers, so this seem counterintuitive to me. I don't recall seeing any devices lacking netif_carrier support for some time. At this point, I would tend to argue that a new device driver that does not implement netif_carrier support should be fixed, and not have another hack added to bonding to work around it. -J >Signed-off-by: Debabrata Banerjee <dbane...@akamai.com> >--- > Documentation/networking/bonding.txt | 4 ++-- > drivers/net/bonding/bond_main.c | 12 > drivers/net/bonding/bond_options.c | 7 --- > 3 files changed, 14 insertions(+), 9 deletions(-) > >diff --git a/Documentation/networking/bonding.txt >b/Documentation/networking/bonding.txt >index 9ba04c0bab8d..f063730e7e73 100644 >--- a/Documentation/networking/bonding.txt >+++ b/Documentation/networking/bonding.txt >@@ -828,8 +828,8 @@ use_carrier > MII / ETHTOOL ioctl method to determine the link state. > > A value of 1 enables the use of netif_carrier_ok(), a value of >- 0 will use the deprecated MII / ETHTOOL ioctls. The default >- value is 1. >+ 0 will use the deprecated MII / ETHTOOL ioctls. A value of 2 >+ will check both. The default value is 1. > > xmit_hash_policy > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index f7f8a49cb32b..7e9652c4b35c 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -132,7 +132,7 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link >down, " > "in milliseconds"); > module_param(use_carrier, int, 0); > MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in > miimon; " >-"0 for off, 1 for on (default)"); >+"0 for off, 1 for on (default), 2 for carrier >then legacy checks"); > module_param(mode, charp, 0); > MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, " > "1 for active-backup, 2 for balance-xor, " >@@ -434,12 +434,16 @@ static int bond_check_dev_link(struct bonding *bond, > int (*ioctl)(struct net_device *, struct ifreq *, int); > struct ifreq ifr; > struct mii_ioctl_data *mii; >+ bool carrier = true; > > if (!reporting && !netif_running(slave_dev)) > return 0; > > if (bond->params.use_carrier) >- return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0; >+ carrier = netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0; >+ >+ if (!carrier) >+ return carrier; > > /* Try to get link status using Ethtool first. */ > if (slave_dev->ethtool_ops->get_link) >@@ -4399,8 +4403,8 @@ static int bond_check_params(struct bond_params *params) > downdelay = 0; > } > >- if ((use_carrier != 0) && (use_carrier != 1)) { >- pr_warn("Warning: use_carrier module parameter (%d), not of >valid value (0/1), so it was set to 1\n", >+ if (use_carrier < 0 || use_carrier > 2) { >+ pr_warn("Warning: use_carrier module parameter (%d), not of >valid value (0-2), so it was set to 1\n", > use_carrier); > use_carrier = 1; > } >diff --git a/drivers/net/bonding/bond_options.c >b/drivers/net/bonding/bond_options.c >index 8a945c9341d6..dba6cef05134 100644 >--- a/drivers/net/bonding/bond_options.c >+++ b/drivers/net/bonding/bond_options.c >@@ -164,9 +164,10 @@ static const struct bond_opt_value >bond_primary_reselect_tbl[] = { > }; > > static const struct bond_opt_value bond_use_carrier_tbl[] = { >- { "off", 0, 0}, >- { "on", 1, BOND_VALFLAG_DEFAULT}, >- { NULL, -1, 0} >+ { "off", 0, 0}, >+ { "on", 1, BOND_VALFLAG_DEFAULT}, >+ { "both", 2, 0}, >+ { NULL, -1, 0} > }; > > static const struct bond_opt_value bond_all_slaves_active_tbl[] = { >-- >2.17.0 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next 3/4] bonding: allow use of tx hashing in balance-alb
Debabrata Banerjeewrote: >The rx load balancing provided by balance-alb is not mutually >exclusive with using hashing for tx selection, and should provide a decent >speed increase because this eliminates spinlocks and cache contention. > >Signed-off-by: Debabrata Banerjee >--- > drivers/net/bonding/bond_alb.c | 20 ++-- > drivers/net/bonding/bond_main.c| 25 +++-- > drivers/net/bonding/bond_options.c | 2 +- > include/net/bonding.h | 10 +- > 4 files changed, 43 insertions(+), 14 deletions(-) > >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c >index 180e50f7806f..6228635880d5 100644 >--- a/drivers/net/bonding/bond_alb.c >+++ b/drivers/net/bonding/bond_alb.c >@@ -1478,8 +1478,24 @@ int bond_alb_xmit(struct sk_buff *skb, struct >net_device *bond_dev) > } > > if (do_tx_balance) { >- hash_index = _simple_hash(hash_start, hash_size); >- tx_slave = tlb_choose_channel(bond, hash_index, skb->len); >+ if (bond->params.tlb_dynamic_lb) { >+ hash_index = _simple_hash(hash_start, hash_size); >+ tx_slave = tlb_choose_channel(bond, hash_index, >skb->len); >+ } else { >+ /* >+ * do_tx_balance means we are free to select the >tx_slave >+ * So we do exactly what tlb would do for hash selection >+ */ >+ >+ struct bond_up_slave *slaves; >+ unsigned int count; >+ >+ slaves = rcu_dereference(bond->slave_arr); >+ count = slaves ? READ_ONCE(slaves->count) : 0; >+ if (likely(count)) >+ tx_slave = slaves->arr[bond_xmit_hash(bond, >skb) % >+ count]; >+ } > } > > return bond_do_alb_xmit(skb, bond, tx_slave); >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 1f1e97b26f95..f7f8a49cb32b 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -159,7 +159,7 @@ module_param(min_links, int, 0); > MODULE_PARM_DESC(min_links, "Minimum number of available links before turning > on carrier"); > > module_param(xmit_hash_policy, charp, 0); >-MODULE_PARM_DESC(xmit_hash_policy, "balance-xor and 802.3ad hashing method; " >+MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, >802.3ad hashing method; " > "0 for layer 2 (default), 1 for layer 3+4, " > "2 for layer 2+3, 3 for encap layer 2+3, " > "4 for encap layer 3+4"); >@@ -1735,7 +1735,7 @@ int bond_enslave(struct net_device *bond_dev, struct >net_device *slave_dev, > unblock_netpoll_tx(); > } > >- if (bond_mode_uses_xmit_hash(bond)) >+ if (bond_mode_can_use_xmit_hash(bond)) > bond_update_slave_arr(bond, NULL); > > bond->nest_level = dev_get_nest_level(bond_dev); >@@ -1870,7 +1870,7 @@ static int __bond_release_one(struct net_device >*bond_dev, > if (BOND_MODE(bond) == BOND_MODE_8023AD) > bond_3ad_unbind_slave(slave); > >- if (bond_mode_uses_xmit_hash(bond)) >+ if (bond_mode_can_use_xmit_hash(bond)) > bond_update_slave_arr(bond, slave); > > netdev_info(bond_dev, "Releasing %s interface %s\n", >@@ -3102,7 +3102,7 @@ static int bond_slave_netdev_event(unsigned long event, >* events. If these (miimon/arpmon) parameters are configured >* then array gets refreshed twice and that should be fine! >*/ >- if (bond_mode_uses_xmit_hash(bond)) >+ if (bond_mode_can_use_xmit_hash(bond)) > bond_update_slave_arr(bond, NULL); > break; > case NETDEV_CHANGEMTU: >@@ -3322,7 +3322,7 @@ static int bond_open(struct net_device *bond_dev) >*/ > if (bond_alb_initialize(bond, (BOND_MODE(bond) == > BOND_MODE_ALB))) > return -ENOMEM; >- if (bond->params.tlb_dynamic_lb) >+ if (bond->params.tlb_dynamic_lb || BOND_MODE(bond) == >BOND_MODE_ALB) > queue_delayed_work(bond->wq, >alb_work, 0); > } > >@@ -3341,7 +3341,7 @@ static int bond_open(struct net_device *bond_dev) > bond_3ad_initiate_agg_selection(bond, 1); > } > >- if (bond_mode_uses_xmit_hash(bond)) >+ if (bond_mode_can_use_xmit_hash(bond)) > bond_update_slave_arr(bond, NULL); > > return 0; >@@ -3892,7 +3892,7 @@ static void bond_slave_arr_handler(struct work_struct >*work) > * to determine the slave interface - > * (a) BOND_MODE_8023AD > * (b) BOND_MODE_XOR >- * (c)
Re: [PATCH net-next 2/4] bonding: use common mac addr checks
Banerjee, Debabrata <dbane...@akamai.com> wrote: >> From: Jay Vosburgh [mailto:jay.vosbu...@canonical.com] >> Debabrata Banerjee <dbane...@akamai.com> wrote: > >> >- if >> (!ether_addr_equal_64bits(rx_hash_table[index].mac_dst, >> >-mac_bcast) && >> >- >> !is_zero_ether_addr(rx_hash_table[index].mac_dst)) { >> >+ if >> (is_valid_ether_addr(rx_hash_table[index].mac_dst)) { >> >> This change and the similar ones below will now fail non-broadcast >> multicast Ethernet addresses, where the prior code would not. Is this an >> intentional change? > >Yes I don't see how it makes sense to use multicast addresses at all, but I >may be missing something. It's also illegal according to rfc1812 3.3.2, but >obviously this balancing mode is trying to be very clever. We probably >shouldn't violate the rfc anyway. Fair enough, but I think it would be good to call this out in the change log just in case it does somehow cause a regression. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next 2/4] bonding: use common mac addr checks
Debabrata Banerjeewrote: >Replace homegrown mac addr checks with faster defs from etherdevice.h > >Signed-off-by: Debabrata Banerjee >--- > drivers/net/bonding/bond_alb.c | 28 +--- > 1 file changed, 9 insertions(+), 19 deletions(-) > >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c >index c2f6c58e4e6a..180e50f7806f 100644 >--- a/drivers/net/bonding/bond_alb.c >+++ b/drivers/net/bonding/bond_alb.c >@@ -40,11 +40,6 @@ > #include > #include > >- >- >-static const u8 mac_bcast[ETH_ALEN + 2] __long_aligned = { >- 0xff, 0xff, 0xff, 0xff, 0xff, 0xff >-}; > static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = { > 0x33, 0x33, 0x00, 0x00, 0x00, 0x01 > }; >@@ -420,9 +415,7 @@ static void rlb_clear_slave(struct bonding *bond, struct >slave *slave) > > if (assigned_slave) { > rx_hash_table[index].slave = assigned_slave; >- if >(!ether_addr_equal_64bits(rx_hash_table[index].mac_dst, >- mac_bcast) && >- >!is_zero_ether_addr(rx_hash_table[index].mac_dst)) { >+ if >(is_valid_ether_addr(rx_hash_table[index].mac_dst)) { This change and the similar ones below will now fail non-broadcast multicast Ethernet addresses, where the prior code would not. Is this an intentional change? -J > bond_info->rx_hashtbl[index].ntt = 1; > bond_info->rx_ntt = 1; > /* A slave has been removed from the >@@ -525,8 +518,7 @@ static void rlb_req_update_slave_clients(struct bonding >*bond, struct slave *sla > client_info = &(bond_info->rx_hashtbl[hash_index]); > > if ((client_info->slave == slave) && >- !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) && >- !is_zero_ether_addr(client_info->mac_dst)) { >+ is_valid_ether_addr(client_info->mac_dst)) { > client_info->ntt = 1; > ntt = 1; > } >@@ -567,8 +559,7 @@ static void rlb_req_update_subnet_clients(struct bonding >*bond, __be32 src_ip) > if ((client_info->ip_src == src_ip) && > !ether_addr_equal_64bits(client_info->slave->dev->dev_addr, >bond->dev->dev_addr) && >- !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) && >- !is_zero_ether_addr(client_info->mac_dst)) { >+ is_valid_ether_addr(client_info->mac_dst)) { > client_info->ntt = 1; > bond_info->rx_ntt = 1; > } >@@ -596,7 +587,7 @@ static struct slave *rlb_choose_channel(struct sk_buff >*skb, struct bonding *bon > if ((client_info->ip_src == arp->ip_src) && > (client_info->ip_dst == arp->ip_dst)) { > /* the entry is already assigned to this client */ >- if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) { >+ if (!is_broadcast_ether_addr(arp->mac_dst)) { > /* update mac address from arp */ > ether_addr_copy(client_info->mac_dst, > arp->mac_dst); > } >@@ -644,8 +635,7 @@ static struct slave *rlb_choose_channel(struct sk_buff >*skb, struct bonding *bon > ether_addr_copy(client_info->mac_src, arp->mac_src); > client_info->slave = assigned_slave; > >- if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) && >- !is_zero_ether_addr(client_info->mac_dst)) { >+ if (is_valid_ether_addr(client_info->mac_dst)) { > client_info->ntt = 1; > bond->alb_info.rx_ntt = 1; > } else { >@@ -1418,9 +1408,9 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device >*bond_dev) > case ETH_P_IP: { > const struct iphdr *iph = ip_hdr(skb); > >- if (ether_addr_equal_64bits(eth_data->h_dest, mac_bcast) || >- (iph->daddr == ip_bcast) || >- (iph->protocol == IPPROTO_IGMP)) { >+ if (is_broadcast_ether_addr(eth_data->h_dest) || >+ iph->daddr == ip_bcast || >+ iph->protocol == IPPROTO_IGMP) { > do_tx_balance = false; > break; > } >@@ -1432,7 +1422,7 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device >*bond_dev) > /* IPv6 doesn't really use broadcast mac address, but leave >* that here just in case. >*/ >- if
[PATCH v2 net-next] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
The operstate update logic will leave an interface in the default UNKNOWN operstate if the interface carrier state never changes from the default carrier up state set at creation. This includes the case of an explicit call to netif_carrier_on, as the carrier on to on transition has no effect on operstate. This affects virtio-net for the case that the virtio peer does not support VIRTIO_NET_F_STATUS (the feature that provides carrier state updates). Without this feature, the virtio specification states that "the link should be assumed active," so, logically, the operstate should be UP instead of UNKNOWN. This has impact on user space applications that use the operstate to make availability decisions for the interface. Resolve this by changing the virtio probe logic slightly to call netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS cases, and then the existing call to netif_carrier_on for the "without" case will cause an operstate transition. Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> Cc: Ben Hutchings <b...@decadent.org.uk> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> --- I considered resolving this by changing linkwatch_init_dev to unconditionally call rfc2863_policy, as that would always set operstate for all interfaces. This would not have any impact on most cases (as most drivers call netif_carrier_off during probe), except for the loopback device, which currently has an operstate of UNKNOWN (because it never does any carrier state transitions). This change would add a round trip on the dev_base_lock for every loopback device creation, which could have a negative impact when creating many loopback devices, e.g., when concurrently creating large numbers of containers. drivers/net/virtio_net.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 23374603e4d9..7b187ec7411e 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev) /* Assume link up if device can't report link status, otherwise get link status from config. */ + netif_carrier_off(dev); if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) { - netif_carrier_off(dev); schedule_work(>config_work); } else { vi->status = VIRTIO_NET_S_LINK_UP; -- 2.14.1
Re: [PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
Michael S. Tsirkin <m...@redhat.com> wrote: >On Thu, Mar 22, 2018 at 12:02:10PM +0000, Jay Vosburgh wrote: >> Michael S. Tsirkin <m...@redhat.com> wrote: >> >> >On Thu, Mar 22, 2018 at 09:05:52AM +, Jay Vosburgh wrote: >> >> The operstate update logic will leave an interface in the >> >> default UNKNOWN operstate if the interface carrier state never changes >> >> from the default carrier up state set at creation. This includes the >> >> case of an explicit call to netif_carrier_on, as the carrier on to on >> >> transition has no effect on operstate. >> >> >> >> This affects virtio-net for the case that the virtio peer does >> >> not support VIRTIO_NET_F_STATUS (the feature that provides carrier state >> >> updates). Without this feature, the virtio specification states that >> >> "the link should be assumed active," so, logically, the operstate should >> >> be UP instead of UNKNOWN. This has impact on user space applications >> >> that use the operstate to make availability decisions for the interface. >> >> >> >> Resolve this by changing the virtio probe logic slightly to call >> >> netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS >> >> cases, and then the existing call to netif_carrier_on for the "without" >> >> case will cause an operstate transition. >> >> >> >> Cc: "Michael S. Tsirkin" <m...@redhat.com> >> >> Cc: Jason Wang <jasow...@redhat.com> >> >> Cc: Ben Hutchings <b...@decadent.org.uk> >> >> Fixes: 167c25e4c550 ("virtio-net: init link state correctly") >> > >> >I'd say that's an abuse of this notation. openstate was UNKNOWN >> >even before that fix. >> >> I went back to the commit that added the dependency on >> VIRTIO_NET_F_STATUS (and that this patch would thus apply on top of). >> If that's an issue, I can resubmit without it. >> >> -J > >The patch can be trivially backported to any version that has virtio. > >The issue was present since the original version of virtio. >VIRTIO_NET_F_STATUS fixed it for new devices. >So the tag is incorrectly blaming a partial fix for not being a full >one. > >Also, I think it's more appropriate for net-next - it's a >minor ABI change (previously presence of VIRTIO_NET_F_STATUS >could be detected by looking at operstate, now it can't). >Hopefully this makes more apps work than it breaks. > >So yes, pls repost without Fixes and with net-next unless >davem can make the change himself. Reposting with requested changes. -J >> >> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> >> > >> >Acked-by: Michael S. Tsirkin <m...@redhat.com> >> > >> > >> >> --- >> >> >> >> I considered resolving this by changing linkwatch_init_dev to >> >> unconditionally call rfc2863_policy, as that would always set operstate >> >> for all interfaces. >> >> >> >> This would not have any impact on most cases (as most drivers >> >> call netif_carrier_off during probe), except for the loopback device, >> >> which currently has an operstate of UNKNOWN (because it never does any >> >> carrier state transitions). This change would add a round trip on the >> >> dev_base_lock for every loopback device creation, which could have a >> >> negative impact when creating many loopback devices, e.g., when >> >> concurrently creating large numbers of containers. >> >> >> >> >> >> drivers/net/virtio_net.c | 2 +- >> >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> >> index 23374603e4d9..7b187ec7411e 100644 >> >> --- a/drivers/net/virtio_net.c >> >> +++ b/drivers/net/virtio_net.c >> >> @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev) >> >> >> >> /* Assume link up if device can't report link status, >> >> otherwise get link status from config. */ >> >> + netif_carrier_off(dev); >> >> if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) { >> >> - netif_carrier_off(dev); >> >> schedule_work(>config_work); >> >> } else { >> >> vi->status = VIRTIO_NET_S_LINK_UP; >> >> -- >> >> 2.14.1
Re: [PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
Michael S. Tsirkin <m...@redhat.com> wrote: >On Thu, Mar 22, 2018 at 09:05:52AM +0000, Jay Vosburgh wrote: >> The operstate update logic will leave an interface in the >> default UNKNOWN operstate if the interface carrier state never changes >> from the default carrier up state set at creation. This includes the >> case of an explicit call to netif_carrier_on, as the carrier on to on >> transition has no effect on operstate. >> >> This affects virtio-net for the case that the virtio peer does >> not support VIRTIO_NET_F_STATUS (the feature that provides carrier state >> updates). Without this feature, the virtio specification states that >> "the link should be assumed active," so, logically, the operstate should >> be UP instead of UNKNOWN. This has impact on user space applications >> that use the operstate to make availability decisions for the interface. >> >> Resolve this by changing the virtio probe logic slightly to call >> netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS >> cases, and then the existing call to netif_carrier_on for the "without" >> case will cause an operstate transition. >> >> Cc: "Michael S. Tsirkin" <m...@redhat.com> >> Cc: Jason Wang <jasow...@redhat.com> >> Cc: Ben Hutchings <b...@decadent.org.uk> >> Fixes: 167c25e4c550 ("virtio-net: init link state correctly") > >I'd say that's an abuse of this notation. openstate was UNKNOWN >even before that fix. I went back to the commit that added the dependency on VIRTIO_NET_F_STATUS (and that this patch would thus apply on top of). If that's an issue, I can resubmit without it. -J >> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> > >Acked-by: Michael S. Tsirkin <m...@redhat.com> > > >> --- >> >> I considered resolving this by changing linkwatch_init_dev to >> unconditionally call rfc2863_policy, as that would always set operstate >> for all interfaces. >> >> This would not have any impact on most cases (as most drivers >> call netif_carrier_off during probe), except for the loopback device, >> which currently has an operstate of UNKNOWN (because it never does any >> carrier state transitions). This change would add a round trip on the >> dev_base_lock for every loopback device creation, which could have a >> negative impact when creating many loopback devices, e.g., when >> concurrently creating large numbers of containers. >> >> >> drivers/net/virtio_net.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c >> index 23374603e4d9..7b187ec7411e 100644 >> --- a/drivers/net/virtio_net.c >> +++ b/drivers/net/virtio_net.c >> @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev) >> >> /* Assume link up if device can't report link status, >> otherwise get link status from config. */ >> +netif_carrier_off(dev); >> if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) { >> -netif_carrier_off(dev); >> schedule_work(>config_work); >> } else { >> vi->status = VIRTIO_NET_S_LINK_UP; >> -- >> 2.14.1
[PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS
The operstate update logic will leave an interface in the default UNKNOWN operstate if the interface carrier state never changes from the default carrier up state set at creation. This includes the case of an explicit call to netif_carrier_on, as the carrier on to on transition has no effect on operstate. This affects virtio-net for the case that the virtio peer does not support VIRTIO_NET_F_STATUS (the feature that provides carrier state updates). Without this feature, the virtio specification states that "the link should be assumed active," so, logically, the operstate should be UP instead of UNKNOWN. This has impact on user space applications that use the operstate to make availability decisions for the interface. Resolve this by changing the virtio probe logic slightly to call netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS cases, and then the existing call to netif_carrier_on for the "without" case will cause an operstate transition. Cc: "Michael S. Tsirkin" <m...@redhat.com> Cc: Jason Wang <jasow...@redhat.com> Cc: Ben Hutchings <b...@decadent.org.uk> Fixes: 167c25e4c550 ("virtio-net: init link state correctly") Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> --- I considered resolving this by changing linkwatch_init_dev to unconditionally call rfc2863_policy, as that would always set operstate for all interfaces. This would not have any impact on most cases (as most drivers call netif_carrier_off during probe), except for the loopback device, which currently has an operstate of UNKNOWN (because it never does any carrier state transitions). This change would add a round trip on the dev_base_lock for every loopback device creation, which could have a negative impact when creating many loopback devices, e.g., when concurrently creating large numbers of containers. drivers/net/virtio_net.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 23374603e4d9..7b187ec7411e 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev) /* Assume link up if device can't report link status, otherwise get link status from config. */ + netif_carrier_off(dev); if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) { - netif_carrier_off(dev); schedule_work(>config_work); } else { vi->status = VIRTIO_NET_S_LINK_UP; -- 2.14.1
[PATCH net] bonding: fix slave stuck in BOND_LINK_FAIL state
The bonding miimon logic has a flaw, in that a failure of the rtnl_trylock can cause a slave to become permanently stuck in BOND_LINK_FAIL state. The sequence of events to cause this is as follows: 1) bond_miimon_inspect finds that a slave's link is down, and so calls bond_propose_link_state, setting slave->new_link_state to BOND_LINK_FAIL, then sets slave->new_link to BOND_LINK_DOWN and returns non-zero. 2) In bond_mii_monitor, the rtnl_trylock fails, and the timer is rescheduled. No change is committed. 3) bond_miimon_inspect is called again, but this time the slave from step 1 has recovered. slave->new_link is reset to NOCHANGE, and, as slave->link was never changed, the switch enters the BOND_LINK_UP case, and does nothing. The pending BOND_LINK_FAIL state from step 1 remains pending, as new_link_state is not reset. 4) The state from step 3 persists until another slave changes link state and causes bond_miimon_inspect to return non-zero. At this point, the BOND_LINK_FAIL state change on the slave from steps 1-3 is committed, and the slave will remain stuck in BOND_LINK_FAIL state even though it is actually link up. The remedy for this is to initialize new_link_state on each entry to bond_miimon_inspect, as is already done with new_link. Reported-by: Alex Sidorenko <alexandre.sidore...@hpe.com> Reviewed-by: Jarod Wilson <ja...@redhat.com> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> Fixes: fb9eb899a6dc ("bonding: handle link transition from FAIL to UP correctly") --- drivers/net/bonding/bond_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index c99dc59d729b..167434e952da 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2042,6 +2042,7 @@ static int bond_miimon_inspect(struct bonding *bond) bond_for_each_slave_rcu(bond, slave, iter) { slave->new_link = BOND_LINK_NOCHANGE; + slave->link_new_state = slave->link; link_state = bond_check_dev_link(bond, slave->dev, 0); -- 2.14.1
Re: Bond recovery from BOND_LINK_FAIL state not working
Jarod Wilson <ja...@redhat.com> wrote: >On 2017-11-02 9:11 PM, Jay Vosburgh wrote: [...] >> diff --git a/drivers/net/bonding/bond_main.c >> b/drivers/net/bonding/bond_main.c >> index 18b58e1376f1..6f89f9981a6c 100644 >> --- a/drivers/net/bonding/bond_main.c >> +++ b/drivers/net/bonding/bond_main.c >> @@ -2046,6 +2046,7 @@ static int bond_miimon_inspect(struct bonding *bond) >> bond_for_each_slave_rcu(bond, slave, iter) { >> slave->new_link = BOND_LINK_NOCHANGE; >> +slave->link_new_state = slave->link; >> link_state = bond_check_dev_link(bond, slave->dev, 0); >> >> >> Alex / Jarod, could you check my logic, and would you be able to >> test this patch if my analysis appears sound? > >This patch looks good, the original reproducing setup successfully >recovers after the original active slave goes down, even with >NetworkManager in the mix. > >Reviewed-by: Jarod Wilson <ja...@redhat.com> Thanks, I'll get a formal patch submission out later today. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: Bond recovery from BOND_LINK_FAIL state not working
ernel: bond0: bond_find_best_slave: slave->link: 2, >up: false, slave->delay: 0 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 1, >up: true, slave->delay: 0 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: no bestslave >found, bond failure imminent >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave >!= curr_active_slave >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave >is NULL, this is probably bad >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: >ens3f0 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: >NULL >Oct 31 09:09:25 SYDC1LNX kernel: bond0: Removing MAC from old_active >Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call >bond_alb_handle_active_change >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier off >Oct 31 09:09:25 SYDC1LNX kernel: bond0: now running without any active >interface! > >Even though link state of ens3f1 has changed, we have > BOND_LINK_NOCHANGE >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave >ens3f1 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE >on slave ens3f1 > >Another invocation of bond_mii_monitor >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link >state: 2 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link >state: 1 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered >BOND_LINK_DOWN case on slave ens3f0 > >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered >BOND_LINK_FAIL case on slave ens3f1 > > We see "up again on ens3f1" and execute slave->new_link = BOND_LINK_UP; (my > initial suggestion) >Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status up again after 0 ms for >interface ens3f1 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero >Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 2->2 for >interface ens3f0, 0 Mbps full duplex >Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for >interface ens3f0 now: 0 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 1->0 for >interface ens3f1, 2 Mbps full duplex >Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for >interface ens3f1 now: 0 > > As we nudged slave->new_link, bond_miimon_commit() now works >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave >ens3f0 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE >on slave ens3f0 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave >ens3f1 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_UP >Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely up for >interface ens3f1, 2 Mbps full duplex >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, >up: false, slave->delay: 0 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave ens3f1 >BOND_LINK_UP >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave >!= curr_active_slave >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: >NULL >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: >ens3f1 >Oct 31 09:09:25 SYDC1LNX kernel: bond0: Non-BOND_LINK_BACK case >Oct 31 09:09:25 SYDC1LNX kernel: bond0: making interface ens3f1 the new active >one >Oct 31 09:09:25 SYDC1LNX kernel: bond0: Setting MAC on new_active >Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call >bond_alb_handle_active_change >Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier on >Oct 31 09:09:25 SYDC1LNX kernel: bond0: first active interface up! > >Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link >state: 2 >Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link >state: 0 >Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered >BOND_LINK_DOWN case on slave ens3f0 >Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered >BOND_LINK_UP case on slave ens3f1 >... > >Alex > >On 11/02/2017 09:11 PM, Jay Vosburgh wrote: >> Alex Sidorenko <alexandre.sidore...@hpe.com> wrote: >>> On 11/02/2017 12:51 AM, Jay Vosburgh wrote: >>>> Jarod Wilson <ja...@redhat.com> wrote: >>>> >>>>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote: >>>>>> Jay Vosburgh <
Re: Bond recovery from BOND_LINK_FAIL state not working
Alex Sidorenko <alexandre.sidore...@hpe.com> wrote: >On 11/02/2017 12:51 AM, Jay Vosburgh wrote: >> Jarod Wilson <ja...@redhat.com> wrote: >> >>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote: >>>> Jay Vosburgh <jay.vosbu...@canonical.com> wrote: >>>> >>>>> Alex Sidorenko <alexandre.sidore...@hpe.com> wrote: >>>>> >>>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy >>>>>> platform, it is seen both in customer's environment and in HPE test lab. >>>>>> >>>>>> There are several bonds configured in TLB mode and miimon=100, all other >>>>>> options are default. Slaves are connected to VirtualConnect >>>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down >>>>>> temporarily, but not another one (ens3f1). But what we see is >>>>>> >>>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms >>>>>> for interface ens3f1 >>>> >>>>In net-next, I don't see a path in the code that will lead to >>>> this message, as it would apparently require entering >>>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0. >>>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not >>>> remain in _FAIL state. >>> >>> The kernel in question is laden with a fair bit of additional debug spew, >>> as we were going back and forth, trying to isolate where things were going >>> wrong. That was indeed from the BOND_LINK_FAIL state in >>> bond_miimon_inspect, inside the if (link_state) clause though, so after >>> commit++, there's a continue, which ... does what now? Doesn't it take us >>> back to the top of the bond_for_each_slave_rcu() loop, so we bypass the >>> next few lines of code that would have led to a transition to >>> BOND_LINK_DOWN? >> >> Just to confirm: your downdelay is 0, correct? > >Correct. > >> >> And, do you get any other log messages other than "link status >> up again after 0 ms"? > >Yes, here are some messages (from an early instrumentation): [...] >That is, we never see ens3f1 going to BOND_LINK_DOWN and it continues >staying in BOND_LINK_NOCHANGE/BOND_LINK_FAIL > > >> >> To answer your question, yes, the "if (link_state) {" block in >> the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but >> this path is nominally for the downdelay logic. If downdelay is active >> and the link recovers before the delay expires, the link should never >> have moved to BOND_LINK_DOWN. The commit++ causes bond_miimon_inspect >> to return nonzero, causing in turn the bond_propose_link_state change to >> BOND_LINK_FAIL state to be committed. This path deliberately does not >> set slave->new_link, as downdelay is purposely delaying the transition >> to BOND_LINK_DOWN. >> >> If downdelay is 0, the slave->link should not persist in >> BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will >> cause a transition in bond_miimon_commit. The bond_propose_link_state >> call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in >> bond_mii_monitor prior to calling bond_miimon_commit, which will in turn >> do the transition to _DOWN state. In this case, the BOND_LINK_FAIL case >> "if (link_state) {" block should never be entered. > >I totally agree with your description of transition logic, and this is why >we were puzzled by how this can occur until we noticed NetworkManager >messages around this time and decided to run a test without it. >Without NM, everything works as expected. After that, adding more >instrumentation, we have found that we do not propose BOND_LINK_FAIL inside >bond_miimon_inspect() but elsewhere (NetworkManager?). I think I see the flaw in the logic. 1) bond_miimon_inspect finds link_state = 0, then makes a call to bond_propose_link_state(BOND_LINK_FAIL), setting link_new_state to BOND_LINK_FAIL. _inspect then sets slave->new_link = BOND_LINK_DOWN and returns non-zero. 2) bond_mii_monitor rtnl_trylock fails, it reschedules. 3) bond_mii_monitor runs again, and calls bond_miimon_inspect. 4) the slave's link has recovered, so link_state != 0. slave->link is still BOND_LINK_UP. The slave's link_new_state remains set to BOND_LINK_FAIL, but new_link is reset to NOCHANGE. bond_miimon_inspect returns 0, so nothing is committed. 5) step 4 can repeat indefi
Re: Bond recovery from BOND_LINK_FAIL state not working
Jarod Wilson <ja...@redhat.com> wrote: >On 2017-11-01 8:35 PM, Jay Vosburgh wrote: >> Jay Vosburgh <jay.vosbu...@canonical.com> wrote: >> >>> Alex Sidorenko <alexandre.sidore...@hpe.com> wrote: >>> >>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy >>>> platform, it is seen both in customer's environment and in HPE test lab. >>>> >>>> There are several bonds configured in TLB mode and miimon=100, all other >>>> options are default. Slaves are connected to VirtualConnect >>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down >>>> temporarily, but not another one (ens3f1). But what we see is >>>> >>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms >>>> for interface ens3f1 >> >> In net-next, I don't see a path in the code that will lead to >> this message, as it would apparently require entering >> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0. >> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not >> remain in _FAIL state. > >The kernel in question is laden with a fair bit of additional debug spew, >as we were going back and forth, trying to isolate where things were going >wrong. That was indeed from the BOND_LINK_FAIL state in >bond_miimon_inspect, inside the if (link_state) clause though, so after >commit++, there's a continue, which ... does what now? Doesn't it take us >back to the top of the bond_for_each_slave_rcu() loop, so we bypass the >next few lines of code that would have led to a transition to >BOND_LINK_DOWN? Just to confirm: your downdelay is 0, correct? And, do you get any other log messages other than "link status up again after 0 ms"? To answer your question, yes, the "if (link_state) {" block in the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but this path is nominally for the downdelay logic. If downdelay is active and the link recovers before the delay expires, the link should never have moved to BOND_LINK_DOWN. The commit++ causes bond_miimon_inspect to return nonzero, causing in turn the bond_propose_link_state change to BOND_LINK_FAIL state to be committed. This path deliberately does not set slave->new_link, as downdelay is purposely delaying the transition to BOND_LINK_DOWN. If downdelay is 0, the slave->link should not persist in BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will cause a transition in bond_miimon_commit. The bond_propose_link_state call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in bond_mii_monitor prior to calling bond_miimon_commit, which will in turn do the transition to _DOWN state. In this case, the BOND_LINK_FAIL case "if (link_state) {" block should never be entered. I'm a little leery of adding the state transition you suggest without understanding how this situation arose, as it shouldn't get into this condition in the first place. -J >... >>> Your patch does not apply to net-next, so I'm not absolutely >>> sure where this is, but presuming that this is in the BOND_LINK_FAIL >>> case of the switch, it looks like both BOND_LINK_FAIL and BOND_LINK_BACK >>> will have the issue that if the link recovers or fails, respectively, >>> within the delay window (for down/updelay > 0) it won't set a >>> slave->new_link. >>> >>> Looks like this got lost somewhere along the line, as originally >>> the transition back to UP (or DOWN) happened immediately, and that has >>> been lost somewhere. >>> >>> I'll have to dig out when that broke, but I'll see about a test >>> patch this afternoon. >> >> The case I was concerned with was moved around; the proposed >> state is committed in bond_mii_monitor. But to commit to _FAIL state, >> the downdelay would have to be > 0. I'm not seeing any errors in >> net-next; can you reproduce your erroneous behavior on net-next? > >I can try to get a net-next-ish kernel into their hands, but the bonding >driver we're working with here is quite close to current net-next already, >so I'm fairly confident the same thing will happen. > >-- >Jarod Wilson >ja...@redhat.com --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: Bond recovery from BOND_LINK_FAIL state not working
Jay Vosburgh <jay.vosbu...@canonical.com> wrote: >Alex Sidorenko <alexandre.sidore...@hpe.com> wrote: > >>The problem has been found while trying to deploy RHEL7 on HPE Synergy >>platform, it is seen both in customer's environment and in HPE test lab. >> >>There are several bonds configured in TLB mode and miimon=100, all other >>options are default. Slaves are connected to VirtualConnect >>modules. Rebooting a VC module should bring one bond slave (ens3f0) down >>temporarily, but not another one (ens3f1). But what we see is >> >>Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for >>interface ens3f1 In net-next, I don't see a path in the code that will lead to this message, as it would apparently require entering bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0. If downdelay is 0, the code will transition to BOND_LINK_DOWN and not remain in _FAIL state. >>and it never recovers. When VC reboot is complete, everything goes back to >>normal again. >> >>Redhat has backported all recent upstream commits and instrumented the >>bonding driver. We have found the following (when VC goes down) >> >>In bond_miimon_inspect() the first slave goes to >> bond_propose_link_state(slave, BOND_LINK_FAIL); >> and >> slave->new_link = BOND_LINK_DOWN; >> >>The second slave is still >> slave->link = BOND_LINK_UP; >> and >>slave->new_link = BOND_LINK_NOCHANGE; >> >>This is as expected. But in bond_miimon_commit() we see that _both_ slaves >>are in BOND_LINK_FAIL. That is, something changes the state of the second >>slave from another thread. We suspect the NetworkManager, as the problem >>is there _only_ when bonds are controlled by it, if we set >>NM_CONTROLLED=no everything starts working normally. >> >>While we still do not understand how NM affects bond state, I think that >>bonding driver needs to be made reliable enough to recover even from this >>state. > > You're suggesting that the bonding driver be able to distinguish >"false" down states set by network manager from a genuine link failure? >Am I misunderstanding? > >>At this moment when we enter bond_miimon_inspect() with >>slave->link = BOND_LINK_FAIL and are in the following code >> >>/*FALLTHRU*/ >>case BOND_LINK_BACK: >>if (!link_state) { >>bond_propose_link_state(slave, >> BOND_LINK_DOWN); >>netdev_info(bond->dev, "link status down >>again after %d ms for interface %s\n", >>(bond->params.updelay - >> slave->delay) * >>bond->params.miimon, >>slave->dev->name); >> >>commit++; >>continue; >>} >> > > Looking at bonding in the current net-next, if a slave enters >bond_miimon_inspect with slave->link == BOND_LINK_FAIL, it will not >proceed to the BOND_LINK_BACK block of the switch; BOND_LINK_FAIL does >not fall through that far. > > Did you actually mean the BOND_LINK_FAIL block of the switch? >I'll assume so for the rest of my reply. > >>we propose a new state and do 'commit++', but we do not change >>slave->new_link from BOND_LINK_NOCHANGE. As a result, bond_miimon_commit() >>will not process this slave. >> >>The following patch fixes the issue: >> >> >>If we enter bond_miimon_inspect() with slave_link=BOND_LINK_FAIL >>and recover, we do bond_propose_link_state(slave, BOND_LINK_UP); >>but do not change slave->new_link, so it is left in >>BOND_LINK_NOCHANGE. As a result, bond_miimon_commit() will not >>process that slave and it never recovers. We need to set >>slave->new_link = BOND_LINK_UP to make bond_miimon_commit() work > > What is your downdelay set to? In principle, >bond_miimon_inspect should not enter with a slave in BOND_LINK_FAIL >state if downdelay is 0. > >> drivers/net/bonding/bond_main.c | 1 + >> 1 file changed, 1 insertion(+) >> >>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >>index c99dc59..07aa7ba 100644 >>--- a/drivers/net/bonding/bond_main.c >>+++ b/drivers/net/bonding/bond_main.c >>@@ -2072,6 +2072,7 @@ static int bond_miimon_inspect(struct bonding *bond) >
Re: Bond recovery from BOND_LINK_FAIL state not working
ls, respectively, within the delay window (for down/updelay > 0) it won't set a slave->new_link. Looks like this got lost somewhere along the line, as originally the transition back to UP (or DOWN) happened immediately, and that has been lost somewhere. I'll have to dig out when that broke, but I'll see about a test patch this afternoon. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers
Dan Williams <d...@redhat.com> wrote: [...] >You'll probably say "aim for the 75% case" or something like that, >which is fine, but then you're depending on your 75% case to be (a) >single AP, (b) never move (eg, only bond wifi + ethernet), (c) little >radio interference. I'm not sure I'd buy that. If I've put words in >your mouth, forgive me. The primary use case that I'm aware of for bonding with wireless devices is in active-backup mode, paired with an ethernet adapter set as the bonding "primary" device. I think this is the case that absolutely should just work. I don't think bonding cares (or should care) about (a) - (c) for this use. Your point (b) suggests that there are use cases other than the above; I'm unfamiliar with any use other than wifi + ethernet, can you elaborate? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c
Michael J Dilmore <michael.j.dilm...@gmail.com> wrote: >On 21/06/17 22:56, David Miller wrote: > >> From: Michael D <michael.j.dilm...@gmail.com> >> Date: Wed, 21 Jun 2017 22:41:07 +0100 >> >>> I don't think you can stop it being dereferenced... you just need to >>> prevent an attacker from exploiting the null pointer dereference >>> vulnerability right? And this is done by returning the function right >>> away? >> What's all of this about an "attacker"? >> >> If there is a bug, we dererence a NULL pointer, and we should >> fix that bug. >> >> The BUG_ON() helps us see where the problem is while at the >> same time stopping the kernel before the NULL deref happens. >Ok this is starting to make sense now - went a bit off track but think my >general thinking is ok - i.e. if we return the function with an error code >before the dereference then this basically does the same thing as BUG_ON >but without crashing the kernel. > >Something like: > >if (WARN_ON(!new_active_slave) { >netdev_dbg("Can't add new active slave - pointer null"); >return ERROR_CODE >} In general, yes, but in this case, the condition should be impossible to hit, so BUG_ON seems appropriate. If bond_slave_get_rtnl/rcu() returns NULL for an actual bonding slave, other code paths (bond_fill_slave_info, bond_handle_frame) will likely crash before getting to this one. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c
David Miller <da...@davemloft.net> wrote: >From: Michael D <michael.j.dilm...@gmail.com> >Date: Wed, 21 Jun 2017 22:41:07 +0100 > >> I don't think you can stop it being dereferenced... you just need to >> prevent an attacker from exploiting the null pointer dereference >> vulnerability right? And this is done by returning the function right >> away? > >What's all of this about an "attacker"? > >If there is a bug, we dererence a NULL pointer, and we should >fix that bug. > >The BUG_ON() helps us see where the problem is while at the >same time stopping the kernel before the NULL deref happens. Looking at the code more carefully than I did earlier, the only way the BUG_ON will hit is if the rx_handler_data is NULL for a bonding slave when this code executes. This should be impossible, as there doesn't appear to be any way to get into bond_option_active_slave_set for a slave prior to bond_enslave registering the rx_handler for that slave, as these operations are mutexed by RTNL. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c
Michael J Dilmore <michael.j.dilm...@gmail.com> wrote: >The function below contains a BUG_ON where no active slave is detected. The >patch >converts this to a WARN_ON to avoid crashing the kernel. > >Signed-off-by: Michael J Dilmore <michael.j.dilm...@gmail.com> >--- > drivers/net/bonding/bond_options.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_options.c >b/drivers/net/bonding/bond_options.c >index 1bcbb89..c4b4791 100644 >--- a/drivers/net/bonding/bond_options.c >+++ b/drivers/net/bonding/bond_options.c >@@ -778,7 +778,7 @@ static int bond_option_active_slave_set(struct bonding >*bond, > struct slave *old_active = > rtnl_dereference(bond->curr_active_slave); > struct slave *new_active = bond_slave_get_rtnl(slave_dev); > >- BUG_ON(!new_active); >+ WARN_ON(!new_active); This is a reasonable idea in principle, but will require additional changes to prevent dereferencing new_active if it is NULL (which would happen just below this point in the code). -J > if (new_active == old_active) { > /* do nothing */ >-- >2.7.4 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] Convert multiple netdev_info messages to netdev_dbg
Joe Perches <j...@perches.com> wrote: >On Thu, 2017-06-15 at 19:14 +0100, Michael J Dilmore wrote: >> Multiple netdev_info messages clutter kernel output. Also add netdev_dbg for >> packets per slave. >[] >> diff --git a/drivers/net/bonding/bond_options.c >> b/drivers/net/bonding/bond_options.c >[] >> @@ -9,6 +9,8 @@ >> * (at your option) any later version. >> */ >> >> +#define DEBUG 1 > >Is defining DEBUG really worthwhile. I don't believe so, since if CONFIG_DYNAMIC_DEBUG is not enabled, having #define DEBUG will enable all of the netdev_dbg messages unconditionally, which is the opposite of the stated purpose of the patch. If DYNAMIC_DEBUG is enabled, having DEBUG doesn't do anything that I can see. -J >As well, it's almost always just >#define DEBUG >without any level value unless the >level value is used in the code. > >> + >> #include >> #include >> #include >> @@ -719,13 +721,13 @@ static int bond_option_mode_set(struct bonding *bond, >> const struct bond_opt_value *newval) >> { >> if (!bond_mode_uses_arp(newval->value) && bond->params.arp_interval) { >> -netdev_info(bond->dev, "%s mode is incompatible with arp >> monitoring, start mii monitoring\n", >> +netdev_dbg(bond->dev, "%s mode is incompatible with arp >> monitoring, start mii monitoring\n", >> newval->string); > >Please realign any multiple line arguments to the >open parenthesis at the same time. > >> /* disable arp monitoring */ >> bond->params.arp_interval = 0; >> /* set miimon to default value */ >> bond->params.miimon = BOND_DEFAULT_MIIMON; >> -netdev_info(bond->dev, "Setting MII monitoring interval to >> %d\n", >> +netdev_dbg(bond->dev, "Setting MII monitoring interval to %d\n", >> bond->params.miimon); > >etc... > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: bond link state mismatch, rtnl_trylock() vs rtnl_lock()
Nithin Sujir <nsu...@tintri.com> wrote: >Hi, >We're encountering a problem in 4.4 LTS where, rarely, the bond link state >is not updated when the slave link changes. > >I've traced the issue to the arp monitor unable to get the rtnl lock. The >sequence resulting in failure is as below. > >bond_loadbalance_arp_mon() periodically called, if slave link is _down_, >it checks if the slave is sending/receiving packets. If it is, it sets >flags to be processed later down the function for bond link >update. However, it sets the slave->link right away. > >if (slave->link != BOND_LINK_UP) { >if (bond_time_in_interval(bond, trans_start, 1) && >bond_time_in_interval(bond, slave->last_rx, >1)) { > >slave->link = BOND_LINK_UP; >slave_state_changed = 1; > > >Later down the function, it tries to get the rtnl_lock. If it doesn't get >it, it rearms and returns. > >if (do_failover || slave_state_changed) { >if (!rtnl_trylock()) >goto re_arm; <-- returns here > >if (slave_state_changed) { >bond_slave_state_change(bond); > >This is the problem. The next time this function is called, the >slave->link is already marked UP. And we will never update the bond link >state to UP. This looks like an ARP monitor version of commit de77ecd4ef02ca783f7762e04e92b3d0964be66b Author: Mahesh Bandewar <mahe...@google.com> Date: Mon Mar 27 11:37:33 2017 -0700 bonding: improve link-status update in mii-monitoring and probably needs a similar fix (possibly for both the loadbalance and active-backup ARP monitor cases). >Changing the rtnl_trylock() -> rtnl_lock() _does_ fix the issue. > >Is this the right way to fix it? If it is, I can submit this formally. It's not the right way, unfortunately. The reason for the rtnl_trylock is that there's a possible race against bond_close() -> bond_work_cancel_all() trying to cancel the arp_work workqueue item while it's running. bond_close is called with RTNL held, so if it has RTNL and is waiting for the work function to complete, an rtnl_lock call here will deadlock. Some of the trylock calls in bonding are commented to this effect, but not this one. -J >What are the guidelines around using rtnl_lock() vs rtnl_trylock()? Some >places are using rtnl_lock() and other rtnl_trylock(). Sorry, I couldn't >find much via a google search or in Documentation/. > >Thanks, >Nithin. > > > >diff --git a/drivers/net/bonding/bond_main.c >b/drivers/net/bonding/bond_main.c >index 5dca77e..1f60503 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -2614,8 +2614,7 @@ static void bond_loadbalance_arp_mon(struct >work_struct *work) >rcu_read_unlock(); > >if (do_failover || slave_state_changed) { >- if (!rtnl_trylock()) >- goto re_arm; >+ rtnl_lock(); > >if (slave_state_changed) { >bond_slave_state_change(bond); > > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net] bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal
Paolo Abeni <pab...@redhat.com> wrote: >On slave list updates, the bonding driver computes its hard_header_len >as the maximum of all enslaved devices's hard_header_len. >If the slave list is empty, e.g. on last enslaved device removal, >ETH_HLEN is used. > >Since the bonding header_ops are set only when the first enslaved >device is attached, the above can lead to header_ops->create() >being called with the wrong skb headroom in place. > >If bond0 is configured on top of ipoib devices, with the >following commands: > >ifup bond0 >for slave in $BOND_SLAVES_LIST; do > ip link set dev $slave nomaster >done >ping -c 1 > >we will obtain a skb_under_panic() with a similar call trace: > skb_push+0x3d/0x40 > push_pseudo_header+0x17/0x30 [ib_ipoib] > ipoib_hard_header+0x4e/0x80 [ib_ipoib] > arp_create+0x12f/0x220 > arp_send_dst.part.19+0x28/0x50 > arp_solicit+0x115/0x290 > neigh_probe+0x4d/0x70 > __neigh_event_send+0xa7/0x230 > neigh_resolve_output+0x12e/0x1c0 > ip_finish_output2+0x14b/0x390 > ip_finish_output+0x136/0x1e0 > ip_output+0x76/0xe0 > ip_local_out+0x35/0x40 > ip_send_skb+0x19/0x40 > ip_push_pending_frames+0x33/0x40 > raw_sendmsg+0x7d3/0xb50 > inet_sendmsg+0x31/0xb0 > sock_sendmsg+0x38/0x50 > SYSC_sendto+0x102/0x190 > SyS_sendto+0xe/0x10 > do_syscall_64+0x67/0x180 > entry_SYSCALL64_slow_path+0x25/0x25 > >This change addresses the issue avoiding updating the bonding device >hard_header_len when the slaves list become empty, forbidding to >shrink it below the value used by header_ops->create(). > >The bug is there since commit 54ef31371407 ("[PATCH] bonding: Handle large >hard_header_len") but the panic can be triggered only since >commit fc791b633515 ("IB/ipoib: move back IB LL address into the hard >header"). > >Reported-by: Norbert P <n...@physik.uzh.ch> >Fixes: 54ef31371407 ("[PATCH] bonding: Handle large hard_header_len") >Fixes: fc791b633515 ("IB/ipoib: move back IB LL address into the hard header") >Signed-off-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com> >Signed-off-by: Paolo Abeni <pab...@redhat.com> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> > drivers/net/bonding/bond_main.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 8a4ba8b..34481c9 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1104,11 +1104,11 @@ static void bond_compute_features(struct bonding *bond) > gso_max_size = min(gso_max_size, slave->dev->gso_max_size); > gso_max_segs = min(gso_max_segs, slave->dev->gso_max_segs); > } >+ bond_dev->hard_header_len = max_hard_header_len; > > done: > bond_dev->vlan_features = vlan_features; > bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL; >- bond_dev->hard_header_len = max_hard_header_len; > bond_dev->gso_max_segs = gso_max_segs; > netif_set_gso_max_size(bond_dev, gso_max_size); > >-- >2.9.3 >
Re: [PATCH net-next v2] bonding: deliver link-local packets with skb->dev set to link that packets arrived on
Chonggang Li <chonggan...@google.com> wrote: >Bonding driver changes the skb->dev to the bonding-master before >passing the packet to stack for further processing. This, however >does not make sense for the link-local packets and it looses "the >link info" once its skb->dev is changed to bonding-master. This >patch changes this behavior for link-local packets by not changing >the skb->dev to the bonding-master and maintaining it as it is, >i.e. the link on which the packet arrived. Minor nit: "looses" should be "loses". Other than that: Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> >Signed-off-by: Chonggang Li <chonggan...@google.com> >Signed-off-by: Mahesh Bandewar <mahe...@google.com> >Signed-off-by: Maciej Żenczykowski <m...@google.com> >--- >Changes in v2: > - Make the commit message more clearer. > > drivers/net/bonding/bond_main.c | 3 +++ > 1 file changed, 3 insertions(+) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 01e4a69af421..6bd3b50faf48 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1176,6 +1176,9 @@ static rx_handler_result_t bond_handle_frame(struct >sk_buff **pskb) > } > } > >+ /* don't change skb->dev for link-local packets */ >+ if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) >+ return RX_HANDLER_PASS; > if (bond_should_deliver_exact_match(skb, slave, bond)) > return RX_HANDLER_EXACT; > >-- >2.12.2.762.g0e3151a226-goog >
Re: [PATCH net-next] bonding: do not pass link-local packets to master-interface
Mahesh Bandewar (महेश बंडेवार) wrote: >On Fri, Apr 14, 2017 at 12:30 PM, Jay Vosburgh ><jay.vosbu...@canonical.com> wrote: >> >> >> Chonggang Li <chonggan...@google.com> wrote: >> >> >Previously link-local packets excluding LACP (which are handled by >> >the recv_probe) received on bond slave interfaces are delivered to >> >stack with bond-master device with RX_HANDLER_ANOTHER, however all >> >link-local packets are link specific and should be delivered with >> >exact same link/dev. >> >> In what case is the current behavior a problem (my guess would >> be something related to LLDP)? What if, e.g., the bond is a bridge >> port, will STP frames no longer propagate to the bridge? >> >When a link-local frame made appear to arrive on bond-master, it >looses value since 'the info' about link on which it arrived is lost. This information should really be in the commit message, along with something about the LLDP issue being solved. >So idea behind this is to pass the frame to the stack with the link on >which it arrived without involving the bonding-master. The same will >happen for the STP frames too. Do you see problem with this approach? My look through the STP code suggests not, but I'm far from an expert there. I'm just concerned that this change will cause some obscure regression in something that depends on the current behavior. >> Also, I think the description would be better if it mentioned >> specifically that the patch is changing how skb->dev is set for link >> local frames (bond->dev vs receiving interface), e.g., >> >> "[...] however all link-local packets are link specific and >> should be delivered with skb->dev set to the original device." >> >yes, makes sense. > >> >Signed-off-by: Chonggang Li <chonggan...@google.com> >> >Signed-off-by: Mahesh Bandewar <mahe...@google.com> >> >Signed-off-by: Maciej Żenczykowski <m...@google.com> >> >--- >> > drivers/net/bonding/bond_main.c | 3 +++ >> > 1 file changed, 3 insertions(+) >> > >> >diff --git a/drivers/net/bonding/bond_main.c >> >b/drivers/net/bonding/bond_main.c >> >index 01e4a69af421..aeca3d8541b9 100644 >> >--- a/drivers/net/bonding/bond_main.c >> >+++ b/drivers/net/bonding/bond_main.c >> >@@ -1176,6 +1176,9 @@ static rx_handler_result_t bond_handle_frame(struct >> >sk_buff **pskb) >> > } >> > } >> > >> >+ /* link-local packets should not be passed to master interface */ >> >+ if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) >> >+ return RX_HANDLER_PASS; >> >> Since this returns _PASS and not _EXACT, the packet will go >> through the ptype_base packet handlers, so is the comment strictly >> correct? >> >The stack does not have all link-local specific packet-type handlers >registered by default so returning _EXACT would find nothing and >packet will be dropped, right? So returning _PASS will deliver packet >to the stack with skb->dev as the link on which packet arrived and >stack can take whatever (default) action it has to take. Fair enough; I do think the comment would be better phrased as something like "don't change skb->dev of link local frames"; on first reading, I thought it meant the frames would be dropped. -J >> > if (bond_should_deliver_exact_match(skb, slave, bond)) >> > return RX_HANDLER_EXACT; >> > >> >-- >> >2.12.2.762.g0e3151a226-goog >> > >> --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next] bonding: do not pass link-local packets to master-interface
Chonggang Li <chonggan...@google.com> wrote: >Previously link-local packets excluding LACP (which are handled by >the recv_probe) received on bond slave interfaces are delivered to >stack with bond-master device with RX_HANDLER_ANOTHER, however all >link-local packets are link specific and should be delivered with >exact same link/dev. In what case is the current behavior a problem (my guess would be something related to LLDP)? What if, e.g., the bond is a bridge port, will STP frames no longer propagate to the bridge? Also, I think the description would be better if it mentioned specifically that the patch is changing how skb->dev is set for link local frames (bond->dev vs receiving interface), e.g., "[...] however all link-local packets are link specific and should be delivered with skb->dev set to the original device." >Signed-off-by: Chonggang Li <chonggan...@google.com> >Signed-off-by: Mahesh Bandewar <mahe...@google.com> >Signed-off-by: Maciej Żenczykowski <m...@google.com> >--- > drivers/net/bonding/bond_main.c | 3 +++ > 1 file changed, 3 insertions(+) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 01e4a69af421..aeca3d8541b9 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1176,6 +1176,9 @@ static rx_handler_result_t bond_handle_frame(struct >sk_buff **pskb) > } > } > >+ /* link-local packets should not be passed to master interface */ >+ if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) >+ return RX_HANDLER_PASS; Since this returns _PASS and not _EXACT, the packet will go through the ptype_base packet handlers, so is the comment strictly correct? -J > if (bond_should_deliver_exact_match(skb, slave, bond)) > return RX_HANDLER_EXACT; > >-- >2.12.2.762.g0e3151a226-goog > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: bond procfs hw addr prints
Jarod Wilson <ja...@redhat.com> wrote: >I've got a bug report for someone using a Intel OPA devices in a bond, and >it appears these devices have a hardware address length of 20, opposed to >the typical 6 on ethernet. When they dump /proc/net/bonding/bondX, it only >prints the first 6 of the address, per %pM and mac_address_string(), while >sysfs for the interface does print the right thing, since it uses >sysfs_print_mac(), which takes a length argument. This (20 octet MAC length) is true for any Infiniband device. >So the question is... What's the best route to take here? Expand %pM to >support variable length hardware addresses? Use sysfs_* in procfs? >Reinvent the wheel? Nothing I've tinkered with just yet feels very clean, >on top of not actually working yet. :) sysfs_format_mac (not _print_mac) uses "%*phC", len, addr in its format string. Perhaps that format would be a better choice than %pM for this case? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH next 0/5] bonding: winter cleanup
Mahesh Bandewar <mah...@bandewar.net> wrote: >From: Mahesh Bandewar <mahe...@google.com> > >Few cleanup patches that I have accumulated over some time now. > >(a) First two patches are basically to move the work-queue initialization >from every ndo_open / bond_open operation to once at the beginning while >port creation. Work-queue initialization is an unnecessary operation >for every 'ifup' operation. However we have some mode-specific work-queues >and mode can change anytime after port creation. So the second patch is >to ensure the correct work-handler is called based on the mode. > >(b) Third patch is simple and straightforward that removes hard-coded value >that was added into the initial commit and replaces it with the default >value configured. > >(c) The final patch in the series removes the unimplemented "port-moved" state >from the LACP state machine. This state is defined but never set so >removing from the state machine logic makes code little cleaner. > >(d) Reduce scope of some global variables to local. For all patches in the series: Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> >Note: None of these patches are making any functional changes. > >Mahesh Bandewar (5): > bonding: restructure arp-monitor > bonding: initialize work-queues during creation of bond > bonding: remove hardcoded value > bonding: remove "port-moved" state that was never implemented > bonding: reduce scope of some global variables > > drivers/net/bonding/bond_3ad.c | 11 +++-- > drivers/net/bonding/bond_main.c | 53 ++++++--- > 2 files changed, 37 insertions(+), 27 deletions(-) > >-- >2.12.0.246.ga2ecc84866-goog --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [Patch net] bonding: use ETH_MAX_MTU as max mtu
Cong Wang <xiyou.wangc...@gmail.com> wrote: >This restores the ability of setting bond device's mtu to 9000. > >Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra") >Reported-by: daz...@gmail.com >Reported-by: Brad Campbell <lists2...@fnarfbargle.com> >Cc: Jarod Wilson <ja...@redhat.com> >Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com> >--- > drivers/net/bonding/bond_main.c | 1 + > 1 file changed, 1 insertion(+) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 6321f12..8a4ba8b 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -4179,6 +4179,7 @@ void bond_setup(struct net_device *bond_dev) > > /* Initialize the device entry points */ > ether_setup(bond_dev); >+ bond_dev->max_mtu = ETH_MAX_MTU; > bond_dev->netdev_ops = _netdev_ops; > bond_dev->ethtool_ops = _ethtool_ops; Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next] net: Remove usage of net_device last_rx member
Tobias Klauser <tklau...@distanz.ch> wrote: >The network stack no longer uses the last_rx member of struct net_device >since the bonding driver switched to use its own private last_rx in >commit 9f242738376d ("bonding: use last_arp_rx in slave_last_rx()"). > >However, some drivers still (ab)use the field for their own purposes and >some driver just update it without actually using it. > >Previously, there was an accompanying comment for the last_rx member >added in commit 4dc89133f49b ("net: add a comment on netdev->last_rx") >which asked drivers not to update is, unless really needed. However, >this commend was removed in commit f8ff080dacec ("bonding: remove >useless updating of slave->dev->last_rx"), so some drivers added later >on still did update last_rx. > >Remove all usage of last_rx and switch three drivers (sky2, atp and >smc91c92_cs) which actually read and write it to use their own private >copy in netdev_priv. > >Compile-tested with allyesconfig and allmodconfig on x86 and arm. Reviewed-by: Jay Vosburgh <jay.vosbu...@canonical.com>
Re: [PATCH] net: bonding: use new api ethtool_{get|set}_link_ksettings
Philippe Reynes <trem...@gmail.com> wrote: >The ethtool api {get|set}_settings is deprecated. >We move this driver to new api {get|set}_link_ksettings. This is just an API change, i.e., no change to functionality? -J >Signed-off-by: Philippe Reynes <trem...@gmail.com> >--- > drivers/net/bonding/bond_main.c | 16 > 1 files changed, 8 insertions(+), 8 deletions(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index c9944d8..5708f17 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -4080,16 +4080,16 @@ static netdev_tx_t bond_start_xmit(struct sk_buff >*skb, struct net_device *dev) > return ret; > } > >-static int bond_ethtool_get_settings(struct net_device *bond_dev, >- struct ethtool_cmd *ecmd) >+static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev, >+ struct ethtool_link_ksettings *cmd) > { > struct bonding *bond = netdev_priv(bond_dev); > unsigned long speed = 0; > struct list_head *iter; > struct slave *slave; > >- ecmd->duplex = DUPLEX_UNKNOWN; >- ecmd->port = PORT_OTHER; >+ cmd->base.duplex = DUPLEX_UNKNOWN; >+ cmd->base.port = PORT_OTHER; > > /* Since bond_slave_can_tx returns false for all inactive or down > slaves, we >* do not need to check mode. Though link speed might not represent >@@ -4100,12 +4100,12 @@ static int bond_ethtool_get_settings(struct net_device >*bond_dev, > if (bond_slave_can_tx(slave)) { > if (slave->speed != SPEED_UNKNOWN) > speed += slave->speed; >- if (ecmd->duplex == DUPLEX_UNKNOWN && >+ if (cmd->base.duplex == DUPLEX_UNKNOWN && > slave->duplex != DUPLEX_UNKNOWN) >- ecmd->duplex = slave->duplex; >+ cmd->base.duplex = slave->duplex; > } > } >- ethtool_cmd_speed_set(ecmd, speed ? : SPEED_UNKNOWN); >+ cmd->base.speed = speed ? : SPEED_UNKNOWN; > > return 0; > } >@@ -4121,8 +4121,8 @@ static void bond_ethtool_get_drvinfo(struct net_device >*bond_dev, > > static const struct ethtool_ops bond_ethtool_ops = { > .get_drvinfo= bond_ethtool_get_drvinfo, >- .get_settings = bond_ethtool_get_settings, > .get_link = ethtool_op_get_link, >+ .get_link_ksettings = bond_ethtool_get_link_ksettings, > }; > > static const struct net_device_ops bond_netdev_ops = { >-- >1.7.4.4 --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: Prevent deletion of a bond, or the last slave from a bond, with active usage.
= command + 2; >+ else >+ ifname = command + 1; >+ > if ((strlen(command) <= 1) || > !dev_valid_name(ifname)) > goto err_no_cmd; >@@ -130,6 +135,32 @@ static ssize_t bonding_store_bonds(struct class *cls, > res = -ENODEV; > } > rtnl_unlock(); >+ } else if ((command[0] == '?') && (command[1] == '-')) { >+ struct net_device *bond_dev; >+ >+ rtnl_lock(); >+ bond_dev = bond_get_by_name(bn, ifname); >+ >+ if (bond_dev) { >+ struct in_device *in_dev; >+ struct bonding *bond = netdev_priv(bond_dev); >+ >+ in_dev = __in_dev_get_rtnl(bond_dev); >+ >+ if (((in_dev->ifa_list) != NULL) && >+ (bond->slave_cnt > 0)) { >+ pr_err("%s is in use. Unconfigure IP %pI4 >before deletion.\n", >+ ifname, _dev->ifa_list->ifa_local); >+ rtnl_unlock(); >+ return -EBUSY; >+ } >+ pr_info("%s is being deleted...\n", ifname); >+ unregister_netdevice(bond_dev); >+ } else { >+ pr_err("unable to delete non-existent %s\n", ifname); >+ res = -ENODEV; >+ } >+ rtnl_unlock(); > } else > goto err_no_cmd; > >@@ -139,7 +170,7 @@ static ssize_t bonding_store_bonds(struct class *cls, > return res; > > err_no_cmd: >- pr_err("no command found in bonding_masters - use +ifname or >-ifname\n"); >+ pr_err("no command found in bonding_masters - use +ifname or -ifname or >?-ifname\n"); > return -EPERM; > } > >-- >1.8.3.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: Allow tun-interfaces as slaves
Jörn Engel <jo...@purestorage.com> wrote: >On Wed, Aug 10, 2016 at 02:26:49PM -0700, Jörn Engel wrote: >> >> Having to set one more parameter is a bit annoying. It would have to be >> documented in a prominent place and people would still often miss it. >> So I wonder if we can make the interface a little nicer. >> >> Options: >> - If there are no slaves yet and the first slave added is tun, we trust >> the users to know what they are doing. Automatically set >> bond->params.fail_over_mac = BOND_FOM_KEEPMAC >> Maybe do a printk to inform the user in case of a mistake. I don't think this is feasible, as I don't see a reliable way to test for a slave being a tun device (ARPHRD_NONE is not just tun, and we cannot check the ops as they are not statically built into the kernel). I'm also not sure that heuristics are the proper way to enable this functionality in general. >> - If we get an error and the slave device is tun, do a printk giving the >> user enough information to find this parameter. This could probably be done as a change the existing logic, e.g., diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 1f276fa30ba6..019c1a689aae 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1443,6 +1443,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) res = -EOPNOTSUPP; goto err_undo_flags; } + } else if (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP && + bond->params.fail_over_mac != BOND_FOM_KEEPMAC) { + netdev_err(bond_dev, "The slave device specified does not support setting the MAC address, but fail_over_mac is not set to keepmac\n"); } } I haven't tested this, and I'm not sure it will get all corner cases correct, but this should basically cover it. -J >> I'm leaning towards the former, but you probably know a reason why I am >> wrong again. > >Patch below is an implementation of the former. Not sure if something >like this is worth considering. > >Jörn > >-- >To announce that there must be no criticism of the President, or that we >are to stand by the President, right or wrong, is not only unpatriotic >and servile, but is morally treasonable to the American public. >-- Theodore Roosevelt, Kansas City Star, 1918 > > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 1f276fa30ba6..306909a44fab 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1482,8 +1482,9 @@ int bond_enslave(struct net_device *bond_dev, struct >net_device *slave_dev) >*/ > ether_addr_copy(new_slave->perm_hwaddr, slave_dev->dev_addr); > >- if (!bond->params.fail_over_mac || >- BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { >+ if (bond_dev->type != ARPHRD_NONE && >+ (!bond->params.fail_over_mac || >+ BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP)) { > /* Set slave to master's mac address. The application already >* set the master's mac address to that of the first slave >*/ >-- >2.1.4 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: Allow tun-interfaces as slaves
Ding Tianhong <dingtianh...@huawei.com> wrote: >On 2016/8/10 7:51, Jay Vosburgh wrote: >> Jörn Engel <jo...@purestorage.com> wrote: >> >>> On Tue, Aug 09, 2016 at 12:06:36PM -0700, David Miller wrote: >>>>> On Tue, Aug 09, 2016 at 09:28:45PM +0800, Ding Tianhong wrote: >>>>> >>>>> Simply not checking errors when setting the mac address solves the >>>>> problem for me. No new features needed. >>>> >>>> But it only works in certain modes. >>>> >>>> So the best we can do is enforce the MAC address setting in the >>>> modes that absolutely require it. We cannot ignore the MAC >>>> address setting unilaterally. >>> >>> Something like this? >>> >>> [PATCH] bonding: Allow tun-interfaces as slaves in balance-rr mode >>> >>> Up until 00503b6f702e (part of 3.14-rc1), the bonding driver could be >>> used to enslave tun-interfaces. 00503b6f702e broke that behaviour, >>> afaics as an unintended side-effect. >>> >>> For the purpose of bond-over-tun in balance-rr mode, simply ignoring the >>> error from dev_set_mac_address() is good enough. >>> >>> Signed-off-by: Joern Engel <jo...@purestorage.com> >>> --- >>> drivers/net/bonding/bond_main.c | 3 ++- >>> 1 file changed, 2 insertions(+), 1 deletion(-) >>> >>> diff --git a/drivers/net/bonding/bond_main.c >>> b/drivers/net/bonding/bond_main.c >>> index 1f276fa30ba6..2f686bfe4304 100644 >>> --- a/drivers/net/bonding/bond_main.c >>> +++ b/drivers/net/bonding/bond_main.c >>> @@ -1490,7 +1490,8 @@ int bond_enslave(struct net_device *bond_dev, struct >>> net_device *slave_dev) >>> memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); >>> addr.sa_family = slave_dev->type; >>> res = dev_set_mac_address(slave_dev, ); >>> - if (res) { >>> + /* round-robin mode works fine without a mac address */ >>> + if (res && BOND_MODE(bond) != BOND_MODE_ROUNDROBIN) { >> >> This will cause balance-rr to add the slave to the bond if any >> device's dev_set_mac_address call fails. >> >> If a bond of regular Ethernet devices is connected to a static >> link aggregation (Etherchannel channel group), a set_mac failure would >> result in that slave having a different MAC address than the bond, which >> in turn would cause traffic inbound from the switch to that slave to be >> dropped (as the destination MAC would not pass the device MAC filters). >> >> The failure check for the set_mac call serves a legitimate >> purpose, and I don't believe we should bypass it without making the >> bypass an option that is explicitly enabled for those special cases that >> need it. >> >> E.g., something like the following (which I have not tested); >> this would also need documentation and iproute2 updates to go with it. >> This would be enabled with "fail_over_mac=keepmac". >> >> diff --git a/drivers/net/bonding/bond_main.c >> b/drivers/net/bonding/bond_main.c >> index 1f276fa30ba6..d2283fc23b16 100644 >> --- a/drivers/net/bonding/bond_main.c >> +++ b/drivers/net/bonding/bond_main.c >> @@ -1483,7 +1483,8 @@ int bond_enslave(struct net_device *bond_dev, struct >> net_device *slave_dev) >> ether_addr_copy(new_slave->perm_hwaddr, slave_dev->dev_addr); >> >> if (!bond->params.fail_over_mac || >> -BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { >> +(BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP && >> + bond->params.fail_over_mac != BOND_FOM_KEEPMAC)) { >> /* Set slave to master's mac address. The application already >> * set the master's mac address to that of the first slave >> */ >> diff --git a/drivers/net/bonding/bond_options.c >> b/drivers/net/bonding/bond_options.c >> index 577e57cad1dc..f9653fe4d622 100644 >> --- a/drivers/net/bonding/bond_options.c >> +++ b/drivers/net/bonding/bond_options.c >> @@ -125,6 +125,7 @@ static const struct bond_opt_value >> bond_fail_over_mac_tbl[] = { >> { "none", BOND_FOM_NONE, BOND_VALFLAG_DEFAULT}, >> { "active", BOND_FOM_ACTIVE, 0}, >> { "follow", BOND_FOM_FOLLOW, 0}, >> +{ "keepmac", BOND_FOM_KEEPMAC, 0}, >> { NULL, -1, 0}, >> }; >> >>
Re: [PATCH] bonding: Allow tun-interfaces as slaves
Jörn Engel <jo...@purestorage.com> wrote: >On Tue, Aug 09, 2016 at 12:06:36PM -0700, David Miller wrote: >> > On Tue, Aug 09, 2016 at 09:28:45PM +0800, Ding Tianhong wrote: >> > >> > Simply not checking errors when setting the mac address solves the >> > problem for me. No new features needed. >> >> But it only works in certain modes. >> >> So the best we can do is enforce the MAC address setting in the >> modes that absolutely require it. We cannot ignore the MAC >> address setting unilaterally. > >Something like this? > >[PATCH] bonding: Allow tun-interfaces as slaves in balance-rr mode > >Up until 00503b6f702e (part of 3.14-rc1), the bonding driver could be >used to enslave tun-interfaces. 00503b6f702e broke that behaviour, >afaics as an unintended side-effect. > >For the purpose of bond-over-tun in balance-rr mode, simply ignoring the >error from dev_set_mac_address() is good enough. > >Signed-off-by: Joern Engel <jo...@purestorage.com> >--- > drivers/net/bonding/bond_main.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 1f276fa30ba6..2f686bfe4304 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1490,7 +1490,8 @@ int bond_enslave(struct net_device *bond_dev, struct >net_device *slave_dev) > memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len); > addr.sa_family = slave_dev->type; > res = dev_set_mac_address(slave_dev, ); >- if (res) { >+ /* round-robin mode works fine without a mac address */ >+ if (res && BOND_MODE(bond) != BOND_MODE_ROUNDROBIN) { This will cause balance-rr to add the slave to the bond if any device's dev_set_mac_address call fails. If a bond of regular Ethernet devices is connected to a static link aggregation (Etherchannel channel group), a set_mac failure would result in that slave having a different MAC address than the bond, which in turn would cause traffic inbound from the switch to that slave to be dropped (as the destination MAC would not pass the device MAC filters). The failure check for the set_mac call serves a legitimate purpose, and I don't believe we should bypass it without making the bypass an option that is explicitly enabled for those special cases that need it. E.g., something like the following (which I have not tested); this would also need documentation and iproute2 updates to go with it. This would be enabled with "fail_over_mac=keepmac". diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 1f276fa30ba6..d2283fc23b16 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1483,7 +1483,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) ether_addr_copy(new_slave->perm_hwaddr, slave_dev->dev_addr); if (!bond->params.fail_over_mac || - BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { + (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP && +bond->params.fail_over_mac != BOND_FOM_KEEPMAC)) { /* Set slave to master's mac address. The application already * set the master's mac address to that of the first slave */ diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c index 577e57cad1dc..f9653fe4d622 100644 --- a/drivers/net/bonding/bond_options.c +++ b/drivers/net/bonding/bond_options.c @@ -125,6 +125,7 @@ static const struct bond_opt_value bond_fail_over_mac_tbl[] = { { "none", BOND_FOM_NONE, BOND_VALFLAG_DEFAULT}, { "active", BOND_FOM_ACTIVE, 0}, { "follow", BOND_FOM_FOLLOW, 0}, + { "keepmac", BOND_FOM_KEEPMAC, 0}, { NULL, -1, 0}, }; diff --git a/include/net/bonding.h b/include/net/bonding.h index 6360c259da6d..ec3442b3aa83 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -420,6 +420,7 @@ static inline bool bond_slave_can_tx(struct slave *slave) #define BOND_FOM_NONE 0 #define BOND_FOM_ACTIVE1 #define BOND_FOM_FOLLOW2 +#define BOND_FOM_KEEPMAC 3 #define BOND_ARP_TARGETS_ANY 0 #define BOND_ARP_TARGETS_ALL 1 -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding: Allow tun-interfaces as slaves
Jörn Engel <jo...@purestorage.com> wrote: >Hello Tianhong! > >On Tue, Aug 09, 2016 at 10:18:41AM +0800, Ding Tianhong wrote: >> >> I don't understand your problem clearly, can you explain more about how the >> 00503b6f702e break tun-interfaces >> and we will try to fix it. > >Here is a trivial testcase: >openvpn --mktun --dev tun0 >echo +tun0 > /sys/class/net/bond0/bonding/slaves > >Worked fine before your patch, no longer works after your patch. Works >again after my patch. Could you describe your use case a bit further? Are you bonding together multiple VPN tunnels? This may be a regression, but since the patch that nominally introduced it was 2 years ago, the impact appears to be very narrow. >> and more, dev_set_mac_address will change the salver's mac address, some nic >> don't support to change the mac address and >> could not work as bond slave, so we need to check the return value, I don't >> think this patch has any effective improvement. > >Using bonding in balance-rr mode, there doesn't seem to be a need to >change the mac address. I suppose you might care in other modes, but I >don't. The balance-rr mode (as well as the -xor mode) is designed to interoperate with a Cisco Etherchannel-style static link aggregation, which requires all members to have the same MAC address for proper function. Now, the above notwithstanding, I don't have an issue if you want to bond together a couple of tun devices and can make it work. However, for the standard balance-rr case, the enslavement must fail if the call to set the slave MAC fails, as permitting the slave into the bond after set-MAC fails can result in a bond that silently loses packets. My tentative suggestion is that we extened fail_over_mac to cover additional modes, as a sort of "I really know what I'm doing" flag, and allow the enslavement to succeed when it is set. This would require setting an additional bonding option for this situation, but that doesn't seem to be a undue burden as this looks to be a niche use case. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net V2] net/bonding: Enforce active-backup policy for IPoIB bonds
Saeed Mahameed <sae...@mellanox.com> wrote: >From: Mark Bloch <ma...@mellanox.com> > >When using an IPoIB bond currently only active-backup mode is a valid >use case and this commit strengthens it. > >Since commit 2ab82852a270 ("net/bonding: Enable bonding to enslave >netdevices not supporting set_mac_address()") was introduced till >4.7-rc1, IPoIB didn't support the set_mac_address ndo, and hence the >fail over mac policy always applied to IPoIB bonds. > >With the introduction of commit 492a7e67ff83 ("IB/IPoIB: Allow setting >the device address"), that doesn't hold and practically IPoIB bonds are >broken as of that. To fix it, lets go to fail over mac if the device >doesn't support the ndo OR this is IPoIB device. > >As a by-product, this commit also prevents a stack corruption which >occurred when trying to copy 20 bytes (IPoIB) device address >to a sockaddr struct that has only 16 bytes of storage. > >Signed-off-by: Mark Bloch <ma...@mellanox.com> >Signed-off-by: Or Gerlitz <ogerl...@mellanox.com> >Signed-off-by: Saeed Mahameed <sae...@mellanox.com> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> > >Changes from v0: > - Set res to -EOPNOTSUPP before jumping to err_undo_flags. > > drivers/net/bonding/bond_main.c | 11 ++- > 1 file changed, 10 insertions(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index a2afa3b..4d79819 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1422,7 +1422,16 @@ int bond_enslave(struct net_device *bond_dev, struct >net_device *slave_dev) > return -EINVAL; > } > >- if (slave_ops->ndo_set_mac_address == NULL) { >+ if (slave_dev->type == ARPHRD_INFINIBAND && >+ BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { >+ netdev_warn(bond_dev, "Type (%d) supports only active-backup >mode\n", >+ slave_dev->type); >+ res = -EOPNOTSUPP; >+ goto err_undo_flags; >+ } >+ >+ if (!slave_ops->ndo_set_mac_address || >+ slave_dev->type == ARPHRD_INFINIBAND) { > netdev_warn(bond_dev, "The slave device specified does not > support setting the MAC address\n"); > if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP && > bond->params.fail_over_mac != BOND_FOM_ACTIVE) { >-- >2.8.0 >
Re: [PATCH net] net/bonding: Enforce active-backup policy for IPoIB bonds
Andy Gospodarek <go...@cumulusnetworks.com> wrote: >On Wed, Jul 20, 2016 at 05:44:20PM +0300, Saeed Mahameed wrote: >> From: Mark Bloch <ma...@mellanox.com> >> >> When using an IPoIB bond currently only active-backup mode is a valid >> use case and this commit strengthens it. >> >> Since commit 2ab82852a270 ("net/bonding: Enable bonding to enslave >> netdevices not supporting set_mac_address()") was introduced till >> 4.7-rc1, IPoIB didn't support the set_mac_address ndo, and hence >> the fail over mac policy always applied to IPoIB bonds. >> >> With the introduction of commit 492a7e67ff83 ("IB/IPoIB: Allow setting >> the device address"), that doesn't hold and practically IPoIB bonds are >> broken as of that. To fix it, lets go to fail over mac if the device >> doesn't support the ndo OR this is IPoIB device. >> >> As a by-product, this commit also prevents a stack corruption which >> occurred when trying to copy 20 bytes (IPoIB) device address >> to a sockaddr struct that has only 16 bytes of storage. >> >> Signed-off-by: Mark Bloch <ma...@mellanox.com> >> Signed-off-by: Or Gerlitz <ogerl...@mellanox.com> >> Signed-off-by: Saeed Mahameed <sae...@mellanox.com> >> --- >> drivers/net/bonding/bond_main.c | 10 +- >> 1 file changed, 9 insertions(+), 1 deletion(-) >> >> diff --git a/drivers/net/bonding/bond_main.c >> b/drivers/net/bonding/bond_main.c >> index a2afa3b..ccd4003 100644 >> --- a/drivers/net/bonding/bond_main.c >> +++ b/drivers/net/bonding/bond_main.c >> @@ -1422,7 +1422,15 @@ int bond_enslave(struct net_device *bond_dev, struct >> net_device *slave_dev) >> return -EINVAL; >> } >> >> -if (slave_ops->ndo_set_mac_address == NULL) { >> +if (slave_dev->type == ARPHRD_INFINIBAND && >> +BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { >> +netdev_warn(bond_dev, "Type (%d) supports only active-backup >> mode\n", >> +slave_dev->type); > >Seems like we should propagate the failure back to the caller. >Something like this? > > return -EOPNOTSUPP; Agreed, although I think it needs to be res = -EOPNOTSUPP; goto err_undo_flags; The code has to undo the the ARPHRD_INFINIBAND setting in bond_dev->type and other stuff done by bond_setup_by_slave. If we don't switch the fields back to the ARPHRD_ETHER related values, then we could have an invalid dereference on the header_ops pointer if something unloads the IB module. -J >> +goto err_undo_flags; >> +} >> + >> +if (!slave_ops->ndo_set_mac_address || >> + slave_dev->type == ARPHRD_INFINIBAND) { >> netdev_warn(bond_dev, "The slave device specified does not >> support setting the MAC address\n"); >> if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP && >> bond->params.fail_over_mac != BOND_FOM_ACTIVE) { > >The rest of the patch seems logical, so I'm fine with it. --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] bonding:Fix perfomance drop if one bonding device in configuration is repeatedly restarted
Nicholas Krause <xerofo...@gmail.com> wrote: >This fixes a issue with the following test caseL > >1. Created a alb bonding(bond0) with three of 1Gb interfaces >(eth0, eth1, eth2) under ipv4, set the bond0 ip address as >192.168.10.101, run command "iperf -s" as the server(x86_64). > >2. Run command "iperf -c 192.168.10.101 -t 1 -i 1" >on three linux client with CentOS6.5 x86_64 > >3. Run the following commands on server repeatly, >"ifconfig eth1 down; sleep 20; ifconfig eth1 up". >monitor the bonding performance with program " >nmon_x86_64_centos6", after some time, the >performance droped from 3000 to 2000 Mbps, >and could not go up to 3000Mbps any more, >it seems that the bonding doesn't rebalance >again. >This is due to use incorrectly checking in the function, >rlb_choose_channel if the ethernet address is not equal >to 64 bits before updating our mac address from arp. >However this is incorrect as the ethernet header length >needs to be 64bits in order to update our mac address >from arp without the above test case having a perfomance >regression. Fix it by checking for the ethernet header >having 64 bits and not. NAK. I don't understand what this is trying to say, but the change below does not look valid. >Signed-off-by: Nicholas Krause <xerofo...@gmail.com> >--- > drivers/net/bonding/bond_alb.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > >diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c >index 551f0f8..d5ae8a5 100644 >--- a/drivers/net/bonding/bond_alb.c >+++ b/drivers/net/bonding/bond_alb.c >@@ -593,7 +593,7 @@ static struct slave *rlb_choose_channel(struct sk_buff >*skb, struct bonding *bon > if ((client_info->ip_src == arp->ip_src) && > (client_info->ip_dst == arp->ip_dst)) { > /* the entry is already assigned to this client */ >- if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) { >+ if (ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) { > /* update mac address from arp */ > ether_addr_copy(client_info->mac_dst, > arp->mac_dst); Regardless of the description above, this test is insuring that the code doesn't use the broadcast MAC address for the client. Changing it seems like a bad thing to do, as it would cause traffic for the client to be sent to the ethernet broadcast address. This might, in fact, make the described test run better, but it has nothing to do with rebalancing and would negatively impact other hosts on the network. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net] bonding: fix 802.3ad aggregator reselection
Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote: >2016-07-06 0:20 GMT+03:00 Jay Vosburgh <jay.vosbu...@canonical.com>: >> Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote: >> >>>2016-06-30 14:15 GMT+03:00 Veli-Matti Lintu <veli-matti.li...@opinsys.fi>: >>>> 2016-06-29 18:59 GMT+03:00 Jay Vosburgh <jay.vosbu...@canonical.com>: >>>>> Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote: >>> >>>>> I tried this locally, but don't see any failure (at the end, the >>>>> "Switch A" agg is still active with the single port). I am starting >>>>> with just two ports in each aggregator (instead of three), so that may >>>>> be relevant. >>>> >>>> When the connection problem occurs, /proc/net/bonding/bond0 always >>>> shows the aggregator that has a link up active. Dumpcap sees at least >>>> broadcast traffic on the port, but I haven't done extensive analysis >>>> on that yet. All TCP connections are cut until the bond is up again >>>> when more ports are enabled on the switch. ping doesn't work either >>>> way. >>> >>>I did some further testing on this and it looks like I can get this >>>working by enabling the ports in the new aggregator the same way as >>>the ports in old aggregator are disabled in ad_agg_selection_logic(). >> >> I tested with this some as well, using 6 ports total across two >> switches, and was still not able to reproduce the issue. How are you >> configuring the bond in the first place? It may be that there is some >> dependency on the ordering of the slaves within the bond and how they >> are disabled. > >The switch configuration should be pretty basic - port group that has >LACP as type. L4 based hashing is configured on the switches. The bond >is running as untagged on the ports. "show run" shows them as such: > >trunk 20-22 trk4 lacp >trunk-load-balance L4-based > >Both switches have similar configuration. I'm not an expert in switch >configurations, so I do not know if there's anything else interesting >in there. > >The servers are running Ubuntu 16.04 and systemd-networkd is used to >configure the interfaces. Using /etc/network/interfaces failed quite >often to setup all interfaces at boot time, so we switched over to >systemd-networkd. > >> Also, I am taking the ports down by physically unplugging the >> cable from the switch. If you're doing it differently, that might be >> relevant. > >The servers are in remote location, so I'm disabling the ports in >switch configuration. Unfortunately I do not have a similar setup that >I could access physically at the moment. > >>>Normally the ports seem to get enabled from ad_mux_machine() in "case >>>AD_MUX_COLLECTING_DISTRIBUTING", but something different happens there >>>as the port does get enabled, but no traffic passes through. So far I >>>haven't been able to figure out what happens. When the connection is >>>lost, dumpcap sees traffic on the only active port in the bond, but it >>>seems like nothing catches it. If I disable and re-enable the same >>>port, traffic start flowing again normally. >> >> Looking at the debug log you provided, the step that fails >> appears to correspond to this portion: > >... > >> The "Periodic Machine" 3->4 then 4->3 then "Sent LACPDU" looks >> normal (3->4 is the periodic timer expiring, state 4 sets port->ntt, >> then the next iteration moves back to state 3), and includes both send >> and receive of LACPDUs on port 2 (which is enp5s0f0). >> >> I don't see a transition to COLL/DIST state for port 2 in the >> log, so presumably it takes place prior to the beginning. This is the >> place that would call __enable_port. >> >> What you're describing sounds consistent with the slave not >> being set to active, which would cause bond_handle_frame -> >> bond_should_deliver_exact_match to return RX_HANDLER_EXACT. >> >> This leads me to wonder if the port in question is in this >> incorrect state from the beginning, but it only manifests once it >> becomes the only active port. >> >> Can you instrument __enable_port to see when it is called for >> the bad port, and if it actually calls bond_set_slave_active_flags for >> the port in question (without your additional patch)? > >Thanks for the pointers. Adding some extra debug in __enable_port >hel
Re: [PATCH net] bonding: fix 802.3ad aggregator reselection
Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote: >2016-06-30 14:15 GMT+03:00 Veli-Matti Lintu <veli-matti.li...@opinsys.fi>: >> 2016-06-29 18:59 GMT+03:00 Jay Vosburgh <jay.vosbu...@canonical.com>: >>> Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote: > >>> I tried this locally, but don't see any failure (at the end, the >>> "Switch A" agg is still active with the single port). I am starting >>> with just two ports in each aggregator (instead of three), so that may >>> be relevant. >> >> When the connection problem occurs, /proc/net/bonding/bond0 always >> shows the aggregator that has a link up active. Dumpcap sees at least >> broadcast traffic on the port, but I haven't done extensive analysis >> on that yet. All TCP connections are cut until the bond is up again >> when more ports are enabled on the switch. ping doesn't work either >> way. > >I did some further testing on this and it looks like I can get this >working by enabling the ports in the new aggregator the same way as >the ports in old aggregator are disabled in ad_agg_selection_logic(). I tested with this some as well, using 6 ports total across two switches, and was still not able to reproduce the issue. How are you configuring the bond in the first place? It may be that there is some dependency on the ordering of the slaves within the bond and how they are disabled. Also, I am taking the ports down by physically unplugging the cable from the switch. If you're doing it differently, that might be relevant. >Normally the ports seem to get enabled from ad_mux_machine() in "case >AD_MUX_COLLECTING_DISTRIBUTING", but something different happens there >as the port does get enabled, but no traffic passes through. So far I >haven't been able to figure out what happens. When the connection is >lost, dumpcap sees traffic on the only active port in the bond, but it >seems like nothing catches it. If I disable and re-enable the same >port, traffic start flowing again normally. Looking at the debug log you provided, the step that fails appears to correspond to this portion: [ 367.811419] bond0: link status definitely down for interface enp5s0f1, disabl ing it [ 367.811425] bond0: best Agg=2; P=3; a k=9; p k=57; Ind=0; Act=0 [ 367.811429] bond0: best ports 8830113f6638 slave 8830113cfe00 enp5s0f 0 [ 367.811432] bond0: Agg=1; P=3; a k=9; p k=57; Ind=0; Act=0 [ 367.811434] bond0: Agg=2; P=3; a k=9; p k=57; Ind=0; Act=0 [ 367.811437] bond0: Agg=3; P=0; a k=0; p k=0; Ind=0; Act=0 [ 367.811439] bond0: Agg=4; P=0; a k=0; p k=0; Ind=0; Act=0 [ 367.811441] bond0: Agg=5; P=0; a k=0; p k=0; Ind=0; Act=0 [ 367.811444] bond0: Agg=6; P=0; a k=0; p k=0; Ind=0; Act=0 [ 367.811446] bond0: LAG 2 chosen as the active LAG [ 367.811448] bond0: Agg=2; P=3; a k=9; p k=57; Ind=0; Act=1 [ 367.811451] bond0: Port 1 changed link status to DOWN [ 367.811455] bond0: first active interface up! [ 367.811461] Rx Machine: Port=1 (enp5s0f1), Last State=6, Curr State=2 [ 367.811495] ixgbe :05:00.0 enp5s0f0: event: 1b [ 367.811497] ixgbe :05:00.0 enp5s0f0: IFF_SLAVE [ 367.811519] ixgbe :05:00.1 enp5s0f1: event: 19 [ 367.811522] ixgbe :05:00.1 enp5s0f1: IFF_SLAVE [ 367.811525] ixgbe :05:00.1 enp5s0f1: event: 19 [ 367.811528] ixgbe :05:00.1 enp5s0f1: IFF_SLAVE [ 386.809542] Periodic Machine: Port=2, Last State=3, Curr State=4 [ 386.909543] Periodic Machine: Port=2, Last State=4, Curr State=3 [ 387.009530] update lacpdu: enp5s0f0, actor port state 3d [ 387.009541] Sent LACPDU on port 2 [ 387.571372] bond0: Received LACPDU on port 2 slave enp5s0f0 [ 387.571379] Rx Machine: Port=2 (enp5s0f0), Last State=6, Curr State=6 [ 387.571381] enp5s0f0 partner sync=1 [ 416.810767] Periodic Machine: Port=2, Last State=3, Curr State=4 [ 416.910786] Periodic Machine: Port=2, Last State=4, Curr State=3 [ 417.010749] update lacpdu: enp5s0f0, actor port state 3d [ 417.010761] Sent LACPDU on port 2 [ 417.569137] bond0: Received LACPDU on port 2 slave enp5s0f0 [ 417.569156] Rx Machine: Port=2 (enp5s0f0), Last State=6, Curr State=6 [... repeats this cycle for a while ] [ 537.614050] enp5s0f0 partner sync=1 [ 566.816851] Periodic Machine: Port=2, Last State=3, Curr State=4 [ 566.916843] Periodic Machine: Port=2, Last State=4, Curr State=3 [ 567.016829] update lacpdu: enp5s0f0, actor port state 3d [ 567.016839] Sent LACPDU on port 2 [ 567.558379] bond0: Received LACPDU on port 2 slave enp5s0f0 [ 567.558399] Rx Machine: Port=2 (enp5s0f0), Last State=6, Curr State=6 [ 567.558403] enp5s0f0 partner sync=1 [ 572.925434] igb :81:00.0 ens5f0: igb: ens5f0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX [ 572.925862] igb :81:00.0 ens5f0: event: 4 [ 572.925865] igb :81:00.0 ens5f0: IFF_SLAVE [ 572.925
Re: [PATCH net] bonding: fix 802.3ad aggregator reselection
Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote: [...] >I have understood that only miimon is available with 802.3ad: > >bonding.txt: > >802.3ad: >... >Finally, the 802.3ad mode mandates the use of the MII monitor, >therefore, the ARP monitor is not available in this mode. > >Is there a way to enable ARP monitor somehow? No, there isn't. ARP monitor is not available in 802.3ad mode. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net] bonding: fix 802.3ad aggregator reselection
Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote: [...] >Thanks for the patch. I have been now testing it and the reselection >seems to be working now in most cases, but I hit one case that seems >to consistently fail in my test environment. > >I've been doing most of testing with ad_select=count and this happens >with it. I haven't yet done extensive testing with >ad_select=stable/bandwidth. > >The sequence to trigger the failure seems to be: > > Switch A (Agg ID 2) Switch B (Agg ID 1) >enp5s0f0 ens5f0 ens6f0enp5s0f1 ens5f1 ens6f1 >X X - X - - Connection works >(Agg ID 2 active) >X - - X - - Connection works >(Agg ID 1 active) >X - - - - - No connection (Agg >ID 2 active) I tried this locally, but don't see any failure (at the end, the "Switch A" agg is still active with the single port). I am starting with just two ports in each aggregator (instead of three), so that may be relevant. Can you enable dynamic debug for bonding and run your test again, and then send me the debug output (this will appear in the kernel log, e.g., from dmesg)? You can enable this via # echo 'module bonding =p' > /sys/kernel/debug/dynamic_debug/control before running the test. The contents of /proc/net/bonding/bond0 (read as root, otherwise the LACP internal state isn't included) from each step would also be helpful. The output will likely be large, so I'd suggest sending it to me directly off-list if it's too big. >I'm also wondering why link down event causes change of aggregator >when the active aggregator has the same number of active links than >the new aggregator. This shouldn't happen. If the active aggregator is just as good as some other aggregator choice, it should stay with the current active. I suspect that both of these are edge cases arising from the aggregators now including link down ports, which previously never happened. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
[PATCH net] bonding: fix 802.3ad aggregator reselection
Since commit 7bb11dc9f59d ("bonding: unify all places where actor-oper key needs to be updated."), the logic in bonding to handle selection between multiple aggregators has not functioned. This affects only configurations wherein the bonding slaves connect to two discrete aggregators (e.g., two independent switches, each with LACP enabled), thus creating two separate aggregation groups within a single bond. The cause is a change in 7bb11dc9f59d to no longer set AD_PORT_BEGIN on a port after a link state change, which would cause the port to be reselected for attachment to an aggregator as if were newly added to the bond. We cannot restore the prior behavior, as it contradicts IEEE 802.1AX 5.4.12, which requires ports that "become inoperable" (lose carrier, setting port_enabled=false as per 802.1AX 5.4.7) to remain selected (i.e., assigned to the aggregator). As the port now remains selected, the aggregator selection logic is not invoked. A side effect of this change is that aggregators in bonding will now contain ports that are link down. The aggregator selection logic does not currently handle this situation correctly, causing incorrect aggregator selection. This patch makes two changes to repair the aggregator selection logic in bonding to function as documented and within the confines of the standard: First, the aggregator selection and related logic now utilizes the number of active ports per aggregator, not the number of selected ports (as some selected ports may be down). The ad_select "bandwidth" and "count" options only consider ports that are link up. Second, on any carrier state change of any slave, the aggregator selection logic is explicitly called to insure the correct aggregator is active. Reported-by: Veli-Matti Lintu <veli-matti.li...@opinsys.fi> Fixes: 7bb11dc9f59d ("bonding: unify all places where actor-oper key needs to be updated.") Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> --- drivers/net/bonding/bond_3ad.c | 64 +- 1 file changed, 45 insertions(+), 19 deletions(-) diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c index b9304a295f86..ca81f46ea1aa 100644 --- a/drivers/net/bonding/bond_3ad.c +++ b/drivers/net/bonding/bond_3ad.c @@ -657,6 +657,20 @@ static void __set_agg_ports_ready(struct aggregator *aggregator, int val) } } +static int __agg_active_ports(struct aggregator *agg) +{ + struct port *port; + int active = 0; + + for (port = agg->lag_ports; port; +port = port->next_port_in_aggregator) { + if (port->is_enabled) + active++; + } + + return active; +} + /** * __get_agg_bandwidth - get the total bandwidth of an aggregator * @aggregator: the aggregator we're looking at @@ -664,39 +678,40 @@ static void __set_agg_ports_ready(struct aggregator *aggregator, int val) */ static u32 __get_agg_bandwidth(struct aggregator *aggregator) { + int nports = __agg_active_ports(aggregator); u32 bandwidth = 0; - if (aggregator->num_of_ports) { + if (nports) { switch (__get_link_speed(aggregator->lag_ports)) { case AD_LINK_SPEED_1MBPS: - bandwidth = aggregator->num_of_ports; + bandwidth = nports; break; case AD_LINK_SPEED_10MBPS: - bandwidth = aggregator->num_of_ports * 10; + bandwidth = nports * 10; break; case AD_LINK_SPEED_100MBPS: - bandwidth = aggregator->num_of_ports * 100; + bandwidth = nports * 100; break; case AD_LINK_SPEED_1000MBPS: - bandwidth = aggregator->num_of_ports * 1000; + bandwidth = nports * 1000; break; case AD_LINK_SPEED_2500MBPS: - bandwidth = aggregator->num_of_ports * 2500; + bandwidth = nports * 2500; break; case AD_LINK_SPEED_1MBPS: - bandwidth = aggregator->num_of_ports * 1; + bandwidth = nports * 1; break; case AD_LINK_SPEED_2MBPS: - bandwidth = aggregator->num_of_ports * 2; + bandwidth = nports * 2; break; case AD_LINK_SPEED_4MBPS: - bandwidth = aggregator->num_of_ports * 4; + bandwidth = nports * 4; break; case AD_LINK_SPEED_56000MBPS: -
Re: 802.3ad bonding aggregator reselection
;num_of_ports > best->num_of_ports) + if (__agg_active_ports(curr) > __agg_active_ports(best)) return curr; - if (curr->num_of_ports < best->num_of_ports) + if (__agg_active_ports(curr) < __agg_active_ports(best)) return best; /*FALLTHROUGH*/ @@ -1561,8 +1576,14 @@ static int agg_device_up(const struct aggregator *agg) if (!port) return 0; - return netif_running(port->slave->dev) && - netif_carrier_ok(port->slave->dev); + for (port = agg->lag_ports; port; +port = port->next_port_in_aggregator) { + if (netif_running(port->slave->dev) && + netif_carrier_ok(port->slave->dev)) + return 1; + } + + return 0; } /** @@ -1610,7 +1631,7 @@ static void ad_agg_selection_logic(struct aggregator *agg, agg->is_active = 0; - if (agg->num_of_ports && agg_device_up(agg)) + if (__agg_active_ports(agg) && agg_device_up(agg)) best = ad_agg_selection_test(best, agg); } @@ -1622,7 +1643,7 @@ static void ad_agg_selection_logic(struct aggregator *agg, * answering partner. */ if (active && active->lag_ports && - active->lag_ports->is_enabled && + __agg_active_ports(active) && (__agg_has_partner(active) || (!__agg_has_partner(active) && !__agg_has_partner(best { @@ -2432,7 +2453,9 @@ void bond_3ad_adapter_speed_duplex_changed(struct slave *slave) */ void bond_3ad_handle_link_change(struct slave *slave, char link) { + struct aggregator *agg; struct port *port; + bool dummy; port = &(SLAVE_AD_INFO(slave)->port); @@ -2459,6 +2482,9 @@ void bond_3ad_handle_link_change(struct slave *slave, char link) port->is_enabled = false; ad_update_actor_keys(port, true); } + agg = __get_first_agg(port); + ad_agg_selection_logic(agg, ); + netdev_dbg(slave->bond->dev, "Port %d changed link status to %s\n", port->actor_port_number, link == BOND_LINK_UP ? "UP" : "DOWN"); @@ -2499,7 +2525,7 @@ int bond_3ad_set_carrier(struct bonding *bond) active = __get_active_agg(&(SLAVE_AD_INFO(first_slave)->aggregator)); if (active) { /* are enough slaves available to consider link up? */ - if (active->num_of_ports < bond->params.min_links) { + if (__agg_active_ports(active) < bond->params.min_links) { if (netif_carrier_ok(bond->dev)) { netif_carrier_off(bond->dev); goto out; -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: 802.3ad bonding aggregator reselection
if (__get_agg_selection_mode(port) != BOND_AD_STABLE) + port->port->sm_vars &= ~AD_PORT_SELECTED; } netdev_dbg(slave->bond->dev, "Port %d changed link status to %s\n", port->actor_port_number, I'll test this locally and will submit a formal patch with an update to bonding.txt tomorrow (if it works). -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: 802.3ad bonding aggregator reselection
Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote: >2016-06-20 17:11 GMT+03:00 zhuyj <zyjzyj2...@gmail.com>: >> 5. Switch Configuration >> === >> >> For this section, "switch" refers to whatever system the >> bonded devices are directly connected to (i.e., where the other end of >> the cable plugs into). This may be an actual dedicated switch device, >> or it may be another regular system (e.g., another computer running >> Linux), >> >> The active-backup, balance-tlb and balance-alb modes do not >> require any specific configuration of the switch. >> >> The 802.3ad mode requires that the switch have the appropriate >> ports configured as an 802.3ad aggregation. The precise method used >> to configure this varies from switch to switch, but, for example, a >> Cisco 3550 series switch requires that the appropriate ports first be >> grouped together in a single etherchannel instance, then that >> etherchannel is set to mode "lacp" to enable 802.3ad (instead of >> standard EtherChannel). > >The ports are configured in switch settings (HP Procurve 2530-48G) in >same trunk group (TrkX) and trunk group type is set as LACP. >/proc/net/bonding/bond0 also shows that the three ports belong to same >aggregator and bandwidth tests also support this. In my understanding >Procurve's trunk group is pretty much the same as etherchannel in >Cisco's terminology. The bonded link comes always up properly, but >handling of links going down is the problem. Are there known >differences between different vendors there? I did the original LACP reselection testing on a Cisco switch, but I have an HP 2530 now; I'll test it later today or tomorrow and see if it behaves properly, and whether your proposed patch is needed. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH v2 net] bonding: don't use stale speed and duplex information
zhuyj <zyjzyj2...@gmail.com> wrote: >On 02/25/2016 09:33 PM, Jay Vosburgh wrote: >> zhuyj <zyjzyj2...@gmail.com> wrote: >> [...] >>> I delved into the source code and Emil's tests. I think that the problem >>> that this patch expects to fix occurs very unusually. >>> >>> Do you agree with me? >>> >>> If so, maybe the following patch can reduce the performance loss. >>> Please comment on it. Thanks a lot. >>> >>> >>> diff --git a/drivers/net/bonding/bond_main.c >>> b/drivers/net/bonding/bond_main.c >>> index b7f1a99..c4c511a 100644 >>> --- a/drivers/net/bonding/bond_main.c >>> +++ b/drivers/net/bonding/bond_main.c >>> @@ -2129,7 +2129,9 @@ static void bond_miimon_commit(struct bonding *bond) >>> continue; >>> >>> case BOND_LINK_UP: >>> - bond_update_speed_duplex(slave); >>> + if (slave->speed == SPEED_UNKNOWN) >>> + bond_update_speed_duplex(slave); >>> + >>> bond_set_slave_link_state(slave, BOND_LINK_UP, >>> BOND_SLAVE_NOTIFY_NOW); >>> slave->last_link_up = jiffies; >> I don't believe the speed is necessarily SPEED_UNKNOWN coming in >> here. If the race occurs at a time later than the initial enslavement, >> speed may already be set (and the race manifests if the new speed >> changes, i.e., the link changes from 1 Gb/sec to 10 Gb/sec), so I don't >> think this is functionally correct. >Hi, Jay > >Thanks for your reply. > >IMHO, "If the race occurs at a time later than the initial enslavement, >speed may already be set (and the race manifests if the new speed >changes, i.e., the link changes from 1 Gb/sec to 10 Gb/sec)", from my test, >this will not happen because the previous source code make the speed >correct. How, exactly, will "the previous source code make the speed correct"? >This "bond_update_speed_duplex" repeats to get the correct speed. > >That is, this patch is to fix the error in initial enslavement. The >mentioned scenario will not occur. I see nothing in the code that limits the race to happening only at enslavement time. If the bond_mii_monitor call executes between the device going link up and the arrival of the NETDEV_CHANGE or NETDEV_UP callback, the stored speed and duplex are stale. The stale speed value is not guaranteed to be SPEED_UNKNOWN, so your patch is not functionally correct. -J >Even though the performance impact is minimal, if we can avoid this >performance >impact, why not ? > >Best Regards! >Zhu Yanjun > >> >> Also, the call to bond_miimon_commit itself is already gated by >> bond_miimon_inspect finding a link state change. The performance impact >> here should be minimal. >> >> -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH v2 net] bonding: don't use stale speed and duplex information
zhuyj <zyjzyj2...@gmail.com> wrote: [...] >I delved into the source code and Emil's tests. I think that the problem >that this patch expects to fix occurs very unusually. > >Do you agree with me? > >If so, maybe the following patch can reduce the performance loss. >Please comment on it. Thanks a lot. > > >diff --git a/drivers/net/bonding/bond_main.c >b/drivers/net/bonding/bond_main.c >index b7f1a99..c4c511a 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -2129,7 +2129,9 @@ static void bond_miimon_commit(struct bonding *bond) >continue; > >case BOND_LINK_UP: >- bond_update_speed_duplex(slave); >+ if (slave->speed == SPEED_UNKNOWN) >+ bond_update_speed_duplex(slave); >+ >bond_set_slave_link_state(slave, BOND_LINK_UP, >BOND_SLAVE_NOTIFY_NOW); >slave->last_link_up = jiffies; I don't believe the speed is necessarily SPEED_UNKNOWN coming in here. If the race occurs at a time later than the initial enslavement, speed may already be set (and the race manifests if the new speed changes, i.e., the link changes from 1 Gb/sec to 10 Gb/sec), so I don't think this is functionally correct. Also, the call to bond_miimon_commit itself is already gated by bond_miimon_inspect finding a link state change. The performance impact here should be minimal. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH v2 net] bonding: don't use stale speed and duplex information
David Miller <da...@davemloft.net> wrote: [...] >> This was done historically in bonding, but the call to >> bond_update_speed_duplex was removed in commit 876254ae2758 ("bonding: >> don't call update_speed_duplex() under spinlocks"), as it might sleep >> under lock. Later, the locking was changed to only hold RTNL, and so >> after commit 876254ae2758 ("bonding: don't call update_speed_duplex() >> under spinlocks") this call is again safe. >> >> Tested-by: "Tantilov, Emil S" <emil.s.tanti...@intel.com> >> Cc: Veaceslav Falico <vfal...@gmail.com> >> Cc: dingtianhong <dingtianh...@huawei.com> >> Fixes: 876254ae2758 ("bonding: don't call update_speed_duplex() under >> spinlocks") >> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> > >Applied, thanks Jay. Rereading the above, I just noticed that I put the wrong commit into the fixes tag (and the "Later, the locking was changed" text); the correct fixes tag should be: Fixes: 4cb4f97b7e36 ("bonding: rebuild the lock use for bond_mii_monitor()") Kernels between 876254ae2758 and 4cb4f97b7e36 should not have this patch applied, as it might sleep under lock. Sorry for the error, -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: Bonding MAC failover behavior with VLAN interfaces
John Valko <jova...@cisco.com> wrote: >Hi all, > >I have observed the following on CentOS 6 using the >2.6.32-573.8.1.el6.x86_64 kernel as well as Ubuntu 15.10 with >4.2.0-25-generic.. I have not yet tried with a vanilla kernel but with the >large time/distro gap it didn't seem likely to be caused by distro >changes. [ ...example of VLAN-over-bond's MAC not changing when bonding fail_over_mac=active has a failover... ] >So, herein lies my confusion. I expect that the VLAN interfaces should >also pick up the new MAC address, but they don't. It seems like a bug to >me, but I don't want to be presumptuous so if in fact this is expected >behavior, how do you recommend we approach making it switch when the bond >fails over? Right now, the MAC must be manually set for each vlan >interface. [...] >I can see it set the mac of the bond to the new active slave MAC, but I >don't see any indication of looping over vlan interfaces or anything, if >that is expected... or would it be expected that the 8021q module receives >an event that should make it change? Your observation is correct: the fail_over_mac functionality does not propagate beyond the bonding master device itself. The presumption at the VLAN level is that the underlying device can support multiple MAC addresses; this same effect (different MACs between VLAN and its lower device) occurs if the device logically below a VLAN interface changes its MAC. This is handled by vlan_sync_address, which is called via the NETDEV_CHANGEADDR notifier (which happens when the bonding master changes its MAC due to f_o_m=active). The original need for f_o_m=active was IPoIB devices that cannot change their MAC address. Those don't support VLANs, either, so propagation to VLANs was not a consideration. In your "real" SR-IOV sitation, is there something that necessitates the use of fail_over_mac=active (I don't recall that SR-IOV itself prohibits a VF from changing its MAC address), or is this being done for other reasons? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
[PATCH v2 net] bonding: don't use stale speed and duplex information
There is presently a race condition between the bonding periodic link monitor and the updating of a slave's speed and duplex. The former occurs on a periodic basis, and the latter in response to a driver's calling of netif_carrier_on. It is possible for the periodic monitor to run between the driver call of netif_carrier_on and the receipt of the NETDEV_CHANGE event that causes bonding to update the slave's speed and duplex. This manifests most notably as a report that a slave is up and "0 Mbps full duplex" after enslavement, but in principle could report an incorrect speed and duplex after any link up event if the device comes up with a different speed or duplex. This affects the 802.3ad aggregator selection, as the speed and duplex are selection criteria. This is fixed by updating the speed and duplex in the periodic monitor, prior to using that information. This was done historically in bonding, but the call to bond_update_speed_duplex was removed in commit 876254ae2758 ("bonding: don't call update_speed_duplex() under spinlocks"), as it might sleep under lock. Later, the locking was changed to only hold RTNL, and so after commit 876254ae2758 ("bonding: don't call update_speed_duplex() under spinlocks") this call is again safe. Tested-by: "Tantilov, Emil S" <emil.s.tanti...@intel.com> Cc: Veaceslav Falico <vfal...@gmail.com> Cc: dingtianhong <dingtianh...@huawei.com> Fixes: 876254ae2758 ("bonding: don't call update_speed_duplex() under spinlocks") Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> --- v2: Correct Veaceslav's email address Note: The "Fixes" commit is the commit that makes this operation safe again, not the commit that originally introduced the race. I don't see any simple way to resolve this bug between these two commits. drivers/net/bonding/bond_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 56b560558884..cabaeb61333d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2127,6 +2127,7 @@ static void bond_miimon_commit(struct bonding *bond) continue; case BOND_LINK_UP: + bond_update_speed_duplex(slave); bond_set_slave_link_state(slave, BOND_LINK_UP, BOND_SLAVE_NOTIFY_NOW); slave->last_link_up = jiffies; -- 1.9.1
[PATCH net] bonding: don't use stale speed and duplex information
There is presently a race condition between the bonding periodic link monitor and the updating of a slave's speed and duplex. The former occurs on a periodic basis, and the latter in response to a driver's calling of netif_carrier_on. It is possible for the periodic monitor to run between the driver call of netif_carrier_on and the receipt of the NETDEV_CHANGE event that causes bonding to update the slave's speed and duplex. This manifests most notably as a report that a slave is up and "0 Mbps full duplex" after enslavement, but in principle could report an incorrect speed and duplex after any link up event if the device comes up with a different speed or duplex. This affects the 802.3ad aggregator selection, as the speed and duplex are selection criteria. This is fixed by updating the speed and duplex in the periodic monitor, prior to using that information. This was done historically in bonding, but the call to bond_update_speed_duplex was removed in commit 876254ae2758 ("bonding: don't call update_speed_duplex() under spinlocks"), as it might sleep under lock. Later, the locking was changed to only hold RTNL, and so after commit 876254ae2758 ("bonding: don't call update_speed_duplex() under spinlocks") this call is again safe. Tested-by: "Tantilov, Emil S" <emil.s.tanti...@intel.com> Cc: Veaceslav Falico <vfal...@redhat.com> Cc: dingtianhong <dingtianh...@huawei.com> Fixes: 876254ae2758 ("bonding: don't call update_speed_duplex() under spinlocks") Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> --- Note: The "Fixes" commit is the commit that makes this operation safe again, not the commit that originally introduced the race. I don't see any simple way to resolve this bug between these two commits. drivers/net/bonding/bond_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 56b560558884..cabaeb61333d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2127,6 +2127,7 @@ static void bond_miimon_commit(struct bonding *bond) continue; case BOND_LINK_UP: + bond_update_speed_duplex(slave); bond_set_slave_link_state(slave, BOND_LINK_UP, BOND_SLAVE_NOTIFY_NOW); slave->last_link_up = jiffies; -- 1.9.1
Re: [PATCH net v2] bonding: Fix ARP monitor validation
David Miller <da...@davemloft.net> wrote: [...] >> This patch changes the logic in bond_arp_rcv to additionally >> accept an ARP reply for validation on any slave if there is a current ARP >> slave and it sent an ARP probe during the previous arp_interval. >> >> Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave >> really works") >> Cc: Veaceslav Falico <vfal...@gmail.com> >> Cc: Andy Gospodarek <go...@cumulusnetworks.com> >> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> > >I'm going to wait until we get some feedback from Uwe if you don't mind >Jay, it would be nice to know if it solves their problem too. Sounds good to me. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: bonding reports interface up with 0 Mbps
Jay Vosburgh <jay.vosbu...@canonical.com> wrote: [...] > Thinking about the trace again... Emil: what happens in the >trace before this? Is there ever a call to the ixgbe_get_settings? >Does a NETDEV_UP or NETDEV_CHANGE event ever hit the bond_netdev_event >function? Emil kindly sent me the trace offline, and I think I see what's going on. It looks like the sequence of events is: bond_enslave -> bond_update_speed_duplex (device is down, thus DUPLEX/SPEED_UNKNOWN) [ do rest of enslavement, start miimon periodic work ] [ time passes, device goes carrier up ] ixgbe_service_task: eth1: NIC Link is Up 10 Gbps -> netif_carrier_on (arranges for NETDEV_CHANGE notifier out of line) [ a few microseconds later ] bond_mii_monitor -> bond_check_dev_link (now is carrier up) bond_miimon_commit -> (emits "0 Mbps full duplex" message) bond_lower_state_changed -> bond_netdev_event (NETDEV_CHANGELOWERSTATE, is ignored) bond_3ad_handle_link_change (sees DUPLEX/SPEED_UNKNOWN) [ a few microseconds later, in response to ixgbe's netif_carrier_on ] notifier_call_chain -> bond_netdev_event NETDEV_CHANGE -> bond_update_speed_duplex (sees correct SPEED_1/FULL) -> bond_3ad_adapter_speed_duplex_changed (updates 802.3ad) Basically, the race is that the periodic bond_mii_monitor is squeezing in between the link going up and bonding's update of the speed and duplex in response to the NETDEV_CHANGE triggered by the driver's netif_carrier_on call. bonding ends up using the stale duplex and speed information obtained at enslavement time. I think that, nowadays, the initial speed and duplex will pretty much always be UNKNOWN, at least for real Ethernet devices, because it will take longer to autoneg than the time between the dev_open and bond_update_speed_duplex calls in bond_enslave. Adding a case to bond_netdev_event for CHANGELOWERSTATE works because it's a synchronous call from bonding. For purposes of fixing this, it's more or less equivalent to calling bond_update_speed_duplex from bond_miimon_commit (which is part of a test patch I posted earlier today). If the above analysis is correct, then I would expect this patch to make the problem go away: diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 56b560558884..cabaeb61333d 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2127,6 +2127,7 @@ static void bond_miimon_commit(struct bonding *bond) continue; case BOND_LINK_UP: + bond_update_speed_duplex(slave); bond_set_slave_link_state(slave, BOND_LINK_UP, BOND_SLAVE_NOTIFY_NOW); slave->last_link_up = jiffies; Emil, can you give just the above a test? I don't see in the trace that there's evidence that ixgbe's link is rapidly flapping, so I don't think it's necessary to do more than the above. Now, separately, bonding really should obey the NETDEV_CHANGE / NETDEV_UP events instead of polling for carrier state, but if the above patch works it's a simple fix that is easily backported, which the CHANGELOWERSTATE method isn't, and the new way (notifier driven) can be net-next material. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: bonding reports interface up with 0 Mbps
Tantilov, Emil S <emil.s.tanti...@intel.com> wrote: [...] >Sure, I'll give this a try, but I'm not sure this check applies in this case >as you can see from the trace link is up and carrier is on. From code inspection, I see another possible race, although I'm not sure if it's relevant for this case. During enslavement, the speed and duplex are initially set, and then later the link state is checked, but if link is up, the code presumes the speed and duplex are valid, which they may not be. I think this patch would narrow (but not totally eliminate) this race: diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 56b560558884..b8b8a24f92d1 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1591,6 +1591,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) /* check for initial state */ if (bond->params.miimon) { if (bond_check_dev_link(bond, slave_dev, 0) == BMSR_LSTATUS) { + bond_update_speed_duplex(new_slave); if (bond->params.updelay) { bond_set_slave_link_state(new_slave, BOND_LINK_BACK, I'm not sure it's going to be really possible to completely close all of these races, as the device can change its link state (and thus what it reports for speed and duplex) asynchronously. Even in a NETDEV_UP or NETDEV_CHANGE notifier callback, the link could go down between the time of the netif_carrier_on call that triggers the notifier and when the callback runs. But, if the link is really flapping here, bonding should get a notifier for each flap (up or down). Once it settles down with carrier up then the speed and duplex should be valid. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: bonding reports interface up with 0 Mbps
Tantilov, Emil S <emil.s.tanti...@intel.com> wrote: >We are seeing an occasional issue where the bonding driver may report >interface up with 0 Mbps: >bond0: link status definitely up for interface eth0, 0 Mbps full duplex > >So far in all the failed traces I have collected this happens on >NETDEV_CHANGELOWERSTATE event: > ><...>-20533 [000] 81811.041241: ixgbe_service_task: eth1: NIC Link is Up >10 Gbps, Flow Control: RX/TX ><...>-20533 [000] 81811.041257: ixgbe_check_vf_rate_limit ><-ixgbe_service_task ><...>-20533 [000] 81811.041272: ixgbe_ping_all_vfs <-ixgbe_service_task >kworker/u48:0-7503 [010] 81811.041345: ixgbe_get_stats64 <-dev_get_stats >kworker/u48:0-7503 [010] 81811.041393: bond_netdev_event: eth1: event: 1b >kworker/u48:0-7503 [010] 81811.041394: bond_netdev_event: eth1: IFF_SLAVE >kworker/u48:0-7503 [010] 81811.041395: bond_netdev_event: eth1: >slave->speed = ><...>-20533 [000] 81811.041407: ixgbe_ptp_overflow_check ><-ixgbe_service_task >kworker/u48:0-7503 [010] 81811.041407: bond_mii_monitor: bond0: link >status definitely up for interface eth1, 0 Mbps full duplex Thinking about the trace again... Emil: what happens in the trace before this? Is there ever a call to the ixgbe_get_settings? Does a NETDEV_UP or NETDEV_CHANGE event ever hit the bond_netdev_event function? Could you describe your test that reproduces this? I'd like to see if I can set it up locally. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net v2] bonding: Fix ARP monitor validation
Jarod Wilson <ja...@redhat.com> wrote: [...] >This sounds suspiciously like the same problem Uwe was encountering[*] and >attempting to solve. Uwe, can you give this patch a try? > >[*] = http://marc.info/?l=linux-netdev=144416122705850=2 Agreed; my apologies for not Cc'ing you, Uwe. FWIW, the "Sometimes, the issue sorts itself out after a while" Uwe reported is likely due to something causing a switch flood (TCN or clearing of the MAC table); in that case, the ARP reply is sent to all ports, so the "expected" curr_arp_slave receives the ARP reply. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: bonding reports interface up with 0 Mbps
Tantilov, Emil S <emil.s.tanti...@intel.com> wrote: >We are seeing an occasional issue where the bonding driver may report >interface up with 0 Mbps: >bond0: link status definitely up for interface eth0, 0 Mbps full duplex > >So far in all the failed traces I have collected this happens on >NETDEV_CHANGELOWERSTATE event: > ><...>-20533 [000] 81811.041241: ixgbe_service_task: eth1: NIC Link is Up >10 Gbps, Flow Control: RX/TX ><...>-20533 [000] 81811.041257: ixgbe_check_vf_rate_limit ><-ixgbe_service_task ><...>-20533 [000] 81811.041272: ixgbe_ping_all_vfs <-ixgbe_service_task >kworker/u48:0-7503 [010] 81811.041345: ixgbe_get_stats64 <-dev_get_stats >kworker/u48:0-7503 [010] 81811.041393: bond_netdev_event: eth1: event: 1b >kworker/u48:0-7503 [010] 81811.041394: bond_netdev_event: eth1: IFF_SLAVE >kworker/u48:0-7503 [010] 81811.041395: bond_netdev_event: eth1: >slave->speed = ><...>-20533 [000] 81811.041407: ixgbe_ptp_overflow_check ><-ixgbe_service_task >kworker/u48:0-7503 [010] 81811.041407: bond_mii_monitor: bond0: link >status definitely up for interface eth1, 0 Mbps full duplex From looking at the code that prints this, the "full" duplex is probably actually DUPLEX_UNKNOWN, but the netdev_info uses the expression slave->duplex ? "full" : "half", so DUPLEX_UNKNOWN at 0xff would print "full." This is what ixgbe_get_settings returns for speed and duplex if it is called when carrier is off. >As a proof of concept I added NETDEV_CHANGELOWERSTATE in >bond_slave_netdev_event() along with NETDEV_UP/CHANGE: > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 56b5605..a9dac4c 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -3014,6 +3014,7 @@ static int bond_slave_netdev_event(unsigned long event, > break; > case NETDEV_UP: > case NETDEV_CHANGE: >+ case NETDEV_CHANGELOWERSTATE: > bond_update_speed_duplex(slave); > if (BOND_MODE(bond) == BOND_MODE_8023AD) > bond_3ad_adapter_speed_duplex_changed(slave); > >With this change I have not seen 0 Mbps reported by the bonding driver (around >12 hour test up to this point >vs. 2-3 hours otherwise). Although I suppose it could also be some sort of >race/timing issue with bond_mii_monitor(). This change as a fix seems kind of odd, since CHANGELOWERSTATE is generated by bonding itself. Perhaps the net effect is to add a delay and then update the speed and duplex, masking the actual problem. Emil, if I recall correctly, the test patch I send that uses the notifiers directly instead of miimon (specify miimon=0 and have bonding respond to the notifiers) handled everything properly, right? If so I can split that up and submit it properly; it seems more like a feature than a straightforward bug fix, so I'm not sure it's appropriate for net. As a possibly less complex alternative for the miimon > 0 case, could you try the following: diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 56b560558884..ac8921e65f26 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2120,6 +2120,7 @@ static void bond_miimon_commit(struct bonding *bond) { struct list_head *iter; struct slave *slave, *primary; + int link_state; bond_for_each_slave(bond, slave, iter) { switch (slave->new_link) { @@ -2127,6 +2128,10 @@ static void bond_miimon_commit(struct bonding *bond) continue; case BOND_LINK_UP: + link_state = bond_check_dev_link(bond, slave->dev, 0); + if (!link_state) + continue; + bond_update_speed_duplex(slave); bond_set_slave_link_state(slave, BOND_LINK_UP, BOND_SLAVE_NOTIFY_NOW); slave->last_link_up = jiffies; This will make bonding recheck the link state and update the speed and duplex after it acquires RTNL to commit a link change. This probably still has a race, since the change of carrier state in the device is not mutexed by anything bonding can acquire (so it can always change as soon as it's checked). Thanks, -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next] bonding: 3ad: apply ad_actor settings changes immediately
Nikolay Aleksandrov <niko...@cumulusnetworks.com> wrote: >On 02/03/2016 08:05 PM, Jay Vosburgh wrote: >> Nikolay Aleksandrov <ra...@blackwall.org> wrote: >> >>> From: Nikolay Aleksandrov <niko...@cumulusnetworks.com> >>> >>> Currently the bonding allows to set ad_actor_system and prio while the >>> bond device is down, but these are actually applied only if there aren't >>> any slaves yet (applied to bond device when first slave shows up, and to >>> slaves at 3ad bind time). After this patch changes are applied immediately >>> and the new values can be used/seen after the bond's upped so it's not >>> necessary anymore to release all and enslave again to see the changes. >>> >>> CC: Jay Vosburgh <j.vosbu...@gmail.com> >>> CC: Veaceslav Falico <vfal...@gmail.com> >>> CC: Andy Gospodarek <go...@cumulusnetworks.com> >>> Signed-off-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com> >> >> Looks good to me. >> >> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> >> >> -J >> > >Thanks Jay. Do you think it makes sense to allow setting these while the >bond is up ? I don't see any serious problems. Actually, I was thinking about that when I read the patch. Changing these while live will trigger reselection of the aggregator(s), since selection information in the LACPDUs will change. It should just work itself out when the parter system sees the change. E.g., if bonding received a LACPDU with a changed _prio or _system, __update_selected will clear AD_PORT_SELECTED if it doesn't match, and that will cascade through the state machine (in ad_mux_machine, COLLECTING_DISTRIBUTING state will revert to MUX_ATTACHED, which will clear COLL_DIST and go from there). I think it may converge more quickly if bonding set ->ntt on the ports after the change to immediately send new LACPDUs (rather than waiting for the LACP timeout), but that's an optimization. If the _prio or _system are set to the same values they had previously, this is harmless as the partner system shouldn't reselect if nothing in the LACPDU changed. There's a similar optimization in bond_3ad_unbind_slave to send a LACPDU with AGGREGATION cleared to speed up the LACP processing on the partner system. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH net-next] bonding: 3ad: apply ad_actor settings changes immediately
Nikolay Aleksandrov <ra...@blackwall.org> wrote: >From: Nikolay Aleksandrov <niko...@cumulusnetworks.com> > >Currently the bonding allows to set ad_actor_system and prio while the >bond device is down, but these are actually applied only if there aren't >any slaves yet (applied to bond device when first slave shows up, and to >slaves at 3ad bind time). After this patch changes are applied immediately >and the new values can be used/seen after the bond's upped so it's not >necessary anymore to release all and enslave again to see the changes. > >CC: Jay Vosburgh <j.vosbu...@gmail.com> >CC: Veaceslav Falico <vfal...@gmail.com> >CC: Andy Gospodarek <go...@cumulusnetworks.com> >Signed-off-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com> Looks good to me. Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> -J >--- > drivers/net/bonding/bond_3ad.c | 40 +++--- > drivers/net/bonding/bond_options.c | 4 > include/net/bond_3ad.h | 1 + > 3 files changed, 42 insertions(+), 3 deletions(-) > >diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c >index 4cbb8b27a891..ee94056dbb2e 100644 >--- a/drivers/net/bonding/bond_3ad.c >+++ b/drivers/net/bonding/bond_3ad.c >@@ -357,6 +357,14 @@ static u8 __get_duplex(struct port *port) > return retval; > } > >+static void __ad_actor_update_port(struct port *port) >+{ >+ const struct bonding *bond = bond_get_bond_by_slave(port->slave); >+ >+ port->actor_system = BOND_AD_INFO(bond).system.sys_mac_addr; >+ port->actor_system_priority = BOND_AD_INFO(bond).system.sys_priority; >+} >+ > /* Conversions */ > > /** >@@ -1963,9 +1971,7 @@ void bond_3ad_bind_slave(struct slave *slave) > port->actor_admin_port_key = bond->params.ad_user_port_key << 6; > ad_update_actor_keys(port, false); > /* actor system is the bond's system */ >- port->actor_system = BOND_AD_INFO(bond).system.sys_mac_addr; >- port->actor_system_priority = >- BOND_AD_INFO(bond).system.sys_priority; >+ __ad_actor_update_port(port); > /* tx timer(to verify that no more than MAX_TX_IN_SECOND >* lacpdu's are sent in one second) >*/ >@@ -2148,6 +2154,34 @@ out: > } > > /** >+ * bond_3ad_update_ad_actor_settings - reflect change of actor settings to >ports >+ * @bond: bonding struct to work on >+ * >+ * If an ad_actor setting gets changed we need to update the individual port >+ * settings so the bond device will use the new values when it gets upped. >+ */ >+void bond_3ad_update_ad_actor_settings(struct bonding *bond) >+{ >+ struct list_head *iter; >+ struct slave *slave; >+ >+ ASSERT_RTNL(); >+ >+ BOND_AD_INFO(bond).system.sys_priority = bond->params.ad_actor_sys_prio; >+ if (is_zero_ether_addr(bond->params.ad_actor_system)) >+ BOND_AD_INFO(bond).system.sys_mac_addr = >+ *((struct mac_addr *)bond->dev->dev_addr); >+ else >+ BOND_AD_INFO(bond).system.sys_mac_addr = >+ *((struct mac_addr *)bond->params.ad_actor_system); >+ >+ spin_lock_bh(>mode_lock); >+ bond_for_each_slave(bond, slave, iter) >+ __ad_actor_update_port(&(SLAVE_AD_INFO(slave)->port)); >+ spin_unlock_bh(>mode_lock); >+} >+ >+/** > * bond_3ad_state_machine_handler - handle state machines timeout > * @bond: bonding struct to work on > * >diff --git a/drivers/net/bonding/bond_options.c >b/drivers/net/bonding/bond_options.c >index 55e93b6b6d21..ed0bdae64f5e 100644 >--- a/drivers/net/bonding/bond_options.c >+++ b/drivers/net/bonding/bond_options.c >@@ -1392,6 +1392,8 @@ static int bond_option_ad_actor_sys_prio_set(struct >bonding *bond, > newval->value); > > bond->params.ad_actor_sys_prio = newval->value; >+ bond_3ad_update_ad_actor_settings(bond); >+ > return 0; > } > >@@ -1418,6 +1420,8 @@ static int bond_option_ad_actor_system_set(struct >bonding *bond, > > netdev_info(bond->dev, "Setting ad_actor_system to %pM\n", mac); > ether_addr_copy(bond->params.ad_actor_system, mac); >+ bond_3ad_update_ad_actor_settings(bond); >+ > return 0; > > err: >diff --git a/include/net/bond_3ad.h b/include/net/bond_3ad.h >index f1fbc3b11962..f358ad5e4214 100644 >--- a/include/net/bond_3ad.h >+++ b/include/net/bond_3ad.h >@@ -306,5 +306,6 @@ int bond_3ad_lacpdu_recv(const struct sk_buff *skb, struct >bonding *bond, >struct slave *slave); > int bond_3ad_set_carrier(struct bonding *bond); > void bond_3ad_update_lacp_rate(struct bonding *bond); >+void bond_3ad_update_ad_actor_settings(struct bonding *bond); > #endif /* _NET_BOND_3AD_H */ > >-- >2.4.3 >
[PATCH net v2] bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for validation if (a) the receiving slave is either "active" (which includes the currently active slave, or the current ARP slave) or, (b) there is a currently active slave, and it has received an ARP since it became active. For case (b), the receiving slave isn't the currently active slave, and is receiving the original broadcast ARP request, not an ARP reply from the target. This logic can fail if there is no currently active slave. In this situation, the ARP probe logic cycles through all slaves, assigning each in turn as the "current_arp_slave" for one arp_interval, then setting that one as "active," and sending an ARP probe from that slave. The current logic expects the ARP reply to arrive on the sending current_arp_slave, however, due to switch FDB updating delays, the reply may be directed to another slave. This can arise if the bonding slaves and switch are working, but the ARP target is not responding. When the ARP target recovers, a condition may result wherein the ARP target host replies faster than the switch can update its forwarding table, causing each ARP reply to be sent to the previous current_arp_slave. This will never pass the logic in bond_arp_rcv, as neither of the above conditions (a) or (b) are met. Some experimentation on a LAN shows ARP reply round trips in the 200 usec range, but my available switches never update their FDB in less than 4000 usec. This patch changes the logic in bond_arp_rcv to additionally accept an ARP reply for validation on any slave if there is a current ARP slave and it sent an ARP probe during the previous arp_interval. Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works") Cc: Veaceslav Falico <vfal...@gmail.com> Cc: Andy Gospodarek <go...@cumulusnetworks.com> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> --- v2: more detail in log and comment; no code change. drivers/net/bonding/bond_main.c | 39 --- 1 file changed, 28 insertions(+), 11 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 56b560558884..65a4107749df 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -214,6 +214,8 @@ static void bond_uninit(struct net_device *bond_dev); static struct rtnl_link_stats64 *bond_get_stats(struct net_device *bond_dev, struct rtnl_link_stats64 *stats); static void bond_slave_arr_handler(struct work_struct *work); +static bool bond_time_in_interval(struct bonding *bond, unsigned long last_act, + int mod); /* General routines -*/ @@ -2459,7 +2461,7 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond, struct slave *slave) { struct arphdr *arp = (struct arphdr *)skb->data; - struct slave *curr_active_slave; + struct slave *curr_active_slave, *curr_arp_slave; unsigned char *arp_ptr; __be32 sip, tip; int alen, is_arp = skb->protocol == __cpu_to_be16(ETH_P_ARP); @@ -2506,26 +2508,41 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond, , ); curr_active_slave = rcu_dereference(bond->curr_active_slave); + curr_arp_slave = rcu_dereference(bond->current_arp_slave); - /* Backup slaves won't see the ARP reply, but do come through -* here for each ARP probe (so we swap the sip/tip to validate -* the probe). In a "redundant switch, common router" type of -* configuration, the ARP probe will (hopefully) travel from -* the active, through one switch, the router, then the other -* switch before reaching the backup. + /* We 'trust' the received ARP enough to validate it if: +* +* (a) the slave receiving the ARP is active (which includes the +* current ARP slave, if any), or +* +* (b) the receiving slave isn't active, but there is a currently +* active slave and it received valid arp reply(s) after it became +* the currently active slave, or +* +* (c) there is an ARP slave that sent an ARP during the prior ARP +* interval, and we receive an ARP reply on any slave. We accept +* these because switch FDB update delays may deliver the ARP +* reply to a slave other than the sender of the ARP request. * -* We 'trust' the arp requests if there is an active slave and -* it received valid arp reply(s) after it became active. This -* is done to avoid endless looping when we can't reach the +* Note: for (b), backup slaves are receiving the broadcast ARP +* request, no
Re: bonding (IEEE 802.3ad) not working with qemu/virtio
Nikolay Aleksandrov <niko...@cumulusnetworks.com> wrote: >On 01/25/2016 05:24 PM, Bjørnar Ness wrote: >> As subject says, 802.3ad bonding is not working with virtio network model. >> >> The only errors I see is: >> >> No 802.3ad response from the link partner for any adapters in the bond. >> >> Dumping the network traffic shows that no LACP packets are sent from the >> host running with virtio driver, changing to for example e1000 solves >> this problem >> with no configuration changes. >> >> Is this a known problem? >> >[Including bonding maintainers for comments] > >Hi, >Here's a workaround patch for virtio_net devices that "cheats" the >duplex test (which is the actual problem). I've tested this locally >and it works for me. >I'd let the others comment on the implementation, there're other signs >that can be used to distinguish a virtio_net device so I'm open to suggestions. >Also feedback if this is at all acceptable would be appreciated. Should virtio instead provide an arbitrary speed and full duplex to ethtool, as veth does? Creating a magic whitelist of devices deep inside the 802.3ad implementation seems less desirable. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
[PATCH net] bonding: Fix ARP monitor validation
The current logic in bond_arp_rcv will accept an incoming ARP for validation if (a) the receiving slave is either "active" (here meaning the slave is functioning, not that it is the currently active slave) or, (b) the currently active slave recently sent an ARP probe. This can fail if there is no currently active slave. In this case, the ARP probe logic cycles through all slaves, assigning each in turn as the "current_arp_slave," setting that one as "active," and sending an ARP probe from that slave. The current logic expects the ARP reply to arrive on this slave, however, due to switch FDB updating delays, the reply may be directed to another slave. This leads to a pathological condition wherein the ARP target host replies faster than the switch can update its forwarding table, causing each ARP reply to be sent to the previous current_arp_slave. This will never pass the logic in bond_arp_rcv. Some experimentation on a LAN shows ARP reply round trips in the 200 usec range, but my available switches never update their FDB in less than 4000 usec. This patch changes the logic to accept an ARP reply for validation on any slave if there is a current ARP slave and it has recently sent an ARP probe. Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave really works") Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> --- drivers/net/bonding/bond_main.c | 19 ++- 1 file changed, 14 insertions(+), 5 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 56b560558884..5a9c43b12999 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -214,6 +214,8 @@ static void bond_uninit(struct net_device *bond_dev); static struct rtnl_link_stats64 *bond_get_stats(struct net_device *bond_dev, struct rtnl_link_stats64 *stats); static void bond_slave_arr_handler(struct work_struct *work); +static bool bond_time_in_interval(struct bonding *bond, unsigned long last_act, + int mod); /* General routines -*/ @@ -2459,7 +2461,7 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond, struct slave *slave) { struct arphdr *arp = (struct arphdr *)skb->data; - struct slave *curr_active_slave; + struct slave *curr_active_slave, *curr_arp_slave; unsigned char *arp_ptr; __be32 sip, tip; int alen, is_arp = skb->protocol == __cpu_to_be16(ETH_P_ARP); @@ -2506,6 +2508,7 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond, , ); curr_active_slave = rcu_dereference(bond->curr_active_slave); + curr_arp_slave = rcu_dereference(bond->current_arp_slave); /* Backup slaves won't see the ARP reply, but do come through * here for each ARP probe (so we swap the sip/tip to validate @@ -2514,10 +2517,12 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond, * the active, through one switch, the router, then the other * switch before reaching the backup. * -* We 'trust' the arp requests if there is an active slave and -* it received valid arp reply(s) after it became active. This -* is done to avoid endless looping when we can't reach the -* arp_ip_target and fool ourselves with our own arp requests. +* We 'trust' the arp requests if (a) there is an active slave and +* it received valid arp reply(s) after it became active, or (b) +* there is an ARP slave, and we receive a recent ARP reply on any +* slave. This is done to avoid endless looping when we can't +* reach the arp_ip_target and fool ourselves with our own arp +* requests. */ if (bond_is_active_slave(slave)) @@ -2526,6 +2531,10 @@ int bond_arp_rcv(const struct sk_buff *skb, struct bonding *bond, time_after(slave_last_rx(bond, curr_active_slave), curr_active_slave->last_link_up)) bond_validate_arp(bond, slave, tip, sip); + else if (curr_arp_slave && (arp->ar_op == htons(ARPOP_REPLY)) && +bond_time_in_interval(bond, + dev_trans_start(curr_arp_slave->dev), 1)) + bond_validate_arp(bond, slave, sip, tip); out_unlock: if (arp != (struct arphdr *)skb->data) -- 1.9.1
Re: [PATCH 1/1] bonding: Use notifiers for slave link state detection
zhuyj <zyjzyj2...@gmail.com> wrote: >On 01/26/2016 08:43 AM, Jay Vosburgh wrote: >> <zyjzyj2...@gmail.com> wrote: >> >>> From: Zhu Yanjun <zyjzyj2...@gmail.com> >>> >>> Bonding will utilize notifier callbacks to detect slave >>> link state changes. It is intended to be used with miimon >>> set to zero, and does not support the updelay or downdelay >>> options to bonding. >>> >>> Because of link flap from the slave interface, if the notifier >>> is NETDEV_UP while the actual link state is down, it is not >>> necessary to continue. >>> >>> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> >> I haven't signed off on this patch. >> >> I've just started some testing, but as before immediately get an >> RCU warning; it looks to be coming from bond_miimon_inspect_slave(); >> >> [ 316.473050] bond1: Enslaving eth1 as a backup interface with an up link >> [ 316.473059] >> [ 316.473806] === >> [ 316.475630] [ INFO: suspicious RCU usage. ] >> [ 316.477519] 4.4.0+ #38 Not tainted >> [ 316.479094] --- >> [ 316.480765] drivers/net/bonding/bond_main.c:2024 suspicious >> rcu_dereference_check() usage! >> >> This is presumably because the "case NETDEV_DOWN" call to >> bond_miimon_inspect_slave does not hold RCU. It does hold RTNL, though, >> which should be safe for this usage (RTNL mutexes changes to the active >> slave). The appended patch on top of the original makes the warning go >> away. >> >> I'm still testing the patch and have no comment about its >> functionality as yet. >> >> diff --git a/drivers/net/bonding/bond_main.c >> b/drivers/net/bonding/bond_main.c >> index 9f67948..e3faee9 100644 >> --- a/drivers/net/bonding/bond_main.c >> +++ b/drivers/net/bonding/bond_main.c >> @@ -2014,14 +2014,14 @@ static int bond_slave_info_query(struct net_device >> *bond_dev, struct ifslave *in >> /* Monitoring >> ---*/ >> -/* called with rcu_read_lock() */ >> +/* called with rcu_read_lock() or RTNL */ >> static int bond_miimon_inspect_slave(struct bonding *bond, struct slave >> *slave, >> unsigned long event) >> { >> int link_state; >> bool ignore_updelay; >> - ignore_updelay = !rcu_dereference(bond->curr_active_slave); >> +ignore_updelay = !rcu_dereference_rtnl(bond->curr_active_slave); > >Thanks a lot. >Because kernel v4.4 needs this kind of patch, I backport this patch from >net-next to kernel v4.4. > >If it is not appropriate, I will revert this patch. I don't understand what you mean here. I've tested the patch (with my above modification), and while I seem to be hitting an unrelated bug in the ARP monitor, I believe this patch will misbehave when the ARP monitor is running. For example, if arp_interval=1000 and miimon=0, the link state notifier callback will change a slave to up should a notifier event take place. So, hypothetically, if a slave is "down" according to the ARP monitor (but actually carrier up), and then experience a carrier down then up transition, the slave would be set to "up" even though the ARP monitor believes it to be down. I'm not able to induce the speedy link flap events, so I'm not sure about this portion of the patch: + /* Because of link flap from the slave interface, it is possilbe that +* the notifiler is NETDEV_UP while the actual link state is down. If +* so, it is not necessary to contiune. +*/ + switch (event) { + case NETDEV_UP: + if (!link_state) + return 0; + break; + + case NETDEV_DOWN: + if (link_state) + return 0; + break; + } + Unless I misunderstood, Emil's comments elsewhere suggest that the current ixgbe driver won't cause those, though. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH 1/1] bonding: Use notifiers for slave link state detection
<zyjzyj2...@gmail.com> wrote: >From: Zhu Yanjun <zyjzyj2...@gmail.com> > >Bonding will utilize notifier callbacks to detect slave >link state changes. It is intended to be used with miimon >set to zero, and does not support the updelay or downdelay >options to bonding. > >Because of link flap from the slave interface, if the notifier >is NETDEV_UP while the actual link state is down, it is not >necessary to continue. > >Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> I haven't signed off on this patch. I've just started some testing, but as before immediately get an RCU warning; it looks to be coming from bond_miimon_inspect_slave(); [ 316.473050] bond1: Enslaving eth1 as a backup interface with an up link [ 316.473059] [ 316.473806] === [ 316.475630] [ INFO: suspicious RCU usage. ] [ 316.477519] 4.4.0+ #38 Not tainted [ 316.479094] --- [ 316.480765] drivers/net/bonding/bond_main.c:2024 suspicious rcu_dereference_check() usage! This is presumably because the "case NETDEV_DOWN" call to bond_miimon_inspect_slave does not hold RCU. It does hold RTNL, though, which should be safe for this usage (RTNL mutexes changes to the active slave). The appended patch on top of the original makes the warning go away. I'm still testing the patch and have no comment about its functionality as yet. diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 9f67948..e3faee9 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2014,14 +2014,14 @@ static int bond_slave_info_query(struct net_device *bond_dev, struct ifslave *in /* Monitoring ---*/ -/* called with rcu_read_lock() */ +/* called with rcu_read_lock() or RTNL */ static int bond_miimon_inspect_slave(struct bonding *bond, struct slave *slave, unsigned long event) { int link_state; bool ignore_updelay; - ignore_updelay = !rcu_dereference(bond->curr_active_slave); + ignore_updelay = !rcu_dereference_rtnl(bond->curr_active_slave); slave->new_link = BOND_LINK_NOCHANGE; -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [RFC PATCH net] net/core: don't increment rx_dropped on inactive slaves
Jarod Wilson <ja...@redhat.com> wrote: >The network core tries to keep track of dropped packets, but some packets >you wouldn't really call dropped, so much as intentionally ignored, under >certain circumstances. One such case is that of bonding and team device >slaves that are currently inactive. Their respective rx_handler functions >return RX_HANDLER_EXACT (the only places in the kernel that return that), >which ends up tracking into the network core's __netif_receive_skb_core() >function's drop path, with no pt_prev set. On a noisy network, this can >result in a very rapidly incrementing rx_dropped counter, not only on the >inactive slave(s), but also on the master device, such as the following: [...] >In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an >active-backup bond0, and you can see that all three have high drop counts, >with the master bond0 showing a tally of all three. > >I know that this was previously discussed some here: > >http://www.spinics.net/lists/netdev/msg226341.html > >It seems additional counters never came to fruition, but honestly, for >this particular case, I'm not even sure they're warranted, I'd be inclined >to say just silently drop these packets without incrementing a counter. At >least, that's probably what would make someone who has complained loudly >about this issue happy, as they have monitoring tools that are squaking >loudly at any increments to rx_dropped. I don't think the kernel should silently drop packets; there should be a counter somewhere. If a packet is being thrown away deliberately, it should not just vanish into the screaming void of space. Someday someone will try and track down where that packet is being dropped. I've had that same conversation with customers who insist on accounting for every packet drop (from the "any drop is an error" mindset), so I understand the issue. Thinking about the prior discussion, the rx_drop_inactive is still a good idea, but I'd actually today get good use from a "rx_drop_unforwardable" (or an equivalent but shorter name) counter that counts every time a packet is dropped due to is_skb_forwardable() returning false. __dev_forward_skb does this (and hits rx_dropped), as does the bridge (and does not count it). -J >CC: "David S. Miller" <da...@davemloft.net> >CC: Eric Dumazet <eduma...@google.com> >CC: Jiri Pirko <j...@mellanox.com> >CC: Daniel Borkmann <dan...@iogearbox.net> >CC: Tom Herbert <t...@herbertland.com> >CC: Jay Vosburgh <j.vosbu...@gmail.com> >CC: Veaceslav Falico <vfal...@gmail.com> >CC: Andy Gospodarek <go...@cumulusnetworks.com> >CC: netdev@vger.kernel.org >Signed-off-by: Jarod Wilson <ja...@redhat.com> >--- > net/core/dev.c | 3 +++ > 1 file changed, 3 insertions(+) > >diff --git a/net/core/dev.c b/net/core/dev.c >index 8cba3d8..1354c7b 100644 >--- a/net/core/dev.c >+++ b/net/core/dev.c >@@ -4153,8 +4153,11 @@ ncls: > else > ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); > } else { >+ if (deliver_exact) >+ goto inactive; /* bond or team inactive slave */ > drop: > atomic_long_inc(>dev->rx_dropped); >+inactive: > kfree_skb(skb); > /* Jamal, now you will not able to escape explaining >* me how you were going to use this. :-) >-- >1.8.3.1 > --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH] net: take care of bonding in build_skb_flow_key (v3)
Wengang Wang <wen.gang.w...@oracle.com> wrote: [...] >For ipip, yes seems update_pmtu is called in line for each call of >queue_xmit. Do you know if it's a good configuration for ipip + bonding? Yes, it is. >Other's comment and suggestion? I agree with Sabrina Dubroca <s...@queasysnail.net>'s suggestions from yesterday. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com
Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode
<zyjzyj2...@gmail.com> wrote: >From: Zhu Yanjun <zyjzyj2...@gmail.com> > >In 802.3ad mode, the speed and duplex is needed. But in some NIC, >there is a time span between NIC up state and getting speed and duplex. >As such, sometimes a slave in 802.3ad mode is in up state without >speed and duplex. This will make bonding in 802.3ad mode can not >work well. >To make bonding driver be compatible with more NICs, it is >necessary to restrict the up state in 802.3ad mode. What device is this? It seems a bit odd that an Ethernet device can be carrier up but not have the duplex and speed available. Also, what are the option settings for bonding? Specifically, is "use_carrier" set to 0? The default setting is 1. In general, though, bonding expects a speed or duplex change to be announced via a NETDEV_UPDATE or NETDEV_UP notifier, which would propagate to the 802.3ad logic. If the device here is going carrier up prior to having speed or duplex available, then maybe it should call netdev_state_change() when the duplex and speed are available, or delay calling netif_carrier_on(). >Signed-off-by: Zhu Yanjun <zyjzyj2...@gmail.com> >--- > drivers/net/bonding/bond_main.c | 19 +++ > 1 file changed, 19 insertions(+) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 9e0f8a7..0a80fb3 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -1991,6 +1991,25 @@ static int bond_miimon_inspect(struct bonding *bond) > > link_state = bond_check_dev_link(bond, slave->dev, 0); > >+ /* Since some NIC has time span between netif_running and >+ * getting speed and duples. That is, after a NIC is up >(netif_running), >+ * there is a time span before this NIC is negotiated with >speed and duplex. >+ * During this time span, the slave in 802.3ad is configured >without speed >+ * and duplex. This 802.3ad bonding will not work because it >needs slave's speed >+ * and duplex to generate key field. >+ * As such, we restrict up in 802.3ad mode to: netif_running && >peed != SPEED_UNKNOWN && >+ * duplex != DUPLEX_UNKNOWN >+ */ >+ if ((BMSR_LSTATUS == link_state) && >+ (BOND_MODE(bond) == BOND_MODE_8023AD)) { >+ bond_update_speed_duplex(slave); >+ if ((slave->speed == SPEED_UNKNOWN) || >+ (slave->duplex == DUPLEX_UNKNOWN)) { >+ link_state = 0; >+ netdev_info(bond->dev, "In 802.3ad mode, it is >not enough to up without speed and duplex"); >+ } >+ } Also, as a functional note on this patch, the above looks like it will spam the log repeatedly every miimon interval for as long as the "carrier up but no speed/duplex" situation persists. -J > switch (slave->link) { > case BOND_LINK_UP: > if (link_state) >-- >1.7.9.5 > --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/1] bonding: delay up state without speed and duplex in 802.3ad mode
<zyjzyj2...@gmail.com> wrote: >From: yzhu1 <yz...@windriver.com> > >In 802.3ad mode, the speed and duplex is needed. But in some NICs, >there is a time span between NIC up state and getting speed and duplex. >As such, sometimes a slave in 802.3ad mode is in up state without >speed and duplex. This will make bonding in 802.3ad mode can not >work well. > >To make bonding driver robust and compatible with more NICs, it is >necessary to delay the up state without speed and duplex in 802.3ad >mode. You misunderstood my comment. What I meant is that the device driver for the network device should change to either delay carrier up or issue a second notifier when speed and duplex are available. If the driver doesn't handle duplex and speed notification properly, it will likely have trouble with more than just bonding. You also didn't mention the identity of the network device that requires this special handling. Is the driver part of the linux kernel? -J >Signed-off-by: yzhu1 <yz...@windriver.com> >--- > drivers/net/bonding/bond_main.c | 34 ++ > 1 file changed, 34 insertions(+) > >diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c >index 9e0f8a7..a1d8708 100644 >--- a/drivers/net/bonding/bond_main.c >+++ b/drivers/net/bonding/bond_main.c >@@ -419,6 +419,35 @@ const char *bond_slave_link_status(s8 link) > } > } > >+/* This function is to check the speed and duplex of a NIC. >+ * Since the speed and duplex of a slave device are very >+ * important to the bonding in the 802.3ad mode. As such, >+ * it is necessary to check the speed and duplex of a slave >+ * device in 802.3ad mode. >+ * >+ * speed != SPEED_UNKNOWN and duplex == DUPLEX_FULL : 1 >+ * others : 0 >+ */ >+static int __check_speed_duplex(struct net_device *netdev) >+{ >+ struct ethtool_cmd ecmd; >+ u32 slave_speed = SPEED_UNKNOWN; >+ int res; >+ >+ res = __ethtool_get_settings(netdev, ); >+ if (res < 0) >+ return 0; >+ >+ slave_speed = ethtool_cmd_speed(); >+ if (slave_speed == 0 || slave_speed == ((__u32) -1)) >+ return 0; >+ >+ if (DUPLEX_FULL != ecmd.duplex) >+ return 0; >+ >+ return 1; >+} >+ > /* if supports MII link status reporting, check its link status. > * > * We either do MII/ETHTOOL ioctls, or check netif_carrier_ok(), >@@ -445,6 +474,11 @@ static int bond_check_dev_link(struct bonding *bond, > if (!reporting && !netif_running(slave_dev)) > return 0; > >+ /* Check the speed and duplex of the slave device in 802.3ad mode. */ >+ if ((BOND_MODE(bond) == BOND_MODE_8023AD) && >+ !__check_speed_duplex(slave_dev)) >+ return 0; >+ > if (bond->params.use_carrier) > return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0; > >-- >1.7.9.5 > --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH net] bonding: fix panic on non-ARPHRD_ETHER enslave failure
Since commit 7d5cd2ce529b, when bond_enslave fails on devices that are not ARPHRD_ETHER, if needed, it resets the bonding device back to ARPHRD_ETHER by calling ether_setup. Unfortunately, ether_setup clobbers dev->flags, clearing IFF_UP if the bond device is up, leaving it in a quasi-down state without having actually gone through dev_close. For bonding, if any periodic work queue items are active (miimon, arp_interval, etc), those will remain running, as they are stopped by bond_close. At this point, if the bonding module is unloaded or the bond is deleted, the system will panic when the work function is called. This panic is resolved by calling dev_close on the bond itself prior to calling ether_setup. Cc: Nikolay Aleksandrov <niko...@cumulusnetworks.com> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com> Fixes: 7d5cd2ce5292 ("bonding: correctly handle bonding type change on enslave failure") --- drivers/net/bonding/bond_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index b4351ca..9e0f8a7 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1749,6 +1749,7 @@ err_undo_flags: slave_dev->dev_addr)) eth_hw_addr_random(bond_dev); if (bond_dev->type != ARPHRD_ETHER) { + dev_close(bond_dev); ether_setup(bond_dev); bond_dev->flags |= IFF_MASTER; bond_dev->priv_flags &= ~IFF_TX_SKB_SHARING; -- 1.9.1 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in bonding driver - devices without set_mac
Toby Corkindale <t...@wintrmute.net> wrote: [...] >The thing is, I've been running PPP devices in balance-rr mode for >quite a while. It worked fine, and I'd like to continue having that >functionality. These changes to the kernel have broken that >functionality, by requiring the MAC address support unconditionally. > >I guess using PPP devices is a special case, but it would be great if >that special case could continue to be supported. Could you describe your configuration / use case a bit more? The nominal reason for the MAC change in balance-rr or balance-xor modes is that those modes are designed to interoperate with Etherchannel (static link aggregation) type of systems, which expect all ports participating in the aggregation to use the same MAC address. Relaxing the restriction is likely a matter of causing fail_over_mac to be obeyed in addtional modes other than active-backup, which is actually what's catching the ppp case right now. The check for the set_mac function passes, but later, the test for "should the slave's MAC be changed" is if (!bond->params.fail_over_mac || BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { [ change the slave MAC ] } so this block is entered for anything other than active-backup mode with fail_over_mac enabled. This block was changed for commit 00503b6f702eaf23e7257d6287da72805d7d014c Author: dingtianhong <dingtianh...@huawei.com> Date: Sat Jan 25 13:00:29 2014 +0800 bonding: fail_over_mac should only affect AB mode at enslave and removal pro cessing So that's probably when you saw the behavior change. Are you able to test a patch? I believe the following would change the behavior, although some of the release logic may not be correct (I've not tested this). diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index b4351caf8e01..5f3632544d52 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1451,8 +1451,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) */ ether_addr_copy(new_slave->perm_hwaddr, slave_dev->dev_addr); - if (!bond->params.fail_over_mac || - BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { + if (!bond->params.fail_over_mac) { /* Set slave to master's mac address. The application already * set the master's mac address to that of the first slave */ @@ -1725,8 +1724,7 @@ err_close: dev_close(slave_dev); err_restore_mac: - if (!bond->params.fail_over_mac || - BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { + if (!bond->params.fail_over_mac) { /* XXX TODO - fom follow mode needs to change master's * MAC if this slave's MAC is in use by the bond, or at * least print a warning. @@ -1823,8 +1821,7 @@ static int __bond_release_one(struct net_device *bond_dev, RCU_INIT_POINTER(bond->current_arp_slave, NULL); - if (!all && (!bond->params.fail_over_mac || -BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP)) { + if (!all && (!bond->params.fail_over_mac)) { if (ether_addr_equal_64bits(bond_dev->dev_addr, slave->perm_hwaddr) && bond_has_slaves(bond)) netdev_warn(bond_dev, "the permanent HWaddr of %s - %pM - is still in use by %s - set the HWaddr of %s to a different address to avoid conflicts\n", @@ -1905,8 +1902,7 @@ static int __bond_release_one(struct net_device *bond_dev, /* close slave before restoring its mac address */ dev_close(slave_dev); - if (bond->params.fail_over_mac != BOND_FOM_ACTIVE || - BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) { + if (bond->params.fail_over_mac != BOND_FOM_ACTIVE) { /* restore original ("permanent") mac address */ ether_addr_copy(addr.sa_data, slave->perm_hwaddr); addr.sa_family = slave_dev->type; -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Regression in bonding driver - devices without set_mac
Toby Corkindale <t...@wintrmute.net> wrote: >I originally reported this in Dec 2014, as: >https://bugzilla.kernel.org/show_bug.cgi?id=89161 > >The bug is still present in the 4.2 Linux kernel. I had some time to look into this today, and there is a related code path that panics the kernel, so I'm working on a patch to resolve things, and will be posting that shortly. >If you look at the code, here: >https://github.com/torvalds/linux/commit/f54424412b6b2f64cae4d7c39d981ca14ce0052c > >Then you see that the intention is for devices without set_mac support >to be supported. Although in older code they DID work, and that >ability died sometime near this commit. And has never returned since. Unfortunately, I believe the referenced commit is in error. The change log states: [...] It's wrong because we only require ndo_set_mac_address in case bonding is in active-backup mode and FOM isn't FOM_ACTIVE. This assertion is incorrect; ndo_set_mac_address is necessary for all modes with the only exception being active-backup with fail_over_mac enabled. All of the load balance modes (everything except active-backup) will change the MAC of the slave devices at some point, and thus require ndo_set_mac_address support. PPP devices should function in active-backup mode, and should enable fail_over_mac automatically (if the PPP device is the first slave) if f_o_m is not already enabled. The prior discussion on this patch can be found here: http://www.spinics.net/lists/netdev/msg289311.html >The code path in current kernels does allow devices through that block >of code, but it fails somewhere else -- the devices are not >successfully added as slaves, but the only error printed is the >warning about not supporting MAC address setting. What is happening in the current code is that, for load balance modes, e.g., balance-rr, the test for ndo_set_mac_address support passes, but, later, bond_enslave attempts to actually set the MAC of the new slave, and this fails (because there is no ndo_set_mac). >Is there anything I can do to try and sort this regression out, with you? As I said, I do not believe this is a regression. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH net-next] ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU
Hannes Frederic Sowa <han...@stressinduktion.org> wrote: >Hello Alex, > >On Mon, Oct 26, 2015, at 16:52, Alexander Duyck wrote: >> Seems like this code isn't quite correct. You are calling ipv6_add_dev >> for slave devices, and if I understand things correctly I don't believe >> that was happening before and may be an unintended side effect. > >Ah, btw., autoconf and ipv6 operation on IFF_SLAVE devices is actually >desired nowadays and don't think we can change this. See also: ><https://patchwork.ozlabs.org/patch/531196/> IPv6 addrconf on IFF_SLAVE devices was disabled for bonding slaves in commit c2edacf80e15 because it caused issues with snooping switches. This is also referenced in https://bugzilla.redhat.com/show_bug.cgi?id=236750 Won't re-enabling autoconf on IFF_SLAVE devices cause that issue to return? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] net/bonding: send arp in interval if no active slave
Jarod Wilson <ja...@redhat.com> wrote: >Jarod Wilson wrote: >... >> As Andy already stated I'm not a fan of such workarounds either but it's >> necessary sometimes so if this is going to be actually considered then a >> few things need to be fixed. Please make this a proper bonding option >> which can be changed at runtime and not only via a module parameter. > >Is there any particular userspace tool that would need some updating, or >is adding the sysfs knobs sufficient here? I think I've got all the sysfs >stuff thrown together now, but still need to test. Most (all?) bonding options should be configurable via iproute (netlink) now. > >>> Now, I saw that you've only tested with 500 ms, can't this be fixed by >>> using >>> a different interval ? This seems like a very specific problem to have a >>> whole new option for. >> >> ...I'll wait until we've heard confirmation from Uwe that intervals >> other than 500ms don't fix things. > >Okay, so I believe the "only tested with 500ms" was in reference to >testing with Uwe's initial patch. I do have supporting evidence in a >bugzilla report that shows upwards of 5000ms still experience the problem >here. I did set up some switches and attempt to reproduce this yesterday; I daisy-chained three switches (two Cisco and an HP) together and connected the bonded interfaces to the "end" switches. I tried various ARP targets (the switch, hosts on various points of the switch) and varying arp_intervals and was unable to reproduce the problem. As I understand it, the working theory is something like this: - host with two bonded interfaces, A and B. For active-backup mode, the interfaces have been assigned the same MAC address. - switch has MAC for B in its forwarding table - bonding goes from down to up, and thinks all its slaves are down, and starts the "curr_arp_slave" search for an active arp_ip_target. In this case, it starts with A, and sends an ARP from A. As an aside, I'm not 100% clear on what exactly is going on in the "bonding goes from down to up" transition; this seems to be key in reproducing the issue. - switch sees source mac coming from port A, starts to update its forwarding table - meanwhile, switch forwards ARP request, and receives ARP reply, which it forwards to port B. Bonding drops this, as the slave is inactive. - switch finishes updating forwarding table, MAC is now assigned to port A. - bonding now tries sending on port B, and the cycle repeats. If this is what's taking place, then the arp_interval itself is irrelevant, the race is between the switch table update and the generation of the ARP reply. Also, presuming the above is what's going on, we could modify the ARP "curr_arp_slave" logic a bit to resolve this without requiring any magic knobs. For example, we could change the "drop on inactive" logic to recognise the "curr_arp_slave" search and accept the unicast ARP reply, and perhaps make that receiving slave the next curr_arp_slave automatically. I also wonder if the fail_over_mac option would affect this behavior, as it would cause the slaves to keep their MAC address for the duration, so the switch would not see the MAC move from port to port. Another thought would be to have the curr_arp_slave cycle through the slaves in random order, but that could create non-deterministic results even when things are working correctly. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/bonding: send arp in interval if no active slave
Uwe Koziolek <uwe.kozio...@redknee.com> wrote: >On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote: >> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote: >>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote: >>>> Uwe Koziolek <uwe.kozio...@redknee.com> wrote: >>>> >>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote: >>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote: >>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote: >>>>>>>> From: Uwe Koziolek <uwe.kozio...@redknee.com> >>>>>>>> >>>>>>>> With some very finicky switch hardware, active backup bonding can get >>>>>>>> into >>>>>>>> a situation where we play ping-pong between interfaces, trying to get >>>>>>>> one >>>>>>>> to come up as the active slave. There seems to be an issue with the >>>>>>>> switch's arp replies either taking too long, or simply getting lost, >>>>>>>> so we >>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue >>>>>>>> sorts itself out after a while, sometimes it doesn't. >>>>>>>> >>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an >>>>>>>> additional >>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in >>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing >>>>>>>> with >>>>>>>> this hardware combination. >>>>>>> Sorry, I don't understand the logic of why it works, and what exactly >>>>>>> are >>>>>>> we fixiing here. >>>>>>> >>>>>>> It also breaks completely the logic for link state management in case >>>>>>> of no >>>>>>> current active slave for 2*arp_interval. >>>>>>> >>>>>>> Could you please elaborate what exactly is fixed here, and how it >>>>>>> works? :) >>>>>> I can either duplicate some information from the bug, or Uwe can, to >>>>>> illustrate the exact nature of the problem. >>>>>> >>>>>>> p.s. num_grat_arp maybe could help? >>>>>> That was my thought as well, but as I understand it, that route was >>>>>> explored, and it didn't help any. I don't actually have a reproducer >>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle >>>>>> here... >>>>>> >>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp >>>>>> settings were tried that didn't help? I'm still of the mind that if >>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off >>>>>> num_grat_arp. >>>>> The bonding slaves are connected to high available switches, each of the >>>>> slaves is connected to a different switch. If the bond is starting, only >>>>> the selected slave sends one arp-request. If a matching arp_response was >>>>> received, this slave and the bond is going into state up, sending the >>>>> gratitious arps... >>>>> But if you got no arp reply the next slave was selected. >>>>> With most of the newer switches, not overloaded, or with other software >>>>> bugs, or with a single switch configuration, you would get a arp response >>>>> on the first arp request. >>>>> But in case of high availability configuration with non perfect switches >>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a >>>>> response on the first arp request. >>>>> >>>>> I have seen network snoops, there the switches are not responding to the >>>>> first arp request on slave 1, the second arp request was sent on slave 2 >>>>> but the response was received on slave one, and all following arp >>>>> requests are anwsered on the wrong slave for a longer time. >>>>Could you elaborate on the exact "high availability >>>> configuration" here, including the model(s) of switch(es) involved? >>>> >>>>Is this some kind of race between the switch or switches >>>&
Re: [PATCH net] bonding: fix bond dev flags after convert to arphrd_ether
Nikolay Aleksandrov niko...@cumulusnetworks.com wrote: [...] Restarting this thread because there’s actually a bug here, what you described with the bonding destruction is true when the slaves are all destroyed but it isn’t true if they’re just released, if you take a look at bond_slave_netdev_event() the bond destruction happens only on NETDEV_UNREGISTER and I just hit this bug by enslaving a non-ARPHRD_ETHER device, releasing it and enslaving a ARPHRD_ETHER device so ether_setup() path in bond_enslave is hit and IFF_MASTER gets dropped: 17: bond0: BROADCAST,MULTICAST,MASTER,UP,LOWER_UP mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/fddi 9a:33:c5:30:ff:a6 brd ff:ff:ff:ff:ff:ff (release non-ARPHRD_ETHER slave) (enslave ARPHRD_ETHER device) 17: bond0: BROADCAST,MULTICAST,LOWER_UP mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 08:00:27:3c:13:57 brd ff:ff:ff:ff:ff:ff Notice the master flag is gone and of course on unload we get: [57981.545547] [ cut here ] [57981.545567] WARNING: CPU: 0 PID: 13792 at fs/proc/generic.c:575 remove_proc_entry+0x17e/0x190() [57981.545572] remove_proc_entry: removing non-empty directory 'net/bonding', leaking at least 'bond0' [...] We need to convert it back to ARPHRD_ETHER if releasing the last slave, because we can’t destroy it (in some paths bond-dev is used after bond_release()). Basically we should make the case that if the bonding doesn’t have any slaves then it’s always an ARPHRD_ETHER device. Thoughts ? I agree that it would be cleaner for bond_dev-type to switch back on release of last slave. The options code (caller of bond_option_slaves_set) and bond_uninit() both reference the bond or dev after calling bond_release(), and would need changing if any release could destroy the bond itself. However, for the type change, there's the potentially tricky case of a nested non-ARPHRD_ETHER bond, e.g., bond0 - bond1 - ib0. This isn't a typical use case that I'm aware of, but I believe it's supported by the code. If ib0, the last slave, is released, bond1 will want to change to ARPHRD_ETHER, but bond0 is ARPHRD_INFINIBAND. I suspect bonding will have to notice the NETDEV_PRE_TYPE_CHANGE and _POST_ notifiers and take appropriate action (i.e., cascade the type change upwards). There might be similar issues with other devices stacked on top of the IB - Ether type-changing bond; I'm not sure how many of those there may be, though, since many things won't stack over IB devices (or an IB-flavor bond). If the type change works, then I don't think we would still need the release and destroy logic. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] net/bonding: send arp in interval if no active slave
Uwe Koziolek uwe.kozio...@redknee.com wrote: On2015-08-17 07:12 PM,Jarod Wilson wrote: On 2015-08-17 12:55 PM, Veaceslav Falico wrote: On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote: From: Uwe Koziolek uwe.kozio...@redknee.com With some very finicky switch hardware, active backup bonding can get into a situation where we play ping-pong between interfaces, trying to get one to come up as the active slave. There seems to be an issue with the switch's arp replies either taking too long, or simply getting lost, so we wind up unable to get any interface up and active. Sometimes, the issue sorts itself out after a while, sometimes it doesn't. Testing with num_grat_arp has proven fruitless, but sending an additional arp on curr_arp_slave if we're still in the arp_interval timeslice in bond_ab_arp_probe(), has shown to produce 100% reliability in testing with this hardware combination. Sorry, I don't understand the logic of why it works, and what exactly are we fixiing here. It also breaks completely the logic for link state management in case of no current active slave for 2*arp_interval. Could you please elaborate what exactly is fixed here, and how it works? :) I can either duplicate some information from the bug, or Uwe can, to illustrate the exact nature of the problem. p.s. num_grat_arp maybe could help? That was my thought as well, but as I understand it, that route was explored, and it didn't help any. I don't actually have a reproducer setup of my own, unfortunately, so I'm kind of caught in the middle here... Uwe, can you perhaps further enlighten us as to what num_grat_arp settings were tried that didn't help? I'm still of the mind that if num_grat_arp *didn't* help, we probably need to do something keyed off num_grat_arp. The bonding slaves are connected to high available switches, each of the slaves is connected to a different switch. If the bond is starting, only the selected slave sends one arp-request. If a matching arp_response was received, this slave and the bond is going into state up, sending the gratitious arps... But if you got no arp reply the next slave was selected. With most of the newer switches, not overloaded, or with other software bugs, or with a single switch configuration, you would get a arp response on the first arp request. But in case of high availability configuration with non perfect switches like HP ProCurve 54xx, also with some Cisco models, you may not get a response on the first arp request. I have seen network snoops, there the switches are not responding to the first arp request on slave 1, the second arp request was sent on slave 2 but the response was received on slave one, and all following arp requests are anwsered on the wrong slave for a longer time. Could you elaborate on the exact high availability configuration here, including the model(s) of switch(es) involved? Is this some kind of race between the switch or switches updating the forwarding tables and the bond flip flopping between the slaves? E.g., source MAC from ARP sent on slave 1 is used to populate the forwarding table, but (for whatever reason) there is no reply. ARP on slave 2 is sent (using the same source MAC, unless you set fail_over_mac), but forwarding tables still send that MAC to slave 1, so reply is sent there. The proposed change sents up to 3 arp requests on a down bond using the same slave, delayed by arp_interval. Using problematic switches i have seen the the arp response on the right slave at latest on the second arp request. So the bond is going into state up. How does it works: The bonds in up state are handled on the beginning of bond_ab_arp_probe procedure, the other part of this procedure is handling the slave change. The proposed change is bypassing the slave change for 2 additional calls of bond_ab_arp_probe. Now the retries are not only for an up bond available, they are also implemented for a down bond. Does this delay failover or bringup on switches that are not problematic? I.e., if arp_interval is, say, 1000 (1 second), will this impact failover / recovery times? -J The num_grat_arp has no chance to solve the problem. The num_grat_arp is only used, if a different slave is going active. But in our case, the bonding slaves are not going into the state active for a longer time. [jarod: manufacturing of changelog] CC: Jay Vosburgh j.vosbu...@gmail.com CC: Veaceslav Falico vfal...@gmail.com CC: Andy Gospodarek go...@cumulusnetworks.com CC: netdev@vger.kernel.org Signed-off-by: Uwe Koziolek uwe.kozio...@redknee.com Signed-off-by: Jarod Wilson ja...@redhat.com --- drivers/net/bonding/bond_main.c | 5 + 1 file changed, 5 insertions(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 0c627b4..60b9483 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -2794,6 +2794,11 @@ static bool
3.14 stable request for net: gso: use feature flag argument in all protocol gso handlers
For 3.14 -stable, please consider commit: commit 1e16aa3ddf863c6b9f37eddf52503230a62dedb3 Author: Florian Westphal f...@strlen.de Date: Mon Oct 20 13:49:16 2014 +0200 net: gso: use feature flag argument in all protocol gso handlers We have observed kernel panics when an openvswitch bridge is populated with virtual devices (veth, for example) that have expansive feature sets that include NETIF_F_GSO_GRE. The failure occurs when foreign GRE encapsulated traffic (explicitly not including the initial packets of a connection) arrives at the system (likely via a switch flood event). The packets are GRO accumulated, and passed to the OVS receive processing. As the connection is not in the OVS kernel datapath table, the call path is: ovs_dp_upcall - queue_gso_packets - __skb_gso_segment(skb, NETIF_F_SG, false) Without the patch cited above, __skb_gso_segment returns NULL, as the features from the device (including _GSO_GRE) are used in place of the _SG feature supplied to the call. The kernel panics on a subsequent dereference of the NULL pointer in queue_userspace_packet(). A backport to 3.14.50 is below. -J Subject: [PATCH 3.14-stable] net: gso: use feature flag argument in all protocol gso handlers From: Florian Westphal f...@strlen.de skb_gso_segment() has a 'features' argument representing offload features available to the output path. A few handlers, e.g. GRE, instead re-fetch the features of skb-dev and use those instead of the provided ones when handing encapsulation/tunnels. Depending on dev-hw_enc_features of the output device skb_gso_segment() can then return NULL even when the caller has disabled all GSO feature bits, as segmentation of inner header thinks device will take care of segmentation. This e.g. affects the tbf scheduler, which will silently drop GRE-encap GSO skbs that did not fit the remaining token quota as the segmentation does not work when device supports corresponding hw offload capabilities. Cc: Pravin B Shelar pshe...@nicira.com Signed-off-by: Florian Westphal f...@strlen.de Signed-off-by: David S. Miller da...@davemloft.net [jay.vosburgh: backported to 3.14. ] Signed-off-by: Jay Vosburgh jay.vosbu...@canonical.com --- net/ipv4/af_inet.c | 2 +- net/ipv4/gre_offload.c | 2 +- net/ipv4/udp.c | 2 +- net/ipv6/ip6_offload.c | 2 +- net/mpls/mpls_gso.c| 2 +- 5 files changed, 5 insertions(+), 5 deletions(-) diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c index 951fe55..f4c804d 100644 --- a/net/ipv4/af_inet.c +++ b/net/ipv4/af_inet.c @@ -1291,7 +1291,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff *skb, encap = SKB_GSO_CB(skb)-encap_level 0; if (encap) - features = skb-dev-hw_enc_features netif_skb_features(skb); + features = skb-dev-hw_enc_features; SKB_GSO_CB(skb)-encap_level += ihl; skb_reset_transport_header(skb); diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c index 278836f..0da513e 100644 --- a/net/ipv4/gre_offload.c +++ b/net/ipv4/gre_offload.c @@ -69,7 +69,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb, skb-mac_len = skb_inner_network_offset(skb); /* segment inner packet. */ - enc_features = skb-dev-hw_enc_features netif_skb_features(skb); + enc_features = skb-dev-hw_enc_features features; segs = skb_mac_gso_segment(skb, enc_features); if (!segs || IS_ERR(segs)) { skb_gso_error_unwind(skb, protocol, ghl, mac_offset, mac_len); diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 21a3a9e..22b2a83 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -2510,7 +2510,7 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff *skb, skb-protocol = htons(ETH_P_TEB); /* segment inner packet. */ - enc_features = skb-dev-hw_enc_features netif_skb_features(skb); + enc_features = skb-dev-hw_enc_features features; segs = skb_mac_gso_segment(skb, enc_features); if (!segs || IS_ERR(segs)) { skb_gso_error_unwind(skb, protocol, tnl_hlen, mac_offset, diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c index b2f0915..dc46eba 100644 --- a/net/ipv6/ip6_offload.c +++ b/net/ipv6/ip6_offload.c @@ -112,7 +112,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb, encap = SKB_GSO_CB(skb)-encap_level 0; if (encap) - features = skb-dev-hw_enc_features netif_skb_features(skb); + features = skb-dev-hw_enc_features; SKB_GSO_CB(skb)-encap_level += sizeof(*ipv6h); ipv6h = ipv6_hdr(skb); diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c index 851cd88..0c970cb 100644 --- a/net/mpls/mpls_gso.c +++ b/net/mpls/mpls_gso.c @@ -47,7 +47,7 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb, __skb_push(skb, skb-mac_len); /* Segment inner packet
Re: [PATCH net] bonding: fix bond dev flags after convert to arphrd_ether
Nikolay Aleksandrov ra...@blackwall.org wrote: From: Nikolay Aleksandrov niko...@cumulusnetworks.com If a bonding device enslaves devices != arphrd_ether it'll change types and if later these devices are released, it can enslave an arphrd_ether device and switch back calling ether_setup() which resets dev-flags to IFF_BROADCAST|IFF_MULTICAST and clears IFF_MASTER which then could lead to many different bugs. This bug seems to have been there since the introduction of ether_setup() in bond_enslave(). I thought the hack-around for the non-ethernet device problem was that, for non-ARPHRD_ETHER devices, the bonding master device is automatically destroyed when the last slave is released. This (well, this sort of thing) originally came up with IPoIB devices needing different ops that would disappear if the ipoib module was unloaded, so the bond master was deliberately unregistered when the last slave was released. Looking at the code, at first glance this appears to still be the case: bond_slave_netdev_event calls bond_release_and_destroy for != ARPHRD_ETHER, which in turn should unregister the bond itself if there are no slaves left. Is this no longer working as intended? -J Signed-off-by: Nikolay Aleksandrov niko...@cumulusnetworks.com Fixes: e36b9d16c6a6 (bonding: clean muticast addresses when device changes type) --- drivers/net/bonding/bond_main.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 317a49480475..8ba119896e55 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -1368,6 +1368,7 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev) bond_setup_by_slave(bond_dev, slave_dev); else { ether_setup(bond_dev); + bond_dev-flags |= IFF_MASTER; bond_dev-priv_flags = ~IFF_TX_SKB_SHARING; } -- 1.9.3 --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Stable request for gso feature flag and error handling fixes
Please consider commit commit 1e16aa3ddf863c6b9f37eddf52503230a62dedb3 Author: Florian Westphal f...@strlen.de Date: Mon Oct 20 13:49:16 2014 +0200 net: gso: use feature flag argument in all protocol gso handlers and, at your discretion, the related commit commit 330966e501ffe282d7184fde4518d5e0c24bc7f8 Author: Florian Westphal f...@strlen.de Date: Mon Oct 20 13:49:17 2014 +0200 net: make skb_gso_segment error handling more robust for -stable kernels prior to 3.18 back to 3.10. We have observed kernel panics when an openvswitch bridge is populated with virtual devices (veth, for example) that have expansive feature sets that include NETIF_F_GSO_GRE. The failure occurs when foreign GRE encapsulated traffic (explicitly not including the initial packets of a connection) arrives at the system (likely via a switch flood event). The packets are GRO accumulated, and passed to the OVS receive processing. As the connection is not in the OVS kernel datapath table, the call path is: ovs_dp_upcall - queue_gso_packets - __skb_gso_segment(skb, NETIF_F_SG, false) Without the first patch cited above, __skb_gso_segment returns NULL, as the features from the device (including GSO_GRE) are used in place of the _SG feature supplied to the call. Without the second patch cited above, the kernel panics when it later dereferences the NULL skb pointer in queue_userspace_packet. Strictly speaking, with the first place applied the panic is avoided (as the NULL return does not occur), but including the second patch may still be prudent. Thanks, -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Stable request for gso feature flag and error handling fixes
David Miller da...@davemloft.net wrote: From: Jay Vosburgh jay.vosbu...@canonical.com Date: Tue, 07 Jul 2015 17:38:50 -0700 Please consider commit When you ask me to consider commits for -stable you have to tell me what -stable releases you want me to submit them for. Currently I am only doing -stable submissions for 4.1.x, 3.18.x and 3.14.x I did say: for -stable kernels prior to 3.18 back to 3.10. So, this would be just for 3.14.x. My apologies if I buried that too far into the message. Are the other -stable tree maintainers picking up patches after you've submitted to 4.1/3.18/3.14, or is it necessary to make separate requests? -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] bonding: primary_reselect with failure is not working properly
GMAIL ranamazh...@gmail.com wrote: Hi Jay, On Friday 03 July 2015 02:12 AM, Jay Vosburgh wrote: [ added netdev to cc ] Mazhar Rana ranamazh...@gmail.com wrote: When primary_reselect is set to failure, primary interface should not become active until current active slave is up. But if we set first I think you mean until current active slave is down here, not up. Yes, It should be up, grammatical mistake down, right? [...] Below is the updated version of your patch. Any Comments or suggestions ? [...] static struct slave *bond_find_best_slave(struct bonding *bond) { - struct slave *slave, *bestslave = NULL, *primary; + struct slave *slave = NULL, *bestslave = NULL, *primary; struct list_head *iter; int mintime = bond-params.updelay; primary = rtnl_dereference(bond-primary_slave); - if (primary primary-link == BOND_LINK_UP - bond_should_change_active(bond)) - return primary; + if (primary primary-link == BOND_LINK_UP) + slave = bond_choose_primary_or_current(bond); + + if (slave) + return slave; bond_for_each_slave(bond, slave, iter) { if (slave-link == BOND_LINK_UP) I think this will misbehave in the case that curr is up and available, but primary is NULL (this can happen when the primary option is cleared). In this case, the above code will not call bond_choose_primary_or_current, and will then run the loop to find a new curr, which may select a different slave unnecessarily. How does the following look? I prefer to make the call to choose_primary_or_current unconditional, and have it decide if the search loop should be run. In this version, _choose_ tests curr if prim is not suitable. Compile tested only. Thoughts? -J diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 19eb990..1e35e25 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -689,40 +689,57 @@ out: } -static bool bond_should_change_active(struct bonding *bond) +static struct slave *bond_choose_primary_or_current(struct bonding *bond) { struct slave *prim = rtnl_dereference(bond-primary_slave); struct slave *curr = rtnl_dereference(bond-curr_active_slave); - if (!prim || !curr || curr-link != BOND_LINK_UP) - return true; + if (!prim || !prim-link == BOND_LINK_UP) { + if (!curr || !curr-link == BOND_LINK_UP) + return NULL; + return curr; + } + if (bond-force_primary) { bond-force_primary = false; - return true; + return prim; + } + + if (!curr || curr-link != BOND_LINK_UP) + return prim; + + /* At this point, prim and curr are both up */ + switch (bond-params.primary_reselect) { + case BOND_PRI_RESELECT_ALWAYS: + return prim; + case BOND_PRI_RESELECT_BETTER: + if (prim-speed curr-speed) + return curr; + if (prim-speed == curr-speed prim-duplex = curr-duplex) + return curr; + return prim; + case BOND_PRI_RESELECT_FAILURE: + return curr; + default: + netdev_err(bond-dev, impossible primary_reselect %d\n, + bond-params.primary_reselect); + return curr; } - if (bond-params.primary_reselect == BOND_PRI_RESELECT_BETTER - (prim-speed curr-speed || -(prim-speed == curr-speed prim-duplex = curr-duplex))) - return false; - if (bond-params.primary_reselect == BOND_PRI_RESELECT_FAILURE) - return false; - return true; } /** - * find_best_interface - select the best available slave to be the active one + * bond_find_best_slave - select the best available slave to be the active one * @bond: our bonding struct */ static struct slave *bond_find_best_slave(struct bonding *bond) { - struct slave *slave, *bestslave = NULL, *primary; + struct slave *slave, *bestslave = NULL; struct list_head *iter; int mintime = bond-params.updelay; - primary = rtnl_dereference(bond-primary_slave); - if (primary primary-link == BOND_LINK_UP - bond_should_change_active(bond)) - return primary; + slave = bond_choose_primary_or_current(bond); + if (slave) + return slave; bond_for_each_slave(bond, slave, iter) { if (slave-link == BOND_LINK_UP) --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] bonding: primary_reselect with failure is not working properly
= bond_choose_primary_or_current(bond); + if (slave) + return slave; bond_for_each_slave(bond, slave, iter) { if (slave-link == BOND_LINK_UP) --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with LACP mode in linux bonding driver
Ajith Adapa adapa.aj...@gmail.com wrote: On 26 June 2015 at 07:45, Jay Vosburgh jay.vosbu...@canonical.com wrote: echo 'module bonding =p' /sys/kernel/debug/dynamic_debug/control Hi, thanks for the reply. I tried this out a bit here, and could reproduce the problem on 3.13, but not on 4.0.0. A bit of checking suggests that this problem is fixed by the following commit: commit 63b46242f707849a1df10b70e026281bfa40e849 Author: Wilson Kok w...@cumulusnetworks.com Date: Mon Jan 26 01:16:59 2015 -0500 bonding: fix incorrect lacp mux state when agg not active which looks to have first appeared in the 4.0 kernel. I did a quick backport of that to 3.13 (leaving out the style and pr_debug changes), and it appears to resolve the problem. Ajith: can you test this patch? If this resolves the problem for you, we can request this patch for -stable to get it into the older kernels. diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c index dc0c56a..8c62f90 100644 --- a/drivers/net/bonding/bond_3ad.c +++ b/drivers/net/bonding/bond_3ad.c @@ -890,19 +890,23 @@ static void ad_mux_machine(struct port *port) case AD_MUX_ATTACHED: // check also if agg_select_timer expired(so the edable port will take place only after this timer) if ((port-sm_vars AD_PORT_SELECTED) (port-partner_oper.port_state AD_STATE_SYNCHRONIZATION) !__check_agg_selection_timer(port)) { - port-sm_mux_state = AD_MUX_COLLECTING_DISTRIBUTING;// next state + if (port-aggregator-is_active) + port-sm_mux_state = AD_MUX_COLLECTING_DISTRIBUTING;// next state } else if (!(port-sm_vars AD_PORT_SELECTED) || (port-sm_vars AD_PORT_STANDBY)) {// if UNSELECTED or STANDBY port-sm_vars = ~AD_PORT_READY_N; // in order to withhold the selection logic to check all ports READY_N value // every callback cycle to update ready variable, we check READY_N and update READY here __set_agg_ports_ready(port-aggregator, __agg_ports_are_ready(port-aggregator)); port-sm_mux_state = AD_MUX_DETACHED;// next state + } else if (port-aggregator-is_active) { + port-actor_oper_port_state |= + AD_STATE_SYNCHRONIZATION; } break; case AD_MUX_COLLECTING_DISTRIBUTING: if (!(port-sm_vars AD_PORT_SELECTED) || (port-sm_vars AD_PORT_STANDBY) || - !(port-partner_oper.port_state AD_STATE_SYNCHRONIZATION) - ) { + !(port-partner_oper.port_state AD_STATE_SYNCHRONIZATION) || + !(port-actor_oper_port_state AD_STATE_SYNCHRONIZATION)) { port-sm_mux_state = AD_MUX_ATTACHED;// next state } else { @@ -941,7 +945,12 @@ static void ad_mux_machine(struct port *port) break; case AD_MUX_ATTACHED: __attach_bond_to_agg(port); - port-actor_oper_port_state |= AD_STATE_SYNCHRONIZATION; + if (port-aggregator-is_active) + port-actor_oper_port_state |= + AD_STATE_SYNCHRONIZATION; + else + port-actor_oper_port_state = + ~AD_STATE_SYNCHRONIZATION; port-actor_oper_port_state = ~AD_STATE_COLLECTING; port-actor_oper_port_state = ~AD_STATE_DISTRIBUTING; ad_disable_collecting_distributing(port); @@ -950,6 +959,7 @@ static void ad_mux_machine(struct port *port) case AD_MUX_COLLECTING_DISTRIBUTING: port-actor_oper_port_state |= AD_STATE_COLLECTING; port-actor_oper_port_state |= AD_STATE_DISTRIBUTING; + port-actor_oper_port_state |= AD_STATE_SYNCHRONIZATION; ad_enable_collecting_distributing(port); port-ntt = true; break; @@ -1350,6 +1360,9 @@ static void ad_port_selection_logic(struct port *port) aggregator = __get_first_agg(port); ad_agg_selection_logic(aggregator); + + if (!port-aggregator-is_active) + port-actor_oper_port_state = ~AD_STATE_SYNCHRONIZATION; } /* -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord
Re: Issue with LACP mode in linux bonding driver
Ajith Adapa adapa.aj...@gmail.com wrote: Hi, Sorry for direct mail. Since the question is more specific about supporting LACP standard I decided to communicate directly with the MAINTAINERS. My issue is related to multiaggregation support in LACP. I saw your message this morning, but didn't have an opportunity to look into it today. Linux Flavour: Centos7. Setup topology: Back to back connected Linux server and a L2 switch with 2 interfaces eth0 and eth1 (on both sides). On Switch I have mapped eth0 to po1 and eth1 to po2. On Linux server I have created a single bond interface with both interfaces eth0 and eth1. On switch both po1 and po2 has same system-id but the Actor key is different i.e. on PO1 it is 16385 and on PO2 it is 32768. As per the information available regarding bond0 on Linux server which is given below Active aggregator ID is 1 which is mapped to eth0. But we have observed that eth1 on Linux server is also sending LACPDUS with Collecting/Distributing bit set as 1. Which will result in single bond interface on Linux server is splitted into multiple port-channels on Switch causing duplication of frames on Linux server. I'd suggest enabling the dynamic_debug for the bonding driver and observe the state machine activity within the 802.3ad code. This is described in the Documentation/dynamic-debug-howto.txt that is part of the kernel source; off the top of my head, I think you'll need something like: echo 'module bonding =p' /sys/kernel/debug/dynamic_debug/control This should put the bonding LACP state machine activity into the system log. If a port on a non-active aggregator is actually in collecting / distributing state, that is probably bad, as I'd only expect that to be true for ports in the active aggregator. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Experimental new bonding driver mode=batman
Michal Kubecek mkube...@suse.cz wrote: On Tue, May 19, 2015 at 04:29:45AM -0400, Patrick Simmons wrote: On 05/19/2015 03:49 AM, Michal Kubecek wrote: On Tue, May 19, 2015 at 02:09:43AM -0400, Patrick Simmons wrote: I've written a new mode for the kernel bonding driver. It's similar to the round-robin mode, but it keeps statistics on TCP resends so as to favor slave devices with more bandwidth when choosing where to send packets. I've tested it on two laptops using WiFi/Ethernet and it seems to work okay, but it should be considered experimental. A description of how is the mode supposed to work would be definitely helpful. Rationale: It's helpful for cases where the slave devices have significantly different or varying bandwidth. The reason I wrote it is to bond powerline networking and wireless networking adapters into a single interface for use with connecting to a MythTV server. Neither of these systems is particularly reliable with bandwidth, but mode=batman can adaptively figure out which network has more available bandwidth at any given moment. This is better than mode=round-robin which always balances everything 50/50. Thank you. But I rather meant some basic description of the algorithm used to achieve this goal. Both should be IMHO part of the commit message. Agreed; the concept sounds interesting, but without a detailed description of how it works it is difficult to evaluate its value. Regarding your analysis, I appreciate your comments, and I know it's rough, but I'm sorry to say I'm not really interested in doing much to improve its polish past where it is. If it fails some way when I try to deploy it, then I'll fix that, and maybe I'll play around with the balancing heuristics, but the code quality is what it is unless someone else wants to improve it. I would fix the indentation if that would make it acceptable for you to merge it, but not much more. My argument for merging it is basically it doesn't do anything unless you pass mode=batman, so what's the harm?. So, if you guys decide you don't want to merge it because of the global spinlock etc., that's cool and I understand, but I thought I should at least post to this list so you and any other potentially interested people know it exists. Oh, and, if you're not going to merge it, please let me know so I can know post the patch to GitHub or somewhere. And, if you could include a note in the comments at the top of bond_main.c or somewhere pointing people to the patch, I'd very much appreciate that. I don't want anyone else to have to endure hours of kernel rebuilds with KASAN enabled if they want this functionality :) Well, it's not my call, I'm not a bonding maintainer. But I believe at least some of the objections would be shared by them. Of course, it's up to you if you want to dedicate your time to improving the code to be acceptable for mainline or rather maintain it out of tree (which may end up taking even more time in the long term). Well, I am a bonding maintainer, and I can say that the patch in its current state is not suitable for inclusion. At a minimum, there are many coding style issues, commented out debug statements, etc, along with design issues (e.g., the batman mode handling in bond_handle_frame is unconditional and takes place for all modes, not just the new batman mode). If you (Patrick) or someone else wishes to contribute this to mainline, I'd suggest that the first step is to read and follow the instructions in Documentation/SubmittingPatches in the kernel source code. It is also not feasible to add pointers in the kernel source code to out-of-tree patches; sorry. -J --- -Jay Vosburgh, jay.vosbu...@canonical.com -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] bonding: simplify code and get rid of warning
Stephen Hemminger [EMAIL PROTECTED] wrote: Get rid of warning and simplify code that looks up vlan tag. No need to get tag, then copy it. Also no need for a local status variable. Granted, the current code is suboptimal, but I don't see any warnings compiling this (gcc 4.1.2). What are you getting? -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] Signed-off-by: Stephen Hemminger [EMAIL PROTECTED] --- Patch against current 2.6.25 version. --- a/drivers/net/bonding/bond_alb.c 2008-02-18 20:58:53.0 -0800 +++ b/drivers/net/bonding/bond_alb.c 2008-02-18 21:01:10.0 -0800 @@ -678,12 +678,8 @@ static struct slave *rlb_choose_channel( } if (!list_empty(bond-vlan_list)) { - unsigned short vlan_id; - int res = vlan_get_tag(skb, vlan_id); - if (!res) { + if (!vlan_get_tag(skb, client_info-vlan_id)) client_info-tag = 1; - client_info-vlan_id = vlan_id; - } } if (!client_info-assigned) { -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Bugme-new] [Bug 9937] New: Bug in bonding driver - Kernel oops whenever driver is loaded with max_bonds parameter
Andrew Morton [EMAIL PROTECTED] wrote: Problem Description: Kernel oops whenever bonding driver with max_bonds=2 (or 2) is loaded ... I believe this is fixed by the following (from linux-2.6): From: Jay Vosburgh [EMAIL PROTECTED] Date: Tue, 29 Jan 2008 18:07:45 -0800 Subject: [PATCH] bonding: fix NULL pointer deref in startup processing Fix the are we creating a duplicate check to not compare the name if the name is NULL (meaning that the system should select a name). Bug reported by Benny Amorsen [EMAIL PROTECTED]. Signed-off-by: Jay Vosburgh [EMAIL PROTECTED] Signed-off-by: Jeff Garzik [EMAIL PROTECTED] Signed-off-by: David S. Miller [EMAIL PROTECTED] --- drivers/net/bonding/bond_main.c | 16 +--- 1 files changed, 9 insertions(+), 7 deletions(-) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 65c7eba..81b4574 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4896,14 +4896,16 @@ int bond_create(char *name, struct bond_params *params, struct bonding **newbond down_write(bonding_rwsem); /* Check to see if the bond already exists. */ - list_for_each_entry_safe(bond, nxt, bond_dev_list, bond_list) - if (strnicmp(bond-dev-name, name, IFNAMSIZ) == 0) { - printk(KERN_ERR DRV_NAME + if (name) { + list_for_each_entry_safe(bond, nxt, bond_dev_list, bond_list) + if (strnicmp(bond-dev-name, name, IFNAMSIZ) == 0) { + printk(KERN_ERR DRV_NAME : cannot add bond %s; it already exists\n, - name); - res = -EPERM; - goto out_rtnl; - } + name); + res = -EPERM; + goto out_rtnl; + } + } bond_dev = alloc_netdev(sizeof(struct bonding), name ? name : , ether_setup); -- 1.5.2.4 -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Null pointer dereference when bringing up bonding device on kernel-2.6.24-2.fc9.i686
Siim Põder [EMAIL PROTECTED] wrote: Jay Vosburgh wrote: Benny Amorsen [EMAIL PROTECTED] wrote: https://bugzilla.redhat.com/show_bug.cgi?id=430391 I know what this is, I'll fix it. do you know when this happend, so we would know which kernel is ok to use (not to start trying blindly)? It was something I changed up during the 2.6.24-rc development somewhere. Also, I posted the fix for this a couple days ago, it's commit 5eb71eec3616b0a62e63197016576a74da240c6b in netdev-2.6#upstream. -J --- -Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED] -- To unsubscribe from this list: send the line unsubscribe netdev in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html