from:"Jay Vosburgh"

Re: [PATCH v2 net] bonding: fix 802.3ad state sent to partner when unbinding slave

2018-11-27 Thread Jay Vosburgh

Toni Peltonen  wrote:

>Previously when unbinding a slave the 802.3ad implementation only told
>partner that the port is not suitable for aggregation by setting the port
>aggregation state from aggregatable to individual. This is not enough. If the
>physical layer still stays up and we only unbinded this port from the bond 
>there
>is nothing in the aggregation status alone to prevent the partner from sending
>traffic towards us. To ensure that the partner doesn't consider this
>port at all anymore we should also disable collecting and distributing to
>signal that this actor is going away. Also clear AD_STATE_SYNCHRONIZATION to
>ensure partner exits collecting + distributing state.
>
>I have tested this behaviour againts Arista EOS switches with mlx5 cards
>(physical link stays up even when interface is down) and simulated
>the same situation virtually Linux <-> Linux with two network namespaces
>running two veth device pairs. In both cases setting aggregation to
>individual doesn't alone prevent traffic from being to sent towards this
>port given that the link stays up in partners end. Partner still keeps
>it's end in collecting + distributing state and continues until timeout is
>reached. In most cases this means we are losing the traffic partner sends
>towards our port while we wait for timeout. This is most visible with slow
>periodic time (LACP rate slow).
>
>Other open source implementations like Open VSwitch and libreswitch, and
>vendor implementations like Arista EOS, seem to disable collecting +
>distributing to when doing similar port disabling/detaching/removing change.
>With this patch kernel implementation would behave the same way and ensure
>partner doesn't consider our actor viable anymore.
>
>Signed-off-by: Toni Peltonen 

Signed-off-by: Jay Vosburgh 

>---
>v2 changes:
>* Fix typo in commit message
>* Also clear AD_STATE_SYNCHRONIZATION
>---
> drivers/net/bonding/bond_3ad.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index f43fb2f958a5..93dfcef8afc4 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -2086,6 +2086,9 @@ void bond_3ad_unbind_slave(struct slave *slave)
>  aggregator->aggregator_identifier);
> 
>   /* Tell the partner that this port is not suitable for aggregation */
>+  port->actor_oper_port_state &= ~AD_STATE_SYNCHRONIZATION;
>+  port->actor_oper_port_state &= ~AD_STATE_COLLECTING;
>+  port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING;
>   port->actor_oper_port_state &= ~AD_STATE_AGGREGATION;
>   __update_lacpdu_from_port(port);
>   ad_lacpdu_send(port);
>-- 
>2.19.0
>

Re: [PATCH net] bonding: fix 802.3ad state sent to partner when unbinding slave

2018-11-26 Thread Jay Vosburgh

Toni Peltonen  wrote:

>Previously when unbinding a slave the 802.3ad implementation only told
>partner that the port is not suitable for aggregation by setting the port
>aggregation state from aggregatable to individual. This is not enough. If the
>physical layer still stays up and we only unbinded this port from the bond 
>there
>is nothing in the aggregation status alone to prevent the partner from sending
>traffic towards us. To ensure that the partner doesn't consider this
>port at all anymore we should also disable collecting and distributing to
>signal that this actor is going away.
>
>I have tested this behaviour againts Arista EOS switches with mlx5 cards
>(physical link stays up when even when interface is down) and simulated
>the same situation virtually Linux <-> Linux with two network namespaces
>running two veth device pairs. In both cases setting aggregation to
>individual doesn't alone prevent traffic from being to sent towards this
>port given that the link stays up in partners end. Partner still keeps
>it's end in collecting + distributing state and continues until timeout is
>reached. In most cases this means we are losing the traffic partner sends
>towards our port while we wait for timeout. This is most visible with slow
>periodic time (LAPC rate slow).

"LAPC" -> "LACP"

>Other open source implementations like Open VSwitch and libreswitch, and
>vendor implementations like Arista EOS, seem to disable collecting +
>distributing to when doing similar port disabling/detaching/removing change.
>With this patch kernel implementation would behave the same way and ensure
>partner doesn't consider our actor viable anymore.

After re-reading the relevant bits of 802.1AX (particularly
5.4.9 on recordPDU and update_Selected) I'm going to suggest also
clearing AD_STATE_SYNCHRONIZATION, based on:

Partner_Oper_Port_State.Synchronization is also set to TRUE if
the value of Actor_State.Aggregation in the received PDU is set
to FALSE (i.e., indicates an Individual link),
Actor_State.Synchronization in the received PDU is set to TRUE,
and LACP will actively maintain the link.

Per the above, learing _SYNC in the LACPDU should un-sync the
port, inducing the Mux state machine (figure 5-15) to exit C_D state and
go to ATTACHED state (disabling Coll/Dist).

But, either way, as this is a hint to get the link partner to
stop using the port, this looks reasonable to me.  

Acked-by: Jay Vosburgh 

-J  

>Signed-off-by: Toni Peltonen 
>---
> drivers/net/bonding/bond_3ad.c | 2 ++
> 1 file changed, 2 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index f43fb2f958a5..6776c33753dc 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -2086,6 +2086,8 @@ void bond_3ad_unbind_slave(struct slave *slave)
>  aggregator->aggregator_identifier);
> 
>   /* Tell the partner that this port is not suitable for aggregation */
>+  port->actor_oper_port_state &= ~AD_STATE_COLLECTING;
>+  port->actor_oper_port_state &= ~AD_STATE_DISTRIBUTING;
>   port->actor_oper_port_state &= ~AD_STATE_AGGREGATION;
>   __update_lacpdu_from_port(port);
>   ad_lacpdu_send(port);
>-- 
>2.19.0

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: fix length of actor system

2018-10-27 Thread Jay Vosburgh

Tobias Jungel  wrote:

>The attribute IFLA_BOND_AD_ACTOR_SYSTEM is sent to user space having the
>length of sizeof(bond->params.ad_actor_system) which is 8 byte. This
>patch aligns the length to ETH_ALEN to have the same MAC address exposed
>as using sysfs.
>
>fixes f87fda00b6ed2
>
>Signed-off-by: Tobias Jungel 

The patch looks fine to me, but the "fixes" line is not
formatted properly.  Please format it according to

Documentation/process/submitting-patches.rst

and resubmit your patch as V2.

-J

>---
> drivers/net/bonding/bond_netlink.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_netlink.c 
>b/drivers/net/bonding/bond_netlink.c
>index 9697977b80f0..6b9ad8673218 100644
>--- a/drivers/net/bonding/bond_netlink.c
>+++ b/drivers/net/bonding/bond_netlink.c
>@@ -638,8 +638,7 @@ static int bond_fill_info(struct sk_buff *skb,
>   goto nla_put_failure;
> 
>   if (nla_put(skb, IFLA_BOND_AD_ACTOR_SYSTEM,
>-  sizeof(bond->params.ad_actor_system),
>-  >params.ad_actor_system))
>+  ETH_ALEN, >params.ad_actor_system))
>   goto nla_put_failure;
>   }
>   if (!bond_3ad_get_active_agg_info(bond, )) {
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next] net/ipv6: Block IPv6 addrconf on team ports

2018-10-25 Thread Jay Vosburgh

Chas Williams <3ch...@gmail.com> wrote:

>On 10/25/2018 05:59 PM, Jay Vosburgh wrote:
>> Chas Williams <3ch...@gmail.com> wrote:
>>
>>> netif_is_lag_port should be used to identify link aggregation ports.
>>> For this to work, we need to reorganize the bonding and team drivers
>>> so that the necessary flags are set before dev_open is called.
>>>
>>> commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>>> made this decision originally based on the IFF_SLAVE flag which isn't
>>> used by the team driver.  Note, we do need to retain the IFF_SLAVE
>>> check for the eql driver.
>>
>>  Is 31e77c93e432 the correct commit reference?  I don't see
>> anything in there about IFF_SLAVE or bonding; it's a patch to the
>> process scheduler.
>
>No, that's wrong.  It should be c2edacf80e155.
>
>>  And, as Jiri said, the subject doesn't mention bonding.
>
>The behavior of bonding wasn't changed.  The intent of the patch
>is to add team slaves to the interfaces that don't get automatic
>IPv6 addresses.  The body discusses why bonding had to change as
>well.

Sure, but the bonding code has changed, and the current
presentation makes it harder for reviewers to follow (or perhaps even
notice).

>I was under the impression that the subject needs to kept short.
>If there a better way to phrase what I want to do?

I'd suggest splitting this into three patches: A first patch
that adds the new IPv6 functionality, then one patch each for team and
bonding to take advantage of that new functionality.  Each of the three
would then be very straightforward, change just one thing, and should be
clearer all around.

-J

>>> Signed-off-by: Chas Williams <3ch...@gmail.com>
>>> ---
>>> drivers/net/bonding/bond_main.c | 4 ++--
>>> drivers/net/team/team.c | 7 +--
>>> net/ipv6/addrconf.c | 2 +-
>>> 3 files changed, 8 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c 
>>> b/drivers/net/bonding/bond_main.c
>>> index ffa37adb7681..5cdad164332b 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -1536,6 +1536,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
>>> net_device *slave_dev,
>>>
>>> /* set slave flag before open to prevent IPv6 addrconf */
>>> slave_dev->flags |= IFF_SLAVE;
>>> +   slave_dev->priv_flags |= IFF_BONDING;
>>>
>>> /* open the slave since the application closed it */
>>> res = dev_open(slave_dev);
>>> @@ -1544,7 +1545,6 @@ int bond_enslave(struct net_device *bond_dev, struct 
>>> net_device *slave_dev,
>>> goto err_restore_mac;
>>> }
>>>
>>> -   slave_dev->priv_flags |= IFF_BONDING;
>>> /* initialize slave stats */
>>> dev_get_stats(new_slave->dev, _slave->slave_stats);
>>>
>>> @@ -1804,10 +1804,10 @@ int bond_enslave(struct net_device *bond_dev, 
>>> struct net_device *slave_dev,
>>> slave_disable_netpoll(new_slave);
>>>
>>> err_close:
>>> -   slave_dev->priv_flags &= ~IFF_BONDING;
>>> dev_close(slave_dev);
>>>
>>> err_restore_mac:
>>> +   slave_dev->priv_flags &= ~IFF_BONDING;
>>> slave_dev->flags &= ~IFF_SLAVE;
>>> if (!bond->params.fail_over_mac ||
>>> BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
>>> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>>> index db633ae9f784..8fc7d57e9f6d 100644
>>> --- a/drivers/net/team/team.c
>>> +++ b/drivers/net/team/team.c
>>> @@ -1128,14 +1128,12 @@ static int team_upper_dev_link(struct team *team, 
>>> struct team_port *port,
>>>_upper_info, extack);
>>> if (err)
>>> return err;
>>> -   port->dev->priv_flags |= IFF_TEAM_PORT;
>>> return 0;
>>> }
>>>
>>> static void team_upper_dev_unlink(struct team *team, struct team_port *port)
>>> {
>>> netdev_upper_dev_unlink(port->dev, team->dev);
>>> -   port->dev->priv_flags &= ~IFF_TEAM_PORT;
>>> }
>>>
>>> static void __team_port_change_port_added(struct team_port *port, bool 
>>> linkup);
>>> @@ -1214,6 +1212,9 @@ static int team_port_add(struct team *team, struct 
>>> net_device *port_dev,
>>> goto err_po

Re: [PATCH net-next] net/ipv6: Block IPv6 addrconf on team ports

2018-10-25 Thread Jay Vosburgh

Chas Williams <3ch...@gmail.com> wrote:

>netif_is_lag_port should be used to identify link aggregation ports.
>For this to work, we need to reorganize the bonding and team drivers
>so that the necessary flags are set before dev_open is called.
>
>commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")
>made this decision originally based on the IFF_SLAVE flag which isn't
>used by the team driver.  Note, we do need to retain the IFF_SLAVE
>check for the eql driver.

Is 31e77c93e432 the correct commit reference?  I don't see
anything in there about IFF_SLAVE or bonding; it's a patch to the
process scheduler.

And, as Jiri said, the subject doesn't mention bonding.

>Signed-off-by: Chas Williams <3ch...@gmail.com>
>---
> drivers/net/bonding/bond_main.c | 4 ++--
> drivers/net/team/team.c | 7 +--
> net/ipv6/addrconf.c | 2 +-
> 3 files changed, 8 insertions(+), 5 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index ffa37adb7681..5cdad164332b 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1536,6 +1536,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
>net_device *slave_dev,
> 
>   /* set slave flag before open to prevent IPv6 addrconf */
>   slave_dev->flags |= IFF_SLAVE;
>+  slave_dev->priv_flags |= IFF_BONDING;
> 
>   /* open the slave since the application closed it */
>   res = dev_open(slave_dev);
>@@ -1544,7 +1545,6 @@ int bond_enslave(struct net_device *bond_dev, struct 
>net_device *slave_dev,
>   goto err_restore_mac;
>   }
> 
>-  slave_dev->priv_flags |= IFF_BONDING;
>   /* initialize slave stats */
>   dev_get_stats(new_slave->dev, _slave->slave_stats);
> 
>@@ -1804,10 +1804,10 @@ int bond_enslave(struct net_device *bond_dev, struct 
>net_device *slave_dev,
>   slave_disable_netpoll(new_slave);
> 
> err_close:
>-  slave_dev->priv_flags &= ~IFF_BONDING;
>   dev_close(slave_dev);
> 
> err_restore_mac:
>+  slave_dev->priv_flags &= ~IFF_BONDING;
>   slave_dev->flags &= ~IFF_SLAVE;
>   if (!bond->params.fail_over_mac ||
>   BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
>diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
>index db633ae9f784..8fc7d57e9f6d 100644
>--- a/drivers/net/team/team.c
>+++ b/drivers/net/team/team.c
>@@ -1128,14 +1128,12 @@ static int team_upper_dev_link(struct team *team, 
>struct team_port *port,
>  _upper_info, extack);
>   if (err)
>   return err;
>-  port->dev->priv_flags |= IFF_TEAM_PORT;
>   return 0;
> }
> 
> static void team_upper_dev_unlink(struct team *team, struct team_port *port)
> {
>   netdev_upper_dev_unlink(port->dev, team->dev);
>-  port->dev->priv_flags &= ~IFF_TEAM_PORT;
> }
> 
> static void __team_port_change_port_added(struct team_port *port, bool 
> linkup);
>@@ -1214,6 +1212,9 @@ static int team_port_add(struct team *team, struct 
>net_device *port_dev,
>   goto err_port_enter;
>   }
> 
>+  /* set slave flag before open to prevent IPv6 addrconf */
>+  port->dev->priv_flags |= IFF_TEAM_PORT;
>+
>   err = dev_open(port_dev);
>   if (err) {
>   netdev_dbg(dev, "Device %s opening failed\n",
>@@ -1292,6 +1293,7 @@ static int team_port_add(struct team *team, struct 
>net_device *port_dev,
>   dev_close(port_dev);
> 
> err_dev_open:
>+  port->dev->priv_flags &= ~IFF_TEAM_PORT;
>   team_port_leave(team, port);
>   team_port_set_orig_dev_addr(port);
> 
>@@ -1328,6 +1330,7 @@ static int team_port_del(struct team *team, struct 
>net_device *port_dev)
>   dev_uc_unsync(port_dev, dev);
>   dev_mc_unsync(port_dev, dev);
>   dev_close(port_dev);
>+  port->dev->priv_flags &= ~IFF_TEAM_PORT;
>   team_port_leave(team, port);
> 
>   __team_option_inst_mark_removed_port(team, port);
>diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
>index 45b84dd5c4eb..121f863022ed 100644
>--- a/net/ipv6/addrconf.c
>+++ b/net/ipv6/addrconf.c
>@@ -3482,7 +3482,7 @@ static int addrconf_notify(struct notifier_block *this, 
>unsigned long event,
> 
>   case NETDEV_UP:
>   case NETDEV_CHANGE:
>-  if (dev->flags & IFF_SLAVE)
>+  if (netif_is_lag_port(dev) || dev->flags & IFF_SLAVE)

    Note that netvsc_vf_join() also uses IFF_SLAVE in order skip
IPv6 addrconf for netvsc devices; I don't believe its usage will pass
netif_is_lag_port().  It looks like the above will work, but your commit
message mentions eql as the reason for retaining the IFF_SLAVE test, and
eql isn't the only user of IFF_SLAVE in this manner.

-J

>   break;
> 
>   if (idev && idev->cnf.disable_ipv6)
>-- 
>2.14.4
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net 02/15] bonding: use netpoll_poll_dev() helper

2018-09-22 Thread Jay Vosburgh

Eric Dumazet  wrote:

>We want to allow NAPI drivers to no longer provide
>ndo_poll_controller() method, as it has been proven problematic.
>
>team driver must not look at its presence, but instead call
>netpoll_poll_dev() which factorize the needed actions.

This patch is for bonding, not team; otherwise LGTM.

Acked-by: Jay Vosburgh 

-J


>Signed-off-by: Eric Dumazet 
>Cc: Jay Vosburgh 
>Cc: Veaceslav Falico 
>Cc: Andy Gospodarek 
>---
> drivers/net/bonding/bond_main.c | 11 ++-
> 1 file changed, 2 insertions(+), 9 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 
>a764a83f99dabe54585dbad7dba40b6601177c03..0d87e11e7f1d84537fe43d95249b1bd3a2ce291d
> 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -971,16 +971,13 @@ static void bond_poll_controller(struct net_device 
>*bond_dev)
>   struct slave *slave = NULL;
>   struct list_head *iter;
>   struct ad_info ad_info;
>-  struct netpoll_info *ni;
>-  const struct net_device_ops *ops;
> 
>   if (BOND_MODE(bond) == BOND_MODE_8023AD)
>   if (bond_3ad_get_active_agg_info(bond, _info))
>   return;
> 
>   bond_for_each_slave_rcu(bond, slave, iter) {
>-  ops = slave->dev->netdev_ops;
>-  if (!bond_slave_is_up(slave) || !ops->ndo_poll_controller)
>+  if (!bond_slave_is_up(slave))
>   continue;
> 
>   if (BOND_MODE(bond) == BOND_MODE_8023AD) {
>@@ -992,11 +989,7 @@ static void bond_poll_controller(struct net_device 
>*bond_dev)
>   continue;
>   }
> 
>-  ni = rcu_dereference_bh(slave->dev->npinfo);
>-  if (down_trylock(>dev_lock))
>-  continue;
>-  ops->ndo_poll_controller(slave->dev);
>-  up(>dev_lock);
>+  netpoll_poll_dev(slave->dev);
>   }
> }
> 
>-- 
>2.19.0.444.g18242da7ef-goog
>

Re: [PATCH next] bonding: pass link-local packets to bonding master also.

2018-07-16 Thread Jay Vosburgh

Mahesh Bandewar  wrote:

>From: Mahesh Bandewar 
>
>Commit b89f04c61efe ("bonding: deliver link-local packets with
>skb->dev set to link that packets arrived on") changed the behavior
>of how link-local-multicast packets are processed. The change in
>the behavior broke some legacy use cases where these packets are
>expected to arrive on bonding master device also.
>
>This patch passes the packet to the stack with the link it arrived
>on as well as passes to the bonding-master device to preserve the
>legacy use case.

Michal, can you test this?  I'm travelling this week and won't
be able to run the patch.

Mahesh, will this confuse LLDP, et al, daemons that, e.g., bind
to every possible interface and now see the same LLDP PDU (identical
Chassis ID, Port ID, et al, TLVs) on multiple interfaces?

Thanks,

-J

>Reported-by: Michal Soltys 
>Signed-off-by: Mahesh Bandewar 
>---
> drivers/net/bonding/bond_main.c | 17 +++--
> 1 file changed, 15 insertions(+), 2 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 9a2ea3c1f949..1d3b7d8448f2 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1177,9 +1177,22 @@ static rx_handler_result_t bond_handle_frame(struct 
>sk_buff **pskb)
>   }
>   }
> 
>-  /* don't change skb->dev for link-local packets */
>-  if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
>+  /* Link-local multicast packets should be passed to the
>+   * stack on the link they arrive as well as pass them to the
>+   * bond-master device. These packets are mostly usable when
>+   * stack receives it with the link on which they arrive
>+   * (e.g. LLDP) but there may be some legacy behavior that
>+   * expects these packets to appear on bonding master too.
>+   */
>+  if (is_link_local_ether_addr(eth_hdr(skb)->h_dest)) {
>+  struct sk_buff *nskb = skb_clone(skb, GFP_ATOMIC);
>+
>+  if (nskb) {
>+  nskb->dev = bond->dev;
>+  netif_rx(nskb);
>+  }
>   return RX_HANDLER_PASS;
>+      }
>   if (bond_should_deliver_exact_match(skb, slave, bond))
>   return RX_HANDLER_EXACT;
> 
>-- 
>2.18.0.203.gfac676dfb9-goog

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-12 Thread Jay Vosburgh

Mahesh Bandewar (महेश बंडेवार) wrote:

>On Thu, Jul 12, 2018 at 11:03 AM, Jay Vosburgh
> wrote:
>> Michal Soltys  wrote:
>>
>>>On 07/12/2018 04:51 PM, Jay Vosburgh wrote:
>>>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>>>
>>>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> As weird as that sounds, this is what I observed today after bumping
>>>>>> kernel version. I have a setup where 2 bonds are attached to linux
>>>>>> bridge and physically are connected to two switches doing MSTP (and
>>>>>> linux bridge is just passing them).
>>>>>>
>>>>>> Initially I suspected some changes related to bridge code - but quick
>>>>>> peek at the code showed nothing suspicious - and the part of it that
>>>>>> explicitly passes stp frames if stp is not enabled has seen little
>>>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
>>>>>> regular non-bonded interfaces are attached everything works fine.
>>>>>>
>>>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with
>>>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
>>>>>> there (with them being present just fine on active enslaved interface,
>>>>>> or on the bond device in earlier kernels).
>>>>>>
>>>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from
>>>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
>>>>>> debian) and 4.17.3 (tested on archlinux) are failing.
>>>>>>
>>>>>> Unless this is already a known issue (or you have any suggestions what
>>>>>> could be responsible).
>>>>>>
>>>>> I believe these are link-local-multicast messages and sometime back a
>>>>> change went into to not pass those frames to the bonding master. This
>>>>> could be the side effect of that.
>>>>
>>>>  Mahesh, I suspect you're thinking of:
>>>>
>>>> commit b89f04c61efe3b7756434d693b9203cc0cce002e
>>>> Author: Chonggang Li 
>>>> Date:   Sun Apr 16 12:02:18 2017 -0700
>>>>
>>>>  bonding: deliver link-local packets with skb->dev set to link that 
>>>> packets arrived on
>>>>
>>>>  Michal, are you able to revert this patch and test?
>>>>
>>>>  -J
>>>>
>>>> ---
>>>>  -Jay Vosburgh, jay.vosbu...@canonical.com
>>>>
>>>
>>>
>>>Just tested - yes, reverting that patch solves the issues.
>>
>> Chonggang,
>>
>> Reading the changelog in your commit referenced above, I'm not
>> entirely sure what actual problem it is fixing.  Could you elaborate?
>>
>> As the patch appears to cause a regression, it needs to be
>> either fixed or reverted.
>>
>> Mahesh, you signed-off on it as well, perhaps you also have some
>> context?
>>
>
>I think the original idea behind it was to pass the LLDPDUs to the
>stack on the interface that they came on since this is considered to
>be link-local traffic and passing to bond-master would loose it's
>"linklocal-ness". This is true for LLDP and if you change the skb->dev
>of the packet, then you don't know which slave link it came on in
>(from LLDP consumer's perspective).
>
>I don't know much about STP but trunking two links and aggregating
>this link info through bond-master seems wrong. Just like LLDP, you
>are losing info specific to a link and the decision derived from that
>info could be wrong.
>
>Having said that, we determine "linklocal-ness" by looking at L2 and
>bondmaster shares this with lts slaves. So it does seem fair to pass
>those frames to the bonding-master but at the same time link-local
>traffic is supposed to be limited to the physical link (LLDP/STP/LACP
>etc). Your thoughts?

I agree the whole thing sounds kind of weird, but I'm curious as
to what Michal's actual use case is; he presumably has some practical
use for this, since he noticed that the behavior changed.

Michal, you mentioned MSTP and using 802.3ad (LACP) mode; how
does that combination work rationally given that the bond might send and
receive traffic across multiple slaves?  Or does the switch side bundle
the ports together into a single logical interface for MSTP purposes?
On the TX side, I think the bond will likely balance all STP frames to
just one slave.

As for a resolution, presuming that Michal has some reasonable
use case, I'm thinking along the lines of reverting the new (leave frame
attached to slave) behavior for the general case and adding a special
case for LLDP and friends to get the new behavior.  I'd like to avoid
adding any new options to bonding.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-12 Thread Jay Vosburgh

Michal Soltys  wrote:

>On 07/12/2018 04:51 PM, Jay Vosburgh wrote:
>> Mahesh Bandewar (महेश बंडेवार) wrote:
>>
>>> On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:
>>>>
>>>> Hi,
>>>>
>>>> As weird as that sounds, this is what I observed today after bumping
>>>> kernel version. I have a setup where 2 bonds are attached to linux
>>>> bridge and physically are connected to two switches doing MSTP (and
>>>> linux bridge is just passing them).
>>>>
>>>> Initially I suspected some changes related to bridge code - but quick
>>>> peek at the code showed nothing suspicious - and the part of it that
>>>> explicitly passes stp frames if stp is not enabled has seen little
>>>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
>>>> regular non-bonded interfaces are attached everything works fine.
>>>>
>>>> Just to be sure I detached the bond (802.3ad mode) and checked it with
>>>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
>>>> there (with them being present just fine on active enslaved interface,
>>>> or on the bond device in earlier kernels).
>>>>
>>>> If time permits I'll bisect tommorow to pinpoint the commit, but from
>>>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
>>>> debian) and 4.17.3 (tested on archlinux) are failing.
>>>>
>>>> Unless this is already a known issue (or you have any suggestions what
>>>> could be responsible).
>>>>
>>> I believe these are link-local-multicast messages and sometime back a
>>> change went into to not pass those frames to the bonding master. This
>>> could be the side effect of that.
>>
>>  Mahesh, I suspect you're thinking of:
>>
>> commit b89f04c61efe3b7756434d693b9203cc0cce002e
>> Author: Chonggang Li 
>> Date:   Sun Apr 16 12:02:18 2017 -0700
>>
>>  bonding: deliver link-local packets with skb->dev set to link that 
>> packets arrived on
>>
>>  Michal, are you able to revert this patch and test?
>>
>>  -J
>>
>> ---
>>  -Jay Vosburgh, jay.vosbu...@canonical.com
>>
>
>
>Just tested - yes, reverting that patch solves the issues.

Chonggang,

Reading the changelog in your commit referenced above, I'm not
entirely sure what actual problem it is fixing.  Could you elaborate?

As the patch appears to cause a regression, it needs to be
either fixed or reverted.

Mahesh, you signed-off on it as well, perhaps you also have some
context?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [BUG] bonded interfaces drop bpdu (stp) frames

2018-07-12 Thread Jay Vosburgh

Mahesh Bandewar (महेश बंडेवार) wrote:

>On Wed, Jul 11, 2018 at 3:23 PM, Michal Soltys  wrote:
>>
>> Hi,
>>
>> As weird as that sounds, this is what I observed today after bumping
>> kernel version. I have a setup where 2 bonds are attached to linux
>> bridge and physically are connected to two switches doing MSTP (and
>> linux bridge is just passing them).
>>
>> Initially I suspected some changes related to bridge code - but quick
>> peek at the code showed nothing suspicious - and the part of it that
>> explicitly passes stp frames if stp is not enabled has seen little
>> changes (e.g. per-port group_fwd_mask added recently). Furthermore - if
>> regular non-bonded interfaces are attached everything works fine.
>>
>> Just to be sure I detached the bond (802.3ad mode) and checked it with
>> simple tcpdump (ether proto \\stp) - and indeed no hello packets were
>> there (with them being present just fine on active enslaved interface,
>> or on the bond device in earlier kernels).
>>
>> If time permits I'll bisect tommorow to pinpoint the commit, but from
>> quick todays test - 4.9.x is working fine, while 4.16.16 (tested on
>> debian) and 4.17.3 (tested on archlinux) are failing.
>>
>> Unless this is already a known issue (or you have any suggestions what
>> could be responsible).
>>
>I believe these are link-local-multicast messages and sometime back a
>change went into to not pass those frames to the bonding master. This
>could be the side effect of that.

Mahesh, I suspect you're thinking of:

commit b89f04c61efe3b7756434d693b9203cc0cce002e
Author: Chonggang Li 
Date:   Sun Apr 16 12:02:18 2017 -0700

bonding: deliver link-local packets with skb->dev set to link that packets 
arrived on

Michal, are you able to revert this patch and test?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next v2 4/4] bonding: allow carrier and link status to determine link state

2018-05-16 Thread Jay Vosburgh

Debabrata Banerjee <dbane...@akamai.com> wrote:

>In a mixed environment it may be difficult to tell if your hardware
>support carrier, if it does not it can always report true. With a new
>use_carrier option of 2, we can check both carrier and link status
>sequentially, instead of one or the other

Your reply in the prior discussion suggests to me that there is
a bug somewhere else (perhaps in bonding, perhaps in the network
driver), and this is just papering over it.  As I said, I don't believe
that an additional hack to bonding is the appropriate way to resolve
this.

-J

>Signed-off-by: Debabrata Banerjee <dbane...@akamai.com>
>---
> Documentation/networking/bonding.txt |  4 ++--
> drivers/net/bonding/bond_main.c  | 12 
> drivers/net/bonding/bond_options.c   |  7 ---
> 3 files changed, 14 insertions(+), 9 deletions(-)
>
>diff --git a/Documentation/networking/bonding.txt 
>b/Documentation/networking/bonding.txt
>index 9ba04c0bab8d..f063730e7e73 100644
>--- a/Documentation/networking/bonding.txt
>+++ b/Documentation/networking/bonding.txt
>@@ -828,8 +828,8 @@ use_carrier
>   MII / ETHTOOL ioctl method to determine the link state.
> 
>   A value of 1 enables the use of netif_carrier_ok(), a value of
>-  0 will use the deprecated MII / ETHTOOL ioctls.  The default
>-  value is 1.
>+  0 will use the deprecated MII / ETHTOOL ioctls. A value of 2
>+  will check both.  The default value is 1.
> 
> xmit_hash_policy
> 
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index f7f8a49cb32b..7e9652c4b35c 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -132,7 +132,7 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link 
>down, "
>   "in milliseconds");
> module_param(use_carrier, int, 0);
> MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in 
> miimon; "
>-"0 for off, 1 for on (default)");
>+"0 for off, 1 for on (default), 2 for carrier 
>then legacy checks");
> module_param(mode, charp, 0);
> MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>  "1 for active-backup, 2 for balance-xor, "
>@@ -434,12 +434,16 @@ static int bond_check_dev_link(struct bonding *bond,
>   int (*ioctl)(struct net_device *, struct ifreq *, int);
>   struct ifreq ifr;
>   struct mii_ioctl_data *mii;
>+  bool carrier = true;
> 
>   if (!reporting && !netif_running(slave_dev))
>   return 0;
> 
>   if (bond->params.use_carrier)
>-  return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
>+  carrier = netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
>+
>+  if (!carrier)
>+  return carrier;
> 
>   /* Try to get link status using Ethtool first. */
>   if (slave_dev->ethtool_ops->get_link)
>@@ -4399,8 +4403,8 @@ static int bond_check_params(struct bond_params *params)
>   downdelay = 0;
>   }
> 
>-  if ((use_carrier != 0) && (use_carrier != 1)) {
>-  pr_warn("Warning: use_carrier module parameter (%d), not of 
>valid value (0/1), so it was set to 1\n",
>+  if (use_carrier < 0 || use_carrier > 2) {
>+  pr_warn("Warning: use_carrier module parameter (%d), not of 
>valid value (0-2), so it was set to 1\n",
>   use_carrier);
>   use_carrier = 1;
>   }
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index 8a945c9341d6..dba6cef05134 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -164,9 +164,10 @@ static const struct bond_opt_value 
>bond_primary_reselect_tbl[] = {
> };
> 
> static const struct bond_opt_value bond_use_carrier_tbl[] = {
>-  { "off", 0,  0},
>-  { "on",  1,  BOND_VALFLAG_DEFAULT},
>-  { NULL,  -1, 0}
>+  { "off",  0,  0},
>+  { "on",   1,  BOND_VALFLAG_DEFAULT},
>+  { "both", 2,  0},
>+  { NULL,  -1,  0}
> };
> 
> static const struct bond_opt_value bond_all_slaves_active_tbl[] = {
>-- 
>2.17.0

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next 4/4] bonding: allow carrier and link status to determine link state

2018-05-11 Thread Jay Vosburgh

Debabrata Banerjee <dbane...@akamai.com> wrote:

>In a mixed environment it may be difficult to tell if your hardware
>support carrier, if it does not it can always report true. With a new
>use_carrier option of 2, we can check both carrier and link status
>sequentially, instead of one or the other

What do you mean by "mixed environment," and under what
circumstances are you seeing an actual benefit from doing the MII /
ethtool test in addition to the standard netif_carrier_ok test?

The use_carrier option was meant for backwards compatibility
with old-in-2005 device drivers, so this seem counterintuitive to me.  I
don't recall seeing any devices lacking netif_carrier support for some
time.  At this point, I would tend to argue that a new device driver
that does not implement netif_carrier support should be fixed, and not
have another hack added to bonding to work around it.

-J

>Signed-off-by: Debabrata Banerjee <dbane...@akamai.com>
>---
> Documentation/networking/bonding.txt |  4 ++--
> drivers/net/bonding/bond_main.c  | 12 
> drivers/net/bonding/bond_options.c   |  7 ---
> 3 files changed, 14 insertions(+), 9 deletions(-)
>
>diff --git a/Documentation/networking/bonding.txt 
>b/Documentation/networking/bonding.txt
>index 9ba04c0bab8d..f063730e7e73 100644
>--- a/Documentation/networking/bonding.txt
>+++ b/Documentation/networking/bonding.txt
>@@ -828,8 +828,8 @@ use_carrier
>   MII / ETHTOOL ioctl method to determine the link state.
> 
>   A value of 1 enables the use of netif_carrier_ok(), a value of
>-  0 will use the deprecated MII / ETHTOOL ioctls.  The default
>-  value is 1.
>+  0 will use the deprecated MII / ETHTOOL ioctls. A value of 2
>+  will check both.  The default value is 1.
> 
> xmit_hash_policy
> 
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index f7f8a49cb32b..7e9652c4b35c 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -132,7 +132,7 @@ MODULE_PARM_DESC(downdelay, "Delay before considering link 
>down, "
>   "in milliseconds");
> module_param(use_carrier, int, 0);
> MODULE_PARM_DESC(use_carrier, "Use netif_carrier_ok (vs MII ioctls) in 
> miimon; "
>-"0 for off, 1 for on (default)");
>+"0 for off, 1 for on (default), 2 for carrier 
>then legacy checks");
> module_param(mode, charp, 0);
> MODULE_PARM_DESC(mode, "Mode of operation; 0 for balance-rr, "
>  "1 for active-backup, 2 for balance-xor, "
>@@ -434,12 +434,16 @@ static int bond_check_dev_link(struct bonding *bond,
>   int (*ioctl)(struct net_device *, struct ifreq *, int);
>   struct ifreq ifr;
>   struct mii_ioctl_data *mii;
>+  bool carrier = true;
> 
>   if (!reporting && !netif_running(slave_dev))
>   return 0;
> 
>   if (bond->params.use_carrier)
>-  return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
>+  carrier = netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
>+
>+  if (!carrier)
>+  return carrier;
> 
>   /* Try to get link status using Ethtool first. */
>   if (slave_dev->ethtool_ops->get_link)
>@@ -4399,8 +4403,8 @@ static int bond_check_params(struct bond_params *params)
>   downdelay = 0;
>   }
> 
>-  if ((use_carrier != 0) && (use_carrier != 1)) {
>-  pr_warn("Warning: use_carrier module parameter (%d), not of 
>valid value (0/1), so it was set to 1\n",
>+  if (use_carrier < 0 || use_carrier > 2) {
>+  pr_warn("Warning: use_carrier module parameter (%d), not of 
>valid value (0-2), so it was set to 1\n",
>   use_carrier);
>   use_carrier = 1;
>   }
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index 8a945c9341d6..dba6cef05134 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -164,9 +164,10 @@ static const struct bond_opt_value 
>bond_primary_reselect_tbl[] = {
> };
> 
> static const struct bond_opt_value bond_use_carrier_tbl[] = {
>-  { "off", 0,  0},
>-  { "on",  1,  BOND_VALFLAG_DEFAULT},
>-  { NULL,  -1, 0}
>+  { "off",  0,  0},
>+  { "on",   1,  BOND_VALFLAG_DEFAULT},
>+  { "both", 2,  0},
>+  { NULL,  -1,  0}
> };
> 
> static const struct bond_opt_value bond_all_slaves_active_tbl[] = {
>-- 
>2.17.0
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next 3/4] bonding: allow use of tx hashing in balance-alb

2018-05-11 Thread Jay Vosburgh

Debabrata Banerjee  wrote:

>The rx load balancing provided by balance-alb is not mutually
>exclusive with using hashing for tx selection, and should provide a decent
>speed increase because this eliminates spinlocks and cache contention.
>
>Signed-off-by: Debabrata Banerjee 
>---
> drivers/net/bonding/bond_alb.c | 20 ++--
> drivers/net/bonding/bond_main.c| 25 +++--
> drivers/net/bonding/bond_options.c |  2 +-
> include/net/bonding.h  | 10 +-
> 4 files changed, 43 insertions(+), 14 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index 180e50f7806f..6228635880d5 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -1478,8 +1478,24 @@ int bond_alb_xmit(struct sk_buff *skb, struct 
>net_device *bond_dev)
>   }
> 
>   if (do_tx_balance) {
>-  hash_index = _simple_hash(hash_start, hash_size);
>-  tx_slave = tlb_choose_channel(bond, hash_index, skb->len);
>+  if (bond->params.tlb_dynamic_lb) {
>+  hash_index = _simple_hash(hash_start, hash_size);
>+  tx_slave = tlb_choose_channel(bond, hash_index, 
>skb->len);
>+  } else {
>+  /*
>+   * do_tx_balance means we are free to select the 
>tx_slave
>+   * So we do exactly what tlb would do for hash selection
>+   */
>+
>+  struct bond_up_slave *slaves;
>+  unsigned int count;
>+
>+  slaves = rcu_dereference(bond->slave_arr);
>+  count = slaves ? READ_ONCE(slaves->count) : 0;
>+  if (likely(count))
>+  tx_slave = slaves->arr[bond_xmit_hash(bond, 
>skb) %
>+ count];
>+  }
>   }
> 
>   return bond_do_alb_xmit(skb, bond, tx_slave);
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 1f1e97b26f95..f7f8a49cb32b 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -159,7 +159,7 @@ module_param(min_links, int, 0);
> MODULE_PARM_DESC(min_links, "Minimum number of available links before turning 
> on carrier");
> 
> module_param(xmit_hash_policy, charp, 0);
>-MODULE_PARM_DESC(xmit_hash_policy, "balance-xor and 802.3ad hashing method; "
>+MODULE_PARM_DESC(xmit_hash_policy, "balance-alb, balance-tlb, balance-xor, 
>802.3ad hashing method; "
>  "0 for layer 2 (default), 1 for layer 3+4, "
>  "2 for layer 2+3, 3 for encap layer 2+3, "
>  "4 for encap layer 3+4");
>@@ -1735,7 +1735,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
>net_device *slave_dev,
>   unblock_netpoll_tx();
>   }
> 
>-  if (bond_mode_uses_xmit_hash(bond))
>+  if (bond_mode_can_use_xmit_hash(bond))
>   bond_update_slave_arr(bond, NULL);
> 
>   bond->nest_level = dev_get_nest_level(bond_dev);
>@@ -1870,7 +1870,7 @@ static int __bond_release_one(struct net_device 
>*bond_dev,
>   if (BOND_MODE(bond) == BOND_MODE_8023AD)
>   bond_3ad_unbind_slave(slave);
> 
>-  if (bond_mode_uses_xmit_hash(bond))
>+  if (bond_mode_can_use_xmit_hash(bond))
>   bond_update_slave_arr(bond, slave);
> 
>   netdev_info(bond_dev, "Releasing %s interface %s\n",
>@@ -3102,7 +3102,7 @@ static int bond_slave_netdev_event(unsigned long event,
>* events. If these (miimon/arpmon) parameters are configured
>* then array gets refreshed twice and that should be fine!
>*/
>-  if (bond_mode_uses_xmit_hash(bond))
>+  if (bond_mode_can_use_xmit_hash(bond))
>   bond_update_slave_arr(bond, NULL);
>   break;
>   case NETDEV_CHANGEMTU:
>@@ -3322,7 +3322,7 @@ static int bond_open(struct net_device *bond_dev)
>*/
>   if (bond_alb_initialize(bond, (BOND_MODE(bond) == 
> BOND_MODE_ALB)))
>   return -ENOMEM;
>-  if (bond->params.tlb_dynamic_lb)
>+  if (bond->params.tlb_dynamic_lb || BOND_MODE(bond) == 
>BOND_MODE_ALB)
>   queue_delayed_work(bond->wq, >alb_work, 0);
>   }
> 
>@@ -3341,7 +3341,7 @@ static int bond_open(struct net_device *bond_dev)
>   bond_3ad_initiate_agg_selection(bond, 1);
>   }
> 
>-  if (bond_mode_uses_xmit_hash(bond))
>+  if (bond_mode_can_use_xmit_hash(bond))
>   bond_update_slave_arr(bond, NULL);
> 
>   return 0;
>@@ -3892,7 +3892,7 @@ static void bond_slave_arr_handler(struct work_struct 
>*work)
>  * to determine the slave interface -
>  * (a) BOND_MODE_8023AD
>  * (b) BOND_MODE_XOR
>- * (c)

Re: [PATCH net-next 2/4] bonding: use common mac addr checks

2018-05-11 Thread Jay Vosburgh

Banerjee, Debabrata <dbane...@akamai.com> wrote:

>> From: Jay Vosburgh [mailto:jay.vosbu...@canonical.com]
>> Debabrata Banerjee <dbane...@akamai.com> wrote:
>
>> >-   if
>> (!ether_addr_equal_64bits(rx_hash_table[index].mac_dst,
>> >-mac_bcast) &&
>> >-
>> !is_zero_ether_addr(rx_hash_table[index].mac_dst)) {
>> >+   if
>> (is_valid_ether_addr(rx_hash_table[index].mac_dst)) {
>> 
>>  This change and the similar ones below will now fail non-broadcast
>> multicast Ethernet addresses, where the prior code would not.  Is this an
>> intentional change?
>
>Yes I don't see how it makes sense to use multicast addresses at all, but I 
>may be missing something. It's also illegal according to rfc1812 3.3.2, but 
>obviously this balancing mode is trying to be very clever. We probably 
>shouldn't violate the rfc anyway.

Fair enough, but I think it would be good to call this out in
the change log just in case it does somehow cause a regression.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next 2/4] bonding: use common mac addr checks

2018-05-11 Thread Jay Vosburgh

Debabrata Banerjee  wrote:

>Replace homegrown mac addr checks with faster defs from etherdevice.h
>
>Signed-off-by: Debabrata Banerjee 
>---
> drivers/net/bonding/bond_alb.c | 28 +---
> 1 file changed, 9 insertions(+), 19 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index c2f6c58e4e6a..180e50f7806f 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -40,11 +40,6 @@
> #include 
> #include 
> 
>-
>-
>-static const u8 mac_bcast[ETH_ALEN + 2] __long_aligned = {
>-  0xff, 0xff, 0xff, 0xff, 0xff, 0xff
>-};
> static const u8 mac_v6_allmcast[ETH_ALEN + 2] __long_aligned = {
>   0x33, 0x33, 0x00, 0x00, 0x00, 0x01
> };
>@@ -420,9 +415,7 @@ static void rlb_clear_slave(struct bonding *bond, struct 
>slave *slave)
> 
>   if (assigned_slave) {
>   rx_hash_table[index].slave = assigned_slave;
>-  if 
>(!ether_addr_equal_64bits(rx_hash_table[index].mac_dst,
>-   mac_bcast) &&
>-  
>!is_zero_ether_addr(rx_hash_table[index].mac_dst)) {
>+  if 
>(is_valid_ether_addr(rx_hash_table[index].mac_dst)) {

This change and the similar ones below will now fail
non-broadcast multicast Ethernet addresses, where the prior code would
not.  Is this an intentional change?

-J

>   bond_info->rx_hashtbl[index].ntt = 1;
>   bond_info->rx_ntt = 1;
>   /* A slave has been removed from the
>@@ -525,8 +518,7 @@ static void rlb_req_update_slave_clients(struct bonding 
>*bond, struct slave *sla
>   client_info = &(bond_info->rx_hashtbl[hash_index]);
> 
>   if ((client_info->slave == slave) &&
>-  !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
>-  !is_zero_ether_addr(client_info->mac_dst)) {
>+  is_valid_ether_addr(client_info->mac_dst)) {
>   client_info->ntt = 1;
>   ntt = 1;
>   }
>@@ -567,8 +559,7 @@ static void rlb_req_update_subnet_clients(struct bonding 
>*bond, __be32 src_ip)
>   if ((client_info->ip_src == src_ip) &&
>   !ether_addr_equal_64bits(client_info->slave->dev->dev_addr,
>bond->dev->dev_addr) &&
>-  !ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
>-  !is_zero_ether_addr(client_info->mac_dst)) {
>+  is_valid_ether_addr(client_info->mac_dst)) {
>   client_info->ntt = 1;
>   bond_info->rx_ntt = 1;
>   }
>@@ -596,7 +587,7 @@ static struct slave *rlb_choose_channel(struct sk_buff 
>*skb, struct bonding *bon
>   if ((client_info->ip_src == arp->ip_src) &&
>   (client_info->ip_dst == arp->ip_dst)) {
>   /* the entry is already assigned to this client */
>-  if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
>+  if (!is_broadcast_ether_addr(arp->mac_dst)) {
>   /* update mac address from arp */
>   ether_addr_copy(client_info->mac_dst, 
> arp->mac_dst);
>   }
>@@ -644,8 +635,7 @@ static struct slave *rlb_choose_channel(struct sk_buff 
>*skb, struct bonding *bon
>   ether_addr_copy(client_info->mac_src, arp->mac_src);
>   client_info->slave = assigned_slave;
> 
>-  if (!ether_addr_equal_64bits(client_info->mac_dst, mac_bcast) &&
>-  !is_zero_ether_addr(client_info->mac_dst)) {
>+  if (is_valid_ether_addr(client_info->mac_dst)) {
>   client_info->ntt = 1;
>   bond->alb_info.rx_ntt = 1;
>   } else {
>@@ -1418,9 +1408,9 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device 
>*bond_dev)
>   case ETH_P_IP: {
>   const struct iphdr *iph = ip_hdr(skb);
> 
>-  if (ether_addr_equal_64bits(eth_data->h_dest, mac_bcast) ||
>-  (iph->daddr == ip_bcast) ||
>-  (iph->protocol == IPPROTO_IGMP)) {
>+  if (is_broadcast_ether_addr(eth_data->h_dest) ||
>+  iph->daddr == ip_bcast ||
>+  iph->protocol == IPPROTO_IGMP) {
>   do_tx_balance = false;
>   break;
>   }
>@@ -1432,7 +1422,7 @@ int bond_alb_xmit(struct sk_buff *skb, struct net_device 
>*bond_dev)
>   /* IPv6 doesn't really use broadcast mac address, but leave
>* that here just in case.
>*/
>-  if

[PATCH v2 net-next] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS

2018-03-22 Thread Jay Vosburgh

The operstate update logic will leave an interface in the
default UNKNOWN operstate if the interface carrier state never changes
from the default carrier up state set at creation.  This includes the
case of an explicit call to netif_carrier_on, as the carrier on to on
transition has no effect on operstate.

This affects virtio-net for the case that the virtio peer does
not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
updates).  Without this feature, the virtio specification states that
"the link should be assumed active," so, logically, the operstate should
be UP instead of UNKNOWN.  This has impact on user space applications
that use the operstate to make availability decisions for the interface.

Resolve this by changing the virtio probe logic slightly to call
netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
cases, and then the existing call to netif_carrier_on for the "without"
case will cause an operstate transition.

Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: Ben Hutchings <b...@decadent.org.uk>
Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

---

I considered resolving this by changing linkwatch_init_dev to
unconditionally call rfc2863_policy, as that would always set operstate
for all interfaces.

This would not have any impact on most cases (as most drivers
call netif_carrier_off during probe), except for the loopback device,
which currently has an operstate of UNKNOWN (because it never does any
carrier state transitions).  This change would add a round trip on the
dev_base_lock for every loopback device creation, which could have a
negative impact when creating many loopback devices, e.g., when
concurrently creating large numbers of containers.


 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 23374603e4d9..7b187ec7411e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 
/* Assume link up if device can't report link status,
   otherwise get link status from config. */
+   netif_carrier_off(dev);
if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
-   netif_carrier_off(dev);
schedule_work(>config_work);
} else {
vi->status = VIRTIO_NET_S_LINK_UP;
-- 
2.14.1

Re: [PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS

2018-03-22 Thread Jay Vosburgh

Michael S. Tsirkin <m...@redhat.com> wrote:

>On Thu, Mar 22, 2018 at 12:02:10PM +0000, Jay Vosburgh wrote:
>> Michael S. Tsirkin <m...@redhat.com> wrote:
>> 
>> >On Thu, Mar 22, 2018 at 09:05:52AM +, Jay Vosburgh wrote:
>> >>   The operstate update logic will leave an interface in the
>> >> default UNKNOWN operstate if the interface carrier state never changes
>> >> from the default carrier up state set at creation.  This includes the
>> >> case of an explicit call to netif_carrier_on, as the carrier on to on
>> >> transition has no effect on operstate.
>> >> 
>> >>   This affects virtio-net for the case that the virtio peer does
>> >> not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
>> >> updates).  Without this feature, the virtio specification states that
>> >> "the link should be assumed active," so, logically, the operstate should
>> >> be UP instead of UNKNOWN.  This has impact on user space applications
>> >> that use the operstate to make availability decisions for the interface.
>> >> 
>> >>   Resolve this by changing the virtio probe logic slightly to call
>> >> netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
>> >> cases, and then the existing call to netif_carrier_on for the "without"
>> >> case will cause an operstate transition.
>> >> 
>> >> Cc: "Michael S. Tsirkin" <m...@redhat.com>
>> >> Cc: Jason Wang <jasow...@redhat.com>
>> >> Cc: Ben Hutchings <b...@decadent.org.uk>
>> >> Fixes: 167c25e4c550 ("virtio-net: init link state correctly")
>> >
>> >I'd say that's an abuse of this notation. openstate was UNKNOWN
>> >even before that fix.
>> 
>>  I went back to the commit that added the dependency on
>> VIRTIO_NET_F_STATUS (and that this patch would thus apply on top of).
>> If that's an issue, I can resubmit without it.
>> 
>>  -J
>
>The patch can be trivially backported to any version that has virtio.
>
>The issue was present since the original version of virtio.
>VIRTIO_NET_F_STATUS fixed it for new devices.
>So the tag is incorrectly blaming a partial fix for not being a full
>one.
>
>Also, I think it's more appropriate for net-next - it's a
>minor ABI change (previously presence of VIRTIO_NET_F_STATUS
>could be detected by looking at operstate, now it can't).
>Hopefully this makes more apps work than it breaks.
>
>So yes, pls repost without Fixes and with net-next unless
>davem can make the change himself.

Reposting with requested changes.

-J

>> >> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>
>> >
>> >Acked-by: Michael S. Tsirkin <m...@redhat.com>
>> >
>> >
>> >> ---
>> >> 
>> >>   I considered resolving this by changing linkwatch_init_dev to
>> >> unconditionally call rfc2863_policy, as that would always set operstate
>> >> for all interfaces.
>> >> 
>> >>   This would not have any impact on most cases (as most drivers
>> >> call netif_carrier_off during probe), except for the loopback device,
>> >> which currently has an operstate of UNKNOWN (because it never does any
>> >> carrier state transitions).  This change would add a round trip on the
>> >> dev_base_lock for every loopback device creation, which could have a
>> >> negative impact when creating many loopback devices, e.g., when
>> >> concurrently creating large numbers of containers.
>> >> 
>> >> 
>> >>  drivers/net/virtio_net.c | 2 +-
>> >>  1 file changed, 1 insertion(+), 1 deletion(-)
>> >> 
>> >> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> >> index 23374603e4d9..7b187ec7411e 100644
>> >> --- a/drivers/net/virtio_net.c
>> >> +++ b/drivers/net/virtio_net.c
>> >> @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>> >>  
>> >>   /* Assume link up if device can't report link status,
>> >>  otherwise get link status from config. */
>> >> + netif_carrier_off(dev);
>> >>   if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
>> >> - netif_carrier_off(dev);
>> >>   schedule_work(>config_work);
>> >>   } else {
>> >>   vi->status = VIRTIO_NET_S_LINK_UP;
>> >> -- 
>> >> 2.14.1

Re: [PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS

2018-03-22 Thread Jay Vosburgh

Michael S. Tsirkin <m...@redhat.com> wrote:

>On Thu, Mar 22, 2018 at 09:05:52AM +0000, Jay Vosburgh wrote:
>>  The operstate update logic will leave an interface in the
>> default UNKNOWN operstate if the interface carrier state never changes
>> from the default carrier up state set at creation.  This includes the
>> case of an explicit call to netif_carrier_on, as the carrier on to on
>> transition has no effect on operstate.
>> 
>>  This affects virtio-net for the case that the virtio peer does
>> not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
>> updates).  Without this feature, the virtio specification states that
>> "the link should be assumed active," so, logically, the operstate should
>> be UP instead of UNKNOWN.  This has impact on user space applications
>> that use the operstate to make availability decisions for the interface.
>> 
>>  Resolve this by changing the virtio probe logic slightly to call
>> netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
>> cases, and then the existing call to netif_carrier_on for the "without"
>> case will cause an operstate transition.
>> 
>> Cc: "Michael S. Tsirkin" <m...@redhat.com>
>> Cc: Jason Wang <jasow...@redhat.com>
>> Cc: Ben Hutchings <b...@decadent.org.uk>
>> Fixes: 167c25e4c550 ("virtio-net: init link state correctly")
>
>I'd say that's an abuse of this notation. openstate was UNKNOWN
>even before that fix.

I went back to the commit that added the dependency on
VIRTIO_NET_F_STATUS (and that this patch would thus apply on top of).
If that's an issue, I can resubmit without it.

-J

>> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>
>
>Acked-by: Michael S. Tsirkin <m...@redhat.com>
>
>
>> ---
>> 
>>  I considered resolving this by changing linkwatch_init_dev to
>> unconditionally call rfc2863_policy, as that would always set operstate
>> for all interfaces.
>> 
>>  This would not have any impact on most cases (as most drivers
>> call netif_carrier_off during probe), except for the loopback device,
>> which currently has an operstate of UNKNOWN (because it never does any
>> carrier state transitions).  This change would add a round trip on the
>> dev_base_lock for every loopback device creation, which could have a
>> negative impact when creating many loopback devices, e.g., when
>> concurrently creating large numbers of containers.
>> 
>> 
>>  drivers/net/virtio_net.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>> index 23374603e4d9..7b187ec7411e 100644
>> --- a/drivers/net/virtio_net.c
>> +++ b/drivers/net/virtio_net.c
>> @@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
>>  
>>  /* Assume link up if device can't report link status,
>> otherwise get link status from config. */
>> +netif_carrier_off(dev);
>>  if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
>> -netif_carrier_off(dev);
>>  schedule_work(>config_work);
>>  } else {
>>  vi->status = VIRTIO_NET_S_LINK_UP;
>> -- 
>> 2.14.1

[PATCH net] virtio-net: Fix operstate for virtio when no VIRTIO_NET_F_STATUS

2018-03-22 Thread Jay Vosburgh

The operstate update logic will leave an interface in the
default UNKNOWN operstate if the interface carrier state never changes
from the default carrier up state set at creation.  This includes the
case of an explicit call to netif_carrier_on, as the carrier on to on
transition has no effect on operstate.

This affects virtio-net for the case that the virtio peer does
not support VIRTIO_NET_F_STATUS (the feature that provides carrier state
updates).  Without this feature, the virtio specification states that
"the link should be assumed active," so, logically, the operstate should
be UP instead of UNKNOWN.  This has impact on user space applications
that use the operstate to make availability decisions for the interface.

Resolve this by changing the virtio probe logic slightly to call
netif_carrier_off for both the "with" and "without" VIRTIO_NET_F_STATUS
cases, and then the existing call to netif_carrier_on for the "without"
case will cause an operstate transition.

Cc: "Michael S. Tsirkin" <m...@redhat.com>
Cc: Jason Wang <jasow...@redhat.com>
Cc: Ben Hutchings <b...@decadent.org.uk>
Fixes: 167c25e4c550 ("virtio-net: init link state correctly")
Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

---

I considered resolving this by changing linkwatch_init_dev to
unconditionally call rfc2863_policy, as that would always set operstate
for all interfaces.

This would not have any impact on most cases (as most drivers
call netif_carrier_off during probe), except for the loopback device,
which currently has an operstate of UNKNOWN (because it never does any
carrier state transitions).  This change would add a round trip on the
dev_base_lock for every loopback device creation, which could have a
negative impact when creating many loopback devices, e.g., when
concurrently creating large numbers of containers.


 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 23374603e4d9..7b187ec7411e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2857,8 +2857,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 
/* Assume link up if device can't report link status,
   otherwise get link status from config. */
+   netif_carrier_off(dev);
if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
-   netif_carrier_off(dev);
schedule_work(>config_work);
} else {
vi->status = VIRTIO_NET_S_LINK_UP;
-- 
2.14.1

[PATCH net] bonding: fix slave stuck in BOND_LINK_FAIL state

2017-11-07 Thread Jay Vosburgh

The bonding miimon logic has a flaw, in that a failure of the
rtnl_trylock can cause a slave to become permanently stuck in
BOND_LINK_FAIL state.

The sequence of events to cause this is as follows:

1) bond_miimon_inspect finds that a slave's link is down, and so
calls bond_propose_link_state, setting slave->new_link_state to
BOND_LINK_FAIL, then sets slave->new_link to BOND_LINK_DOWN and returns
non-zero.

2) In bond_mii_monitor, the rtnl_trylock fails, and the timer is
rescheduled.  No change is committed.

3) bond_miimon_inspect is called again, but this time the slave
from step 1 has recovered.  slave->new_link is reset to NOCHANGE, and, as
slave->link was never changed, the switch enters the BOND_LINK_UP case,
and does nothing.  The pending BOND_LINK_FAIL state from step 1 remains
pending, as new_link_state is not reset.

4) The state from step 3 persists until another slave changes link
state and causes bond_miimon_inspect to return non-zero.  At this point,
the BOND_LINK_FAIL state change on the slave from steps 1-3 is committed,
and the slave will remain stuck in BOND_LINK_FAIL state even though it
is actually link up.

The remedy for this is to initialize new_link_state on each entry
to bond_miimon_inspect, as is already done with new_link.

Reported-by: Alex Sidorenko <alexandre.sidore...@hpe.com>
Reviewed-by: Jarod Wilson <ja...@redhat.com>
Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>
Fixes: fb9eb899a6dc ("bonding: handle link transition from FAIL to UP 
correctly")
---
 drivers/net/bonding/bond_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index c99dc59d729b..167434e952da 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2042,6 +2042,7 @@ static int bond_miimon_inspect(struct bonding *bond)
 
bond_for_each_slave_rcu(bond, slave, iter) {
slave->new_link = BOND_LINK_NOCHANGE;
+   slave->link_new_state = slave->link;
 
link_state = bond_check_dev_link(bond, slave->dev, 0);
 
-- 
2.14.1

Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-06 Thread Jay Vosburgh

Jarod Wilson <ja...@redhat.com> wrote:

>On 2017-11-02 9:11 PM, Jay Vosburgh wrote:
[...]
>> diff --git a/drivers/net/bonding/bond_main.c 
>> b/drivers/net/bonding/bond_main.c
>> index 18b58e1376f1..6f89f9981a6c 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -2046,6 +2046,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>>  bond_for_each_slave_rcu(bond, slave, iter) {
>>  slave->new_link = BOND_LINK_NOCHANGE;
>> +slave->link_new_state = slave->link;
>>  link_state = bond_check_dev_link(bond, slave->dev, 0);
>>   
>>
>>  Alex / Jarod, could you check my logic, and would you be able to
>> test this patch if my analysis appears sound?
>
>This patch looks good, the original reproducing setup successfully
>recovers after the original active slave goes down, even with
>NetworkManager in the mix.
>
>Reviewed-by: Jarod Wilson <ja...@redhat.com>

Thanks, I'll get a formal patch submission out later today.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-03 Thread Jay Vosburgh

ernel: bond0: bond_find_best_slave: slave->link: 2, 
>up: false, slave->delay: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 1, 
>up: true, slave->delay: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: no bestslave 
>found, bond failure imminent
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave 
>!= curr_active_slave
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave 
>is NULL, this is probably bad
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: 
>ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: 
>NULL
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: Removing MAC from old_active
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call 
>bond_alb_handle_active_change
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier off
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: now running without any active 
>interface!
>
>Even though link state of ens3f1 has changed, we have 
> BOND_LINK_NOCHANGE
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave 
>ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE 
>on slave ens3f1
>
>Another invocation of bond_mii_monitor
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
>state: 2
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
>state: 1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_DOWN case on slave ens3f0
>
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_FAIL case on slave ens3f1
>
>   We see "up again on ens3f1" and execute slave->new_link = BOND_LINK_UP; (my 
> initial suggestion)
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status up again after 0 ms for 
>interface ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_inspect returned non-zero
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 2->2 for 
>interface ens3f0, 0 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for 
>interface ens3f0 now: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: committing link state 1->0 for 
>interface ens3f1, 2 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: slave->should_notify_link for 
>interface ens3f1 now: 0
>
>   As we nudged slave->new_link, bond_miimon_commit() now works
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave 
>ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_NOCHANGE 
>on slave ens3f0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: working on slave 
>ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_miimon_commit: BOND_LINK_UP
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: link status definitely up for 
>interface ens3f1, 2 Mbps full duplex
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave->link: 2, 
>up: false, slave->delay: 0
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_find_best_slave: slave ens3f1 
>BOND_LINK_UP
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_select_active_slave: best_slave 
>!= curr_active_slave
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: old_active: 
>NULL
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_change_active_slave: new_active: 
>ens3f1
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: Non-BOND_LINK_BACK case
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: making interface ens3f1 the new active 
>one
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: Setting MAC on new_active
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: ALB and TLB modes should call 
>bond_alb_handle_active_change
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: bond_set_carrier: turning carrier on
>Oct 31 09:09:25 SYDC1LNX kernel: bond0: first active interface up!
>
>Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f0 current link 
>state: 2
>Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_mii_monitor: ens3f1 current link 
>state: 0
>Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_DOWN case on slave ens3f0
>Oct 31 09:09:26 SYDC1LNX kernel: bond0: bond_miimon_inspect: entered 
>BOND_LINK_UP case on slave ens3f1
>...
>
>Alex
>
>On 11/02/2017 09:11 PM, Jay Vosburgh wrote:
>> Alex Sidorenko <alexandre.sidore...@hpe.com> wrote:
>>> On 11/02/2017 12:51 AM, Jay Vosburgh wrote:
>>>> Jarod Wilson <ja...@redhat.com> wrote:
>>>>
>>>>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
>>>>>> Jay Vosburgh <

Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-02 Thread Jay Vosburgh

Alex Sidorenko <alexandre.sidore...@hpe.com> wrote:
>On 11/02/2017 12:51 AM, Jay Vosburgh wrote:
>> Jarod Wilson <ja...@redhat.com> wrote:
>>
>>> On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
>>>> Jay Vosburgh <jay.vosbu...@canonical.com> wrote:
>>>>
>>>>> Alex Sidorenko <alexandre.sidore...@hpe.com> wrote:
>>>>>
>>>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>>>>> platform, it is seen both in customer's environment and in HPE test lab.
>>>>>>
>>>>>> There are several bonds configured in TLB mode and miimon=100, all other
>>>>>> options are default. Slaves are connected to VirtualConnect
>>>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>>>>> temporarily, but not another one (ens3f1). But what we see is
>>>>>>
>>>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms 
>>>>>> for interface ens3f1
>>>>
>>>>In net-next, I don't see a path in the code that will lead to
>>>> this message, as it would apparently require entering
>>>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
>>>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
>>>> remain in _FAIL state.
>>>
>>> The kernel in question is laden with a fair bit of additional debug spew,
>>> as we were going back and forth, trying to isolate where things were going
>>> wrong.  That was indeed from the BOND_LINK_FAIL state in
>>> bond_miimon_inspect, inside the if (link_state) clause though, so after
>>> commit++, there's a continue, which ... does what now? Doesn't it take us
>>> back to the top of the bond_for_each_slave_rcu() loop, so we bypass the
>>> next few lines of code that would have led to a transition to
>>> BOND_LINK_DOWN?
>>
>>  Just to confirm: your downdelay is 0, correct?
>
>Correct.
>
>>
>>  And, do you get any other log messages other than "link status
>> up again after 0 ms"?
>
>Yes, here are some messages (from an early instrumentation):
[...]
>That is, we never see ens3f1 going to BOND_LINK_DOWN and it continues
>staying in BOND_LINK_NOCHANGE/BOND_LINK_FAIL
>
>
>>
>>  To answer your question, yes, the "if (link_state) {" block in
>> the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but
>> this path is nominally for the downdelay logic.  If downdelay is active
>> and the link recovers before the delay expires, the link should never
>> have moved to BOND_LINK_DOWN.  The commit++ causes bond_miimon_inspect
>> to return nonzero, causing in turn the bond_propose_link_state change to
>> BOND_LINK_FAIL state to be committed.  This path deliberately does not
>> set slave->new_link, as downdelay is purposely delaying the transition
>> to BOND_LINK_DOWN.
>>
>>  If downdelay is 0, the slave->link should not persist in
>> BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will
>> cause a transition in bond_miimon_commit.  The bond_propose_link_state
>> call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in
>> bond_mii_monitor prior to calling bond_miimon_commit, which will in turn
>> do the transition to _DOWN state.  In this case, the BOND_LINK_FAIL case
>> "if (link_state) {" block should never be entered.
>
>I totally agree with your description of transition logic, and this is why
>we were puzzled by how this can occur until we noticed NetworkManager
>messages around this time and decided to run a test without it.
>Without NM, everything works as expected. After that, adding more
>instrumentation, we have found that we do not propose BOND_LINK_FAIL inside
>bond_miimon_inspect() but elsewhere (NetworkManager?).

I think I see the flaw in the logic.

1) bond_miimon_inspect finds link_state = 0, then makes a call
to bond_propose_link_state(BOND_LINK_FAIL), setting link_new_state to
BOND_LINK_FAIL.  _inspect then sets slave->new_link = BOND_LINK_DOWN and
returns non-zero.

2) bond_mii_monitor rtnl_trylock fails, it reschedules.

3) bond_mii_monitor runs again, and calls bond_miimon_inspect.

4) the slave's link has recovered, so link_state != 0.
slave->link is still BOND_LINK_UP.  The slave's link_new_state remains
set to BOND_LINK_FAIL, but new_link is reset to NOCHANGE.
bond_miimon_inspect returns 0, so nothing is committed.

5) step 4 can repeat indefi

Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-01 Thread Jay Vosburgh

Jarod Wilson <ja...@redhat.com> wrote:

>On 2017-11-01 8:35 PM, Jay Vosburgh wrote:
>> Jay Vosburgh <jay.vosbu...@canonical.com> wrote:
>>
>>> Alex Sidorenko <alexandre.sidore...@hpe.com> wrote:
>>>
>>>> The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>>> platform, it is seen both in customer's environment and in HPE test lab.
>>>>
>>>> There are several bonds configured in TLB mode and miimon=100, all other
>>>> options are default. Slaves are connected to VirtualConnect
>>>> modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>>> temporarily, but not another one (ens3f1). But what we see is
>>>>
>>>> Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms 
>>>> for interface ens3f1
>>
>>  In net-next, I don't see a path in the code that will lead to
>> this message, as it would apparently require entering
>> bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
>> If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
>> remain in _FAIL state.
>
>The kernel in question is laden with a fair bit of additional debug spew,
>as we were going back and forth, trying to isolate where things were going
>wrong.  That was indeed from the BOND_LINK_FAIL state in
>bond_miimon_inspect, inside the if (link_state) clause though, so after
>commit++, there's a continue, which ... does what now? Doesn't it take us
>back to the top of the bond_for_each_slave_rcu() loop, so we bypass the
>next few lines of code that would have led to a transition to
>BOND_LINK_DOWN?

Just to confirm: your downdelay is 0, correct?

And, do you get any other log messages other than "link status
up again after 0 ms"?

To answer your question, yes, the "if (link_state) {" block in
the BOND_LINK_FAIL case of bond_miimon_inspect ends in continue, but
this path is nominally for the downdelay logic.  If downdelay is active
and the link recovers before the delay expires, the link should never
have moved to BOND_LINK_DOWN.  The commit++ causes bond_miimon_inspect
to return nonzero, causing in turn the bond_propose_link_state change to
BOND_LINK_FAIL state to be committed.  This path deliberately does not
set slave->new_link, as downdelay is purposely delaying the transition
to BOND_LINK_DOWN.

If downdelay is 0, the slave->link should not persist in
BOND_LINK_FAIL state; it should set new_link = BOND_LINK_DOWN which will
cause a transition in bond_miimon_commit.  The bond_propose_link_state
call to set BOND_LINK_FAIL in the BOND_LINK_UP case will be committed in
bond_mii_monitor prior to calling bond_miimon_commit, which will in turn
do the transition to _DOWN state.  In this case, the BOND_LINK_FAIL case
"if (link_state) {" block should never be entered.

I'm a little leery of adding the state transition you suggest
without understanding how this situation arose, as it shouldn't get into
this condition in the first place.

-J

>...
>>> Your patch does not apply to net-next, so I'm not absolutely
>>> sure where this is, but presuming that this is in the BOND_LINK_FAIL
>>> case of the switch, it looks like both BOND_LINK_FAIL and BOND_LINK_BACK
>>> will have the issue that if the link recovers or fails, respectively,
>>> within the delay window (for down/updelay > 0) it won't set a
>>> slave->new_link.
>>>
>>> Looks like this got lost somewhere along the line, as originally
>>> the transition back to UP (or DOWN) happened immediately, and that has
>>> been lost somewhere.
>>>
>>> I'll have to dig out when that broke, but I'll see about a test
>>> patch this afternoon.
>>
>>  The case I was concerned with was moved around; the proposed
>> state is committed in bond_mii_monitor.  But to commit to _FAIL state,
>> the downdelay would have to be > 0.  I'm not seeing any errors in
>> net-next; can you reproduce your erroneous behavior on net-next?
>
>I can try to get a net-next-ish kernel into their hands, but the bonding
>driver we're working with here is quite close to current net-next already,
>so I'm fairly confident the same thing will happen.
>
>-- 
>Jarod Wilson
>ja...@redhat.com

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-01 Thread Jay Vosburgh

Jay Vosburgh <jay.vosbu...@canonical.com> wrote:

>Alex Sidorenko <alexandre.sidore...@hpe.com> wrote:
>
>>The problem has been found while trying to deploy RHEL7 on HPE Synergy
>>platform, it is seen both in customer's environment and in HPE test lab.
>>
>>There are several bonds configured in TLB mode and miimon=100, all other
>>options are default. Slaves are connected to VirtualConnect
>>modules. Rebooting a VC module should bring one bond slave (ens3f0) down
>>temporarily, but not another one (ens3f1). But what we see is
>>
>>Oct 24 10:37:12 SYDC1LNX kernel: bond0: link status up again after 0 ms for 
>>interface ens3f1

In net-next, I don't see a path in the code that will lead to
this message, as it would apparently require entering
bond_miimon_inspect in state BOND_LINK_FAIL but with downdelay set to 0.
If downdelay is 0, the code will transition to BOND_LINK_DOWN and not
remain in _FAIL state.

>>and it never recovers. When VC reboot is complete, everything goes back to 
>>normal again.
>>
>>Redhat has backported all recent upstream commits and instrumented the
>>bonding driver. We have found the following (when VC goes down)
>>
>>In bond_miimon_inspect() the first slave goes to
>>  bond_propose_link_state(slave, BOND_LINK_FAIL);
>>  and
>>  slave->new_link = BOND_LINK_DOWN;
>>
>>The second slave is still
>>  slave->link = BOND_LINK_UP;
>>  and
>>slave->new_link = BOND_LINK_NOCHANGE;
>>
>>This is as expected. But in bond_miimon_commit() we see that _both_ slaves
>>are in BOND_LINK_FAIL.  That is, something changes the state of the second
>>slave from another thread. We suspect the NetworkManager, as the problem
>>is there _only_ when bonds are controlled by it, if we set
>>NM_CONTROLLED=no everything starts working normally.
>>
>>While we still do not understand how NM affects bond state, I think that
>>bonding driver needs to be made reliable enough to recover even from this
>>state.
>
>   You're suggesting that the bonding driver be able to distinguish
>"false" down states set by network manager from a genuine link failure?
>Am I misunderstanding?
>
>>At this moment when we enter bond_miimon_inspect() with
>>slave->link = BOND_LINK_FAIL and are in the following code
>>
>>/*FALLTHRU*/
>>case BOND_LINK_BACK:
>>if (!link_state) {
>>bond_propose_link_state(slave, 
>> BOND_LINK_DOWN);
>>netdev_info(bond->dev, "link status down
>>again after %d ms for interface %s\n",
>>(bond->params.updelay - 
>> slave->delay) *
>>bond->params.miimon,
>>slave->dev->name);
>>
>>commit++;
>>continue;
>>}
>>
>
>   Looking at bonding in the current net-next, if a slave enters
>bond_miimon_inspect with slave->link == BOND_LINK_FAIL, it will not
>proceed to the BOND_LINK_BACK block of the switch; BOND_LINK_FAIL does
>not fall through that far.
>
>   Did you actually mean the BOND_LINK_FAIL block of the switch?
>I'll assume so for the rest of my reply.
>
>>we propose a new state and do 'commit++', but we do not change
>>slave->new_link from BOND_LINK_NOCHANGE. As a result, bond_miimon_commit()
>>will not process this slave.
>>
>>The following patch fixes the issue:
>>
>>
>>If we enter bond_miimon_inspect() with slave_link=BOND_LINK_FAIL
>>and recover, we do bond_propose_link_state(slave, BOND_LINK_UP);
>>but do not change slave->new_link, so it is left in
>>BOND_LINK_NOCHANGE. As a result, bond_miimon_commit() will not
>>process that slave and it never recovers. We need to set
>>slave->new_link = BOND_LINK_UP to make bond_miimon_commit() work
>
>   What is your downdelay set to?  In principle,
>bond_miimon_inspect should not enter with a slave in BOND_LINK_FAIL
>state if downdelay is 0.
>
>> drivers/net/bonding/bond_main.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>index c99dc59..07aa7ba 100644
>>--- a/drivers/net/bonding/bond_main.c
>>+++ b/drivers/net/bonding/bond_main.c
>>@@ -2072,6 +2072,7 @@ static int bond_miimon_inspect(struct bonding *bond)
>

Re: Bond recovery from BOND_LINK_FAIL state not working

2017-11-01 Thread Jay Vosburgh

ls, respectively,
within the delay window (for down/updelay > 0) it won't set a
slave->new_link.

Looks like this got lost somewhere along the line, as originally
the transition back to UP (or DOWN) happened immediately, and that has
been lost somewhere.

I'll have to dig out when that broke, but I'll see about a test
patch this afternoon.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: Regression: Bug 196547 - Since 4.12 - bonding module not working with wireless drivers

2017-08-16 Thread Jay Vosburgh

Dan Williams <d...@redhat.com> wrote:
[...]
>You'll probably say "aim for the 75% case" or something like that,
>which is fine, but then you're depending on your 75% case to be (a)
>single AP, (b) never move (eg, only bond wifi + ethernet), (c) little
>radio interference.  I'm not sure I'd buy that.  If I've put words in
>your mouth, forgive me.

The primary use case that I'm aware of for bonding with wireless
devices is in active-backup mode, paired with an ethernet adapter set as
the bonding "primary" device.  I think this is the case that absolutely
should just work.  I don't think bonding cares (or should care) about
(a) - (c) for this use.

Your point (b) suggests that there are use cases other than the
above; I'm unfamiliar with any use other than wifi + ethernet, can you
elaborate?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c

2017-06-21 Thread Jay Vosburgh

Michael J Dilmore <michael.j.dilm...@gmail.com> wrote:

>On 21/06/17 22:56, David Miller wrote:
>
>> From: Michael D <michael.j.dilm...@gmail.com>
>> Date: Wed, 21 Jun 2017 22:41:07 +0100
>>
>>> I don't think you can stop it being dereferenced... you just need to
>>> prevent an attacker from exploiting the null pointer dereference
>>> vulnerability right? And this is done by returning the function right
>>> away?
>> What's all of this about an "attacker"?
>>
>> If there is a bug, we dererence a NULL pointer, and we should
>> fix that bug.
>>
>> The BUG_ON() helps us see where the problem is while at the
>> same time stopping the kernel before the NULL deref happens.
>Ok this is starting to make sense now - went a bit off track but think my
>general thinking is ok - i.e. if we return the function with an error code
>before the dereference then this basically does the same thing as BUG_ON
>but without crashing the kernel.
>
>Something like:
>
>if (WARN_ON(!new_active_slave) {
>netdev_dbg("Can't add new active slave - pointer null");
>return ERROR_CODE
>}

In general, yes, but in this case, the condition should be
impossible to hit, so BUG_ON seems appropriate.

If bond_slave_get_rtnl/rcu() returns NULL for an actual bonding
slave, other code paths (bond_fill_slave_info, bond_handle_frame) will
likely crash before getting to this one.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c

2017-06-21 Thread Jay Vosburgh

David Miller <da...@davemloft.net> wrote:

>From: Michael D <michael.j.dilm...@gmail.com>
>Date: Wed, 21 Jun 2017 22:41:07 +0100
>
>> I don't think you can stop it being dereferenced... you just need to
>> prevent an attacker from exploiting the null pointer dereference
>> vulnerability right? And this is done by returning the function right
>> away?
>
>What's all of this about an "attacker"?
>
>If there is a bug, we dererence a NULL pointer, and we should
>fix that bug.
>
>The BUG_ON() helps us see where the problem is while at the
>same time stopping the kernel before the NULL deref happens.

Looking at the code more carefully than I did earlier, the only
way the BUG_ON will hit is if the rx_handler_data is NULL for a bonding
slave when this code executes.

This should be impossible, as there doesn't appear to be any way
to get into bond_option_active_slave_set for a slave prior to
bond_enslave registering the rx_handler for that slave, as these
operations are mutexed by RTNL.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert BUG_ON to WARN_ON in bond_options.c

2017-06-21 Thread Jay Vosburgh

Michael J Dilmore <michael.j.dilm...@gmail.com> wrote:

>The function below contains a BUG_ON where no active slave is detected. The 
>patch
>converts this to a WARN_ON to avoid crashing the kernel.
>
>Signed-off-by: Michael J Dilmore <michael.j.dilm...@gmail.com>
>---
> drivers/net/bonding/bond_options.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index 1bcbb89..c4b4791 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -778,7 +778,7 @@ static int bond_option_active_slave_set(struct bonding 
>*bond,
>   struct slave *old_active = 
> rtnl_dereference(bond->curr_active_slave);
>   struct slave *new_active = bond_slave_get_rtnl(slave_dev);
> 
>-  BUG_ON(!new_active);
>+  WARN_ON(!new_active);

This is a reasonable idea in principle, but will require
additional changes to prevent dereferencing new_active if it is NULL
(which would happen just below this point in the code).

-J

>   if (new_active == old_active) {
>   /* do nothing */
>-- 
>2.7.4
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] Convert multiple netdev_info messages to netdev_dbg

2017-06-15 Thread Jay Vosburgh

Joe Perches <j...@perches.com> wrote:

>On Thu, 2017-06-15 at 19:14 +0100, Michael J Dilmore wrote:
>> Multiple netdev_info messages clutter kernel output. Also add netdev_dbg for 
>> packets per slave.
>[]
>> diff --git a/drivers/net/bonding/bond_options.c 
>> b/drivers/net/bonding/bond_options.c
>[]
>> @@ -9,6 +9,8 @@
>>   * (at your option) any later version.
>>   */
>>   
>> +#define DEBUG 1
>
>Is defining DEBUG really worthwhile.

I don't believe so, since if CONFIG_DYNAMIC_DEBUG is not
enabled, having #define DEBUG will enable all of the netdev_dbg messages
unconditionally, which is the opposite of the stated purpose of the
patch.  If DYNAMIC_DEBUG is enabled, having DEBUG doesn't do anything
that I can see.

-J

>As well, it's almost always just
>#define DEBUG
>without any level value unless the
>level value is used in the code.
>
>> +
>>  #include 
>>  #include 
>>  #include 
>> @@ -719,13 +721,13 @@ static int bond_option_mode_set(struct bonding *bond,
>>  const struct bond_opt_value *newval)
>>  {
>>  if (!bond_mode_uses_arp(newval->value) && bond->params.arp_interval) {
>> -netdev_info(bond->dev, "%s mode is incompatible with arp 
>> monitoring, start mii monitoring\n",
>> +netdev_dbg(bond->dev, "%s mode is incompatible with arp 
>> monitoring, start mii monitoring\n",
>>  newval->string);
>
>Please realign any multiple line arguments to the
>open parenthesis at the same time.
>
>>  /* disable arp monitoring */
>>  bond->params.arp_interval = 0;
>>  /* set miimon to default value */
>>  bond->params.miimon = BOND_DEFAULT_MIIMON;
>> -netdev_info(bond->dev, "Setting MII monitoring interval to 
>> %d\n",
>> +netdev_dbg(bond->dev, "Setting MII monitoring interval to %d\n",
>>  bond->params.miimon);
>
>etc...
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: bond link state mismatch, rtnl_trylock() vs rtnl_lock()

2017-05-23 Thread Jay Vosburgh

Nithin Sujir <nsu...@tintri.com> wrote:

>Hi,
>We're encountering a problem in 4.4 LTS where, rarely, the bond link state
>is not updated when the slave link changes.
>
>I've traced the issue to the arp monitor unable to get the rtnl lock. The
>sequence resulting in failure is as below.
>
>bond_loadbalance_arp_mon() periodically called, if slave link is _down_,
>it checks if the slave is sending/receiving packets. If it is, it sets
>flags to be processed later down the function for bond link
>update. However, it sets the slave->link right away.
>
>if (slave->link != BOND_LINK_UP) {
>if (bond_time_in_interval(bond, trans_start, 1) &&
>bond_time_in_interval(bond, slave->last_rx,
>1)) {
>
>slave->link  = BOND_LINK_UP;
>slave_state_changed = 1;
>
>
>Later down the function, it tries to get the rtnl_lock. If it doesn't get
>it, it rearms and returns.
>
>if (do_failover || slave_state_changed) {
>if (!rtnl_trylock())
>goto re_arm; <-- returns here
>
>if (slave_state_changed) {
>bond_slave_state_change(bond);
>
>This is the problem. The next time this function is called, the
>slave->link is already marked UP. And we will never update the bond link
>state to UP.

This looks like an ARP monitor version of

commit de77ecd4ef02ca783f7762e04e92b3d0964be66b
Author: Mahesh Bandewar <mahe...@google.com>
Date:   Mon Mar 27 11:37:33 2017 -0700

bonding: improve link-status update in mii-monitoring

and probably needs a similar fix (possibly for both the
loadbalance and active-backup ARP monitor cases).

>Changing the rtnl_trylock() -> rtnl_lock() _does_ fix the issue.
>
>Is this the right way to fix it? If it is, I can submit this formally.

It's not the right way, unfortunately.

The reason for the rtnl_trylock is that there's a possible race
against bond_close() -> bond_work_cancel_all() trying to cancel the
arp_work workqueue item while it's running.  bond_close is called with
RTNL held, so if it has RTNL and is waiting for the work function to
complete, an rtnl_lock call here will deadlock.  Some of the trylock
calls in bonding are commented to this effect, but not this one.

-J

>What are the guidelines around using rtnl_lock() vs rtnl_trylock()? Some
>places are using rtnl_lock() and other rtnl_trylock(). Sorry, I couldn't
>find much via a google search or in Documentation/.
>
>Thanks,
>Nithin.
>
>
>
>diff --git a/drivers/net/bonding/bond_main.c
>b/drivers/net/bonding/bond_main.c
>index 5dca77e..1f60503 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2614,8 +2614,7 @@ static void bond_loadbalance_arp_mon(struct
>work_struct *work)
>rcu_read_unlock();
>
>if (do_failover || slave_state_changed) {
>-       if (!rtnl_trylock())
>-   goto re_arm;
>+   rtnl_lock();
>
>if (slave_state_changed) {
>bond_slave_state_change(bond);
>
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net] bonding: avoid defaulting hard_header_len to ETH_HLEN on slave removal

2017-04-27 Thread Jay Vosburgh

Paolo Abeni <pab...@redhat.com> wrote:

>On slave list updates, the bonding driver computes its hard_header_len
>as the maximum of all enslaved devices's hard_header_len.
>If the slave list is empty, e.g. on last enslaved device removal,
>ETH_HLEN is used.
>
>Since the bonding header_ops are set only when the first enslaved
>device is attached, the above can lead to header_ops->create()
>being called with the wrong skb headroom in place.
>
>If bond0 is configured on top of ipoib devices, with the
>following commands:
>
>ifup bond0
>for slave in $BOND_SLAVES_LIST; do
>   ip link set dev $slave nomaster
>done
>ping -c 1 
>
>we will obtain a skb_under_panic() with a similar call trace:
>   skb_push+0x3d/0x40
>   push_pseudo_header+0x17/0x30 [ib_ipoib]
>   ipoib_hard_header+0x4e/0x80 [ib_ipoib]
>   arp_create+0x12f/0x220
>   arp_send_dst.part.19+0x28/0x50
>   arp_solicit+0x115/0x290
>   neigh_probe+0x4d/0x70
>   __neigh_event_send+0xa7/0x230
>   neigh_resolve_output+0x12e/0x1c0
>   ip_finish_output2+0x14b/0x390
>   ip_finish_output+0x136/0x1e0
>   ip_output+0x76/0xe0
>   ip_local_out+0x35/0x40
>   ip_send_skb+0x19/0x40
>   ip_push_pending_frames+0x33/0x40
>   raw_sendmsg+0x7d3/0xb50
>   inet_sendmsg+0x31/0xb0
>   sock_sendmsg+0x38/0x50
>   SYSC_sendto+0x102/0x190
>   SyS_sendto+0xe/0x10
>   do_syscall_64+0x67/0x180
>   entry_SYSCALL64_slow_path+0x25/0x25
>
>This change addresses the issue avoiding updating the bonding device
>hard_header_len when the slaves list become empty, forbidding to
>shrink it below the value used by header_ops->create().
>
>The bug is there since commit 54ef31371407 ("[PATCH] bonding: Handle large
>hard_header_len") but the panic can be triggered only since
>commit fc791b633515 ("IB/ipoib: move back IB LL address into the hard
>header").
>
>Reported-by: Norbert P <n...@physik.uzh.ch>
>Fixes: 54ef31371407 ("[PATCH] bonding: Handle large hard_header_len")
>Fixes: fc791b633515 ("IB/ipoib: move back IB LL address into the hard header")
>Signed-off-by: Marcelo Ricardo Leitner <marcelo.leit...@gmail.com>
>Signed-off-by: Paolo Abeni <pab...@redhat.com>

Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>


> drivers/net/bonding/bond_main.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 8a4ba8b..34481c9 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1104,11 +1104,11 @@ static void bond_compute_features(struct bonding *bond)
>   gso_max_size = min(gso_max_size, slave->dev->gso_max_size);
>   gso_max_segs = min(gso_max_segs, slave->dev->gso_max_segs);
>   }
>+  bond_dev->hard_header_len = max_hard_header_len;
> 
> done:
>   bond_dev->vlan_features = vlan_features;
>   bond_dev->hw_enc_features = enc_features | NETIF_F_GSO_ENCAP_ALL;
>-  bond_dev->hard_header_len = max_hard_header_len;
>   bond_dev->gso_max_segs = gso_max_segs;
>   netif_set_gso_max_size(bond_dev, gso_max_size);
> 
>-- 
>2.9.3
>

Re: [PATCH net-next v2] bonding: deliver link-local packets with skb->dev set to link that packets arrived on

2017-04-17 Thread Jay Vosburgh

Chonggang Li <chonggan...@google.com> wrote:

>Bonding driver changes the skb->dev to the bonding-master before
>passing the packet to stack for further processing. This, however
>does not make sense for the link-local packets and it looses "the
>link info" once its skb->dev is changed to bonding-master.  This
>patch changes this behavior for link-local packets by not changing
>the skb->dev to the bonding-master and maintaining it as it is,
>i.e. the link on which the packet arrived.

Minor nit: "looses" should be "loses".  Other than that:

Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>


>Signed-off-by: Chonggang Li <chonggan...@google.com>
>Signed-off-by: Mahesh Bandewar <mahe...@google.com>
>Signed-off-by: Maciej Żenczykowski <m...@google.com>
>---
>Changes in v2:
>  - Make the commit message more clearer.
>
> drivers/net/bonding/bond_main.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 01e4a69af421..6bd3b50faf48 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1176,6 +1176,9 @@ static rx_handler_result_t bond_handle_frame(struct 
>sk_buff **pskb)
>   }
>   }
> 
>+  /* don't change skb->dev for link-local packets */
>+  if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
>+  return RX_HANDLER_PASS;
>   if (bond_should_deliver_exact_match(skb, slave, bond))
>   return RX_HANDLER_EXACT;
> 
>-- 
>2.12.2.762.g0e3151a226-goog
>

Re: [PATCH net-next] bonding: do not pass link-local packets to master-interface

2017-04-14 Thread Jay Vosburgh

Mahesh Bandewar (महेश बंडेवार) wrote:

>On Fri, Apr 14, 2017 at 12:30 PM, Jay Vosburgh
><jay.vosbu...@canonical.com> wrote:
>>
>>
>> Chonggang Li <chonggan...@google.com> wrote:
>>
>> >Previously link-local packets excluding LACP (which are handled by
>> >the recv_probe) received on bond slave interfaces are delivered to
>> >stack with bond-master device with RX_HANDLER_ANOTHER, however all
>> >link-local packets are link specific and should be delivered with
>> >exact same link/dev.
>>
>> In what case is the current behavior a problem (my guess would
>> be something related to LLDP)?  What if, e.g., the bond is a bridge
>> port, will STP frames no longer propagate to the bridge?
>>
>When a link-local frame made appear to arrive on bond-master, it
>looses value since 'the info' about link on which it arrived is lost.

This information should really be in the commit message, along
with something about the LLDP issue being solved.

>So idea behind this is to pass the frame to the stack with the link on
>which it arrived without involving the bonding-master. The same will
>happen for the STP frames too. Do you see problem with this approach?

My look through the STP code suggests not, but I'm far from an
expert there.  I'm just concerned that this change will cause some
obscure regression in something that depends on the current behavior.

>> Also, I think the description would be better if it mentioned
>> specifically that the patch is changing how skb->dev is set for link
>> local frames (bond->dev vs receiving interface), e.g.,
>>
>> "[...] however all link-local packets are link specific and
>> should be delivered with skb->dev set to the original device."
>>
>yes, makes sense.
>
>> >Signed-off-by: Chonggang Li <chonggan...@google.com>
>> >Signed-off-by: Mahesh Bandewar <mahe...@google.com>
>> >Signed-off-by: Maciej Żenczykowski <m...@google.com>
>> >---
>> > drivers/net/bonding/bond_main.c | 3 +++
>> > 1 file changed, 3 insertions(+)
>> >
>> >diff --git a/drivers/net/bonding/bond_main.c 
>> >b/drivers/net/bonding/bond_main.c
>> >index 01e4a69af421..aeca3d8541b9 100644
>> >--- a/drivers/net/bonding/bond_main.c
>> >+++ b/drivers/net/bonding/bond_main.c
>> >@@ -1176,6 +1176,9 @@ static rx_handler_result_t bond_handle_frame(struct 
>> >sk_buff **pskb)
>> >   }
>> >   }
>> >
>> >+  /* link-local packets should not be passed to master interface */
>> >+  if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
>> >+  return RX_HANDLER_PASS;
>>
>> Since this returns _PASS and not _EXACT, the packet will go
>> through the ptype_base packet handlers, so is the comment strictly
>> correct?
>>
>The stack does not have all link-local specific packet-type handlers
>registered by default so returning _EXACT would find nothing and
>packet will be dropped, right? So returning _PASS will deliver packet
>to the stack with skb->dev as the link on which packet arrived and
>stack can take whatever (default) action it has to take.

Fair enough; I do think the comment would be better phrased as
something like "don't change skb->dev of link local frames"; on first
reading, I thought it meant the frames would be dropped.

-J

>> >   if (bond_should_deliver_exact_match(skb, slave, bond))
>> >   return RX_HANDLER_EXACT;
>> >
>> >--
>> >2.12.2.762.g0e3151a226-goog
>> >
>>
---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next] bonding: do not pass link-local packets to master-interface

2017-04-14 Thread Jay Vosburgh


Chonggang Li <chonggan...@google.com> wrote:

>Previously link-local packets excluding LACP (which are handled by
>the recv_probe) received on bond slave interfaces are delivered to
>stack with bond-master device with RX_HANDLER_ANOTHER, however all
>link-local packets are link specific and should be delivered with
>exact same link/dev.

In what case is the current behavior a problem (my guess would
be something related to LLDP)?  What if, e.g., the bond is a bridge
port, will STP frames no longer propagate to the bridge?

Also, I think the description would be better if it mentioned
specifically that the patch is changing how skb->dev is set for link
local frames (bond->dev vs receiving interface), e.g.,

"[...] however all link-local packets are link specific and
should be delivered with skb->dev set to the original device."

>Signed-off-by: Chonggang Li <chonggan...@google.com>
>Signed-off-by: Mahesh Bandewar <mahe...@google.com>
>Signed-off-by: Maciej Żenczykowski <m...@google.com>
>---
> drivers/net/bonding/bond_main.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 01e4a69af421..aeca3d8541b9 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1176,6 +1176,9 @@ static rx_handler_result_t bond_handle_frame(struct 
>sk_buff **pskb)
>   }
>   }
> 
>+  /* link-local packets should not be passed to master interface */
>+  if (is_link_local_ether_addr(eth_hdr(skb)->h_dest))
>+  return RX_HANDLER_PASS;

Since this returns _PASS and not _EXACT, the packet will go
through the ptype_base packet handlers, so is the comment strictly
correct?

-J

>   if (bond_should_deliver_exact_match(skb, slave, bond))
>   return RX_HANDLER_EXACT;
> 
>-- 
>2.12.2.762.g0e3151a226-goog
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: bond procfs hw addr prints

2017-03-13 Thread Jay Vosburgh

Jarod Wilson <ja...@redhat.com> wrote:

>I've got a bug report for someone using a Intel OPA devices in a bond, and
>it appears these devices have a hardware address length of 20, opposed to
>the typical 6 on ethernet. When they dump /proc/net/bonding/bondX, it only
>prints the first 6 of the address, per %pM and mac_address_string(), while
>sysfs for the interface does print the right thing, since it uses
>sysfs_print_mac(), which takes a length argument.

This (20 octet MAC length) is true for any Infiniband device.

>So the question is... What's the best route to take here? Expand %pM to
>support variable length hardware addresses? Use sysfs_* in procfs?
>Reinvent the wheel? Nothing I've tinkered with just yet feels very clean,
>on top of not actually working yet. :)

sysfs_format_mac (not _print_mac) uses "%*phC", len, addr in its
format string.  Perhaps that format would be a better choice than %pM
for this case?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH next 0/5] bonding: winter cleanup

2017-03-08 Thread Jay Vosburgh

Mahesh Bandewar <mah...@bandewar.net> wrote:

>From: Mahesh Bandewar <mahe...@google.com>
>
>Few cleanup patches that I have accumulated over some time now.
>
>(a) First two patches are basically to move the work-queue initialization
>from every ndo_open / bond_open operation to once at the beginning while
>port creation. Work-queue initialization is an unnecessary operation
>for every 'ifup' operation. However we have some mode-specific work-queues
>and mode can change anytime after port creation. So the second patch is
>to ensure the correct work-handler is called based on the mode.
>
>(b) Third patch is simple and straightforward that removes hard-coded value
>that was added into the initial commit and replaces it with the default
>value configured.
>
>(c) The final patch in the series removes the unimplemented "port-moved" state
>from the LACP state machine. This state is defined but never set so
>removing from the state machine logic makes code little cleaner.
>
>(d) Reduce scope of some global variables to local.

For all patches in the series:

Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

>Note: None of these patches are making any functional changes.
>
>Mahesh Bandewar (5):
>  bonding: restructure arp-monitor
>  bonding: initialize work-queues during creation of bond
>  bonding: remove hardcoded value
>  bonding: remove "port-moved" state that was never implemented
>  bonding: reduce scope of some global variables
>
> drivers/net/bonding/bond_3ad.c  | 11 +++--
> drivers/net/bonding/bond_main.c | 53 ++++++---
> 2 files changed, 37 insertions(+), 27 deletions(-)
>
>-- 
>2.12.0.246.ga2ecc84866-goog

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [Patch net] bonding: use ETH_MAX_MTU as max mtu

2017-03-02 Thread Jay Vosburgh

Cong Wang <xiyou.wangc...@gmail.com> wrote:

>This restores the ability of setting bond device's mtu to 9000.
>
>Fixes: 91572088e3fd ("net: use core MTU range checking in core net infra")
>Reported-by: daz...@gmail.com
>Reported-by: Brad Campbell <lists2...@fnarfbargle.com>
>Cc: Jarod Wilson <ja...@redhat.com>
>Signed-off-by: Cong Wang <xiyou.wangc...@gmail.com>
>---
> drivers/net/bonding/bond_main.c | 1 +
> 1 file changed, 1 insertion(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 6321f12..8a4ba8b 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -4179,6 +4179,7 @@ void bond_setup(struct net_device *bond_dev)
> 
>   /* Initialize the device entry points */
>   ether_setup(bond_dev);
>+  bond_dev->max_mtu = ETH_MAX_MTU;
>   bond_dev->netdev_ops = _netdev_ops;
>   bond_dev->ethtool_ops = _ethtool_ops;

Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next] net: Remove usage of net_device last_rx member

2017-01-18 Thread Jay Vosburgh

Tobias Klauser <tklau...@distanz.ch> wrote:

>The network stack no longer uses the last_rx member of struct net_device
>since the bonding driver switched to use its own private last_rx in
>commit 9f242738376d ("bonding: use last_arp_rx in slave_last_rx()").
>
>However, some drivers still (ab)use the field for their own purposes and
>some driver just update it without actually using it.
>
>Previously, there was an accompanying comment for the last_rx member
>added in commit 4dc89133f49b ("net: add a comment on netdev->last_rx")
>which asked drivers not to update is, unless really needed. However,
>this commend was removed in commit f8ff080dacec ("bonding: remove
>useless updating of slave->dev->last_rx"), so some drivers added later
>on still did update last_rx.
>
>Remove all usage of last_rx and switch three drivers (sky2, atp and
>smc91c92_cs) which actually read and write it to use their own private
>copy in netdev_priv.
>
>Compile-tested with allyesconfig and allmodconfig on x86 and arm.

Reviewed-by: Jay Vosburgh <jay.vosbu...@canonical.com>

Re: [PATCH] net: bonding: use new api ethtool_{get|set}_link_ksettings

2016-10-26 Thread Jay Vosburgh

Philippe Reynes <trem...@gmail.com> wrote:

>The ethtool api {get|set}_settings is deprecated.
>We move this driver to new api {get|set}_link_ksettings.

This is just an API change, i.e., no change to functionality?

-J

>Signed-off-by: Philippe Reynes <trem...@gmail.com>
>---
> drivers/net/bonding/bond_main.c |   16 
> 1 files changed, 8 insertions(+), 8 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index c9944d8..5708f17 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -4080,16 +4080,16 @@ static netdev_tx_t bond_start_xmit(struct sk_buff 
>*skb, struct net_device *dev)
>   return ret;
> }
> 
>-static int bond_ethtool_get_settings(struct net_device *bond_dev,
>-   struct ethtool_cmd *ecmd)
>+static int bond_ethtool_get_link_ksettings(struct net_device *bond_dev,
>+ struct ethtool_link_ksettings *cmd)
> {
>   struct bonding *bond = netdev_priv(bond_dev);
>   unsigned long speed = 0;
>   struct list_head *iter;
>   struct slave *slave;
> 
>-  ecmd->duplex = DUPLEX_UNKNOWN;
>-  ecmd->port = PORT_OTHER;
>+  cmd->base.duplex = DUPLEX_UNKNOWN;
>+  cmd->base.port = PORT_OTHER;
> 
>   /* Since bond_slave_can_tx returns false for all inactive or down 
> slaves, we
>* do not need to check mode.  Though link speed might not represent
>@@ -4100,12 +4100,12 @@ static int bond_ethtool_get_settings(struct net_device 
>*bond_dev,
>   if (bond_slave_can_tx(slave)) {
>   if (slave->speed != SPEED_UNKNOWN)
>   speed += slave->speed;
>-  if (ecmd->duplex == DUPLEX_UNKNOWN &&
>+  if (cmd->base.duplex == DUPLEX_UNKNOWN &&
>   slave->duplex != DUPLEX_UNKNOWN)
>-  ecmd->duplex = slave->duplex;
>+  cmd->base.duplex = slave->duplex;
>   }
>   }
>-  ethtool_cmd_speed_set(ecmd, speed ? : SPEED_UNKNOWN);
>+  cmd->base.speed = speed ? : SPEED_UNKNOWN;
> 
>   return 0;
> }
>@@ -4121,8 +4121,8 @@ static void bond_ethtool_get_drvinfo(struct net_device 
>*bond_dev,
> 
> static const struct ethtool_ops bond_ethtool_ops = {
>   .get_drvinfo= bond_ethtool_get_drvinfo,
>-  .get_settings   = bond_ethtool_get_settings,
>   .get_link   = ethtool_op_get_link,
>+  .get_link_ksettings = bond_ethtool_get_link_ksettings,
> };
> 
> static const struct net_device_ops bond_netdev_ops = {
>-- 
>1.7.4.4

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: Prevent deletion of a bond, or the last slave from a bond, with active usage.

2016-09-06 Thread Jay Vosburgh

= command + 2;
>+  else
>+  ifname = command + 1;
>+
>   if ((strlen(command) <= 1) ||
>   !dev_valid_name(ifname))
>   goto err_no_cmd;
>@@ -130,6 +135,32 @@ static ssize_t bonding_store_bonds(struct class *cls,
>   res = -ENODEV;
>   }
>   rtnl_unlock();
>+  } else if ((command[0] == '?') && (command[1] == '-')) {
>+  struct net_device *bond_dev;
>+
>+  rtnl_lock();
>+  bond_dev = bond_get_by_name(bn, ifname);
>+
>+  if (bond_dev) {
>+  struct in_device *in_dev;
>+  struct bonding *bond = netdev_priv(bond_dev);
>+
>+  in_dev = __in_dev_get_rtnl(bond_dev);
>+
>+  if (((in_dev->ifa_list) != NULL) &&
>+  (bond->slave_cnt > 0)) {
>+  pr_err("%s is in use. Unconfigure IP %pI4 
>before deletion.\n",
>+ ifname, _dev->ifa_list->ifa_local);
>+  rtnl_unlock();
>+  return -EBUSY;
>+  }
>+  pr_info("%s is being deleted...\n", ifname);
>+      unregister_netdevice(bond_dev);
>+  } else {
>+  pr_err("unable to delete non-existent %s\n", ifname);
>+  res = -ENODEV;
>+  }
>+  rtnl_unlock();
>   } else
>   goto err_no_cmd;
> 
>@@ -139,7 +170,7 @@ static ssize_t bonding_store_bonds(struct class *cls,
>   return res;
> 
> err_no_cmd:
>-  pr_err("no command found in bonding_masters - use +ifname or 
>-ifname\n");
>+  pr_err("no command found in bonding_masters - use +ifname or -ifname or 
>?-ifname\n");
>   return -EPERM;
> }
> 
>-- 
>1.8.3.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: Allow tun-interfaces as slaves

2016-08-10 Thread Jay Vosburgh

Jörn Engel <jo...@purestorage.com> wrote:

>On Wed, Aug 10, 2016 at 02:26:49PM -0700, Jörn Engel wrote:
>> 
>> Having to set one more parameter is a bit annoying.  It would have to be
>> documented in a prominent place and people would still often miss it.
>> So I wonder if we can make the interface a little nicer.
>> 
>> Options:
>> - If there are no slaves yet and the first slave added is tun, we trust
>>   the users to know what they are doing.  Automatically set
>>   bond->params.fail_over_mac = BOND_FOM_KEEPMAC
>>   Maybe do a printk to inform the user in case of a mistake.

I don't think this is feasible, as I don't see a reliable way to
test for a slave being a tun device (ARPHRD_NONE is not just tun, and we
cannot check the ops as they are not statically built into the kernel).
I'm also not sure that heuristics are the proper way to enable this
functionality in general.

>> - If we get an error and the slave device is tun, do a printk giving the
>>   user enough information to find this parameter.

This could probably be done as a change the existing logic, e.g.,

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 1f276fa30ba6..019c1a689aae 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1443,6 +1443,9 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev)
res = -EOPNOTSUPP;
goto err_undo_flags;
}
+   } else if (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP &&
+  bond->params.fail_over_mac != BOND_FOM_KEEPMAC) {
+   netdev_err(bond_dev, "The slave device 
specified does not support setting the MAC address, but fail_over_mac is not 
set to keepmac\n");
}
}
 
I haven't tested this, and I'm not sure it will get all corner
cases correct, but this should basically cover it.

-J

>> I'm leaning towards the former, but you probably know a reason why I am
>> wrong again.
>
>Patch below is an implementation of the former.  Not sure if something
>like this is worth considering.
>
>Jörn
>
>--
>To announce that there must be no criticism of the President, or that we
>are to stand by the President, right or wrong, is not only unpatriotic
>and servile, but is morally treasonable to the American public.
>-- Theodore Roosevelt, Kansas City Star, 1918
>
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 1f276fa30ba6..306909a44fab 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1482,8 +1482,9 @@ int bond_enslave(struct net_device *bond_dev, struct 
>net_device *slave_dev)
>*/
>   ether_addr_copy(new_slave->perm_hwaddr, slave_dev->dev_addr);
> 
>-  if (!bond->params.fail_over_mac ||
>-  BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
>+  if (bond_dev->type != ARPHRD_NONE &&
>+  (!bond->params.fail_over_mac ||
>+   BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP)) {
>   /* Set slave to master's mac address.  The application already
>* set the master's mac address to that of the first slave
>*/
>-- 
>2.1.4
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: Allow tun-interfaces as slaves

2016-08-10 Thread Jay Vosburgh

Ding Tianhong <dingtianh...@huawei.com> wrote:

>On 2016/8/10 7:51, Jay Vosburgh wrote:
>> Jörn Engel <jo...@purestorage.com> wrote:
>> 
>>> On Tue, Aug 09, 2016 at 12:06:36PM -0700, David Miller wrote:
>>>>> On Tue, Aug 09, 2016 at 09:28:45PM +0800, Ding Tianhong wrote:
>>>>>
>>>>> Simply not checking errors when setting the mac address solves the
>>>>> problem for me.  No new features needed.
>>>>
>>>> But it only works in certain modes.
>>>>
>>>> So the best we can do is enforce the MAC address setting in the
>>>> modes that absolutely require it.  We cannot ignore the MAC
>>>> address setting unilaterally.
>>>
>>> Something like this?
>>>
>>> [PATCH] bonding: Allow tun-interfaces as slaves in balance-rr mode
>>>
>>> Up until 00503b6f702e (part of 3.14-rc1), the bonding driver could be
>>> used to enslave tun-interfaces.  00503b6f702e broke that behaviour,
>>> afaics as an unintended side-effect.
>>>
>>> For the purpose of bond-over-tun in balance-rr mode, simply ignoring the
>>> error from dev_set_mac_address() is good enough.
>>>
>>> Signed-off-by: Joern Engel <jo...@purestorage.com>
>>> ---
>>> drivers/net/bonding/bond_main.c | 3 ++-
>>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c 
>>> b/drivers/net/bonding/bond_main.c
>>> index 1f276fa30ba6..2f686bfe4304 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -1490,7 +1490,8 @@ int bond_enslave(struct net_device *bond_dev, struct 
>>> net_device *slave_dev)
>>> memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len);
>>> addr.sa_family = slave_dev->type;
>>> res = dev_set_mac_address(slave_dev, );
>>> -   if (res) {
>>> +   /* round-robin mode works fine without a mac address */
>>> +   if (res && BOND_MODE(bond) != BOND_MODE_ROUNDROBIN) {
>> 
>>  This will cause balance-rr to add the slave to the bond if any
>> device's dev_set_mac_address call fails.
>> 
>>  If a bond of regular Ethernet devices is connected to a static
>> link aggregation (Etherchannel channel group), a set_mac failure would
>> result in that slave having a different MAC address than the bond, which
>> in turn would cause traffic inbound from the switch to that slave to be
>> dropped (as the destination MAC would not pass the device MAC filters).
>> 
>>  The failure check for the set_mac call serves a legitimate
>> purpose, and I don't believe we should bypass it without making the
>> bypass an option that is explicitly enabled for those special cases that
>> need it.
>> 
>>  E.g., something like the following (which I have not tested);
>> this would also need documentation and iproute2 updates to go with it.
>> This would be enabled with "fail_over_mac=keepmac".
>> 
>> diff --git a/drivers/net/bonding/bond_main.c 
>> b/drivers/net/bonding/bond_main.c
>> index 1f276fa30ba6..d2283fc23b16 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -1483,7 +1483,8 @@ int bond_enslave(struct net_device *bond_dev, struct 
>> net_device *slave_dev)
>>  ether_addr_copy(new_slave->perm_hwaddr, slave_dev->dev_addr);
>>  
>>  if (!bond->params.fail_over_mac ||
>> -BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
>> +(BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP &&
>> + bond->params.fail_over_mac != BOND_FOM_KEEPMAC)) {
>>  /* Set slave to master's mac address.  The application already
>>   * set the master's mac address to that of the first slave
>>   */
>> diff --git a/drivers/net/bonding/bond_options.c 
>> b/drivers/net/bonding/bond_options.c
>> index 577e57cad1dc..f9653fe4d622 100644
>> --- a/drivers/net/bonding/bond_options.c
>> +++ b/drivers/net/bonding/bond_options.c
>> @@ -125,6 +125,7 @@ static const struct bond_opt_value 
>> bond_fail_over_mac_tbl[] = {
>>  { "none",   BOND_FOM_NONE,   BOND_VALFLAG_DEFAULT},
>>  { "active", BOND_FOM_ACTIVE, 0},
>>  { "follow", BOND_FOM_FOLLOW, 0},
>> +{ "keepmac", BOND_FOM_KEEPMAC, 0},
>>  { NULL, -1,  0},
>>  };
>>  
>>

Re: [PATCH] bonding: Allow tun-interfaces as slaves

2016-08-09 Thread Jay Vosburgh

Jörn Engel <jo...@purestorage.com> wrote:

>On Tue, Aug 09, 2016 at 12:06:36PM -0700, David Miller wrote:
>> > On Tue, Aug 09, 2016 at 09:28:45PM +0800, Ding Tianhong wrote:
>> > 
>> > Simply not checking errors when setting the mac address solves the
>> > problem for me.  No new features needed.
>> 
>> But it only works in certain modes.
>> 
>> So the best we can do is enforce the MAC address setting in the
>> modes that absolutely require it.  We cannot ignore the MAC
>> address setting unilaterally.
>
>Something like this?
>
>[PATCH] bonding: Allow tun-interfaces as slaves in balance-rr mode
>
>Up until 00503b6f702e (part of 3.14-rc1), the bonding driver could be
>used to enslave tun-interfaces.  00503b6f702e broke that behaviour,
>afaics as an unintended side-effect.
>
>For the purpose of bond-over-tun in balance-rr mode, simply ignoring the
>error from dev_set_mac_address() is good enough.
>
>Signed-off-by: Joern Engel <jo...@purestorage.com>
>---
> drivers/net/bonding/bond_main.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 1f276fa30ba6..2f686bfe4304 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1490,7 +1490,8 @@ int bond_enslave(struct net_device *bond_dev, struct 
>net_device *slave_dev)
>   memcpy(addr.sa_data, bond_dev->dev_addr, bond_dev->addr_len);
>   addr.sa_family = slave_dev->type;
>   res = dev_set_mac_address(slave_dev, );
>-  if (res) {
>+  /* round-robin mode works fine without a mac address */
>+  if (res && BOND_MODE(bond) != BOND_MODE_ROUNDROBIN) {

This will cause balance-rr to add the slave to the bond if any
device's dev_set_mac_address call fails.

If a bond of regular Ethernet devices is connected to a static
link aggregation (Etherchannel channel group), a set_mac failure would
result in that slave having a different MAC address than the bond, which
in turn would cause traffic inbound from the switch to that slave to be
dropped (as the destination MAC would not pass the device MAC filters).

The failure check for the set_mac call serves a legitimate
purpose, and I don't believe we should bypass it without making the
bypass an option that is explicitly enabled for those special cases that
need it.

E.g., something like the following (which I have not tested);
this would also need documentation and iproute2 updates to go with it.
This would be enabled with "fail_over_mac=keepmac".

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 1f276fa30ba6..d2283fc23b16 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1483,7 +1483,8 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev)
ether_addr_copy(new_slave->perm_hwaddr, slave_dev->dev_addr);
 
if (!bond->params.fail_over_mac ||
-   BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
+   (BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP &&
+bond->params.fail_over_mac != BOND_FOM_KEEPMAC)) {
/* Set slave to master's mac address.  The application already
 * set the master's mac address to that of the first slave
 */
diff --git a/drivers/net/bonding/bond_options.c 
b/drivers/net/bonding/bond_options.c
index 577e57cad1dc..f9653fe4d622 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -125,6 +125,7 @@ static const struct bond_opt_value bond_fail_over_mac_tbl[] 
= {
{ "none",   BOND_FOM_NONE,   BOND_VALFLAG_DEFAULT},
{ "active", BOND_FOM_ACTIVE, 0},
{ "follow", BOND_FOM_FOLLOW, 0},
+   { "keepmac", BOND_FOM_KEEPMAC, 0},
{ NULL, -1,  0},
 };
 
diff --git a/include/net/bonding.h b/include/net/bonding.h
index 6360c259da6d..ec3442b3aa83 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -420,6 +420,7 @@ static inline bool bond_slave_can_tx(struct slave *slave)
 #define BOND_FOM_NONE      0
 #define BOND_FOM_ACTIVE1
 #define BOND_FOM_FOLLOW2
+#define BOND_FOM_KEEPMAC   3
 
 #define BOND_ARP_TARGETS_ANY   0
 #define BOND_ARP_TARGETS_ALL   1


-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding: Allow tun-interfaces as slaves

2016-08-09 Thread Jay Vosburgh

Jörn Engel <jo...@purestorage.com> wrote:

>Hello Tianhong!
>
>On Tue, Aug 09, 2016 at 10:18:41AM +0800, Ding Tianhong wrote:
>> 
>> I don't understand your problem clearly, can you explain more about how the 
>> 00503b6f702e break tun-interfaces
>> and we will try to fix it.
>
>Here is a trivial testcase:
>openvpn --mktun --dev tun0
>echo +tun0 > /sys/class/net/bond0/bonding/slaves
>
>Worked fine before your patch, no longer works after your patch.  Works
>again after my patch.

Could you describe your use case a bit further?  Are you bonding
together multiple VPN tunnels?

This may be a regression, but since the patch that nominally
introduced it was 2 years ago, the impact appears to be very narrow.

>> and more, dev_set_mac_address will change the salver's mac address, some nic 
>> don't support to change the mac address and
>> could not work as bond slave, so we need to check the return value, I don't 
>> think this patch has any effective improvement.
>
>Using bonding in balance-rr mode, there doesn't seem to be a need to
>change the mac address.  I suppose you might care in other modes, but I
>don't.

The balance-rr mode (as well as the -xor mode) is designed to
interoperate with a Cisco Etherchannel-style static link aggregation,
which requires all members to have the same MAC address for proper
function.

Now, the above notwithstanding, I don't have an issue if you
want to bond together a couple of tun devices and can make it work.
However, for the standard balance-rr case, the enslavement must fail if
the call to set the slave MAC fails, as permitting the slave into the
bond after set-MAC fails can result in a bond that silently loses
packets.

My tentative suggestion is that we extened fail_over_mac to
cover additional modes, as a sort of "I really know what I'm doing"
flag, and allow the enslavement to succeed when it is set.  This would
require setting an additional bonding option for this situation, but
that doesn't seem to be a undue burden as this looks to be a niche use
case.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net V2] net/bonding: Enforce active-backup policy for IPoIB bonds

2016-07-21 Thread Jay Vosburgh

Saeed Mahameed <sae...@mellanox.com> wrote:

>From: Mark Bloch <ma...@mellanox.com>
>
>When using an IPoIB bond currently only active-backup mode is a valid
>use case and this commit strengthens it.
>
>Since commit 2ab82852a270 ("net/bonding: Enable bonding to enslave
>netdevices not supporting set_mac_address()") was introduced till
>4.7-rc1, IPoIB didn't support the set_mac_address ndo, and hence the
>fail over mac policy always applied to IPoIB bonds.
>
>With the introduction of commit 492a7e67ff83 ("IB/IPoIB: Allow setting
>the device address"), that doesn't hold and practically IPoIB bonds are
>broken as of that. To fix it, lets go to fail over mac if the device
>doesn't support the ndo OR this is IPoIB device.
>
>As a by-product, this commit also prevents a stack corruption which
>occurred when trying to copy 20 bytes (IPoIB) device address
>to a sockaddr struct that has only 16 bytes of storage.
>
>Signed-off-by: Mark Bloch <ma...@mellanox.com>
>Signed-off-by: Or Gerlitz <ogerl...@mellanox.com>
>Signed-off-by: Saeed Mahameed <sae...@mellanox.com>

Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>


>
>Changes from v0:
> - Set res to -EOPNOTSUPP before jumping to err_undo_flags.
>
> drivers/net/bonding/bond_main.c | 11 ++-
> 1 file changed, 10 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index a2afa3b..4d79819 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1422,7 +1422,16 @@ int bond_enslave(struct net_device *bond_dev, struct 
>net_device *slave_dev)
>   return -EINVAL;
>   }
> 
>-  if (slave_ops->ndo_set_mac_address == NULL) {
>+  if (slave_dev->type == ARPHRD_INFINIBAND &&
>+  BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
>+  netdev_warn(bond_dev, "Type (%d) supports only active-backup 
>mode\n",
>+  slave_dev->type);
>+  res = -EOPNOTSUPP;
>+  goto err_undo_flags;
>+  }
>+
>+  if (!slave_ops->ndo_set_mac_address ||
>+  slave_dev->type == ARPHRD_INFINIBAND) {
>   netdev_warn(bond_dev, "The slave device specified does not 
> support setting the MAC address\n");
>   if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP &&
>   bond->params.fail_over_mac != BOND_FOM_ACTIVE) {
>-- 
>2.8.0
>

Re: [PATCH net] net/bonding: Enforce active-backup policy for IPoIB bonds

2016-07-20 Thread Jay Vosburgh

Andy Gospodarek <go...@cumulusnetworks.com> wrote:

>On Wed, Jul 20, 2016 at 05:44:20PM +0300, Saeed Mahameed wrote:
>> From: Mark Bloch <ma...@mellanox.com>
>> 
>> When using an IPoIB bond currently only active-backup mode is a valid
>> use case and this commit strengthens it.
>> 
>> Since commit 2ab82852a270 ("net/bonding: Enable bonding to enslave
>> netdevices not supporting set_mac_address()") was introduced till
>> 4.7-rc1, IPoIB didn't support the set_mac_address ndo, and hence
>> the fail over mac policy always applied to IPoIB bonds.
>> 
>> With the introduction of commit 492a7e67ff83 ("IB/IPoIB: Allow setting
>> the device address"), that doesn't hold and practically IPoIB bonds are
>> broken as of that. To fix it, lets go to fail over mac if the device
>> doesn't support the ndo OR this is IPoIB device.
>> 
>> As a by-product, this commit also prevents a stack corruption which
>> occurred when trying to copy 20 bytes (IPoIB) device address
>> to a sockaddr struct that has only 16 bytes of storage.
>> 
>> Signed-off-by: Mark Bloch <ma...@mellanox.com>
>> Signed-off-by: Or Gerlitz <ogerl...@mellanox.com>
>> Signed-off-by: Saeed Mahameed <sae...@mellanox.com>
>> ---
>>  drivers/net/bonding/bond_main.c | 10 +-
>>  1 file changed, 9 insertions(+), 1 deletion(-)
>> 
>> diff --git a/drivers/net/bonding/bond_main.c 
>> b/drivers/net/bonding/bond_main.c
>> index a2afa3b..ccd4003 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -1422,7 +1422,15 @@ int bond_enslave(struct net_device *bond_dev, struct 
>> net_device *slave_dev)
>>  return -EINVAL;
>>  }
>>  
>> -if (slave_ops->ndo_set_mac_address == NULL) {
>> +if (slave_dev->type == ARPHRD_INFINIBAND &&
>> +BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
>> +netdev_warn(bond_dev, "Type (%d) supports only active-backup 
>> mode\n",
>> +slave_dev->type);
>
>Seems like we should propagate the failure back to the caller.
>Something like this?
>
>   return -EOPNOTSUPP;

Agreed, although I think it needs to be

res = -EOPNOTSUPP;
goto err_undo_flags;

The code has to undo the the ARPHRD_INFINIBAND setting in
bond_dev->type and other stuff done by bond_setup_by_slave.  If we don't
switch the fields back to the ARPHRD_ETHER related values, then we could
have an invalid dereference on the header_ops pointer if something
unloads the IB module.

-J

>> +goto err_undo_flags;
>> +}
>> +
>> +if (!slave_ops->ndo_set_mac_address ||
>> +    slave_dev->type == ARPHRD_INFINIBAND) {
>>  netdev_warn(bond_dev, "The slave device specified does not 
>> support setting the MAC address\n");
>>  if (BOND_MODE(bond) == BOND_MODE_ACTIVEBACKUP &&
>>  bond->params.fail_over_mac != BOND_FOM_ACTIVE) {
>
>The rest of the patch seems logical, so I'm fine with it.

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] bonding:Fix perfomance drop if one bonding device in configuration is repeatedly restarted

2016-07-14 Thread Jay Vosburgh

Nicholas Krause <xerofo...@gmail.com> wrote:

>This fixes a issue with the following test caseL
>
>1. Created a alb bonding(bond0) with three of 1Gb interfaces
>(eth0, eth1, eth2) under ipv4, set the bond0 ip address as
>192.168.10.101, run command "iperf -s" as the server(x86_64).
>
>2. Run command "iperf -c 192.168.10.101 -t 1 -i 1"
>on three linux client with CentOS6.5 x86_64
>
>3. Run the following commands on server repeatly,
>"ifconfig eth1 down; sleep 20; ifconfig eth1 up".
>monitor the bonding performance with program "
>nmon_x86_64_centos6", after some time, the
>performance droped from 3000 to 2000 Mbps,
>and could not go up to 3000Mbps any more,
>it seems that the bonding doesn't rebalance
>again.
>This is due to use incorrectly checking in the function,
>rlb_choose_channel if the ethernet address is not equal
>to 64 bits before updating our mac address from arp.
>However this is incorrect as the ethernet header length
>needs to be 64bits in order to update our mac address
>from arp without the above test case having a perfomance
>regression. Fix it by checking for the ethernet header
>having 64 bits and not.

NAK.

I don't understand what this is trying to say, but the change
below does not look valid.

>Signed-off-by: Nicholas Krause <xerofo...@gmail.com>
>---
> drivers/net/bonding/bond_alb.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/drivers/net/bonding/bond_alb.c b/drivers/net/bonding/bond_alb.c
>index 551f0f8..d5ae8a5 100644
>--- a/drivers/net/bonding/bond_alb.c
>+++ b/drivers/net/bonding/bond_alb.c
>@@ -593,7 +593,7 @@ static struct slave *rlb_choose_channel(struct sk_buff 
>*skb, struct bonding *bon
>   if ((client_info->ip_src == arp->ip_src) &&
>   (client_info->ip_dst == arp->ip_dst)) {
>   /* the entry is already assigned to this client */
>-  if (!ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
>+  if (ether_addr_equal_64bits(arp->mac_dst, mac_bcast)) {
>   /* update mac address from arp */
>   ether_addr_copy(client_info->mac_dst, 
> arp->mac_dst);

Regardless of the description above, this test is insuring that
the code doesn't use the broadcast MAC address for the client.  Changing
it seems like a bad thing to do, as it would cause traffic for the
client to be sent to the ethernet broadcast address.

This might, in fact, make the described test run better, but it
has nothing to do with rebalancing and would negatively impact other
hosts on the network.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net] bonding: fix 802.3ad aggregator reselection

2016-07-07 Thread Jay Vosburgh

Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote:

>2016-07-06 0:20 GMT+03:00 Jay Vosburgh <jay.vosbu...@canonical.com>:
>> Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote:
>>
>>>2016-06-30 14:15 GMT+03:00 Veli-Matti Lintu <veli-matti.li...@opinsys.fi>:
>>>> 2016-06-29 18:59 GMT+03:00 Jay Vosburgh <jay.vosbu...@canonical.com>:
>>>>> Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote:
>>>
>>>>> I tried this locally, but don't see any failure (at the end, the
>>>>> "Switch A" agg is still active with the single port).  I am starting
>>>>> with just two ports in each aggregator (instead of three), so that may
>>>>> be relevant.
>>>>
>>>> When the connection problem occurs, /proc/net/bonding/bond0 always
>>>> shows the aggregator that has a link up active. Dumpcap sees at least
>>>> broadcast traffic on the port, but I haven't done extensive analysis
>>>> on that yet. All TCP connections are cut until the bond is up again
>>>> when more ports are enabled on the switch. ping doesn't work either
>>>> way.
>>>
>>>I did some further testing on this and it looks like I can get this
>>>working by enabling the ports in the new aggregator the same way as
>>>the ports in old aggregator are disabled in ad_agg_selection_logic().
>>
>> I tested with this some as well, using 6 ports total across two
>> switches, and was still not able to reproduce the issue.  How are you
>> configuring the bond in the first place?  It may be that there is some
>> dependency on the ordering of the slaves within the bond and how they
>> are disabled.
>
>The switch configuration should be pretty basic - port group that has
>LACP as type. L4 based hashing is configured on the switches. The bond
>is running as untagged on the ports. "show run" shows them as such:
>
>trunk 20-22 trk4 lacp
>trunk-load-balance L4-based
>
>Both switches have similar configuration. I'm not an expert in switch
>configurations, so I do not know if there's anything else interesting
>in there.
>
>The servers are running Ubuntu 16.04 and systemd-networkd is used to
>configure the interfaces. Using /etc/network/interfaces failed quite
>often to setup all interfaces at boot time, so we switched over to
>systemd-networkd.
>
>> Also, I am taking the ports down by physically unplugging the
>> cable from the switch.  If you're doing it differently, that might be
>> relevant.
>
>The servers are in remote location, so I'm disabling the ports in
>switch configuration. Unfortunately I do not have a similar setup that
>I could access physically at the moment.
>
>>>Normally the ports seem to get enabled from ad_mux_machine() in "case
>>>AD_MUX_COLLECTING_DISTRIBUTING", but something different happens there
>>>as the port does get enabled, but no traffic passes through. So far I
>>>haven't been able to figure out what happens. When the connection is
>>>lost, dumpcap sees traffic on the only active port in the bond, but it
>>>seems like nothing catches it. If I disable and re-enable the same
>>>port, traffic start flowing again normally.
>>
>> Looking at the debug log you provided, the step that fails
>> appears to correspond to this portion:
>
>...
>
>> The "Periodic Machine" 3->4 then 4->3 then "Sent LACPDU" looks
>> normal (3->4 is the periodic timer expiring, state 4 sets port->ntt,
>> then the next iteration moves back to state 3), and includes both send
>> and receive of LACPDUs on port 2 (which is enp5s0f0).
>>
>> I don't see a transition to COLL/DIST state for port 2 in the
>> log, so presumably it takes place prior to the beginning.  This is the
>> place that would call __enable_port.
>>
>> What you're describing sounds consistent with the slave not
>> being set to active, which would cause bond_handle_frame ->
>> bond_should_deliver_exact_match to return RX_HANDLER_EXACT.
>>
>> This leads me to wonder if the port in question is in this
>> incorrect state from the beginning, but it only manifests once it
>> becomes the only active port.
>>
>> Can you instrument __enable_port to see when it is called for
>> the bad port, and if it actually calls bond_set_slave_active_flags for
>> the port in question (without your additional patch)?
>
>Thanks for the pointers. Adding some extra debug in __enable_port
>hel

Re: [PATCH net] bonding: fix 802.3ad aggregator reselection

2016-07-05 Thread Jay Vosburgh

Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote:

>2016-06-30 14:15 GMT+03:00 Veli-Matti Lintu <veli-matti.li...@opinsys.fi>:
>> 2016-06-29 18:59 GMT+03:00 Jay Vosburgh <jay.vosbu...@canonical.com>:
>>> Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote:
>
>>> I tried this locally, but don't see any failure (at the end, the
>>> "Switch A" agg is still active with the single port).  I am starting
>>> with just two ports in each aggregator (instead of three), so that may
>>> be relevant.
>>
>> When the connection problem occurs, /proc/net/bonding/bond0 always
>> shows the aggregator that has a link up active. Dumpcap sees at least
>> broadcast traffic on the port, but I haven't done extensive analysis
>> on that yet. All TCP connections are cut until the bond is up again
>> when more ports are enabled on the switch. ping doesn't work either
>> way.
>
>I did some further testing on this and it looks like I can get this
>working by enabling the ports in the new aggregator the same way as
>the ports in old aggregator are disabled in ad_agg_selection_logic().

I tested with this some as well, using 6 ports total across two
switches, and was still not able to reproduce the issue.  How are you
configuring the bond in the first place?  It may be that there is some
dependency on the ordering of the slaves within the bond and how they
are disabled.

Also, I am taking the ports down by physically unplugging the
cable from the switch.  If you're doing it differently, that might be
relevant.

>Normally the ports seem to get enabled from ad_mux_machine() in "case
>AD_MUX_COLLECTING_DISTRIBUTING", but something different happens there
>as the port does get enabled, but no traffic passes through. So far I
>haven't been able to figure out what happens. When the connection is
>lost, dumpcap sees traffic on the only active port in the bond, but it
>seems like nothing catches it. If I disable and re-enable the same
>port, traffic start flowing again normally.

Looking at the debug log you provided, the step that fails
appears to correspond to this portion:

[  367.811419] bond0: link status definitely down for interface enp5s0f1, disabl
ing it
[  367.811425] bond0: best Agg=2; P=3; a k=9; p k=57; Ind=0; Act=0
[  367.811429] bond0: best ports 8830113f6638 slave 8830113cfe00 enp5s0f
0
[  367.811432] bond0: Agg=1; P=3; a k=9; p k=57; Ind=0; Act=0
[  367.811434] bond0: Agg=2; P=3; a k=9; p k=57; Ind=0; Act=0
[  367.811437] bond0: Agg=3; P=0; a k=0; p k=0; Ind=0; Act=0
[  367.811439] bond0: Agg=4; P=0; a k=0; p k=0; Ind=0; Act=0
[  367.811441] bond0: Agg=5; P=0; a k=0; p k=0; Ind=0; Act=0
[  367.811444] bond0: Agg=6; P=0; a k=0; p k=0; Ind=0; Act=0
[  367.811446] bond0: LAG 2 chosen as the active LAG
[  367.811448] bond0: Agg=2; P=3; a k=9; p k=57; Ind=0; Act=1
[  367.811451] bond0: Port 1 changed link status to DOWN
[  367.811455] bond0: first active interface up!
[  367.811461] Rx Machine: Port=1 (enp5s0f1), Last State=6, Curr State=2
[  367.811495] ixgbe :05:00.0 enp5s0f0: event: 1b
[  367.811497] ixgbe :05:00.0 enp5s0f0: IFF_SLAVE
[  367.811519] ixgbe :05:00.1 enp5s0f1: event: 19
[  367.811522] ixgbe :05:00.1 enp5s0f1: IFF_SLAVE
[  367.811525] ixgbe :05:00.1 enp5s0f1: event: 19
[  367.811528] ixgbe :05:00.1 enp5s0f1: IFF_SLAVE
[  386.809542] Periodic Machine: Port=2, Last State=3, Curr State=4
[  386.909543] Periodic Machine: Port=2, Last State=4, Curr State=3
[  387.009530] update lacpdu: enp5s0f0, actor port state 3d
[  387.009541] Sent LACPDU on port 2
[  387.571372] bond0: Received LACPDU on port 2 slave enp5s0f0
[  387.571379] Rx Machine: Port=2 (enp5s0f0), Last State=6, Curr State=6
[  387.571381] enp5s0f0 partner sync=1
[  416.810767] Periodic Machine: Port=2, Last State=3, Curr State=4
[  416.910786] Periodic Machine: Port=2, Last State=4, Curr State=3
[  417.010749] update lacpdu: enp5s0f0, actor port state 3d
[  417.010761] Sent LACPDU on port 2
[  417.569137] bond0: Received LACPDU on port 2 slave enp5s0f0
[  417.569156] Rx Machine: Port=2 (enp5s0f0), Last State=6, Curr State=6
[... repeats this cycle for a while ]
[  537.614050] enp5s0f0 partner sync=1
[  566.816851] Periodic Machine: Port=2, Last State=3, Curr State=4
[  566.916843] Periodic Machine: Port=2, Last State=4, Curr State=3
[  567.016829] update lacpdu: enp5s0f0, actor port state 3d
[  567.016839] Sent LACPDU on port 2
[  567.558379] bond0: Received LACPDU on port 2 slave enp5s0f0
[  567.558399] Rx Machine: Port=2 (enp5s0f0), Last State=6, Curr State=6
[  567.558403] enp5s0f0 partner sync=1
[  572.925434] igb :81:00.0 ens5f0: igb: ens5f0 NIC Link is Up 1000 Mbps 
Full Duplex, Flow Control: RX
[  572.925862] igb :81:00.0 ens5f0: event: 4
[  572.925865] igb :81:00.0 ens5f0: IFF_SLAVE
[  572.925

Re: [PATCH net] bonding: fix 802.3ad aggregator reselection

2016-07-05 Thread Jay Vosburgh

Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote:
[...]
>I have understood that only miimon is available with 802.3ad:
>
>bonding.txt:
>
>802.3ad:
>...
>Finally, the 802.3ad mode mandates the use of the MII monitor,
>therefore, the ARP monitor is not available in this mode.
>
>Is there a way to enable ARP monitor somehow?

No, there isn't.  ARP monitor is not available in 802.3ad mode.

    -J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net] bonding: fix 802.3ad aggregator reselection

2016-06-29 Thread Jay Vosburgh

Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote:
[...]
>Thanks for the patch. I have been now testing it and the reselection
>seems to be working now in most cases, but I hit one case that seems
>to consistently fail in my test environment.
>
>I've been doing most of testing with ad_select=count and this happens
>with it. I haven't yet done extensive testing with
>ad_select=stable/bandwidth.
>
>The sequence to trigger the failure seems to be:
>
>  Switch A (Agg ID 2)   Switch B (Agg ID 1)
>enp5s0f0 ens5f0 ens6f0enp5s0f1 ens5f1 ens6f1
>X   X  -   X  -   - Connection works
>(Agg ID 2 active)
>X   -  -   X  -   - Connection works
>(Agg ID 1 active)
>X   -  -   -  -   - No connection (Agg
>ID 2 active)

I tried this locally, but don't see any failure (at the end, the
"Switch A" agg is still active with the single port).  I am starting
with just two ports in each aggregator (instead of three), so that may
be relevant.

Can you enable dynamic debug for bonding and run your test
again, and then send me the debug output (this will appear in the kernel
log, e.g., from dmesg)?  You can enable this via

# echo 'module bonding =p' > /sys/kernel/debug/dynamic_debug/control

before running the test.  The contents of
/proc/net/bonding/bond0 (read as root, otherwise the LACP internal state
isn't included) from each step would also be helpful.  The output will
likely be large, so I'd suggest sending it to me directly off-list if
it's too big.

>I'm also wondering why link down event causes change of aggregator
>when the active aggregator has the same number of active links than
>the new aggregator.

This shouldn't happen.  If the active aggregator is just as good
as some other aggregator choice, it should stay with the current active.

I suspect that both of these are edge cases arising from the
aggregators now including link down ports, which previously never
happened.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

[PATCH net] bonding: fix 802.3ad aggregator reselection

2016-06-23 Thread Jay Vosburgh


Since commit 7bb11dc9f59d ("bonding: unify all places where
actor-oper key needs to be updated."), the logic in bonding to handle
selection between multiple aggregators has not functioned.

This affects only configurations wherein the bonding slaves
connect to two discrete aggregators (e.g., two independent switches, each
with LACP enabled), thus creating two separate aggregation groups within a
single bond.

The cause is a change in 7bb11dc9f59d to no longer set
AD_PORT_BEGIN on a port after a link state change, which would cause the
port to be reselected for attachment to an aggregator as if were newly
added to the bond.  We cannot restore the prior behavior, as it
contradicts IEEE 802.1AX 5.4.12, which requires ports that "become
inoperable" (lose carrier, setting port_enabled=false as per 802.1AX
5.4.7) to remain selected (i.e., assigned to the aggregator).  As the port
now remains selected, the aggregator selection logic is not invoked.

A side effect of this change is that aggregators in bonding will
now contain ports that are link down.  The aggregator selection logic
does not currently handle this situation correctly, causing incorrect
aggregator selection.

This patch makes two changes to repair the aggregator selection
logic in bonding to function as documented and within the confines of the
standard:

First, the aggregator selection and related logic now utilizes the
number of active ports per aggregator, not the number of selected ports
(as some selected ports may be down).  The ad_select "bandwidth" and
"count" options only consider ports that are link up.

Second, on any carrier state change of any slave, the aggregator
selection logic is explicitly called to insure the correct aggregator is
active.

Reported-by: Veli-Matti Lintu <veli-matti.li...@opinsys.fi>
Fixes: 7bb11dc9f59d ("bonding: unify all places where actor-oper key needs to 
be updated.")
Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

---
 drivers/net/bonding/bond_3ad.c | 64 +-
 1 file changed, 45 insertions(+), 19 deletions(-)

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index b9304a295f86..ca81f46ea1aa 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -657,6 +657,20 @@ static void __set_agg_ports_ready(struct aggregator 
*aggregator, int val)
}
 }
 
+static int __agg_active_ports(struct aggregator *agg)
+{
+   struct port *port;
+   int active = 0;
+
+   for (port = agg->lag_ports; port;
+port = port->next_port_in_aggregator) {
+   if (port->is_enabled)
+   active++;
+   }
+
+   return active;
+}
+
 /**
  * __get_agg_bandwidth - get the total bandwidth of an aggregator
  * @aggregator: the aggregator we're looking at
@@ -664,39 +678,40 @@ static void __set_agg_ports_ready(struct aggregator 
*aggregator, int val)
  */
 static u32 __get_agg_bandwidth(struct aggregator *aggregator)
 {
+   int nports = __agg_active_ports(aggregator);
u32 bandwidth = 0;
 
-   if (aggregator->num_of_ports) {
+   if (nports) {
switch (__get_link_speed(aggregator->lag_ports)) {
case AD_LINK_SPEED_1MBPS:
-   bandwidth = aggregator->num_of_ports;
+   bandwidth = nports;
break;
case AD_LINK_SPEED_10MBPS:
-   bandwidth = aggregator->num_of_ports * 10;
+   bandwidth = nports * 10;
break;
case AD_LINK_SPEED_100MBPS:
-   bandwidth = aggregator->num_of_ports * 100;
+   bandwidth = nports * 100;
break;
case AD_LINK_SPEED_1000MBPS:
-   bandwidth = aggregator->num_of_ports * 1000;
+   bandwidth = nports * 1000;
break;
case AD_LINK_SPEED_2500MBPS:
-   bandwidth = aggregator->num_of_ports * 2500;
+   bandwidth = nports * 2500;
break;
case AD_LINK_SPEED_1MBPS:
-   bandwidth = aggregator->num_of_ports * 1;
+   bandwidth = nports * 1;
break;
case AD_LINK_SPEED_2MBPS:
-   bandwidth = aggregator->num_of_ports * 2;
+   bandwidth = nports * 2;
break;
case AD_LINK_SPEED_4MBPS:
-   bandwidth = aggregator->num_of_ports * 4;
+   bandwidth = nports * 4;
break;
case AD_LINK_SPEED_56000MBPS:
-

Re: 802.3ad bonding aggregator reselection

2016-06-22 Thread Jay Vosburgh

;num_of_ports > best->num_of_ports)
+   if (__agg_active_ports(curr) > __agg_active_ports(best))
return curr;
 
-   if (curr->num_of_ports < best->num_of_ports)
+   if (__agg_active_ports(curr) < __agg_active_ports(best))
return best;
 
/*FALLTHROUGH*/
@@ -1561,8 +1576,14 @@ static int agg_device_up(const struct aggregator *agg)
if (!port)
return 0;
 
-   return netif_running(port->slave->dev) &&
-  netif_carrier_ok(port->slave->dev);
+   for (port = agg->lag_ports; port;
+port = port->next_port_in_aggregator) {
+   if (netif_running(port->slave->dev) &&
+   netif_carrier_ok(port->slave->dev))
+   return 1;
+   }
+
+   return 0;
 }
 
 /**
@@ -1610,7 +1631,7 @@ static void ad_agg_selection_logic(struct aggregator *agg,
 
agg->is_active = 0;
 
-   if (agg->num_of_ports && agg_device_up(agg))
+   if (__agg_active_ports(agg) && agg_device_up(agg))
best = ad_agg_selection_test(best, agg);
}
 
@@ -1622,7 +1643,7 @@ static void ad_agg_selection_logic(struct aggregator *agg,
 * answering partner.
 */
if (active && active->lag_ports &&
-   active->lag_ports->is_enabled &&
+   __agg_active_ports(active) &&
(__agg_has_partner(active) ||
 (!__agg_has_partner(active) &&
 !__agg_has_partner(best {
@@ -2432,7 +2453,9 @@ void bond_3ad_adapter_speed_duplex_changed(struct slave 
*slave)
  */
 void bond_3ad_handle_link_change(struct slave *slave, char link)
 {
+   struct aggregator *agg;
struct port *port;
+   bool dummy;
 
port = &(SLAVE_AD_INFO(slave)->port);
 
@@ -2459,6 +2482,9 @@ void bond_3ad_handle_link_change(struct slave *slave, 
char link)
port->is_enabled = false;
ad_update_actor_keys(port, true);
}
+   agg = __get_first_agg(port);
+   ad_agg_selection_logic(agg, );
+
netdev_dbg(slave->bond->dev, "Port %d changed link status to %s\n",
   port->actor_port_number,
   link == BOND_LINK_UP ? "UP" : "DOWN");
@@ -2499,7 +2525,7 @@ int bond_3ad_set_carrier(struct bonding *bond)
active = __get_active_agg(&(SLAVE_AD_INFO(first_slave)->aggregator));
if (active) {
/* are enough slaves available to consider link up? */
-   if (active->num_of_ports < bond->params.min_links) {
+   if (__agg_active_ports(active) < bond->params.min_links) {
if (netif_carrier_ok(bond->dev)) {
netif_carrier_off(bond->dev);
goto out;


-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: 802.3ad bonding aggregator reselection

2016-06-21 Thread Jay Vosburgh

 if (__get_agg_selection_mode(port) != BOND_AD_STABLE)
+   port->port->sm_vars &= ~AD_PORT_SELECTED;
}
netdev_dbg(slave->bond->dev, "Port %d changed link status to %s\n",
   port->actor_port_number,

I'll test this locally and will submit a formal patch with an
update to bonding.txt tomorrow (if it works).

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: 802.3ad bonding aggregator reselection

2016-06-21 Thread Jay Vosburgh

Veli-Matti Lintu <veli-matti.li...@opinsys.fi> wrote:

>2016-06-20 17:11 GMT+03:00 zhuyj <zyjzyj2...@gmail.com>:
>> 5. Switch Configuration
>> ===
>>
>> For this section, "switch" refers to whatever system the
>> bonded devices are directly connected to (i.e., where the other end of
>> the cable plugs into).  This may be an actual dedicated switch device,
>> or it may be another regular system (e.g., another computer running
>> Linux),
>>
>> The active-backup, balance-tlb and balance-alb modes do not
>> require any specific configuration of the switch.
>>
>> The 802.3ad mode requires that the switch have the appropriate
>> ports configured as an 802.3ad aggregation.  The precise method used
>> to configure this varies from switch to switch, but, for example, a
>> Cisco 3550 series switch requires that the appropriate ports first be
>> grouped together in a single etherchannel instance, then that
>> etherchannel is set to mode "lacp" to enable 802.3ad (instead of
>> standard EtherChannel).
>
>The ports are configured in switch settings (HP Procurve 2530-48G) in
>same trunk group (TrkX) and trunk group type is set as LACP.
>/proc/net/bonding/bond0 also shows that the three ports belong to same
>aggregator and bandwidth tests also support this. In my understanding
>Procurve's trunk group is pretty much the same as etherchannel in
>Cisco's terminology. The bonded link comes always up properly, but
>handling of links going down is the problem. Are there known
>differences between different vendors there?

I did the original LACP reselection testing on a Cisco switch,
but I have an HP 2530 now; I'll test it later today or tomorrow and see
if it behaves properly, and whether your proposed patch is needed.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH v2 net] bonding: don't use stale speed and duplex information

2016-02-28 Thread Jay Vosburgh

zhuyj <zyjzyj2...@gmail.com> wrote:

>On 02/25/2016 09:33 PM, Jay Vosburgh wrote:
>> zhuyj <zyjzyj2...@gmail.com> wrote:
>> [...]
>>> I delved into the source code and Emil's tests. I think that the problem
>>> that this patch expects to fix occurs very unusually.
>>>
>>> Do you agree with me?
>>>
>>> If so, maybe the following patch can reduce the performance loss.
>>> Please comment on it. Thanks a lot.
>>>
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c
>>> b/drivers/net/bonding/bond_main.c
>>> index b7f1a99..c4c511a 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -2129,7 +2129,9 @@ static void bond_miimon_commit(struct bonding *bond)
>>> continue;
>>>
>>> case BOND_LINK_UP:
>>> -   bond_update_speed_duplex(slave);
>>> +   if (slave->speed == SPEED_UNKNOWN)
>>> +   bond_update_speed_duplex(slave);
>>> +
>>> bond_set_slave_link_state(slave, BOND_LINK_UP,
>>> BOND_SLAVE_NOTIFY_NOW);
>>> slave->last_link_up = jiffies;
>>  I don't believe the speed is necessarily SPEED_UNKNOWN coming in
>> here.  If the race occurs at a time later than the initial enslavement,
>> speed may already be set (and the race manifests if the new speed
>> changes, i.e., the link changes from 1 Gb/sec to 10 Gb/sec), so I don't
>> think this is functionally correct.
>Hi, Jay
>
>Thanks for your reply.
>
>IMHO, "If the race occurs at a time later than the initial enslavement,
>speed may already be set (and the race manifests if the new speed
>changes, i.e., the link changes from 1 Gb/sec to 10 Gb/sec)", from my test,
>this will not happen because the previous source code make the speed
>correct.

How, exactly, will "the previous source code make the speed
correct"?

>This "bond_update_speed_duplex" repeats to get the correct speed.
>
>That is, this patch is to fix the error in initial enslavement. The
>mentioned scenario will not occur.

I see nothing in the code that limits the race to happening only
at enslavement time.

If the bond_mii_monitor call executes between the device going
link up and the arrival of the NETDEV_CHANGE or NETDEV_UP callback, the
stored speed and duplex are stale.  The stale speed value is not
guaranteed to be SPEED_UNKNOWN, so your patch is not functionally
correct.

    -J

>Even though the performance impact is minimal, if we can avoid this
>performance
>impact, why not ?
>
>Best Regards!
>Zhu Yanjun
>
>>
>>  Also, the call to bond_miimon_commit itself is already gated by
>> bond_miimon_inspect finding a link state change.  The performance impact
>> here should be minimal.
>>
>>  -J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH v2 net] bonding: don't use stale speed and duplex information

2016-02-25 Thread Jay Vosburgh

zhuyj <zyjzyj2...@gmail.com> wrote:
[...]
>I delved into the source code and Emil's tests. I think that the problem
>that this patch expects to fix occurs very unusually.
>
>Do you agree with me?
>
>If so, maybe the following patch can reduce the performance loss.
>Please comment on it. Thanks a lot.
>
>
>diff --git a/drivers/net/bonding/bond_main.c
>b/drivers/net/bonding/bond_main.c
>index b7f1a99..c4c511a 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -2129,7 +2129,9 @@ static void bond_miimon_commit(struct bonding *bond)
>continue;
>
>case BOND_LINK_UP:
>-   bond_update_speed_duplex(slave);
>+   if (slave->speed == SPEED_UNKNOWN)
>+   bond_update_speed_duplex(slave);
>+
>bond_set_slave_link_state(slave, BOND_LINK_UP,
>BOND_SLAVE_NOTIFY_NOW);
>slave->last_link_up = jiffies;

I don't believe the speed is necessarily SPEED_UNKNOWN coming in
here.  If the race occurs at a time later than the initial enslavement,
speed may already be set (and the race manifests if the new speed
changes, i.e., the link changes from 1 Gb/sec to 10 Gb/sec), so I don't
think this is functionally correct.

Also, the call to bond_miimon_commit itself is already gated by
bond_miimon_inspect finding a link state change.  The performance impact
here should be minimal.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH v2 net] bonding: don't use stale speed and duplex information

2016-02-18 Thread Jay Vosburgh

David Miller <da...@davemloft.net> wrote:
[...]
>>  This was done historically in bonding, but the call to
>> bond_update_speed_duplex was removed in commit 876254ae2758 ("bonding:
>> don't call update_speed_duplex() under spinlocks"), as it might sleep
>> under lock.  Later, the locking was changed to only hold RTNL, and so
>> after commit 876254ae2758 ("bonding: don't call update_speed_duplex()
>> under spinlocks") this call is again safe.
>> 
>> Tested-by: "Tantilov, Emil S" <emil.s.tanti...@intel.com>
>> Cc: Veaceslav Falico <vfal...@gmail.com>
>> Cc: dingtianhong <dingtianh...@huawei.com>
>> Fixes: 876254ae2758 ("bonding: don't call update_speed_duplex() under 
>> spinlocks")
>> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>
>
>Applied, thanks Jay.

Rereading the above, I just noticed that I put the wrong commit
into the fixes tag (and the "Later, the locking was changed" text); the
correct fixes tag should be:

Fixes: 4cb4f97b7e36 ("bonding: rebuild the lock use for bond_mii_monitor()")

        Kernels between 876254ae2758 and 4cb4f97b7e36 should not have
this patch applied, as it might sleep under lock.

Sorry for the error,

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: Bonding MAC failover behavior with VLAN interfaces

2016-02-12 Thread Jay Vosburgh

John Valko <jova...@cisco.com> wrote:

>Hi all,
>
>I have observed the following on CentOS 6 using the
>2.6.32-573.8.1.el6.x86_64 kernel as well as Ubuntu 15.10 with
>4.2.0-25-generic.. I have not yet tried with a vanilla kernel but with the
>large time/distro gap it didn't seem likely to be caused by distro
>changes.

[ ...example of VLAN-over-bond's MAC not changing when bonding
fail_over_mac=active has a failover... ]

>So, herein lies my confusion.  I expect that the VLAN interfaces should
>also pick up the new MAC address, but they don't.  It seems like a bug to
>me, but I don't want to be presumptuous so if in fact this is expected
>behavior, how do you recommend we approach making it switch when the bond
>fails over?  Right now, the MAC must be manually set for each vlan
>interface.
[...]
>I can see it set the mac of the bond to the new active slave MAC, but I
>don't see any indication of looping over vlan interfaces or anything, if
>that is expected... or would it be expected that the 8021q module receives
>an event that should make it change?

Your observation is correct: the fail_over_mac functionality
does not propagate beyond the bonding master device itself.  The
presumption at the VLAN level is that the underlying device can support
multiple MAC addresses; this same effect (different MACs between VLAN
and its lower device) occurs if the device logically below a VLAN
interface changes its MAC.  This is handled by vlan_sync_address, which
is called via the NETDEV_CHANGEADDR notifier (which happens when the
bonding master changes its MAC due to f_o_m=active).

The original need for f_o_m=active was IPoIB devices that cannot
change their MAC address.  Those don't support VLANs, either, so
propagation to VLANs was not a consideration.

In your "real" SR-IOV sitation, is there something that
necessitates the use of fail_over_mac=active (I don't recall that SR-IOV
itself prohibits a VF from changing its MAC address), or is this being
done for other reasons?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

[PATCH v2 net] bonding: don't use stale speed and duplex information

2016-02-08 Thread Jay Vosburgh


There is presently a race condition between the bonding periodic
link monitor and the updating of a slave's speed and duplex.  The former
occurs on a periodic basis, and the latter in response to a driver's
calling of netif_carrier_on.

It is possible for the periodic monitor to run between the
driver call of netif_carrier_on and the receipt of the NETDEV_CHANGE
event that causes bonding to update the slave's speed and duplex.  This
manifests most notably as a report that a slave is up and "0 Mbps full
duplex" after enslavement, but in principle could report an incorrect
speed and duplex after any link up event if the device comes up with a
different speed or duplex.  This affects the 802.3ad aggregator
selection, as the speed and duplex are selection criteria.

This is fixed by updating the speed and duplex in the periodic
monitor, prior to using that information.

This was done historically in bonding, but the call to
bond_update_speed_duplex was removed in commit 876254ae2758 ("bonding:
don't call update_speed_duplex() under spinlocks"), as it might sleep
under lock.  Later, the locking was changed to only hold RTNL, and so
after commit 876254ae2758 ("bonding: don't call update_speed_duplex()
under spinlocks") this call is again safe.

Tested-by: "Tantilov, Emil S" <emil.s.tanti...@intel.com>
Cc: Veaceslav Falico <vfal...@gmail.com>
Cc: dingtianhong <dingtianh...@huawei.com>
Fixes: 876254ae2758 ("bonding: don't call update_speed_duplex() under 
spinlocks")
Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

---

v2: Correct Veaceslav's email address

Note: The "Fixes" commit is the commit that makes this operation safe
again, not the commit that originally introduced the race.  I don't see
any simple way to resolve this bug between these two commits.

 drivers/net/bonding/bond_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56b560558884..cabaeb61333d 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2127,6 +2127,7 @@ static void bond_miimon_commit(struct bonding *bond)
continue;
 
case BOND_LINK_UP:
+   bond_update_speed_duplex(slave);
bond_set_slave_link_state(slave, BOND_LINK_UP,
  BOND_SLAVE_NOTIFY_NOW);
slave->last_link_up = jiffies;
-- 
1.9.1

[PATCH net] bonding: don't use stale speed and duplex information

2016-02-08 Thread Jay Vosburgh


There is presently a race condition between the bonding periodic
link monitor and the updating of a slave's speed and duplex.  The former
occurs on a periodic basis, and the latter in response to a driver's
calling of netif_carrier_on.

It is possible for the periodic monitor to run between the
driver call of netif_carrier_on and the receipt of the NETDEV_CHANGE
event that causes bonding to update the slave's speed and duplex.  This
manifests most notably as a report that a slave is up and "0 Mbps full
duplex" after enslavement, but in principle could report an incorrect
speed and duplex after any link up event if the device comes up with a
different speed or duplex.  This affects the 802.3ad aggregator
selection, as the speed and duplex are selection criteria.

This is fixed by updating the speed and duplex in the periodic
monitor, prior to using that information.

This was done historically in bonding, but the call to
bond_update_speed_duplex was removed in commit 876254ae2758 ("bonding:
don't call update_speed_duplex() under spinlocks"), as it might sleep
under lock.  Later, the locking was changed to only hold RTNL, and so
after commit 876254ae2758 ("bonding: don't call update_speed_duplex()
under spinlocks") this call is again safe.

Tested-by: "Tantilov, Emil S" <emil.s.tanti...@intel.com>
Cc: Veaceslav Falico <vfal...@redhat.com>
Cc: dingtianhong <dingtianh...@huawei.com>
Fixes: 876254ae2758 ("bonding: don't call update_speed_duplex() under 
spinlocks")
Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

---

Note: The "Fixes" commit is the commit that makes this operation safe
again, not the commit that originally introduced the race.  I don't see
any simple way to resolve this bug between these two commits.

 drivers/net/bonding/bond_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56b560558884..cabaeb61333d 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2127,6 +2127,7 @@ static void bond_miimon_commit(struct bonding *bond)
continue;
 
case BOND_LINK_UP:
+   bond_update_speed_duplex(slave);
bond_set_slave_link_state(slave, BOND_LINK_UP,
  BOND_SLAVE_NOTIFY_NOW);
slave->last_link_up = jiffies;
-- 
1.9.1

Re: [PATCH net v2] bonding: Fix ARP monitor validation

2016-02-07 Thread Jay Vosburgh


David Miller <da...@davemloft.net> wrote:
[...]
>>  This patch changes the logic in bond_arp_rcv to additionally
>> accept an ARP reply for validation on any slave if there is a current ARP
>> slave and it sent an ARP probe during the previous arp_interval.
>> 
>> Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave 
>> really works")
>> Cc: Veaceslav Falico <vfal...@gmail.com>
>> Cc: Andy Gospodarek <go...@cumulusnetworks.com>
>> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>
>
>I'm going to wait until we get some feedback from Uwe if you don't mind
>Jay, it would be nice to know if it solves their problem too.

Sounds good to me.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: bonding reports interface up with 0 Mbps

2016-02-04 Thread Jay Vosburgh

Jay Vosburgh <jay.vosbu...@canonical.com> wrote:
[...]
>   Thinking about the trace again... Emil: what happens in the
>trace before this?  Is there ever a call to the ixgbe_get_settings?
>Does a NETDEV_UP or NETDEV_CHANGE event ever hit the bond_netdev_event
>function?

Emil kindly sent me the trace offline, and I think I see what's
going on.  It looks like the sequence of events is:

bond_enslave ->
bond_update_speed_duplex (device is down, thus DUPLEX/SPEED_UNKNOWN)
[ do rest of enslavement, start miimon periodic work ]

[ time passes, device goes carrier up ]

ixgbe_service_task: eth1: NIC Link is Up 10 Gbps ->
netif_carrier_on (arranges for NETDEV_CHANGE notifier out of line)

[ a few microseconds later ]

bond_mii_monitor ->
bond_check_dev_link (now is carrier up)
bond_miimon_commit ->   (emits "0 Mbps full duplex" message)
bond_lower_state_changed ->
bond_netdev_event (NETDEV_CHANGELOWERSTATE, is ignored)
bond_3ad_handle_link_change (sees DUPLEX/SPEED_UNKNOWN)

[ a few microseconds later, in response to ixgbe's netif_carrier_on ]

notifier_call_chain ->
bond_netdev_event NETDEV_CHANGE ->
bond_update_speed_duplex (sees correct SPEED_1/FULL) ->
bond_3ad_adapter_speed_duplex_changed (updates 802.3ad)

Basically, the race is that the periodic bond_mii_monitor is
squeezing in between the link going up and bonding's update of the speed
and duplex in response to the NETDEV_CHANGE triggered by the driver's
netif_carrier_on call.  bonding ends up using the stale duplex and speed
information obtained at enslavement time.

I think that, nowadays, the initial speed and duplex will pretty
much always be UNKNOWN, at least for real Ethernet devices, because it
will take longer to autoneg than the time between the dev_open and
bond_update_speed_duplex calls in bond_enslave.

Adding a case to bond_netdev_event for CHANGELOWERSTATE works
because it's a synchronous call from bonding.  For purposes of fixing
this, it's more or less equivalent to calling bond_update_speed_duplex
from bond_miimon_commit (which is part of a test patch I posted earlier
today).

If the above analysis is correct, then I would expect this patch
to make the problem go away:

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56b560558884..cabaeb61333d 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2127,6 +2127,7 @@ static void bond_miimon_commit(struct bonding *bond)
continue;

case BOND_LINK_UP:
+   bond_update_speed_duplex(slave);
bond_set_slave_link_state(slave, BOND_LINK_UP,
  BOND_SLAVE_NOTIFY_NOW);
slave->last_link_up = jiffies;

Emil, can you give just the above a test?

I don't see in the trace that there's evidence that ixgbe's link
is rapidly flapping, so I don't think it's necessary to do more than the
above.

Now, separately, bonding really should obey the NETDEV_CHANGE /
NETDEV_UP events instead of polling for carrier state, but if the above
patch works it's a simple fix that is easily backported, which the
CHANGELOWERSTATE method isn't, and the new way (notifier driven) can be
net-next material.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: bonding reports interface up with 0 Mbps

2016-02-04 Thread Jay Vosburgh

Tantilov, Emil S <emil.s.tanti...@intel.com> wrote:
[...]
>Sure, I'll give this a try, but I'm not sure this check applies in this case
>as you can see from the trace link is up and carrier is on.

From code inspection, I see another possible race, although I'm
not sure if it's relevant for this case.  During enslavement, the speed
and duplex are initially set, and then later the link state is checked,
but if link is up, the code presumes the speed and duplex are valid,
which they may not be.

I think this patch would narrow (but not totally eliminate) this
race:

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56b560558884..b8b8a24f92d1 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1591,6 +1591,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev)
/* check for initial state */
if (bond->params.miimon) {
if (bond_check_dev_link(bond, slave_dev, 0) == BMSR_LSTATUS) {
+   bond_update_speed_duplex(new_slave);
if (bond->params.updelay) {
bond_set_slave_link_state(new_slave,
  BOND_LINK_BACK,

I'm not sure it's going to be really possible to completely
close all of these races, as the device can change its link state (and
thus what it reports for speed and duplex) asynchronously.  Even in a
NETDEV_UP or NETDEV_CHANGE notifier callback, the link could go down
between the time of the netif_carrier_on call that triggers the notifier
and when the callback runs.

But, if the link is really flapping here, bonding should get a
notifier for each flap (up or down).  Once it settles down with carrier
up then the speed and duplex should be valid.

    -J

---
    -Jay Vosburgh, jay.vosbu...@canonical.com

Re: bonding reports interface up with 0 Mbps

2016-02-04 Thread Jay Vosburgh

Tantilov, Emil S <emil.s.tanti...@intel.com> wrote:

>We are seeing an occasional issue where the bonding driver may report 
>interface up with 0 Mbps:
>bond0: link status definitely up for interface eth0, 0 Mbps full duplex
>
>So far in all the failed traces I have collected this happens on 
>NETDEV_CHANGELOWERSTATE event:
>
><...>-20533 [000]  81811.041241: ixgbe_service_task: eth1: NIC Link is Up 
>10 Gbps, Flow Control: RX/TX
><...>-20533 [000]  81811.041257: ixgbe_check_vf_rate_limit 
><-ixgbe_service_task
><...>-20533 [000]  81811.041272: ixgbe_ping_all_vfs <-ixgbe_service_task
>kworker/u48:0-7503  [010]  81811.041345: ixgbe_get_stats64 <-dev_get_stats
>kworker/u48:0-7503  [010]  81811.041393: bond_netdev_event: eth1: event: 1b
>kworker/u48:0-7503  [010]  81811.041394: bond_netdev_event: eth1: IFF_SLAVE
>kworker/u48:0-7503  [010]  81811.041395: bond_netdev_event: eth1: 
>slave->speed = 
><...>-20533 [000]  81811.041407: ixgbe_ptp_overflow_check 
><-ixgbe_service_task
>kworker/u48:0-7503  [010]  81811.041407: bond_mii_monitor: bond0: link 
>status definitely up for interface eth1, 0 Mbps full duplex

Thinking about the trace again... Emil: what happens in the
trace before this?  Is there ever a call to the ixgbe_get_settings?
Does a NETDEV_UP or NETDEV_CHANGE event ever hit the bond_netdev_event
function?

Could you describe your test that reproduces this?  I'd like to
see if I can set it up locally.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net v2] bonding: Fix ARP monitor validation

2016-02-03 Thread Jay Vosburgh

Jarod Wilson <ja...@redhat.com> wrote:
[...]
>This sounds suspiciously like the same problem Uwe was encountering[*] and
>attempting to solve. Uwe, can you give this patch a try?
>
>[*] = http://marc.info/?l=linux-netdev=144416122705850=2

Agreed; my apologies for not Cc'ing you, Uwe.

FWIW, the "Sometimes, the issue sorts itself out after a while"
Uwe reported is likely due to something causing a switch flood (TCN or
clearing of the MAC table); in that case, the ARP reply is sent to all
ports, so the "expected" curr_arp_slave receives the ARP reply.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: bonding reports interface up with 0 Mbps

2016-02-03 Thread Jay Vosburgh

Tantilov, Emil S <emil.s.tanti...@intel.com> wrote:

>We are seeing an occasional issue where the bonding driver may report 
>interface up with 0 Mbps:
>bond0: link status definitely up for interface eth0, 0 Mbps full duplex
>
>So far in all the failed traces I have collected this happens on 
>NETDEV_CHANGELOWERSTATE event:
>
><...>-20533 [000]  81811.041241: ixgbe_service_task: eth1: NIC Link is Up 
>10 Gbps, Flow Control: RX/TX
><...>-20533 [000]  81811.041257: ixgbe_check_vf_rate_limit 
><-ixgbe_service_task
><...>-20533 [000]  81811.041272: ixgbe_ping_all_vfs <-ixgbe_service_task
>kworker/u48:0-7503  [010]  81811.041345: ixgbe_get_stats64 <-dev_get_stats
>kworker/u48:0-7503  [010]  81811.041393: bond_netdev_event: eth1: event: 1b
>kworker/u48:0-7503  [010]  81811.041394: bond_netdev_event: eth1: IFF_SLAVE
>kworker/u48:0-7503  [010]  81811.041395: bond_netdev_event: eth1: 
>slave->speed = 
><...>-20533 [000]  81811.041407: ixgbe_ptp_overflow_check 
><-ixgbe_service_task
>kworker/u48:0-7503  [010]  81811.041407: bond_mii_monitor: bond0: link 
>status definitely up for interface eth1, 0 Mbps full duplex

From looking at the code that prints this, the "full" duplex is
probably actually DUPLEX_UNKNOWN, but the netdev_info uses the
expression slave->duplex ? "full" : "half", so DUPLEX_UNKNOWN at 0xff
would print "full."

This is what ixgbe_get_settings returns for speed and duplex if
it is called when carrier is off.

>As a proof of concept I added NETDEV_CHANGELOWERSTATE in 
>bond_slave_netdev_event() along with NETDEV_UP/CHANGE:
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 56b5605..a9dac4c 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -3014,6 +3014,7 @@ static int bond_slave_netdev_event(unsigned long event,
>   break;
>   case NETDEV_UP:
>   case NETDEV_CHANGE:
>+  case NETDEV_CHANGELOWERSTATE:
>   bond_update_speed_duplex(slave);
>   if (BOND_MODE(bond) == BOND_MODE_8023AD)
>   bond_3ad_adapter_speed_duplex_changed(slave);
>
>With this change I have not seen 0 Mbps reported by the bonding driver (around 
>12 hour test up to this point
>vs. 2-3 hours otherwise). Although I suppose it could also be some sort of 
>race/timing issue with bond_mii_monitor().

This change as a fix seems kind of odd, since CHANGELOWERSTATE
is generated by bonding itself.  Perhaps the net effect is to add a
delay and then update the speed and duplex, masking the actual problem.

Emil, if I recall correctly, the test patch I send that uses the
notifiers directly instead of miimon (specify miimon=0 and have bonding
respond to the notifiers) handled everything properly, right?  If so I
can split that up and submit it properly; it seems more like a feature
than a straightforward bug fix, so I'm not sure it's appropriate for
net.

As a possibly less complex alternative for the miimon > 0 case,
could you try the following:

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56b560558884..ac8921e65f26 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2120,6 +2120,7 @@ static void bond_miimon_commit(struct bonding *bond)
 {
struct list_head *iter;
struct slave *slave, *primary;
+   int link_state;

bond_for_each_slave(bond, slave, iter) {
switch (slave->new_link) {
@@ -2127,6 +2128,10 @@ static void bond_miimon_commit(struct bonding *bond)
continue;

case BOND_LINK_UP:
+   link_state = bond_check_dev_link(bond, slave->dev, 0);
+   if (!link_state)
+   continue;
+   bond_update_speed_duplex(slave);
bond_set_slave_link_state(slave, BOND_LINK_UP,
  BOND_SLAVE_NOTIFY_NOW);
slave->last_link_up = jiffies;

This will make bonding recheck the link state and update the
speed and duplex after it acquires RTNL to commit a link change.  This
probably still has a race, since the change of carrier state in the
device is not mutexed by anything bonding can acquire (so it can always
change as soon as it's checked).

Thanks,

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next] bonding: 3ad: apply ad_actor settings changes immediately

2016-02-03 Thread Jay Vosburgh

Nikolay Aleksandrov <niko...@cumulusnetworks.com> wrote:

>On 02/03/2016 08:05 PM, Jay Vosburgh wrote:
>> Nikolay Aleksandrov <ra...@blackwall.org> wrote:
>> 
>>> From: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
>>>
>>> Currently the bonding allows to set ad_actor_system and prio while the
>>> bond device is down, but these are actually applied only if there aren't
>>> any slaves yet (applied to bond device when first slave shows up, and to
>>> slaves at 3ad bind time). After this patch changes are applied immediately
>>> and the new values can be used/seen after the bond's upped so it's not
>>> necessary anymore to release all and enslave again to see the changes.
>>>
>>> CC: Jay Vosburgh <j.vosbu...@gmail.com>
>>> CC: Veaceslav Falico <vfal...@gmail.com>
>>> CC: Andy Gospodarek <go...@cumulusnetworks.com>
>>> Signed-off-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
>> 
>>  Looks good to me.
>> 
>> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>
>> 
>>  -J
>> 
>
>Thanks Jay. Do you think it makes sense to allow setting these while the
>bond is up ? I don't see any serious problems.

Actually, I was thinking about that when I read the patch.

Changing these while live will trigger reselection of the
aggregator(s), since selection information in the LACPDUs will change.
It should just work itself out when the parter system sees the change.

E.g., if bonding received a LACPDU with a changed _prio or
_system, __update_selected will clear AD_PORT_SELECTED if it doesn't
match, and that will cascade through the state machine (in
ad_mux_machine, COLLECTING_DISTRIBUTING state will revert to
MUX_ATTACHED, which will clear COLL_DIST and go from there).

I think it may converge more quickly if bonding set ->ntt on the
ports after the change to immediately send new LACPDUs (rather than
waiting for the LACP timeout), but that's an optimization.  If the _prio
or _system are set to the same values they had previously, this is
harmless as the partner system shouldn't reselect if nothing in the
LACPDU changed.

There's a similar optimization in bond_3ad_unbind_slave to send
a LACPDU with AGGREGATION cleared to speed up the LACP processing on the
partner system.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH net-next] bonding: 3ad: apply ad_actor settings changes immediately

2016-02-03 Thread Jay Vosburgh

Nikolay Aleksandrov <ra...@blackwall.org> wrote:

>From: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
>
>Currently the bonding allows to set ad_actor_system and prio while the
>bond device is down, but these are actually applied only if there aren't
>any slaves yet (applied to bond device when first slave shows up, and to
>slaves at 3ad bind time). After this patch changes are applied immediately
>and the new values can be used/seen after the bond's upped so it's not
>necessary anymore to release all and enslave again to see the changes.
>
>CC: Jay Vosburgh <j.vosbu...@gmail.com>
>CC: Veaceslav Falico <vfal...@gmail.com>
>CC: Andy Gospodarek <go...@cumulusnetworks.com>
>Signed-off-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Looks good to me.

Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

-J

>---
> drivers/net/bonding/bond_3ad.c | 40 +++---
> drivers/net/bonding/bond_options.c |  4 
> include/net/bond_3ad.h |  1 +
> 3 files changed, 42 insertions(+), 3 deletions(-)
>
>diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
>index 4cbb8b27a891..ee94056dbb2e 100644
>--- a/drivers/net/bonding/bond_3ad.c
>+++ b/drivers/net/bonding/bond_3ad.c
>@@ -357,6 +357,14 @@ static u8 __get_duplex(struct port *port)
>   return retval;
> }
> 
>+static void __ad_actor_update_port(struct port *port)
>+{
>+  const struct bonding *bond = bond_get_bond_by_slave(port->slave);
>+
>+  port->actor_system = BOND_AD_INFO(bond).system.sys_mac_addr;
>+  port->actor_system_priority = BOND_AD_INFO(bond).system.sys_priority;
>+}
>+
> /* Conversions */
> 
> /**
>@@ -1963,9 +1971,7 @@ void bond_3ad_bind_slave(struct slave *slave)
>   port->actor_admin_port_key = bond->params.ad_user_port_key << 6;
>   ad_update_actor_keys(port, false);
>   /* actor system is the bond's system */
>-  port->actor_system = BOND_AD_INFO(bond).system.sys_mac_addr;
>-  port->actor_system_priority =
>-  BOND_AD_INFO(bond).system.sys_priority;
>+  __ad_actor_update_port(port);
>   /* tx timer(to verify that no more than MAX_TX_IN_SECOND
>* lacpdu's are sent in one second)
>*/
>@@ -2148,6 +2154,34 @@ out:
> }
> 
> /**
>+ * bond_3ad_update_ad_actor_settings - reflect change of actor settings to 
>ports
>+ * @bond: bonding struct to work on
>+ *
>+ * If an ad_actor setting gets changed we need to update the individual port
>+ * settings so the bond device will use the new values when it gets upped.
>+ */
>+void bond_3ad_update_ad_actor_settings(struct bonding *bond)
>+{
>+  struct list_head *iter;
>+  struct slave *slave;
>+
>+  ASSERT_RTNL();
>+
>+  BOND_AD_INFO(bond).system.sys_priority = bond->params.ad_actor_sys_prio;
>+  if (is_zero_ether_addr(bond->params.ad_actor_system))
>+  BOND_AD_INFO(bond).system.sys_mac_addr =
>+  *((struct mac_addr *)bond->dev->dev_addr);
>+  else
>+  BOND_AD_INFO(bond).system.sys_mac_addr =
>+  *((struct mac_addr *)bond->params.ad_actor_system);
>+
>+  spin_lock_bh(>mode_lock);
>+  bond_for_each_slave(bond, slave, iter)
>+  __ad_actor_update_port(&(SLAVE_AD_INFO(slave)->port));
>+  spin_unlock_bh(>mode_lock);
>+}
>+
>+/**
>  * bond_3ad_state_machine_handler - handle state machines timeout
>  * @bond: bonding struct to work on
>  *
>diff --git a/drivers/net/bonding/bond_options.c 
>b/drivers/net/bonding/bond_options.c
>index 55e93b6b6d21..ed0bdae64f5e 100644
>--- a/drivers/net/bonding/bond_options.c
>+++ b/drivers/net/bonding/bond_options.c
>@@ -1392,6 +1392,8 @@ static int bond_option_ad_actor_sys_prio_set(struct 
>bonding *bond,
>   newval->value);
> 
>   bond->params.ad_actor_sys_prio = newval->value;
>+  bond_3ad_update_ad_actor_settings(bond);
>+
>   return 0;
> }
> 
>@@ -1418,6 +1420,8 @@ static int bond_option_ad_actor_system_set(struct 
>bonding *bond,
> 
>   netdev_info(bond->dev, "Setting ad_actor_system to %pM\n", mac);
>   ether_addr_copy(bond->params.ad_actor_system, mac);
>+  bond_3ad_update_ad_actor_settings(bond);
>+
>   return 0;
> 
> err:
>diff --git a/include/net/bond_3ad.h b/include/net/bond_3ad.h
>index f1fbc3b11962..f358ad5e4214 100644
>--- a/include/net/bond_3ad.h
>+++ b/include/net/bond_3ad.h
>@@ -306,5 +306,6 @@ int bond_3ad_lacpdu_recv(const struct sk_buff *skb, struct 
>bonding *bond,
>struct slave *slave);
> int bond_3ad_set_carrier(struct bonding *bond);
> void bond_3ad_update_lacp_rate(struct bonding *bond);
>+void bond_3ad_update_ad_actor_settings(struct bonding *bond);
> #endif /* _NET_BOND_3AD_H */
> 
>-- 
>2.4.3
>

[PATCH net v2] bonding: Fix ARP monitor validation

2016-02-02 Thread Jay Vosburgh


The current logic in bond_arp_rcv will accept an incoming ARP for
validation if (a) the receiving slave is either "active" (which includes
the currently active slave, or the current ARP slave) or, (b) there is a
currently active slave, and it has received an ARP since it became active.
For case (b), the receiving slave isn't the currently active slave, and is
receiving the original broadcast ARP request, not an ARP reply from the
target.

This logic can fail if there is no currently active slave.  In
this situation, the ARP probe logic cycles through all slaves, assigning
each in turn as the "current_arp_slave" for one arp_interval, then setting
that one as "active," and sending an ARP probe from that slave.  The
current logic expects the ARP reply to arrive on the sending
current_arp_slave, however, due to switch FDB updating delays, the reply
may be directed to another slave.

This can arise if the bonding slaves and switch are working, but
the ARP target is not responding.  When the ARP target recovers, a
condition may result wherein the ARP target host replies faster than the
switch can update its forwarding table, causing each ARP reply to be sent
to the previous current_arp_slave.  This will never pass the logic in
bond_arp_rcv, as neither of the above conditions (a) or (b) are met.

Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.

This patch changes the logic in bond_arp_rcv to additionally
accept an ARP reply for validation on any slave if there is a current ARP
slave and it sent an ARP probe during the previous arp_interval.

Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave 
really works")
Cc: Veaceslav Falico <vfal...@gmail.com>
Cc: Andy Gospodarek <go...@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

---
v2: more detail in log and comment; no code change.

 drivers/net/bonding/bond_main.c | 39 ---
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56b560558884..65a4107749df 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -214,6 +214,8 @@ static void bond_uninit(struct net_device *bond_dev);
 static struct rtnl_link_stats64 *bond_get_stats(struct net_device *bond_dev,
struct rtnl_link_stats64 
*stats);
 static void bond_slave_arr_handler(struct work_struct *work);
+static bool bond_time_in_interval(struct bonding *bond, unsigned long last_act,
+ int mod);
 
 /* General routines -*/
 
@@ -2459,7 +2461,7 @@ int bond_arp_rcv(const struct sk_buff *skb, struct 
bonding *bond,
 struct slave *slave)
 {
struct arphdr *arp = (struct arphdr *)skb->data;
-   struct slave *curr_active_slave;
+   struct slave *curr_active_slave, *curr_arp_slave;
unsigned char *arp_ptr;
__be32 sip, tip;
int alen, is_arp = skb->protocol == __cpu_to_be16(ETH_P_ARP);
@@ -2506,26 +2508,41 @@ int bond_arp_rcv(const struct sk_buff *skb, struct 
bonding *bond,
 , );
 
curr_active_slave = rcu_dereference(bond->curr_active_slave);
+   curr_arp_slave = rcu_dereference(bond->current_arp_slave);
 
-   /* Backup slaves won't see the ARP reply, but do come through
-* here for each ARP probe (so we swap the sip/tip to validate
-* the probe).  In a "redundant switch, common router" type of
-* configuration, the ARP probe will (hopefully) travel from
-* the active, through one switch, the router, then the other
-* switch before reaching the backup.
+   /* We 'trust' the received ARP enough to validate it if:
+*
+* (a) the slave receiving the ARP is active (which includes the
+* current ARP slave, if any), or
+*
+* (b) the receiving slave isn't active, but there is a currently
+* active slave and it received valid arp reply(s) after it became
+* the currently active slave, or
+*
+* (c) there is an ARP slave that sent an ARP during the prior ARP
+* interval, and we receive an ARP reply on any slave.  We accept
+* these because switch FDB update delays may deliver the ARP
+* reply to a slave other than the sender of the ARP request.
 *
-* We 'trust' the arp requests if there is an active slave and
-* it received valid arp reply(s) after it became active. This
-* is done to avoid endless looping when we can't reach the
+* Note: for (b), backup slaves are receiving the broadcast ARP
+* request, no

Re: bonding (IEEE 802.3ad) not working with qemu/virtio

2016-01-29 Thread Jay Vosburgh

Nikolay Aleksandrov <niko...@cumulusnetworks.com> wrote:

>On 01/25/2016 05:24 PM, Bjørnar Ness wrote:
>> As subject says, 802.3ad bonding is not working with virtio network model.
>> 
>> The only errors I see is:
>> 
>> No 802.3ad response from the link partner for any adapters in the bond.
>> 
>> Dumping the network traffic shows that no LACP packets are sent from the
>> host running with virtio driver, changing to for example e1000 solves
>> this problem
>> with no configuration changes.
>> 
>> Is this a known problem?
>> 
>[Including bonding maintainers for comments]
>
>Hi,
>Here's a workaround patch for virtio_net devices that "cheats" the
>duplex test (which is the actual problem). I've tested this locally
>and it works for me.
>I'd let the others comment on the implementation, there're other signs
>that can be used to distinguish a virtio_net device so I'm open to suggestions.
>Also feedback if this is at all acceptable would be appreciated.

Should virtio instead provide an arbitrary speed and full duplex
to ethtool, as veth does?

Creating a magic whitelist of devices deep inside the 802.3ad
implementation seems less desirable.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

[PATCH net] bonding: Fix ARP monitor validation

2016-01-28 Thread Jay Vosburgh


The current logic in bond_arp_rcv will accept an incoming ARP
for validation if (a) the receiving slave is either "active" (here
meaning the slave is functioning, not that it is the currently active
slave) or, (b) the currently active slave recently sent an ARP probe.

This can fail if there is no currently active slave.  In this
case, the ARP probe logic cycles through all slaves, assigning each in
turn as the "current_arp_slave," setting that one as "active," and
sending an ARP probe from that slave.  The current logic expects the ARP
reply to arrive on this slave, however, due to switch FDB updating
delays, the reply may be directed to another slave.

This leads to a pathological condition wherein the ARP target host
replies faster than the switch can update its forwarding table, causing
each ARP reply to be sent to the previous current_arp_slave.  This will
never pass the logic in bond_arp_rcv.

Some experimentation on a LAN shows ARP reply round trips in the
200 usec range, but my available switches never update their FDB in less
than 4000 usec.

This patch changes the logic to accept an ARP reply for validation
on any slave if there is a current ARP slave and it has recently sent an
ARP probe.

Fixes: aeea64ac717a ("bonding: don't trust arp requests unless active slave 
really works")
Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

---
 drivers/net/bonding/bond_main.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 56b560558884..5a9c43b12999 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -214,6 +214,8 @@ static void bond_uninit(struct net_device *bond_dev);
 static struct rtnl_link_stats64 *bond_get_stats(struct net_device *bond_dev,
struct rtnl_link_stats64 
*stats);
 static void bond_slave_arr_handler(struct work_struct *work);
+static bool bond_time_in_interval(struct bonding *bond, unsigned long last_act,
+ int mod);
 
 /* General routines -*/
 
@@ -2459,7 +2461,7 @@ int bond_arp_rcv(const struct sk_buff *skb, struct 
bonding *bond,
 struct slave *slave)
 {
struct arphdr *arp = (struct arphdr *)skb->data;
-   struct slave *curr_active_slave;
+   struct slave *curr_active_slave, *curr_arp_slave;
unsigned char *arp_ptr;
__be32 sip, tip;
int alen, is_arp = skb->protocol == __cpu_to_be16(ETH_P_ARP);
@@ -2506,6 +2508,7 @@ int bond_arp_rcv(const struct sk_buff *skb, struct 
bonding *bond,
 , );
 
curr_active_slave = rcu_dereference(bond->curr_active_slave);
+   curr_arp_slave = rcu_dereference(bond->current_arp_slave);
 
/* Backup slaves won't see the ARP reply, but do come through
 * here for each ARP probe (so we swap the sip/tip to validate
@@ -2514,10 +2517,12 @@ int bond_arp_rcv(const struct sk_buff *skb, struct 
bonding *bond,
 * the active, through one switch, the router, then the other
 * switch before reaching the backup.
 *
-* We 'trust' the arp requests if there is an active slave and
-* it received valid arp reply(s) after it became active. This
-* is done to avoid endless looping when we can't reach the
-* arp_ip_target and fool ourselves with our own arp requests.
+* We 'trust' the arp requests if (a) there is an active slave and
+* it received valid arp reply(s) after it became active, or (b)
+* there is an ARP slave, and we receive a recent ARP reply on any
+* slave. This is done to avoid endless looping when we can't
+* reach the arp_ip_target and fool ourselves with our own arp
+* requests.
 */
 
if (bond_is_active_slave(slave))
@@ -2526,6 +2531,10 @@ int bond_arp_rcv(const struct sk_buff *skb, struct 
bonding *bond,
 time_after(slave_last_rx(bond, curr_active_slave),
curr_active_slave->last_link_up))
bond_validate_arp(bond, slave, tip, sip);
+   else if (curr_arp_slave && (arp->ar_op == htons(ARPOP_REPLY)) &&
+bond_time_in_interval(bond,
+  dev_trans_start(curr_arp_slave->dev), 1))
+   bond_validate_arp(bond, slave, sip, tip);
 
 out_unlock:
if (arp != (struct arphdr *)skb->data)
-- 
1.9.1

Re: [PATCH 1/1] bonding: Use notifiers for slave link state detection

2016-01-25 Thread Jay Vosburgh

zhuyj <zyjzyj2...@gmail.com> wrote:

>On 01/26/2016 08:43 AM, Jay Vosburgh wrote:
>> <zyjzyj2...@gmail.com> wrote:
>>
>>> From: Zhu Yanjun <zyjzyj2...@gmail.com>
>>>
>>> Bonding will utilize notifier callbacks to detect slave
>>> link state changes. It is intended to be used with miimon
>>> set to zero, and does not support the updelay or downdelay
>>> options to bonding.
>>>
>>> Because of link flap from the slave interface, if the notifier
>>> is NETDEV_UP while the actual link state is down, it is not
>>> necessary to continue.
>>>
>>> Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>
>>  I haven't signed off on this patch.
>>
>>  I've just started some testing, but as before immediately get an
>> RCU warning; it looks to be coming from bond_miimon_inspect_slave();
>>
>> [  316.473050] bond1: Enslaving eth1 as a backup interface with an up link
>> [  316.473059]
>> [  316.473806] ===
>> [  316.475630] [ INFO: suspicious RCU usage. ]
>> [  316.477519] 4.4.0+ #38 Not tainted
>> [  316.479094] ---
>> [  316.480765] drivers/net/bonding/bond_main.c:2024 suspicious 
>> rcu_dereference_check() usage!
>>
>>  This is presumably because the "case NETDEV_DOWN" call to
>> bond_miimon_inspect_slave does not hold RCU.  It does hold RTNL, though,
>> which should be safe for this usage (RTNL mutexes changes to the active
>> slave).  The appended patch on top of the original makes the warning go
>> away.
>>
>>  I'm still testing the patch and have no comment about its
>> functionality as yet.
>>
>> diff --git a/drivers/net/bonding/bond_main.c 
>> b/drivers/net/bonding/bond_main.c
>> index 9f67948..e3faee9 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -2014,14 +2014,14 @@ static int bond_slave_info_query(struct net_device 
>> *bond_dev, struct ifslave *in
>> /* Monitoring
>> ---*/
>>   -/* called with rcu_read_lock() */
>> +/* called with rcu_read_lock() or RTNL */
>>   static int bond_miimon_inspect_slave(struct bonding *bond, struct slave 
>> *slave,
>>   unsigned long event)
>>   {
>>  int link_state;
>>  bool ignore_updelay;
>>   -  ignore_updelay = !rcu_dereference(bond->curr_active_slave);
>> +ignore_updelay = !rcu_dereference_rtnl(bond->curr_active_slave);
>
>Thanks a lot.
>Because kernel v4.4 needs this kind of patch, I backport this patch from
>net-next to kernel v4.4.
>
>If it is not appropriate, I will revert this patch.

I don't understand what you mean here.

I've tested the patch (with my above modification), and while I
seem to be hitting an unrelated bug in the ARP monitor, I believe this
patch will misbehave when the ARP monitor is running.

For example, if arp_interval=1000 and miimon=0, the link state
notifier callback will change a slave to up should a notifier event take
place.  So, hypothetically, if a slave is "down" according to the ARP
monitor (but actually carrier up), and then experience a carrier down
then up transition, the slave would be set to "up" even though the ARP
monitor believes it to be down.

I'm not able to induce the speedy link flap events, so I'm not
sure about this portion of the patch:

+   /* Because of link flap from the slave interface, it is possilbe that
+* the notifiler is NETDEV_UP while the actual link state is down. If
+* so, it is not necessary to contiune.
+*/
+   switch (event) {
+   case NETDEV_UP:
+   if (!link_state)
+   return 0;
+   break;
+
+   case NETDEV_DOWN:
+   if (link_state)
+   return 0;
+   break;
+   }
+

Unless I misunderstood, Emil's comments elsewhere suggest that
the current ixgbe driver won't cause those, though.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH 1/1] bonding: Use notifiers for slave link state detection

2016-01-25 Thread Jay Vosburgh

<zyjzyj2...@gmail.com> wrote:

>From: Zhu Yanjun <zyjzyj2...@gmail.com>
>
>Bonding will utilize notifier callbacks to detect slave
>link state changes. It is intended to be used with miimon
>set to zero, and does not support the updelay or downdelay
>options to bonding.
>
>Because of link flap from the slave interface, if the notifier
>is NETDEV_UP while the actual link state is down, it is not
>necessary to continue.
>
>Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>

I haven't signed off on this patch.

I've just started some testing, but as before immediately get an
RCU warning; it looks to be coming from bond_miimon_inspect_slave();

[  316.473050] bond1: Enslaving eth1 as a backup interface with an up link
[  316.473059] 
[  316.473806] ===
[  316.475630] [ INFO: suspicious RCU usage. ]
[  316.477519] 4.4.0+ #38 Not tainted
[  316.479094] ---
[  316.480765] drivers/net/bonding/bond_main.c:2024 suspicious 
rcu_dereference_check() usage!

This is presumably because the "case NETDEV_DOWN" call to
bond_miimon_inspect_slave does not hold RCU.  It does hold RTNL, though,
which should be safe for this usage (RTNL mutexes changes to the active
slave).  The appended patch on top of the original makes the warning go
away.

I'm still testing the patch and have no comment about its
functionality as yet.

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 9f67948..e3faee9 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2014,14 +2014,14 @@ static int bond_slave_info_query(struct net_device 
*bond_dev, struct ifslave *in
 
 /* Monitoring ---*/
 
-/* called with rcu_read_lock() */
+/* called with rcu_read_lock() or RTNL */
 static int bond_miimon_inspect_slave(struct bonding *bond, struct slave *slave,
 unsigned long event)
 {
int link_state;
bool ignore_updelay;
 
-   ignore_updelay = !rcu_dereference(bond->curr_active_slave);
+   ignore_updelay = !rcu_dereference_rtnl(bond->curr_active_slave);
 
    slave->new_link = BOND_LINK_NOCHANGE;
 

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [RFC PATCH net] net/core: don't increment rx_dropped on inactive slaves

2016-01-22 Thread Jay Vosburgh

Jarod Wilson <ja...@redhat.com> wrote:

>The network core tries to keep track of dropped packets, but some packets
>you wouldn't really call dropped, so much as intentionally ignored, under
>certain circumstances. One such case is that of bonding and team device
>slaves that are currently inactive. Their respective rx_handler functions
>return RX_HANDLER_EXACT (the only places in the kernel that return that),
>which ends up tracking into the network core's __netif_receive_skb_core()
>function's drop path, with no pt_prev set. On a noisy network, this can
>result in a very rapidly incrementing rx_dropped counter, not only on the
>inactive slave(s), but also on the master device, such as the following:
[...]
>In this scenario, p5p1, p5p2 and p7p1 are all inactive slaves in an
>active-backup bond0, and you can see that all three have high drop counts,
>with the master bond0 showing a tally of all three.
>
>I know that this was previously discussed some here:
>
>http://www.spinics.net/lists/netdev/msg226341.html
>
>It seems additional counters never came to fruition, but honestly, for
>this particular case, I'm not even sure they're warranted, I'd be inclined
>to say just silently drop these packets without incrementing a counter. At
>least, that's probably what would make someone who has complained loudly
>about this issue happy, as they have monitoring tools that are squaking
>loudly at any increments to rx_dropped.

I don't think the kernel should silently drop packets; there
should be a counter somewhere.  If a packet is being thrown away
deliberately, it should not just vanish into the screaming void of
space.  Someday someone will try and track down where that packet is
being dropped.

I've had that same conversation with customers who insist on
accounting for every packet drop (from the "any drop is an error"
mindset), so I understand the issue.

Thinking about the prior discussion, the rx_drop_inactive is
still a good idea, but I'd actually today get good use from a
"rx_drop_unforwardable" (or an equivalent but shorter name) counter that
counts every time a packet is dropped due to is_skb_forwardable()
returning false.  __dev_forward_skb does this (and hits rx_dropped), as
does the bridge (and does not count it).

-J

>CC: "David S. Miller" <da...@davemloft.net>
>CC: Eric Dumazet <eduma...@google.com>
>CC: Jiri Pirko <j...@mellanox.com>
>CC: Daniel Borkmann <dan...@iogearbox.net>
>CC: Tom Herbert <t...@herbertland.com>
>CC: Jay Vosburgh <j.vosbu...@gmail.com>
>CC: Veaceslav Falico <vfal...@gmail.com>
>CC: Andy Gospodarek <go...@cumulusnetworks.com>
>CC: netdev@vger.kernel.org
>Signed-off-by: Jarod Wilson <ja...@redhat.com>
>---
> net/core/dev.c | 3 +++
> 1 file changed, 3 insertions(+)
>
>diff --git a/net/core/dev.c b/net/core/dev.c
>index 8cba3d8..1354c7b 100644
>--- a/net/core/dev.c
>+++ b/net/core/dev.c
>@@ -4153,8 +4153,11 @@ ncls:
>   else
>   ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>   } else {
>+  if (deliver_exact)
>+  goto inactive; /* bond or team inactive slave */
> drop:
>       atomic_long_inc(>dev->rx_dropped);
>+inactive:
>   kfree_skb(skb);
>   /* Jamal, now you will not able to escape explaining
>* me how you were going to use this. :-)
>-- 
>1.8.3.1
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH] net: take care of bonding in build_skb_flow_key (v3)

2016-01-20 Thread Jay Vosburgh

Wengang Wang <wen.gang.w...@oracle.com> wrote:
[...]
>For ipip,  yes seems update_pmtu is called in line for each call of
>queue_xmit.  Do you know if it's a good configuration for ipip + bonding?

Yes, it is.

>Other's comment and suggestion?

I agree with Sabrina Dubroca <s...@queasysnail.net>'s suggestions
from yesterday.

    -J

---
-Jay Vosburgh, jay.vosbu...@canonical.com

Re: [PATCH 1/1] bonding: restrict up state in 802.3ad mode

2015-12-17 Thread Jay Vosburgh

<zyjzyj2...@gmail.com> wrote:

>From: Zhu Yanjun <zyjzyj2...@gmail.com>
>
>In 802.3ad mode, the speed and duplex is needed. But in some NIC,
>there is a time span between NIC up state and getting speed and duplex.
>As such, sometimes a slave in 802.3ad mode is in up state without
>speed and duplex. This will make bonding in 802.3ad mode can not
>work well. 
>To make bonding driver be compatible with more NICs, it is
>necessary to restrict the up state in 802.3ad mode.

What device is this?  It seems a bit odd that an Ethernet device
can be carrier up but not have the duplex and speed available.

Also, what are the option settings for bonding?  Specifically,
is "use_carrier" set to 0?  The default setting is 1.

In general, though, bonding expects a speed or duplex change to
be announced via a NETDEV_UPDATE or NETDEV_UP notifier, which would
propagate to the 802.3ad logic.

If the device here is going carrier up prior to having speed or
duplex available, then maybe it should call netdev_state_change() when
the duplex and speed are available, or delay calling netif_carrier_on().

>Signed-off-by: Zhu Yanjun <zyjzyj2...@gmail.com>
>---
> drivers/net/bonding/bond_main.c |   19 +++
> 1 file changed, 19 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 9e0f8a7..0a80fb3 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -1991,6 +1991,25 @@ static int bond_miimon_inspect(struct bonding *bond)
> 
>   link_state = bond_check_dev_link(bond, slave->dev, 0);
> 
>+  /* Since some NIC has time span between netif_running and
>+   * getting speed and duples. That is, after a NIC is up 
>(netif_running),
>+   * there is a time span before this NIC is negotiated with 
>speed and duplex.
>+   * During this time span, the slave in 802.3ad is configured 
>without speed
>+   * and duplex. This 802.3ad bonding will not work because it 
>needs slave's speed
>+   * and duplex to generate key field.
>+   * As such, we restrict up in 802.3ad mode to: netif_running && 
>peed != SPEED_UNKNOWN &&
>+   * duplex != DUPLEX_UNKNOWN
>+   */
>+  if ((BMSR_LSTATUS == link_state) &&
>+  (BOND_MODE(bond) == BOND_MODE_8023AD)) {
>+  bond_update_speed_duplex(slave);
>+  if ((slave->speed == SPEED_UNKNOWN) ||
>+  (slave->duplex == DUPLEX_UNKNOWN)) {
>+  link_state = 0;
>+  netdev_info(bond->dev, "In 802.3ad mode, it is 
>not enough to up without speed and duplex");
>+  }
>+  }

Also, as a functional note on this patch, the above looks like
it will spam the log repeatedly every miimon interval for as long as the
"carrier up but no speed/duplex" situation persists.

-J

>   switch (slave->link) {
>   case BOND_LINK_UP:
>   if (link_state)
>-- 
>1.7.9.5
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 1/1] bonding: delay up state without speed and duplex in 802.3ad mode

2015-12-17 Thread Jay Vosburgh

<zyjzyj2...@gmail.com> wrote:

>From: yzhu1 <yz...@windriver.com>
>
>In 802.3ad mode, the speed and duplex is needed. But in some NICs,
>there is a time span between NIC up state and getting speed and duplex.
>As such, sometimes a slave in 802.3ad mode is in up state without
>speed and duplex. This will make bonding in 802.3ad mode can not
>work well.
>
>To make bonding driver robust and compatible with more NICs, it is
>necessary to delay the up state without speed and duplex in 802.3ad
>mode.

You misunderstood my comment.  What I meant is that the device
driver for the network device should change to either delay carrier up
or issue a second notifier when speed and duplex are available.  If the
driver doesn't handle duplex and speed notification properly, it will
likely have trouble with more than just bonding.

You also didn't mention the identity of the network device that
requires this special handling.  Is the driver part of the linux kernel?

-J

>Signed-off-by: yzhu1 <yz...@windriver.com>
>---
> drivers/net/bonding/bond_main.c |   34 ++
> 1 file changed, 34 insertions(+)
>
>diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>index 9e0f8a7..a1d8708 100644
>--- a/drivers/net/bonding/bond_main.c
>+++ b/drivers/net/bonding/bond_main.c
>@@ -419,6 +419,35 @@ const char *bond_slave_link_status(s8 link)
>   }
> }
> 
>+/* This function is to check the speed and duplex of a NIC.
>+ * Since the speed and duplex of a slave device are very
>+ * important to the bonding in the 802.3ad mode. As such,
>+ * it is necessary to check the speed and duplex of a slave
>+ * device in 802.3ad mode.
>+ *
>+ * speed != SPEED_UNKNOWN and duplex == DUPLEX_FULL  :  1
>+ *   others  :  0
>+ */
>+static int __check_speed_duplex(struct net_device *netdev)
>+{
>+  struct ethtool_cmd ecmd;
>+  u32 slave_speed = SPEED_UNKNOWN;
>+  int res;
>+
>+  res = __ethtool_get_settings(netdev, );
>+  if (res < 0)
>+  return 0;
>+
>+  slave_speed = ethtool_cmd_speed();
>+  if (slave_speed == 0 || slave_speed == ((__u32) -1))
>+  return 0;
>+
>+  if (DUPLEX_FULL != ecmd.duplex)
>+  return 0;
>+
>+  return 1;
>+}
>+
> /* if  supports MII link status reporting, check its link status.
>  *
>  * We either do MII/ETHTOOL ioctls, or check netif_carrier_ok(),
>@@ -445,6 +474,11 @@ static int bond_check_dev_link(struct bonding *bond,
>   if (!reporting && !netif_running(slave_dev))
>   return 0;
> 
>+  /* Check the speed and duplex of the slave device in 802.3ad mode. */
>+  if ((BOND_MODE(bond) == BOND_MODE_8023AD) &&
>+ !__check_speed_duplex(slave_dev))
>+  return 0;
>+
>   if (bond->params.use_carrier)
>   return netif_carrier_ok(slave_dev) ? BMSR_LSTATUS : 0;
> 
>-- 
>1.7.9.5
>

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH net] bonding: fix panic on non-ARPHRD_ETHER enslave failure

2015-11-06 Thread Jay Vosburgh


Since commit 7d5cd2ce529b, when bond_enslave fails on devices that
are not ARPHRD_ETHER, if needed, it resets the bonding device back to
ARPHRD_ETHER by calling ether_setup.

Unfortunately, ether_setup clobbers dev->flags, clearing IFF_UP
if the bond device is up, leaving it in a quasi-down state without
having actually gone through dev_close.  For bonding, if any periodic
work queue items are active (miimon, arp_interval, etc), those will
remain running, as they are stopped by bond_close.  At this point, if
the bonding module is unloaded or the bond is deleted, the system will
panic when the work function is called.

This panic is resolved by calling dev_close on the bond itself
prior to calling ether_setup.

Cc: Nikolay Aleksandrov <niko...@cumulusnetworks.com>
Signed-off-by: Jay Vosburgh <jay.vosbu...@canonical.com>
Fixes: 7d5cd2ce5292 ("bonding: correctly handle bonding type change on enslave 
failure")

---
 drivers/net/bonding/bond_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index b4351ca..9e0f8a7 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1749,6 +1749,7 @@ err_undo_flags:
slave_dev->dev_addr))
eth_hw_addr_random(bond_dev);
if (bond_dev->type != ARPHRD_ETHER) {
+   dev_close(bond_dev);
ether_setup(bond_dev);
bond_dev->flags |= IFF_MASTER;
bond_dev->priv_flags &= ~IFF_TX_SKB_SHARING;
-- 
1.9.1

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regression in bonding driver - devices without set_mac

2015-11-06 Thread Jay Vosburgh

Toby Corkindale <t...@wintrmute.net> wrote:
[...]
>The thing is, I've been running PPP devices in balance-rr mode for
>quite a while. It worked fine, and I'd like to continue having that
>functionality.  These changes to the kernel have broken that
>functionality, by requiring the MAC address support unconditionally.
>
>I guess using PPP devices is a special case, but it would be great if
>that special case could continue to be supported.

Could you describe your configuration / use case a bit more?

The nominal reason for the MAC change in balance-rr or
balance-xor modes is that those modes are designed to interoperate with
Etherchannel (static link aggregation) type of systems, which expect all
ports participating in the aggregation to use the same MAC address.

Relaxing the restriction is likely a matter of causing
fail_over_mac to be obeyed in addtional modes other than active-backup,
which is actually what's catching the ppp case right now.  The check for
the set_mac function passes, but later, the test for "should the slave's
MAC be changed" is

if (!bond->params.fail_over_mac ||
BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
[ change the slave MAC ]
}

so this block is entered for anything other than active-backup
mode with fail_over_mac enabled.  This block was changed for

commit 00503b6f702eaf23e7257d6287da72805d7d014c
Author: dingtianhong <dingtianh...@huawei.com>
Date:   Sat Jan 25 13:00:29 2014 +0800

bonding: fail_over_mac should only affect AB mode at enslave and removal pro
cessing

So that's probably when you saw the behavior change.

Are you able to test a patch?  I believe the following would
change the behavior, although some of the release logic may not be
correct (I've not tested this).

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index b4351caf8e01..5f3632544d52 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1451,8 +1451,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev)
 */
ether_addr_copy(new_slave->perm_hwaddr, slave_dev->dev_addr);
 
-   if (!bond->params.fail_over_mac ||
-   BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
+   if (!bond->params.fail_over_mac) {
/* Set slave to master's mac address.  The application already
 * set the master's mac address to that of the first slave
 */
@@ -1725,8 +1724,7 @@ err_close:
dev_close(slave_dev);
 
 err_restore_mac:
-   if (!bond->params.fail_over_mac ||
-   BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
+   if (!bond->params.fail_over_mac) {
/* XXX TODO - fom follow mode needs to change master's
 * MAC if this slave's MAC is in use by the bond, or at
 * least print a warning.
@@ -1823,8 +1821,7 @@ static int __bond_release_one(struct net_device *bond_dev,
 
RCU_INIT_POINTER(bond->current_arp_slave, NULL);
 
-   if (!all && (!bond->params.fail_over_mac ||
-BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP)) {
+   if (!all && (!bond->params.fail_over_mac)) {
if (ether_addr_equal_64bits(bond_dev->dev_addr, 
slave->perm_hwaddr) &&
bond_has_slaves(bond))
netdev_warn(bond_dev, "the permanent HWaddr of %s - %pM 
- is still in use by %s - set the HWaddr of %s to a different address to avoid 
conflicts\n",
@@ -1905,8 +1902,7 @@ static int __bond_release_one(struct net_device *bond_dev,
/* close slave before restoring its mac address */
dev_close(slave_dev);
 
-   if (bond->params.fail_over_mac != BOND_FOM_ACTIVE ||
-   BOND_MODE(bond) != BOND_MODE_ACTIVEBACKUP) {
+   if (bond->params.fail_over_mac != BOND_FOM_ACTIVE) {
/* restore original ("permanent") mac address */
    ether_addr_copy(addr.sa_data, slave->perm_hwaddr);
addr.sa_family = slave_dev->type;


-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regression in bonding driver - devices without set_mac

2015-11-06 Thread Jay Vosburgh

Toby Corkindale <t...@wintrmute.net> wrote:

>I originally reported this in Dec 2014, as:
>https://bugzilla.kernel.org/show_bug.cgi?id=89161
>
>The bug is still present in the 4.2 Linux kernel.

I had some time to look into this today, and there is a related
code path that panics the kernel, so I'm working on a patch to resolve
things, and will be posting that shortly.

>If you look at the code, here:
>https://github.com/torvalds/linux/commit/f54424412b6b2f64cae4d7c39d981ca14ce0052c
>
>Then you see that the intention is for devices without set_mac support
>to be supported. Although in older code they DID work, and that
>ability died sometime near this commit. And has never returned since.

Unfortunately, I believe the referenced commit is in error.  The
change log states:

[...] It's wrong because we
only require ndo_set_mac_address in case bonding is in active-backup mode
and FOM isn't FOM_ACTIVE.

This assertion is incorrect; ndo_set_mac_address is necessary
for all modes with the only exception being active-backup with
fail_over_mac enabled.

All of the load balance modes (everything except active-backup)
will change the MAC of the slave devices at some point, and thus require
ndo_set_mac_address support.

PPP devices should function in active-backup mode, and should
enable fail_over_mac automatically (if the PPP device is the first
slave) if f_o_m is not already enabled.

The prior discussion on this patch can be found here:

http://www.spinics.net/lists/netdev/msg289311.html

>The code path in current kernels does allow devices through that block
>of code, but it fails somewhere else -- the devices are not
>successfully added as slaves, but the only error printed is the
>warning about not supporting MAC address setting.

What is happening in the current code is that, for load balance
modes, e.g., balance-rr, the test for ndo_set_mac_address support
passes, but, later, bond_enslave attempts to actually set the MAC of the
new slave, and this fails (because there is no ndo_set_mac).

>Is there anything I can do to try and sort this regression out, with you?

As I said, I do not believe this is a regression.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH net-next] ipv6: recreate ipv6 link-local addresses when increasing MTU over IPV6_MIN_MTU

2015-10-26 Thread Jay Vosburgh

Hannes Frederic Sowa <han...@stressinduktion.org> wrote:

>Hello Alex,
>
>On Mon, Oct 26, 2015, at 16:52, Alexander Duyck wrote:
>> Seems like this code isn't quite correct.  You are calling ipv6_add_dev 
>> for slave devices, and if I understand things correctly I don't believe 
>> that was happening before and may be an unintended side effect.
>
>Ah, btw., autoconf and ipv6 operation on IFF_SLAVE devices is actually
>desired nowadays and don't think we can change this. See also:
><https://patchwork.ozlabs.org/patch/531196/>

IPv6 addrconf on IFF_SLAVE devices was disabled for bonding
slaves in commit c2edacf80e15 because it caused issues with snooping
switches.

This is also referenced in

https://bugzilla.redhat.com/show_bug.cgi?id=236750

Won't re-enabling autoconf on IFF_SLAVE devices cause that issue
to return?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4] net/bonding: send arp in interval if no active slave

2015-10-09 Thread Jay Vosburgh

Jarod Wilson <ja...@redhat.com> wrote:

>Jarod Wilson wrote:
>...
>> As Andy already stated I'm not a fan of such workarounds either but it's
>> necessary sometimes so if this is going to be actually considered then a
>> few things need to be fixed. Please make this a proper bonding option
>> which can be changed at runtime and not only via a module parameter.
>
>Is there any particular userspace tool that would need some updating, or
>is adding the sysfs knobs sufficient here? I think I've got all the sysfs
>stuff thrown together now, but still need to test.

Most (all?) bonding options should be configurable via iproute
(netlink) now.

>
>>> Now, I saw that you've only tested with 500 ms, can't this be fixed by
>>> using
>>> a different interval ? This seems like a very specific problem to have a
>>> whole new option for.
>>
>> ...I'll wait until we've heard confirmation from Uwe that intervals
>> other than 500ms don't fix things.
>
>Okay, so I believe the "only tested with 500ms" was in reference to
>testing with Uwe's initial patch. I do have supporting evidence in a
>bugzilla report that shows upwards of 5000ms still experience the problem
>here.

I did set up some switches and attempt to reproduce this
yesterday; I daisy-chained three switches (two Cisco and an HP) together
and connected the bonded interfaces to the "end" switches.  I tried
various ARP targets (the switch, hosts on various points of the switch)
and varying arp_intervals and was unable to reproduce the problem.

As I understand it, the working theory is something like this:

- host with two bonded interfaces, A and B.  For active-backup
mode, the interfaces have been assigned the same MAC address.

- switch has MAC for B in its forwarding table

- bonding goes from down to up, and thinks all its slaves are
down, and starts the "curr_arp_slave" search for an active
arp_ip_target.  In this case, it starts with A, and sends an ARP from A.

As an aside, I'm not 100% clear on what exactly is going on in
the "bonding goes from down to up" transition; this seems to be key in
reproducing the issue.

- switch sees source mac coming from port A, starts to update
its forwarding table

- meanwhile, switch forwards ARP request, and receives ARP
reply, which it forwards to port B.  Bonding drops this, as the slave is
inactive.

- switch finishes updating forwarding table, MAC is now assigned
to port A.

- bonding now tries sending on port B, and the cycle repeats.

If this is what's taking place, then the arp_interval itself is
irrelevant, the race is between the switch table update and the
generation of the ARP reply.

Also, presuming the above is what's going on, we could modify
the ARP "curr_arp_slave" logic a bit to resolve this without requiring
any magic knobs.

For example, we could change the "drop on inactive" logic to
recognise the "curr_arp_slave" search and accept the unicast ARP reply,
and perhaps make that receiving slave the next curr_arp_slave
automatically.

I also wonder if the fail_over_mac option would affect this
behavior, as it would cause the slaves to keep their MAC address for the
duration, so the switch would not see the MAC move from port to port.

Another thought would be to have the curr_arp_slave cycle
through the slaves in random order, but that could create
non-deterministic results even when things are working correctly.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/bonding: send arp in interval if no active slave

2015-09-03 Thread Jay Vosburgh

Uwe Koziolek <uwe.kozio...@redknee.com> wrote:

>On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote:
>> On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote:
>>> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote:
>>>> Uwe Koziolek <uwe.kozio...@redknee.com> wrote:
>>>>
>>>>> On2015-08-17 07:12 PM,Jarod Wilson wrote:
>>>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
>>>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
>>>>>>>> From: Uwe Koziolek <uwe.kozio...@redknee.com>
>>>>>>>>
>>>>>>>> With some very finicky switch hardware, active backup bonding can get
>>>>>>>> into
>>>>>>>> a situation where we play ping-pong between interfaces, trying to get
>>>>>>>> one
>>>>>>>> to come up as the active slave. There seems to be an issue with the
>>>>>>>> switch's arp replies either taking too long, or simply getting lost,
>>>>>>>> so we
>>>>>>>> wind up unable to get any interface up and active. Sometimes, the issue
>>>>>>>> sorts itself out after a while, sometimes it doesn't.
>>>>>>>>
>>>>>>>> Testing with num_grat_arp has proven fruitless, but sending an
>>>>>>>> additional
>>>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in
>>>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing
>>>>>>>> with
>>>>>>>> this hardware combination.
>>>>>>> Sorry, I don't understand the logic of why it works, and what exactly
>>>>>>> are
>>>>>>> we fixiing here.
>>>>>>>
>>>>>>> It also breaks completely the logic for link state management in case
>>>>>>> of no
>>>>>>> current active slave for 2*arp_interval.
>>>>>>>
>>>>>>> Could you please elaborate what exactly is fixed here, and how it
>>>>>>> works? :)
>>>>>> I can either duplicate some information from the bug, or Uwe can, to
>>>>>> illustrate the exact nature of the problem.
>>>>>>
>>>>>>> p.s. num_grat_arp maybe could help?
>>>>>> That was my thought as well, but as I understand it, that route was
>>>>>> explored, and it didn't help any. I don't actually have a reproducer
>>>>>> setup of my own, unfortunately, so I'm kind of caught in the middle
>>>>>> here...
>>>>>>
>>>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp
>>>>>> settings were tried that didn't help? I'm still of the mind that if
>>>>>> num_grat_arp *didn't* help, we probably need to do something keyed off
>>>>>> num_grat_arp.
>>>>> The bonding slaves are connected to high available switches, each of the
>>>>> slaves is connected to a different switch. If the bond is starting, only
>>>>> the selected slave sends one arp-request. If a matching arp_response was
>>>>> received, this slave and the bond is going into state up, sending the
>>>>> gratitious arps...
>>>>> But if you got no arp reply the next slave was selected.
>>>>> With most of the newer switches, not overloaded, or with other software
>>>>> bugs, or with a single switch configuration, you would get a arp response
>>>>> on the first arp request.
>>>>> But in case of high availability configuration with non perfect switches
>>>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a
>>>>> response on the first arp request.
>>>>>
>>>>> I have seen network snoops, there the switches are not responding to the
>>>>> first arp request on slave 1, the second arp request was sent on slave 2
>>>>> but the response was received on slave one,  and all following arp
>>>>> requests are anwsered on the wrong slave for a longer time.
>>>>Could you elaborate on the exact "high availability
>>>> configuration" here, including the model(s) of switch(es) involved?
>>>>
>>>>Is this some kind of race between the switch or switches
>>>&

Re: [PATCH net] bonding: fix bond dev flags after convert to arphrd_ether

2015-08-27 Thread Jay Vosburgh

Nikolay Aleksandrov niko...@cumulusnetworks.com wrote:
[...]
Restarting this thread because there’s actually a bug here, what you described 
with
the bonding destruction is true when the slaves are all destroyed but it isn’t 
true if they’re
just released, if you take a look at bond_slave_netdev_event() the bond 
destruction happens
only on NETDEV_UNREGISTER and I just hit this bug by enslaving a 
non-ARPHRD_ETHER
device, releasing it and enslaving a ARPHRD_ETHER device so ether_setup() path 
in bond_enslave
is hit and IFF_MASTER gets dropped:
17: bond0: BROADCAST,MULTICAST,MASTER,UP,LOWER_UP mtu 1500 qdisc noqueue 
state UP mode DEFAULT group default qlen 1000
link/fddi 9a:33:c5:30:ff:a6 brd ff:ff:ff:ff:ff:ff
(release non-ARPHRD_ETHER slave)
(enslave ARPHRD_ETHER device)
17: bond0: BROADCAST,MULTICAST,LOWER_UP mtu 1500 qdisc noqueue state UP mode 
DEFAULT group default qlen 1000
link/ether 08:00:27:3c:13:57 brd ff:ff:ff:ff:ff:ff

Notice the master flag is gone and of course on unload we get:
[57981.545547] [ cut here ]
[57981.545567] WARNING: CPU: 0 PID: 13792 at fs/proc/generic.c:575 
remove_proc_entry+0x17e/0x190()
[57981.545572] remove_proc_entry: removing non-empty directory 'net/bonding', 
leaking at least 'bond0'
[...]
We need to convert it back to ARPHRD_ETHER if releasing the last slave, because
we can’t destroy it (in some paths bond-dev is used after bond_release()).
Basically we should make the case that if the bonding doesn’t have any slaves 
then it’s
always an ARPHRD_ETHER device.

Thoughts ?

I agree that it would be cleaner for bond_dev-type to switch
back on release of last slave.  The options code (caller of
bond_option_slaves_set) and bond_uninit() both reference the bond or dev
after calling bond_release(), and would need changing if any release
could destroy the bond itself.

However, for the type change, there's the potentially tricky
case of a nested non-ARPHRD_ETHER bond, e.g., bond0 - bond1 - ib0.
This isn't a typical use case that I'm aware of, but I believe it's
supported by the code.

If ib0, the last slave, is released, bond1 will want to change
to ARPHRD_ETHER, but bond0 is ARPHRD_INFINIBAND.  I suspect bonding will
have to notice the NETDEV_PRE_TYPE_CHANGE and _POST_ notifiers and take
appropriate action (i.e., cascade the type change upwards).

There might be similar issues with other devices stacked on top
of the IB - Ether type-changing bond; I'm not sure how many of those
there may be, though, since many things won't stack over IB devices (or
an IB-flavor bond).

If the type change works, then I don't think we would still need
the release and destroy logic.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] net/bonding: send arp in interval if no active slave

2015-08-17 Thread Jay Vosburgh

Uwe Koziolek uwe.kozio...@redknee.com wrote:

On2015-08-17 07:12 PM,Jarod Wilson wrote:
 On 2015-08-17 12:55 PM, Veaceslav Falico wrote:
 On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote:
 From: Uwe Koziolek uwe.kozio...@redknee.com

 With some very finicky switch hardware, active backup bonding can get
 into
 a situation where we play ping-pong between interfaces, trying to get
 one
 to come up as the active slave. There seems to be an issue with the
 switch's arp replies either taking too long, or simply getting lost,
 so we
 wind up unable to get any interface up and active. Sometimes, the issue
 sorts itself out after a while, sometimes it doesn't.

 Testing with num_grat_arp has proven fruitless, but sending an
 additional
 arp on curr_arp_slave if we're still in the arp_interval timeslice in
 bond_ab_arp_probe(), has shown to produce 100% reliability in testing
 with
 this hardware combination.

 Sorry, I don't understand the logic of why it works, and what exactly
 are
 we fixiing here.

 It also breaks completely the logic for link state management in case
 of no
 current active slave for 2*arp_interval.

 Could you please elaborate what exactly is fixed here, and how it
 works? :)

 I can either duplicate some information from the bug, or Uwe can, to
 illustrate the exact nature of the problem.

 p.s. num_grat_arp maybe could help?

 That was my thought as well, but as I understand it, that route was
 explored, and it didn't help any. I don't actually have a reproducer
 setup of my own, unfortunately, so I'm kind of caught in the middle
 here...

 Uwe, can you perhaps further enlighten us as to what num_grat_arp
 settings were tried that didn't help? I'm still of the mind that if
 num_grat_arp *didn't* help, we probably need to do something keyed off
 num_grat_arp.
The bonding slaves are connected to high available switches, each of the
slaves is connected to a different switch. If the bond is starting, only
the selected slave sends one arp-request. If a matching arp_response was
received, this slave and the bond is going into state up, sending the
gratitious arps...
But if you got no arp reply the next slave was selected.
With most of the newer switches, not overloaded, or with other software
bugs, or with a single switch configuration, you would get a arp response
on the first arp request.
But in case of high availability configuration with non perfect switches
like HP ProCurve 54xx, also with some Cisco models, you may not get a
response on the first arp request.

I have seen network snoops, there the switches are not responding to the
first arp request on slave 1, the second arp request was sent on slave 2
but the response was received on slave one,  and all following arp
requests are anwsered on the wrong slave for a longer time.

Could you elaborate on the exact high availability
configuration here, including the model(s) of switch(es) involved?

Is this some kind of race between the switch or switches
updating the forwarding tables and the bond flip flopping between the
slaves?  E.g., source MAC from ARP sent on slave 1 is used to populate
the forwarding table, but (for whatever reason) there is no reply.  ARP
on slave 2 is sent (using the same source MAC, unless you set
fail_over_mac), but forwarding tables still send that MAC to slave 1, so
reply is sent there.

The proposed change sents up to 3 arp requests on a down bond using the
same slave, delayed by arp_interval.
Using problematic switches i have seen the the arp response on the right
slave at latest on the second arp request. So the bond is going into state
up.

How does it works:
The bonds in up state are handled on the beginning of bond_ab_arp_probe
procedure, the other part of this procedure is handling the slave change.
The proposed change is bypassing the slave change for 2 additional calls
of bond_ab_arp_probe.
Now the retries are not only for an up bond available, they are also
implemented for a down bond.

Does this delay failover or bringup on switches that are not
problematic?  I.e., if arp_interval is, say, 1000 (1 second), will
this impact failover / recovery times?

-J

The num_grat_arp has no chance to solve the problem. The num_grat_arp is
only used, if a different slave is going active.
But in our case, the bonding slaves are not going into the state active
for a longer time.

 [jarod: manufacturing of changelog]
 CC: Jay Vosburgh j.vosbu...@gmail.com
 CC: Veaceslav Falico vfal...@gmail.com
 CC: Andy Gospodarek go...@cumulusnetworks.com
 CC: netdev@vger.kernel.org
 Signed-off-by: Uwe Koziolek uwe.kozio...@redknee.com
 Signed-off-by: Jarod Wilson ja...@redhat.com
 ---
 drivers/net/bonding/bond_main.c | 5 +
 1 file changed, 5 insertions(+)

 diff --git a/drivers/net/bonding/bond_main.c
 b/drivers/net/bonding/bond_main.c
 index 0c627b4..60b9483 100644
 --- a/drivers/net/bonding/bond_main.c
 +++ b/drivers/net/bonding/bond_main.c
 @@ -2794,6 +2794,11 @@ static bool

3.14 stable request for net: gso: use feature flag argument in all protocol gso handlers

2015-08-13 Thread Jay Vosburgh


For 3.14 -stable, please consider commit:

commit 1e16aa3ddf863c6b9f37eddf52503230a62dedb3
Author: Florian Westphal f...@strlen.de
Date:   Mon Oct 20 13:49:16 2014 +0200

net: gso: use feature flag argument in all protocol gso handlers


We have observed kernel panics when an openvswitch bridge is
populated with virtual devices (veth, for example) that have expansive
feature sets that include NETIF_F_GSO_GRE.

The failure occurs when foreign GRE encapsulated traffic
(explicitly not including the initial packets of a connection) arrives at
the system (likely via a switch flood event).  The packets are GRO
accumulated, and passed to the OVS receive processing.  As the connection
is not in the OVS kernel datapath table, the call path is:

ovs_dp_upcall -
queue_gso_packets -
__skb_gso_segment(skb, NETIF_F_SG, false)

Without the patch cited above, __skb_gso_segment returns NULL,
as the features from the device (including _GSO_GRE) are used in place
of the _SG feature supplied to the call.  The kernel panics on a
subsequent dereference of the NULL pointer in queue_userspace_packet().

A backport to 3.14.50 is below.

-J

Subject: [PATCH 3.14-stable] net: gso: use feature flag argument in all 
protocol gso handlers

From: Florian Westphal f...@strlen.de

skb_gso_segment() has a 'features' argument representing offload features
available to the output path.

A few handlers, e.g. GRE, instead re-fetch the features of skb-dev and use
those instead of the provided ones when handing encapsulation/tunnels.

Depending on dev-hw_enc_features of the output device skb_gso_segment() can
then return NULL even when the caller has disabled all GSO feature bits,
as segmentation of inner header thinks device will take care of segmentation.

This e.g. affects the tbf scheduler, which will silently drop GRE-encap GSO skbs
that did not fit the remaining token quota as the segmentation does not work
when device supports corresponding hw offload capabilities.

Cc: Pravin B Shelar pshe...@nicira.com
Signed-off-by: Florian Westphal f...@strlen.de
Signed-off-by: David S. Miller da...@davemloft.net
[jay.vosburgh: backported to 3.14. ]
Signed-off-by: Jay Vosburgh jay.vosbu...@canonical.com

---
 net/ipv4/af_inet.c | 2 +-
 net/ipv4/gre_offload.c | 2 +-
 net/ipv4/udp.c | 2 +-
 net/ipv6/ip6_offload.c | 2 +-
 net/mpls/mpls_gso.c| 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 951fe55..f4c804d 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1291,7 +1291,7 @@ static struct sk_buff *inet_gso_segment(struct sk_buff 
*skb,
 
encap = SKB_GSO_CB(skb)-encap_level  0;
if (encap)
-   features = skb-dev-hw_enc_features  netif_skb_features(skb);
+   features = skb-dev-hw_enc_features;
SKB_GSO_CB(skb)-encap_level += ihl;
 
skb_reset_transport_header(skb);
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index 278836f..0da513e 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -69,7 +69,7 @@ static struct sk_buff *gre_gso_segment(struct sk_buff *skb,
skb-mac_len = skb_inner_network_offset(skb);
 
/* segment inner packet. */
-   enc_features = skb-dev-hw_enc_features  netif_skb_features(skb);
+   enc_features = skb-dev-hw_enc_features  features;
segs = skb_mac_gso_segment(skb, enc_features);
if (!segs || IS_ERR(segs)) {
skb_gso_error_unwind(skb, protocol, ghl, mac_offset, mac_len);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 21a3a9e..22b2a83 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -2510,7 +2510,7 @@ struct sk_buff *skb_udp_tunnel_segment(struct sk_buff 
*skb,
skb-protocol = htons(ETH_P_TEB);
 
/* segment inner packet. */
-   enc_features = skb-dev-hw_enc_features  netif_skb_features(skb);
+   enc_features = skb-dev-hw_enc_features  features;
segs = skb_mac_gso_segment(skb, enc_features);
if (!segs || IS_ERR(segs)) {
skb_gso_error_unwind(skb, protocol, tnl_hlen, mac_offset,
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index b2f0915..dc46eba 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -112,7 +112,7 @@ static struct sk_buff *ipv6_gso_segment(struct sk_buff *skb,
 
encap = SKB_GSO_CB(skb)-encap_level  0;
if (encap)
-   features = skb-dev-hw_enc_features  netif_skb_features(skb);
+   features = skb-dev-hw_enc_features;
SKB_GSO_CB(skb)-encap_level += sizeof(*ipv6h);
 
ipv6h = ipv6_hdr(skb);
diff --git a/net/mpls/mpls_gso.c b/net/mpls/mpls_gso.c
index 851cd88..0c970cb 100644
--- a/net/mpls/mpls_gso.c
+++ b/net/mpls/mpls_gso.c
@@ -47,7 +47,7 @@ static struct sk_buff *mpls_gso_segment(struct sk_buff *skb,
__skb_push(skb, skb-mac_len);
 
/* Segment inner packet

Re: [PATCH net] bonding: fix bond dev flags after convert to arphrd_ether

2015-07-15 Thread Jay Vosburgh

Nikolay Aleksandrov ra...@blackwall.org wrote:

From: Nikolay Aleksandrov niko...@cumulusnetworks.com

If a bonding device enslaves devices != arphrd_ether it'll change types
and if later these devices are released, it can enslave an arphrd_ether
device and switch back calling ether_setup() which resets dev-flags to
IFF_BROADCAST|IFF_MULTICAST and clears IFF_MASTER which then could lead
to many different bugs. This bug seems to have been there since the
introduction of ether_setup() in bond_enslave().

I thought the hack-around for the non-ethernet device problem
was that, for non-ARPHRD_ETHER devices, the bonding master device is
automatically destroyed when the last slave is released.

This (well, this sort of thing) originally came up with IPoIB
devices needing different ops that would disappear if the ipoib module
was unloaded, so the bond master was deliberately unregistered when the
last slave was released.

Looking at the code, at first glance this appears to still be
the case: bond_slave_netdev_event calls bond_release_and_destroy for !=
ARPHRD_ETHER, which in turn should unregister the bond itself if there
are no slaves left.  Is this no longer working as intended?

-J


Signed-off-by: Nikolay Aleksandrov niko...@cumulusnetworks.com
Fixes: e36b9d16c6a6 (bonding: clean muticast addresses when device changes 
type)
---
 drivers/net/bonding/bond_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 317a49480475..8ba119896e55 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1368,6 +1368,7 @@ int bond_enslave(struct net_device *bond_dev, struct 
net_device *slave_dev)
   bond_setup_by_slave(bond_dev, slave_dev);
   else {
   ether_setup(bond_dev);
+  bond_dev-flags |= IFF_MASTER;
   bond_dev-priv_flags = ~IFF_TX_SKB_SHARING;
   }
 
-- 
1.9.3


---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Stable request for gso feature flag and error handling fixes

2015-07-07 Thread Jay Vosburgh


Please consider commit

commit 1e16aa3ddf863c6b9f37eddf52503230a62dedb3
Author: Florian Westphal f...@strlen.de
Date:   Mon Oct 20 13:49:16 2014 +0200

net: gso: use feature flag argument in all protocol gso handlers

and, at your discretion, the related commit

commit 330966e501ffe282d7184fde4518d5e0c24bc7f8
Author: Florian Westphal f...@strlen.de
Date:   Mon Oct 20 13:49:17 2014 +0200

net: make skb_gso_segment error handling more robust

for -stable kernels prior to 3.18 back to 3.10.

We have observed kernel panics when an openvswitch bridge is
populated with virtual devices (veth, for example) that have expansive
feature sets that include NETIF_F_GSO_GRE.

The failure occurs when foreign GRE encapsulated traffic
(explicitly not including the initial packets of a connection) arrives
at the system (likely via a switch flood event).  The packets are GRO
accumulated, and passed to the OVS receive processing.  As the
connection is not in the OVS kernel datapath table, the call path is:

ovs_dp_upcall -
 queue_gso_packets -
__skb_gso_segment(skb, NETIF_F_SG, false)

Without the first patch cited above, __skb_gso_segment returns
NULL, as the features from the device (including GSO_GRE) are used in
place of the _SG feature supplied to the call.

Without the second patch cited above, the kernel panics when it
later dereferences the NULL skb pointer in queue_userspace_packet.

Strictly speaking, with the first place applied the panic is
avoided (as the NULL return does not occur), but including the second
patch may still be prudent.

Thanks,

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Stable request for gso feature flag and error handling fixes

2015-07-07 Thread Jay Vosburgh

David Miller da...@davemloft.net wrote:

From: Jay Vosburgh jay.vosbu...@canonical.com
Date: Tue, 07 Jul 2015 17:38:50 -0700

  Please consider commit

When you ask me to consider commits for -stable you have to tell
me what -stable releases you want me to submit them for.

Currently I am only doing -stable submissions for 4.1.x, 3.18.x
and 3.14.x

I did say:

   for -stable kernels prior to 3.18 back to 3.10.

So, this would be just for 3.14.x.  My apologies if I buried
that too far into the message.

Are the other -stable tree maintainers picking up patches after
you've submitted to 4.1/3.18/3.14, or is it necessary to make separate
requests?

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] bonding: primary_reselect with failure is not working properly

2015-07-03 Thread Jay Vosburgh

GMAIL ranamazh...@gmail.com wrote:

Hi Jay,

On Friday 03 July 2015 02:12 AM, Jay Vosburgh wrote:

  [ added netdev to cc ]

 Mazhar Rana ranamazh...@gmail.com wrote:

 When primary_reselect is set to failure, primary interface should
 not become active until current active slave is up. But if we set first
  I think you mean until current active slave is down here, not
 up.

Yes, It should be up, grammatical mistake

down, right?

[...]
Below is the updated version of your patch. Any Comments or suggestions ?
[...]
 static struct slave *bond_find_best_slave(struct bonding *bond)
 {
-  struct slave *slave, *bestslave = NULL, *primary;
+  struct slave *slave = NULL, *bestslave = NULL, *primary;
   struct list_head *iter;
   int mintime = bond-params.updelay;
   primary = rtnl_dereference(bond-primary_slave);
-  if (primary  primary-link == BOND_LINK_UP 
-  bond_should_change_active(bond))
-  return primary;
+  if (primary  primary-link == BOND_LINK_UP)
+  slave = bond_choose_primary_or_current(bond);
+
+  if (slave)
+  return slave;
   bond_for_each_slave(bond, slave, iter) {
   if (slave-link == BOND_LINK_UP)

I think this will misbehave in the case that curr is up and
available, but primary is NULL (this can happen when the primary option
is cleared).  In this case, the above code will not call
bond_choose_primary_or_current, and will then run the loop to find a new
curr, which may select a different slave unnecessarily.

How does the following look?  I prefer to make the call to
choose_primary_or_current unconditional, and have it decide if the
search loop should be run.  In this version, _choose_ tests curr if prim
is not suitable.  Compile tested only.

Thoughts?

-J

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 19eb990..1e35e25 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -689,40 +689,57 @@ out:
 
 }
 
-static bool bond_should_change_active(struct bonding *bond)
+static struct slave *bond_choose_primary_or_current(struct bonding *bond)
 {
struct slave *prim = rtnl_dereference(bond-primary_slave);
struct slave *curr = rtnl_dereference(bond-curr_active_slave);
 
-   if (!prim || !curr || curr-link != BOND_LINK_UP)
-   return true;
+   if (!prim || !prim-link == BOND_LINK_UP) {
+   if (!curr || !curr-link == BOND_LINK_UP)
+   return NULL;
+   return curr;
+   }
+
if (bond-force_primary) {
bond-force_primary = false;
-   return true;
+   return prim;
+   }
+
+   if (!curr || curr-link != BOND_LINK_UP)
+   return prim;
+
+   /* At this point, prim and curr are both up */
+   switch (bond-params.primary_reselect) {
+   case BOND_PRI_RESELECT_ALWAYS:
+   return prim;
+   case BOND_PRI_RESELECT_BETTER:
+   if (prim-speed  curr-speed)
+   return curr;
+   if (prim-speed == curr-speed  prim-duplex = curr-duplex)
+   return curr;
+   return prim;
+   case BOND_PRI_RESELECT_FAILURE:
+   return curr;
+   default:
+   netdev_err(bond-dev, impossible primary_reselect %d\n,
+  bond-params.primary_reselect);
+   return curr;
}
-   if (bond-params.primary_reselect == BOND_PRI_RESELECT_BETTER 
-   (prim-speed  curr-speed ||
-(prim-speed == curr-speed  prim-duplex = curr-duplex)))
-   return false;
-   if (bond-params.primary_reselect == BOND_PRI_RESELECT_FAILURE)
-   return false;
-   return true;
 }
 
 /**
- * find_best_interface - select the best available slave to be the active one
+ * bond_find_best_slave - select the best available slave to be the active one
  * @bond: our bonding struct
  */
 static struct slave *bond_find_best_slave(struct bonding *bond)
 {
-   struct slave *slave, *bestslave = NULL, *primary;
+   struct slave *slave, *bestslave = NULL;
struct list_head *iter;
int mintime = bond-params.updelay;
 
-   primary = rtnl_dereference(bond-primary_slave);
-   if (primary  primary-link == BOND_LINK_UP 
-   bond_should_change_active(bond))
-   return primary;
+   slave = bond_choose_primary_or_current(bond);
+   if (slave)
+   return slave;
 
bond_for_each_slave(bond, slave, iter) {
if (slave-link == BOND_LINK_UP)

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2] bonding: primary_reselect with failure is not working properly

2015-07-02 Thread Jay Vosburgh

 = bond_choose_primary_or_current(bond);
+   if (slave)
+   return slave;
 
bond_for_each_slave(bond, slave, iter) {
if (slave-link == BOND_LINK_UP)

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Issue with LACP mode in linux bonding driver

2015-06-26 Thread Jay Vosburgh

Ajith Adapa adapa.aj...@gmail.com wrote:

On 26 June 2015 at 07:45, Jay Vosburgh jay.vosbu...@canonical.com wrote:
 echo 'module bonding =p'  /sys/kernel/debug/dynamic_debug/control
Hi,

thanks for the reply.

I tried this out a bit here, and could reproduce the problem on
3.13, but not on 4.0.0.  A bit of checking suggests that this problem is
fixed by the following commit:

commit 63b46242f707849a1df10b70e026281bfa40e849
Author: Wilson Kok w...@cumulusnetworks.com
Date:   Mon Jan 26 01:16:59 2015 -0500

bonding: fix incorrect lacp mux state when agg not active

which looks to have first appeared in the 4.0 kernel.  I did a
quick backport of that to 3.13 (leaving out the style and pr_debug
changes), and it appears to resolve the problem.

Ajith: can you test this patch?  If this resolves the problem
for you, we can request this patch for -stable to get it into the older
kernels.

diff --git a/drivers/net/bonding/bond_3ad.c b/drivers/net/bonding/bond_3ad.c
index dc0c56a..8c62f90 100644
--- a/drivers/net/bonding/bond_3ad.c
+++ b/drivers/net/bonding/bond_3ad.c
@@ -890,19 +890,23 @@ static void ad_mux_machine(struct port *port)
case AD_MUX_ATTACHED:
// check also if agg_select_timer expired(so the edable 
port will take place only after this timer)
if ((port-sm_vars  AD_PORT_SELECTED)  
(port-partner_oper.port_state  AD_STATE_SYNCHRONIZATION)  
!__check_agg_selection_timer(port)) {
-   port-sm_mux_state = 
AD_MUX_COLLECTING_DISTRIBUTING;// next state
+   if (port-aggregator-is_active)
+   port-sm_mux_state = 
AD_MUX_COLLECTING_DISTRIBUTING;// next state
} else if (!(port-sm_vars  AD_PORT_SELECTED) || 
(port-sm_vars  AD_PORT_STANDBY)) {// if UNSELECTED or STANDBY
port-sm_vars = ~AD_PORT_READY_N;
// in order to withhold the selection logic to 
check all ports READY_N value
// every callback cycle to update ready 
variable, we check READY_N and update READY here
__set_agg_ports_ready(port-aggregator, 
__agg_ports_are_ready(port-aggregator));
port-sm_mux_state = AD_MUX_DETACHED;// next 
state
+   } else if (port-aggregator-is_active) {
+   port-actor_oper_port_state |=
+   AD_STATE_SYNCHRONIZATION;
}
break;
case AD_MUX_COLLECTING_DISTRIBUTING:
if (!(port-sm_vars  AD_PORT_SELECTED) || 
(port-sm_vars  AD_PORT_STANDBY) ||
-   !(port-partner_oper.port_state  
AD_STATE_SYNCHRONIZATION)
-  ) {
+   !(port-partner_oper.port_state  
AD_STATE_SYNCHRONIZATION) ||
+   !(port-actor_oper_port_state  
AD_STATE_SYNCHRONIZATION)) {
port-sm_mux_state = AD_MUX_ATTACHED;// next 
state
 
} else {
@@ -941,7 +945,12 @@ static void ad_mux_machine(struct port *port)
break;
case AD_MUX_ATTACHED:
__attach_bond_to_agg(port);
-   port-actor_oper_port_state |= AD_STATE_SYNCHRONIZATION;
+   if (port-aggregator-is_active)
+   port-actor_oper_port_state |=
+   AD_STATE_SYNCHRONIZATION;
+   else
+   port-actor_oper_port_state =
+   ~AD_STATE_SYNCHRONIZATION;
port-actor_oper_port_state = ~AD_STATE_COLLECTING;
port-actor_oper_port_state = ~AD_STATE_DISTRIBUTING;
ad_disable_collecting_distributing(port);
@@ -950,6 +959,7 @@ static void ad_mux_machine(struct port *port)
case AD_MUX_COLLECTING_DISTRIBUTING:
port-actor_oper_port_state |= AD_STATE_COLLECTING;
port-actor_oper_port_state |= AD_STATE_DISTRIBUTING;
+   port-actor_oper_port_state |= AD_STATE_SYNCHRONIZATION;
ad_enable_collecting_distributing(port);
port-ntt = true;
break;
@@ -1350,6 +1360,9 @@ static void ad_port_selection_logic(struct port *port)
 
aggregator = __get_first_agg(port);
ad_agg_selection_logic(aggregator);
+
+   if (!port-aggregator-is_active)
+   port-actor_oper_port_state = ~AD_STATE_SYNCHRONIZATION;
 }
 
 /*


-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord

Re: Issue with LACP mode in linux bonding driver

2015-06-25 Thread Jay Vosburgh

Ajith Adapa adapa.aj...@gmail.com wrote:

Hi,

Sorry for direct mail. Since the question is more specific about
supporting LACP standard I decided to communicate directly with the
MAINTAINERS. My issue is related to multiaggregation support in LACP.

I saw your message this morning, but didn't have an opportunity
to look into it today.

Linux Flavour: Centos7.
Setup topology: Back to back connected Linux server and a L2 switch
with 2 interfaces eth0 and eth1 (on both sides).

On Switch I have mapped eth0 to po1 and eth1 to po2. On Linux server I
have created a single bond interface with both interfaces eth0 and
eth1.

On switch both po1 and po2 has same system-id but the Actor key is
different i.e. on PO1 it is 16385 and on PO2 it is 32768. As per the
information available regarding bond0 on Linux server which is given below
Active aggregator ID is 1 which is mapped to eth0.

But we have observed that eth1 on Linux server is also sending LACPDUS
with Collecting/Distributing bit set as 1. Which will result in single
bond interface on Linux server is splitted into multiple port-channels
on Switch causing duplication of frames on Linux server.

I'd suggest enabling the dynamic_debug for the bonding driver
and observe the state machine activity within the 802.3ad code.  This is
described in the Documentation/dynamic-debug-howto.txt that is part of
the kernel source; off the top of my head, I think you'll need something
like:

echo 'module bonding =p'  /sys/kernel/debug/dynamic_debug/control

This should put the bonding LACP state machine activity into the
system log.

If a port on a non-active aggregator is actually in collecting /
distributing state, that is probably bad, as I'd only expect that to be
true for ports in the active aggregator.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Experimental new bonding driver mode=batman

2015-05-20 Thread Jay Vosburgh

Michal Kubecek mkube...@suse.cz wrote:

On Tue, May 19, 2015 at 04:29:45AM -0400, Patrick Simmons wrote:
 On 05/19/2015 03:49 AM, Michal Kubecek wrote:
 On Tue, May 19, 2015 at 02:09:43AM -0400, Patrick Simmons wrote:
 
 I've written a new mode for the kernel bonding driver.  It's similar to
 the round-robin mode, but it keeps statistics on TCP resends so as to
 favor slave devices with more bandwidth when choosing where to send
 packets.  I've tested it on two laptops using WiFi/Ethernet and it seems
 to work okay, but it should be considered experimental.
 
 A description of how is the mode supposed to work would be definitely
 helpful.
 
 
 Rationale: It's helpful for cases where the slave devices have
 significantly different or varying bandwidth.  The reason I wrote it
 is to bond powerline networking and wireless networking adapters
 into a single interface for use with connecting to a MythTV server.
 Neither of these systems is particularly reliable with bandwidth,
 but mode=batman can adaptively figure out which network has more
 available bandwidth at any given moment.  This is better than
 mode=round-robin which always balances everything 50/50.

Thank you. But I rather meant some basic description of the algorithm
used to achieve this goal. Both should be IMHO part of the commit
message.

Agreed; the concept sounds interesting, but without a detailed
description of how it works it is difficult to evaluate its value.

 Regarding your analysis, I appreciate your comments, and I know it's
 rough, but I'm sorry to say I'm not really interested in doing much
 to improve its polish past where it is.  If it fails some way when I
 try to deploy it, then I'll fix that, and maybe I'll play around
 with the balancing heuristics, but the code quality is what it is
 unless someone else wants to improve it.  I would fix the
 indentation if that would make it acceptable for you to merge it,
 but not much more.  My argument for merging it is basically it
 doesn't do anything unless you pass mode=batman, so what's the
 harm?.
 
 So, if you guys decide you don't want to merge it because of the
 global spinlock etc., that's cool and I understand, but I thought I
 should at least post to this list so you and any other potentially
 interested people know it exists.  Oh, and, if you're not going to
 merge it, please let me know so I can know post the patch to GitHub
 or somewhere.  And, if you could include a note in the comments at
 the top of bond_main.c or somewhere pointing people to the patch,
 I'd very much appreciate that. I don't want anyone else to have to
 endure hours of kernel rebuilds with KASAN enabled if they want this
 functionality :)

Well, it's not my call, I'm not a bonding maintainer. But I believe at
least some of the objections would be shared by them. Of course, it's up
to you if you want to dedicate your time to improving the code to be
acceptable for mainline or rather maintain it out of tree (which may end
up taking even more time in the long term).

Well, I am a bonding maintainer, and I can say that the patch in
its current state is not suitable for inclusion.

At a minimum, there are many coding style issues, commented out
debug statements, etc, along with design issues (e.g., the batman mode
handling in bond_handle_frame is unconditional and takes place for all
modes, not just the new batman mode).

If you (Patrick) or someone else wishes to contribute this to
mainline, I'd suggest that the first step is to read and follow the
instructions in Documentation/SubmittingPatches in the kernel source
code.

It is also not feasible to add pointers in the kernel source
code to out-of-tree patches; sorry.

-J

---
-Jay Vosburgh, jay.vosbu...@canonical.com
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] bonding: simplify code and get rid of warning

2008-02-20 Thread Jay Vosburgh

Stephen Hemminger [EMAIL PROTECTED] wrote:

Get rid of warning and simplify code that looks up vlan tag.
No need to get tag, then copy it. Also no need for a local status
variable.

Granted, the current code is suboptimal, but I don't see any
warnings compiling this (gcc 4.1.2).  What are you getting?

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]


Signed-off-by: Stephen Hemminger [EMAIL PROTECTED]
---
Patch against current 2.6.25 version.

--- a/drivers/net/bonding/bond_alb.c   2008-02-18 20:58:53.0 -0800
+++ b/drivers/net/bonding/bond_alb.c   2008-02-18 21:01:10.0 -0800
@@ -678,12 +678,8 @@ static struct slave *rlb_choose_channel(
   }

   if (!list_empty(bond-vlan_list)) {
-  unsigned short vlan_id;
-  int res = vlan_get_tag(skb, vlan_id);
-  if (!res) {
+  if (!vlan_get_tag(skb, client_info-vlan_id))
   client_info-tag = 1;
-  client_info-vlan_id = vlan_id;
-  }
   }

   if (!client_info-assigned) {
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Bugme-new] [Bug 9937] New: Bug in bonding driver - Kernel oops whenever driver is loaded with max_bonds parameter

2008-02-11 Thread Jay Vosburgh

Andrew Morton [EMAIL PROTECTED] wrote:

 Problem Description: Kernel oops whenever bonding driver with max_bonds=2 
 (or 
 2) is loaded ...

I believe this is fixed by the following (from linux-2.6):

From: Jay Vosburgh [EMAIL PROTECTED]
Date: Tue, 29 Jan 2008 18:07:45 -0800
Subject: [PATCH] bonding: fix NULL pointer deref in startup processing

Fix the are we creating a duplicate check to not compare
the name if the name is NULL (meaning that the system should select
a name).  Bug reported by Benny Amorsen [EMAIL PROTECTED].

Signed-off-by: Jay Vosburgh [EMAIL PROTECTED]
Signed-off-by: Jeff Garzik [EMAIL PROTECTED]
Signed-off-by: David S. Miller [EMAIL PROTECTED]
---
 drivers/net/bonding/bond_main.c |   16 +---
 1 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 65c7eba..81b4574 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4896,14 +4896,16 @@ int bond_create(char *name, struct bond_params *params, 
struct bonding **newbond
down_write(bonding_rwsem);
 
/* Check to see if the bond already exists. */
-   list_for_each_entry_safe(bond, nxt, bond_dev_list, bond_list)
-   if (strnicmp(bond-dev-name, name, IFNAMSIZ) == 0) {
-   printk(KERN_ERR DRV_NAME
+   if (name) {
+   list_for_each_entry_safe(bond, nxt, bond_dev_list, bond_list)
+   if (strnicmp(bond-dev-name, name, IFNAMSIZ) == 0) {
+   printk(KERN_ERR DRV_NAME
   : cannot add bond %s; it already exists\n,
-  name);
-   res = -EPERM;
-   goto out_rtnl;
-   }
+  name);
+   res = -EPERM;
+   goto out_rtnl;
+   }
+   }
 
bond_dev = alloc_netdev(sizeof(struct bonding), name ? name : ,
ether_setup);
-- 
1.5.2.4

--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Null pointer dereference when bringing up bonding device on kernel-2.6.24-2.fc9.i686

2008-01-31 Thread Jay Vosburgh

Siim Põder [EMAIL PROTECTED] wrote:

Jay Vosburgh wrote:
 Benny Amorsen [EMAIL PROTECTED] wrote:
 
 https://bugzilla.redhat.com/show_bug.cgi?id=430391
 
  I know what this is, I'll fix it.

do you know when this happend, so we would know which kernel is ok to
use (not to start trying blindly)?

It was something I changed up during the 2.6.24-rc development
somewhere.  Also, I posted the fix for this a couple days ago, it's
commit 5eb71eec3616b0a62e63197016576a74da240c6b in netdev-2.6#upstream.

-J

---
-Jay Vosburgh, IBM Linux Technology Center, [EMAIL PROTECTED]
--
To unsubscribe from this list: send the line unsubscribe netdev in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

1 2 3 4 >

1 - 100 of 338 matches

Mail list logo