Re: [PATCH net-next V3 0/6] switch to use tx skb array in tun

2016-06-29 Thread Michael S. Tsirkin
On Thu, Jun 30, 2016 at 11:52:53AM +0800, Jason Wang wrote:
> Hi all:
> 
> This series tries to switch to use skb array in tun. This is used to
> eliminate the spinlock contention between producer and consumer. The
> conversion was straightforward: just introdce a tx skb array and use
> it instead of sk_receive_queue.
> 
> A minor issue is to keep the tx_queue_len behaviour, since tun used to
> use it for the length of sk_receive_queue. This is done through:
> 
> - add the ability to resize multiple rings at once to avoid handling
>   partial resize failure for mutiple rings.
> - add the support for zero length ring.
> - introduce a notifier which was triggered when tx_queue_len was
>   changed for a netdev.
> - resize all queues during the tx_queue_len changing.
> 
> Tests shows about 15% improvement on guest rx pps:
> 
> Before: ~130pps
> After : ~150pps

Series:

Acked-by: Michael S. Tsirkin 


> Changes from V2:
> - add multiple rings resizing support for ptr_ring/skb_array
> - add zero length ring support
> - introdce a NETDEV_CHANGE_TX_QUEUE_LEN
> - drop new flags
> 
> Changes from V1:
> - switch to use skb array instead of a customized circular buffer
> - add non-blocking support
> - rename .peek to .peek_len
> - drop lockless peeking since test show very minor improvement
> 
> Jason Wang (5):
>   ptr_ring: support zero length ring
>   skb_array: minor tweak
>   skb_array: add wrappers for resizing
>   net: introduce NETDEV_CHANGE_TX_QUEUE_LEN
>   tun: switch to use skb array for tx
> 
> Michael S. Tsirkin (1):
>   ptr_ring: support resizing multiple queues
> 
>  drivers/net/tun.c| 138 
> ---
>  drivers/vhost/net.c  |  16 -
>  include/linux/net.h  |   1 +
>  include/linux/netdevice.h|   1 +
>  include/linux/ptr_ring.h |  77 ++
>  include/linux/skb_array.h|  13 +++-
>  net/core/net-sysfs.c |  15 -
>  tools/virtio/ringtest/ptr_ring.c |   5 ++
>  8 files changed, 243 insertions(+), 23 deletions(-)
> 
> -- 
> 2.7.4


Re: [PATCH net-next V3 5/6] net: introduce NETDEV_CHANGE_TX_QUEUE_LEN

2016-06-29 Thread Jason Wang



On 2016年06月30日 12:56, John Fastabend wrote:

On 16-06-29 08:52 PM, Jason Wang wrote:

This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
will be triggered when tx_queue_len. It could be used by net device
who want to do some processing at that time. An example is tun who may
want to resize tx array when tx_queue_len is changed.

Signed-off-by: Jason Wang 
---
  include/linux/netdevice.h |  1 +
  net/core/net-sysfs.c  | 15 ++-
  2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e84d9d2..7dc2ec7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2237,6 +2237,7 @@ struct netdev_lag_lower_state_info {
  #define NETDEV_PRECHANGEUPPER 0x001A
  #define NETDEV_CHANGELOWERSTATE   0x001B
  #define NETDEV_UDP_TUNNEL_PUSH_INFO   0x001C
+#define NETDEV_CHANGE_TX_QUEUE_LEN 0x001E
  
  int register_netdevice_notifier(struct notifier_block *nb);

  int unregister_netdevice_notifier(struct notifier_block *nb);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 7a0b616..6e4f347 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -322,7 +322,20 @@ NETDEVICE_SHOW_RW(flags, fmt_hex);
  
  static int change_tx_queue_len(struct net_device *dev, unsigned long new_len)

  {
-   dev->tx_queue_len = new_len;
+   int res, orig_len = dev->tx_queue_len;
+
+   if (new_len != orig_len) {
+   dev->tx_queue_len = new_len;
+   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
+   res = notifier_to_errno(res);
+   if (res) {
+   netdev_err(dev,
+  "refused to change device tx_queue_len\n");
+   dev->tx_queue_len = orig_len;
+   return -EFAULT;
+   }
+   }
+
return 0;
  }
  


Acked-by: John Fastabend 

Great timing I was just looking into this because I need it for the
qdisc side.

It looks like this covers the sysfs change but the tx_queue_len can
also be changed via rtnetlink as well. So we need another patch for
that path right?

 if (tb[IFLA_TXQLEN]) {
 unsigned long value = nla_get_u32(tb[IFLA_TXQLEN]);

 if (dev->tx_queue_len ^ value)
 status |= DO_SETLINK_NOTIFY;

 dev->tx_queue_len = value;
 }

Thanks,
John



Right, will do this in next version.

Thanks


Re: [PATCH V2 4/4] net-next: mediatek: add support for IRQ grouping

2016-06-29 Thread kbuild test robot
Hi,

[auto build test ERROR on net/master]
[also build test ERROR on next-20160629]
[cannot apply to net-next/master v4.7-rc5]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/John-Crispin/net-next-mediatek-IRQ-cleanups-fixes-and-grouping/20160629-194341
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 5.3.1-8) 5.3.1 20160205
reproduce:
wget 
https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross
 -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
make.cross ARCH=arm 

All errors (new ones prefixed by >>):

   drivers/net/ethernet/mediatek/mtk_eth_soc.c: In function 
'mtk_poll_controller':
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:1300:2: error: implicit 
>> declaration of function 'mtk_handle_irq' 
>> [-Werror=implicit-function-declaration]
 mtk_handle_irq(dev->irq[0], dev);
 ^
>> drivers/net/ethernet/mediatek/mtk_eth_soc.c:1300:25: error: subscripted 
>> value is neither array nor pointer nor vector
 mtk_handle_irq(dev->irq[0], dev);
^
   cc1: some warnings being treated as errors

vim +/mtk_handle_irq +1300 drivers/net/ethernet/mediatek/mtk_eth_soc.c

  1294  {
  1295  struct mtk_mac *mac = netdev_priv(dev);
  1296  struct mtk_eth *eth = mac->hw;
  1297  u32 int_mask = MTK_TX_DONE_INT | MTK_RX_DONE_INT;
  1298  
  1299  mtk_irq_disable(eth, int_mask);
> 1300  mtk_handle_irq(dev->irq[0], dev);
  1301  mtk_irq_enable(eth, int_mask);
  1302  }
  1303  #endif

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: Binary data


Re: [PATCH net-next V3 5/6] net: introduce NETDEV_CHANGE_TX_QUEUE_LEN

2016-06-29 Thread John Fastabend
On 16-06-29 08:52 PM, Jason Wang wrote:
> This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
> will be triggered when tx_queue_len. It could be used by net device
> who want to do some processing at that time. An example is tun who may
> want to resize tx array when tx_queue_len is changed.
> 
> Signed-off-by: Jason Wang 
> ---
>  include/linux/netdevice.h |  1 +
>  net/core/net-sysfs.c  | 15 ++-
>  2 files changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index e84d9d2..7dc2ec7 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -2237,6 +2237,7 @@ struct netdev_lag_lower_state_info {
>  #define NETDEV_PRECHANGEUPPER0x001A
>  #define NETDEV_CHANGELOWERSTATE  0x001B
>  #define NETDEV_UDP_TUNNEL_PUSH_INFO  0x001C
> +#define NETDEV_CHANGE_TX_QUEUE_LEN   0x001E
>  
>  int register_netdevice_notifier(struct notifier_block *nb);
>  int unregister_netdevice_notifier(struct notifier_block *nb);
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 7a0b616..6e4f347 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -322,7 +322,20 @@ NETDEVICE_SHOW_RW(flags, fmt_hex);
>  
>  static int change_tx_queue_len(struct net_device *dev, unsigned long new_len)
>  {
> - dev->tx_queue_len = new_len;
> + int res, orig_len = dev->tx_queue_len;
> +
> + if (new_len != orig_len) {
> + dev->tx_queue_len = new_len;
> + res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
> + res = notifier_to_errno(res);
> + if (res) {
> + netdev_err(dev,
> +"refused to change device tx_queue_len\n");
> + dev->tx_queue_len = orig_len;
> + return -EFAULT;
> + }
> + }
> +
>   return 0;
>  }
>  
> 

Acked-by: John Fastabend 

Great timing I was just looking into this because I need it for the
qdisc side.

It looks like this covers the sysfs change but the tx_queue_len can
also be changed via rtnetlink as well. So we need another patch for
that path right?

if (tb[IFLA_TXQLEN]) {
unsigned long value = nla_get_u32(tb[IFLA_TXQLEN]);

if (dev->tx_queue_len ^ value)
status |= DO_SETLINK_NOTIFY;

dev->tx_queue_len = value;
}

Thanks,
John



[PATCH net-next V3 3/6] ptr_ring: support resizing multiple queues

2016-06-29 Thread Jason Wang
From: "Michael S. Tsirkin" 

Sometimes, we need support resizing multiple queues at once. This is
because it was not easy to recover to recover from a partial failure
of multiple queues resizing.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 71 +++-
 tools/virtio/ringtest/ptr_ring.c |  5 +++
 2 files changed, 67 insertions(+), 9 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index d78b8b8..2052011 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -349,20 +349,14 @@ static inline int ptr_ring_init(struct ptr_ring *r, int 
size, gfp_t gfp)
return 0;
 }
 
-static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp,
- void (*destroy)(void *))
+static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue,
+  int size, gfp_t gfp,
+  void (*destroy)(void *))
 {
-   unsigned long flags;
int producer = 0;
-   void **queue = __ptr_ring_init_queue_alloc(size, gfp);
void **old;
void *ptr;
 
-   if (!queue)
-   return -ENOMEM;
-
-   spin_lock_irqsave(&(r)->producer_lock, flags);
-
while ((ptr = ptr_ring_consume(r)))
if (producer < size)
queue[producer++] = ptr;
@@ -375,6 +369,23 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int 
size, gfp_t gfp,
old = r->queue;
r->queue = queue;
 
+   return old;
+}
+
+static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp,
+ void (*destroy)(void *))
+{
+   unsigned long flags;
+   void **queue = __ptr_ring_init_queue_alloc(size, gfp);
+   void **old;
+
+   if (!queue)
+   return -ENOMEM;
+
+   spin_lock_irqsave(&(r)->producer_lock, flags);
+
+   old = __ptr_ring_swap_queue(r, queue, size, gfp, destroy);
+
spin_unlock_irqrestore(&(r)->producer_lock, flags);
 
kfree(old);
@@ -382,6 +393,48 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int 
size, gfp_t gfp,
return 0;
 }
 
+static inline int ptr_ring_resize_multiple(struct ptr_ring **rings, int nrings,
+  int size,
+  gfp_t gfp, void (*destroy)(void *))
+{
+   unsigned long flags;
+   void ***queues;
+   int i;
+
+   queues = kmalloc(nrings * sizeof *queues, gfp);
+   if (!queues)
+   goto noqueues;
+
+   for (i = 0; i < nrings; ++i) {
+   queues[i] = __ptr_ring_init_queue_alloc(size, gfp);
+   if (!queues[i])
+   goto nomem;
+   }
+
+   for (i = 0; i < nrings; ++i) {
+   spin_lock_irqsave(&(rings[i])->producer_lock, flags);
+   queues[i] = __ptr_ring_swap_queue(rings[i], queues[i],
+ size, gfp, destroy);
+   spin_unlock_irqrestore(&(rings[i])->producer_lock, flags);
+   }
+
+   for (i = 0; i < nrings; ++i)
+   kfree(queues[i]);
+
+   kfree(queues);
+
+   return 0;
+
+nomem:
+   while (--i >= 0)
+   kfree(queues[i]);
+
+   kfree(queues);
+
+noqueues:
+   return -ENOMEM;
+}
+
 static inline void ptr_ring_cleanup(struct ptr_ring *r, void (*destroy)(void 
*))
 {
void *ptr;
diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c
index 74abd74..68e4f9f 100644
--- a/tools/virtio/ringtest/ptr_ring.c
+++ b/tools/virtio/ringtest/ptr_ring.c
@@ -17,6 +17,11 @@
 typedef pthread_spinlock_t  spinlock_t;
 
 typedef int gfp_t;
+static void *kmalloc(unsigned size, gfp_t gfp)
+{
+   return memalign(64, size);
+}
+
 static void *kzalloc(unsigned size, gfp_t gfp)
 {
void *p = memalign(64, size);
-- 
2.7.4



Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread John Fastabend
On 16-06-29 08:35 PM, John Fastabend wrote:
> On 16-06-29 03:09 PM, John Fastabend wrote:
>> On 16-06-29 02:33 PM, Or Gerlitz wrote:
>>> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>>>  wrote:
 On 16-06-29 07:48 AM, Or Gerlitz wrote:
> On 6/28/2016 10:31 PM, John Fastabend wrote:
>> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>>
>>> Why?! Please, leave legacy be legacy. Use the new mode for
>>> implementing new features. Don't make things any more complicated :(
>>>
>>> [...]
>> Maybe I'm reading to much into the devlink flag names and if instead
>> you use a switch like the following,
>>>
>>VF representer : enable/disable the creation VF netdev's to represent
>> the virtual functions on the PF
>>>
>> Much less complicated then magic switching between forwarding logic IMO
>> and you don't whack a default configuration that an entire stack (e.g.
>> libvirt) has been built to use.
>>>
> Re letting the user to observe/modify the rules added by the
> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
> will be really pragmatical and doesn't make sense to get that donefor
> the TC subsystem. So this isn't a well defined solution and anyway, as
> you said, legacy mode enhancements is a different exercise. Personally,
> I agree with Jiri, that we should legacy be legacyand focus on adding
> the new model.
>>>
 The ixgbe driver already supports bridge and tc commands without the VF
 representer.  Adding the VF representer to these drivers just extends
 the existing support so we have an identifier for VFs and now the
 redirect action works and the fdb commands can specify the VF netdevs.
 I don't see this as a problem because we already do it today with
 'ip' and bridge tools.
>>>
>>> To be precise, for both ixgbe and mlx5, the existing tc support
>>> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
>>> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
>>> redirect to VF, but this is only for south --> north (wire --> VF)
>>> traffic, w.o the VF rep you can't do the other way around.
>>>
>>
>> Correct which is why we need the VF rep. So we are completely in
>> sync there.
>>
>>> Just to clarify, to what exact bridge command support did you refer for 
>>> ixgbe?
>>
>> 'bridge fdb' commands are supported today on the PF. But its the
>> same story as above we need the VF rep to also use it on the
>> VF representer
>>
>> Also 'bridge link' command for veb/vepa modes is supported and the
>> other link attributes could be supported with additional driver
>> support. No need for core changes here. But again yes only on the
>> PF so again we need the VF reps.
>>
>>>
>>> The forwarding done in the legacy mode is not well defined, and
>>> different across vendors, adding there the VF reps will not make it
>>> any better b/c some steering rules will be set by tc/bridge offloads
>>> while other rules will be put by the driver.
>>> I don't see how this takes us to better place.
>>
>> In legacy mode or any other mode you are defining some default policy
>> and rules.
>>
>> In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
>> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
>> today. And similarly can be modified today using 'ip link' and 'bridge
>> fdb' at least on the intel devices. Its not undefined in any way with
>> a quick query of the tools we can learn exactly what the configuration
>> is and even change it. This works fairly well with existing controllers
>> and stacks.
>>
>> The limitations are 'ip' only supports a single MAC address per VF and
>> 'tc' doesn't work on VF ports because when the VF is assigned to a VM
>> or namespace we lose visibility of it. Providing a VF rep for this
>> solves both of those problems.
>>
>> In this new mode the default policy is to create a default miss rule
>> and implement no l2 forwarding rules. Unfortunately not all hardware
>> in use supports this default miss rule case but would still benefit
>> from having a VF rep. So we shouldn't make this a stipulation for
>> enabling VF reps. It also changes a default policy that has been in
>> place for years without IMO at least any compelling reason. It will
>> be easy enough to change the default l2 policy to a flow based model
>> with a few bridge/tc commands.
>>
>>>
 We are also slightly in disagreement about what the default should be
 with VF netdevs. I think the default should be the same L2 mac/vlan
 switch behavior and see no reason to change it by default just because
 we added VF netdevs. The infrastructure libvirt/openstack/etc are built
 around this default today. But I guess nothing in this series specifies
 what the defaults of any given driver will be. VF netdevs are still
 useful even on older hardware that only supports mac/vlan forwarding to

[PATCH net-next V3 2/6] skb_array: minor tweak

2016-06-29 Thread Jason Wang
Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 include/linux/skb_array.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index 678bfbf..2dd0d1e 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -151,12 +151,12 @@ static inline int skb_array_init(struct skb_array *a, int 
size, gfp_t gfp)
return ptr_ring_init(>ring, size, gfp);
 }
 
-void __skb_array_destroy_skb(void *ptr)
+static void __skb_array_destroy_skb(void *ptr)
 {
kfree_skb(ptr);
 }
 
-int skb_array_resize(struct skb_array *a, int size, gfp_t gfp)
+static inline int skb_array_resize(struct skb_array *a, int size, gfp_t gfp)
 {
return ptr_ring_resize(>ring, size, gfp, __skb_array_destroy_skb);
 }
-- 
2.7.4



[PATCH net-next V3 5/6] net: introduce NETDEV_CHANGE_TX_QUEUE_LEN

2016-06-29 Thread Jason Wang
This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this
will be triggered when tx_queue_len. It could be used by net device
who want to do some processing at that time. An example is tun who may
want to resize tx array when tx_queue_len is changed.

Signed-off-by: Jason Wang 
---
 include/linux/netdevice.h |  1 +
 net/core/net-sysfs.c  | 15 ++-
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e84d9d2..7dc2ec7 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2237,6 +2237,7 @@ struct netdev_lag_lower_state_info {
 #define NETDEV_PRECHANGEUPPER  0x001A
 #define NETDEV_CHANGELOWERSTATE0x001B
 #define NETDEV_UDP_TUNNEL_PUSH_INFO0x001C
+#define NETDEV_CHANGE_TX_QUEUE_LEN 0x001E
 
 int register_netdevice_notifier(struct notifier_block *nb);
 int unregister_netdevice_notifier(struct notifier_block *nb);
diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index 7a0b616..6e4f347 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -322,7 +322,20 @@ NETDEVICE_SHOW_RW(flags, fmt_hex);
 
 static int change_tx_queue_len(struct net_device *dev, unsigned long new_len)
 {
-   dev->tx_queue_len = new_len;
+   int res, orig_len = dev->tx_queue_len;
+
+   if (new_len != orig_len) {
+   dev->tx_queue_len = new_len;
+   res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev);
+   res = notifier_to_errno(res);
+   if (res) {
+   netdev_err(dev,
+  "refused to change device tx_queue_len\n");
+   dev->tx_queue_len = orig_len;
+   return -EFAULT;
+   }
+   }
+
return 0;
 }
 
-- 
2.7.4



[PATCH net-next V3 1/6] ptr_ring: support zero length ring

2016-06-29 Thread Jason Wang
Sometimes, we need zero length ring. But current code will crash since
we don't do any check before accessing the ring. This patch fixes this.

Signed-off-by: Jason Wang 
---
 include/linux/ptr_ring.h | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index 562a65e..d78b8b8 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -102,7 +102,7 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r)
  */
 static inline int __ptr_ring_produce(struct ptr_ring *r, void *ptr)
 {
-   if (r->queue[r->producer])
+   if (unlikely(!r->size) || r->queue[r->producer])
return -ENOSPC;
 
r->queue[r->producer++] = ptr;
@@ -164,7 +164,9 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, 
void *ptr)
  */
 static inline void *__ptr_ring_peek(struct ptr_ring *r)
 {
-   return r->queue[r->consumer];
+   if (likely(r->size))
+   return r->queue[r->consumer];
+   return NULL;
 }
 
 /* Note: callers invoking this in a loop must use a compiler barrier,
-- 
2.7.4



[PATCH net-next V3 4/6] skb_array: add wrappers for resizing

2016-06-29 Thread Jason Wang
Signed-off-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 include/linux/skb_array.h | 9 +
 1 file changed, 9 insertions(+)

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index 2dd0d1e..f4dfade 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -161,6 +161,15 @@ static inline int skb_array_resize(struct skb_array *a, 
int size, gfp_t gfp)
return ptr_ring_resize(>ring, size, gfp, __skb_array_destroy_skb);
 }
 
+static inline int skb_array_resize_multiple(struct skb_array **rings,
+   int nrings, int size, gfp_t gfp)
+{
+   BUILD_BUG_ON(offsetof(struct skb_array, ring));
+   return ptr_ring_resize_multiple((struct ptr_ring **)rings,
+   nrings, size, gfp,
+   __skb_array_destroy_skb);
+}
+
 static inline void skb_array_cleanup(struct skb_array *a)
 {
ptr_ring_cleanup(>ring, __skb_array_destroy_skb);
-- 
2.7.4



[PATCH net-next V3 0/6] switch to use tx skb array in tun

2016-06-29 Thread Jason Wang
Hi all:

This series tries to switch to use skb array in tun. This is used to
eliminate the spinlock contention between producer and consumer. The
conversion was straightforward: just introdce a tx skb array and use
it instead of sk_receive_queue.

A minor issue is to keep the tx_queue_len behaviour, since tun used to
use it for the length of sk_receive_queue. This is done through:

- add the ability to resize multiple rings at once to avoid handling
  partial resize failure for mutiple rings.
- add the support for zero length ring.
- introduce a notifier which was triggered when tx_queue_len was
  changed for a netdev.
- resize all queues during the tx_queue_len changing.

Tests shows about 15% improvement on guest rx pps:

Before: ~130pps
After : ~150pps

Changes from V2:
- add multiple rings resizing support for ptr_ring/skb_array
- add zero length ring support
- introdce a NETDEV_CHANGE_TX_QUEUE_LEN
- drop new flags

Changes from V1:
- switch to use skb array instead of a customized circular buffer
- add non-blocking support
- rename .peek to .peek_len
- drop lockless peeking since test show very minor improvement

Jason Wang (5):
  ptr_ring: support zero length ring
  skb_array: minor tweak
  skb_array: add wrappers for resizing
  net: introduce NETDEV_CHANGE_TX_QUEUE_LEN
  tun: switch to use skb array for tx

Michael S. Tsirkin (1):
  ptr_ring: support resizing multiple queues

 drivers/net/tun.c| 138 ---
 drivers/vhost/net.c  |  16 -
 include/linux/net.h  |   1 +
 include/linux/netdevice.h|   1 +
 include/linux/ptr_ring.h |  77 ++
 include/linux/skb_array.h|  13 +++-
 net/core/net-sysfs.c |  15 -
 tools/virtio/ringtest/ptr_ring.c |   5 ++
 8 files changed, 243 insertions(+), 23 deletions(-)

-- 
2.7.4



[PATCH net-next V3 6/6] tun: switch to use skb array for tx

2016-06-29 Thread Jason Wang
We used to queue tx packets in sk_receive_queue, this is less
efficient since it requires spinlocks to synchronize between producer
and consumer.

This patch tries to address this by:

- switch from sk_receive_queue to a skb_array, and resize it when
  tx_queue_len was changed.
- introduce a new proto_ops peek_len which was used for peeking the
  skb length.
- implement a tun version of peek_len for vhost_net to use and convert
  vhost_net to use peek_len if possible.

Pktgen test shows about 15.3% improvement on guest receiving pps for small
buffers:

Before: ~130pps
After : ~150pps

Signed-off-by: Jason Wang 
---
 drivers/net/tun.c   | 138 +---
 drivers/vhost/net.c |  16 +-
 include/linux/net.h |   1 +
 3 files changed, 146 insertions(+), 9 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4884802..3be69ea 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -71,6 +71,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -167,6 +168,7 @@ struct tun_file {
};
struct list_head next;
struct tun_struct *detached;
+   struct skb_array tx_array;
 };
 
 struct tun_flow_entry {
@@ -515,7 +517,11 @@ static struct tun_struct *tun_enable_queue(struct tun_file 
*tfile)
 
 static void tun_queue_purge(struct tun_file *tfile)
 {
-   skb_queue_purge(>sk.sk_receive_queue);
+   struct sk_buff *skb;
+
+   while ((skb = skb_array_consume(>tx_array)) != NULL)
+   kfree_skb(skb);
+
skb_queue_purge(>sk.sk_error_queue);
 }
 
@@ -560,6 +566,8 @@ static void __tun_detach(struct tun_file *tfile, bool clean)
tun->dev->reg_state == NETREG_REGISTERED)
unregister_netdevice(tun->dev);
}
+   if (tun)
+   skb_array_cleanup(>tx_array);
sock_put(>sk);
}
 }
@@ -613,6 +621,7 @@ static void tun_detach_all(struct net_device *dev)
 static int tun_attach(struct tun_struct *tun, struct file *file, bool 
skip_filter)
 {
struct tun_file *tfile = file->private_data;
+   struct net_device *dev = tun->dev;
int err;
 
err = security_tun_dev_attach(tfile->socket.sk, tun->security);
@@ -642,6 +651,13 @@ static int tun_attach(struct tun_struct *tun, struct file 
*file, bool skip_filte
if (!err)
goto out;
}
+
+   if (!tfile->detached &&
+   skb_array_init(>tx_array, dev->tx_queue_len, GFP_KERNEL)) {
+   err = -ENOMEM;
+   goto out;
+   }
+
tfile->queue_index = tun->numqueues;
tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
rcu_assign_pointer(tfile->tun, tun);
@@ -891,8 +907,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct 
net_device *dev)
 
nf_reset(skb);
 
-   /* Enqueue packet */
-   skb_queue_tail(>socket.sk->sk_receive_queue, skb);
+   if (skb_array_produce(>tx_array, skb))
+   goto drop;
 
/* Notify and wake up reader process */
if (tfile->flags & TUN_FASYNC)
@@ -1107,7 +1123,7 @@ static unsigned int tun_chr_poll(struct file *file, 
poll_table *wait)
 
poll_wait(file, sk_sleep(sk), wait);
 
-   if (!skb_queue_empty(>sk_receive_queue))
+   if (!skb_array_empty(>tx_array))
mask |= POLLIN | POLLRDNORM;
 
if (sock_writeable(sk) ||
@@ -1426,22 +1442,61 @@ done:
return total;
 }
 
+static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock,
+int *err)
+{
+   DECLARE_WAITQUEUE(wait, current);
+   struct sk_buff *skb = NULL;
+
+   skb = skb_array_consume(>tx_array);
+   if (skb)
+   goto out;
+   if (noblock) {
+   *err = -EAGAIN;
+   goto out;
+   }
+
+   add_wait_queue(>wq.wait, );
+   current->state = TASK_INTERRUPTIBLE;
+
+   while (1) {
+   skb = skb_array_consume(>tx_array);
+   if (skb)
+   break;
+   if (signal_pending(current)) {
+   *err = -ERESTARTSYS;
+   break;
+   }
+   if (tfile->socket.sk->sk_shutdown & RCV_SHUTDOWN) {
+   *err = -EFAULT;
+   break;
+   }
+
+   schedule();
+   };
+
+   current->state = TASK_RUNNING;
+   remove_wait_queue(>wq.wait, );
+
+out:
+   return skb;
+}
+
 static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
   struct iov_iter *to,
   int noblock)
 {
struct sk_buff *skb;
ssize_t ret;
-   int peeked, err, off = 0;
+   int err;
 
tun_debug(KERN_INFO, tun, "tun_do_read\n");
 
if (!iov_iter_count(to))
return 0;
 
-   /* Read 

Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Herbert Xu
On Wed, Jun 29, 2016 at 03:39:37PM -0700, Andy Lutomirski wrote:
>
> I don't care about TCP MD5 performance at all.  Ease of maintenance is
> nice, though, and maybe there are other places in the kernel where
> performance does matter.

TCP MD5 is using ahash because it touches SG lists.  Touching
SG lists is a pretty reliable indicator for hashing a large amount
of data.  In fact TCP MD5 hashes the entire packet content so we're
talking about ~1000 bytes, just like IPsec.

Therefore it is completely pointless to use something like shash
for it as shash is only meant to be used for those hashing less
one block of data or less.

If you're aware of any other user in the kernel that is using
ahash and is hashing a small amount of data in aggregate (not
per update) please let me know.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread John Fastabend
On 16-06-29 03:09 PM, John Fastabend wrote:
> On 16-06-29 02:33 PM, Or Gerlitz wrote:
>> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>>  wrote:
>>> On 16-06-29 07:48 AM, Or Gerlitz wrote:
 On 6/28/2016 10:31 PM, John Fastabend wrote:
> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>
>> Why?! Please, leave legacy be legacy. Use the new mode for
>> implementing new features. Don't make things any more complicated :(
>>
>> [...]
> Maybe I'm reading to much into the devlink flag names and if instead
> you use a switch like the following,
>>
>VF representer : enable/disable the creation VF netdev's to represent
> the virtual functions on the PF
>>
> Much less complicated then magic switching between forwarding logic IMO
> and you don't whack a default configuration that an entire stack (e.g.
> libvirt) has been built to use.
>>
 Re letting the user to observe/modify the rules added by the
 driver/firmware while legacy mode. Even if possible with bridge/fdb, it
 will be really pragmatical and doesn't make sense to get that donefor
 the TC subsystem. So this isn't a well defined solution and anyway, as
 you said, legacy mode enhancements is a different exercise. Personally,
 I agree with Jiri, that we should legacy be legacyand focus on adding
 the new model.
>>
>>> The ixgbe driver already supports bridge and tc commands without the VF
>>> representer.  Adding the VF representer to these drivers just extends
>>> the existing support so we have an identifier for VFs and now the
>>> redirect action works and the fdb commands can specify the VF netdevs.
>>> I don't see this as a problem because we already do it today with
>>> 'ip' and bridge tools.
>>
>> To be precise, for both ixgbe and mlx5, the existing tc support
>> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
>> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
>> redirect to VF, but this is only for south --> north (wire --> VF)
>> traffic, w.o the VF rep you can't do the other way around.
>>
> 
> Correct which is why we need the VF rep. So we are completely in
> sync there.
> 
>> Just to clarify, to what exact bridge command support did you refer for 
>> ixgbe?
> 
> 'bridge fdb' commands are supported today on the PF. But its the
> same story as above we need the VF rep to also use it on the
> VF representer
> 
> Also 'bridge link' command for veb/vepa modes is supported and the
> other link attributes could be supported with additional driver
> support. No need for core changes here. But again yes only on the
> PF so again we need the VF reps.
> 
>>
>> The forwarding done in the legacy mode is not well defined, and
>> different across vendors, adding there the VF reps will not make it
>> any better b/c some steering rules will be set by tc/bridge offloads
>> while other rules will be put by the driver.
>> I don't see how this takes us to better place.
> 
> In legacy mode or any other mode you are defining some default policy
> and rules.
> 
> In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
> hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
> today. And similarly can be modified today using 'ip link' and 'bridge
> fdb' at least on the intel devices. Its not undefined in any way with
> a quick query of the tools we can learn exactly what the configuration
> is and even change it. This works fairly well with existing controllers
> and stacks.
> 
> The limitations are 'ip' only supports a single MAC address per VF and
> 'tc' doesn't work on VF ports because when the VF is assigned to a VM
> or namespace we lose visibility of it. Providing a VF rep for this
> solves both of those problems.
> 
> In this new mode the default policy is to create a default miss rule
> and implement no l2 forwarding rules. Unfortunately not all hardware
> in use supports this default miss rule case but would still benefit
> from having a VF rep. So we shouldn't make this a stipulation for
> enabling VF reps. It also changes a default policy that has been in
> place for years without IMO at least any compelling reason. It will
> be easy enough to change the default l2 policy to a flow based model
> with a few bridge/tc commands.
> 
>>
>>> We are also slightly in disagreement about what the default should be
>>> with VF netdevs. I think the default should be the same L2 mac/vlan
>>> switch behavior and see no reason to change it by default just because
>>> we added VF netdevs. The infrastructure libvirt/openstack/etc are built
>>> around this default today. But I guess nothing in this series specifies
>>> what the defaults of any given driver will be. VF netdevs are still
>>> useful even on older hardware that only supports mac/vlan forwarding to
>>> expose statistics and send/receive control frames such as lldp.
>>
>> Again, this is not about default engineering... and using the 

Re: [PATCH net-next V2] tun: introduce tx skb ring

2016-06-29 Thread Jason Wang



On 2016年06月28日 15:09, Michael S. Tsirkin wrote:

On Thu, Jun 23, 2016 at 01:14:07PM +0800, Jason Wang wrote:


On 2016年06月23日 02:18, Michael S. Tsirkin wrote:

On Fri, Jun 17, 2016 at 03:41:20AM +0300, Michael S. Tsirkin wrote:

Would it help to have ptr_ring_resize that gets an array of
rings and resizes them both to same length?

OK, here it is. Untested so far, and no skb wrapper.
Pls let me know whether this is what you had in mind.

Exactly what I want.

Thanks

Ok and this for skb_array

-->
skb_array: add wrappers for resizing

Signed-off-by: Michael S. Tsirkin 

--

diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h
index c900708..7e01c1f 100644
--- a/include/linux/skb_array.h
+++ b/include/linux/skb_array.h
@@ -151,16 +151,24 @@ static inline int skb_array_init(struct skb_array *a, int 
size, gfp_t gfp)
return ptr_ring_init(>ring, size, 0, gfp);
  }
  
-void __skb_array_destroy_skb(void *ptr)

+static void __skb_array_destroy_skb(void *ptr)
  {
kfree_skb(ptr);
  }
  
-int skb_array_resize(struct skb_array *a, int size, gfp_t gfp)

+static inline int skb_array_resize(struct skb_array *a, int size, gfp_t gfp)
  {
return ptr_ring_resize(>ring, size, gfp, __skb_array_destroy_skb);
  }


Will split up the above tweak into another patch when reposting.

  
+static inline int skb_raay_resize_multiple(struct skb_array **rings, int nrings,


I think you mean 'skb_array_resize' here.


+  int size, gfp_t gfp)
+{
+   BUILD_BUG_ON(offsetof(struct skb_array, ring));
+   ptr_ring_resize_multiple((struct ptr_ring **)rings, nrings, size, gfp,
+__skb_array_destroy_skb);


This should be return ptr_ring_resize_multiple(...


+}
+
  static inline void skb_array_cleanup(struct skb_array *a)
  {
ptr_ring_cleanup(>ring, __skb_array_destroy_skb);


With this, looks like there's no need for a new flag. Will repost the 
series with those two patches.


Thanks


Re: [PATCH net-next 9/9] net: hns: get reset registers from DT

2016-06-29 Thread Yankejian (Hackim Yim)
On 2016/6/29 17:11, David Miller wrote:
> From: Yisen Zhuang 
> Date: Mon, 27 Jun 2016 17:54:15 +0800
>
>> @@ -361,9 +371,10 @@ static int hns_mdio_reset(struct mii_bus *bus)
>>  return -ENODEV;
>>  }
>>  
>> +sc_reg = _dev->sc_reg;
>>  /* 1. reset req, and read reset st check */
>> -ret = mdio_sc_cfg_reg_write(mdio_dev, MDIO_SC_RESET_REQ, 0x1,
>> -MDIO_SC_RESET_ST, 0x1,
>> +ret = mdio_sc_cfg_reg_write(mdio_dev, sc_reg->mdio_reset_req, 0x1,
>> +sc_reg->mdio_reset_st, 0x1,
>>  MDIO_CHECK_SET_ST);
>>  if (ret) {
>>  dev_err(>dev, "MDIO reset fail\n");
> What in the world are you doing to the indentation here?
>
> Please read your patches before actually sending them, such things
> will be quite obvious by simple visual inspection.
>
> .
>
Hi David,
i am sorry for my carelessness. i will pay more attention next time.
Thanks for pointing it our

-- 
MBR,
Yankejian (Hackim Yim)




linux-next: manual merge of the net-next tree with Linus' tree

2016-06-29 Thread Stephen Rothwell
Hi all,

Today's linux-next merge of the net-next tree got a conflict in:

  drivers/net/phy/fixed_phy.c

between commit:

  69fc58a57e56 ("net: phy: Manage fixed PHY address space using IDA")

from Linus' tree and commit:

  bf7afb29d545 ("phy: improve safety of fixed-phy MII register reading")

from the net-next tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/net/phy/fixed_phy.c
index 9ec7f7353434,b376ada83598..
--- a/drivers/net/phy/fixed_phy.c
+++ b/drivers/net/phy/fixed_phy.c
@@@ -23,9 -23,9 +23,10 @@@
  #include 
  #include 
  #include 
+ #include 
 +#include 
  
- #define MII_REGS_NUM 29
+ #include "swphy.h"
  
  struct fixed_mdio_bus {
struct mii_bus *mii_bus;


[net-next PATCH 0/5] HFSC patches, part 1

2016-06-29 Thread Michal Soltys
Hi,

It's revised version of part of the patches I submitted really, really long
time ago (back then I asked Patrick to ignore them as I found some issues
shortly after submitting).

Anyway this is the first set with very simple fixes/changes though some of them
relatively subtle (I tried to do very exhaustive commit messages explaining what
and why with those).

The patches are against net-next tree.

The second set will be heavier - or rather with more complex explanations, 
among those I have:

- a fix to subtle issue introduced in
  http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281
  along with simplifying related stuff
- update times to 96 bits (which allows to "just" use 32 bit shifts and
  improves curve definition accuracy at more extreme low/high speeds)
- add curve "merging" instead of just selecting in convex case (computations
  mirror those from concave intersection)

But these are eventually for later.

Michal Soltys (5):
  net/sched/sch_hfsc.c: handle corner cases where head may change invalidating 
calculated deadline
  net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len()
  net/sched/sch_hfsc.c: remove leftover dlist and droplist
  net/sched/sch_hfsc.c: go passive after vt update
  net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc()

 net/sched/sch_hfsc.c | 54 ++--
 1 file changed, 27 insertions(+), 27 deletions(-)

-- 
2.1.3



[net-next PATCH 5/5] net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc()

2016-06-29 Thread Michal Soltys
cl->cl_vt alone is relative only to the current backlog period, while
the curve operates on cumulative virtual time. This patch adds missing
cl->cl_vtoff.

Signed-off-by: Michal Soltys 
---
 net/sched/sch_hfsc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 4eef857..dff92ea 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -940,7 +940,7 @@ static void
 hfsc_change_fsc(struct hfsc_class *cl, struct tc_service_curve *fsc)
 {
sc2isc(fsc, >cl_fsc);
-   rtsc_init(>cl_virtual, >cl_fsc, cl->cl_vt, cl->cl_total);
+   rtsc_init(>cl_virtual, >cl_fsc, cl->cl_vtoff + cl->cl_vt, 
cl->cl_total);
cl->cl_flags |= HFSC_FSC;
 }
 
-- 
2.1.3



[net-next PATCH 2/5] net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len()

2016-06-29 Thread Michal Soltys
The condition can only succeed on wrong configurations.

Signed-off-by: Michal Soltys 
---
 net/sched/sch_hfsc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 6d6df6b..e2244bb 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -882,7 +882,7 @@ qdisc_peek_len(struct Qdisc *sch)
unsigned int len;
 
skb = sch->ops->peek(sch);
-   if (skb == NULL) {
+   if (unlikely(skb == NULL)) {
qdisc_warn_nonwc("qdisc_peek_len", sch);
return 0;
}
-- 
2.1.3



[net-next PATCH 1/5] net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline

2016-06-29 Thread Michal Soltys
Realtime scheduling implemented in HFSC uses head of the queue to make
the decision about which packet to schedule next. But in case of any
head drop, the deadline calculated for the previous head is not
necessarily correct for the next head (unless both packets have the same
length).

Thanks to peek() function used during dequeue - which internally is a
dequeue operation - hfsc is almost safe from this issue, as peek()
dequeues and isolates the head storing it temporarily until the real
dequeue happens.

But there is one exception: if after the class activation a drop happens
before the first dequeue operation, there's never a chance to do the
peek().

Adding peek() call in enqueue - if this is the first packet in a new
backlog period AND the scheduler has realtime curve defined - fixes that
one corner case. The 1st hfsc_dequeue() will use that peeked packet,
similarly as every subsequent hfsc_dequeue() call uses packet peeked by
the previous call.

Signed-off-by: Michal Soltys 
---
 net/sched/sch_hfsc.c | 11 ++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index 8cb5eff..6d6df6b 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -1594,8 +1594,17 @@ hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, 
struct sk_buff **to_free)
return err;
}
 
-   if (cl->qdisc->q.qlen == 1)
+   if (cl->qdisc->q.qlen == 1) {
set_active(cl, qdisc_pkt_len(skb));
+   /*
+* If this is the first packet, isolate the head so an eventual
+* head drop before the first dequeue operation has no chance
+* to invalidate the deadline.
+*/
+   if (cl->cl_flags & HFSC_RSC)
+   cl->qdisc->ops->peek(cl->qdisc);
+
+   }
 
qdisc_qstats_backlog_inc(sch, skb);
sch->q.qlen++;
-- 
2.1.3



[net-next PATCH 3/5] net/sched/sch_hfsc.c: remove leftover dlist and droplist

2016-06-29 Thread Michal Soltys
This is update to:
commit a09ceb0e08140a ("sched: remove qdisc->drop")

That commit removed qdisc->drop, but left alone dlist and droplist
that no longer serve any meaningful purpose.

Signed-off-by: Michal Soltys 
---
 net/sched/sch_hfsc.c | 8 
 1 file changed, 8 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index e2244bb..df07f06 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -130,7 +130,6 @@ struct hfsc_class {
struct rb_node vt_node; /* parent's vt_tree member */
struct rb_root cf_tree; /* active children sorted by cl_f */
struct rb_node cf_node; /* parent's cf_heap member */
-   struct list_head dlist; /* drop list member */
 
u64 cl_total;   /* total work in bytes */
u64 cl_cumul;   /* cumulative work in bytes done by
@@ -177,8 +176,6 @@ struct hfsc_sched {
struct hfsc_class root; /* root class */
struct Qdisc_class_hash clhash; /* class hash */
struct rb_root eligible;/* eligible tree */
-   struct list_head droplist;  /* active leaf class list (for
-  dropping) */
struct qdisc_watchdog watchdog; /* watchdog timer */
 };
 
@@ -858,7 +855,6 @@ set_active(struct hfsc_class *cl, unsigned int len)
if (cl->cl_flags & HFSC_FSC)
init_vf(cl, len);
 
-   list_add_tail(>dlist, >sched->droplist);
 }
 
 static void
@@ -867,8 +863,6 @@ set_passive(struct hfsc_class *cl)
if (cl->cl_flags & HFSC_RSC)
eltree_remove(cl);
 
-   list_del(>dlist);
-
/*
 * vttree is now handled in update_vf() so that update_vf(cl, 0, 0)
 * needs to be called explicitly to remove a class from vttree.
@@ -1443,7 +1437,6 @@ hfsc_init_qdisc(struct Qdisc *sch, struct nlattr *opt)
if (err < 0)
return err;
q->eligible = RB_ROOT;
-   INIT_LIST_HEAD(>droplist);
 
q->root.cl_common.classid = sch->handle;
q->root.refcnt  = 1;
@@ -1527,7 +1520,6 @@ hfsc_reset_qdisc(struct Qdisc *sch)
hfsc_reset_class(cl);
}
q->eligible = RB_ROOT;
-   INIT_LIST_HEAD(>droplist);
qdisc_watchdog_cancel(>watchdog);
sch->qstats.backlog = 0;
sch->q.qlen = 0;
-- 
2.1.3



[net-next PATCH 4/5] net/sched/sch_hfsc.c: go passive after vt update

2016-06-29 Thread Michal Soltys
When a class is going passive, it should update its cl_vt first
to be consistent with the last dequeue operation.

Otherwise its cl_vt will be one packet behind and parent's cvtmax might
not be updated as well.

One possible side effect is if some class goes passive and subsequently
goes active /without/ its parent going passive - with cl_vt lagging one
packet behind - comparison made in init_vf() will be affected (same
period).

Signed-off-by: Michal Soltys 
---
 net/sched/sch_hfsc.c | 31 +++
 1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c
index df07f06..4eef857 100644
--- a/net/sched/sch_hfsc.c
+++ b/net/sched/sch_hfsc.c
@@ -778,6 +778,20 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 
cur_time)
else
go_passive = 0;
 
+   /* update vt */
+   cl->cl_vt = rtsc_y2x(>cl_virtual, cl->cl_total)
+   - cl->cl_vtoff + cl->cl_vtadj;
+
+   /*
+* if vt of the class is smaller than cvtmin,
+* the class was skipped in the past due to non-fit.
+* if so, we need to adjust vtadj.
+*/
+   if (cl->cl_vt < cl->cl_parent->cl_cvtmin) {
+   cl->cl_vtadj += cl->cl_parent->cl_cvtmin - cl->cl_vt;
+   cl->cl_vt = cl->cl_parent->cl_cvtmin;
+   }
+
if (go_passive) {
/* no more active child, going passive */
 
@@ -794,25 +808,10 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 
cur_time)
continue;
}
 
-   /*
-* update vt and f
-*/
-   cl->cl_vt = rtsc_y2x(>cl_virtual, cl->cl_total)
-   - cl->cl_vtoff + cl->cl_vtadj;
-
-   /*
-* if vt of the class is smaller than cvtmin,
-* the class was skipped in the past due to non-fit.
-* if so, we need to adjust vtadj.
-*/
-   if (cl->cl_vt < cl->cl_parent->cl_cvtmin) {
-   cl->cl_vtadj += cl->cl_parent->cl_cvtmin - cl->cl_vt;
-   cl->cl_vt = cl->cl_parent->cl_cvtmin;
-   }
-
/* update the vt tree */
vttree_update(cl);
 
+   /* update f */
if (cl->cl_flags & HFSC_USC) {
cl->cl_myf = cl->cl_myfadj + rtsc_y2x(>cl_ulimit,
  cl->cl_total);
-- 
2.1.3



Re: [iproute PATCH 1/2] ip-address: Support filtering by slave type, too

2016-06-29 Thread Stephen Hemminger
On Tue, 28 Jun 2016 15:07:16 +0200
Phil Sutter  wrote:

> +static int match_link_kind(struct rtattr **tb, char *kind, bool slave)

const char *kind ??

> +{
> + if (!tb[IFLA_LINKINFO])
> + return -1;
> +
> + return strcmp(parse_link_kind(tb[IFLA_LINKINFO], slave), kind);
> +}
> +


[PATCH 0/1] qlcnic: add wmb() call in transmit data path.

2016-06-29 Thread Sony Chacko
0001-qlcnic-add-wmb-call-in-transmit-data-path.patch

This patch adds a wmb() call in the Tx data path to ensure
writes are completed before hardware fetches updated Tx descriptors.

Please apply to net.

Thanks,
Sony



[PATCH 1/1] qlcnic: add wmb() call in transmit data path.

2016-06-29 Thread Sony Chacko
Call wmb() to ensure writes are complete before
hardware fetches updated Tx descriptors.

Signed-off-by: Sony Chacko 
---
 drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c |2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c 
b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
index 607bb7d..87c642d 100644
--- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
+++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c
@@ -772,6 +772,8 @@ netdev_tx_t qlcnic_xmit_frame(struct sk_buff *skb, struct 
net_device *netdev)
tx_ring->tx_stats.tx_bytes += skb->len;
tx_ring->tx_stats.xmit_called++;
 
+   /* Ensure writes are complete before HW fetches Tx descriptors */
+   wmb();
qlcnic_update_cmd_producer(tx_ring);
 
return NETDEV_TX_OK;
-- 
1.7.1



Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Andy Lutomirski
On Wed, Jun 29, 2016 at 2:44 PM, Eric Dumazet  wrote:
> On Wed, 2016-06-29 at 09:41 -0700, Andy Lutomirski wrote:
>
>> Overall, it looks like there's overhead of something like 50ns for
>> each ahash invocation vs the shash equivalent.  It's not huge, but
>> it's there.  (This is cache-hot.  I bet it's considerably worse if
>> cache-cold, because ahash will require a lot more code cache lines as
>> well as the extra cache lines involved in the scatterlist and whatever
>> arch stuff is needed to map back and forth between virtual and
>> physical addresses.
>
> I am kind of mystified seeing someone caring about TCP MD5, other than
> just making sure it wont crash the host when it needs to be used ;)
>
> The real useful work would be to use a jump label so that we can avoid
> spending cycles for non TCP MD5 sessions, when a host never had to use
> any MD5 negotiation.
>
>
>

I don't care about TCP MD5 performance at all.  Ease of maintenance is
nice, though, and maybe there are other places in the kernel where
performance does matter.

--Andy


Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread John Fastabend
On 16-06-29 02:33 PM, Or Gerlitz wrote:
> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
>  wrote:
>> On 16-06-29 07:48 AM, Or Gerlitz wrote:
>>> On 6/28/2016 10:31 PM, John Fastabend wrote:
 On 16-06-28 12:12 PM, Jiri Pirko wrote:
> 
> Why?! Please, leave legacy be legacy. Use the new mode for
> implementing new features. Don't make things any more complicated :(
> 
> [...]
 Maybe I'm reading to much into the devlink flag names and if instead
 you use a switch like the following,
> 
VF representer : enable/disable the creation VF netdev's to represent
 the virtual functions on the PF
> 
 Much less complicated then magic switching between forwarding logic IMO
 and you don't whack a default configuration that an entire stack (e.g.
 libvirt) has been built to use.
> 
>>> Re letting the user to observe/modify the rules added by the
>>> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
>>> will be really pragmatical and doesn't make sense to get that donefor
>>> the TC subsystem. So this isn't a well defined solution and anyway, as
>>> you said, legacy mode enhancements is a different exercise. Personally,
>>> I agree with Jiri, that we should legacy be legacyand focus on adding
>>> the new model.
> 
>> The ixgbe driver already supports bridge and tc commands without the VF
>> representer.  Adding the VF representer to these drivers just extends
>> the existing support so we have an identifier for VFs and now the
>> redirect action works and the fdb commands can specify the VF netdevs.
>> I don't see this as a problem because we already do it today with
>> 'ip' and bridge tools.
> 
> To be precise, for both ixgbe and mlx5, the existing tc support
> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather
> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
> redirect to VF, but this is only for south --> north (wire --> VF)
> traffic, w.o the VF rep you can't do the other way around.
> 

Correct which is why we need the VF rep. So we are completely in
sync there.

> Just to clarify, to what exact bridge command support did you refer for ixgbe?

'bridge fdb' commands are supported today on the PF. But its the
same story as above we need the VF rep to also use it on the
VF representer

Also 'bridge link' command for veb/vepa modes is supported and the
other link attributes could be supported with additional driver
support. No need for core changes here. But again yes only on the
PF so again we need the VF reps.

> 
> The forwarding done in the legacy mode is not well defined, and
> different across vendors, adding there the VF reps will not make it
> any better b/c some steering rules will be set by tc/bridge offloads
> while other rules will be put by the driver.
> I don't see how this takes us to better place.

In legacy mode or any other mode you are defining some default policy
and rules.

In the legacy mode we use mac/vlan assigned l2 forwarding entries in the
hardware fdb which are seen when you query 'ip link' and 'bridge fdb'
today. And similarly can be modified today using 'ip link' and 'bridge
fdb' at least on the intel devices. Its not undefined in any way with
a quick query of the tools we can learn exactly what the configuration
is and even change it. This works fairly well with existing controllers
and stacks.

The limitations are 'ip' only supports a single MAC address per VF and
'tc' doesn't work on VF ports because when the VF is assigned to a VM
or namespace we lose visibility of it. Providing a VF rep for this
solves both of those problems.

In this new mode the default policy is to create a default miss rule
and implement no l2 forwarding rules. Unfortunately not all hardware
in use supports this default miss rule case but would still benefit
from having a VF rep. So we shouldn't make this a stipulation for
enabling VF reps. It also changes a default policy that has been in
place for years without IMO at least any compelling reason. It will
be easy enough to change the default l2 policy to a flow based model
with a few bridge/tc commands.

> 
>> We are also slightly in disagreement about what the default should be
>> with VF netdevs. I think the default should be the same L2 mac/vlan
>> switch behavior and see no reason to change it by default just because
>> we added VF netdevs. The infrastructure libvirt/openstack/etc are built
>> around this default today. But I guess nothing in this series specifies
>> what the defaults of any given driver will be. VF netdevs are still
>> useful even on older hardware that only supports mac/vlan forwarding to
>> expose statistics and send/receive control frames such as lldp.
> 
> Again, this is not about default engineering... and using the VF reps
> (not VF netdevs) in legacy mode only make it more cryptic to my
> opinion. I agree some changes would be needed in openstack to support
> the new model, but this is how 

Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Eric Dumazet
On Wed, 2016-06-29 at 09:41 -0700, Andy Lutomirski wrote:

> Overall, it looks like there's overhead of something like 50ns for
> each ahash invocation vs the shash equivalent.  It's not huge, but
> it's there.  (This is cache-hot.  I bet it's considerably worse if
> cache-cold, because ahash will require a lot more code cache lines as
> well as the extra cache lines involved in the scatterlist and whatever
> arch stuff is needed to map back and forth between virtual and
> physical addresses.

I am kind of mystified seeing someone caring about TCP MD5, other than
just making sure it wont crash the host when it needs to be used ;)

The real useful work would be to use a jump label so that we can avoid
spending cycles for non TCP MD5 sessions, when a host never had to use
any MD5 negotiation.





Re: [net] e1000e: keep VLAN interfaces functional after rxvlan off

2016-06-29 Thread Jarod Wilson

David Miller wrote:

From: Jeff Kirsher
Date: Tue, 28 Jun 2016 20:41:31 -0700


From: Jarod Wilson

I've got a bug report about an e1000e interface, where a VLAN interface is
set up on top of it:

$ ip link add link ens1f0 name ens1f0.99 type vlan id 99
$ ip link set ens1f0 up
$ ip link set ens1f0.99 up
$ ip addr add 192.168.99.92 dev ens1f0.99

At this point, I can ping another host on vlan 99, ip 192.168.99.91.
However, if I do the following:

$ ethtool -K ens1f0 rxvlan off

Then no traffic passes on ens1f0.99. It comes back if I toggle rxvlan on
again. I'm not sure if this is actually intended behavior, or if there's a
lack of software VLAN stripping fallback, or what, but things continue to
work if I simply don't call e1000e_vlan_strip_disable() if there are
active VLANs (plagiarizing a function from the e1000 driver here) on the
interface.

Also slipped a related-ish fix to the kerneldoc text for
e1000e_vlan_strip_disable here...

Signed-off-by: Jarod Wilson
Tested-by: Aaron Brown
Signed-off-by: Jeff Kirsher


Applied, thanks.


Hm.. There was actually a v2 patch that followed this one that fixed the 
problem slightly differently and slightly better, I think.


http://marc.info/?l=intel-wired-lan=146551652424417=2

--
Jarod Wilson
ja...@redhat.com




Re: [PATCH] ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output

2016-06-29 Thread Hannes Frederic Sowa
On Wed, Jun 29, 2016, at 20:47, Shmulik Ladkani wrote:
> ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
> calls ip_sk_use_pmtu which casts sk as an inet_sk).
> 
> However, in the case of UDP tunneling, the skb->sk is not necessarily an
> inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
> tun/tap).
> 
> OTOH, the sk passed as an argument throughout IP stack's output path is
> the one which is of PMTU interest:
>  - In case of local sockets, sk is same as skb->sk;
>  - In case of a udp tunnel, sk is the tunneling socket.
> 
> Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
> This augments 7026b1ddb6 'netfilter: Pass socket pointer down through
> okfn().'
> 
> Signed-off-by: Shmulik Ladkani 
> ---
> 
> Discovered while debugging other issue in adjacent code area.
> 
> My setup seems to be handling this well; However I'd appreciate a
> Tested-by or Reviewed-by as one can only cover relatively few of the
> possible code flows leading to ip_skb_dst_mtu.

Well spotted.

It all looks good, the socket's pmtudisc field is initialized
accordingly and I suspect people want to have vxlan to use the path mtu
when transmitting packets, even during forwarding.

Maybe this can be tweaked with an additional patch, if people want to
change this behavior.

But as your patch doesn't change any of this,

Reviewed-by: Hannes Frederic Sowa 

Thanks!


[net-next 06/15] ixgbe: Error handler for duplicate filter locations in hardware for cls_u32 offloads

2016-06-29 Thread Jeff Kirsher
From: Amritha Nambiar 

For u32 classifier filters, avoid overwriting existing filter
in a hardware location without removing it first, to clean up
inconsistencies due to duplicate values for filter location.

Verified with the following filters:

Create child hash tables:
handle 1: u32 divisor 1
handle 2: u32 divisor 1

Link to the child hash table from parent hash table:
handle 800:0:11 u32 ht 800: link 1: \
offset at 0 mask 0f00 shift 6 plus 0 eat \
match ip protocol 6 ff match ip dst 15.0.0.1/32

handle 800:0:12 u32 ht 800: link 2: \
offset at 0 mask 0f00 shift 6 plus 0 eat \
match ip protocol 17 ff match ip dst 16.0.0.1/32

Add filter into child hash table:
handle 1:0:3 u32 ht 1: \
match tcp src 22  action drop

Add another filter to the same location:
handle 2:0:3 u32 ht 2: \
match tcp src 33  action drop

Signed-off-by: Amritha Nambiar 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 41 ---
 1 file changed, 25 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 75e6855..fd5a761 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8315,8 +8315,11 @@ static int ixgbe_delete_clsu32(struct ixgbe_adapter 
*adapter,
/* Clear this filter in the link data it is associated with */
if (uhtid != 0x800) {
jump = adapter->jump_tables[uhtid];
-   if (jump)
-   clear_bit(loc - 1, jump->child_loc_map);
+   if (!jump)
+   return -EINVAL;
+   if (!test_bit(loc - 1, jump->child_loc_map))
+   return -EINVAL;
+   clear_bit(loc - 1, jump->child_loc_map);
}
 
/* Check if the filter being deleted is a link */
@@ -8606,7 +8609,7 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter 
*adapter,
mask = kzalloc(sizeof(*mask), GFP_KERNEL);
if (!mask) {
err = -ENOMEM;
-   goto err_out;
+   goto free_input;
}
jump->input = input;
jump->mask = mask;
@@ -8629,7 +8632,7 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter 
*adapter,
mask = kzalloc(sizeof(*mask), GFP_KERNEL);
if (!mask) {
err = -ENOMEM;
-   goto err_out;
+   goto free_input;
}
 
if ((uhtid != 0x800) && (adapter->jump_tables[uhtid])) {
@@ -8639,6 +8642,20 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter 
*adapter,
if ((adapter->jump_tables[uhtid])->mask)
memcpy(mask, (adapter->jump_tables[uhtid])->mask,
   sizeof(*mask));
+
+   /* Lookup in all child hash tables if this location is already
+* filled with a filter
+*/
+   for (i = 1; i < IXGBE_MAX_LINK_HANDLE; i++) {
+   struct ixgbe_jump_table *link = adapter->jump_tables[i];
+
+   if (link && (test_bit(loc - 1, link->child_loc_map))) {
+   e_err(drv, "Filter exists in location: %x\n",
+ loc);
+   err = -EINVAL;
+   goto err_out;
+   }
+   }
}
err = ixgbe_clsu32_build_input(input, mask, cls, field_ptr, NULL);
if (err)
@@ -8670,25 +8687,17 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter 
*adapter,
ixgbe_update_ethtool_fdir_entry(adapter, input, input->sw_idx);
spin_unlock(>fdir_perfect_lock);
 
-   if ((uhtid != 0x800) && (adapter->jump_tables[uhtid])) {
-   struct ixgbe_jump_table *link = adapter->jump_tables[uhtid];
+   if ((uhtid != 0x800) && (adapter->jump_tables[uhtid]))
+   set_bit(loc - 1, (adapter->jump_tables[uhtid])->child_loc_map);
 
-   if (test_bit(loc - 1, link->child_loc_map)) {
-   e_err(drv, "Filter: %x exists in hash table: %x\n",
- loc, uhtid);
-   err = -EINVAL;
-   goto free_mask;
-   }
-   set_bit(loc - 1, link->child_loc_map);
-   }
kfree(mask);
return err;
 err_out_w_lock:
spin_unlock(>fdir_perfect_lock);
 err_out:
-   kfree(input);
-free_mask:
kfree(mask);
+free_input:
+   kfree(input);
 free_jump:
kfree(jump);
return err;
-- 

[net-next 05/15] ixgbe: Fix deleting link filters for cls_u32 offloads

2016-06-29 Thread Jeff Kirsher
From: Amritha Nambiar 

On deleting filters which are links to a child hash table, the filters
in the child hash table must be cleared from the hardware if there
is no link between the parent and child hash table.

Verified with the following filters:

Create a child hash table:
handle 1: u32 divisor 1

Link to the child hash table from parent hash table:
handle 800:0:10 u32 ht 800: link 1: \
offset at 0 mask 0f00 shift 6 plus 0 eat \
match ip protocol 6 ff match ip dst 15.0.0.1/32

Add filters into child hash table:
handle 1:0:2 u32 ht 1: \
match tcp src 22  action drop
handle 1:0:3 u32 ht 1: \
match tcp src 33  action drop

Delete link filter from parent hash table:
handle 800:0:10 u32

Signed-off-by: Amritha Nambiar 
Acked-by: Sridhar Samudrala 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 75 +++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_model.h |  4 ++
 2 files changed, 72 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 468fa9d..75e6855 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8300,14 +8300,50 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)
 static int ixgbe_delete_clsu32(struct ixgbe_adapter *adapter,
   struct tc_cls_u32_offload *cls)
 {
+   u32 hdl = cls->knode.handle;
u32 uhtid = TC_U32_USERHTID(cls->knode.handle);
-   u32 loc;
-   int err;
+   u32 loc = cls->knode.handle & 0xf;
+   int err = 0, i, j;
+   struct ixgbe_jump_table *jump = NULL;
+
+   if (loc > IXGBE_MAX_HW_ENTRIES)
+   return -EINVAL;
 
if ((uhtid != 0x800) && (uhtid >= IXGBE_MAX_LINK_HANDLE))
return -EINVAL;
 
-   loc = cls->knode.handle & 0xf;
+   /* Clear this filter in the link data it is associated with */
+   if (uhtid != 0x800) {
+   jump = adapter->jump_tables[uhtid];
+   if (jump)
+   clear_bit(loc - 1, jump->child_loc_map);
+   }
+
+   /* Check if the filter being deleted is a link */
+   for (i = 1; i < IXGBE_MAX_LINK_HANDLE; i++) {
+   jump = adapter->jump_tables[i];
+   if (jump && jump->link_hdl == hdl) {
+   /* Delete filters in the hardware in the child hash
+* table associated with this link
+*/
+   for (j = 0; j < IXGBE_MAX_HW_ENTRIES; j++) {
+   if (!test_bit(j, jump->child_loc_map))
+   continue;
+   spin_lock(>fdir_perfect_lock);
+   err = ixgbe_update_ethtool_fdir_entry(adapter,
+ NULL,
+ j + 1);
+   spin_unlock(>fdir_perfect_lock);
+   clear_bit(j, jump->child_loc_map);
+   }
+   /* Remove resources for this link */
+   kfree(jump->input);
+   kfree(jump->mask);
+   kfree(jump);
+   adapter->jump_tables[i] = NULL;
+   return err;
+   }
+   }
 
spin_lock(>fdir_perfect_lock);
err = ixgbe_update_ethtool_fdir_entry(adapter, NULL, loc);
@@ -8541,6 +8577,18 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter 
*adapter,
if (!test_bit(link_uhtid - 1, >tables))
return err;
 
+   /* Multiple filters as links to the same hash table are not
+* supported. To add a new filter with the same next header
+* but different match/jump conditions, create a new hash table
+* and link to it.
+*/
+   if (adapter->jump_tables[link_uhtid] &&
+   (adapter->jump_tables[link_uhtid])->link_hdl) {
+   e_err(drv, "Link filter exists for link: %x\n",
+ link_uhtid);
+   return err;
+   }
+
for (i = 0; nexthdr[i].jump; i++) {
if (nexthdr[i].o != cls->knode.sel->offoff ||
nexthdr[i].s != cls->knode.sel->offshift ||
@@ -8558,10 +8606,12 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter 
*adapter,
mask = kzalloc(sizeof(*mask), GFP_KERNEL);
if (!mask) {
  

[net-next 01/15] fm10k: don't use BIT() macro where the value isn't a bitmask

2016-06-29 Thread Jeff Kirsher
From: Jacob Keller 

The FM10K_MAX_DATA_PER_TXD is really just using a bitshift as a power of
2 operation in an efficient manner. We shouldn't represent this as a BIT()
because that obscures the intention of the operation.

Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k.h 
b/drivers/net/ethernet/intel/fm10k/fm10k.h
index fcf106e..e98b86b 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k.h
+++ b/drivers/net/ethernet/intel/fm10k/fm10k.h
@@ -406,7 +406,7 @@ static inline u16 fm10k_desc_unused(struct fm10k_ring *ring)
 (&(((union fm10k_rx_desc *)((R)->desc))[i]))
 
 #define FM10K_MAX_TXD_PWR  14
-#define FM10K_MAX_DATA_PER_TXD BIT(FM10K_MAX_TXD_PWR)
+#define FM10K_MAX_DATA_PER_TXD (1u << FM10K_MAX_TXD_PWR)
 
 /* Tx Descriptors needed, worst case */
 #define TXD_USE_COUNT(S)   DIV_ROUND_UP((S), FM10K_MAX_DATA_PER_TXD)
-- 
2.5.5



[net-next 02/15] fm10k: Align Rx buffers to 512B blocks

2016-06-29 Thread Jeff Kirsher
From: Alexander Duyck 

While reviewing the i40e driver changes to support page based receive I
realized that I had overlooked the fact that the fm10k hardware required a
512 byte alignment for Rx buffers.  This patch is meant to address that by
changing the alignment for Rx buffers to 512 bytes instead of allowing it
to be L1 cache aligned.

Signed-off-by: Alexander Duyck 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 0e166e9..53d02416 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -272,7 +272,7 @@ static bool fm10k_add_rx_frag(struct fm10k_rx_buffer 
*rx_buffer,
 #if (PAGE_SIZE < 8192)
unsigned int truesize = FM10K_RX_BUFSZ;
 #else
-   unsigned int truesize = SKB_DATA_ALIGN(size);
+   unsigned int truesize = ALIGN(size, 512);
 #endif
unsigned int pull_len;
 
-- 
2.5.5



Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread Or Gerlitz
On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend
 wrote:
> On 16-06-29 07:48 AM, Or Gerlitz wrote:
>> On 6/28/2016 10:31 PM, John Fastabend wrote:
>>> On 16-06-28 12:12 PM, Jiri Pirko wrote:

 Why?! Please, leave legacy be legacy. Use the new mode for
 implementing new features. Don't make things any more complicated :(

[...]
>>> Maybe I'm reading to much into the devlink flag names and if instead
>>> you use a switch like the following,

>>>VF representer : enable/disable the creation VF netdev's to represent
>>> the virtual functions on the PF

>>> Much less complicated then magic switching between forwarding logic IMO
>>> and you don't whack a default configuration that an entire stack (e.g.
>>> libvirt) has been built to use.

>> Re letting the user to observe/modify the rules added by the
>> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
>> will be really pragmatical and doesn't make sense to get that donefor
>> the TC subsystem. So this isn't a well defined solution and anyway, as
>> you said, legacy mode enhancements is a different exercise. Personally,
>> I agree with Jiri, that we should legacy be legacyand focus on adding
>> the new model.

> The ixgbe driver already supports bridge and tc commands without the VF
> representer.  Adding the VF representer to these drivers just extends
> the existing support so we have an identifier for VFs and now the
> redirect action works and the fdb commands can specify the VF netdevs.
> I don't see this as a problem because we already do it today with
> 'ip' and bridge tools.

To be precise, for both ixgbe and mlx5, the existing tc support
(u32/ixgbe, flower/mlx5) is not for switching functionality but rather
for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added
redirect to VF, but this is only for south --> north (wire --> VF)
traffic, w.o the VF rep you can't do the other way around.

Just to clarify, to what exact bridge command support did you refer for ixgbe?

The forwarding done in the legacy mode is not well defined, and
different across vendors, adding there the VF reps will not make it
any better b/c some steering rules will be set by tc/bridge offloads
while other rules will be put by the driver.
I don't see how this takes us to better place.

> We are also slightly in disagreement about what the default should be
> with VF netdevs. I think the default should be the same L2 mac/vlan
> switch behavior and see no reason to change it by default just because
> we added VF netdevs. The infrastructure libvirt/openstack/etc are built
> around this default today. But I guess nothing in this series specifies
> what the defaults of any given driver will be. VF netdevs are still
> useful even on older hardware that only supports mac/vlan forwarding to
> expose statistics and send/receive control frames such as lldp.

Again, this is not about default engineering... and using the VF reps
(not VF netdevs) in legacy mode only make it more cryptic to my
opinion. I agree some changes would be needed in openstack to support
the new model, but this is how progress is made... you can't always
make all layer above you unchanged. Note that the VF reps behave the
same as tap devices (v-switch doing xmit on tap --> recv in VM, VM
sends --> recv on tap into the v-switch), so the change in open-stack
would not be that big.

[...]

> Why I think the VF representer is a per port ethtool flag and not a
> devlink option is my use case might be to assign a PF into a VM or
> namespace where I don't want VF netdevs.

again, we think the correct place to set how the eswitch is managed is
through eswitch manager PCI devices and not net devices and hence
ethtool is not the way to go.

Also, how do you want your e-switch to be managed in this case?


[net-next 09/15] igb: re-use igb_ptp_reset in igb_ptp_init

2016-06-29 Thread Jeff Kirsher
From: Jacob Keller 

Modify igb_ptp_init to take advantage of igb_ptp_reset, and remove
duplicated work that was occurring in both igb_ptp_reset and
igb_ptp_init.

In total, resetting the TSAUXC register, and resetting the system time
both happen in igb_ptp_reset already. igb_ptp_reset now also takes care
of starting the delayed work item for overflow checks, as well.

Signed-off-by: Jacob Keller 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb_main.c |  3 +-
 drivers/net/ethernet/intel/igb/igb_ptp.c  | 48 +++
 2 files changed, 18 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index ef3d642..750a5d8 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2027,7 +2027,8 @@ void igb_reset(struct igb_adapter *adapter)
wr32(E1000_VET, ETHERNET_IEEE_VLAN_TYPE);
 
/* Re-enable PTP, where applicable. */
-   igb_ptp_reset(adapter);
+   if (adapter->ptp_flags & IGB_PTP_ENABLED)
+   igb_ptp_reset(adapter);
 
igb_get_phy_info(hw);
 }
diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c 
b/drivers/net/ethernet/intel/igb/igb_ptp.c
index 88d367d..907c225 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -1043,6 +1043,13 @@ int igb_ptp_set_ts_config(struct net_device *netdev, 
struct ifreq *ifr)
-EFAULT : 0;
 }
 
+/**
+ * igb_ptp_init - Initialize PTP functionality
+ * @adapter: Board private structure
+ *
+ * This function is called at device probe to initialize the PTP
+ * functionality.
+ */
 void igb_ptp_init(struct igb_adapter *adapter)
 {
struct e1000_hw *hw = >hw;
@@ -1065,8 +1072,6 @@ void igb_ptp_init(struct igb_adapter *adapter)
adapter->cc.mask = CYCLECOUNTER_MASK(64);
adapter->cc.mult = 1;
adapter->cc.shift = IGB_82576_TSYNC_SHIFT;
-   /* Dial the nominal frequency. */
-   wr32(E1000_TIMINCA, INCPERIOD_82576 | INCVALUE_82576);
adapter->ptp_flags |= IGB_PTP_OVERFLOW_CHECK;
break;
case e1000_82580:
@@ -1086,8 +1091,6 @@ void igb_ptp_init(struct igb_adapter *adapter)
adapter->cc.mask = CYCLECOUNTER_MASK(IGB_NBITS_82580);
adapter->cc.mult = 1;
adapter->cc.shift = 0;
-   /* Enable the timer functions by clearing bit 31. */
-   wr32(E1000_TSAUXC, 0x0);
adapter->ptp_flags |= IGB_PTP_OVERFLOW_CHECK;
break;
case e1000_i210:
@@ -1113,46 +1116,24 @@ void igb_ptp_init(struct igb_adapter *adapter)
adapter->ptp_caps.settime64 = igb_ptp_settime_i210;
adapter->ptp_caps.enable = igb_ptp_feature_enable_i210;
adapter->ptp_caps.verify = igb_ptp_verify_pin;
-   /* Enable the timer functions by clearing bit 31. */
-   wr32(E1000_TSAUXC, 0x0);
break;
default:
adapter->ptp_clock = NULL;
return;
}
 
-   wrfl();
-
spin_lock_init(>tmreg_lock);
INIT_WORK(>ptp_tx_work, igb_ptp_tx_work);
 
-   /* Initialize the clock and overflow work for devices that need it. */
-   if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211)) {
-   struct timespec64 ts = ktime_to_timespec64(ktime_get_real());
-
-   igb_ptp_settime_i210(>ptp_caps, );
-   } else {
-   timecounter_init(>tc, >cc,
-ktime_to_ns(ktime_get_real()));
-   }
-
-   if (adapter->ptp_flags & IGB_PTP_OVERFLOW_CHECK) {
+   if (adapter->ptp_flags & IGB_PTP_OVERFLOW_CHECK)
INIT_DELAYED_WORK(>ptp_overflow_work,
  igb_ptp_overflow_check);
 
-   schedule_delayed_work(>ptp_overflow_work,
- IGB_SYSTIM_OVERFLOW_PERIOD);
-   }
-
-   /* Initialize the time sync interrupts for devices that support it. */
-   if (hw->mac.type >= e1000_82580) {
-   wr32(E1000_TSIM, TSYNC_INTERRUPTS);
-   wr32(E1000_IMS, E1000_IMS_TS);
-   }
-
adapter->tstamp_config.rx_filter = HWTSTAMP_FILTER_NONE;
adapter->tstamp_config.tx_type = HWTSTAMP_TX_OFF;
 
+   igb_ptp_reset(adapter);
+
adapter->ptp_clock = ptp_clock_register(>ptp_caps,
>pdev->dev);
if (IS_ERR(adapter->ptp_clock)) {
@@ -1205,9 +1186,6 @@ void igb_ptp_reset(struct igb_adapter *adapter)
struct e1000_hw *hw = >hw;
unsigned long flags;
 
-   if (!(adapter->ptp_flags & IGB_PTP_ENABLED))
-   return;
-
/* reset 

[net-next 08/15] igb: introduce IGB_PTP_OVERFLOW_CHECK flag

2016-06-29 Thread Jeff Kirsher
From: Jacob Keller 

Don't continue to use complex MAC type checks for handling various cases
where we have overflow check code. Make this code more obvious by
introducing a flag which is enabled for hardware that needs these
checks.

Signed-off-by: Jacob Keller 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb.h |  1 +
 drivers/net/ethernet/intel/igb/igb_ptp.c | 21 -
 2 files changed, 9 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 1e18a9e..38daaab 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -477,6 +477,7 @@ struct igb_adapter {
 
 /* flags controlling PTP/1588 function */
 #define IGB_PTP_ENABLEDBIT(0)
+#define IGB_PTP_OVERFLOW_CHECK BIT(1)
 
 #define IGB_FLAG_HAS_MSI   BIT(0)
 #define IGB_FLAG_DCA_ENABLED   BIT(1)
diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c 
b/drivers/net/ethernet/intel/igb/igb_ptp.c
index 504102b..88d367d 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -1067,6 +1067,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
adapter->cc.shift = IGB_82576_TSYNC_SHIFT;
/* Dial the nominal frequency. */
wr32(E1000_TIMINCA, INCPERIOD_82576 | INCVALUE_82576);
+   adapter->ptp_flags |= IGB_PTP_OVERFLOW_CHECK;
break;
case e1000_82580:
case e1000_i354:
@@ -1087,6 +1088,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
adapter->cc.shift = 0;
/* Enable the timer functions by clearing bit 31. */
wr32(E1000_TSAUXC, 0x0);
+   adapter->ptp_flags |= IGB_PTP_OVERFLOW_CHECK;
break;
case e1000_i210:
case e1000_i211:
@@ -1132,7 +1134,9 @@ void igb_ptp_init(struct igb_adapter *adapter)
} else {
timecounter_init(>tc, >cc,
 ktime_to_ns(ktime_get_real()));
+   }
 
+   if (adapter->ptp_flags & IGB_PTP_OVERFLOW_CHECK) {
INIT_DELAYED_WORK(>ptp_overflow_work,
  igb_ptp_overflow_check);
 
@@ -1169,20 +1173,11 @@ void igb_ptp_init(struct igb_adapter *adapter)
  **/
 void igb_ptp_stop(struct igb_adapter *adapter)
 {
-   switch (adapter->hw.mac.type) {
-   case e1000_82576:
-   case e1000_82580:
-   case e1000_i354:
-   case e1000_i350:
-   cancel_delayed_work_sync(>ptp_overflow_work);
-   break;
-   case e1000_i210:
-   case e1000_i211:
-   /* No delayed work to cancel. */
-   break;
-   default:
+   if (!(adapter->ptp_flags & IGB_PTP_ENABLED))
return;
-   }
+
+   if (adapter->ptp_flags & IGB_PTP_OVERFLOW_CHECK)
+   cancel_delayed_work_sync(>ptp_overflow_work);
 
cancel_work_sync(>ptp_tx_work);
if (adapter->ptp_tx_skb) {
-- 
2.5.5



[net-next 07/15] igb: introduce ptp_flags variable and use it to replace IGB_FLAG_PTP

2016-06-29 Thread Jeff Kirsher
From: Jacob Keller 

Upcoming patches will introduce new PTP specific flags. To avoid
cluttering the normal flags variable, introduce PTP specific "ptp_flags"
variable for this purpose, and move IGB_FLAG_PTP to become
IGB_PTP_ENABLED.

Signed-off-by: Jacob Keller 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb.h | 5 -
 drivers/net/ethernet/intel/igb/igb_ptp.c | 7 ---
 2 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index b9609af..1e18a9e 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -445,6 +445,7 @@ struct igb_adapter {
unsigned long ptp_tx_start;
unsigned long last_rx_ptp_check;
unsigned long last_rx_timestamp;
+   unsigned int ptp_flags;
spinlock_t tmreg_lock;
struct cyclecounter cc;
struct timecounter tc;
@@ -474,12 +475,14 @@ struct igb_adapter {
u16 eee_advert;
 };
 
+/* flags controlling PTP/1588 function */
+#define IGB_PTP_ENABLEDBIT(0)
+
 #define IGB_FLAG_HAS_MSI   BIT(0)
 #define IGB_FLAG_DCA_ENABLED   BIT(1)
 #define IGB_FLAG_QUAD_PORT_A   BIT(2)
 #define IGB_FLAG_QUEUE_PAIRS   BIT(3)
 #define IGB_FLAG_DMAC  BIT(4)
-#define IGB_FLAG_PTP   BIT(5)
 #define IGB_FLAG_RSS_FIELD_IPV4_UDPBIT(6)
 #define IGB_FLAG_RSS_FIELD_IPV6_UDPBIT(7)
 #define IGB_FLAG_WOL_SUPPORTED BIT(8)
diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c 
b/drivers/net/ethernet/intel/igb/igb_ptp.c
index f097c5a..504102b 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -684,6 +684,7 @@ void igb_ptp_rx_hang(struct igb_adapter *adapter)
u32 tsyncrxctl = rd32(E1000_TSYNCRXCTL);
unsigned long rx_event;
 
+   /* Other hardware uses per-packet timestamps */
if (hw->mac.type != e1000_82576)
return;
 
@@ -1156,7 +1157,7 @@ void igb_ptp_init(struct igb_adapter *adapter)
} else {
dev_info(>pdev->dev, "added PHC on %s\n",
 adapter->netdev->name);
-   adapter->flags |= IGB_FLAG_PTP;
+   adapter->ptp_flags |= IGB_PTP_ENABLED;
}
 }
 
@@ -1194,7 +1195,7 @@ void igb_ptp_stop(struct igb_adapter *adapter)
ptp_clock_unregister(adapter->ptp_clock);
dev_info(>pdev->dev, "removed PHC on %s\n",
 adapter->netdev->name);
-   adapter->flags &= ~IGB_FLAG_PTP;
+   adapter->ptp_flags &= ~IGB_PTP_ENABLED;
}
 }
 
@@ -1209,7 +1210,7 @@ void igb_ptp_reset(struct igb_adapter *adapter)
struct e1000_hw *hw = >hw;
unsigned long flags;
 
-   if (!(adapter->flags & IGB_FLAG_PTP))
+   if (!(adapter->ptp_flags & IGB_PTP_ENABLED))
return;
 
/* reset the tstamp_config */
-- 
2.5.5



[net-next 04/15] e1000e: prevent division by zero if TIMINCA is zero

2016-06-29 Thread Jeff Kirsher
From: Denys Vlasenko 

Users report that under VMWare, er32(TIMINCA) returns zero.
This causes division by zero at init time as follows:

 ==>   incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK;
   for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) {
   /* latch SYSTIMH on read of SYSTIML */
   systim_next = (cycle_t)er32(SYSTIML);
   systim_next |= (cycle_t)er32(SYSTIMH) << 32;

   time_delta = systim_next - systim;
   temp = time_delta;
 > rem = do_div(temp, incvalue);

This change makes kernel survive this, and users report that
NIC does work after this change.

Since on real hardware incvalue is never zero, this should not affect
real hardware use case.

Signed-off-by: Denys Vlasenko 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c 
b/drivers/net/ethernet/intel/e1000e/netdev.c
index 75e6089..9d5bab8 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -4352,7 +4352,8 @@ static cycle_t e1000e_cyclecounter_read(const struct 
cyclecounter *cc)
 
time_delta = systim_next - systim;
temp = time_delta;
-   rem = do_div(temp, incvalue);
+   /* VMWare users have seen incvalue of zero, don't div / 
0 */
+   rem = incvalue ? do_div(temp, incvalue) : (time_delta 
!= 0);
 
systim = systim_next;
 
-- 
2.5.5



[net-next 12/15] fm10k: Remove create_workqueue

2016-06-29 Thread Jeff Kirsher
From: Bhaktipriya Shridhar 

alloc_workqueue replaces deprecated create_workqueue().

A dedicated workqueue has been used since the workitem (viz
fm10k_service_task, which manages and runs other subtasks) is involved in
normal device operation and requires forward progress under memory
pressure.

create_workqueue has been replaced with alloc_workqueue with max_active
as 0 since there is no need for throttling the number of active work
items.

Since network devices may be used in memory reclaim path,
WQ_MEM_RECLAIM has been set to guarantee forward progress.

flush_workqueue is unnecessary since destroy_workqueue() itself calls
drain_workqueue() which flushes repeatedly till the workqueue
becomes empty. Hence the call to flush_workqueue() has been dropped.

Signed-off-by: Bhaktipriya Shridhar 
Acked-by: Tejun Heo 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_main.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index 53d02416..a9ccc1e 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -56,7 +56,7 @@ static int __init fm10k_init_module(void)
pr_info("%s\n", fm10k_copyright);
 
/* create driver workqueue */
-   fm10k_workqueue = create_workqueue("fm10k");
+   fm10k_workqueue = alloc_workqueue("fm10k", WQ_MEM_RECLAIM, 0);
 
fm10k_dbg_init();
 
@@ -77,7 +77,6 @@ static void __exit fm10k_exit_module(void)
fm10k_dbg_exit();
 
/* destroy driver workqueue */
-   flush_workqueue(fm10k_workqueue);
destroy_workqueue(fm10k_workqueue);
 }
 module_exit(fm10k_exit_module);
-- 
2.5.5



[net-next 13/15] ixgbe: Correct reporting of timestamping for x550

2016-06-29 Thread Jeff Kirsher
From: Tony Nguyen 

Update ixgbe_ethtool_get_ts_info() to show that x550 supports hardware
timestamping of all packets.

Reported-by: Guy Harris 
Signed-off-by: Tony Nguyen 
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index 59b771b..8a84507 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2991,10 +2991,15 @@ static int ixgbe_get_ts_info(struct net_device *dev,
 {
struct ixgbe_adapter *adapter = netdev_priv(dev);
 
+   /* we always support timestamping disabled */
+   info->rx_filters = BIT(HWTSTAMP_FILTER_NONE);
+
switch (adapter->hw.mac.type) {
case ixgbe_mac_X550:
case ixgbe_mac_X550EM_x:
case ixgbe_mac_x550em_a:
+   info->rx_filters |= BIT(HWTSTAMP_FILTER_ALL);
+   /* fallthrough */
case ixgbe_mac_X540:
case ixgbe_mac_82599EB:
info->so_timestamping =
@@ -3014,8 +3019,7 @@ static int ixgbe_get_ts_info(struct net_device *dev,
BIT(HWTSTAMP_TX_OFF) |
BIT(HWTSTAMP_TX_ON);
 
-   info->rx_filters =
-   BIT(HWTSTAMP_FILTER_NONE) |
+   info->rx_filters |=
BIT(HWTSTAMP_FILTER_PTP_V1_L4_SYNC) |
BIT(HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) |
BIT(HWTSTAMP_FILTER_PTP_V2_EVENT);
-- 
2.5.5



[net-next 00/15][pull request] Intel Wired LAN Driver Updates 2016-06-29

2016-06-29 Thread Jeff Kirsher
This series contains updates and fixes to e1000e, igb, ixgbe and fm10k.  A
true smorgasbord of changes.

Jake cleans up some obscurity by not using the BIT() macro on bitshift
operation and also fixed the calculated index when looping through the
indir array.  Fixes the issue with igb's workqueue item for overflow
check from causing a surprise remove event.  The ptp_flags variable is
added to simplify the work of writing several complex MAC type checks
in the PTP code while fixing the workqueue.

Alex Duyck fixes the receive buffers alignment which should not be L1
cache aligned, but to 512 bytes instead.

Denys Vlasenko prevents a division by zero which was reported under
VMWare for e1000e.

Amritha fixes an issue where filters in a child hash table must be
cleared from the hardware before delete the filter links in ixgbe.

Bhaktipriya Shridhar simply replaces the deprecated create_workqueue()
with alloc_workqueue() for fm10k.

Tony corrects ixgbe ethtool reporting to show x550 supports hardware
timestamping of all packets.

Emil fixes an issue where MAC-VLANs on the VF fail to pass traffic due
to spoofed packets.

Andrew Lunn increases performance on some systems where syncing a buffer
for DMA is expensive.  So rather than sync the whole 2K receive buffer,
only synchronize the length of the frame.

The following are changes since commit 6f30e8b022c8e3a722928ddb1a2ae0be852fcc0e:
  Merge branch '40GbE' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 1GbE

Alexander Duyck (1):
  fm10k: Align Rx buffers to 512B blocks

Amritha Nambiar (2):
  ixgbe: Fix deleting link filters for cls_u32 offloads
  ixgbe: Error handler for duplicate filter locations in hardware for
cls_u32 offloads

Andrew Lunn (1):
  igb: Only DMA sync frame length

Bhaktipriya Shridhar (1):
  fm10k: Remove create_workqueue

Denys Vlasenko (1):
  e1000e: prevent division by zero if TIMINCA is zero

Emil Tantilov (1):
  ixgbe: fix spoofed packets with macvlans

Jacob Keller (7):
  fm10k: don't use BIT() macro where the value isn't a bitmask
  fm10k: fix incorrect index calculation in fm10k_write_reta
  igb: introduce ptp_flags variable and use it to replace IGB_FLAG_PTP
  igb: introduce IGB_PTP_OVERFLOW_CHECK flag
  igb: re-use igb_ptp_reset in igb_ptp_init
  igb: implement igb_ptp_suspend
  igb: call igb_ptp_suspend during suspend/resume cycle

Tony Nguyen (1):
  ixgbe: Correct reporting of timestamping for x550

 drivers/net/ethernet/intel/e1000e/netdev.c   |  3 +-
 drivers/net/ethernet/intel/fm10k/fm10k.h |  2 +-
 drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c |  5 +-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c|  5 +-
 drivers/net/ethernet/intel/igb/igb.h |  7 +-
 drivers/net/ethernet/intel/igb/igb_main.c| 12 ++--
 drivers/net/ethernet/intel/igb/igb_ptp.c | 92 +++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |  8 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 76 +++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_model.h   |  4 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c   |  1 +
 11 files changed, 148 insertions(+), 67 deletions(-)

-- 
2.5.5



[net-next 03/15] fm10k: fix incorrect index calculation in fm10k_write_reta

2016-06-29 Thread Jeff Kirsher
From: Jacob Keller 

The index calculated when looping through the indir array passed to
fm10k_write_reta was incorrectly calculated as the first part i needs to
be multiplied by 4.

Fixes: 0cfea7a65738 ("fm10k: fix possible null pointer deref after kcalloc", 
2016-04-13)
Signed-off-by: Jacob Keller 
Tested-by: Krishneil Singh 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c 
b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
index 9c0d875..9b51954 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c
@@ -983,9 +983,10 @@ void fm10k_write_reta(struct fm10k_intfc *interface, const 
u32 *indir)
/* generate a new table if we weren't given one */
for (j = 0; j < 4; j++) {
if (indir)
-   n = indir[i + j];
+   n = indir[4 * i + j];
else
-   n = ethtool_rxfh_indir_default(i + j, rss_i);
+   n = ethtool_rxfh_indir_default(4 * i + j,
+  rss_i);
 
table[j] = n;
}
-- 
2.5.5



[net-next 14/15] ixgbe: fix spoofed packets with macvlans

2016-06-29 Thread Jeff Kirsher
From: Emil Tantilov 

When setting spoofing, both VLAN and MAC need to be set together.
This change resolves an issue where MAC-VLANs on the VF fail to pass
traffic due to spoofed packets.

Signed-off-by: Emil Tantilov 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
index c5caacd..8618599 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c
@@ -954,6 +954,7 @@ static int ixgbe_set_vf_macvlan_msg(struct ixgbe_adapter 
*adapter,
struct ixgbe_hw *hw = >hw;
 
hw->mac.ops.set_mac_anti_spoofing(hw, false, vf);
+   hw->mac.ops.set_vlan_anti_spoofing(hw, false, vf);
}
}
 
-- 
2.5.5



[net-next 11/15] igb: call igb_ptp_suspend during suspend/resume cycle

2016-06-29 Thread Jeff Kirsher
From: Jacob Keller 

Properly stop the extra workqueue items and ensure that we resume
cleanly. This is better than using igb_ptp_init and igb_ptp_stop since
these functions destroy the PHC device, which will cause other problems
if we do so. Since igb_ptp_reset now re-schedules the work-queue item we
don't need an equivalent igb_ptp_resume in the resume workflow.

Signed-off-by: Jacob Keller 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 750a5d8..a15f826 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -7528,6 +7528,8 @@ static int __igb_shutdown(struct pci_dev *pdev, bool 
*enable_wake,
if (netif_running(netdev))
__igb_close(netdev, true);
 
+   igb_ptp_suspend(adapter);
+
igb_clear_interrupt_scheme(adapter);
 
 #ifdef CONFIG_PM
-- 
2.5.5



[net-next 15/15] igb: Only DMA sync frame length

2016-06-29 Thread Jeff Kirsher
From: Andrew Lunn 

On some platforms, syncing a buffer for DMA is expensive. Rather than
sync the whole 2K receive buffer, only synchronise the length of the
frame, which will typically be the MTU, or a much smaller TCP ACK.

For an IMX6Q, this gives around 6% increased TCP receive performance,
which is cache operations bound and reduces CPU load for TCP transmit.

Signed-off-by: Andrew Lunn 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index a15f826..9bcba42 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -6856,12 +6856,12 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer 
*rx_buffer,
  **/
 static bool igb_add_rx_frag(struct igb_ring *rx_ring,
struct igb_rx_buffer *rx_buffer,
+   unsigned int size,
union e1000_adv_rx_desc *rx_desc,
struct sk_buff *skb)
 {
struct page *page = rx_buffer->page;
unsigned char *va = page_address(page) + rx_buffer->page_offset;
-   unsigned int size = le16_to_cpu(rx_desc->wb.upper.length);
 #if (PAGE_SIZE < 8192)
unsigned int truesize = IGB_RX_BUFSZ;
 #else
@@ -6913,6 +6913,7 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
   union e1000_adv_rx_desc *rx_desc,
   struct sk_buff *skb)
 {
+   unsigned int size = le16_to_cpu(rx_desc->wb.upper.length);
struct igb_rx_buffer *rx_buffer;
struct page *page;
 
@@ -6948,11 +6949,11 @@ static struct sk_buff *igb_fetch_rx_buffer(struct 
igb_ring *rx_ring,
dma_sync_single_range_for_cpu(rx_ring->dev,
  rx_buffer->dma,
  rx_buffer->page_offset,
- IGB_RX_BUFSZ,
+ size,
  DMA_FROM_DEVICE);
 
/* pull page into skb */
-   if (igb_add_rx_frag(rx_ring, rx_buffer, rx_desc, skb)) {
+   if (igb_add_rx_frag(rx_ring, rx_buffer, size, rx_desc, skb)) {
/* hand second half of page back to the ring */
igb_reuse_rx_page(rx_ring, rx_buffer);
} else {
-- 
2.5.5



[net-next 10/15] igb: implement igb_ptp_suspend

2016-06-29 Thread Jeff Kirsher
From: Jacob Keller 

Make igb_ptp_stop take advantage of this new function to reduce code
duplication.

Signed-off-by: Jacob Keller 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb.h |  1 +
 drivers/net/ethernet/intel/igb/igb_ptp.c | 22 +-
 2 files changed, 18 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/igb.h 
b/drivers/net/ethernet/intel/igb/igb.h
index 38daaab..5387b3a 100644
--- a/drivers/net/ethernet/intel/igb/igb.h
+++ b/drivers/net/ethernet/intel/igb/igb.h
@@ -550,6 +550,7 @@ void igb_set_fw_version(struct igb_adapter *);
 void igb_ptp_init(struct igb_adapter *adapter);
 void igb_ptp_stop(struct igb_adapter *adapter);
 void igb_ptp_reset(struct igb_adapter *adapter);
+void igb_ptp_suspend(struct igb_adapter *adapter);
 void igb_ptp_rx_hang(struct igb_adapter *adapter);
 void igb_ptp_rx_rgtstamp(struct igb_q_vector *q_vector, struct sk_buff *skb);
 void igb_ptp_rx_pktstamp(struct igb_q_vector *q_vector, unsigned char *va,
diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c 
b/drivers/net/ethernet/intel/igb/igb_ptp.c
index 907c225..e61b647 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -1147,12 +1147,13 @@ void igb_ptp_init(struct igb_adapter *adapter)
 }
 
 /**
- * igb_ptp_stop - Disable PTP device and stop the overflow check.
- * @adapter: Board private structure.
+ * igb_ptp_suspend - Disable PTP work items and prepare for suspend
+ * @adapter: Board private structure
  *
- * This function stops the PTP support and cancels the delayed work.
- **/
-void igb_ptp_stop(struct igb_adapter *adapter)
+ * This function stops the overflow check work and PTP Tx timestamp work, and
+ * will prepare the device for OS suspend.
+ */
+void igb_ptp_suspend(struct igb_adapter *adapter)
 {
if (!(adapter->ptp_flags & IGB_PTP_ENABLED))
return;
@@ -1166,6 +1167,17 @@ void igb_ptp_stop(struct igb_adapter *adapter)
adapter->ptp_tx_skb = NULL;
clear_bit_unlock(__IGB_PTP_TX_IN_PROGRESS, >state);
}
+}
+
+/**
+ * igb_ptp_stop - Disable PTP device and stop the overflow check.
+ * @adapter: Board private structure.
+ *
+ * This function stops the PTP support and cancels the delayed work.
+ **/
+void igb_ptp_stop(struct igb_adapter *adapter)
+{
+   igb_ptp_suspend(adapter);
 
if (adapter->ptp_clock) {
ptp_clock_unregister(adapter->ptp_clock);
-- 
2.5.5



[PATCH net-next 0/3] nfp: few code improvements

2016-06-29 Thread Jakub Kicinski
Hi!

Three small patches for net-next.  First and second patches
improve the code quality by spelling things correctly and
removing unused parameters.  Third patch hooks-in standard
kernel implementation of .get_link() in ethtool ops.

Jakub Kicinski (3):
  nfp: correct name of control BAR define
  nfp: remove unused parameter from nfp_net_write_mac_addr()
  nfp: implement ethtool .get_link() callback

 drivers/net/ethernet/netronome/nfp/nfp_net.h |  2 +-
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c  | 13 +++--
 drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c |  1 +
 drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c  |  4 ++--
 4 files changed, 11 insertions(+), 9 deletions(-)

-- 
1.9.1



[PATCH net-next 2/3] nfp: remove unused parameter from nfp_net_write_mac_addr()

2016-06-29 Thread Jakub Kicinski
nfp_net_write_mac_addr() always writes to the BAR the current
device address taken from netdev struct.  The address given
as parameter is actually ignored.  Since all callers pass
netdev->dev_addr simply remove the parameter.  

While at it improve the function's kdoc a bit.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 13 +++--
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 2195ed3053da..df29cea0ad01 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1845,13 +1845,14 @@ void nfp_net_coalesce_write_cfg(struct nfp_net *nn)
 }
 
 /**
- * nfp_net_write_mac_addr() - Write mac address to device registers
+ * nfp_net_write_mac_addr() - Write mac address to the device control BAR
  * @nn:  NFP Net device to reconfigure
- * @mac: Six-byte MAC address to be written
  *
- * We do a bit of byte swapping dance because firmware is LE.
+ * Writes the MAC address from the netdev to the device control BAR.  Does not
+ * perform the required reconfig.  We do a bit of byte swapping dance because
+ * firmware is LE.
  */
-static void nfp_net_write_mac_addr(struct nfp_net *nn, const u8 *mac)
+static void nfp_net_write_mac_addr(struct nfp_net *nn)
 {
nn_writel(nn, NFP_NET_CFG_MACADDR + 0,
  get_unaligned_be32(nn->netdev->dev_addr));
@@ -1952,7 +1953,7 @@ static int __nfp_net_set_config_and_enable(struct nfp_net 
*nn)
nn_writeq(nn, NFP_NET_CFG_RXRS_ENABLE, nn->num_rx_rings == 64 ?
  0xULL : ((u64)1 << nn->num_rx_rings) - 1);
 
-   nfp_net_write_mac_addr(nn, nn->netdev->dev_addr);
+   nfp_net_write_mac_addr(nn);
 
nn_writel(nn, NFP_NET_CFG_MTU, nn->netdev->mtu);
nn_writel(nn, NFP_NET_CFG_FLBUFSZ, nn->fl_bufsz);
@@ -2739,7 +2740,7 @@ int nfp_net_netdev_init(struct net_device *netdev)
nn->cap = nn_readl(nn, NFP_NET_CFG_CAP);
nn->max_mtu = nn_readl(nn, NFP_NET_CFG_MAX_MTU);
 
-   nfp_net_write_mac_addr(nn, nn->netdev->dev_addr);
+   nfp_net_write_mac_addr(nn);
 
/* Set default MTU and Freelist buffer size */
if (nn->max_mtu < NFP_NET_DEFAULT_MTU)
-- 
1.9.1



[PATCH net-next 3/3] nfp: implement ethtool .get_link() callback

2016-06-29 Thread Jakub Kicinski
Point the ethtool .get_link() callback to the standard
ethtool_op_get_link() implementation.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
index ccfef1f17627..7d7933d00b8f 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
@@ -605,6 +605,7 @@ static int nfp_net_set_coalesce(struct net_device *netdev,
 
 static const struct ethtool_ops nfp_net_ethtool_ops = {
.get_drvinfo= nfp_net_get_drvinfo,
+   .get_link   = ethtool_op_get_link,
.get_ringparam  = nfp_net_get_ringparam,
.set_ringparam  = nfp_net_set_ringparam,
.get_strings= nfp_net_get_strings,
-- 
1.9.1



[PATCH net-next 1/3] nfp: correct name of control BAR define

2016-06-29 Thread Jakub Kicinski
Spell abbreviation of control as ctrl not crtl.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_net.h| 2 +-
 drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index e744acc18ef4..690635660195 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -63,7 +63,7 @@
 #define NFP_NET_POLL_TIMEOUT   5
 
 /* Bar allocation */
-#define NFP_NET_CRTL_BAR   0
+#define NFP_NET_CTRL_BAR   0
 #define NFP_NET_Q0_BAR 2
 #define NFP_NET_Q1_BAR 4   /* OBSOLETE */
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
index e2b22b8a20f1..37abef016a0a 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c
@@ -124,11 +124,11 @@ static int nfp_netvf_pci_probe(struct pci_dev *pdev,
 * first NFP_NET_CFG_BAR_SZ of the BAR.  This keeps the code
 * the identical for PF and VF drivers.
 */
-   ctrl_bar = ioremap_nocache(pci_resource_start(pdev, NFP_NET_CRTL_BAR),
+   ctrl_bar = ioremap_nocache(pci_resource_start(pdev, NFP_NET_CTRL_BAR),
   NFP_NET_CFG_BAR_SZ);
if (!ctrl_bar) {
dev_err(>dev,
-   "Failed to map resource %d\n", NFP_NET_CRTL_BAR);
+   "Failed to map resource %d\n", NFP_NET_CTRL_BAR);
err = -EIO;
goto err_pci_regions;
}
-- 
1.9.1



Re: Question on i40e PCIe relaxed ordering (RO)

2016-06-29 Thread Jeff Kirsher
On Wed, 2016-06-29 at 13:18 -0700, Greg wrote:
> On Wed, 2016-06-29 at 13:05 -0700, tndave wrote:
> > Hi,
> > 
> > Running iperf tcp test on 2 sparc systems with i40e connected back to
> > back, I see huge number of 'port.rx_dropped' (on iperf server). Based
> on
> > past experience with ixgbe, this could very well because of PCIe RO
> > (relaxed ordering) not enabled.
> > 
> > I am trying to confirm RO is enabled. i40e datasheet mentioned RO
> > settings in 3 different sections:
> > 
> > 1. section 10.2.2.2.38 PCIe Global Config 2 - GLPCI_CNF2 register
> > contains global status fields of PCIe configuration. The bit 0 of the
> > register is "RO_DIS". If this bit is set to 1 RO is disabled.
> > 
> > RO_DIS in my setup is 0 imply RO is not disabled.
> > 
> > 2. section 12.3.5.5 Device Control Register (0xA8; RW) has bit 4
> > that enable/disable RO. This is pcie cap register.
> > 
> > In my i40e pcie config space value at offset 0xA8 is "2110". i.e 4th
> bit
> > set to 1 imply RO is enabled.
> > 
> > 3. section 3.1.2.7.2 mentions some relaxed ordering rules
> > e.g. "The GLLAN_RCTL.RXDESCRDROEN bit (loaded from NVM) enables relaxed
> > ordering for Rx descriptor reads"
> > 
> > However, GLLAN_RCTL register definition has not bit like RXDESCRDROEN.
> > Same goes for GLLAN_TCTL.TXDESCRDROEN.
> > 
> > Am I missing anything? please advise.
> 
> I would try posting this question to the e1000 developer list over at
> Source Forge.  The Intel customer support folks used to monitor that
> list closely when I was there, hopefully they still are.

We are supposed to be monitoring both lists, but just in case I have added
e1000-devel list...

signature.asc
Description: This is a digitally signed message part


Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file

2016-06-29 Thread Jon Mason
On Wed, Jun 29, 2016 at 4:15 PM, Andrew Lunn  wrote:
> On Wed, Jun 29, 2016 at 04:08:20PM -0400, Jon Mason wrote:
>> On Wed, Jun 29, 2016 at 2:46 PM, Andrew Lunn  wrote:
>> > On Wed, Jun 29, 2016 at 11:35:28AM -0700, Florian Fainelli wrote:
>> >> On 06/29/2016 07:13 AM, Andrew Lunn wrote:
>> >> > Hi Jon
>> >> >
>> >> > I know you are just refactoring code, but at some point it would be
>> >> > good to take a closer look at this MDIO bus driver.
>> >
>> >> And, to re-iterate all of your points are valid, but this is premature
>> >
>> > We agree then :-)
>> >
>> >Andrew
>>
>> I also agree with all of your points, but hope this is not something
>> that would prevent this patch series from being acceptable.
>
> No, it is acceptable as is.
>
> However, it would be nice the clean up the mess, especially if you are
> planning on using this code with new platforms, not just legacy stuff
> which is bit rotting.

I don't believe we'll be using the BCMA infrastructure going forward.
We'll be doing our own MDIO Phys, adding them to the device tree, and
referencing them through the DT/OF method.  This is the current plan
for NSP, NS2, Cygnus, and other upcoming platforms.  If someone can
find an elegant way of splitting off the Phy code from the BCMA core,
then this can be better than it is now.  But, it will take someone
with docs, time, and hardware to give this code the audit and cleanup
it really needs.  BTW, my last comment isn't intended as a slight
against the authors of this code.  It is unfortunate that they were
put in a position of having to reverse engineer drivers for this HW.

Thanks,
Jon

>
>   Andrew


Re: Question on i40e PCIe relaxed ordering (RO)

2016-06-29 Thread Greg
On Wed, 2016-06-29 at 13:05 -0700, tndave wrote:
> Hi,
> 
> Running iperf tcp test on 2 sparc systems with i40e connected back to
> back, I see huge number of 'port.rx_dropped' (on iperf server). Based on
> past experience with ixgbe, this could very well because of PCIe RO
> (relaxed ordering) not enabled.
> 
> I am trying to confirm RO is enabled. i40e datasheet mentioned RO
> settings in 3 different sections:
> 
> 1. section 10.2.2.2.38 PCIe Global Config 2 - GLPCI_CNF2 register
> contains global status fields of PCIe configuration. The bit 0 of the
> register is "RO_DIS". If this bit is set to 1 RO is disabled.
> 
> RO_DIS in my setup is 0 imply RO is not disabled.
> 
> 2. section 12.3.5.5 Device Control Register (0xA8; RW) has bit 4
> that enable/disable RO. This is pcie cap register.
> 
> In my i40e pcie config space value at offset 0xA8 is "2110". i.e 4th bit
> set to 1 imply RO is enabled.
> 
> 3. section 3.1.2.7.2 mentions some relaxed ordering rules
> e.g. "The GLLAN_RCTL.RXDESCRDROEN bit (loaded from NVM) enables relaxed
> ordering for Rx descriptor reads"
> 
> However, GLLAN_RCTL register definition has not bit like RXDESCRDROEN.
> Same goes for GLLAN_TCTL.TXDESCRDROEN.
> 
> Am I missing anything? please advise.

I would try posting this question to the e1000 developer list over at
Source Forge.  The Intel customer support folks used to monitor that
list closely when I was there, hopefully they still are.

Regards,

- Greg

> 
> Thanks.
> 
> -Tushar
> 
> (Ref:http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf)
> 
> 
> 




Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation

2016-06-29 Thread Daniel Metz

  BD  R Y
--
mean TCPRecovLat  3s  -7%   +39% +38%
mean TCPRecovLat252s +1%   -11%  -11%


This is indeed very interesting and somewhat unexpected. Do you have any 
clue why Y is as bad as R and so much worse than B? By my understanding 
I would have expected Y to be similar to B. At least tests on the Mean 
Response Waiting Time of sender limited flows show hardly any difference 
to B (as expected).


Also, is a potential longer time in TCPRecovLat such a bad thing 
considering your information on HTTP response performance?


Re: [PATCH] [v6] net: emac: emac gigabit ethernet controller driver

2016-06-29 Thread Timur Tabi

Arnd Bergmann wrote:


Sure, but in theory, my for-loop is correct, right?  Wouldn't there be
some value in setting a 36-bit or 40-bit DMA mask if it works?  We have
a platform where memory starts at a 40-bit address, so some devices have
a 44-bit address bus.  If a 64-bit mask doesn't work, then a 32-bit mask
certainly wont.


The question is whether it makes any difference to the driver: what decision
does the driver make differently if it has set a 44-bit mask?



We don't have ZONE_DMA44 (and it's very unlikely that we will introduce
it), so neither the driver nor the network stack can actually act on the
fact that a 64-bit mask failed but a 44-bit mask succeeded.


A 44-bit mask would have access to much larger regions of memory than a 
32-bit mask.  Maybe one day we will have a version of dma_alloc_coherent 
that uses the DMA mask to specify an actual range of physical addresses 
to allocate from.  In this situation, a mask of 44 would be much better 
than a mask of 32.


It just seems wrong for driver to hard-code 64-bit and 32-bit masks and 
pretend like those are the only two sizes.  Very few 64-bit SOCs have a 
full 64-bit address bus.  Most have actually around 36-48 bits.  If 
dma_set_mask(64) fails on these systems, then every driver will fall 
back to 32 bits, even though they can technically access all of DDR.


Maybe we need a function like this:

int dma_find_best_mask(struct device *dev, unsigned int max)
{
struct dma_map_ops *ops = get_dma_ops(dev);
unsigned int mask;
int ret;

if (!ops || !ops->dma_supported)
return max;

for (mask = max; mask >= 32; mask--) {
ret = ops->dma_supported(dev, mask);
if (ret < 0)
return ret;

if (ret > 0)
return mask;
}

return -ENOMEM;
}

You pass a mask size that you know is the maximum that the device could 
support (in my case, 64).  It then returns a number that should be 
passed to dma_set_mask().



Platforms will also allow allow the driver to set a mask that
is larger than what the bus supports, as long as all RAM is
reachable by the bus.


And that check (like all others) is made in the dma_set_mask call?


Yes. Regarding the system you mentioned: I understand that it has
no memory in ZONE_DMA32 and no IOMMU (btw, that also means 32-bit
PCI devices are fundamentally broken), and the bus can only address
the lower 44 bit of address space.


Actually, we have an IOMMU, and the size of the address space depends on 
the SOC.  Some SOCs have 32 bits, some have over 40.  The EMAC is the 
same on all of these SOCs.



Does this system support more than 15TB of RAM? If not, then there
is no problem, and if it  does, I think your best way out is not
to disable SWIOTLB. That will be a slower compared to a system
with an IOMMU or the fictional ZONE_DMA44, but it should work
correctly. In either case (less than 15TB without swiotlb, or
more than 15TB with swiotlb), the driver should just ask for
a 64-bit mask and that will succeed.


Some SOCs can support over 15TB.  They all have IOMMUs.

Apparently, dma_set_mask(64) won't always succeed.  If the SOC has fewer 
DMA address lines than total memory, then a mask of 64 will fail.


I don't want to implement a solution that just happens to work on the 
EMAC.  I'm trying to come up with a generic solution that can work in 
any driver on any SOC, whether you use device tree or ACPI.


--
Qualcomm Innovation Center, Inc.
The Qualcomm Innovation Center, Inc. is a member of the Code Aurora
Forum, a Linux Foundation collaborative project.


Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file

2016-06-29 Thread Andrew Lunn
On Wed, Jun 29, 2016 at 04:08:20PM -0400, Jon Mason wrote:
> On Wed, Jun 29, 2016 at 2:46 PM, Andrew Lunn  wrote:
> > On Wed, Jun 29, 2016 at 11:35:28AM -0700, Florian Fainelli wrote:
> >> On 06/29/2016 07:13 AM, Andrew Lunn wrote:
> >> > Hi Jon
> >> >
> >> > I know you are just refactoring code, but at some point it would be
> >> > good to take a closer look at this MDIO bus driver.
> >
> >> And, to re-iterate all of your points are valid, but this is premature
> >
> > We agree then :-)
> >
> >Andrew
> 
> I also agree with all of your points, but hope this is not something
> that would prevent this patch series from being acceptable.

No, it is acceptable as is.

However, it would be nice the clean up the mess, especially if you are
planning on using this code with new platforms, not just legacy stuff
which is bit rotting.

  Andrew


Re: [RFC 1/7] net: ethernet: bgmac: change bgmac_* prints to dev_* prints

2016-06-29 Thread Jon Mason
On Tue, Jun 28, 2016 at 3:43 PM, Joe Perches  wrote:
> On Tue, 2016-06-28 at 15:34 -0400, Jon Mason wrote:
>> The bgmac_* print wrappers call dev_* prints with the dev pointer from
>> the bcma core.  In anticipation of removing the bcma requirement for
>> this driver, these must be changed to not reference that struct.  So,
>> simply change all of the bgmac_* prints to their dev_* counterparts.  In
>> some cases netdev_* prints are more appropriate, so change those as
>> well.
>
> Thanks, but...
>
>> diff --git a/drivers/net/ethernet/broadcom/bgmac.c 
>> b/drivers/net/ethernet/broadcom/bgmac.c
> []
>> @@ -1515,7 +1516,7 @@ static void bgmac_get_drvinfo(struct net_device 
>> *net_dev,
>> struct ethtool_drvinfo *info)
>>  {
>>   strlcpy(info->driver, KBUILD_MODNAME, sizeof(info->driver));
>> - strlcpy(info->bus_info, "BCMA", sizeof(info->bus_info));
>> + strlcpy(info->bus_info, "AXI", sizeof(info->bus_info));
>>  }
>
> Unrelated change?

This is a needed change, but unrelated to this patch.  I'll move it to
the 5/7 patch, where it more logically fits.

Thanks,
Jon

>


Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file

2016-06-29 Thread Jon Mason
On Wed, Jun 29, 2016 at 2:46 PM, Andrew Lunn  wrote:
> On Wed, Jun 29, 2016 at 11:35:28AM -0700, Florian Fainelli wrote:
>> On 06/29/2016 07:13 AM, Andrew Lunn wrote:
>> > Hi Jon
>> >
>> > I know you are just refactoring code, but at some point it would be
>> > good to take a closer look at this MDIO bus driver.
>
>> And, to re-iterate all of your points are valid, but this is premature
>
> We agree then :-)
>
>Andrew

I also agree with all of your points, but hope this is not something
that would prevent this patch series from being acceptable.  It could
be quite an effort to try and properly document the code and what it
is doing, get HW to verify the assumptions from the documentation (as
the MIPS HW was done by a different group, no hardware is available,
etc), relocate the code to the drivers/net/phy and add a standard
interface.  From your original comment, I believe that you agree that
this additional effort is not necessary at this point in time.

Thanks,
Jon


Question on i40e PCIe relaxed ordering (RO)

2016-06-29 Thread tndave

Hi,

Running iperf tcp test on 2 sparc systems with i40e connected back to
back, I see huge number of 'port.rx_dropped' (on iperf server). Based on
past experience with ixgbe, this could very well because of PCIe RO
(relaxed ordering) not enabled.

I am trying to confirm RO is enabled. i40e datasheet mentioned RO
settings in 3 different sections:

1. section 10.2.2.2.38 PCIe Global Config 2 - GLPCI_CNF2 register
contains global status fields of PCIe configuration. The bit 0 of the
register is "RO_DIS". If this bit is set to 1 RO is disabled.

RO_DIS in my setup is 0 imply RO is not disabled.

2. section 12.3.5.5 Device Control Register (0xA8; RW) has bit 4
that enable/disable RO. This is pcie cap register.

In my i40e pcie config space value at offset 0xA8 is "2110". i.e 4th bit
set to 1 imply RO is enabled.

3. section 3.1.2.7.2 mentions some relaxed ordering rules
e.g. "The GLLAN_RCTL.RXDESCRDROEN bit (loaded from NVM) enables relaxed
ordering for Rx descriptor reads"

However, GLLAN_RCTL register definition has not bit like RXDESCRDROEN.
Same goes for GLLAN_TCTL.TXDESCRDROEN.

Am I missing anything? please advise.

Thanks.

-Tushar

(Ref:http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf)





[net-next PATCH v2 2/2] net: samples: pktgen mode samples/tests for qdisc layer

2016-06-29 Thread John Fastabend
This adds samples for pktgen to use with new mode to inject pkts into
the qdisc layer. This also doubles as nice test cases to test any
patches against qdisc layer.

Signed-off-by: John Fastabend 
---
 .../pktgen/pktgen_bench_xmit_mode_queue_xmit.sh|   70 
 1 file changed, 70 insertions(+)
 create mode 100755 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh

diff --git a/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh 
b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
new file mode 100755
index 000..eee06cc
--- /dev/null
+++ b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
@@ -0,0 +1,70 @@
+#!/bin/bash
+#
+# Benchmark script:
+#  - developed for benchmarking egress qdisc path, derived from
+#ingress benchmark script.
+#
+# Script for injecting packets into egress qdisc path of the stack
+# with pktgen "xmit_mode queue_xmit".
+#
+basedir=`dirname $0`
+source ${basedir}/functions.sh
+root_check_run_with_sudo "$@"
+
+# Parameter parsing via include
+source ${basedir}/parameters.sh
+# Using invalid DST_MAC will cause the packets to get dropped in
+# ip_rcv() which is part of the test
+[ -z "$DEST_IP" ] && DEST_IP="198.18.0.42"
+[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
+
+# Burst greater than 1 are invalid but allow users to specify it and
+# get an error instead of silently ignoring it.
+[ -z "$BURST" ] && BURST=1
+
+# Base Config
+DELAY="0"# Zero means max speed
+COUNT="1000" # Zero means indefinitely
+
+# General cleanup everything since last run
+pg_ctrl "reset"
+
+# Threads are specified with parameter -t value in $THREADS
+for ((thread = 0; thread < $THREADS; thread++)); do
+# The device name is extended with @name, using thread number to
+# make then unique, but any name will do.
+dev=${DEV}@${thread}
+
+# Add remove all other devices and add_device $dev to thread
+pg_thread $thread "rem_device_all"
+pg_thread $thread "add_device" $dev
+
+# Base config of dev
+pg_set $dev "flag QUEUE_MAP_CPU"
+pg_set $dev "count $COUNT"
+pg_set $dev "pkt_size $PKT_SIZE"
+pg_set $dev "delay $DELAY"
+pg_set $dev "flag NO_TIMESTAMP"
+
+# Destination
+pg_set $dev "dst_mac $DST_MAC"
+pg_set $dev "dst $DEST_IP"
+
+# Inject packet into RX path of stack
+pg_set $dev "xmit_mode queue_xmit"
+
+# Burst allow us to avoid measuring SKB alloc/free overhead
+pg_set $dev "burst $BURST"
+done
+
+# start_run
+echo "Running... ctrl^C to stop" >&2
+pg_ctrl "start"
+echo "Done" >&2
+
+# Print results
+for ((thread = 0; thread < $THREADS; thread++)); do
+dev=${DEV}@${thread}
+echo "Device: $dev"
+cat /proc/net/pktgen/$dev | grep -A2 "Result:"
+done



[net-next PATCH v2 1/2] net: pktgen: support injecting packets for qdisc testing

2016-06-29 Thread John Fastabend
Add another xmit_mode to pktgen to allow testing xmit functionality
of qdiscs. The new mode "queue_xmit" injects packets at
__dev_queue_xmit() so that qdisc is called.

Signed-off-by: John Fastabend 
---
 net/core/pktgen.c |   42 --
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index f74ab9c..4b3d467 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -213,6 +213,7 @@
 /* Xmit modes */
 #define M_START_XMIT   0   /* Default normal TX */
 #define M_NETIF_RECEIVE1   /* Inject packets into stack */
+#define M_QUEUE_XMIT   2   /* Inject packet into qdisc */
 
 /* If lock -- protects updating of if_list */
 #define   if_lock(t)   spin_lock(&(t->if_lock));
@@ -626,6 +627,8 @@ static int pktgen_if_show(struct seq_file *seq, void *v)
 
if (pkt_dev->xmit_mode == M_NETIF_RECEIVE)
seq_puts(seq, " xmit_mode: netif_receive\n");
+   else if (pkt_dev->xmit_mode == M_QUEUE_XMIT)
+   seq_puts(seq, " xmit_mode: xmit_queue\n");
 
seq_puts(seq, " Flags: ");
 
@@ -1142,8 +1145,10 @@ static ssize_t pktgen_if_write(struct file *file,
return len;
 
i += len;
-   if ((value > 1) && (pkt_dev->xmit_mode == M_START_XMIT) &&
-   (!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING)))
+   if ((value > 1) &&
+   ((pkt_dev->xmit_mode == M_QUEUE_XMIT) ||
+((pkt_dev->xmit_mode == M_START_XMIT) &&
+(!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING)
return -ENOTSUPP;
pkt_dev->burst = value < 1 ? 1 : value;
sprintf(pg_result, "OK: burst=%d", pkt_dev->burst);
@@ -1198,6 +1203,9 @@ static ssize_t pktgen_if_write(struct file *file,
 * at module loading time
 */
pkt_dev->clone_skb = 0;
+   } else if (strcmp(f, "queue_xmit") == 0) {
+   pkt_dev->xmit_mode = M_QUEUE_XMIT;
+   pkt_dev->last_ok = 1;
} else {
sprintf(pg_result,
"xmit_mode -:%s:- unknown\nAvailable modes: %s",
@@ -3434,6 +3442,36 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 #endif
} while (--burst > 0);
goto out; /* Skips xmit_mode M_START_XMIT */
+   } else if (pkt_dev->xmit_mode == M_QUEUE_XMIT) {
+   local_bh_disable();
+   atomic_add(burst, _dev->skb->users);
+
+   ret = dev_queue_xmit(pkt_dev->skb);
+   switch (ret) {
+   case NET_XMIT_SUCCESS:
+   pkt_dev->sofar++;
+   pkt_dev->seq_num++;
+   pkt_dev->tx_bytes += pkt_dev->last_pkt_size;
+   break;
+   case NET_XMIT_DROP:
+   case NET_XMIT_CN:
+   /* These are all valid return codes for a qdisc but
+* indicate packets are being dropped or will likely
+* be dropped soon.
+*/
+   case NETDEV_TX_BUSY:
+   /* qdisc may call dev_hard_start_xmit directly in cases
+* where no queues exist e.g. loopback device, virtual
+* devices, etc. In this case we need to handle
+* NETDEV_TX_ codes.
+*/
+   default:
+   pkt_dev->errors++;
+   net_info_ratelimited("%s xmit error: %d\n",
+pkt_dev->odevname, ret);
+   break;
+   }
+   goto out;
}
 
txq = skb_get_tx_queue(odev, pkt_dev->skb);



Re: [PATCH 0/2] brcmfmac: support removing AP interfaces with new fw

2016-06-29 Thread Rafał Miłecki
On 29 June 2016 at 21:54, Rafał Miłecki  wrote:
> This is the latest patchset needed to get brcmfmac working reasonably well
> with BCM4366.
>
> Both patches were already sent as V2 RFC (10 days ago), there were no more
> comments since then and this is the same code as in V2 RFC. I was mostly 
> waiting
> for accepting
> brcmfmac: fix lockup when removing P2P interface after event timeout
> before sending these 2 (because of dependency).
>
> Rafał Miłecki (2):
>   brcmfmac: delete interface directly in code that sent fw request
>   brcmfmac: support removing AP interfaces with "interface_remove"
>
>  .../broadcom/brcm80211/brcmfmac/cfg80211.c | 39 
> +-
>  .../wireless/broadcom/brcm80211/brcmfmac/fweh.c| 10 --
>  .../net/wireless/broadcom/brcm80211/brcmfmac/p2p.c |  3 +-
>  3 files changed, 47 insertions(+), 5 deletions(-)

Oops, I just realized get_maintainer.pl didn't pick any ppl/ML for
this cover letter. Added them now.

-- 
Rafał


[PATCH 2/2] brcmfmac: support removing AP interfaces with "interface_remove"

2016-06-29 Thread Rafał Miłecki
New firmwares (e.g. 10.10.69.36 for BCM4366) support "interface_remove"
for removing interfaces. Try to use this method on cfg80211 request. In
case of older firmwares (e.g. 7.35.177.56 for BCM43602 as I tested) this
will just result in firmware rejecting command and this won't change any
behavior.

Signed-off-by: Rafał Miłecki 
---
 .../broadcom/brcm80211/brcmfmac/cfg80211.c | 39 +-
 1 file changed, 38 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
index a5db953..6e6066a 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c
@@ -785,12 +785,48 @@ s32 brcmf_notify_escan_complete(struct 
brcmf_cfg80211_info *cfg,
return err;
 }
 
+static int brcmf_cfg80211_del_ap_iface(struct wiphy *wiphy,
+  struct wireless_dev *wdev)
+{
+   struct brcmf_cfg80211_info *cfg = wiphy_priv(wiphy);
+   struct net_device *ndev = wdev->netdev;
+   struct brcmf_if *ifp = netdev_priv(ndev);
+   int ret;
+   int err;
+
+   brcmf_cfg80211_arm_vif_event(cfg, ifp->vif);
+
+   err = brcmf_fil_bsscfg_data_set(ifp, "interface_remove", NULL, 0);
+   if (err) {
+   brcmf_err("interface_remove failed %d\n", err);
+   goto err_unarm;
+   }
+
+   /* wait for firmware event */
+   ret = brcmf_cfg80211_wait_vif_event(cfg, BRCMF_E_IF_DEL,
+   BRCMF_VIF_EVENT_TIMEOUT);
+   if (!ret) {
+   brcmf_err("timeout occurred\n");
+   err = -EIO;
+   goto err_unarm;
+   }
+
+   brcmf_remove_interface(ifp, true);
+
+err_unarm:
+   brcmf_cfg80211_arm_vif_event(cfg, NULL);
+   return err;
+}
+
 static
 int brcmf_cfg80211_del_iface(struct wiphy *wiphy, struct wireless_dev *wdev)
 {
struct brcmf_cfg80211_info *cfg = wiphy_priv(wiphy);
struct net_device *ndev = wdev->netdev;
 
+   if (ndev && ndev == cfg_to_ndev(cfg))
+   return -ENOTSUPP;
+
/* vif event pending in firmware */
if (brcmf_cfg80211_vif_event_armed(cfg))
return -EBUSY;
@@ -807,12 +843,13 @@ int brcmf_cfg80211_del_iface(struct wiphy *wiphy, struct 
wireless_dev *wdev)
switch (wdev->iftype) {
case NL80211_IFTYPE_ADHOC:
case NL80211_IFTYPE_STATION:
-   case NL80211_IFTYPE_AP:
case NL80211_IFTYPE_AP_VLAN:
case NL80211_IFTYPE_WDS:
case NL80211_IFTYPE_MONITOR:
case NL80211_IFTYPE_MESH_POINT:
return -EOPNOTSUPP;
+   case NL80211_IFTYPE_AP:
+   return brcmf_cfg80211_del_ap_iface(wiphy, wdev);
case NL80211_IFTYPE_P2P_CLIENT:
case NL80211_IFTYPE_P2P_GO:
case NL80211_IFTYPE_P2P_DEVICE:
-- 
1.8.4.5



[PATCH 1/2] brcmfmac: delete interface directly in code that sent fw request

2016-06-29 Thread Rafał Miłecki
So far when receiving event about in-firmware-interface removal our
event worker was notifying listener and afterwards it was removing Linux
interface.

First of all it was resulting in slightly unexpected order. The listener
(del_virtual_intf callback) was (usually) returning with success before
we even called unregister_netdev(ice).

Please note this couldn't be simply fixed by changing order of calls in
brcmf_fweh_handle_if_event as unregistering interface earlier could free
struct brcmf_if.

Another problem of current implementation are possible lockups. Focus on
the time slot between calling event handler and removing Linux
interface. During that time original caller may leave (unlocking rtnl
semaphore) *and* another call to the same code may be done (locking it
again). If that happens our event handler will stuck at removing Linux
interface, it won't handle another event and will block process holding
rtnl lock.

This can be simply solved by unregistering interface in a proper
callback, right after receiving confirmation event from firmware. This
only required modifying worker to don't unregister on its own if there
is someone waiting for the event.

Signed-off-by: Rafał Miłecki 
---
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c | 10 --
 drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c  |  3 +--
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
index 9da7a4c..79c081f 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c
@@ -18,6 +18,7 @@
 #include "brcmu_wifi.h"
 #include "brcmu_utils.h"
 
+#include "cfg80211.h"
 #include "core.h"
 #include "debug.h"
 #include "tracepoint.h"
@@ -182,8 +183,13 @@ static void brcmf_fweh_handle_if_event(struct brcmf_pub 
*drvr,
 
err = brcmf_fweh_call_event_handler(ifp, emsg->event_code, emsg, data);
 
-   if (ifp && ifevent->action == BRCMF_E_IF_DEL)
-   brcmf_remove_interface(ifp, false);
+   if (ifp && ifevent->action == BRCMF_E_IF_DEL) {
+   bool armed = brcmf_cfg80211_vif_event_armed(drvr->config);
+
+   /* Default handling in case no-one waits for this event */
+   if (!armed)
+   brcmf_remove_interface(ifp, false);
+   }
 }
 
 /**
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c 
b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c
index f6241fd..66f942f 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c
@@ -2288,8 +2288,7 @@ int brcmf_p2p_del_vif(struct wiphy *wiphy, struct 
wireless_dev *wdev)
else
err = 0;
}
-   if (err)
-   brcmf_remove_interface(vif->ifp, true);
+   brcmf_remove_interface(vif->ifp, true);
 
brcmf_cfg80211_arm_vif_event(cfg, NULL);
if (vif->wdev.iftype != NL80211_IFTYPE_P2P_DEVICE)
-- 
1.8.4.5



Re: [PATCH] [v6] net: emac: emac gigabit ethernet controller driver

2016-06-29 Thread Arnd Bergmann
On Wednesday, June 29, 2016 10:46:23 AM CEST Timur Tabi wrote:
> Arnd Bergmann wrote:
> > Usually drivers try 64-bit mask and 32-bit masks, and the 32 bit
> > mask is practically guaranteed to succeed.
> 
> Sure, but in theory, my for-loop is correct, right?  Wouldn't there be 
> some value in setting a 36-bit or 40-bit DMA mask if it works?  We have 
> a platform where memory starts at a 40-bit address, so some devices have 
> a 44-bit address bus.  If a 64-bit mask doesn't work, then a 32-bit mask 
> certainly wont.

The question is whether it makes any difference to the driver: what decision
does the driver make differently if it has set a 44-bit mask?

We don't have ZONE_DMA44 (and it's very unlikely that we will introduce
it), so neither the driver nor the network stack can actually act on the
fact that a 64-bit mask failed but a 44-bit mask succeeded.

> > Platforms will also allow allow the driver to set a mask that
> > is larger than what the bus supports, as long as all RAM is
> > reachable by the bus.
> 
> And that check (like all others) is made in the dma_set_mask call?

Yes. Regarding the system you mentioned: I understand that it has
no memory in ZONE_DMA32 and no IOMMU (btw, that also means 32-bit
PCI devices are fundamentally broken), and the bus can only address
the lower 44 bit of address space.

Does this system support more than 15TB of RAM? If not, then there
is no problem, and if it  does, I think your best way out is not
to disable SWIOTLB. That will be a slower compared to a system
with an IOMMU or the fictional ZONE_DMA44, but it should work
correctly. In either case (less than 15TB without swiotlb, or
more than 15TB with swiotlb), the driver should just ask for
a 64-bit mask and that will succeed.

Arnd


[net-next PATCH 2/2] net: samples: pktgen mode samples/tests for qdisc layer

2016-06-29 Thread John Fastabend
This adds samples for pktgen to use with new mode to inject pkts into
the qdisc layer. This also doubles as nice test cases to test any
patches against qdisc layer.

Signed-off-by: John Fastabend 
---
 .../pktgen/pktgen_bench_xmit_mode_queue_xmit.sh|   67 
 1 file changed, 67 insertions(+)
 create mode 100755 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh

diff --git a/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh 
b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
new file mode 100755
index 000..f1aaef0
--- /dev/null
+++ b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh
@@ -0,0 +1,67 @@
+#!/bin/bash
+#
+# Benchmark script:
+#  - developed for benchmarking egress qdisc path, derived from
+#ingress benchmark script.
+#
+# Script for injecting packets into egress qdisc path of the stack
+# with pktgen "xmit_mode queue_xmit".
+#
+basedir=`dirname $0`
+source ${basedir}/functions.sh
+root_check_run_with_sudo "$@"
+
+# Parameter parsing via include
+source ${basedir}/parameters.sh
+# Using invalid DST_MAC will cause the packets to get dropped in
+# ip_rcv() which is part of the test
+[ -z "$DEST_IP" ] && DEST_IP="198.18.0.42"
+[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff"
+[ -z "$BURST" ] && BURST=1024
+
+# Base Config
+DELAY="0"# Zero means max speed
+COUNT="1000" # Zero means indefinitely
+
+# General cleanup everything since last run
+pg_ctrl "reset"
+
+# Threads are specified with parameter -t value in $THREADS
+for ((thread = 0; thread < $THREADS; thread++)); do
+# The device name is extended with @name, using thread number to
+# make then unique, but any name will do.
+dev=${DEV}@${thread}
+
+# Add remove all other devices and add_device $dev to thread
+pg_thread $thread "rem_device_all"
+pg_thread $thread "add_device" $dev
+
+# Base config of dev
+pg_set $dev "flag QUEUE_MAP_CPU"
+pg_set $dev "count $COUNT"
+pg_set $dev "pkt_size $PKT_SIZE"
+pg_set $dev "delay $DELAY"
+pg_set $dev "flag NO_TIMESTAMP"
+
+# Destination
+pg_set $dev "dst_mac $DST_MAC"
+pg_set $dev "dst $DEST_IP"
+
+# Inject packet into RX path of stack
+pg_set $dev "xmit_mode queue_xmit"
+
+# Burst allow us to avoid measuring SKB alloc/free overhead
+pg_set $dev "burst $BURST"
+done
+
+# start_run
+echo "Running... ctrl^C to stop" >&2
+pg_ctrl "start"
+echo "Done" >&2
+
+# Print results
+for ((thread = 0; thread < $THREADS; thread++)); do
+dev=${DEV}@${thread}
+echo "Device: $dev"
+cat /proc/net/pktgen/$dev | grep -A2 "Result:"
+done



[net-next PATCH 1/2] net: pktgen: support injecting packets for qdisc testing

2016-06-29 Thread John Fastabend
Add another xmit_mode to pktgen to allow testing xmit functionality
of qdiscs. The new mode "queue_xmit" injects packets at
__dev_queue_xmit() so that qdisc is called.

Signed-off-by: John Fastabend 
---
 net/core/pktgen.c |   42 --
 1 file changed, 40 insertions(+), 2 deletions(-)

diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index f74ab9c..4b3d467 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -213,6 +213,7 @@
 /* Xmit modes */
 #define M_START_XMIT   0   /* Default normal TX */
 #define M_NETIF_RECEIVE1   /* Inject packets into stack */
+#define M_QUEUE_XMIT   2   /* Inject packet into qdisc */
 
 /* If lock -- protects updating of if_list */
 #define   if_lock(t)   spin_lock(&(t->if_lock));
@@ -626,6 +627,8 @@ static int pktgen_if_show(struct seq_file *seq, void *v)
 
if (pkt_dev->xmit_mode == M_NETIF_RECEIVE)
seq_puts(seq, " xmit_mode: netif_receive\n");
+   else if (pkt_dev->xmit_mode == M_QUEUE_XMIT)
+   seq_puts(seq, " xmit_mode: xmit_queue\n");
 
seq_puts(seq, " Flags: ");
 
@@ -1142,8 +1145,10 @@ static ssize_t pktgen_if_write(struct file *file,
return len;
 
i += len;
-   if ((value > 1) && (pkt_dev->xmit_mode == M_START_XMIT) &&
-   (!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING)))
+   if ((value > 1) &&
+   ((pkt_dev->xmit_mode == M_QUEUE_XMIT) ||
+((pkt_dev->xmit_mode == M_START_XMIT) &&
+(!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING)
return -ENOTSUPP;
pkt_dev->burst = value < 1 ? 1 : value;
sprintf(pg_result, "OK: burst=%d", pkt_dev->burst);
@@ -1198,6 +1203,9 @@ static ssize_t pktgen_if_write(struct file *file,
 * at module loading time
 */
pkt_dev->clone_skb = 0;
+   } else if (strcmp(f, "queue_xmit") == 0) {
+   pkt_dev->xmit_mode = M_QUEUE_XMIT;
+   pkt_dev->last_ok = 1;
} else {
sprintf(pg_result,
"xmit_mode -:%s:- unknown\nAvailable modes: %s",
@@ -3434,6 +3442,36 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev)
 #endif
} while (--burst > 0);
goto out; /* Skips xmit_mode M_START_XMIT */
+   } else if (pkt_dev->xmit_mode == M_QUEUE_XMIT) {
+   local_bh_disable();
+   atomic_add(burst, _dev->skb->users);
+
+   ret = dev_queue_xmit(pkt_dev->skb);
+   switch (ret) {
+   case NET_XMIT_SUCCESS:
+   pkt_dev->sofar++;
+   pkt_dev->seq_num++;
+   pkt_dev->tx_bytes += pkt_dev->last_pkt_size;
+   break;
+   case NET_XMIT_DROP:
+   case NET_XMIT_CN:
+   /* These are all valid return codes for a qdisc but
+* indicate packets are being dropped or will likely
+* be dropped soon.
+*/
+   case NETDEV_TX_BUSY:
+   /* qdisc may call dev_hard_start_xmit directly in cases
+* where no queues exist e.g. loopback device, virtual
+* devices, etc. In this case we need to handle
+* NETDEV_TX_ codes.
+*/
+   default:
+   pkt_dev->errors++;
+   net_info_ratelimited("%s xmit error: %d\n",
+pkt_dev->odevname, ret);
+   break;
+   }
+   goto out;
}
 
txq = skb_get_tx_queue(odev, pkt_dev->skb);



Re: [PATCH v4] fib_rules: Added NLM_F_EXCL support to fib_nl_newrule

2016-06-29 Thread David Ahern

On 6/29/16 1:22 AM, Mateusz Bajorski wrote:

When adding rule with NLM_F_EXCL flag then check if the same rule exist.
If yes then exit with -EEXIST.

This is already implemented in iproute2:
if (cmd == RTM_NEWRULE) {
req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL;
req.r.rtm_type = RTN_UNICAST;
}

Tested ipv4 and ipv6 with net-next linux on qemu x86

expected behavior after patch:
localhost ~ # ip rule
0:from all lookup local
32766:from all lookup main
32767:from all lookup default
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005
RTNETLINK answers: File exists
localhost ~ # ip rule
0:from all lookup local
1005:from 10.46.177.97 lookup 104
32766:from all lookup main
32767:from all lookup default

There was already topic regarding this but I don't see any changes
merged and problem still occurs.
https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomain

Signed-off-by: Mateusz Bajorski 
---
Changes in v2: section moved to new place where new rule is already built
Changes in v3: compare moved to helper, added fr_net compare
Changes in v4: added l3mdev compare

 net/core/fib_rules.c | 49 +
 1 file changed, 49 insertions(+)



Acked-by: David Ahern 




[PATCH iproute2] bridge: man: fix STP LISTENING description

2016-06-29 Thread Vivien Didelot
Correct the unclear and poorly conjugated STP LISTENING documentation.

Signed-off-by: Vivien Didelot 
---
 man/man8/bridge.8 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/bridge.8 b/man/man8/bridge.8
index ac42118..8a39673 100644
--- a/man/man8/bridge.8
+++ b/man/man8/bridge.8
@@ -235,7 +235,7 @@ error.
 
 .B 1
 - STP LISTENING state. Only valid if STP is enabled on the bridge. In this
-state the port for list for STP BPDUs and drop all other traffic.
+state the port listens for STP BPDUs and drops all other traffic frames.
 .sp
 
 .B 2
-- 
2.9.0



[PATCH iproute2] bridge: man: fix BPUD typo

2016-06-29 Thread Vivien Didelot
s/BPUD/BPDU/ in guard description.

Signed-off-by: Vivien Didelot 
---
 man/man8/bridge.8 | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/man/man8/bridge.8 b/man/man8/bridge.8
index 8a39673..ef4f83e 100644
--- a/man/man8/bridge.8
+++ b/man/man8/bridge.8
@@ -256,7 +256,7 @@ STP BPDUs.
 
 .TP
 .BR "guard on " or " guard off "
-Controls whether STP BPUDs will be processed by the bridge port. By default,
+Controls whether STP BPDUs will be processed by the bridge port. By default,
 the flag is turned off allowed BPDU processing. Turning this flag on will
 cause the port to stop processing STP BPDUs.
 
-- 
2.9.0



net.git commit 3bb549ae4c51

2016-06-29 Thread Sowmini Varadhan
Hi,

I'm getting ready with the next installment of changes to rds-tcp
for mprds, and I noticed that commit 3bb549ae4c51 ("RDS: TCP:
rds_tcp_accept_one() should transition socket from RESETTING to UP")
has not made its way to net-next.  I also noticed that (even though this
is a one-liner), this patch generates a conflict when applied as-is
to net-next.

My planned changes will cause more turbulence around this piece 
of code, so they depend on this patch. What is the advised approach
here? Should I just port this patch over to net-next as part
of my patch-set, and make a note about the port, or something else?

--Sowmini



[PATCH net v3] usbnet: Stop RX Q on MTU change

2016-06-29 Thread soohoon . lee


When MTU is changed unlink_urbs() flushes RX Q but mean while usbnet_bh()
can fill up the Q at the same time.
Depends on which HCD is down there unlink takes long time then the flush
never ends.

Signed-off-by: Soohoon Lee 
Reviewed-by: Kimball Murray 

---

 drivers/net/usb/usbnet.c |   10 +++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 61ba464..6086a01 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -395,8 +395,11 @@ int usbnet_change_mtu (struct net_device *net, int new_mtu)
dev->hard_mtu = net->mtu + net->hard_header_len;
if (dev->rx_urb_size == old_hard_mtu) {
dev->rx_urb_size = dev->hard_mtu;
-   if (dev->rx_urb_size > old_rx_urb_size)
+   if (dev->rx_urb_size > old_rx_urb_size) {
+   usbnet_pause_rx(dev);
usbnet_unlink_rx_urbs(dev);
+   usbnet_resume_rx(dev);
+   }
}
 
/* max qlen depend on hard_mtu and rx_urb_size */
@@ -1508,8 +1511,9 @@ static void usbnet_bh (unsigned long param)
} else if (netif_running (dev->net) &&
   netif_device_present (dev->net) &&
   netif_carrier_ok(dev->net) &&
-  !timer_pending (>delay) &&
-  !test_bit (EVENT_RX_HALT, >flags)) {
+  !timer_pending(>delay) &&
+  !test_bit(EVENT_RX_PAUSED, >flags) &&
+  !test_bit(EVENT_RX_HALT, >flags)) {
int temp = dev->rxq.qlen;
 
if (temp < RX_QLEN(dev)) {


Re: [RFC 0/7] net: ethernet: bgmac: Add platform device support

2016-06-29 Thread Florian Fainelli
On 06/28/2016 12:34 PM, Jon Mason wrote:
> I'm sending out this RFC to see if this is the direction the maintainers
> would like to go to add support for other, non-bcma iProc SoC's to the
> bgmac driver.  Specifically, we are interested in adding support for the
> NSP, Cygnus, and NS2 families (with more possible down the road).
> 
> To support non-bcma enabled SoCs, we need to add the standard device
> tree "platform device" support.  Unfortunately, this driver is very
> tighly coupled with the bcma bus and much unwinding is needed.  I tried
> to break this up into a number of patches to make it more obvious what
> was being done to add platform device support.  I was able to verify
> that the bcma code still works using a 53012K board (NS SoC), and that
> the platform code works using a 58625K board (NSP SoC).
> 
> It is worth noting that the phy logic present in the driver needs to be
> moved to drivers/phy.  However, I was not able to fully decouple that
> code from the bgmac driver.  I was able to move it into a separate C
> file, with only 2 function calls needed to create and destroy the mii
> bus.  Someone with more knowledge of this and HW to test it needs to do
> it properly.  This would natually dovetail into creating an interface
> which the NSP bgmac can use for the external MDIO Phy to properly
> connect (instead of using the fixed phy).

This looks very good to me, and I just tested this with a BCM58625 w/
b53-srab on top of that, and everything works nicely:

Tested-by: Florian Fainelli 
-- 
Florian


Re: [RFC 5/7] net: ethernet: bgmac: Add platform device support

2016-06-29 Thread Florian Fainelli
On 06/28/2016 12:34 PM, Jon Mason wrote:
> The bcma portion of the driver has been split off into a bcma specific
> driver.  This has been mirrored for the platform driver.  The last
> references to the bcma core struct have been changed into a generic
> function call.  These function calls are wrappers to either the original
> bcma code or new platform functions that access the same areas via MMIO.
> This necessitated adding function pointers for both platform and bcma to
> hide which backend is being used from the generic bgmac code.
> 
> Signed-off-by: Jon Mason 
> ---

[snip]

> +static int bgmac_probe(struct platform_device *pdev)
> +{
> + struct device_node *np = pdev->dev.of_node;
> + struct bgmac *bgmac;
> + struct resource regs;
> + const u8 *mac_addr;
> + int rc;
> +
> + bgmac = kzalloc(sizeof(*bgmac), GFP_KERNEL);

You could utilize devm_kzalloc() here which simplifies the error path.

> + if (!bgmac)
> + return -ENOMEM;
> +
> + platform_set_drvdata(pdev, bgmac);
> +
> + /* Set the features of the 4707 family */
> + bgmac->feature_flags |= BGMAC_FEAT_CLKCTLST;
> + bgmac->feature_flags |= BGMAC_FEAT_NO_RESET;
> + bgmac->feature_flags |= BGMAC_FEAT_FORCE_SPEED_2500;
> + bgmac->feature_flags |= BGMAC_FEAT_CMDCFG_SR_REV4;
> + bgmac->feature_flags |= BGMAC_FEAT_TX_MASK_SETUP;
> + bgmac->feature_flags |= BGMAC_FEAT_RX_MASK_SETUP;
> +
> + bgmac->dev = >dev;
> + bgmac->dma_dev = >dev;
> +
> + mac_addr = of_get_mac_address(np);
> + if (mac_addr)
> + ether_addr_copy(bgmac->mac_addr, mac_addr);
> + else
> + dev_warn(>dev, "MAC address not present in device 
> tree\n");
> +
> + bgmac->irq = platform_get_irq(pdev, 0);
> + if (bgmac->irq < 0) {
> + rc = bgmac->irq;
> + dev_err(>dev, "Unable to obtain IRQ\n");
> + goto err;
> + }
> +
> + rc = of_address_to_resource(np, 0, );
> + if (rc < 0) {
> + dev_err(>dev, "Unable to obtain base resource\n");
> + goto err;
> + }

Here you could fetch the resource using a traditional
platform_get_resource(pdev, IORESOURCE_MEM, 0) and...

> +
> + bgmac->plat.base = ioremap(regs.start, resource_size());
> + if (!bgmac->plat.base) {
> + dev_err(>dev, "Unable to map base resource\n");
> + rc = -ENOMEM;
> + goto err;
> + }

... here do a devm_ioremap_resource(), which also does a
request_mem_region, so this shows up nicely in /proc/iomem.

> +
> + rc = of_address_to_resource(np, 1, );
> + if (rc < 0) {
> + dev_err(>dev, "Unable to obtain idm resource\n");
> + goto err1;
> + }
> +
> + bgmac->plat.idm_base = ioremap(regs.start, resource_size());
> + if (!bgmac->plat.base) {
> + dev_err(>dev, "Unable to map idm resource\n");
> + rc = -ENOMEM;
> + goto err1;
> + }

Same here.

> +
> + bgmac->read = platform_bgmac_read;
> + bgmac->write = platform_bgmac_write;
> + bgmac->idm_read = platform_bgmac_idm_read;
> + bgmac->idm_write = platform_bgmac_idm_write;
> + bgmac->clk_enabled = platform_bgmac_clk_enabled;
> + bgmac->clk_enable = platform_bgmac_clk_enable;
> + bgmac->cco_ctl_maskset = platform_bgmac_cco_ctl_maskset;
> + bgmac->get_bus_clock = platform_bgmac_get_bus_clock;
> + bgmac->cmn_maskset32 = platform_bgmac_cmn_maskset32;
> +
> + rc = bgmac_enet_probe(bgmac);
> + if (rc)
> + goto err2;
> +
> + return 0;
> +
> +err2:
> + iounmap(bgmac->plat.idm_base);
> +err1:
> + iounmap(bgmac->plat.base);
> +err:
> + kfree(bgmac);

And with devm_* helpers none of that is needed now.
-- 
Florian


[PATCH] ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output

2016-06-29 Thread Shmulik Ladkani
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it
calls ip_sk_use_pmtu which casts sk as an inet_sk).

However, in the case of UDP tunneling, the skb->sk is not necessarily an
inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from
tun/tap).

OTOH, the sk passed as an argument throughout IP stack's output path is
the one which is of PMTU interest:
 - In case of local sockets, sk is same as skb->sk;
 - In case of a udp tunnel, sk is the tunneling socket.

Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu.
This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().'

Signed-off-by: Shmulik Ladkani 
---

Discovered while debugging other issue in adjacent code area.

My setup seems to be handling this well; However I'd appreciate a
Tested-by or Reviewed-by as one can only cover relatively few of the
possible code flows leading to ip_skb_dst_mtu.

 include/net/ip.h| 5 ++---
 net/bridge/br_netfilter_hooks.c | 2 +-
 net/ipv4/ip_output.c| 4 ++--
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/net/ip.h b/include/net/ip.h
index 37165fba37..08f36cd2b8 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -313,10 +313,9 @@ static inline unsigned int ip_dst_mtu_maybe_forward(const 
struct dst_entry *dst,
return min(dst->dev->mtu, IP_MAX_MTU);
 }
 
-static inline unsigned int ip_skb_dst_mtu(const struct sk_buff *skb)
+static inline unsigned int ip_skb_dst_mtu(struct sock *sk,
+ const struct sk_buff *skb)
 {
-   struct sock *sk = skb->sk;
-
if (!sk || !sk_fullsock(sk) || ip_sk_use_pmtu(sk)) {
bool forwarding = IPCB(skb)->flags & IPSKB_FORWARDED;
 
diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c
index 2d25979273..77e7f69bf8 100644
--- a/net/bridge/br_netfilter_hooks.c
+++ b/net/bridge/br_netfilter_hooks.c
@@ -700,7 +700,7 @@ static int
 br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
  int (*output)(struct net *, struct sock *, struct sk_buff *))
 {
-   unsigned int mtu = ip_skb_dst_mtu(skb);
+   unsigned int mtu = ip_skb_dst_mtu(sk, skb);
struct iphdr *iph = ip_hdr(skb);
 
if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) ||
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index cbac493c91..e23f141c9b 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -271,7 +271,7 @@ static int ip_finish_output(struct net *net, struct sock 
*sk, struct sk_buff *sk
return dst_output(net, sk, skb);
}
 #endif
-   mtu = ip_skb_dst_mtu(skb);
+   mtu = ip_skb_dst_mtu(sk, skb);
if (skb_is_gso(skb))
return ip_finish_output_gso(net, sk, skb, mtu);
 
@@ -541,7 +541,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct 
sk_buff *skb,
 
iph = ip_hdr(skb);
 
-   mtu = ip_skb_dst_mtu(skb);
+   mtu = ip_skb_dst_mtu(sk, skb);
if (IPCB(skb)->frag_max_size && IPCB(skb)->frag_max_size < mtu)
mtu = IPCB(skb)->frag_max_size;
 
-- 
2.7.4



Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file

2016-06-29 Thread Andrew Lunn
On Wed, Jun 29, 2016 at 11:35:28AM -0700, Florian Fainelli wrote:
> On 06/29/2016 07:13 AM, Andrew Lunn wrote:
> > Hi Jon
> > 
> > I know you are just refactoring code, but at some point it would be
> > good to take a closer look at this MDIO bus driver.
 
> And, to re-iterate all of your points are valid, but this is premature

We agree then :-)

   Andrew


[PATCH net v2.3] usbnet: Stop RX Q on MTU change

2016-06-29 Thread soohoon . lee


When MTU is changed unlink_urbs() flushes RX Q but mean while usbnet_bh() can 
fill up the Q at the same time.
Depends on which HCD is down there unlink takes long time then the flush never 
ends.

Signed-off-by: Soohoon Lee 
Reviewed-by: Kimball Murray 

---
 drivers/net/usb/usbnet.c |6 +-
  1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index 61ba464..ce72dd0 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -395,8 +395,11 @@ int usbnet_change_mtu (struct net_device *net, int new_mtu)
dev->hard_mtu = net->mtu + net->hard_header_len;
if (dev->rx_urb_size == old_hard_mtu) {
dev->rx_urb_size = dev->hard_mtu;
-   if (dev->rx_urb_size > old_rx_urb_size)
+   if (dev->rx_urb_size > old_rx_urb_size) {
+   usbnet_pause_rx(dev);
usbnet_unlink_rx_urbs(dev);
+   usbnet_resume_rx(dev);
+   }
}
 
/* max qlen depend on hard_mtu and rx_urb_size */
@@ -1509,6 +1512,7 @@ static void usbnet_bh (unsigned long param)
   netif_device_present (dev->net) &&
   netif_carrier_ok(dev->net) &&
   !timer_pending (>delay) &&
+  !test_bit (EVENT_RX_PAUSED, >flags) &&
   !test_bit (EVENT_RX_HALT, >flags)) {
int temp = dev->rxq.qlen;
 


Re: [RFC 6/7] dt-bindings: net: bgmac: add bindings documentation for bgmac

2016-06-29 Thread Florian Fainelli
On 06/28/2016 12:34 PM, Jon Mason wrote:
> Signed-off-by: Jon Mason 
> ---
>  .../devicetree/bindings/net/brcm,bgmac-enet.txt | 21 
> +
>  1 file changed, 21 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/brcm,bgmac-enet.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/brcm,bgmac-enet.txt 
> b/Documentation/devicetree/bindings/net/brcm,bgmac-enet.txt
> new file mode 100644
> index 000..efd36d5
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/brcm,bgmac-enet.txt
> @@ -0,0 +1,21 @@
> +Broadcom GMAC Ethernet Controller Device Tree Bindings
> +-
> +
> +Required properties:
> + - compatible:   "brcm,bgmac-enet"

You might want something a bit more explicit like "brcm,bgmac-nsp" for
instance and also define a compatible string for the Northstar family maybe?

> + - reg:  Address and length of the GMAC registers,
> + Address and length of the GMAC IDM registers
> + - interrupts:   Interrupt number
> +
> +Optional properties:
> +- mac-address:   mac address to be assigned to the device
> +
> +Examples:
> +
> +gmac0: enet@18022000 {
> + compatible = "brcm,bgmac-enet";
> + reg = <0x18022000 0x1000>,
> +   <0x1811 0x1000>;
> + interrupts = ;
> + status = "disabled";
> +};
> 


-- 
Florian


Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file

2016-06-29 Thread Florian Fainelli
On 06/29/2016 07:13 AM, Andrew Lunn wrote:
> Hi Jon
> 
> I know you are just refactoring code, but at some point it would be
> good to take a closer look at this MDIO bus driver.
> 
> The MDIO bus driver should be generic, allowing access to all 32
> addresses on the bus, if that makes sense. You could for example have
> a B53 switch hanging off the MDIO bus The basic read/write
> functions seem to allow this.

That's the case on MIPS-based SoCs actually.

> 
>> +/* http://bcm-v4.sipsolutions.net/mac-gbit/gmac/chipphyinit */
>> +static void bcma_mdio_phy_init(struct bcma_mdio *bcma_mdio)
>> +{
>> +struct bcma_chipinfo *ci = _mdio->core->bus->chipinfo;
>> +u8 i;
>> +
>> +if (ci->id == BCMA_CHIP_ID_BCM5356) {
>> +for (i = 0; i < 5; i++) {
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x008b);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x15, 0x0100);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000f);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x12, 0x2aaa);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000b);
> 
> It would be good to document what these writes are doing. Also, could
> they be moved into the phy driver? Does the PHY implement an ID in
> registers 2 and 3 that can be used to determine these are needed?
> 
> Also, why is it iterating over 5 PHYs? Is this actually a switch with
> 5 ports?

Yes, this is applying workarounds to the integrated PHYs, the plan is to
move that to a proper PHY driver so we can get rid of that.

> 
>> +}
>> +}
>> +if ((ci->id == BCMA_CHIP_ID_BCM5357 && ci->pkg != 10) ||
>> +(ci->id == BCMA_CHIP_ID_BCM4749 && ci->pkg != 10) ||
>> +(ci->id == BCMA_CHIP_ID_BCM53572 && ci->pkg != 9)) {
>> +struct bcma_drv_cc *cc = _mdio->core->bus->drv_cc;
>> +
>> +bcma_chipco_chipctl_maskset(cc, 2, ~0xc000, 0);
>> +bcma_chipco_chipctl_maskset(cc, 4, ~0x8000, 0);
>> +for (i = 0; i < 5; i++) {
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000f);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x16, 0x5284);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000b);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x17, 0x0010);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000f);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x16, 0x5296);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x17, 0x1073);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x17, 0x9073);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x16, 0x52b6);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x17, 0x9273);
>> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000b);
> 
> Same comment here.
> 
>> +/* http://bcm-v4.sipsolutions.net/mac-gbit/gmac/chipphyreset */
>> +static int bcma_mdio_phy_reset(struct mii_bus *bus)
>> +{
>> +struct bcma_mdio *bcma_mdio = bus->priv;
>> +u8 phyaddr = bcma_mdio->phyaddr;
>> +
>> +if (bcma_mdio->phyaddr == BGMAC_PHY_NOREGS)
>> +return 0;
>> +
>> +bcma_mdio_phy_write(bcma_mdio, phyaddr, MII_BMCR, BMCR_RESET);
>> +udelay(100);
>> +if (bcma_mdio_phy_read(bcma_mdio, phyaddr, MII_BMCR) & BMCR_RESET)
>> +dev_err(_mdio->core->dev, "PHY reset failed\n");
>> +bcma_mdio_phy_init(bcma_mdio);
> 
> This appears to be resetting one PHY. What about the others? Why not
> put this in the PHY driver?

Probably should be there, keep in mind this driver was mostly reverse
engineered, from an internal code base which usually did not plan on
utilizing existing APIs.

> 
>> +struct mii_bus *bcma_mdio_mii_register(struct bcma_device *core, u8 phyaddr)
>> +{
>> +struct bcma_mdio *bcma_mdio;
>> +struct mii_bus *mii_bus;
>> +int err;
>> +
>> +bcma_mdio = kzalloc(sizeof(*bcma_mdio), GFP_KERNEL);
>> +if (!bcma_mdio)
>> +return ERR_PTR(-ENOMEM);
>> +
>> +mii_bus = mdiobus_alloc();
>> +if (!mii_bus) {
>> +err = -ENOMEM;
>> +goto err;
>> +}
>> +
>> +mii_bus->name = "bcma_mdio mii bus";
>> +sprintf(mii_bus->id, "%s-%d-%d", "bcma_mdio", core->bus->num,
>> +core->core_unit);
>> +mii_bus->priv = bcma_mdio;
>> +mii_bus->read = bcma_mdio_mii_read;
>> +mii_bus->write = bcma_mdio_mii_write;
>> +mii_bus->reset = bcma_mdio_phy_reset;
>> +mii_bus->parent = >dev;
>> +mii_bus->phy_mask = ~(1 << phyaddr);
> 
> Is there a need to limit this to just the one PHY?

Usually this is only relevant for an external PHY, and Jon is just
moving code around without making functional changes to existing BCMA
code, so all of that can be changed at a later time.

> 
>> +
>> +bcma_mdio->core = core;
>> +bcma_mdio->phyaddr = phyaddr;
>> +
>> +err = mdiobus_register(mii_bus);
> 
> Something which appears to be missing from the later 

[PATCH v4 iproute2 2/6] ip link/addr: Add support for vrf keyword

2016-06-29 Thread David Ahern
Add vrf keyword to 'ip link' and 'ip addr' commands (common list code).

Allows:
1. Adding a link to a VRF
   $ ip link set NAME vrf NAME

   Removing a link from a VRF still uses 'ip link set NAME nomaster'

2. Showing links associated with a VRF:
   $ ip link show vrf NAME

3. List addresses associated with links in a VRF
   $ ip -br addr show vrf red

Signed-off-by: David Ahern 
---
 ip/ipaddress.c   | 12 +++-
 ip/iplink.c  | 15 +--
 man/man8/ip-address.8.in |  6 ++
 man/man8/ip-link.8.in| 10 ++
 4 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index d4d649505e15..54896c726e07 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -80,7 +80,7 @@ static void usage(void)
fprintf(stderr, "[ to PREFIX ] [ FLAG-LIST 
] [ label LABEL ] [up]\n");
fprintf(stderr, "   ip address [ show [ dev IFNAME ] [ scope 
SCOPE-ID ] [ master DEVICE ]\n");
fprintf(stderr, " [ type TYPE ] [ to PREFIX ] [ 
FLAG-LIST ]\n");
-   fprintf(stderr, " [ label LABEL ] [up] ]\n");
+   fprintf(stderr, " [ label LABEL ] [up] [ vrf 
NAME ] ]\n");
fprintf(stderr, "   ip address {showdump|restore}\n");
fprintf(stderr, "IFADDR := PREFIX | ADDR peer PREFIX\n");
fprintf(stderr, "  [ broadcast ADDR ] [ anycast ADDR ]\n");
@@ -1620,6 +1620,16 @@ static int ipaddr_list_flush_or_save(int argc, char 
**argv, int action)
if (!ifindex)
invarg("Device does not exist\n", *argv);
filter.master = ifindex;
+   } else if (strcmp(*argv, "vrf") == 0) {
+   int ifindex;
+
+   NEXT_ARG();
+   ifindex = ll_name_to_index(*argv);
+   if (!ifindex)
+   invarg("Not a valid VRF name\n", *argv);
+   if (!name_is_vrf(*argv))
+   invarg("Not a valid VRF name\n", *argv);
+   filter.master = ifindex;
} else if (strcmp(*argv, "type") == 0) {
int soff;
 
diff --git a/ip/iplink.c b/ip/iplink.c
index b1f8a37922f5..f2a2e13cf0c5 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -82,11 +82,11 @@ void iplink_usage(void)
fprintf(stderr, "  [ query_rss { on | 
off} ]\n");
fprintf(stderr, "  [ state { auto | 
enable | disable} ] ]\n");
fprintf(stderr, "  [ trust { on | off} 
] ]\n");
-   fprintf(stderr, " [ master DEVICE ]\n");
+   fprintf(stderr, " [ master DEVICE ][ vrf NAME 
]\n");
fprintf(stderr, " [ nomaster ]\n");
fprintf(stderr, " [ addrgenmode { eui64 | none 
| stable_secret | random } ]\n");
fprintf(stderr, " [ protodown { on | off } 
]\n");
-   fprintf(stderr, "   ip link show [ DEVICE | group GROUP ] [up] 
[master DEV] [type TYPE]\n");
+   fprintf(stderr, "   ip link show [ DEVICE | group GROUP ] [up] 
[master DEV] [vrf NAME] [type TYPE]\n");
 
if (iplink_have_newlink()) {
fprintf(stderr, "   ip link help [ TYPE ]\n");
@@ -603,6 +603,17 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
invarg("Device does not exist\n", *argv);
addattr_l(>n, sizeof(*req), IFLA_MASTER,
  , 4);
+   } else if (strcmp(*argv, "vrf") == 0) {
+   int ifindex;
+
+   NEXT_ARG();
+   ifindex = ll_name_to_index(*argv);
+   if (!ifindex)
+   invarg("Not a valid VRF name\n", *argv);
+   if (!name_is_vrf(*argv))
+   invarg("Not a valid VRF name\n", *argv);
+   addattr_l(>n, sizeof(*req), IFLA_MASTER,
+ , sizeof(ifindex));
} else if (matches(*argv, "nomaster") == 0) {
int ifindex = 0;
 
diff --git a/man/man8/ip-address.8.in b/man/man8/ip-address.8.in
index 3cbe4181f7e3..7cb927173133 100644
--- a/man/man8/ip-address.8.in
+++ b/man/man8/ip-address.8.in
@@ -45,6 +45,8 @@ ip-address \- protocol address management
 .IR DEVICE " ] [ "
 .B  type
 .IR TYPE " ] [ "
+.B vrf
+.IR NAME " ] [ "
 .BR up " ] ]"
 
 .ti -8
@@ -280,6 +282,10 @@ is a usual shell style pattern.
 only list interfaces enslaved to this master device.
 
 .TP
+.BI vrf " NAME "
+only list interfaces enslaved to this vrf.
+
+.TP
 .BI type " TYPE"
 only list interfaces of the 

[PATCH v4 iproute2 4/6] ip route: Change type mask to bitmask

2016-06-29 Thread David Ahern
Allow option to select multiple route types to show or exlude
specific route types.

Signed-off-by: David Ahern 
---
 ip/iproute.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 8224d7ffa94b..aae693d17be8 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -113,7 +113,7 @@ static struct
int flushe;
int protocol, protocolmask;
int scope, scopemask;
-   int type, typemask;
+   __u64 typemask;
int tos, tosmask;
int iif, iifmask;
int oif, oifmask;
@@ -178,7 +178,8 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
return 0;
if ((filter.scope^r->rtm_scope))
return 0;
-   if ((filter.type^r->rtm_type))
+
+   if (filter.typemask && !(filter.typemask & (1 << r->rtm_type)))
return 0;
if ((filter.tos^r->rtm_tos))
return 0;
@@ -365,7 +366,8 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
 
if (n->nlmsg_type == RTM_DELROUTE)
fprintf(fp, "Deleted ");
-   if ((r->rtm_type != RTN_UNICAST || show_details > 0) && !filter.type)
+   if ((r->rtm_type != RTN_UNICAST || show_details > 0) &&
+   (!filter.typemask || (filter.typemask & (1 << r->rtm_type
fprintf(fp, "%s ", rtnl_rtntype_n2a(r->rtm_type, b1, 
sizeof(b1)));
 
if (tb[RTA_DST]) {
@@ -1433,10 +1435,9 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
int type;
 
NEXT_ARG();
-   filter.typemask = -1;
if (rtnl_rtntype_a2n(, *argv))
invarg("node type value is invalid\n", *argv);
-   filter.type = type;
+   filter.typemask = (1<

[PATCH v4 iproute2 0/6] Add support for vrf keyword

2016-06-29 Thread David Ahern
Currently the syntax for VRF related commands is rather kludgy and
inconsistent from one subcommand to another. This set adds support
for the VRF keyword to the link, address, neigh, and route commands
to improve the user experience listing data associated with vrfs,
modifying routes or doing a route lookup.

v4
- Stephen's comments about patch 1 also apply to patch 5, so updated it
v3
- addressed comments from Stephen
- added documentation updates to patches 2, 3 and 6

v2
- rebased to top of tree
- all checkpatch warnings are usage lines. The change in these
  patches is consistent with existing code for usage lines

David Ahern (6):
  ip vrf: Add name_is_vrf
  ip link/addr: Add support for vrf keyword
  ip neigh: Add support for keyword
  ip route: Change type mask to bitmask
  ip vrf: Add ipvrf_get_table
  ip route: Add support for vrf keyword

 ip/ip_common.h   |   3 ++
 ip/ipaddress.c   |  12 -
 ip/iplink.c  |  15 ++-
 ip/iplink_vrf.c  | 113 +++
 ip/ipneigh.c |  14 +-
 ip/iproute.c |  43 +++---
 man/man8/ip-address.8.in |   6 +++
 man/man8/ip-link.8.in|  10 +
 man/man8/ip-neighbour.8  |   8 +++-
 man/man8/ip-route.8.in   |  19 +++-
 10 files changed, 230 insertions(+), 13 deletions(-)

-- 
2.1.4



[PATCH v4 iproute2 6/6] ip route: Add support for vrf keyword

2016-06-29 Thread David Ahern
Add vrf keyword to 'ip route' commands. Allows:
1. Users can list routes by VRF name:
   $ ip route show vrf NAME

   VRF tables have all routes including local and broadcast routes.
   The VRF keyword filters LOCAL and BROADCAST routes; to see all
   routes the table option can be used. Or to see local routes only
   for a VRF:
   $ ip route show vrf NAME type local

2. Add or delete a route for a VRF:
   $ ip route {add|delete} vrf NAME 

3. Do a route lookup for a VRF:
   $ ip route get vrf NAME ADDRESS

Signed-off-by: David Ahern 
---
 ip/iproute.c   | 32 ++--
 man/man8/ip-route.8.in | 19 ++-
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index aae693d17be8..bd661c16cb46 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -67,10 +67,10 @@ static void usage(void)
fprintf(stderr, "   ip route showdump\n");
fprintf(stderr, "   ip route get ADDRESS [ from ADDRESS iif STRING 
]\n");
fprintf(stderr, "[ oif STRING ] [ tos TOS 
]\n");
-   fprintf(stderr, "[ mark NUMBER ]\n");
+   fprintf(stderr, "[ mark NUMBER ] [ vrf NAME 
]\n");
fprintf(stderr, "   ip route { add | del | change | append | 
replace } ROUTE\n");
fprintf(stderr, "SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact 
PREFIX ]\n");
-   fprintf(stderr, "[ table TABLE_ID ] [ proto RTPROTO ]\n");
+   fprintf(stderr, "[ table TABLE_ID ] [ vrf NAME ] [ proto 
RTPROTO ]\n");
fprintf(stderr, "[ type TYPE ] [ scope SCOPE ]\n");
fprintf(stderr, "ROUTE := NODE_SPEC [ INFO_SPEC ]\n");
fprintf(stderr, "NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]\n");
@@ -1141,6 +1141,20 @@ static int iproute_modify(int cmd, unsigned int flags, 
int argc, char **argv)
addattr32(, sizeof(req), RTA_TABLE, tid);
}
table_ok = 1;
+   } else if (matches(*argv, "vrf") == 0) {
+   __u32 tid;
+
+   NEXT_ARG();
+   tid = ipvrf_get_table(*argv);
+   if (tid == 0)
+   invarg("Invalid VRF\n", *argv);
+   if (tid < 256)
+   req.r.rtm_table = tid;
+   else {
+   req.r.rtm_table = RT_TABLE_UNSPEC;
+   addattr32(, sizeof(req), RTA_TABLE, tid);
+   }
+   table_ok = 1;
} else if (strcmp(*argv, "dev") == 0 ||
   strcmp(*argv, "oif") == 0) {
NEXT_ARG();
@@ -1395,6 +1409,15 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
}
} else
filter.tb = tid;
+   } else if (matches(*argv, "vrf") == 0) {
+   __u32 tid;
+
+   NEXT_ARG();
+   tid = ipvrf_get_table(*argv);
+   if (tid == 0)
+   invarg("Invalid VRF\n", *argv);
+   filter.tb = tid;
+   filter.typemask = ~(1 << RTN_LOCAL | 1<

[PATCH v4 iproute2 3/6] ip neigh: Add support for keyword

2016-06-29 Thread David Ahern
Add vrf keyword to 'ip neigh' commands. Allows listing neighbor
entries for all links associated with a given VRF.

Signed-off-by: David Ahern 
---
 ip/ipneigh.c| 14 +-
 man/man8/ip-neighbour.8 |  8 +++-
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/ip/ipneigh.c b/ip/ipneigh.c
index 4ddb747e2086..3e444712645f 100644
--- a/ip/ipneigh.c
+++ b/ip/ipneigh.c
@@ -48,7 +48,8 @@ static void usage(void)
 {
fprintf(stderr, "Usage: ip neigh { add | del | change | replace }\n"
"{ ADDR [ lladdr LLADDR ] [ nud STATE ] 
| proxy ADDR } [ dev DEV ]\n");
-   fprintf(stderr, "   ip neigh { show | flush } [ proxy ] [ to PREFIX 
] [ dev DEV ] [ nud STATE ]\n\n");
+   fprintf(stderr, "   ip neigh { show | flush } [ proxy ] [ to PREFIX 
] [ dev DEV ] [ nud STATE ]\n");
+   fprintf(stderr, " [ vrf NAME ]\n\n");
fprintf(stderr, "STATE := { permanent | noarp | stale | reachable | 
none |\n"
"   incomplete | delay | probe | failed }\n");
exit(-1);
@@ -385,6 +386,17 @@ static int do_show_or_flush(int argc, char **argv, int 
flush)
invarg("Device does not exist\n", *argv);
addattr32(, sizeof(req), NDA_MASTER, ifindex);
filter.master = ifindex;
+   } else if (strcmp(*argv, "vrf") == 0) {
+   int ifindex;
+
+   NEXT_ARG();
+   ifindex = ll_name_to_index(*argv);
+   if (!ifindex)
+   invarg("Not a valid VRF name\n", *argv);
+   if (!name_is_vrf(*argv))
+   invarg("Not a valid VRF name\n", *argv);
+   addattr32(, sizeof(req), NDA_MASTER, ifindex);
+   filter.master = ifindex;
} else if (strcmp(*argv, "unused") == 0) {
filter.unused_only = 1;
} else if (strcmp(*argv, "nud") == 0) {
diff --git a/man/man8/ip-neighbour.8 b/man/man8/ip-neighbour.8
index b292e1814495..bbfe8e72a09c 100644
--- a/man/man8/ip-neighbour.8
+++ b/man/man8/ip-neighbour.8
@@ -31,7 +31,9 @@ ip-neighbour \- neighbour/arp tables management.
 .B  dev
 .IR DEV " ] [ "
 .B  nud
-.IR STATE " ]"
+.IR STATE " ] [ "
+.B  vrf
+.IR NAME " ] "
 
 .ti -8
 .IR STATE " := {"
@@ -164,6 +166,10 @@ the prefix selecting the neighbours to list.
 only list the neighbours attached to this device.
 
 .TP
+.BI vrf " NAME"
+only list the neighbours for given VRF.
+
+.TP
 .BI proxy
 list neighbour proxies.
 
-- 
2.1.4



[PATCH v4 iproute2 5/6] ip vrf: Add ipvrf_get_table

2016-06-29 Thread David Ahern
Add ipvrf_get_table to lookup table id for device name. Returns 0
on any error or if name is not a VRF device.

Signed-off-by: David Ahern 
---
 ip/ip_common.h  |  1 +
 ip/iplink_vrf.c | 63 +
 2 files changed, 64 insertions(+)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index b747ea47122d..c8188122ab5d 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -90,6 +90,7 @@ struct link_util *get_link_slave_kind(const char *slave_kind);
 
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
 
+__u32 ipvrf_get_table(const char *name);
 bool name_is_vrf(const char *name);
 
 #ifndefINFINITY_LIFE_TIME
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index 166f566e6d77..015b41e1e7b0 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -97,6 +97,69 @@ struct link_util vrf_slave_link_util = {
.slave  = true,
 };
 
+/* returns table id if name is a VRF device */
+__u32 ipvrf_get_table(const char *name)
+{
+   struct {
+   struct nlmsghdr n;
+   struct ifinfomsgi;
+   charbuf[1024];
+   } req = {
+   .n = {
+   .nlmsg_len   = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+   .nlmsg_flags = NLM_F_REQUEST,
+   .nlmsg_type  = RTM_GETLINK,
+   },
+   .i = {
+   .ifi_family  = preferred_family,
+   },
+   };
+   struct {
+   struct nlmsghdr n;
+   char buf[8192];
+   } answer;
+   struct rtattr *tb[IFLA_MAX+1];
+   struct rtattr *li[IFLA_INFO_MAX+1];
+   struct rtattr *vrf_attr[IFLA_VRF_MAX + 1];
+   struct ifinfomsg *ifi;
+   __u32 tb_id = 0;
+   int len;
+
+   addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
+
+   if (rtnl_talk(, , , sizeof(answer)) < 0)
+   return 0;
+
+   ifi = NLMSG_DATA();
+   len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
+   if (len < 0) {
+   fprintf(stderr, "BUG: Invalid response to link query.\n");
+   return 0;
+   }
+
+   parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
+
+   if (!tb[IFLA_LINKINFO])
+   return 0;
+
+   parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
+
+   if (!li[IFLA_INFO_KIND] || !li[IFLA_INFO_DATA])
+   return 0;
+
+   if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf"))
+   return 0;
+
+   parse_rtattr_nested(vrf_attr, IFLA_VRF_MAX, li[IFLA_INFO_DATA]);
+   if (vrf_attr[IFLA_VRF_TABLE])
+   tb_id = rta_getattr_u32(vrf_attr[IFLA_VRF_TABLE]);
+
+   if (!tb_id)
+   fprintf(stderr, "BUG: VRF %s is missing table id\n", name);
+
+   return tb_id;
+}
+
 bool name_is_vrf(const char *name)
 {
struct {
-- 
2.1.4



[PATCH v4 iproute2 1/6] ip vrf: Add name_is_vrf

2016-06-29 Thread David Ahern
Add name_is_vrf function to determine if given name corresponds to a
VRF device.

Signed-off-by: David Ahern 
---
 ip/ip_common.h  |  2 ++
 ip/iplink_vrf.c | 50 ++
 2 files changed, 52 insertions(+)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index e8da9e034b15..b747ea47122d 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -90,6 +90,8 @@ struct link_util *get_link_slave_kind(const char *slave_kind);
 
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
 
+bool name_is_vrf(const char *name);
+
 #ifndefINFINITY_LIFE_TIME
 #define INFINITY_LIFE_TIME  0xU
 #endif
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index e3c7b4652da5..166f566e6d77 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -96,3 +96,53 @@ struct link_util vrf_slave_link_util = {
.print_opt  = vrf_slave_print_opt,
.slave  = true,
 };
+
+bool name_is_vrf(const char *name)
+{
+   struct {
+   struct nlmsghdr n;
+   struct ifinfomsgi;
+   charbuf[1024];
+   } req = {
+   .n = {
+   .nlmsg_len   = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+   .nlmsg_flags = NLM_F_REQUEST,
+   .nlmsg_type  = RTM_GETLINK,
+   },
+   .i = {
+   .ifi_family  = preferred_family,
+   },
+   };
+   struct {
+   struct nlmsghdr n;
+   char buf[8192];
+   } answer;
+   struct rtattr *tb[IFLA_MAX+1];
+   struct rtattr *li[IFLA_INFO_MAX+1];
+   struct ifinfomsg *ifi;
+   int len;
+
+   addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
+
+   if (rtnl_talk(, , , sizeof(answer)) < 0)
+   return false;
+
+   ifi = NLMSG_DATA();
+   len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
+   if (len < 0) {
+   fprintf(stderr, "BUG: Invalid response to link query.\n");
+   return false;
+   }
+
+   parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
+
+   if (!tb[IFLA_LINKINFO])
+   return false;
+
+   parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
+
+   if (!li[IFLA_INFO_KIND])
+   return false;
+
+   return strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf") == 0;
+}
-- 
2.1.4



[PATCH v3 iproute2 1/6] ip vrf: Add name_is_vrf

2016-06-29 Thread David Ahern
Add name_is_vrf function to determine if given name corresponds to a
VRF device.

Signed-off-by: David Ahern 
---
 ip/ip_common.h  |  2 ++
 ip/iplink_vrf.c | 50 ++
 2 files changed, 52 insertions(+)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index e8da9e034b15..b747ea47122d 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -90,6 +90,8 @@ struct link_util *get_link_slave_kind(const char *slave_kind);
 
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
 
+bool name_is_vrf(const char *name);
+
 #ifndefINFINITY_LIFE_TIME
 #define INFINITY_LIFE_TIME  0xU
 #endif
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index e3c7b4652da5..166f566e6d77 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -96,3 +96,53 @@ struct link_util vrf_slave_link_util = {
.print_opt  = vrf_slave_print_opt,
.slave  = true,
 };
+
+bool name_is_vrf(const char *name)
+{
+   struct {
+   struct nlmsghdr n;
+   struct ifinfomsgi;
+   charbuf[1024];
+   } req = {
+   .n = {
+   .nlmsg_len   = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+   .nlmsg_flags = NLM_F_REQUEST,
+   .nlmsg_type  = RTM_GETLINK,
+   },
+   .i = {
+   .ifi_family  = preferred_family,
+   },
+   };
+   struct {
+   struct nlmsghdr n;
+   char buf[8192];
+   } answer;
+   struct rtattr *tb[IFLA_MAX+1];
+   struct rtattr *li[IFLA_INFO_MAX+1];
+   struct ifinfomsg *ifi;
+   int len;
+
+   addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
+
+   if (rtnl_talk(, , , sizeof(answer)) < 0)
+   return false;
+
+   ifi = NLMSG_DATA();
+   len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
+   if (len < 0) {
+   fprintf(stderr, "BUG: Invalid response to link query.\n");
+   return false;
+   }
+
+   parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
+
+   if (!tb[IFLA_LINKINFO])
+   return false;
+
+   parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
+
+   if (!li[IFLA_INFO_KIND])
+   return false;
+
+   return strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf") == 0;
+}
-- 
2.1.4



[PATCH v3 iproute2 2/6] ip link/addr: Add support for vrf keyword

2016-06-29 Thread David Ahern
Add vrf keyword to 'ip link' and 'ip addr' commands (common list code).

Allows:
1. Adding a link to a VRF
   $ ip link set NAME vrf NAME

   Removing a link from a VRF still uses 'ip link set NAME nomaster'

2. Showing links associated with a VRF:
   $ ip link show vrf NAME

3. List addresses associated with links in a VRF
   $ ip -br addr show vrf red

Signed-off-by: David Ahern 
---
 ip/ipaddress.c   | 12 +++-
 ip/iplink.c  | 15 +--
 man/man8/ip-address.8.in |  6 ++
 man/man8/ip-link.8.in| 10 ++
 4 files changed, 40 insertions(+), 3 deletions(-)

diff --git a/ip/ipaddress.c b/ip/ipaddress.c
index d4d649505e15..54896c726e07 100644
--- a/ip/ipaddress.c
+++ b/ip/ipaddress.c
@@ -80,7 +80,7 @@ static void usage(void)
fprintf(stderr, "[ to PREFIX ] [ FLAG-LIST 
] [ label LABEL ] [up]\n");
fprintf(stderr, "   ip address [ show [ dev IFNAME ] [ scope 
SCOPE-ID ] [ master DEVICE ]\n");
fprintf(stderr, " [ type TYPE ] [ to PREFIX ] [ 
FLAG-LIST ]\n");
-   fprintf(stderr, " [ label LABEL ] [up] ]\n");
+   fprintf(stderr, " [ label LABEL ] [up] [ vrf 
NAME ] ]\n");
fprintf(stderr, "   ip address {showdump|restore}\n");
fprintf(stderr, "IFADDR := PREFIX | ADDR peer PREFIX\n");
fprintf(stderr, "  [ broadcast ADDR ] [ anycast ADDR ]\n");
@@ -1620,6 +1620,16 @@ static int ipaddr_list_flush_or_save(int argc, char 
**argv, int action)
if (!ifindex)
invarg("Device does not exist\n", *argv);
filter.master = ifindex;
+   } else if (strcmp(*argv, "vrf") == 0) {
+   int ifindex;
+
+   NEXT_ARG();
+   ifindex = ll_name_to_index(*argv);
+   if (!ifindex)
+   invarg("Not a valid VRF name\n", *argv);
+   if (!name_is_vrf(*argv))
+   invarg("Not a valid VRF name\n", *argv);
+   filter.master = ifindex;
} else if (strcmp(*argv, "type") == 0) {
int soff;
 
diff --git a/ip/iplink.c b/ip/iplink.c
index b1f8a37922f5..f2a2e13cf0c5 100644
--- a/ip/iplink.c
+++ b/ip/iplink.c
@@ -82,11 +82,11 @@ void iplink_usage(void)
fprintf(stderr, "  [ query_rss { on | 
off} ]\n");
fprintf(stderr, "  [ state { auto | 
enable | disable} ] ]\n");
fprintf(stderr, "  [ trust { on | off} 
] ]\n");
-   fprintf(stderr, " [ master DEVICE ]\n");
+   fprintf(stderr, " [ master DEVICE ][ vrf NAME 
]\n");
fprintf(stderr, " [ nomaster ]\n");
fprintf(stderr, " [ addrgenmode { eui64 | none 
| stable_secret | random } ]\n");
fprintf(stderr, " [ protodown { on | off } 
]\n");
-   fprintf(stderr, "   ip link show [ DEVICE | group GROUP ] [up] 
[master DEV] [type TYPE]\n");
+   fprintf(stderr, "   ip link show [ DEVICE | group GROUP ] [up] 
[master DEV] [vrf NAME] [type TYPE]\n");
 
if (iplink_have_newlink()) {
fprintf(stderr, "   ip link help [ TYPE ]\n");
@@ -603,6 +603,17 @@ int iplink_parse(int argc, char **argv, struct iplink_req 
*req,
invarg("Device does not exist\n", *argv);
addattr_l(>n, sizeof(*req), IFLA_MASTER,
  , 4);
+   } else if (strcmp(*argv, "vrf") == 0) {
+   int ifindex;
+
+   NEXT_ARG();
+   ifindex = ll_name_to_index(*argv);
+   if (!ifindex)
+   invarg("Not a valid VRF name\n", *argv);
+   if (!name_is_vrf(*argv))
+   invarg("Not a valid VRF name\n", *argv);
+   addattr_l(>n, sizeof(*req), IFLA_MASTER,
+ , sizeof(ifindex));
} else if (matches(*argv, "nomaster") == 0) {
int ifindex = 0;
 
diff --git a/man/man8/ip-address.8.in b/man/man8/ip-address.8.in
index 3cbe4181f7e3..7cb927173133 100644
--- a/man/man8/ip-address.8.in
+++ b/man/man8/ip-address.8.in
@@ -45,6 +45,8 @@ ip-address \- protocol address management
 .IR DEVICE " ] [ "
 .B  type
 .IR TYPE " ] [ "
+.B vrf
+.IR NAME " ] [ "
 .BR up " ] ]"
 
 .ti -8
@@ -280,6 +282,10 @@ is a usual shell style pattern.
 only list interfaces enslaved to this master device.
 
 .TP
+.BI vrf " NAME "
+only list interfaces enslaved to this vrf.
+
+.TP
 .BI type " TYPE"
 only list interfaces of the 

[PATCH v3 iproute2 0/6] Add support for vrf keyword

2016-06-29 Thread David Ahern
Currently the syntax for VRF related commands is rather kludgy and
inconsistent from one subcommand to another. This set adds support
for the VRF keyword to the link, address, neigh, and route commands
to improve the user experience listing data associated with vrfs,
modifying routes or doing a route lookup.

v3
- addressed comments from Stephen
- added documentation updates to patches 2, 3 and 6

v2
- rebased to top of tree
- all checkpatch warnings are usage lines. The change in these
  patches is consistent with existing code for usage lines

David Ahern (6):
  ip vrf: Add name_is_vrf
  ip link/addr: Add support for vrf keyword
  ip neigh: Add support for keyword
  ip route: Change type mask to bitmask
  ip vrf: Add ipvrf_get_table
  ip route: Add support for vrf keyword

 ip/ip_common.h   |   3 ++
 ip/ipaddress.c   |  12 -
 ip/iplink.c  |  15 +-
 ip/iplink_vrf.c  | 116 +++
 ip/ipneigh.c |  14 +-
 ip/iproute.c |  43 +++---
 man/man8/ip-address.8.in |   6 +++
 man/man8/ip-link.8.in|  10 
 man/man8/ip-neighbour.8  |   8 +++-
 man/man8/ip-route.8.in   |  19 +++-
 10 files changed, 233 insertions(+), 13 deletions(-)

-- 
2.1.4



[PATCH v3 iproute2 4/6] ip route: Change type mask to bitmask

2016-06-29 Thread David Ahern
Allow option to select multiple route types to show or exlude
specific route types.

Signed-off-by: David Ahern 
---
 ip/iproute.c | 11 ++-
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index 8224d7ffa94b..aae693d17be8 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -113,7 +113,7 @@ static struct
int flushe;
int protocol, protocolmask;
int scope, scopemask;
-   int type, typemask;
+   __u64 typemask;
int tos, tosmask;
int iif, iifmask;
int oif, oifmask;
@@ -178,7 +178,8 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
return 0;
if ((filter.scope^r->rtm_scope))
return 0;
-   if ((filter.type^r->rtm_type))
+
+   if (filter.typemask && !(filter.typemask & (1 << r->rtm_type)))
return 0;
if ((filter.tos^r->rtm_tos))
return 0;
@@ -365,7 +366,8 @@ int print_route(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
 
if (n->nlmsg_type == RTM_DELROUTE)
fprintf(fp, "Deleted ");
-   if ((r->rtm_type != RTN_UNICAST || show_details > 0) && !filter.type)
+   if ((r->rtm_type != RTN_UNICAST || show_details > 0) &&
+   (!filter.typemask || (filter.typemask & (1 << r->rtm_type
fprintf(fp, "%s ", rtnl_rtntype_n2a(r->rtm_type, b1, 
sizeof(b1)));
 
if (tb[RTA_DST]) {
@@ -1433,10 +1435,9 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
int type;
 
NEXT_ARG();
-   filter.typemask = -1;
if (rtnl_rtntype_a2n(, *argv))
invarg("node type value is invalid\n", *argv);
-   filter.type = type;
+   filter.typemask = (1<

[PATCH v3 iproute2 5/6] ip vrf: Add ipvrf_get_table

2016-06-29 Thread David Ahern
Add ipvrf_get_table to lookup table id for device name. Returns 0
on any error or if name is not a VRF device.

Signed-off-by: David Ahern 
---
 ip/ip_common.h  |  1 +
 ip/iplink_vrf.c | 66 +
 2 files changed, 67 insertions(+)

diff --git a/ip/ip_common.h b/ip/ip_common.h
index b747ea47122d..8faed6174f06 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -90,6 +90,7 @@ struct link_util *get_link_slave_kind(const char *slave_kind);
 
 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);
 
+__u32 ipvrf_get_table(char *name);
 bool name_is_vrf(const char *name);
 
 #ifndefINFINITY_LIFE_TIME
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index 166f566e6d77..b4013773537f 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -97,6 +97,72 @@ struct link_util vrf_slave_link_util = {
.slave  = true,
 };
 
+/* returns table id if name is a VRF device */
+__u32 ipvrf_get_table(char *name)
+{
+   struct {
+   struct nlmsghdr n;
+   struct ifinfomsgi;
+   charbuf[1024];
+   } req = {
+   .n = {
+   .nlmsg_len   = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+   .nlmsg_flags = NLM_F_REQUEST,
+   .nlmsg_type  = RTM_GETLINK,
+   },
+   .i = {
+   .ifi_family  = preferred_family,
+   },
+   };
+   struct {
+   struct nlmsghdr n;
+   char buf[8192];
+   } answer;
+   struct rtattr *tb[IFLA_MAX+1];
+   struct rtattr *li[IFLA_INFO_MAX+1];
+   struct rtattr *vrf_attr[IFLA_VRF_MAX + 1];
+   struct ifinfomsg *ifi;
+   __u32 tb_id = 0;
+   int len;
+
+   addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
+
+   if (rtnl_talk(, , , sizeof(answer)) < 0)
+   goto err;
+
+   ifi = NLMSG_DATA();
+   len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi));
+   if (len < 0) {
+   fprintf(stderr, "BUG: Invalid response to link query.\n");
+   goto err;
+   }
+
+   parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len);
+
+   if (!tb[IFLA_LINKINFO])
+   goto err;
+
+   parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]);
+
+   if (!li[IFLA_INFO_KIND] || !li[IFLA_INFO_DATA])
+   goto err;
+
+   if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf"))
+   goto err;
+
+   parse_rtattr_nested(vrf_attr, IFLA_VRF_MAX, li[IFLA_INFO_DATA]);
+   if (vrf_attr[IFLA_VRF_TABLE])
+   tb_id = rta_getattr_u32(vrf_attr[IFLA_VRF_TABLE]);
+
+   if (!tb_id)
+   fprintf(stderr, "BUG: VRF %s is missing table id\n", name);
+
+   return tb_id;
+
+err:
+   return 0;
+}
+
 bool name_is_vrf(const char *name)
 {
struct {
-- 
2.1.4



[PATCH v3 iproute2 6/6] ip route: Add support for vrf keyword

2016-06-29 Thread David Ahern
Add vrf keyword to 'ip route' commands. Allows:
1. Users can list routes by VRF name:
   $ ip route show vrf NAME

   VRF tables have all routes including local and broadcast routes.
   The VRF keyword filters LOCAL and BROADCAST routes; to see all
   routes the table option can be used. Or to see local routes only
   for a VRF:
   $ ip route show vrf NAME type local

2. Add or delete a route for a VRF:
   $ ip route {add|delete} vrf NAME 

3. Do a route lookup for a VRF:
   $ ip route get vrf NAME ADDRESS

Signed-off-by: David Ahern 
---
 ip/iproute.c   | 32 ++--
 man/man8/ip-route.8.in | 19 ++-
 2 files changed, 48 insertions(+), 3 deletions(-)

diff --git a/ip/iproute.c b/ip/iproute.c
index aae693d17be8..bd661c16cb46 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -67,10 +67,10 @@ static void usage(void)
fprintf(stderr, "   ip route showdump\n");
fprintf(stderr, "   ip route get ADDRESS [ from ADDRESS iif STRING 
]\n");
fprintf(stderr, "[ oif STRING ] [ tos TOS 
]\n");
-   fprintf(stderr, "[ mark NUMBER ]\n");
+   fprintf(stderr, "[ mark NUMBER ] [ vrf NAME 
]\n");
fprintf(stderr, "   ip route { add | del | change | append | 
replace } ROUTE\n");
fprintf(stderr, "SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact 
PREFIX ]\n");
-   fprintf(stderr, "[ table TABLE_ID ] [ proto RTPROTO ]\n");
+   fprintf(stderr, "[ table TABLE_ID ] [ vrf NAME ] [ proto 
RTPROTO ]\n");
fprintf(stderr, "[ type TYPE ] [ scope SCOPE ]\n");
fprintf(stderr, "ROUTE := NODE_SPEC [ INFO_SPEC ]\n");
fprintf(stderr, "NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]\n");
@@ -1141,6 +1141,20 @@ static int iproute_modify(int cmd, unsigned int flags, 
int argc, char **argv)
addattr32(, sizeof(req), RTA_TABLE, tid);
}
table_ok = 1;
+   } else if (matches(*argv, "vrf") == 0) {
+   __u32 tid;
+
+   NEXT_ARG();
+   tid = ipvrf_get_table(*argv);
+   if (tid == 0)
+   invarg("Invalid VRF\n", *argv);
+   if (tid < 256)
+   req.r.rtm_table = tid;
+   else {
+   req.r.rtm_table = RT_TABLE_UNSPEC;
+   addattr32(, sizeof(req), RTA_TABLE, tid);
+   }
+   table_ok = 1;
} else if (strcmp(*argv, "dev") == 0 ||
   strcmp(*argv, "oif") == 0) {
NEXT_ARG();
@@ -1395,6 +1409,15 @@ static int iproute_list_flush_or_save(int argc, char 
**argv, int action)
}
} else
filter.tb = tid;
+   } else if (matches(*argv, "vrf") == 0) {
+   __u32 tid;
+
+   NEXT_ARG();
+   tid = ipvrf_get_table(*argv);
+   if (tid == 0)
+   invarg("Invalid VRF\n", *argv);
+   filter.tb = tid;
+   filter.typemask = ~(1 << RTN_LOCAL | 1<

[PATCH v3 iproute2 3/6] ip neigh: Add support for keyword

2016-06-29 Thread David Ahern
Add vrf keyword to 'ip neigh' commands. Allows listing neighbor
entries for all links associated with a given VRF.

Signed-off-by: David Ahern 
---
 ip/ipneigh.c| 14 +-
 man/man8/ip-neighbour.8 |  8 +++-
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/ip/ipneigh.c b/ip/ipneigh.c
index 4ddb747e2086..3e444712645f 100644
--- a/ip/ipneigh.c
+++ b/ip/ipneigh.c
@@ -48,7 +48,8 @@ static void usage(void)
 {
fprintf(stderr, "Usage: ip neigh { add | del | change | replace }\n"
"{ ADDR [ lladdr LLADDR ] [ nud STATE ] 
| proxy ADDR } [ dev DEV ]\n");
-   fprintf(stderr, "   ip neigh { show | flush } [ proxy ] [ to PREFIX 
] [ dev DEV ] [ nud STATE ]\n\n");
+   fprintf(stderr, "   ip neigh { show | flush } [ proxy ] [ to PREFIX 
] [ dev DEV ] [ nud STATE ]\n");
+   fprintf(stderr, " [ vrf NAME ]\n\n");
fprintf(stderr, "STATE := { permanent | noarp | stale | reachable | 
none |\n"
"   incomplete | delay | probe | failed }\n");
exit(-1);
@@ -385,6 +386,17 @@ static int do_show_or_flush(int argc, char **argv, int 
flush)
invarg("Device does not exist\n", *argv);
addattr32(, sizeof(req), NDA_MASTER, ifindex);
filter.master = ifindex;
+   } else if (strcmp(*argv, "vrf") == 0) {
+   int ifindex;
+
+   NEXT_ARG();
+   ifindex = ll_name_to_index(*argv);
+   if (!ifindex)
+   invarg("Not a valid VRF name\n", *argv);
+   if (!name_is_vrf(*argv))
+   invarg("Not a valid VRF name\n", *argv);
+   addattr32(, sizeof(req), NDA_MASTER, ifindex);
+   filter.master = ifindex;
} else if (strcmp(*argv, "unused") == 0) {
filter.unused_only = 1;
} else if (strcmp(*argv, "nud") == 0) {
diff --git a/man/man8/ip-neighbour.8 b/man/man8/ip-neighbour.8
index b292e1814495..bbfe8e72a09c 100644
--- a/man/man8/ip-neighbour.8
+++ b/man/man8/ip-neighbour.8
@@ -31,7 +31,9 @@ ip-neighbour \- neighbour/arp tables management.
 .B  dev
 .IR DEV " ] [ "
 .B  nud
-.IR STATE " ]"
+.IR STATE " ] [ "
+.B  vrf
+.IR NAME " ] "
 
 .ti -8
 .IR STATE " := {"
@@ -164,6 +166,10 @@ the prefix selecting the neighbours to list.
 only list the neighbours attached to this device.
 
 .TP
+.BI vrf " NAME"
+only list the neighbours for given VRF.
+
+.TP
 .BI proxy
 list neighbour proxies.
 
-- 
2.1.4



Re: [PATCH v2 iproute2 0/6] Add support for vrf keyword

2016-06-29 Thread David Ahern

On 6/29/16 9:07 AM, Stephen Hemminger wrote:

On Mon, 27 Jun 2016 11:50:55 -0700
David Ahern  wrote:


Currently the syntax for VRF related commands is rather kludgy and
inconsistent from one subcommand to another. This set adds support
for the VRF keyword to the link, address, neigh, and route commands
to improve the user experience listing data associated with vrfs,
modifying routes or doing a route lookup.

v2
- rebased to top of tree
- all checkpatch warnings are usage lines. The change in these
  patches is consistent with existing code for usage lines


Does this break current user scripts?


I don't see how it can. Existing syntax is not touched so if a user 
wants to run:


$ ip link show master red
$ ip ro sh table 1001

it still works. Using the vrf keyword just makes for a more natural syntax:

$ ip link show vrf red
$ ip ro sh vrf red

in this case I don't have to lookup what table device red is associated 
with ip learns it and shows that table.




It seems this method will cause lots of additional netlink requests
to check if device is a vrf. Won't this impact users with 1000's of devices?



It only adds 1 GETLINK request if the vrf keyword is used. The lookup 
verifies the name is actually a VRF. It's really no different than the 
ll_name_to_index lookups used when processing the command line 
(ll_name_to_index relies on a cache or an ioctl, if_nametoindex).


Re: [PATCH iproute2 1/6] ip vrf: Add name_is_vrf

2016-06-29 Thread David Ahern

On 6/29/16 9:06 AM, Stephen Hemminger wrote:

On Mon, 27 Jun 2016 11:50:56 -0700
David Ahern  wrote:


diff --git a/ip/ip_common.h b/ip/ip_common.h
index e8da9e034b15..410eb135774a 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -90,6 +90,8 @@ struct link_util *get_link_slave_kind(const char *slave_kind);

 void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len);

+bool name_is_vrf(char *name);
+
 #ifndefINFINITY_LIFE_TIME
 #define INFINITY_LIFE_TIME  0xU
 #endif
diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c
index e3c7b4652da5..abd43c08423e 100644
--- a/ip/iplink_vrf.c
+++ b/ip/iplink_vrf.c
@@ -96,3 +96,56 @@ struct link_util vrf_slave_link_util = {
.print_opt  = vrf_slave_print_opt,
.slave  = true,
 };
+
+bool name_is_vrf(char *name)


Why not?
bool name_is_vrf(const char *name)


sure.




+{
+   struct {
+   struct nlmsghdr n;
+   struct ifinfomsgi;
+   charbuf[1024];
+   } req = {
+   .n = {
+   .nlmsg_len   = NLMSG_LENGTH(sizeof(struct ifinfomsg)),
+   .nlmsg_flags = NLM_F_REQUEST,
+   .nlmsg_type  = RTM_GETLINK,
+   },
+   .i = {
+   .ifi_family  = preferred_family,
+   },
+   };
+   struct {
+   struct nlmsghdr n;
+   char buf[8192];
+   } answer;
+   struct rtattr *tb[IFLA_MAX+1];
+   struct rtattr *li[IFLA_INFO_MAX+1];
+   struct ifinfomsg *ifi;
+   int len;
+
+   addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1);
+
+   if (rtnl_talk(, , , sizeof(answer)) < 0)
+   goto err;

Just return false instead of all these goto's?
Also you might want to give some indication of error.


Any failure and the user gets the message "not a VRF". The reason can be 
the device is not a VRF or the device does not exist but that's the same 
thing in this case:


$ ./ip link show vrf foo
Error: argument "foo" is wrong: Not a valid VRF name

$ ./ip link show vrf eth1
Error: argument "eth1" is wrong: Not a valid VRF name




Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas

2016-06-29 Thread Andy Lutomirski
On Wed, Jun 29, 2016 at 8:38 AM, Herbert Xu  wrote:
> On Wed, Jun 29, 2016 at 08:26:43AM -0700, Andy Lutomirski wrote:
>>
>> Two reasons:
>>
>> 1. Code like tcp md5 would be simpler if it could pass a scatterlist
>> to hash the skb but use a virtual address for the header.
>
> True.  But I bet we can make it simpler in other ways without
> creating special interfaces for it.  Look at how we do IPsec
> encryption/hashing, this is what TCP md5 should look like.  But
> nobody in their right mind would bother doing this because this
> is dead code.
>
>> 2. The actual calling sequence and the amount of indirection is much
>> less for shash, so hashing short buffer is probably *much* faster.
>
> Really?
>
> Have you measured the speed difference between the ahash and shash
> interfaces? Are you sure that this would matter when compared
> against the speed of hashing a single MD5 block?
>

It's non-negligible.  If I had SHA-NI, it would probably be a bigger
relative difference.

[0.535717] SHA1 ahash: warming up
[0.737471] SHA1 ahash: 50 loops, 386 ns/loop
[0.737937] SHA1 shash: warming up
[0.903131] SHA1 shash: 50 loops, 323 ns/loop
[1.055409] SHA1 shash_digest: 50 loops, 303 ns/loop

[0.574410] SHA1 ahash: warming up
[0.796927] SHA1 ahash: 50 loops, 414 ns/loop
[0.797695] SHA1 shash: warming up
[0.959772] SHA1 shash: 50 loops, 316 ns/loop
[1.100886] SHA1 shash_digest: 50 loops, 280 ns/loop

Here's MD5:

[0.504708] MD5 ahash: warming up
[0.688155] MD5 ahash: 50 loops, 344 ns/loop
[0.688971] MD5 shash: warming up
[0.834953] MD5 shash: 50 loops, 285 ns/loop
[0.982021] MD5 shash_digest: 50 loops, 292 ns/loop

[0.570780] MD5 ahash: warming up
[0.754325] MD5 ahash: 50 loops, 357 ns/loop
[0.754807] MD5 shash: warming up
[0.906861] MD5 shash: 50 loops, 297 ns/loop
[1.059057] MD5 shash_digest: 50 loops, 303 ns/loop

If you throw in sub-one-block requests, it'll be worse, because those
requests don't do real work.

Overall, it looks like there's overhead of something like 50ns for
each ahash invocation vs the shash equivalent.  It's not huge, but
it's there.  (This is cache-hot.  I bet it's considerably worse if
cache-cold, because ahash will require a lot more code cache lines as
well as the extra cache lines involved in the scatterlist and whatever
arch stuff is needed to map back and forth between virtual and
physical addresses.

Here's the code:

static void test_ahash(void)
{
struct crypto_ahash *hash;
struct ahash_request *req;
struct scatterlist sg_in;
char *buf, *out;
int i;
u64 end, start;
unsigned loops = 50;

hash = crypto_alloc_ahash("sha1", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(hash))
return;

req = ahash_request_alloc(hash, GFP_KERNEL);
if (!req)
return;

ahash_request_set_callback(req, 0, NULL, NULL);

buf = kmalloc(64, GFP_KERNEL);
memset(buf, 0, 64);
out = kmalloc(20, GFP_KERNEL);

pr_err("SHA1 ahash: warming up\n");

for (i = 0; i < 1; i++) {
crypto_ahash_init(req);
sg_init_one(_in, buf, 64);
ahash_request_set_crypt(req, _in, NULL, 64);
if (crypto_ahash_update(req))
return;
ahash_request_set_crypt(req, NULL, out, 0);
if (crypto_ahash_final(req))
return;
}

start = ktime_get_ns();
for (i = 0; i < loops; i++) {
crypto_ahash_init(req);
sg_init_one(_in, buf, 64);
ahash_request_set_crypt(req, _in, NULL, 64);
if (crypto_ahash_update(req))
return;
ahash_request_set_crypt(req, NULL, out, 0);
if (crypto_ahash_final(req))
return;
}
end = ktime_get_ns();

pr_err("SHA1 ahash: %u loops, %llu ns/loop\n", loops, (unsigned
long long)((end-start)/loops));
}

static void test_shash(void)
{
struct crypto_shash *hash;
struct shash_desc *desc;
char *buf, *out;
int i;
u64 end, start;
unsigned loops = 50;

hash = crypto_alloc_shash("sha1", 0, CRYPTO_ALG_ASYNC);
if (IS_ERR(hash))
return;

desc = kmalloc(crypto_shash_descsize(hash), GFP_KERNEL);
if (!desc)
return;
desc->tfm = hash;
desc->flags = 0;

/*
 * Hmm.  This interface isn't conducive to the use of hardware that
 * needs state beyond what lives in a single memory buffer -- how would
 * the shash implementation know when to free the state?
 */

buf = kmalloc(64, GFP_KERNEL);
memset(buf, 0, 64);
out = kmalloc(20, GFP_KERNEL);

pr_err("SHA1 shash: warming up\n");

for (i = 0; i < 1; i++) {
crypto_shash_init(desc);
if (crypto_shash_update(desc, buf, 64))
return;
if (crypto_shash_final(desc, out))
return;
}

start = ktime_get_ns();
for (i = 0; i < loops; i++) {
/*
 * For fairness, this 

Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control

2016-06-29 Thread John Fastabend
On 16-06-29 07:48 AM, Or Gerlitz wrote:
> On 6/28/2016 10:31 PM, John Fastabend wrote:
>> On 16-06-28 12:12 PM, Jiri Pirko wrote:
>>>
>>> Why?! Please, leave legacy be legacy. Use the new mode for
>>> implementing new features. Don't make things any more complicated :(
>>>
>> OK so how I read this is there are two things going on that are being
>> conflated together. Creating VF netdev's is linked to the PCIe
>> subsystems and brings VFs into the netdev model. This is a good thing
>> but doesn't need to be a global nic policy it can be per port hence
>> the ethtool flag vs devlink discussion. I don't actually have a use case
>> to have one port with VF netdevs and another without it so I'm not too
>> particular on this. Logically it looks like a per port setting because
>> the hardware has no issues with making one physical function create
>> a netdev for each of its VFs and the other one run without these
>> netdevs. This is why I called it out.
>>
>> How this relates to bridge, tc, etc. is now you have a identifier to
>> configure instead of using strange 'ip link set ... vf#' commands. This
>> is great. But I see no reason the hardware has to make changes to
>> the existing tables or any of this. Before we used 'bridge fdb' and 'ip
>> link' now we can use bridge tools more effectively and can deprecate
>> the overloaded use of ip. But again I see no reason to thrash the
>> forwarding state of the switch because we happen to be adding VFs.
>> Having a set of fdb rules to forward MAC/Vlan pairs (as we do now)
>> seems like a perfectly reasonable default. Add with this patch now
>> when I run 'fdb show' I can see the defaults.
>>
>> Maybe I'm reading to much into the devlink flag names and if instead
>> you use a switch like the following,
>>
>>VF representer : enable/disable the creation VF netdev's to represent
>> the virtual functions on the PF
>>
>>
>> Much less complicated then magic switching between forwarding logic IMO
>> and you don't whack a default configuration that an entire stack (e.g.
>> libvirt) has been built to use.
> 
> 
> John,
> 
> I'll try to address here the core questions and arguments you brought.
> 

thanks. Also just to reiterate I really like the series just a few
details here.

> Re letting the user to observe/modify the rules added by the
> driver/firmware while legacy mode. Even if possible with bridge/fdb, it
> will be really pragmatical and doesn't make sense to get that donefor
> the TC subsystem. So this isn't a well defined solution and anyway, as
> you said, legacy mode enhancements is a different exercise. Personally,
> I agree with Jiri, that we should legacy be legacyand focus on adding
> the new model.
> 

The ixgbe driver already supports bridge and tc commands without the VF
representer.  Adding the VF representer to these drivers just extends
the existing support so we have an identifier for VFs and now the
redirect action works and the fdb commands can specify the VF netdevs.
I don't see this as a problem because we already do it today with
'ip' and bridge tools.

We are also slightly in disagreement about what the default should be
with VF netdevs. I think the default should be the same L2 mac/vlan
switch behavior and see no reason to change it by default just because
we added VF netdevs. The infrastructure libvirt/openstack/etc are built
around this default today. But I guess nothing in this series specifies
what the defaults of any given driver will be. VF netdevs are still
useful even on older hardware that only supports mac/vlan forwarding to
expose statistics and send/receive control frames such as lldp.

> The new model has few building blocks, and by all means, have the VF
> representors is not the full story, which is not magic but rather the
> following:
> 
> 1. VF (vport) representors netdevices + the needed mechanics
> (send-to-vport rules that makes xmit on VF rep --> recv on VF)
> 

We all agree on this. For me this should be its own knob VF netdevs or
no VF netdevs.

There is also my point that this is really a port attribute of the PCIe
configuration not a switch attribute.

> 2. handling HW data-patch misses --> send to CPU or drop

Yep need this also but we have a standard way to configure this already
with bridge and 'tc' so why have a toggle for it? Also you don't know
in the driver where I want to send missed packets. In some use cases I
have the VM is managing the system and in these cases I want to send
missed packets to a VF.

In ixgbe we get this for free (with the vf identifier netdevs) because
we have 'tc' and 'bridge' already hooked up. With 'tc' you can define
a wild card match with low priority and with 'bridge' model you can
setup the flood ports to do this.

> 
> 3. ability to offload SW rules (tc/bridge/etc) using VF representors and
> ingress qdiscs / bridge fdb rules / switchdev fdb rule, etc
> 
> The knob we suggested says that the system is put into a state where
> 1,2,3 are needed to make it full 

Re: [net 2/2] ixgbevf: ixgbevf_write/read_posted_mbx should use IXGBE_ERR_MBX to initialize ret_val

2016-06-29 Thread Greg
On Wed, 2016-06-29 at 09:30 -0700, Jeff Kirsher wrote:
> From: Xin Long 
> 
> Now ixgbevf_write/read_posted_mbx use -IXGBE_ERR_MBX as the initiative
> return value, but it's incorrect, cause in ixgbevf_vlan_rx_add_vid(),
> it use err == IXGBE_ERR_MBX, the err returned from mac.ops.set_vfta,
> and in ixgbevf_set_vfta_vf, it return from write/read_posted. so we
> should initialize err with IXGBE_ERR_MBX, instead of -IXGBE_ERR_MBX.

I believe "initiative" should be "initial", "cause" should be "because"
and "it use err" should be "it uses err".  And I won't be a grammar nazi
about all the commas.

- Greg

> 
> With this fix, the other functions that called it also can work well,
> cause they only care about if err is 0 or not.
> 
> Signed-off-by: Xin Long 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 
> ---
>  drivers/net/ethernet/intel/ixgbevf/mbx.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbevf/mbx.c 
> b/drivers/net/ethernet/intel/ixgbevf/mbx.c
> index 61a80da..2819abc 100644
> --- a/drivers/net/ethernet/intel/ixgbevf/mbx.c
> +++ b/drivers/net/ethernet/intel/ixgbevf/mbx.c
> @@ -85,7 +85,7 @@ static s32 ixgbevf_poll_for_ack(struct ixgbe_hw *hw)
>  static s32 ixgbevf_read_posted_mbx(struct ixgbe_hw *hw, u32 *msg, u16 size)
>  {
>   struct ixgbe_mbx_info *mbx = >mbx;
> - s32 ret_val = -IXGBE_ERR_MBX;
> + s32 ret_val = IXGBE_ERR_MBX;
>  
>   if (!mbx->ops.read)
>   goto out;
> @@ -111,7 +111,7 @@ out:
>  static s32 ixgbevf_write_posted_mbx(struct ixgbe_hw *hw, u32 *msg, u16 size)
>  {
>   struct ixgbe_mbx_info *mbx = >mbx;
> - s32 ret_val = -IXGBE_ERR_MBX;
> + s32 ret_val = IXGBE_ERR_MBX;
>  
>   /* exit if either we can't write or there isn't a defined timeout */
>   if (!mbx->ops.write || !mbx->timeout)




  1   2   3   >