Re: [PATCH net-next V3 0/6] switch to use tx skb array in tun
On Thu, Jun 30, 2016 at 11:52:53AM +0800, Jason Wang wrote: > Hi all: > > This series tries to switch to use skb array in tun. This is used to > eliminate the spinlock contention between producer and consumer. The > conversion was straightforward: just introdce a tx skb array and use > it instead of sk_receive_queue. > > A minor issue is to keep the tx_queue_len behaviour, since tun used to > use it for the length of sk_receive_queue. This is done through: > > - add the ability to resize multiple rings at once to avoid handling > partial resize failure for mutiple rings. > - add the support for zero length ring. > - introduce a notifier which was triggered when tx_queue_len was > changed for a netdev. > - resize all queues during the tx_queue_len changing. > > Tests shows about 15% improvement on guest rx pps: > > Before: ~130pps > After : ~150pps Series: Acked-by: Michael S. Tsirkin> Changes from V2: > - add multiple rings resizing support for ptr_ring/skb_array > - add zero length ring support > - introdce a NETDEV_CHANGE_TX_QUEUE_LEN > - drop new flags > > Changes from V1: > - switch to use skb array instead of a customized circular buffer > - add non-blocking support > - rename .peek to .peek_len > - drop lockless peeking since test show very minor improvement > > Jason Wang (5): > ptr_ring: support zero length ring > skb_array: minor tweak > skb_array: add wrappers for resizing > net: introduce NETDEV_CHANGE_TX_QUEUE_LEN > tun: switch to use skb array for tx > > Michael S. Tsirkin (1): > ptr_ring: support resizing multiple queues > > drivers/net/tun.c| 138 > --- > drivers/vhost/net.c | 16 - > include/linux/net.h | 1 + > include/linux/netdevice.h| 1 + > include/linux/ptr_ring.h | 77 ++ > include/linux/skb_array.h| 13 +++- > net/core/net-sysfs.c | 15 - > tools/virtio/ringtest/ptr_ring.c | 5 ++ > 8 files changed, 243 insertions(+), 23 deletions(-) > > -- > 2.7.4
Re: [PATCH net-next V3 5/6] net: introduce NETDEV_CHANGE_TX_QUEUE_LEN
On 2016年06月30日 12:56, John Fastabend wrote: On 16-06-29 08:52 PM, Jason Wang wrote: This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this will be triggered when tx_queue_len. It could be used by net device who want to do some processing at that time. An example is tun who may want to resize tx array when tx_queue_len is changed. Signed-off-by: Jason Wang--- include/linux/netdevice.h | 1 + net/core/net-sysfs.c | 15 ++- 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index e84d9d2..7dc2ec7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2237,6 +2237,7 @@ struct netdev_lag_lower_state_info { #define NETDEV_PRECHANGEUPPER 0x001A #define NETDEV_CHANGELOWERSTATE 0x001B #define NETDEV_UDP_TUNNEL_PUSH_INFO 0x001C +#define NETDEV_CHANGE_TX_QUEUE_LEN 0x001E int register_netdevice_notifier(struct notifier_block *nb); int unregister_netdevice_notifier(struct notifier_block *nb); diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 7a0b616..6e4f347 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -322,7 +322,20 @@ NETDEVICE_SHOW_RW(flags, fmt_hex); static int change_tx_queue_len(struct net_device *dev, unsigned long new_len) { - dev->tx_queue_len = new_len; + int res, orig_len = dev->tx_queue_len; + + if (new_len != orig_len) { + dev->tx_queue_len = new_len; + res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev); + res = notifier_to_errno(res); + if (res) { + netdev_err(dev, + "refused to change device tx_queue_len\n"); + dev->tx_queue_len = orig_len; + return -EFAULT; + } + } + return 0; } Acked-by: John Fastabend Great timing I was just looking into this because I need it for the qdisc side. It looks like this covers the sysfs change but the tx_queue_len can also be changed via rtnetlink as well. So we need another patch for that path right? if (tb[IFLA_TXQLEN]) { unsigned long value = nla_get_u32(tb[IFLA_TXQLEN]); if (dev->tx_queue_len ^ value) status |= DO_SETLINK_NOTIFY; dev->tx_queue_len = value; } Thanks, John Right, will do this in next version. Thanks
Re: [PATCH V2 4/4] net-next: mediatek: add support for IRQ grouping
Hi, [auto build test ERROR on net/master] [also build test ERROR on next-20160629] [cannot apply to net-next/master v4.7-rc5] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/John-Crispin/net-next-mediatek-IRQ-cleanups-fixes-and-grouping/20160629-194341 config: arm-allmodconfig (attached as .config) compiler: arm-linux-gnueabi-gcc (Debian 5.3.1-8) 5.3.1 20160205 reproduce: wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross chmod +x ~/bin/make.cross # save the attached .config to linux build tree make.cross ARCH=arm All errors (new ones prefixed by >>): drivers/net/ethernet/mediatek/mtk_eth_soc.c: In function 'mtk_poll_controller': >> drivers/net/ethernet/mediatek/mtk_eth_soc.c:1300:2: error: implicit >> declaration of function 'mtk_handle_irq' >> [-Werror=implicit-function-declaration] mtk_handle_irq(dev->irq[0], dev); ^ >> drivers/net/ethernet/mediatek/mtk_eth_soc.c:1300:25: error: subscripted >> value is neither array nor pointer nor vector mtk_handle_irq(dev->irq[0], dev); ^ cc1: some warnings being treated as errors vim +/mtk_handle_irq +1300 drivers/net/ethernet/mediatek/mtk_eth_soc.c 1294 { 1295 struct mtk_mac *mac = netdev_priv(dev); 1296 struct mtk_eth *eth = mac->hw; 1297 u32 int_mask = MTK_TX_DONE_INT | MTK_RX_DONE_INT; 1298 1299 mtk_irq_disable(eth, int_mask); > 1300 mtk_handle_irq(dev->irq[0], dev); 1301 mtk_irq_enable(eth, int_mask); 1302 } 1303 #endif --- 0-DAY kernel test infrastructureOpen Source Technology Center https://lists.01.org/pipermail/kbuild-all Intel Corporation .config.gz Description: Binary data
Re: [PATCH net-next V3 5/6] net: introduce NETDEV_CHANGE_TX_QUEUE_LEN
On 16-06-29 08:52 PM, Jason Wang wrote: > This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this > will be triggered when tx_queue_len. It could be used by net device > who want to do some processing at that time. An example is tun who may > want to resize tx array when tx_queue_len is changed. > > Signed-off-by: Jason Wang> --- > include/linux/netdevice.h | 1 + > net/core/net-sysfs.c | 15 ++- > 2 files changed, 15 insertions(+), 1 deletion(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index e84d9d2..7dc2ec7 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -2237,6 +2237,7 @@ struct netdev_lag_lower_state_info { > #define NETDEV_PRECHANGEUPPER0x001A > #define NETDEV_CHANGELOWERSTATE 0x001B > #define NETDEV_UDP_TUNNEL_PUSH_INFO 0x001C > +#define NETDEV_CHANGE_TX_QUEUE_LEN 0x001E > > int register_netdevice_notifier(struct notifier_block *nb); > int unregister_netdevice_notifier(struct notifier_block *nb); > diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c > index 7a0b616..6e4f347 100644 > --- a/net/core/net-sysfs.c > +++ b/net/core/net-sysfs.c > @@ -322,7 +322,20 @@ NETDEVICE_SHOW_RW(flags, fmt_hex); > > static int change_tx_queue_len(struct net_device *dev, unsigned long new_len) > { > - dev->tx_queue_len = new_len; > + int res, orig_len = dev->tx_queue_len; > + > + if (new_len != orig_len) { > + dev->tx_queue_len = new_len; > + res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev); > + res = notifier_to_errno(res); > + if (res) { > + netdev_err(dev, > +"refused to change device tx_queue_len\n"); > + dev->tx_queue_len = orig_len; > + return -EFAULT; > + } > + } > + > return 0; > } > > Acked-by: John Fastabend Great timing I was just looking into this because I need it for the qdisc side. It looks like this covers the sysfs change but the tx_queue_len can also be changed via rtnetlink as well. So we need another patch for that path right? if (tb[IFLA_TXQLEN]) { unsigned long value = nla_get_u32(tb[IFLA_TXQLEN]); if (dev->tx_queue_len ^ value) status |= DO_SETLINK_NOTIFY; dev->tx_queue_len = value; } Thanks, John
[PATCH net-next V3 3/6] ptr_ring: support resizing multiple queues
From: "Michael S. Tsirkin"Sometimes, we need support resizing multiple queues at once. This is because it was not easy to recover to recover from a partial failure of multiple queues resizing. Signed-off-by: Michael S. Tsirkin Signed-off-by: Jason Wang --- include/linux/ptr_ring.h | 71 +++- tools/virtio/ringtest/ptr_ring.c | 5 +++ 2 files changed, 67 insertions(+), 9 deletions(-) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index d78b8b8..2052011 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -349,20 +349,14 @@ static inline int ptr_ring_init(struct ptr_ring *r, int size, gfp_t gfp) return 0; } -static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp, - void (*destroy)(void *)) +static inline void **__ptr_ring_swap_queue(struct ptr_ring *r, void **queue, + int size, gfp_t gfp, + void (*destroy)(void *)) { - unsigned long flags; int producer = 0; - void **queue = __ptr_ring_init_queue_alloc(size, gfp); void **old; void *ptr; - if (!queue) - return -ENOMEM; - - spin_lock_irqsave(&(r)->producer_lock, flags); - while ((ptr = ptr_ring_consume(r))) if (producer < size) queue[producer++] = ptr; @@ -375,6 +369,23 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp, old = r->queue; r->queue = queue; + return old; +} + +static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp, + void (*destroy)(void *)) +{ + unsigned long flags; + void **queue = __ptr_ring_init_queue_alloc(size, gfp); + void **old; + + if (!queue) + return -ENOMEM; + + spin_lock_irqsave(&(r)->producer_lock, flags); + + old = __ptr_ring_swap_queue(r, queue, size, gfp, destroy); + spin_unlock_irqrestore(&(r)->producer_lock, flags); kfree(old); @@ -382,6 +393,48 @@ static inline int ptr_ring_resize(struct ptr_ring *r, int size, gfp_t gfp, return 0; } +static inline int ptr_ring_resize_multiple(struct ptr_ring **rings, int nrings, + int size, + gfp_t gfp, void (*destroy)(void *)) +{ + unsigned long flags; + void ***queues; + int i; + + queues = kmalloc(nrings * sizeof *queues, gfp); + if (!queues) + goto noqueues; + + for (i = 0; i < nrings; ++i) { + queues[i] = __ptr_ring_init_queue_alloc(size, gfp); + if (!queues[i]) + goto nomem; + } + + for (i = 0; i < nrings; ++i) { + spin_lock_irqsave(&(rings[i])->producer_lock, flags); + queues[i] = __ptr_ring_swap_queue(rings[i], queues[i], + size, gfp, destroy); + spin_unlock_irqrestore(&(rings[i])->producer_lock, flags); + } + + for (i = 0; i < nrings; ++i) + kfree(queues[i]); + + kfree(queues); + + return 0; + +nomem: + while (--i >= 0) + kfree(queues[i]); + + kfree(queues); + +noqueues: + return -ENOMEM; +} + static inline void ptr_ring_cleanup(struct ptr_ring *r, void (*destroy)(void *)) { void *ptr; diff --git a/tools/virtio/ringtest/ptr_ring.c b/tools/virtio/ringtest/ptr_ring.c index 74abd74..68e4f9f 100644 --- a/tools/virtio/ringtest/ptr_ring.c +++ b/tools/virtio/ringtest/ptr_ring.c @@ -17,6 +17,11 @@ typedef pthread_spinlock_t spinlock_t; typedef int gfp_t; +static void *kmalloc(unsigned size, gfp_t gfp) +{ + return memalign(64, size); +} + static void *kzalloc(unsigned size, gfp_t gfp) { void *p = memalign(64, size); -- 2.7.4
Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control
On 16-06-29 08:35 PM, John Fastabend wrote: > On 16-06-29 03:09 PM, John Fastabend wrote: >> On 16-06-29 02:33 PM, Or Gerlitz wrote: >>> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend >>>wrote: On 16-06-29 07:48 AM, Or Gerlitz wrote: > On 6/28/2016 10:31 PM, John Fastabend wrote: >> On 16-06-28 12:12 PM, Jiri Pirko wrote: >>> >>> Why?! Please, leave legacy be legacy. Use the new mode for >>> implementing new features. Don't make things any more complicated :( >>> >>> [...] >> Maybe I'm reading to much into the devlink flag names and if instead >> you use a switch like the following, >>> >>VF representer : enable/disable the creation VF netdev's to represent >> the virtual functions on the PF >>> >> Much less complicated then magic switching between forwarding logic IMO >> and you don't whack a default configuration that an entire stack (e.g. >> libvirt) has been built to use. >>> > Re letting the user to observe/modify the rules added by the > driver/firmware while legacy mode. Even if possible with bridge/fdb, it > will be really pragmatical and doesn't make sense to get that donefor > the TC subsystem. So this isn't a well defined solution and anyway, as > you said, legacy mode enhancements is a different exercise. Personally, > I agree with Jiri, that we should legacy be legacyand focus on adding > the new model. >>> The ixgbe driver already supports bridge and tc commands without the VF representer. Adding the VF representer to these drivers just extends the existing support so we have an identifier for VFs and now the redirect action works and the fdb commands can specify the VF netdevs. I don't see this as a problem because we already do it today with 'ip' and bridge tools. >>> >>> To be precise, for both ixgbe and mlx5, the existing tc support >>> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather >>> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added >>> redirect to VF, but this is only for south --> north (wire --> VF) >>> traffic, w.o the VF rep you can't do the other way around. >>> >> >> Correct which is why we need the VF rep. So we are completely in >> sync there. >> >>> Just to clarify, to what exact bridge command support did you refer for >>> ixgbe? >> >> 'bridge fdb' commands are supported today on the PF. But its the >> same story as above we need the VF rep to also use it on the >> VF representer >> >> Also 'bridge link' command for veb/vepa modes is supported and the >> other link attributes could be supported with additional driver >> support. No need for core changes here. But again yes only on the >> PF so again we need the VF reps. >> >>> >>> The forwarding done in the legacy mode is not well defined, and >>> different across vendors, adding there the VF reps will not make it >>> any better b/c some steering rules will be set by tc/bridge offloads >>> while other rules will be put by the driver. >>> I don't see how this takes us to better place. >> >> In legacy mode or any other mode you are defining some default policy >> and rules. >> >> In the legacy mode we use mac/vlan assigned l2 forwarding entries in the >> hardware fdb which are seen when you query 'ip link' and 'bridge fdb' >> today. And similarly can be modified today using 'ip link' and 'bridge >> fdb' at least on the intel devices. Its not undefined in any way with >> a quick query of the tools we can learn exactly what the configuration >> is and even change it. This works fairly well with existing controllers >> and stacks. >> >> The limitations are 'ip' only supports a single MAC address per VF and >> 'tc' doesn't work on VF ports because when the VF is assigned to a VM >> or namespace we lose visibility of it. Providing a VF rep for this >> solves both of those problems. >> >> In this new mode the default policy is to create a default miss rule >> and implement no l2 forwarding rules. Unfortunately not all hardware >> in use supports this default miss rule case but would still benefit >> from having a VF rep. So we shouldn't make this a stipulation for >> enabling VF reps. It also changes a default policy that has been in >> place for years without IMO at least any compelling reason. It will >> be easy enough to change the default l2 policy to a flow based model >> with a few bridge/tc commands. >> >>> We are also slightly in disagreement about what the default should be with VF netdevs. I think the default should be the same L2 mac/vlan switch behavior and see no reason to change it by default just because we added VF netdevs. The infrastructure libvirt/openstack/etc are built around this default today. But I guess nothing in this series specifies what the defaults of any given driver will be. VF netdevs are still useful even on older hardware that only supports mac/vlan forwarding to
[PATCH net-next V3 2/6] skb_array: minor tweak
Signed-off-by: Michael S. TsirkinSigned-off-by: Jason Wang --- include/linux/skb_array.h | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h index 678bfbf..2dd0d1e 100644 --- a/include/linux/skb_array.h +++ b/include/linux/skb_array.h @@ -151,12 +151,12 @@ static inline int skb_array_init(struct skb_array *a, int size, gfp_t gfp) return ptr_ring_init(>ring, size, gfp); } -void __skb_array_destroy_skb(void *ptr) +static void __skb_array_destroy_skb(void *ptr) { kfree_skb(ptr); } -int skb_array_resize(struct skb_array *a, int size, gfp_t gfp) +static inline int skb_array_resize(struct skb_array *a, int size, gfp_t gfp) { return ptr_ring_resize(>ring, size, gfp, __skb_array_destroy_skb); } -- 2.7.4
[PATCH net-next V3 5/6] net: introduce NETDEV_CHANGE_TX_QUEUE_LEN
This patch introduces a new event - NETDEV_CHANGE_TX_QUEUE_LEN, this will be triggered when tx_queue_len. It could be used by net device who want to do some processing at that time. An example is tun who may want to resize tx array when tx_queue_len is changed. Signed-off-by: Jason Wang--- include/linux/netdevice.h | 1 + net/core/net-sysfs.c | 15 ++- 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index e84d9d2..7dc2ec7 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -2237,6 +2237,7 @@ struct netdev_lag_lower_state_info { #define NETDEV_PRECHANGEUPPER 0x001A #define NETDEV_CHANGELOWERSTATE0x001B #define NETDEV_UDP_TUNNEL_PUSH_INFO0x001C +#define NETDEV_CHANGE_TX_QUEUE_LEN 0x001E int register_netdevice_notifier(struct notifier_block *nb); int unregister_netdevice_notifier(struct notifier_block *nb); diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c index 7a0b616..6e4f347 100644 --- a/net/core/net-sysfs.c +++ b/net/core/net-sysfs.c @@ -322,7 +322,20 @@ NETDEVICE_SHOW_RW(flags, fmt_hex); static int change_tx_queue_len(struct net_device *dev, unsigned long new_len) { - dev->tx_queue_len = new_len; + int res, orig_len = dev->tx_queue_len; + + if (new_len != orig_len) { + dev->tx_queue_len = new_len; + res = call_netdevice_notifiers(NETDEV_CHANGE_TX_QUEUE_LEN, dev); + res = notifier_to_errno(res); + if (res) { + netdev_err(dev, + "refused to change device tx_queue_len\n"); + dev->tx_queue_len = orig_len; + return -EFAULT; + } + } + return 0; } -- 2.7.4
[PATCH net-next V3 1/6] ptr_ring: support zero length ring
Sometimes, we need zero length ring. But current code will crash since we don't do any check before accessing the ring. This patch fixes this. Signed-off-by: Jason Wang--- include/linux/ptr_ring.h | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h index 562a65e..d78b8b8 100644 --- a/include/linux/ptr_ring.h +++ b/include/linux/ptr_ring.h @@ -102,7 +102,7 @@ static inline bool ptr_ring_full_bh(struct ptr_ring *r) */ static inline int __ptr_ring_produce(struct ptr_ring *r, void *ptr) { - if (r->queue[r->producer]) + if (unlikely(!r->size) || r->queue[r->producer]) return -ENOSPC; r->queue[r->producer++] = ptr; @@ -164,7 +164,9 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, void *ptr) */ static inline void *__ptr_ring_peek(struct ptr_ring *r) { - return r->queue[r->consumer]; + if (likely(r->size)) + return r->queue[r->consumer]; + return NULL; } /* Note: callers invoking this in a loop must use a compiler barrier, -- 2.7.4
[PATCH net-next V3 4/6] skb_array: add wrappers for resizing
Signed-off-by: Michael S. TsirkinSigned-off-by: Jason Wang --- include/linux/skb_array.h | 9 + 1 file changed, 9 insertions(+) diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h index 2dd0d1e..f4dfade 100644 --- a/include/linux/skb_array.h +++ b/include/linux/skb_array.h @@ -161,6 +161,15 @@ static inline int skb_array_resize(struct skb_array *a, int size, gfp_t gfp) return ptr_ring_resize(>ring, size, gfp, __skb_array_destroy_skb); } +static inline int skb_array_resize_multiple(struct skb_array **rings, + int nrings, int size, gfp_t gfp) +{ + BUILD_BUG_ON(offsetof(struct skb_array, ring)); + return ptr_ring_resize_multiple((struct ptr_ring **)rings, + nrings, size, gfp, + __skb_array_destroy_skb); +} + static inline void skb_array_cleanup(struct skb_array *a) { ptr_ring_cleanup(>ring, __skb_array_destroy_skb); -- 2.7.4
[PATCH net-next V3 0/6] switch to use tx skb array in tun
Hi all: This series tries to switch to use skb array in tun. This is used to eliminate the spinlock contention between producer and consumer. The conversion was straightforward: just introdce a tx skb array and use it instead of sk_receive_queue. A minor issue is to keep the tx_queue_len behaviour, since tun used to use it for the length of sk_receive_queue. This is done through: - add the ability to resize multiple rings at once to avoid handling partial resize failure for mutiple rings. - add the support for zero length ring. - introduce a notifier which was triggered when tx_queue_len was changed for a netdev. - resize all queues during the tx_queue_len changing. Tests shows about 15% improvement on guest rx pps: Before: ~130pps After : ~150pps Changes from V2: - add multiple rings resizing support for ptr_ring/skb_array - add zero length ring support - introdce a NETDEV_CHANGE_TX_QUEUE_LEN - drop new flags Changes from V1: - switch to use skb array instead of a customized circular buffer - add non-blocking support - rename .peek to .peek_len - drop lockless peeking since test show very minor improvement Jason Wang (5): ptr_ring: support zero length ring skb_array: minor tweak skb_array: add wrappers for resizing net: introduce NETDEV_CHANGE_TX_QUEUE_LEN tun: switch to use skb array for tx Michael S. Tsirkin (1): ptr_ring: support resizing multiple queues drivers/net/tun.c| 138 --- drivers/vhost/net.c | 16 - include/linux/net.h | 1 + include/linux/netdevice.h| 1 + include/linux/ptr_ring.h | 77 ++ include/linux/skb_array.h| 13 +++- net/core/net-sysfs.c | 15 - tools/virtio/ringtest/ptr_ring.c | 5 ++ 8 files changed, 243 insertions(+), 23 deletions(-) -- 2.7.4
[PATCH net-next V3 6/6] tun: switch to use skb array for tx
We used to queue tx packets in sk_receive_queue, this is less efficient since it requires spinlocks to synchronize between producer and consumer. This patch tries to address this by: - switch from sk_receive_queue to a skb_array, and resize it when tx_queue_len was changed. - introduce a new proto_ops peek_len which was used for peeking the skb length. - implement a tun version of peek_len for vhost_net to use and convert vhost_net to use peek_len if possible. Pktgen test shows about 15.3% improvement on guest receiving pps for small buffers: Before: ~130pps After : ~150pps Signed-off-by: Jason Wang--- drivers/net/tun.c | 138 +--- drivers/vhost/net.c | 16 +- include/linux/net.h | 1 + 3 files changed, 146 insertions(+), 9 deletions(-) diff --git a/drivers/net/tun.c b/drivers/net/tun.c index 4884802..3be69ea 100644 --- a/drivers/net/tun.c +++ b/drivers/net/tun.c @@ -71,6 +71,7 @@ #include #include #include +#include #include @@ -167,6 +168,7 @@ struct tun_file { }; struct list_head next; struct tun_struct *detached; + struct skb_array tx_array; }; struct tun_flow_entry { @@ -515,7 +517,11 @@ static struct tun_struct *tun_enable_queue(struct tun_file *tfile) static void tun_queue_purge(struct tun_file *tfile) { - skb_queue_purge(>sk.sk_receive_queue); + struct sk_buff *skb; + + while ((skb = skb_array_consume(>tx_array)) != NULL) + kfree_skb(skb); + skb_queue_purge(>sk.sk_error_queue); } @@ -560,6 +566,8 @@ static void __tun_detach(struct tun_file *tfile, bool clean) tun->dev->reg_state == NETREG_REGISTERED) unregister_netdevice(tun->dev); } + if (tun) + skb_array_cleanup(>tx_array); sock_put(>sk); } } @@ -613,6 +621,7 @@ static void tun_detach_all(struct net_device *dev) static int tun_attach(struct tun_struct *tun, struct file *file, bool skip_filter) { struct tun_file *tfile = file->private_data; + struct net_device *dev = tun->dev; int err; err = security_tun_dev_attach(tfile->socket.sk, tun->security); @@ -642,6 +651,13 @@ static int tun_attach(struct tun_struct *tun, struct file *file, bool skip_filte if (!err) goto out; } + + if (!tfile->detached && + skb_array_init(>tx_array, dev->tx_queue_len, GFP_KERNEL)) { + err = -ENOMEM; + goto out; + } + tfile->queue_index = tun->numqueues; tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN; rcu_assign_pointer(tfile->tun, tun); @@ -891,8 +907,8 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev) nf_reset(skb); - /* Enqueue packet */ - skb_queue_tail(>socket.sk->sk_receive_queue, skb); + if (skb_array_produce(>tx_array, skb)) + goto drop; /* Notify and wake up reader process */ if (tfile->flags & TUN_FASYNC) @@ -1107,7 +1123,7 @@ static unsigned int tun_chr_poll(struct file *file, poll_table *wait) poll_wait(file, sk_sleep(sk), wait); - if (!skb_queue_empty(>sk_receive_queue)) + if (!skb_array_empty(>tx_array)) mask |= POLLIN | POLLRDNORM; if (sock_writeable(sk) || @@ -1426,22 +1442,61 @@ done: return total; } +static struct sk_buff *tun_ring_recv(struct tun_file *tfile, int noblock, +int *err) +{ + DECLARE_WAITQUEUE(wait, current); + struct sk_buff *skb = NULL; + + skb = skb_array_consume(>tx_array); + if (skb) + goto out; + if (noblock) { + *err = -EAGAIN; + goto out; + } + + add_wait_queue(>wq.wait, ); + current->state = TASK_INTERRUPTIBLE; + + while (1) { + skb = skb_array_consume(>tx_array); + if (skb) + break; + if (signal_pending(current)) { + *err = -ERESTARTSYS; + break; + } + if (tfile->socket.sk->sk_shutdown & RCV_SHUTDOWN) { + *err = -EFAULT; + break; + } + + schedule(); + }; + + current->state = TASK_RUNNING; + remove_wait_queue(>wq.wait, ); + +out: + return skb; +} + static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile, struct iov_iter *to, int noblock) { struct sk_buff *skb; ssize_t ret; - int peeked, err, off = 0; + int err; tun_debug(KERN_INFO, tun, "tun_do_read\n"); if (!iov_iter_count(to)) return 0; - /* Read
Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas
On Wed, Jun 29, 2016 at 03:39:37PM -0700, Andy Lutomirski wrote: > > I don't care about TCP MD5 performance at all. Ease of maintenance is > nice, though, and maybe there are other places in the kernel where > performance does matter. TCP MD5 is using ahash because it touches SG lists. Touching SG lists is a pretty reliable indicator for hashing a large amount of data. In fact TCP MD5 hashes the entire packet content so we're talking about ~1000 bytes, just like IPsec. Therefore it is completely pointless to use something like shash for it as shash is only meant to be used for those hashing less one block of data or less. If you're aware of any other user in the kernel that is using ahash and is hashing a small amount of data in aggregate (not per update) please let me know. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control
On 16-06-29 03:09 PM, John Fastabend wrote: > On 16-06-29 02:33 PM, Or Gerlitz wrote: >> On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend >>wrote: >>> On 16-06-29 07:48 AM, Or Gerlitz wrote: On 6/28/2016 10:31 PM, John Fastabend wrote: > On 16-06-28 12:12 PM, Jiri Pirko wrote: >> >> Why?! Please, leave legacy be legacy. Use the new mode for >> implementing new features. Don't make things any more complicated :( >> >> [...] > Maybe I'm reading to much into the devlink flag names and if instead > you use a switch like the following, >> >VF representer : enable/disable the creation VF netdev's to represent > the virtual functions on the PF >> > Much less complicated then magic switching between forwarding logic IMO > and you don't whack a default configuration that an entire stack (e.g. > libvirt) has been built to use. >> Re letting the user to observe/modify the rules added by the driver/firmware while legacy mode. Even if possible with bridge/fdb, it will be really pragmatical and doesn't make sense to get that donefor the TC subsystem. So this isn't a well defined solution and anyway, as you said, legacy mode enhancements is a different exercise. Personally, I agree with Jiri, that we should legacy be legacyand focus on adding the new model. >> >>> The ixgbe driver already supports bridge and tc commands without the VF >>> representer. Adding the VF representer to these drivers just extends >>> the existing support so we have an identifier for VFs and now the >>> redirect action works and the fdb commands can specify the VF netdevs. >>> I don't see this as a problem because we already do it today with >>> 'ip' and bridge tools. >> >> To be precise, for both ixgbe and mlx5, the existing tc support >> (u32/ixgbe, flower/mlx5) is not for switching functionality but rather >> for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added >> redirect to VF, but this is only for south --> north (wire --> VF) >> traffic, w.o the VF rep you can't do the other way around. >> > > Correct which is why we need the VF rep. So we are completely in > sync there. > >> Just to clarify, to what exact bridge command support did you refer for >> ixgbe? > > 'bridge fdb' commands are supported today on the PF. But its the > same story as above we need the VF rep to also use it on the > VF representer > > Also 'bridge link' command for veb/vepa modes is supported and the > other link attributes could be supported with additional driver > support. No need for core changes here. But again yes only on the > PF so again we need the VF reps. > >> >> The forwarding done in the legacy mode is not well defined, and >> different across vendors, adding there the VF reps will not make it >> any better b/c some steering rules will be set by tc/bridge offloads >> while other rules will be put by the driver. >> I don't see how this takes us to better place. > > In legacy mode or any other mode you are defining some default policy > and rules. > > In the legacy mode we use mac/vlan assigned l2 forwarding entries in the > hardware fdb which are seen when you query 'ip link' and 'bridge fdb' > today. And similarly can be modified today using 'ip link' and 'bridge > fdb' at least on the intel devices. Its not undefined in any way with > a quick query of the tools we can learn exactly what the configuration > is and even change it. This works fairly well with existing controllers > and stacks. > > The limitations are 'ip' only supports a single MAC address per VF and > 'tc' doesn't work on VF ports because when the VF is assigned to a VM > or namespace we lose visibility of it. Providing a VF rep for this > solves both of those problems. > > In this new mode the default policy is to create a default miss rule > and implement no l2 forwarding rules. Unfortunately not all hardware > in use supports this default miss rule case but would still benefit > from having a VF rep. So we shouldn't make this a stipulation for > enabling VF reps. It also changes a default policy that has been in > place for years without IMO at least any compelling reason. It will > be easy enough to change the default l2 policy to a flow based model > with a few bridge/tc commands. > >> >>> We are also slightly in disagreement about what the default should be >>> with VF netdevs. I think the default should be the same L2 mac/vlan >>> switch behavior and see no reason to change it by default just because >>> we added VF netdevs. The infrastructure libvirt/openstack/etc are built >>> around this default today. But I guess nothing in this series specifies >>> what the defaults of any given driver will be. VF netdevs are still >>> useful even on older hardware that only supports mac/vlan forwarding to >>> expose statistics and send/receive control frames such as lldp. >> >> Again, this is not about default engineering... and using the
Re: [PATCH net-next V2] tun: introduce tx skb ring
On 2016年06月28日 15:09, Michael S. Tsirkin wrote: On Thu, Jun 23, 2016 at 01:14:07PM +0800, Jason Wang wrote: On 2016年06月23日 02:18, Michael S. Tsirkin wrote: On Fri, Jun 17, 2016 at 03:41:20AM +0300, Michael S. Tsirkin wrote: Would it help to have ptr_ring_resize that gets an array of rings and resizes them both to same length? OK, here it is. Untested so far, and no skb wrapper. Pls let me know whether this is what you had in mind. Exactly what I want. Thanks Ok and this for skb_array --> skb_array: add wrappers for resizing Signed-off-by: Michael S. Tsirkin-- diff --git a/include/linux/skb_array.h b/include/linux/skb_array.h index c900708..7e01c1f 100644 --- a/include/linux/skb_array.h +++ b/include/linux/skb_array.h @@ -151,16 +151,24 @@ static inline int skb_array_init(struct skb_array *a, int size, gfp_t gfp) return ptr_ring_init(>ring, size, 0, gfp); } -void __skb_array_destroy_skb(void *ptr) +static void __skb_array_destroy_skb(void *ptr) { kfree_skb(ptr); } -int skb_array_resize(struct skb_array *a, int size, gfp_t gfp) +static inline int skb_array_resize(struct skb_array *a, int size, gfp_t gfp) { return ptr_ring_resize(>ring, size, gfp, __skb_array_destroy_skb); } Will split up the above tweak into another patch when reposting. +static inline int skb_raay_resize_multiple(struct skb_array **rings, int nrings, I think you mean 'skb_array_resize' here. + int size, gfp_t gfp) +{ + BUILD_BUG_ON(offsetof(struct skb_array, ring)); + ptr_ring_resize_multiple((struct ptr_ring **)rings, nrings, size, gfp, +__skb_array_destroy_skb); This should be return ptr_ring_resize_multiple(... +} + static inline void skb_array_cleanup(struct skb_array *a) { ptr_ring_cleanup(>ring, __skb_array_destroy_skb); With this, looks like there's no need for a new flag. Will repost the series with those two patches. Thanks
Re: [PATCH net-next 9/9] net: hns: get reset registers from DT
On 2016/6/29 17:11, David Miller wrote: > From: Yisen Zhuang> Date: Mon, 27 Jun 2016 17:54:15 +0800 > >> @@ -361,9 +371,10 @@ static int hns_mdio_reset(struct mii_bus *bus) >> return -ENODEV; >> } >> >> +sc_reg = _dev->sc_reg; >> /* 1. reset req, and read reset st check */ >> -ret = mdio_sc_cfg_reg_write(mdio_dev, MDIO_SC_RESET_REQ, 0x1, >> -MDIO_SC_RESET_ST, 0x1, >> +ret = mdio_sc_cfg_reg_write(mdio_dev, sc_reg->mdio_reset_req, 0x1, >> +sc_reg->mdio_reset_st, 0x1, >> MDIO_CHECK_SET_ST); >> if (ret) { >> dev_err(>dev, "MDIO reset fail\n"); > What in the world are you doing to the indentation here? > > Please read your patches before actually sending them, such things > will be quite obvious by simple visual inspection. > > . > Hi David, i am sorry for my carelessness. i will pay more attention next time. Thanks for pointing it our -- MBR, Yankejian (Hackim Yim)
linux-next: manual merge of the net-next tree with Linus' tree
Hi all, Today's linux-next merge of the net-next tree got a conflict in: drivers/net/phy/fixed_phy.c between commit: 69fc58a57e56 ("net: phy: Manage fixed PHY address space using IDA") from Linus' tree and commit: bf7afb29d545 ("phy: improve safety of fixed-phy MII register reading") from the net-next tree. I fixed it up (see below) and can carry the fix as necessary. This is now fixed as far as linux-next is concerned, but any non trivial conflicts should be mentioned to your upstream maintainer when your tree is submitted for merging. You may also want to consider cooperating with the maintainer of the conflicting tree to minimise any particularly complex conflicts. -- Cheers, Stephen Rothwell diff --cc drivers/net/phy/fixed_phy.c index 9ec7f7353434,b376ada83598.. --- a/drivers/net/phy/fixed_phy.c +++ b/drivers/net/phy/fixed_phy.c @@@ -23,9 -23,9 +23,10 @@@ #include #include #include + #include +#include - #define MII_REGS_NUM 29 + #include "swphy.h" struct fixed_mdio_bus { struct mii_bus *mii_bus;
[net-next PATCH 0/5] HFSC patches, part 1
Hi, It's revised version of part of the patches I submitted really, really long time ago (back then I asked Patrick to ignore them as I found some issues shortly after submitting). Anyway this is the first set with very simple fixes/changes though some of them relatively subtle (I tried to do very exhaustive commit messages explaining what and why with those). The patches are against net-next tree. The second set will be heavier - or rather with more complex explanations, among those I have: - a fix to subtle issue introduced in http://permalink.gmane.org/gmane.linux.kernel.commits.2-4/8281 along with simplifying related stuff - update times to 96 bits (which allows to "just" use 32 bit shifts and improves curve definition accuracy at more extreme low/high speeds) - add curve "merging" instead of just selecting in convex case (computations mirror those from concave intersection) But these are eventually for later. Michal Soltys (5): net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len() net/sched/sch_hfsc.c: remove leftover dlist and droplist net/sched/sch_hfsc.c: go passive after vt update net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc() net/sched/sch_hfsc.c | 54 ++-- 1 file changed, 27 insertions(+), 27 deletions(-) -- 2.1.3
[net-next PATCH 5/5] net/sched/sch_hfsc.c: anchor virtual curve at proper vt in hfsc_change_fsc()
cl->cl_vt alone is relative only to the current backlog period, while the curve operates on cumulative virtual time. This patch adds missing cl->cl_vtoff. Signed-off-by: Michal Soltys--- net/sched/sch_hfsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 4eef857..dff92ea 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -940,7 +940,7 @@ static void hfsc_change_fsc(struct hfsc_class *cl, struct tc_service_curve *fsc) { sc2isc(fsc, >cl_fsc); - rtsc_init(>cl_virtual, >cl_fsc, cl->cl_vt, cl->cl_total); + rtsc_init(>cl_virtual, >cl_fsc, cl->cl_vtoff + cl->cl_vt, cl->cl_total); cl->cl_flags |= HFSC_FSC; } -- 2.1.3
[net-next PATCH 2/5] net/sched/sch_hfsc.c: add unlikely() in qdisc_peek_len()
The condition can only succeed on wrong configurations. Signed-off-by: Michal Soltys--- net/sched/sch_hfsc.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 6d6df6b..e2244bb 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -882,7 +882,7 @@ qdisc_peek_len(struct Qdisc *sch) unsigned int len; skb = sch->ops->peek(sch); - if (skb == NULL) { + if (unlikely(skb == NULL)) { qdisc_warn_nonwc("qdisc_peek_len", sch); return 0; } -- 2.1.3
[net-next PATCH 1/5] net/sched/sch_hfsc.c: handle corner cases where head may change invalidating calculated deadline
Realtime scheduling implemented in HFSC uses head of the queue to make the decision about which packet to schedule next. But in case of any head drop, the deadline calculated for the previous head is not necessarily correct for the next head (unless both packets have the same length). Thanks to peek() function used during dequeue - which internally is a dequeue operation - hfsc is almost safe from this issue, as peek() dequeues and isolates the head storing it temporarily until the real dequeue happens. But there is one exception: if after the class activation a drop happens before the first dequeue operation, there's never a chance to do the peek(). Adding peek() call in enqueue - if this is the first packet in a new backlog period AND the scheduler has realtime curve defined - fixes that one corner case. The 1st hfsc_dequeue() will use that peeked packet, similarly as every subsequent hfsc_dequeue() call uses packet peeked by the previous call. Signed-off-by: Michal Soltys--- net/sched/sch_hfsc.c | 11 ++- 1 file changed, 10 insertions(+), 1 deletion(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index 8cb5eff..6d6df6b 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -1594,8 +1594,17 @@ hfsc_enqueue(struct sk_buff *skb, struct Qdisc *sch, struct sk_buff **to_free) return err; } - if (cl->qdisc->q.qlen == 1) + if (cl->qdisc->q.qlen == 1) { set_active(cl, qdisc_pkt_len(skb)); + /* +* If this is the first packet, isolate the head so an eventual +* head drop before the first dequeue operation has no chance +* to invalidate the deadline. +*/ + if (cl->cl_flags & HFSC_RSC) + cl->qdisc->ops->peek(cl->qdisc); + + } qdisc_qstats_backlog_inc(sch, skb); sch->q.qlen++; -- 2.1.3
[net-next PATCH 3/5] net/sched/sch_hfsc.c: remove leftover dlist and droplist
This is update to: commit a09ceb0e08140a ("sched: remove qdisc->drop") That commit removed qdisc->drop, but left alone dlist and droplist that no longer serve any meaningful purpose. Signed-off-by: Michal Soltys--- net/sched/sch_hfsc.c | 8 1 file changed, 8 deletions(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index e2244bb..df07f06 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -130,7 +130,6 @@ struct hfsc_class { struct rb_node vt_node; /* parent's vt_tree member */ struct rb_root cf_tree; /* active children sorted by cl_f */ struct rb_node cf_node; /* parent's cf_heap member */ - struct list_head dlist; /* drop list member */ u64 cl_total; /* total work in bytes */ u64 cl_cumul; /* cumulative work in bytes done by @@ -177,8 +176,6 @@ struct hfsc_sched { struct hfsc_class root; /* root class */ struct Qdisc_class_hash clhash; /* class hash */ struct rb_root eligible;/* eligible tree */ - struct list_head droplist; /* active leaf class list (for - dropping) */ struct qdisc_watchdog watchdog; /* watchdog timer */ }; @@ -858,7 +855,6 @@ set_active(struct hfsc_class *cl, unsigned int len) if (cl->cl_flags & HFSC_FSC) init_vf(cl, len); - list_add_tail(>dlist, >sched->droplist); } static void @@ -867,8 +863,6 @@ set_passive(struct hfsc_class *cl) if (cl->cl_flags & HFSC_RSC) eltree_remove(cl); - list_del(>dlist); - /* * vttree is now handled in update_vf() so that update_vf(cl, 0, 0) * needs to be called explicitly to remove a class from vttree. @@ -1443,7 +1437,6 @@ hfsc_init_qdisc(struct Qdisc *sch, struct nlattr *opt) if (err < 0) return err; q->eligible = RB_ROOT; - INIT_LIST_HEAD(>droplist); q->root.cl_common.classid = sch->handle; q->root.refcnt = 1; @@ -1527,7 +1520,6 @@ hfsc_reset_qdisc(struct Qdisc *sch) hfsc_reset_class(cl); } q->eligible = RB_ROOT; - INIT_LIST_HEAD(>droplist); qdisc_watchdog_cancel(>watchdog); sch->qstats.backlog = 0; sch->q.qlen = 0; -- 2.1.3
[net-next PATCH 4/5] net/sched/sch_hfsc.c: go passive after vt update
When a class is going passive, it should update its cl_vt first to be consistent with the last dequeue operation. Otherwise its cl_vt will be one packet behind and parent's cvtmax might not be updated as well. One possible side effect is if some class goes passive and subsequently goes active /without/ its parent going passive - with cl_vt lagging one packet behind - comparison made in init_vf() will be affected (same period). Signed-off-by: Michal Soltys--- net/sched/sch_hfsc.c | 31 +++ 1 file changed, 15 insertions(+), 16 deletions(-) diff --git a/net/sched/sch_hfsc.c b/net/sched/sch_hfsc.c index df07f06..4eef857 100644 --- a/net/sched/sch_hfsc.c +++ b/net/sched/sch_hfsc.c @@ -778,6 +778,20 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 cur_time) else go_passive = 0; + /* update vt */ + cl->cl_vt = rtsc_y2x(>cl_virtual, cl->cl_total) + - cl->cl_vtoff + cl->cl_vtadj; + + /* +* if vt of the class is smaller than cvtmin, +* the class was skipped in the past due to non-fit. +* if so, we need to adjust vtadj. +*/ + if (cl->cl_vt < cl->cl_parent->cl_cvtmin) { + cl->cl_vtadj += cl->cl_parent->cl_cvtmin - cl->cl_vt; + cl->cl_vt = cl->cl_parent->cl_cvtmin; + } + if (go_passive) { /* no more active child, going passive */ @@ -794,25 +808,10 @@ update_vf(struct hfsc_class *cl, unsigned int len, u64 cur_time) continue; } - /* -* update vt and f -*/ - cl->cl_vt = rtsc_y2x(>cl_virtual, cl->cl_total) - - cl->cl_vtoff + cl->cl_vtadj; - - /* -* if vt of the class is smaller than cvtmin, -* the class was skipped in the past due to non-fit. -* if so, we need to adjust vtadj. -*/ - if (cl->cl_vt < cl->cl_parent->cl_cvtmin) { - cl->cl_vtadj += cl->cl_parent->cl_cvtmin - cl->cl_vt; - cl->cl_vt = cl->cl_parent->cl_cvtmin; - } - /* update the vt tree */ vttree_update(cl); + /* update f */ if (cl->cl_flags & HFSC_USC) { cl->cl_myf = cl->cl_myfadj + rtsc_y2x(>cl_ulimit, cl->cl_total); -- 2.1.3
Re: [iproute PATCH 1/2] ip-address: Support filtering by slave type, too
On Tue, 28 Jun 2016 15:07:16 +0200 Phil Sutterwrote: > +static int match_link_kind(struct rtattr **tb, char *kind, bool slave) const char *kind ?? > +{ > + if (!tb[IFLA_LINKINFO]) > + return -1; > + > + return strcmp(parse_link_kind(tb[IFLA_LINKINFO], slave), kind); > +} > +
[PATCH 0/1] qlcnic: add wmb() call in transmit data path.
0001-qlcnic-add-wmb-call-in-transmit-data-path.patch This patch adds a wmb() call in the Tx data path to ensure writes are completed before hardware fetches updated Tx descriptors. Please apply to net. Thanks, Sony
[PATCH 1/1] qlcnic: add wmb() call in transmit data path.
Call wmb() to ensure writes are complete before hardware fetches updated Tx descriptors. Signed-off-by: Sony Chacko--- drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c |2 ++ 1 files changed, 2 insertions(+), 0 deletions(-) diff --git a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c index 607bb7d..87c642d 100644 --- a/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c +++ b/drivers/net/ethernet/qlogic/qlcnic/qlcnic_io.c @@ -772,6 +772,8 @@ netdev_tx_t qlcnic_xmit_frame(struct sk_buff *skb, struct net_device *netdev) tx_ring->tx_stats.tx_bytes += skb->len; tx_ring->tx_stats.xmit_called++; + /* Ensure writes are complete before HW fetches Tx descriptors */ + wmb(); qlcnic_update_cmd_producer(tx_ring); return NETDEV_TX_OK; -- 1.7.1
Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas
On Wed, Jun 29, 2016 at 2:44 PM, Eric Dumazetwrote: > On Wed, 2016-06-29 at 09:41 -0700, Andy Lutomirski wrote: > >> Overall, it looks like there's overhead of something like 50ns for >> each ahash invocation vs the shash equivalent. It's not huge, but >> it's there. (This is cache-hot. I bet it's considerably worse if >> cache-cold, because ahash will require a lot more code cache lines as >> well as the extra cache lines involved in the scatterlist and whatever >> arch stuff is needed to map back and forth between virtual and >> physical addresses. > > I am kind of mystified seeing someone caring about TCP MD5, other than > just making sure it wont crash the host when it needs to be used ;) > > The real useful work would be to use a jump label so that we can avoid > spending cycles for non TCP MD5 sessions, when a host never had to use > any MD5 negotiation. > > > I don't care about TCP MD5 performance at all. Ease of maintenance is nice, though, and maybe there are other places in the kernel where performance does matter. --Andy
Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control
On 16-06-29 02:33 PM, Or Gerlitz wrote: > On Wed, Jun 29, 2016 at 7:35 PM, John Fastabend >wrote: >> On 16-06-29 07:48 AM, Or Gerlitz wrote: >>> On 6/28/2016 10:31 PM, John Fastabend wrote: On 16-06-28 12:12 PM, Jiri Pirko wrote: > > Why?! Please, leave legacy be legacy. Use the new mode for > implementing new features. Don't make things any more complicated :( > > [...] Maybe I'm reading to much into the devlink flag names and if instead you use a switch like the following, > VF representer : enable/disable the creation VF netdev's to represent the virtual functions on the PF > Much less complicated then magic switching between forwarding logic IMO and you don't whack a default configuration that an entire stack (e.g. libvirt) has been built to use. > >>> Re letting the user to observe/modify the rules added by the >>> driver/firmware while legacy mode. Even if possible with bridge/fdb, it >>> will be really pragmatical and doesn't make sense to get that donefor >>> the TC subsystem. So this isn't a well defined solution and anyway, as >>> you said, legacy mode enhancements is a different exercise. Personally, >>> I agree with Jiri, that we should legacy be legacyand focus on adding >>> the new model. > >> The ixgbe driver already supports bridge and tc commands without the VF >> representer. Adding the VF representer to these drivers just extends >> the existing support so we have an identifier for VFs and now the >> redirect action works and the fdb commands can specify the VF netdevs. >> I don't see this as a problem because we already do it today with >> 'ip' and bridge tools. > > To be precise, for both ixgbe and mlx5, the existing tc support > (u32/ixgbe, flower/mlx5) is not for switching functionality but rather > for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added > redirect to VF, but this is only for south --> north (wire --> VF) > traffic, w.o the VF rep you can't do the other way around. > Correct which is why we need the VF rep. So we are completely in sync there. > Just to clarify, to what exact bridge command support did you refer for ixgbe? 'bridge fdb' commands are supported today on the PF. But its the same story as above we need the VF rep to also use it on the VF representer Also 'bridge link' command for veb/vepa modes is supported and the other link attributes could be supported with additional driver support. No need for core changes here. But again yes only on the PF so again we need the VF reps. > > The forwarding done in the legacy mode is not well defined, and > different across vendors, adding there the VF reps will not make it > any better b/c some steering rules will be set by tc/bridge offloads > while other rules will be put by the driver. > I don't see how this takes us to better place. In legacy mode or any other mode you are defining some default policy and rules. In the legacy mode we use mac/vlan assigned l2 forwarding entries in the hardware fdb which are seen when you query 'ip link' and 'bridge fdb' today. And similarly can be modified today using 'ip link' and 'bridge fdb' at least on the intel devices. Its not undefined in any way with a quick query of the tools we can learn exactly what the configuration is and even change it. This works fairly well with existing controllers and stacks. The limitations are 'ip' only supports a single MAC address per VF and 'tc' doesn't work on VF ports because when the VF is assigned to a VM or namespace we lose visibility of it. Providing a VF rep for this solves both of those problems. In this new mode the default policy is to create a default miss rule and implement no l2 forwarding rules. Unfortunately not all hardware in use supports this default miss rule case but would still benefit from having a VF rep. So we shouldn't make this a stipulation for enabling VF reps. It also changes a default policy that has been in place for years without IMO at least any compelling reason. It will be easy enough to change the default l2 policy to a flow based model with a few bridge/tc commands. > >> We are also slightly in disagreement about what the default should be >> with VF netdevs. I think the default should be the same L2 mac/vlan >> switch behavior and see no reason to change it by default just because >> we added VF netdevs. The infrastructure libvirt/openstack/etc are built >> around this default today. But I guess nothing in this series specifies >> what the defaults of any given driver will be. VF netdevs are still >> useful even on older hardware that only supports mac/vlan forwarding to >> expose statistics and send/receive control frames such as lldp. > > Again, this is not about default engineering... and using the VF reps > (not VF netdevs) in legacy mode only make it more cryptic to my > opinion. I agree some changes would be needed in openstack to support > the new model, but this is how
Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas
On Wed, 2016-06-29 at 09:41 -0700, Andy Lutomirski wrote: > Overall, it looks like there's overhead of something like 50ns for > each ahash invocation vs the shash equivalent. It's not huge, but > it's there. (This is cache-hot. I bet it's considerably worse if > cache-cold, because ahash will require a lot more code cache lines as > well as the extra cache lines involved in the scatterlist and whatever > arch stuff is needed to map back and forth between virtual and > physical addresses. I am kind of mystified seeing someone caring about TCP MD5, other than just making sure it wont crash the host when it needs to be used ;) The real useful work would be to use a jump label so that we can avoid spending cycles for non TCP MD5 sessions, when a host never had to use any MD5 negotiation.
Re: [net] e1000e: keep VLAN interfaces functional after rxvlan off
David Miller wrote: From: Jeff KirsherDate: Tue, 28 Jun 2016 20:41:31 -0700 From: Jarod Wilson I've got a bug report about an e1000e interface, where a VLAN interface is set up on top of it: $ ip link add link ens1f0 name ens1f0.99 type vlan id 99 $ ip link set ens1f0 up $ ip link set ens1f0.99 up $ ip addr add 192.168.99.92 dev ens1f0.99 At this point, I can ping another host on vlan 99, ip 192.168.99.91. However, if I do the following: $ ethtool -K ens1f0 rxvlan off Then no traffic passes on ens1f0.99. It comes back if I toggle rxvlan on again. I'm not sure if this is actually intended behavior, or if there's a lack of software VLAN stripping fallback, or what, but things continue to work if I simply don't call e1000e_vlan_strip_disable() if there are active VLANs (plagiarizing a function from the e1000 driver here) on the interface. Also slipped a related-ish fix to the kerneldoc text for e1000e_vlan_strip_disable here... Signed-off-by: Jarod Wilson Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher Applied, thanks. Hm.. There was actually a v2 patch that followed this one that fixed the problem slightly differently and slightly better, I think. http://marc.info/?l=intel-wired-lan=146551652424417=2 -- Jarod Wilson ja...@redhat.com
Re: [PATCH] ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output
On Wed, Jun 29, 2016, at 20:47, Shmulik Ladkani wrote: > ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it > calls ip_sk_use_pmtu which casts sk as an inet_sk). > > However, in the case of UDP tunneling, the skb->sk is not necessarily an > inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from > tun/tap). > > OTOH, the sk passed as an argument throughout IP stack's output path is > the one which is of PMTU interest: > - In case of local sockets, sk is same as skb->sk; > - In case of a udp tunnel, sk is the tunneling socket. > > Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu. > This augments 7026b1ddb6 'netfilter: Pass socket pointer down through > okfn().' > > Signed-off-by: Shmulik Ladkani> --- > > Discovered while debugging other issue in adjacent code area. > > My setup seems to be handling this well; However I'd appreciate a > Tested-by or Reviewed-by as one can only cover relatively few of the > possible code flows leading to ip_skb_dst_mtu. Well spotted. It all looks good, the socket's pmtudisc field is initialized accordingly and I suspect people want to have vxlan to use the path mtu when transmitting packets, even during forwarding. Maybe this can be tweaked with an additional patch, if people want to change this behavior. But as your patch doesn't change any of this, Reviewed-by: Hannes Frederic Sowa Thanks!
[net-next 06/15] ixgbe: Error handler for duplicate filter locations in hardware for cls_u32 offloads
From: Amritha NambiarFor u32 classifier filters, avoid overwriting existing filter in a hardware location without removing it first, to clean up inconsistencies due to duplicate values for filter location. Verified with the following filters: Create child hash tables: handle 1: u32 divisor 1 handle 2: u32 divisor 1 Link to the child hash table from parent hash table: handle 800:0:11 u32 ht 800: link 1: \ offset at 0 mask 0f00 shift 6 plus 0 eat \ match ip protocol 6 ff match ip dst 15.0.0.1/32 handle 800:0:12 u32 ht 800: link 2: \ offset at 0 mask 0f00 shift 6 plus 0 eat \ match ip protocol 17 ff match ip dst 16.0.0.1/32 Add filter into child hash table: handle 1:0:3 u32 ht 1: \ match tcp src 22 action drop Add another filter to the same location: handle 2:0:3 u32 ht 2: \ match tcp src 33 action drop Signed-off-by: Amritha Nambiar Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 41 --- 1 file changed, 25 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 75e6855..fd5a761 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -8315,8 +8315,11 @@ static int ixgbe_delete_clsu32(struct ixgbe_adapter *adapter, /* Clear this filter in the link data it is associated with */ if (uhtid != 0x800) { jump = adapter->jump_tables[uhtid]; - if (jump) - clear_bit(loc - 1, jump->child_loc_map); + if (!jump) + return -EINVAL; + if (!test_bit(loc - 1, jump->child_loc_map)) + return -EINVAL; + clear_bit(loc - 1, jump->child_loc_map); } /* Check if the filter being deleted is a link */ @@ -8606,7 +8609,7 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter *adapter, mask = kzalloc(sizeof(*mask), GFP_KERNEL); if (!mask) { err = -ENOMEM; - goto err_out; + goto free_input; } jump->input = input; jump->mask = mask; @@ -8629,7 +8632,7 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter *adapter, mask = kzalloc(sizeof(*mask), GFP_KERNEL); if (!mask) { err = -ENOMEM; - goto err_out; + goto free_input; } if ((uhtid != 0x800) && (adapter->jump_tables[uhtid])) { @@ -8639,6 +8642,20 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter *adapter, if ((adapter->jump_tables[uhtid])->mask) memcpy(mask, (adapter->jump_tables[uhtid])->mask, sizeof(*mask)); + + /* Lookup in all child hash tables if this location is already +* filled with a filter +*/ + for (i = 1; i < IXGBE_MAX_LINK_HANDLE; i++) { + struct ixgbe_jump_table *link = adapter->jump_tables[i]; + + if (link && (test_bit(loc - 1, link->child_loc_map))) { + e_err(drv, "Filter exists in location: %x\n", + loc); + err = -EINVAL; + goto err_out; + } + } } err = ixgbe_clsu32_build_input(input, mask, cls, field_ptr, NULL); if (err) @@ -8670,25 +8687,17 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter *adapter, ixgbe_update_ethtool_fdir_entry(adapter, input, input->sw_idx); spin_unlock(>fdir_perfect_lock); - if ((uhtid != 0x800) && (adapter->jump_tables[uhtid])) { - struct ixgbe_jump_table *link = adapter->jump_tables[uhtid]; + if ((uhtid != 0x800) && (adapter->jump_tables[uhtid])) + set_bit(loc - 1, (adapter->jump_tables[uhtid])->child_loc_map); - if (test_bit(loc - 1, link->child_loc_map)) { - e_err(drv, "Filter: %x exists in hash table: %x\n", - loc, uhtid); - err = -EINVAL; - goto free_mask; - } - set_bit(loc - 1, link->child_loc_map); - } kfree(mask); return err; err_out_w_lock: spin_unlock(>fdir_perfect_lock); err_out: - kfree(input); -free_mask: kfree(mask); +free_input: + kfree(input); free_jump: kfree(jump); return err; --
[net-next 05/15] ixgbe: Fix deleting link filters for cls_u32 offloads
From: Amritha NambiarOn deleting filters which are links to a child hash table, the filters in the child hash table must be cleared from the hardware if there is no link between the parent and child hash table. Verified with the following filters: Create a child hash table: handle 1: u32 divisor 1 Link to the child hash table from parent hash table: handle 800:0:10 u32 ht 800: link 1: \ offset at 0 mask 0f00 shift 6 plus 0 eat \ match ip protocol 6 ff match ip dst 15.0.0.1/32 Add filters into child hash table: handle 1:0:2 u32 ht 1: \ match tcp src 22 action drop handle 1:0:3 u32 ht 1: \ match tcp src 33 action drop Delete link filter from parent hash table: handle 800:0:10 u32 Signed-off-by: Amritha Nambiar Acked-by: Sridhar Samudrala Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 75 +++--- drivers/net/ethernet/intel/ixgbe/ixgbe_model.h | 4 ++ 2 files changed, 72 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 468fa9d..75e6855 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -8300,14 +8300,50 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc) static int ixgbe_delete_clsu32(struct ixgbe_adapter *adapter, struct tc_cls_u32_offload *cls) { + u32 hdl = cls->knode.handle; u32 uhtid = TC_U32_USERHTID(cls->knode.handle); - u32 loc; - int err; + u32 loc = cls->knode.handle & 0xf; + int err = 0, i, j; + struct ixgbe_jump_table *jump = NULL; + + if (loc > IXGBE_MAX_HW_ENTRIES) + return -EINVAL; if ((uhtid != 0x800) && (uhtid >= IXGBE_MAX_LINK_HANDLE)) return -EINVAL; - loc = cls->knode.handle & 0xf; + /* Clear this filter in the link data it is associated with */ + if (uhtid != 0x800) { + jump = adapter->jump_tables[uhtid]; + if (jump) + clear_bit(loc - 1, jump->child_loc_map); + } + + /* Check if the filter being deleted is a link */ + for (i = 1; i < IXGBE_MAX_LINK_HANDLE; i++) { + jump = adapter->jump_tables[i]; + if (jump && jump->link_hdl == hdl) { + /* Delete filters in the hardware in the child hash +* table associated with this link +*/ + for (j = 0; j < IXGBE_MAX_HW_ENTRIES; j++) { + if (!test_bit(j, jump->child_loc_map)) + continue; + spin_lock(>fdir_perfect_lock); + err = ixgbe_update_ethtool_fdir_entry(adapter, + NULL, + j + 1); + spin_unlock(>fdir_perfect_lock); + clear_bit(j, jump->child_loc_map); + } + /* Remove resources for this link */ + kfree(jump->input); + kfree(jump->mask); + kfree(jump); + adapter->jump_tables[i] = NULL; + return err; + } + } spin_lock(>fdir_perfect_lock); err = ixgbe_update_ethtool_fdir_entry(adapter, NULL, loc); @@ -8541,6 +8577,18 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter *adapter, if (!test_bit(link_uhtid - 1, >tables)) return err; + /* Multiple filters as links to the same hash table are not +* supported. To add a new filter with the same next header +* but different match/jump conditions, create a new hash table +* and link to it. +*/ + if (adapter->jump_tables[link_uhtid] && + (adapter->jump_tables[link_uhtid])->link_hdl) { + e_err(drv, "Link filter exists for link: %x\n", + link_uhtid); + return err; + } + for (i = 0; nexthdr[i].jump; i++) { if (nexthdr[i].o != cls->knode.sel->offoff || nexthdr[i].s != cls->knode.sel->offshift || @@ -8558,10 +8606,12 @@ static int ixgbe_configure_clsu32(struct ixgbe_adapter *adapter, mask = kzalloc(sizeof(*mask), GFP_KERNEL); if (!mask) {
[net-next 01/15] fm10k: don't use BIT() macro where the value isn't a bitmask
From: Jacob KellerThe FM10K_MAX_DATA_PER_TXD is really just using a bitshift as a power of 2 operation in an efficient manner. We shouldn't represent this as a BIT() because that obscures the intention of the operation. Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k.h b/drivers/net/ethernet/intel/fm10k/fm10k.h index fcf106e..e98b86b 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k.h +++ b/drivers/net/ethernet/intel/fm10k/fm10k.h @@ -406,7 +406,7 @@ static inline u16 fm10k_desc_unused(struct fm10k_ring *ring) (&(((union fm10k_rx_desc *)((R)->desc))[i])) #define FM10K_MAX_TXD_PWR 14 -#define FM10K_MAX_DATA_PER_TXD BIT(FM10K_MAX_TXD_PWR) +#define FM10K_MAX_DATA_PER_TXD (1u << FM10K_MAX_TXD_PWR) /* Tx Descriptors needed, worst case */ #define TXD_USE_COUNT(S) DIV_ROUND_UP((S), FM10K_MAX_DATA_PER_TXD) -- 2.5.5
[net-next 02/15] fm10k: Align Rx buffers to 512B blocks
From: Alexander DuyckWhile reviewing the i40e driver changes to support page based receive I realized that I had overlooked the fact that the fm10k hardware required a 512 byte alignment for Rx buffers. This patch is meant to address that by changing the alignment for Rx buffers to 512 bytes instead of allowing it to be L1 cache aligned. Signed-off-by: Alexander Duyck Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c index 0e166e9..53d02416 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c @@ -272,7 +272,7 @@ static bool fm10k_add_rx_frag(struct fm10k_rx_buffer *rx_buffer, #if (PAGE_SIZE < 8192) unsigned int truesize = FM10K_RX_BUFSZ; #else - unsigned int truesize = SKB_DATA_ALIGN(size); + unsigned int truesize = ALIGN(size, 512); #endif unsigned int pull_len; -- 2.5.5
Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control
On Wed, Jun 29, 2016 at 7:35 PM, John Fastabendwrote: > On 16-06-29 07:48 AM, Or Gerlitz wrote: >> On 6/28/2016 10:31 PM, John Fastabend wrote: >>> On 16-06-28 12:12 PM, Jiri Pirko wrote: Why?! Please, leave legacy be legacy. Use the new mode for implementing new features. Don't make things any more complicated :( [...] >>> Maybe I'm reading to much into the devlink flag names and if instead >>> you use a switch like the following, >>>VF representer : enable/disable the creation VF netdev's to represent >>> the virtual functions on the PF >>> Much less complicated then magic switching between forwarding logic IMO >>> and you don't whack a default configuration that an entire stack (e.g. >>> libvirt) has been built to use. >> Re letting the user to observe/modify the rules added by the >> driver/firmware while legacy mode. Even if possible with bridge/fdb, it >> will be really pragmatical and doesn't make sense to get that donefor >> the TC subsystem. So this isn't a well defined solution and anyway, as >> you said, legacy mode enhancements is a different exercise. Personally, >> I agree with Jiri, that we should legacy be legacyand focus on adding >> the new model. > The ixgbe driver already supports bridge and tc commands without the VF > representer. Adding the VF representer to these drivers just extends > the existing support so we have an identifier for VFs and now the > redirect action works and the fdb commands can specify the VF netdevs. > I don't see this as a problem because we already do it today with > 'ip' and bridge tools. To be precise, for both ixgbe and mlx5, the existing tc support (u32/ixgbe, flower/mlx5) is not for switching functionality but rather for NIC-ish one, e.g drop, mark, etc. Indeed in ixgbe you added redirect to VF, but this is only for south --> north (wire --> VF) traffic, w.o the VF rep you can't do the other way around. Just to clarify, to what exact bridge command support did you refer for ixgbe? The forwarding done in the legacy mode is not well defined, and different across vendors, adding there the VF reps will not make it any better b/c some steering rules will be set by tc/bridge offloads while other rules will be put by the driver. I don't see how this takes us to better place. > We are also slightly in disagreement about what the default should be > with VF netdevs. I think the default should be the same L2 mac/vlan > switch behavior and see no reason to change it by default just because > we added VF netdevs. The infrastructure libvirt/openstack/etc are built > around this default today. But I guess nothing in this series specifies > what the defaults of any given driver will be. VF netdevs are still > useful even on older hardware that only supports mac/vlan forwarding to > expose statistics and send/receive control frames such as lldp. Again, this is not about default engineering... and using the VF reps (not VF netdevs) in legacy mode only make it more cryptic to my opinion. I agree some changes would be needed in openstack to support the new model, but this is how progress is made... you can't always make all layer above you unchanged. Note that the VF reps behave the same as tap devices (v-switch doing xmit on tap --> recv in VM, VM sends --> recv on tap into the v-switch), so the change in open-stack would not be that big. [...] > Why I think the VF representer is a per port ethtool flag and not a > devlink option is my use case might be to assign a PF into a VM or > namespace where I don't want VF netdevs. again, we think the correct place to set how the eswitch is managed is through eswitch manager PCI devices and not net devices and hence ethtool is not the way to go. Also, how do you want your e-switch to be managed in this case?
[net-next 09/15] igb: re-use igb_ptp_reset in igb_ptp_init
From: Jacob KellerModify igb_ptp_init to take advantage of igb_ptp_reset, and remove duplicated work that was occurring in both igb_ptp_reset and igb_ptp_init. In total, resetting the TSAUXC register, and resetting the system time both happen in igb_ptp_reset already. igb_ptp_reset now also takes care of starting the delayed work item for overflow checks, as well. Signed-off-by: Jacob Keller Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/igb/igb_main.c | 3 +- drivers/net/ethernet/intel/igb/igb_ptp.c | 48 +++ 2 files changed, 18 insertions(+), 33 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index ef3d642..750a5d8 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -2027,7 +2027,8 @@ void igb_reset(struct igb_adapter *adapter) wr32(E1000_VET, ETHERNET_IEEE_VLAN_TYPE); /* Re-enable PTP, where applicable. */ - igb_ptp_reset(adapter); + if (adapter->ptp_flags & IGB_PTP_ENABLED) + igb_ptp_reset(adapter); igb_get_phy_info(hw); } diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c index 88d367d..907c225 100644 --- a/drivers/net/ethernet/intel/igb/igb_ptp.c +++ b/drivers/net/ethernet/intel/igb/igb_ptp.c @@ -1043,6 +1043,13 @@ int igb_ptp_set_ts_config(struct net_device *netdev, struct ifreq *ifr) -EFAULT : 0; } +/** + * igb_ptp_init - Initialize PTP functionality + * @adapter: Board private structure + * + * This function is called at device probe to initialize the PTP + * functionality. + */ void igb_ptp_init(struct igb_adapter *adapter) { struct e1000_hw *hw = >hw; @@ -1065,8 +1072,6 @@ void igb_ptp_init(struct igb_adapter *adapter) adapter->cc.mask = CYCLECOUNTER_MASK(64); adapter->cc.mult = 1; adapter->cc.shift = IGB_82576_TSYNC_SHIFT; - /* Dial the nominal frequency. */ - wr32(E1000_TIMINCA, INCPERIOD_82576 | INCVALUE_82576); adapter->ptp_flags |= IGB_PTP_OVERFLOW_CHECK; break; case e1000_82580: @@ -1086,8 +1091,6 @@ void igb_ptp_init(struct igb_adapter *adapter) adapter->cc.mask = CYCLECOUNTER_MASK(IGB_NBITS_82580); adapter->cc.mult = 1; adapter->cc.shift = 0; - /* Enable the timer functions by clearing bit 31. */ - wr32(E1000_TSAUXC, 0x0); adapter->ptp_flags |= IGB_PTP_OVERFLOW_CHECK; break; case e1000_i210: @@ -1113,46 +1116,24 @@ void igb_ptp_init(struct igb_adapter *adapter) adapter->ptp_caps.settime64 = igb_ptp_settime_i210; adapter->ptp_caps.enable = igb_ptp_feature_enable_i210; adapter->ptp_caps.verify = igb_ptp_verify_pin; - /* Enable the timer functions by clearing bit 31. */ - wr32(E1000_TSAUXC, 0x0); break; default: adapter->ptp_clock = NULL; return; } - wrfl(); - spin_lock_init(>tmreg_lock); INIT_WORK(>ptp_tx_work, igb_ptp_tx_work); - /* Initialize the clock and overflow work for devices that need it. */ - if ((hw->mac.type == e1000_i210) || (hw->mac.type == e1000_i211)) { - struct timespec64 ts = ktime_to_timespec64(ktime_get_real()); - - igb_ptp_settime_i210(>ptp_caps, ); - } else { - timecounter_init(>tc, >cc, -ktime_to_ns(ktime_get_real())); - } - - if (adapter->ptp_flags & IGB_PTP_OVERFLOW_CHECK) { + if (adapter->ptp_flags & IGB_PTP_OVERFLOW_CHECK) INIT_DELAYED_WORK(>ptp_overflow_work, igb_ptp_overflow_check); - schedule_delayed_work(>ptp_overflow_work, - IGB_SYSTIM_OVERFLOW_PERIOD); - } - - /* Initialize the time sync interrupts for devices that support it. */ - if (hw->mac.type >= e1000_82580) { - wr32(E1000_TSIM, TSYNC_INTERRUPTS); - wr32(E1000_IMS, E1000_IMS_TS); - } - adapter->tstamp_config.rx_filter = HWTSTAMP_FILTER_NONE; adapter->tstamp_config.tx_type = HWTSTAMP_TX_OFF; + igb_ptp_reset(adapter); + adapter->ptp_clock = ptp_clock_register(>ptp_caps, >pdev->dev); if (IS_ERR(adapter->ptp_clock)) { @@ -1205,9 +1186,6 @@ void igb_ptp_reset(struct igb_adapter *adapter) struct e1000_hw *hw = >hw; unsigned long flags; - if (!(adapter->ptp_flags & IGB_PTP_ENABLED)) - return; - /* reset
[net-next 08/15] igb: introduce IGB_PTP_OVERFLOW_CHECK flag
From: Jacob KellerDon't continue to use complex MAC type checks for handling various cases where we have overflow check code. Make this code more obvious by introducing a flag which is enabled for hardware that needs these checks. Signed-off-by: Jacob Keller Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/igb/igb.h | 1 + drivers/net/ethernet/intel/igb/igb_ptp.c | 21 - 2 files changed, 9 insertions(+), 13 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index 1e18a9e..38daaab 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -477,6 +477,7 @@ struct igb_adapter { /* flags controlling PTP/1588 function */ #define IGB_PTP_ENABLEDBIT(0) +#define IGB_PTP_OVERFLOW_CHECK BIT(1) #define IGB_FLAG_HAS_MSI BIT(0) #define IGB_FLAG_DCA_ENABLED BIT(1) diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c index 504102b..88d367d 100644 --- a/drivers/net/ethernet/intel/igb/igb_ptp.c +++ b/drivers/net/ethernet/intel/igb/igb_ptp.c @@ -1067,6 +1067,7 @@ void igb_ptp_init(struct igb_adapter *adapter) adapter->cc.shift = IGB_82576_TSYNC_SHIFT; /* Dial the nominal frequency. */ wr32(E1000_TIMINCA, INCPERIOD_82576 | INCVALUE_82576); + adapter->ptp_flags |= IGB_PTP_OVERFLOW_CHECK; break; case e1000_82580: case e1000_i354: @@ -1087,6 +1088,7 @@ void igb_ptp_init(struct igb_adapter *adapter) adapter->cc.shift = 0; /* Enable the timer functions by clearing bit 31. */ wr32(E1000_TSAUXC, 0x0); + adapter->ptp_flags |= IGB_PTP_OVERFLOW_CHECK; break; case e1000_i210: case e1000_i211: @@ -1132,7 +1134,9 @@ void igb_ptp_init(struct igb_adapter *adapter) } else { timecounter_init(>tc, >cc, ktime_to_ns(ktime_get_real())); + } + if (adapter->ptp_flags & IGB_PTP_OVERFLOW_CHECK) { INIT_DELAYED_WORK(>ptp_overflow_work, igb_ptp_overflow_check); @@ -1169,20 +1173,11 @@ void igb_ptp_init(struct igb_adapter *adapter) **/ void igb_ptp_stop(struct igb_adapter *adapter) { - switch (adapter->hw.mac.type) { - case e1000_82576: - case e1000_82580: - case e1000_i354: - case e1000_i350: - cancel_delayed_work_sync(>ptp_overflow_work); - break; - case e1000_i210: - case e1000_i211: - /* No delayed work to cancel. */ - break; - default: + if (!(adapter->ptp_flags & IGB_PTP_ENABLED)) return; - } + + if (adapter->ptp_flags & IGB_PTP_OVERFLOW_CHECK) + cancel_delayed_work_sync(>ptp_overflow_work); cancel_work_sync(>ptp_tx_work); if (adapter->ptp_tx_skb) { -- 2.5.5
[net-next 07/15] igb: introduce ptp_flags variable and use it to replace IGB_FLAG_PTP
From: Jacob KellerUpcoming patches will introduce new PTP specific flags. To avoid cluttering the normal flags variable, introduce PTP specific "ptp_flags" variable for this purpose, and move IGB_FLAG_PTP to become IGB_PTP_ENABLED. Signed-off-by: Jacob Keller Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/igb/igb.h | 5 - drivers/net/ethernet/intel/igb/igb_ptp.c | 7 --- 2 files changed, 8 insertions(+), 4 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index b9609af..1e18a9e 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -445,6 +445,7 @@ struct igb_adapter { unsigned long ptp_tx_start; unsigned long last_rx_ptp_check; unsigned long last_rx_timestamp; + unsigned int ptp_flags; spinlock_t tmreg_lock; struct cyclecounter cc; struct timecounter tc; @@ -474,12 +475,14 @@ struct igb_adapter { u16 eee_advert; }; +/* flags controlling PTP/1588 function */ +#define IGB_PTP_ENABLEDBIT(0) + #define IGB_FLAG_HAS_MSI BIT(0) #define IGB_FLAG_DCA_ENABLED BIT(1) #define IGB_FLAG_QUAD_PORT_A BIT(2) #define IGB_FLAG_QUEUE_PAIRS BIT(3) #define IGB_FLAG_DMAC BIT(4) -#define IGB_FLAG_PTP BIT(5) #define IGB_FLAG_RSS_FIELD_IPV4_UDPBIT(6) #define IGB_FLAG_RSS_FIELD_IPV6_UDPBIT(7) #define IGB_FLAG_WOL_SUPPORTED BIT(8) diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c index f097c5a..504102b 100644 --- a/drivers/net/ethernet/intel/igb/igb_ptp.c +++ b/drivers/net/ethernet/intel/igb/igb_ptp.c @@ -684,6 +684,7 @@ void igb_ptp_rx_hang(struct igb_adapter *adapter) u32 tsyncrxctl = rd32(E1000_TSYNCRXCTL); unsigned long rx_event; + /* Other hardware uses per-packet timestamps */ if (hw->mac.type != e1000_82576) return; @@ -1156,7 +1157,7 @@ void igb_ptp_init(struct igb_adapter *adapter) } else { dev_info(>pdev->dev, "added PHC on %s\n", adapter->netdev->name); - adapter->flags |= IGB_FLAG_PTP; + adapter->ptp_flags |= IGB_PTP_ENABLED; } } @@ -1194,7 +1195,7 @@ void igb_ptp_stop(struct igb_adapter *adapter) ptp_clock_unregister(adapter->ptp_clock); dev_info(>pdev->dev, "removed PHC on %s\n", adapter->netdev->name); - adapter->flags &= ~IGB_FLAG_PTP; + adapter->ptp_flags &= ~IGB_PTP_ENABLED; } } @@ -1209,7 +1210,7 @@ void igb_ptp_reset(struct igb_adapter *adapter) struct e1000_hw *hw = >hw; unsigned long flags; - if (!(adapter->flags & IGB_FLAG_PTP)) + if (!(adapter->ptp_flags & IGB_PTP_ENABLED)) return; /* reset the tstamp_config */ -- 2.5.5
[net-next 04/15] e1000e: prevent division by zero if TIMINCA is zero
From: Denys VlasenkoUsers report that under VMWare, er32(TIMINCA) returns zero. This causes division by zero at init time as follows: ==> incvalue = er32(TIMINCA) & E1000_TIMINCA_INCVALUE_MASK; for (i = 0; i < E1000_MAX_82574_SYSTIM_REREADS; i++) { /* latch SYSTIMH on read of SYSTIML */ systim_next = (cycle_t)er32(SYSTIML); systim_next |= (cycle_t)er32(SYSTIMH) << 32; time_delta = systim_next - systim; temp = time_delta; > rem = do_div(temp, incvalue); This change makes kernel survive this, and users report that NIC does work after this change. Since on real hardware incvalue is never zero, this should not affect real hardware use case. Signed-off-by: Denys Vlasenko Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/e1000e/netdev.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c index 75e6089..9d5bab8 100644 --- a/drivers/net/ethernet/intel/e1000e/netdev.c +++ b/drivers/net/ethernet/intel/e1000e/netdev.c @@ -4352,7 +4352,8 @@ static cycle_t e1000e_cyclecounter_read(const struct cyclecounter *cc) time_delta = systim_next - systim; temp = time_delta; - rem = do_div(temp, incvalue); + /* VMWare users have seen incvalue of zero, don't div / 0 */ + rem = incvalue ? do_div(temp, incvalue) : (time_delta != 0); systim = systim_next; -- 2.5.5
[net-next 12/15] fm10k: Remove create_workqueue
From: Bhaktipriya Shridharalloc_workqueue replaces deprecated create_workqueue(). A dedicated workqueue has been used since the workitem (viz fm10k_service_task, which manages and runs other subtasks) is involved in normal device operation and requires forward progress under memory pressure. create_workqueue has been replaced with alloc_workqueue with max_active as 0 since there is no need for throttling the number of active work items. Since network devices may be used in memory reclaim path, WQ_MEM_RECLAIM has been set to guarantee forward progress. flush_workqueue is unnecessary since destroy_workqueue() itself calls drain_workqueue() which flushes repeatedly till the workqueue becomes empty. Hence the call to flush_workqueue() has been dropped. Signed-off-by: Bhaktipriya Shridhar Acked-by: Tejun Heo Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_main.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c index 53d02416..a9ccc1e 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c @@ -56,7 +56,7 @@ static int __init fm10k_init_module(void) pr_info("%s\n", fm10k_copyright); /* create driver workqueue */ - fm10k_workqueue = create_workqueue("fm10k"); + fm10k_workqueue = alloc_workqueue("fm10k", WQ_MEM_RECLAIM, 0); fm10k_dbg_init(); @@ -77,7 +77,6 @@ static void __exit fm10k_exit_module(void) fm10k_dbg_exit(); /* destroy driver workqueue */ - flush_workqueue(fm10k_workqueue); destroy_workqueue(fm10k_workqueue); } module_exit(fm10k_exit_module); -- 2.5.5
[net-next 13/15] ixgbe: Correct reporting of timestamping for x550
From: Tony NguyenUpdate ixgbe_ethtool_get_ts_info() to show that x550 supports hardware timestamping of all packets. Reported-by: Guy Harris Signed-off-by: Tony Nguyen Signed-off-by: Jacob Keller Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c index 59b771b..8a84507 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c @@ -2991,10 +2991,15 @@ static int ixgbe_get_ts_info(struct net_device *dev, { struct ixgbe_adapter *adapter = netdev_priv(dev); + /* we always support timestamping disabled */ + info->rx_filters = BIT(HWTSTAMP_FILTER_NONE); + switch (adapter->hw.mac.type) { case ixgbe_mac_X550: case ixgbe_mac_X550EM_x: case ixgbe_mac_x550em_a: + info->rx_filters |= BIT(HWTSTAMP_FILTER_ALL); + /* fallthrough */ case ixgbe_mac_X540: case ixgbe_mac_82599EB: info->so_timestamping = @@ -3014,8 +3019,7 @@ static int ixgbe_get_ts_info(struct net_device *dev, BIT(HWTSTAMP_TX_OFF) | BIT(HWTSTAMP_TX_ON); - info->rx_filters = - BIT(HWTSTAMP_FILTER_NONE) | + info->rx_filters |= BIT(HWTSTAMP_FILTER_PTP_V1_L4_SYNC) | BIT(HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ) | BIT(HWTSTAMP_FILTER_PTP_V2_EVENT); -- 2.5.5
[net-next 00/15][pull request] Intel Wired LAN Driver Updates 2016-06-29
This series contains updates and fixes to e1000e, igb, ixgbe and fm10k. A true smorgasbord of changes. Jake cleans up some obscurity by not using the BIT() macro on bitshift operation and also fixed the calculated index when looping through the indir array. Fixes the issue with igb's workqueue item for overflow check from causing a surprise remove event. The ptp_flags variable is added to simplify the work of writing several complex MAC type checks in the PTP code while fixing the workqueue. Alex Duyck fixes the receive buffers alignment which should not be L1 cache aligned, but to 512 bytes instead. Denys Vlasenko prevents a division by zero which was reported under VMWare for e1000e. Amritha fixes an issue where filters in a child hash table must be cleared from the hardware before delete the filter links in ixgbe. Bhaktipriya Shridhar simply replaces the deprecated create_workqueue() with alloc_workqueue() for fm10k. Tony corrects ixgbe ethtool reporting to show x550 supports hardware timestamping of all packets. Emil fixes an issue where MAC-VLANs on the VF fail to pass traffic due to spoofed packets. Andrew Lunn increases performance on some systems where syncing a buffer for DMA is expensive. So rather than sync the whole 2K receive buffer, only synchronize the length of the frame. The following are changes since commit 6f30e8b022c8e3a722928ddb1a2ae0be852fcc0e: Merge branch '40GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue and are available in the git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 1GbE Alexander Duyck (1): fm10k: Align Rx buffers to 512B blocks Amritha Nambiar (2): ixgbe: Fix deleting link filters for cls_u32 offloads ixgbe: Error handler for duplicate filter locations in hardware for cls_u32 offloads Andrew Lunn (1): igb: Only DMA sync frame length Bhaktipriya Shridhar (1): fm10k: Remove create_workqueue Denys Vlasenko (1): e1000e: prevent division by zero if TIMINCA is zero Emil Tantilov (1): ixgbe: fix spoofed packets with macvlans Jacob Keller (7): fm10k: don't use BIT() macro where the value isn't a bitmask fm10k: fix incorrect index calculation in fm10k_write_reta igb: introduce ptp_flags variable and use it to replace IGB_FLAG_PTP igb: introduce IGB_PTP_OVERFLOW_CHECK flag igb: re-use igb_ptp_reset in igb_ptp_init igb: implement igb_ptp_suspend igb: call igb_ptp_suspend during suspend/resume cycle Tony Nguyen (1): ixgbe: Correct reporting of timestamping for x550 drivers/net/ethernet/intel/e1000e/netdev.c | 3 +- drivers/net/ethernet/intel/fm10k/fm10k.h | 2 +- drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c | 5 +- drivers/net/ethernet/intel/fm10k/fm10k_main.c| 5 +- drivers/net/ethernet/intel/igb/igb.h | 7 +- drivers/net/ethernet/intel/igb/igb_main.c| 12 ++-- drivers/net/ethernet/intel/igb/igb_ptp.c | 92 +++- drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c | 8 ++- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 76 +++- drivers/net/ethernet/intel/ixgbe/ixgbe_model.h | 4 ++ drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 1 + 11 files changed, 148 insertions(+), 67 deletions(-) -- 2.5.5
[net-next 03/15] fm10k: fix incorrect index calculation in fm10k_write_reta
From: Jacob KellerThe index calculated when looping through the indir array passed to fm10k_write_reta was incorrectly calculated as the first part i needs to be multiplied by 4. Fixes: 0cfea7a65738 ("fm10k: fix possible null pointer deref after kcalloc", 2016-04-13) Signed-off-by: Jacob Keller Tested-by: Krishneil Singh Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c index 9c0d875..9b51954 100644 --- a/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c +++ b/drivers/net/ethernet/intel/fm10k/fm10k_ethtool.c @@ -983,9 +983,10 @@ void fm10k_write_reta(struct fm10k_intfc *interface, const u32 *indir) /* generate a new table if we weren't given one */ for (j = 0; j < 4; j++) { if (indir) - n = indir[i + j]; + n = indir[4 * i + j]; else - n = ethtool_rxfh_indir_default(i + j, rss_i); + n = ethtool_rxfh_indir_default(4 * i + j, + rss_i); table[j] = n; } -- 2.5.5
[net-next 14/15] ixgbe: fix spoofed packets with macvlans
From: Emil TantilovWhen setting spoofing, both VLAN and MAC need to be set together. This change resolves an issue where MAC-VLANs on the VF fail to pass traffic due to spoofed packets. Signed-off-by: Emil Tantilov Tested-by: Andrew Bowers Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c index c5caacd..8618599 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_sriov.c @@ -954,6 +954,7 @@ static int ixgbe_set_vf_macvlan_msg(struct ixgbe_adapter *adapter, struct ixgbe_hw *hw = >hw; hw->mac.ops.set_mac_anti_spoofing(hw, false, vf); + hw->mac.ops.set_vlan_anti_spoofing(hw, false, vf); } } -- 2.5.5
[net-next 11/15] igb: call igb_ptp_suspend during suspend/resume cycle
From: Jacob KellerProperly stop the extra workqueue items and ensure that we resume cleanly. This is better than using igb_ptp_init and igb_ptp_stop since these functions destroy the PHC device, which will cause other problems if we do so. Since igb_ptp_reset now re-schedules the work-queue item we don't need an equivalent igb_ptp_resume in the resume workflow. Signed-off-by: Jacob Keller Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/igb/igb_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 750a5d8..a15f826 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -7528,6 +7528,8 @@ static int __igb_shutdown(struct pci_dev *pdev, bool *enable_wake, if (netif_running(netdev)) __igb_close(netdev, true); + igb_ptp_suspend(adapter); + igb_clear_interrupt_scheme(adapter); #ifdef CONFIG_PM -- 2.5.5
[net-next 15/15] igb: Only DMA sync frame length
From: Andrew LunnOn some platforms, syncing a buffer for DMA is expensive. Rather than sync the whole 2K receive buffer, only synchronise the length of the frame, which will typically be the MTU, or a much smaller TCP ACK. For an IMX6Q, this gives around 6% increased TCP receive performance, which is cache operations bound and reduces CPU load for TCP transmit. Signed-off-by: Andrew Lunn Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/igb/igb_main.c | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index a15f826..9bcba42 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -6856,12 +6856,12 @@ static bool igb_can_reuse_rx_page(struct igb_rx_buffer *rx_buffer, **/ static bool igb_add_rx_frag(struct igb_ring *rx_ring, struct igb_rx_buffer *rx_buffer, + unsigned int size, union e1000_adv_rx_desc *rx_desc, struct sk_buff *skb) { struct page *page = rx_buffer->page; unsigned char *va = page_address(page) + rx_buffer->page_offset; - unsigned int size = le16_to_cpu(rx_desc->wb.upper.length); #if (PAGE_SIZE < 8192) unsigned int truesize = IGB_RX_BUFSZ; #else @@ -6913,6 +6913,7 @@ static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring, union e1000_adv_rx_desc *rx_desc, struct sk_buff *skb) { + unsigned int size = le16_to_cpu(rx_desc->wb.upper.length); struct igb_rx_buffer *rx_buffer; struct page *page; @@ -6948,11 +6949,11 @@ static struct sk_buff *igb_fetch_rx_buffer(struct igb_ring *rx_ring, dma_sync_single_range_for_cpu(rx_ring->dev, rx_buffer->dma, rx_buffer->page_offset, - IGB_RX_BUFSZ, + size, DMA_FROM_DEVICE); /* pull page into skb */ - if (igb_add_rx_frag(rx_ring, rx_buffer, rx_desc, skb)) { + if (igb_add_rx_frag(rx_ring, rx_buffer, size, rx_desc, skb)) { /* hand second half of page back to the ring */ igb_reuse_rx_page(rx_ring, rx_buffer); } else { -- 2.5.5
[net-next 10/15] igb: implement igb_ptp_suspend
From: Jacob KellerMake igb_ptp_stop take advantage of this new function to reduce code duplication. Signed-off-by: Jacob Keller Tested-by: Aaron Brown Signed-off-by: Jeff Kirsher --- drivers/net/ethernet/intel/igb/igb.h | 1 + drivers/net/ethernet/intel/igb/igb_ptp.c | 22 +- 2 files changed, 18 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/igb.h b/drivers/net/ethernet/intel/igb/igb.h index 38daaab..5387b3a 100644 --- a/drivers/net/ethernet/intel/igb/igb.h +++ b/drivers/net/ethernet/intel/igb/igb.h @@ -550,6 +550,7 @@ void igb_set_fw_version(struct igb_adapter *); void igb_ptp_init(struct igb_adapter *adapter); void igb_ptp_stop(struct igb_adapter *adapter); void igb_ptp_reset(struct igb_adapter *adapter); +void igb_ptp_suspend(struct igb_adapter *adapter); void igb_ptp_rx_hang(struct igb_adapter *adapter); void igb_ptp_rx_rgtstamp(struct igb_q_vector *q_vector, struct sk_buff *skb); void igb_ptp_rx_pktstamp(struct igb_q_vector *q_vector, unsigned char *va, diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c index 907c225..e61b647 100644 --- a/drivers/net/ethernet/intel/igb/igb_ptp.c +++ b/drivers/net/ethernet/intel/igb/igb_ptp.c @@ -1147,12 +1147,13 @@ void igb_ptp_init(struct igb_adapter *adapter) } /** - * igb_ptp_stop - Disable PTP device and stop the overflow check. - * @adapter: Board private structure. + * igb_ptp_suspend - Disable PTP work items and prepare for suspend + * @adapter: Board private structure * - * This function stops the PTP support and cancels the delayed work. - **/ -void igb_ptp_stop(struct igb_adapter *adapter) + * This function stops the overflow check work and PTP Tx timestamp work, and + * will prepare the device for OS suspend. + */ +void igb_ptp_suspend(struct igb_adapter *adapter) { if (!(adapter->ptp_flags & IGB_PTP_ENABLED)) return; @@ -1166,6 +1167,17 @@ void igb_ptp_stop(struct igb_adapter *adapter) adapter->ptp_tx_skb = NULL; clear_bit_unlock(__IGB_PTP_TX_IN_PROGRESS, >state); } +} + +/** + * igb_ptp_stop - Disable PTP device and stop the overflow check. + * @adapter: Board private structure. + * + * This function stops the PTP support and cancels the delayed work. + **/ +void igb_ptp_stop(struct igb_adapter *adapter) +{ + igb_ptp_suspend(adapter); if (adapter->ptp_clock) { ptp_clock_unregister(adapter->ptp_clock); -- 2.5.5
[PATCH net-next 0/3] nfp: few code improvements
Hi! Three small patches for net-next. First and second patches improve the code quality by spelling things correctly and removing unused parameters. Third patch hooks-in standard kernel implementation of .get_link() in ethtool ops. Jakub Kicinski (3): nfp: correct name of control BAR define nfp: remove unused parameter from nfp_net_write_mac_addr() nfp: implement ethtool .get_link() callback drivers/net/ethernet/netronome/nfp/nfp_net.h | 2 +- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 13 +++-- drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 1 + drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c | 4 ++-- 4 files changed, 11 insertions(+), 9 deletions(-) -- 1.9.1
[PATCH net-next 2/3] nfp: remove unused parameter from nfp_net_write_mac_addr()
nfp_net_write_mac_addr() always writes to the BAR the current device address taken from netdev struct. The address given as parameter is actually ignored. Since all callers pass netdev->dev_addr simply remove the parameter. While at it improve the function's kdoc a bit. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 13 +++-- 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c index 2195ed3053da..df29cea0ad01 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c @@ -1845,13 +1845,14 @@ void nfp_net_coalesce_write_cfg(struct nfp_net *nn) } /** - * nfp_net_write_mac_addr() - Write mac address to device registers + * nfp_net_write_mac_addr() - Write mac address to the device control BAR * @nn: NFP Net device to reconfigure - * @mac: Six-byte MAC address to be written * - * We do a bit of byte swapping dance because firmware is LE. + * Writes the MAC address from the netdev to the device control BAR. Does not + * perform the required reconfig. We do a bit of byte swapping dance because + * firmware is LE. */ -static void nfp_net_write_mac_addr(struct nfp_net *nn, const u8 *mac) +static void nfp_net_write_mac_addr(struct nfp_net *nn) { nn_writel(nn, NFP_NET_CFG_MACADDR + 0, get_unaligned_be32(nn->netdev->dev_addr)); @@ -1952,7 +1953,7 @@ static int __nfp_net_set_config_and_enable(struct nfp_net *nn) nn_writeq(nn, NFP_NET_CFG_RXRS_ENABLE, nn->num_rx_rings == 64 ? 0xULL : ((u64)1 << nn->num_rx_rings) - 1); - nfp_net_write_mac_addr(nn, nn->netdev->dev_addr); + nfp_net_write_mac_addr(nn); nn_writel(nn, NFP_NET_CFG_MTU, nn->netdev->mtu); nn_writel(nn, NFP_NET_CFG_FLBUFSZ, nn->fl_bufsz); @@ -2739,7 +2740,7 @@ int nfp_net_netdev_init(struct net_device *netdev) nn->cap = nn_readl(nn, NFP_NET_CFG_CAP); nn->max_mtu = nn_readl(nn, NFP_NET_CFG_MAX_MTU); - nfp_net_write_mac_addr(nn, nn->netdev->dev_addr); + nfp_net_write_mac_addr(nn); /* Set default MTU and Freelist buffer size */ if (nn->max_mtu < NFP_NET_DEFAULT_MTU) -- 1.9.1
[PATCH net-next 3/3] nfp: implement ethtool .get_link() callback
Point the ethtool .get_link() callback to the standard ethtool_op_get_link() implementation. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c index ccfef1f17627..7d7933d00b8f 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c @@ -605,6 +605,7 @@ static int nfp_net_set_coalesce(struct net_device *netdev, static const struct ethtool_ops nfp_net_ethtool_ops = { .get_drvinfo= nfp_net_get_drvinfo, + .get_link = ethtool_op_get_link, .get_ringparam = nfp_net_get_ringparam, .set_ringparam = nfp_net_set_ringparam, .get_strings= nfp_net_get_strings, -- 1.9.1
[PATCH net-next 1/3] nfp: correct name of control BAR define
Spell abbreviation of control as ctrl not crtl. Signed-off-by: Jakub Kicinski--- drivers/net/ethernet/netronome/nfp/nfp_net.h| 2 +- drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h index e744acc18ef4..690635660195 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h @@ -63,7 +63,7 @@ #define NFP_NET_POLL_TIMEOUT 5 /* Bar allocation */ -#define NFP_NET_CRTL_BAR 0 +#define NFP_NET_CTRL_BAR 0 #define NFP_NET_Q0_BAR 2 #define NFP_NET_Q1_BAR 4 /* OBSOLETE */ diff --git a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c index e2b22b8a20f1..37abef016a0a 100644 --- a/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c +++ b/drivers/net/ethernet/netronome/nfp/nfp_netvf_main.c @@ -124,11 +124,11 @@ static int nfp_netvf_pci_probe(struct pci_dev *pdev, * first NFP_NET_CFG_BAR_SZ of the BAR. This keeps the code * the identical for PF and VF drivers. */ - ctrl_bar = ioremap_nocache(pci_resource_start(pdev, NFP_NET_CRTL_BAR), + ctrl_bar = ioremap_nocache(pci_resource_start(pdev, NFP_NET_CTRL_BAR), NFP_NET_CFG_BAR_SZ); if (!ctrl_bar) { dev_err(>dev, - "Failed to map resource %d\n", NFP_NET_CRTL_BAR); + "Failed to map resource %d\n", NFP_NET_CTRL_BAR); err = -EIO; goto err_pci_regions; } -- 1.9.1
Re: Question on i40e PCIe relaxed ordering (RO)
On Wed, 2016-06-29 at 13:18 -0700, Greg wrote: > On Wed, 2016-06-29 at 13:05 -0700, tndave wrote: > > Hi, > > > > Running iperf tcp test on 2 sparc systems with i40e connected back to > > back, I see huge number of 'port.rx_dropped' (on iperf server). Based > on > > past experience with ixgbe, this could very well because of PCIe RO > > (relaxed ordering) not enabled. > > > > I am trying to confirm RO is enabled. i40e datasheet mentioned RO > > settings in 3 different sections: > > > > 1. section 10.2.2.2.38 PCIe Global Config 2 - GLPCI_CNF2 register > > contains global status fields of PCIe configuration. The bit 0 of the > > register is "RO_DIS". If this bit is set to 1 RO is disabled. > > > > RO_DIS in my setup is 0 imply RO is not disabled. > > > > 2. section 12.3.5.5 Device Control Register (0xA8; RW) has bit 4 > > that enable/disable RO. This is pcie cap register. > > > > In my i40e pcie config space value at offset 0xA8 is "2110". i.e 4th > bit > > set to 1 imply RO is enabled. > > > > 3. section 3.1.2.7.2 mentions some relaxed ordering rules > > e.g. "The GLLAN_RCTL.RXDESCRDROEN bit (loaded from NVM) enables relaxed > > ordering for Rx descriptor reads" > > > > However, GLLAN_RCTL register definition has not bit like RXDESCRDROEN. > > Same goes for GLLAN_TCTL.TXDESCRDROEN. > > > > Am I missing anything? please advise. > > I would try posting this question to the e1000 developer list over at > Source Forge. The Intel customer support folks used to monitor that > list closely when I was there, hopefully they still are. We are supposed to be monitoring both lists, but just in case I have added e1000-devel list... signature.asc Description: This is a digitally signed message part
Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file
On Wed, Jun 29, 2016 at 4:15 PM, Andrew Lunnwrote: > On Wed, Jun 29, 2016 at 04:08:20PM -0400, Jon Mason wrote: >> On Wed, Jun 29, 2016 at 2:46 PM, Andrew Lunn wrote: >> > On Wed, Jun 29, 2016 at 11:35:28AM -0700, Florian Fainelli wrote: >> >> On 06/29/2016 07:13 AM, Andrew Lunn wrote: >> >> > Hi Jon >> >> > >> >> > I know you are just refactoring code, but at some point it would be >> >> > good to take a closer look at this MDIO bus driver. >> > >> >> And, to re-iterate all of your points are valid, but this is premature >> > >> > We agree then :-) >> > >> >Andrew >> >> I also agree with all of your points, but hope this is not something >> that would prevent this patch series from being acceptable. > > No, it is acceptable as is. > > However, it would be nice the clean up the mess, especially if you are > planning on using this code with new platforms, not just legacy stuff > which is bit rotting. I don't believe we'll be using the BCMA infrastructure going forward. We'll be doing our own MDIO Phys, adding them to the device tree, and referencing them through the DT/OF method. This is the current plan for NSP, NS2, Cygnus, and other upcoming platforms. If someone can find an elegant way of splitting off the Phy code from the BCMA core, then this can be better than it is now. But, it will take someone with docs, time, and hardware to give this code the audit and cleanup it really needs. BTW, my last comment isn't intended as a slight against the authors of this code. It is unfortunate that they were put in a position of having to reverse engineer drivers for this HW. Thanks, Jon > > Andrew
Re: Question on i40e PCIe relaxed ordering (RO)
On Wed, 2016-06-29 at 13:05 -0700, tndave wrote: > Hi, > > Running iperf tcp test on 2 sparc systems with i40e connected back to > back, I see huge number of 'port.rx_dropped' (on iperf server). Based on > past experience with ixgbe, this could very well because of PCIe RO > (relaxed ordering) not enabled. > > I am trying to confirm RO is enabled. i40e datasheet mentioned RO > settings in 3 different sections: > > 1. section 10.2.2.2.38 PCIe Global Config 2 - GLPCI_CNF2 register > contains global status fields of PCIe configuration. The bit 0 of the > register is "RO_DIS". If this bit is set to 1 RO is disabled. > > RO_DIS in my setup is 0 imply RO is not disabled. > > 2. section 12.3.5.5 Device Control Register (0xA8; RW) has bit 4 > that enable/disable RO. This is pcie cap register. > > In my i40e pcie config space value at offset 0xA8 is "2110". i.e 4th bit > set to 1 imply RO is enabled. > > 3. section 3.1.2.7.2 mentions some relaxed ordering rules > e.g. "The GLLAN_RCTL.RXDESCRDROEN bit (loaded from NVM) enables relaxed > ordering for Rx descriptor reads" > > However, GLLAN_RCTL register definition has not bit like RXDESCRDROEN. > Same goes for GLLAN_TCTL.TXDESCRDROEN. > > Am I missing anything? please advise. I would try posting this question to the e1000 developer list over at Source Forge. The Intel customer support folks used to monitor that list closely when I was there, hopefully they still are. Regards, - Greg > > Thanks. > > -Tushar > > (Ref:http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf) > > >
Re: [PATCH net-next v3] tcp: use RFC6298 compliant TCP RTO calculation
BD R Y -- mean TCPRecovLat 3s -7% +39% +38% mean TCPRecovLat252s +1% -11% -11% This is indeed very interesting and somewhat unexpected. Do you have any clue why Y is as bad as R and so much worse than B? By my understanding I would have expected Y to be similar to B. At least tests on the Mean Response Waiting Time of sender limited flows show hardly any difference to B (as expected). Also, is a potential longer time in TCPRecovLat such a bad thing considering your information on HTTP response performance?
Re: [PATCH] [v6] net: emac: emac gigabit ethernet controller driver
Arnd Bergmann wrote: Sure, but in theory, my for-loop is correct, right? Wouldn't there be some value in setting a 36-bit or 40-bit DMA mask if it works? We have a platform where memory starts at a 40-bit address, so some devices have a 44-bit address bus. If a 64-bit mask doesn't work, then a 32-bit mask certainly wont. The question is whether it makes any difference to the driver: what decision does the driver make differently if it has set a 44-bit mask? We don't have ZONE_DMA44 (and it's very unlikely that we will introduce it), so neither the driver nor the network stack can actually act on the fact that a 64-bit mask failed but a 44-bit mask succeeded. A 44-bit mask would have access to much larger regions of memory than a 32-bit mask. Maybe one day we will have a version of dma_alloc_coherent that uses the DMA mask to specify an actual range of physical addresses to allocate from. In this situation, a mask of 44 would be much better than a mask of 32. It just seems wrong for driver to hard-code 64-bit and 32-bit masks and pretend like those are the only two sizes. Very few 64-bit SOCs have a full 64-bit address bus. Most have actually around 36-48 bits. If dma_set_mask(64) fails on these systems, then every driver will fall back to 32 bits, even though they can technically access all of DDR. Maybe we need a function like this: int dma_find_best_mask(struct device *dev, unsigned int max) { struct dma_map_ops *ops = get_dma_ops(dev); unsigned int mask; int ret; if (!ops || !ops->dma_supported) return max; for (mask = max; mask >= 32; mask--) { ret = ops->dma_supported(dev, mask); if (ret < 0) return ret; if (ret > 0) return mask; } return -ENOMEM; } You pass a mask size that you know is the maximum that the device could support (in my case, 64). It then returns a number that should be passed to dma_set_mask(). Platforms will also allow allow the driver to set a mask that is larger than what the bus supports, as long as all RAM is reachable by the bus. And that check (like all others) is made in the dma_set_mask call? Yes. Regarding the system you mentioned: I understand that it has no memory in ZONE_DMA32 and no IOMMU (btw, that also means 32-bit PCI devices are fundamentally broken), and the bus can only address the lower 44 bit of address space. Actually, we have an IOMMU, and the size of the address space depends on the SOC. Some SOCs have 32 bits, some have over 40. The EMAC is the same on all of these SOCs. Does this system support more than 15TB of RAM? If not, then there is no problem, and if it does, I think your best way out is not to disable SWIOTLB. That will be a slower compared to a system with an IOMMU or the fictional ZONE_DMA44, but it should work correctly. In either case (less than 15TB without swiotlb, or more than 15TB with swiotlb), the driver should just ask for a 64-bit mask and that will succeed. Some SOCs can support over 15TB. They all have IOMMUs. Apparently, dma_set_mask(64) won't always succeed. If the SOC has fewer DMA address lines than total memory, then a mask of 64 will fail. I don't want to implement a solution that just happens to work on the EMAC. I'm trying to come up with a generic solution that can work in any driver on any SOC, whether you use device tree or ACPI. -- Qualcomm Innovation Center, Inc. The Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, a Linux Foundation collaborative project.
Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file
On Wed, Jun 29, 2016 at 04:08:20PM -0400, Jon Mason wrote: > On Wed, Jun 29, 2016 at 2:46 PM, Andrew Lunnwrote: > > On Wed, Jun 29, 2016 at 11:35:28AM -0700, Florian Fainelli wrote: > >> On 06/29/2016 07:13 AM, Andrew Lunn wrote: > >> > Hi Jon > >> > > >> > I know you are just refactoring code, but at some point it would be > >> > good to take a closer look at this MDIO bus driver. > > > >> And, to re-iterate all of your points are valid, but this is premature > > > > We agree then :-) > > > >Andrew > > I also agree with all of your points, but hope this is not something > that would prevent this patch series from being acceptable. No, it is acceptable as is. However, it would be nice the clean up the mess, especially if you are planning on using this code with new platforms, not just legacy stuff which is bit rotting. Andrew
Re: [RFC 1/7] net: ethernet: bgmac: change bgmac_* prints to dev_* prints
On Tue, Jun 28, 2016 at 3:43 PM, Joe Percheswrote: > On Tue, 2016-06-28 at 15:34 -0400, Jon Mason wrote: >> The bgmac_* print wrappers call dev_* prints with the dev pointer from >> the bcma core. In anticipation of removing the bcma requirement for >> this driver, these must be changed to not reference that struct. So, >> simply change all of the bgmac_* prints to their dev_* counterparts. In >> some cases netdev_* prints are more appropriate, so change those as >> well. > > Thanks, but... > >> diff --git a/drivers/net/ethernet/broadcom/bgmac.c >> b/drivers/net/ethernet/broadcom/bgmac.c > [] >> @@ -1515,7 +1516,7 @@ static void bgmac_get_drvinfo(struct net_device >> *net_dev, >> struct ethtool_drvinfo *info) >> { >> strlcpy(info->driver, KBUILD_MODNAME, sizeof(info->driver)); >> - strlcpy(info->bus_info, "BCMA", sizeof(info->bus_info)); >> + strlcpy(info->bus_info, "AXI", sizeof(info->bus_info)); >> } > > Unrelated change? This is a needed change, but unrelated to this patch. I'll move it to the 5/7 patch, where it more logically fits. Thanks, Jon >
Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file
On Wed, Jun 29, 2016 at 2:46 PM, Andrew Lunnwrote: > On Wed, Jun 29, 2016 at 11:35:28AM -0700, Florian Fainelli wrote: >> On 06/29/2016 07:13 AM, Andrew Lunn wrote: >> > Hi Jon >> > >> > I know you are just refactoring code, but at some point it would be >> > good to take a closer look at this MDIO bus driver. > >> And, to re-iterate all of your points are valid, but this is premature > > We agree then :-) > >Andrew I also agree with all of your points, but hope this is not something that would prevent this patch series from being acceptable. It could be quite an effort to try and properly document the code and what it is doing, get HW to verify the assumptions from the documentation (as the MIPS HW was done by a different group, no hardware is available, etc), relocate the code to the drivers/net/phy and add a standard interface. From your original comment, I believe that you agree that this additional effort is not necessary at this point in time. Thanks, Jon
Question on i40e PCIe relaxed ordering (RO)
Hi, Running iperf tcp test on 2 sparc systems with i40e connected back to back, I see huge number of 'port.rx_dropped' (on iperf server). Based on past experience with ixgbe, this could very well because of PCIe RO (relaxed ordering) not enabled. I am trying to confirm RO is enabled. i40e datasheet mentioned RO settings in 3 different sections: 1. section 10.2.2.2.38 PCIe Global Config 2 - GLPCI_CNF2 register contains global status fields of PCIe configuration. The bit 0 of the register is "RO_DIS". If this bit is set to 1 RO is disabled. RO_DIS in my setup is 0 imply RO is not disabled. 2. section 12.3.5.5 Device Control Register (0xA8; RW) has bit 4 that enable/disable RO. This is pcie cap register. In my i40e pcie config space value at offset 0xA8 is "2110". i.e 4th bit set to 1 imply RO is enabled. 3. section 3.1.2.7.2 mentions some relaxed ordering rules e.g. "The GLLAN_RCTL.RXDESCRDROEN bit (loaded from NVM) enables relaxed ordering for Rx descriptor reads" However, GLLAN_RCTL register definition has not bit like RXDESCRDROEN. Same goes for GLLAN_TCTL.TXDESCRDROEN. Am I missing anything? please advise. Thanks. -Tushar (Ref:http://www.intel.com/content/dam/www/public/us/en/documents/datasheets/xl710-10-40-controller-datasheet.pdf)
[net-next PATCH v2 2/2] net: samples: pktgen mode samples/tests for qdisc layer
This adds samples for pktgen to use with new mode to inject pkts into the qdisc layer. This also doubles as nice test cases to test any patches against qdisc layer. Signed-off-by: John Fastabend--- .../pktgen/pktgen_bench_xmit_mode_queue_xmit.sh| 70 1 file changed, 70 insertions(+) create mode 100755 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh diff --git a/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh new file mode 100755 index 000..eee06cc --- /dev/null +++ b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh @@ -0,0 +1,70 @@ +#!/bin/bash +# +# Benchmark script: +# - developed for benchmarking egress qdisc path, derived from +#ingress benchmark script. +# +# Script for injecting packets into egress qdisc path of the stack +# with pktgen "xmit_mode queue_xmit". +# +basedir=`dirname $0` +source ${basedir}/functions.sh +root_check_run_with_sudo "$@" + +# Parameter parsing via include +source ${basedir}/parameters.sh +# Using invalid DST_MAC will cause the packets to get dropped in +# ip_rcv() which is part of the test +[ -z "$DEST_IP" ] && DEST_IP="198.18.0.42" +[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff" + +# Burst greater than 1 are invalid but allow users to specify it and +# get an error instead of silently ignoring it. +[ -z "$BURST" ] && BURST=1 + +# Base Config +DELAY="0"# Zero means max speed +COUNT="1000" # Zero means indefinitely + +# General cleanup everything since last run +pg_ctrl "reset" + +# Threads are specified with parameter -t value in $THREADS +for ((thread = 0; thread < $THREADS; thread++)); do +# The device name is extended with @name, using thread number to +# make then unique, but any name will do. +dev=${DEV}@${thread} + +# Add remove all other devices and add_device $dev to thread +pg_thread $thread "rem_device_all" +pg_thread $thread "add_device" $dev + +# Base config of dev +pg_set $dev "flag QUEUE_MAP_CPU" +pg_set $dev "count $COUNT" +pg_set $dev "pkt_size $PKT_SIZE" +pg_set $dev "delay $DELAY" +pg_set $dev "flag NO_TIMESTAMP" + +# Destination +pg_set $dev "dst_mac $DST_MAC" +pg_set $dev "dst $DEST_IP" + +# Inject packet into RX path of stack +pg_set $dev "xmit_mode queue_xmit" + +# Burst allow us to avoid measuring SKB alloc/free overhead +pg_set $dev "burst $BURST" +done + +# start_run +echo "Running... ctrl^C to stop" >&2 +pg_ctrl "start" +echo "Done" >&2 + +# Print results +for ((thread = 0; thread < $THREADS; thread++)); do +dev=${DEV}@${thread} +echo "Device: $dev" +cat /proc/net/pktgen/$dev | grep -A2 "Result:" +done
[net-next PATCH v2 1/2] net: pktgen: support injecting packets for qdisc testing
Add another xmit_mode to pktgen to allow testing xmit functionality of qdiscs. The new mode "queue_xmit" injects packets at __dev_queue_xmit() so that qdisc is called. Signed-off-by: John Fastabend--- net/core/pktgen.c | 42 -- 1 file changed, 40 insertions(+), 2 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index f74ab9c..4b3d467 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -213,6 +213,7 @@ /* Xmit modes */ #define M_START_XMIT 0 /* Default normal TX */ #define M_NETIF_RECEIVE1 /* Inject packets into stack */ +#define M_QUEUE_XMIT 2 /* Inject packet into qdisc */ /* If lock -- protects updating of if_list */ #define if_lock(t) spin_lock(&(t->if_lock)); @@ -626,6 +627,8 @@ static int pktgen_if_show(struct seq_file *seq, void *v) if (pkt_dev->xmit_mode == M_NETIF_RECEIVE) seq_puts(seq, " xmit_mode: netif_receive\n"); + else if (pkt_dev->xmit_mode == M_QUEUE_XMIT) + seq_puts(seq, " xmit_mode: xmit_queue\n"); seq_puts(seq, " Flags: "); @@ -1142,8 +1145,10 @@ static ssize_t pktgen_if_write(struct file *file, return len; i += len; - if ((value > 1) && (pkt_dev->xmit_mode == M_START_XMIT) && - (!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING))) + if ((value > 1) && + ((pkt_dev->xmit_mode == M_QUEUE_XMIT) || +((pkt_dev->xmit_mode == M_START_XMIT) && +(!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING) return -ENOTSUPP; pkt_dev->burst = value < 1 ? 1 : value; sprintf(pg_result, "OK: burst=%d", pkt_dev->burst); @@ -1198,6 +1203,9 @@ static ssize_t pktgen_if_write(struct file *file, * at module loading time */ pkt_dev->clone_skb = 0; + } else if (strcmp(f, "queue_xmit") == 0) { + pkt_dev->xmit_mode = M_QUEUE_XMIT; + pkt_dev->last_ok = 1; } else { sprintf(pg_result, "xmit_mode -:%s:- unknown\nAvailable modes: %s", @@ -3434,6 +3442,36 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev) #endif } while (--burst > 0); goto out; /* Skips xmit_mode M_START_XMIT */ + } else if (pkt_dev->xmit_mode == M_QUEUE_XMIT) { + local_bh_disable(); + atomic_add(burst, _dev->skb->users); + + ret = dev_queue_xmit(pkt_dev->skb); + switch (ret) { + case NET_XMIT_SUCCESS: + pkt_dev->sofar++; + pkt_dev->seq_num++; + pkt_dev->tx_bytes += pkt_dev->last_pkt_size; + break; + case NET_XMIT_DROP: + case NET_XMIT_CN: + /* These are all valid return codes for a qdisc but +* indicate packets are being dropped or will likely +* be dropped soon. +*/ + case NETDEV_TX_BUSY: + /* qdisc may call dev_hard_start_xmit directly in cases +* where no queues exist e.g. loopback device, virtual +* devices, etc. In this case we need to handle +* NETDEV_TX_ codes. +*/ + default: + pkt_dev->errors++; + net_info_ratelimited("%s xmit error: %d\n", +pkt_dev->odevname, ret); + break; + } + goto out; } txq = skb_get_tx_queue(odev, pkt_dev->skb);
Re: [PATCH 0/2] brcmfmac: support removing AP interfaces with new fw
On 29 June 2016 at 21:54, Rafał Miłeckiwrote: > This is the latest patchset needed to get brcmfmac working reasonably well > with BCM4366. > > Both patches were already sent as V2 RFC (10 days ago), there were no more > comments since then and this is the same code as in V2 RFC. I was mostly > waiting > for accepting > brcmfmac: fix lockup when removing P2P interface after event timeout > before sending these 2 (because of dependency). > > Rafał Miłecki (2): > brcmfmac: delete interface directly in code that sent fw request > brcmfmac: support removing AP interfaces with "interface_remove" > > .../broadcom/brcm80211/brcmfmac/cfg80211.c | 39 > +- > .../wireless/broadcom/brcm80211/brcmfmac/fweh.c| 10 -- > .../net/wireless/broadcom/brcm80211/brcmfmac/p2p.c | 3 +- > 3 files changed, 47 insertions(+), 5 deletions(-) Oops, I just realized get_maintainer.pl didn't pick any ppl/ML for this cover letter. Added them now. -- Rafał
[PATCH 2/2] brcmfmac: support removing AP interfaces with "interface_remove"
New firmwares (e.g. 10.10.69.36 for BCM4366) support "interface_remove" for removing interfaces. Try to use this method on cfg80211 request. In case of older firmwares (e.g. 7.35.177.56 for BCM43602 as I tested) this will just result in firmware rejecting command and this won't change any behavior. Signed-off-by: Rafał Miłecki--- .../broadcom/brcm80211/brcmfmac/cfg80211.c | 39 +- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c index a5db953..6e6066a 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c @@ -785,12 +785,48 @@ s32 brcmf_notify_escan_complete(struct brcmf_cfg80211_info *cfg, return err; } +static int brcmf_cfg80211_del_ap_iface(struct wiphy *wiphy, + struct wireless_dev *wdev) +{ + struct brcmf_cfg80211_info *cfg = wiphy_priv(wiphy); + struct net_device *ndev = wdev->netdev; + struct brcmf_if *ifp = netdev_priv(ndev); + int ret; + int err; + + brcmf_cfg80211_arm_vif_event(cfg, ifp->vif); + + err = brcmf_fil_bsscfg_data_set(ifp, "interface_remove", NULL, 0); + if (err) { + brcmf_err("interface_remove failed %d\n", err); + goto err_unarm; + } + + /* wait for firmware event */ + ret = brcmf_cfg80211_wait_vif_event(cfg, BRCMF_E_IF_DEL, + BRCMF_VIF_EVENT_TIMEOUT); + if (!ret) { + brcmf_err("timeout occurred\n"); + err = -EIO; + goto err_unarm; + } + + brcmf_remove_interface(ifp, true); + +err_unarm: + brcmf_cfg80211_arm_vif_event(cfg, NULL); + return err; +} + static int brcmf_cfg80211_del_iface(struct wiphy *wiphy, struct wireless_dev *wdev) { struct brcmf_cfg80211_info *cfg = wiphy_priv(wiphy); struct net_device *ndev = wdev->netdev; + if (ndev && ndev == cfg_to_ndev(cfg)) + return -ENOTSUPP; + /* vif event pending in firmware */ if (brcmf_cfg80211_vif_event_armed(cfg)) return -EBUSY; @@ -807,12 +843,13 @@ int brcmf_cfg80211_del_iface(struct wiphy *wiphy, struct wireless_dev *wdev) switch (wdev->iftype) { case NL80211_IFTYPE_ADHOC: case NL80211_IFTYPE_STATION: - case NL80211_IFTYPE_AP: case NL80211_IFTYPE_AP_VLAN: case NL80211_IFTYPE_WDS: case NL80211_IFTYPE_MONITOR: case NL80211_IFTYPE_MESH_POINT: return -EOPNOTSUPP; + case NL80211_IFTYPE_AP: + return brcmf_cfg80211_del_ap_iface(wiphy, wdev); case NL80211_IFTYPE_P2P_CLIENT: case NL80211_IFTYPE_P2P_GO: case NL80211_IFTYPE_P2P_DEVICE: -- 1.8.4.5
[PATCH 1/2] brcmfmac: delete interface directly in code that sent fw request
So far when receiving event about in-firmware-interface removal our event worker was notifying listener and afterwards it was removing Linux interface. First of all it was resulting in slightly unexpected order. The listener (del_virtual_intf callback) was (usually) returning with success before we even called unregister_netdev(ice). Please note this couldn't be simply fixed by changing order of calls in brcmf_fweh_handle_if_event as unregistering interface earlier could free struct brcmf_if. Another problem of current implementation are possible lockups. Focus on the time slot between calling event handler and removing Linux interface. During that time original caller may leave (unlocking rtnl semaphore) *and* another call to the same code may be done (locking it again). If that happens our event handler will stuck at removing Linux interface, it won't handle another event and will block process holding rtnl lock. This can be simply solved by unregistering interface in a proper callback, right after receiving confirmation event from firmware. This only required modifying worker to don't unregister on its own if there is someone waiting for the event. Signed-off-by: Rafał Miłecki--- drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c | 10 -- drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c | 3 +-- 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c index 9da7a4c..79c081f 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/fweh.c @@ -18,6 +18,7 @@ #include "brcmu_wifi.h" #include "brcmu_utils.h" +#include "cfg80211.h" #include "core.h" #include "debug.h" #include "tracepoint.h" @@ -182,8 +183,13 @@ static void brcmf_fweh_handle_if_event(struct brcmf_pub *drvr, err = brcmf_fweh_call_event_handler(ifp, emsg->event_code, emsg, data); - if (ifp && ifevent->action == BRCMF_E_IF_DEL) - brcmf_remove_interface(ifp, false); + if (ifp && ifevent->action == BRCMF_E_IF_DEL) { + bool armed = brcmf_cfg80211_vif_event_armed(drvr->config); + + /* Default handling in case no-one waits for this event */ + if (!armed) + brcmf_remove_interface(ifp, false); + } } /** diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c index f6241fd..66f942f 100644 --- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c +++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/p2p.c @@ -2288,8 +2288,7 @@ int brcmf_p2p_del_vif(struct wiphy *wiphy, struct wireless_dev *wdev) else err = 0; } - if (err) - brcmf_remove_interface(vif->ifp, true); + brcmf_remove_interface(vif->ifp, true); brcmf_cfg80211_arm_vif_event(cfg, NULL); if (vif->wdev.iftype != NL80211_IFTYPE_P2P_DEVICE) -- 1.8.4.5
Re: [PATCH] [v6] net: emac: emac gigabit ethernet controller driver
On Wednesday, June 29, 2016 10:46:23 AM CEST Timur Tabi wrote: > Arnd Bergmann wrote: > > Usually drivers try 64-bit mask and 32-bit masks, and the 32 bit > > mask is practically guaranteed to succeed. > > Sure, but in theory, my for-loop is correct, right? Wouldn't there be > some value in setting a 36-bit or 40-bit DMA mask if it works? We have > a platform where memory starts at a 40-bit address, so some devices have > a 44-bit address bus. If a 64-bit mask doesn't work, then a 32-bit mask > certainly wont. The question is whether it makes any difference to the driver: what decision does the driver make differently if it has set a 44-bit mask? We don't have ZONE_DMA44 (and it's very unlikely that we will introduce it), so neither the driver nor the network stack can actually act on the fact that a 64-bit mask failed but a 44-bit mask succeeded. > > Platforms will also allow allow the driver to set a mask that > > is larger than what the bus supports, as long as all RAM is > > reachable by the bus. > > And that check (like all others) is made in the dma_set_mask call? Yes. Regarding the system you mentioned: I understand that it has no memory in ZONE_DMA32 and no IOMMU (btw, that also means 32-bit PCI devices are fundamentally broken), and the bus can only address the lower 44 bit of address space. Does this system support more than 15TB of RAM? If not, then there is no problem, and if it does, I think your best way out is not to disable SWIOTLB. That will be a slower compared to a system with an IOMMU or the fictional ZONE_DMA44, but it should work correctly. In either case (less than 15TB without swiotlb, or more than 15TB with swiotlb), the driver should just ask for a 64-bit mask and that will succeed. Arnd
[net-next PATCH 2/2] net: samples: pktgen mode samples/tests for qdisc layer
This adds samples for pktgen to use with new mode to inject pkts into the qdisc layer. This also doubles as nice test cases to test any patches against qdisc layer. Signed-off-by: John Fastabend--- .../pktgen/pktgen_bench_xmit_mode_queue_xmit.sh| 67 1 file changed, 67 insertions(+) create mode 100755 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh diff --git a/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh new file mode 100755 index 000..f1aaef0 --- /dev/null +++ b/samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh @@ -0,0 +1,67 @@ +#!/bin/bash +# +# Benchmark script: +# - developed for benchmarking egress qdisc path, derived from +#ingress benchmark script. +# +# Script for injecting packets into egress qdisc path of the stack +# with pktgen "xmit_mode queue_xmit". +# +basedir=`dirname $0` +source ${basedir}/functions.sh +root_check_run_with_sudo "$@" + +# Parameter parsing via include +source ${basedir}/parameters.sh +# Using invalid DST_MAC will cause the packets to get dropped in +# ip_rcv() which is part of the test +[ -z "$DEST_IP" ] && DEST_IP="198.18.0.42" +[ -z "$DST_MAC" ] && DST_MAC="90:e2:ba:ff:ff:ff" +[ -z "$BURST" ] && BURST=1024 + +# Base Config +DELAY="0"# Zero means max speed +COUNT="1000" # Zero means indefinitely + +# General cleanup everything since last run +pg_ctrl "reset" + +# Threads are specified with parameter -t value in $THREADS +for ((thread = 0; thread < $THREADS; thread++)); do +# The device name is extended with @name, using thread number to +# make then unique, but any name will do. +dev=${DEV}@${thread} + +# Add remove all other devices and add_device $dev to thread +pg_thread $thread "rem_device_all" +pg_thread $thread "add_device" $dev + +# Base config of dev +pg_set $dev "flag QUEUE_MAP_CPU" +pg_set $dev "count $COUNT" +pg_set $dev "pkt_size $PKT_SIZE" +pg_set $dev "delay $DELAY" +pg_set $dev "flag NO_TIMESTAMP" + +# Destination +pg_set $dev "dst_mac $DST_MAC" +pg_set $dev "dst $DEST_IP" + +# Inject packet into RX path of stack +pg_set $dev "xmit_mode queue_xmit" + +# Burst allow us to avoid measuring SKB alloc/free overhead +pg_set $dev "burst $BURST" +done + +# start_run +echo "Running... ctrl^C to stop" >&2 +pg_ctrl "start" +echo "Done" >&2 + +# Print results +for ((thread = 0; thread < $THREADS; thread++)); do +dev=${DEV}@${thread} +echo "Device: $dev" +cat /proc/net/pktgen/$dev | grep -A2 "Result:" +done
[net-next PATCH 1/2] net: pktgen: support injecting packets for qdisc testing
Add another xmit_mode to pktgen to allow testing xmit functionality of qdiscs. The new mode "queue_xmit" injects packets at __dev_queue_xmit() so that qdisc is called. Signed-off-by: John Fastabend--- net/core/pktgen.c | 42 -- 1 file changed, 40 insertions(+), 2 deletions(-) diff --git a/net/core/pktgen.c b/net/core/pktgen.c index f74ab9c..4b3d467 100644 --- a/net/core/pktgen.c +++ b/net/core/pktgen.c @@ -213,6 +213,7 @@ /* Xmit modes */ #define M_START_XMIT 0 /* Default normal TX */ #define M_NETIF_RECEIVE1 /* Inject packets into stack */ +#define M_QUEUE_XMIT 2 /* Inject packet into qdisc */ /* If lock -- protects updating of if_list */ #define if_lock(t) spin_lock(&(t->if_lock)); @@ -626,6 +627,8 @@ static int pktgen_if_show(struct seq_file *seq, void *v) if (pkt_dev->xmit_mode == M_NETIF_RECEIVE) seq_puts(seq, " xmit_mode: netif_receive\n"); + else if (pkt_dev->xmit_mode == M_QUEUE_XMIT) + seq_puts(seq, " xmit_mode: xmit_queue\n"); seq_puts(seq, " Flags: "); @@ -1142,8 +1145,10 @@ static ssize_t pktgen_if_write(struct file *file, return len; i += len; - if ((value > 1) && (pkt_dev->xmit_mode == M_START_XMIT) && - (!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING))) + if ((value > 1) && + ((pkt_dev->xmit_mode == M_QUEUE_XMIT) || +((pkt_dev->xmit_mode == M_START_XMIT) && +(!(pkt_dev->odev->priv_flags & IFF_TX_SKB_SHARING) return -ENOTSUPP; pkt_dev->burst = value < 1 ? 1 : value; sprintf(pg_result, "OK: burst=%d", pkt_dev->burst); @@ -1198,6 +1203,9 @@ static ssize_t pktgen_if_write(struct file *file, * at module loading time */ pkt_dev->clone_skb = 0; + } else if (strcmp(f, "queue_xmit") == 0) { + pkt_dev->xmit_mode = M_QUEUE_XMIT; + pkt_dev->last_ok = 1; } else { sprintf(pg_result, "xmit_mode -:%s:- unknown\nAvailable modes: %s", @@ -3434,6 +3442,36 @@ static void pktgen_xmit(struct pktgen_dev *pkt_dev) #endif } while (--burst > 0); goto out; /* Skips xmit_mode M_START_XMIT */ + } else if (pkt_dev->xmit_mode == M_QUEUE_XMIT) { + local_bh_disable(); + atomic_add(burst, _dev->skb->users); + + ret = dev_queue_xmit(pkt_dev->skb); + switch (ret) { + case NET_XMIT_SUCCESS: + pkt_dev->sofar++; + pkt_dev->seq_num++; + pkt_dev->tx_bytes += pkt_dev->last_pkt_size; + break; + case NET_XMIT_DROP: + case NET_XMIT_CN: + /* These are all valid return codes for a qdisc but +* indicate packets are being dropped or will likely +* be dropped soon. +*/ + case NETDEV_TX_BUSY: + /* qdisc may call dev_hard_start_xmit directly in cases +* where no queues exist e.g. loopback device, virtual +* devices, etc. In this case we need to handle +* NETDEV_TX_ codes. +*/ + default: + pkt_dev->errors++; + net_info_ratelimited("%s xmit error: %d\n", +pkt_dev->odevname, ret); + break; + } + goto out; } txq = skb_get_tx_queue(odev, pkt_dev->skb);
Re: [PATCH v4] fib_rules: Added NLM_F_EXCL support to fib_nl_newrule
On 6/29/16 1:22 AM, Mateusz Bajorski wrote: When adding rule with NLM_F_EXCL flag then check if the same rule exist. If yes then exit with -EEXIST. This is already implemented in iproute2: if (cmd == RTM_NEWRULE) { req.n.nlmsg_flags |= NLM_F_CREATE|NLM_F_EXCL; req.r.rtm_type = RTN_UNICAST; } Tested ipv4 and ipv6 with net-next linux on qemu x86 expected behavior after patch: localhost ~ # ip rule 0:from all lookup local 32766:from all lookup main 32767:from all lookup default localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005 localhost ~ # ip rule add from 10.46.177.97 lookup 104 pref 1005 RTNETLINK answers: File exists localhost ~ # ip rule 0:from all lookup local 1005:from 10.46.177.97 lookup 104 32766:from all lookup main 32767:from all lookup default There was already topic regarding this but I don't see any changes merged and problem still occurs. https://lkml.kernel.org/r/1135778809.5944.7.camel+%28%29+localhost+%21+localdomain Signed-off-by: Mateusz Bajorski--- Changes in v2: section moved to new place where new rule is already built Changes in v3: compare moved to helper, added fr_net compare Changes in v4: added l3mdev compare net/core/fib_rules.c | 49 + 1 file changed, 49 insertions(+) Acked-by: David Ahern
[PATCH iproute2] bridge: man: fix STP LISTENING description
Correct the unclear and poorly conjugated STP LISTENING documentation. Signed-off-by: Vivien Didelot--- man/man8/bridge.8 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/man8/bridge.8 b/man/man8/bridge.8 index ac42118..8a39673 100644 --- a/man/man8/bridge.8 +++ b/man/man8/bridge.8 @@ -235,7 +235,7 @@ error. .B 1 - STP LISTENING state. Only valid if STP is enabled on the bridge. In this -state the port for list for STP BPDUs and drop all other traffic. +state the port listens for STP BPDUs and drops all other traffic frames. .sp .B 2 -- 2.9.0
[PATCH iproute2] bridge: man: fix BPUD typo
s/BPUD/BPDU/ in guard description. Signed-off-by: Vivien Didelot--- man/man8/bridge.8 | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/man/man8/bridge.8 b/man/man8/bridge.8 index 8a39673..ef4f83e 100644 --- a/man/man8/bridge.8 +++ b/man/man8/bridge.8 @@ -256,7 +256,7 @@ STP BPDUs. .TP .BR "guard on " or " guard off " -Controls whether STP BPUDs will be processed by the bridge port. By default, +Controls whether STP BPDUs will be processed by the bridge port. By default, the flag is turned off allowed BPDU processing. Turning this flag on will cause the port to stop processing STP BPDUs. -- 2.9.0
net.git commit 3bb549ae4c51
Hi, I'm getting ready with the next installment of changes to rds-tcp for mprds, and I noticed that commit 3bb549ae4c51 ("RDS: TCP: rds_tcp_accept_one() should transition socket from RESETTING to UP") has not made its way to net-next. I also noticed that (even though this is a one-liner), this patch generates a conflict when applied as-is to net-next. My planned changes will cause more turbulence around this piece of code, so they depend on this patch. What is the advised approach here? Should I just port this patch over to net-next as part of my patch-set, and make a note about the port, or something else? --Sowmini
[PATCH net v3] usbnet: Stop RX Q on MTU change
When MTU is changed unlink_urbs() flushes RX Q but mean while usbnet_bh() can fill up the Q at the same time. Depends on which HCD is down there unlink takes long time then the flush never ends. Signed-off-by: Soohoon LeeReviewed-by: Kimball Murray --- drivers/net/usb/usbnet.c | 10 +++--- 1 files changed, 7 insertions(+), 3 deletions(-) diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index 61ba464..6086a01 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -395,8 +395,11 @@ int usbnet_change_mtu (struct net_device *net, int new_mtu) dev->hard_mtu = net->mtu + net->hard_header_len; if (dev->rx_urb_size == old_hard_mtu) { dev->rx_urb_size = dev->hard_mtu; - if (dev->rx_urb_size > old_rx_urb_size) + if (dev->rx_urb_size > old_rx_urb_size) { + usbnet_pause_rx(dev); usbnet_unlink_rx_urbs(dev); + usbnet_resume_rx(dev); + } } /* max qlen depend on hard_mtu and rx_urb_size */ @@ -1508,8 +1511,9 @@ static void usbnet_bh (unsigned long param) } else if (netif_running (dev->net) && netif_device_present (dev->net) && netif_carrier_ok(dev->net) && - !timer_pending (>delay) && - !test_bit (EVENT_RX_HALT, >flags)) { + !timer_pending(>delay) && + !test_bit(EVENT_RX_PAUSED, >flags) && + !test_bit(EVENT_RX_HALT, >flags)) { int temp = dev->rxq.qlen; if (temp < RX_QLEN(dev)) {
Re: [RFC 0/7] net: ethernet: bgmac: Add platform device support
On 06/28/2016 12:34 PM, Jon Mason wrote: > I'm sending out this RFC to see if this is the direction the maintainers > would like to go to add support for other, non-bcma iProc SoC's to the > bgmac driver. Specifically, we are interested in adding support for the > NSP, Cygnus, and NS2 families (with more possible down the road). > > To support non-bcma enabled SoCs, we need to add the standard device > tree "platform device" support. Unfortunately, this driver is very > tighly coupled with the bcma bus and much unwinding is needed. I tried > to break this up into a number of patches to make it more obvious what > was being done to add platform device support. I was able to verify > that the bcma code still works using a 53012K board (NS SoC), and that > the platform code works using a 58625K board (NSP SoC). > > It is worth noting that the phy logic present in the driver needs to be > moved to drivers/phy. However, I was not able to fully decouple that > code from the bgmac driver. I was able to move it into a separate C > file, with only 2 function calls needed to create and destroy the mii > bus. Someone with more knowledge of this and HW to test it needs to do > it properly. This would natually dovetail into creating an interface > which the NSP bgmac can use for the external MDIO Phy to properly > connect (instead of using the fixed phy). This looks very good to me, and I just tested this with a BCM58625 w/ b53-srab on top of that, and everything works nicely: Tested-by: Florian Fainelli-- Florian
Re: [RFC 5/7] net: ethernet: bgmac: Add platform device support
On 06/28/2016 12:34 PM, Jon Mason wrote: > The bcma portion of the driver has been split off into a bcma specific > driver. This has been mirrored for the platform driver. The last > references to the bcma core struct have been changed into a generic > function call. These function calls are wrappers to either the original > bcma code or new platform functions that access the same areas via MMIO. > This necessitated adding function pointers for both platform and bcma to > hide which backend is being used from the generic bgmac code. > > Signed-off-by: Jon Mason> --- [snip] > +static int bgmac_probe(struct platform_device *pdev) > +{ > + struct device_node *np = pdev->dev.of_node; > + struct bgmac *bgmac; > + struct resource regs; > + const u8 *mac_addr; > + int rc; > + > + bgmac = kzalloc(sizeof(*bgmac), GFP_KERNEL); You could utilize devm_kzalloc() here which simplifies the error path. > + if (!bgmac) > + return -ENOMEM; > + > + platform_set_drvdata(pdev, bgmac); > + > + /* Set the features of the 4707 family */ > + bgmac->feature_flags |= BGMAC_FEAT_CLKCTLST; > + bgmac->feature_flags |= BGMAC_FEAT_NO_RESET; > + bgmac->feature_flags |= BGMAC_FEAT_FORCE_SPEED_2500; > + bgmac->feature_flags |= BGMAC_FEAT_CMDCFG_SR_REV4; > + bgmac->feature_flags |= BGMAC_FEAT_TX_MASK_SETUP; > + bgmac->feature_flags |= BGMAC_FEAT_RX_MASK_SETUP; > + > + bgmac->dev = >dev; > + bgmac->dma_dev = >dev; > + > + mac_addr = of_get_mac_address(np); > + if (mac_addr) > + ether_addr_copy(bgmac->mac_addr, mac_addr); > + else > + dev_warn(>dev, "MAC address not present in device > tree\n"); > + > + bgmac->irq = platform_get_irq(pdev, 0); > + if (bgmac->irq < 0) { > + rc = bgmac->irq; > + dev_err(>dev, "Unable to obtain IRQ\n"); > + goto err; > + } > + > + rc = of_address_to_resource(np, 0, ); > + if (rc < 0) { > + dev_err(>dev, "Unable to obtain base resource\n"); > + goto err; > + } Here you could fetch the resource using a traditional platform_get_resource(pdev, IORESOURCE_MEM, 0) and... > + > + bgmac->plat.base = ioremap(regs.start, resource_size()); > + if (!bgmac->plat.base) { > + dev_err(>dev, "Unable to map base resource\n"); > + rc = -ENOMEM; > + goto err; > + } ... here do a devm_ioremap_resource(), which also does a request_mem_region, so this shows up nicely in /proc/iomem. > + > + rc = of_address_to_resource(np, 1, ); > + if (rc < 0) { > + dev_err(>dev, "Unable to obtain idm resource\n"); > + goto err1; > + } > + > + bgmac->plat.idm_base = ioremap(regs.start, resource_size()); > + if (!bgmac->plat.base) { > + dev_err(>dev, "Unable to map idm resource\n"); > + rc = -ENOMEM; > + goto err1; > + } Same here. > + > + bgmac->read = platform_bgmac_read; > + bgmac->write = platform_bgmac_write; > + bgmac->idm_read = platform_bgmac_idm_read; > + bgmac->idm_write = platform_bgmac_idm_write; > + bgmac->clk_enabled = platform_bgmac_clk_enabled; > + bgmac->clk_enable = platform_bgmac_clk_enable; > + bgmac->cco_ctl_maskset = platform_bgmac_cco_ctl_maskset; > + bgmac->get_bus_clock = platform_bgmac_get_bus_clock; > + bgmac->cmn_maskset32 = platform_bgmac_cmn_maskset32; > + > + rc = bgmac_enet_probe(bgmac); > + if (rc) > + goto err2; > + > + return 0; > + > +err2: > + iounmap(bgmac->plat.idm_base); > +err1: > + iounmap(bgmac->plat.base); > +err: > + kfree(bgmac); And with devm_* helpers none of that is needed now. -- Florian
[PATCH] ipv4: Fix ip_skb_dst_mtu to use the sk passed by ip_finish_output
ip_skb_dst_mtu uses skb->sk, assuming it is an AF_INET socket (e.g. it calls ip_sk_use_pmtu which casts sk as an inet_sk). However, in the case of UDP tunneling, the skb->sk is not necessarily an inet socket (could be AF_PACKET socket, or AF_UNSPEC if arriving from tun/tap). OTOH, the sk passed as an argument throughout IP stack's output path is the one which is of PMTU interest: - In case of local sockets, sk is same as skb->sk; - In case of a udp tunnel, sk is the tunneling socket. Fix, by passing ip_finish_output's sk to ip_skb_dst_mtu. This augments 7026b1ddb6 'netfilter: Pass socket pointer down through okfn().' Signed-off-by: Shmulik Ladkani--- Discovered while debugging other issue in adjacent code area. My setup seems to be handling this well; However I'd appreciate a Tested-by or Reviewed-by as one can only cover relatively few of the possible code flows leading to ip_skb_dst_mtu. include/net/ip.h| 5 ++--- net/bridge/br_netfilter_hooks.c | 2 +- net/ipv4/ip_output.c| 4 ++-- 3 files changed, 5 insertions(+), 6 deletions(-) diff --git a/include/net/ip.h b/include/net/ip.h index 37165fba37..08f36cd2b8 100644 --- a/include/net/ip.h +++ b/include/net/ip.h @@ -313,10 +313,9 @@ static inline unsigned int ip_dst_mtu_maybe_forward(const struct dst_entry *dst, return min(dst->dev->mtu, IP_MAX_MTU); } -static inline unsigned int ip_skb_dst_mtu(const struct sk_buff *skb) +static inline unsigned int ip_skb_dst_mtu(struct sock *sk, + const struct sk_buff *skb) { - struct sock *sk = skb->sk; - if (!sk || !sk_fullsock(sk) || ip_sk_use_pmtu(sk)) { bool forwarding = IPCB(skb)->flags & IPSKB_FORWARDED; diff --git a/net/bridge/br_netfilter_hooks.c b/net/bridge/br_netfilter_hooks.c index 2d25979273..77e7f69bf8 100644 --- a/net/bridge/br_netfilter_hooks.c +++ b/net/bridge/br_netfilter_hooks.c @@ -700,7 +700,7 @@ static int br_nf_ip_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, int (*output)(struct net *, struct sock *, struct sk_buff *)) { - unsigned int mtu = ip_skb_dst_mtu(skb); + unsigned int mtu = ip_skb_dst_mtu(sk, skb); struct iphdr *iph = ip_hdr(skb); if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->ignore_df) || diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index cbac493c91..e23f141c9b 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -271,7 +271,7 @@ static int ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *sk return dst_output(net, sk, skb); } #endif - mtu = ip_skb_dst_mtu(skb); + mtu = ip_skb_dst_mtu(sk, skb); if (skb_is_gso(skb)) return ip_finish_output_gso(net, sk, skb, mtu); @@ -541,7 +541,7 @@ int ip_do_fragment(struct net *net, struct sock *sk, struct sk_buff *skb, iph = ip_hdr(skb); - mtu = ip_skb_dst_mtu(skb); + mtu = ip_skb_dst_mtu(sk, skb); if (IPCB(skb)->frag_max_size && IPCB(skb)->frag_max_size < mtu) mtu = IPCB(skb)->frag_max_size; -- 2.7.4
Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file
On Wed, Jun 29, 2016 at 11:35:28AM -0700, Florian Fainelli wrote: > On 06/29/2016 07:13 AM, Andrew Lunn wrote: > > Hi Jon > > > > I know you are just refactoring code, but at some point it would be > > good to take a closer look at this MDIO bus driver. > And, to re-iterate all of your points are valid, but this is premature We agree then :-) Andrew
[PATCH net v2.3] usbnet: Stop RX Q on MTU change
When MTU is changed unlink_urbs() flushes RX Q but mean while usbnet_bh() can fill up the Q at the same time. Depends on which HCD is down there unlink takes long time then the flush never ends. Signed-off-by: Soohoon LeeReviewed-by: Kimball Murray --- drivers/net/usb/usbnet.c |6 +- 1 files changed, 5 insertions(+), 1 deletions(-) diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c index 61ba464..ce72dd0 100644 --- a/drivers/net/usb/usbnet.c +++ b/drivers/net/usb/usbnet.c @@ -395,8 +395,11 @@ int usbnet_change_mtu (struct net_device *net, int new_mtu) dev->hard_mtu = net->mtu + net->hard_header_len; if (dev->rx_urb_size == old_hard_mtu) { dev->rx_urb_size = dev->hard_mtu; - if (dev->rx_urb_size > old_rx_urb_size) + if (dev->rx_urb_size > old_rx_urb_size) { + usbnet_pause_rx(dev); usbnet_unlink_rx_urbs(dev); + usbnet_resume_rx(dev); + } } /* max qlen depend on hard_mtu and rx_urb_size */ @@ -1509,6 +1512,7 @@ static void usbnet_bh (unsigned long param) netif_device_present (dev->net) && netif_carrier_ok(dev->net) && !timer_pending (>delay) && + !test_bit (EVENT_RX_PAUSED, >flags) && !test_bit (EVENT_RX_HALT, >flags)) { int temp = dev->rxq.qlen;
Re: [RFC 6/7] dt-bindings: net: bgmac: add bindings documentation for bgmac
On 06/28/2016 12:34 PM, Jon Mason wrote: > Signed-off-by: Jon Mason> --- > .../devicetree/bindings/net/brcm,bgmac-enet.txt | 21 > + > 1 file changed, 21 insertions(+) > create mode 100644 Documentation/devicetree/bindings/net/brcm,bgmac-enet.txt > > diff --git a/Documentation/devicetree/bindings/net/brcm,bgmac-enet.txt > b/Documentation/devicetree/bindings/net/brcm,bgmac-enet.txt > new file mode 100644 > index 000..efd36d5 > --- /dev/null > +++ b/Documentation/devicetree/bindings/net/brcm,bgmac-enet.txt > @@ -0,0 +1,21 @@ > +Broadcom GMAC Ethernet Controller Device Tree Bindings > +- > + > +Required properties: > + - compatible: "brcm,bgmac-enet" You might want something a bit more explicit like "brcm,bgmac-nsp" for instance and also define a compatible string for the Northstar family maybe? > + - reg: Address and length of the GMAC registers, > + Address and length of the GMAC IDM registers > + - interrupts: Interrupt number > + > +Optional properties: > +- mac-address: mac address to be assigned to the device > + > +Examples: > + > +gmac0: enet@18022000 { > + compatible = "brcm,bgmac-enet"; > + reg = <0x18022000 0x1000>, > + <0x1811 0x1000>; > + interrupts = ; > + status = "disabled"; > +}; > -- Florian
Re: [RFC 3/7] net: ethernet: bgmac: move BCMA MDIO Phy code into a separate file
On 06/29/2016 07:13 AM, Andrew Lunn wrote: > Hi Jon > > I know you are just refactoring code, but at some point it would be > good to take a closer look at this MDIO bus driver. > > The MDIO bus driver should be generic, allowing access to all 32 > addresses on the bus, if that makes sense. You could for example have > a B53 switch hanging off the MDIO bus The basic read/write > functions seem to allow this. That's the case on MIPS-based SoCs actually. > >> +/* http://bcm-v4.sipsolutions.net/mac-gbit/gmac/chipphyinit */ >> +static void bcma_mdio_phy_init(struct bcma_mdio *bcma_mdio) >> +{ >> +struct bcma_chipinfo *ci = _mdio->core->bus->chipinfo; >> +u8 i; >> + >> +if (ci->id == BCMA_CHIP_ID_BCM5356) { >> +for (i = 0; i < 5; i++) { >> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x008b); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x15, 0x0100); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000f); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x12, 0x2aaa); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000b); > > It would be good to document what these writes are doing. Also, could > they be moved into the phy driver? Does the PHY implement an ID in > registers 2 and 3 that can be used to determine these are needed? > > Also, why is it iterating over 5 PHYs? Is this actually a switch with > 5 ports? Yes, this is applying workarounds to the integrated PHYs, the plan is to move that to a proper PHY driver so we can get rid of that. > >> +} >> +} >> +if ((ci->id == BCMA_CHIP_ID_BCM5357 && ci->pkg != 10) || >> +(ci->id == BCMA_CHIP_ID_BCM4749 && ci->pkg != 10) || >> +(ci->id == BCMA_CHIP_ID_BCM53572 && ci->pkg != 9)) { >> +struct bcma_drv_cc *cc = _mdio->core->bus->drv_cc; >> + >> +bcma_chipco_chipctl_maskset(cc, 2, ~0xc000, 0); >> +bcma_chipco_chipctl_maskset(cc, 4, ~0x8000, 0); >> +for (i = 0; i < 5; i++) { >> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000f); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x16, 0x5284); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000b); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x17, 0x0010); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000f); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x16, 0x5296); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x17, 0x1073); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x17, 0x9073); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x16, 0x52b6); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x17, 0x9273); >> +bcma_mdio_phy_write(bcma_mdio, i, 0x1f, 0x000b); > > Same comment here. > >> +/* http://bcm-v4.sipsolutions.net/mac-gbit/gmac/chipphyreset */ >> +static int bcma_mdio_phy_reset(struct mii_bus *bus) >> +{ >> +struct bcma_mdio *bcma_mdio = bus->priv; >> +u8 phyaddr = bcma_mdio->phyaddr; >> + >> +if (bcma_mdio->phyaddr == BGMAC_PHY_NOREGS) >> +return 0; >> + >> +bcma_mdio_phy_write(bcma_mdio, phyaddr, MII_BMCR, BMCR_RESET); >> +udelay(100); >> +if (bcma_mdio_phy_read(bcma_mdio, phyaddr, MII_BMCR) & BMCR_RESET) >> +dev_err(_mdio->core->dev, "PHY reset failed\n"); >> +bcma_mdio_phy_init(bcma_mdio); > > This appears to be resetting one PHY. What about the others? Why not > put this in the PHY driver? Probably should be there, keep in mind this driver was mostly reverse engineered, from an internal code base which usually did not plan on utilizing existing APIs. > >> +struct mii_bus *bcma_mdio_mii_register(struct bcma_device *core, u8 phyaddr) >> +{ >> +struct bcma_mdio *bcma_mdio; >> +struct mii_bus *mii_bus; >> +int err; >> + >> +bcma_mdio = kzalloc(sizeof(*bcma_mdio), GFP_KERNEL); >> +if (!bcma_mdio) >> +return ERR_PTR(-ENOMEM); >> + >> +mii_bus = mdiobus_alloc(); >> +if (!mii_bus) { >> +err = -ENOMEM; >> +goto err; >> +} >> + >> +mii_bus->name = "bcma_mdio mii bus"; >> +sprintf(mii_bus->id, "%s-%d-%d", "bcma_mdio", core->bus->num, >> +core->core_unit); >> +mii_bus->priv = bcma_mdio; >> +mii_bus->read = bcma_mdio_mii_read; >> +mii_bus->write = bcma_mdio_mii_write; >> +mii_bus->reset = bcma_mdio_phy_reset; >> +mii_bus->parent = >dev; >> +mii_bus->phy_mask = ~(1 << phyaddr); > > Is there a need to limit this to just the one PHY? Usually this is only relevant for an external PHY, and Jon is just moving code around without making functional changes to existing BCMA code, so all of that can be changed at a later time. > >> + >> +bcma_mdio->core = core; >> +bcma_mdio->phyaddr = phyaddr; >> + >> +err = mdiobus_register(mii_bus); > > Something which appears to be missing from the later
[PATCH v4 iproute2 2/6] ip link/addr: Add support for vrf keyword
Add vrf keyword to 'ip link' and 'ip addr' commands (common list code). Allows: 1. Adding a link to a VRF $ ip link set NAME vrf NAME Removing a link from a VRF still uses 'ip link set NAME nomaster' 2. Showing links associated with a VRF: $ ip link show vrf NAME 3. List addresses associated with links in a VRF $ ip -br addr show vrf red Signed-off-by: David Ahern--- ip/ipaddress.c | 12 +++- ip/iplink.c | 15 +-- man/man8/ip-address.8.in | 6 ++ man/man8/ip-link.8.in| 10 ++ 4 files changed, 40 insertions(+), 3 deletions(-) diff --git a/ip/ipaddress.c b/ip/ipaddress.c index d4d649505e15..54896c726e07 100644 --- a/ip/ipaddress.c +++ b/ip/ipaddress.c @@ -80,7 +80,7 @@ static void usage(void) fprintf(stderr, "[ to PREFIX ] [ FLAG-LIST ] [ label LABEL ] [up]\n"); fprintf(stderr, " ip address [ show [ dev IFNAME ] [ scope SCOPE-ID ] [ master DEVICE ]\n"); fprintf(stderr, " [ type TYPE ] [ to PREFIX ] [ FLAG-LIST ]\n"); - fprintf(stderr, " [ label LABEL ] [up] ]\n"); + fprintf(stderr, " [ label LABEL ] [up] [ vrf NAME ] ]\n"); fprintf(stderr, " ip address {showdump|restore}\n"); fprintf(stderr, "IFADDR := PREFIX | ADDR peer PREFIX\n"); fprintf(stderr, " [ broadcast ADDR ] [ anycast ADDR ]\n"); @@ -1620,6 +1620,16 @@ static int ipaddr_list_flush_or_save(int argc, char **argv, int action) if (!ifindex) invarg("Device does not exist\n", *argv); filter.master = ifindex; + } else if (strcmp(*argv, "vrf") == 0) { + int ifindex; + + NEXT_ARG(); + ifindex = ll_name_to_index(*argv); + if (!ifindex) + invarg("Not a valid VRF name\n", *argv); + if (!name_is_vrf(*argv)) + invarg("Not a valid VRF name\n", *argv); + filter.master = ifindex; } else if (strcmp(*argv, "type") == 0) { int soff; diff --git a/ip/iplink.c b/ip/iplink.c index b1f8a37922f5..f2a2e13cf0c5 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -82,11 +82,11 @@ void iplink_usage(void) fprintf(stderr, " [ query_rss { on | off} ]\n"); fprintf(stderr, " [ state { auto | enable | disable} ] ]\n"); fprintf(stderr, " [ trust { on | off} ] ]\n"); - fprintf(stderr, " [ master DEVICE ]\n"); + fprintf(stderr, " [ master DEVICE ][ vrf NAME ]\n"); fprintf(stderr, " [ nomaster ]\n"); fprintf(stderr, " [ addrgenmode { eui64 | none | stable_secret | random } ]\n"); fprintf(stderr, " [ protodown { on | off } ]\n"); - fprintf(stderr, " ip link show [ DEVICE | group GROUP ] [up] [master DEV] [type TYPE]\n"); + fprintf(stderr, " ip link show [ DEVICE | group GROUP ] [up] [master DEV] [vrf NAME] [type TYPE]\n"); if (iplink_have_newlink()) { fprintf(stderr, " ip link help [ TYPE ]\n"); @@ -603,6 +603,17 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, invarg("Device does not exist\n", *argv); addattr_l(>n, sizeof(*req), IFLA_MASTER, , 4); + } else if (strcmp(*argv, "vrf") == 0) { + int ifindex; + + NEXT_ARG(); + ifindex = ll_name_to_index(*argv); + if (!ifindex) + invarg("Not a valid VRF name\n", *argv); + if (!name_is_vrf(*argv)) + invarg("Not a valid VRF name\n", *argv); + addattr_l(>n, sizeof(*req), IFLA_MASTER, + , sizeof(ifindex)); } else if (matches(*argv, "nomaster") == 0) { int ifindex = 0; diff --git a/man/man8/ip-address.8.in b/man/man8/ip-address.8.in index 3cbe4181f7e3..7cb927173133 100644 --- a/man/man8/ip-address.8.in +++ b/man/man8/ip-address.8.in @@ -45,6 +45,8 @@ ip-address \- protocol address management .IR DEVICE " ] [ " .B type .IR TYPE " ] [ " +.B vrf +.IR NAME " ] [ " .BR up " ] ]" .ti -8 @@ -280,6 +282,10 @@ is a usual shell style pattern. only list interfaces enslaved to this master device. .TP +.BI vrf " NAME " +only list interfaces enslaved to this vrf. + +.TP .BI type " TYPE" only list interfaces of the
[PATCH v4 iproute2 4/6] ip route: Change type mask to bitmask
Allow option to select multiple route types to show or exlude specific route types. Signed-off-by: David Ahern--- ip/iproute.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 8224d7ffa94b..aae693d17be8 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -113,7 +113,7 @@ static struct int flushe; int protocol, protocolmask; int scope, scopemask; - int type, typemask; + __u64 typemask; int tos, tosmask; int iif, iifmask; int oif, oifmask; @@ -178,7 +178,8 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) return 0; if ((filter.scope^r->rtm_scope)) return 0; - if ((filter.type^r->rtm_type)) + + if (filter.typemask && !(filter.typemask & (1 << r->rtm_type))) return 0; if ((filter.tos^r->rtm_tos)) return 0; @@ -365,7 +366,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (n->nlmsg_type == RTM_DELROUTE) fprintf(fp, "Deleted "); - if ((r->rtm_type != RTN_UNICAST || show_details > 0) && !filter.type) + if ((r->rtm_type != RTN_UNICAST || show_details > 0) && + (!filter.typemask || (filter.typemask & (1 << r->rtm_type fprintf(fp, "%s ", rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1))); if (tb[RTA_DST]) { @@ -1433,10 +1435,9 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) int type; NEXT_ARG(); - filter.typemask = -1; if (rtnl_rtntype_a2n(, *argv)) invarg("node type value is invalid\n", *argv); - filter.type = type; + filter.typemask = (1<
[PATCH v4 iproute2 0/6] Add support for vrf keyword
Currently the syntax for VRF related commands is rather kludgy and inconsistent from one subcommand to another. This set adds support for the VRF keyword to the link, address, neigh, and route commands to improve the user experience listing data associated with vrfs, modifying routes or doing a route lookup. v4 - Stephen's comments about patch 1 also apply to patch 5, so updated it v3 - addressed comments from Stephen - added documentation updates to patches 2, 3 and 6 v2 - rebased to top of tree - all checkpatch warnings are usage lines. The change in these patches is consistent with existing code for usage lines David Ahern (6): ip vrf: Add name_is_vrf ip link/addr: Add support for vrf keyword ip neigh: Add support for keyword ip route: Change type mask to bitmask ip vrf: Add ipvrf_get_table ip route: Add support for vrf keyword ip/ip_common.h | 3 ++ ip/ipaddress.c | 12 - ip/iplink.c | 15 ++- ip/iplink_vrf.c | 113 +++ ip/ipneigh.c | 14 +- ip/iproute.c | 43 +++--- man/man8/ip-address.8.in | 6 +++ man/man8/ip-link.8.in| 10 + man/man8/ip-neighbour.8 | 8 +++- man/man8/ip-route.8.in | 19 +++- 10 files changed, 230 insertions(+), 13 deletions(-) -- 2.1.4
[PATCH v4 iproute2 6/6] ip route: Add support for vrf keyword
Add vrf keyword to 'ip route' commands. Allows: 1. Users can list routes by VRF name: $ ip route show vrf NAME VRF tables have all routes including local and broadcast routes. The VRF keyword filters LOCAL and BROADCAST routes; to see all routes the table option can be used. Or to see local routes only for a VRF: $ ip route show vrf NAME type local 2. Add or delete a route for a VRF: $ ip route {add|delete} vrf NAME 3. Do a route lookup for a VRF: $ ip route get vrf NAME ADDRESS Signed-off-by: David Ahern--- ip/iproute.c | 32 ++-- man/man8/ip-route.8.in | 19 ++- 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index aae693d17be8..bd661c16cb46 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -67,10 +67,10 @@ static void usage(void) fprintf(stderr, " ip route showdump\n"); fprintf(stderr, " ip route get ADDRESS [ from ADDRESS iif STRING ]\n"); fprintf(stderr, "[ oif STRING ] [ tos TOS ]\n"); - fprintf(stderr, "[ mark NUMBER ]\n"); + fprintf(stderr, "[ mark NUMBER ] [ vrf NAME ]\n"); fprintf(stderr, " ip route { add | del | change | append | replace } ROUTE\n"); fprintf(stderr, "SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact PREFIX ]\n"); - fprintf(stderr, "[ table TABLE_ID ] [ proto RTPROTO ]\n"); + fprintf(stderr, "[ table TABLE_ID ] [ vrf NAME ] [ proto RTPROTO ]\n"); fprintf(stderr, "[ type TYPE ] [ scope SCOPE ]\n"); fprintf(stderr, "ROUTE := NODE_SPEC [ INFO_SPEC ]\n"); fprintf(stderr, "NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]\n"); @@ -1141,6 +1141,20 @@ static int iproute_modify(int cmd, unsigned int flags, int argc, char **argv) addattr32(, sizeof(req), RTA_TABLE, tid); } table_ok = 1; + } else if (matches(*argv, "vrf") == 0) { + __u32 tid; + + NEXT_ARG(); + tid = ipvrf_get_table(*argv); + if (tid == 0) + invarg("Invalid VRF\n", *argv); + if (tid < 256) + req.r.rtm_table = tid; + else { + req.r.rtm_table = RT_TABLE_UNSPEC; + addattr32(, sizeof(req), RTA_TABLE, tid); + } + table_ok = 1; } else if (strcmp(*argv, "dev") == 0 || strcmp(*argv, "oif") == 0) { NEXT_ARG(); @@ -1395,6 +1409,15 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) } } else filter.tb = tid; + } else if (matches(*argv, "vrf") == 0) { + __u32 tid; + + NEXT_ARG(); + tid = ipvrf_get_table(*argv); + if (tid == 0) + invarg("Invalid VRF\n", *argv); + filter.tb = tid; + filter.typemask = ~(1 << RTN_LOCAL | 1<
[PATCH v4 iproute2 3/6] ip neigh: Add support for keyword
Add vrf keyword to 'ip neigh' commands. Allows listing neighbor entries for all links associated with a given VRF. Signed-off-by: David Ahern--- ip/ipneigh.c| 14 +- man/man8/ip-neighbour.8 | 8 +++- 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/ip/ipneigh.c b/ip/ipneigh.c index 4ddb747e2086..3e444712645f 100644 --- a/ip/ipneigh.c +++ b/ip/ipneigh.c @@ -48,7 +48,8 @@ static void usage(void) { fprintf(stderr, "Usage: ip neigh { add | del | change | replace }\n" "{ ADDR [ lladdr LLADDR ] [ nud STATE ] | proxy ADDR } [ dev DEV ]\n"); - fprintf(stderr, " ip neigh { show | flush } [ proxy ] [ to PREFIX ] [ dev DEV ] [ nud STATE ]\n\n"); + fprintf(stderr, " ip neigh { show | flush } [ proxy ] [ to PREFIX ] [ dev DEV ] [ nud STATE ]\n"); + fprintf(stderr, " [ vrf NAME ]\n\n"); fprintf(stderr, "STATE := { permanent | noarp | stale | reachable | none |\n" " incomplete | delay | probe | failed }\n"); exit(-1); @@ -385,6 +386,17 @@ static int do_show_or_flush(int argc, char **argv, int flush) invarg("Device does not exist\n", *argv); addattr32(, sizeof(req), NDA_MASTER, ifindex); filter.master = ifindex; + } else if (strcmp(*argv, "vrf") == 0) { + int ifindex; + + NEXT_ARG(); + ifindex = ll_name_to_index(*argv); + if (!ifindex) + invarg("Not a valid VRF name\n", *argv); + if (!name_is_vrf(*argv)) + invarg("Not a valid VRF name\n", *argv); + addattr32(, sizeof(req), NDA_MASTER, ifindex); + filter.master = ifindex; } else if (strcmp(*argv, "unused") == 0) { filter.unused_only = 1; } else if (strcmp(*argv, "nud") == 0) { diff --git a/man/man8/ip-neighbour.8 b/man/man8/ip-neighbour.8 index b292e1814495..bbfe8e72a09c 100644 --- a/man/man8/ip-neighbour.8 +++ b/man/man8/ip-neighbour.8 @@ -31,7 +31,9 @@ ip-neighbour \- neighbour/arp tables management. .B dev .IR DEV " ] [ " .B nud -.IR STATE " ]" +.IR STATE " ] [ " +.B vrf +.IR NAME " ] " .ti -8 .IR STATE " := {" @@ -164,6 +166,10 @@ the prefix selecting the neighbours to list. only list the neighbours attached to this device. .TP +.BI vrf " NAME" +only list the neighbours for given VRF. + +.TP .BI proxy list neighbour proxies. -- 2.1.4
[PATCH v4 iproute2 5/6] ip vrf: Add ipvrf_get_table
Add ipvrf_get_table to lookup table id for device name. Returns 0 on any error or if name is not a VRF device. Signed-off-by: David Ahern--- ip/ip_common.h | 1 + ip/iplink_vrf.c | 63 + 2 files changed, 64 insertions(+) diff --git a/ip/ip_common.h b/ip/ip_common.h index b747ea47122d..c8188122ab5d 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -90,6 +90,7 @@ struct link_util *get_link_slave_kind(const char *slave_kind); void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len); +__u32 ipvrf_get_table(const char *name); bool name_is_vrf(const char *name); #ifndefINFINITY_LIFE_TIME diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c index 166f566e6d77..015b41e1e7b0 100644 --- a/ip/iplink_vrf.c +++ b/ip/iplink_vrf.c @@ -97,6 +97,69 @@ struct link_util vrf_slave_link_util = { .slave = true, }; +/* returns table id if name is a VRF device */ +__u32 ipvrf_get_table(const char *name) +{ + struct { + struct nlmsghdr n; + struct ifinfomsgi; + charbuf[1024]; + } req = { + .n = { + .nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)), + .nlmsg_flags = NLM_F_REQUEST, + .nlmsg_type = RTM_GETLINK, + }, + .i = { + .ifi_family = preferred_family, + }, + }; + struct { + struct nlmsghdr n; + char buf[8192]; + } answer; + struct rtattr *tb[IFLA_MAX+1]; + struct rtattr *li[IFLA_INFO_MAX+1]; + struct rtattr *vrf_attr[IFLA_VRF_MAX + 1]; + struct ifinfomsg *ifi; + __u32 tb_id = 0; + int len; + + addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1); + + if (rtnl_talk(, , , sizeof(answer)) < 0) + return 0; + + ifi = NLMSG_DATA(); + len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); + if (len < 0) { + fprintf(stderr, "BUG: Invalid response to link query.\n"); + return 0; + } + + parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len); + + if (!tb[IFLA_LINKINFO]) + return 0; + + parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]); + + if (!li[IFLA_INFO_KIND] || !li[IFLA_INFO_DATA]) + return 0; + + if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf")) + return 0; + + parse_rtattr_nested(vrf_attr, IFLA_VRF_MAX, li[IFLA_INFO_DATA]); + if (vrf_attr[IFLA_VRF_TABLE]) + tb_id = rta_getattr_u32(vrf_attr[IFLA_VRF_TABLE]); + + if (!tb_id) + fprintf(stderr, "BUG: VRF %s is missing table id\n", name); + + return tb_id; +} + bool name_is_vrf(const char *name) { struct { -- 2.1.4
[PATCH v4 iproute2 1/6] ip vrf: Add name_is_vrf
Add name_is_vrf function to determine if given name corresponds to a VRF device. Signed-off-by: David Ahern--- ip/ip_common.h | 2 ++ ip/iplink_vrf.c | 50 ++ 2 files changed, 52 insertions(+) diff --git a/ip/ip_common.h b/ip/ip_common.h index e8da9e034b15..b747ea47122d 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -90,6 +90,8 @@ struct link_util *get_link_slave_kind(const char *slave_kind); void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len); +bool name_is_vrf(const char *name); + #ifndefINFINITY_LIFE_TIME #define INFINITY_LIFE_TIME 0xU #endif diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c index e3c7b4652da5..166f566e6d77 100644 --- a/ip/iplink_vrf.c +++ b/ip/iplink_vrf.c @@ -96,3 +96,53 @@ struct link_util vrf_slave_link_util = { .print_opt = vrf_slave_print_opt, .slave = true, }; + +bool name_is_vrf(const char *name) +{ + struct { + struct nlmsghdr n; + struct ifinfomsgi; + charbuf[1024]; + } req = { + .n = { + .nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)), + .nlmsg_flags = NLM_F_REQUEST, + .nlmsg_type = RTM_GETLINK, + }, + .i = { + .ifi_family = preferred_family, + }, + }; + struct { + struct nlmsghdr n; + char buf[8192]; + } answer; + struct rtattr *tb[IFLA_MAX+1]; + struct rtattr *li[IFLA_INFO_MAX+1]; + struct ifinfomsg *ifi; + int len; + + addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1); + + if (rtnl_talk(, , , sizeof(answer)) < 0) + return false; + + ifi = NLMSG_DATA(); + len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); + if (len < 0) { + fprintf(stderr, "BUG: Invalid response to link query.\n"); + return false; + } + + parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len); + + if (!tb[IFLA_LINKINFO]) + return false; + + parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]); + + if (!li[IFLA_INFO_KIND]) + return false; + + return strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf") == 0; +} -- 2.1.4
[PATCH v3 iproute2 1/6] ip vrf: Add name_is_vrf
Add name_is_vrf function to determine if given name corresponds to a VRF device. Signed-off-by: David Ahern--- ip/ip_common.h | 2 ++ ip/iplink_vrf.c | 50 ++ 2 files changed, 52 insertions(+) diff --git a/ip/ip_common.h b/ip/ip_common.h index e8da9e034b15..b747ea47122d 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -90,6 +90,8 @@ struct link_util *get_link_slave_kind(const char *slave_kind); void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len); +bool name_is_vrf(const char *name); + #ifndefINFINITY_LIFE_TIME #define INFINITY_LIFE_TIME 0xU #endif diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c index e3c7b4652da5..166f566e6d77 100644 --- a/ip/iplink_vrf.c +++ b/ip/iplink_vrf.c @@ -96,3 +96,53 @@ struct link_util vrf_slave_link_util = { .print_opt = vrf_slave_print_opt, .slave = true, }; + +bool name_is_vrf(const char *name) +{ + struct { + struct nlmsghdr n; + struct ifinfomsgi; + charbuf[1024]; + } req = { + .n = { + .nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)), + .nlmsg_flags = NLM_F_REQUEST, + .nlmsg_type = RTM_GETLINK, + }, + .i = { + .ifi_family = preferred_family, + }, + }; + struct { + struct nlmsghdr n; + char buf[8192]; + } answer; + struct rtattr *tb[IFLA_MAX+1]; + struct rtattr *li[IFLA_INFO_MAX+1]; + struct ifinfomsg *ifi; + int len; + + addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1); + + if (rtnl_talk(, , , sizeof(answer)) < 0) + return false; + + ifi = NLMSG_DATA(); + len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); + if (len < 0) { + fprintf(stderr, "BUG: Invalid response to link query.\n"); + return false; + } + + parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len); + + if (!tb[IFLA_LINKINFO]) + return false; + + parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]); + + if (!li[IFLA_INFO_KIND]) + return false; + + return strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf") == 0; +} -- 2.1.4
[PATCH v3 iproute2 2/6] ip link/addr: Add support for vrf keyword
Add vrf keyword to 'ip link' and 'ip addr' commands (common list code). Allows: 1. Adding a link to a VRF $ ip link set NAME vrf NAME Removing a link from a VRF still uses 'ip link set NAME nomaster' 2. Showing links associated with a VRF: $ ip link show vrf NAME 3. List addresses associated with links in a VRF $ ip -br addr show vrf red Signed-off-by: David Ahern--- ip/ipaddress.c | 12 +++- ip/iplink.c | 15 +-- man/man8/ip-address.8.in | 6 ++ man/man8/ip-link.8.in| 10 ++ 4 files changed, 40 insertions(+), 3 deletions(-) diff --git a/ip/ipaddress.c b/ip/ipaddress.c index d4d649505e15..54896c726e07 100644 --- a/ip/ipaddress.c +++ b/ip/ipaddress.c @@ -80,7 +80,7 @@ static void usage(void) fprintf(stderr, "[ to PREFIX ] [ FLAG-LIST ] [ label LABEL ] [up]\n"); fprintf(stderr, " ip address [ show [ dev IFNAME ] [ scope SCOPE-ID ] [ master DEVICE ]\n"); fprintf(stderr, " [ type TYPE ] [ to PREFIX ] [ FLAG-LIST ]\n"); - fprintf(stderr, " [ label LABEL ] [up] ]\n"); + fprintf(stderr, " [ label LABEL ] [up] [ vrf NAME ] ]\n"); fprintf(stderr, " ip address {showdump|restore}\n"); fprintf(stderr, "IFADDR := PREFIX | ADDR peer PREFIX\n"); fprintf(stderr, " [ broadcast ADDR ] [ anycast ADDR ]\n"); @@ -1620,6 +1620,16 @@ static int ipaddr_list_flush_or_save(int argc, char **argv, int action) if (!ifindex) invarg("Device does not exist\n", *argv); filter.master = ifindex; + } else if (strcmp(*argv, "vrf") == 0) { + int ifindex; + + NEXT_ARG(); + ifindex = ll_name_to_index(*argv); + if (!ifindex) + invarg("Not a valid VRF name\n", *argv); + if (!name_is_vrf(*argv)) + invarg("Not a valid VRF name\n", *argv); + filter.master = ifindex; } else if (strcmp(*argv, "type") == 0) { int soff; diff --git a/ip/iplink.c b/ip/iplink.c index b1f8a37922f5..f2a2e13cf0c5 100644 --- a/ip/iplink.c +++ b/ip/iplink.c @@ -82,11 +82,11 @@ void iplink_usage(void) fprintf(stderr, " [ query_rss { on | off} ]\n"); fprintf(stderr, " [ state { auto | enable | disable} ] ]\n"); fprintf(stderr, " [ trust { on | off} ] ]\n"); - fprintf(stderr, " [ master DEVICE ]\n"); + fprintf(stderr, " [ master DEVICE ][ vrf NAME ]\n"); fprintf(stderr, " [ nomaster ]\n"); fprintf(stderr, " [ addrgenmode { eui64 | none | stable_secret | random } ]\n"); fprintf(stderr, " [ protodown { on | off } ]\n"); - fprintf(stderr, " ip link show [ DEVICE | group GROUP ] [up] [master DEV] [type TYPE]\n"); + fprintf(stderr, " ip link show [ DEVICE | group GROUP ] [up] [master DEV] [vrf NAME] [type TYPE]\n"); if (iplink_have_newlink()) { fprintf(stderr, " ip link help [ TYPE ]\n"); @@ -603,6 +603,17 @@ int iplink_parse(int argc, char **argv, struct iplink_req *req, invarg("Device does not exist\n", *argv); addattr_l(>n, sizeof(*req), IFLA_MASTER, , 4); + } else if (strcmp(*argv, "vrf") == 0) { + int ifindex; + + NEXT_ARG(); + ifindex = ll_name_to_index(*argv); + if (!ifindex) + invarg("Not a valid VRF name\n", *argv); + if (!name_is_vrf(*argv)) + invarg("Not a valid VRF name\n", *argv); + addattr_l(>n, sizeof(*req), IFLA_MASTER, + , sizeof(ifindex)); } else if (matches(*argv, "nomaster") == 0) { int ifindex = 0; diff --git a/man/man8/ip-address.8.in b/man/man8/ip-address.8.in index 3cbe4181f7e3..7cb927173133 100644 --- a/man/man8/ip-address.8.in +++ b/man/man8/ip-address.8.in @@ -45,6 +45,8 @@ ip-address \- protocol address management .IR DEVICE " ] [ " .B type .IR TYPE " ] [ " +.B vrf +.IR NAME " ] [ " .BR up " ] ]" .ti -8 @@ -280,6 +282,10 @@ is a usual shell style pattern. only list interfaces enslaved to this master device. .TP +.BI vrf " NAME " +only list interfaces enslaved to this vrf. + +.TP .BI type " TYPE" only list interfaces of the
[PATCH v3 iproute2 0/6] Add support for vrf keyword
Currently the syntax for VRF related commands is rather kludgy and inconsistent from one subcommand to another. This set adds support for the VRF keyword to the link, address, neigh, and route commands to improve the user experience listing data associated with vrfs, modifying routes or doing a route lookup. v3 - addressed comments from Stephen - added documentation updates to patches 2, 3 and 6 v2 - rebased to top of tree - all checkpatch warnings are usage lines. The change in these patches is consistent with existing code for usage lines David Ahern (6): ip vrf: Add name_is_vrf ip link/addr: Add support for vrf keyword ip neigh: Add support for keyword ip route: Change type mask to bitmask ip vrf: Add ipvrf_get_table ip route: Add support for vrf keyword ip/ip_common.h | 3 ++ ip/ipaddress.c | 12 - ip/iplink.c | 15 +- ip/iplink_vrf.c | 116 +++ ip/ipneigh.c | 14 +- ip/iproute.c | 43 +++--- man/man8/ip-address.8.in | 6 +++ man/man8/ip-link.8.in| 10 man/man8/ip-neighbour.8 | 8 +++- man/man8/ip-route.8.in | 19 +++- 10 files changed, 233 insertions(+), 13 deletions(-) -- 2.1.4
[PATCH v3 iproute2 4/6] ip route: Change type mask to bitmask
Allow option to select multiple route types to show or exlude specific route types. Signed-off-by: David Ahern--- ip/iproute.c | 11 ++- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index 8224d7ffa94b..aae693d17be8 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -113,7 +113,7 @@ static struct int flushe; int protocol, protocolmask; int scope, scopemask; - int type, typemask; + __u64 typemask; int tos, tosmask; int iif, iifmask; int oif, oifmask; @@ -178,7 +178,8 @@ static int filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len) return 0; if ((filter.scope^r->rtm_scope)) return 0; - if ((filter.type^r->rtm_type)) + + if (filter.typemask && !(filter.typemask & (1 << r->rtm_type))) return 0; if ((filter.tos^r->rtm_tos)) return 0; @@ -365,7 +366,8 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg) if (n->nlmsg_type == RTM_DELROUTE) fprintf(fp, "Deleted "); - if ((r->rtm_type != RTN_UNICAST || show_details > 0) && !filter.type) + if ((r->rtm_type != RTN_UNICAST || show_details > 0) && + (!filter.typemask || (filter.typemask & (1 << r->rtm_type fprintf(fp, "%s ", rtnl_rtntype_n2a(r->rtm_type, b1, sizeof(b1))); if (tb[RTA_DST]) { @@ -1433,10 +1435,9 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) int type; NEXT_ARG(); - filter.typemask = -1; if (rtnl_rtntype_a2n(, *argv)) invarg("node type value is invalid\n", *argv); - filter.type = type; + filter.typemask = (1<
[PATCH v3 iproute2 5/6] ip vrf: Add ipvrf_get_table
Add ipvrf_get_table to lookup table id for device name. Returns 0 on any error or if name is not a VRF device. Signed-off-by: David Ahern--- ip/ip_common.h | 1 + ip/iplink_vrf.c | 66 + 2 files changed, 67 insertions(+) diff --git a/ip/ip_common.h b/ip/ip_common.h index b747ea47122d..8faed6174f06 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -90,6 +90,7 @@ struct link_util *get_link_slave_kind(const char *slave_kind); void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len); +__u32 ipvrf_get_table(char *name); bool name_is_vrf(const char *name); #ifndefINFINITY_LIFE_TIME diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c index 166f566e6d77..b4013773537f 100644 --- a/ip/iplink_vrf.c +++ b/ip/iplink_vrf.c @@ -97,6 +97,72 @@ struct link_util vrf_slave_link_util = { .slave = true, }; +/* returns table id if name is a VRF device */ +__u32 ipvrf_get_table(char *name) +{ + struct { + struct nlmsghdr n; + struct ifinfomsgi; + charbuf[1024]; + } req = { + .n = { + .nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)), + .nlmsg_flags = NLM_F_REQUEST, + .nlmsg_type = RTM_GETLINK, + }, + .i = { + .ifi_family = preferred_family, + }, + }; + struct { + struct nlmsghdr n; + char buf[8192]; + } answer; + struct rtattr *tb[IFLA_MAX+1]; + struct rtattr *li[IFLA_INFO_MAX+1]; + struct rtattr *vrf_attr[IFLA_VRF_MAX + 1]; + struct ifinfomsg *ifi; + __u32 tb_id = 0; + int len; + + addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1); + + if (rtnl_talk(, , , sizeof(answer)) < 0) + goto err; + + ifi = NLMSG_DATA(); + len = answer.n.nlmsg_len - NLMSG_LENGTH(sizeof(*ifi)); + if (len < 0) { + fprintf(stderr, "BUG: Invalid response to link query.\n"); + goto err; + } + + parse_rtattr(tb, IFLA_MAX, IFLA_RTA(ifi), len); + + if (!tb[IFLA_LINKINFO]) + goto err; + + parse_rtattr_nested(li, IFLA_INFO_MAX, tb[IFLA_LINKINFO]); + + if (!li[IFLA_INFO_KIND] || !li[IFLA_INFO_DATA]) + goto err; + + if (strcmp(RTA_DATA(li[IFLA_INFO_KIND]), "vrf")) + goto err; + + parse_rtattr_nested(vrf_attr, IFLA_VRF_MAX, li[IFLA_INFO_DATA]); + if (vrf_attr[IFLA_VRF_TABLE]) + tb_id = rta_getattr_u32(vrf_attr[IFLA_VRF_TABLE]); + + if (!tb_id) + fprintf(stderr, "BUG: VRF %s is missing table id\n", name); + + return tb_id; + +err: + return 0; +} + bool name_is_vrf(const char *name) { struct { -- 2.1.4
[PATCH v3 iproute2 6/6] ip route: Add support for vrf keyword
Add vrf keyword to 'ip route' commands. Allows: 1. Users can list routes by VRF name: $ ip route show vrf NAME VRF tables have all routes including local and broadcast routes. The VRF keyword filters LOCAL and BROADCAST routes; to see all routes the table option can be used. Or to see local routes only for a VRF: $ ip route show vrf NAME type local 2. Add or delete a route for a VRF: $ ip route {add|delete} vrf NAME 3. Do a route lookup for a VRF: $ ip route get vrf NAME ADDRESS Signed-off-by: David Ahern--- ip/iproute.c | 32 ++-- man/man8/ip-route.8.in | 19 ++- 2 files changed, 48 insertions(+), 3 deletions(-) diff --git a/ip/iproute.c b/ip/iproute.c index aae693d17be8..bd661c16cb46 100644 --- a/ip/iproute.c +++ b/ip/iproute.c @@ -67,10 +67,10 @@ static void usage(void) fprintf(stderr, " ip route showdump\n"); fprintf(stderr, " ip route get ADDRESS [ from ADDRESS iif STRING ]\n"); fprintf(stderr, "[ oif STRING ] [ tos TOS ]\n"); - fprintf(stderr, "[ mark NUMBER ]\n"); + fprintf(stderr, "[ mark NUMBER ] [ vrf NAME ]\n"); fprintf(stderr, " ip route { add | del | change | append | replace } ROUTE\n"); fprintf(stderr, "SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact PREFIX ]\n"); - fprintf(stderr, "[ table TABLE_ID ] [ proto RTPROTO ]\n"); + fprintf(stderr, "[ table TABLE_ID ] [ vrf NAME ] [ proto RTPROTO ]\n"); fprintf(stderr, "[ type TYPE ] [ scope SCOPE ]\n"); fprintf(stderr, "ROUTE := NODE_SPEC [ INFO_SPEC ]\n"); fprintf(stderr, "NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]\n"); @@ -1141,6 +1141,20 @@ static int iproute_modify(int cmd, unsigned int flags, int argc, char **argv) addattr32(, sizeof(req), RTA_TABLE, tid); } table_ok = 1; + } else if (matches(*argv, "vrf") == 0) { + __u32 tid; + + NEXT_ARG(); + tid = ipvrf_get_table(*argv); + if (tid == 0) + invarg("Invalid VRF\n", *argv); + if (tid < 256) + req.r.rtm_table = tid; + else { + req.r.rtm_table = RT_TABLE_UNSPEC; + addattr32(, sizeof(req), RTA_TABLE, tid); + } + table_ok = 1; } else if (strcmp(*argv, "dev") == 0 || strcmp(*argv, "oif") == 0) { NEXT_ARG(); @@ -1395,6 +1409,15 @@ static int iproute_list_flush_or_save(int argc, char **argv, int action) } } else filter.tb = tid; + } else if (matches(*argv, "vrf") == 0) { + __u32 tid; + + NEXT_ARG(); + tid = ipvrf_get_table(*argv); + if (tid == 0) + invarg("Invalid VRF\n", *argv); + filter.tb = tid; + filter.typemask = ~(1 << RTN_LOCAL | 1<
[PATCH v3 iproute2 3/6] ip neigh: Add support for keyword
Add vrf keyword to 'ip neigh' commands. Allows listing neighbor entries for all links associated with a given VRF. Signed-off-by: David Ahern--- ip/ipneigh.c| 14 +- man/man8/ip-neighbour.8 | 8 +++- 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/ip/ipneigh.c b/ip/ipneigh.c index 4ddb747e2086..3e444712645f 100644 --- a/ip/ipneigh.c +++ b/ip/ipneigh.c @@ -48,7 +48,8 @@ static void usage(void) { fprintf(stderr, "Usage: ip neigh { add | del | change | replace }\n" "{ ADDR [ lladdr LLADDR ] [ nud STATE ] | proxy ADDR } [ dev DEV ]\n"); - fprintf(stderr, " ip neigh { show | flush } [ proxy ] [ to PREFIX ] [ dev DEV ] [ nud STATE ]\n\n"); + fprintf(stderr, " ip neigh { show | flush } [ proxy ] [ to PREFIX ] [ dev DEV ] [ nud STATE ]\n"); + fprintf(stderr, " [ vrf NAME ]\n\n"); fprintf(stderr, "STATE := { permanent | noarp | stale | reachable | none |\n" " incomplete | delay | probe | failed }\n"); exit(-1); @@ -385,6 +386,17 @@ static int do_show_or_flush(int argc, char **argv, int flush) invarg("Device does not exist\n", *argv); addattr32(, sizeof(req), NDA_MASTER, ifindex); filter.master = ifindex; + } else if (strcmp(*argv, "vrf") == 0) { + int ifindex; + + NEXT_ARG(); + ifindex = ll_name_to_index(*argv); + if (!ifindex) + invarg("Not a valid VRF name\n", *argv); + if (!name_is_vrf(*argv)) + invarg("Not a valid VRF name\n", *argv); + addattr32(, sizeof(req), NDA_MASTER, ifindex); + filter.master = ifindex; } else if (strcmp(*argv, "unused") == 0) { filter.unused_only = 1; } else if (strcmp(*argv, "nud") == 0) { diff --git a/man/man8/ip-neighbour.8 b/man/man8/ip-neighbour.8 index b292e1814495..bbfe8e72a09c 100644 --- a/man/man8/ip-neighbour.8 +++ b/man/man8/ip-neighbour.8 @@ -31,7 +31,9 @@ ip-neighbour \- neighbour/arp tables management. .B dev .IR DEV " ] [ " .B nud -.IR STATE " ]" +.IR STATE " ] [ " +.B vrf +.IR NAME " ] " .ti -8 .IR STATE " := {" @@ -164,6 +166,10 @@ the prefix selecting the neighbours to list. only list the neighbours attached to this device. .TP +.BI vrf " NAME" +only list the neighbours for given VRF. + +.TP .BI proxy list neighbour proxies. -- 2.1.4
Re: [PATCH v2 iproute2 0/6] Add support for vrf keyword
On 6/29/16 9:07 AM, Stephen Hemminger wrote: On Mon, 27 Jun 2016 11:50:55 -0700 David Ahernwrote: Currently the syntax for VRF related commands is rather kludgy and inconsistent from one subcommand to another. This set adds support for the VRF keyword to the link, address, neigh, and route commands to improve the user experience listing data associated with vrfs, modifying routes or doing a route lookup. v2 - rebased to top of tree - all checkpatch warnings are usage lines. The change in these patches is consistent with existing code for usage lines Does this break current user scripts? I don't see how it can. Existing syntax is not touched so if a user wants to run: $ ip link show master red $ ip ro sh table 1001 it still works. Using the vrf keyword just makes for a more natural syntax: $ ip link show vrf red $ ip ro sh vrf red in this case I don't have to lookup what table device red is associated with ip learns it and shows that table. It seems this method will cause lots of additional netlink requests to check if device is a vrf. Won't this impact users with 1000's of devices? It only adds 1 GETLINK request if the vrf keyword is used. The lookup verifies the name is actually a VRF. It's really no different than the ll_name_to_index lookups used when processing the command line (ll_name_to_index relies on a cache or an ioctl, if_nametoindex).
Re: [PATCH iproute2 1/6] ip vrf: Add name_is_vrf
On 6/29/16 9:06 AM, Stephen Hemminger wrote: On Mon, 27 Jun 2016 11:50:56 -0700 David Ahernwrote: diff --git a/ip/ip_common.h b/ip/ip_common.h index e8da9e034b15..410eb135774a 100644 --- a/ip/ip_common.h +++ b/ip/ip_common.h @@ -90,6 +90,8 @@ struct link_util *get_link_slave_kind(const char *slave_kind); void br_dump_bridge_id(const struct ifla_bridge_id *id, char *buf, size_t len); +bool name_is_vrf(char *name); + #ifndefINFINITY_LIFE_TIME #define INFINITY_LIFE_TIME 0xU #endif diff --git a/ip/iplink_vrf.c b/ip/iplink_vrf.c index e3c7b4652da5..abd43c08423e 100644 --- a/ip/iplink_vrf.c +++ b/ip/iplink_vrf.c @@ -96,3 +96,56 @@ struct link_util vrf_slave_link_util = { .print_opt = vrf_slave_print_opt, .slave = true, }; + +bool name_is_vrf(char *name) Why not? bool name_is_vrf(const char *name) sure. +{ + struct { + struct nlmsghdr n; + struct ifinfomsgi; + charbuf[1024]; + } req = { + .n = { + .nlmsg_len = NLMSG_LENGTH(sizeof(struct ifinfomsg)), + .nlmsg_flags = NLM_F_REQUEST, + .nlmsg_type = RTM_GETLINK, + }, + .i = { + .ifi_family = preferred_family, + }, + }; + struct { + struct nlmsghdr n; + char buf[8192]; + } answer; + struct rtattr *tb[IFLA_MAX+1]; + struct rtattr *li[IFLA_INFO_MAX+1]; + struct ifinfomsg *ifi; + int len; + + addattr_l(, sizeof(req), IFLA_IFNAME, name, strlen(name) + 1); + + if (rtnl_talk(, , , sizeof(answer)) < 0) + goto err; Just return false instead of all these goto's? Also you might want to give some indication of error. Any failure and the user gets the message "not a VRF". The reason can be the device is not a VRF or the device does not exist but that's the same thing in this case: $ ./ip link show vrf foo Error: argument "foo" is wrong: Not a valid VRF name $ ./ip link show vrf eth1 Error: argument "eth1" is wrong: Not a valid VRF name
Re: [PATCH v2 net-next] tcp: md5: use kmalloc() backed scratch areas
On Wed, Jun 29, 2016 at 8:38 AM, Herbert Xuwrote: > On Wed, Jun 29, 2016 at 08:26:43AM -0700, Andy Lutomirski wrote: >> >> Two reasons: >> >> 1. Code like tcp md5 would be simpler if it could pass a scatterlist >> to hash the skb but use a virtual address for the header. > > True. But I bet we can make it simpler in other ways without > creating special interfaces for it. Look at how we do IPsec > encryption/hashing, this is what TCP md5 should look like. But > nobody in their right mind would bother doing this because this > is dead code. > >> 2. The actual calling sequence and the amount of indirection is much >> less for shash, so hashing short buffer is probably *much* faster. > > Really? > > Have you measured the speed difference between the ahash and shash > interfaces? Are you sure that this would matter when compared > against the speed of hashing a single MD5 block? > It's non-negligible. If I had SHA-NI, it would probably be a bigger relative difference. [0.535717] SHA1 ahash: warming up [0.737471] SHA1 ahash: 50 loops, 386 ns/loop [0.737937] SHA1 shash: warming up [0.903131] SHA1 shash: 50 loops, 323 ns/loop [1.055409] SHA1 shash_digest: 50 loops, 303 ns/loop [0.574410] SHA1 ahash: warming up [0.796927] SHA1 ahash: 50 loops, 414 ns/loop [0.797695] SHA1 shash: warming up [0.959772] SHA1 shash: 50 loops, 316 ns/loop [1.100886] SHA1 shash_digest: 50 loops, 280 ns/loop Here's MD5: [0.504708] MD5 ahash: warming up [0.688155] MD5 ahash: 50 loops, 344 ns/loop [0.688971] MD5 shash: warming up [0.834953] MD5 shash: 50 loops, 285 ns/loop [0.982021] MD5 shash_digest: 50 loops, 292 ns/loop [0.570780] MD5 ahash: warming up [0.754325] MD5 ahash: 50 loops, 357 ns/loop [0.754807] MD5 shash: warming up [0.906861] MD5 shash: 50 loops, 297 ns/loop [1.059057] MD5 shash_digest: 50 loops, 303 ns/loop If you throw in sub-one-block requests, it'll be worse, because those requests don't do real work. Overall, it looks like there's overhead of something like 50ns for each ahash invocation vs the shash equivalent. It's not huge, but it's there. (This is cache-hot. I bet it's considerably worse if cache-cold, because ahash will require a lot more code cache lines as well as the extra cache lines involved in the scatterlist and whatever arch stuff is needed to map back and forth between virtual and physical addresses. Here's the code: static void test_ahash(void) { struct crypto_ahash *hash; struct ahash_request *req; struct scatterlist sg_in; char *buf, *out; int i; u64 end, start; unsigned loops = 50; hash = crypto_alloc_ahash("sha1", 0, CRYPTO_ALG_ASYNC); if (IS_ERR(hash)) return; req = ahash_request_alloc(hash, GFP_KERNEL); if (!req) return; ahash_request_set_callback(req, 0, NULL, NULL); buf = kmalloc(64, GFP_KERNEL); memset(buf, 0, 64); out = kmalloc(20, GFP_KERNEL); pr_err("SHA1 ahash: warming up\n"); for (i = 0; i < 1; i++) { crypto_ahash_init(req); sg_init_one(_in, buf, 64); ahash_request_set_crypt(req, _in, NULL, 64); if (crypto_ahash_update(req)) return; ahash_request_set_crypt(req, NULL, out, 0); if (crypto_ahash_final(req)) return; } start = ktime_get_ns(); for (i = 0; i < loops; i++) { crypto_ahash_init(req); sg_init_one(_in, buf, 64); ahash_request_set_crypt(req, _in, NULL, 64); if (crypto_ahash_update(req)) return; ahash_request_set_crypt(req, NULL, out, 0); if (crypto_ahash_final(req)) return; } end = ktime_get_ns(); pr_err("SHA1 ahash: %u loops, %llu ns/loop\n", loops, (unsigned long long)((end-start)/loops)); } static void test_shash(void) { struct crypto_shash *hash; struct shash_desc *desc; char *buf, *out; int i; u64 end, start; unsigned loops = 50; hash = crypto_alloc_shash("sha1", 0, CRYPTO_ALG_ASYNC); if (IS_ERR(hash)) return; desc = kmalloc(crypto_shash_descsize(hash), GFP_KERNEL); if (!desc) return; desc->tfm = hash; desc->flags = 0; /* * Hmm. This interface isn't conducive to the use of hardware that * needs state beyond what lives in a single memory buffer -- how would * the shash implementation know when to free the state? */ buf = kmalloc(64, GFP_KERNEL); memset(buf, 0, 64); out = kmalloc(20, GFP_KERNEL); pr_err("SHA1 shash: warming up\n"); for (i = 0; i < 1; i++) { crypto_shash_init(desc); if (crypto_shash_update(desc, buf, 64)) return; if (crypto_shash_final(desc, out)) return; } start = ktime_get_ns(); for (i = 0; i < loops; i++) { /* * For fairness, this
Re: [PATCH net-next 08/16] net/devlink: Add E-Switch mode control
On 16-06-29 07:48 AM, Or Gerlitz wrote: > On 6/28/2016 10:31 PM, John Fastabend wrote: >> On 16-06-28 12:12 PM, Jiri Pirko wrote: >>> >>> Why?! Please, leave legacy be legacy. Use the new mode for >>> implementing new features. Don't make things any more complicated :( >>> >> OK so how I read this is there are two things going on that are being >> conflated together. Creating VF netdev's is linked to the PCIe >> subsystems and brings VFs into the netdev model. This is a good thing >> but doesn't need to be a global nic policy it can be per port hence >> the ethtool flag vs devlink discussion. I don't actually have a use case >> to have one port with VF netdevs and another without it so I'm not too >> particular on this. Logically it looks like a per port setting because >> the hardware has no issues with making one physical function create >> a netdev for each of its VFs and the other one run without these >> netdevs. This is why I called it out. >> >> How this relates to bridge, tc, etc. is now you have a identifier to >> configure instead of using strange 'ip link set ... vf#' commands. This >> is great. But I see no reason the hardware has to make changes to >> the existing tables or any of this. Before we used 'bridge fdb' and 'ip >> link' now we can use bridge tools more effectively and can deprecate >> the overloaded use of ip. But again I see no reason to thrash the >> forwarding state of the switch because we happen to be adding VFs. >> Having a set of fdb rules to forward MAC/Vlan pairs (as we do now) >> seems like a perfectly reasonable default. Add with this patch now >> when I run 'fdb show' I can see the defaults. >> >> Maybe I'm reading to much into the devlink flag names and if instead >> you use a switch like the following, >> >>VF representer : enable/disable the creation VF netdev's to represent >> the virtual functions on the PF >> >> >> Much less complicated then magic switching between forwarding logic IMO >> and you don't whack a default configuration that an entire stack (e.g. >> libvirt) has been built to use. > > > John, > > I'll try to address here the core questions and arguments you brought. > thanks. Also just to reiterate I really like the series just a few details here. > Re letting the user to observe/modify the rules added by the > driver/firmware while legacy mode. Even if possible with bridge/fdb, it > will be really pragmatical and doesn't make sense to get that donefor > the TC subsystem. So this isn't a well defined solution and anyway, as > you said, legacy mode enhancements is a different exercise. Personally, > I agree with Jiri, that we should legacy be legacyand focus on adding > the new model. > The ixgbe driver already supports bridge and tc commands without the VF representer. Adding the VF representer to these drivers just extends the existing support so we have an identifier for VFs and now the redirect action works and the fdb commands can specify the VF netdevs. I don't see this as a problem because we already do it today with 'ip' and bridge tools. We are also slightly in disagreement about what the default should be with VF netdevs. I think the default should be the same L2 mac/vlan switch behavior and see no reason to change it by default just because we added VF netdevs. The infrastructure libvirt/openstack/etc are built around this default today. But I guess nothing in this series specifies what the defaults of any given driver will be. VF netdevs are still useful even on older hardware that only supports mac/vlan forwarding to expose statistics and send/receive control frames such as lldp. > The new model has few building blocks, and by all means, have the VF > representors is not the full story, which is not magic but rather the > following: > > 1. VF (vport) representors netdevices + the needed mechanics > (send-to-vport rules that makes xmit on VF rep --> recv on VF) > We all agree on this. For me this should be its own knob VF netdevs or no VF netdevs. There is also my point that this is really a port attribute of the PCIe configuration not a switch attribute. > 2. handling HW data-patch misses --> send to CPU or drop Yep need this also but we have a standard way to configure this already with bridge and 'tc' so why have a toggle for it? Also you don't know in the driver where I want to send missed packets. In some use cases I have the VM is managing the system and in these cases I want to send missed packets to a VF. In ixgbe we get this for free (with the vf identifier netdevs) because we have 'tc' and 'bridge' already hooked up. With 'tc' you can define a wild card match with low priority and with 'bridge' model you can setup the flood ports to do this. > > 3. ability to offload SW rules (tc/bridge/etc) using VF representors and > ingress qdiscs / bridge fdb rules / switchdev fdb rule, etc > > The knob we suggested says that the system is put into a state where > 1,2,3 are needed to make it full
Re: [net 2/2] ixgbevf: ixgbevf_write/read_posted_mbx should use IXGBE_ERR_MBX to initialize ret_val
On Wed, 2016-06-29 at 09:30 -0700, Jeff Kirsher wrote: > From: Xin Long> > Now ixgbevf_write/read_posted_mbx use -IXGBE_ERR_MBX as the initiative > return value, but it's incorrect, cause in ixgbevf_vlan_rx_add_vid(), > it use err == IXGBE_ERR_MBX, the err returned from mac.ops.set_vfta, > and in ixgbevf_set_vfta_vf, it return from write/read_posted. so we > should initialize err with IXGBE_ERR_MBX, instead of -IXGBE_ERR_MBX. I believe "initiative" should be "initial", "cause" should be "because" and "it use err" should be "it uses err". And I won't be a grammar nazi about all the commas. - Greg > > With this fix, the other functions that called it also can work well, > cause they only care about if err is 0 or not. > > Signed-off-by: Xin Long > Tested-by: Andrew Bowers > Signed-off-by: Jeff Kirsher > --- > drivers/net/ethernet/intel/ixgbevf/mbx.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/drivers/net/ethernet/intel/ixgbevf/mbx.c > b/drivers/net/ethernet/intel/ixgbevf/mbx.c > index 61a80da..2819abc 100644 > --- a/drivers/net/ethernet/intel/ixgbevf/mbx.c > +++ b/drivers/net/ethernet/intel/ixgbevf/mbx.c > @@ -85,7 +85,7 @@ static s32 ixgbevf_poll_for_ack(struct ixgbe_hw *hw) > static s32 ixgbevf_read_posted_mbx(struct ixgbe_hw *hw, u32 *msg, u16 size) > { > struct ixgbe_mbx_info *mbx = >mbx; > - s32 ret_val = -IXGBE_ERR_MBX; > + s32 ret_val = IXGBE_ERR_MBX; > > if (!mbx->ops.read) > goto out; > @@ -111,7 +111,7 @@ out: > static s32 ixgbevf_write_posted_mbx(struct ixgbe_hw *hw, u32 *msg, u16 size) > { > struct ixgbe_mbx_info *mbx = >mbx; > - s32 ret_val = -IXGBE_ERR_MBX; > + s32 ret_val = IXGBE_ERR_MBX; > > /* exit if either we can't write or there isn't a defined timeout */ > if (!mbx->ops.write || !mbx->timeout)