Re: [PATCH v2 net-next 0/2] platform data controls for mdio-gpio
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit : > Soon to be mainlined is an x86 platform with a Marvell switch, and a > bit-banging MDIO bus. In order to make this work, the phy_mask of the > MDIO bus needs to be set to prevent scanning for PHYs, and the > phy_ignore_ta_mask needs to be set because the switch has broken > turnaround. > > Add a platform_data structure with these parameters. Looks good thanks Andrew. I do wonder if we could define a common mii_bus_platform_data structure eventually which is comprised of these two members (if nothing else) and maybe update the common mdiobus_register() code path to look for these members. If a subsequent platform data/device MDIO bus shows up we could do that at that time. Thanks! > > Andrew Lunn (2): > net: phy: mdio-gpio: Add platform_data support for phy_mask > net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data > > MAINTAINERS | 1 + > drivers/net/phy/mdio-gpio.c | 7 +++ > include/linux/platform_data/mdio-gpio.h | 14 ++ > 3 files changed, 22 insertions(+) > create mode 100644 include/linux/platform_data/mdio-gpio.h > -- Florian
Re: [PATCH v2 net-next 2/2] net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit : > The Marvell 6390 Ethernet switch family does not perform MDIO > turnaround correctly. Many hardware MDIO bus masters don't care about > this, but the bitbangging implementation in Linux does by default. Add > phy_ignore_ta_mask to the platform data so that the bitbangging code > can be told which devices are known to get TA wrong. > > v2 > -- > int -> u32 in platform data structure > > Signed-off-by: Andrew Lunn Reviewed-by: Florian Fainelli -- Florian
Re: [PATCH v2 net-next 1/2] net: phy: mdio-gpio: Add platform_data support for phy_mask
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit : > It is sometimes necessary to instantiate a bit-banging MDIO bus as a > platform device, without the aid of device tree. > > When device tree is being used, the bus is not scanned for devices, > only those devices which are in device tree are probed. Without device > tree, by default, all addresses on the bus are scanned. This may then > find a device which is not a PHY, e.g. a switch. And the switch may > have registers containing values which look like a PHY. So during the > scan, a PHY device is wrongly created. > > After the bus has been registered, a search is made for > mdio_board_info structures which indicates devices on the bus, and the > driver which should be used for them. This is typically used to > instantiate Ethernet switches from platform drivers. However, if the > scanning of the bus has created a PHY device at the same location as > indicated into the board info for a switch, the switch device is not > created, since the address is already busy. > > This can be avoided by setting the phy_mask of the mdio bus. This mask > prevents addresses on the bus being scanned. > > v2 > -- > int -> u32 in platform data structure > > Signed-off-by: Andrew Lunn Reviewed-by: Florian Fainelli -- Florian
[PATCH v2 net-next 0/2] platform data controls for mdio-gpio
Soon to be mainlined is an x86 platform with a Marvell switch, and a bit-banging MDIO bus. In order to make this work, the phy_mask of the MDIO bus needs to be set to prevent scanning for PHYs, and the phy_ignore_ta_mask needs to be set because the switch has broken turnaround. Add a platform_data structure with these parameters. Andrew Lunn (2): net: phy: mdio-gpio: Add platform_data support for phy_mask net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data MAINTAINERS | 1 + drivers/net/phy/mdio-gpio.c | 7 +++ include/linux/platform_data/mdio-gpio.h | 14 ++ 3 files changed, 22 insertions(+) create mode 100644 include/linux/platform_data/mdio-gpio.h -- 2.19.1
[PATCH v2 net-next 2/2] net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data
The Marvell 6390 Ethernet switch family does not perform MDIO turnaround correctly. Many hardware MDIO bus masters don't care about this, but the bitbangging implementation in Linux does by default. Add phy_ignore_ta_mask to the platform data so that the bitbangging code can be told which devices are known to get TA wrong. v2 -- int -> u32 in platform data structure Signed-off-by: Andrew Lunn --- drivers/net/phy/mdio-gpio.c | 4 +++- include/linux/platform_data/mdio-gpio.h | 1 + 2 files changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c index 1e296dd4067a..ea9a0e339778 100644 --- a/drivers/net/phy/mdio-gpio.c +++ b/drivers/net/phy/mdio-gpio.c @@ -130,8 +130,10 @@ static struct mii_bus *mdio_gpio_bus_init(struct device *dev, else strncpy(new_bus->id, "gpio", MII_BUS_ID_SIZE); - if (pdata) + if (pdata) { new_bus->phy_mask = pdata->phy_mask; + new_bus->phy_ignore_ta_mask = pdata->phy_ignore_ta_mask; + } dev_set_drvdata(dev, new_bus); diff --git a/include/linux/platform_data/mdio-gpio.h b/include/linux/platform_data/mdio-gpio.h index a5d5ff5e174c..13874fa6e767 100644 --- a/include/linux/platform_data/mdio-gpio.h +++ b/include/linux/platform_data/mdio-gpio.h @@ -8,6 +8,7 @@ struct mdio_gpio_platform_data { u32 phy_mask; + u32 phy_ignore_ta_mask; }; #endif /* __LINUX_MDIO_GPIO_PDATA_H */ -- 2.19.1
[PATCH v2 net-next 1/2] net: phy: mdio-gpio: Add platform_data support for phy_mask
It is sometimes necessary to instantiate a bit-banging MDIO bus as a platform device, without the aid of device tree. When device tree is being used, the bus is not scanned for devices, only those devices which are in device tree are probed. Without device tree, by default, all addresses on the bus are scanned. This may then find a device which is not a PHY, e.g. a switch. And the switch may have registers containing values which look like a PHY. So during the scan, a PHY device is wrongly created. After the bus has been registered, a search is made for mdio_board_info structures which indicates devices on the bus, and the driver which should be used for them. This is typically used to instantiate Ethernet switches from platform drivers. However, if the scanning of the bus has created a PHY device at the same location as indicated into the board info for a switch, the switch device is not created, since the address is already busy. This can be avoided by setting the phy_mask of the mdio bus. This mask prevents addresses on the bus being scanned. v2 -- int -> u32 in platform data structure Signed-off-by: Andrew Lunn --- MAINTAINERS | 1 + drivers/net/phy/mdio-gpio.c | 5 + include/linux/platform_data/mdio-gpio.h | 13 + 3 files changed, 19 insertions(+) create mode 100644 include/linux/platform_data/mdio-gpio.h diff --git a/MAINTAINERS b/MAINTAINERS index fb88b6863d10..9d3b899f9ba2 100644 --- a/MAINTAINERS +++ b/MAINTAINERS @@ -5610,6 +5610,7 @@ F:include/linux/of_net.h F: include/linux/phy.h F: include/linux/phy_fixed.h F: include/linux/platform_data/mdio-bcm-unimac.h +F: include/linux/platform_data/mdio-gpio.h F: include/trace/events/mdio.h F: include/uapi/linux/mdio.h F: include/uapi/linux/mii.h diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c index 0fbcedcdf6e2..1e296dd4067a 100644 --- a/drivers/net/phy/mdio-gpio.c +++ b/drivers/net/phy/mdio-gpio.c @@ -24,6 +24,7 @@ #include #include #include +#include #include #include #include @@ -112,6 +113,7 @@ static struct mii_bus *mdio_gpio_bus_init(struct device *dev, struct mdio_gpio_info *bitbang, int bus_id) { + struct mdio_gpio_platform_data *pdata = dev_get_platdata(dev); struct mii_bus *new_bus; bitbang->ctrl.ops = _gpio_ops; @@ -128,6 +130,9 @@ static struct mii_bus *mdio_gpio_bus_init(struct device *dev, else strncpy(new_bus->id, "gpio", MII_BUS_ID_SIZE); + if (pdata) + new_bus->phy_mask = pdata->phy_mask; + dev_set_drvdata(dev, new_bus); return new_bus; diff --git a/include/linux/platform_data/mdio-gpio.h b/include/linux/platform_data/mdio-gpio.h new file mode 100644 index ..a5d5ff5e174c --- /dev/null +++ b/include/linux/platform_data/mdio-gpio.h @@ -0,0 +1,13 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * MDIO-GPIO bus platform data structure + */ + +#ifndef __LINUX_MDIO_GPIO_PDATA_H +#define __LINUX_MDIO_GPIO_PDATA_H + +struct mdio_gpio_platform_data { + u32 phy_mask; +}; + +#endif /* __LINUX_MDIO_GPIO_PDATA_H */ -- 2.19.1
Re: [Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE
From: yupeng Date: Wed, 5 Dec 2018 18:56:28 -0800 > after set SO_DONTROUTE to 1, the IP layer should not route packets if > the dest IP address is not in link scope. But if the socket has cached > the dst_entry, such packets would be routed until the sk_dst_cache > expires. So we should clean the sk_dst_cache when a user set > SO_DONTROUTE option. Below are server/client python scripts which > could reprodue this issue: ... > Signed-off-by: yupeng Applied.
Re: [PATCH v2 net-next] neighbor: Improve garbage collection
From: David Ahern Date: Fri, 7 Dec 2018 12:24:57 -0800 > From: David Ahern > > The existing garbage collection algorithm has a number of problems: ... > This patch addresses these problems as follows: > > 1. Use of a separate list_head to track entries that can be garbage >collected along with a separate counter. PERMANENT entries are not >added to this list. > >The gc_thresh parameters are only compared to the new counter, not the >total entries in the table. The forced_gc function is updated to only >walk this new gc_list looking for entries to evict. > > 2. Entries are added to the list head at the tail and removed from the >front. > > 3. Entries are only evicted if they were last updated more than 5 seconds >ago, adhering to the original intent of gc_thresh2. > > 4. Forced gc is stopped once the number of gc_entries drops below >gc_thresh2. > > 5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped >when allocating a new neighbor for a PERMANENT entry. By extension this >means there are no explicit limits on the number of PERMANENT entries >that can be created, but this is no different than FIB entries or FDB >entries. > > Signed-off-by: David Ahern > --- > v2 > - remove on_gc_list boolean in favor of !list_empty > - fix neigh_alloc to add new entry to tail of list_head Again, looks great, applied.
Re: [PATCH v2 net-next 0/4] net: aquantia: add RSS configuration
From: Igor Russkikh Date: Fri, 7 Dec 2018 14:00:09 + > In this patchset few bugs related to RSS are fixed and RSS table and > hash key configuration is added. > > We also do increase max number of HW rings upto 8. > > v2: removed extra arg check Series applied.
[PATCH v2 net-next] neighbor: Improve garbage collection
From: David Ahern The existing garbage collection algorithm has a number of problems: 1. The gc algorithm will not evict PERMANENT entries as those entries are managed by userspace, yet the existing algorithm walks the entire hash table which means it always considers PERMANENT entries when looking for entries to evict. In some use cases (e.g., EVPN) there can be tens of thousands of PERMANENT entries leading to wasted CPU cycles when gc kicks in. As an example, with 32k permanent entries, neigh_alloc has been observed taking more than 4 msec per invocation. 2. Currently, when the number of neighbor entries hits gc_thresh2 and the last flush for the table was more than 5 seconds ago gc kicks in walks the entire hash table evicting *all* entries not in PERMANENT or REACHABLE state and not marked as externally learned. There is no discriminator on when the neigh entry was created or if it just moved from REACHABLE to another NUD_VALID state (e.g., NUD_STALE). It is possible for entries to be created or for established neighbor entries to be moved to STALE (e.g., an external node sends an ARP request) right before the 5 second window lapses: -|-x|--|- t-5 t t+5 If that happens those entries are evicted during gc causing unnecessary thrashing on neighbor entries and userspace caches trying to track them. Further, this contradicts the description of gc_thresh2 which says "Entries older than 5 seconds will be cleared". One workaround is to make gc_thresh2 == gc_thresh3 but that negates the whole point of having separate thresholds. 3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries when gc_thresh2 is exceeded is over kill and contributes to trashing especially during startup. This patch addresses these problems as follows: 1. Use of a separate list_head to track entries that can be garbage collected along with a separate counter. PERMANENT entries are not added to this list. The gc_thresh parameters are only compared to the new counter, not the total entries in the table. The forced_gc function is updated to only walk this new gc_list looking for entries to evict. 2. Entries are added to the list head at the tail and removed from the front. 3. Entries are only evicted if they were last updated more than 5 seconds ago, adhering to the original intent of gc_thresh2. 4. Forced gc is stopped once the number of gc_entries drops below gc_thresh2. 5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped when allocating a new neighbor for a PERMANENT entry. By extension this means there are no explicit limits on the number of PERMANENT entries that can be created, but this is no different than FIB entries or FDB entries. Signed-off-by: David Ahern --- v2 - remove on_gc_list boolean in favor of !list_empty - fix neigh_alloc to add new entry to tail of list_head Documentation/networking/ip-sysctl.txt | 4 +- include/net/neighbour.h| 3 + net/core/neighbour.c | 119 +++-- 3 files changed, 90 insertions(+), 36 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index af2a69439b93..acdfb5d2bcaa 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -108,8 +108,8 @@ neigh/default/gc_thresh2 - INTEGER Default: 512 neigh/default/gc_thresh3 - INTEGER - Maximum number of neighbor entries allowed. Increase this - when using large numbers of interfaces and when communicating + Maximum number of non-PERMANENT neighbor entries allowed. Increase + this when using large numbers of interfaces and when communicating with large numbers of directly-connected peers. Default: 1024 diff --git a/include/net/neighbour.h b/include/net/neighbour.h index f58b384aa6c9..6c13072910ab 100644 --- a/include/net/neighbour.h +++ b/include/net/neighbour.h @@ -154,6 +154,7 @@ struct neighbour { struct hh_cache hh; int (*output)(struct neighbour *, struct sk_buff *); const struct neigh_ops *ops; + struct list_headgc_list; struct rcu_head rcu; struct net_device *dev; u8 primary_key[0]; @@ -214,6 +215,8 @@ struct neigh_table { struct timer_list proxy_timer; struct sk_buff_head proxy_queue; atomic_tentries; + atomic_tgc_entries; + struct list_headgc_list; rwlock_tlock; unsigned long last_rand; struct neigh_statistics __percpu *stats; diff --git a/net/core/neighbour.c b/net/core/neighbour.c index 6d479b5562be..c3b58712e98b 100644 --- a/net/core/neighbour.c +++
RE: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
> -Original Message- > From: Ilias Apalodimas > Sent: Friday, December 7, 2018 7:52 PM > To: Ioana Ciocoi Radulescu > Cc: Jesper Dangaard Brouer ; > netdev@vger.kernel.org; da...@davemloft.net; Ioana Ciornei > ; dsah...@gmail.com; Camelia Alexandra Groza > > Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support > > Hi Ioana, > > > > > > > > I only did a quick grep around the driver so i might be missing something, > > > but i can only see allocations via napi_alloc_frag(). XDP requires pages > > > (either a single page per packet or a driver that does the page > management > > > of > > > its own and fits 2 frames in a single page, assuming 4kb pages). > > > Am i missing something on the driver? > > > > No, I guess I'm the one missing stuff, I didn't realise single page per > > packet > > is a hard requirement for XDP. Could you point me to more info on this? > > > > Well if you don't have to use 64kb pages you can use the page_pool API (only > used from mlx5 atm) and get the xdp recycling for free. The memory 'waste' > for > 4kb pages isn't too much if the platforms the driver sits on have decent > amounts > of memory (and the number of descriptors used is not too high). > We still have work in progress with Jesper (just posted an RFC)with > improvements > on the API. > Using it is fairly straightforward. This is a patchset on marvell's mvneta > driver with the API changes needed: > https://www.spinics.net/lists/netdev/msg538285.html > > If you need 64kb pages you would have to introduce page recycling and > sharing > like intel/mlx drivers on your driver. Thanks a lot for the info, will look into this. Do you have any pointers as to why the full page restriction exists in the first place? Sorry if it's a dumb question, but I haven't found details on this and I'd really like to understand it. Thanks Ioana
Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
Hi Ioana, > > > > > > I only did a quick grep around the driver so i might be missing something, > > but i can only see allocations via napi_alloc_frag(). XDP requires pages > > (either a single page per packet or a driver that does the page management > > of > > its own and fits 2 frames in a single page, assuming 4kb pages). > > Am i missing something on the driver? > > No, I guess I'm the one missing stuff, I didn't realise single page per packet > is a hard requirement for XDP. Could you point me to more info on this? > Well if you don't have to use 64kb pages you can use the page_pool API (only used from mlx5 atm) and get the xdp recycling for free. The memory 'waste' for 4kb pages isn't too much if the platforms the driver sits on have decent amounts of memory (and the number of descriptors used is not too high). We still have work in progress with Jesper (just posted an RFC)with improvements on the API. Using it is fairly straightforward. This is a patchset on marvell's mvneta driver with the API changes needed: https://www.spinics.net/lists/netdev/msg538285.html If you need 64kb pages you would have to introduce page recycling and sharing like intel/mlx drivers on your driver. /Ilias
RE: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
> -Original Message- > From: Ilias Apalodimas > Sent: Friday, December 7, 2018 7:20 PM > To: Ioana Ciocoi Radulescu > Cc: Jesper Dangaard Brouer ; > netdev@vger.kernel.org; da...@davemloft.net; Ioana Ciornei > ; dsah...@gmail.com; Camelia Alexandra Groza > > Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support > > Hi Ioana, > > > > > > > Add support for XDP programs. Only XDP_PASS, XDP_DROP and > XDP_TX > > > > actions are supported for now. Frame header changes are also > > > > allowed. > > I only did a quick grep around the driver so i might be missing something, > but i can only see allocations via napi_alloc_frag(). XDP requires pages > (either a single page per packet or a driver that does the page management > of > its own and fits 2 frames in a single page, assuming 4kb pages). > Am i missing something on the driver? No, I guess I'm the one missing stuff, I didn't realise single page per packet is a hard requirement for XDP. Could you point me to more info on this? Thanks, Ioana > > > > > > > Do you have any XDP performance benchmarks on this hardware? > > > > We have some preliminary perf data that doesn't look great, > > but we hope to improve it :) > > As Jesper said we are doing similar work on a cortex a-53 and plan to work on > a-72 as well. We might be able to help out. > > /Ilias
Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
Hi Ioana, > > > > > Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX > > > actions are supported for now. Frame header changes are also > > > allowed. I only did a quick grep around the driver so i might be missing something, but i can only see allocations via napi_alloc_frag(). XDP requires pages (either a single page per packet or a driver that does the page management of its own and fits 2 frames in a single page, assuming 4kb pages). Am i missing something on the driver? > > > > Do you have any XDP performance benchmarks on this hardware? > > We have some preliminary perf data that doesn't look great, > but we hope to improve it :) As Jesper said we are doing similar work on a cortex a-53 and plan to work on a-72 as well. We might be able to help out. /Ilias
RE: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
> -Original Message- > From: Jesper Dangaard Brouer > Sent: Wednesday, December 5, 2018 5:45 PM > To: Ioana Ciocoi Radulescu > Cc: bro...@redhat.com; netdev@vger.kernel.org; da...@davemloft.net; > Ioana Ciornei ; dsah...@gmail.com; Camelia > Alexandra Groza ; Ilias Apalodimas > > Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support > > On Mon, 26 Nov 2018 16:27:28 + > Ioana Ciocoi Radulescu wrote: > > > Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX > > actions are supported for now. Frame header changes are also > > allowed. > > Do you have any XDP performance benchmarks on this hardware? We have some preliminary perf data that doesn't look great, but we hope to improve it :) On a LS2088A with A72 cores @2GHz (numbers in Mpps): 1core 8cores - XDP_DROP (no touching data) 5.6829.6 (linerate) XDP_DROP (xdp1 sample) 3.4625.18 XDP_TX(xdp2 sample) 1.7113.26 For comparison, plain IP forwarding through the stack is currently around 0.5Mpps (1c) / 3.8Mpps (8c). > > Also what boards (and arch's) are using this dpaa2-eth driver? Currently supported LS2088A, LS1088A, soon LX2160A (all with ARM64 cores). > Any devel board I can buy? I should have an answer for this early next week and will get back to you. Thanks, Ioana > > > p.s. Ilias and I are coding up page_pool and XDP support for Marvell > mvneta driver, which is avail on a number of avail boards, see here[1] > > [1] > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit > hub.com%2Fxdp-project%2Fxdp- > project%2Fblob%2Fmaster%2Fareas%2Farm64%2Farm01_selecting_hardwar > e.orgdata=02%7C01%7Cruxandra.radulescu%40nxp.com%7C546868ba > aa074902ded608d65ac8a594%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7 > C0%7C636796215148994553sdata=za6xUoIrv2jo%2BbvuKjXfpOXeQ3tw > 96bZZzRB2Vny1iw%3Dreserved=0 > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > LinkedIn: > https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fww > w.linkedin.com%2Fin%2Fbrouerdata=02%7C01%7Cruxandra.radulescu > %40nxp.com%7C546868baaa074902ded608d65ac8a594%7C686ea1d3bc2b4c6f > a92cd99c5c301635%7C0%7C0%7C636796215148994553sdata=vTe2jd3V > FXUpEVPLkbGN6i2OyyPfhQ9HacCaPZbm%2Bk8%3Dreserved=0
[PATCH v2 net-next 3/4] net: aquantia: fix initialization of RSS table
From: Dmitry Bogdanov Now RSS indirection table is initialized before setting up the number of hw queues, consequently the table may be filled by non existing queues. This patch moves the initialization when the number of hw queues is known. Signed-off-by: Dmitry Bogdanov Signed-off-by: Igor Russkikh --- drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c index d617289d95f7..0147c037ca96 100644 --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c @@ -84,8 +84,6 @@ void aq_nic_cfg_start(struct aq_nic_s *self) cfg->is_lro = AQ_CFG_IS_LRO_DEF; - aq_nic_rss_init(self, cfg->num_rss_queues); - /*descriptors */ cfg->rxds = min(cfg->aq_hw_caps->rxds_max, AQ_CFG_RXDS_DEF); cfg->txds = min(cfg->aq_hw_caps->txds_max, AQ_CFG_TXDS_DEF); @@ -106,6 +104,8 @@ void aq_nic_cfg_start(struct aq_nic_s *self) cfg->num_rss_queues = min(cfg->vecs, AQ_CFG_NUM_RSS_QUEUES_DEF); + aq_nic_rss_init(self, cfg->num_rss_queues); + cfg->irq_type = aq_pci_func_get_irq_type(self); if ((cfg->irq_type == AQ_HW_IRQ_LEGACY) || -- 2.17.1
[PATCH v2 net-next 1/4] net: aquantia: fix RSS table and key sizes
From: Dmitry Bogdanov Set RSS indirection table and RSS hash key sizes to their real size. Signed-off-by: Dmitry Bogdanov Signed-off-by: Igor Russkikh --- drivers/net/ethernet/aquantia/atlantic/aq_cfg.h | 4 ++-- drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h index 91eb8910b1c9..90a0e1d0d622 100644 --- a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h +++ b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h @@ -42,8 +42,8 @@ #define AQ_CFG_IS_LRO_DEF 1U /* RSS */ -#define AQ_CFG_RSS_INDIRECTION_TABLE_MAX 128U -#define AQ_CFG_RSS_HASHKEY_SIZE 320U +#define AQ_CFG_RSS_INDIRECTION_TABLE_MAX 64U +#define AQ_CFG_RSS_HASHKEY_SIZE 40U #define AQ_CFG_IS_RSS_DEF 1U #define AQ_CFG_NUM_RSS_QUEUES_DEF AQ_CFG_VECS_DEF diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c index 279ea58f4a9e..d617289d95f7 100644 --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c @@ -44,7 +44,7 @@ static void aq_nic_rss_init(struct aq_nic_s *self, unsigned int num_rss_queues) struct aq_rss_parameters *rss_params = >aq_rss; int i = 0; - static u8 rss_key[40] = { + static u8 rss_key[AQ_CFG_RSS_HASHKEY_SIZE] = { 0x1e, 0xad, 0x71, 0x87, 0x65, 0xfc, 0x26, 0x7d, 0x0d, 0x45, 0x67, 0x74, 0xcd, 0x06, 0x1a, 0x18, 0xb6, 0xc1, 0xf0, 0xc7, 0xbb, 0x18, 0xbe, 0xf8, -- 2.17.1
[PATCH v2 net-next 4/4] net: aquantia: add support of RSS configuration
From: Dmitry Bogdanov Add support of configuration of RSS hash key and RSS indirection table. Signed-off-by: Dmitry Bogdanov Signed-off-by: Igor Russkikh --- .../ethernet/aquantia/atlantic/aq_ethtool.c | 36 +++ 1 file changed, 36 insertions(+) diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c index a5fd71692c8b..fcbfecf41c45 100644 --- a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c +++ b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c @@ -202,6 +202,41 @@ static int aq_ethtool_get_rss(struct net_device *ndev, u32 *indir, u8 *key, return 0; } +static int aq_ethtool_set_rss(struct net_device *netdev, const u32 *indir, + const u8 *key, const u8 hfunc) +{ + struct aq_nic_s *aq_nic = netdev_priv(netdev); + struct aq_nic_cfg_s *cfg; + unsigned int i = 0U; + u32 rss_entries; + int err = 0; + + cfg = aq_nic_get_cfg(aq_nic); + rss_entries = cfg->aq_rss.indirection_table_size; + + /* We do not allow change in unsupported parameters */ + if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP) + return -EOPNOTSUPP; + /* Fill out the redirection table */ + if (indir) + for (i = 0; i < rss_entries; i++) + cfg->aq_rss.indirection_table[i] = indir[i]; + + /* Fill out the rss hash key */ + if (key) { + memcpy(cfg->aq_rss.hash_secret_key, key, + sizeof(cfg->aq_rss.hash_secret_key)); + err = aq_nic->aq_hw_ops->hw_rss_hash_set(aq_nic->aq_hw, + >aq_rss); + if (err) + return err; + } + + err = aq_nic->aq_hw_ops->hw_rss_set(aq_nic->aq_hw, >aq_rss); + + return err; +} + static int aq_ethtool_get_rxnfc(struct net_device *ndev, struct ethtool_rxnfc *cmd, u32 *rule_locs) @@ -549,6 +584,7 @@ const struct ethtool_ops aq_ethtool_ops = { .set_pauseparam = aq_ethtool_set_pauseparam, .get_rxfh_key_size = aq_ethtool_get_rss_key_size, .get_rxfh= aq_ethtool_get_rss, + .set_rxfh= aq_ethtool_set_rss, .get_rxnfc = aq_ethtool_get_rxnfc, .set_rxnfc = aq_ethtool_set_rxnfc, .get_sset_count = aq_ethtool_get_sset_count, -- 2.17.1
[PATCH v2 net-next 2/4] net: aquantia: increase max number of hw queues
From: Dmitry Bogdanov Increase the upper limit of the hw queues up to 8. This makes RSS better on multiheaded cpus. This is a maximum AQC hardware supports in one traffic class. The actual value is still limited by a number of available cpu cores. Signed-off-by: Dmitry Bogdanov Signed-off-by: Igor Russkikh --- drivers/net/ethernet/aquantia/atlantic/aq_cfg.h | 2 +- drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h index 90a0e1d0d622..3944ce7f0870 100644 --- a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h +++ b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h @@ -12,7 +12,7 @@ #ifndef AQ_CFG_H #define AQ_CFG_H -#define AQ_CFG_VECS_DEF 4U +#define AQ_CFG_VECS_DEF 8U #define AQ_CFG_TCS_DEF1U #define AQ_CFG_TXDS_DEF4096U diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c index 6af7d7f0cdca..08596a7a6486 100644 --- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c +++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c @@ -21,7 +21,7 @@ #define DEFAULT_B0_BOARD_BASIC_CAPABILITIES \ .is_64_dma = true,\ - .msix_irqs = 4U, \ + .msix_irqs = 8U, \ .irq_mask = ~0U, \ .vecs = HW_ATL_B0_RSS_MAX,\ .tcs = HW_ATL_B0_TC_MAX, \ -- 2.17.1
[PATCH v2 net-next 0/4] net: aquantia: add RSS configuration
In this patchset few bugs related to RSS are fixed and RSS table and hash key configuration is added. We also do increase max number of HW rings upto 8. v2: removed extra arg check Dmitry Bogdanov (4): net: aquantia: fix RSS table and key sizes net: aquantia: increase max number of hw queues net: aquantia: fix initialization of RSS table net: aquantia: add support of RSS configuration .../net/ethernet/aquantia/atlantic/aq_cfg.h | 6 ++-- .../ethernet/aquantia/atlantic/aq_ethtool.c | 36 +++ .../net/ethernet/aquantia/atlantic/aq_nic.c | 6 ++-- .../aquantia/atlantic/hw_atl/hw_atl_b0.c | 2 +- 4 files changed, 43 insertions(+), 7 deletions(-) -- 2.17.1
Re: [PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree
From: Peter Oskolkov Date: Tue, 4 Dec 2018 11:55:56 -0800 > When testing high-bandwidth TCP streams with large windows, > high latency, and low jitter, netem consumes a lot of CPU cycles > doing rbtree rebalancing. > > This patch uses a linear list/queue in addition to the rbtree: > if an incoming packet is past the tail of the linear queue, it is > added there, otherwise it is inserted into the rbtree. > > Without this patch, perf shows netem_enqueue, netem_dequeue, > and rb_* functions among the top offenders. With this patch, > only netem_enqueue is noticeable if jitter is low/absent. > > Suggested-by: Eric Dumazet > Signed-off-by: Peter Oskolkov Applied, thanks.
[Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE
after set SO_DONTROUTE to 1, the IP layer should not route packets if the dest IP address is not in link scope. But if the socket has cached the dst_entry, such packets would be routed until the sk_dst_cache expires. So we should clean the sk_dst_cache when a user set SO_DONTROUTE option. Below are server/client python scripts which could reprodue this issue: server side code: == import socket import struct import time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(('0.0.0.0', 9000)) s.listen(1) sock, addr = s.accept() sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1)) while True: sock.send(b'foo') time.sleep(1) == client side code: == import socket import time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(('server_address', 9000)) while True: data = s.recv(1024) print(data) == Signed-off-by: yupeng --- net/core/sock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/sock.c b/net/core/sock.c index f5bb89785e47..f00902c532cc 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, break; case SO_DONTROUTE: sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool); + sk_dst_reset(sk); break; case SO_BROADCAST: sock_valbool_flag(sk, SOCK_BROADCAST, valbool); -- 2.17.1
[Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE
after set SO_DONTROUTE to 1, the IP layer should not route packets if the dest IP address is not in link scope. But if the socket has cached the dst_entry, such packets would be routed until the sk_dst_cache expires. So we should clean the sk_dst_cache when a user set SO_DONTROUTE option. Below are server/client python scripts which could reprodue this issue: server side code: == import socket import struct import time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(('0.0.0.0', 9000)) s.listen(1) sock, addr = s.accept() sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1)) while True: sock.send(b'foo') time.sleep(1) == client side code: == import socket import time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(('server_address', 9000)) while True: data = s.recv(1024) print(data) == Signed-off-by: yupeng --- net/core/sock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/sock.c b/net/core/sock.c index f5bb89785e47..f00902c532cc 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, break; case SO_DONTROUTE: sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool); + sk_dst_reset(sk); break; case SO_BROADCAST: sock_valbool_flag(sk, SOCK_BROADCAST, valbool); -- 2.17.1
[Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE
after set SO_DONTROUTE to 1, the IP layer should not route packets if the dest IP address is not in link scope. But if the socket has cached the dst_entry, such packets would be routed until the sk_dst_cache expires. So we should clean the sk_dst_cache when a user set SO_DONTROUTE option. Below are server/client python scripts which could reprodue this issue: server side code: == import socket import struct import time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.bind(('0.0.0.0', 9000)) s.listen(1) sock, addr = s.accept() sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1)) while True: sock.send(b'foo') time.sleep(1) == client side code: == import socket import time s = socket.socket(socket.AF_INET, socket.SOCK_STREAM) s.connect(('server_address', 9000)) while True: data = s.recv(1024) print(data) == Signed-off-by: yupeng --- net/core/sock.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/core/sock.c b/net/core/sock.c index f5bb89785e47..f00902c532cc 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int optname, break; case SO_DONTROUTE: sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool); + sk_dst_reset(sk); break; case SO_BROADCAST: sock_valbool_flag(sk, SOCK_BROADCAST, valbool); -- 2.17.1
Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
On Mon, 26 Nov 2018 16:27:28 + Ioana Ciocoi Radulescu wrote: > Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX > actions are supported for now. Frame header changes are also > allowed. Do you have any XDP performance benchmarks on this hardware? Also what boards (and arch's) are using this dpaa2-eth driver? Any devel board I can buy? p.s. Ilias and I are coding up page_pool and XDP support for Marvell mvneta driver, which is avail on a number of avail boards, see here[1] [1] https://github.com/xdp-project/xdp-project/blob/master/areas/arm64/arm01_selecting_hardware.org -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
Re: [PATCH v2 net-next] ip6_tunnel: Adding support of mapping rules for MAP-E tunnel
From: Felix Jia Date: Mon, 3 Dec 2018 16:39:31 +1300 > +int > +ip6_get_addrport(struct iphdr *iph, __be32 *saddr4, __be32 *daddr4, > + __be16 *sport4, __be16 *dport4, __u8 *proto, int *icmperr) > +{ This looks like something the flow dissector can do alreayd, please look into utilizing that common piece of infrastructure instead of reimplementing it. > + u8 *ptr; > + struct iphdr *icmpiph = NULL; > + struct tcphdr *tcph, *icmptcph; > + struct udphdr *udph, *icmpudph; > + struct icmphdr *icmph, *icmpicmph; Please always order local variables from longest to shortest line. Please audit your entire submission for this problem. > +static struct ip6_tnl_rule *ip6_tnl_rule_find(struct net_device *dev, > + __be32 _dst) > +{ > + u32 dst = ntohl(_dst); > + struct ip6_rule_list *pos = NULL; > + struct ip6_tnl *t = netdev_priv(dev); > + > + list_for_each_entry(pos, >rules.list, list) { > + int mask = > + 0x ^ ((1 << (32 - pos->data.ipv4_prefixlen)) - 1); > + if ((dst & mask) == ntohl(pos->data.ipv4_subnet.s_addr)) > + return >data; > + } > + return NULL; > +} How will this scale with large numbers of rules? This rule facility seems to be designed in a way that sophisticated (at least as fast as "O(log N)") lookup schemes aren't even possible, and that even worse the ordering matters.
Re: [PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree
On 12/04/2018 11:55 AM, Peter Oskolkov wrote: > When testing high-bandwidth TCP streams with large windows, > high latency, and low jitter, netem consumes a lot of CPU cycles > doing rbtree rebalancing. > > This patch uses a linear list/queue in addition to the rbtree: > if an incoming packet is past the tail of the linear queue, it is > added there, otherwise it is inserted into the rbtree. > > Without this patch, perf shows netem_enqueue, netem_dequeue, > and rb_* functions among the top offenders. With this patch, > only netem_enqueue is noticeable if jitter is low/absent. > > Suggested-by: Eric Dumazet > Signed-off-by: Peter Oskolkov > --- Reviewed-by: Eric Dumazet
[PATCH v2 net-next 0/1] net: netem: use a list _and_ rbtree
v2: address style suggestions by Stephen Hemminger. All changes are noop vs v1. Peter Oskolkov (1): net: netem: use a list in addition to rbtree net/sched/sch_netem.c | 89 +-- 1 file changed, 69 insertions(+), 20 deletions(-)
[PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree
When testing high-bandwidth TCP streams with large windows, high latency, and low jitter, netem consumes a lot of CPU cycles doing rbtree rebalancing. This patch uses a linear list/queue in addition to the rbtree: if an incoming packet is past the tail of the linear queue, it is added there, otherwise it is inserted into the rbtree. Without this patch, perf shows netem_enqueue, netem_dequeue, and rb_* functions among the top offenders. With this patch, only netem_enqueue is noticeable if jitter is low/absent. Suggested-by: Eric Dumazet Signed-off-by: Peter Oskolkov --- net/sched/sch_netem.c | 89 +-- 1 file changed, 69 insertions(+), 20 deletions(-) diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c index 2c38e3d07924..84658f60a872 100644 --- a/net/sched/sch_netem.c +++ b/net/sched/sch_netem.c @@ -77,6 +77,10 @@ struct netem_sched_data { /* internal t(ime)fifo qdisc uses t_root and sch->limit */ struct rb_root t_root; + /* a linear queue; reduces rbtree rebalancing when jitter is low */ + struct sk_buff *t_head; + struct sk_buff *t_tail; + /* optional qdisc for classful handling (NULL at netem init) */ struct Qdisc*qdisc; @@ -369,26 +373,39 @@ static void tfifo_reset(struct Qdisc *sch) rb_erase(>rbnode, >t_root); rtnl_kfree_skbs(skb, skb); } + + rtnl_kfree_skbs(q->t_head, q->t_tail); + q->t_head = NULL; + q->t_tail = NULL; } static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch) { struct netem_sched_data *q = qdisc_priv(sch); u64 tnext = netem_skb_cb(nskb)->time_to_send; - struct rb_node **p = >t_root.rb_node, *parent = NULL; - while (*p) { - struct sk_buff *skb; - - parent = *p; - skb = rb_to_skb(parent); - if (tnext >= netem_skb_cb(skb)->time_to_send) - p = >rb_right; + if (!q->t_tail || tnext >= netem_skb_cb(q->t_tail)->time_to_send) { + if (q->t_tail) + q->t_tail->next = nskb; else - p = >rb_left; + q->t_head = nskb; + q->t_tail = nskb; + } else { + struct rb_node **p = >t_root.rb_node, *parent = NULL; + + while (*p) { + struct sk_buff *skb; + + parent = *p; + skb = rb_to_skb(parent); + if (tnext >= netem_skb_cb(skb)->time_to_send) + p = >rb_right; + else + p = >rb_left; + } + rb_link_node(>rbnode, parent, p); + rb_insert_color(>rbnode, >t_root); } - rb_link_node(>rbnode, parent, p); - rb_insert_color(>rbnode, >t_root); sch->q.qlen++; } @@ -530,9 +547,16 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc *sch, t_skb = skb_rb_last(>t_root); t_last = netem_skb_cb(t_skb); if (!last || - t_last->time_to_send > last->time_to_send) { + t_last->time_to_send > last->time_to_send) + last = t_last; + } + if (q->t_tail) { + struct netem_skb_cb *t_last = + netem_skb_cb(q->t_tail); + + if (!last || + t_last->time_to_send > last->time_to_send) last = t_last; - } } if (last) { @@ -611,11 +635,38 @@ static void get_slot_next(struct netem_sched_data *q, u64 now) q->slot.bytes_left = q->slot_config.max_bytes; } +static struct sk_buff *netem_peek(struct netem_sched_data *q) +{ + struct sk_buff *skb = skb_rb_first(>t_root); + u64 t1, t2; + + if (!skb) + return q->t_head; + if (!q->t_head) + return skb; + + t1 = netem_skb_cb(skb)->time_to_send; + t2 = netem_skb_cb(q->t_head)->time_to_send; + if (t1 < t2) + return skb; + return q->t_head; +} + +static void netem_erase_head(struct netem_sched_data *q, struct sk_buff *skb) +{ + if (skb == q->t_head) { + q->t_head = skb->next; + if (!q->t_head) + q->t_tail = NULL; + } else { + rb_erase(>rbnode, >t_root); + } +} + static struct sk_buff *netem_dequeue(struct Qdisc *sch) { struct netem_sched_data *q = qdisc_priv(sch); struct sk_buff *skb; - struct rb_node *p; tfifo_dequeue: skb =
[PATCH v2 net-next] ip6_tunnel: Adding support of mapping rules for MAP-E tunnel
From: Blair Steven Mapping of Addresses and Ports with Encapsulation (MAP-E) is defined in RFC7597, and is an IPv6 transition technology providing interoperability between IPv4 and IPv6 networks. MAP-E uses the encapsulation mode described in RFC2473 (IPv6 Tunneling) to transport IPv4 and IPv6 packets over an IPv6 network. It requires a list rules for mapping between IPv4 prefix/shared addresses and IPv6 addresses. This patch also supports the mapping rules defined in the draft3 version of the RFC. Co-developed-by: Felix Jia Co-developed-by: Sheena Mira-ato Co-developed-by: Masakazu Asama Signed-off-by: Blair Steven Signed-off-by: Felix Jia Signed-off-by: Sheena Mira-ato Signed-off-by: Masakazu Asama --- include/net/ip6_tunnel.h | 18 ++ include/uapi/linux/if_tunnel.h | 18 ++ net/ipv6/ip6_tunnel.c | 490 - 3 files changed, 524 insertions(+), 2 deletions(-) diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h index 69b4bcf880c9..ed715ee8d87c 100644 --- a/include/net/ip6_tunnel.h +++ b/include/net/ip6_tunnel.h @@ -18,6 +18,16 @@ /* determine capability on a per-packet basis */ #define IP6_TNL_F_CAP_PER_PACKET 0x4 +struct ip6_tnl_rule { + struct in_addr ipv4_subnet; + struct in6_addr ipv6_subnet; + u8 version; + u8 ea_length; + u8 psid_offset; + u8 ipv4_prefixlen; + u8 ipv6_prefixlen; +}; + struct __ip6_tnl_parm { char name[IFNAMSIZ];/* name of tunnel device */ int link; /* ifindex of underlying L2 interface */ @@ -40,6 +50,13 @@ struct __ip6_tnl_parm { __u8erspan_ver; /* ERSPAN version */ __u8dir;/* direction */ __u16 hwid; /* hwid */ + __u8rule_action; + struct ip6_tnl_rule rule; +}; + +struct ip6_rule_list { + struct list_head list; + struct ip6_tnl_rule data; }; /* IPv6 tunnel */ @@ -63,6 +80,7 @@ struct ip6_tnl { int encap_hlen; /* Encap header length (FOU,GUE) */ struct ip_tunnel_encap encap; int mlink; + struct ip6_rule_list rules; }; struct ip6_tnl_encap_ops { diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h index 1b3d148c4560..7cb09c8c4d8a 100644 --- a/include/uapi/linux/if_tunnel.h +++ b/include/uapi/linux/if_tunnel.h @@ -77,10 +77,28 @@ enum { IFLA_IPTUN_ENCAP_DPORT, IFLA_IPTUN_COLLECT_METADATA, IFLA_IPTUN_FWMARK, + IFLA_IPTUN_RULE_VERSION, + IFLA_IPTUN_RULE_ACTION, + IFLA_IPTUN_RULE_IPV6_PREFIX, + IFLA_IPTUN_RULE_IPV6_PREFIXLEN, + IFLA_IPTUN_RULE_IPV4_PREFIX, + IFLA_IPTUN_RULE_IPV4_PREFIXLEN, + IFLA_IPTUN_RULE_EA_LENGTH, + IFLA_IPTUN_RULE_PSID_OFFSET, __IFLA_IPTUN_MAX, }; #define IFLA_IPTUN_MAX (__IFLA_IPTUN_MAX - 1) +enum map_rule_versions { + MAP_VERSION_RFC, + MAP_VERSION_DRAFT3, +}; + +enum tunnel_rule_actions { + TUNNEL_RULE_ACTION_ADD = 1, + TUNNEL_RULE_ACTION_DELETE = 2, +}; + enum tunnel_encap_types { TUNNEL_ENCAP_NONE, TUNNEL_ENCAP_FOU, diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c index a9d06d4dd057..3bd7a5045f28 100644 --- a/net/ipv6/ip6_tunnel.c +++ b/net/ipv6/ip6_tunnel.c @@ -20,6 +20,8 @@ #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt +#include +#include #include #include #include @@ -32,6 +34,7 @@ #include #include #include +#include #include #include #include @@ -124,6 +127,226 @@ static struct net_device_stats *ip6_get_stats(struct net_device *dev) return >stats; } +int +ip6_get_addrport(struct iphdr *iph, __be32 *saddr4, __be32 *daddr4, +__be16 *sport4, __be16 *dport4, __u8 *proto, int *icmperr) +{ + u8 *ptr; + struct iphdr *icmpiph = NULL; + struct tcphdr *tcph, *icmptcph; + struct udphdr *udph, *icmpudph; + struct icmphdr *icmph, *icmpicmph; + + *icmperr = 0; + *saddr4 = iph->saddr; + *daddr4 = iph->daddr; + ptr = (u8 *)iph; + ptr += iph->ihl * 4; + switch (iph->protocol) { + case IPPROTO_TCP: + *proto = IPPROTO_TCP; + tcph = (struct tcphdr *)ptr; + *sport4 = tcph->source; + *dport4 = tcph->dest; + break; + case IPPROTO_UDP: + *proto = IPPROTO_UDP; + udph = (struct udphdr *)ptr; + *sport4 = udph->source; + *dport4 = udph->dest; + break; + case IPPROTO_ICMP: + *proto = IPPROTO_ICMP; + icmph = (struct icmphdr *)ptr; + switch (icmph->type) { + case ICMP_DEST_UNREACH: + case ICMP_SOURCE_QUENCH: + case ICMP_TIME_EXCEEDED: + case ICMP_PARAMETERPROB: + *icmperr = 1; + ptr = (u8
Re: [PATCH v2 net-next] cxgb4: number of VFs supported is not always 16
From: Ganesh Goudar Date: Tue, 27 Nov 2018 14:59:06 +0530 > Total number of VFs supported by PF is used to determine the last > byte of VF's mac address. Number of VFs supported is not always > 16, use the variable nvfs to get the number of VFs supported > rather than hard coding it to 16. > > Signed-off-by: Casey Leedom > Signed-off-by: Ganesh Goudar > --- > V2: Fixes typo in commit message Applied.
Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
From: Ioana Ciocoi Radulescu Date: Wed, 28 Nov 2018 09:18:28 + > They apply cleanly for me. I figured out what happend. The patches were mis-ordered (specifically patches #3 and #4) when I added them to the patchwork bundle, and that is what causes them to fail. Series applied, thanks!
Re: [PATCH v2 net-next 1/8] dpaa2-eth: Add basic XDP support
On 11/26/18 9:27 AM, Ioana Ciocoi Radulescu wrote: > We keep one XDP program reference per channel. The only actions > supported for now are XDP_DROP and XDP_PASS. > > Until now we didn't enforce a maximum size for Rx frames based > on MTU value. Change that, since for XDP mode we must ensure no > scatter-gather frames can be received. > > Signed-off-by: Ioana Radulescu > --- > v2: - xdp packets count towards the rx packets and bytes counters > - add warning message with the maximum supported MTU value for XDP > > drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 189 > ++- > drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 6 + > 2 files changed, 194 insertions(+), 1 deletion(-) > Reviewed-by: David Ahern
Re: [PATCH v2 net-next 8/8] dpaa2-eth: Add xdp counters
On 11/26/18 9:27 AM, Ioana Ciocoi Radulescu wrote: > Add counters for xdp processed frames to the channel statistics. > > Signed-off-by: Ioana Radulescu > --- > v2: no changes > Reviewed-by: David Ahern
Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
On 11/28/18 2:18 AM, Ioana Ciocoi Radulescu wrote: >> -Original Message- >> From: David Miller >> Sent: Wednesday, November 28, 2018 2:25 AM >> To: Ioana Ciocoi Radulescu >> Cc: netdev@vger.kernel.org; Ioana Ciornei ; >> dsah...@gmail.com; Camelia Alexandra Groza >> Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support >> >> From: Ioana Ciocoi Radulescu >> Date: Mon, 26 Nov 2018 16:27:28 + >> >>> Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX >>> actions are supported for now. Frame header changes are also >>> allowed. >>> >>> v2: - count the XDP packets in the rx/tx inteface stats >>> - add message with the maximum supported MTU value for XDP >> >> This doesn't apply cleanly to net-next. >> >> Could you please do a quick respin so I can apply this? > > They apply cleanly for me. To doublecheck, I've downloaded the mbox > patches from patchwork and applied them on net-next.git, master branch > (commit 86d1d8b72c). > I'm obviously doing something wrong, but I don't know what. same here. All patches applied cleanly to net-next.
RE: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
> -Original Message- > From: David Miller > Sent: Wednesday, November 28, 2018 2:25 AM > To: Ioana Ciocoi Radulescu > Cc: netdev@vger.kernel.org; Ioana Ciornei ; > dsah...@gmail.com; Camelia Alexandra Groza > Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support > > From: Ioana Ciocoi Radulescu > Date: Mon, 26 Nov 2018 16:27:28 + > > > Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX > > actions are supported for now. Frame header changes are also > > allowed. > > > > v2: - count the XDP packets in the rx/tx inteface stats > > - add message with the maximum supported MTU value for XDP > > This doesn't apply cleanly to net-next. > > Could you please do a quick respin so I can apply this? They apply cleanly for me. To doublecheck, I've downloaded the mbox patches from patchwork and applied them on net-next.git, master branch (commit 86d1d8b72c). I'm obviously doing something wrong, but I don't know what. Thanks, Ioana
Re: [PATCH v2 net-next] tcp: remove hdrlen argument from tcp_queue_rcv()
From: Eric Dumazet Date: Mon, 26 Nov 2018 14:49:12 -0800 > Only one caller needs to pull TCP headers, so lets > move __skb_pull() to the caller side. > > Signed-off-by: Eric Dumazet > Acked-by: Yuchung Cheng > --- > v2: sent as a standalone patch. Applied, thanks Eric.
Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
From: Ioana Ciocoi Radulescu Date: Mon, 26 Nov 2018 16:27:28 + > Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX > actions are supported for now. Frame header changes are also > allowed. > > v2: - count the XDP packets in the rx/tx inteface stats > - add message with the maximum supported MTU value for XDP This doesn't apply cleanly to net-next. Could you please do a quick respin so I can apply this? Thanks!
Re: [PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue
On Tue, Nov 27, 2018 at 2:13 PM Eric Dumazet wrote: > > > > On 11/27/2018 01:58 PM, Neal Cardwell wrote: > > > I wonder if technically perhaps the logic should skip coalescing if > > the tail or skb has the TCP_FLAG_URG bit set? It seems if skbs are > > coalesced, and some have urgent data and some do not, then the > > TCP_FLAG_URG bit will be accumulated into the tail header, but there > > will be no way to ensure the correct urgent offsets for the one or > > more skbs with urgent data are passed along. > > Yes, I guess I need to fix that, thanks. > > I will simply make sure both thtail->urg and th->urg are not set. > > I could only test thtail->urg, but that would require copying th->urg_ptr and > th->urg, > and quite frankly we should not spend cycles on URG stuff. pseudo code added in V3 diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 9fa7516fb5c33277be4ba3a667ff61202d8dd445..4904250a9aac5001410f9454258cbb8978bb8202 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1668,6 +1668,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb) TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield || +((TCP_SKB_CB(tail)->tcp_flags | +TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) || + ((TCP_SKB_CB(tail)->tcp_flags ^ + TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) || #ifdef CONFIG_TLS_DEVICE tail->decrypted != skb->decrypted || #endif
Re: [PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue
On 11/27/2018 01:58 PM, Neal Cardwell wrote: > I wonder if technically perhaps the logic should skip coalescing if > the tail or skb has the TCP_FLAG_URG bit set? It seems if skbs are > coalesced, and some have urgent data and some do not, then the > TCP_FLAG_URG bit will be accumulated into the tail header, but there > will be no way to ensure the correct urgent offsets for the one or > more skbs with urgent data are passed along. Yes, I guess I need to fix that, thanks. I will simply make sure both thtail->urg and th->urg are not set. I could only test thtail->urg, but that would require copying th->urg_ptr and th->urg, and quite frankly we should not spend cycles on URG stuff.
Re: [PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet wrote: > > In case GRO is not as efficient as it should be or disabled, > we might have a user thread trapped in __release_sock() while > softirq handler flood packets up to the point we have to drop. > > This patch balances work done from user thread and softirq, > to give more chances to __release_sock() to complete its work > before new packets are added the the backlog. > > This also helps if we receive many ACK packets, since GRO > does not aggregate them. > > This patch brings ~60% throughput increase on a receiver > without GRO, but the spectacular gain is really on > 1000x release_sock() latency reduction I have measured. > > Signed-off-by: Eric Dumazet > Cc: Neal Cardwell > Cc: Yuchung Cheng > --- ... > + if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq || > + TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield || > +#ifdef CONFIG_TLS_DEVICE > + tail->decrypted != skb->decrypted || > +#endif > + thtail->doff != th->doff || > + memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th))) > + goto no_coalesce; > + > + __skb_pull(skb, hdrlen); > + if (skb_try_coalesce(tail, skb, , )) { > + thtail->window = th->window; > + > + TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq; > + > + if (after(TCP_SKB_CB(skb)->ack_seq, > TCP_SKB_CB(tail)->ack_seq)) > + TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq; > + > + TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags; I wonder if technically perhaps the logic should skip coalescing if the tail or skb has the TCP_FLAG_URG bit set? It seems if skbs are coalesced, and some have urgent data and some do not, then the TCP_FLAG_URG bit will be accumulated into the tail header, but there will be no way to ensure the correct urgent offsets for the one or more skbs with urgent data are passed along. Thinking out loud, I guess if this is ECN/DCTCP and some ACKs have TCP_FLAG_ECE and some don't, this will effectively have all ACKed bytes be treated as ECN-marked. Probably OK, since if this coalescing path is being hit the sender may be overloaded and slowing down might be a good thing. Otherwise, looks great to me. Thanks for doing this! neal
Re: [PATCH v2 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()
On 11/27/2018 01:19 PM, Neal Cardwell wrote: > On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet wrote: >> >> Neal pointed out that non sack flows might suffer from ACK compression >> added in the following patch ("tcp: implement coalescing on backlog queue") >> >> Instead of tweaking tcp_add_backlog() we can take into >> account how many ACK were coalesced, this information >> will be available in skb_shinfo(skb)->gso_segs >> >> Signed-off-by: Eric Dumazet >> --- > ... >> @@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int >> flag, bool is_dupack, >> /* A Reno DUPACK means new data in F-RTO step 2.b above are >> * delivered. Lower inflight to clock out (re)tranmissions. >> */ >> - if (after(tp->snd_nxt, tp->high_seq) && is_dupack) >> - tcp_add_reno_sack(sk); >> + if (after(tp->snd_nxt, tp->high_seq)) >> + tcp_add_reno_sack(sk, num_dupack); >> else if (flag & FLAG_SND_UNA_ADVANCED) >> tcp_reset_reno_sack(tp); >> } > > I think this probably should be checking num_dupack, something like: > > + if (after(tp->snd_nxt, tp->high_seq) && num_dupack) > + tcp_add_reno_sack(sk, num_dupack); > > If we don't check num_dupack, that seems to mean that after FRTO sends > the two new data packets (making snd_nxt after high_seq), the patch > would have a particular non-SACK FRTO loss recovery always go into the > "if" branch where we tcp_add_reno_sack() function, and we would never > have a chance to get to the "else" branch where we check if > FLAG_SND_UNA_ADVANCED and zero out the reno SACKs. > > Otherwise the patch looks great to me. Thanks for doing this! > Oh right, I missed the else clause, I thought that tcp_add_reno_sack() checking the num_dupack was enough. Thanks.
Re: [PATCH v2 net-next 3/4] tcp: make tcp_space() aware of socket backlog
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet wrote: > > Jean-Louis Dupond reported poor iscsi TCP receive performance > that we tracked to backlog drops. > > Apparently we fail to send window updates reflecting the > fact that we are under stress. > > Note that we might lack a proper window increase when > backlog is fully processed, since __release_sock() clears > sk->sk_backlog.len _after_ all skbs have been processed. > > This should not matter in practice. If we had a significant > load through socket backlog, we are in a dangerous > situation. > > Reported-by: Jean-Louis Dupond > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Nice. Thanks! neal
Re: [PATCH v2 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet wrote: > > Neal pointed out that non sack flows might suffer from ACK compression > added in the following patch ("tcp: implement coalescing on backlog queue") > > Instead of tweaking tcp_add_backlog() we can take into > account how many ACK were coalesced, this information > will be available in skb_shinfo(skb)->gso_segs > > Signed-off-by: Eric Dumazet > --- ... > @@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int flag, > bool is_dupack, > /* A Reno DUPACK means new data in F-RTO step 2.b above are > * delivered. Lower inflight to clock out (re)tranmissions. > */ > - if (after(tp->snd_nxt, tp->high_seq) && is_dupack) > - tcp_add_reno_sack(sk); > + if (after(tp->snd_nxt, tp->high_seq)) > + tcp_add_reno_sack(sk, num_dupack); > else if (flag & FLAG_SND_UNA_ADVANCED) > tcp_reset_reno_sack(tp); > } I think this probably should be checking num_dupack, something like: + if (after(tp->snd_nxt, tp->high_seq) && num_dupack) + tcp_add_reno_sack(sk, num_dupack); If we don't check num_dupack, that seems to mean that after FRTO sends the two new data packets (making snd_nxt after high_seq), the patch would have a particular non-SACK FRTO loss recovery always go into the "if" branch where we tcp_add_reno_sack() function, and we would never have a chance to get to the "else" branch where we check if FLAG_SND_UNA_ADVANCED and zero out the reno SACKs. Otherwise the patch looks great to me. Thanks for doing this! neal
Re: [PATCH v2 net-next 1/4] tcp: hint compiler about sack flows
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet wrote: > > Tell the compiler that most TCP flows are using SACK these days. > > There is no need to add the unlikely() clause in tcp_is_reno(), > the compiler is able to infer it. > > Signed-off-by: Eric Dumazet > --- Acked-by: Neal Cardwell Nice. Thanks! neal
Re: [PATCH v2 net-next 0/4] tcp: take a bit more care of backlog stress
On Tue, Nov 27, 2018 at 7:57 AM, Eric Dumazet wrote: > While working on the SACK compression issue Jean-Louis Dupond > reported, we found that his linux box was suffering very hard > from tail drops on the socket backlog queue. > > First patch hints the compiler about sack flows being the norm. > > Second patch changes non-sack code in preparation of the ack > compression. > > Third patch fixes tcp_space() to take backlog into account. > > Fourth patch is attempting coalescing when a new packet must > be added to the backlog queue. Cooking bigger skbs helps > to keep backlog list smaller and speeds its handling when > user thread finally releases the socket lock. > > v2: added feedback from Neal : tcp: take care of compressed acks in > tcp_add_reno_sack() > added : tcp: hint compiler about sack flows > added : tcp: make tcp_space() aware of socket backlog Great feature! Acked-by: Yuchung Cheng > > > > Eric Dumazet (4): > tcp: hint compiler about sack flows > tcp: take care of compressed acks in tcp_add_reno_sack() > tcp: make tcp_space() aware of socket backlog > tcp: implement coalescing on backlog queue > > include/net/tcp.h | 4 +- > include/uapi/linux/snmp.h | 1 + > net/ipv4/proc.c | 1 + > net/ipv4/tcp_input.c | 58 +++--- > net/ipv4/tcp_ipv4.c | 88 --- > 5 files changed, 119 insertions(+), 33 deletions(-) > > -- > 2.20.0.rc0.387.gc7a69e6b6c-goog >
[PATCH v2 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()
Neal pointed out that non sack flows might suffer from ACK compression added in the following patch ("tcp: implement coalescing on backlog queue") Instead of tweaking tcp_add_backlog() we can take into account how many ACK were coalesced, this information will be available in skb_shinfo(skb)->gso_segs Signed-off-by: Eric Dumazet --- net/ipv4/tcp_input.c | 58 +--- 1 file changed, 33 insertions(+), 25 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index f32397890b6dcbc34976954c4be142108efa04d8..33d9956d667cbd5eaf6a93913a10ce5d419b8a3a 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -1863,16 +1863,20 @@ static void tcp_check_reno_reordering(struct sock *sk, const int addend) /* Emulate SACKs for SACKless connection: account for a new dupack. */ -static void tcp_add_reno_sack(struct sock *sk) +static void tcp_add_reno_sack(struct sock *sk, int num_dupack) { - struct tcp_sock *tp = tcp_sk(sk); - u32 prior_sacked = tp->sacked_out; + if (num_dupack) { + struct tcp_sock *tp = tcp_sk(sk); + u32 prior_sacked = tp->sacked_out; + s32 delivered; - tp->sacked_out++; - tcp_check_reno_reordering(sk, 0); - if (tp->sacked_out > prior_sacked) - tp->delivered++; /* Some out-of-order packet is delivered */ - tcp_verify_left_out(tp); + tp->sacked_out += num_dupack; + tcp_check_reno_reordering(sk, 0); + delivered = tp->sacked_out - prior_sacked; + if (delivered > 0) + tp->delivered += delivered; + tcp_verify_left_out(tp); + } } /* Account for ACK, ACKing some data in Reno Recovery phase. */ @@ -2634,7 +2638,7 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack) /* Process an ACK in CA_Loss state. Move to CA_Open if lost data are * recovered or spurious. Otherwise retransmits more on partial ACKs. */ -static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack, +static void tcp_process_loss(struct sock *sk, int flag, int num_dupack, int *rexmit) { struct tcp_sock *tp = tcp_sk(sk); @@ -2653,7 +2657,7 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack, return; if (after(tp->snd_nxt, tp->high_seq)) { - if (flag & FLAG_DATA_SACKED || is_dupack) + if (flag & FLAG_DATA_SACKED || num_dupack) tp->frto = 0; /* Step 3.a. loss was real */ } else if (flag & FLAG_SND_UNA_ADVANCED && !recovered) { tp->high_seq = tp->snd_nxt; @@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack, /* A Reno DUPACK means new data in F-RTO step 2.b above are * delivered. Lower inflight to clock out (re)tranmissions. */ - if (after(tp->snd_nxt, tp->high_seq) && is_dupack) - tcp_add_reno_sack(sk); + if (after(tp->snd_nxt, tp->high_seq)) + tcp_add_reno_sack(sk, num_dupack); else if (flag & FLAG_SND_UNA_ADVANCED) tcp_reset_reno_sack(tp); } @@ -2757,13 +2761,13 @@ static bool tcp_force_fast_retransmit(struct sock *sk) * tcp_xmit_retransmit_queue(). */ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una, - bool is_dupack, int *ack_flag, int *rexmit) + int num_dupack, int *ack_flag, int *rexmit) { struct inet_connection_sock *icsk = inet_csk(sk); struct tcp_sock *tp = tcp_sk(sk); int fast_rexmit = 0, flag = *ack_flag; - bool do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) && -tcp_force_fast_retransmit(sk)); + bool do_lost = num_dupack || ((flag & FLAG_DATA_SACKED) && + tcp_force_fast_retransmit(sk)); if (!tp->packets_out && tp->sacked_out) tp->sacked_out = 0; @@ -2810,8 +2814,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una, switch (icsk->icsk_ca_state) { case TCP_CA_Recovery: if (!(flag & FLAG_SND_UNA_ADVANCED)) { - if (tcp_is_reno(tp) && is_dupack) - tcp_add_reno_sack(sk); + if (tcp_is_reno(tp)) + tcp_add_reno_sack(sk, num_dupack); } else { if (tcp_try_undo_partial(sk, prior_snd_una)) return; @@ -2826,7 +2830,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una, tcp_identify_packet_loss(sk, ack_flag); break;
[PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue
In case GRO is not as efficient as it should be or disabled, we might have a user thread trapped in __release_sock() while softirq handler flood packets up to the point we have to drop. This patch balances work done from user thread and softirq, to give more chances to __release_sock() to complete its work before new packets are added the the backlog. This also helps if we receive many ACK packets, since GRO does not aggregate them. This patch brings ~60% throughput increase on a receiver without GRO, but the spectacular gain is really on 1000x release_sock() latency reduction I have measured. Signed-off-by: Eric Dumazet Cc: Neal Cardwell Cc: Yuchung Cheng --- include/uapi/linux/snmp.h | 1 + net/ipv4/proc.c | 1 + net/ipv4/tcp_ipv4.c | 88 --- 3 files changed, 84 insertions(+), 6 deletions(-) diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h index f80135e5feaa88609db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79 100644 --- a/include/uapi/linux/snmp.h +++ b/include/uapi/linux/snmp.h @@ -243,6 +243,7 @@ enum LINUX_MIB_TCPREQQFULLDROP, /* TCPReqQFullDrop */ LINUX_MIB_TCPRETRANSFAIL, /* TCPRetransFail */ LINUX_MIB_TCPRCVCOALESCE, /* TCPRcvCoalesce */ + LINUX_MIB_TCPBACKLOGCOALESCE, /* TCPBacklogCoalesce */ LINUX_MIB_TCPOFOQUEUE, /* TCPOFOQueue */ LINUX_MIB_TCPOFODROP, /* TCPOFODrop */ LINUX_MIB_TCPOFOMERGE, /* TCPOFOMerge */ diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = { SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL), SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL), SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED), + SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE), SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT), SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT), SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 795605a2327504b8a025405826e7e0ca8dc8501d..b587a841678eb66ece005a9900537fd3f3dab963 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1619,12 +1619,14 @@ int tcp_v4_early_demux(struct sk_buff *skb) bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb) { u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf; - - /* Only socket owner can try to collapse/prune rx queues -* to reduce memory overhead, so add a little headroom here. -* Few sockets backlog are possibly concurrently non empty. -*/ - limit += 64*1024; + struct skb_shared_info *shinfo; + const struct tcphdr *th; + struct tcphdr *thtail; + struct sk_buff *tail; + unsigned int hdrlen; + bool fragstolen; + u32 gso_segs; + int delta; /* In case all data was pulled from skb frags (in __pskb_pull_tail()), * we can fix skb->truesize to its real value to avoid future drops. @@ -1636,6 +1638,80 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb) skb_dst_drop(skb); + if (unlikely(tcp_checksum_complete(skb))) { + bh_unlock_sock(sk); + __TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS); + __TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS); + return true; + } + + /* Attempt coalescing to last skb in backlog, even if we are +* above the limits. +* This is okay because skb capacity is limited to MAX_SKB_FRAGS. +*/ + th = (const struct tcphdr *)skb->data; + hdrlen = th->doff * 4; + shinfo = skb_shinfo(skb); + + if (!shinfo->gso_size) + shinfo->gso_size = skb->len - hdrlen; + + if (!shinfo->gso_segs) + shinfo->gso_segs = 1; + + tail = sk->sk_backlog.tail; + if (!tail) + goto no_coalesce; + thtail = (struct tcphdr *)tail->data; + + if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq || + TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield || +#ifdef CONFIG_TLS_DEVICE + tail->decrypted != skb->decrypted || +#endif + thtail->doff != th->doff || + memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th))) + goto no_coalesce; + + __skb_pull(skb, hdrlen); + if (skb_try_coalesce(tail, skb, , )) { + thtail->window = th->window; + + TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq; + + if (after(TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(tail)->ack_seq)) +
[PATCH v2 net-next 3/4] tcp: make tcp_space() aware of socket backlog
Jean-Louis Dupond reported poor iscsi TCP receive performance that we tracked to backlog drops. Apparently we fail to send window updates reflecting the fact that we are under stress. Note that we might lack a proper window increase when backlog is fully processed, since __release_sock() clears sk->sk_backlog.len _after_ all skbs have been processed. This should not matter in practice. If we had a significant load through socket backlog, we are in a dangerous situation. Reported-by: Jean-Louis Dupond Signed-off-by: Eric Dumazet --- include/net/tcp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 0c61bf0a06dac95268c26b6302a2afbaef4c88b3..3b522259da7d5a54d7d3730ddd8d8c9ef24313e1 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1368,7 +1368,7 @@ static inline int tcp_win_from_space(const struct sock *sk, int space) /* Note: caller must be prepared to deal with negative returns */ static inline int tcp_space(const struct sock *sk) { - return tcp_win_from_space(sk, sk->sk_rcvbuf - + return tcp_win_from_space(sk, sk->sk_rcvbuf - sk->sk_backlog.len - atomic_read(>sk_rmem_alloc)); } -- 2.20.0.rc0.387.gc7a69e6b6c-goog
[PATCH v2 net-next 1/4] tcp: hint compiler about sack flows
Tell the compiler that most TCP flows are using SACK these days. There is no need to add the unlikely() clause in tcp_is_reno(), the compiler is able to infer it. Signed-off-by: Eric Dumazet --- include/net/tcp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 63e37dd1c274cc396e41ea9612cf67a5b7c89776..0c61bf0a06dac95268c26b6302a2afbaef4c88b3 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -1124,7 +1124,7 @@ void tcp_rate_check_app_limited(struct sock *sk); */ static inline int tcp_is_sack(const struct tcp_sock *tp) { - return tp->rx_opt.sack_ok; + return likely(tp->rx_opt.sack_ok); } static inline bool tcp_is_reno(const struct tcp_sock *tp) -- 2.20.0.rc0.387.gc7a69e6b6c-goog
[PATCH v2 net-next 0/4] tcp: take a bit more care of backlog stress
While working on the SACK compression issue Jean-Louis Dupond reported, we found that his linux box was suffering very hard from tail drops on the socket backlog queue. First patch hints the compiler about sack flows being the norm. Second patch changes non-sack code in preparation of the ack compression. Third patch fixes tcp_space() to take backlog into account. Fourth patch is attempting coalescing when a new packet must be added to the backlog queue. Cooking bigger skbs helps to keep backlog list smaller and speeds its handling when user thread finally releases the socket lock. v2: added feedback from Neal : tcp: take care of compressed acks in tcp_add_reno_sack() added : tcp: hint compiler about sack flows added : tcp: make tcp_space() aware of socket backlog Eric Dumazet (4): tcp: hint compiler about sack flows tcp: take care of compressed acks in tcp_add_reno_sack() tcp: make tcp_space() aware of socket backlog tcp: implement coalescing on backlog queue include/net/tcp.h | 4 +- include/uapi/linux/snmp.h | 1 + net/ipv4/proc.c | 1 + net/ipv4/tcp_input.c | 58 +++--- net/ipv4/tcp_ipv4.c | 88 --- 5 files changed, 119 insertions(+), 33 deletions(-) -- 2.20.0.rc0.387.gc7a69e6b6c-goog
RE: [PATCH v2 net-next 1/8] dpaa2-eth: Add basic XDP support
> -Original Message- > From: Ioana Ciocoi Radulescu > Sent: Monday, November 26, 2018 18:27 > To: netdev@vger.kernel.org; da...@davemloft.net > Cc: Ioana Ciornei ; dsah...@gmail.com; Camelia > Alexandra Groza > Subject: [PATCH v2 net-next 1/8] dpaa2-eth: Add basic XDP support > > We keep one XDP program reference per channel. The only actions > supported for now are XDP_DROP and XDP_PASS. > > Until now we didn't enforce a maximum size for Rx frames based > on MTU value. Change that, since for XDP mode we must ensure no > scatter-gather frames can be received. > > Signed-off-by: Ioana Radulescu Acked-by: Camelia Groza
[PATCH v2 net-next] cxgb4: number of VFs supported is not always 16
Total number of VFs supported by PF is used to determine the last byte of VF's mac address. Number of VFs supported is not always 16, use the variable nvfs to get the number of VFs supported rather than hard coding it to 16. Signed-off-by: Casey Leedom Signed-off-by: Ganesh Goudar --- V2: Fixes typo in commit message --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 7f76ad9..6ba9099 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -2646,7 +2646,7 @@ static void cxgb4_mgmt_fill_vf_station_mac_addr(struct adapter *adap) for (vf = 0, nvfs = pci_sriov_get_totalvfs(adap->pdev); vf < nvfs; vf++) { - macaddr[5] = adap->pf * 16 + vf; + macaddr[5] = adap->pf * nvfs + vf; ether_addr_copy(adap->vfinfo[vf].vf_mac_addr, macaddr); } } -- 2.1.0
[PATCH v2 net-next] tcp: remove hdrlen argument from tcp_queue_rcv()
Only one caller needs to pull TCP headers, so lets move __skb_pull() to the caller side. Signed-off-by: Eric Dumazet Acked-by: Yuchung Cheng --- v2: sent as a standalone patch. net/ipv4/tcp_input.c | 13 ++--- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 568dbf3b711af75e5f4f0a309f8943579e913494..f32397890b6dcbc34976954c4be142108efa04d8 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -4603,13 +4603,12 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb) } } -static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, int hdrlen, - bool *fragstolen) +static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, + bool *fragstolen) { int eaten; struct sk_buff *tail = skb_peek_tail(>sk_receive_queue); - __skb_pull(skb, hdrlen); eaten = (tail && tcp_try_coalesce(sk, tail, skb, fragstolen)) ? 1 : 0; @@ -4660,7 +4659,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, size_t size) TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size; TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1; - if (tcp_queue_rcv(sk, skb, 0, )) { + if (tcp_queue_rcv(sk, skb, )) { WARN_ON_ONCE(fragstolen); /* should not happen */ __kfree_skb(skb); } @@ -4720,7 +4719,7 @@ static void tcp_data_queue(struct sock *sk, struct sk_buff *skb) goto drop; } - eaten = tcp_queue_rcv(sk, skb, 0, ); + eaten = tcp_queue_rcv(sk, skb, ); if (skb->len) tcp_event_data_recv(sk, skb); if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN) @@ -5596,8 +5595,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb) NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS); /* Bulk data transfer: receiver */ - eaten = tcp_queue_rcv(sk, skb, tcp_header_len, - ); + __skb_pull(skb, tcp_header_len); + eaten = tcp_queue_rcv(sk, skb, ); tcp_event_data_recv(sk, skb); -- 2.20.0.rc0.387.gc7a69e6b6c-goog
[PATCH v2 net-next 6/8] dpaa2-eth: Add support for XDP_TX
Send frames back on the same port for XDP_TX action. Since the frame buffers have been allocated by us, we can recycle them directly into the Rx buffer pool instead of requesting a confirmation frame upon transmission complete. Signed-off-by: Ioana Radulescu --- v2: XDP_TX packets count towards the tx packets and bytes counters drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 51 +++- drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 2 + 2 files changed, 52 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c index c2e880b..bc582c4 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c @@ -240,14 +240,53 @@ static void xdp_release_buf(struct dpaa2_eth_priv *priv, ch->xdp.drop_cnt = 0; } +static int xdp_enqueue(struct dpaa2_eth_priv *priv, struct dpaa2_fd *fd, + void *buf_start, u16 queue_id) +{ + struct dpaa2_eth_fq *fq; + struct dpaa2_faead *faead; + u32 ctrl, frc; + int i, err; + + /* Mark the egress frame hardware annotation area as valid */ + frc = dpaa2_fd_get_frc(fd); + dpaa2_fd_set_frc(fd, frc | DPAA2_FD_FRC_FAEADV); + dpaa2_fd_set_ctrl(fd, DPAA2_FD_CTRL_ASAL); + + /* Instruct hardware to release the FD buffer directly into +* the buffer pool once transmission is completed, instead of +* sending a Tx confirmation frame to us +*/ + ctrl = DPAA2_FAEAD_A4V | DPAA2_FAEAD_A2V | DPAA2_FAEAD_EBDDV; + faead = dpaa2_get_faead(buf_start, false); + faead->ctrl = cpu_to_le32(ctrl); + faead->conf_fqid = 0; + + fq = >fq[queue_id]; + for (i = 0; i < DPAA2_ETH_ENQUEUE_RETRIES; i++) { + err = dpaa2_io_service_enqueue_qd(fq->channel->dpio, + priv->tx_qdid, 0, + fq->tx_qdbin, fd); + if (err != -EBUSY) + break; + } + + return err; +} + static u32 run_xdp(struct dpaa2_eth_priv *priv, struct dpaa2_eth_channel *ch, + struct dpaa2_eth_fq *rx_fq, struct dpaa2_fd *fd, void *vaddr) { dma_addr_t addr = dpaa2_fd_get_addr(fd); + struct rtnl_link_stats64 *percpu_stats; struct bpf_prog *xdp_prog; struct xdp_buff xdp; u32 xdp_act = XDP_PASS; + int err; + + percpu_stats = this_cpu_ptr(priv->percpu_stats); rcu_read_lock(); @@ -269,6 +308,16 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv, switch (xdp_act) { case XDP_PASS: break; + case XDP_TX: + err = xdp_enqueue(priv, fd, vaddr, rx_fq->flowid); + if (err) { + xdp_release_buf(priv, ch, addr); + percpu_stats->tx_errors++; + } else { + percpu_stats->tx_packets++; + percpu_stats->tx_bytes += dpaa2_fd_get_len(fd); + } + break; default: bpf_warn_invalid_xdp_action(xdp_act); case XDP_ABORTED: @@ -317,7 +366,7 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv, percpu_extras = this_cpu_ptr(priv->percpu_extras); if (fd_format == dpaa2_fd_single) { - xdp_act = run_xdp(priv, ch, (struct dpaa2_fd *)fd, vaddr); + xdp_act = run_xdp(priv, ch, fq, (struct dpaa2_fd *)fd, vaddr); if (xdp_act != XDP_PASS) { percpu_stats->rx_packets++; percpu_stats->rx_bytes += dpaa2_fd_get_len(fd); diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h index 23cf9d9..5530a0e 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h @@ -139,7 +139,9 @@ struct dpaa2_faead { }; #define DPAA2_FAEAD_A2V0x2000 +#define DPAA2_FAEAD_A4V0x0800 #define DPAA2_FAEAD_UPDV 0x1000 +#define DPAA2_FAEAD_EBDDV 0x2000 #define DPAA2_FAEAD_UPD0x0010 /* Accessors for the hardware annotation fields that we use */ -- 2.7.4
[PATCH v2 net-next 2/8] dpaa2-eth: Allow XDP header adjustments
Reserve XDP_PACKET_HEADROOM bytes in Rx buffers to allow XDP programs to increase frame header size. Signed-off-by: Ioana Radulescu --- v2: no changes drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 43 ++-- 1 file changed, 40 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c index d3cfed4..008cdf8 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c @@ -216,11 +216,15 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv, xdp.data = vaddr + dpaa2_fd_get_offset(fd); xdp.data_end = xdp.data + dpaa2_fd_get_len(fd); - xdp.data_hard_start = xdp.data; + xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM; xdp_set_data_meta_invalid(); xdp_act = bpf_prog_run_xdp(xdp_prog, ); + /* xdp.data pointer may have changed */ + dpaa2_fd_set_offset(fd, xdp.data - vaddr); + dpaa2_fd_set_len(fd, xdp.data_end - xdp.data); + switch (xdp_act) { case XDP_PASS: break; @@ -1483,7 +1487,7 @@ static bool xdp_mtu_valid(struct dpaa2_eth_priv *priv, int mtu) mfl = DPAA2_ETH_L2_MAX_FRM(mtu); linear_mfl = DPAA2_ETH_RX_BUF_SIZE - DPAA2_ETH_RX_HWA_SIZE - -dpaa2_eth_rx_head_room(priv); +dpaa2_eth_rx_head_room(priv) - XDP_PACKET_HEADROOM; if (mfl > linear_mfl) { netdev_warn(priv->net_dev, "Maximum MTU for XDP is %d\n", @@ -1537,6 +1541,32 @@ static int dpaa2_eth_change_mtu(struct net_device *dev, int new_mtu) return 0; } +static int update_rx_buffer_headroom(struct dpaa2_eth_priv *priv, bool has_xdp) +{ + struct dpni_buffer_layout buf_layout = {0}; + int err; + + err = dpni_get_buffer_layout(priv->mc_io, 0, priv->mc_token, +DPNI_QUEUE_RX, _layout); + if (err) { + netdev_err(priv->net_dev, "dpni_get_buffer_layout failed\n"); + return err; + } + + /* Reserve extra headroom for XDP header size changes */ + buf_layout.data_head_room = dpaa2_eth_rx_head_room(priv) + + (has_xdp ? XDP_PACKET_HEADROOM : 0); + buf_layout.options = DPNI_BUF_LAYOUT_OPT_DATA_HEAD_ROOM; + err = dpni_set_buffer_layout(priv->mc_io, 0, priv->mc_token, +DPNI_QUEUE_RX, _layout); + if (err) { + netdev_err(priv->net_dev, "dpni_set_buffer_layout failed\n"); + return err; + } + + return 0; +} + static int setup_xdp(struct net_device *dev, struct bpf_prog *prog) { struct dpaa2_eth_priv *priv = netdev_priv(dev); @@ -1560,11 +1590,18 @@ static int setup_xdp(struct net_device *dev, struct bpf_prog *prog) if (up) dpaa2_eth_stop(dev); - /* While in xdp mode, enforce a maximum Rx frame size based on MTU */ + /* While in xdp mode, enforce a maximum Rx frame size based on MTU. +* Also, when switching between xdp/non-xdp modes we need to reconfigure +* our Rx buffer layout. Buffer pool was drained on dpaa2_eth_stop, +* so we are sure no old format buffers will be used from now on. +*/ if (need_update) { err = set_rx_mfl(priv, dev->mtu, !!prog); if (err) goto out_err; + err = update_rx_buffer_headroom(priv, !!prog); + if (err) + goto out_err; } old = xchg(>xdp_prog, prog); -- 2.7.4
[PATCH v2 net-next 7/8] dpaa2-eth: Cleanup channel stats
Remove unused counter. Reorder fields in channel stats structure to match the ethtool strings order and make it easier to print them with ethtool -S. Signed-off-by: Ioana Radulescu --- v2: no changes drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 1 - drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 6 ++ drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c | 16 +--- 3 files changed, 7 insertions(+), 16 deletions(-) diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c index bc582c4..d2bc5da 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c @@ -467,7 +467,6 @@ static int consume_frames(struct dpaa2_eth_channel *ch, return 0; fq->stats.frames += cleaned; - ch->stats.frames += cleaned; /* A dequeue operation only pulls frames from a single queue * into the store. Return the frame queue as an out param. diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h index 5530a0e..41a2a0d 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h @@ -245,12 +245,10 @@ struct dpaa2_eth_fq_stats { struct dpaa2_eth_ch_stats { /* Volatile dequeues retried due to portal busy */ __u64 dequeue_portal_busy; - /* Number of CDANs; useful to estimate avg NAPI len */ - __u64 cdan; - /* Number of frames received on queues from this channel */ - __u64 frames; /* Pull errors */ __u64 pull_err; + /* Number of CDANs; useful to estimate avg NAPI len */ + __u64 cdan; }; /* Maximum number of queues associated with a DPNI */ diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c index 26bd5a2..79eeebe 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c @@ -174,8 +174,6 @@ static void dpaa2_eth_get_ethtool_stats(struct net_device *net_dev, int j, k, err; int num_cnt; union dpni_statistics dpni_stats; - u64 cdan = 0; - u64 portal_busy = 0, pull_err = 0; struct dpaa2_eth_priv *priv = netdev_priv(net_dev); struct dpaa2_eth_drv_stats *extras; struct dpaa2_eth_ch_stats *ch_stats; @@ -212,16 +210,12 @@ static void dpaa2_eth_get_ethtool_stats(struct net_device *net_dev, } i += j; - for (j = 0; j < priv->num_channels; j++) { - ch_stats = >channel[j]->stats; - cdan += ch_stats->cdan; - portal_busy += ch_stats->dequeue_portal_busy; - pull_err += ch_stats->pull_err; + /* Per-channel stats */ + for (k = 0; k < priv->num_channels; k++) { + ch_stats = >channel[k]->stats; + for (j = 0; j < sizeof(*ch_stats) / sizeof(__u64); j++) + *((__u64 *)data + i + j) += *((__u64 *)ch_stats + j); } - - *(data + i++) = portal_busy; - *(data + i++) = pull_err; - *(data + i++) = cdan; } static int prep_eth_rule(struct ethhdr *eth_value, struct ethhdr *eth_mask, -- 2.7.4
[PATCH v2 net-next 3/8] dpaa2-eth: Move function
We'll use function free_bufs() on the XDP path as well, so move it higher in order to avoid a forward declaration. Signed-off-by: Ioana Radulescu --- v2: no changes drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 34 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c index 008cdf8..174c960 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c @@ -200,6 +200,23 @@ static struct sk_buff *build_frag_skb(struct dpaa2_eth_priv *priv, return skb; } +/* Free buffers acquired from the buffer pool or which were meant to + * be released in the pool + */ +static void free_bufs(struct dpaa2_eth_priv *priv, u64 *buf_array, int count) +{ + struct device *dev = priv->net_dev->dev.parent; + void *vaddr; + int i; + + for (i = 0; i < count; i++) { + vaddr = dpaa2_iova_to_virt(priv->iommu_domain, buf_array[i]); + dma_unmap_single(dev, buf_array[i], DPAA2_ETH_RX_BUF_SIZE, +DMA_FROM_DEVICE); + skb_free_frag(vaddr); + } +} + static u32 run_xdp(struct dpaa2_eth_priv *priv, struct dpaa2_eth_channel *ch, struct dpaa2_fd *fd, void *vaddr) @@ -797,23 +814,6 @@ static int set_tx_csum(struct dpaa2_eth_priv *priv, bool enable) return 0; } -/* Free buffers acquired from the buffer pool or which were meant to - * be released in the pool - */ -static void free_bufs(struct dpaa2_eth_priv *priv, u64 *buf_array, int count) -{ - struct device *dev = priv->net_dev->dev.parent; - void *vaddr; - int i; - - for (i = 0; i < count; i++) { - vaddr = dpaa2_iova_to_virt(priv->iommu_domain, buf_array[i]); - dma_unmap_single(dev, buf_array[i], DPAA2_ETH_RX_BUF_SIZE, -DMA_FROM_DEVICE); - skb_free_frag(vaddr); - } -} - /* Perform a single release command to add buffers * to the specified buffer pool */ -- 2.7.4
[PATCH v2 net-next 5/8] dpaa2-eth: Map Rx buffers as bidirectional
In order to support enqueueing Rx FDs back to hardware, we need to DMA map Rx buffers as bidirectional. Signed-off-by: Ioana Radulescu --- v2: no changes drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 14 +++--- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c index ac4cb81..c2e880b 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c @@ -87,7 +87,7 @@ static void free_rx_fd(struct dpaa2_eth_priv *priv, addr = dpaa2_sg_get_addr([i]); sg_vaddr = dpaa2_iova_to_virt(priv->iommu_domain, addr); dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE, -DMA_FROM_DEVICE); +DMA_BIDIRECTIONAL); skb_free_frag(sg_vaddr); if (dpaa2_sg_is_final([i])) @@ -145,7 +145,7 @@ static struct sk_buff *build_frag_skb(struct dpaa2_eth_priv *priv, sg_addr = dpaa2_sg_get_addr(sge); sg_vaddr = dpaa2_iova_to_virt(priv->iommu_domain, sg_addr); dma_unmap_single(dev, sg_addr, DPAA2_ETH_RX_BUF_SIZE, -DMA_FROM_DEVICE); +DMA_BIDIRECTIONAL); sg_length = dpaa2_sg_get_len(sge); @@ -212,7 +212,7 @@ static void free_bufs(struct dpaa2_eth_priv *priv, u64 *buf_array, int count) for (i = 0; i < count; i++) { vaddr = dpaa2_iova_to_virt(priv->iommu_domain, buf_array[i]); dma_unmap_single(dev, buf_array[i], DPAA2_ETH_RX_BUF_SIZE, -DMA_FROM_DEVICE); +DMA_BIDIRECTIONAL); skb_free_frag(vaddr); } } @@ -306,7 +306,7 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv, vaddr = dpaa2_iova_to_virt(priv->iommu_domain, addr); dma_sync_single_for_cpu(dev, addr, DPAA2_ETH_RX_BUF_SIZE, - DMA_FROM_DEVICE); + DMA_BIDIRECTIONAL); fas = dpaa2_get_fas(vaddr, false); prefetch(fas); @@ -325,13 +325,13 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv, } dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE, -DMA_FROM_DEVICE); +DMA_BIDIRECTIONAL); skb = build_linear_skb(ch, fd, vaddr); } else if (fd_format == dpaa2_fd_sg) { WARN_ON(priv->xdp_prog); dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE, -DMA_FROM_DEVICE); +DMA_BIDIRECTIONAL); skb = build_frag_skb(priv, ch, buf_data); skb_free_frag(vaddr); percpu_extras->rx_sg_frames++; @@ -865,7 +865,7 @@ static int add_bufs(struct dpaa2_eth_priv *priv, buf = PTR_ALIGN(buf, priv->rx_buf_align); addr = dma_map_single(dev, buf, DPAA2_ETH_RX_BUF_SIZE, - DMA_FROM_DEVICE); + DMA_BIDIRECTIONAL); if (unlikely(dma_mapping_error(dev, addr))) goto err_map; -- 2.7.4
[PATCH v2 net-next 8/8] dpaa2-eth: Add xdp counters
Add counters for xdp processed frames to the channel statistics. Signed-off-by: Ioana Radulescu --- v2: no changes drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 3 +++ drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 4 drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c | 3 +++ 3 files changed, 10 insertions(+) diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c index d2bc5da..be84171 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c @@ -313,9 +313,11 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv, if (err) { xdp_release_buf(priv, ch, addr); percpu_stats->tx_errors++; + ch->stats.xdp_tx_err++; } else { percpu_stats->tx_packets++; percpu_stats->tx_bytes += dpaa2_fd_get_len(fd); + ch->stats.xdp_tx++; } break; default: @@ -324,6 +326,7 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv, trace_xdp_exception(priv->net_dev, xdp_prog, xdp_act); case XDP_DROP: xdp_release_buf(priv, ch, addr); + ch->stats.xdp_drop++; break; } diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h index 41a2a0d..69c965d 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h @@ -249,6 +249,10 @@ struct dpaa2_eth_ch_stats { __u64 pull_err; /* Number of CDANs; useful to estimate avg NAPI len */ __u64 cdan; + /* XDP counters */ + __u64 xdp_drop; + __u64 xdp_tx; + __u64 xdp_tx_err; }; /* Maximum number of queues associated with a DPNI */ diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c index 79eeebe..0c831bf 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c @@ -45,6 +45,9 @@ static char dpaa2_ethtool_extras[][ETH_GSTRING_LEN] = { "[drv] dequeue portal busy", "[drv] channel pull errors", "[drv] cdan", + "[drv] xdp drop", + "[drv] xdp tx", + "[drv] xdp tx errors", }; #define DPAA2_ETH_NUM_EXTRA_STATS ARRAY_SIZE(dpaa2_ethtool_extras) -- 2.7.4
[PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX actions are supported for now. Frame header changes are also allowed. v2: - count the XDP packets in the rx/tx inteface stats - add message with the maximum supported MTU value for XDP Ioana Radulescu (8): dpaa2-eth: Add basic XDP support dpaa2-eth: Allow XDP header adjustments dpaa2-eth: Move function dpaa2-eth: Release buffers back to pool on XDP_DROP dpaa2-eth: Map Rx buffers as bidirectional dpaa2-eth: Add support for XDP_TX dpaa2-eth: Cleanup channel stats dpaa2-eth: Add xdp counters drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 349 +++-- drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 20 +- .../net/ethernet/freescale/dpaa2/dpaa2-ethtool.c | 19 +- 3 files changed, 350 insertions(+), 38 deletions(-) -- 2.7.4
[PATCH v2 net-next 4/8] dpaa2-eth: Release buffers back to pool on XDP_DROP
Instead of freeing the RX buffers, release them back into the pool. We wait for the maximum number of buffers supported by a single release command to accumulate before issuing the command. Also, don't unmap the Rx buffers at the beginning of the Rx routine anymore, since that would require remapping them before release. Instead, just do a DMA sync at first and only unmap if the frame is meant for the stack. Signed-off-by: Ioana Radulescu --- v2: no changes drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 34 +--- drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 2 ++ 2 files changed, 33 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c index 174c960..ac4cb81 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c @@ -217,10 +217,34 @@ static void free_bufs(struct dpaa2_eth_priv *priv, u64 *buf_array, int count) } } +static void xdp_release_buf(struct dpaa2_eth_priv *priv, + struct dpaa2_eth_channel *ch, + dma_addr_t addr) +{ + int err; + + ch->xdp.drop_bufs[ch->xdp.drop_cnt++] = addr; + if (ch->xdp.drop_cnt < DPAA2_ETH_BUFS_PER_CMD) + return; + + while ((err = dpaa2_io_service_release(ch->dpio, priv->bpid, + ch->xdp.drop_bufs, + ch->xdp.drop_cnt)) == -EBUSY) + cpu_relax(); + + if (err) { + free_bufs(priv, ch->xdp.drop_bufs, ch->xdp.drop_cnt); + ch->buf_count -= ch->xdp.drop_cnt; + } + + ch->xdp.drop_cnt = 0; +} + static u32 run_xdp(struct dpaa2_eth_priv *priv, struct dpaa2_eth_channel *ch, struct dpaa2_fd *fd, void *vaddr) { + dma_addr_t addr = dpaa2_fd_get_addr(fd); struct bpf_prog *xdp_prog; struct xdp_buff xdp; u32 xdp_act = XDP_PASS; @@ -250,8 +274,7 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv, case XDP_ABORTED: trace_xdp_exception(priv->net_dev, xdp_prog, xdp_act); case XDP_DROP: - ch->buf_count--; - free_rx_fd(priv, fd, vaddr); + xdp_release_buf(priv, ch, addr); break; } @@ -282,7 +305,8 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv, trace_dpaa2_rx_fd(priv->net_dev, fd); vaddr = dpaa2_iova_to_virt(priv->iommu_domain, addr); - dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE, DMA_FROM_DEVICE); + dma_sync_single_for_cpu(dev, addr, DPAA2_ETH_RX_BUF_SIZE, + DMA_FROM_DEVICE); fas = dpaa2_get_fas(vaddr, false); prefetch(fas); @@ -300,10 +324,14 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv, return; } + dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE, +DMA_FROM_DEVICE); skb = build_linear_skb(ch, fd, vaddr); } else if (fd_format == dpaa2_fd_sg) { WARN_ON(priv->xdp_prog); + dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE, +DMA_FROM_DEVICE); skb = build_frag_skb(priv, ch, buf_data); skb_free_frag(vaddr); percpu_extras->rx_sg_frames++; diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h index 2873a15..23cf9d9 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h @@ -285,6 +285,8 @@ struct dpaa2_eth_fq { struct dpaa2_eth_ch_xdp { struct bpf_prog *prog; + u64 drop_bufs[DPAA2_ETH_BUFS_PER_CMD]; + int drop_cnt; }; struct dpaa2_eth_channel { -- 2.7.4
[PATCH v2 net-next 1/8] dpaa2-eth: Add basic XDP support
We keep one XDP program reference per channel. The only actions supported for now are XDP_DROP and XDP_PASS. Until now we didn't enforce a maximum size for Rx frames based on MTU value. Change that, since for XDP mode we must ensure no scatter-gather frames can be received. Signed-off-by: Ioana Radulescu --- v2: - xdp packets count towards the rx packets and bytes counters - add warning message with the maximum supported MTU value for XDP drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 189 ++- drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 6 + 2 files changed, 194 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c index 640967a..d3cfed4 100644 --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c @@ -13,7 +13,8 @@ #include #include #include - +#include +#include #include #include "dpaa2-eth.h" @@ -199,6 +200,45 @@ static struct sk_buff *build_frag_skb(struct dpaa2_eth_priv *priv, return skb; } +static u32 run_xdp(struct dpaa2_eth_priv *priv, + struct dpaa2_eth_channel *ch, + struct dpaa2_fd *fd, void *vaddr) +{ + struct bpf_prog *xdp_prog; + struct xdp_buff xdp; + u32 xdp_act = XDP_PASS; + + rcu_read_lock(); + + xdp_prog = READ_ONCE(ch->xdp.prog); + if (!xdp_prog) + goto out; + + xdp.data = vaddr + dpaa2_fd_get_offset(fd); + xdp.data_end = xdp.data + dpaa2_fd_get_len(fd); + xdp.data_hard_start = xdp.data; + xdp_set_data_meta_invalid(); + + xdp_act = bpf_prog_run_xdp(xdp_prog, ); + + switch (xdp_act) { + case XDP_PASS: + break; + default: + bpf_warn_invalid_xdp_action(xdp_act); + case XDP_ABORTED: + trace_xdp_exception(priv->net_dev, xdp_prog, xdp_act); + case XDP_DROP: + ch->buf_count--; + free_rx_fd(priv, fd, vaddr); + break; + } + +out: + rcu_read_unlock(); + return xdp_act; +} + /* Main Rx frame processing routine */ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv, struct dpaa2_eth_channel *ch, @@ -215,6 +255,7 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv, struct dpaa2_fas *fas; void *buf_data; u32 status = 0; + u32 xdp_act; /* Tracing point */ trace_dpaa2_rx_fd(priv->net_dev, fd); @@ -231,8 +272,17 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv, percpu_extras = this_cpu_ptr(priv->percpu_extras); if (fd_format == dpaa2_fd_single) { + xdp_act = run_xdp(priv, ch, (struct dpaa2_fd *)fd, vaddr); + if (xdp_act != XDP_PASS) { + percpu_stats->rx_packets++; + percpu_stats->rx_bytes += dpaa2_fd_get_len(fd); + return; + } + skb = build_linear_skb(ch, fd, vaddr); } else if (fd_format == dpaa2_fd_sg) { + WARN_ON(priv->xdp_prog); + skb = build_frag_skb(priv, ch, buf_data); skb_free_frag(vaddr); percpu_extras->rx_sg_frames++; @@ -1427,6 +1477,141 @@ static int dpaa2_eth_ioctl(struct net_device *dev, struct ifreq *rq, int cmd) return -EINVAL; } +static bool xdp_mtu_valid(struct dpaa2_eth_priv *priv, int mtu) +{ + int mfl, linear_mfl; + + mfl = DPAA2_ETH_L2_MAX_FRM(mtu); + linear_mfl = DPAA2_ETH_RX_BUF_SIZE - DPAA2_ETH_RX_HWA_SIZE - +dpaa2_eth_rx_head_room(priv); + + if (mfl > linear_mfl) { + netdev_warn(priv->net_dev, "Maximum MTU for XDP is %d\n", + linear_mfl - VLAN_ETH_HLEN); + return false; + } + + return true; +} + +static int set_rx_mfl(struct dpaa2_eth_priv *priv, int mtu, bool has_xdp) +{ + int mfl, err; + + /* We enforce a maximum Rx frame length based on MTU only if we have +* an XDP program attached (in order to avoid Rx S/G frames). +* Otherwise, we accept all incoming frames as long as they are not +* larger than maximum size supported in hardware +*/ + if (has_xdp) + mfl = DPAA2_ETH_L2_MAX_FRM(mtu); + else + mfl = DPAA2_ETH_MFL; + + err = dpni_set_max_frame_length(priv->mc_io, 0, priv->mc_token, mfl); + if (err) { + netdev_err(priv->net_dev, "dpni_set_max_frame_length failed\n"); + return err; + } + + return 0; +} + +static int dpaa2_eth_change_mtu(struct net_device *dev, int new_mtu) +{ + struct dpaa2_eth_priv *priv = netdev_priv(dev); + int err; + + if (!priv->xdp_prog) + goto out; + + if (!xdp_mtu_valid(priv, new_mtu)) + return -EINVAL; +
[PATCH v2 net-next] net: remove unsafe skb_insert()
I do not see how one can effectively use skb_insert() without holding some kind of lock. Otherwise other cpus could have changed the list right before we have a chance of acquiring list->lock. Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this one probably meant to use __skb_insert() since it appears nesqp->pau_list is protected by nesqp->pau_lock. This looks like nesqp->pau_lock could be removed, since nesqp->pau_list.lock could be used instead. Signed-off-by: Eric Dumazet Cc: Faisal Latif Cc: Doug Ledford Cc: Jason Gunthorpe Cc: linux-rdma --- drivers/infiniband/hw/nes/nes_mgt.c | 4 ++-- include/linux/skbuff.h | 2 -- net/core/skbuff.c | 22 -- 3 files changed, 2 insertions(+), 26 deletions(-) diff --git a/drivers/infiniband/hw/nes/nes_mgt.c b/drivers/infiniband/hw/nes/nes_mgt.c index fc0c191014e908eea32d752f3499295ef143aa0a..cc4dce5c3e5f6d99fc44fcde7334e70ac7a33002 100644 --- a/drivers/infiniband/hw/nes/nes_mgt.c +++ b/drivers/infiniband/hw/nes/nes_mgt.c @@ -551,14 +551,14 @@ static void queue_fpdus(struct sk_buff *skb, struct nes_vnic *nesvnic, struct ne /* Queue skb by sequence number */ if (skb_queue_len(>pau_list) == 0) { - skb_queue_head(>pau_list, skb); + __skb_queue_head(>pau_list, skb); } else { skb_queue_walk(>pau_list, tmpskb) { cb = (struct nes_rskb_cb *)>cb[0]; if (before(seqnum, cb->seqnum)) break; } - skb_insert(tmpskb, skb, >pau_list); + __skb_insert(skb, tmpskb->prev, tmpskb, >pau_list); } if (nesqp->pau_state == PAU_READY) process_it = true; diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index f17a7452ac7bf47ef4bcf89840bba165cee6f50a..73902acf2b71c8800d81b744a936a7420f33b459 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -1749,8 +1749,6 @@ static inline void skb_queue_head_init_class(struct sk_buff_head *list, * The "__skb_()" functions are the non-atomic ones that * can only be called with interrupts disabled. */ -void skb_insert(struct sk_buff *old, struct sk_buff *newsk, - struct sk_buff_head *list); static inline void __skb_insert(struct sk_buff *newsk, struct sk_buff *prev, struct sk_buff *next, struct sk_buff_head *list) diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 9a8a72cefe9b94d3821b9cc5ba5bba647ae51267..02cd7ae3d0fb26ef0a8b006390154fdefd0d292f 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -2990,28 +2990,6 @@ void skb_append(struct sk_buff *old, struct sk_buff *newsk, struct sk_buff_head } EXPORT_SYMBOL(skb_append); -/** - * skb_insert - insert a buffer - * @old: buffer to insert before - * @newsk: buffer to insert - * @list: list to use - * - * Place a packet before a given packet in a list. The list locks are - * taken and this function is atomic with respect to other list locked - * calls. - * - * A buffer cannot be placed on two lists at the same time. - */ -void skb_insert(struct sk_buff *old, struct sk_buff *newsk, struct sk_buff_head *list) -{ - unsigned long flags; - - spin_lock_irqsave(>lock, flags); - __skb_insert(newsk, old->prev, old, list); - spin_unlock_irqrestore(>lock, flags); -} -EXPORT_SYMBOL(skb_insert); - static inline void skb_split_inside_header(struct sk_buff *skb, struct sk_buff* skb1, const u32 len, const int pos) -- 2.20.0.rc0.387.gc7a69e6b6c-goog
Re: [PATCH v2 net-next] net: lpc_eth: fix trivial comment typo
From: Andrea Claudi Date: Tue, 20 Nov 2018 18:30:30 +0100 > Fix comment typo rxfliterctrl -> rxfilterctrl > > Signed-off-by: Andrea Claudi Applied.
Re: [PATCH v2 net-next] cxgb4/cxgb4vf: Fix mac_hlist initialization and free
From: Arjun Vynipadath Date: Tue, 20 Nov 2018 12:11:39 +0530 > Null pointer dereference seen when cxgb4vf driver is unloaded > without bringing up any interfaces, moving mac_hlist initialization > to driver probe and free the mac_hlist in remove to fix the issue. > > Fixes: 24357e06ba51 ("cxgb4vf: fix memleak in mac_hlist initialization") > Signed-off-by: Arjun Vynipadath > Signed-off-by: Casey Leedom > Signed-off-by: Ganesh Goudar > --- > v2: > - Updated commit description as per Leon's feedback Applied.
[PATCH v2 net-next] net: lpc_eth: fix trivial comment typo
Fix comment typo rxfliterctrl -> rxfilterctrl Signed-off-by: Andrea Claudi --- drivers/net/ethernet/nxp/lpc_eth.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/net/ethernet/nxp/lpc_eth.c b/drivers/net/ethernet/nxp/lpc_eth.c index bd8695a4faaa..89d17399fb5a 100644 --- a/drivers/net/ethernet/nxp/lpc_eth.c +++ b/drivers/net/ethernet/nxp/lpc_eth.c @@ -280,7 +280,7 @@ #define LPC_FCCR_MIRRORCOUNTERCURRENT(n) ((n) & 0x) /* - * rxfliterctrl, rxfilterwolstatus, and rxfilterwolclear shared + * rxfilterctrl, rxfilterwolstatus, and rxfilterwolclear shared * register definitions */ #define LPC_RXFLTRW_ACCEPTUNICAST (1 << 0) @@ -291,7 +291,7 @@ #define LPC_RXFLTRW_ACCEPTPERFECT (1 << 5) /* - * rxfliterctrl register definitions + * rxfilterctrl register definitions */ #define LPC_RXFLTRWSTS_MAGICPACKETENWOL(1 << 12) #define LPC_RXFLTRWSTS_RXFILTERENWOL (1 << 13) -- 2.17.2
[PATCH v2 net-next] cxgb4/cxgb4vf: Fix mac_hlist initialization and free
Null pointer dereference seen when cxgb4vf driver is unloaded without bringing up any interfaces, moving mac_hlist initialization to driver probe and free the mac_hlist in remove to fix the issue. Fixes: 24357e06ba51 ("cxgb4vf: fix memleak in mac_hlist initialization") Signed-off-by: Arjun Vynipadath Signed-off-by: Casey Leedom Signed-off-by: Ganesh Goudar --- v2: - Updated commit description as per Leon's feedback --- drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 19 ++- drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c | 6 +++--- 2 files changed, 13 insertions(+), 12 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 956e708..cdd6f48 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -2280,8 +2280,6 @@ static int cxgb_up(struct adapter *adap) #if IS_ENABLED(CONFIG_IPV6) update_clip(adap); #endif - /* Initialize hash mac addr list*/ - INIT_LIST_HEAD(>mac_hlist); return err; irq_err: @@ -2295,8 +2293,6 @@ static int cxgb_up(struct adapter *adap) static void cxgb_down(struct adapter *adapter) { - struct hash_mac_addr *entry, *tmp; - cancel_work_sync(>tid_release_task); cancel_work_sync(>db_full_task); cancel_work_sync(>db_drop_task); @@ -2306,11 +2302,6 @@ static void cxgb_down(struct adapter *adapter) t4_sge_stop(adapter); t4_free_sge_resources(adapter); - list_for_each_entry_safe(entry, tmp, >mac_hlist, list) { - list_del(>list); - kfree(entry); - } - adapter->flags &= ~FULL_INIT_DONE; } @@ -5629,6 +5620,9 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent) (is_t5(adapter->params.chip) ? STATMODE_V(0) : T6_STATMODE_V(0))); + /* Initialize hash mac addr list */ + INIT_LIST_HEAD(>mac_hlist); + for_each_port(adapter, i) { netdev = alloc_etherdev_mq(sizeof(struct port_info), MAX_ETH_QSETS); @@ -5907,6 +5901,7 @@ static int init_one(struct pci_dev *pdev, const struct pci_device_id *ent) static void remove_one(struct pci_dev *pdev) { struct adapter *adapter = pci_get_drvdata(pdev); + struct hash_mac_addr *entry, *tmp; if (!adapter) { pci_release_regions(pdev); @@ -5956,6 +5951,12 @@ static void remove_one(struct pci_dev *pdev) if (adapter->num_uld || adapter->num_ofld_uld) t4_uld_mem_free(adapter); free_some_resources(adapter); + list_for_each_entry_safe(entry, tmp, >mac_hlist, +list) { + list_del(>list); + kfree(entry); + } + #if IS_ENABLED(CONFIG_IPV6) t4_cleanup_clip_tbl(adapter); #endif diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c index 8ec503c..8a2ad6b 100644 --- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c @@ -723,9 +723,6 @@ static int adapter_up(struct adapter *adapter) if (adapter->flags & USING_MSIX) name_msix_vecs(adapter); - /* Initialize hash mac addr list*/ - INIT_LIST_HEAD(>mac_hlist); - adapter->flags |= FULL_INIT_DONE; } @@ -3038,6 +3035,9 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev, if (err) goto err_unmap_bar; + /* Initialize hash mac addr list */ + INIT_LIST_HEAD(>mac_hlist); + /* * Allocate our "adapter ports" and stitch everything together. */ -- 2.9.5
[PATCH V2 net-next 0/5] net: hns3: Add support of hardware GRO to HNS3 Driver
This patch-set adds support of hardware assisted GRO feature to HNS3 driver on Rev B(=0x21) platform. Current hardware only supports TCP/IPv{4|6} flows. Change Log: V1->V2: 1. Remove redundant print reported by Leon Romanovsky. Link: https://lkml.org/lkml/2018/11/13/715 Peng Li (5): net: hns3: Enable HW GRO for Rev B(=0x21) HNS3 hardware net: hns3: Add handling of GRO Pkts not fully RX'ed in NAPI poll net: hns3: Add support for ethtool -K to enable/disable HW GRO net: hns3: Add skb chain when num of RX buf exceeds MAX_SKB_FRAGS net: hns3: Adds GRO params to SKB for the stack drivers/net/ethernet/hisilicon/hns3/hnae3.h | 7 + .../net/ethernet/hisilicon/hns3/hns3_enet.c | 289 ++ .../net/ethernet/hisilicon/hns3/hns3_enet.h | 17 +- .../hisilicon/hns3/hns3pf/hclge_cmd.h | 7 + .../hisilicon/hns3/hns3pf/hclge_main.c| 39 +++ .../hisilicon/hns3/hns3vf/hclgevf_cmd.h | 8 + .../hisilicon/hns3/hns3vf/hclgevf_main.c | 39 +++ 7 files changed, 339 insertions(+), 67 deletions(-) -- 2.17.1
Re: [PATCH v2 net-next] net: phy: improve struct phy_device member interrupts handling
From: Heiner Kallweit Date: Fri, 9 Nov 2018 18:35:52 +0100 > As a heritage from the very early days of phylib member interrupts is > defined as u32 even though it's just a flag whether interrupts are > enabled. So we can change it to a bitfield member. In addition change > the code dealing with this member in a way that it's clear we're > dealing with a bool value. > > Signed-off-by: Heiner Kallweit > --- > v2: > - use false/true instead of 0/1 for the constants Applied.
Re: [PATCH v2 net-next] net: phy: improve struct phy_device member interrupts handling
On 11/9/18 9:35 AM, Heiner Kallweit wrote: > As a heritage from the very early days of phylib member interrupts is > defined as u32 even though it's just a flag whether interrupts are > enabled. So we can change it to a bitfield member. In addition change > the code dealing with this member in a way that it's clear we're > dealing with a bool value. > > Signed-off-by: Heiner Kallweit Reviewed-by: Florian Fainelli -- Florian
Re: [PATCH v2 net-next] net: phy: improve struct phy_device member interrupts handling
On Fri, Nov 09, 2018 at 06:35:52PM +0100, Heiner Kallweit wrote: > As a heritage from the very early days of phylib member interrupts is > defined as u32 even though it's just a flag whether interrupts are > enabled. So we can change it to a bitfield member. In addition change > the code dealing with this member in a way that it's clear we're > dealing with a bool value. > > Signed-off-by: Heiner Kallweit Reviewed-by: Andrew Lunn Andrew
[PATCH v2 net-next] net: phy: improve struct phy_device member interrupts handling
As a heritage from the very early days of phylib member interrupts is defined as u32 even though it's just a flag whether interrupts are enabled. So we can change it to a bitfield member. In addition change the code dealing with this member in a way that it's clear we're dealing with a bool value. Signed-off-by: Heiner Kallweit --- v2: - use false/true instead of 0/1 for the constants Actually this member isn't needed at all and could be replaced with a parameter in phy_driver->config_intr. But this would mean an API change, maybe I come up with a proposal later. --- drivers/net/phy/phy.c | 4 ++-- include/linux/phy.h | 10 +- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c index dd5bff955..8dac890f3 100644 --- a/drivers/net/phy/phy.c +++ b/drivers/net/phy/phy.c @@ -115,9 +115,9 @@ static int phy_clear_interrupt(struct phy_device *phydev) * * Returns 0 on success or < 0 on error. */ -static int phy_config_interrupt(struct phy_device *phydev, u32 interrupts) +static int phy_config_interrupt(struct phy_device *phydev, bool interrupts) { - phydev->interrupts = interrupts; + phydev->interrupts = interrupts ? 1 : 0; if (phydev->drv->config_intr) return phydev->drv->config_intr(phydev); diff --git a/include/linux/phy.h b/include/linux/phy.h index 240e04d5a..59bb31ee1 100644 --- a/include/linux/phy.h +++ b/include/linux/phy.h @@ -262,8 +262,8 @@ static inline struct mii_bus *devm_mdiobus_alloc(struct device *dev) void devm_mdiobus_free(struct device *dev, struct mii_bus *bus); struct phy_device *mdiobus_scan(struct mii_bus *bus, int addr); -#define PHY_INTERRUPT_DISABLED 0x0 -#define PHY_INTERRUPT_ENABLED 0x8000 +#define PHY_INTERRUPT_DISABLED false +#define PHY_INTERRUPT_ENABLED true /* PHY state machine states: * @@ -409,6 +409,9 @@ struct phy_device { /* The most recently read link state */ unsigned link:1; + /* Interrupts are enabled */ + unsigned interrupts:1; + enum phy_state state; u32 dev_flags; @@ -424,9 +427,6 @@ struct phy_device { int pause; int asym_pause; - /* Enabled Interrupts */ - u32 interrupts; - /* Union of PHY and Attached devices' supported modes */ /* See mii.h for more info */ u32 supported; -- 2.19.1
Re: [PATCH v2 net-next] sock: Reset dst when changing sk_mark via setsockopt
On 11/07/2018 08:55 PM, David Barmann wrote: > When setting the SO_MARK socket option, the dst needs to be reset so > that a new route lookup is performed. > > This fixes the case where an application wants to change routing by > setting a new sk_mark. If this is done after some packets have already > been sent, the dst is cached and has no effect. > > Signed-off-by: David Barmann > --- > net/core/sock.c | 6 -- > 1 file changed, 4 insertions(+), 2 deletions(-) > > diff --git a/net/core/sock.c b/net/core/sock.c > index 7b304e454a38..c74b10be86cb 100644 > --- a/net/core/sock.c > +++ b/net/core/sock.c > @@ -952,10 +952,12 @@ int sock_setsockopt(struct socket *sock, int level, int > optname, > clear_bit(SOCK_PASSSEC, >flags); > break; > case SO_MARK: > - if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) > + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) { > ret = -EPERM; > - else > + } else { > sk->sk_mark = val; > + sk_dst_reset(sk); There is no need to force a sk_dst_reset(sk) if sk_mark was not changed. I already gave you this feedback, please do not ignore it. Thanks.
[PATCH v2 net-next] sock: Reset dst when changing sk_mark via setsockopt
When setting the SO_MARK socket option, the dst needs to be reset so that a new route lookup is performed. This fixes the case where an application wants to change routing by setting a new sk_mark. If this is done after some packets have already been sent, the dst is cached and has no effect. Signed-off-by: David Barmann --- net/core/sock.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/net/core/sock.c b/net/core/sock.c index 7b304e454a38..c74b10be86cb 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -952,10 +952,12 @@ int sock_setsockopt(struct socket *sock, int level, int optname, clear_bit(SOCK_PASSSEC, >flags); break; case SO_MARK: - if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) { ret = -EPERM; - else + } else { sk->sk_mark = val; + sk_dst_reset(sk); + } break; case SO_RXQ_OVFL: -- 2.14.5
Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
On 08/16/2018 02:34 PM, tristram...@microchip.com wrote: >> -Original Message- >> From: Florian Fainelli >> Sent: Wednesday, August 15, 2018 5:29 PM >> To: Tristram Ha - C24268 ; Andrew Lunn >> ; Pavel Machek ; Ruediger Schmitt >> >> Cc: Arkadi Sharshevsky ; UNGLinuxDriver >> ; netdev@vger.kernel.org >> Subject: Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477 >> DSA driver in preparation to add other KSZ switch drivers >> >> On 12/05/2017 05:46 PM, tristram...@microchip.com wrote: >>> From: Tristram Ha >>> >>> This series of patches is to modify the original KSZ9477 DSA driver so >>> that other KSZ switch drivers can be added and use the common code. >>> >>> There are several steps to accomplish this achievement. First is to >>> rename some function names with a prefix to indicate chip specific >>> function. Second is to move common code into header that can be shared. >>> Last is to modify tag_ksz.c so that it can handle many tail tag formats >>> used by different KSZ switch drivers. >>> >>> ksz_common.c will contain the common code used by all KSZ switch drivers. >>> ksz9477.c will contain KSZ9477 code from the original ksz_common.c. >>> ksz9477_spi.c is renamed from ksz_spi.c. >>> ksz9477_reg.h is renamed from ksz_9477_reg.h. >>> ksz_common.h is added to provide common code access to KSZ switch >>> drivers. >>> ksz_spi.h is added to provide common SPI access functions to KSZ SPI >>> drivers. >> >> Is something gating this series from getting included? It's been nearly >> 8 months now and this has not been include nor resubmitted, any plans to >> rebase that patch series and work towards inclusion in net-next when it >> opens back again? >> >> Thank you! > > Sorry for the long delay. I will restart my kernel submission effort next > month > after finishing the work on current development project. > Tristram, any chance of resubmitting this or should someone with access to those switches take up your series and submit it? -- Florian
Re: [PATCH V2 net-next] net: ena: Fix Kconfig dependency on X86
From: Date: Wed, 17 Oct 2018 10:04:21 + > From: Netanel Belgazal > > The Kconfig limitation of X86 is to too wide. > The ENA driver only requires a little endian dependency. > > Change the dependency to be on little endian CPU. > > Signed-off-by: Netanel Belgazal Applied.
[PATCH V2 net-next] net: ena: Fix Kconfig dependency on X86
From: Netanel Belgazal The Kconfig limitation of X86 is to too wide. The ENA driver only requires a little endian dependency. Change the dependency to be on little endian CPU. Signed-off-by: Netanel Belgazal --- drivers/net/ethernet/amazon/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amazon/Kconfig b/drivers/net/ethernet/amazon/Kconfig index 99b30353541a..9e87d7b8360f 100644 --- a/drivers/net/ethernet/amazon/Kconfig +++ b/drivers/net/ethernet/amazon/Kconfig @@ -17,7 +17,7 @@ if NET_VENDOR_AMAZON config ENA_ETHERNET tristate "Elastic Network Adapter (ENA) support" - depends on (PCI_MSI && X86) + depends on PCI_MSI && !CPU_BIG_ENDIAN ---help--- This driver supports Elastic Network Adapter (ENA)" -- 2.15.2.AMZN
Re: [PATCH v2 net-next 00/11] net: Kernel side filtering for route dumps
From: David Ahern Date: Mon, 15 Oct 2018 18:56:40 -0700 > From: David Ahern > > Implement kernel side filtering of route dumps by protocol (e.g., which > routing daemon installed the route), route type (e.g., unicast), table > id and nexthop device. > > iproute2 has been doing this filtering in userspace for years; pushing > the filters to the kernel side reduces the amount of data the kernel > sends and reduces wasted cycles on both sides processing unwanted data. > These initial options provide a huge improvement for efficiently > examining routes on large scale systems. > > v2 > - better handling of requests for a specific table. Rather than walking > the hash of all tables, lookup the specific table and dump it > - refactor mr_rtm_dumproute moving the loop over the table into a > helper that can be invoked directly > - add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure > it is returned even when the dump returns nothing Looks great David, I'll push this out to net-next after my build tests finish. Thanks.
[PATCH v2 net-next 06/11] ipmr: Refactor mr_rtm_dumproute
From: David Ahern Move per-table loops from mr_rtm_dumproute to mr_table_dump and export mr_table_dump for dumps by specific table id. Signed-off-by: David Ahern --- include/linux/mroute_base.h | 6 net/ipv4/ipmr_base.c| 88 - 2 files changed, 61 insertions(+), 33 deletions(-) diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h index 6675b9f81979..db85373c8d15 100644 --- a/include/linux/mroute_base.h +++ b/include/linux/mroute_base.h @@ -283,6 +283,12 @@ void *mr_mfc_find_any(struct mr_table *mrt, int vifi, void *hasharg); int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb, struct mr_mfc *c, struct rtmsg *rtm); +int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb, + struct netlink_callback *cb, + int (*fill)(struct mr_table *mrt, struct sk_buff *skb, + u32 portid, u32 seq, struct mr_mfc *c, + int cmd, int flags), + spinlock_t *lock); int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct mr_table *(*iter)(struct net *net, struct mr_table *mrt), diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c index 1ad9aa62a97b..132dd2613ca5 100644 --- a/net/ipv4/ipmr_base.c +++ b/net/ipv4/ipmr_base.c @@ -268,6 +268,55 @@ int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb, } EXPORT_SYMBOL(mr_fill_mroute); +int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb, + struct netlink_callback *cb, + int (*fill)(struct mr_table *mrt, struct sk_buff *skb, + u32 portid, u32 seq, struct mr_mfc *c, + int cmd, int flags), + spinlock_t *lock) +{ + unsigned int e = 0, s_e = cb->args[1]; + unsigned int flags = NLM_F_MULTI; + struct mr_mfc *mfc; + int err; + + list_for_each_entry_rcu(mfc, >mfc_cache_list, list) { + if (e < s_e) + goto next_entry; + + err = fill(mrt, skb, NETLINK_CB(cb->skb).portid, + cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags); + if (err < 0) + goto out; +next_entry: + e++; + } + e = 0; + s_e = 0; + + spin_lock_bh(lock); + list_for_each_entry(mfc, >mfc_unres_queue, list) { + if (e < s_e) + goto next_entry2; + + err = fill(mrt, skb, NETLINK_CB(cb->skb).portid, + cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags); + if (err < 0) { + spin_unlock_bh(lock); + goto out; + } +next_entry2: + e++; + } + spin_unlock_bh(lock); + err = 0; + e = 0; + +out: + cb->args[1] = e; + return err; +} + int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct mr_table *(*iter)(struct net *net, struct mr_table *mrt), @@ -277,51 +326,24 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, int cmd, int flags), spinlock_t *lock) { - unsigned int t = 0, e = 0, s_t = cb->args[0], s_e = cb->args[1]; + unsigned int t = 0, s_t = cb->args[0]; struct net *net = sock_net(skb->sk); struct mr_table *mrt; - struct mr_mfc *mfc; + int err; rcu_read_lock(); for (mrt = iter(net, NULL); mrt; mrt = iter(net, mrt)) { if (t < s_t) goto next_table; - list_for_each_entry_rcu(mfc, >mfc_cache_list, list) { - if (e < s_e) - goto next_entry; - if (fill(mrt, skb, NETLINK_CB(cb->skb).portid, -cb->nlh->nlmsg_seq, mfc, -RTM_NEWROUTE, NLM_F_MULTI) < 0) - goto done; -next_entry: - e++; - } - e = 0; - s_e = 0; - - spin_lock_bh(lock); - list_for_each_entry(mfc, >mfc_unres_queue, list) { - if (e < s_e) - goto next_entry2; - if (fill(mrt, skb, NETLINK_CB(cb->skb).portid, -cb->nlh->nlmsg_seq, mfc, -RTM_NEWROUTE, NLM_F_MULTI) < 0) { - spin_unlock_bh(lock); - goto done; - } -next_entry2: - e++; - } - spin_unlock_bh(lock); - e = 0; - s_e = 0; + +
[PATCH v2 net-next 01/11] netlink: Add answer_flags to netlink_callback
From: David Ahern With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED flag is set on a message back to the user if the data returned is influenced by some input attributes. Normally this can be done as messages are added to the skb, but if the filter results in no data being returned, the user could be confused as to why. This patch adds answer_flags to the netlink_callback allowing dump handlers to set the NLM_F_DUMP_FILTERED at a minimum in the NLMSG_DONE message ensuring the flag gets back to the user. The netlink_callback space is initialized to 0 via a memset in __netlink_dump_start, so init of the new answer_flags is covered. Signed-off-by: David Ahern --- include/linux/netlink.h | 1 + net/netlink/af_netlink.c | 3 ++- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/netlink.h b/include/linux/netlink.h index 72580f1a72a2..4da90a6ab536 100644 --- a/include/linux/netlink.h +++ b/include/linux/netlink.h @@ -180,6 +180,7 @@ struct netlink_callback { u16 family; u16 min_dump_alloc; boolstrict_check; + u16 answer_flags; unsigned intprev_seq, seq; longargs[6]; }; diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c index e613a9f89600..6bb9f3cde0b0 100644 --- a/net/netlink/af_netlink.c +++ b/net/netlink/af_netlink.c @@ -2257,7 +2257,8 @@ static int netlink_dump(struct sock *sk) } nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE, - sizeof(nlk->dump_done_errno), NLM_F_MULTI); + sizeof(nlk->dump_done_errno), + NLM_F_MULTI | cb->answer_flags); if (WARN_ON(!nlh)) goto errout_skb; -- 2.11.0
[PATCH v2 net-next 05/11] net/mpls: Plumb support for filtering route dumps
From: David Ahern Implement kernel side filtering of routes by egress device index and protocol. MPLS uses only a single table and route type. Signed-off-by: David Ahern --- net/mpls/af_mpls.c | 42 +- 1 file changed, 41 insertions(+), 1 deletion(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index bfcb4759c9ee..48f4cbd9fb38 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -2067,12 +2067,35 @@ static int mpls_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, } #endif +static bool mpls_rt_uses_dev(struct mpls_route *rt, +const struct net_device *dev) +{ + struct net_device *nh_dev; + + if (rt->rt_nhn == 1) { + struct mpls_nh *nh = rt->rt_nh; + + nh_dev = rtnl_dereference(nh->nh_dev); + if (dev == nh_dev) + return true; + } else { + for_nexthops(rt) { + nh_dev = rtnl_dereference(nh->nh_dev); + if (nh_dev == dev) + return true; + } endfor_nexthops(rt); + } + + return false; +} + static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb) { const struct nlmsghdr *nlh = cb->nlh; struct net *net = sock_net(skb->sk); struct mpls_route __rcu **platform_label; struct fib_dump_filter filter = {}; + unsigned int flags = NLM_F_MULTI; size_t platform_labels; unsigned int index; @@ -2084,6 +2107,14 @@ static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb) err = mpls_valid_fib_dump_req(net, nlh, , cb->extack); if (err < 0) return err; + + /* for MPLS, there is only 1 table with fixed type and flags. +* If either are set in the filter then return nothing. +*/ + if ((filter.table_id && filter.table_id != RT_TABLE_MAIN) || + (filter.rt_type && filter.rt_type != RTN_UNICAST) || +filter.flags) + return skb->len; } index = cb->args[0]; @@ -2092,15 +2123,24 @@ static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb) platform_label = rtnl_dereference(net->mpls.platform_label); platform_labels = net->mpls.platform_labels; + + if (filter.filter_set) + flags |= NLM_F_DUMP_FILTERED; + for (; index < platform_labels; index++) { struct mpls_route *rt; + rt = rtnl_dereference(platform_label[index]); if (!rt) continue; + if ((filter.dev && !mpls_rt_uses_dev(rt, filter.dev)) || + (filter.protocol && rt->rt_protocol != filter.protocol)) + continue; + if (mpls_dump_route(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, RTM_NEWROUTE, - index, rt, NLM_F_MULTI) < 0) + index, rt, flags) < 0) break; } cb->args[0] = index; -- 2.11.0
[PATCH v2 net-next 11/11] net/ipv4: Bail early if user only wants prefix entries
From: David Ahern Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the flag is set in the dump request, just return. In the process of this change, move the CLONE check to use the new filter flags. Signed-off-by: David Ahern --- net/ipv4/fib_frontend.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index e86ca2255181..5bf653f36911 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -886,10 +886,14 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) err = ip_valid_fib_dump_req(net, nlh, , cb); if (err < 0) return err; + } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) { + struct rtmsg *rtm = nlmsg_data(nlh); + + filter.flags = rtm->rtm_flags & (RTM_F_PREFIX | RTM_F_CLONED); } - if (nlmsg_len(nlh) >= sizeof(struct rtmsg) && - ((struct rtmsg *)nlmsg_data(nlh))->rtm_flags & RTM_F_CLONED) + /* fib entries are never clones and ipv4 does not use prefix flag */ + if (filter.flags & (RTM_F_PREFIX | RTM_F_CLONED)) return skb->len; if (filter.table_id) { -- 2.11.0
[PATCH v2 net-next 07/11] net: Plumb support for filtering ipv4 and ipv6 multicast route dumps
From: David Ahern Implement kernel side filtering of routes by egress device index and table id. If the table id is given in the filter, lookup table and call mr_table_dump directly for it. Signed-off-by: David Ahern --- include/linux/mroute_base.h | 7 --- net/ipv4/ipmr.c | 18 +++--- net/ipv4/ipmr_base.c| 42 +++--- net/ipv6/ip6mr.c| 18 +++--- 4 files changed, 73 insertions(+), 12 deletions(-) diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h index db85373c8d15..34de06b426ef 100644 --- a/include/linux/mroute_base.h +++ b/include/linux/mroute_base.h @@ -7,6 +7,7 @@ #include #include #include +#include /** * struct vif_device - interface representor for multicast routing @@ -288,7 +289,7 @@ int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb, int (*fill)(struct mr_table *mrt, struct sk_buff *skb, u32 portid, u32 seq, struct mr_mfc *c, int cmd, int flags), - spinlock_t *lock); + spinlock_t *lock, struct fib_dump_filter *filter); int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct mr_table *(*iter)(struct net *net, struct mr_table *mrt), @@ -296,7 +297,7 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct sk_buff *skb, u32 portid, u32 seq, struct mr_mfc *c, int cmd, int flags), -spinlock_t *lock); +spinlock_t *lock, struct fib_dump_filter *filter); int mr_dump(struct net *net, struct notifier_block *nb, unsigned short family, int (*rules_dump)(struct net *net, @@ -346,7 +347,7 @@ mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb, struct sk_buff *skb, u32 portid, u32 seq, struct mr_mfc *c, int cmd, int flags), -spinlock_t *lock) +spinlock_t *lock, struct fib_dump_filter *filter) { return -EINVAL; } diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 44d777058960..3fa988e6a3df 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -2528,18 +2528,30 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb) { struct fib_dump_filter filter = {}; + int err; if (cb->strict_check) { - int err; - err = ip_valid_fib_dump_req(sock_net(skb->sk), cb->nlh, , cb->extack); if (err < 0) return err; } + if (filter.table_id) { + struct mr_table *mrt; + + mrt = ipmr_get_table(sock_net(skb->sk), filter.table_id); + if (!mrt) { + NL_SET_ERR_MSG(cb->extack, "ipv4: MR table does not exist"); + return -ENOENT; + } + err = mr_table_dump(mrt, skb, cb, _ipmr_fill_mroute, + _unres_lock, ); + return skb->len ? : err; + } + return mr_rtm_dumproute(skb, cb, ipmr_mr_table_iter, - _ipmr_fill_mroute, _unres_lock); + _ipmr_fill_mroute, _unres_lock, ); } static const struct nla_policy rtm_ipmr_policy[RTA_MAX + 1] = { diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c index 132dd2613ca5..bfe8fd04afa0 100644 --- a/net/ipv4/ipmr_base.c +++ b/net/ipv4/ipmr_base.c @@ -268,21 +268,45 @@ int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb, } EXPORT_SYMBOL(mr_fill_mroute); +static bool mr_mfc_uses_dev(const struct mr_table *mrt, + const struct mr_mfc *c, + const struct net_device *dev) +{ + int ct; + + for (ct = c->mfc_un.res.minvif; ct < c->mfc_un.res.maxvif; ct++) { + if (VIF_EXISTS(mrt, ct) && c->mfc_un.res.ttls[ct] < 255) { + const struct vif_device *vif; + + vif = >vif_table[ct]; + if (vif->dev == dev) + return true; + } + } + return false; +} + int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb, struct netlink_callback *cb, int (*fill)(struct mr_table *mrt, struct sk_buff *skb, u32 portid, u32 seq, struct mr_mfc *c, int cmd, int flags), - spinlock_t *lock) + spinlock_t *lock, struct fib_dump_filter *filter) { unsigned int e = 0,
[PATCH v2 net-next 10/11] net/ipv6: Bail early if user only wants cloned entries
From: David Ahern Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user requests a route dump for only cloned entries, no sense walking the FIB and returning everything. Signed-off-by: David Ahern --- net/ipv6/ip6_fib.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 5562c77022c6..2a058b408a6a 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -586,10 +586,13 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) { struct rtmsg *rtm = nlmsg_data(nlh); - if (rtm->rtm_flags & RTM_F_PREFIX) - arg.filter.flags = RTM_F_PREFIX; + arg.filter.flags = rtm->rtm_flags & (RTM_F_PREFIX|RTM_F_CLONED); } + /* fib entries are never clones */ + if (arg.filter.flags & RTM_F_CLONED) + return skb->len; + w = (void *)cb->args[2]; if (!w) { /* New dump: -- 2.11.0
[PATCH v2 net-next 09/11] net/mpls: Handle kernel side filtering of route dumps
From: David Ahern Update the dump request parsing in MPLS for the non-INET case to enable kernel side filtering. If INET is disabled the only filters that make sense for MPLS are protocol and nexthop device. Signed-off-by: David Ahern --- net/mpls/af_mpls.c | 33 - 1 file changed, 28 insertions(+), 5 deletions(-) diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c index 24381696932a..7d55d4c04088 100644 --- a/net/mpls/af_mpls.c +++ b/net/mpls/af_mpls.c @@ -2044,7 +2044,9 @@ static int mpls_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, struct netlink_callback *cb) { struct netlink_ext_ack *extack = cb->extack; + struct nlattr *tb[RTA_MAX + 1]; struct rtmsg *rtm; + int err, i; if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) { NL_SET_ERR_MSG_MOD(extack, "Invalid header for FIB dump request"); @@ -2053,15 +2055,36 @@ static int mpls_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, rtm = nlmsg_data(nlh); if (rtm->rtm_dst_len || rtm->rtm_src_len || rtm->rtm_tos || - rtm->rtm_table || rtm->rtm_protocol || rtm->rtm_scope || - rtm->rtm_type|| rtm->rtm_flags) { + rtm->rtm_table || rtm->rtm_scope|| rtm->rtm_type || + rtm->rtm_flags) { NL_SET_ERR_MSG_MOD(extack, "Invalid values in header for FIB dump request"); return -EINVAL; } - if (nlmsg_attrlen(nlh, sizeof(*rtm))) { - NL_SET_ERR_MSG_MOD(extack, "Invalid data after header in FIB dump request"); - return -EINVAL; + if (rtm->rtm_protocol) { + filter->protocol = rtm->rtm_protocol; + filter->filter_set = 1; + cb->answer_flags = NLM_F_DUMP_FILTERED; + } + + err = nlmsg_parse_strict(nlh, sizeof(*rtm), tb, RTA_MAX, +rtm_mpls_policy, extack); + if (err < 0) + return err; + + for (i = 0; i <= RTA_MAX; ++i) { + int ifindex; + + if (i == RTA_OIF) { + ifindex = nla_get_u32(tb[i]); + filter->dev = __dev_get_by_index(net, ifindex); + if (!filter->dev) + return -ENODEV; + filter->filter_set = 1; + } else if (tb[i]) { + NL_SET_ERR_MSG_MOD(extack, "Unsupported attribute in dump request"); + return -EINVAL; + } } return 0; -- 2.11.0
[PATCH v2 net-next 04/11] net/ipv6: Plumb support for filtering route dumps
From: David Ahern Implement kernel side filtering of routes by table id, egress device index, protocol, and route type. If the table id is given in the filter, lookup the table and call fib6_dump_table directly for it. Move the existing route flags check for prefix only routes to the new filter. Signed-off-by: David Ahern --- net/ipv6/ip6_fib.c | 28 ++-- net/ipv6/route.c | 40 2 files changed, 54 insertions(+), 14 deletions(-) diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 94e61fe47ff8..a51fc357a05c 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -583,10 +583,12 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) err = ip_valid_fib_dump_req(net, nlh, , cb->extack); if (err < 0) return err; - } + } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) { + struct rtmsg *rtm = nlmsg_data(nlh); - s_h = cb->args[0]; - s_e = cb->args[1]; + if (rtm->rtm_flags & RTM_F_PREFIX) + arg.filter.flags = RTM_F_PREFIX; + } w = (void *)cb->args[2]; if (!w) { @@ -612,6 +614,20 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) arg.net = net; w->args = + if (arg.filter.table_id) { + tb = fib6_get_table(net, arg.filter.table_id); + if (!tb) { + NL_SET_ERR_MSG_MOD(cb->extack, "FIB table does not exist"); + return -ENOENT; + } + + res = fib6_dump_table(tb, skb, cb); + goto out; + } + + s_h = cb->args[0]; + s_e = cb->args[1]; + rcu_read_lock(); for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) { e = 0; @@ -621,16 +637,16 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) goto next; res = fib6_dump_table(tb, skb, cb); if (res != 0) - goto out; + goto out_unlock; next: e++; } } -out: +out_unlock: rcu_read_unlock(); cb->args[1] = e; cb->args[0] = h; - +out: res = res < 0 ? res : skb->len; if (res <= 0) fib6_dump_end(cb); diff --git a/net/ipv6/route.c b/net/ipv6/route.c index f4e08b0689a8..9fd600e42f9d 100644 --- a/net/ipv6/route.c +++ b/net/ipv6/route.c @@ -4767,28 +4767,52 @@ static int rt6_fill_node(struct net *net, struct sk_buff *skb, return -EMSGSIZE; } +static bool fib6_info_uses_dev(const struct fib6_info *f6i, + const struct net_device *dev) +{ + if (f6i->fib6_nh.nh_dev == dev) + return true; + + if (f6i->fib6_nsiblings) { + struct fib6_info *sibling, *next_sibling; + + list_for_each_entry_safe(sibling, next_sibling, +>fib6_siblings, fib6_siblings) { + if (sibling->fib6_nh.nh_dev == dev) + return true; + } + } + + return false; +} + int rt6_dump_route(struct fib6_info *rt, void *p_arg) { struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg; + struct fib_dump_filter *filter = >filter; + unsigned int flags = NLM_F_MULTI; struct net *net = arg->net; if (rt == net->ipv6.fib6_null_entry) return 0; - if (nlmsg_len(arg->cb->nlh) >= sizeof(struct rtmsg)) { - struct rtmsg *rtm = nlmsg_data(arg->cb->nlh); - - /* user wants prefix routes only */ - if (rtm->rtm_flags & RTM_F_PREFIX && - !(rt->fib6_flags & RTF_PREFIX_RT)) { - /* success since this is not a prefix route */ + if ((filter->flags & RTM_F_PREFIX) && + !(rt->fib6_flags & RTF_PREFIX_RT)) { + /* success since this is not a prefix route */ + return 1; + } + if (filter->filter_set) { + if ((filter->rt_type && rt->fib6_type != filter->rt_type) || + (filter->dev && !fib6_info_uses_dev(rt, filter->dev)) || + (filter->protocol && rt->fib6_protocol != filter->protocol)) { return 1; } + flags |= NLM_F_DUMP_FILTERED; } return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0, RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid, -arg->cb->nlh->nlmsg_seq, NLM_F_MULTI); +arg->cb->nlh->nlmsg_seq, flags); } static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, -- 2.11.0
[PATCH v2 net-next 02/11] net: Add struct for fib dump filter
From: David Ahern Add struct fib_dump_filter for options on limiting which routes are returned in a dump request. The current list is table id, protocol, route type, rtm_flags and nexthop device index. struct net is needed to lookup the net_device from the index. Declare the filter for each route dump handler and plumb the new arguments from dump handlers to ip_valid_fib_dump_req. Signed-off-by: David Ahern --- include/net/ip6_route.h | 1 + include/net/ip_fib.h| 13 - net/ipv4/fib_frontend.c | 6 -- net/ipv4/ipmr.c | 6 +- net/ipv6/ip6_fib.c | 5 +++-- net/ipv6/ip6mr.c| 5 - net/mpls/af_mpls.c | 12 7 files changed, 37 insertions(+), 11 deletions(-) diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h index cef186dbd2ce..7ab119936e69 100644 --- a/include/net/ip6_route.h +++ b/include/net/ip6_route.h @@ -174,6 +174,7 @@ struct rt6_rtnl_dump_arg { struct sk_buff *skb; struct netlink_callback *cb; struct net *net; + struct fib_dump_filter filter; }; int rt6_dump_route(struct fib6_info *f6i, void *p_arg); diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 852e4ebf2209..667013bf4266 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -222,6 +222,16 @@ struct fib_table { unsigned long __data[0]; }; +struct fib_dump_filter { + u32 table_id; + /* filter_set is an optimization that an entry is set */ + boolfilter_set; + unsigned char protocol; + unsigned char rt_type; + unsigned intflags; + struct net_device *dev; +}; + int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp, struct fib_result *res, int fib_flags); int fib_table_insert(struct net *, struct fib_table *, struct fib_config *, @@ -453,6 +463,7 @@ static inline void fib_proc_exit(struct net *net) u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr); -int ip_valid_fib_dump_req(const struct nlmsghdr *nlh, +int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, + struct fib_dump_filter *filter, struct netlink_ext_ack *extack); #endif /* _NET_FIB_H */ diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 0f1beceb47d5..850850dd80e1 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -802,7 +802,8 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh, return err; } -int ip_valid_fib_dump_req(const struct nlmsghdr *nlh, +int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, + struct fib_dump_filter *filter, struct netlink_ext_ack *extack) { struct rtmsg *rtm; @@ -837,6 +838,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) { const struct nlmsghdr *nlh = cb->nlh; struct net *net = sock_net(skb->sk); + struct fib_dump_filter filter = {}; unsigned int h, s_h; unsigned int e = 0, s_e; struct fib_table *tb; @@ -844,7 +846,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) int dumped = 0, err; if (cb->strict_check) { - err = ip_valid_fib_dump_req(nlh, cb->extack); + err = ip_valid_fib_dump_req(net, nlh, , cb->extack); if (err < 0) return err; } diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 91b0d5671649..44d777058960 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -2527,9 +2527,13 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh, static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb) { + struct fib_dump_filter filter = {}; + if (cb->strict_check) { - int err = ip_valid_fib_dump_req(cb->nlh, cb->extack); + int err; + err = ip_valid_fib_dump_req(sock_net(skb->sk), cb->nlh, + , cb->extack); if (err < 0) return err; } diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index 0783af11b0b7..94e61fe47ff8 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -569,17 +569,18 @@ static int inet6_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) { const struct nlmsghdr *nlh = cb->nlh; struct net *net = sock_net(skb->sk); + struct rt6_rtnl_dump_arg arg = {}; unsigned int h, s_h; unsigned int e = 0, s_e; - struct rt6_rtnl_dump_arg arg; struct fib6_walker *w; struct fib6_table *tb; struct hlist_head *head; int res = 0; if (cb->strict_check) { - int err = ip_valid_fib_dump_req(nlh, cb->extack); +
[PATCH v2 net-next 08/11] net: Enable kernel side filtering of route dumps
From: David Ahern Update parsing of route dump request to enable kernel side filtering. Allow filtering results by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. These amount to the low hanging fruit, yet a huge improvement, for dumping routes. ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can be used to look up the device index without taking a reference. From there filter->dev is only used during dump loops with the lock still held. Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results have been filtered should no entries be returned. Signed-off-by: David Ahern --- include/net/ip_fib.h| 2 +- net/ipv4/fib_frontend.c | 51 ++--- net/ipv4/ipmr.c | 2 +- net/ipv6/ip6_fib.c | 2 +- net/ipv6/ip6mr.c| 2 +- net/mpls/af_mpls.c | 9 + 6 files changed, 53 insertions(+), 15 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 1eabc9edd2b9..e8d9456bf36e 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -465,5 +465,5 @@ u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr); int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, struct fib_dump_filter *filter, - struct netlink_ext_ack *extack); + struct netlink_callback *cb); #endif /* _NET_FIB_H */ diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 37dc8ac366fd..e86ca2255181 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -804,9 +804,14 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct nlmsghdr *nlh, int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, struct fib_dump_filter *filter, - struct netlink_ext_ack *extack) + struct netlink_callback *cb) { + struct netlink_ext_ack *extack = cb->extack; + struct nlattr *tb[RTA_MAX + 1]; struct rtmsg *rtm; + int err, i; + + ASSERT_RTNL(); if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) { NL_SET_ERR_MSG(extack, "Invalid header for FIB dump request"); @@ -815,8 +820,7 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, rtm = nlmsg_data(nlh); if (rtm->rtm_dst_len || rtm->rtm_src_len || rtm->rtm_tos || - rtm->rtm_table || rtm->rtm_protocol || rtm->rtm_scope || - rtm->rtm_type) { + rtm->rtm_scope) { NL_SET_ERR_MSG(extack, "Invalid values in header for FIB dump request"); return -EINVAL; } @@ -825,9 +829,42 @@ int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh, return -EINVAL; } - if (nlmsg_attrlen(nlh, sizeof(*rtm))) { - NL_SET_ERR_MSG(extack, "Invalid data after header in FIB dump request"); - return -EINVAL; + filter->flags= rtm->rtm_flags; + filter->protocol = rtm->rtm_protocol; + filter->rt_type = rtm->rtm_type; + filter->table_id = rtm->rtm_table; + + err = nlmsg_parse_strict(nlh, sizeof(*rtm), tb, RTA_MAX, +rtm_ipv4_policy, extack); + if (err < 0) + return err; + + for (i = 0; i <= RTA_MAX; ++i) { + int ifindex; + + if (!tb[i]) + continue; + + switch (i) { + case RTA_TABLE: + filter->table_id = nla_get_u32(tb[i]); + break; + case RTA_OIF: + ifindex = nla_get_u32(tb[i]); + filter->dev = __dev_get_by_index(net, ifindex); + if (!filter->dev) + return -ENODEV; + break; + default: + NL_SET_ERR_MSG(extack, "Unsupported attribute in dump request"); + return -EINVAL; + } + } + + if (filter->flags || filter->protocol || filter->rt_type || + filter->table_id || filter->dev) { + filter->filter_set = 1; + cb->answer_flags = NLM_F_DUMP_FILTERED; } return 0; @@ -846,7 +883,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) int dumped = 0, err; if (cb->strict_check) { - err = ip_valid_fib_dump_req(net, nlh, , cb->extack); + err = ip_valid_fib_dump_req(net, nlh, , cb); if (err < 0) return err; } diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c index 3fa988e6a3df..7a3e2acda94c 100644 --- a/net/ipv4/ipmr.c +++ b/net/ipv4/ipmr.c @@ -2532,7 +2532,7 @@ static int ipmr_rtm_dumproute(struct sk_buff *skb,
[PATCH v2 net-next 00/11] net: Kernel side filtering for route dumps
From: David Ahern Implement kernel side filtering of route dumps by protocol (e.g., which routing daemon installed the route), route type (e.g., unicast), table id and nexthop device. iproute2 has been doing this filtering in userspace for years; pushing the filters to the kernel side reduces the amount of data the kernel sends and reduces wasted cycles on both sides processing unwanted data. These initial options provide a huge improvement for efficiently examining routes on large scale systems. v2 - better handling of requests for a specific table. Rather than walking the hash of all tables, lookup the specific table and dump it - refactor mr_rtm_dumproute moving the loop over the table into a helper that can be invoked directly - add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure it is returned even when the dump returns nothing David Ahern (11): netlink: Add answer_flags to netlink_callback net: Add struct for fib dump filter net/ipv4: Plumb support for filtering route dumps net/ipv6: Plumb support for filtering route dumps net/mpls: Plumb support for filtering route dumps ipmr: Refactor mr_rtm_dumproute net: Plumb support for filtering ipv4 and ipv6 multicast route dumps net: Enable kernel side filtering of route dumps net/mpls: Handle kernel side filtering of route dumps net/ipv6: Bail early if user only wants cloned entries net/ipv4: Bail early if user only wants prefix entries include/linux/mroute_base.h | 11 +++- include/linux/netlink.h | 1 + include/net/ip6_route.h | 1 + include/net/ip_fib.h| 17 -- net/ipv4/fib_frontend.c | 76 ++ net/ipv4/fib_trie.c | 37 + net/ipv4/ipmr.c | 22 ++-- net/ipv4/ipmr_base.c| 126 net/ipv6/ip6_fib.c | 34 +--- net/ipv6/ip6mr.c| 21 ++-- net/ipv6/route.c| 40 +++--- net/mpls/af_mpls.c | 92 +++- net/netlink/af_netlink.c| 3 +- 13 files changed, 386 insertions(+), 95 deletions(-) -- 2.11.0
[PATCH v2 net-next 03/11] net/ipv4: Plumb support for filtering route dumps
From: David Ahern Implement kernel side filtering of routes by table id, egress device index, protocol and route type. If the table id is given in the filter, lookup the table and call fib_table_dump directly for it. Signed-off-by: David Ahern --- include/net/ip_fib.h| 2 +- net/ipv4/fib_frontend.c | 13 - net/ipv4/fib_trie.c | 37 ++--- 3 files changed, 39 insertions(+), 13 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 667013bf4266..1eabc9edd2b9 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -239,7 +239,7 @@ int fib_table_insert(struct net *, struct fib_table *, struct fib_config *, int fib_table_delete(struct net *, struct fib_table *, struct fib_config *, struct netlink_ext_ack *extack); int fib_table_dump(struct fib_table *table, struct sk_buff *skb, - struct netlink_callback *cb); + struct netlink_callback *cb, struct fib_dump_filter *filter); int fib_table_flush(struct net *net, struct fib_table *table); struct fib_table *fib_trie_unmerge(struct fib_table *main_tb); void fib_table_flush_external(struct fib_table *table); diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c index 850850dd80e1..37dc8ac366fd 100644 --- a/net/ipv4/fib_frontend.c +++ b/net/ipv4/fib_frontend.c @@ -855,6 +855,17 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) ((struct rtmsg *)nlmsg_data(nlh))->rtm_flags & RTM_F_CLONED) return skb->len; + if (filter.table_id) { + tb = fib_get_table(net, filter.table_id); + if (!tb) { + NL_SET_ERR_MSG(cb->extack, "ipv4: FIB table does not exist"); + return -ENOENT; + } + + err = fib_table_dump(tb, skb, cb, ); + return skb->len ? : err; + } + s_h = cb->args[0]; s_e = cb->args[1]; @@ -869,7 +880,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb) if (dumped) memset(>args[2], 0, sizeof(cb->args) - 2 * sizeof(cb->args[0])); - err = fib_table_dump(tb, skb, cb); + err = fib_table_dump(tb, skb, cb, ); if (err < 0) { if (likely(skb->len)) goto out; diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c index 5bc0c89e81e4..237c9f72b265 100644 --- a/net/ipv4/fib_trie.c +++ b/net/ipv4/fib_trie.c @@ -2003,12 +2003,17 @@ void fib_free_table(struct fib_table *tb) } static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb, -struct sk_buff *skb, struct netlink_callback *cb) +struct sk_buff *skb, struct netlink_callback *cb, +struct fib_dump_filter *filter) { + unsigned int flags = NLM_F_MULTI; __be32 xkey = htonl(l->key); struct fib_alias *fa; int i, s_i; + if (filter->filter_set) + flags |= NLM_F_DUMP_FILTERED; + s_i = cb->args[4]; i = 0; @@ -2016,25 +2021,35 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb, hlist_for_each_entry_rcu(fa, >leaf, fa_list) { int err; - if (i < s_i) { - i++; - continue; - } + if (i < s_i) + goto next; - if (tb->tb_id != fa->tb_id) { - i++; - continue; + if (tb->tb_id != fa->tb_id) + goto next; + + if (filter->filter_set) { + if (filter->rt_type && fa->fa_type != filter->rt_type) + goto next; + + if ((filter->protocol && +fa->fa_info->fib_protocol != filter->protocol)) + goto next; + + if (filter->dev && + !fib_info_nh_uses_dev(fa->fa_info, filter->dev)) + goto next; } err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid, cb->nlh->nlmsg_seq, RTM_NEWROUTE, tb->tb_id, fa->fa_type, xkey, KEYLENGTH - fa->fa_slen, - fa->fa_tos, fa->fa_info, NLM_F_MULTI); + fa->fa_tos, fa->fa_info, flags); if (err < 0) { cb->args[4] = i; return err; } +next: i++; } @@ -2044,7 +2059,7 @@ static int
Re: [PATCH V2 net-next 5/5] ptp: Add a driver for InES time stamping IP core.
On Sun, Oct 07, 2018 at 10:38:23AM -0700, Richard Cochran wrote: > The InES at the ZHAW offers a PTP time stamping IP core. The FPGA > logic recognizes and time stamps PTP frames on the MII bus. This > patch adds a driver for the core along with a device tree binding to > allow hooking the driver to MII buses. > > Signed-off-by: Richard Cochran > --- > Documentation/devicetree/bindings/ptp/ptp-ines.txt | 37 + Bindings should be separate patch. > drivers/ptp/Kconfig| 10 + > drivers/ptp/Makefile | 1 + > drivers/ptp/ptp_ines.c | 870 > + > 4 files changed, 918 insertions(+) > create mode 100644 Documentation/devicetree/bindings/ptp/ptp-ines.txt > create mode 100644 drivers/ptp/ptp_ines.c > > diff --git a/Documentation/devicetree/bindings/ptp/ptp-ines.txt > b/Documentation/devicetree/bindings/ptp/ptp-ines.txt > new file mode 100644 > index ..1484b62802c7 > --- /dev/null > +++ b/Documentation/devicetree/bindings/ptp/ptp-ines.txt > @@ -0,0 +1,37 @@ > +ZHAW InES PTP time stamping IP core > + > +The IP core needs two different kinds of nodes. The control node > +lives somewhere in the memory map and specifies the address of the > +control registers. There can be up to three port handles placed as > +attributes of PHY nodes. These associate a particular MII bus with a > +port index within the IP core. > + > +Required properties of the control node: > + > +- compatible:"ines,ptp-ctrl" ines is not registered vendor prefix. Should it be 'zhaw' instead? > +- reg: physical address and size of the register bank > +- #phandle-cells:must be one (1) #timestamper-cells Or if it is always 1, you could omit it. > + > +Required format of the port handle within the PHY node: > + > +- timestamper: provides control node reference and > + the port channel within the IP core This and #timestamper-cells need to be in a common binding doc. And bonus points if you add a check in dtc for this. Should be a one-liner. > + > +Example: > + > + tstamper: timestamper@6000 { > + compatible = "ines,ptp-ctrl"; > + reg = <0x6000 0x80>; > + #phandle-cells = <1>; > + }; > + > + ethernet@8000 { > + ... > + mdio { > + ... > + phy@3 { > + ... > + timestamper = < 0>; > + }; > + }; > + };
Re: [PATCH v2 net-next] net/ipv6: Add knob to skip DELROUTE message on device down
From: David Ahern Date: Thu, 11 Oct 2018 20:17:21 -0700 > From: David Ahern > > Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE > notifications when a device is taken down (admin down) or deleted. IPv4 > does not generate a message for routes evicted by the down or delete; > IPv6 does. A NOS at scale really needs to avoid these messages and have > IPv4 and IPv6 behave similarly, relying on userspace to handle link > notifications and evict the routes. > > At this point existing user behavior needs to be preserved. Since > notifications are a global action (not per app) the only way to preserve > existing behavior and allow the messages to be skipped is to add a new > sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to > disable the notificatioons. > > IPv6 route code already supports the option to skip the message (it is > used for multipath routes for example). Besides the new sysctl we need > to pass the skip_notify setting through the generic fib6_clean and > fib6_walk functions to fib6_clean_node and to set skip_notify on calls > to __ip_del_rt for the addrconf_ifdown path. > > Signed-off-by: David Ahern > --- > v2 > - removed the changes to addrconf and anycast. addrconf_ifdown calls > rt6_disable_ip which calls rt6_sync_down_dev. The last one evicts all > routes for the device, so the delete route calls done later in addrconf > and anycast are superfluous Applied.
[PATCH v2 net-next] net/ipv6: Add knob to skip DELROUTE message on device down
From: David Ahern Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE notifications when a device is taken down (admin down) or deleted. IPv4 does not generate a message for routes evicted by the down or delete; IPv6 does. A NOS at scale really needs to avoid these messages and have IPv4 and IPv6 behave similarly, relying on userspace to handle link notifications and evict the routes. At this point existing user behavior needs to be preserved. Since notifications are a global action (not per app) the only way to preserve existing behavior and allow the messages to be skipped is to add a new sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to disable the notificatioons. IPv6 route code already supports the option to skip the message (it is used for multipath routes for example). Besides the new sysctl we need to pass the skip_notify setting through the generic fib6_clean and fib6_walk functions to fib6_clean_node and to set skip_notify on calls to __ip_del_rt for the addrconf_ifdown path. Signed-off-by: David Ahern --- v2 - removed the changes to addrconf and anycast. addrconf_ifdown calls rt6_disable_ip which calls rt6_sync_down_dev. The last one evicts all routes for the device, so the delete route calls done later in addrconf and anycast are superfluous Documentation/networking/ip-sysctl.txt | 8 include/net/ip6_fib.h | 3 +++ include/net/netns/ipv6.h | 1 + net/ipv6/ip6_fib.c | 20 +++- net/ipv6/route.c | 20 +++- 5 files changed, 46 insertions(+), 6 deletions(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 960de8fe3f40..163b5ff1073c 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -1442,6 +1442,14 @@ max_hbh_length - INTEGER header. Default: INT_MAX (unlimited) +skip_notify_on_dev_down - BOOLEAN + Controls whether an RTM_DELROUTE message is generated for routes + removed when a device is taken down or deleted. IPv4 does not + generate this message; IPv6 does by default. Setting this sysctl + to true skips the message, making IPv4 and IPv6 on par in relying + on userspace caches to track link events and evict routes. + Default: false (generate message) + IPv6 Fragmentation: ip6frag_high_thresh - INTEGER diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h index f06e968f1992..caabfd84a098 100644 --- a/include/net/ip6_fib.h +++ b/include/net/ip6_fib.h @@ -407,6 +407,9 @@ struct fib6_node *fib6_locate(struct fib6_node *root, void fib6_clean_all(struct net *net, int (*func)(struct fib6_info *, void *arg), void *arg); +void fib6_clean_all_skip_notify(struct net *net, + int (*func)(struct fib6_info *, void *arg), + void *arg); int fib6_add(struct fib6_node *root, struct fib6_info *rt, struct nl_info *info, struct netlink_ext_ack *extack); diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h index f0e396ab9bec..ef1ed529f33c 100644 --- a/include/net/netns/ipv6.h +++ b/include/net/netns/ipv6.h @@ -45,6 +45,7 @@ struct netns_sysctl_ipv6 { int max_dst_opts_len; int max_hbh_opts_len; int seg6_flowlabel; + bool skip_notify_on_dev_down; }; struct netns_ipv6 { diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c index e14d244c551f..9ba72d94d60f 100644 --- a/net/ipv6/ip6_fib.c +++ b/net/ipv6/ip6_fib.c @@ -47,6 +47,7 @@ struct fib6_cleaner { int (*func)(struct fib6_info *, void *arg); int sernum; void *arg; + bool skip_notify; }; #ifdef CONFIG_IPV6_SUBTREES @@ -1956,6 +1957,7 @@ static int fib6_clean_node(struct fib6_walker *w) struct fib6_cleaner *c = container_of(w, struct fib6_cleaner, w); struct nl_info info = { .nl_net = c->net, + .skip_notify = c->skip_notify, }; if (c->sernum != FIB6_NO_SERNUM_CHANGE && @@ -2007,7 +2009,7 @@ static int fib6_clean_node(struct fib6_walker *w) static void fib6_clean_tree(struct net *net, struct fib6_node *root, int (*func)(struct fib6_info *, void *arg), - int sernum, void *arg) + int sernum, void *arg, bool skip_notify) { struct fib6_cleaner c; @@ -2019,13 +2021,14 @@ static void fib6_clean_tree(struct net *net, struct fib6_node *root, c.sernum = sernum; c.arg = arg; c.net = net; + c.skip_notify = skip_notify; fib6_walk(net, ); } static void __fib6_clean_all(struct net *net, int (*func)(struct fib6_info *, void *), -int sernum, void *arg) +int sernum, void *arg, bool skip_notify) {
Re: [PATCH V2 net-next 00/12] Improving performance and reducing latencies, by using latest capabilities exposed in ENA device
From: Date: Thu, 11 Oct 2018 11:26:15 +0300 > From: Arthur Kiyanovski > > This patchset introduces the following: > 1. A new placement policy of Tx headers and descriptors, which takes > advantage of an option to place headers + descriptors in device memory > space. This is sometimes referred to as LLQ - low latency queue. > The patch set defines the admin capability, maps the device memory as > write-combined, and adds a mode in transmit datapath to do header + > descriptor placement on the device. > 2. Support for RX checksum offloading > 3. Miscelaneous small improvements and code cleanups > > Note: V1 of this patchset was created as if patches e2a322a 248ab77 > from net were applied to net-next before applying the patchset. This V2 > version does not assume this, and should be applyed directly on net-next > without the aformentioned patches. Series applied.
[PATCH V2 net-next 11/12] net: ena: update driver version to 2.0.1
From: Arthur Kiyanovski Signed-off-by: Arthur Kiyanovski --- drivers/net/ethernet/amazon/ena/ena_netdev.h | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h index d241dfc..5218736 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h @@ -43,9 +43,9 @@ #include "ena_com.h" #include "ena_eth_com.h" -#define DRV_MODULE_VER_MAJOR 1 -#define DRV_MODULE_VER_MINOR 5 -#define DRV_MODULE_VER_SUBMINOR 0 +#define DRV_MODULE_VER_MAJOR 2 +#define DRV_MODULE_VER_MINOR 0 +#define DRV_MODULE_VER_SUBMINOR 1 #define DRV_MODULE_NAME"ena" #ifndef DRV_MODULE_VERSION -- 2.7.4
[PATCH V2 net-next 10/12] net: ena: remove redundant parameter in ena_com_admin_init()
From: Arthur Kiyanovski Remove redundant spinlock acquire parameter from ena_com_admin_init() Signed-off-by: Arthur Kiyanovski --- drivers/net/ethernet/amazon/ena/ena_com.c| 6 ++ drivers/net/ethernet/amazon/ena/ena_com.h| 5 + drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +- 3 files changed, 4 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c index 5c468b2..420cede 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_com.c @@ -1701,8 +1701,7 @@ void ena_com_mmio_reg_read_request_write_dev_addr(struct ena_com_dev *ena_dev) } int ena_com_admin_init(struct ena_com_dev *ena_dev, - struct ena_aenq_handlers *aenq_handlers, - bool init_spinlock) + struct ena_aenq_handlers *aenq_handlers) { struct ena_com_admin_queue *admin_queue = _dev->admin_queue; u32 aq_caps, acq_caps, dev_sts, addr_low, addr_high; @@ -1728,8 +1727,7 @@ int ena_com_admin_init(struct ena_com_dev *ena_dev, atomic_set(_queue->outstanding_cmds, 0); - if (init_spinlock) - spin_lock_init(_queue->q_lock); + spin_lock_init(_queue->q_lock); ret = ena_com_init_comp_ctxt(admin_queue); if (ret) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.h b/drivers/net/ethernet/amazon/ena/ena_com.h index 25af8d0..ae8b485 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.h +++ b/drivers/net/ethernet/amazon/ena/ena_com.h @@ -436,8 +436,6 @@ void ena_com_mmio_reg_read_request_destroy(struct ena_com_dev *ena_dev); /* ena_com_admin_init - Init the admin and the async queues * @ena_dev: ENA communication layer struct * @aenq_handlers: Those handlers to be called upon event. - * @init_spinlock: Indicate if this method should init the admin spinlock or - * the spinlock was init before (for example, in a case of FLR). * * Initialize the admin submission and completion queues. * Initialize the asynchronous events notification queues. @@ -445,8 +443,7 @@ void ena_com_mmio_reg_read_request_destroy(struct ena_com_dev *ena_dev); * @return - 0 on success, negative value on failure. */ int ena_com_admin_init(struct ena_com_dev *ena_dev, - struct ena_aenq_handlers *aenq_handlers, - bool init_spinlock); + struct ena_aenq_handlers *aenq_handlers); /* ena_com_admin_destroy - Destroy the admin and the async events queues. * @ena_dev: ENA communication layer struct diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c b/drivers/net/ethernet/amazon/ena/ena_netdev.c index e71bf82..3494d4a 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.c +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c @@ -2503,7 +2503,7 @@ static int ena_device_init(struct ena_com_dev *ena_dev, struct pci_dev *pdev, } /* ENA admin level init */ - rc = ena_com_admin_init(ena_dev, _handlers, true); + rc = ena_com_admin_init(ena_dev, _handlers); if (rc) { dev_err(dev, "Can not initialize ena admin queue with device\n"); -- 2.7.4
[PATCH V2 net-next 09/12] net: ena: change rx copybreak default to reduce kernel memory pressure
From: Arthur Kiyanovski Improves socket memory utilization when receiving packets larger than 128 bytes (the previous rx copybreak) and smaller than 256 bytes. Signed-off-by: Arthur Kiyanovski --- drivers/net/ethernet/amazon/ena/ena_netdev.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h b/drivers/net/ethernet/amazon/ena/ena_netdev.h index 0cf35ae..d241dfc 100644 --- a/drivers/net/ethernet/amazon/ena/ena_netdev.h +++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h @@ -81,7 +81,7 @@ #define ENA_DEFAULT_RING_SIZE (1024) #define ENA_TX_WAKEUP_THRESH (MAX_SKB_FRAGS + 2) -#define ENA_DEFAULT_RX_COPYBREAK (128 - NET_IP_ALIGN) +#define ENA_DEFAULT_RX_COPYBREAK (256 - NET_IP_ALIGN) /* limit the buffer size to 600 bytes to handle MTU changes from very * small to very large, in which case the number of buffers per packet -- 2.7.4
[PATCH V2 net-next 07/12] net: ena: explicit casting and initialization, and clearer error handling
From: Arthur Kiyanovski Signed-off-by: Arthur Kiyanovski --- drivers/net/ethernet/amazon/ena/ena_com.c| 39 drivers/net/ethernet/amazon/ena/ena_netdev.c | 5 ++-- drivers/net/ethernet/amazon/ena/ena_netdev.h | 22 3 files changed, 36 insertions(+), 30 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c index 5220c75..5c468b2 100644 --- a/drivers/net/ethernet/amazon/ena/ena_com.c +++ b/drivers/net/ethernet/amazon/ena/ena_com.c @@ -235,7 +235,7 @@ static struct ena_comp_ctx *__ena_com_submit_admin_cmd(struct ena_com_admin_queu tail_masked = admin_queue->sq.tail & queue_size_mask; /* In case of queue FULL */ - cnt = atomic_read(_queue->outstanding_cmds); + cnt = (u16)atomic_read(_queue->outstanding_cmds); if (cnt >= admin_queue->q_depth) { pr_debug("admin queue is full.\n"); admin_queue->stats.out_of_space++; @@ -304,7 +304,7 @@ static struct ena_comp_ctx *ena_com_submit_admin_cmd(struct ena_com_admin_queue struct ena_admin_acq_entry *comp, size_t comp_size_in_bytes) { - unsigned long flags; + unsigned long flags = 0; struct ena_comp_ctx *comp_ctx; spin_lock_irqsave(_queue->q_lock, flags); @@ -332,7 +332,7 @@ static int ena_com_init_io_sq(struct ena_com_dev *ena_dev, memset(_sq->desc_addr, 0x0, sizeof(io_sq->desc_addr)); - io_sq->dma_addr_bits = ena_dev->dma_addr_bits; + io_sq->dma_addr_bits = (u8)ena_dev->dma_addr_bits; io_sq->desc_entry_size = (io_sq->direction == ENA_COM_IO_QUEUE_DIRECTION_TX) ? sizeof(struct ena_eth_io_tx_desc) : @@ -486,7 +486,7 @@ static void ena_com_handle_admin_completion(struct ena_com_admin_queue *admin_qu /* Go over all the completions */ while ((READ_ONCE(cqe->acq_common_descriptor.flags) & - ENA_ADMIN_ACQ_COMMON_DESC_PHASE_MASK) == phase) { + ENA_ADMIN_ACQ_COMMON_DESC_PHASE_MASK) == phase) { /* Do not read the rest of the completion entry before the * phase bit was validated */ @@ -537,7 +537,8 @@ static int ena_com_comp_status_to_errno(u8 comp_status) static int ena_com_wait_and_process_admin_cq_polling(struct ena_comp_ctx *comp_ctx, struct ena_com_admin_queue *admin_queue) { - unsigned long flags, timeout; + unsigned long flags = 0; + unsigned long timeout; int ret; timeout = jiffies + usecs_to_jiffies(admin_queue->completion_timeout); @@ -736,7 +737,7 @@ static int ena_com_config_llq_info(struct ena_com_dev *ena_dev, static int ena_com_wait_and_process_admin_cq_interrupts(struct ena_comp_ctx *comp_ctx, struct ena_com_admin_queue *admin_queue) { - unsigned long flags; + unsigned long flags = 0; int ret; wait_for_completion_timeout(_ctx->wait_event, @@ -782,7 +783,7 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev *ena_dev, u16 offset) volatile struct ena_admin_ena_mmio_req_read_less_resp *read_resp = mmio_read->read_resp; u32 mmio_read_reg, ret, i; - unsigned long flags; + unsigned long flags = 0; u32 timeout = mmio_read->reg_read_to; might_sleep(); @@ -1426,7 +1427,7 @@ void ena_com_abort_admin_commands(struct ena_com_dev *ena_dev) void ena_com_wait_for_abort_completion(struct ena_com_dev *ena_dev) { struct ena_com_admin_queue *admin_queue = _dev->admin_queue; - unsigned long flags; + unsigned long flags = 0; spin_lock_irqsave(_queue->q_lock, flags); while (atomic_read(_queue->outstanding_cmds) != 0) { @@ -1470,7 +1471,7 @@ bool ena_com_get_admin_running_state(struct ena_com_dev *ena_dev) void ena_com_set_admin_running_state(struct ena_com_dev *ena_dev, bool state) { struct ena_com_admin_queue *admin_queue = _dev->admin_queue; - unsigned long flags; + unsigned long flags = 0; spin_lock_irqsave(_queue->q_lock, flags); ena_dev->admin_queue.running_state = state; @@ -1504,7 +1505,7 @@ int ena_com_set_aenq_config(struct ena_com_dev *ena_dev, u32 groups_flag) } if ((get_resp.u.aenq.supported_groups & groups_flag) != groups_flag) { - pr_warn("Trying to set unsupported aenq events. supported flag: %x asked flag: %x\n", + pr_warn("Trying to set unsupported aenq events. supported flag: 0x%x asked flag: 0x%x\n", get_resp.u.aenq.supported_groups, groups_flag); return -EOPNOTSUPP; } @@ -1652,7 +1653,7 @@ int ena_com_mmio_reg_read_request_init(struct ena_com_dev *ena_dev)
[PATCH V2 net-next 12/12] net: ena: fix indentations in ena_defs for better readability
From: Arthur Kiyanovski Signed-off-by: Arthur Kiyanovski --- drivers/net/ethernet/amazon/ena/ena_admin_defs.h | 334 +- drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h | 223 +++ drivers/net/ethernet/amazon/ena/ena_regs_defs.h | 206 +++-- 3 files changed, 338 insertions(+), 425 deletions(-) diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h index b439ec1..9f80b73 100644 --- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h +++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h @@ -32,119 +32,81 @@ #ifndef _ENA_ADMIN_H_ #define _ENA_ADMIN_H_ -enum ena_admin_aq_opcode { - ENA_ADMIN_CREATE_SQ = 1, - - ENA_ADMIN_DESTROY_SQ= 2, - - ENA_ADMIN_CREATE_CQ = 3, - - ENA_ADMIN_DESTROY_CQ= 4, - - ENA_ADMIN_GET_FEATURE = 8, - ENA_ADMIN_SET_FEATURE = 9, - - ENA_ADMIN_GET_STATS = 11, +enum ena_admin_aq_opcode { + ENA_ADMIN_CREATE_SQ = 1, + ENA_ADMIN_DESTROY_SQ= 2, + ENA_ADMIN_CREATE_CQ = 3, + ENA_ADMIN_DESTROY_CQ= 4, + ENA_ADMIN_GET_FEATURE = 8, + ENA_ADMIN_SET_FEATURE = 9, + ENA_ADMIN_GET_STATS = 11, }; enum ena_admin_aq_completion_status { - ENA_ADMIN_SUCCESS = 0, - - ENA_ADMIN_RESOURCE_ALLOCATION_FAILURE = 1, - - ENA_ADMIN_BAD_OPCODE= 2, - - ENA_ADMIN_UNSUPPORTED_OPCODE= 3, - - ENA_ADMIN_MALFORMED_REQUEST = 4, - + ENA_ADMIN_SUCCESS = 0, + ENA_ADMIN_RESOURCE_ALLOCATION_FAILURE = 1, + ENA_ADMIN_BAD_OPCODE= 2, + ENA_ADMIN_UNSUPPORTED_OPCODE= 3, + ENA_ADMIN_MALFORMED_REQUEST = 4, /* Additional status is provided in ACQ entry extended_status */ - ENA_ADMIN_ILLEGAL_PARAMETER = 5, - - ENA_ADMIN_UNKNOWN_ERROR = 6, - - ENA_ADMIN_RESOURCE_BUSY = 7, + ENA_ADMIN_ILLEGAL_PARAMETER = 5, + ENA_ADMIN_UNKNOWN_ERROR = 6, + ENA_ADMIN_RESOURCE_BUSY = 7, }; enum ena_admin_aq_feature_id { - ENA_ADMIN_DEVICE_ATTRIBUTES = 1, - - ENA_ADMIN_MAX_QUEUES_NUM= 2, - - ENA_ADMIN_HW_HINTS = 3, - - ENA_ADMIN_LLQ = 4, - - ENA_ADMIN_RSS_HASH_FUNCTION = 10, - - ENA_ADMIN_STATELESS_OFFLOAD_CONFIG = 11, - - ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG = 12, - - ENA_ADMIN_MTU = 14, - - ENA_ADMIN_RSS_HASH_INPUT= 18, - - ENA_ADMIN_INTERRUPT_MODERATION = 20, - - ENA_ADMIN_AENQ_CONFIG = 26, - - ENA_ADMIN_LINK_CONFIG = 27, - - ENA_ADMIN_HOST_ATTR_CONFIG = 28, - - ENA_ADMIN_FEATURES_OPCODE_NUM = 32, + ENA_ADMIN_DEVICE_ATTRIBUTES = 1, + ENA_ADMIN_MAX_QUEUES_NUM= 2, + ENA_ADMIN_HW_HINTS = 3, + ENA_ADMIN_LLQ = 4, + ENA_ADMIN_RSS_HASH_FUNCTION = 10, + ENA_ADMIN_STATELESS_OFFLOAD_CONFIG = 11, + ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG = 12, + ENA_ADMIN_MTU = 14, + ENA_ADMIN_RSS_HASH_INPUT= 18, + ENA_ADMIN_INTERRUPT_MODERATION = 20, + ENA_ADMIN_AENQ_CONFIG = 26, + ENA_ADMIN_LINK_CONFIG = 27, + ENA_ADMIN_HOST_ATTR_CONFIG = 28, + ENA_ADMIN_FEATURES_OPCODE_NUM = 32, }; enum ena_admin_placement_policy_type { /* descriptors and headers are in host memory */ - ENA_ADMIN_PLACEMENT_POLICY_HOST = 1, - + ENA_ADMIN_PLACEMENT_POLICY_HOST = 1, /* descriptors and headers are in device memory (a.k.a Low Latency * Queue) */ - ENA_ADMIN_PLACEMENT_POLICY_DEV = 3, + ENA_ADMIN_PLACEMENT_POLICY_DEV = 3, }; enum ena_admin_link_types { - ENA_ADMIN_LINK_SPEED_1G = 0x1, - - ENA_ADMIN_LINK_SPEED_2_HALF_G = 0x2, - - ENA_ADMIN_LINK_SPEED_5G = 0x4, - - ENA_ADMIN_LINK_SPEED_10G= 0x8, - - ENA_ADMIN_LINK_SPEED_25G= 0x10, - - ENA_ADMIN_LINK_SPEED_40G= 0x20, - - ENA_ADMIN_LINK_SPEED_50G= 0x40, - - ENA_ADMIN_LINK_SPEED_100G = 0x80, - - ENA_ADMIN_LINK_SPEED_200G = 0x100, - - ENA_ADMIN_LINK_SPEED_400G = 0x200, +