Re: [PATCH v2 net-next 0/2] platform data controls for mdio-gpio

2018-12-08 Thread Florian Fainelli
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit :
> Soon to be mainlined is an x86 platform with a Marvell switch, and a
> bit-banging MDIO bus. In order to make this work, the phy_mask of the
> MDIO bus needs to be set to prevent scanning for PHYs, and the
> phy_ignore_ta_mask needs to be set because the switch has broken
> turnaround.
> 
> Add a platform_data structure with these parameters.

Looks good thanks Andrew. I do wonder if we could define a common
mii_bus_platform_data structure eventually which is comprised of these
two members (if nothing else) and maybe update the common
mdiobus_register() code path to look for these members. If a subsequent
platform data/device MDIO bus shows up we could do that at that time.

Thanks!

> 
> Andrew Lunn (2):
>   net: phy: mdio-gpio: Add platform_data support for phy_mask
>   net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data
> 
>  MAINTAINERS |  1 +
>  drivers/net/phy/mdio-gpio.c |  7 +++
>  include/linux/platform_data/mdio-gpio.h | 14 ++
>  3 files changed, 22 insertions(+)
>  create mode 100644 include/linux/platform_data/mdio-gpio.h
> 


-- 
Florian


Re: [PATCH v2 net-next 2/2] net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data

2018-12-08 Thread Florian Fainelli
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit :
> The Marvell 6390 Ethernet switch family does not perform MDIO
> turnaround correctly. Many hardware MDIO bus masters don't care about
> this, but the bitbangging implementation in Linux does by default. Add
> phy_ignore_ta_mask to the platform data so that the bitbangging code
> can be told which devices are known to get TA wrong.
> 
> v2
> --
> int -> u32 in platform data structure
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH v2 net-next 1/2] net: phy: mdio-gpio: Add platform_data support for phy_mask

2018-12-08 Thread Florian Fainelli
Le 12/8/18 à 7:12 AM, Andrew Lunn a écrit :
> It is sometimes necessary to instantiate a bit-banging MDIO bus as a
> platform device, without the aid of device tree.
> 
> When device tree is being used, the bus is not scanned for devices,
> only those devices which are in device tree are probed. Without device
> tree, by default, all addresses on the bus are scanned. This may then
> find a device which is not a PHY, e.g. a switch. And the switch may
> have registers containing values which look like a PHY. So during the
> scan, a PHY device is wrongly created.
> 
> After the bus has been registered, a search is made for
> mdio_board_info structures which indicates devices on the bus, and the
> driver which should be used for them. This is typically used to
> instantiate Ethernet switches from platform drivers.  However, if the
> scanning of the bus has created a PHY device at the same location as
> indicated into the board info for a switch, the switch device is not
> created, since the address is already busy.
> 
> This can be avoided by setting the phy_mask of the mdio bus. This mask
> prevents addresses on the bus being scanned.
> 
> v2
> --
> int -> u32 in platform data structure
> 
> Signed-off-by: Andrew Lunn 

Reviewed-by: Florian Fainelli 
-- 
Florian


[PATCH v2 net-next 0/2] platform data controls for mdio-gpio

2018-12-08 Thread Andrew Lunn
Soon to be mainlined is an x86 platform with a Marvell switch, and a
bit-banging MDIO bus. In order to make this work, the phy_mask of the
MDIO bus needs to be set to prevent scanning for PHYs, and the
phy_ignore_ta_mask needs to be set because the switch has broken
turnaround.

Add a platform_data structure with these parameters.

Andrew Lunn (2):
  net: phy: mdio-gpio: Add platform_data support for phy_mask
  net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data

 MAINTAINERS |  1 +
 drivers/net/phy/mdio-gpio.c |  7 +++
 include/linux/platform_data/mdio-gpio.h | 14 ++
 3 files changed, 22 insertions(+)
 create mode 100644 include/linux/platform_data/mdio-gpio.h

-- 
2.19.1



[PATCH v2 net-next 2/2] net: phy: mdio-gpio: Add phy_ignore_ta_mask to platform data

2018-12-08 Thread Andrew Lunn
The Marvell 6390 Ethernet switch family does not perform MDIO
turnaround correctly. Many hardware MDIO bus masters don't care about
this, but the bitbangging implementation in Linux does by default. Add
phy_ignore_ta_mask to the platform data so that the bitbangging code
can be told which devices are known to get TA wrong.

v2
--
int -> u32 in platform data structure

Signed-off-by: Andrew Lunn 
---
 drivers/net/phy/mdio-gpio.c | 4 +++-
 include/linux/platform_data/mdio-gpio.h | 1 +
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c
index 1e296dd4067a..ea9a0e339778 100644
--- a/drivers/net/phy/mdio-gpio.c
+++ b/drivers/net/phy/mdio-gpio.c
@@ -130,8 +130,10 @@ static struct mii_bus *mdio_gpio_bus_init(struct device 
*dev,
else
strncpy(new_bus->id, "gpio", MII_BUS_ID_SIZE);
 
-   if (pdata)
+   if (pdata) {
new_bus->phy_mask = pdata->phy_mask;
+   new_bus->phy_ignore_ta_mask = pdata->phy_ignore_ta_mask;
+   }
 
dev_set_drvdata(dev, new_bus);
 
diff --git a/include/linux/platform_data/mdio-gpio.h 
b/include/linux/platform_data/mdio-gpio.h
index a5d5ff5e174c..13874fa6e767 100644
--- a/include/linux/platform_data/mdio-gpio.h
+++ b/include/linux/platform_data/mdio-gpio.h
@@ -8,6 +8,7 @@
 
 struct mdio_gpio_platform_data {
u32 phy_mask;
+   u32 phy_ignore_ta_mask;
 };
 
 #endif /* __LINUX_MDIO_GPIO_PDATA_H */
-- 
2.19.1



[PATCH v2 net-next 1/2] net: phy: mdio-gpio: Add platform_data support for phy_mask

2018-12-08 Thread Andrew Lunn
It is sometimes necessary to instantiate a bit-banging MDIO bus as a
platform device, without the aid of device tree.

When device tree is being used, the bus is not scanned for devices,
only those devices which are in device tree are probed. Without device
tree, by default, all addresses on the bus are scanned. This may then
find a device which is not a PHY, e.g. a switch. And the switch may
have registers containing values which look like a PHY. So during the
scan, a PHY device is wrongly created.

After the bus has been registered, a search is made for
mdio_board_info structures which indicates devices on the bus, and the
driver which should be used for them. This is typically used to
instantiate Ethernet switches from platform drivers.  However, if the
scanning of the bus has created a PHY device at the same location as
indicated into the board info for a switch, the switch device is not
created, since the address is already busy.

This can be avoided by setting the phy_mask of the mdio bus. This mask
prevents addresses on the bus being scanned.

v2
--
int -> u32 in platform data structure

Signed-off-by: Andrew Lunn 
---
 MAINTAINERS |  1 +
 drivers/net/phy/mdio-gpio.c |  5 +
 include/linux/platform_data/mdio-gpio.h | 13 +
 3 files changed, 19 insertions(+)
 create mode 100644 include/linux/platform_data/mdio-gpio.h

diff --git a/MAINTAINERS b/MAINTAINERS
index fb88b6863d10..9d3b899f9ba2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5610,6 +5610,7 @@ F:include/linux/of_net.h
 F: include/linux/phy.h
 F: include/linux/phy_fixed.h
 F: include/linux/platform_data/mdio-bcm-unimac.h
+F: include/linux/platform_data/mdio-gpio.h
 F: include/trace/events/mdio.h
 F: include/uapi/linux/mdio.h
 F: include/uapi/linux/mii.h
diff --git a/drivers/net/phy/mdio-gpio.c b/drivers/net/phy/mdio-gpio.c
index 0fbcedcdf6e2..1e296dd4067a 100644
--- a/drivers/net/phy/mdio-gpio.c
+++ b/drivers/net/phy/mdio-gpio.c
@@ -24,6 +24,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -112,6 +113,7 @@ static struct mii_bus *mdio_gpio_bus_init(struct device 
*dev,
  struct mdio_gpio_info *bitbang,
  int bus_id)
 {
+   struct mdio_gpio_platform_data *pdata = dev_get_platdata(dev);
struct mii_bus *new_bus;
 
bitbang->ctrl.ops = _gpio_ops;
@@ -128,6 +130,9 @@ static struct mii_bus *mdio_gpio_bus_init(struct device 
*dev,
else
strncpy(new_bus->id, "gpio", MII_BUS_ID_SIZE);
 
+   if (pdata)
+   new_bus->phy_mask = pdata->phy_mask;
+
dev_set_drvdata(dev, new_bus);
 
return new_bus;
diff --git a/include/linux/platform_data/mdio-gpio.h 
b/include/linux/platform_data/mdio-gpio.h
new file mode 100644
index ..a5d5ff5e174c
--- /dev/null
+++ b/include/linux/platform_data/mdio-gpio.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * MDIO-GPIO bus platform data structure
+ */
+
+#ifndef __LINUX_MDIO_GPIO_PDATA_H
+#define __LINUX_MDIO_GPIO_PDATA_H
+
+struct mdio_gpio_platform_data {
+   u32 phy_mask;
+};
+
+#endif /* __LINUX_MDIO_GPIO_PDATA_H */
-- 
2.19.1



Re: [Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-07 Thread David Miller
From: yupeng 
Date: Wed,  5 Dec 2018 18:56:28 -0800

> after set SO_DONTROUTE to 1, the IP layer should not route packets if
> the dest IP address is not in link scope. But if the socket has cached
> the dst_entry, such packets would be routed until the sk_dst_cache
> expires. So we should clean the sk_dst_cache when a user set
> SO_DONTROUTE option. Below are server/client python scripts which
> could reprodue this issue:
 ...
> Signed-off-by: yupeng 

Applied.


Re: [PATCH v2 net-next] neighbor: Improve garbage collection

2018-12-07 Thread David Miller
From: David Ahern 
Date: Fri,  7 Dec 2018 12:24:57 -0800

> From: David Ahern 
> 
> The existing garbage collection algorithm has a number of problems:
 ...
> This patch addresses these problems as follows:
> 
> 1. Use of a separate list_head to track entries that can be garbage
>collected along with a separate counter. PERMANENT entries are not
>added to this list.
> 
>The gc_thresh parameters are only compared to the new counter, not the
>total entries in the table. The forced_gc function is updated to only
>walk this new gc_list looking for entries to evict.
> 
> 2. Entries are added to the list head at the tail and removed from the
>front.
> 
> 3. Entries are only evicted if they were last updated more than 5 seconds
>ago, adhering to the original intent of gc_thresh2.
> 
> 4. Forced gc is stopped once the number of gc_entries drops below
>gc_thresh2.
> 
> 5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
>when allocating a new neighbor for a PERMANENT entry. By extension this
>means there are no explicit limits on the number of PERMANENT entries
>that can be created, but this is no different than FIB entries or FDB
>entries.
> 
> Signed-off-by: David Ahern 
> ---
> v2
> - remove on_gc_list boolean in favor of !list_empty
> - fix neigh_alloc to add new entry to tail of list_head

Again, looks great, applied.


Re: [PATCH v2 net-next 0/4] net: aquantia: add RSS configuration

2018-12-07 Thread David Miller
From: Igor Russkikh 
Date: Fri, 7 Dec 2018 14:00:09 +

> In this patchset few bugs related to RSS are fixed and RSS table and
> hash key configuration is added.
> 
> We also do increase max number of HW rings upto 8.
> 
> v2: removed extra arg check

Series applied.


[PATCH v2 net-next] neighbor: Improve garbage collection

2018-12-07 Thread David Ahern
From: David Ahern 

The existing garbage collection algorithm has a number of problems:

1. The gc algorithm will not evict PERMANENT entries as those entries
   are managed by userspace, yet the existing algorithm walks the entire
   hash table which means it always considers PERMANENT entries when
   looking for entries to evict. In some use cases (e.g., EVPN) there
   can be tens of thousands of PERMANENT entries leading to wasted
   CPU cycles when gc kicks in. As an example, with 32k permanent
   entries, neigh_alloc has been observed taking more than 4 msec per
   invocation.

2. Currently, when the number of neighbor entries hits gc_thresh2 and
   the last flush for the table was more than 5 seconds ago gc kicks in
   walks the entire hash table evicting *all* entries not in PERMANENT
   or REACHABLE state and not marked as externally learned. There is no
   discriminator on when the neigh entry was created or if it just moved
   from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).

   It is possible for entries to be created or for established neighbor
   entries to be moved to STALE (e.g., an external node sends an ARP
   request) right before the 5 second window lapses:

-|-x|--|-
t-5 t t+5

   If that happens those entries are evicted during gc causing unnecessary
   thrashing on neighbor entries and userspace caches trying to track them.

   Further, this contradicts the description of gc_thresh2 which says
   "Entries older than 5 seconds will be cleared".

   One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
   whole point of having separate thresholds.

3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
   when gc_thresh2 is exceeded is over kill and contributes to trashing
   especially during startup.

This patch addresses these problems as follows:

1. Use of a separate list_head to track entries that can be garbage
   collected along with a separate counter. PERMANENT entries are not
   added to this list.

   The gc_thresh parameters are only compared to the new counter, not the
   total entries in the table. The forced_gc function is updated to only
   walk this new gc_list looking for entries to evict.

2. Entries are added to the list head at the tail and removed from the
   front.

3. Entries are only evicted if they were last updated more than 5 seconds
   ago, adhering to the original intent of gc_thresh2.

4. Forced gc is stopped once the number of gc_entries drops below
   gc_thresh2.

5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
   when allocating a new neighbor for a PERMANENT entry. By extension this
   means there are no explicit limits on the number of PERMANENT entries
   that can be created, but this is no different than FIB entries or FDB
   entries.

Signed-off-by: David Ahern 
---
v2
- remove on_gc_list boolean in favor of !list_empty
- fix neigh_alloc to add new entry to tail of list_head

 Documentation/networking/ip-sysctl.txt |   4 +-
 include/net/neighbour.h|   3 +
 net/core/neighbour.c   | 119 +++--
 3 files changed, 90 insertions(+), 36 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index af2a69439b93..acdfb5d2bcaa 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -108,8 +108,8 @@ neigh/default/gc_thresh2 - INTEGER
Default: 512
 
 neigh/default/gc_thresh3 - INTEGER
-   Maximum number of neighbor entries allowed.  Increase this
-   when using large numbers of interfaces and when communicating
+   Maximum number of non-PERMANENT neighbor entries allowed.  Increase
+   this when using large numbers of interfaces and when communicating
with large numbers of directly-connected peers.
Default: 1024
 
diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index f58b384aa6c9..6c13072910ab 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -154,6 +154,7 @@ struct neighbour {
struct hh_cache hh;
int (*output)(struct neighbour *, struct sk_buff *);
const struct neigh_ops  *ops;
+   struct list_headgc_list;
struct rcu_head rcu;
struct net_device   *dev;
u8  primary_key[0];
@@ -214,6 +215,8 @@ struct neigh_table {
struct timer_list   proxy_timer;
struct sk_buff_head proxy_queue;
atomic_tentries;
+   atomic_tgc_entries;
+   struct list_headgc_list;
rwlock_tlock;
unsigned long   last_rand;
struct neigh_statistics __percpu *stats;
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 6d479b5562be..c3b58712e98b 100644
--- a/net/core/neighbour.c
+++ 

RE: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-12-07 Thread Ioana Ciocoi Radulescu
> -Original Message-
> From: Ilias Apalodimas 
> Sent: Friday, December 7, 2018 7:52 PM
> To: Ioana Ciocoi Radulescu 
> Cc: Jesper Dangaard Brouer ;
> netdev@vger.kernel.org; da...@davemloft.net; Ioana Ciornei
> ; dsah...@gmail.com; Camelia Alexandra Groza
> 
> Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
> 
> Hi Ioana,
> > > > >
> > > I only did a quick grep around the driver so i might be missing something,
> > > but i can only see allocations via napi_alloc_frag(). XDP requires pages
> > > (either a single page per packet or a driver that does the page
> management
> > > of
> > > its own and fits 2 frames in a single page, assuming 4kb pages).
> > > Am i missing something on the driver?
> >
> > No, I guess I'm the one missing stuff, I didn't realise single page per 
> > packet
> > is a hard requirement for XDP. Could you point me to more info on this?
> >
> 
> Well if you don't have to use 64kb pages you can use the page_pool API (only
> used from mlx5 atm) and get the xdp recycling for free. The memory 'waste'
> for
> 4kb pages isn't too much if the platforms the driver sits on have decent
> amounts
> of memory  (and the number of descriptors used is not too high).
> We still have work in progress with Jesper (just posted an RFC)with
> improvements
> on the API.
> Using it is fairly straightforward. This is a patchset on marvell's mvneta
> driver with the API changes needed:
> https://www.spinics.net/lists/netdev/msg538285.html
> 
> If you need 64kb pages you would have to introduce page recycling and
> sharing
> like intel/mlx drivers on your driver.

Thanks a lot for the info, will look into this. Do you have any pointers
as to why the full page restriction exists in the first place? Sorry if it's
a dumb question, but I haven't found details on this and I'd really like
to understand it.

Thanks
Ioana


Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-12-07 Thread Ilias Apalodimas
Hi Ioana,
> > > >
> > I only did a quick grep around the driver so i might be missing something,
> > but i can only see allocations via napi_alloc_frag(). XDP requires pages
> > (either a single page per packet or a driver that does the page management
> > of
> > its own and fits 2 frames in a single page, assuming 4kb pages).
> > Am i missing something on the driver?
> 
> No, I guess I'm the one missing stuff, I didn't realise single page per packet
> is a hard requirement for XDP. Could you point me to more info on this?
> 

Well if you don't have to use 64kb pages you can use the page_pool API (only
used from mlx5 atm) and get the xdp recycling for free. The memory 'waste' for
4kb pages isn't too much if the platforms the driver sits on have decent amounts
of memory  (and the number of descriptors used is not too high).
We still have work in progress with Jesper (just posted an RFC)with improvements
on the API.
Using it is fairly straightforward. This is a patchset on marvell's mvneta
driver with the API changes needed: 
https://www.spinics.net/lists/netdev/msg538285.html

If you need 64kb pages you would have to introduce page recycling and sharing 
like intel/mlx drivers on your driver.

/Ilias


RE: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-12-07 Thread Ioana Ciocoi Radulescu



> -Original Message-
> From: Ilias Apalodimas 
> Sent: Friday, December 7, 2018 7:20 PM
> To: Ioana Ciocoi Radulescu 
> Cc: Jesper Dangaard Brouer ;
> netdev@vger.kernel.org; da...@davemloft.net; Ioana Ciornei
> ; dsah...@gmail.com; Camelia Alexandra Groza
> 
> Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
> 
> Hi Ioana,
> > >
> > > > Add support for XDP programs. Only XDP_PASS, XDP_DROP and
> XDP_TX
> > > > actions are supported for now. Frame header changes are also
> > > > allowed.
> 
> I only did a quick grep around the driver so i might be missing something,
> but i can only see allocations via napi_alloc_frag(). XDP requires pages
> (either a single page per packet or a driver that does the page management
> of
> its own and fits 2 frames in a single page, assuming 4kb pages).
> Am i missing something on the driver?

No, I guess I'm the one missing stuff, I didn't realise single page per packet
is a hard requirement for XDP. Could you point me to more info on this?

Thanks,
Ioana

> 
> > >
> > > Do you have any XDP performance benchmarks on this hardware?
> >
> > We have some preliminary perf data that doesn't look great,
> > but we hope to improve it :)
> 
> As Jesper said we are doing similar work on a cortex a-53 and plan to work on
> a-72 as well. We might be able to help out.
> 
> /Ilias


Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-12-07 Thread Ilias Apalodimas
Hi Ioana,
> > 
> > > Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX
> > > actions are supported for now. Frame header changes are also
> > > allowed.

I only did a quick grep around the driver so i might be missing something, 
but i can only see allocations via napi_alloc_frag(). XDP requires pages 
(either a single page per packet or a driver that does the page management of
its own and fits 2 frames in a single page, assuming 4kb pages). 
Am i missing something on the driver? 

> > 
> > Do you have any XDP performance benchmarks on this hardware?
> 
> We have some preliminary perf data that doesn't look great,
> but we hope to improve it :)

As Jesper said we are doing similar work on a cortex a-53 and plan to work on
a-72 as well. We might be able to help out.

/Ilias


RE: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-12-07 Thread Ioana Ciocoi Radulescu
> -Original Message-
> From: Jesper Dangaard Brouer 
> Sent: Wednesday, December 5, 2018 5:45 PM
> To: Ioana Ciocoi Radulescu 
> Cc: bro...@redhat.com; netdev@vger.kernel.org; da...@davemloft.net;
> Ioana Ciornei ; dsah...@gmail.com; Camelia
> Alexandra Groza ; Ilias Apalodimas
> 
> Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
> 
> On Mon, 26 Nov 2018 16:27:28 +
> Ioana Ciocoi Radulescu  wrote:
> 
> > Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX
> > actions are supported for now. Frame header changes are also
> > allowed.
> 
> Do you have any XDP performance benchmarks on this hardware?

We have some preliminary perf data that doesn't look great,
but we hope to improve it :)

On a LS2088A with A72 cores @2GHz (numbers in Mpps):
1core   8cores
-
XDP_DROP (no touching data) 5.6829.6 (linerate)
XDP_DROP (xdp1 sample)  3.4625.18
XDP_TX(xdp2 sample) 1.7113.26

For comparison, plain IP forwarding through the stack
is currently around 0.5Mpps (1c) / 3.8Mpps (8c).

>
> Also what boards (and arch's) are using this dpaa2-eth driver?

Currently supported LS2088A, LS1088A, soon LX2160A (all with
ARM64 cores).

> Any devel board I can buy?

I should have an answer for this early next week and will
get back to you.

Thanks,
Ioana

> 
> 
> p.s. Ilias and I are coding up page_pool and XDP support for Marvell
> mvneta driver, which is avail on a number of avail boards, see here[1]
> 
> [1]
> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgit
> hub.com%2Fxdp-project%2Fxdp-
> project%2Fblob%2Fmaster%2Fareas%2Farm64%2Farm01_selecting_hardwar
> e.orgdata=02%7C01%7Cruxandra.radulescu%40nxp.com%7C546868ba
> aa074902ded608d65ac8a594%7C686ea1d3bc2b4c6fa92cd99c5c301635%7C0%7
> C0%7C636796215148994553sdata=za6xUoIrv2jo%2BbvuKjXfpOXeQ3tw
> 96bZZzRB2Vny1iw%3Dreserved=0
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn:
> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fww
> w.linkedin.com%2Fin%2Fbrouerdata=02%7C01%7Cruxandra.radulescu
> %40nxp.com%7C546868baaa074902ded608d65ac8a594%7C686ea1d3bc2b4c6f
> a92cd99c5c301635%7C0%7C0%7C636796215148994553sdata=vTe2jd3V
> FXUpEVPLkbGN6i2OyyPfhQ9HacCaPZbm%2Bk8%3Dreserved=0


[PATCH v2 net-next 3/4] net: aquantia: fix initialization of RSS table

2018-12-07 Thread Igor Russkikh
From: Dmitry Bogdanov 

Now RSS indirection table is initialized before setting up the number of
hw queues, consequently the table may be filled by non existing queues.
This patch moves the initialization when the number of hw queues is
known.

Signed-off-by: Dmitry Bogdanov 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index d617289d95f7..0147c037ca96 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -84,8 +84,6 @@ void aq_nic_cfg_start(struct aq_nic_s *self)
 
cfg->is_lro = AQ_CFG_IS_LRO_DEF;
 
-   aq_nic_rss_init(self, cfg->num_rss_queues);
-
/*descriptors */
cfg->rxds = min(cfg->aq_hw_caps->rxds_max, AQ_CFG_RXDS_DEF);
cfg->txds = min(cfg->aq_hw_caps->txds_max, AQ_CFG_TXDS_DEF);
@@ -106,6 +104,8 @@ void aq_nic_cfg_start(struct aq_nic_s *self)
 
cfg->num_rss_queues = min(cfg->vecs, AQ_CFG_NUM_RSS_QUEUES_DEF);
 
+   aq_nic_rss_init(self, cfg->num_rss_queues);
+
cfg->irq_type = aq_pci_func_get_irq_type(self);
 
if ((cfg->irq_type == AQ_HW_IRQ_LEGACY) ||
-- 
2.17.1



[PATCH v2 net-next 1/4] net: aquantia: fix RSS table and key sizes

2018-12-07 Thread Igor Russkikh
From: Dmitry Bogdanov 

Set RSS indirection table and RSS hash key sizes to their real size.

Signed-off-by: Dmitry Bogdanov 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/aquantia/atlantic/aq_cfg.h | 4 ++--
 drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h 
b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
index 91eb8910b1c9..90a0e1d0d622 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
@@ -42,8 +42,8 @@
 #define AQ_CFG_IS_LRO_DEF   1U
 
 /* RSS */
-#define AQ_CFG_RSS_INDIRECTION_TABLE_MAX  128U
-#define AQ_CFG_RSS_HASHKEY_SIZE   320U
+#define AQ_CFG_RSS_INDIRECTION_TABLE_MAX  64U
+#define AQ_CFG_RSS_HASHKEY_SIZE   40U
 
 #define AQ_CFG_IS_RSS_DEF   1U
 #define AQ_CFG_NUM_RSS_QUEUES_DEF   AQ_CFG_VECS_DEF
diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
index 279ea58f4a9e..d617289d95f7 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
@@ -44,7 +44,7 @@ static void aq_nic_rss_init(struct aq_nic_s *self, unsigned 
int num_rss_queues)
struct aq_rss_parameters *rss_params = >aq_rss;
int i = 0;
 
-   static u8 rss_key[40] = {
+   static u8 rss_key[AQ_CFG_RSS_HASHKEY_SIZE] = {
0x1e, 0xad, 0x71, 0x87, 0x65, 0xfc, 0x26, 0x7d,
0x0d, 0x45, 0x67, 0x74, 0xcd, 0x06, 0x1a, 0x18,
0xb6, 0xc1, 0xf0, 0xc7, 0xbb, 0x18, 0xbe, 0xf8,
-- 
2.17.1



[PATCH v2 net-next 4/4] net: aquantia: add support of RSS configuration

2018-12-07 Thread Igor Russkikh
From: Dmitry Bogdanov 

Add support of configuration of RSS hash key and RSS indirection table.

Signed-off-by: Dmitry Bogdanov 
Signed-off-by: Igor Russkikh 
---
 .../ethernet/aquantia/atlantic/aq_ethtool.c   | 36 +++
 1 file changed, 36 insertions(+)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c 
b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
index a5fd71692c8b..fcbfecf41c45 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_ethtool.c
@@ -202,6 +202,41 @@ static int aq_ethtool_get_rss(struct net_device *ndev, u32 
*indir, u8 *key,
return 0;
 }
 
+static int aq_ethtool_set_rss(struct net_device *netdev, const u32 *indir,
+ const u8 *key, const u8 hfunc)
+{
+   struct aq_nic_s *aq_nic = netdev_priv(netdev);
+   struct aq_nic_cfg_s *cfg;
+   unsigned int i = 0U;
+   u32 rss_entries;
+   int err = 0;
+
+   cfg = aq_nic_get_cfg(aq_nic);
+   rss_entries = cfg->aq_rss.indirection_table_size;
+
+   /* We do not allow change in unsupported parameters */
+   if (hfunc != ETH_RSS_HASH_NO_CHANGE && hfunc != ETH_RSS_HASH_TOP)
+   return -EOPNOTSUPP;
+   /* Fill out the redirection table */
+   if (indir)
+   for (i = 0; i < rss_entries; i++)
+   cfg->aq_rss.indirection_table[i] = indir[i];
+
+   /* Fill out the rss hash key */
+   if (key) {
+   memcpy(cfg->aq_rss.hash_secret_key, key,
+  sizeof(cfg->aq_rss.hash_secret_key));
+   err = aq_nic->aq_hw_ops->hw_rss_hash_set(aq_nic->aq_hw,
+   >aq_rss);
+   if (err)
+   return err;
+   }
+
+   err = aq_nic->aq_hw_ops->hw_rss_set(aq_nic->aq_hw, >aq_rss);
+
+   return err;
+}
+
 static int aq_ethtool_get_rxnfc(struct net_device *ndev,
struct ethtool_rxnfc *cmd,
u32 *rule_locs)
@@ -549,6 +584,7 @@ const struct ethtool_ops aq_ethtool_ops = {
.set_pauseparam  = aq_ethtool_set_pauseparam,
.get_rxfh_key_size   = aq_ethtool_get_rss_key_size,
.get_rxfh= aq_ethtool_get_rss,
+   .set_rxfh= aq_ethtool_set_rss,
.get_rxnfc   = aq_ethtool_get_rxnfc,
.set_rxnfc   = aq_ethtool_set_rxnfc,
.get_sset_count  = aq_ethtool_get_sset_count,
-- 
2.17.1



[PATCH v2 net-next 2/4] net: aquantia: increase max number of hw queues

2018-12-07 Thread Igor Russkikh
From: Dmitry Bogdanov 

Increase the upper limit of the hw queues up to 8.
This makes RSS better on multiheaded cpus.

This is a maximum AQC hardware supports in one traffic class.

The actual value is still limited by a number of available cpu cores.

Signed-off-by: Dmitry Bogdanov 
Signed-off-by: Igor Russkikh 
---
 drivers/net/ethernet/aquantia/atlantic/aq_cfg.h   | 2 +-
 drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h 
b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
index 90a0e1d0d622..3944ce7f0870 100644
--- a/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
+++ b/drivers/net/ethernet/aquantia/atlantic/aq_cfg.h
@@ -12,7 +12,7 @@
 #ifndef AQ_CFG_H
 #define AQ_CFG_H
 
-#define AQ_CFG_VECS_DEF   4U
+#define AQ_CFG_VECS_DEF   8U
 #define AQ_CFG_TCS_DEF1U
 
 #define AQ_CFG_TXDS_DEF4096U
diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
index 6af7d7f0cdca..08596a7a6486 100644
--- a/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/hw_atl_b0.c
@@ -21,7 +21,7 @@
 
 #define DEFAULT_B0_BOARD_BASIC_CAPABILITIES \
.is_64_dma = true,\
-   .msix_irqs = 4U,  \
+   .msix_irqs = 8U,  \
.irq_mask = ~0U,  \
.vecs = HW_ATL_B0_RSS_MAX,\
.tcs = HW_ATL_B0_TC_MAX,  \
-- 
2.17.1



[PATCH v2 net-next 0/4] net: aquantia: add RSS configuration

2018-12-07 Thread Igor Russkikh
In this patchset few bugs related to RSS are fixed and RSS table and
hash key configuration is added.

We also do increase max number of HW rings upto 8.

v2: removed extra arg check

Dmitry Bogdanov (4):
  net: aquantia: fix RSS table and key sizes
  net: aquantia: increase max number of hw queues
  net: aquantia: fix initialization of RSS table
  net: aquantia: add support of RSS configuration

 .../net/ethernet/aquantia/atlantic/aq_cfg.h   |  6 ++--
 .../ethernet/aquantia/atlantic/aq_ethtool.c   | 36 +++
 .../net/ethernet/aquantia/atlantic/aq_nic.c   |  6 ++--
 .../aquantia/atlantic/hw_atl/hw_atl_b0.c  |  2 +-
 4 files changed, 43 insertions(+), 7 deletions(-)

-- 
2.17.1



Re: [PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree

2018-12-05 Thread David Miller
From: Peter Oskolkov 
Date: Tue,  4 Dec 2018 11:55:56 -0800

> When testing high-bandwidth TCP streams with large windows,
> high latency, and low jitter, netem consumes a lot of CPU cycles
> doing rbtree rebalancing.
> 
> This patch uses a linear list/queue in addition to the rbtree:
> if an incoming packet is past the tail of the linear queue, it is
> added there, otherwise it is inserted into the rbtree.
> 
> Without this patch, perf shows netem_enqueue, netem_dequeue,
> and rb_* functions among the top offenders. With this patch,
> only netem_enqueue is noticeable if jitter is low/absent.
> 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Peter Oskolkov 

Applied, thanks.


[Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread yupeng
after set SO_DONTROUTE to 1, the IP layer should not route packets if
the dest IP address is not in link scope. But if the socket has cached
the dst_entry, such packets would be routed until the sk_dst_cache
expires. So we should clean the sk_dst_cache when a user set
SO_DONTROUTE option. Below are server/client python scripts which
could reprodue this issue:

server side code:
==
import socket
import struct
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 9000))
s.listen(1)
sock, addr = s.accept()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
while True:
sock.send(b'foo')
time.sleep(1)
==

client side code:
==
import socket
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('server_address', 9000))
while True:
data = s.recv(1024)
print(data)
==

Signed-off-by: yupeng 
---
 net/core/sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index f5bb89785e47..f00902c532cc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
break;
case SO_DONTROUTE:
sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
+   sk_dst_reset(sk);
break;
case SO_BROADCAST:
sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
-- 
2.17.1



[Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread yupeng
after set SO_DONTROUTE to 1, the IP layer should not route packets if
the dest IP address is not in link scope. But if the socket has cached
the dst_entry, such packets would be routed until the sk_dst_cache
expires. So we should clean the sk_dst_cache when a user set
SO_DONTROUTE option. Below are server/client python scripts which
could reprodue this issue:

server side code:
==
import socket
import struct
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 9000))
s.listen(1)
sock, addr = s.accept()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
while True:
sock.send(b'foo')
time.sleep(1)
==

client side code:
==
import socket
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('server_address', 9000))
while True:
data = s.recv(1024)
print(data)
==

Signed-off-by: yupeng 
---
 net/core/sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index f5bb89785e47..f00902c532cc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
break;
case SO_DONTROUTE:
sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
+   sk_dst_reset(sk);
break;
case SO_BROADCAST:
sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
-- 
2.17.1



[Patch v2 net-next] call sk_dst_reset when set SO_DONTROUTE

2018-12-05 Thread yupeng
after set SO_DONTROUTE to 1, the IP layer should not route packets if
the dest IP address is not in link scope. But if the socket has cached
the dst_entry, such packets would be routed until the sk_dst_cache
expires. So we should clean the sk_dst_cache when a user set
SO_DONTROUTE option. Below are server/client python scripts which
could reprodue this issue:

server side code:
==
import socket
import struct
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.bind(('0.0.0.0', 9000))
s.listen(1)
sock, addr = s.accept()
sock.setsockopt(socket.SOL_SOCKET, socket.SO_DONTROUTE, struct.pack('i', 1))
while True:
sock.send(b'foo')
time.sleep(1)
==

client side code:
==
import socket
import time

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(('server_address', 9000))
while True:
data = s.recv(1024)
print(data)
==

Signed-off-by: yupeng 
---
 net/core/sock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/core/sock.c b/net/core/sock.c
index f5bb89785e47..f00902c532cc 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -700,6 +700,7 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
break;
case SO_DONTROUTE:
sock_valbool_flag(sk, SOCK_LOCALROUTE, valbool);
+   sk_dst_reset(sk);
break;
case SO_BROADCAST:
sock_valbool_flag(sk, SOCK_BROADCAST, valbool);
-- 
2.17.1



Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-12-05 Thread Jesper Dangaard Brouer
On Mon, 26 Nov 2018 16:27:28 +
Ioana Ciocoi Radulescu  wrote:

> Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX
> actions are supported for now. Frame header changes are also
> allowed.

Do you have any XDP performance benchmarks on this hardware?

Also what boards (and arch's) are using this dpaa2-eth driver?
Any devel board I can buy?


p.s. Ilias and I are coding up page_pool and XDP support for Marvell
mvneta driver, which is avail on a number of avail boards, see here[1]

[1] 
https://github.com/xdp-project/xdp-project/blob/master/areas/arm64/arm01_selecting_hardware.org
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH v2 net-next] ip6_tunnel: Adding support of mapping rules for MAP-E tunnel

2018-12-04 Thread David Miller
From: Felix Jia 
Date: Mon,  3 Dec 2018 16:39:31 +1300

> +int
> +ip6_get_addrport(struct iphdr *iph, __be32 *saddr4, __be32 *daddr4,
> +  __be16 *sport4, __be16 *dport4, __u8 *proto, int *icmperr)
> +{

This looks like something the flow dissector can do alreayd, please look into
utilizing that common piece of infrastructure instead of reimplementing it.

> + u8 *ptr;
> + struct iphdr *icmpiph = NULL;
> + struct tcphdr *tcph, *icmptcph;
> + struct udphdr *udph, *icmpudph;
> + struct icmphdr *icmph, *icmpicmph;

Please always order local variables from longest to shortest line.

Please audit your entire submission for this problem.

> +static struct ip6_tnl_rule *ip6_tnl_rule_find(struct net_device *dev,
> +   __be32 _dst)
> +{
> + u32 dst = ntohl(_dst);
> + struct ip6_rule_list *pos = NULL;
> + struct ip6_tnl *t = netdev_priv(dev);
> +
> + list_for_each_entry(pos, >rules.list, list) {
> + int mask =
> + 0x ^ ((1 << (32 - pos->data.ipv4_prefixlen)) - 1);
> + if ((dst & mask) == ntohl(pos->data.ipv4_subnet.s_addr))
> + return >data;
> + }
> + return NULL;
> +}

How will this scale with large numbers of rules?

This rule facility seems to be designed in a way that sophisticated
(at least as fast as "O(log N)") lookup schemes aren't even possible,
and that even worse the ordering matters.


Re: [PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree

2018-12-04 Thread Eric Dumazet



On 12/04/2018 11:55 AM, Peter Oskolkov wrote:
> When testing high-bandwidth TCP streams with large windows,
> high latency, and low jitter, netem consumes a lot of CPU cycles
> doing rbtree rebalancing.
> 
> This patch uses a linear list/queue in addition to the rbtree:
> if an incoming packet is past the tail of the linear queue, it is
> added there, otherwise it is inserted into the rbtree.
> 
> Without this patch, perf shows netem_enqueue, netem_dequeue,
> and rb_* functions among the top offenders. With this patch,
> only netem_enqueue is noticeable if jitter is low/absent.
> 
> Suggested-by: Eric Dumazet 
> Signed-off-by: Peter Oskolkov 
> ---

Reviewed-by: Eric Dumazet 



[PATCH v2 net-next 0/1] net: netem: use a list _and_ rbtree

2018-12-04 Thread Peter Oskolkov
v2: address style suggestions by Stephen Hemminger.

All changes are noop vs v1.

Peter Oskolkov (1):
  net: netem: use a list in addition to rbtree

 net/sched/sch_netem.c | 89 +--
 1 file changed, 69 insertions(+), 20 deletions(-)



[PATCH v2 net-next 1/1] net: netem: use a list in addition to rbtree

2018-12-04 Thread Peter Oskolkov
When testing high-bandwidth TCP streams with large windows,
high latency, and low jitter, netem consumes a lot of CPU cycles
doing rbtree rebalancing.

This patch uses a linear list/queue in addition to the rbtree:
if an incoming packet is past the tail of the linear queue, it is
added there, otherwise it is inserted into the rbtree.

Without this patch, perf shows netem_enqueue, netem_dequeue,
and rb_* functions among the top offenders. With this patch,
only netem_enqueue is noticeable if jitter is low/absent.

Suggested-by: Eric Dumazet 
Signed-off-by: Peter Oskolkov 
---
 net/sched/sch_netem.c | 89 +--
 1 file changed, 69 insertions(+), 20 deletions(-)

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 2c38e3d07924..84658f60a872 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -77,6 +77,10 @@ struct netem_sched_data {
/* internal t(ime)fifo qdisc uses t_root and sch->limit */
struct rb_root t_root;
 
+   /* a linear queue; reduces rbtree rebalancing when jitter is low */
+   struct sk_buff  *t_head;
+   struct sk_buff  *t_tail;
+
/* optional qdisc for classful handling (NULL at netem init) */
struct Qdisc*qdisc;
 
@@ -369,26 +373,39 @@ static void tfifo_reset(struct Qdisc *sch)
rb_erase(>rbnode, >t_root);
rtnl_kfree_skbs(skb, skb);
}
+
+   rtnl_kfree_skbs(q->t_head, q->t_tail);
+   q->t_head = NULL;
+   q->t_tail = NULL;
 }
 
 static void tfifo_enqueue(struct sk_buff *nskb, struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
u64 tnext = netem_skb_cb(nskb)->time_to_send;
-   struct rb_node **p = >t_root.rb_node, *parent = NULL;
 
-   while (*p) {
-   struct sk_buff *skb;
-
-   parent = *p;
-   skb = rb_to_skb(parent);
-   if (tnext >= netem_skb_cb(skb)->time_to_send)
-   p = >rb_right;
+   if (!q->t_tail || tnext >= netem_skb_cb(q->t_tail)->time_to_send) {
+   if (q->t_tail)
+   q->t_tail->next = nskb;
else
-   p = >rb_left;
+   q->t_head = nskb;
+   q->t_tail = nskb;
+   } else {
+   struct rb_node **p = >t_root.rb_node, *parent = NULL;
+
+   while (*p) {
+   struct sk_buff *skb;
+
+   parent = *p;
+   skb = rb_to_skb(parent);
+   if (tnext >= netem_skb_cb(skb)->time_to_send)
+   p = >rb_right;
+   else
+   p = >rb_left;
+   }
+   rb_link_node(>rbnode, parent, p);
+   rb_insert_color(>rbnode, >t_root);
}
-   rb_link_node(>rbnode, parent, p);
-   rb_insert_color(>rbnode, >t_root);
sch->q.qlen++;
 }
 
@@ -530,9 +547,16 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
t_skb = skb_rb_last(>t_root);
t_last = netem_skb_cb(t_skb);
if (!last ||
-   t_last->time_to_send > last->time_to_send) {
+   t_last->time_to_send > last->time_to_send)
+   last = t_last;
+   }
+   if (q->t_tail) {
+   struct netem_skb_cb *t_last =
+   netem_skb_cb(q->t_tail);
+
+   if (!last ||
+   t_last->time_to_send > last->time_to_send)
last = t_last;
-   }
}
 
if (last) {
@@ -611,11 +635,38 @@ static void get_slot_next(struct netem_sched_data *q, u64 
now)
q->slot.bytes_left = q->slot_config.max_bytes;
 }
 
+static struct sk_buff *netem_peek(struct netem_sched_data *q)
+{
+   struct sk_buff *skb = skb_rb_first(>t_root);
+   u64 t1, t2;
+
+   if (!skb)
+   return q->t_head;
+   if (!q->t_head)
+   return skb;
+
+   t1 = netem_skb_cb(skb)->time_to_send;
+   t2 = netem_skb_cb(q->t_head)->time_to_send;
+   if (t1 < t2)
+   return skb;
+   return q->t_head;
+}
+
+static void netem_erase_head(struct netem_sched_data *q, struct sk_buff *skb)
+{
+   if (skb == q->t_head) {
+   q->t_head = skb->next;
+   if (!q->t_head)
+   q->t_tail = NULL;
+   } else {
+   rb_erase(>rbnode, >t_root);
+   }
+}
+
 static struct sk_buff *netem_dequeue(struct Qdisc *sch)
 {
struct netem_sched_data *q = qdisc_priv(sch);
struct sk_buff *skb;
-   struct rb_node *p;
 
 tfifo_dequeue:
skb = 

[PATCH v2 net-next] ip6_tunnel: Adding support of mapping rules for MAP-E tunnel

2018-12-02 Thread Felix Jia
From: Blair Steven 

Mapping of Addresses and Ports with Encapsulation (MAP-E) is defined in
RFC7597, and is an IPv6 transition technology providing interoperability
between IPv4 and IPv6 networks.

MAP-E uses the encapsulation mode described in RFC2473 (IPv6 Tunneling)
to transport IPv4 and IPv6 packets over an IPv6 network. It requires a
list rules for mapping between IPv4 prefix/shared addresses and IPv6
addresses.

This patch also supports the mapping rules defined in the draft3 version
of the RFC.

Co-developed-by: Felix Jia 
Co-developed-by: Sheena Mira-ato 
Co-developed-by: Masakazu Asama 
Signed-off-by: Blair Steven 
Signed-off-by: Felix Jia 
Signed-off-by: Sheena Mira-ato 
Signed-off-by: Masakazu Asama 
---
 include/net/ip6_tunnel.h   |  18 ++
 include/uapi/linux/if_tunnel.h |  18 ++
 net/ipv6/ip6_tunnel.c  | 490 -
 3 files changed, 524 insertions(+), 2 deletions(-)

diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 69b4bcf880c9..ed715ee8d87c 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -18,6 +18,16 @@
 /* determine capability on a per-packet basis */
 #define IP6_TNL_F_CAP_PER_PACKET 0x4
 
+struct ip6_tnl_rule {
+   struct in_addr ipv4_subnet;
+   struct in6_addr ipv6_subnet;
+   u8 version;
+   u8 ea_length;
+   u8 psid_offset;
+   u8 ipv4_prefixlen;
+   u8 ipv6_prefixlen;
+};
+
 struct __ip6_tnl_parm {
char name[IFNAMSIZ];/* name of tunnel device */
int link;   /* ifindex of underlying L2 interface */
@@ -40,6 +50,13 @@ struct __ip6_tnl_parm {
__u8erspan_ver; /* ERSPAN version */
__u8dir;/* direction */
__u16   hwid;   /* hwid */
+   __u8rule_action;
+   struct ip6_tnl_rule rule;
+};
+
+struct ip6_rule_list {
+   struct list_head list;
+   struct ip6_tnl_rule data;
 };
 
 /* IPv6 tunnel */
@@ -63,6 +80,7 @@ struct ip6_tnl {
int encap_hlen; /* Encap header length (FOU,GUE) */
struct ip_tunnel_encap encap;
int mlink;
+   struct ip6_rule_list rules;
 };
 
 struct ip6_tnl_encap_ops {
diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 1b3d148c4560..7cb09c8c4d8a 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -77,10 +77,28 @@ enum {
IFLA_IPTUN_ENCAP_DPORT,
IFLA_IPTUN_COLLECT_METADATA,
IFLA_IPTUN_FWMARK,
+   IFLA_IPTUN_RULE_VERSION,
+   IFLA_IPTUN_RULE_ACTION,
+   IFLA_IPTUN_RULE_IPV6_PREFIX,
+   IFLA_IPTUN_RULE_IPV6_PREFIXLEN,
+   IFLA_IPTUN_RULE_IPV4_PREFIX,
+   IFLA_IPTUN_RULE_IPV4_PREFIXLEN,
+   IFLA_IPTUN_RULE_EA_LENGTH,
+   IFLA_IPTUN_RULE_PSID_OFFSET,
__IFLA_IPTUN_MAX,
 };
 #define IFLA_IPTUN_MAX (__IFLA_IPTUN_MAX - 1)
 
+enum map_rule_versions {
+   MAP_VERSION_RFC,
+   MAP_VERSION_DRAFT3,
+};
+
+enum tunnel_rule_actions {
+   TUNNEL_RULE_ACTION_ADD = 1,
+   TUNNEL_RULE_ACTION_DELETE = 2,
+};
+
 enum tunnel_encap_types {
TUNNEL_ENCAP_NONE,
TUNNEL_ENCAP_FOU,
diff --git a/net/ipv6/ip6_tunnel.c b/net/ipv6/ip6_tunnel.c
index a9d06d4dd057..3bd7a5045f28 100644
--- a/net/ipv6/ip6_tunnel.c
+++ b/net/ipv6/ip6_tunnel.c
@@ -20,6 +20,8 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -124,6 +127,226 @@ static struct net_device_stats *ip6_get_stats(struct 
net_device *dev)
return >stats;
 }
 
+int
+ip6_get_addrport(struct iphdr *iph, __be32 *saddr4, __be32 *daddr4,
+__be16 *sport4, __be16 *dport4, __u8 *proto, int *icmperr)
+{
+   u8 *ptr;
+   struct iphdr *icmpiph = NULL;
+   struct tcphdr *tcph, *icmptcph;
+   struct udphdr *udph, *icmpudph;
+   struct icmphdr *icmph, *icmpicmph;
+
+   *icmperr = 0;
+   *saddr4 = iph->saddr;
+   *daddr4 = iph->daddr;
+   ptr = (u8 *)iph;
+   ptr += iph->ihl * 4;
+   switch (iph->protocol) {
+   case IPPROTO_TCP:
+   *proto = IPPROTO_TCP;
+   tcph = (struct tcphdr *)ptr;
+   *sport4 = tcph->source;
+   *dport4 = tcph->dest;
+   break;
+   case IPPROTO_UDP:
+   *proto = IPPROTO_UDP;
+   udph = (struct udphdr *)ptr;
+   *sport4 = udph->source;
+   *dport4 = udph->dest;
+   break;
+   case IPPROTO_ICMP:
+   *proto = IPPROTO_ICMP;
+   icmph = (struct icmphdr *)ptr;
+   switch (icmph->type) {
+   case ICMP_DEST_UNREACH:
+   case ICMP_SOURCE_QUENCH:
+   case ICMP_TIME_EXCEEDED:
+   case ICMP_PARAMETERPROB:
+   *icmperr = 1;
+   ptr = (u8 

Re: [PATCH v2 net-next] cxgb4: number of VFs supported is not always 16

2018-11-30 Thread David Miller
From: Ganesh Goudar 
Date: Tue, 27 Nov 2018 14:59:06 +0530

> Total number of VFs supported by PF is used to determine the last
> byte of VF's mac address. Number of VFs supported is not always
> 16, use the variable nvfs to get the number of VFs supported
> rather than hard coding it to 16.
> 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudar 
> ---
> V2: Fixes typo in commit message

Applied.


Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-11-28 Thread David Miller
From: Ioana Ciocoi Radulescu 
Date: Wed, 28 Nov 2018 09:18:28 +

> They apply cleanly for me.

I figured out what happend.

The patches were mis-ordered (specifically patches #3 and #4) when I added
them to the patchwork bundle, and that is what causes them to fail.

Series applied, thanks!


Re: [PATCH v2 net-next 1/8] dpaa2-eth: Add basic XDP support

2018-11-28 Thread David Ahern
On 11/26/18 9:27 AM, Ioana Ciocoi Radulescu wrote:
> We keep one XDP program reference per channel. The only actions
> supported for now are XDP_DROP and XDP_PASS.
> 
> Until now we didn't enforce a maximum size for Rx frames based
> on MTU value. Change that, since for XDP mode we must ensure no
> scatter-gather frames can be received.
> 
> Signed-off-by: Ioana Radulescu 
> ---
> v2: - xdp packets count towards the rx packets and bytes counters
> - add warning message with the maximum supported MTU value for XDP
> 
>  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 189 
> ++-
>  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h |   6 +
>  2 files changed, 194 insertions(+), 1 deletion(-)
> 

Reviewed-by: David Ahern 




Re: [PATCH v2 net-next 8/8] dpaa2-eth: Add xdp counters

2018-11-28 Thread David Ahern
On 11/26/18 9:27 AM, Ioana Ciocoi Radulescu wrote:
> Add counters for xdp processed frames to the channel statistics.
> 
> Signed-off-by: Ioana Radulescu 
> ---
> v2: no changes
> 

Reviewed-by: David Ahern 




Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-11-28 Thread David Ahern
On 11/28/18 2:18 AM, Ioana Ciocoi Radulescu wrote:
>> -Original Message-
>> From: David Miller 
>> Sent: Wednesday, November 28, 2018 2:25 AM
>> To: Ioana Ciocoi Radulescu 
>> Cc: netdev@vger.kernel.org; Ioana Ciornei ;
>> dsah...@gmail.com; Camelia Alexandra Groza 
>> Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
>>
>> From: Ioana Ciocoi Radulescu 
>> Date: Mon, 26 Nov 2018 16:27:28 +
>>
>>> Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX
>>> actions are supported for now. Frame header changes are also
>>> allowed.
>>>
>>> v2: - count the XDP packets in the rx/tx inteface stats
>>> - add message with the maximum supported MTU value for XDP
>>
>> This doesn't apply cleanly to net-next.
>>
>> Could you please do a quick respin so I can apply this?
> 
> They apply cleanly for me. To doublecheck, I've downloaded the mbox
> patches from patchwork and applied them on net-next.git, master branch
> (commit 86d1d8b72c).
> I'm obviously doing something wrong, but I don't know what.

same here. All patches applied cleanly to net-next.


RE: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-11-28 Thread Ioana Ciocoi Radulescu
> -Original Message-
> From: David Miller 
> Sent: Wednesday, November 28, 2018 2:25 AM
> To: Ioana Ciocoi Radulescu 
> Cc: netdev@vger.kernel.org; Ioana Ciornei ;
> dsah...@gmail.com; Camelia Alexandra Groza 
> Subject: Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support
> 
> From: Ioana Ciocoi Radulescu 
> Date: Mon, 26 Nov 2018 16:27:28 +
> 
> > Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX
> > actions are supported for now. Frame header changes are also
> > allowed.
> >
> > v2: - count the XDP packets in the rx/tx inteface stats
> > - add message with the maximum supported MTU value for XDP
> 
> This doesn't apply cleanly to net-next.
> 
> Could you please do a quick respin so I can apply this?

They apply cleanly for me. To doublecheck, I've downloaded the mbox
patches from patchwork and applied them on net-next.git, master branch
(commit 86d1d8b72c).
I'm obviously doing something wrong, but I don't know what.

Thanks,
Ioana


Re: [PATCH v2 net-next] tcp: remove hdrlen argument from tcp_queue_rcv()

2018-11-27 Thread David Miller
From: Eric Dumazet 
Date: Mon, 26 Nov 2018 14:49:12 -0800

> Only one caller needs to pull TCP headers, so lets
> move __skb_pull() to the caller side.
> 
> Signed-off-by: Eric Dumazet 
> Acked-by: Yuchung Cheng 
> ---
> v2: sent as a standalone patch.

Applied, thanks Eric.


Re: [PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-11-27 Thread David Miller
From: Ioana Ciocoi Radulescu 
Date: Mon, 26 Nov 2018 16:27:28 +

> Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX
> actions are supported for now. Frame header changes are also
> allowed.
> 
> v2: - count the XDP packets in the rx/tx inteface stats
> - add message with the maximum supported MTU value for XDP

This doesn't apply cleanly to net-next.

Could you please do a quick respin so I can apply this?

Thanks!


Re: [PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue

2018-11-27 Thread Eric Dumazet
On Tue, Nov 27, 2018 at 2:13 PM Eric Dumazet  wrote:
>
>
>
> On 11/27/2018 01:58 PM, Neal Cardwell wrote:
>
> > I wonder if technically perhaps the logic should skip coalescing if
> > the tail or skb has the TCP_FLAG_URG bit set? It seems if skbs are
> > coalesced, and some have urgent data and some do not, then the
> > TCP_FLAG_URG bit will be accumulated into the tail header, but there
> > will be no way to ensure the correct urgent offsets for the one or
> > more skbs with urgent data are passed along.
>
> Yes, I guess I need to fix that, thanks.
>
> I will simply make sure both thtail->urg and th->urg are not set.
>
> I could only test thtail->urg, but that would require copying th->urg_ptr and 
> th->urg,
> and quite frankly we should not spend cycles on URG stuff.

pseudo code added in V3

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
9fa7516fb5c33277be4ba3a667ff61202d8dd445..4904250a9aac5001410f9454258cbb8978bb8202
100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1668,6 +1668,8 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
+((TCP_SKB_CB(tail)->tcp_flags |
+TCP_SKB_CB(skb)->tcp_flags) & TCPHDR_URG) ||
+   ((TCP_SKB_CB(tail)->tcp_flags ^
+ TCP_SKB_CB(skb)->tcp_flags) & (TCPHDR_ECE | TCPHDR_CWR)) ||
 #ifdef CONFIG_TLS_DEVICE
tail->decrypted != skb->decrypted ||
 #endif


Re: [PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue

2018-11-27 Thread Eric Dumazet



On 11/27/2018 01:58 PM, Neal Cardwell wrote:

> I wonder if technically perhaps the logic should skip coalescing if
> the tail or skb has the TCP_FLAG_URG bit set? It seems if skbs are
> coalesced, and some have urgent data and some do not, then the
> TCP_FLAG_URG bit will be accumulated into the tail header, but there
> will be no way to ensure the correct urgent offsets for the one or
> more skbs with urgent data are passed along.

Yes, I guess I need to fix that, thanks.

I will simply make sure both thtail->urg and th->urg are not set.

I could only test thtail->urg, but that would require copying th->urg_ptr and 
th->urg,
and quite frankly we should not spend cycles on URG stuff.





Re: [PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue

2018-11-27 Thread Neal Cardwell
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> In case GRO is not as efficient as it should be or disabled,
> we might have a user thread trapped in __release_sock() while
> softirq handler flood packets up to the point we have to drop.
>
> This patch balances work done from user thread and softirq,
> to give more chances to __release_sock() to complete its work
> before new packets are added the the backlog.
>
> This also helps if we receive many ACK packets, since GRO
> does not aggregate them.
>
> This patch brings ~60% throughput increase on a receiver
> without GRO, but the spectacular gain is really on
> 1000x release_sock() latency reduction I have measured.
>
> Signed-off-by: Eric Dumazet 
> Cc: Neal Cardwell 
> Cc: Yuchung Cheng 
> ---
...
> +   if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
> +   TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
> +#ifdef CONFIG_TLS_DEVICE
> +   tail->decrypted != skb->decrypted ||
> +#endif
> +   thtail->doff != th->doff ||
> +   memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
> +   goto no_coalesce;
> +
> +   __skb_pull(skb, hdrlen);
> +   if (skb_try_coalesce(tail, skb, , )) {
> +   thtail->window = th->window;
> +
> +   TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
> +
> +   if (after(TCP_SKB_CB(skb)->ack_seq, 
> TCP_SKB_CB(tail)->ack_seq))
> +   TCP_SKB_CB(tail)->ack_seq = TCP_SKB_CB(skb)->ack_seq;
> +
> +   TCP_SKB_CB(tail)->tcp_flags |= TCP_SKB_CB(skb)->tcp_flags;

I wonder if technically perhaps the logic should skip coalescing if
the tail or skb has the TCP_FLAG_URG bit set? It seems if skbs are
coalesced, and some have urgent data and some do not, then the
TCP_FLAG_URG bit will be accumulated into the tail header, but there
will be no way to ensure the correct urgent offsets for the one or
more skbs with urgent data are passed along.

Thinking out loud, I guess if this is ECN/DCTCP and some ACKs have
TCP_FLAG_ECE and some don't, this will effectively have all ACKed
bytes be treated as ECN-marked. Probably OK, since if this coalescing
path is being hit the sender may be overloaded and slowing down might
be a good thing.

Otherwise, looks great to me. Thanks for doing this!

neal


Re: [PATCH v2 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()

2018-11-27 Thread Eric Dumazet



On 11/27/2018 01:19 PM, Neal Cardwell wrote:
> On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>>
>> Neal pointed out that non sack flows might suffer from ACK compression
>> added in the following patch ("tcp: implement coalescing on backlog queue")
>>
>> Instead of tweaking tcp_add_backlog() we can take into
>> account how many ACK were coalesced, this information
>> will be available in skb_shinfo(skb)->gso_segs
>>
>> Signed-off-by: Eric Dumazet 
>> ---
> ...
>> @@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int 
>> flag, bool is_dupack,
>> /* A Reno DUPACK means new data in F-RTO step 2.b above are
>>  * delivered. Lower inflight to clock out (re)tranmissions.
>>  */
>> -   if (after(tp->snd_nxt, tp->high_seq) && is_dupack)
>> -   tcp_add_reno_sack(sk);
>> +   if (after(tp->snd_nxt, tp->high_seq))
>> +   tcp_add_reno_sack(sk, num_dupack);
>> else if (flag & FLAG_SND_UNA_ADVANCED)
>> tcp_reset_reno_sack(tp);
>> }
> 
> I think this probably should be checking num_dupack, something like:
> 
> +   if (after(tp->snd_nxt, tp->high_seq) && num_dupack)
> +   tcp_add_reno_sack(sk, num_dupack);
> 
> If we don't check num_dupack, that seems to mean that after FRTO sends
> the two new data packets (making snd_nxt after high_seq), the patch
> would have a particular non-SACK FRTO loss recovery always go into the
> "if" branch where we tcp_add_reno_sack() function, and we would never
> have a chance to get to the "else" branch where we check if
> FLAG_SND_UNA_ADVANCED and zero out the reno SACKs.
> 
> Otherwise the patch looks great to me. Thanks for doing this!
>

Oh right, I missed the else clause, I thought that tcp_add_reno_sack()
checking the num_dupack was enough.

Thanks.


Re: [PATCH v2 net-next 3/4] tcp: make tcp_space() aware of socket backlog

2018-11-27 Thread Neal Cardwell
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> Jean-Louis Dupond reported poor iscsi TCP receive performance
> that we tracked to backlog drops.
>
> Apparently we fail to send window updates reflecting the
> fact that we are under stress.
>
> Note that we might lack a proper window increase when
> backlog is fully processed, since __release_sock() clears
> sk->sk_backlog.len _after_ all skbs have been processed.
>
> This should not matter in practice. If we had a significant
> load through socket backlog, we are in a dangerous
> situation.
>
> Reported-by: Jean-Louis Dupond 
> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Nice. Thanks!

neal


Re: [PATCH v2 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()

2018-11-27 Thread Neal Cardwell
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> Neal pointed out that non sack flows might suffer from ACK compression
> added in the following patch ("tcp: implement coalescing on backlog queue")
>
> Instead of tweaking tcp_add_backlog() we can take into
> account how many ACK were coalesced, this information
> will be available in skb_shinfo(skb)->gso_segs
>
> Signed-off-by: Eric Dumazet 
> ---
...
> @@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int flag, 
> bool is_dupack,
> /* A Reno DUPACK means new data in F-RTO step 2.b above are
>  * delivered. Lower inflight to clock out (re)tranmissions.
>  */
> -   if (after(tp->snd_nxt, tp->high_seq) && is_dupack)
> -   tcp_add_reno_sack(sk);
> +   if (after(tp->snd_nxt, tp->high_seq))
> +   tcp_add_reno_sack(sk, num_dupack);
> else if (flag & FLAG_SND_UNA_ADVANCED)
> tcp_reset_reno_sack(tp);
> }

I think this probably should be checking num_dupack, something like:

+   if (after(tp->snd_nxt, tp->high_seq) && num_dupack)
+   tcp_add_reno_sack(sk, num_dupack);

If we don't check num_dupack, that seems to mean that after FRTO sends
the two new data packets (making snd_nxt after high_seq), the patch
would have a particular non-SACK FRTO loss recovery always go into the
"if" branch where we tcp_add_reno_sack() function, and we would never
have a chance to get to the "else" branch where we check if
FLAG_SND_UNA_ADVANCED and zero out the reno SACKs.

Otherwise the patch looks great to me. Thanks for doing this!

neal


Re: [PATCH v2 net-next 1/4] tcp: hint compiler about sack flows

2018-11-27 Thread Neal Cardwell
On Tue, Nov 27, 2018 at 10:57 AM Eric Dumazet  wrote:
>
> Tell the compiler that most TCP flows are using SACK these days.
>
> There is no need to add the unlikely() clause in tcp_is_reno(),
> the compiler is able to infer it.
>
> Signed-off-by: Eric Dumazet 
> ---

Acked-by: Neal Cardwell 

Nice. Thanks!

neal


Re: [PATCH v2 net-next 0/4] tcp: take a bit more care of backlog stress

2018-11-27 Thread Yuchung Cheng
On Tue, Nov 27, 2018 at 7:57 AM, Eric Dumazet  wrote:
> While working on the SACK compression issue Jean-Louis Dupond
> reported, we found that his linux box was suffering very hard
> from tail drops on the socket backlog queue.
>
> First patch hints the compiler about sack flows being the norm.
>
> Second patch changes non-sack code in preparation of the ack
> compression.
>
> Third patch fixes tcp_space() to take backlog into account.
>
> Fourth patch is attempting coalescing when a new packet must
> be added to the backlog queue. Cooking bigger skbs helps
> to keep backlog list smaller and speeds its handling when
> user thread finally releases the socket lock.
>
> v2: added feedback from Neal : tcp: take care of compressed acks in 
> tcp_add_reno_sack()
> added : tcp: hint compiler about sack flows
> added : tcp: make tcp_space() aware of socket backlog
Great feature!

Acked-by: Yuchung Cheng 

>
>
>
> Eric Dumazet (4):
>   tcp: hint compiler about sack flows
>   tcp: take care of compressed acks in tcp_add_reno_sack()
>   tcp: make tcp_space() aware of socket backlog
>   tcp: implement coalescing on backlog queue
>
>  include/net/tcp.h |  4 +-
>  include/uapi/linux/snmp.h |  1 +
>  net/ipv4/proc.c   |  1 +
>  net/ipv4/tcp_input.c  | 58 +++---
>  net/ipv4/tcp_ipv4.c   | 88 ---
>  5 files changed, 119 insertions(+), 33 deletions(-)
>
> --
> 2.20.0.rc0.387.gc7a69e6b6c-goog
>


[PATCH v2 net-next 2/4] tcp: take care of compressed acks in tcp_add_reno_sack()

2018-11-27 Thread Eric Dumazet
Neal pointed out that non sack flows might suffer from ACK compression
added in the following patch ("tcp: implement coalescing on backlog queue")

Instead of tweaking tcp_add_backlog() we can take into
account how many ACK were coalesced, this information
will be available in skb_shinfo(skb)->gso_segs

Signed-off-by: Eric Dumazet 
---
 net/ipv4/tcp_input.c | 58 +---
 1 file changed, 33 insertions(+), 25 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
f32397890b6dcbc34976954c4be142108efa04d8..33d9956d667cbd5eaf6a93913a10ce5d419b8a3a
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1863,16 +1863,20 @@ static void tcp_check_reno_reordering(struct sock *sk, 
const int addend)
 
 /* Emulate SACKs for SACKless connection: account for a new dupack. */
 
-static void tcp_add_reno_sack(struct sock *sk)
+static void tcp_add_reno_sack(struct sock *sk, int num_dupack)
 {
-   struct tcp_sock *tp = tcp_sk(sk);
-   u32 prior_sacked = tp->sacked_out;
+   if (num_dupack) {
+   struct tcp_sock *tp = tcp_sk(sk);
+   u32 prior_sacked = tp->sacked_out;
+   s32 delivered;
 
-   tp->sacked_out++;
-   tcp_check_reno_reordering(sk, 0);
-   if (tp->sacked_out > prior_sacked)
-   tp->delivered++; /* Some out-of-order packet is delivered */
-   tcp_verify_left_out(tp);
+   tp->sacked_out += num_dupack;
+   tcp_check_reno_reordering(sk, 0);
+   delivered = tp->sacked_out - prior_sacked;
+   if (delivered > 0)
+   tp->delivered += delivered;
+   tcp_verify_left_out(tp);
+   }
 }
 
 /* Account for ACK, ACKing some data in Reno Recovery phase. */
@@ -2634,7 +2638,7 @@ void tcp_enter_recovery(struct sock *sk, bool ece_ack)
 /* Process an ACK in CA_Loss state. Move to CA_Open if lost data are
  * recovered or spurious. Otherwise retransmits more on partial ACKs.
  */
-static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack,
+static void tcp_process_loss(struct sock *sk, int flag, int num_dupack,
 int *rexmit)
 {
struct tcp_sock *tp = tcp_sk(sk);
@@ -2653,7 +2657,7 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
return;
 
if (after(tp->snd_nxt, tp->high_seq)) {
-   if (flag & FLAG_DATA_SACKED || is_dupack)
+   if (flag & FLAG_DATA_SACKED || num_dupack)
tp->frto = 0; /* Step 3.a. loss was real */
} else if (flag & FLAG_SND_UNA_ADVANCED && !recovered) {
tp->high_seq = tp->snd_nxt;
@@ -2679,8 +2683,8 @@ static void tcp_process_loss(struct sock *sk, int flag, 
bool is_dupack,
/* A Reno DUPACK means new data in F-RTO step 2.b above are
 * delivered. Lower inflight to clock out (re)tranmissions.
 */
-   if (after(tp->snd_nxt, tp->high_seq) && is_dupack)
-   tcp_add_reno_sack(sk);
+   if (after(tp->snd_nxt, tp->high_seq))
+   tcp_add_reno_sack(sk, num_dupack);
else if (flag & FLAG_SND_UNA_ADVANCED)
tcp_reset_reno_sack(tp);
}
@@ -2757,13 +2761,13 @@ static bool tcp_force_fast_retransmit(struct sock *sk)
  * tcp_xmit_retransmit_queue().
  */
 static void tcp_fastretrans_alert(struct sock *sk, const u32 prior_snd_una,
- bool is_dupack, int *ack_flag, int *rexmit)
+ int num_dupack, int *ack_flag, int *rexmit)
 {
struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
int fast_rexmit = 0, flag = *ack_flag;
-   bool do_lost = is_dupack || ((flag & FLAG_DATA_SACKED) &&
-tcp_force_fast_retransmit(sk));
+   bool do_lost = num_dupack || ((flag & FLAG_DATA_SACKED) &&
+ tcp_force_fast_retransmit(sk));
 
if (!tp->packets_out && tp->sacked_out)
tp->sacked_out = 0;
@@ -2810,8 +2814,8 @@ static void tcp_fastretrans_alert(struct sock *sk, const 
u32 prior_snd_una,
switch (icsk->icsk_ca_state) {
case TCP_CA_Recovery:
if (!(flag & FLAG_SND_UNA_ADVANCED)) {
-   if (tcp_is_reno(tp) && is_dupack)
-   tcp_add_reno_sack(sk);
+   if (tcp_is_reno(tp))
+   tcp_add_reno_sack(sk, num_dupack);
} else {
if (tcp_try_undo_partial(sk, prior_snd_una))
return;
@@ -2826,7 +2830,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const 
u32 prior_snd_una,
tcp_identify_packet_loss(sk, ack_flag);
break;

[PATCH v2 net-next 4/4] tcp: implement coalescing on backlog queue

2018-11-27 Thread Eric Dumazet
In case GRO is not as efficient as it should be or disabled,
we might have a user thread trapped in __release_sock() while
softirq handler flood packets up to the point we have to drop.

This patch balances work done from user thread and softirq,
to give more chances to __release_sock() to complete its work
before new packets are added the the backlog.

This also helps if we receive many ACK packets, since GRO
does not aggregate them.

This patch brings ~60% throughput increase on a receiver
without GRO, but the spectacular gain is really on
1000x release_sock() latency reduction I have measured.

Signed-off-by: Eric Dumazet 
Cc: Neal Cardwell 
Cc: Yuchung Cheng 
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c   |  1 +
 net/ipv4/tcp_ipv4.c   | 88 ---
 3 files changed, 84 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index 
f80135e5feaa88609db6dff75b2bc2d637b2..86dc24a96c90ab047d5173d625450facd6c6dd79
 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -243,6 +243,7 @@ enum
LINUX_MIB_TCPREQQFULLDROP,  /* TCPReqQFullDrop */
LINUX_MIB_TCPRETRANSFAIL,   /* TCPRetransFail */
LINUX_MIB_TCPRCVCOALESCE,   /* TCPRcvCoalesce */
+   LINUX_MIB_TCPBACKLOGCOALESCE,   /* TCPBacklogCoalesce */
LINUX_MIB_TCPOFOQUEUE,  /* TCPOFOQueue */
LINUX_MIB_TCPOFODROP,   /* TCPOFODrop */
LINUX_MIB_TCPOFOMERGE,  /* TCPOFOMerge */
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 
70289682a6701438aed99a00a9705c39fa4394d3..c3610b37bb4ce665b1976d8cc907b6dd0de42ab9
 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -219,6 +219,7 @@ static const struct snmp_mib snmp4_net_list[] = {
SNMP_MIB_ITEM("TCPRenoRecoveryFail", LINUX_MIB_TCPRENORECOVERYFAIL),
SNMP_MIB_ITEM("TCPSackRecoveryFail", LINUX_MIB_TCPSACKRECOVERYFAIL),
SNMP_MIB_ITEM("TCPRcvCollapsed", LINUX_MIB_TCPRCVCOLLAPSED),
+   SNMP_MIB_ITEM("TCPBacklogCoalesce", LINUX_MIB_TCPBACKLOGCOALESCE),
SNMP_MIB_ITEM("TCPDSACKOldSent", LINUX_MIB_TCPDSACKOLDSENT),
SNMP_MIB_ITEM("TCPDSACKOfoSent", LINUX_MIB_TCPDSACKOFOSENT),
SNMP_MIB_ITEM("TCPDSACKRecv", LINUX_MIB_TCPDSACKRECV),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 
795605a2327504b8a025405826e7e0ca8dc8501d..b587a841678eb66ece005a9900537fd3f3dab963
 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1619,12 +1619,14 @@ int tcp_v4_early_demux(struct sk_buff *skb)
 bool tcp_add_backlog(struct sock *sk, struct sk_buff *skb)
 {
u32 limit = sk->sk_rcvbuf + sk->sk_sndbuf;
-
-   /* Only socket owner can try to collapse/prune rx queues
-* to reduce memory overhead, so add a little headroom here.
-* Few sockets backlog are possibly concurrently non empty.
-*/
-   limit += 64*1024;
+   struct skb_shared_info *shinfo;
+   const struct tcphdr *th;
+   struct tcphdr *thtail;
+   struct sk_buff *tail;
+   unsigned int hdrlen;
+   bool fragstolen;
+   u32 gso_segs;
+   int delta;
 
/* In case all data was pulled from skb frags (in __pskb_pull_tail()),
 * we can fix skb->truesize to its real value to avoid future drops.
@@ -1636,6 +1638,80 @@ bool tcp_add_backlog(struct sock *sk, struct sk_buff 
*skb)
 
skb_dst_drop(skb);
 
+   if (unlikely(tcp_checksum_complete(skb))) {
+   bh_unlock_sock(sk);
+   __TCP_INC_STATS(sock_net(sk), TCP_MIB_CSUMERRORS);
+   __TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
+   return true;
+   }
+
+   /* Attempt coalescing to last skb in backlog, even if we are
+* above the limits.
+* This is okay because skb capacity is limited to MAX_SKB_FRAGS.
+*/
+   th = (const struct tcphdr *)skb->data;
+   hdrlen = th->doff * 4;
+   shinfo = skb_shinfo(skb);
+
+   if (!shinfo->gso_size)
+   shinfo->gso_size = skb->len - hdrlen;
+
+   if (!shinfo->gso_segs)
+   shinfo->gso_segs = 1;
+
+   tail = sk->sk_backlog.tail;
+   if (!tail)
+   goto no_coalesce;
+   thtail = (struct tcphdr *)tail->data;
+
+   if (TCP_SKB_CB(tail)->end_seq != TCP_SKB_CB(skb)->seq ||
+   TCP_SKB_CB(tail)->ip_dsfield != TCP_SKB_CB(skb)->ip_dsfield ||
+#ifdef CONFIG_TLS_DEVICE
+   tail->decrypted != skb->decrypted ||
+#endif
+   thtail->doff != th->doff ||
+   memcmp(thtail + 1, th + 1, hdrlen - sizeof(*th)))
+   goto no_coalesce;
+
+   __skb_pull(skb, hdrlen);
+   if (skb_try_coalesce(tail, skb, , )) {
+   thtail->window = th->window;
+
+   TCP_SKB_CB(tail)->end_seq = TCP_SKB_CB(skb)->end_seq;
+
+   if (after(TCP_SKB_CB(skb)->ack_seq, TCP_SKB_CB(tail)->ack_seq))
+  

[PATCH v2 net-next 3/4] tcp: make tcp_space() aware of socket backlog

2018-11-27 Thread Eric Dumazet
Jean-Louis Dupond reported poor iscsi TCP receive performance
that we tracked to backlog drops.

Apparently we fail to send window updates reflecting the
fact that we are under stress.

Note that we might lack a proper window increase when
backlog is fully processed, since __release_sock() clears
sk->sk_backlog.len _after_ all skbs have been processed.

This should not matter in practice. If we had a significant
load through socket backlog, we are in a dangerous
situation.

Reported-by: Jean-Louis Dupond 
Signed-off-by: Eric Dumazet 
---
 include/net/tcp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
0c61bf0a06dac95268c26b6302a2afbaef4c88b3..3b522259da7d5a54d7d3730ddd8d8c9ef24313e1
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1368,7 +1368,7 @@ static inline int tcp_win_from_space(const struct sock 
*sk, int space)
 /* Note: caller must be prepared to deal with negative returns */
 static inline int tcp_space(const struct sock *sk)
 {
-   return tcp_win_from_space(sk, sk->sk_rcvbuf -
+   return tcp_win_from_space(sk, sk->sk_rcvbuf - sk->sk_backlog.len -
  atomic_read(>sk_rmem_alloc));
 }
 
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v2 net-next 1/4] tcp: hint compiler about sack flows

2018-11-27 Thread Eric Dumazet
Tell the compiler that most TCP flows are using SACK these days.

There is no need to add the unlikely() clause in tcp_is_reno(),
the compiler is able to infer it.

Signed-off-by: Eric Dumazet 
---
 include/net/tcp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 
63e37dd1c274cc396e41ea9612cf67a5b7c89776..0c61bf0a06dac95268c26b6302a2afbaef4c88b3
 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1124,7 +1124,7 @@ void tcp_rate_check_app_limited(struct sock *sk);
  */
 static inline int tcp_is_sack(const struct tcp_sock *tp)
 {
-   return tp->rx_opt.sack_ok;
+   return likely(tp->rx_opt.sack_ok);
 }
 
 static inline bool tcp_is_reno(const struct tcp_sock *tp)
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v2 net-next 0/4] tcp: take a bit more care of backlog stress

2018-11-27 Thread Eric Dumazet
While working on the SACK compression issue Jean-Louis Dupond
reported, we found that his linux box was suffering very hard
from tail drops on the socket backlog queue.

First patch hints the compiler about sack flows being the norm.

Second patch changes non-sack code in preparation of the ack
compression.

Third patch fixes tcp_space() to take backlog into account.

Fourth patch is attempting coalescing when a new packet must
be added to the backlog queue. Cooking bigger skbs helps
to keep backlog list smaller and speeds its handling when
user thread finally releases the socket lock.

v2: added feedback from Neal : tcp: take care of compressed acks in 
tcp_add_reno_sack() 
added : tcp: hint compiler about sack flows
added : tcp: make tcp_space() aware of socket backlog



Eric Dumazet (4):
  tcp: hint compiler about sack flows
  tcp: take care of compressed acks in tcp_add_reno_sack()
  tcp: make tcp_space() aware of socket backlog
  tcp: implement coalescing on backlog queue

 include/net/tcp.h |  4 +-
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/proc.c   |  1 +
 net/ipv4/tcp_input.c  | 58 +++---
 net/ipv4/tcp_ipv4.c   | 88 ---
 5 files changed, 119 insertions(+), 33 deletions(-)

-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



RE: [PATCH v2 net-next 1/8] dpaa2-eth: Add basic XDP support

2018-11-27 Thread Camelia Alexandra Groza
> -Original Message-
> From: Ioana Ciocoi Radulescu
> Sent: Monday, November 26, 2018 18:27
> To: netdev@vger.kernel.org; da...@davemloft.net
> Cc: Ioana Ciornei ; dsah...@gmail.com; Camelia
> Alexandra Groza 
> Subject: [PATCH v2 net-next 1/8] dpaa2-eth: Add basic XDP support
> 
> We keep one XDP program reference per channel. The only actions
> supported for now are XDP_DROP and XDP_PASS.
> 
> Until now we didn't enforce a maximum size for Rx frames based
> on MTU value. Change that, since for XDP mode we must ensure no
> scatter-gather frames can be received.
> 
> Signed-off-by: Ioana Radulescu 

Acked-by: Camelia Groza 


[PATCH v2 net-next] cxgb4: number of VFs supported is not always 16

2018-11-27 Thread Ganesh Goudar
Total number of VFs supported by PF is used to determine the last
byte of VF's mac address. Number of VFs supported is not always
16, use the variable nvfs to get the number of VFs supported
rather than hard coding it to 16.

Signed-off-by: Casey Leedom 
Signed-off-by: Ganesh Goudar 
---
V2: Fixes typo in commit message
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 7f76ad9..6ba9099 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2646,7 +2646,7 @@ static void cxgb4_mgmt_fill_vf_station_mac_addr(struct 
adapter *adap)
 
for (vf = 0, nvfs = pci_sriov_get_totalvfs(adap->pdev);
vf < nvfs; vf++) {
-   macaddr[5] = adap->pf * 16 + vf;
+   macaddr[5] = adap->pf * nvfs + vf;
ether_addr_copy(adap->vfinfo[vf].vf_mac_addr, macaddr);
}
 }
-- 
2.1.0



[PATCH v2 net-next] tcp: remove hdrlen argument from tcp_queue_rcv()

2018-11-26 Thread Eric Dumazet
Only one caller needs to pull TCP headers, so lets
move __skb_pull() to the caller side.

Signed-off-by: Eric Dumazet 
Acked-by: Yuchung Cheng 
---
v2: sent as a standalone patch.

 net/ipv4/tcp_input.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 
568dbf3b711af75e5f4f0a309f8943579e913494..f32397890b6dcbc34976954c4be142108efa04d8
 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4603,13 +4603,12 @@ static void tcp_data_queue_ofo(struct sock *sk, struct 
sk_buff *skb)
}
 }
 
-static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb, 
int hdrlen,
- bool *fragstolen)
+static int __must_check tcp_queue_rcv(struct sock *sk, struct sk_buff *skb,
+ bool *fragstolen)
 {
int eaten;
struct sk_buff *tail = skb_peek_tail(>sk_receive_queue);
 
-   __skb_pull(skb, hdrlen);
eaten = (tail &&
 tcp_try_coalesce(sk, tail,
  skb, fragstolen)) ? 1 : 0;
@@ -4660,7 +4659,7 @@ int tcp_send_rcvq(struct sock *sk, struct msghdr *msg, 
size_t size)
TCP_SKB_CB(skb)->end_seq = TCP_SKB_CB(skb)->seq + size;
TCP_SKB_CB(skb)->ack_seq = tcp_sk(sk)->snd_una - 1;
 
-   if (tcp_queue_rcv(sk, skb, 0, )) {
+   if (tcp_queue_rcv(sk, skb, )) {
WARN_ON_ONCE(fragstolen); /* should not happen */
__kfree_skb(skb);
}
@@ -4720,7 +4719,7 @@ static void tcp_data_queue(struct sock *sk, struct 
sk_buff *skb)
goto drop;
}
 
-   eaten = tcp_queue_rcv(sk, skb, 0, );
+   eaten = tcp_queue_rcv(sk, skb, );
if (skb->len)
tcp_event_data_recv(sk, skb);
if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
@@ -5596,8 +5595,8 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff 
*skb)
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPHPHITS);
 
/* Bulk data transfer: receiver */
-   eaten = tcp_queue_rcv(sk, skb, tcp_header_len,
- );
+   __skb_pull(skb, tcp_header_len);
+   eaten = tcp_queue_rcv(sk, skb, );
 
tcp_event_data_recv(sk, skb);
 
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



[PATCH v2 net-next 6/8] dpaa2-eth: Add support for XDP_TX

2018-11-26 Thread Ioana Ciocoi Radulescu
Send frames back on the same port for XDP_TX action.
Since the frame buffers have been allocated by us, we can recycle
them directly into the Rx buffer pool instead of requesting a
confirmation frame upon transmission complete.

Signed-off-by: Ioana Radulescu 
---
v2: XDP_TX packets count towards the tx packets and bytes counters

 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 51 +++-
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h |  2 +
 2 files changed, 52 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index c2e880b..bc582c4 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -240,14 +240,53 @@ static void xdp_release_buf(struct dpaa2_eth_priv *priv,
ch->xdp.drop_cnt = 0;
 }
 
+static int xdp_enqueue(struct dpaa2_eth_priv *priv, struct dpaa2_fd *fd,
+  void *buf_start, u16 queue_id)
+{
+   struct dpaa2_eth_fq *fq;
+   struct dpaa2_faead *faead;
+   u32 ctrl, frc;
+   int i, err;
+
+   /* Mark the egress frame hardware annotation area as valid */
+   frc = dpaa2_fd_get_frc(fd);
+   dpaa2_fd_set_frc(fd, frc | DPAA2_FD_FRC_FAEADV);
+   dpaa2_fd_set_ctrl(fd, DPAA2_FD_CTRL_ASAL);
+
+   /* Instruct hardware to release the FD buffer directly into
+* the buffer pool once transmission is completed, instead of
+* sending a Tx confirmation frame to us
+*/
+   ctrl = DPAA2_FAEAD_A4V | DPAA2_FAEAD_A2V | DPAA2_FAEAD_EBDDV;
+   faead = dpaa2_get_faead(buf_start, false);
+   faead->ctrl = cpu_to_le32(ctrl);
+   faead->conf_fqid = 0;
+
+   fq = >fq[queue_id];
+   for (i = 0; i < DPAA2_ETH_ENQUEUE_RETRIES; i++) {
+   err = dpaa2_io_service_enqueue_qd(fq->channel->dpio,
+ priv->tx_qdid, 0,
+ fq->tx_qdbin, fd);
+   if (err != -EBUSY)
+   break;
+   }
+
+   return err;
+}
+
 static u32 run_xdp(struct dpaa2_eth_priv *priv,
   struct dpaa2_eth_channel *ch,
+  struct dpaa2_eth_fq *rx_fq,
   struct dpaa2_fd *fd, void *vaddr)
 {
dma_addr_t addr = dpaa2_fd_get_addr(fd);
+   struct rtnl_link_stats64 *percpu_stats;
struct bpf_prog *xdp_prog;
struct xdp_buff xdp;
u32 xdp_act = XDP_PASS;
+   int err;
+
+   percpu_stats = this_cpu_ptr(priv->percpu_stats);
 
rcu_read_lock();
 
@@ -269,6 +308,16 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv,
switch (xdp_act) {
case XDP_PASS:
break;
+   case XDP_TX:
+   err = xdp_enqueue(priv, fd, vaddr, rx_fq->flowid);
+   if (err) {
+   xdp_release_buf(priv, ch, addr);
+   percpu_stats->tx_errors++;
+   } else {
+   percpu_stats->tx_packets++;
+   percpu_stats->tx_bytes += dpaa2_fd_get_len(fd);
+   }
+   break;
default:
bpf_warn_invalid_xdp_action(xdp_act);
case XDP_ABORTED:
@@ -317,7 +366,7 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv,
percpu_extras = this_cpu_ptr(priv->percpu_extras);
 
if (fd_format == dpaa2_fd_single) {
-   xdp_act = run_xdp(priv, ch, (struct dpaa2_fd *)fd, vaddr);
+   xdp_act = run_xdp(priv, ch, fq, (struct dpaa2_fd *)fd, vaddr);
if (xdp_act != XDP_PASS) {
percpu_stats->rx_packets++;
percpu_stats->rx_bytes += dpaa2_fd_get_len(fd);
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
index 23cf9d9..5530a0e 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
@@ -139,7 +139,9 @@ struct dpaa2_faead {
 };
 
 #define DPAA2_FAEAD_A2V0x2000
+#define DPAA2_FAEAD_A4V0x0800
 #define DPAA2_FAEAD_UPDV   0x1000
+#define DPAA2_FAEAD_EBDDV  0x2000
 #define DPAA2_FAEAD_UPD0x0010
 
 /* Accessors for the hardware annotation fields that we use */
-- 
2.7.4



[PATCH v2 net-next 2/8] dpaa2-eth: Allow XDP header adjustments

2018-11-26 Thread Ioana Ciocoi Radulescu
Reserve XDP_PACKET_HEADROOM bytes in Rx buffers to allow XDP
programs to increase frame header size.

Signed-off-by: Ioana Radulescu 
---
v2: no changes

 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 43 ++--
 1 file changed, 40 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index d3cfed4..008cdf8 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -216,11 +216,15 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv,
 
xdp.data = vaddr + dpaa2_fd_get_offset(fd);
xdp.data_end = xdp.data + dpaa2_fd_get_len(fd);
-   xdp.data_hard_start = xdp.data;
+   xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
xdp_set_data_meta_invalid();
 
xdp_act = bpf_prog_run_xdp(xdp_prog, );
 
+   /* xdp.data pointer may have changed */
+   dpaa2_fd_set_offset(fd, xdp.data - vaddr);
+   dpaa2_fd_set_len(fd, xdp.data_end - xdp.data);
+
switch (xdp_act) {
case XDP_PASS:
break;
@@ -1483,7 +1487,7 @@ static bool xdp_mtu_valid(struct dpaa2_eth_priv *priv, 
int mtu)
 
mfl = DPAA2_ETH_L2_MAX_FRM(mtu);
linear_mfl = DPAA2_ETH_RX_BUF_SIZE - DPAA2_ETH_RX_HWA_SIZE -
-dpaa2_eth_rx_head_room(priv);
+dpaa2_eth_rx_head_room(priv) - XDP_PACKET_HEADROOM;
 
if (mfl > linear_mfl) {
netdev_warn(priv->net_dev, "Maximum MTU for XDP is %d\n",
@@ -1537,6 +1541,32 @@ static int dpaa2_eth_change_mtu(struct net_device *dev, 
int new_mtu)
return 0;
 }
 
+static int update_rx_buffer_headroom(struct dpaa2_eth_priv *priv, bool has_xdp)
+{
+   struct dpni_buffer_layout buf_layout = {0};
+   int err;
+
+   err = dpni_get_buffer_layout(priv->mc_io, 0, priv->mc_token,
+DPNI_QUEUE_RX, _layout);
+   if (err) {
+   netdev_err(priv->net_dev, "dpni_get_buffer_layout failed\n");
+   return err;
+   }
+
+   /* Reserve extra headroom for XDP header size changes */
+   buf_layout.data_head_room = dpaa2_eth_rx_head_room(priv) +
+   (has_xdp ? XDP_PACKET_HEADROOM : 0);
+   buf_layout.options = DPNI_BUF_LAYOUT_OPT_DATA_HEAD_ROOM;
+   err = dpni_set_buffer_layout(priv->mc_io, 0, priv->mc_token,
+DPNI_QUEUE_RX, _layout);
+   if (err) {
+   netdev_err(priv->net_dev, "dpni_set_buffer_layout failed\n");
+   return err;
+   }
+
+   return 0;
+}
+
 static int setup_xdp(struct net_device *dev, struct bpf_prog *prog)
 {
struct dpaa2_eth_priv *priv = netdev_priv(dev);
@@ -1560,11 +1590,18 @@ static int setup_xdp(struct net_device *dev, struct 
bpf_prog *prog)
if (up)
dpaa2_eth_stop(dev);
 
-   /* While in xdp mode, enforce a maximum Rx frame size based on MTU */
+   /* While in xdp mode, enforce a maximum Rx frame size based on MTU.
+* Also, when switching between xdp/non-xdp modes we need to reconfigure
+* our Rx buffer layout. Buffer pool was drained on dpaa2_eth_stop,
+* so we are sure no old format buffers will be used from now on.
+*/
if (need_update) {
err = set_rx_mfl(priv, dev->mtu, !!prog);
if (err)
goto out_err;
+   err = update_rx_buffer_headroom(priv, !!prog);
+   if (err)
+   goto out_err;
}
 
old = xchg(>xdp_prog, prog);
-- 
2.7.4



[PATCH v2 net-next 7/8] dpaa2-eth: Cleanup channel stats

2018-11-26 Thread Ioana Ciocoi Radulescu
Remove unused counter. Reorder fields in channel stats structure
to match the ethtool strings order and make it easier to print them
with ethtool -S.

Signed-off-by: Ioana Radulescu 
---
v2: no changes

 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c |  1 -
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h |  6 ++
 drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c | 16 +---
 3 files changed, 7 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index bc582c4..d2bc5da 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -467,7 +467,6 @@ static int consume_frames(struct dpaa2_eth_channel *ch,
return 0;
 
fq->stats.frames += cleaned;
-   ch->stats.frames += cleaned;
 
/* A dequeue operation only pulls frames from a single queue
 * into the store. Return the frame queue as an out param.
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
index 5530a0e..41a2a0d 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
@@ -245,12 +245,10 @@ struct dpaa2_eth_fq_stats {
 struct dpaa2_eth_ch_stats {
/* Volatile dequeues retried due to portal busy */
__u64 dequeue_portal_busy;
-   /* Number of CDANs; useful to estimate avg NAPI len */
-   __u64 cdan;
-   /* Number of frames received on queues from this channel */
-   __u64 frames;
/* Pull errors */
__u64 pull_err;
+   /* Number of CDANs; useful to estimate avg NAPI len */
+   __u64 cdan;
 };
 
 /* Maximum number of queues associated with a DPNI */
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
index 26bd5a2..79eeebe 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
@@ -174,8 +174,6 @@ static void dpaa2_eth_get_ethtool_stats(struct net_device 
*net_dev,
int j, k, err;
int num_cnt;
union dpni_statistics dpni_stats;
-   u64 cdan = 0;
-   u64 portal_busy = 0, pull_err = 0;
struct dpaa2_eth_priv *priv = netdev_priv(net_dev);
struct dpaa2_eth_drv_stats *extras;
struct dpaa2_eth_ch_stats *ch_stats;
@@ -212,16 +210,12 @@ static void dpaa2_eth_get_ethtool_stats(struct net_device 
*net_dev,
}
i += j;
 
-   for (j = 0; j < priv->num_channels; j++) {
-   ch_stats = >channel[j]->stats;
-   cdan += ch_stats->cdan;
-   portal_busy += ch_stats->dequeue_portal_busy;
-   pull_err += ch_stats->pull_err;
+   /* Per-channel stats */
+   for (k = 0; k < priv->num_channels; k++) {
+   ch_stats = >channel[k]->stats;
+   for (j = 0; j < sizeof(*ch_stats) / sizeof(__u64); j++)
+   *((__u64 *)data + i + j) += *((__u64 *)ch_stats + j);
}
-
-   *(data + i++) = portal_busy;
-   *(data + i++) = pull_err;
-   *(data + i++) = cdan;
 }
 
 static int prep_eth_rule(struct ethhdr *eth_value, struct ethhdr *eth_mask,
-- 
2.7.4



[PATCH v2 net-next 3/8] dpaa2-eth: Move function

2018-11-26 Thread Ioana Ciocoi Radulescu
We'll use function free_bufs() on the XDP path as well, so move
it higher in order to avoid a forward declaration.

Signed-off-by: Ioana Radulescu 
---
v2: no changes

 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 34 
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 008cdf8..174c960 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -200,6 +200,23 @@ static struct sk_buff *build_frag_skb(struct 
dpaa2_eth_priv *priv,
return skb;
 }
 
+/* Free buffers acquired from the buffer pool or which were meant to
+ * be released in the pool
+ */
+static void free_bufs(struct dpaa2_eth_priv *priv, u64 *buf_array, int count)
+{
+   struct device *dev = priv->net_dev->dev.parent;
+   void *vaddr;
+   int i;
+
+   for (i = 0; i < count; i++) {
+   vaddr = dpaa2_iova_to_virt(priv->iommu_domain, buf_array[i]);
+   dma_unmap_single(dev, buf_array[i], DPAA2_ETH_RX_BUF_SIZE,
+DMA_FROM_DEVICE);
+   skb_free_frag(vaddr);
+   }
+}
+
 static u32 run_xdp(struct dpaa2_eth_priv *priv,
   struct dpaa2_eth_channel *ch,
   struct dpaa2_fd *fd, void *vaddr)
@@ -797,23 +814,6 @@ static int set_tx_csum(struct dpaa2_eth_priv *priv, bool 
enable)
return 0;
 }
 
-/* Free buffers acquired from the buffer pool or which were meant to
- * be released in the pool
- */
-static void free_bufs(struct dpaa2_eth_priv *priv, u64 *buf_array, int count)
-{
-   struct device *dev = priv->net_dev->dev.parent;
-   void *vaddr;
-   int i;
-
-   for (i = 0; i < count; i++) {
-   vaddr = dpaa2_iova_to_virt(priv->iommu_domain, buf_array[i]);
-   dma_unmap_single(dev, buf_array[i], DPAA2_ETH_RX_BUF_SIZE,
-DMA_FROM_DEVICE);
-   skb_free_frag(vaddr);
-   }
-}
-
 /* Perform a single release command to add buffers
  * to the specified buffer pool
  */
-- 
2.7.4



[PATCH v2 net-next 5/8] dpaa2-eth: Map Rx buffers as bidirectional

2018-11-26 Thread Ioana Ciocoi Radulescu
In order to support enqueueing Rx FDs back to hardware, we need to
DMA map Rx buffers as bidirectional.

Signed-off-by: Ioana Radulescu 
---
v2: no changes

 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index ac4cb81..c2e880b 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -87,7 +87,7 @@ static void free_rx_fd(struct dpaa2_eth_priv *priv,
addr = dpaa2_sg_get_addr([i]);
sg_vaddr = dpaa2_iova_to_virt(priv->iommu_domain, addr);
dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
-DMA_FROM_DEVICE);
+DMA_BIDIRECTIONAL);
 
skb_free_frag(sg_vaddr);
if (dpaa2_sg_is_final([i]))
@@ -145,7 +145,7 @@ static struct sk_buff *build_frag_skb(struct dpaa2_eth_priv 
*priv,
sg_addr = dpaa2_sg_get_addr(sge);
sg_vaddr = dpaa2_iova_to_virt(priv->iommu_domain, sg_addr);
dma_unmap_single(dev, sg_addr, DPAA2_ETH_RX_BUF_SIZE,
-DMA_FROM_DEVICE);
+DMA_BIDIRECTIONAL);
 
sg_length = dpaa2_sg_get_len(sge);
 
@@ -212,7 +212,7 @@ static void free_bufs(struct dpaa2_eth_priv *priv, u64 
*buf_array, int count)
for (i = 0; i < count; i++) {
vaddr = dpaa2_iova_to_virt(priv->iommu_domain, buf_array[i]);
dma_unmap_single(dev, buf_array[i], DPAA2_ETH_RX_BUF_SIZE,
-DMA_FROM_DEVICE);
+DMA_BIDIRECTIONAL);
skb_free_frag(vaddr);
}
 }
@@ -306,7 +306,7 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv,
 
vaddr = dpaa2_iova_to_virt(priv->iommu_domain, addr);
dma_sync_single_for_cpu(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
-   DMA_FROM_DEVICE);
+   DMA_BIDIRECTIONAL);
 
fas = dpaa2_get_fas(vaddr, false);
prefetch(fas);
@@ -325,13 +325,13 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv,
}
 
dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
-DMA_FROM_DEVICE);
+DMA_BIDIRECTIONAL);
skb = build_linear_skb(ch, fd, vaddr);
} else if (fd_format == dpaa2_fd_sg) {
WARN_ON(priv->xdp_prog);
 
dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
-DMA_FROM_DEVICE);
+DMA_BIDIRECTIONAL);
skb = build_frag_skb(priv, ch, buf_data);
skb_free_frag(vaddr);
percpu_extras->rx_sg_frames++;
@@ -865,7 +865,7 @@ static int add_bufs(struct dpaa2_eth_priv *priv,
buf = PTR_ALIGN(buf, priv->rx_buf_align);
 
addr = dma_map_single(dev, buf, DPAA2_ETH_RX_BUF_SIZE,
- DMA_FROM_DEVICE);
+ DMA_BIDIRECTIONAL);
if (unlikely(dma_mapping_error(dev, addr)))
goto err_map;
 
-- 
2.7.4



[PATCH v2 net-next 8/8] dpaa2-eth: Add xdp counters

2018-11-26 Thread Ioana Ciocoi Radulescu
Add counters for xdp processed frames to the channel statistics.

Signed-off-by: Ioana Radulescu 
---
v2: no changes

 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 3 +++
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 4 
 drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c | 3 +++
 3 files changed, 10 insertions(+)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index d2bc5da..be84171 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -313,9 +313,11 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv,
if (err) {
xdp_release_buf(priv, ch, addr);
percpu_stats->tx_errors++;
+   ch->stats.xdp_tx_err++;
} else {
percpu_stats->tx_packets++;
percpu_stats->tx_bytes += dpaa2_fd_get_len(fd);
+   ch->stats.xdp_tx++;
}
break;
default:
@@ -324,6 +326,7 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv,
trace_xdp_exception(priv->net_dev, xdp_prog, xdp_act);
case XDP_DROP:
xdp_release_buf(priv, ch, addr);
+   ch->stats.xdp_drop++;
break;
}
 
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
index 41a2a0d..69c965d 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
@@ -249,6 +249,10 @@ struct dpaa2_eth_ch_stats {
__u64 pull_err;
/* Number of CDANs; useful to estimate avg NAPI len */
__u64 cdan;
+   /* XDP counters */
+   __u64 xdp_drop;
+   __u64 xdp_tx;
+   __u64 xdp_tx_err;
 };
 
 /* Maximum number of queues associated with a DPNI */
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
index 79eeebe..0c831bf 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-ethtool.c
@@ -45,6 +45,9 @@ static char dpaa2_ethtool_extras[][ETH_GSTRING_LEN] = {
"[drv] dequeue portal busy",
"[drv] channel pull errors",
"[drv] cdan",
+   "[drv] xdp drop",
+   "[drv] xdp tx",
+   "[drv] xdp tx errors",
 };
 
 #define DPAA2_ETH_NUM_EXTRA_STATS  ARRAY_SIZE(dpaa2_ethtool_extras)
-- 
2.7.4



[PATCH v2 net-next 0/8] dpaa2-eth: Introduce XDP support

2018-11-26 Thread Ioana Ciocoi Radulescu
Add support for XDP programs. Only XDP_PASS, XDP_DROP and XDP_TX
actions are supported for now. Frame header changes are also
allowed.

v2: - count the XDP packets in the rx/tx inteface stats
- add message with the maximum supported MTU value for XDP

Ioana Radulescu (8):
  dpaa2-eth: Add basic XDP support
  dpaa2-eth: Allow XDP header adjustments
  dpaa2-eth: Move function
  dpaa2-eth: Release buffers back to pool on XDP_DROP
  dpaa2-eth: Map Rx buffers as bidirectional
  dpaa2-eth: Add support for XDP_TX
  dpaa2-eth: Cleanup channel stats
  dpaa2-eth: Add xdp counters

 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c   | 349 +++--
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h   |  20 +-
 .../net/ethernet/freescale/dpaa2/dpaa2-ethtool.c   |  19 +-
 3 files changed, 350 insertions(+), 38 deletions(-)

-- 
2.7.4



[PATCH v2 net-next 4/8] dpaa2-eth: Release buffers back to pool on XDP_DROP

2018-11-26 Thread Ioana Ciocoi Radulescu
Instead of freeing the RX buffers, release them back into the pool.
We wait for the maximum number of buffers supported by a single
release command to accumulate before issuing the command.

Also, don't unmap the Rx buffers at the beginning of the Rx routine
anymore, since that would require remapping them before release.
Instead, just do a DMA sync at first and only unmap if the frame is
meant for the stack.

Signed-off-by: Ioana Radulescu 
---
v2: no changes

 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 34 +---
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h |  2 ++
 2 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 174c960..ac4cb81 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -217,10 +217,34 @@ static void free_bufs(struct dpaa2_eth_priv *priv, u64 
*buf_array, int count)
}
 }
 
+static void xdp_release_buf(struct dpaa2_eth_priv *priv,
+   struct dpaa2_eth_channel *ch,
+   dma_addr_t addr)
+{
+   int err;
+
+   ch->xdp.drop_bufs[ch->xdp.drop_cnt++] = addr;
+   if (ch->xdp.drop_cnt < DPAA2_ETH_BUFS_PER_CMD)
+   return;
+
+   while ((err = dpaa2_io_service_release(ch->dpio, priv->bpid,
+  ch->xdp.drop_bufs,
+  ch->xdp.drop_cnt)) == -EBUSY)
+   cpu_relax();
+
+   if (err) {
+   free_bufs(priv, ch->xdp.drop_bufs, ch->xdp.drop_cnt);
+   ch->buf_count -= ch->xdp.drop_cnt;
+   }
+
+   ch->xdp.drop_cnt = 0;
+}
+
 static u32 run_xdp(struct dpaa2_eth_priv *priv,
   struct dpaa2_eth_channel *ch,
   struct dpaa2_fd *fd, void *vaddr)
 {
+   dma_addr_t addr = dpaa2_fd_get_addr(fd);
struct bpf_prog *xdp_prog;
struct xdp_buff xdp;
u32 xdp_act = XDP_PASS;
@@ -250,8 +274,7 @@ static u32 run_xdp(struct dpaa2_eth_priv *priv,
case XDP_ABORTED:
trace_xdp_exception(priv->net_dev, xdp_prog, xdp_act);
case XDP_DROP:
-   ch->buf_count--;
-   free_rx_fd(priv, fd, vaddr);
+   xdp_release_buf(priv, ch, addr);
break;
}
 
@@ -282,7 +305,8 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv,
trace_dpaa2_rx_fd(priv->net_dev, fd);
 
vaddr = dpaa2_iova_to_virt(priv->iommu_domain, addr);
-   dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE, DMA_FROM_DEVICE);
+   dma_sync_single_for_cpu(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
+   DMA_FROM_DEVICE);
 
fas = dpaa2_get_fas(vaddr, false);
prefetch(fas);
@@ -300,10 +324,14 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv,
return;
}
 
+   dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
+DMA_FROM_DEVICE);
skb = build_linear_skb(ch, fd, vaddr);
} else if (fd_format == dpaa2_fd_sg) {
WARN_ON(priv->xdp_prog);
 
+   dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
+DMA_FROM_DEVICE);
skb = build_frag_skb(priv, ch, buf_data);
skb_free_frag(vaddr);
percpu_extras->rx_sg_frames++;
diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
index 2873a15..23cf9d9 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
@@ -285,6 +285,8 @@ struct dpaa2_eth_fq {
 
 struct dpaa2_eth_ch_xdp {
struct bpf_prog *prog;
+   u64 drop_bufs[DPAA2_ETH_BUFS_PER_CMD];
+   int drop_cnt;
 };
 
 struct dpaa2_eth_channel {
-- 
2.7.4



[PATCH v2 net-next 1/8] dpaa2-eth: Add basic XDP support

2018-11-26 Thread Ioana Ciocoi Radulescu
We keep one XDP program reference per channel. The only actions
supported for now are XDP_DROP and XDP_PASS.

Until now we didn't enforce a maximum size for Rx frames based
on MTU value. Change that, since for XDP mode we must ensure no
scatter-gather frames can be received.

Signed-off-by: Ioana Radulescu 
---
v2: - xdp packets count towards the rx packets and bytes counters
- add warning message with the maximum supported MTU value for XDP

 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 189 ++-
 drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h |   6 +
 2 files changed, 194 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c 
b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
index 640967a..d3cfed4 100644
--- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
+++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
@@ -13,7 +13,8 @@
 #include 
 #include 
 #include 
-
+#include 
+#include 
 #include 
 
 #include "dpaa2-eth.h"
@@ -199,6 +200,45 @@ static struct sk_buff *build_frag_skb(struct 
dpaa2_eth_priv *priv,
return skb;
 }
 
+static u32 run_xdp(struct dpaa2_eth_priv *priv,
+  struct dpaa2_eth_channel *ch,
+  struct dpaa2_fd *fd, void *vaddr)
+{
+   struct bpf_prog *xdp_prog;
+   struct xdp_buff xdp;
+   u32 xdp_act = XDP_PASS;
+
+   rcu_read_lock();
+
+   xdp_prog = READ_ONCE(ch->xdp.prog);
+   if (!xdp_prog)
+   goto out;
+
+   xdp.data = vaddr + dpaa2_fd_get_offset(fd);
+   xdp.data_end = xdp.data + dpaa2_fd_get_len(fd);
+   xdp.data_hard_start = xdp.data;
+   xdp_set_data_meta_invalid();
+
+   xdp_act = bpf_prog_run_xdp(xdp_prog, );
+
+   switch (xdp_act) {
+   case XDP_PASS:
+   break;
+   default:
+   bpf_warn_invalid_xdp_action(xdp_act);
+   case XDP_ABORTED:
+   trace_xdp_exception(priv->net_dev, xdp_prog, xdp_act);
+   case XDP_DROP:
+   ch->buf_count--;
+   free_rx_fd(priv, fd, vaddr);
+   break;
+   }
+
+out:
+   rcu_read_unlock();
+   return xdp_act;
+}
+
 /* Main Rx frame processing routine */
 static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv,
 struct dpaa2_eth_channel *ch,
@@ -215,6 +255,7 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv,
struct dpaa2_fas *fas;
void *buf_data;
u32 status = 0;
+   u32 xdp_act;
 
/* Tracing point */
trace_dpaa2_rx_fd(priv->net_dev, fd);
@@ -231,8 +272,17 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv *priv,
percpu_extras = this_cpu_ptr(priv->percpu_extras);
 
if (fd_format == dpaa2_fd_single) {
+   xdp_act = run_xdp(priv, ch, (struct dpaa2_fd *)fd, vaddr);
+   if (xdp_act != XDP_PASS) {
+   percpu_stats->rx_packets++;
+   percpu_stats->rx_bytes += dpaa2_fd_get_len(fd);
+   return;
+   }
+
skb = build_linear_skb(ch, fd, vaddr);
} else if (fd_format == dpaa2_fd_sg) {
+   WARN_ON(priv->xdp_prog);
+
skb = build_frag_skb(priv, ch, buf_data);
skb_free_frag(vaddr);
percpu_extras->rx_sg_frames++;
@@ -1427,6 +1477,141 @@ static int dpaa2_eth_ioctl(struct net_device *dev, 
struct ifreq *rq, int cmd)
return -EINVAL;
 }
 
+static bool xdp_mtu_valid(struct dpaa2_eth_priv *priv, int mtu)
+{
+   int mfl, linear_mfl;
+
+   mfl = DPAA2_ETH_L2_MAX_FRM(mtu);
+   linear_mfl = DPAA2_ETH_RX_BUF_SIZE - DPAA2_ETH_RX_HWA_SIZE -
+dpaa2_eth_rx_head_room(priv);
+
+   if (mfl > linear_mfl) {
+   netdev_warn(priv->net_dev, "Maximum MTU for XDP is %d\n",
+   linear_mfl - VLAN_ETH_HLEN);
+   return false;
+   }
+
+   return true;
+}
+
+static int set_rx_mfl(struct dpaa2_eth_priv *priv, int mtu, bool has_xdp)
+{
+   int mfl, err;
+
+   /* We enforce a maximum Rx frame length based on MTU only if we have
+* an XDP program attached (in order to avoid Rx S/G frames).
+* Otherwise, we accept all incoming frames as long as they are not
+* larger than maximum size supported in hardware
+*/
+   if (has_xdp)
+   mfl = DPAA2_ETH_L2_MAX_FRM(mtu);
+   else
+   mfl = DPAA2_ETH_MFL;
+
+   err = dpni_set_max_frame_length(priv->mc_io, 0, priv->mc_token, mfl);
+   if (err) {
+   netdev_err(priv->net_dev, "dpni_set_max_frame_length failed\n");
+   return err;
+   }
+
+   return 0;
+}
+
+static int dpaa2_eth_change_mtu(struct net_device *dev, int new_mtu)
+{
+   struct dpaa2_eth_priv *priv = netdev_priv(dev);
+   int err;
+
+   if (!priv->xdp_prog)
+   goto out;
+
+   if (!xdp_mtu_valid(priv, new_mtu))
+   return -EINVAL;
+

[PATCH v2 net-next] net: remove unsafe skb_insert()

2018-11-25 Thread Eric Dumazet
I do not see how one can effectively use skb_insert() without holding
some kind of lock. Otherwise other cpus could have changed the list
right before we have a chance of acquiring list->lock.

Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this
one probably meant to use __skb_insert() since it appears nesqp->pau_list
is protected by nesqp->pau_lock. This looks like nesqp->pau_lock
could be removed, since nesqp->pau_list.lock could be used instead.

Signed-off-by: Eric Dumazet 
Cc: Faisal Latif 
Cc: Doug Ledford 
Cc: Jason Gunthorpe 
Cc: linux-rdma 
---
 drivers/infiniband/hw/nes/nes_mgt.c |  4 ++--
 include/linux/skbuff.h  |  2 --
 net/core/skbuff.c   | 22 --
 3 files changed, 2 insertions(+), 26 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes_mgt.c 
b/drivers/infiniband/hw/nes/nes_mgt.c
index 
fc0c191014e908eea32d752f3499295ef143aa0a..cc4dce5c3e5f6d99fc44fcde7334e70ac7a33002
 100644
--- a/drivers/infiniband/hw/nes/nes_mgt.c
+++ b/drivers/infiniband/hw/nes/nes_mgt.c
@@ -551,14 +551,14 @@ static void queue_fpdus(struct sk_buff *skb, struct 
nes_vnic *nesvnic, struct ne
 
/* Queue skb by sequence number */
if (skb_queue_len(>pau_list) == 0) {
-   skb_queue_head(>pau_list, skb);
+   __skb_queue_head(>pau_list, skb);
} else {
skb_queue_walk(>pau_list, tmpskb) {
cb = (struct nes_rskb_cb *)>cb[0];
if (before(seqnum, cb->seqnum))
break;
}
-   skb_insert(tmpskb, skb, >pau_list);
+   __skb_insert(skb, tmpskb->prev, tmpskb, >pau_list);
}
if (nesqp->pau_state == PAU_READY)
process_it = true;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 
f17a7452ac7bf47ef4bcf89840bba165cee6f50a..73902acf2b71c8800d81b744a936a7420f33b459
 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1749,8 +1749,6 @@ static inline void skb_queue_head_init_class(struct 
sk_buff_head *list,
  * The "__skb_()" functions are the non-atomic ones that
  * can only be called with interrupts disabled.
  */
-void skb_insert(struct sk_buff *old, struct sk_buff *newsk,
-   struct sk_buff_head *list);
 static inline void __skb_insert(struct sk_buff *newsk,
struct sk_buff *prev, struct sk_buff *next,
struct sk_buff_head *list)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 
9a8a72cefe9b94d3821b9cc5ba5bba647ae51267..02cd7ae3d0fb26ef0a8b006390154fdefd0d292f
 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2990,28 +2990,6 @@ void skb_append(struct sk_buff *old, struct sk_buff 
*newsk, struct sk_buff_head
 }
 EXPORT_SYMBOL(skb_append);
 
-/**
- * skb_insert  -   insert a buffer
- * @old: buffer to insert before
- * @newsk: buffer to insert
- * @list: list to use
- *
- * Place a packet before a given packet in a list. The list locks are
- * taken and this function is atomic with respect to other list locked
- * calls.
- *
- * A buffer cannot be placed on two lists at the same time.
- */
-void skb_insert(struct sk_buff *old, struct sk_buff *newsk, struct 
sk_buff_head *list)
-{
-   unsigned long flags;
-
-   spin_lock_irqsave(>lock, flags);
-   __skb_insert(newsk, old->prev, old, list);
-   spin_unlock_irqrestore(>lock, flags);
-}
-EXPORT_SYMBOL(skb_insert);
-
 static inline void skb_split_inside_header(struct sk_buff *skb,
   struct sk_buff* skb1,
   const u32 len, const int pos)
-- 
2.20.0.rc0.387.gc7a69e6b6c-goog



Re: [PATCH v2 net-next] net: lpc_eth: fix trivial comment typo

2018-11-21 Thread David Miller
From: Andrea Claudi 
Date: Tue, 20 Nov 2018 18:30:30 +0100

> Fix comment typo rxfliterctrl -> rxfilterctrl
> 
> Signed-off-by: Andrea Claudi 

Applied.


Re: [PATCH v2 net-next] cxgb4/cxgb4vf: Fix mac_hlist initialization and free

2018-11-20 Thread David Miller
From: Arjun Vynipadath 
Date: Tue, 20 Nov 2018 12:11:39 +0530

> Null pointer dereference seen when cxgb4vf driver is unloaded
> without bringing up any interfaces, moving mac_hlist initialization
> to driver probe and free the mac_hlist in remove to fix the issue.
> 
> Fixes: 24357e06ba51 ("cxgb4vf: fix memleak in mac_hlist initialization")
> Signed-off-by: Arjun Vynipadath 
> Signed-off-by: Casey Leedom 
> Signed-off-by: Ganesh Goudar 
> ---
> v2:
> - Updated commit description as per Leon's feedback

Applied.


[PATCH v2 net-next] net: lpc_eth: fix trivial comment typo

2018-11-20 Thread Andrea Claudi
Fix comment typo rxfliterctrl -> rxfilterctrl

Signed-off-by: Andrea Claudi 
---
 drivers/net/ethernet/nxp/lpc_eth.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/nxp/lpc_eth.c 
b/drivers/net/ethernet/nxp/lpc_eth.c
index bd8695a4faaa..89d17399fb5a 100644
--- a/drivers/net/ethernet/nxp/lpc_eth.c
+++ b/drivers/net/ethernet/nxp/lpc_eth.c
@@ -280,7 +280,7 @@
 #define LPC_FCCR_MIRRORCOUNTERCURRENT(n)   ((n) & 0x)
 
 /*
- * rxfliterctrl, rxfilterwolstatus, and rxfilterwolclear shared
+ * rxfilterctrl, rxfilterwolstatus, and rxfilterwolclear shared
  * register definitions
  */
 #define LPC_RXFLTRW_ACCEPTUNICAST  (1 << 0)
@@ -291,7 +291,7 @@
 #define LPC_RXFLTRW_ACCEPTPERFECT  (1 << 5)
 
 /*
- * rxfliterctrl register definitions
+ * rxfilterctrl register definitions
  */
 #define LPC_RXFLTRWSTS_MAGICPACKETENWOL(1 << 12)
 #define LPC_RXFLTRWSTS_RXFILTERENWOL   (1 << 13)
-- 
2.17.2



[PATCH v2 net-next] cxgb4/cxgb4vf: Fix mac_hlist initialization and free

2018-11-19 Thread Arjun Vynipadath
Null pointer dereference seen when cxgb4vf driver is unloaded
without bringing up any interfaces, moving mac_hlist initialization
to driver probe and free the mac_hlist in remove to fix the issue.

Fixes: 24357e06ba51 ("cxgb4vf: fix memleak in mac_hlist initialization")
Signed-off-by: Arjun Vynipadath 
Signed-off-by: Casey Leedom 
Signed-off-by: Ganesh Goudar 
---
v2:
- Updated commit description as per Leon's feedback
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 19 ++-
 drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c |  6 +++---
 2 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 956e708..cdd6f48 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2280,8 +2280,6 @@ static int cxgb_up(struct adapter *adap)
 #if IS_ENABLED(CONFIG_IPV6)
update_clip(adap);
 #endif
-   /* Initialize hash mac addr list*/
-   INIT_LIST_HEAD(>mac_hlist);
return err;
 
  irq_err:
@@ -2295,8 +2293,6 @@ static int cxgb_up(struct adapter *adap)
 
 static void cxgb_down(struct adapter *adapter)
 {
-   struct hash_mac_addr *entry, *tmp;
-
cancel_work_sync(>tid_release_task);
cancel_work_sync(>db_full_task);
cancel_work_sync(>db_drop_task);
@@ -2306,11 +2302,6 @@ static void cxgb_down(struct adapter *adapter)
t4_sge_stop(adapter);
t4_free_sge_resources(adapter);
 
-   list_for_each_entry_safe(entry, tmp, >mac_hlist, list) {
-   list_del(>list);
-   kfree(entry);
-   }
-
adapter->flags &= ~FULL_INIT_DONE;
 }
 
@@ -5629,6 +5620,9 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 (is_t5(adapter->params.chip) ? STATMODE_V(0) :
  T6_STATMODE_V(0)));
 
+   /* Initialize hash mac addr list */
+   INIT_LIST_HEAD(>mac_hlist);
+
for_each_port(adapter, i) {
netdev = alloc_etherdev_mq(sizeof(struct port_info),
   MAX_ETH_QSETS);
@@ -5907,6 +5901,7 @@ static int init_one(struct pci_dev *pdev, const struct 
pci_device_id *ent)
 static void remove_one(struct pci_dev *pdev)
 {
struct adapter *adapter = pci_get_drvdata(pdev);
+   struct hash_mac_addr *entry, *tmp;
 
if (!adapter) {
pci_release_regions(pdev);
@@ -5956,6 +5951,12 @@ static void remove_one(struct pci_dev *pdev)
if (adapter->num_uld || adapter->num_ofld_uld)
t4_uld_mem_free(adapter);
free_some_resources(adapter);
+   list_for_each_entry_safe(entry, tmp, >mac_hlist,
+list) {
+   list_del(>list);
+   kfree(entry);
+   }
+
 #if IS_ENABLED(CONFIG_IPV6)
t4_cleanup_clip_tbl(adapter);
 #endif
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c 
b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
index 8ec503c..8a2ad6b 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/cxgb4vf_main.c
@@ -723,9 +723,6 @@ static int adapter_up(struct adapter *adapter)
if (adapter->flags & USING_MSIX)
name_msix_vecs(adapter);
 
-   /* Initialize hash mac addr list*/
-   INIT_LIST_HEAD(>mac_hlist);
-
adapter->flags |= FULL_INIT_DONE;
}
 
@@ -3038,6 +3035,9 @@ static int cxgb4vf_pci_probe(struct pci_dev *pdev,
if (err)
goto err_unmap_bar;
 
+   /* Initialize hash mac addr list */
+   INIT_LIST_HEAD(>mac_hlist);
+
/*
 * Allocate our "adapter ports" and stitch everything together.
 */
-- 
2.9.5



[PATCH V2 net-next 0/5] net: hns3: Add support of hardware GRO to HNS3 Driver

2018-11-15 Thread Salil Mehta
This patch-set adds support of hardware assisted GRO feature to
HNS3 driver on Rev B(=0x21) platform. Current hardware only
supports TCP/IPv{4|6} flows.

Change Log:
V1->V2:
1. Remove redundant print reported by Leon Romanovsky.
   Link: https://lkml.org/lkml/2018/11/13/715

Peng Li (5):
  net: hns3: Enable HW GRO for Rev B(=0x21) HNS3 hardware
  net: hns3: Add handling of GRO Pkts not fully RX'ed in NAPI poll
  net: hns3: Add support for ethtool -K to enable/disable HW GRO
  net: hns3: Add skb chain when num of RX buf exceeds MAX_SKB_FRAGS
  net: hns3: Adds GRO params to SKB for the stack

 drivers/net/ethernet/hisilicon/hns3/hnae3.h   |   7 +
 .../net/ethernet/hisilicon/hns3/hns3_enet.c   | 289 ++
 .../net/ethernet/hisilicon/hns3/hns3_enet.h   |  17 +-
 .../hisilicon/hns3/hns3pf/hclge_cmd.h |   7 +
 .../hisilicon/hns3/hns3pf/hclge_main.c|  39 +++
 .../hisilicon/hns3/hns3vf/hclgevf_cmd.h   |   8 +
 .../hisilicon/hns3/hns3vf/hclgevf_main.c  |  39 +++
 7 files changed, 339 insertions(+), 67 deletions(-)

-- 
2.17.1




Re: [PATCH v2 net-next] net: phy: improve struct phy_device member interrupts handling

2018-11-09 Thread David Miller
From: Heiner Kallweit 
Date: Fri, 9 Nov 2018 18:35:52 +0100

> As a heritage from the very early days of phylib member interrupts is
> defined as u32 even though it's just a flag whether interrupts are
> enabled. So we can change it to a bitfield member. In addition change
> the code dealing with this member in a way that it's clear we're
> dealing with a bool value.
> 
> Signed-off-by: Heiner Kallweit 
> ---
> v2:
> - use false/true instead of 0/1 for the constants

Applied.


Re: [PATCH v2 net-next] net: phy: improve struct phy_device member interrupts handling

2018-11-09 Thread Florian Fainelli
On 11/9/18 9:35 AM, Heiner Kallweit wrote:
> As a heritage from the very early days of phylib member interrupts is
> defined as u32 even though it's just a flag whether interrupts are
> enabled. So we can change it to a bitfield member. In addition change
> the code dealing with this member in a way that it's clear we're
> dealing with a bool value.
> 
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Florian Fainelli 
-- 
Florian


Re: [PATCH v2 net-next] net: phy: improve struct phy_device member interrupts handling

2018-11-09 Thread Andrew Lunn
On Fri, Nov 09, 2018 at 06:35:52PM +0100, Heiner Kallweit wrote:
> As a heritage from the very early days of phylib member interrupts is
> defined as u32 even though it's just a flag whether interrupts are
> enabled. So we can change it to a bitfield member. In addition change
> the code dealing with this member in a way that it's clear we're
> dealing with a bool value.
> 
> Signed-off-by: Heiner Kallweit 

Reviewed-by: Andrew Lunn 

Andrew


[PATCH v2 net-next] net: phy: improve struct phy_device member interrupts handling

2018-11-09 Thread Heiner Kallweit
As a heritage from the very early days of phylib member interrupts is
defined as u32 even though it's just a flag whether interrupts are
enabled. So we can change it to a bitfield member. In addition change
the code dealing with this member in a way that it's clear we're
dealing with a bool value.

Signed-off-by: Heiner Kallweit 
---
v2:
- use false/true instead of 0/1 for the constants

Actually this member isn't needed at all and could be replaced with
a parameter in phy_driver->config_intr. But this would mean an API
change, maybe I come up with a proposal later.
---
 drivers/net/phy/phy.c |  4 ++--
 include/linux/phy.h   | 10 +-
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index dd5bff955..8dac890f3 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -115,9 +115,9 @@ static int phy_clear_interrupt(struct phy_device *phydev)
  *
  * Returns 0 on success or < 0 on error.
  */
-static int phy_config_interrupt(struct phy_device *phydev, u32 interrupts)
+static int phy_config_interrupt(struct phy_device *phydev, bool interrupts)
 {
-   phydev->interrupts = interrupts;
+   phydev->interrupts = interrupts ? 1 : 0;
if (phydev->drv->config_intr)
return phydev->drv->config_intr(phydev);
 
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 240e04d5a..59bb31ee1 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -262,8 +262,8 @@ static inline struct mii_bus *devm_mdiobus_alloc(struct 
device *dev)
 void devm_mdiobus_free(struct device *dev, struct mii_bus *bus);
 struct phy_device *mdiobus_scan(struct mii_bus *bus, int addr);
 
-#define PHY_INTERRUPT_DISABLED 0x0
-#define PHY_INTERRUPT_ENABLED  0x8000
+#define PHY_INTERRUPT_DISABLED false
+#define PHY_INTERRUPT_ENABLED  true
 
 /* PHY state machine states:
  *
@@ -409,6 +409,9 @@ struct phy_device {
/* The most recently read link state */
unsigned link:1;
 
+   /* Interrupts are enabled */
+   unsigned interrupts:1;
+
enum phy_state state;
 
u32 dev_flags;
@@ -424,9 +427,6 @@ struct phy_device {
int pause;
int asym_pause;
 
-   /* Enabled Interrupts */
-   u32 interrupts;
-
/* Union of PHY and Attached devices' supported modes */
/* See mii.h for more info */
u32 supported;
-- 
2.19.1



Re: [PATCH v2 net-next] sock: Reset dst when changing sk_mark via setsockopt

2018-11-07 Thread Eric Dumazet



On 11/07/2018 08:55 PM, David Barmann wrote:
> When setting the SO_MARK socket option, the dst needs to be reset so
> that a new route lookup is performed.
> 
> This fixes the case where an application wants to change routing by
> setting a new sk_mark.  If this is done after some packets have already
> been sent, the dst is cached and has no effect.
> 
> Signed-off-by: David Barmann 
> ---
>  net/core/sock.c | 6 --
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 7b304e454a38..c74b10be86cb 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -952,10 +952,12 @@ int sock_setsockopt(struct socket *sock, int level, int 
> optname,
>   clear_bit(SOCK_PASSSEC, >flags);
>   break;
>   case SO_MARK:
> - if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
> + if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
>   ret = -EPERM;
> - else
> + } else {
>   sk->sk_mark = val;
> + sk_dst_reset(sk);


There is no need to force a sk_dst_reset(sk) if sk_mark was not changed.

I already gave you this feedback, please do not ignore it.

Thanks.



[PATCH v2 net-next] sock: Reset dst when changing sk_mark via setsockopt

2018-11-07 Thread David Barmann
When setting the SO_MARK socket option, the dst needs to be reset so
that a new route lookup is performed.

This fixes the case where an application wants to change routing by
setting a new sk_mark.  If this is done after some packets have already
been sent, the dst is cached and has no effect.

Signed-off-by: David Barmann 
---
 net/core/sock.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 7b304e454a38..c74b10be86cb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -952,10 +952,12 @@ int sock_setsockopt(struct socket *sock, int level, int 
optname,
clear_bit(SOCK_PASSSEC, >flags);
break;
case SO_MARK:
-   if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
+   if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)) {
ret = -EPERM;
-   else
+   } else {
sk->sk_mark = val;
+   sk_dst_reset(sk);
+   }
break;
 
case SO_RXQ_OVFL:
-- 
2.14.5



Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers

2018-10-19 Thread Florian Fainelli
On 08/16/2018 02:34 PM, tristram...@microchip.com wrote:
>> -Original Message-
>> From: Florian Fainelli 
>> Sent: Wednesday, August 15, 2018 5:29 PM
>> To: Tristram Ha - C24268 ; Andrew Lunn
>> ; Pavel Machek ; Ruediger Schmitt
>> 
>> Cc: Arkadi Sharshevsky ; UNGLinuxDriver
>> ; netdev@vger.kernel.org
>> Subject: Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477
>> DSA driver in preparation to add other KSZ switch drivers
>>
>> On 12/05/2017 05:46 PM, tristram...@microchip.com wrote:
>>> From: Tristram Ha 
>>>
>>> This series of patches is to modify the original KSZ9477 DSA driver so
>>> that other KSZ switch drivers can be added and use the common code.
>>>
>>> There are several steps to accomplish this achievement.  First is to
>>> rename some function names with a prefix to indicate chip specific
>>> function.  Second is to move common code into header that can be shared.
>>> Last is to modify tag_ksz.c so that it can handle many tail tag formats
>>> used by different KSZ switch drivers.
>>>
>>> ksz_common.c will contain the common code used by all KSZ switch drivers.
>>> ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
>>> ksz9477_spi.c is renamed from ksz_spi.c.
>>> ksz9477_reg.h is renamed from ksz_9477_reg.h.
>>> ksz_common.h is added to provide common code access to KSZ switch
>>> drivers.
>>> ksz_spi.h is added to provide common SPI access functions to KSZ SPI
>>> drivers.
>>
>> Is something gating this series from getting included? It's been nearly
>> 8 months now and this has not been include nor resubmitted, any plans to
>> rebase that patch series and work towards inclusion in net-next when it
>> opens back again?
>>
>> Thank you!
> 
> Sorry for the long delay.  I will restart my kernel submission effort next 
> month
> after finishing the work on current development project.
> 

Tristram, any chance of resubmitting this or should someone with access
to those switches take up your series and submit it?
-- 
Florian


Re: [PATCH V2 net-next] net: ena: Fix Kconfig dependency on X86

2018-10-17 Thread David Miller
From: 
Date: Wed, 17 Oct 2018 10:04:21 +

> From: Netanel Belgazal 
> 
> The Kconfig limitation of X86 is to too wide.
> The ENA driver only requires a little endian dependency.
> 
> Change the dependency to be on little endian CPU.
> 
> Signed-off-by: Netanel Belgazal 

Applied.


[PATCH V2 net-next] net: ena: Fix Kconfig dependency on X86

2018-10-17 Thread netanel
From: Netanel Belgazal 

The Kconfig limitation of X86 is to too wide.
The ENA driver only requires a little endian dependency.

Change the dependency to be on little endian CPU.

Signed-off-by: Netanel Belgazal 
---
 drivers/net/ethernet/amazon/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amazon/Kconfig 
b/drivers/net/ethernet/amazon/Kconfig
index 99b30353541a..9e87d7b8360f 100644
--- a/drivers/net/ethernet/amazon/Kconfig
+++ b/drivers/net/ethernet/amazon/Kconfig
@@ -17,7 +17,7 @@ if NET_VENDOR_AMAZON
 
 config ENA_ETHERNET
tristate "Elastic Network Adapter (ENA) support"
-   depends on (PCI_MSI && X86)
+   depends on PCI_MSI && !CPU_BIG_ENDIAN
---help---
  This driver supports Elastic Network Adapter (ENA)"
 
-- 
2.15.2.AMZN



Re: [PATCH v2 net-next 00/11] net: Kernel side filtering for route dumps

2018-10-16 Thread David Miller
From: David Ahern 
Date: Mon, 15 Oct 2018 18:56:40 -0700

> From: David Ahern 
> 
> Implement kernel side filtering of route dumps by protocol (e.g., which
> routing daemon installed the route), route type (e.g., unicast), table
> id and nexthop device.
> 
> iproute2 has been doing this filtering in userspace for years; pushing
> the filters to the kernel side reduces the amount of data the kernel
> sends and reduces wasted cycles on both sides processing unwanted data.
> These initial options provide a huge improvement for efficiently
> examining routes on large scale systems.
> 
> v2
> - better handling of requests for a specific table. Rather than walking
>   the hash of all tables, lookup the specific table and dump it
> - refactor mr_rtm_dumproute moving the loop over the table into a
>   helper that can be invoked directly
> - add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure
>   it is returned even when the dump returns nothing

Looks great David, I'll push this out to net-next after my build tests
finish.

Thanks.


[PATCH v2 net-next 06/11] ipmr: Refactor mr_rtm_dumproute

2018-10-15 Thread David Ahern
From: David Ahern 

Move per-table loops from mr_rtm_dumproute to mr_table_dump and export
mr_table_dump for dumps by specific table id.

Signed-off-by: David Ahern 
---
 include/linux/mroute_base.h |  6 
 net/ipv4/ipmr_base.c| 88 -
 2 files changed, 61 insertions(+), 33 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index 6675b9f81979..db85373c8d15 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -283,6 +283,12 @@ void *mr_mfc_find_any(struct mr_table *mrt, int vifi, void 
*hasharg);
 
 int mr_fill_mroute(struct mr_table *mrt, struct sk_buff *skb,
   struct mr_mfc *c, struct rtmsg *rtm);
+int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb,
+ struct netlink_callback *cb,
+ int (*fill)(struct mr_table *mrt, struct sk_buff *skb,
+ u32 portid, u32 seq, struct mr_mfc *c,
+ int cmd, int flags),
+ spinlock_t *lock);
 int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
 struct mr_table *(*iter)(struct net *net,
  struct mr_table *mrt),
diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 1ad9aa62a97b..132dd2613ca5 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -268,6 +268,55 @@ int mr_fill_mroute(struct mr_table *mrt, struct sk_buff 
*skb,
 }
 EXPORT_SYMBOL(mr_fill_mroute);
 
+int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb,
+ struct netlink_callback *cb,
+ int (*fill)(struct mr_table *mrt, struct sk_buff *skb,
+ u32 portid, u32 seq, struct mr_mfc *c,
+ int cmd, int flags),
+ spinlock_t *lock)
+{
+   unsigned int e = 0, s_e = cb->args[1];
+   unsigned int flags = NLM_F_MULTI;
+   struct mr_mfc *mfc;
+   int err;
+
+   list_for_each_entry_rcu(mfc, >mfc_cache_list, list) {
+   if (e < s_e)
+   goto next_entry;
+
+   err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
+  cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
+   if (err < 0)
+   goto out;
+next_entry:
+   e++;
+   }
+   e = 0;
+   s_e = 0;
+
+   spin_lock_bh(lock);
+   list_for_each_entry(mfc, >mfc_unres_queue, list) {
+   if (e < s_e)
+   goto next_entry2;
+
+   err = fill(mrt, skb, NETLINK_CB(cb->skb).portid,
+  cb->nlh->nlmsg_seq, mfc, RTM_NEWROUTE, flags);
+   if (err < 0) {
+   spin_unlock_bh(lock);
+   goto out;
+   }
+next_entry2:
+   e++;
+   }
+   spin_unlock_bh(lock);
+   err = 0;
+   e = 0;
+
+out:
+   cb->args[1] = e;
+   return err;
+}
+
 int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
 struct mr_table *(*iter)(struct net *net,
  struct mr_table *mrt),
@@ -277,51 +326,24 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 int cmd, int flags),
 spinlock_t *lock)
 {
-   unsigned int t = 0, e = 0, s_t = cb->args[0], s_e = cb->args[1];
+   unsigned int t = 0, s_t = cb->args[0];
struct net *net = sock_net(skb->sk);
struct mr_table *mrt;
-   struct mr_mfc *mfc;
+   int err;
 
rcu_read_lock();
for (mrt = iter(net, NULL); mrt; mrt = iter(net, mrt)) {
if (t < s_t)
goto next_table;
-   list_for_each_entry_rcu(mfc, >mfc_cache_list, list) {
-   if (e < s_e)
-   goto next_entry;
-   if (fill(mrt, skb, NETLINK_CB(cb->skb).portid,
-cb->nlh->nlmsg_seq, mfc,
-RTM_NEWROUTE, NLM_F_MULTI) < 0)
-   goto done;
-next_entry:
-   e++;
-   }
-   e = 0;
-   s_e = 0;
-
-   spin_lock_bh(lock);
-   list_for_each_entry(mfc, >mfc_unres_queue, list) {
-   if (e < s_e)
-   goto next_entry2;
-   if (fill(mrt, skb, NETLINK_CB(cb->skb).portid,
-cb->nlh->nlmsg_seq, mfc,
-RTM_NEWROUTE, NLM_F_MULTI) < 0) {
-   spin_unlock_bh(lock);
-   goto done;
-   }
-next_entry2:
-   e++;
-   }
-   spin_unlock_bh(lock);
-   e = 0;
-   s_e = 0;
+
+   

[PATCH v2 net-next 01/11] netlink: Add answer_flags to netlink_callback

2018-10-15 Thread David Ahern
From: David Ahern 

With dump filtering we need a way to ensure the NLM_F_DUMP_FILTERED
flag is set on a message back to the user if the data returned is
influenced by some input attributes. Normally this can be done as
messages are added to the skb, but if the filter results in no data
being returned, the user could be confused as to why.

This patch adds answer_flags to the netlink_callback allowing dump
handlers to set the NLM_F_DUMP_FILTERED at a minimum in the
NLMSG_DONE message ensuring the flag gets back to the user.

The netlink_callback space is initialized to 0 via a memset in
__netlink_dump_start, so init of the new answer_flags is covered.

Signed-off-by: David Ahern 
---
 include/linux/netlink.h  | 1 +
 net/netlink/af_netlink.c | 3 ++-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/netlink.h b/include/linux/netlink.h
index 72580f1a72a2..4da90a6ab536 100644
--- a/include/linux/netlink.h
+++ b/include/linux/netlink.h
@@ -180,6 +180,7 @@ struct netlink_callback {
u16 family;
u16 min_dump_alloc;
boolstrict_check;
+   u16 answer_flags;
unsigned intprev_seq, seq;
longargs[6];
 };
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index e613a9f89600..6bb9f3cde0b0 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -2257,7 +2257,8 @@ static int netlink_dump(struct sock *sk)
}
 
nlh = nlmsg_put_answer(skb, cb, NLMSG_DONE,
-  sizeof(nlk->dump_done_errno), NLM_F_MULTI);
+  sizeof(nlk->dump_done_errno),
+  NLM_F_MULTI | cb->answer_flags);
if (WARN_ON(!nlh))
goto errout_skb;
 
-- 
2.11.0



[PATCH v2 net-next 05/11] net/mpls: Plumb support for filtering route dumps

2018-10-15 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of routes by egress device index and
protocol. MPLS uses only a single table and route type.

Signed-off-by: David Ahern 
---
 net/mpls/af_mpls.c | 42 +-
 1 file changed, 41 insertions(+), 1 deletion(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index bfcb4759c9ee..48f4cbd9fb38 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -2067,12 +2067,35 @@ static int mpls_valid_fib_dump_req(struct net *net, 
const struct nlmsghdr *nlh,
 }
 #endif
 
+static bool mpls_rt_uses_dev(struct mpls_route *rt,
+const struct net_device *dev)
+{
+   struct net_device *nh_dev;
+
+   if (rt->rt_nhn == 1) {
+   struct mpls_nh *nh = rt->rt_nh;
+
+   nh_dev = rtnl_dereference(nh->nh_dev);
+   if (dev == nh_dev)
+   return true;
+   } else {
+   for_nexthops(rt) {
+   nh_dev = rtnl_dereference(nh->nh_dev);
+   if (nh_dev == dev)
+   return true;
+   } endfor_nexthops(rt);
+   }
+
+   return false;
+}
+
 static int mpls_dump_routes(struct sk_buff *skb, struct netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
struct mpls_route __rcu **platform_label;
struct fib_dump_filter filter = {};
+   unsigned int flags = NLM_F_MULTI;
size_t platform_labels;
unsigned int index;
 
@@ -2084,6 +2107,14 @@ static int mpls_dump_routes(struct sk_buff *skb, struct 
netlink_callback *cb)
err = mpls_valid_fib_dump_req(net, nlh, , cb->extack);
if (err < 0)
return err;
+
+   /* for MPLS, there is only 1 table with fixed type and flags.
+* If either are set in the filter then return nothing.
+*/
+   if ((filter.table_id && filter.table_id != RT_TABLE_MAIN) ||
+   (filter.rt_type && filter.rt_type != RTN_UNICAST) ||
+filter.flags)
+   return skb->len;
}
 
index = cb->args[0];
@@ -2092,15 +2123,24 @@ static int mpls_dump_routes(struct sk_buff *skb, struct 
netlink_callback *cb)
 
platform_label = rtnl_dereference(net->mpls.platform_label);
platform_labels = net->mpls.platform_labels;
+
+   if (filter.filter_set)
+   flags |= NLM_F_DUMP_FILTERED;
+
for (; index < platform_labels; index++) {
struct mpls_route *rt;
+
rt = rtnl_dereference(platform_label[index]);
if (!rt)
continue;
 
+   if ((filter.dev && !mpls_rt_uses_dev(rt, filter.dev)) ||
+   (filter.protocol && rt->rt_protocol != filter.protocol))
+   continue;
+
if (mpls_dump_route(skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, RTM_NEWROUTE,
-   index, rt, NLM_F_MULTI) < 0)
+   index, rt, flags) < 0)
break;
}
cb->args[0] = index;
-- 
2.11.0



[PATCH v2 net-next 11/11] net/ipv4: Bail early if user only wants prefix entries

2018-10-15 Thread David Ahern
From: David Ahern 

Unlike IPv6, IPv4 does not have routes marked with RTF_PREFIX_RT. If the
flag is set in the dump request, just return.

In the process of this change, move the CLONE check to use the new
filter flags.

Signed-off-by: David Ahern 
---
 net/ipv4/fib_frontend.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index e86ca2255181..5bf653f36911 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -886,10 +886,14 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
err = ip_valid_fib_dump_req(net, nlh, , cb);
if (err < 0)
return err;
+   } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
+   struct rtmsg *rtm = nlmsg_data(nlh);
+
+   filter.flags = rtm->rtm_flags & (RTM_F_PREFIX | RTM_F_CLONED);
}
 
-   if (nlmsg_len(nlh) >= sizeof(struct rtmsg) &&
-   ((struct rtmsg *)nlmsg_data(nlh))->rtm_flags & RTM_F_CLONED)
+   /* fib entries are never clones and ipv4 does not use prefix flag */
+   if (filter.flags & (RTM_F_PREFIX | RTM_F_CLONED))
return skb->len;
 
if (filter.table_id) {
-- 
2.11.0



[PATCH v2 net-next 07/11] net: Plumb support for filtering ipv4 and ipv6 multicast route dumps

2018-10-15 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of routes by egress device index and
table id. If the table id is given in the filter, lookup table and
call mr_table_dump directly for it.

Signed-off-by: David Ahern 
---
 include/linux/mroute_base.h |  7 ---
 net/ipv4/ipmr.c | 18 +++---
 net/ipv4/ipmr_base.c| 42 +++---
 net/ipv6/ip6mr.c| 18 +++---
 4 files changed, 73 insertions(+), 12 deletions(-)

diff --git a/include/linux/mroute_base.h b/include/linux/mroute_base.h
index db85373c8d15..34de06b426ef 100644
--- a/include/linux/mroute_base.h
+++ b/include/linux/mroute_base.h
@@ -7,6 +7,7 @@
 #include 
 #include 
 #include 
+#include 
 
 /**
  * struct vif_device - interface representor for multicast routing
@@ -288,7 +289,7 @@ int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb,
  int (*fill)(struct mr_table *mrt, struct sk_buff *skb,
  u32 portid, u32 seq, struct mr_mfc *c,
  int cmd, int flags),
- spinlock_t *lock);
+ spinlock_t *lock, struct fib_dump_filter *filter);
 int mr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb,
 struct mr_table *(*iter)(struct net *net,
  struct mr_table *mrt),
@@ -296,7 +297,7 @@ int mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 struct sk_buff *skb,
 u32 portid, u32 seq, struct mr_mfc *c,
 int cmd, int flags),
-spinlock_t *lock);
+spinlock_t *lock, struct fib_dump_filter *filter);
 
 int mr_dump(struct net *net, struct notifier_block *nb, unsigned short family,
int (*rules_dump)(struct net *net,
@@ -346,7 +347,7 @@ mr_rtm_dumproute(struct sk_buff *skb, struct 
netlink_callback *cb,
 struct sk_buff *skb,
 u32 portid, u32 seq, struct mr_mfc *c,
 int cmd, int flags),
-spinlock_t *lock)
+spinlock_t *lock, struct fib_dump_filter *filter)
 {
return -EINVAL;
 }
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 44d777058960..3fa988e6a3df 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2528,18 +2528,30 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
 static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 {
struct fib_dump_filter filter = {};
+   int err;
 
if (cb->strict_check) {
-   int err;
-
err = ip_valid_fib_dump_req(sock_net(skb->sk), cb->nlh,
, cb->extack);
if (err < 0)
return err;
}
 
+   if (filter.table_id) {
+   struct mr_table *mrt;
+
+   mrt = ipmr_get_table(sock_net(skb->sk), filter.table_id);
+   if (!mrt) {
+   NL_SET_ERR_MSG(cb->extack, "ipv4: MR table does not 
exist");
+   return -ENOENT;
+   }
+   err = mr_table_dump(mrt, skb, cb, _ipmr_fill_mroute,
+   _unres_lock, );
+   return skb->len ? : err;
+   }
+
return mr_rtm_dumproute(skb, cb, ipmr_mr_table_iter,
-   _ipmr_fill_mroute, _unres_lock);
+   _ipmr_fill_mroute, _unres_lock, );
 }
 
 static const struct nla_policy rtm_ipmr_policy[RTA_MAX + 1] = {
diff --git a/net/ipv4/ipmr_base.c b/net/ipv4/ipmr_base.c
index 132dd2613ca5..bfe8fd04afa0 100644
--- a/net/ipv4/ipmr_base.c
+++ b/net/ipv4/ipmr_base.c
@@ -268,21 +268,45 @@ int mr_fill_mroute(struct mr_table *mrt, struct sk_buff 
*skb,
 }
 EXPORT_SYMBOL(mr_fill_mroute);
 
+static bool mr_mfc_uses_dev(const struct mr_table *mrt,
+   const struct mr_mfc *c,
+   const struct net_device *dev)
+{
+   int ct;
+
+   for (ct = c->mfc_un.res.minvif; ct < c->mfc_un.res.maxvif; ct++) {
+   if (VIF_EXISTS(mrt, ct) && c->mfc_un.res.ttls[ct] < 255) {
+   const struct vif_device *vif;
+
+   vif = >vif_table[ct];
+   if (vif->dev == dev)
+   return true;
+   }
+   }
+   return false;
+}
+
 int mr_table_dump(struct mr_table *mrt, struct sk_buff *skb,
  struct netlink_callback *cb,
  int (*fill)(struct mr_table *mrt, struct sk_buff *skb,
  u32 portid, u32 seq, struct mr_mfc *c,
  int cmd, int flags),
- spinlock_t *lock)
+ spinlock_t *lock, struct fib_dump_filter *filter)
 {
unsigned int e = 0, 

[PATCH v2 net-next 10/11] net/ipv6: Bail early if user only wants cloned entries

2018-10-15 Thread David Ahern
From: David Ahern 

Similar to IPv4, IPv6 fib no longer contains cloned routes. If a user
requests a route dump for only cloned entries, no sense walking the FIB
and returning everything.

Signed-off-by: David Ahern 
---
 net/ipv6/ip6_fib.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 5562c77022c6..2a058b408a6a 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -586,10 +586,13 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
} else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
struct rtmsg *rtm = nlmsg_data(nlh);
 
-   if (rtm->rtm_flags & RTM_F_PREFIX)
-   arg.filter.flags = RTM_F_PREFIX;
+   arg.filter.flags = rtm->rtm_flags & (RTM_F_PREFIX|RTM_F_CLONED);
}
 
+   /* fib entries are never clones */
+   if (arg.filter.flags & RTM_F_CLONED)
+   return skb->len;
+
w = (void *)cb->args[2];
if (!w) {
/* New dump:
-- 
2.11.0



[PATCH v2 net-next 09/11] net/mpls: Handle kernel side filtering of route dumps

2018-10-15 Thread David Ahern
From: David Ahern 

Update the dump request parsing in MPLS for the non-INET case to
enable kernel side filtering. If INET is disabled the only filters
that make sense for MPLS are protocol and nexthop device.

Signed-off-by: David Ahern 
---
 net/mpls/af_mpls.c | 33 -
 1 file changed, 28 insertions(+), 5 deletions(-)

diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 24381696932a..7d55d4c04088 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -2044,7 +2044,9 @@ static int mpls_valid_fib_dump_req(struct net *net, const 
struct nlmsghdr *nlh,
   struct netlink_callback *cb)
 {
struct netlink_ext_ack *extack = cb->extack;
+   struct nlattr *tb[RTA_MAX + 1];
struct rtmsg *rtm;
+   int err, i;
 
if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) {
NL_SET_ERR_MSG_MOD(extack, "Invalid header for FIB dump 
request");
@@ -2053,15 +2055,36 @@ static int mpls_valid_fib_dump_req(struct net *net, 
const struct nlmsghdr *nlh,
 
rtm = nlmsg_data(nlh);
if (rtm->rtm_dst_len || rtm->rtm_src_len  || rtm->rtm_tos   ||
-   rtm->rtm_table   || rtm->rtm_protocol || rtm->rtm_scope ||
-   rtm->rtm_type|| rtm->rtm_flags) {
+   rtm->rtm_table   || rtm->rtm_scope|| rtm->rtm_type  ||
+   rtm->rtm_flags) {
NL_SET_ERR_MSG_MOD(extack, "Invalid values in header for FIB 
dump request");
return -EINVAL;
}
 
-   if (nlmsg_attrlen(nlh, sizeof(*rtm))) {
-   NL_SET_ERR_MSG_MOD(extack, "Invalid data after header in FIB 
dump request");
-   return -EINVAL;
+   if (rtm->rtm_protocol) {
+   filter->protocol = rtm->rtm_protocol;
+   filter->filter_set = 1;
+   cb->answer_flags = NLM_F_DUMP_FILTERED;
+   }
+
+   err = nlmsg_parse_strict(nlh, sizeof(*rtm), tb, RTA_MAX,
+rtm_mpls_policy, extack);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i <= RTA_MAX; ++i) {
+   int ifindex;
+
+   if (i == RTA_OIF) {
+   ifindex = nla_get_u32(tb[i]);
+   filter->dev = __dev_get_by_index(net, ifindex);
+   if (!filter->dev)
+   return -ENODEV;
+   filter->filter_set = 1;
+   } else if (tb[i]) {
+   NL_SET_ERR_MSG_MOD(extack, "Unsupported attribute in 
dump request");
+   return -EINVAL;
+   }
}
 
return 0;
-- 
2.11.0



[PATCH v2 net-next 04/11] net/ipv6: Plumb support for filtering route dumps

2018-10-15 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of routes by table id, egress device
index, protocol, and route type. If the table id is given in the filter,
lookup the table and call fib6_dump_table directly for it.

Move the existing route flags check for prefix only routes to the new
filter.

Signed-off-by: David Ahern 
---
 net/ipv6/ip6_fib.c | 28 ++--
 net/ipv6/route.c   | 40 
 2 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 94e61fe47ff8..a51fc357a05c 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -583,10 +583,12 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
err = ip_valid_fib_dump_req(net, nlh, , cb->extack);
if (err < 0)
return err;
-   }
+   } else if (nlmsg_len(nlh) >= sizeof(struct rtmsg)) {
+   struct rtmsg *rtm = nlmsg_data(nlh);
 
-   s_h = cb->args[0];
-   s_e = cb->args[1];
+   if (rtm->rtm_flags & RTM_F_PREFIX)
+   arg.filter.flags = RTM_F_PREFIX;
+   }
 
w = (void *)cb->args[2];
if (!w) {
@@ -612,6 +614,20 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
arg.net = net;
w->args = 
 
+   if (arg.filter.table_id) {
+   tb = fib6_get_table(net, arg.filter.table_id);
+   if (!tb) {
+   NL_SET_ERR_MSG_MOD(cb->extack, "FIB table does not 
exist");
+   return -ENOENT;
+   }
+
+   res = fib6_dump_table(tb, skb, cb);
+   goto out;
+   }
+
+   s_h = cb->args[0];
+   s_e = cb->args[1];
+
rcu_read_lock();
for (h = s_h; h < FIB6_TABLE_HASHSZ; h++, s_e = 0) {
e = 0;
@@ -621,16 +637,16 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
goto next;
res = fib6_dump_table(tb, skb, cb);
if (res != 0)
-   goto out;
+   goto out_unlock;
 next:
e++;
}
}
-out:
+out_unlock:
rcu_read_unlock();
cb->args[1] = e;
cb->args[0] = h;
-
+out:
res = res < 0 ? res : skb->len;
if (res <= 0)
fib6_dump_end(cb);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f4e08b0689a8..9fd600e42f9d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -4767,28 +4767,52 @@ static int rt6_fill_node(struct net *net, struct 
sk_buff *skb,
return -EMSGSIZE;
 }
 
+static bool fib6_info_uses_dev(const struct fib6_info *f6i,
+  const struct net_device *dev)
+{
+   if (f6i->fib6_nh.nh_dev == dev)
+   return true;
+
+   if (f6i->fib6_nsiblings) {
+   struct fib6_info *sibling, *next_sibling;
+
+   list_for_each_entry_safe(sibling, next_sibling,
+>fib6_siblings, fib6_siblings) {
+   if (sibling->fib6_nh.nh_dev == dev)
+   return true;
+   }
+   }
+
+   return false;
+}
+
 int rt6_dump_route(struct fib6_info *rt, void *p_arg)
 {
struct rt6_rtnl_dump_arg *arg = (struct rt6_rtnl_dump_arg *) p_arg;
+   struct fib_dump_filter *filter = >filter;
+   unsigned int flags = NLM_F_MULTI;
struct net *net = arg->net;
 
if (rt == net->ipv6.fib6_null_entry)
return 0;
 
-   if (nlmsg_len(arg->cb->nlh) >= sizeof(struct rtmsg)) {
-   struct rtmsg *rtm = nlmsg_data(arg->cb->nlh);
-
-   /* user wants prefix routes only */
-   if (rtm->rtm_flags & RTM_F_PREFIX &&
-   !(rt->fib6_flags & RTF_PREFIX_RT)) {
-   /* success since this is not a prefix route */
+   if ((filter->flags & RTM_F_PREFIX) &&
+   !(rt->fib6_flags & RTF_PREFIX_RT)) {
+   /* success since this is not a prefix route */
+   return 1;
+   }
+   if (filter->filter_set) {
+   if ((filter->rt_type && rt->fib6_type != filter->rt_type) ||
+   (filter->dev && !fib6_info_uses_dev(rt, filter->dev)) ||
+   (filter->protocol && rt->fib6_protocol != 
filter->protocol)) {
return 1;
}
+   flags |= NLM_F_DUMP_FILTERED;
}
 
return rt6_fill_node(net, arg->skb, rt, NULL, NULL, NULL, 0,
 RTM_NEWROUTE, NETLINK_CB(arg->cb->skb).portid,
-arg->cb->nlh->nlmsg_seq, NLM_F_MULTI);
+arg->cb->nlh->nlmsg_seq, flags);
 }
 
 static int inet6_rtm_getroute(struct sk_buff *in_skb, struct nlmsghdr *nlh,
-- 
2.11.0



[PATCH v2 net-next 02/11] net: Add struct for fib dump filter

2018-10-15 Thread David Ahern
From: David Ahern 

Add struct fib_dump_filter for options on limiting which routes are
returned in a dump request. The current list is table id, protocol,
route type, rtm_flags and nexthop device index. struct net is needed
to lookup the net_device from the index.

Declare the filter for each route dump handler and plumb the new
arguments from dump handlers to ip_valid_fib_dump_req.

Signed-off-by: David Ahern 
---
 include/net/ip6_route.h |  1 +
 include/net/ip_fib.h| 13 -
 net/ipv4/fib_frontend.c |  6 --
 net/ipv4/ipmr.c |  6 +-
 net/ipv6/ip6_fib.c  |  5 +++--
 net/ipv6/ip6mr.c|  5 -
 net/mpls/af_mpls.c  | 12 
 7 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index cef186dbd2ce..7ab119936e69 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -174,6 +174,7 @@ struct rt6_rtnl_dump_arg {
struct sk_buff *skb;
struct netlink_callback *cb;
struct net *net;
+   struct fib_dump_filter filter;
 };
 
 int rt6_dump_route(struct fib6_info *f6i, void *p_arg);
diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 852e4ebf2209..667013bf4266 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -222,6 +222,16 @@ struct fib_table {
unsigned long   __data[0];
 };
 
+struct fib_dump_filter {
+   u32 table_id;
+   /* filter_set is an optimization that an entry is set */
+   boolfilter_set;
+   unsigned char   protocol;
+   unsigned char   rt_type;
+   unsigned intflags;
+   struct net_device   *dev;
+};
+
 int fib_table_lookup(struct fib_table *tb, const struct flowi4 *flp,
 struct fib_result *res, int fib_flags);
 int fib_table_insert(struct net *, struct fib_table *, struct fib_config *,
@@ -453,6 +463,7 @@ static inline void fib_proc_exit(struct net *net)
 
 u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr);
 
-int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
+int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
+ struct fib_dump_filter *filter,
  struct netlink_ext_ack *extack);
 #endif  /* _NET_FIB_H */
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 0f1beceb47d5..850850dd80e1 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -802,7 +802,8 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct 
nlmsghdr *nlh,
return err;
 }
 
-int ip_valid_fib_dump_req(const struct nlmsghdr *nlh,
+int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
+ struct fib_dump_filter *filter,
  struct netlink_ext_ack *extack)
 {
struct rtmsg *rtm;
@@ -837,6 +838,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
+   struct fib_dump_filter filter = {};
unsigned int h, s_h;
unsigned int e = 0, s_e;
struct fib_table *tb;
@@ -844,7 +846,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
int dumped = 0, err;
 
if (cb->strict_check) {
-   err = ip_valid_fib_dump_req(nlh, cb->extack);
+   err = ip_valid_fib_dump_req(net, nlh, , cb->extack);
if (err < 0)
return err;
}
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 91b0d5671649..44d777058960 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2527,9 +2527,13 @@ static int ipmr_rtm_getroute(struct sk_buff *in_skb, 
struct nlmsghdr *nlh,
 
 static int ipmr_rtm_dumproute(struct sk_buff *skb, struct netlink_callback *cb)
 {
+   struct fib_dump_filter filter = {};
+
if (cb->strict_check) {
-   int err = ip_valid_fib_dump_req(cb->nlh, cb->extack);
+   int err;
 
+   err = ip_valid_fib_dump_req(sock_net(skb->sk), cb->nlh,
+   , cb->extack);
if (err < 0)
return err;
}
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 0783af11b0b7..94e61fe47ff8 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -569,17 +569,18 @@ static int inet6_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
 {
const struct nlmsghdr *nlh = cb->nlh;
struct net *net = sock_net(skb->sk);
+   struct rt6_rtnl_dump_arg arg = {};
unsigned int h, s_h;
unsigned int e = 0, s_e;
-   struct rt6_rtnl_dump_arg arg;
struct fib6_walker *w;
struct fib6_table *tb;
struct hlist_head *head;
int res = 0;
 
if (cb->strict_check) {
-   int err = ip_valid_fib_dump_req(nlh, cb->extack);
+   

[PATCH v2 net-next 08/11] net: Enable kernel side filtering of route dumps

2018-10-15 Thread David Ahern
From: David Ahern 

Update parsing of route dump request to enable kernel side filtering.
Allow filtering results by protocol (e.g., which routing daemon installed
the route), route type (e.g., unicast), table id and nexthop device. These
amount to the low hanging fruit, yet a huge improvement, for dumping
routes.

ip_valid_fib_dump_req is called with RTNL held, so __dev_get_by_index can
be used to look up the device index without taking a reference. From
there filter->dev is only used during dump loops with the lock still held.

Set NLM_F_DUMP_FILTERED in the answer_flags so the user knows the results
have been filtered should no entries be returned.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h|  2 +-
 net/ipv4/fib_frontend.c | 51 ++---
 net/ipv4/ipmr.c |  2 +-
 net/ipv6/ip6_fib.c  |  2 +-
 net/ipv6/ip6mr.c|  2 +-
 net/mpls/af_mpls.c  |  9 +
 6 files changed, 53 insertions(+), 15 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 1eabc9edd2b9..e8d9456bf36e 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -465,5 +465,5 @@ u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 
daddr);
 
 int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
  struct fib_dump_filter *filter,
- struct netlink_ext_ack *extack);
+ struct netlink_callback *cb);
 #endif  /* _NET_FIB_H */
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 37dc8ac366fd..e86ca2255181 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -804,9 +804,14 @@ static int inet_rtm_newroute(struct sk_buff *skb, struct 
nlmsghdr *nlh,
 
 int ip_valid_fib_dump_req(struct net *net, const struct nlmsghdr *nlh,
  struct fib_dump_filter *filter,
- struct netlink_ext_ack *extack)
+ struct netlink_callback *cb)
 {
+   struct netlink_ext_ack *extack = cb->extack;
+   struct nlattr *tb[RTA_MAX + 1];
struct rtmsg *rtm;
+   int err, i;
+
+   ASSERT_RTNL();
 
if (nlh->nlmsg_len < nlmsg_msg_size(sizeof(*rtm))) {
NL_SET_ERR_MSG(extack, "Invalid header for FIB dump request");
@@ -815,8 +820,7 @@ int ip_valid_fib_dump_req(struct net *net, const struct 
nlmsghdr *nlh,
 
rtm = nlmsg_data(nlh);
if (rtm->rtm_dst_len || rtm->rtm_src_len  || rtm->rtm_tos   ||
-   rtm->rtm_table   || rtm->rtm_protocol || rtm->rtm_scope ||
-   rtm->rtm_type) {
+   rtm->rtm_scope) {
NL_SET_ERR_MSG(extack, "Invalid values in header for FIB dump 
request");
return -EINVAL;
}
@@ -825,9 +829,42 @@ int ip_valid_fib_dump_req(struct net *net, const struct 
nlmsghdr *nlh,
return -EINVAL;
}
 
-   if (nlmsg_attrlen(nlh, sizeof(*rtm))) {
-   NL_SET_ERR_MSG(extack, "Invalid data after header in FIB dump 
request");
-   return -EINVAL;
+   filter->flags= rtm->rtm_flags;
+   filter->protocol = rtm->rtm_protocol;
+   filter->rt_type  = rtm->rtm_type;
+   filter->table_id = rtm->rtm_table;
+
+   err = nlmsg_parse_strict(nlh, sizeof(*rtm), tb, RTA_MAX,
+rtm_ipv4_policy, extack);
+   if (err < 0)
+   return err;
+
+   for (i = 0; i <= RTA_MAX; ++i) {
+   int ifindex;
+
+   if (!tb[i])
+   continue;
+
+   switch (i) {
+   case RTA_TABLE:
+   filter->table_id = nla_get_u32(tb[i]);
+   break;
+   case RTA_OIF:
+   ifindex = nla_get_u32(tb[i]);
+   filter->dev = __dev_get_by_index(net, ifindex);
+   if (!filter->dev)
+   return -ENODEV;
+   break;
+   default:
+   NL_SET_ERR_MSG(extack, "Unsupported attribute in dump 
request");
+   return -EINVAL;
+   }
+   }
+
+   if (filter->flags || filter->protocol || filter->rt_type ||
+   filter->table_id || filter->dev) {
+   filter->filter_set = 1;
+   cb->answer_flags = NLM_F_DUMP_FILTERED;
}
 
return 0;
@@ -846,7 +883,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
int dumped = 0, err;
 
if (cb->strict_check) {
-   err = ip_valid_fib_dump_req(net, nlh, , cb->extack);
+   err = ip_valid_fib_dump_req(net, nlh, , cb);
if (err < 0)
return err;
}
diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
index 3fa988e6a3df..7a3e2acda94c 100644
--- a/net/ipv4/ipmr.c
+++ b/net/ipv4/ipmr.c
@@ -2532,7 +2532,7 @@ static int ipmr_rtm_dumproute(struct sk_buff *skb, 

[PATCH v2 net-next 00/11] net: Kernel side filtering for route dumps

2018-10-15 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of route dumps by protocol (e.g., which
routing daemon installed the route), route type (e.g., unicast), table
id and nexthop device.

iproute2 has been doing this filtering in userspace for years; pushing
the filters to the kernel side reduces the amount of data the kernel
sends and reduces wasted cycles on both sides processing unwanted data.
These initial options provide a huge improvement for efficiently
examining routes on large scale systems.

v2
- better handling of requests for a specific table. Rather than walking
  the hash of all tables, lookup the specific table and dump it
- refactor mr_rtm_dumproute moving the loop over the table into a
  helper that can be invoked directly
- add hook to return NLM_F_DUMP_FILTERED in DONE message to ensure
  it is returned even when the dump returns nothing

David Ahern (11):
  netlink: Add answer_flags to netlink_callback
  net: Add struct for fib dump filter
  net/ipv4: Plumb support for filtering route dumps
  net/ipv6: Plumb support for filtering route dumps
  net/mpls: Plumb support for filtering route dumps
  ipmr: Refactor mr_rtm_dumproute
  net: Plumb support for filtering ipv4 and ipv6 multicast route dumps
  net: Enable kernel side filtering of route dumps
  net/mpls: Handle kernel side filtering of route dumps
  net/ipv6: Bail early if user only wants cloned entries
  net/ipv4: Bail early if user only wants prefix entries

 include/linux/mroute_base.h |  11 +++-
 include/linux/netlink.h |   1 +
 include/net/ip6_route.h |   1 +
 include/net/ip_fib.h|  17 --
 net/ipv4/fib_frontend.c |  76 ++
 net/ipv4/fib_trie.c |  37 +
 net/ipv4/ipmr.c |  22 ++--
 net/ipv4/ipmr_base.c| 126 
 net/ipv6/ip6_fib.c  |  34 +---
 net/ipv6/ip6mr.c|  21 ++--
 net/ipv6/route.c|  40 +++---
 net/mpls/af_mpls.c  |  92 +++-
 net/netlink/af_netlink.c|   3 +-
 13 files changed, 386 insertions(+), 95 deletions(-)

-- 
2.11.0



[PATCH v2 net-next 03/11] net/ipv4: Plumb support for filtering route dumps

2018-10-15 Thread David Ahern
From: David Ahern 

Implement kernel side filtering of routes by table id, egress device index,
protocol and route type. If the table id is given in the filter, lookup the
table and call fib_table_dump directly for it.

Signed-off-by: David Ahern 
---
 include/net/ip_fib.h|  2 +-
 net/ipv4/fib_frontend.c | 13 -
 net/ipv4/fib_trie.c | 37 ++---
 3 files changed, 39 insertions(+), 13 deletions(-)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 667013bf4266..1eabc9edd2b9 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -239,7 +239,7 @@ int fib_table_insert(struct net *, struct fib_table *, 
struct fib_config *,
 int fib_table_delete(struct net *, struct fib_table *, struct fib_config *,
 struct netlink_ext_ack *extack);
 int fib_table_dump(struct fib_table *table, struct sk_buff *skb,
-  struct netlink_callback *cb);
+  struct netlink_callback *cb, struct fib_dump_filter *filter);
 int fib_table_flush(struct net *net, struct fib_table *table);
 struct fib_table *fib_trie_unmerge(struct fib_table *main_tb);
 void fib_table_flush_external(struct fib_table *table);
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 850850dd80e1..37dc8ac366fd 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -855,6 +855,17 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
((struct rtmsg *)nlmsg_data(nlh))->rtm_flags & RTM_F_CLONED)
return skb->len;
 
+   if (filter.table_id) {
+   tb = fib_get_table(net, filter.table_id);
+   if (!tb) {
+   NL_SET_ERR_MSG(cb->extack, "ipv4: FIB table does not 
exist");
+   return -ENOENT;
+   }
+
+   err = fib_table_dump(tb, skb, cb, );
+   return skb->len ? : err;
+   }
+
s_h = cb->args[0];
s_e = cb->args[1];
 
@@ -869,7 +880,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct 
netlink_callback *cb)
if (dumped)
memset(>args[2], 0, sizeof(cb->args) -
 2 * sizeof(cb->args[0]));
-   err = fib_table_dump(tb, skb, cb);
+   err = fib_table_dump(tb, skb, cb, );
if (err < 0) {
if (likely(skb->len))
goto out;
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 5bc0c89e81e4..237c9f72b265 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -2003,12 +2003,17 @@ void fib_free_table(struct fib_table *tb)
 }
 
 static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
-struct sk_buff *skb, struct netlink_callback *cb)
+struct sk_buff *skb, struct netlink_callback *cb,
+struct fib_dump_filter *filter)
 {
+   unsigned int flags = NLM_F_MULTI;
__be32 xkey = htonl(l->key);
struct fib_alias *fa;
int i, s_i;
 
+   if (filter->filter_set)
+   flags |= NLM_F_DUMP_FILTERED;
+
s_i = cb->args[4];
i = 0;
 
@@ -2016,25 +2021,35 @@ static int fn_trie_dump_leaf(struct key_vector *l, 
struct fib_table *tb,
hlist_for_each_entry_rcu(fa, >leaf, fa_list) {
int err;
 
-   if (i < s_i) {
-   i++;
-   continue;
-   }
+   if (i < s_i)
+   goto next;
 
-   if (tb->tb_id != fa->tb_id) {
-   i++;
-   continue;
+   if (tb->tb_id != fa->tb_id)
+   goto next;
+
+   if (filter->filter_set) {
+   if (filter->rt_type && fa->fa_type != filter->rt_type)
+   goto next;
+
+   if ((filter->protocol &&
+fa->fa_info->fib_protocol != filter->protocol))
+   goto next;
+
+   if (filter->dev &&
+   !fib_info_nh_uses_dev(fa->fa_info, filter->dev))
+   goto next;
}
 
err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
cb->nlh->nlmsg_seq, RTM_NEWROUTE,
tb->tb_id, fa->fa_type,
xkey, KEYLENGTH - fa->fa_slen,
-   fa->fa_tos, fa->fa_info, NLM_F_MULTI);
+   fa->fa_tos, fa->fa_info, flags);
if (err < 0) {
cb->args[4] = i;
return err;
}
+next:
i++;
}
 
@@ -2044,7 +2059,7 @@ static int 

Re: [PATCH V2 net-next 5/5] ptp: Add a driver for InES time stamping IP core.

2018-10-12 Thread Rob Herring
On Sun, Oct 07, 2018 at 10:38:23AM -0700, Richard Cochran wrote:
> The InES at the ZHAW offers a PTP time stamping IP core.  The FPGA
> logic recognizes and time stamps PTP frames on the MII bus.  This
> patch adds a driver for the core along with a device tree binding to
> allow hooking the driver to MII buses.
> 
> Signed-off-by: Richard Cochran 
> ---
>  Documentation/devicetree/bindings/ptp/ptp-ines.txt |  37 +

Bindings should be separate patch.

>  drivers/ptp/Kconfig|  10 +
>  drivers/ptp/Makefile   |   1 +
>  drivers/ptp/ptp_ines.c | 870 
> +
>  4 files changed, 918 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/ptp/ptp-ines.txt
>  create mode 100644 drivers/ptp/ptp_ines.c
> 
> diff --git a/Documentation/devicetree/bindings/ptp/ptp-ines.txt 
> b/Documentation/devicetree/bindings/ptp/ptp-ines.txt
> new file mode 100644
> index ..1484b62802c7
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/ptp/ptp-ines.txt
> @@ -0,0 +1,37 @@
> +ZHAW InES PTP time stamping IP core
> +
> +The IP core needs two different kinds of nodes.  The control node
> +lives somewhere in the memory map and specifies the address of the
> +control registers.  There can be up to three port handles placed as
> +attributes of PHY nodes.  These associate a particular MII bus with a
> +port index within the IP core.
> +
> +Required properties of the control node:
> +
> +- compatible:"ines,ptp-ctrl"

ines is not registered vendor prefix. Should it be 'zhaw' instead?

> +- reg:   physical address and size of the register bank
> +- #phandle-cells:must be one (1)

#timestamper-cells

Or if it is always 1, you could omit it.

> +
> +Required format of the port handle within the PHY node:
> +
> +- timestamper:   provides control node reference and
> + the port channel within the IP core

This and #timestamper-cells need to be in a common binding doc.

And bonus points if you add a check in dtc for this. Should be a 
one-liner.

> +
> +Example:
> +
> + tstamper: timestamper@6000 {
> + compatible = "ines,ptp-ctrl";
> + reg = <0x6000 0x80>;
> + #phandle-cells = <1>;
> + };
> +
> + ethernet@8000 {
> + ...
> + mdio {
> + ...
> + phy@3 {
> + ...
> + timestamper = < 0>;
> + };
> + };
> + };


Re: [PATCH v2 net-next] net/ipv6: Add knob to skip DELROUTE message on device down

2018-10-12 Thread David Miller
From: David Ahern 
Date: Thu, 11 Oct 2018 20:17:21 -0700

> From: David Ahern 
> 
> Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE
> notifications when a device is taken down (admin down) or deleted. IPv4
> does not generate a message for routes evicted by the down or delete;
> IPv6 does. A NOS at scale really needs to avoid these messages and have
> IPv4 and IPv6 behave similarly, relying on userspace to handle link
> notifications and evict the routes.
> 
> At this point existing user behavior needs to be preserved. Since
> notifications are a global action (not per app) the only way to preserve
> existing behavior and allow the messages to be skipped is to add a new
> sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to
> disable the notificatioons.
> 
> IPv6 route code already supports the option to skip the message (it is
> used for multipath routes for example). Besides the new sysctl we need
> to pass the skip_notify setting through the generic fib6_clean and
> fib6_walk functions to fib6_clean_node and to set skip_notify on calls
> to __ip_del_rt for the addrconf_ifdown path.
> 
> Signed-off-by: David Ahern 
> ---
> v2
> - removed the changes to addrconf and anycast. addrconf_ifdown calls
>   rt6_disable_ip which calls rt6_sync_down_dev. The last one evicts all
>   routes for the device, so the delete route calls done later in addrconf
>   and anycast are superfluous

Applied.


[PATCH v2 net-next] net/ipv6: Add knob to skip DELROUTE message on device down

2018-10-11 Thread David Ahern
From: David Ahern 

Another difference between IPv4 and IPv6 is the generation of RTM_DELROUTE
notifications when a device is taken down (admin down) or deleted. IPv4
does not generate a message for routes evicted by the down or delete;
IPv6 does. A NOS at scale really needs to avoid these messages and have
IPv4 and IPv6 behave similarly, relying on userspace to handle link
notifications and evict the routes.

At this point existing user behavior needs to be preserved. Since
notifications are a global action (not per app) the only way to preserve
existing behavior and allow the messages to be skipped is to add a new
sysctl (net/ipv6/route/skip_notify_on_dev_down) which can be set to
disable the notificatioons.

IPv6 route code already supports the option to skip the message (it is
used for multipath routes for example). Besides the new sysctl we need
to pass the skip_notify setting through the generic fib6_clean and
fib6_walk functions to fib6_clean_node and to set skip_notify on calls
to __ip_del_rt for the addrconf_ifdown path.

Signed-off-by: David Ahern 
---
v2
- removed the changes to addrconf and anycast. addrconf_ifdown calls
  rt6_disable_ip which calls rt6_sync_down_dev. The last one evicts all
  routes for the device, so the delete route calls done later in addrconf
  and anycast are superfluous

 Documentation/networking/ip-sysctl.txt |  8 
 include/net/ip6_fib.h  |  3 +++
 include/net/netns/ipv6.h   |  1 +
 net/ipv6/ip6_fib.c | 20 +++-
 net/ipv6/route.c   | 20 +++-
 5 files changed, 46 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 960de8fe3f40..163b5ff1073c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -1442,6 +1442,14 @@ max_hbh_length - INTEGER
header.
Default: INT_MAX (unlimited)
 
+skip_notify_on_dev_down - BOOLEAN
+   Controls whether an RTM_DELROUTE message is generated for routes
+   removed when a device is taken down or deleted. IPv4 does not
+   generate this message; IPv6 does by default. Setting this sysctl
+   to true skips the message, making IPv4 and IPv6 on par in relying
+   on userspace caches to track link events and evict routes.
+   Default: false (generate message)
+
 IPv6 Fragmentation:
 
 ip6frag_high_thresh - INTEGER
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index f06e968f1992..caabfd84a098 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -407,6 +407,9 @@ struct fib6_node *fib6_locate(struct fib6_node *root,
 
 void fib6_clean_all(struct net *net, int (*func)(struct fib6_info *, void 
*arg),
void *arg);
+void fib6_clean_all_skip_notify(struct net *net,
+   int (*func)(struct fib6_info *, void *arg),
+   void *arg);
 
 int fib6_add(struct fib6_node *root, struct fib6_info *rt,
 struct nl_info *info, struct netlink_ext_ack *extack);
diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index f0e396ab9bec..ef1ed529f33c 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -45,6 +45,7 @@ struct netns_sysctl_ipv6 {
int max_dst_opts_len;
int max_hbh_opts_len;
int seg6_flowlabel;
+   bool skip_notify_on_dev_down;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index e14d244c551f..9ba72d94d60f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -47,6 +47,7 @@ struct fib6_cleaner {
int (*func)(struct fib6_info *, void *arg);
int sernum;
void *arg;
+   bool skip_notify;
 };
 
 #ifdef CONFIG_IPV6_SUBTREES
@@ -1956,6 +1957,7 @@ static int fib6_clean_node(struct fib6_walker *w)
struct fib6_cleaner *c = container_of(w, struct fib6_cleaner, w);
struct nl_info info = {
.nl_net = c->net,
+   .skip_notify = c->skip_notify,
};
 
if (c->sernum != FIB6_NO_SERNUM_CHANGE &&
@@ -2007,7 +2009,7 @@ static int fib6_clean_node(struct fib6_walker *w)
 
 static void fib6_clean_tree(struct net *net, struct fib6_node *root,
int (*func)(struct fib6_info *, void *arg),
-   int sernum, void *arg)
+   int sernum, void *arg, bool skip_notify)
 {
struct fib6_cleaner c;
 
@@ -2019,13 +2021,14 @@ static void fib6_clean_tree(struct net *net, struct 
fib6_node *root,
c.sernum = sernum;
c.arg = arg;
c.net = net;
+   c.skip_notify = skip_notify;
 
fib6_walk(net, );
 }
 
 static void __fib6_clean_all(struct net *net,
 int (*func)(struct fib6_info *, void *),
-int sernum, void *arg)
+int sernum, void *arg, bool skip_notify)
 {
 

Re: [PATCH V2 net-next 00/12] Improving performance and reducing latencies, by using latest capabilities exposed in ENA device

2018-10-11 Thread David Miller
From: 
Date: Thu, 11 Oct 2018 11:26:15 +0300

> From: Arthur Kiyanovski 
> 
> This patchset introduces the following:
> 1. A new placement policy of Tx headers and descriptors, which takes
> advantage of an option to place headers + descriptors in device memory
> space. This is sometimes referred to as LLQ - low latency queue.
> The patch set defines the admin capability, maps the device memory as
> write-combined, and adds a mode in transmit datapath to do header +
> descriptor placement on the device.
> 2. Support for RX checksum offloading
> 3. Miscelaneous small improvements and code cleanups
> 
> Note: V1 of this patchset was created as if patches e2a322a 248ab77
> from net were applied to net-next before applying the patchset. This V2 
> version does not assume this, and should be applyed directly on net-next
> without the aformentioned patches.

Series applied.


[PATCH V2 net-next 11/12] net: ena: update driver version to 2.0.1

2018-10-11 Thread akiyano
From: Arthur Kiyanovski 

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.h | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index d241dfc..5218736 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -43,9 +43,9 @@
 #include "ena_com.h"
 #include "ena_eth_com.h"
 
-#define DRV_MODULE_VER_MAJOR   1
-#define DRV_MODULE_VER_MINOR   5
-#define DRV_MODULE_VER_SUBMINOR 0
+#define DRV_MODULE_VER_MAJOR   2
+#define DRV_MODULE_VER_MINOR   0
+#define DRV_MODULE_VER_SUBMINOR 1
 
 #define DRV_MODULE_NAME"ena"
 #ifndef DRV_MODULE_VERSION
-- 
2.7.4



[PATCH V2 net-next 10/12] net: ena: remove redundant parameter in ena_com_admin_init()

2018-10-11 Thread akiyano
From: Arthur Kiyanovski 

Remove redundant spinlock acquire parameter from ena_com_admin_init()

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_com.c| 6 ++
 drivers/net/ethernet/amazon/ena/ena_com.h| 5 +
 drivers/net/ethernet/amazon/ena/ena_netdev.c | 2 +-
 3 files changed, 4 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index 5c468b2..420cede 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -1701,8 +1701,7 @@ void ena_com_mmio_reg_read_request_write_dev_addr(struct 
ena_com_dev *ena_dev)
 }
 
 int ena_com_admin_init(struct ena_com_dev *ena_dev,
-  struct ena_aenq_handlers *aenq_handlers,
-  bool init_spinlock)
+  struct ena_aenq_handlers *aenq_handlers)
 {
struct ena_com_admin_queue *admin_queue = _dev->admin_queue;
u32 aq_caps, acq_caps, dev_sts, addr_low, addr_high;
@@ -1728,8 +1727,7 @@ int ena_com_admin_init(struct ena_com_dev *ena_dev,
 
atomic_set(_queue->outstanding_cmds, 0);
 
-   if (init_spinlock)
-   spin_lock_init(_queue->q_lock);
+   spin_lock_init(_queue->q_lock);
 
ret = ena_com_init_comp_ctxt(admin_queue);
if (ret)
diff --git a/drivers/net/ethernet/amazon/ena/ena_com.h 
b/drivers/net/ethernet/amazon/ena/ena_com.h
index 25af8d0..ae8b485 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.h
+++ b/drivers/net/ethernet/amazon/ena/ena_com.h
@@ -436,8 +436,6 @@ void ena_com_mmio_reg_read_request_destroy(struct 
ena_com_dev *ena_dev);
 /* ena_com_admin_init - Init the admin and the async queues
  * @ena_dev: ENA communication layer struct
  * @aenq_handlers: Those handlers to be called upon event.
- * @init_spinlock: Indicate if this method should init the admin spinlock or
- * the spinlock was init before (for example, in a case of FLR).
  *
  * Initialize the admin submission and completion queues.
  * Initialize the asynchronous events notification queues.
@@ -445,8 +443,7 @@ void ena_com_mmio_reg_read_request_destroy(struct 
ena_com_dev *ena_dev);
  * @return - 0 on success, negative value on failure.
  */
 int ena_com_admin_init(struct ena_com_dev *ena_dev,
-  struct ena_aenq_handlers *aenq_handlers,
-  bool init_spinlock);
+  struct ena_aenq_handlers *aenq_handlers);
 
 /* ena_com_admin_destroy - Destroy the admin and the async events queues.
  * @ena_dev: ENA communication layer struct
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index e71bf82..3494d4a 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2503,7 +2503,7 @@ static int ena_device_init(struct ena_com_dev *ena_dev, 
struct pci_dev *pdev,
}
 
/* ENA admin level init */
-   rc = ena_com_admin_init(ena_dev, _handlers, true);
+   rc = ena_com_admin_init(ena_dev, _handlers);
if (rc) {
dev_err(dev,
"Can not initialize ena admin queue with device\n");
-- 
2.7.4



[PATCH V2 net-next 09/12] net: ena: change rx copybreak default to reduce kernel memory pressure

2018-10-11 Thread akiyano
From: Arthur Kiyanovski 

Improves socket memory utilization when receiving packets larger
than 128 bytes (the previous rx copybreak) and smaller than 256 bytes.

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.h 
b/drivers/net/ethernet/amazon/ena/ena_netdev.h
index 0cf35ae..d241dfc 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.h
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.h
@@ -81,7 +81,7 @@
 #define ENA_DEFAULT_RING_SIZE  (1024)
 
 #define ENA_TX_WAKEUP_THRESH   (MAX_SKB_FRAGS + 2)
-#define ENA_DEFAULT_RX_COPYBREAK   (128 - NET_IP_ALIGN)
+#define ENA_DEFAULT_RX_COPYBREAK   (256 - NET_IP_ALIGN)
 
 /* limit the buffer size to 600 bytes to handle MTU changes from very
  * small to very large, in which case the number of buffers per packet
-- 
2.7.4



[PATCH V2 net-next 07/12] net: ena: explicit casting and initialization, and clearer error handling

2018-10-11 Thread akiyano
From: Arthur Kiyanovski 

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_com.c| 39 
 drivers/net/ethernet/amazon/ena/ena_netdev.c |  5 ++--
 drivers/net/ethernet/amazon/ena/ena_netdev.h | 22 
 3 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c 
b/drivers/net/ethernet/amazon/ena/ena_com.c
index 5220c75..5c468b2 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -235,7 +235,7 @@ static struct ena_comp_ctx 
*__ena_com_submit_admin_cmd(struct ena_com_admin_queu
tail_masked = admin_queue->sq.tail & queue_size_mask;
 
/* In case of queue FULL */
-   cnt = atomic_read(_queue->outstanding_cmds);
+   cnt = (u16)atomic_read(_queue->outstanding_cmds);
if (cnt >= admin_queue->q_depth) {
pr_debug("admin queue is full.\n");
admin_queue->stats.out_of_space++;
@@ -304,7 +304,7 @@ static struct ena_comp_ctx *ena_com_submit_admin_cmd(struct 
ena_com_admin_queue
 struct ena_admin_acq_entry 
*comp,
 size_t comp_size_in_bytes)
 {
-   unsigned long flags;
+   unsigned long flags = 0;
struct ena_comp_ctx *comp_ctx;
 
spin_lock_irqsave(_queue->q_lock, flags);
@@ -332,7 +332,7 @@ static int ena_com_init_io_sq(struct ena_com_dev *ena_dev,
 
memset(_sq->desc_addr, 0x0, sizeof(io_sq->desc_addr));
 
-   io_sq->dma_addr_bits = ena_dev->dma_addr_bits;
+   io_sq->dma_addr_bits = (u8)ena_dev->dma_addr_bits;
io_sq->desc_entry_size =
(io_sq->direction == ENA_COM_IO_QUEUE_DIRECTION_TX) ?
sizeof(struct ena_eth_io_tx_desc) :
@@ -486,7 +486,7 @@ static void ena_com_handle_admin_completion(struct 
ena_com_admin_queue *admin_qu
 
/* Go over all the completions */
while ((READ_ONCE(cqe->acq_common_descriptor.flags) &
-   ENA_ADMIN_ACQ_COMMON_DESC_PHASE_MASK) == phase) {
+   ENA_ADMIN_ACQ_COMMON_DESC_PHASE_MASK) == phase) {
/* Do not read the rest of the completion entry before the
 * phase bit was validated
 */
@@ -537,7 +537,8 @@ static int ena_com_comp_status_to_errno(u8 comp_status)
 static int ena_com_wait_and_process_admin_cq_polling(struct ena_comp_ctx 
*comp_ctx,
 struct ena_com_admin_queue 
*admin_queue)
 {
-   unsigned long flags, timeout;
+   unsigned long flags = 0;
+   unsigned long timeout;
int ret;
 
timeout = jiffies + usecs_to_jiffies(admin_queue->completion_timeout);
@@ -736,7 +737,7 @@ static int ena_com_config_llq_info(struct ena_com_dev 
*ena_dev,
 static int ena_com_wait_and_process_admin_cq_interrupts(struct ena_comp_ctx 
*comp_ctx,
struct 
ena_com_admin_queue *admin_queue)
 {
-   unsigned long flags;
+   unsigned long flags = 0;
int ret;
 
wait_for_completion_timeout(_ctx->wait_event,
@@ -782,7 +783,7 @@ static u32 ena_com_reg_bar_read32(struct ena_com_dev 
*ena_dev, u16 offset)
volatile struct ena_admin_ena_mmio_req_read_less_resp *read_resp =
mmio_read->read_resp;
u32 mmio_read_reg, ret, i;
-   unsigned long flags;
+   unsigned long flags = 0;
u32 timeout = mmio_read->reg_read_to;
 
might_sleep();
@@ -1426,7 +1427,7 @@ void ena_com_abort_admin_commands(struct ena_com_dev 
*ena_dev)
 void ena_com_wait_for_abort_completion(struct ena_com_dev *ena_dev)
 {
struct ena_com_admin_queue *admin_queue = _dev->admin_queue;
-   unsigned long flags;
+   unsigned long flags = 0;
 
spin_lock_irqsave(_queue->q_lock, flags);
while (atomic_read(_queue->outstanding_cmds) != 0) {
@@ -1470,7 +1471,7 @@ bool ena_com_get_admin_running_state(struct ena_com_dev 
*ena_dev)
 void ena_com_set_admin_running_state(struct ena_com_dev *ena_dev, bool state)
 {
struct ena_com_admin_queue *admin_queue = _dev->admin_queue;
-   unsigned long flags;
+   unsigned long flags = 0;
 
spin_lock_irqsave(_queue->q_lock, flags);
ena_dev->admin_queue.running_state = state;
@@ -1504,7 +1505,7 @@ int ena_com_set_aenq_config(struct ena_com_dev *ena_dev, 
u32 groups_flag)
}
 
if ((get_resp.u.aenq.supported_groups & groups_flag) != groups_flag) {
-   pr_warn("Trying to set unsupported aenq events. supported flag: 
%x asked flag: %x\n",
+   pr_warn("Trying to set unsupported aenq events. supported flag: 
0x%x asked flag: 0x%x\n",
get_resp.u.aenq.supported_groups, groups_flag);
return -EOPNOTSUPP;
}
@@ -1652,7 +1653,7 @@ int ena_com_mmio_reg_read_request_init(struct ena_com_dev 
*ena_dev)
 

[PATCH V2 net-next 12/12] net: ena: fix indentations in ena_defs for better readability

2018-10-11 Thread akiyano
From: Arthur Kiyanovski 

Signed-off-by: Arthur Kiyanovski 
---
 drivers/net/ethernet/amazon/ena/ena_admin_defs.h  | 334 +-
 drivers/net/ethernet/amazon/ena/ena_eth_io_defs.h | 223 +++
 drivers/net/ethernet/amazon/ena/ena_regs_defs.h   | 206 +++--
 3 files changed, 338 insertions(+), 425 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h 
b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
index b439ec1..9f80b73 100644
--- a/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
+++ b/drivers/net/ethernet/amazon/ena/ena_admin_defs.h
@@ -32,119 +32,81 @@
 #ifndef _ENA_ADMIN_H_
 #define _ENA_ADMIN_H_
 
-enum ena_admin_aq_opcode {
-   ENA_ADMIN_CREATE_SQ = 1,
-
-   ENA_ADMIN_DESTROY_SQ= 2,
-
-   ENA_ADMIN_CREATE_CQ = 3,
-
-   ENA_ADMIN_DESTROY_CQ= 4,
-
-   ENA_ADMIN_GET_FEATURE   = 8,
 
-   ENA_ADMIN_SET_FEATURE   = 9,
-
-   ENA_ADMIN_GET_STATS = 11,
+enum ena_admin_aq_opcode {
+   ENA_ADMIN_CREATE_SQ = 1,
+   ENA_ADMIN_DESTROY_SQ= 2,
+   ENA_ADMIN_CREATE_CQ = 3,
+   ENA_ADMIN_DESTROY_CQ= 4,
+   ENA_ADMIN_GET_FEATURE   = 8,
+   ENA_ADMIN_SET_FEATURE   = 9,
+   ENA_ADMIN_GET_STATS = 11,
 };
 
 enum ena_admin_aq_completion_status {
-   ENA_ADMIN_SUCCESS   = 0,
-
-   ENA_ADMIN_RESOURCE_ALLOCATION_FAILURE   = 1,
-
-   ENA_ADMIN_BAD_OPCODE= 2,
-
-   ENA_ADMIN_UNSUPPORTED_OPCODE= 3,
-
-   ENA_ADMIN_MALFORMED_REQUEST = 4,
-
+   ENA_ADMIN_SUCCESS   = 0,
+   ENA_ADMIN_RESOURCE_ALLOCATION_FAILURE   = 1,
+   ENA_ADMIN_BAD_OPCODE= 2,
+   ENA_ADMIN_UNSUPPORTED_OPCODE= 3,
+   ENA_ADMIN_MALFORMED_REQUEST = 4,
/* Additional status is provided in ACQ entry extended_status */
-   ENA_ADMIN_ILLEGAL_PARAMETER = 5,
-
-   ENA_ADMIN_UNKNOWN_ERROR = 6,
-
-   ENA_ADMIN_RESOURCE_BUSY = 7,
+   ENA_ADMIN_ILLEGAL_PARAMETER = 5,
+   ENA_ADMIN_UNKNOWN_ERROR = 6,
+   ENA_ADMIN_RESOURCE_BUSY = 7,
 };
 
 enum ena_admin_aq_feature_id {
-   ENA_ADMIN_DEVICE_ATTRIBUTES = 1,
-
-   ENA_ADMIN_MAX_QUEUES_NUM= 2,
-
-   ENA_ADMIN_HW_HINTS  = 3,
-
-   ENA_ADMIN_LLQ   = 4,
-
-   ENA_ADMIN_RSS_HASH_FUNCTION = 10,
-
-   ENA_ADMIN_STATELESS_OFFLOAD_CONFIG  = 11,
-
-   ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG  = 12,
-
-   ENA_ADMIN_MTU   = 14,
-
-   ENA_ADMIN_RSS_HASH_INPUT= 18,
-
-   ENA_ADMIN_INTERRUPT_MODERATION  = 20,
-
-   ENA_ADMIN_AENQ_CONFIG   = 26,
-
-   ENA_ADMIN_LINK_CONFIG   = 27,
-
-   ENA_ADMIN_HOST_ATTR_CONFIG  = 28,
-
-   ENA_ADMIN_FEATURES_OPCODE_NUM   = 32,
+   ENA_ADMIN_DEVICE_ATTRIBUTES = 1,
+   ENA_ADMIN_MAX_QUEUES_NUM= 2,
+   ENA_ADMIN_HW_HINTS  = 3,
+   ENA_ADMIN_LLQ   = 4,
+   ENA_ADMIN_RSS_HASH_FUNCTION = 10,
+   ENA_ADMIN_STATELESS_OFFLOAD_CONFIG  = 11,
+   ENA_ADMIN_RSS_REDIRECTION_TABLE_CONFIG  = 12,
+   ENA_ADMIN_MTU   = 14,
+   ENA_ADMIN_RSS_HASH_INPUT= 18,
+   ENA_ADMIN_INTERRUPT_MODERATION  = 20,
+   ENA_ADMIN_AENQ_CONFIG   = 26,
+   ENA_ADMIN_LINK_CONFIG   = 27,
+   ENA_ADMIN_HOST_ATTR_CONFIG  = 28,
+   ENA_ADMIN_FEATURES_OPCODE_NUM   = 32,
 };
 
 enum ena_admin_placement_policy_type {
/* descriptors and headers are in host memory */
-   ENA_ADMIN_PLACEMENT_POLICY_HOST = 1,
-
+   ENA_ADMIN_PLACEMENT_POLICY_HOST = 1,
/* descriptors and headers are in device memory (a.k.a Low Latency
 * Queue)
 */
-   ENA_ADMIN_PLACEMENT_POLICY_DEV  = 3,
+   ENA_ADMIN_PLACEMENT_POLICY_DEV  = 3,
 };
 
 enum ena_admin_link_types {
-   ENA_ADMIN_LINK_SPEED_1G = 0x1,
-
-   ENA_ADMIN_LINK_SPEED_2_HALF_G   = 0x2,
-
-   ENA_ADMIN_LINK_SPEED_5G = 0x4,
-
-   ENA_ADMIN_LINK_SPEED_10G= 0x8,
-
-   ENA_ADMIN_LINK_SPEED_25G= 0x10,
-
-   ENA_ADMIN_LINK_SPEED_40G= 0x20,
-
-   ENA_ADMIN_LINK_SPEED_50G= 0x40,
-
-   ENA_ADMIN_LINK_SPEED_100G   = 0x80,
-
-   ENA_ADMIN_LINK_SPEED_200G   = 0x100,
-
-   ENA_ADMIN_LINK_SPEED_400G   = 0x200,
+   

  1   2   3   4   5   6   7   8   9   10   >