date:20180925

Re: netlink: 16 bytes leftover after parsing attributes in process `ip'.

2018-09-25 Thread Jiri Benc

On Tue, 25 Sep 2018 09:37:41 -0600, David Ahern wrote:
> For ifaddrmsg ifa_flags aligns with ifi_type which is set by kernel side
> so this should be ok.

Does the existing user space set ifi_type to anything? Does it zero out
the field?

Are we able to find a flag value that is not going to be set by unaware
user space? I.e., a bit that is unused by the current ARPHRD values on
both little and big endian? (ARPHRD_NONE might be a problem, though...)

 Jiri

Re: Marvell phy errata origins?

2018-09-25 Thread Harini Katakam

Hi,
On Tue, Sep 25, 2018 at 11:00 PM Harini Katakam  wrote:
>
> Hi Daniel,
>
> On Tue, Sep 25, 2018 at 9:10 PM Andrew Lunn  wrote:
> >
> > > I hope this this thread isn't too old to bring back to life. So it seems
> > > that Harini found that m88e did not need this errata, and Cisco
> > > previously found that Harini's patch fixed m88e1112, we included it
> > > internally for that reason
> > >
> > > Now I'm getting reports that this errata fixes issues we're seeing on
> > > m88e. We see an interrupt storm without the errata, despite the errata
> > > not being defined in the datasheet.
> >
> > Is everybody actually using interrupts? It could be in one system
> > phylib is polling.
> >
>
> Yes, we weren't using interrupts; we used phy poll.
>
> As I recall, the register and page combination was reserved and
> the access seemed to fail.
> It will be useful if we can the errata description or version details.
> I'll check if I can get any more information.

One of the PHY parts used was "88E-B2-bab1i000"

Regards,
Harini

Re: [PATCH net-next] tls: Fix socket mem accounting error under async encryption

2018-09-25 Thread David Miller

From: Vakul Garg 
Date: Wed, 26 Sep 2018 04:19:25 +

> BTW, I noticed following build failure.
> It gets resolved after reverting d6ab93364734.
> 
>   CC [M]  drivers/net/phy/marvell.o
> drivers/net/phy/marvell.c: In function 'm88e1121_config_aneg':
> drivers/net/phy/marvell.c:468:25: error: 'autoneg' undeclared (first use in 
> this function); did you mean 'put_net'?
>   if (phydev->autoneg != autoneg || changed) {
>  ^~~
>  put_net
> drivers/net/phy/marvell.c:468:25: note: each undeclared identifier is 
> reported only once for each function it appears in
> 

Thanks, I just fixed it as follows below.

Please CC: the patch author when you report build failures like this
in the future.


>From 4b1bd69769454175268908f50b32f1cbfee5bb83 Mon Sep 17 00:00:00 2001
From: "David S. Miller" 
Date: Tue, 25 Sep 2018 22:41:31 -0700
Subject: [PATCH] net: phy: marvell: Fix build.

Local variable 'autoneg' doesn't even exist:

drivers/net/phy/marvell.c: In function 'm88e1121_config_aneg':
drivers/net/phy/marvell.c:468:25: error: 'autoneg' undeclared (first use in 
this function); did you mean 'put_net'?
  if (phydev->autoneg != autoneg || changed) {
 ^~~

Fixes: d6ab93364734 ("net: phy: marvell: Avoid unnecessary soft reset")
Reported-by:Vakul Garg 
Signed-off-by: David S. Miller 
---
 drivers/net/phy/marvell.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/phy/marvell.c b/drivers/net/phy/marvell.c
index b55a7376bfdc..24fc4a73c300 100644
--- a/drivers/net/phy/marvell.c
+++ b/drivers/net/phy/marvell.c
@@ -465,7 +465,7 @@ static int m88e1121_config_aneg(struct phy_device *phydev)
if (err < 0)
return err;
 
-   if (phydev->autoneg != autoneg || changed) {
+   if (phydev->autoneg != AUTONEG_ENABLE || changed) {
/* A software reset is used to ensure a "commit" of the
 * changes is done.
 */
-- 
2.17.1

Re: [bpf-next PATCH 1/3] net: fix generic XDP to handle if eth header was mangled

2018-09-25 Thread Song Liu

On Tue, Sep 25, 2018 at 7:26 AM Jesper Dangaard Brouer
 wrote:
>
> XDP can modify (and resize) the Ethernet header in the packet.
>
> There is a bug in generic-XDP, because skb->protocol and skb->pkt_type
> are setup before reaching (netif_receive_)generic_xdp.
>
> This bug was hit when XDP were popping VLAN headers (changing
> eth->h_proto), as skb->protocol still contains VLAN-indication
> (ETH_P_8021Q) causing invocation of skb_vlan_untag(skb), which corrupt
> the packet (basically popping the VLAN again).
>
> This patch catch if XDP changed eth header in such a way, that SKB
> fields needs to be updated.
>
> Fixes: d445516966dc ("net: xdp: support xdp generic on virtual devices")
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  net/core/dev.c |   14 ++
>  1 file changed, 14 insertions(+)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ca78dc5a79a3..db6d89f536cb 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4258,6 +4258,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
> *skb,
> struct netdev_rx_queue *rxqueue;
> void *orig_data, *orig_data_end;
> u32 metalen, act = XDP_DROP;
> +   __be16 orig_eth_type;
> +   struct ethhdr *eth;
> +   bool orig_bcast;
> int hlen, off;
> u32 mac_len;
>
> @@ -4298,6 +4301,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
> *skb,
> xdp->data_hard_start = skb->data - skb_headroom(skb);
> orig_data_end = xdp->data_end;
> orig_data = xdp->data;
> +   eth = (struct ethhdr *)xdp->data;
> +   orig_bcast = is_multicast_ether_addr_64bits(eth->h_dest);
> +   orig_eth_type = eth->h_proto;
>
> rxqueue = netif_get_rxqueue(skb);
> xdp->rxq = >xdp_rxq;
> @@ -4321,6 +4327,14 @@ static u32 netif_receive_generic_xdp(struct sk_buff 
> *skb,
>
> }
>
> +   /* check if XDP changed eth hdr such SKB needs update */
> +   eth = (struct ethhdr *)xdp->data;
> +   if ((orig_eth_type != eth->h_proto) ||
> +   (orig_bcast != is_multicast_ether_addr_64bits(eth->h_dest))) {

Is the actions below always correct for the condition above? Do we need
to confirm the SKB is updated properly?

> +   __skb_push(skb, mac_len);
> +   skb->protocol = eth_type_trans(skb, skb->dev);
> +   }
> +
> switch (act) {
> case XDP_REDIRECT:
> case XDP_TX:
>

[PATCH] net-tcp: /proc/sys/net/ipv4/tcp_probe_interval is a u32 not int

2018-09-25 Thread Maciej Żenczykowski

From: Maciej Żenczykowski 

(fix documentation and sysctl access to treat it as such)

Tested:
  # zcat /proc/config.gz | egrep ^CONFIG_HZ
  CONFIG_HZ_1000=y
  CONFIG_HZ=1000
  # echo $[(1<<32)/1000 + 1] | tee /proc/sys/net/ipv4/tcp_probe_interval
  4294968
  tee: /proc/sys/net/ipv4/tcp_probe_interval: Invalid argument
  # echo $[(1<<32)/1000] | tee /proc/sys/net/ipv4/tcp_probe_interval
  4294967
  # echo 0 | tee /proc/sys/net/ipv4/tcp_probe_interval
  # echo -1 | tee /proc/sys/net/ipv4/tcp_probe_interval
  -1
  tee: /proc/sys/net/ipv4/tcp_probe_interval: Invalid argument

Signed-off-by: Maciej Żenczykowski 
---
 Documentation/networking/ip-sysctl.txt | 2 +-
 net/ipv4/sysctl_net_ipv4.c | 6 --
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 8313a636dd53..960de8fe3f40 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -425,7 +425,7 @@ tcp_mtu_probing - INTEGER
  1 - Disabled by default, enabled when an ICMP black hole detected
  2 - Always enabled, use initial MSS of tcp_base_mss.
 
-tcp_probe_interval - INTEGER
+tcp_probe_interval - UNSIGNED INTEGER
Controls how often to start TCP Packetization-Layer Path MTU
Discovery reprobe. The default is reprobing every 10 minutes as
per RFC4821.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b92f422f2fa8..891ed2f91467 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -48,6 +48,7 @@ static int tcp_syn_retries_max = MAX_TCP_SYNCNT;
 static int ip_ping_group_range_min[] = { 0, 0 };
 static int ip_ping_group_range_max[] = { GID_T_MAX, GID_T_MAX };
 static int comp_sack_nr_max = 255;
+static u32 u32_max_div_HZ = UINT_MAX / HZ;
 
 /* obsolete */
 static int sysctl_tcp_low_latency __read_mostly;
@@ -745,9 +746,10 @@ static struct ctl_table ipv4_net_table[] = {
{
.procname   = "tcp_probe_interval",
.data   = _net.ipv4.sysctl_tcp_probe_interval,
-   .maxlen = sizeof(int),
+   .maxlen = sizeof(u32),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_douintvec_minmax,
+   .extra2 = _max_div_HZ,
},
{
.procname   = "igmp_link_local_mcast_reports",
-- 
2.19.0.605.g01d371f741-goog

[PATCH net v2] bnxt_en: Fix TX timeout during netpoll.

2018-09-25 Thread Michael Chan

The current netpoll implementation in the bnxt_en driver has problems
that may miss TX completion events.  bnxt_poll_work() in effect is
only handling at most 1 TX packet before exiting.  In addition,
there may be in flight TX completions that ->poll() may miss even
after we fix bnxt_poll_work() to handle all visible TX completions.
netpoll may not call ->poll() again and HW may not generate IRQ
because the driver does not ARM the IRQ when the budget (0 for netpoll)
is reached.

We fix it by handling all TX completions and to always ARM the IRQ
when we exit ->poll() with 0 budget.

Also, the logic to ACK the completion ring in case it is almost filled
with TX completions need to be adjusted to take care of the 0 budget
case, as discussed with Eric Dumazet 

Reported-by: Song Liu 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 61957b0..0478e56 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1884,8 +1884,11 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
if (TX_CMP_TYPE(txcmp) == CMP_TYPE_TX_L2_CMP) {
tx_pkts++;
/* return full budget so NAPI will complete. */
-   if (unlikely(tx_pkts > bp->tx_wake_thresh))
+   if (unlikely(tx_pkts > bp->tx_wake_thresh)) {
rx_pkts = budget;
+   raw_cons = NEXT_RAW_CMP(raw_cons);
+   break;
+   }
} else if ((TX_CMP_TYPE(txcmp) & 0x30) == 0x10) {
if (likely(budget))
rc = bnxt_rx_pkt(bp, bnapi, _cons, );
@@ -1913,7 +1916,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
}
raw_cons = NEXT_RAW_CMP(raw_cons);
 
-   if (rx_pkts == budget)
+   if (rx_pkts && rx_pkts == budget)
break;
}
 
@@ -2027,8 +2030,12 @@ static int bnxt_poll(struct napi_struct *napi, int 
budget)
while (1) {
work_done += bnxt_poll_work(bp, bnapi, budget - work_done);
 
-   if (work_done >= budget)
+   if (work_done >= budget) {
+   if (!budget)
+   BNXT_CP_DB_REARM(cpr->cp_doorbell,
+cpr->cp_raw_cons);
break;
+   }
 
if (!bnxt_has_work(bp, cpr)) {
if (napi_complete_done(napi, work_done))
-- 
2.5.1

Re: [PATCH] net/ncsi: Add NCSI OEM command for FB Tiogapass

2018-09-25 Thread Samuel Mendoza-Jonas

On Tue, 2018-09-25 at 18:16 +, Vijay Khemka wrote:
> Hi Joel,
> Thanks, I am adding netdev mailing list here.
> Yes, this command is supported for all Mellanox card. It is as per Mellanox 
> specification.
> 
> Regards
> -Vijay

Hi Vijay,

Thanks for the patch; before I get too into a review though I'd like to
loop in Justin (cc'd) who I know is also working on an OEM command patch.
The changes here are very specific (eg. a command specific config option
"CONFIG_NCSI_OEM_CMD_GET_MAC"), which is ok on a small scale but if we
start to add an increasing amount of commands could get out of hand.
As I understand Justin's version adds a generic handler, using the NCSI
Netlink interface to pass OEM commands and responses to and from
userspace, which does the actual packet handling.
It would be good to compare these two approaches first before committing
to any one path

Justin -  could you weigh in here and give a description of your intended
changes? Are you able to post your changes upstream so we can compare?

Regards,
Samuel

> 
> On 9/24/18, 5:30 PM, "Joel Stanley"  wrote:
> 
> Hi Vijay,
> 
> On Tue, 25 Sep 2018 at 09:39, Vijay Khemka  wrote:
> >
> > This patch adds OEM command to get mac address from NCSI device and and
> > configure the same to the network card.
> >
> > ncsi_cmd_arg - Modified this structure to include bigger payload data.
> > ncsi_cmd_handler_oem: This function handes oem command request
> > ncsi_rsp_handler_oem: This function handles response for OEM command.
> > get_mac_address_oem_mlx: This function will send OEM command to get
> > mac address for Mellanox card
> > set_mac_affinity_mlx: This will send OEM command to set Mac affinity
> > for Mellanox card
> 
> Thanks for the patch. The code looks okay, but I wanted to get some
> input from our NCSI maintainer as to how OEM commands would be
> structured. Sam, can you please provide some review here?
> 
> Is the command supported on all melanox cards, just some, or does
> TiogaPass have a special firmware that enables it?
> 
> We should include the netdev mailing list in this discussion as the
> patch needs to be acceptable for upstream.
> 
> Cheers,
> 
> Joel
> 
> >
> > Signed-off-by: Vijay Khemka 
> > ---
> >  net/ncsi/Kconfig   |  3 ++
> >  net/ncsi/internal.h| 11 +--
> >  net/ncsi/ncsi-cmd.c| 24 +--
> >  net/ncsi/ncsi-manage.c | 68 ++
> >  net/ncsi/ncsi-pkt.h| 16 ++
> >  net/ncsi/ncsi-rsp.c| 33 +++-
> >  6 files changed, 149 insertions(+), 6 deletions(-)
> >
> > diff --git a/net/ncsi/Kconfig b/net/ncsi/Kconfig
> > index 08a8a6031fd7..b8bf89fea7c8 100644
> > --- a/net/ncsi/Kconfig
> > +++ b/net/ncsi/Kconfig
> > @@ -10,3 +10,6 @@ config NET_NCSI
> >   support. Enable this only if your system connects to a network
> >   device via NCSI and the ethernet driver you're using supports
> >   the protocol explicitly.
> > +config NCSI_OEM_CMD_GET_MAC
> > +   bool "Get NCSI OEM MAC Address"
> > +   depends on NET_NCSI
> > diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
> > index 8055e3965cef..da17958e6a4b 100644
> > --- a/net/ncsi/internal.h
> > +++ b/net/ncsi/internal.h
> > @@ -68,6 +68,10 @@ enum {
> > NCSI_MODE_MAX
> >  };
> >
> > +#define NCSI_OEM_MFR_MLX_ID 0x8119
> > +#define NCSI_OEM_MLX_CMD_GET_MAC0x1b00
> > +#define NCSI_OEM_MLX_CMD_SET_AFFINITY   0x010700
> > +
> >  struct ncsi_channel_version {
> > u32 version;/* Supported BCD encoded NCSI version */
> > u32 alpha2; /* Supported BCD encoded NCSI version */
> > @@ -236,6 +240,7 @@ enum {
> > ncsi_dev_state_probe_dp,
> > ncsi_dev_state_config_sp= 0x0301,
> > ncsi_dev_state_config_cis,
> > +   ncsi_dev_state_config_oem_gma,
> > ncsi_dev_state_config_clear_vids,
> > ncsi_dev_state_config_svf,
> > ncsi_dev_state_config_ev,
> > @@ -301,9 +306,9 @@ struct ncsi_cmd_arg {
> > unsigned short   payload; /* Command packet payload 
> length */
> > unsigned int req_flags;   /* NCSI request properties
>*/
> > union {
> > -   unsigned char  bytes[16]; /* Command packet specific 
> data  */
> > -   unsigned short words[8];
> > -   unsigned int   dwords[4];
> > +   unsigned char  bytes[64]; /* Command packet specific 
> data  */
> > +   unsigned short words[32];
> > +   unsigned int   dwords[16];
> > };
> >  };
> >
> > diff --git a/net/ncsi/ncsi-cmd.c

Re: [PATCH bpf-next] bpftool: Fix bpftool net output

2018-09-25 Thread Song Liu

On Tue, Sep 25, 2018 at 3:25 PM Andrey Ignatov  wrote:
>
> Print `bpftool net` output to stdout instead of stderr. Only errors
> should be printed to stderr. Regular output should go to stdout and this
> is what all other subcommands of bpftool do, including --json and
> --pretty formats of `bpftool net` itself.
>
> Fixes: commit f6f3bac08ff9 ("tools/bpf: bpftool: add net support")
> Signed-off-by: Andrey Ignatov 
> Acked-by: Yonghong Song 

Acked-by: Song Liu 

> ---
>  tools/bpf/bpftool/netlink_dumper.h | 18 +-
>  1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/tools/bpf/bpftool/netlink_dumper.h 
> b/tools/bpf/bpftool/netlink_dumper.h
> index 0788cfbbed0e..e3516b586a34 100644
> --- a/tools/bpf/bpftool/netlink_dumper.h
> +++ b/tools/bpf/bpftool/netlink_dumper.h
> @@ -16,7 +16,7 @@
> jsonw_name(json_wtr, name); \
> jsonw_start_object(json_wtr);   \
> } else {\
> -   fprintf(stderr, "%s {", name);  \
> +   fprintf(stdout, "%s {", name);  \
> }   \
>  }
>
> @@ -25,7 +25,7 @@
> if (json_output)\
> jsonw_start_object(json_wtr);   \
> else\
> -   fprintf(stderr, "{");   \
> +   fprintf(stdout, "{");   \
>  }
>
>  #define NET_END_OBJECT_NESTED  \
> @@ -33,7 +33,7 @@
> if (json_output)\
> jsonw_end_object(json_wtr); \
> else\
> -   fprintf(stderr, "}");   \
> +   fprintf(stdout, "}");   \
>  }
>
>  #define NET_END_OBJECT \
> @@ -47,7 +47,7 @@
> if (json_output)\
> jsonw_end_object(json_wtr); \
> else\
> -   fprintf(stderr, "\n");  \
> +   fprintf(stdout, "\n");  \
>  }
>
>  #define NET_START_ARRAY(name, fmt_str) \
> @@ -56,7 +56,7 @@
> jsonw_name(json_wtr, name); \
> jsonw_start_array(json_wtr);\
> } else {\
> -   fprintf(stderr, fmt_str, name); \
> +   fprintf(stdout, fmt_str, name); \
> }   \
>  }
>
> @@ -65,7 +65,7 @@
> if (json_output)\
> jsonw_end_array(json_wtr);  \
> else\
> -   fprintf(stderr, "%s", endstr);  \
> +   fprintf(stdout, "%s", endstr);  \
>  }
>
>  #define NET_DUMP_UINT(name, fmt_str, val)  \
> @@ -73,7 +73,7 @@
> if (json_output)\
> jsonw_uint_field(json_wtr, name, val);  \
> else\
> -   fprintf(stderr, fmt_str, val);  \
> +   fprintf(stdout, fmt_str, val);  \
>  }
>
>  #define NET_DUMP_STR(name, fmt_str, str)   \
> @@ -81,7 +81,7 @@
> if (json_output)\
> jsonw_string_field(json_wtr, name, str);\
> else\
> -   fprintf(stderr, fmt_str, str);  \
> +   fprintf(stdout, fmt_str, str);  \
>  }
>
>  #define NET_DUMP_STR_ONLY(str) \
> @@ -89,7 +89,7 @@
> if (json_output)\
> jsonw_string(json_wtr, str);\
> else\
> -   fprintf(stderr, "%s ", str);\
> +   fprintf(stdout, "%s ", str);\
>  }
>
>  #endif
> --
> 2.17.1
>

RE: [PATCH net-next] tls: Fix socket mem accounting error under async encryption

2018-09-25 Thread Vakul Garg




> -Original Message-
> From: David Miller 
> Sent: Wednesday, September 26, 2018 9:10 AM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; bor...@mellanox.com;
> avia...@mellanox.com; davejwat...@fb.com; doro...@fb.com
> Subject: Re: [PATCH net-next] tls: Fix socket mem accounting error under
> async encryption
> 
> From: Vakul Garg 
> Date: Wed, 26 Sep 2018 01:54:25 +
> 
> > I don't find this patch and one other ("tls: Fixed a memory leak
> > during socket close") in linux-net-next. Could you please kindly
> > check? Regards.
> 
> After applying I didn't push out and instead I started a test build, closed my
> laptop, did a lot of other things and just came back to finish the build.
> 
> It'll show up momentarily.
> 
> Thanks for your patience.
 
Thanks for explaining the workflow.

BTW, I noticed following build failure.
It gets resolved after reverting d6ab93364734.

  CC [M]  drivers/net/phy/marvell.o
drivers/net/phy/marvell.c: In function 'm88e1121_config_aneg':
drivers/net/phy/marvell.c:468:25: error: 'autoneg' undeclared (first use in 
this function); did you mean 'put_net'?
  if (phydev->autoneg != autoneg || changed) {
 ^~~
 put_net
drivers/net/phy/marvell.c:468:25: note: each undeclared identifier is reported 
only once for each function it appears in

[PATCH 2/2] net-ipv4: remove 2 always zero parameters from ipv4_redirect()

2018-09-25 Thread Maciej Żenczykowski

From: Maciej Żenczykowski 

(the parameters in question are mark and flow_flags)

Reviewed-by: David Ahern 
Signed-off-by: Maciej Żenczykowski 
---
 include/net/route.h   | 3 +--
 net/ipv4/ah4.c| 2 +-
 net/ipv4/esp4.c   | 2 +-
 net/ipv4/icmp.c   | 2 +-
 net/ipv4/ip_gre.c | 4 ++--
 net/ipv4/ip_vti.c | 2 +-
 net/ipv4/ipcomp.c | 2 +-
 net/ipv4/ipip.c   | 2 +-
 net/ipv4/route.c  | 4 ++--
 net/ipv6/sit.c| 4 ++--
 net/xfrm/xfrm_interface.c | 2 +-
 11 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 73c605bdd6d8..9883dc82f723 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -203,8 +203,7 @@ static inline int ip_route_input(struct sk_buff *skb, 
__be32 dst, __be32 src,
 void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu, int oif,
  u8 protocol);
 void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu);
-void ipv4_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark,
-  u8 protocol, int flow_flags);
+void ipv4_redirect(struct sk_buff *skb, struct net *net, int oif, u8 protocol);
 void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 8811fe30282a..c01fa791260d 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -463,7 +463,7 @@ static int ah4_err(struct sk_buff *skb, u32 info)
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
ipv4_update_pmtu(skb, net, info, 0, IPPROTO_AH);
else
-   ipv4_redirect(skb, net, 0, 0, IPPROTO_AH, 0);
+   ipv4_redirect(skb, net, 0, IPPROTO_AH);
xfrm_state_put(x);
 
return 0;
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 2d0274441923..071533dd33c2 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -822,7 +822,7 @@ static int esp4_err(struct sk_buff *skb, u32 info)
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
ipv4_update_pmtu(skb, net, info, 0, IPPROTO_ESP);
else
-   ipv4_redirect(skb, net, 0, 0, IPPROTO_ESP, 0);
+   ipv4_redirect(skb, net, 0, IPPROTO_ESP);
xfrm_state_put(x);
 
return 0;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 8013b37b598f..d832beed6e3a 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -1100,7 +1100,7 @@ void icmp_err(struct sk_buff *skb, u32 info)
if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED)
ipv4_update_pmtu(skb, net, info, 0, IPPROTO_ICMP);
else if (type == ICMP_REDIRECT)
-   ipv4_redirect(skb, net, 0, 0, IPPROTO_ICMP, 0);
+   ipv4_redirect(skb, net, 0, IPPROTO_ICMP);
 }
 
 /*
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 83b80fafd8f2..38befe829caf 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -243,8 +243,8 @@ static void gre_err(struct sk_buff *skb, u32 info)
return;
}
if (type == ICMP_REDIRECT) {
-   ipv4_redirect(skb, dev_net(skb->dev), skb->dev->ifindex, 0,
- IPPROTO_GRE, 0);
+   ipv4_redirect(skb, dev_net(skb->dev), skb->dev->ifindex,
+ IPPROTO_GRE);
return;
}
 
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 1b5571cb3282..de31b302d69c 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -320,7 +320,7 @@ static int vti4_err(struct sk_buff *skb, u32 info)
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
ipv4_update_pmtu(skb, net, info, 0, protocol);
else
-   ipv4_redirect(skb, net, 0, 0, protocol, 0);
+   ipv4_redirect(skb, net, 0, protocol);
xfrm_state_put(x);
 
return 0;
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index 04049d1330a2..9119d012ba46 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -50,7 +50,7 @@ static int ipcomp4_err(struct sk_buff *skb, u32 info)
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
ipv4_update_pmtu(skb, net, info, 0, IPPROTO_COMP);
else
-   ipv4_redirect(skb, net, 0, 0, IPPROTO_COMP, 0);
+   ipv4_redirect(skb, net, 0, IPPROTO_COMP);
xfrm_state_put(x);
 
return 0;
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 6ff008e5818d..e65287c27e3d 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -180,7 +180,7 @@ static int ipip_err(struct sk_buff *skb, u32 info)
}
 
if (type == ICMP_REDIRECT) {
-   ipv4_redirect(skb, net, t->parms.link, 0, iph->protocol, 0);
+   ipv4_redirect(skb, net, t->parms.link, iph->protocol);
goto out;
}
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7bbe3fc80b90..dce2ed66ebe1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1130,14

[PATCH 1/2] net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu()

2018-09-25 Thread Maciej Żenczykowski

From: Maciej Żenczykowski 

(the parameters in question are mark and flow_flags)

Reviewed-by: David Ahern 
Signed-off-by: Maciej Żenczykowski 
---
 include/net/route.h | 2 +-
 net/ipv4/ah4.c  | 2 +-
 net/ipv4/esp4.c | 2 +-
 net/ipv4/icmp.c | 2 +-
 net/ipv4/ip_gre.c   | 2 +-
 net/ipv4/ip_vti.c   | 2 +-
 net/ipv4/ipcomp.c   | 2 +-
 net/ipv4/ipip.c | 3 +--
 net/ipv4/route.c| 8 +++-
 net/ipv6/sit.c  | 2 +-
 net/netfilter/ipvs/ip_vs_core.c | 3 +--
 net/xfrm/xfrm_interface.c   | 2 +-
 12 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index bb53cdba38dc..73c605bdd6d8 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -201,7 +201,7 @@ static inline int ip_route_input(struct sk_buff *skb, 
__be32 dst, __be32 src,
 }
 
 void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu, int oif,
- u32 mark, u8 protocol, int flow_flags);
+ u8 protocol);
 void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu);
 void ipv4_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark,
   u8 protocol, int flow_flags);
diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 4dd95cdd8070..8811fe30282a 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -461,7 +461,7 @@ static int ah4_err(struct sk_buff *skb, u32 info)
return 0;
 
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
-   ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_AH, 0);
+   ipv4_update_pmtu(skb, net, info, 0, IPPROTO_AH);
else
ipv4_redirect(skb, net, 0, 0, IPPROTO_AH, 0);
xfrm_state_put(x);
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 97689012b357..2d0274441923 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -820,7 +820,7 @@ static int esp4_err(struct sk_buff *skb, u32 info)
return 0;
 
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
-   ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_ESP, 0);
+   ipv4_update_pmtu(skb, net, info, 0, IPPROTO_ESP);
else
ipv4_redirect(skb, net, 0, 0, IPPROTO_ESP, 0);
xfrm_state_put(x);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 695979b7ef6d..8013b37b598f 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -1098,7 +1098,7 @@ void icmp_err(struct sk_buff *skb, u32 info)
}
 
if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED)
-   ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_ICMP, 0);
+   ipv4_update_pmtu(skb, net, info, 0, IPPROTO_ICMP);
else if (type == ICMP_REDIRECT)
ipv4_redirect(skb, net, 0, 0, IPPROTO_ICMP, 0);
 }
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index c3385a84f8ff..83b80fafd8f2 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -239,7 +239,7 @@ static void gre_err(struct sk_buff *skb, u32 info)
 
if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) {
ipv4_update_pmtu(skb, dev_net(skb->dev), info,
-skb->dev->ifindex, 0, IPPROTO_GRE, 0);
+skb->dev->ifindex, IPPROTO_GRE);
return;
}
if (type == ICMP_REDIRECT) {
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index f38cb21d773d..1b5571cb3282 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -318,7 +318,7 @@ static int vti4_err(struct sk_buff *skb, u32 info)
return 0;
 
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
-   ipv4_update_pmtu(skb, net, info, 0, 0, protocol, 0);
+   ipv4_update_pmtu(skb, net, info, 0, protocol);
else
ipv4_redirect(skb, net, 0, 0, protocol, 0);
xfrm_state_put(x);
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index d97f4f2787f5..04049d1330a2 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -48,7 +48,7 @@ static int ipcomp4_err(struct sk_buff *skb, u32 info)
return 0;
 
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
-   ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_COMP, 0);
+   ipv4_update_pmtu(skb, net, info, 0, IPPROTO_COMP);
else
ipv4_redirect(skb, net, 0, 0, IPPROTO_COMP, 0);
xfrm_state_put(x);
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index c891235b4966..6ff008e5818d 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -175,8 +175,7 @@ static int ipip_err(struct sk_buff *skb, u32 info)
}
 
if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) {
-   ipv4_update_pmtu(skb, net, info, t->parms.link, 0,
-iph->protocol, 0);
+   ipv4_update_pmtu(skb, net, info, t->parms.link, iph->protocol);

Re: pull-request: bpf-next 2018-09-25

2018-09-25 Thread David Miller

From: Daniel Borkmann 
Date: Tue, 25 Sep 2018 22:43:43 +0200

> The following pull-request contains BPF updates for your *net-next*
> tree.

Pulled, there was a minor merge conflict.  Please double check my work.

Thanks.

Re: [PATCH net-next] tls: Fix socket mem accounting error under async encryption

2018-09-25 Thread David Miller

From: Vakul Garg 
Date: Wed, 26 Sep 2018 01:54:25 +

> I don't find this patch and one other ("tls: Fixed a memory leak
> during socket close") in linux-net-next. Could you please kindly
> check? Regards.

After applying I didn't push out and instead I started a test build,
closed my laptop, did a lot of other things and just came back to
finish the build.

It'll show up momentarily.

Thanks for your patience.

Re: [PATCH net-next] bridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTER

2018-09-25 Thread David Miller

From: Roopa Prabhu 
Date: Tue, 25 Sep 2018 14:39:14 -0700

> From: Roopa Prabhu 
> 
> Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
> Signed-off-by: Roopa Prabhu 

Applied.

Re: [PATCH] net: dsa: lantiq_gswip: Depend on HAS_IOMEM

2018-09-25 Thread David Miller

From: Hauke Mehrtens 
Date: Tue, 25 Sep 2018 21:55:33 +0200

> The driver uses devm_ioremap_resource() which is only available when
> CONFIG_HAS_IOMEM is set, make the driver depend on this config option.
> User mode Linux does not have CONFIG_HAS_IOMEM set and the driver was
> failing on this architecture.
> 
> Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
> Reported-by: kbuild test robot 
> Signed-off-by: Hauke Mehrtens 

Applied to net-next.

Re: [net-next 00/10][pull request] 40GbE Intel Wired LAN Driver Updates 2018-09-25

2018-09-25 Thread David Miller

From: Jeff Kirsher 
Date: Tue, 25 Sep 2018 13:19:54 -0700

> This series contains updates to i40e and xsk.
 ...
> The following are changes since commit 
> bd6207202db8974ca3d3183ca0d5611d45b2973c:
>   net: macb: Clean 64b dma addresses if they are not detected
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Pulled, thanks Jeff.

Re: [PATCH net-next v3 00/10] Refactor classifier API to work with Qdisc/blocks without rtnl lock

2018-09-25 Thread David Miller

From: Vlad Buslov 
Date: Mon, 24 Sep 2018 19:18:32 +0300

 ...
> The goal of this change is to refactor tcf_block_find() and its
> dependencies to allow concurrent execution:
> - Extend Qdisc API with rcu to lookup and take reference to Qdisc
>   without relying on rtnl lock.
> - Extend tcf_block with atomic reference counting and rcu.
> - Always take reference to tcf_block while working with it.
> - Implement tcf_block_release() to release resources obtained by
>   tcf_block_find()
> - Create infrastructure to allow registering Qdiscs with class ops that
>   do not require the caller to hold rtnl lock.

Series applied, thank you.

Re: [PATCH net RFT] bnxt_en: Fix TX timeout during netpoll.

2018-09-25 Thread Michael Chan

On Tue, Sep 25, 2018 at 7:25 PM Eric Dumazet  wrote:
>
> On Tue, Sep 25, 2018 at 7:15 PM Michael Chan  
> wrote:
> >
> > On Tue, Sep 25, 2018 at 4:11 PM Michael Chan  
> > wrote:
> > >
> > > On Tue, Sep 25, 2018 at 3:15 PM Eric Dumazet  
> > > wrote:
> > >
> > > >
> > > > It seems bnx2 should have a similar issue ?
> > > >
> > >
> > > Yes, I think so.  The MSIX mode in bnx2 is also auto-masking, meaning
> > > that MSIX will only assert once after it is ARMed.  If we return from
> > > ->poll() when budget of 0 is reached without ARMing, we may not get
> > > another MSIX.
> > >
> >
> > On second thought, I think bnx2 is ok.  If netpoll is polling on the
> > TX packets and reaching budget of 0 and returning, the INT_ACK_CMD
> > register is untouched.  bnx2 uses the status block for events and the
> > producers/consumers are cumulative.  So there is no need to ACK the
> > status block unless ARMing for interrupts.  If there is an IRQ about
> > to be fired, it won't be affected by the polling done by netpoll.
> >
> > In the case of bnxt, a completion ring is used for the events.  The
> > polling done by netpoll will cause the completion ring to be ACKed as
> > entries are processed.  ACKing the completion ring without ARMing may
> > cause future IRQs to be disabled for that ring.
>
> About bnxt : Are you sure it is all about IRQ problems ?

I'm pretty sure, because FB first reported TX timeouts followed by
ring reset failures when running netconsole.  These ring reset
failures are caused by IRQs no longer working on some rings.

>
> What if the whole ring buffer is is filled, then all entries
> are processed from netpoll.
>
> If cp_raw_cons becomes too high without the NIC knowing its (updated)
> value, maybe no IRQ can be generated anymore because
> of some wrapping issue (based on ring size)

Good point.  We have logic to handle that.  We will ACK the ring at
least once every tp->tx_wake_thresh TX packets.  But this logic fails
when the budget is 0, so I need to send a revised patch take care of
this one case.

>
> I guess that in order to test this, we would need something bursting
> 16000 messages while holding napi->poll_owner.
> The (single) IRQ would set/grab the SCHED bit but the cpu responsible
> to service this (soft)irq would spin for the whole test,
> and no more IRQ should be fired really.

Right, not easy to hit.  But it should be handled by my v2 patch.  Thanks.

Re: [PATCH] net-tcp: /proc/sys/net/ipv4/tcp_probe_interval is a u32 not int

2018-09-25 Thread Eric Dumazet

On 09/25/2018 05:41 PM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski 
> 
> (fix documentation and sysctl access to treat it as such)
> 
> Signed-off-by: Maciej Żenczykowski 
> ---

While we are at it, we probably should add sane limits,
given tcp_mtu_check_reprobe() does :

u32 interval = net->ipv4.sysctl_tcp_probe_interval;
...
if (unlikely(delta >= interval * HZ)) {

A limit of UINT_MAX / HZ would avoid an overflow I guess.

Re: [PATCH v2 net-next] tcp: expose sk_state in tcp_retransmit_skb tracepoint

2018-09-25 Thread Eric Dumazet




On 09/24/2018 05:57 AM, Yafang Shao wrote:
> After sk_state exposed, we can get in which state this retransmission
> occurs. That could give us more detail for dignostic.
> For example, if this retransmission occurs in SYN_SENT state, it may
> also indicates that the syn packet may be dropped on the remote peer due
> to syn backlog queue full and then we could check the remote peer.
> 
> BTW,SYNACK retransmission is traced in tcp_retransmit_synack tracepoint.
> 
> Signed-off-by: Yafang Shao 

Signed-off-by: Eric Dumazet

[PATCH net] vxlan: fill ttl inherit info

2018-09-25 Thread Hangbin Liu

When add vxlan ttl inherit support, I forgot to fill it when dump
vlxan info. Fix it now.

Fixes: 72f6d71e491e6 ("vxlan: add ttl inherit support")
Signed-off-by: Hangbin Liu 
---
 drivers/net/vxlan.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ababba3..2b8da2b 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -3539,6 +3539,7 @@ static size_t vxlan_get_size(const struct net_device *dev)
nla_total_size(sizeof(__u32)) + /* IFLA_VXLAN_LINK */
nla_total_size(sizeof(struct in6_addr)) + /* 
IFLA_VXLAN_LOCAL{6} */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TTL */
+   nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TTL_INHERIT */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_TOS */
nla_total_size(sizeof(__be32)) + /* IFLA_VXLAN_LABEL */
nla_total_size(sizeof(__u8)) +  /* IFLA_VXLAN_LEARNING */
@@ -3603,6 +3604,8 @@ static int vxlan_fill_info(struct sk_buff *skb, const 
struct net_device *dev)
}
 
if (nla_put_u8(skb, IFLA_VXLAN_TTL, vxlan->cfg.ttl) ||
+   nla_put_u8(skb, IFLA_VXLAN_TTL_INHERIT,
+  !!(vxlan->cfg.flags & VXLAN_F_TTL_INHERIT)) ||
nla_put_u8(skb, IFLA_VXLAN_TOS, vxlan->cfg.tos) ||
nla_put_be32(skb, IFLA_VXLAN_LABEL, vxlan->cfg.label) ||
nla_put_u8(skb, IFLA_VXLAN_LEARNING,
-- 
2.5.5

Re: [PATCH net RFT] bnxt_en: Fix TX timeout during netpoll.

2018-09-25 Thread Eric Dumazet

On Tue, Sep 25, 2018 at 7:15 PM Michael Chan  wrote:
>
> On Tue, Sep 25, 2018 at 4:11 PM Michael Chan  
> wrote:
> >
> > On Tue, Sep 25, 2018 at 3:15 PM Eric Dumazet  wrote:
> >
> > >
> > > It seems bnx2 should have a similar issue ?
> > >
> >
> > Yes, I think so.  The MSIX mode in bnx2 is also auto-masking, meaning
> > that MSIX will only assert once after it is ARMed.  If we return from
> > ->poll() when budget of 0 is reached without ARMing, we may not get
> > another MSIX.
> >
>
> On second thought, I think bnx2 is ok.  If netpoll is polling on the
> TX packets and reaching budget of 0 and returning, the INT_ACK_CMD
> register is untouched.  bnx2 uses the status block for events and the
> producers/consumers are cumulative.  So there is no need to ACK the
> status block unless ARMing for interrupts.  If there is an IRQ about
> to be fired, it won't be affected by the polling done by netpoll.
>
> In the case of bnxt, a completion ring is used for the events.  The
> polling done by netpoll will cause the completion ring to be ACKed as
> entries are processed.  ACKing the completion ring without ARMing may
> cause future IRQs to be disabled for that ring.

About bnxt : Are you sure it is all about IRQ problems ?

What if the whole ring buffer is is filled, then all entries
are processed from netpoll.

If cp_raw_cons becomes too high without the NIC knowing its (updated)
value, maybe no IRQ can be generated anymore because
of some wrapping issue (based on ring size)

I guess that in order to test this, we would need something bursting
16000 messages while holding napi->poll_owner.
The (single) IRQ would set/grab the SCHED bit but the cpu responsible
to service this (soft)irq would spin for the whole test,
and no more IRQ should be fired really.

Re: [PATCH net RFT] bnxt_en: Fix TX timeout during netpoll.

2018-09-25 Thread Michael Chan

On Tue, Sep 25, 2018 at 4:11 PM Michael Chan  wrote:
>
> On Tue, Sep 25, 2018 at 3:15 PM Eric Dumazet  wrote:
>
> >
> > It seems bnx2 should have a similar issue ?
> >
>
> Yes, I think so.  The MSIX mode in bnx2 is also auto-masking, meaning
> that MSIX will only assert once after it is ARMed.  If we return from
> ->poll() when budget of 0 is reached without ARMing, we may not get
> another MSIX.
>

On second thought, I think bnx2 is ok.  If netpoll is polling on the
TX packets and reaching budget of 0 and returning, the INT_ACK_CMD
register is untouched.  bnx2 uses the status block for events and the
producers/consumers are cumulative.  So there is no need to ACK the
status block unless ARMing for interrupts.  If there is an IRQ about
to be fired, it won't be affected by the polling done by netpoll.

In the case of bnxt, a completion ring is used for the events.  The
polling done by netpoll will cause the completion ring to be ACKed as
entries are processed.  ACKing the completion ring without ARMing may
cause future IRQs to be disabled for that ring.

RE: [PATCH net-next] tls: Fix socket mem accounting error under async encryption

2018-09-25 Thread Vakul Garg




> -Original Message-
> From: David Miller 
> Sent: Tuesday, September 25, 2018 11:14 PM
> To: Vakul Garg 
> Cc: netdev@vger.kernel.org; bor...@mellanox.com;
> avia...@mellanox.com; davejwat...@fb.com; doro...@fb.com
> Subject: Re: [PATCH net-next] tls: Fix socket mem accounting error under
> async encryption
> 
> From: Vakul Garg 
> Date: Tue, 25 Sep 2018 16:26:17 +0530
> 
> > Current async encryption implementation sometimes showed up socket
> > memory accounting error during socket close. This results in kernel
> > warning calltrace. The root cause of the problem is that socket var
> > sk_forward_alloc gets corrupted due to access in sk_mem_charge() and
> > sk_mem_uncharge() being invoked from multiple concurrent contexts in
> > multicore processor. The apis sk_mem_charge() and sk_mem_uncharge()
> > are called from functions alloc_plaintext_sg(), free_sg() etc. It is
> > required that memory accounting apis are called under a socket lock.
> >
> > The plaintext sg data sent for encryption is freed using free_sg() in
> > tls_encryption_done(). It is wrong to call free_sg() from this function.
> > This is because this function may run in irq context. We cannot
> > acquire socket lock in this function.
> >
> > We remove calling of function free_sg() for plaintext data from
> > tls_encryption_done() and defer freeing up of plaintext data to the
> > time when the record is picked up from tx_list and transmitted/freed.
> > When
> > tls_tx_records() gets called, socket is already locked and thus there
> > is no concurrent access problem.
> >
> > Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
> > Signed-off-by: Vakul Garg 
> 
> Applied.
 
I don't find this patch and one other ("tls: Fixed a memory leak during socket 
close")
in linux-net-next. Could you please kindly check? Regards.

Re: [PATCH 2/2] net-ipv4: remove 2 always zero parameters from ipv4_redirect()

2018-09-25 Thread David Ahern

On 9/25/18 6:41 PM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski 

A summary here of which 2 parameters are always 0 would be nice.

> 
> Signed-off-by: Maciej Żenczykowski 
> ---
>  include/net/route.h   | 3 +--
>  net/ipv4/ah4.c| 2 +-
>  net/ipv4/esp4.c   | 2 +-
>  net/ipv4/icmp.c   | 2 +-
>  net/ipv4/ip_gre.c | 4 ++--
>  net/ipv4/ip_vti.c | 2 +-
>  net/ipv4/ipcomp.c | 2 +-
>  net/ipv4/ipip.c   | 2 +-
>  net/ipv4/route.c  | 4 ++--
>  net/ipv6/sit.c| 4 ++--
>  net/xfrm/xfrm_interface.c | 2 +-
>  11 files changed, 14 insertions(+), 15 deletions(-)

Reviewed-by: David Ahern

Re: [PATCH 1/2] net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu()

2018-09-25 Thread David Ahern

On 9/25/18 6:41 PM, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski 
> 

A summary here of which 2 parameters are always 0 would be nice.

> Signed-off-by: Maciej Żenczykowski 
> ---
>  include/net/route.h | 2 +-
>  net/ipv4/ah4.c  | 2 +-
>  net/ipv4/esp4.c | 2 +-
>  net/ipv4/icmp.c | 2 +-
>  net/ipv4/ip_gre.c   | 2 +-
>  net/ipv4/ip_vti.c   | 2 +-
>  net/ipv4/ipcomp.c   | 2 +-
>  net/ipv4/ipip.c | 3 +--
>  net/ipv4/route.c| 8 +++-
>  net/ipv6/sit.c  | 2 +-
>  net/netfilter/ipvs/ip_vs_core.c | 3 +--
>  net/xfrm/xfrm_interface.c   | 2 +-
>  12 files changed, 14 insertions(+), 18 deletions(-)
> 

Reviewed-by: David Ahern

[PATCH] net-tcp: /proc/sys/net/ipv4/tcp_probe_interval is a u32 not int

2018-09-25 Thread Maciej Żenczykowski

From: Maciej Żenczykowski 

(fix documentation and sysctl access to treat it as such)

Signed-off-by: Maciej Żenczykowski 
---
 Documentation/networking/ip-sysctl.txt | 2 +-
 net/ipv4/sysctl_net_ipv4.c | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index 8313a636dd53..960de8fe3f40 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -425,7 +425,7 @@ tcp_mtu_probing - INTEGER
  1 - Disabled by default, enabled when an ICMP black hole detected
  2 - Always enabled, use initial MSS of tcp_base_mss.
 
-tcp_probe_interval - INTEGER
+tcp_probe_interval - UNSIGNED INTEGER
Controls how often to start TCP Packetization-Layer Path MTU
Discovery reprobe. The default is reprobing every 10 minutes as
per RFC4821.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index b92f422f2fa8..c8fa935c3cdb 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -745,9 +745,9 @@ static struct ctl_table ipv4_net_table[] = {
{
.procname   = "tcp_probe_interval",
.data   = _net.ipv4.sysctl_tcp_probe_interval,
-   .maxlen = sizeof(int),
+   .maxlen = sizeof(u32),
.mode   = 0644,
-   .proc_handler   = proc_dointvec,
+   .proc_handler   = proc_douintvec,
},
{
.procname   = "igmp_link_local_mcast_reports",
-- 
2.19.0.605.g01d371f741-goog

[PATCH 2/2] net-ipv4: remove 2 always zero parameters from ipv4_redirect()

2018-09-25 Thread Maciej Żenczykowski

From: Maciej Żenczykowski 

Signed-off-by: Maciej Żenczykowski 
---
 include/net/route.h   | 3 +--
 net/ipv4/ah4.c| 2 +-
 net/ipv4/esp4.c   | 2 +-
 net/ipv4/icmp.c   | 2 +-
 net/ipv4/ip_gre.c | 4 ++--
 net/ipv4/ip_vti.c | 2 +-
 net/ipv4/ipcomp.c | 2 +-
 net/ipv4/ipip.c   | 2 +-
 net/ipv4/route.c  | 4 ++--
 net/ipv6/sit.c| 4 ++--
 net/xfrm/xfrm_interface.c | 2 +-
 11 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index 73c605bdd6d8..9883dc82f723 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -203,8 +203,7 @@ static inline int ip_route_input(struct sk_buff *skb, 
__be32 dst, __be32 src,
 void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu, int oif,
  u8 protocol);
 void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu);
-void ipv4_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark,
-  u8 protocol, int flow_flags);
+void ipv4_redirect(struct sk_buff *skb, struct net *net, int oif, u8 protocol);
 void ipv4_sk_redirect(struct sk_buff *skb, struct sock *sk);
 void ip_rt_send_redirect(struct sk_buff *skb);
 
diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 8811fe30282a..c01fa791260d 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -463,7 +463,7 @@ static int ah4_err(struct sk_buff *skb, u32 info)
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
ipv4_update_pmtu(skb, net, info, 0, IPPROTO_AH);
else
-   ipv4_redirect(skb, net, 0, 0, IPPROTO_AH, 0);
+   ipv4_redirect(skb, net, 0, IPPROTO_AH);
xfrm_state_put(x);
 
return 0;
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 2d0274441923..071533dd33c2 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -822,7 +822,7 @@ static int esp4_err(struct sk_buff *skb, u32 info)
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
ipv4_update_pmtu(skb, net, info, 0, IPPROTO_ESP);
else
-   ipv4_redirect(skb, net, 0, 0, IPPROTO_ESP, 0);
+   ipv4_redirect(skb, net, 0, IPPROTO_ESP);
xfrm_state_put(x);
 
return 0;
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 8013b37b598f..d832beed6e3a 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -1100,7 +1100,7 @@ void icmp_err(struct sk_buff *skb, u32 info)
if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED)
ipv4_update_pmtu(skb, net, info, 0, IPPROTO_ICMP);
else if (type == ICMP_REDIRECT)
-   ipv4_redirect(skb, net, 0, 0, IPPROTO_ICMP, 0);
+   ipv4_redirect(skb, net, 0, IPPROTO_ICMP);
 }
 
 /*
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 83b80fafd8f2..38befe829caf 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -243,8 +243,8 @@ static void gre_err(struct sk_buff *skb, u32 info)
return;
}
if (type == ICMP_REDIRECT) {
-   ipv4_redirect(skb, dev_net(skb->dev), skb->dev->ifindex, 0,
- IPPROTO_GRE, 0);
+   ipv4_redirect(skb, dev_net(skb->dev), skb->dev->ifindex,
+ IPPROTO_GRE);
return;
}
 
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index 1b5571cb3282..de31b302d69c 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -320,7 +320,7 @@ static int vti4_err(struct sk_buff *skb, u32 info)
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
ipv4_update_pmtu(skb, net, info, 0, protocol);
else
-   ipv4_redirect(skb, net, 0, 0, protocol, 0);
+   ipv4_redirect(skb, net, 0, protocol);
xfrm_state_put(x);
 
return 0;
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index 04049d1330a2..9119d012ba46 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -50,7 +50,7 @@ static int ipcomp4_err(struct sk_buff *skb, u32 info)
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
ipv4_update_pmtu(skb, net, info, 0, IPPROTO_COMP);
else
-   ipv4_redirect(skb, net, 0, 0, IPPROTO_COMP, 0);
+   ipv4_redirect(skb, net, 0, IPPROTO_COMP);
xfrm_state_put(x);
 
return 0;
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index 6ff008e5818d..e65287c27e3d 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -180,7 +180,7 @@ static int ipip_err(struct sk_buff *skb, u32 info)
}
 
if (type == ICMP_REDIRECT) {
-   ipv4_redirect(skb, net, t->parms.link, 0, iph->protocol, 0);
+   ipv4_redirect(skb, net, t->parms.link, iph->protocol);
goto out;
}
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7bbe3fc80b90..dce2ed66ebe1 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1130,14 +1130,14 @@ void ipv4_sk_update_pmtu(struct sk_buff *skb, struct 
sock *sk, u32 mtu)

[PATCH 1/2] net-ipv4: remove 2 always zero parameters from ipv4_update_pmtu()

2018-09-25 Thread Maciej Żenczykowski

From: Maciej Żenczykowski 

Signed-off-by: Maciej Żenczykowski 
---
 include/net/route.h | 2 +-
 net/ipv4/ah4.c  | 2 +-
 net/ipv4/esp4.c | 2 +-
 net/ipv4/icmp.c | 2 +-
 net/ipv4/ip_gre.c   | 2 +-
 net/ipv4/ip_vti.c   | 2 +-
 net/ipv4/ipcomp.c   | 2 +-
 net/ipv4/ipip.c | 3 +--
 net/ipv4/route.c| 8 +++-
 net/ipv6/sit.c  | 2 +-
 net/netfilter/ipvs/ip_vs_core.c | 3 +--
 net/xfrm/xfrm_interface.c   | 2 +-
 12 files changed, 14 insertions(+), 18 deletions(-)

diff --git a/include/net/route.h b/include/net/route.h
index bb53cdba38dc..73c605bdd6d8 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -201,7 +201,7 @@ static inline int ip_route_input(struct sk_buff *skb, 
__be32 dst, __be32 src,
 }
 
 void ipv4_update_pmtu(struct sk_buff *skb, struct net *net, u32 mtu, int oif,
- u32 mark, u8 protocol, int flow_flags);
+ u8 protocol);
 void ipv4_sk_update_pmtu(struct sk_buff *skb, struct sock *sk, u32 mtu);
 void ipv4_redirect(struct sk_buff *skb, struct net *net, int oif, u32 mark,
   u8 protocol, int flow_flags);
diff --git a/net/ipv4/ah4.c b/net/ipv4/ah4.c
index 4dd95cdd8070..8811fe30282a 100644
--- a/net/ipv4/ah4.c
+++ b/net/ipv4/ah4.c
@@ -461,7 +461,7 @@ static int ah4_err(struct sk_buff *skb, u32 info)
return 0;
 
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
-   ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_AH, 0);
+   ipv4_update_pmtu(skb, net, info, 0, IPPROTO_AH);
else
ipv4_redirect(skb, net, 0, 0, IPPROTO_AH, 0);
xfrm_state_put(x);
diff --git a/net/ipv4/esp4.c b/net/ipv4/esp4.c
index 97689012b357..2d0274441923 100644
--- a/net/ipv4/esp4.c
+++ b/net/ipv4/esp4.c
@@ -820,7 +820,7 @@ static int esp4_err(struct sk_buff *skb, u32 info)
return 0;
 
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
-   ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_ESP, 0);
+   ipv4_update_pmtu(skb, net, info, 0, IPPROTO_ESP);
else
ipv4_redirect(skb, net, 0, 0, IPPROTO_ESP, 0);
xfrm_state_put(x);
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 695979b7ef6d..8013b37b598f 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -1098,7 +1098,7 @@ void icmp_err(struct sk_buff *skb, u32 info)
}
 
if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED)
-   ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_ICMP, 0);
+   ipv4_update_pmtu(skb, net, info, 0, IPPROTO_ICMP);
else if (type == ICMP_REDIRECT)
ipv4_redirect(skb, net, 0, 0, IPPROTO_ICMP, 0);
 }
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index c3385a84f8ff..83b80fafd8f2 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -239,7 +239,7 @@ static void gre_err(struct sk_buff *skb, u32 info)
 
if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) {
ipv4_update_pmtu(skb, dev_net(skb->dev), info,
-skb->dev->ifindex, 0, IPPROTO_GRE, 0);
+skb->dev->ifindex, IPPROTO_GRE);
return;
}
if (type == ICMP_REDIRECT) {
diff --git a/net/ipv4/ip_vti.c b/net/ipv4/ip_vti.c
index f38cb21d773d..1b5571cb3282 100644
--- a/net/ipv4/ip_vti.c
+++ b/net/ipv4/ip_vti.c
@@ -318,7 +318,7 @@ static int vti4_err(struct sk_buff *skb, u32 info)
return 0;
 
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
-   ipv4_update_pmtu(skb, net, info, 0, 0, protocol, 0);
+   ipv4_update_pmtu(skb, net, info, 0, protocol);
else
ipv4_redirect(skb, net, 0, 0, protocol, 0);
xfrm_state_put(x);
diff --git a/net/ipv4/ipcomp.c b/net/ipv4/ipcomp.c
index d97f4f2787f5..04049d1330a2 100644
--- a/net/ipv4/ipcomp.c
+++ b/net/ipv4/ipcomp.c
@@ -48,7 +48,7 @@ static int ipcomp4_err(struct sk_buff *skb, u32 info)
return 0;
 
if (icmp_hdr(skb)->type == ICMP_DEST_UNREACH)
-   ipv4_update_pmtu(skb, net, info, 0, 0, IPPROTO_COMP, 0);
+   ipv4_update_pmtu(skb, net, info, 0, IPPROTO_COMP);
else
ipv4_redirect(skb, net, 0, 0, IPPROTO_COMP, 0);
xfrm_state_put(x);
diff --git a/net/ipv4/ipip.c b/net/ipv4/ipip.c
index c891235b4966..6ff008e5818d 100644
--- a/net/ipv4/ipip.c
+++ b/net/ipv4/ipip.c
@@ -175,8 +175,7 @@ static int ipip_err(struct sk_buff *skb, u32 info)
}
 
if (type == ICMP_DEST_UNREACH && code == ICMP_FRAG_NEEDED) {
-   ipv4_update_pmtu(skb, net, info, t->parms.link, 0,
-iph->protocol, 0);
+   ipv4_update_pmtu(skb, net, info, t->parms.link, iph->protocol);
goto out;
}
 
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index

Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Song Liu




> On Sep 24, 2018, at 8:30 AM, Eric Dumazet  wrote:
> 
> On Sun, Sep 23, 2018 at 10:04 PM David Miller  wrote:
>> 
>> Series applied, thanks Eric.
> 
> Thanks David.
> 
> Song, would you please this additional patch ?
> 
> diff --git a/net/core/netpoll.c b/net/core/netpoll.c
> index 
> 3219a2932463096566ce8ff336ecdf699422dd65..2ad45babe621b2c979ad5496b7df4342e4efbaa6
> 100644
> --- a/net/core/netpoll.c
> +++ b/net/core/netpoll.c
> @@ -150,13 +150,6 @@ static void poll_one_napi(struct napi_struct *napi)
> {
>int work = 0;
> 
> -   /* net_rx_action's ->poll() invocations and our's are
> -* synchronized by this test which is only made while
> -* holding the napi->poll_lock.
> -*/
> -   if (!test_bit(NAPI_STATE_SCHED, >state))
> -   return;
> -
>/* If we set this bit but see that it has already been set,
> * that indicates that napi has been disabled and we need
> * to abort this operation


Reviewed-and-tested-by: Song Liu

Re: [PATCH net RFT] bnxt_en: Fix TX timeout during netpoll.

2018-09-25 Thread Song Liu

Thanks Michael!

This works well in my tests. 


Reviewed-and-tested-by: Song Liu 


> On Sep 25, 2018, at 2:31 PM, Michael Chan  wrote:
> 
> The current netpoll implementation in the bnxt_en driver has problems
> that may miss TX completion events.  bnxt_poll_work() in effect is
> only handling at most 1 TX packet before exiting.  In addition,
> there may be in flight TX completions that ->poll() may miss even
> after we fix bnxt_poll_work() to handle all visible TX completions.
> netpoll may not call ->poll() again and HW may not generate IRQ
> because the driver does not ARM the IRQ when the budget (0 for netpoll)
> is reached.
> 
> We fix it by handling all TX completions and to always ARM the IRQ
> when we exit ->poll() with 0 budget.
> 
> Reported-by: Song Liu 
> Signed-off-by: Michael Chan 
> ---
> drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 ++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 61957b0..c981b53 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -1913,7 +1913,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
> bnxt_napi *bnapi, int budget)
>   }
>   raw_cons = NEXT_RAW_CMP(raw_cons);
> 
> - if (rx_pkts == budget)
> + if (rx_pkts && rx_pkts == budget)
>   break;
>   }
> 
> @@ -2027,8 +2027,12 @@ static int bnxt_poll(struct napi_struct *napi, int 
> budget)
>   while (1) {
>   work_done += bnxt_poll_work(bp, bnapi, budget - work_done);
> 
> - if (work_done >= budget)
> + if (work_done >= budget) {
> + if (!budget)
> + BNXT_CP_DB_REARM(cpr->cp_doorbell,
> +  cpr->cp_raw_cons);
>   break;
> + }
> 
>   if (!bnxt_has_work(bp, cpr)) {
>   if (napi_complete_done(napi, work_done))
> -- 
> 2.5.1
>

Re: [PATCH net RFT] bnxt_en: Fix TX timeout during netpoll.

2018-09-25 Thread Michael Chan

On Tue, Sep 25, 2018 at 3:15 PM Eric Dumazet  wrote:

>
> It seems bnx2 should have a similar issue ?
>

Yes, I think so.  The MSIX mode in bnx2 is also auto-masking, meaning
that MSIX will only assert once after it is ARMed.  If we return from
->poll() when budget of 0 is reached without ARMing, we may not get
another MSIX.

I can work on a similar patch but I don't have bnx2 cards to test with
anymore.  Thanks.

[PATCH bpf-next] bpftool: Fix bpftool net output

2018-09-25 Thread Andrey Ignatov

Print `bpftool net` output to stdout instead of stderr. Only errors
should be printed to stderr. Regular output should go to stdout and this
is what all other subcommands of bpftool do, including --json and
--pretty formats of `bpftool net` itself.

Fixes: commit f6f3bac08ff9 ("tools/bpf: bpftool: add net support")
Signed-off-by: Andrey Ignatov 
Acked-by: Yonghong Song 
---
 tools/bpf/bpftool/netlink_dumper.h | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/tools/bpf/bpftool/netlink_dumper.h 
b/tools/bpf/bpftool/netlink_dumper.h
index 0788cfbbed0e..e3516b586a34 100644
--- a/tools/bpf/bpftool/netlink_dumper.h
+++ b/tools/bpf/bpftool/netlink_dumper.h
@@ -16,7 +16,7 @@
jsonw_name(json_wtr, name); \
jsonw_start_object(json_wtr);   \
} else {\
-   fprintf(stderr, "%s {", name);  \
+   fprintf(stdout, "%s {", name);  \
}   \
 }
 
@@ -25,7 +25,7 @@
if (json_output)\
jsonw_start_object(json_wtr);   \
else\
-   fprintf(stderr, "{");   \
+   fprintf(stdout, "{");   \
 }
 
 #define NET_END_OBJECT_NESTED  \
@@ -33,7 +33,7 @@
if (json_output)\
jsonw_end_object(json_wtr); \
else\
-   fprintf(stderr, "}");   \
+   fprintf(stdout, "}");   \
 }
 
 #define NET_END_OBJECT \
@@ -47,7 +47,7 @@
if (json_output)\
jsonw_end_object(json_wtr); \
else\
-   fprintf(stderr, "\n");  \
+   fprintf(stdout, "\n");  \
 }
 
 #define NET_START_ARRAY(name, fmt_str) \
@@ -56,7 +56,7 @@
jsonw_name(json_wtr, name); \
jsonw_start_array(json_wtr);\
} else {\
-   fprintf(stderr, fmt_str, name); \
+   fprintf(stdout, fmt_str, name); \
}   \
 }
 
@@ -65,7 +65,7 @@
if (json_output)\
jsonw_end_array(json_wtr);  \
else\
-   fprintf(stderr, "%s", endstr);  \
+   fprintf(stdout, "%s", endstr);  \
 }
 
 #define NET_DUMP_UINT(name, fmt_str, val)  \
@@ -73,7 +73,7 @@
if (json_output)\
jsonw_uint_field(json_wtr, name, val);  \
else\
-   fprintf(stderr, fmt_str, val);  \
+   fprintf(stdout, fmt_str, val);  \
 }
 
 #define NET_DUMP_STR(name, fmt_str, str)   \
@@ -81,7 +81,7 @@
if (json_output)\
jsonw_string_field(json_wtr, name, str);\
else\
-   fprintf(stderr, fmt_str, str);  \
+   fprintf(stdout, fmt_str, str);  \
 }
 
 #define NET_DUMP_STR_ONLY(str) \
@@ -89,7 +89,7 @@
if (json_output)\
jsonw_string(json_wtr, str);\
else\
-   fprintf(stderr, "%s ", str);\
+   fprintf(stdout, "%s ", str);\
 }
 
 #endif
-- 
2.17.1

Re: [PATCH net RFT] bnxt_en: Fix TX timeout during netpoll.

2018-09-25 Thread Eric Dumazet




On 09/25/2018 02:31 PM, Michael Chan wrote:
> The current netpoll implementation in the bnxt_en driver has problems
> that may miss TX completion events.  bnxt_poll_work() in effect is
> only handling at most 1 TX packet before exiting.  In addition,
> there may be in flight TX completions that ->poll() may miss even
> after we fix bnxt_poll_work() to handle all visible TX completions.
> netpoll may not call ->poll() again and HW may not generate IRQ
> because the driver does not ARM the IRQ when the budget (0 for netpoll)
> is reached.
> 
> We fix it by handling all TX completions and to always ARM the IRQ
> when we exit ->poll() with 0 budget.
> 
> Reported-by: Song Liu 
> Signed-off-by: Michael Chan 
> ---
>  drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
> b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> index 61957b0..c981b53 100644
> --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
> @@ -1913,7 +1913,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
> bnxt_napi *bnapi, int budget)
>   }
>   raw_cons = NEXT_RAW_CMP(raw_cons);
>  
> - if (rx_pkts == budget)
> + if (rx_pkts && rx_pkts == budget)
>   break;
>   }
>  
> @@ -2027,8 +2027,12 @@ static int bnxt_poll(struct napi_struct *napi, int 
> budget)
>   while (1) {
>   work_done += bnxt_poll_work(bp, bnapi, budget - work_done);
>  
> - if (work_done >= budget)
> + if (work_done >= budget) {
> + if (!budget)
> + BNXT_CP_DB_REARM(cpr->cp_doorbell,
> +  cpr->cp_raw_cons);
>   break;
> + }
>  
>   if (!bnxt_has_work(bp, cpr)) {
>   if (napi_complete_done(napi, work_done))
> 

Hi Michael, thanks for the patch.

It seems bnx2 should have a similar issue ?

[PATCH net-next] bridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTER

2018-09-25 Thread Roopa Prabhu

From: Roopa Prabhu 

Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
Signed-off-by: Roopa Prabhu 
---
 net/bridge/br_arp_nd_proxy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/bridge/br_arp_nd_proxy.c b/net/bridge/br_arp_nd_proxy.c
index 2cf7716..d42e390 100644
--- a/net/bridge/br_arp_nd_proxy.c
+++ b/net/bridge/br_arp_nd_proxy.c
@@ -311,7 +311,7 @@ static void br_nd_send(struct net_bridge *br, struct 
net_bridge_port *p,
/* Neighbor Advertisement */
memset(na, 0, sizeof(*na) + na_olen);
na->icmph.icmp6_type = NDISC_NEIGHBOUR_ADVERTISEMENT;
-   na->icmph.icmp6_router = 0; /* XXX: should be 1 ? */
+   na->icmph.icmp6_router = (n->flags & NTF_ROUTER) ? 1 : 0;
na->icmph.icmp6_override = 1;
na->icmph.icmp6_solicited = 1;
na->target = ns->target;
-- 
2.1.4

[PATCH net RFT] bnxt_en: Fix TX timeout during netpoll.

2018-09-25 Thread Michael Chan

The current netpoll implementation in the bnxt_en driver has problems
that may miss TX completion events.  bnxt_poll_work() in effect is
only handling at most 1 TX packet before exiting.  In addition,
there may be in flight TX completions that ->poll() may miss even
after we fix bnxt_poll_work() to handle all visible TX completions.
netpoll may not call ->poll() again and HW may not generate IRQ
because the driver does not ARM the IRQ when the budget (0 for netpoll)
is reached.

We fix it by handling all TX completions and to always ARM the IRQ
when we exit ->poll() with 0 budget.

Reported-by: Song Liu 
Signed-off-by: Michael Chan 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 61957b0..c981b53 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1913,7 +1913,7 @@ static int bnxt_poll_work(struct bnxt *bp, struct 
bnxt_napi *bnapi, int budget)
}
raw_cons = NEXT_RAW_CMP(raw_cons);
 
-   if (rx_pkts == budget)
+   if (rx_pkts && rx_pkts == budget)
break;
}
 
@@ -2027,8 +2027,12 @@ static int bnxt_poll(struct napi_struct *napi, int 
budget)
while (1) {
work_done += bnxt_poll_work(bp, bnapi, budget - work_done);
 
-   if (work_done >= budget)
+   if (work_done >= budget) {
+   if (!budget)
+   BNXT_CP_DB_REARM(cpr->cp_doorbell,
+cpr->cp_raw_cons);
break;
+   }
 
if (!bnxt_has_work(bp, cpr)) {
if (napi_complete_done(napi, work_done))
-- 
2.5.1

Urgent,

2018-09-25 Thread Juliet Muhammad

Hello 

   i have been trying to contact you i have a transaction for you.

Re: bond: take rcu lock in bond_poll_controller

2018-09-25 Thread Cong Wang

On Mon, Sep 24, 2018 at 1:08 PM Dave Jones  wrote:
>
> Callers of bond_for_each_slave_rcu are expected to hold the rcu lock,
> otherwise a trace like below is shown

Interesting, netpoll_send_skb_on_dev() already assumes RCU read lock
when it calls rcu_dereference_bh()...

I wonder how it can't catch such a warning before the one you reported.

[PATCH iproute2 net-next] ipneigh: support setting of NTF_ROUTER on neigh entries

2018-09-25 Thread Roopa Prabhu

From: Roopa Prabhu 

Signed-off-by: Roopa Prabhu 
---
 ip/ipneigh.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/ip/ipneigh.c b/ip/ipneigh.c
index a0af705..5747152 100644
--- a/ip/ipneigh.c
+++ b/ip/ipneigh.c
@@ -139,6 +139,8 @@ static int ipneigh_modify(int cmd, int flags, int argc, 
char **argv)
dst_ok = 1;
dev_ok = 1;
req.ndm.ndm_flags |= NTF_PROXY;
+   } else if (strcmp(*argv, "router") == 0) {
+   req.ndm.ndm_flags |= NTF_ROUTER;
} else if (strcmp(*argv, "dev") == 0) {
NEXT_ARG();
dev = *argv;
-- 
2.1.4

Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Michael Chan

On Tue, Sep 25, 2018 at 11:25 AM Song Liu  wrote:

>
> Hi Michael,
>
> This may not be related. But I am looking at this:
>
> bnxt_poll_work() {
>
> while (1) {
> 
> if (rx_pkts == budget)
> return
> }
> }
>
> With budget of 0, the loop will terminate after processing one packet.
> But I think the expectation is to finish all tx packets. So it doesn't
> feel right. Could you please confirm?
>

Right, this in effect is processing only 1 TX packet so it will be
inefficient at least.

But I think fixing it here still will not fix all the issues, because
even if we process all the TX packets here, we may still miss some
that are in flight.  When we exit poll, netpoll may not call us back
again and there may be no interrupts because we don't ARM the IRQ when
budget of 0 is reached.  I will send a test patch shortly for review
and testing.  Thanks.

I have been trying to contact you

2018-09-25 Thread Fridman Mikhail

I have been trying to contact you

pull-request: bpf-next 2018-09-25

2018-09-25 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Allow for RX stack hardening by implementing the kernel's flow
   dissector in BPF. Idea was originally presented at netconf 2017 [0].
   Quote from merge commit:

 [...] Because of the rigorous checks of the BPF verifier, this
 provides significant security guarantees. In particular, the BPF
 flow dissector cannot get inside of an infinite loop, as with
 CVE-2013-4348, because BPF programs are guaranteed to terminate.
 It cannot read outside of packet bounds, because all memory accesses
 are checked. Also, with BPF the administrator can decide which
 protocols to support, reducing potential attack surface. Rarely
 encountered protocols can be excluded from dissection and the
 program can be updated without kernel recompile or reboot if a
 bug is discovered. [...]

   Also, a sample flow dissector has been implemented in BPF as part
   of this work, from Petar and Willem.

   [0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf

2) Add support for bpftool to list currently active attachment
   points of BPF networking programs providing a quick overview
   similar to bpftool's perf subcommand, from Yonghong.

3) Fix a verifier pruning instability bug where a union member
   from the register state was not cleared properly leading to
   branches not being pruned despite them being valid candidates,
   from Alexei.

4) Various smaller fast-path optimizations in XDP's map redirect
   code, from Jesper.

5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps
   in bpftool, from Roman.

6) Remove a duplicate check in libbpf that probes for function
   storage, from Taeung.

7) Fix an issue in test_progs by avoid checking for errno since
   on success its value should not be checked, from Mauricio.

8) Fix unused variable warning in bpf_getsockopt() helper when
   CONFIG_INET is not configured, from Anders.

9) Fix a compilation failure in the BPF sample code's use of
   bpf_flow_keys, from Prashant.

10) Minor cleanups in BPF code, from Yue and Zhong.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git

Thanks a lot!



The following changes since commit 428f944bd58607021b5a1f85d145c0b50f908c6f:

  netlink: Make groups check less stupid in netlink_bind() (2018-09-05 22:11:33 
-0700)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git 

for you to fetch changes up to d0e13a1488ad30dc3c2c9347b931cb10f892e3a4:

  flow_dissector: lookup netns by skb->sk if skb->dev is NULL (2018-09-25 
17:31:19 +0200)


Alexei Starovoitov (4):
  bpf/verifier: fix verifier instability
  Merge branch 'progarray_mapinmap_dump'
  Merge branch 'bpf-flow-dissector'
  selftests/bpf: fix bpf_flow.c build

Anders Roxell (1):
  net/core/filter: fix unused-variable warning

Jesper Dangaard Brouer (3):
  xdp: unlikely instrumentation for xdp map redirect
  xdp: explicit inline __xdp_map_lookup_elem
  xdp: split code for map vs non-map redirect

Mauricio Vasquez B (2):
  selftests/bpf: add missing executables to .gitignore
  selftests/bpf/test_progs: do not check errno == 0

Petar Penkov (5):
  flow_dissector: implements flow dissector BPF hook
  bpf: sync bpf.h uapi with tools/
  bpf: support flow dissector in libbpf and bpftool
  flow_dissector: implements eBPF parser
  selftests/bpf: test bpf flow dissection

Prashant Bhole (1):
  samples/bpf: fix compilation failure

Roman Gushchin (1):
  bpftool: add support for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps

Taeung Song (1):
  libbpf: Remove the duplicate checking of function storage

Willem de Bruijn (2):
  flow_dissector: fix build failure without CONFIG_NET
  flow_dissector: lookup netns by skb->sk if skb->dev is NULL

Yonghong Song (9):
  tools/bpf: sync kernel uapi header if_link.h to tools
  tools/bpf: move bpf/lib netlink related functions into a new file
  tools/bpf: add more netlink functionalities in lib/bpf
  tools/bpf: bpftool: add net support
  bpf: add bpffs pretty print for program array map
  tools/bpf: bpftool: support prog array map and map of maps
  tools/bpf: fix a netlink recv issue
  tools/bpf: bpftool: improve output format for bpftool net
  samples/bpf: fix a compilation failure

YueHaibing (1):
  samples/bpf: remove duplicated includes

zhong jiang (1):
  bpf: remove redundant null pointer check before consume_skb

 include/linux/bpf.h|   1 +
 include/linux/bpf_types.h  |   1 +
 include/linux/skbuff.h |  20 +
 include/net/net_namespace.h

[net-next 01/10] i40e: Fix VF's link state notification

2018-09-25 Thread Jeff Kirsher

From: Mariusz Stachura 

This resolves an issue where the VF link state was not being updated
when the PF is down or up, and the VF link state would always show
that it is running.

Signed-off-by: Mariusz Stachura 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 5 -
 1 file changed, 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index c7d2c9010fdf..fff53470e182 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -8509,14 +8509,9 @@ static void i40e_link_event(struct i40e_pf *pf)
i40e_status status;
bool new_link, old_link;
 
-   /* save off old link status information */
-   pf->hw.phy.link_info_old = pf->hw.phy.link_info;
-
/* set this to force the get_link_status call to refresh state */
pf->hw.phy.get_link_info = true;
-
old_link = (pf->hw.phy.link_info_old.link_info & I40E_AQ_LINK_UP);
-
status = i40e_get_link_status(>hw, _link);
 
/* On success, disable temp link polling */
-- 
2.17.1

[net-next 06/10] i40e: Remove unused msglen parameter from virtchnl functions

2018-09-25 Thread Jeff Kirsher

From: Patryk Małek 

msglen parameter seems to be unused in several virtchnl function.
This patch removes it from signatures of those functions.

Signed-off-by: Patryk Małek 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c| 96 +++
 1 file changed, 37 insertions(+), 59 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 5d5ffde1e93b..f4bb2779f03a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -1973,13 +1973,11 @@ static inline int 
i40e_getnum_vf_vsi_vlan_filters(struct i40e_vsi *vsi)
  * i40e_vc_config_promiscuous_mode_msg
  * @vf: pointer to the VF info
  * @msg: pointer to the msg buffer
- * @msglen: msg length
  *
  * called from the VF to configure the promiscuous mode of
  * VF vsis
  **/
-static int i40e_vc_config_promiscuous_mode_msg(struct i40e_vf *vf,
-  u8 *msg, u16 msglen)
+static int i40e_vc_config_promiscuous_mode_msg(struct i40e_vf *vf, u8 *msg)
 {
struct virtchnl_promisc_info *info =
(struct virtchnl_promisc_info *)msg;
@@ -2034,12 +2032,11 @@ static int i40e_vc_config_promiscuous_mode_msg(struct 
i40e_vf *vf,
  * i40e_vc_config_queues_msg
  * @vf: pointer to the VF info
  * @msg: pointer to the msg buffer
- * @msglen: msg length
  *
  * called from the VF to configure the rx/tx
  * queues
  **/
-static int i40e_vc_config_queues_msg(struct i40e_vf *vf, u8 *msg, u16 msglen)
+static int i40e_vc_config_queues_msg(struct i40e_vf *vf, u8 *msg)
 {
struct virtchnl_vsi_queue_config_info *qci =
(struct virtchnl_vsi_queue_config_info *)msg;
@@ -2152,12 +2149,11 @@ static int i40e_validate_queue_map(struct i40e_vf *vf, 
u16 vsi_id,
  * i40e_vc_config_irq_map_msg
  * @vf: pointer to the VF info
  * @msg: pointer to the msg buffer
- * @msglen: msg length
  *
  * called from the VF to configure the irq to
  * queue map
  **/
-static int i40e_vc_config_irq_map_msg(struct i40e_vf *vf, u8 *msg, u16 msglen)
+static int i40e_vc_config_irq_map_msg(struct i40e_vf *vf, u8 *msg)
 {
struct virtchnl_irq_map_info *irqmap_info =
(struct virtchnl_irq_map_info *)msg;
@@ -2249,11 +2245,10 @@ static int i40e_ctrl_vf_rx_rings(struct i40e_vsi *vsi, 
unsigned long q_map,
  * i40e_vc_enable_queues_msg
  * @vf: pointer to the VF info
  * @msg: pointer to the msg buffer
- * @msglen: msg length
  *
  * called from the VF to enable all or specific queue(s)
  **/
-static int i40e_vc_enable_queues_msg(struct i40e_vf *vf, u8 *msg, u16 msglen)
+static int i40e_vc_enable_queues_msg(struct i40e_vf *vf, u8 *msg)
 {
struct virtchnl_queue_select *vqs =
(struct virtchnl_queue_select *)msg;
@@ -2308,12 +2303,11 @@ static int i40e_vc_enable_queues_msg(struct i40e_vf 
*vf, u8 *msg, u16 msglen)
  * i40e_vc_disable_queues_msg
  * @vf: pointer to the VF info
  * @msg: pointer to the msg buffer
- * @msglen: msg length
  *
  * called from the VF to disable all or specific
  * queue(s)
  **/
-static int i40e_vc_disable_queues_msg(struct i40e_vf *vf, u8 *msg, u16 msglen)
+static int i40e_vc_disable_queues_msg(struct i40e_vf *vf, u8 *msg)
 {
struct virtchnl_queue_select *vqs =
(struct virtchnl_queue_select *)msg;
@@ -2356,14 +2350,13 @@ static int i40e_vc_disable_queues_msg(struct i40e_vf 
*vf, u8 *msg, u16 msglen)
  * i40e_vc_request_queues_msg
  * @vf: pointer to the VF info
  * @msg: pointer to the msg buffer
- * @msglen: msg length
  *
  * VFs get a default number of queues but can use this message to request a
  * different number.  If the request is successful, PF will reset the VF and
  * return 0.  If unsuccessful, PF will send message informing VF of number of
  * available queues and return result of sending VF a message.
  **/
-static int i40e_vc_request_queues_msg(struct i40e_vf *vf, u8 *msg, int msglen)
+static int i40e_vc_request_queues_msg(struct i40e_vf *vf, u8 *msg)
 {
struct virtchnl_vf_res_request *vfres =
(struct virtchnl_vf_res_request *)msg;
@@ -2407,11 +2400,10 @@ static int i40e_vc_request_queues_msg(struct i40e_vf 
*vf, u8 *msg, int msglen)
  * i40e_vc_get_stats_msg
  * @vf: pointer to the VF info
  * @msg: pointer to the msg buffer
- * @msglen: msg length
  *
  * called from the VF to get vsi stats
  **/
-static int i40e_vc_get_stats_msg(struct i40e_vf *vf, u8 *msg, u16 msglen)
+static int i40e_vc_get_stats_msg(struct i40e_vf *vf, u8 *msg)
 {
struct virtchnl_queue_select *vqs =
(struct virtchnl_queue_select *)msg;
@@ -2517,11 +2509,10 @@ static inline int i40e_check_vf_permission(struct 
i40e_vf *vf,
  * i40e_vc_add_mac_addr_msg
  * @vf: pointer to the VF info
  * @msg: pointer to the msg buffer
- * @msglen: msg length
  *
  * add guest mac address filter
  **/
-static int i40e_vc_add_mac_addr_msg(struct

[net-next 03/10] i40e: use declared variables for pf and hw

2018-09-25 Thread Jeff Kirsher

From: Patryk Małek 

In order to slightly simplify the code use the variables for pf and hw
that are declared in i40e_set_mac function.

Signed-off-by: Patryk Małek 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 9 -
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index fff53470e182..102219cea67f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -1532,8 +1532,8 @@ static int i40e_set_mac(struct net_device *netdev, void 
*p)
return 0;
}
 
-   if (test_bit(__I40E_VSI_DOWN, vsi->back->state) ||
-   test_bit(__I40E_RESET_RECOVERY_PENDING, vsi->back->state))
+   if (test_bit(__I40E_DOWN, pf->state) ||
+   test_bit(__I40E_RESET_RECOVERY_PENDING, pf->state))
return -EADDRNOTAVAIL;
 
if (ether_addr_equal(hw->mac.addr, addr->sa_data))
@@ -1557,8 +1557,7 @@ static int i40e_set_mac(struct net_device *netdev, void 
*p)
if (vsi->type == I40E_VSI_MAIN) {
i40e_status ret;
 
-   ret = i40e_aq_mac_address_write(>back->hw,
-   I40E_AQC_WRITE_TYPE_LAA_WOL,
+   ret = i40e_aq_mac_address_write(hw, I40E_AQC_WRITE_TYPE_LAA_WOL,
addr->sa_data, NULL);
if (ret)
netdev_info(netdev, "Ignoring error from firmware on 
LAA update, status %s, AQ ret %s\n",
@@ -1569,7 +1568,7 @@ static int i40e_set_mac(struct net_device *netdev, void 
*p)
/* schedule our worker thread which will take care of
 * applying the new filter changes
 */
-   i40e_service_event_schedule(vsi->back);
+   i40e_service_event_schedule(pf);
return 0;
 }
 
-- 
2.17.1

[net-next 02/10] i40e: Unset promiscuous settings on VF reset

2018-09-25 Thread Jeff Kirsher

From: Mariusz Stachura 

This patch cleans up promiscuous configuration when a VF reset occurs.
Previously the promiscuous mode settings were still there after the VF
driver removal.

Signed-off-by: Mariusz Stachura 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c| 267 ++
 1 file changed, 157 insertions(+), 110 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index 3e707c7e6782..cd0f0bcfeddc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -1084,6 +1084,136 @@ static int i40e_quiesce_vf_pci(struct i40e_vf *vf)
return -EIO;
 }
 
+static inline int i40e_getnum_vf_vsi_vlan_filters(struct i40e_vsi *vsi);
+
+/**
+ * i40e_config_vf_promiscuous_mode
+ * @vf: pointer to the VF info
+ * @vsi_id: VSI id
+ * @allmulti: set MAC L2 layer multicast promiscuous enable/disable
+ * @alluni: set MAC L2 layer unicast promiscuous enable/disable
+ *
+ * Called from the VF to configure the promiscuous mode of
+ * VF vsis and from the VF reset path to reset promiscuous mode.
+ **/
+static i40e_status i40e_config_vf_promiscuous_mode(struct i40e_vf *vf,
+  u16 vsi_id,
+  bool allmulti,
+  bool alluni)
+{
+   struct i40e_pf *pf = vf->pf;
+   struct i40e_hw *hw = >hw;
+   struct i40e_mac_filter *f;
+   i40e_status aq_ret = 0;
+   struct i40e_vsi *vsi;
+   int bkt;
+
+   vsi = i40e_find_vsi_from_id(pf, vsi_id);
+   if (!i40e_vc_isvalid_vsi_id(vf, vsi_id) || !vsi)
+   return I40E_ERR_PARAM;
+
+   if (!test_bit(I40E_VIRTCHNL_VF_CAP_PRIVILEGE, >vf_caps)) {
+   dev_err(>pdev->dev,
+   "Unprivileged VF %d is attempting to configure 
promiscuous mode\n",
+   vf->vf_id);
+   /* Lie to the VF on purpose. */
+   return 0;
+   }
+
+   if (vf->port_vlan_id) {
+   aq_ret = i40e_aq_set_vsi_mc_promisc_on_vlan(hw, vsi->seid,
+   allmulti,
+   vf->port_vlan_id,
+   NULL);
+   if (aq_ret) {
+   int aq_err = pf->hw.aq.asq_last_status;
+
+   dev_err(>pdev->dev,
+   "VF %d failed to set multicast promiscuous mode 
err %s aq_err %s\n",
+   vf->vf_id,
+   i40e_stat_str(>hw, aq_ret),
+   i40e_aq_str(>hw, aq_err));
+   return aq_ret;
+   }
+
+   aq_ret = i40e_aq_set_vsi_uc_promisc_on_vlan(hw, vsi->seid,
+   alluni,
+   vf->port_vlan_id,
+   NULL);
+   if (aq_ret) {
+   int aq_err = pf->hw.aq.asq_last_status;
+
+   dev_err(>pdev->dev,
+   "VF %d failed to set unicast promiscuous mode 
err %s aq_err %s\n",
+   vf->vf_id,
+   i40e_stat_str(>hw, aq_ret),
+   i40e_aq_str(>hw, aq_err));
+   }
+   return aq_ret;
+   } else if (i40e_getnum_vf_vsi_vlan_filters(vsi)) {
+   hash_for_each(vsi->mac_filter_hash, bkt, f, hlist) {
+   if (f->vlan < 0 || f->vlan > I40E_MAX_VLANID)
+   continue;
+   aq_ret = i40e_aq_set_vsi_mc_promisc_on_vlan(hw,
+   vsi->seid,
+   allmulti,
+   f->vlan,
+   NULL);
+   if (aq_ret) {
+   int aq_err = pf->hw.aq.asq_last_status;
+
+   dev_err(>pdev->dev,
+   "Could not add VLAN %d to multicast 
promiscuous domain err %s aq_err %s\n",
+   f->vlan,
+   i40e_stat_str(>hw, aq_ret),
+   i40e_aq_str(>hw, aq_err));
+   }
+
+   aq_ret = i40e_aq_set_vsi_uc_promisc_on_vlan(hw,
+   vsi->seid,
+   alluni,

[net-next 09/10] i40e: clean zero-copy XDP Rx ring on shutdown/reset

2018-09-25 Thread Jeff Kirsher

From: Björn Töpel 

Outstanding Rx descriptors are temporarily stored on a stash/reuse
queue. When/if the HW rings comes up again, entries from the stash are
used to re-populate the ring.

The latter required some restructuring of the allocation scheme for
the AF_XDP zero-copy implementation. There is now a fast, and a slow
allocation. The "fast allocation" is used from the fast-path and
obtains free buffers from the fill ring and the internal recycle
mechanism. The "slow allocation" is only used in ring setup, and
obtains buffers from the fill ring and the stash (if any).

Signed-off-by: Björn Töpel 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   4 +-
 .../ethernet/intel/i40e/i40e_txrx_common.h|   1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 100 --
 3 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 7f85d4ba8b54..740ea58ba938 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1355,8 +1355,10 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
rx_ring->skb = NULL;
}
 
-   if (rx_ring->xsk_umem)
+   if (rx_ring->xsk_umem) {
+   i40e_xsk_clean_rx_ring(rx_ring);
goto skip_free;
+   }
 
/* Free all the Rx ring sk_buffs */
for (i = 0; i < rx_ring->count; i++) {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 29c68b29d36f..8d46acff6f2e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -87,6 +87,7 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
}
 }
 
+void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring);
 void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
 
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index d5a9f5b7cfa9..386703883713 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -140,6 +140,7 @@ static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, 
struct xdp_umem *umem)
 static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
u16 qid)
 {
+   struct xdp_umem_fq_reuse *reuseq;
bool if_running;
int err;
 
@@ -156,6 +157,12 @@ static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, 
struct xdp_umem *umem,
return -EBUSY;
}
 
+   reuseq = xsk_reuseq_prepare(vsi->rx_rings[0]->count);
+   if (!reuseq)
+   return -ENOMEM;
+
+   xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
+
err = i40e_xsk_umem_dma_map(vsi, umem);
if (err)
return err;
@@ -353,16 +360,46 @@ static bool i40e_alloc_buffer_zc(struct i40e_ring 
*rx_ring,
 }
 
 /**
- * i40e_alloc_rx_buffers_zc - Allocates a number of Rx buffers
+ * i40e_alloc_buffer_slow_zc - Allocates an i40e_rx_buffer
  * @rx_ring: Rx ring
- * @count: The number of buffers to allocate
+ * @bi: Rx buffer to populate
  *
- * This function allocates a number of Rx buffers and places them on
- * the Rx ring.
+ * This function allocates an Rx buffer. The buffer can come from fill
+ * queue, or via the reuse queue.
  *
  * Returns true for a successful allocation, false otherwise
  **/
-bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count)
+static bool i40e_alloc_buffer_slow_zc(struct i40e_ring *rx_ring,
+ struct i40e_rx_buffer *bi)
+{
+   struct xdp_umem *umem = rx_ring->xsk_umem;
+   u64 handle, hr;
+
+   if (!xsk_umem_peek_addr_rq(umem, )) {
+   rx_ring->rx_stats.alloc_page_failed++;
+   return false;
+   }
+
+   handle &= rx_ring->xsk_umem->chunk_mask;
+
+   hr = umem->headroom + XDP_PACKET_HEADROOM;
+
+   bi->dma = xdp_umem_get_dma(umem, handle);
+   bi->dma += hr;
+
+   bi->addr = xdp_umem_get_data(umem, handle);
+   bi->addr += hr;
+
+   bi->handle = handle + umem->headroom;
+
+   xsk_umem_discard_addr_rq(umem);
+   return true;
+}
+
+static __always_inline bool
+__i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count,
+  bool alloc(struct i40e_ring *rx_ring,
+ struct i40e_rx_buffer *bi))
 {
u16 ntu = rx_ring->next_to_use;
union i40e_rx_desc *rx_desc;
@@ -372,7 +409,7 @@ bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, 
u16 count)
rx_desc = I40E_RX_DESC(rx_ring, ntu);
bi = _ring->rx_bi[ntu];
do {
-   if (!i40e_alloc_buffer_zc(rx_ring, bi)) {
+   if (!alloc(rx_ring, bi)) {
ok =

[net-next 00/10][pull request] 40GbE Intel Wired LAN Driver Updates 2018-09-25

2018-09-25 Thread Jeff Kirsher

This series contains updates to i40e and xsk.

Mariusz fixes an issue where the VF link state was not being updated
properly when the PF is down or up.  Also cleaned up the promiscuous
configuration during a VF reset.

Patryk simplifies the code a bit to use the variables for PF and HW that
are declared, rather than using the VSI pointers.  Cleaned up the
message length parameter to several virtchnl functions, since it was not
being used (or needed).

Harshitha fixes two potential race conditions when trying to change VF
settings by creating a helper function to validate that the VF is
enabled and that the VSI is set up.

Sergey corrects a double "link down" message by putting in a check for
whether or not the link is up or going down.

Björn addresses an AF_XDP zero-copy issue that buffers passed
from userspace to the kernel was leaked when the hardware descriptor
ring was torn down.  A zero-copy capable driver picks buffers off the
fill ring and places them on the hardware receive ring to be completed at
a later point when DMA is complete. Similar on the transmit side; The
driver picks buffers off the transmit ring and places them on the
transmit hardware ring.

In the typical flow, the receive buffer will be placed onto an receive
ring (completed to the user), and the transmit buffer will be placed on
the completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), the userspace
buffers cannot be leaked. They have to be reused or completed back to
userspace.

The following are changes since commit bd6207202db8974ca3d3183ca0d5611d45b2973c:
  net: macb: Clean 64b dma addresses if they are not detected
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 40GbE

Björn Töpel (3):
  i40e: clean zero-copy XDP Tx ring on shutdown/reset
  i40e: clean zero-copy XDP Rx ring on shutdown/reset
  i40e: disallow changing the number of descriptors when AF_XDP is on

Harshitha Ramamurthy (1):
  i40e: add a helper function to validate a VF based on the vf id

Jakub Kicinski (1):
  net: xsk: add a simple buffer reuse queue

Mariusz Stachura (2):
  i40e: Fix VF's link state notification
  i40e: Unset promiscuous settings on VF reset

Patryk Małek (2):
  i40e: use declared variables for pf and hw
  i40e: Remove unused msglen parameter from virtchnl functions

Sergey Nemov (1):
  i40e: fix double 'NIC Link is Down' messages

 .../net/ethernet/intel/i40e/i40e_ethtool.c|   8 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |  19 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  21 +-
 .../ethernet/intel/i40e/i40e_txrx_common.h|   4 +
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c| 425 ++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 151 ++-
 include/net/xdp_sock.h|  69 +++
 net/xdp/xdp_umem.c|   2 +
 net/xdp/xsk_queue.c   |  55 +++
 net/xdp/xsk_queue.h   |   3 +
 10 files changed, 541 insertions(+), 216 deletions(-)

-- 
2.17.1

[net-next 04/10] i40e: add a helper function to validate a VF based on the vf id

2018-09-25 Thread Jeff Kirsher

From: Harshitha Ramamurthy 

When we are trying to change VF settings, it is possible for 2 race
conditions to happen. One, when the VF is created but not yet enabled.
Second, the VF is enabled but the VSI is still not created or not yet
re-created in the VF reset flow.

This patch introduces a helper function to validate that the VF is
enabled and that the VSI is set up. This patch also calls this
function from other functions which could get into these race conditions.
While we are poking around here, remove unnecessary parenthesis that
checkpatch was complaining about.

Signed-off-by: Harshitha Ramamurthy 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../ethernet/intel/i40e/i40e_virtchnl_pf.c| 62 ---
 1 file changed, 41 insertions(+), 21 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c 
b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
index cd0f0bcfeddc..5d5ffde1e93b 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_virtchnl_pf.c
@@ -3842,6 +3842,35 @@ int i40e_vc_process_vflr_event(struct i40e_pf *pf)
return 0;
 }
 
+/**
+ * i40e_validate_vf
+ * @pf: the physical function
+ * @vf_id: VF identifier
+ *
+ * Check that the VF is enabled and the VSI exists.
+ *
+ * Returns 0 on success, negative on failure
+ **/
+static int i40e_validate_vf(struct i40e_pf *pf, int vf_id)
+{
+   struct i40e_vsi *vsi;
+   struct i40e_vf *vf;
+   int ret = 0;
+
+   if (vf_id >= pf->num_alloc_vfs) {
+   dev_err(>pdev->dev,
+   "Invalid VF Identifier %d\n", vf_id);
+   ret = -EINVAL;
+   goto err_out;
+   }
+   vf = >vf[vf_id];
+   vsi = i40e_find_vsi_from_id(pf, vf->lan_vsi_id);
+   if (!vsi)
+   ret = -EINVAL;
+err_out:
+   return ret;
+}
+
 /**
  * i40e_ndo_set_vf_mac
  * @netdev: network interface device structure
@@ -3863,14 +3892,11 @@ int i40e_ndo_set_vf_mac(struct net_device *netdev, int 
vf_id, u8 *mac)
u8 i;
 
/* validate the request */
-   if (vf_id >= pf->num_alloc_vfs) {
-   dev_err(>pdev->dev,
-   "Invalid VF Identifier %d\n", vf_id);
-   ret = -EINVAL;
+   ret = i40e_validate_vf(pf, vf_id);
+   if (ret)
goto error_param;
-   }
 
-   vf = &(pf->vf[vf_id]);
+   vf = >vf[vf_id];
vsi = pf->vsi[vf->lan_vsi_idx];
 
/* When the VF is resetting wait until it is done.
@@ -3989,11 +4015,9 @@ int i40e_ndo_set_vf_port_vlan(struct net_device *netdev, 
int vf_id,
int ret = 0;
 
/* validate the request */
-   if (vf_id >= pf->num_alloc_vfs) {
-   dev_err(>pdev->dev, "Invalid VF Identifier %d\n", vf_id);
-   ret = -EINVAL;
+   ret = i40e_validate_vf(pf, vf_id);
+   if (ret)
goto error_pvid;
-   }
 
if ((vlan_id > I40E_MAX_VLANID) || (qos > 7)) {
dev_err(>pdev->dev, "Invalid VF Parameters\n");
@@ -4007,7 +4031,7 @@ int i40e_ndo_set_vf_port_vlan(struct net_device *netdev, 
int vf_id,
goto error_pvid;
}
 
-   vf = &(pf->vf[vf_id]);
+   vf = >vf[vf_id];
vsi = pf->vsi[vf->lan_vsi_idx];
if (!test_bit(I40E_VF_STATE_INIT, >vf_states)) {
dev_err(>pdev->dev, "VF %d still in reset. Try again.\n",
@@ -4127,11 +4151,9 @@ int i40e_ndo_set_vf_bw(struct net_device *netdev, int 
vf_id, int min_tx_rate,
int ret = 0;
 
/* validate the request */
-   if (vf_id >= pf->num_alloc_vfs) {
-   dev_err(>pdev->dev, "Invalid VF Identifier %d.\n", vf_id);
-   ret = -EINVAL;
+   ret = i40e_validate_vf(pf, vf_id);
+   if (ret)
goto error;
-   }
 
if (min_tx_rate) {
dev_err(>pdev->dev, "Invalid min tx rate (%d) (greater than 
0) specified for VF %d.\n",
@@ -4139,7 +4161,7 @@ int i40e_ndo_set_vf_bw(struct net_device *netdev, int 
vf_id, int min_tx_rate,
return -EINVAL;
}
 
-   vf = &(pf->vf[vf_id]);
+   vf = >vf[vf_id];
vsi = pf->vsi[vf->lan_vsi_idx];
if (!test_bit(I40E_VF_STATE_INIT, >vf_states)) {
dev_err(>pdev->dev, "VF %d still in reset. Try again.\n",
@@ -4175,13 +4197,11 @@ int i40e_ndo_get_vf_config(struct net_device *netdev,
int ret = 0;
 
/* validate the request */
-   if (vf_id >= pf->num_alloc_vfs) {
-   dev_err(>pdev->dev, "Invalid VF Identifier %d\n", vf_id);
-   ret = -EINVAL;
+   ret = i40e_validate_vf(pf, vf_id);
+   if (ret)
goto error_param;
-   }
 
-   vf = &(pf->vf[vf_id]);
+   vf = >vf[vf_id];
/* first vsi is always the LAN vsi */
vsi = pf->vsi[vf->lan_vsi_idx];
if (!test_bit(I40E_VF_STATE_INIT, >vf_states)) {
-- 
2.17.1

[net-next 07/10] i40e: clean zero-copy XDP Tx ring on shutdown/reset

2018-09-25 Thread Jeff Kirsher

From: Björn Töpel 

When the zero-copy enabled XDP Tx ring is torn down, due to
configuration changes, outstanding frames on the hardware descriptor
ring are queued on the completion ring.

The completion ring has a back-pressure mechanism that will guarantee
that there is sufficient space on the ring.

Signed-off-by: Björn Töpel 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 17 +++
 .../ethernet/intel/i40e/i40e_txrx_common.h|  2 ++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 30 +++
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 37bd4e50ccde..7f85d4ba8b54 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -636,13 +636,18 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
unsigned long bi_size;
u16 i;
 
-   /* ring already cleared, nothing to do */
-   if (!tx_ring->tx_bi)
-   return;
+   if (ring_is_xdp(tx_ring) && tx_ring->xsk_umem) {
+   i40e_xsk_clean_tx_ring(tx_ring);
+   } else {
+   /* ring already cleared, nothing to do */
+   if (!tx_ring->tx_bi)
+   return;
 
-   /* Free all the Tx ring sk_buffs */
-   for (i = 0; i < tx_ring->count; i++)
-   i40e_unmap_and_free_tx_resource(tx_ring, _ring->tx_bi[i]);
+   /* Free all the Tx ring sk_buffs */
+   for (i = 0; i < tx_ring->count; i++)
+   i40e_unmap_and_free_tx_resource(tx_ring,
+   _ring->tx_bi[i]);
+   }
 
bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
memset(tx_ring->tx_bi, 0, bi_size);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index b5afd479a9c5..29c68b29d36f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -87,4 +87,6 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
}
 }
 
+void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
+
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 2ebfc78bbd09..d5a9f5b7cfa9 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -830,3 +830,33 @@ int i40e_xsk_async_xmit(struct net_device *dev, u32 
queue_id)
 
return 0;
 }
+
+/**
+ * i40e_xsk_clean_xdp_ring - Clean the XDP Tx ring on shutdown
+ * @xdp_ring: XDP Tx ring
+ **/
+void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring)
+{
+   u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
+   struct xdp_umem *umem = tx_ring->xsk_umem;
+   struct i40e_tx_buffer *tx_bi;
+   u32 xsk_frames = 0;
+
+   while (ntc != ntu) {
+   tx_bi = _ring->tx_bi[ntc];
+
+   if (tx_bi->xdpf)
+   i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
+   else
+   xsk_frames++;
+
+   tx_bi->xdpf = NULL;
+
+   ntc++;
+   if (ntc >= tx_ring->count)
+   ntc = 0;
+   }
+
+   if (xsk_frames)
+   xsk_umem_complete_tx(umem, xsk_frames);
+}
-- 
2.17.1

[net-next 08/10] net: xsk: add a simple buffer reuse queue

2018-09-25 Thread Jeff Kirsher

From: Jakub Kicinski 

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.

Signed-off-by: Jakub Kicinski 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 include/net/xdp_sock.h | 69 ++
 net/xdp/xdp_umem.c |  2 ++
 net/xdp/xsk_queue.c| 55 +
 net/xdp/xsk_queue.h|  3 ++
 4 files changed, 129 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 932ca0dad6f3..70a115bea4f4 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -21,6 +21,12 @@ struct xdp_umem_page {
dma_addr_t dma;
 };
 
+struct xdp_umem_fq_reuse {
+   u32 nentries;
+   u32 length;
+   u64 handles[];
+};
+
 struct xdp_umem {
struct xsk_queue *fq;
struct xsk_queue *cq;
@@ -37,6 +43,7 @@ struct xdp_umem {
struct page **pgs;
u32 npgs;
struct net_device *dev;
+   struct xdp_umem_fq_reuse *fq_reuse;
u16 queue_id;
bool zc;
spinlock_t xsk_list_lock;
@@ -75,6 +82,10 @@ void xsk_umem_discard_addr(struct xdp_umem *umem);
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
 bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len);
 void xsk_umem_consume_tx_done(struct xdp_umem *umem);
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq);
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
@@ -85,6 +96,35 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem 
*umem, u64 addr)
 {
return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
 }
+
+/* Reuse-queue aware version of FILL queue helpers */
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length)
+   return xsk_umem_peek_addr(umem, addr);
+
+   *addr = rq->handles[rq->length - 1];
+   return addr;
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length)
+   xsk_umem_discard_addr(umem);
+   else
+   rq->length--;
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   rq->handles[rq->length++] = addr;
+}
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
@@ -128,6 +168,21 @@ static inline void xsk_umem_consume_tx_done(struct 
xdp_umem *umem)
 {
 }
 
+static inline struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+   return NULL;
+}
+
+static inline struct xdp_umem_fq_reuse *xsk_reuseq_swap(
+   struct xdp_umem *umem,
+   struct xdp_umem_fq_reuse *newq)
+{
+   return NULL;
+}
+static inline void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq)
+{
+}
+
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
return NULL;
@@ -137,6 +192,20 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem 
*umem, u64 addr)
 {
return 0;
 }
+
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+   return NULL;
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+}
+
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index b3b632c5aeae..555427b3e0fe 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -165,6 +165,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
umem->cq = NULL;
}
 
+   xsk_reuseq_destroy(umem);
+
xdp_umem_unpin_pages(umem);
 
task = get_pid_task(umem->pid, PIDTYPE_PID);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 2dc1384d9f27..b66504592d9b 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -3,7 +3,9 @@
  * Copyright(c) 2018 Intel Corporation.
  */
 
+#include 
 #include 
+#include 
 
 #include "xsk_queue.h"
 
@@ -62,3 +64,56 @@ void xskq_destroy(struct xsk_queue *q)
page_frag_free(q->ring);
kfree(q);
 }
+
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+   struct xdp_umem_fq_reuse *newq;
+
+   /* Check for overflow */
+   if (nentries > (u32)roundup_pow_of_two(nentries))
+   return NULL;
+   nentries = roundup_pow_of_two(nentries);
+
+   newq = kvmalloc(struct_size(newq, handles, nentries),

[net-next 05/10] i40e: fix double 'NIC Link is Down' messages

2018-09-25 Thread Jeff Kirsher

From: Sergey Nemov 

When isup is false meaning that interface is going to shut down
set new speed to 0 to avoid double 'NIC Link is Down' messages.

Signed-off-by: Sergey Nemov 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 102219cea67f..330bafe6a689 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -6431,7 +6431,10 @@ void i40e_print_link_message(struct i40e_vsi *vsi, bool 
isup)
char *req_fec = "";
char *an = "";
 
-   new_speed = pf->hw.phy.link_info.link_speed;
+   if (isup)
+   new_speed = pf->hw.phy.link_info.link_speed;
+   else
+   new_speed = I40E_LINK_SPEED_UNKNOWN;
 
if ((vsi->current_isup == isup) && (vsi->current_speed == new_speed))
return;
-- 
2.17.1

[net-next 10/10] i40e: disallow changing the number of descriptors when AF_XDP is on

2018-09-25 Thread Jeff Kirsher

From: Björn Töpel 

When an AF_XDP UMEM is attached to any of the Rx rings, we disallow a
user to change the number of descriptors via e.g. "ethtool -G IFNAME".

Otherwise, the size of the stash/reuse queue can grow unbounded, which
would result in OOM or leaking userspace buffers.

Signed-off-by: Björn Töpel 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 .../net/ethernet/intel/i40e/i40e_ethtool.c|  8 +++
 .../ethernet/intel/i40e/i40e_txrx_common.h|  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 21 +++
 3 files changed, 30 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index 87fe2e60602f..9f8464f80783 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -5,6 +5,7 @@
 
 #include "i40e.h"
 #include "i40e_diag.h"
+#include "i40e_txrx_common.h"
 
 /* ethtool statistics helpers */
 
@@ -1710,6 +1711,13 @@ static int i40e_set_ringparam(struct net_device *netdev,
(new_rx_count == vsi->rx_rings[0]->count))
return 0;
 
+   /* If there is a AF_XDP UMEM attached to any of Rx rings,
+* disallow changing the number of descriptors -- regardless
+* if the netdev is running or not.
+*/
+   if (i40e_xsk_any_rx_ring_enabled(vsi))
+   return -EBUSY;
+
while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
timeout--;
if (!timeout)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 8d46acff6f2e..09809dffe399 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -89,5 +89,6 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
 
 void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring);
 void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
+bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi);
 
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 386703883713..add1e457886d 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -944,3 +944,24 @@ void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring)
if (xsk_frames)
xsk_umem_complete_tx(umem, xsk_frames);
 }
+
+/**
+ * i40e_xsk_any_rx_ring_enabled - Checks if Rx rings have AF_XDP UMEM attached
+ * @vsi: vsi
+ *
+ * Returns true if any of the Rx rings has an AF_XDP UMEM attached
+ **/
+bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi)
+{
+   int i;
+
+   if (!vsi->xsk_umems)
+   return false;
+
+   for (i = 0; i < vsi->num_queue_pairs; i++) {
+   if (vsi->xsk_umems[i])
+   return true;
+   }
+
+   return false;
+}
-- 
2.17.1

[PATCH] net: dsa: lantiq_gswip: Depend on HAS_IOMEM

2018-09-25 Thread Hauke Mehrtens

The driver uses devm_ioremap_resource() which is only available when
CONFIG_HAS_IOMEM is set, make the driver depend on this config option.
User mode Linux does not have CONFIG_HAS_IOMEM set and the driver was
failing on this architecture.

Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200")
Reported-by: kbuild test robot 
Signed-off-by: Hauke Mehrtens 
---
 drivers/net/dsa/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/dsa/Kconfig b/drivers/net/dsa/Kconfig
index 7c09d8f195ae..71bb3aebded4 100644
--- a/drivers/net/dsa/Kconfig
+++ b/drivers/net/dsa/Kconfig
@@ -25,7 +25,7 @@ config NET_DSA_LOOP
 
 config NET_DSA_LANTIQ_GSWIP
tristate "Lantiq / Intel GSWIP"
-   depends on NET_DSA
+   depends on HAS_IOMEM && NET_DSA
select NET_DSA_TAG_GSWIP
---help---
  This enables support for the Lantiq / Intel GSWIP 2.1 found in
-- 
2.11.0

[PATCH RFC,net-next 01/10] flow_dissector: add flow_rule and flow_match structures and use them

2018-09-25 Thread Pablo Neira Ayuso

This patch wraps the dissector key and mask - that flower uses to
represent the matching side - in the flow_match structure.

To avoid a follow up patch that would edit the same LoCs in the drivers,
this patch also wraps this new flow match structure around the flow rule
object. This new structure will also contain the flow actions in follow
up patches.

This introduces two new interfaces:

bool flow_rule_match_key(rule, dissector_id)

that returns true if a given matching key is set on, and:

flow_rule_match_XYZ(rule, );

To fetch the matching side XYZ into the match container structure, to
retrieve the key and the mask with one single call.

Signed-off-by: Pablo Neira Ayuso 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   | 174 -
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 194 --
 drivers/net/ethernet/intel/i40e/i40e_main.c| 178 -
 drivers/net/ethernet/intel/i40evf/i40evf_main.c| 194 --
 drivers/net/ethernet/intel/igb/igb_main.c  |  64 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 418 +
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  | 203 +-
 drivers/net/ethernet/netronome/nfp/flower/action.c |  11 +-
 drivers/net/ethernet/netronome/nfp/flower/match.c  | 416 ++--
 .../net/ethernet/netronome/nfp/flower/offload.c| 146 +++
 drivers/net/ethernet/qlogic/qede/qede_filter.c |  85 ++---
 include/net/flow_dissector.h   | 107 ++
 include/net/pkt_cls.h  |   4 +-
 net/core/flow_dissector.c  | 133 +++
 net/sched/cls_flower.c |  18 +-
 15 files changed, 1144 insertions(+), 1201 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index f4ba9b3f8819..62652ffc8221 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -169,18 +169,12 @@ static int bnxt_tc_parse_actions(struct bnxt *bp,
return 0;
 }
 
-#define GET_KEY(flow_cmd, key_type)\
-   skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
- (flow_cmd)->key)
-#define GET_MASK(flow_cmd, key_type)   \
-   skb_flow_dissector_target((flow_cmd)->dissector, key_type,\
- (flow_cmd)->mask)
-
 static int bnxt_tc_parse_flow(struct bnxt *bp,
  struct tc_cls_flower_offload *tc_flow_cmd,
  struct bnxt_tc_flow *flow)
 {
-   struct flow_dissector *dissector = tc_flow_cmd->dissector;
+   struct flow_rule *rule = _flow_cmd->rule;
+   struct flow_dissector *dissector = rule->match.dissector;
 
/* KEY_CONTROL and KEY_BASIC are needed for forming a meaningful key */
if ((dissector->used_keys & BIT(FLOW_DISSECTOR_KEY_CONTROL)) == 0 ||
@@ -190,140 +184,120 @@ static int bnxt_tc_parse_flow(struct bnxt *bp,
return -EOPNOTSUPP;
}
 
-   if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_BASIC)) {
-   struct flow_dissector_key_basic *key =
-   GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
-   struct flow_dissector_key_basic *mask =
-   GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_BASIC);
+   if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_BASIC)) {
+   struct flow_match_basic match;
 
-   flow->l2_key.ether_type = key->n_proto;
-   flow->l2_mask.ether_type = mask->n_proto;
+   flow_rule_match_basic(rule, );
+   flow->l2_key.ether_type = match.key->n_proto;
+   flow->l2_mask.ether_type = match.mask->n_proto;
 
-   if (key->n_proto == htons(ETH_P_IP) ||
-   key->n_proto == htons(ETH_P_IPV6)) {
-   flow->l4_key.ip_proto = key->ip_proto;
-   flow->l4_mask.ip_proto = mask->ip_proto;
+   if (match.key->n_proto == htons(ETH_P_IP) ||
+   match.key->n_proto == htons(ETH_P_IPV6)) {
+   flow->l4_key.ip_proto = match.key->ip_proto;
+   flow->l4_mask.ip_proto = match.mask->ip_proto;
}
}
 
-   if (dissector_uses_key(dissector, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
-   struct flow_dissector_key_eth_addrs *key =
-   GET_KEY(tc_flow_cmd, FLOW_DISSECTOR_KEY_ETH_ADDRS);
-   struct flow_dissector_key_eth_addrs *mask =
-   GET_MASK(tc_flow_cmd, FLOW_DISSECTOR_KEY_ETH_ADDRS);
+   if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_ETH_ADDRS)) {
+   struct flow_match_eth_addrs match;
 
+   flow_rule_match_eth_addrs(rule, );
flow->flags |=

[PATCH RFC,net-next 06/10] drivers: net: use flow action infrastructure

2018-09-25 Thread Pablo Neira Ayuso

This patch updates drivers to use the new flow action infrastructure.

Signed-off-by: Pablo Neira Ayuso 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c   |  78 +++
 .../net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c   | 251 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c| 252 ++---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c |   2 +-
 .../net/ethernet/mellanox/mlxsw/spectrum_flower.c  |  55 +++--
 drivers/net/ethernet/netronome/nfp/flower/action.c | 183 +++
 drivers/net/ethernet/qlogic/qede/qede_filter.c |  12 +-
 7 files changed, 412 insertions(+), 421 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 3505791777e7..8b494e66f519 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -61,9 +61,9 @@ static u16 bnxt_flow_get_dst_fid(struct bnxt *pf_bp, struct 
net_device *dev)
 
 static int bnxt_tc_parse_redir(struct bnxt *bp,
   struct bnxt_tc_actions *actions,
-  const struct tc_action *tc_act)
+  const struct flow_action_key *act)
 {
-   struct net_device *dev = tcf_mirred_dev(tc_act);
+   struct net_device *dev = act->dev;
 
if (!dev) {
netdev_info(bp->dev, "no dev in mirred action");
@@ -75,25 +75,12 @@ static int bnxt_tc_parse_redir(struct bnxt *bp,
return 0;
 }
 
-static void bnxt_tc_parse_vlan(struct bnxt *bp,
-  struct bnxt_tc_actions *actions,
-  const struct tc_action *tc_act)
-{
-   if (tcf_vlan_action(tc_act) == TCA_VLAN_ACT_POP) {
-   actions->flags |= BNXT_TC_ACTION_FLAG_POP_VLAN;
-   } else if (tcf_vlan_action(tc_act) == TCA_VLAN_ACT_PUSH) {
-   actions->flags |= BNXT_TC_ACTION_FLAG_PUSH_VLAN;
-   actions->push_vlan_tci = htons(tcf_vlan_push_vid(tc_act));
-   actions->push_vlan_tpid = tcf_vlan_push_proto(tc_act);
-   }
-}
-
 static int bnxt_tc_parse_tunnel_set(struct bnxt *bp,
struct bnxt_tc_actions *actions,
-   const struct tc_action *tc_act)
+   const struct flow_action_key *act)
 {
-   struct ip_tunnel_info *tun_info = tcf_tunnel_info(tc_act);
-   struct ip_tunnel_key *tun_key = _info->key;
+   const struct ip_tunnel_info *tun_info = act->tunnel;
+   const struct ip_tunnel_key *tun_key = _info->key;
 
if (ip_tunnel_info_af(tun_info) != AF_INET) {
netdev_info(bp->dev, "only IPv4 tunnel-encap is supported");
@@ -107,49 +94,44 @@ static int bnxt_tc_parse_tunnel_set(struct bnxt *bp,
 
 static int bnxt_tc_parse_actions(struct bnxt *bp,
 struct bnxt_tc_actions *actions,
-struct tcf_exts *tc_exts)
+struct flow_action *flow_action)
 {
-   const struct tc_action *tc_act;
+   struct flow_action_key *act;
int i, rc;
 
-   if (!tcf_exts_has_actions(tc_exts)) {
+   if (!flow_action_has_keys(flow_action)) {
netdev_info(bp->dev, "no actions");
return -EINVAL;
}
 
-   tcf_exts_for_each_action(i, tc_act, tc_exts) {
-   /* Drop action */
-   if (is_tcf_gact_shot(tc_act)) {
+   flow_action_for_each(i, act, flow_action) {
+   switch (act->id) {
+   case FLOW_ACTION_KEY_DROP:
actions->flags |= BNXT_TC_ACTION_FLAG_DROP;
return 0; /* don't bother with other actions */
-   }
-
-   /* Redirect action */
-   if (is_tcf_mirred_egress_redirect(tc_act)) {
-   rc = bnxt_tc_parse_redir(bp, actions, tc_act);
+   case FLOW_ACTION_KEY_REDIRECT:
+   rc = bnxt_tc_parse_redir(bp, actions, act);
if (rc)
return rc;
-   continue;
-   }
-
-   /* Push/pop VLAN */
-   if (is_tcf_vlan(tc_act)) {
-   bnxt_tc_parse_vlan(bp, actions, tc_act);
-   continue;
-   }
-
-   /* Tunnel encap */
-   if (is_tcf_tunnel_set(tc_act)) {
-   rc = bnxt_tc_parse_tunnel_set(bp, actions, tc_act);
+   break;
+   case FLOW_ACTION_KEY_VLAN_POP:
+   actions->flags |= BNXT_TC_ACTION_FLAG_POP_VLAN;
+   break;
+   case FLOW_ACTION_KEY_VLAN_PUSH:
+   actions->flags |= BNXT_TC_ACTION_FLAG_PUSH_VLAN;
+   actions->push_vlan_tci = htons(act->vlan.vid);
+   actions->push_vlan_tpid = act->vlan.proto;

[PATCH RFC,net-next 10/10] dsa: bcm_sf2: use flow_rule infrastructure

2018-09-25 Thread Pablo Neira Ayuso

Update this driver to use the flow_rule infrastructure, hence the same
code to populate hardware IR can be used from ethtool_rx_flow and the
cls_flower interfaces.

Signed-off-by: Pablo Neira Ayuso 
---
 drivers/net/dsa/bcm_sf2_cfp.c | 311 ++
 1 file changed, 166 insertions(+), 145 deletions(-)

diff --git a/drivers/net/dsa/bcm_sf2_cfp.c b/drivers/net/dsa/bcm_sf2_cfp.c
index 47c5f272a084..9dace0e25a3a 100644
--- a/drivers/net/dsa/bcm_sf2_cfp.c
+++ b/drivers/net/dsa/bcm_sf2_cfp.c
@@ -251,10 +251,12 @@ static int bcm_sf2_cfp_act_pol_set(struct bcm_sf2_priv 
*priv,
 }
 
 static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv *priv,
-  struct ethtool_tcpip4_spec *v4_spec,
+  struct flow_rule *flow_rule,
   unsigned int slice_num,
   bool mask)
 {
+   struct flow_match_ipv4_addrs ipv4;
+   struct flow_match_ports ports;
u32 reg, offset;
 
/* C-Tag[31:24]
@@ -268,41 +270,54 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
offset = CORE_CFP_DATA_PORT(4);
core_writel(priv, reg, offset);
 
+   flow_rule_match_ipv4_addrs(flow_rule, );
+   flow_rule_match_ports(flow_rule, );
+
/* UDF_n_A7 [31:24]
 * UDF_n_A6 [23:8]
 * UDF_n_A5 [7:0]
 */
-   reg = be16_to_cpu(v4_spec->pdst) >> 8;
-   if (mask)
+   if (mask) {
+   reg = be16_to_cpu(ports.mask->dst) >> 8;
offset = CORE_CFP_MASK_PORT(3);
-   else
+   } else {
+   reg = be16_to_cpu(ports.key->dst) >> 8;
offset = CORE_CFP_DATA_PORT(3);
+   }
core_writel(priv, reg, offset);
 
/* UDF_n_A5 [31:24]
 * UDF_n_A4 [23:8]
 * UDF_n_A3 [7:0]
 */
-   reg = (be16_to_cpu(v4_spec->pdst) & 0xff) << 24 |
- (u32)be16_to_cpu(v4_spec->psrc) << 8 |
- (be32_to_cpu(v4_spec->ip4dst) & 0xff00) >> 8;
-   if (mask)
+   if (mask) {
+   reg = (be16_to_cpu(ports.mask->dst) & 0xff) << 24 |
+ (u32)be16_to_cpu(ports.mask->src) << 8 |
+ (be32_to_cpu(ipv4.mask->dst) & 0xff00) >> 8;
offset = CORE_CFP_MASK_PORT(2);
-   else
+   } else {
+   reg = (be16_to_cpu(ports.key->dst) & 0xff) << 24 |
+ (u32)be16_to_cpu(ports.key->src) << 8 |
+ (be32_to_cpu(ipv4.key->dst) & 0xff00) >> 8;
offset = CORE_CFP_DATA_PORT(2);
+   }
core_writel(priv, reg, offset);
 
/* UDF_n_A3 [31:24]
 * UDF_n_A2 [23:8]
 * UDF_n_A1 [7:0]
 */
-   reg = (u32)(be32_to_cpu(v4_spec->ip4dst) & 0xff) << 24 |
- (u32)(be32_to_cpu(v4_spec->ip4dst) >> 16) << 8 |
- (be32_to_cpu(v4_spec->ip4src) & 0xff00) >> 8;
-   if (mask)
+   if (mask) {
+   reg = (u32)(be32_to_cpu(ipv4.mask->dst) & 0xff) << 24 |
+ (u32)(be32_to_cpu(ipv4.mask->dst) >> 16) << 8 |
+ (be32_to_cpu(ipv4.mask->src) & 0xff00) >> 8;
offset = CORE_CFP_MASK_PORT(1);
-   else
+   } else {
+   reg = (u32)(be32_to_cpu(ipv4.key->dst) & 0xff) << 24 |
+ (u32)(be32_to_cpu(ipv4.key->dst) >> 16) << 8 |
+ (be32_to_cpu(ipv4.key->src) & 0xff00) >> 8;
offset = CORE_CFP_DATA_PORT(1);
+   }
core_writel(priv, reg, offset);
 
/* UDF_n_A1 [31:24]
@@ -311,56 +326,34 @@ static void bcm_sf2_cfp_slice_ipv4(struct bcm_sf2_priv 
*priv,
 * Slice ID [3:2]
 * Slice valid  [1:0]
 */
-   reg = (u32)(be32_to_cpu(v4_spec->ip4src) & 0xff) << 24 |
- (u32)(be32_to_cpu(v4_spec->ip4src) >> 16) << 8 |
- SLICE_NUM(slice_num) | SLICE_VALID;
-   if (mask)
+   if (mask) {
+   reg = (u32)(be32_to_cpu(ipv4.mask->src) & 0xff) << 24 |
+ (u32)(be32_to_cpu(ipv4.mask->src) >> 16) << 8 |
+ SLICE_NUM(slice_num) | SLICE_VALID;
offset = CORE_CFP_MASK_PORT(0);
-   else
+   } else {
+   reg = (u32)(be32_to_cpu(ipv4.key->src) & 0xff) << 24 |
+ (u32)(be32_to_cpu(ipv4.key->src) >> 16) << 8 |
+ SLICE_NUM(slice_num) | SLICE_VALID;
offset = CORE_CFP_DATA_PORT(0);
+   }
core_writel(priv, reg, offset);
 }
 
 static int bcm_sf2_cfp_ipv4_rule_set(struct bcm_sf2_priv *priv, int port,
 unsigned int port_num,
 unsigned int queue_num,
-struct ethtool_rx_flow_spec

[PATCH RFC,net-next 03/10] flow_dissector: add flow action infrastructure

2018-09-25 Thread Pablo Neira Ayuso

This patch adds new infrastructure that defines actions that you can
perform in existing network drivers. This infrastructure allows us to
avoid the direct dependency with the software TC action infrastructure.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/flow_dissector.h | 70 
 net/core/flow_dissector.c| 18 
 2 files changed, 88 insertions(+)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 965a82b8d881..925c208816f1 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -402,8 +402,78 @@ void flow_rule_match_enc_keyid(const struct flow_rule 
*rule,
 void flow_rule_match_enc_opts(const struct flow_rule *rule,
  struct flow_match_enc_opts *out);
 
+enum flow_action_key_id {
+   FLOW_ACTION_KEY_ACCEPT  = 0,
+   FLOW_ACTION_KEY_DROP,
+   FLOW_ACTION_KEY_TRAP,
+   FLOW_ACTION_KEY_GOTO,
+   FLOW_ACTION_KEY_REDIRECT,
+   FLOW_ACTION_KEY_MIRRED,
+   FLOW_ACTION_KEY_VLAN_PUSH,
+   FLOW_ACTION_KEY_VLAN_POP,
+   FLOW_ACTION_KEY_VLAN_MANGLE,
+   FLOW_ACTION_KEY_TUNNEL_ENCAP,
+   FLOW_ACTION_KEY_TUNNEL_DECAP,
+   FLOW_ACTION_KEY_MANGLE,
+   FLOW_ACTION_KEY_ADD,
+   FLOW_ACTION_KEY_CSUM,
+   FLOW_ACTION_KEY_MARK,
+};
+
+/* This is mirroring enum pedit_header_type definition for easy mapping between
+ * tc pedit action. Legacy TCA_PEDIT_KEY_EX_HDR_TYPE_NETWORK is mapped to
+ * FLOW_ACT_MANGLE_UNSPEC, which is supported by no driver.
+ */
+enum flow_act_mangle_base {
+   FLOW_ACT_MANGLE_UNSPEC  = 0,
+   FLOW_ACT_MANGLE_HDR_TYPE_ETH,
+   FLOW_ACT_MANGLE_HDR_TYPE_IP4,
+   FLOW_ACT_MANGLE_HDR_TYPE_IP6,
+   FLOW_ACT_MANGLE_HDR_TYPE_TCP,
+   FLOW_ACT_MANGLE_HDR_TYPE_UDP,
+};
+
+struct flow_action_key {
+   enum flow_action_key_id id;
+   union {
+   u32 chain_index;/* FLOW_ACTION_KEY_GOTO 
*/
+   struct net_device   *dev;   /* 
FLOW_ACTION_KEY_REDIRECT */
+   struct {/* FLOW_ACTION_KEY_VLAN 
*/
+   u16 vid;
+   __be16  proto;
+   u8  prio;
+   } vlan;
+   struct {/* 
FLOW_ACTION_KEY_PACKET_EDIT */
+   enum flow_act_mangle_base htype;
+   u32 offset;
+   u32 mask;
+   u32 val;
+   } mangle;
+   const struct ip_tunnel_info *tunnel;/* 
FLOW_ACTION_KEY_TUNNEL_ENCAP */
+   u32 csum_flags; /* FLOW_ACTION_KEY_CSUM 
*/
+   u32 mark;   /* FLOW_ACTION_KEY_MARK 
*/
+   };
+};
+
+struct flow_action {
+   int num_keys;
+   struct flow_action_key  *keys;
+};
+
+int flow_action_init(struct flow_action *flow_action, int num_acts);
+void flow_action_free(struct flow_action *flow_action);
+
+static inline bool flow_action_has_keys(const struct flow_action *action)
+{
+   return action->num_keys;
+}
+
+#define flow_action_for_each(__i, __act, __actions)\
+for (__i = 0, __act = &(__actions)->keys[0]; __i < 
(__actions)->num_keys; __act = &(__actions)->keys[++__i])
+
 struct flow_rule {
struct flow_match   match;
+   struct flow_action  action;
 };
 
 static inline bool flow_rule_match_key(const struct flow_rule *rule,
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 5a22381efccc..e25b235a8e10 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -195,6 +195,24 @@ void flow_rule_match_enc_opts(const struct flow_rule *rule,
 }
 EXPORT_SYMBOL(flow_rule_match_enc_opts);
 
+int flow_action_init(struct flow_action *flow_action, int num_acts)
+{
+   flow_action->keys = kmalloc(sizeof(struct flow_action_key) * num_acts,
+   GFP_KERNEL);
+   if (!flow_action->keys)
+   return -ENOMEM;
+
+   flow_action->num_keys = num_acts;
+   return 0;
+}
+EXPORT_SYMBOL(flow_action_init);
+
+void flow_action_free(struct flow_action *flow_action)
+{
+   kfree(flow_action->keys);
+}
+EXPORT_SYMBOL(flow_action_free);
+
 /**
  * skb_flow_get_be16 - extract be16 entity
  * @skb: sk_buff to extract from
-- 
2.11.0

[PATCH RFC,net-next 09/10] flow_dissector: add basic ethtool_rx_flow_spec to flow_rule structure translator

2018-09-25 Thread Pablo Neira Ayuso

This patch adds a function to translate the ethtool_rx_flow_spec
structure to the flow_rule representation.

This allows us to reuse code from the driver side given the flower and
the ethtool_rx_flow interface use the same representation.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/flow_dissector.h |   5 ++
 net/core/flow_dissector.c| 190 +++
 2 files changed, 195 insertions(+)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 7a4683646d5a..ec9036232538 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -485,4 +485,9 @@ static inline bool flow_rule_match_key(const struct 
flow_rule *rule,
return dissector_uses_key(rule->match.dissector, key);
 }
 
+struct ethtool_rx_flow_spec;
+
+struct flow_rule *ethtool_rx_flow_rule(const struct ethtool_rx_flow_spec *fs);
+void ethtool_rx_flow_rule_free(struct flow_rule *rule);
+
 #endif
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index e25b235a8e10..73c9c7d29cd3 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -25,6 +25,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static void dissector_set_key(struct flow_dissector *flow_dissector,
  enum flow_dissector_key_id key_id)
@@ -213,6 +214,195 @@ void flow_action_free(struct flow_action *flow_action)
 }
 EXPORT_SYMBOL(flow_action_free);
 
+struct ethtool_rx_flow_key {
+   struct flow_dissector_key_basic basic;
+   union {
+   struct flow_dissector_key_ipv4_addrsipv4;
+   struct flow_dissector_key_ipv6_addrsipv6;
+   };
+   struct flow_dissector_key_ports tp;
+   struct flow_dissector_key_ipip;
+} __aligned(BITS_PER_LONG / 8); /* Ensure that we can do comparisons as longs. 
*/
+
+struct ethtool_rx_flow_match {
+   struct flow_dissector   dissector;
+   struct ethtool_rx_flow_key  key;
+   struct ethtool_rx_flow_key  mask;
+};
+
+struct flow_rule *ethtool_rx_flow_rule(const struct ethtool_rx_flow_spec *fs)
+{
+   static struct in6_addr zero_addr = {};
+   struct ethtool_rx_flow_match *match;
+   struct flow_action_key *act;
+   struct flow_rule *rule;
+
+   rule = kmalloc(sizeof(struct flow_rule), GFP_KERNEL);
+   if (!rule)
+   return NULL;
+
+   match = kzalloc(sizeof(struct ethtool_rx_flow_match), GFP_KERNEL);
+   if (!match)
+   goto err_match;
+
+   rule->match.dissector   = >dissector;
+   rule->match.mask= >mask;
+   rule->match.key = >key;
+
+   match->mask.basic.n_proto = 0x;
+
+   switch (fs->flow_type & ~FLOW_EXT) {
+   case TCP_V4_FLOW:
+   case UDP_V4_FLOW: {
+   const struct ethtool_tcpip4_spec *v4_spec, *v4_m_spec;
+
+   match->key.basic.n_proto = htons(ETH_P_IP);
+
+   v4_spec = >h_u.tcp_ip4_spec;
+   v4_m_spec = >m_u.tcp_ip4_spec;
+
+   if (v4_m_spec->ip4src) {
+   match->key.ipv4.src = v4_spec->ip4src;
+   match->mask.ipv4.src = v4_m_spec->ip4src;
+   }
+   if (v4_m_spec->ip4dst) {
+   match->key.ipv4.dst = v4_spec->ip4dst;
+   match->mask.ipv4.dst = v4_m_spec->ip4dst;
+   }
+   if (v4_m_spec->ip4src ||
+   v4_m_spec->ip4dst) {
+   match->dissector.used_keys |=
+   FLOW_DISSECTOR_KEY_IPV4_ADDRS;
+   match->dissector.offset[FLOW_DISSECTOR_KEY_IPV4_ADDRS] =
+   offsetof(struct ethtool_rx_flow_key, ipv4);
+   }
+   if (v4_m_spec->psrc) {
+   match->key.tp.src = v4_spec->psrc;
+   match->mask.tp.src = v4_m_spec->psrc;
+   }
+   if (v4_m_spec->pdst) {
+   match->key.tp.dst = v4_spec->pdst;
+   match->mask.tp.dst = v4_m_spec->pdst;
+   }
+   if (v4_m_spec->psrc ||
+   v4_m_spec->pdst) {
+   match->dissector.used_keys |= FLOW_DISSECTOR_KEY_PORTS;
+   match->dissector.offset[FLOW_DISSECTOR_KEY_PORTS] =
+   offsetof(struct ethtool_rx_flow_key, tp);
+   }
+   if (v4_m_spec->tos) {
+   match->key.ip.tos = v4_spec->pdst;
+   match->mask.ip.tos = v4_m_spec->pdst;
+   match->dissector.used_keys |= FLOW_DISSECTOR_KEY_IP;
+   match->dissector.offset[FLOW_DISSECTOR_KEY_IP] =
+   offsetof(struct ethtool_rx_flow_key, ip);
+   }
+   }
+   break;
+   case TCP_V6_FLOW:
+   case UDP_V6_FLOW: {
+

[PATCH RFC,net-next 07/10] cls_flower: don't expose TC actions to drivers anymore

2018-09-25 Thread Pablo Neira Ayuso

Now that drivers have been converted to use the flow action
infrastructure, remove this field from the tc_cls_flower_offload
structure.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/pkt_cls.h  | 1 -
 net/sched/cls_flower.c | 5 -
 2 files changed, 6 deletions(-)

diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index c8e61e8b0257..a6317baff462 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -742,7 +742,6 @@ struct tc_cls_flower_offload {
enum tc_fl_command command;
unsigned long cookie;
struct flow_rule rule;
-   struct tcf_exts *exts;
u32 classid;
struct tc_cls_flower_stats stats;
 };
diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index a89f3370718f..e7fcd5e9ba1e 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -419,7 +419,6 @@ static int fl_hw_replace_filter(struct tcf_proto *tp,
cls_flower.rule.match.dissector = >mask->dissector;
cls_flower.rule.match.mask = >mask->key;
cls_flower.rule.match.key = >mkey;
-   cls_flower.exts = >exts;
cls_flower.classid = f->res.classid;
 
if (fl_hw_setup_action(>action, >exts) < 0)
@@ -455,7 +454,6 @@ static void fl_hw_update_stats(struct tcf_proto *tp, struct 
cls_fl_filter *f)
tc_cls_common_offload_init(_flower.common, tp, f->flags, NULL);
cls_flower.command = TC_CLSFLOWER_STATS;
cls_flower.cookie = (unsigned long) f;
-   cls_flower.exts = >exts;
cls_flower.classid = f->res.classid;
 
tc_setup_cb_call(block, >exts, TC_SETUP_CLSFLOWER,
@@ -1464,7 +1462,6 @@ static int fl_reoffload(struct tcf_proto *tp, bool add, 
tc_setup_cb_t *cb,
cls_flower.rule.match.dissector = >dissector;
cls_flower.rule.match.mask = >key;
cls_flower.rule.match.key = >mkey;
-   cls_flower.exts = >exts;
cls_flower.rule.action.num_keys = f->action.num_keys;
cls_flower.rule.action.keys = f->action.keys;
cls_flower.classid = f->res.classid;
@@ -1489,7 +1486,6 @@ static void fl_hw_create_tmplt(struct tcf_chain *chain,
 {
struct tc_cls_flower_offload cls_flower = {};
struct tcf_block *block = chain->block;
-   struct tcf_exts dummy_exts = { 0, };
 
cls_flower.common.chain_index = chain->index;
cls_flower.command = TC_CLSFLOWER_TMPLT_CREATE;
@@ -1497,7 +1493,6 @@ static void fl_hw_create_tmplt(struct tcf_chain *chain,
cls_flower.rule.match.dissector = >dissector;
cls_flower.rule.match.mask = >mask;
cls_flower.rule.match.key = >dummy_key;
-   cls_flower.exts = _exts;
 
/* We don't care if driver (any of them) fails to handle this
 * call. It serves just as a hint for it.
-- 
2.11.0

[PATCH RFC,net-next 05/10] cls_flower: add statistics retrieval infrastructure and use it

2018-09-25 Thread Pablo Neira Ayuso

Provide a tc_cls_flower_stats structure that acts as container for
tc_cls_flower_offload, then restore stats on the TC actions. Hence
tcf_exts_stats_update() is not used from drivers.

Signed-off-by: Pablo Neira Ayuso 
---
 drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c  |  4 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c  |  6 +++---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c   |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c |  2 +-
 drivers/net/ethernet/netronome/nfp/flower/offload.c   |  4 ++--
 include/net/pkt_cls.h | 15 +++
 net/sched/cls_flower.c|  4 
 7 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c 
b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
index 62652ffc8221..3505791777e7 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_tc.c
@@ -1358,8 +1358,8 @@ static int bnxt_tc_get_flow_stats(struct bnxt *bp,
lastused = flow->lastused;
spin_unlock(>stats_lock);
 
-   tcf_exts_stats_update(tc_flow_cmd->exts, stats.bytes, stats.packets,
- lastused);
+   tc_cls_flower_stats_update(tc_flow_cmd, stats.bytes, stats.packets,
+  lastused);
return 0;
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
index cff9d854bf51..74fe2ee4636e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_tc_flower.c
@@ -807,9 +807,9 @@ int cxgb4_tc_flower_stats(struct net_device *dev,
if (ofld_stats->packet_count != packets) {
if (ofld_stats->prev_packet_count != packets)
ofld_stats->last_used = jiffies;
-   tcf_exts_stats_update(cls->exts, bytes - ofld_stats->byte_count,
- packets - ofld_stats->packet_count,
- ofld_stats->last_used);
+   tc_cls_flower_stats_update(cls, bytes - ofld_stats->byte_count,
+  packets - ofld_stats->packet_count,
+  ofld_stats->last_used);
 
ofld_stats->packet_count = packets;
ofld_stats->byte_count = bytes;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index 5139e63daa74..bb29060c0351 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2906,7 +2906,7 @@ int mlx5e_stats_flower(struct mlx5e_priv *priv,
 
mlx5_fc_query_cached(counter, , , );
 
-   tcf_exts_stats_update(f->exts, bytes, packets, lastuse);
+   tc_cls_flower_stats_update(f, bytes, packets, lastuse);
 
return 0;
 }
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
index 773dba2067ed..b3668b578e84 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_flower.c
@@ -461,7 +461,7 @@ int mlxsw_sp_flower_stats(struct mlxsw_sp *mlxsw_sp,
if (err)
goto err_rule_get_stats;
 
-   tcf_exts_stats_update(f->exts, bytes, packets, lastuse);
+   tc_cls_flower_stats_update(f, bytes, packets, lastuse);
 
mlxsw_sp_acl_ruleset_put(mlxsw_sp, ruleset);
return 0;
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 5a9986fccfa4..fbfec64767fb 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -577,8 +577,8 @@ nfp_flower_get_stats(struct nfp_app *app, struct net_device 
*netdev,
return 0;
 
spin_lock_bh(_flow->lock);
-   tcf_exts_stats_update(flow->exts, nfp_flow->stats.bytes,
- nfp_flow->stats.pkts, nfp_flow->stats.used);
+   tc_cls_flower_stats_update(flow, nfp_flow->stats.bytes,
+  nfp_flow->stats.pkts, nfp_flow->stats.used);
 
nfp_flow->stats.pkts = 0;
nfp_flow->stats.bytes = 0;
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index bc145265ebd5..c8e61e8b0257 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -731,6 +731,12 @@ enum tc_fl_command {
TC_CLSFLOWER_TMPLT_DESTROY,
 };
 
+struct tc_cls_flower_stats {
+   u64 pkts;
+   u64 bytes;
+   u64 lastused;
+};
+
 struct tc_cls_flower_offload {
struct tc_cls_common_offload common;
enum tc_fl_command command;
@@ -738,8 +744,17 @@ struct tc_cls_flower_offload {
struct flow_rule rule;
struct tcf_exts *exts;
u32 classid;
+

[PATCH RFC,net-next 08/10] flow_dissector: add wake-up-on-lan and queue to flow_action

2018-09-25 Thread Pablo Neira Ayuso

These actions need to be added to support bcm sf2 features available
through the ethtool_rx_flow interface.

Signed-off-by: Pablo Neira Ayuso 
---
 include/net/flow_dissector.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/net/flow_dissector.h b/include/net/flow_dissector.h
index 925c208816f1..7a4683646d5a 100644
--- a/include/net/flow_dissector.h
+++ b/include/net/flow_dissector.h
@@ -418,6 +418,8 @@ enum flow_action_key_id {
FLOW_ACTION_KEY_ADD,
FLOW_ACTION_KEY_CSUM,
FLOW_ACTION_KEY_MARK,
+   FLOW_ACTION_KEY_WAKE,
+   FLOW_ACTION_KEY_QUEUE,
 };
 
 /* This is mirroring enum pedit_header_type definition for easy mapping between
@@ -452,6 +454,7 @@ struct flow_action_key {
const struct ip_tunnel_info *tunnel;/* 
FLOW_ACTION_KEY_TUNNEL_ENCAP */
u32 csum_flags; /* FLOW_ACTION_KEY_CSUM 
*/
u32 mark;   /* FLOW_ACTION_KEY_MARK 
*/
+   u32 queue_index;/* 
FLOW_ACTION_KEY_QUEUE */
};
 };
 
-- 
2.11.0

[PATCH RFC,net-next 04/10] cls_flower: add translator to flow_action representation

2018-09-25 Thread Pablo Neira Ayuso

This implements TC action to flow_action translation from cls_flower.

Signed-off-by: Pablo Neira Ayuso 
---
 net/sched/cls_flower.c | 124 -
 1 file changed, 123 insertions(+), 1 deletion(-)

diff --git a/net/sched/cls_flower.c b/net/sched/cls_flower.c
index e1dd60a2ecb8..a96a80f01c6d 100644
--- a/net/sched/cls_flower.c
+++ b/net/sched/cls_flower.c
@@ -28,6 +28,14 @@
 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
 
 struct fl_flow_key {
int indev_ifindex;
@@ -101,6 +109,7 @@ struct cls_fl_filter {
u32 in_hw_count;
struct rcu_work rwork;
struct net_device *hw_dev;
+   struct flow_action action;
 };
 
 static const struct rhashtable_params mask_ht_params = {
@@ -294,6 +303,107 @@ static void fl_hw_destroy_filter(struct tcf_proto *tp, 
struct cls_fl_filter *f,
tcf_block_offload_dec(block, >flags);
 }
 
+static int fl_hw_setup_action(struct flow_action *flow_action,
+ const struct tcf_exts *exts)
+{
+   const struct tc_action *act;
+   int num_acts = 0, i, j, k;
+
+   if (!exts)
+   return 0;
+
+   tcf_exts_for_each_action(i, act, exts) {
+   if (is_tcf_pedit(act))
+   num_acts += tcf_pedit_nkeys(act);
+   else
+   num_acts++;
+   }
+
+   if (!num_acts)
+   return 0;
+
+   if (flow_action_init(flow_action, num_acts) < 0)
+   return -ENOMEM;
+
+   j = 0;
+   tcf_exts_for_each_action(i, act, exts) {
+   struct flow_action_key *key;
+
+   key = _action->keys[j];
+   if (is_tcf_gact_ok(act)) {
+   key->id = FLOW_ACTION_KEY_ACCEPT;
+   } else if (is_tcf_gact_shot(act)) {
+   key->id = FLOW_ACTION_KEY_DROP;
+   } else if (is_tcf_gact_trap(act)) {
+   key->id = FLOW_ACTION_KEY_TRAP;
+   } else if (is_tcf_gact_goto_chain(act)) {
+   key->id = FLOW_ACTION_KEY_GOTO;
+   key->chain_index = tcf_gact_goto_chain_index(act);
+   } else if (is_tcf_mirred_egress_redirect(act)) {
+   key->id = FLOW_ACTION_KEY_REDIRECT;
+   key->dev = tcf_mirred_dev(act);
+   } else if (is_tcf_mirred_egress_mirror(act)) {
+   key->id = FLOW_ACTION_KEY_MIRRED;
+   key->dev = tcf_mirred_dev(act);
+   } else if (is_tcf_vlan(act)) {
+   switch (tcf_vlan_action(act)) {
+   case TCA_VLAN_ACT_PUSH:
+   key->id = FLOW_ACTION_KEY_VLAN_PUSH;
+   key->vlan.vid = tcf_vlan_push_vid(act);
+   key->vlan.proto = tcf_vlan_push_proto(act);
+   key->vlan.prio = tcf_vlan_push_prio(act);
+   break;
+   case TCA_VLAN_ACT_POP:
+   key->id = FLOW_ACTION_KEY_VLAN_POP;
+   break;
+   case TCA_VLAN_ACT_MODIFY:
+   key->id = FLOW_ACTION_KEY_VLAN_MANGLE;
+   key->vlan.vid = tcf_vlan_push_vid(act);
+   key->vlan.proto = tcf_vlan_push_proto(act);
+   key->vlan.prio = tcf_vlan_push_prio(act);
+   break;
+   }
+   } else if (is_tcf_tunnel_set(act)) {
+   key->id = FLOW_ACTION_KEY_TUNNEL_ENCAP;
+   key->tunnel = tcf_tunnel_info(act);
+   } else if (is_tcf_tunnel_release(act)) {
+   key->id = FLOW_ACTION_KEY_TUNNEL_DECAP;
+   key->tunnel = tcf_tunnel_info(act);
+   } else if (is_tcf_pedit(act)) {
+   for (k = 0; k < tcf_pedit_nkeys(act); k++) {
+   switch (tcf_pedit_cmd(act, k)) {
+   case TCA_PEDIT_KEY_EX_CMD_SET:
+   key->id = FLOW_ACTION_KEY_MANGLE;
+   break;
+   case TCA_PEDIT_KEY_EX_CMD_ADD:
+   key->id = FLOW_ACTION_KEY_ADD;
+   break;
+   default:
+   WARN_ON_ONCE(1);
+   break;
+   }
+
+   key->mangle.htype = tcf_pedit_htype(act, k);
+   key->mangle.mask = tcf_pedit_mask(act, k);
+   key->mangle.val = tcf_pedit_val(act, k);
+   key->mangle.offset =

[PATCH RFC,net-next 00/10] add flow_rule infrastructure

2018-09-25 Thread Pablo Neira Ayuso

Hi,

This patchset spins over the existing kernel representation for network
driver offloads based on the existing cls_flower dissector use for the
rule matching side and the TC action infrastructure to represent the
action side.

The proposed object that represent rules looks like this:

struct flow_rule {
struct flow_match   match;
struct flow_action  action;
};

The flow_match structure wraps Jiri Pirko's existing representation
available in cls_flower based on dissectors to represent the matching
side:

struct flow_match {
struct flow_dissector   *dissector;
void*mask;
void*key;
};

The mask and key layouts are opaque, given the dissector object provides
the used_keys flags - to check for rule selectors that are being used -
and the offset to the corresponding key and mask in the opaque
container structures.

Then, the actions to be performs on the matching packets is represented
through the flow_action object:

struct flow_action {
int num_keys;
struct flow_action_key  *keys;
};

This object comes with a num_keys field that specifies the number of
actions - this supports an arbitrary number of actions, the driver
will impose its own restrictions on this - and the array that stores
flow_action_key structures (keys). Each flow action key looks like this:

struct flow_action_key {
enum flow_action_key_id id;
union {
u32 chain_index;/* 
FLOW_ACTION_KEY_GOTO */
struct net_device   *dev;   /* 
FLOW_ACTION_KEY_REDIRECT */
struct {/* 
FLOW_ACTION_KEY_VLAN */
u16 vid;
__be16  proto;
u8  prio;
} vlan;
struct {/* 
FLOW_ACTION_KEY_PACKET_EDIT */
enum flow_act_mangle_base htype;
u32 offset;
u32 mask;
u32 val;
} mangle;
const struct ip_tunnel_info *tunnel;/* 
FLOW_ACTION_KEY_TUNNEL_ENCAP */
u32 csum_flags; /* 
FLOW_ACTION_KEY_CSUM */
u32 mark;   /* 
FLOW_ACTION_KEY_MARK */
u32 queue_index;/* 
FLOW_ACTION_KEY_QUEUE */
};
};

Possible actions in this patchset are:

enum flow_action_key_id {
FLOW_ACTION_KEY_ACCEPT  = 0,
FLOW_ACTION_KEY_DROP,
FLOW_ACTION_KEY_TRAP,
FLOW_ACTION_KEY_GOTO,
FLOW_ACTION_KEY_REDIRECT,
FLOW_ACTION_KEY_MIRRED,
FLOW_ACTION_KEY_VLAN_PUSH,
FLOW_ACTION_KEY_VLAN_POP,
FLOW_ACTION_KEY_VLAN_MANGLE,
FLOW_ACTION_KEY_TUNNEL_ENCAP,
FLOW_ACTION_KEY_TUNNEL_DECAP,
FLOW_ACTION_KEY_MANGLE,
FLOW_ACTION_KEY_ADD,
FLOW_ACTION_KEY_CSUM,
FLOW_ACTION_KEY_MARK,
FLOW_ACTION_KEY_WAKE,
FLOW_ACTION_KEY_QUEUE,
};

which are based on what existing drivers can do from the existing TC
actions.

Common code pattern from the drivers to populate the hardware
intermediate representation looks like this:

if (flow_rule_match_key(rule, FLOW_DISSECTOR_KEY_IPV4_ADDRS)) {
struct flow_match_ipv4_addrs match;

flow_rule_match_ipv4_addrs(rule, );
flow->l3_key.ipv4.daddr.s_addr = match.key->dst;
flow->l3_mask.ipv4.daddr.s_addr = match.mask->dst;
flow->l3_key.ipv4.saddr.s_addr = match.key->src;
flow->l3_mask.ipv4.saddr.s_addr = match.mask->src;
}

Then, flow action code parser should look like:

flow_action_for_each(i, act, flow_action) {
   switch (act->id) {
   case FLOW_ACTION_KEY_DROP:
actions->flags |= DRIVER_XYZ_ACTION_FLAG_DROP;
break;
   case ...:
break;
   default:
return -EOPNOTSUPP;
   }
}

A quick description of the patches:

Patch #1 introduces the flow_match structure and two new interfaces to
 check for rule selectors that are used and to fetch the key
 and the mask with one single function call. This patch also
 introduces the flow_rule

[PATCH RFC,net-next 02/10] net/mlx5e: allow two independent packet edit actions

2018-09-25 Thread Pablo Neira Ayuso

Although the packet edit infrastructure allows an arbitrary number of
packet mangling in the same action, it is still possible to define a
rule with two or more packet edit instances. This would result in the
last packet mangling action being applied to the mlx5e driver.

This patch adds pedit_headers_action struct that annotates the headers
mangling configuration that is used to configure hardware. Then,
alloc_tc_pedit_action() is called to populate the mlx5e hardware
intermediate representation once all actions have been parsed.

Signed-off-by: Pablo Neira Ayuso 
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 77 +
 1 file changed, 54 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index c9d541944a14..5139e63daa74 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -1600,6 +1600,12 @@ struct pedit_headers {
struct udphdr  udp;
 };
 
+struct pedit_headers_action {
+   struct pedit_headersvals;
+   struct pedit_headersmasks;
+   u32 pedits;
+};
+
 static int pedit_header_offsets[] = {
[TCA_PEDIT_KEY_EX_HDR_TYPE_ETH] = offsetof(struct pedit_headers, eth),
[TCA_PEDIT_KEY_EX_HDR_TYPE_IP4] = offsetof(struct pedit_headers, ip4),
@@ -1611,16 +1617,15 @@ static int pedit_header_offsets[] = {
 #define pedit_header(_ph, _htype) ((void *)(_ph) + 
pedit_header_offsets[_htype])
 
 static int set_pedit_val(u8 hdr_type, u32 mask, u32 val, u32 offset,
-struct pedit_headers *masks,
-struct pedit_headers *vals)
+struct pedit_headers_action *hdrs)
 {
u32 *curr_pmask, *curr_pval;
 
if (hdr_type >= __PEDIT_HDR_TYPE_MAX)
goto out_err;
 
-   curr_pmask = (u32 *)(pedit_header(masks, hdr_type) + offset);
-   curr_pval  = (u32 *)(pedit_header(vals, hdr_type) + offset);
+   curr_pmask = (u32 *)(pedit_header(>masks, hdr_type) + offset);
+   curr_pval  = (u32 *)(pedit_header(>vals, hdr_type) + offset);
 
if (*curr_pmask & mask)  /* disallow acting twice on the same location 
*/
goto out_err;
@@ -1676,8 +1681,7 @@ static struct mlx5_fields fields[] = {
  * max from the SW pedit action. On success, it says how many HW actions were
  * actually parsed.
  */
-static int offload_pedit_fields(struct pedit_headers *masks,
-   struct pedit_headers *vals,
+static int offload_pedit_fields(struct pedit_headers_action *hdrs,
struct mlx5e_tc_flow_parse_attr *parse_attr)
 {
struct pedit_headers *set_masks, *add_masks, *set_vals, *add_vals;
@@ -1691,10 +1695,10 @@ static int offload_pedit_fields(struct pedit_headers 
*masks,
__be16 mask_be16;
void *action;
 
-   set_masks = [TCA_PEDIT_KEY_EX_CMD_SET];
-   add_masks = [TCA_PEDIT_KEY_EX_CMD_ADD];
-   set_vals = [TCA_PEDIT_KEY_EX_CMD_SET];
-   add_vals = [TCA_PEDIT_KEY_EX_CMD_ADD];
+   set_masks = [TCA_PEDIT_KEY_EX_CMD_SET].masks;
+   add_masks = [TCA_PEDIT_KEY_EX_CMD_ADD].masks;
+   set_vals = [TCA_PEDIT_KEY_EX_CMD_SET].vals;
+   add_vals = [TCA_PEDIT_KEY_EX_CMD_ADD].vals;
 
action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto);
action = parse_attr->mod_hdr_actions;
@@ -1784,12 +1788,12 @@ static int offload_pedit_fields(struct pedit_headers 
*masks,
 }
 
 static int alloc_mod_hdr_actions(struct mlx5e_priv *priv,
-const struct tc_action *a, int namespace,
+u32 pedits, int namespace,
 struct mlx5e_tc_flow_parse_attr *parse_attr)
 {
int nkeys, action_size, max_actions;
 
-   nkeys = tcf_pedit_nkeys(a);
+   nkeys = pedits;
action_size = MLX5_UN_SZ_BYTES(set_action_in_add_action_in_auto);
 
if (namespace == MLX5_FLOW_NAMESPACE_FDB) /* FDB offloading */
@@ -1812,17 +1816,15 @@ static const struct pedit_headers zero_masks = {};
 
 static int parse_tc_pedit_action(struct mlx5e_priv *priv,
 const struct tc_action *a, int namespace,
-struct mlx5e_tc_flow_parse_attr *parse_attr)
+struct mlx5e_tc_flow_parse_attr *parse_attr,
+struct pedit_headers_action *hdrs)
 {
-   struct pedit_headers masks[__PEDIT_CMD_MAX], vals[__PEDIT_CMD_MAX], 
*cmd_masks;
int nkeys, i, err = -EOPNOTSUPP;
u32 mask, val, offset;
u8 cmd, htype;
 
nkeys = tcf_pedit_nkeys(a);
-
-   memset(masks, 0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX);
-   memset(vals,  0, sizeof(struct pedit_headers) * __PEDIT_CMD_MAX);
+   hdrs->pedits += nkeys;
 
for (i = 0; i < nkeys; i++) {
htype =

Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Song Liu




> On Sep 25, 2018, at 7:43 AM, Michael Chan  wrote:
> 
> On Tue, Sep 25, 2018 at 7:20 AM Eric Dumazet  wrote:
>> 
>> On Tue, Sep 25, 2018 at 7:02 AM Michael Chan  
>> wrote:
>>> 
>>> On Mon, Sep 24, 2018 at 2:18 PM Song Liu  wrote:
 
 
 
> On Sep 24, 2018, at 2:05 PM, Eric Dumazet  wrote:
> 
>> 
>> Interesting, maybe a bnxt specific issue.
>> 
>> It seems their model is to process TX/RX notification in the same queue,
>> they throw away RX events if budget == 0
>> 
>> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
>> must be reverted.
>> 
>> Otherwise, we have a possibility of blocking a queue under netpoll 
>> pressure.
> 
> Hmm, actually a revert might not be enough, since code at lines 2030-2031
> would fire and we might not call napi_complete_done() anyway.
> 
> Unfortunately this driver logic is quite complex.
> 
> Could you test on other NIC eventually ?
> 
 
 It actually runs OK on ixgbe.
 
 @Michael, could you please help us with this?
 
>>> I've taken a quick look using today's net tree plus Eric's
>>> poll_one_napi() patch.  The problem I'm seeing is that netpoll calls
>>> bnxt_poll() with budget 0.  And since work_done >= budget of 0, we
>>> return without calling napi_complete_done() and without arming the
>>> interrupt.  netpoll doesn't always call us back until we call
>>> napi_complete_done(), right?  So I think if there are in-flight TX
>>> completions, we'll miss those.
>> 
>> That's the whole point of netpoll :
>> 
>> We drain the TX queues, without interrupts being involved at all,
>> by calling ->napi() with a zero budget.
>> 
>> napi_complete(), even if called from ->napi() while budget was zero,
>> should do nothing but return early.
>> 
>> budget==0 means that ->napi() should process all TX completions.
> 
> All TX completions that we can see.  We cannot see the in-flight ones.
> 
> If budget is exceeded, I think the assumption is that poll will always
> be called again.
> 
>> 
>> So it looks like bnxt has a bug, that is showing up after the latest
>> poll_one_napi() patch.
>> This latest patch is needed otherwise the cpu attempting the
>> netpoll-TX-drain might drain nothing at all,
>> since it does not anymore call ndo_poll_controller() that was grabbing
>> SCHED bits on all queues (napi_schedule() like calls)
> 
> I think the latest patch is preventing the normal interrupt -> NAPI
> path from coming in and cleaning the remaining TX completions and
> arming the interrupt.

Hi Michael, 

This may not be related. But I am looking at this:

bnxt_poll_work() {

while (1) {

if (rx_pkts == budget)
return
}
}

With budget of 0, the loop will terminate after processing one packet. 
But I think the expectation is to finish all tx packets. So it doesn't
feel right. Could you please confirm?

Thanks,
Song

Re: [PATCH] net/ncsi: Add NCSI OEM command for FB Tiogapass

2018-09-25 Thread Vijay Khemka

Hi Joel,
Thanks, I am adding netdev mailing list here.
Yes, this command is supported for all Mellanox card. It is as per Mellanox 
specification.

Regards
-Vijay

On 9/24/18, 5:30 PM, "Joel Stanley"  wrote:

Hi Vijay,

On Tue, 25 Sep 2018 at 09:39, Vijay Khemka  wrote:
>
> This patch adds OEM command to get mac address from NCSI device and and
> configure the same to the network card.
>
> ncsi_cmd_arg - Modified this structure to include bigger payload data.
> ncsi_cmd_handler_oem: This function handes oem command request
> ncsi_rsp_handler_oem: This function handles response for OEM command.
> get_mac_address_oem_mlx: This function will send OEM command to get
> mac address for Mellanox card
> set_mac_affinity_mlx: This will send OEM command to set Mac affinity
> for Mellanox card

Thanks for the patch. The code looks okay, but I wanted to get some
input from our NCSI maintainer as to how OEM commands would be
structured. Sam, can you please provide some review here?

Is the command supported on all melanox cards, just some, or does
TiogaPass have a special firmware that enables it?

We should include the netdev mailing list in this discussion as the
patch needs to be acceptable for upstream.

Cheers,

Joel

>
> Signed-off-by: Vijay Khemka 
> ---
>  net/ncsi/Kconfig   |  3 ++
>  net/ncsi/internal.h| 11 +--
>  net/ncsi/ncsi-cmd.c| 24 +--
>  net/ncsi/ncsi-manage.c | 68 ++
>  net/ncsi/ncsi-pkt.h| 16 ++
>  net/ncsi/ncsi-rsp.c| 33 +++-
>  6 files changed, 149 insertions(+), 6 deletions(-)
>
> diff --git a/net/ncsi/Kconfig b/net/ncsi/Kconfig
> index 08a8a6031fd7..b8bf89fea7c8 100644
> --- a/net/ncsi/Kconfig
> +++ b/net/ncsi/Kconfig
> @@ -10,3 +10,6 @@ config NET_NCSI
>   support. Enable this only if your system connects to a network
>   device via NCSI and the ethernet driver you're using supports
>   the protocol explicitly.
> +config NCSI_OEM_CMD_GET_MAC
> +   bool "Get NCSI OEM MAC Address"
> +   depends on NET_NCSI
> diff --git a/net/ncsi/internal.h b/net/ncsi/internal.h
> index 8055e3965cef..da17958e6a4b 100644
> --- a/net/ncsi/internal.h
> +++ b/net/ncsi/internal.h
> @@ -68,6 +68,10 @@ enum {
> NCSI_MODE_MAX
>  };
>
> +#define NCSI_OEM_MFR_MLX_ID 0x8119
> +#define NCSI_OEM_MLX_CMD_GET_MAC0x1b00
> +#define NCSI_OEM_MLX_CMD_SET_AFFINITY   0x010700
> +
>  struct ncsi_channel_version {
> u32 version;/* Supported BCD encoded NCSI version */
> u32 alpha2; /* Supported BCD encoded NCSI version */
> @@ -236,6 +240,7 @@ enum {
> ncsi_dev_state_probe_dp,
> ncsi_dev_state_config_sp= 0x0301,
> ncsi_dev_state_config_cis,
> +   ncsi_dev_state_config_oem_gma,
> ncsi_dev_state_config_clear_vids,
> ncsi_dev_state_config_svf,
> ncsi_dev_state_config_ev,
> @@ -301,9 +306,9 @@ struct ncsi_cmd_arg {
> unsigned short   payload; /* Command packet payload 
length */
> unsigned int req_flags;   /* NCSI request properties  
 */
> union {
> -   unsigned char  bytes[16]; /* Command packet specific data 
 */
> -   unsigned short words[8];
> -   unsigned int   dwords[4];
> +   unsigned char  bytes[64]; /* Command packet specific data 
 */
> +   unsigned short words[32];
> +   unsigned int   dwords[16];
> };
>  };
>
> diff --git a/net/ncsi/ncsi-cmd.c b/net/ncsi/ncsi-cmd.c
> index 7567ca63aae2..3205e22c1734 100644
> --- a/net/ncsi/ncsi-cmd.c
> +++ b/net/ncsi/ncsi-cmd.c
> @@ -211,6 +211,25 @@ static int ncsi_cmd_handler_snfc(struct sk_buff *skb,
> return 0;
>  }
>
> +static int ncsi_cmd_handler_oem(struct sk_buff *skb,
> +   struct ncsi_cmd_arg *nca)
> +{
> +   struct ncsi_cmd_oem_pkt *cmd;
> +   unsigned int len;
> +
> +   len = sizeof(struct ncsi_cmd_pkt_hdr) + 4;
> +   if (nca->payload < 26)
> +   len += 26;
> +   else
> +   len += nca->payload;
> +
> +   cmd = skb_put_zero(skb, len);
> +   memcpy(cmd->data, nca->bytes, nca->payload);
> +   ncsi_cmd_build_header(>cmd.common, nca);
> +
> +   return 0;
> +}
> +
>  static struct ncsi_cmd_handler {
> unsigned char type;
> int   payload;
> @@ -244,7 +263,7 @@ static struct ncsi_cmd_handler {
> { NCSI_PKT_CMD_GNS,

Re: [PATCH net-next v3 00/10] Refactor classifier API to work with Qdisc/blocks without rtnl lock

2018-09-25 Thread Cong Wang

On Mon, Sep 24, 2018 at 9:23 AM Vlad Buslov  wrote:
>
> Currently, all netlink protocol handlers for updating rules, actions and
> qdiscs are protected with single global rtnl lock which removes any
> possibility for parallelism. This patch set is a third step to remove
> rtnl lock dependency from TC rules update path.
>
> Recently, new rtnl registration flag RTNL_FLAG_DOIT_UNLOCKED was added.
> Handlers registered with this flag are called without RTNL taken. End
> goal is to have rule update handlers(RTM_NEWTFILTER, RTM_DELTFILTER,
> etc.) to be registered with UNLOCKED flag to allow parallel execution.
> However, there is no intention to completely remove or split rtnl lock
> itself. This patch set addresses specific problems in implementation of
> classifiers API that prevent its control path from being executed
> concurrently. Additional changes are required to refactor classifiers
> API and individual classifiers for parallel execution. This patch set
> lays groundwork to eventually register rule update handlers as
> rtnl-unlocked by modifying code in cls API that works with Qdiscs and
> blocks. Following patch set does the same for chains and classifiers.
>
> The goal of this change is to refactor tcf_block_find() and its
> dependencies to allow concurrent execution:
> - Extend Qdisc API with rcu to lookup and take reference to Qdisc
>   without relying on rtnl lock.
> - Extend tcf_block with atomic reference counting and rcu.
> - Always take reference to tcf_block while working with it.
> - Implement tcf_block_release() to release resources obtained by
>   tcf_block_find()
> - Create infrastructure to allow registering Qdiscs with class ops that
>   do not require the caller to hold rtnl lock.
>
> All three netlink rule update handlers use tcf_block_find() to lookup
> Qdisc and block, and this patch set introduces additional means of
> synchronization to substitute rtnl lock in cls API.
>
> Some functions in cls and sch APIs have historic names that no longer
> clearly describe their intent. In order not make this code even more
> confusing when introducing their concurrency-friendly versions, rename
> these functions to describe actual implementation.
>
> Changes from V2 to V3:
> - Patch 1:
>   - Explicitly include refcount.h in rtnetlink.h.
> - Patch 3:
>   - Move rcu_head field to the end of struct Qdisc.
>   - Rearrange local variable declarations in qdisc_lookup_rcu().
> - Patch 5:
>   - Remove tcf_qdisc_put() and inline its content to callers.


Overall looks good to me,

Acked-by: Cong Wang 


There are some minor issues like qdisc_free_cb() can be static,
please send follow-up patch to address it.

Thanks.

Re: [PATCH net-next] tls: Fixed a memory leak during socket close

2018-09-25 Thread David Miller

From: Vakul Garg 
Date: Tue, 25 Sep 2018 20:21:51 +0530

> During socket close, if there is a open record with tx context, it needs
> to be be freed apart from freeing up plaintext and encrypted scatter
> lists. This patch frees up the open record if present in tx context.
> 
> Also tls_free_both_sg() has been renamed to tls_free_open_rec() to
> indicate that the free record in tx context is being freed inside the
> function.
> 
> Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
> Signed-off-by: Vakul Garg 

Applied.

Re: [PATCH net-next] tls: Fix socket mem accounting error under async encryption

2018-09-25 Thread David Miller

From: Vakul Garg 
Date: Tue, 25 Sep 2018 16:26:17 +0530

> Current async encryption implementation sometimes showed up socket
> memory accounting error during socket close. This results in kernel
> warning calltrace. The root cause of the problem is that socket var
> sk_forward_alloc gets corrupted due to access in sk_mem_charge()
> and sk_mem_uncharge() being invoked from multiple concurrent contexts
> in multicore processor. The apis sk_mem_charge() and sk_mem_uncharge()
> are called from functions alloc_plaintext_sg(), free_sg() etc. It is
> required that memory accounting apis are called under a socket lock.
> 
> The plaintext sg data sent for encryption is freed using free_sg() in
> tls_encryption_done(). It is wrong to call free_sg() from this function.
> This is because this function may run in irq context. We cannot acquire
> socket lock in this function.
> 
> We remove calling of function free_sg() for plaintext data from
> tls_encryption_done() and defer freeing up of plaintext data to the time
> when the record is picked up from tx_list and transmitted/freed. When
> tls_tx_records() gets called, socket is already locked and thus there is
> no concurrent access problem.
> 
> Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
> Signed-off-by: Vakul Garg 

Applied.

Re: [PATCH net-next 0/3] r8169: series with smaller improvements

2018-09-25 Thread David Miller

From: Heiner Kallweit 
Date: Tue, 25 Sep 2018 07:55:36 +0200

> This series includes smaller improvements, nothing exciting.

Series applied.

Re: [PATCH v2] net: macb: Clean 64b dma addresses if they are not detected

2018-09-25 Thread David Miller

From: Michal Simek 
Date: Tue, 25 Sep 2018 08:32:50 +0200

> Clear ADDR64 dma bit in DMACFG register in case that HW_DMA_CAP_64B is
> not detected on 64bit system.
> The issue was observed when bootloader(u-boot) does not check macb
> feature at DCFG6 register (DAW64_OFFSET) and enabling 64bit dma support
> by default. Then macb driver is reading DMACFG register back and only
> adding 64bit dma configuration but not cleaning it out.
> 
> Signed-off-by: Michal Simek 
> ---
> 
> Changes in v2:
> - Clean reg at the first place - Edgar
> - Update commit message

Applied, thank you.

Re: Marvell phy errata origins?

2018-09-25 Thread Harini Katakam

Hi Daniel,

On Tue, Sep 25, 2018 at 9:10 PM Andrew Lunn  wrote:
>
> > I hope this this thread isn't too old to bring back to life. So it seems
> > that Harini found that m88e did not need this errata, and Cisco
> > previously found that Harini's patch fixed m88e1112, we included it
> > internally for that reason
> >
> > Now I'm getting reports that this errata fixes issues we're seeing on
> > m88e. We see an interrupt storm without the errata, despite the errata
> > not being defined in the datasheet.
>
> Is everybody actually using interrupts? It could be in one system
> phylib is polling.
>

Yes, we weren't using interrupts; we used phy poll.

As I recall, the register and page combination was reserved and
the access seemed to fail.
It will be useful if we can the errata description or version details.
I'll check if I can get any more information.

Regards,
Harini

Re: [PATCH net v2 1/2] net: core: add member wol_enabled to struct net_device

2018-09-25 Thread Heiner Kallweit

On 25.09.2018 10:28, Michal Kubecek wrote:
> (I wrote my comment to v1 because I overlooked there is a v2 already;
> duplicating it here.)
> 
> On Mon, Sep 24, 2018 at 09:58:59PM +0200, Heiner Kallweit wrote:
>> Add flag wol_enabled to struct net_device indicating whether
>> Wake-on-LAN is enabled. As first user phy_suspend() will use it to
>> decide whether PHY can be suspended or not.
>>
>> Fixes: f1e911d5d0df ("r8169: add basic phylib support")
>> Fixes: e8cfd9d6c772 ("net: phy: call state machine synchronously in 
>> phy_stop")
>> Signed-off-by: Heiner Kallweit 
>> ---
>>  include/linux/netdevice.h | 3 +++
>>  net/core/ethtool.c| 9 -
>>  2 files changed, 11 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 1cbbf77a6..f5f1f1450 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1756,6 +1756,8 @@ enum netdev_priv_flags {
>>   *  switch driver and used to set the phys state of the
>>   *  switch port.
>>   *
>> + *  @wol_enabled:   Wake-on-LAN is enabled
>> + *
>>   *  FIXME: cleanup struct net_device such that network protocol info
>>   *  moves out.
>>   */
>> @@ -2039,6 +2041,7 @@ struct net_device {
>>  struct lock_class_key   *qdisc_tx_busylock;
>>  struct lock_class_key   *qdisc_running_key;
>>  boolproto_down;
>> +unsignedwol_enabled:1;
>>  };
>>  #define to_net_dev(d) container_of(d, struct net_device, dev)
>>  
> 
> As there is no bitfield yet, this would add 4 bytes to struct net_device.
> How about using a bit in net_device::priv_flags like IFF_RXFH_CONFIGURED
> in ethtool_set_rxfh_indir() and ethtool_set_rxfh()?
> 
Indeed alternatively we could add a private flag and the related
getter/setter.
Regarding the size argument: We have few bool members in struct
net_device (uc_promisc, dismantle, needs_free_netdev, proto_down)
which most likely can be transparently converted to bitfield
members. Then a wol_enabled bitfield member would be no overhead.
I don't have a strong preference and would leave it up to David.

> Michal Kubecek
> 
Heiner

Re: [PATCH net-next v1 1/5] net: allow binding socket in a VRF when there's an unbound socket

2018-09-25 Thread David Ahern

On 9/25/18 9:26 AM, Mike Manning wrote:
> On 24/09/2018 23:44, David Ahern wrote:
>> On 9/24/18 10:13 AM, Mike Manning wrote:
>>> From: Robert Shearman 
>>>
>>> There is no easy way currently for applications that want to receive
>>> packets in the default VRF to be isolated from packets arriving in
>>> VRFs, which makes using VRF-unaware applications in a VRF-aware system
>>> a potential security risk.
>>
>> That comment is not correct.
>>
>> The point of the l3mdev sysctl's is to prohibit this case. Setting
>> net.ipv4.{tcp,udp}_l3mdev_accept=0 means that a packet arriving on an
>> interface enslaved to a VRF can not be received by a global socket.
> Hi David, thanks for reviewing this. The converse does not hold though,
> i.e. there is no guarantee that the unbound socket will be selected for
> packets when not in a VRF, if there is an unbound socket and a socket
> bound to a VRF. Also, such packets should not be handled by the socket

I need an explicit example here. You are saying a packet arriving on an
interface not enslaved to a VRF might match a socket bound to a VRF?


> in the VRF if there is no unbound socket. We also had an issue with raw
> socket lookup device bind matching. I can break this particular patch
> into smaller patches and provide more detail, would this help? I will
> also update/break up the other patches according to your comments.

Why not add an l3mdev sysctl for raw sockets then?

Yes, please send smaller patches. A diff stat of:
15 files changed, 109 insertions(+), 62 deletions(-)
is a bit harsh.

> 
>>
>> Setting the l3mdev to 1 allows the default socket to work across VRFs.
>> If that is not what you want for a given app or a given VRF, then one
>> option is to add netfilter rules on the VRF device to prohibit it. I
>> just verified this works for both tcp and udp.
> 
> Netfilter is per application and so does not scale. I have not checked
> if it is suitable for packet handling on raw sockets.
> 
>>
>> Further, overlapping binds are allowed using SO_REUSEPORT meaning I can
>> have a server running in the default vrf bound to a port AND a server
>> running bound to a specific vrf and the same port:
>>
>> udp    UNCONN 0  0  *%red:12345 *:*
>>     users:(("vrf-test",pid=1376,fd=3))
>> udp    UNCONN 0  0   *:12345 *:*
>>  users:(("vrf-test",pid=1375,fd=3))
>>
>> tcp    LISTEN 0  1  *%red:12345 *:*
>>     users:(("vrf-test",pid=1356,fd=3))
>> tcp    LISTEN 0  1   *:12345 *:*
>>  users:(("vrf-test",pid=1352,fd=3))
>>
>> For packets arriving on an interface enslaved to a VRF the socket lookup
>> will pick the VRF server over the global one.
> 
> Agreed, but the converse is not guaranteed to hold i.e. packets that are
> not in a VRF may be handled by a socket bound to a VRF.
> 
> We do use SO_REUSEPORT for our own applications so as to run instances
> in the default and other VRFs, but still require these patches (or
> similar) due to how packets are handled when there is an unbound socket
> and sockets bound to different VRFs.

Why can't compute_score be adjusted to account for that case?

> 
>>
>> -- 
>>
>> With this patch set I am seeing a number of tests failing -- socket
>> connections working when they should not or not working when they
>> should. I only skimmed the results. I am guessing this patch is the
>> reason, but that is just a guess.
>>
>> You need to make sure all permutations of:
>> 1. net.ipv4.{tcp,udp}_l3mdev_accept={0,1},
>> 2. connection in the default VRF and in a VRF,
>> 3. locally originated and remote traffic,
>> 4. ipv4 and ipv6
>>
> 
> We are using raw, datagram and stream sockets for ipv4 & ipv6, require
> connectivity for local and remote addresses where appropriate and need
> route leaking between VRFs when configured, we are unaware of any
> outstanding bugs. Is there some way that I can run/analyze the tests
> that are failing for you?

I am not distributing my vrf tests right now. Before sending the
response I quickly verified one case is easy for you to see: set the udp
sysctl to 0, start a global server, send a packet to it via an interface
enslaved to a VRF. It should fail ECONNREFUSED (no socket match) but
instead packet reaches the server.

> 
> Also cf patch 2/5 note that ping to link-local addresses is handled
> consistently with that to global addresses in a VRF, so this now
> succeeds if ping is done in the VRF, i.e. 'sudo ip vrf exec  ping
>  -I 

Shifting packets destined to a LLA from the real device to the vrf
device is a change in behavior. It is not clear to me at the moment that
it will not cause a problem.

> 
>> continue to work as expected meaning packets flow when they should and
>> fail with the right error when they should not. I believe the UDP cases
>> were the main ones failing.
>>
>> Given the test failures, I did not look at the code changes in the patch.
>>
> 

A couple

Re: Marvell phy errata origins?

2018-09-25 Thread Andrew Lunn

> I hope this this thread isn't too old to bring back to life. So it seems
> that Harini found that m88e did not need this errata, and Cisco
> previously found that Harini's patch fixed m88e1112, we included it
> internally for that reason
> 
> Now I'm getting reports that this errata fixes issues we're seeing on
> m88e. We see an interrupt storm without the errata, despite the errata
> not being defined in the datasheet.

Is everybody actually using interrupts? It could be in one system
phylib is polling.

When you get an interrupt storm, which interrupt bit is not cleared by
reading the ievent register?

Andrew

Re: netlink: 16 bytes leftover after parsing attributes in process `ip'.

2018-09-25 Thread David Ahern

On 9/25/18 8:47 AM, Jiri Benc wrote:
> On Tue, 25 Sep 2018 11:49:10 +0200, Christian Brauner wrote:
>> So if people really want to hide this issue as much as we can then we
>> can play the guessing game. I could send a patch that roughly does the
>> following:
>>
>> if (nlmsg_len(cb->nlh) < sizeof(struct ifinfomsg))
>> guessed_header_len = sizeof(struct ifaddrmsg);
>> else
>> guessed_header_len = sizeof(struct ifinfomsg);
>>
>> This will work since sizeof(ifaddrmsg) == 8 and sizeof(ifinfomsg) == 16.
>> The only valid property for RTM_GETADDR requests is IFA_TARGET_NETNSID.
>> This propert is a __s32 which should bring the message up to 12 bytes
>> (not sure about alignment requiremnts and where we might wend up ten)
>> which is still less than the 16 bytes without that property from
>> ifinfomsg. That's a hacky hacky hack-hack and will likely work but will
>> break when ifaddrmsg grows a new member or we introduce another property
>> that is valid in RTM_GETADDR requests. It also will not work cleanly
>> when users stuff additional properties in there that are vaif 
>> (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,lid for the
>> address family but are not used int RTM_GETADDR requests.
> 
> I'd expect that any potential existing code that makes use of other
> attributes already assumes ifaddrmsg. Hence, if the nlmsg_len is >
> sizeof(ifinfomsg), you can be sure that there are attributes and thus
> the struct used was ifaddrmsg.
> 
> So, in order for RTM_GETADDR to work reliably with attributes, you have
> to ensure that the length is > sizeof(ifinfomsg).

One of the many on-going efforts I have in progress is kernel side
filtering of route dumps. It has this same problem. For it I am
proposing a new flag:

#define RTM_F_PROPER_HEADER0x4000

ifinfomsg is 16 bytes which is > rtmsg at 12. If the message length is >
16, then rtm_flags can be checked to know if the proper header is sent.

For ifaddrmsg things do not line up as well. Worst all of the flag bits
are used. But, perhaps one can be overloaded with the limit that you can
never filter on its presence. Since you are adding the first filter to
address dumps such a limitation should be ok.

For example something like this (whitespace damaged on paste) to remove
the guess work:

diff --git a/include/uapi/linux/if_addr.h b/include/uapi/linux/if_addr.h
index dfcf3ce0097f..8e3e9d475db5 100644
--- a/include/uapi/linux/if_addr.h
+++ b/include/uapi/linux/if_addr.h
@@ -41,6 +41,11 @@ enum {
 #define IFA_MAX (__IFA_MAX - 1)

 /* ifa_flags */
+/* IFA_F_PROPER_HEADER is only set in ifa_flags for dumps
+ * to indicate the ancillary header is the expected ifaddrmsg
+ * vs ifinfomsg from legacy userspace
+ */
+#define IFA_F_PROPER_HEADER0x01
 #define IFA_F_SECONDARY0x01
 #define IFA_F_TEMPORARYIFA_F_SECONDARY

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index bfe3ec7ecb14..256b9f88db8f 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5022,8 +5022,13 @@ static int inet6_dump_addr(struct sk_buff *skb,
struct netlink_callback *cb,
s_idx = idx = cb->args[1];
s_ip_idx = ip_idx = cb->args[2];

-   if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,
-   ifa_ipv6_policy, NULL) >= 0) {
+   if (nlmsg_len(cb->nlh) >= sizeof(struct ifaddrmsg) &&
+   ((struct ifaddrmsg *) nlmsg_data(cb->nlh))->ifa_flags &
IFA_F_PROPER_HEADER) {
+
+   if (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb,
IFA_MAX,
+   ifa_ipv6_policy, NULL) >= 0) {
+   ...
+
if (tb[IFA_TARGET_NETNSID]) {
netnsid = nla_get_s32(tb[IFA_TARGET_NETNSID]);


For ifaddrmsg ifa_flags aligns with ifi_type which is set by kernel side
so this should be ok.

Re: [PATCH bpf-next] flow_dissector: lookup netns by skb->sk if skb->dev is NULL

2018-09-25 Thread Daniel Borkmann

On 09/24/2018 10:49 PM, Willem de Bruijn wrote:
> From: Willem de Bruijn 
> 
> BPF flow dissectors are configured per network namespace.
> __skb_flow_dissect looks up the netns through dev_net(skb->dev).
> 
> In some dissector paths skb->dev is NULL, such as for Unix sockets.
> In these cases fall back to looking up the netns by socket.
> 
> Analyzing the codepaths leading to __skb_flow_dissect I did not find
> a case where both skb->dev and skb->sk are NULL. Warn and fall back to
> standard flow dissector if one is found.
> 
> Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook")
> Reported-by: Eric Dumazet 
> Signed-off-by: Willem de Bruijn 

Applied to bpf-next, thanks Willem!

Re: Marvell phy errata origins?

2018-09-25 Thread Andrew Lunn

> I've added Gokul who reported the issue to me. Is it possible that Harini
> and Cisco have different m88e phys? Maybe there's an issue with how they
> are hooked up ?

Hi Daniel

The lower 4 bits of the ID registers normally indicate the revision of
the PHY. It might be worth checking if everybody has the same
revision, or the problems are limited to just one revision.

  Andrew

Re: [PATCH net-next v1 1/5] net: allow binding socket in a VRF when there's an unbound socket

2018-09-25 Thread Mike Manning


On 24/09/2018 23:44, David Ahern wrote:

On 9/24/18 10:13 AM, Mike Manning wrote:

From: Robert Shearman 

There is no easy way currently for applications that want to receive
packets in the default VRF to be isolated from packets arriving in
VRFs, which makes using VRF-unaware applications in a VRF-aware system
a potential security risk.


That comment is not correct.

The point of the l3mdev sysctl's is to prohibit this case. Setting
net.ipv4.{tcp,udp}_l3mdev_accept=0 means that a packet arriving on an
interface enslaved to a VRF can not be received by a global socket.
Hi David, thanks for reviewing this. The converse does not hold though, 
i.e. there is no guarantee that the unbound socket will be selected for 
packets when not in a VRF, if there is an unbound socket and a socket 
bound to a VRF. Also, such packets should not be handled by the socket 
in the VRF if there is no unbound socket. We also had an issue with raw 
socket lookup device bind matching. I can break this particular patch 
into smaller patches and provide more detail, would this help? I will 
also update/break up the other patches according to your comments.




Setting the l3mdev to 1 allows the default socket to work across VRFs.
If that is not what you want for a given app or a given VRF, then one
option is to add netfilter rules on the VRF device to prohibit it. I
just verified this works for both tcp and udp.


Netfilter is per application and so does not scale. I have not checked 
if it is suitable for packet handling on raw sockets.




Further, overlapping binds are allowed using SO_REUSEPORT meaning I can
have a server running in the default vrf bound to a port AND a server
running bound to a specific vrf and the same port:

udpUNCONN 0  0  *%red:12345 *:*
users:(("vrf-test",pid=1376,fd=3))
udpUNCONN 0  0   *:12345 *:*
 users:(("vrf-test",pid=1375,fd=3))

tcpLISTEN 0  1  *%red:12345 *:*
users:(("vrf-test",pid=1356,fd=3))
tcpLISTEN 0  1   *:12345 *:*
 users:(("vrf-test",pid=1352,fd=3))

For packets arriving on an interface enslaved to a VRF the socket lookup
will pick the VRF server over the global one.


Agreed, but the converse is not guaranteed to hold i.e. packets that are 
not in a VRF may be handled by a socket bound to a VRF.


We do use SO_REUSEPORT for our own applications so as to run instances 
in the default and other VRFs, but still require these patches (or 
similar) due to how packets are handled when there is an unbound socket 
and sockets bound to different VRFs.




--

With this patch set I am seeing a number of tests failing -- socket
connections working when they should not or not working when they
should. I only skimmed the results. I am guessing this patch is the
reason, but that is just a guess.

You need to make sure all permutations of:
1. net.ipv4.{tcp,udp}_l3mdev_accept={0,1},
2. connection in the default VRF and in a VRF,
3. locally originated and remote traffic,
4. ipv4 and ipv6



We are using raw, datagram and stream sockets for ipv4 & ipv6, require 
connectivity for local and remote addresses where appropriate and need 
route leaking between VRFs when configured, we are unaware of any 
outstanding bugs. Is there some way that I can run/analyze the tests 
that are failing for you?


Also cf patch 2/5 note that ping to link-local addresses is handled 
consistently with that to global addresses in a VRF, so this now 
succeeds if ping is done in the VRF, i.e. 'sudo ip vrf exec  ping 
 -I 



continue to work as expected meaning packets flow when they should and
fail with the right error when they should not. I believe the UDP cases
were the main ones failing.

Given the test failures, I did not look at the code changes in the patch.

Re: Marvell phy errata origins?

2018-09-25 Thread Daniel Walker


On 04/18/2017 07:04 AM, Andrew Lunn wrote:

On Tue, Apr 18, 2017 at 06:16:33AM -0700, Daniel Walker wrote:


Hi,

Cisco is using a Marvell 88E1112 phy. It seems to be fairly similar
to the 88E which Harini added a fix for.


Hi Daniel

If you look at Marvell reference drive, DSDT, they are actually quite
different. Different virtual cable tester, different downshift
configuration, different packet generator, different loopback. I would
say they are different generations of PHY.


In Harini's commit
message for ,

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/net/phy/marvell.c?id=3ec0a0f10ceb

"This function has a sequence accessing Page 5 and Register 31, both
of which are not defined or reserved for this PHY"



I hope this this thread isn't too old to bring back to life. So it seems 
that Harini found that m88e did not need this errata, and Cisco 
previously found that Harini's patch fixed m88e1112, we included it 
internally for that reason


Now I'm getting reports that this errata fixes issues we're seeing on 
m88e. We see an interrupt storm without the errata, despite the 
errata not being defined in the datasheet.


I would just send a patch adding the errata, but because Harini removed 
it I guess we really need to suss out what's going on.


I've added Gokul who reported the issue to me. Is it possible that 
Harini and Cisco have different m88e phys? Maybe there's an issue 
with how they are hooked up ?


Daniel

Re: [PATCH 3/5] ixgbe: add AF_XDP zero-copy Rx support

2018-09-25 Thread Jakub Kicinski

On Mon, 24 Sep 2018 18:35:55 +0200, Björn Töpel wrote:
> + if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
> + return -EINVAL;
> +
> + if (adapter->flags & IXGBE_FLAG_DCB_ENABLED)
> + return -EINVAL;

Hm, should you add UMEM checks to all the places these may get
enabled?  Like fabf1bce103a ("ixgbe: Prevent unsupported configurations
with XDP") did?

[PATCH net-next v6 17/23] zinc: Curve25519 generic C implementations and selftest

2018-09-25 Thread Jason A. Donenfeld

This contains two formally verified C implementations of the Curve25519
scalar multiplication function, one for 32-bit systems, and one for
64-bit systems whose compiler supports efficient 128-bit integer types.
Not only are these implementations formally verified, but they are also
the fastest available C implementations. They have been modified to be
friendly to kernel space and to be generally less horrendous looking,
but still an effort has been made to retain their formally verified
characteristic, and so the C might look slightly unidiomatic.

The 64-bit version comes from HACL*: 
https://github.com/project-everest/hacl-star
The 32-bit version comes from Fiat: https://github.com/mit-plv/fiat-crypto

Information: https://cr.yp.to/ecdh.html

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Karthikeyan Bhargavan 
---
 include/zinc/curve25519.h   |   22 +
 lib/zinc/Kconfig|4 +
 lib/zinc/Makefile   |3 +
 lib/zinc/curve25519/curve25519-fiat32.h |  860 +++
 lib/zinc/curve25519/curve25519-hacl64.h |  784 ++
 lib/zinc/curve25519/curve25519.c|  109 ++
 lib/zinc/selftest/curve25519.h  | 1321 +++
 7 files changed, 3103 insertions(+)
 create mode 100644 include/zinc/curve25519.h
 create mode 100644 lib/zinc/curve25519/curve25519-fiat32.h
 create mode 100644 lib/zinc/curve25519/curve25519-hacl64.h
 create mode 100644 lib/zinc/curve25519/curve25519.c
 create mode 100644 lib/zinc/selftest/curve25519.h

diff --git a/include/zinc/curve25519.h b/include/zinc/curve25519.h
new file mode 100644
index ..def173e736fc
--- /dev/null
+++ b/include/zinc/curve25519.h
@@ -0,0 +1,22 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#ifndef _ZINC_CURVE25519_H
+#define _ZINC_CURVE25519_H
+
+#include 
+
+enum curve25519_lengths {
+   CURVE25519_KEY_SIZE = 32
+};
+
+bool __must_check curve25519(u8 mypublic[CURVE25519_KEY_SIZE],
+const u8 secret[CURVE25519_KEY_SIZE],
+const u8 basepoint[CURVE25519_KEY_SIZE]);
+void curve25519_generate_secret(u8 secret[CURVE25519_KEY_SIZE]);
+bool __must_check curve25519_generate_public(
+   u8 pub[CURVE25519_KEY_SIZE], const u8 secret[CURVE25519_KEY_SIZE]);
+
+#endif /* _ZINC_CURVE25519_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index 7bf4bc88f81f..0d8a94dd73a8 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -14,6 +14,10 @@ config ZINC_CHACHA20POLY1305
 config ZINC_BLAKE2S
tristate
 
+config ZINC_CURVE25519
+   tristate
+   select CONFIG_CRYPTO
+
 config ZINC_DEBUG
bool "Zinc cryptography library debugging and self-tests"
help
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 67ad837c822c..65440438c6e5 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -25,3 +25,6 @@ obj-$(CONFIG_ZINC_CHACHA20POLY1305) += zinc_chacha20poly1305.o
 zinc_blake2s-y := blake2s/blake2s.o
 zinc_blake2s-$(CONFIG_ZINC_ARCH_X86_64) += blake2s/blake2s-x86_64.o
 obj-$(CONFIG_ZINC_BLAKE2S) += zinc_blake2s.o
+
+zinc_curve25519-y := curve25519/curve25519.o
+obj-$(CONFIG_ZINC_CURVE25519) += zinc_curve25519.o
diff --git a/lib/zinc/curve25519/curve25519-fiat32.h 
b/lib/zinc/curve25519/curve25519-fiat32.h
new file mode 100644
index ..32b5ec7aa040
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-fiat32.h
@@ -0,0 +1,860 @@
+/* SPDX-License-Identifier: GPL-2.0 OR MIT */
+/*
+ * Copyright (C) 2015-2016 The fiat-crypto Authors.
+ * Copyright (C) 2018 Jason A. Donenfeld . All Rights 
Reserved.
+ *
+ * This is a machine-generated formally verified implementation of Curve25519
+ * ECDH from: . Though originally
+ * machine generated, it has been tweaked to be suitable for use in the kernel.
+ * It is optimized for 32-bit machines and machines that cannot work 
efficiently
+ * with 128-bit integer types.
+ */
+
+/* fe means field element. Here the field is \Z/(2^255-19). An element t,
+ * entries t[0]...t[9], represents the integer t[0]+2^26 t[1]+2^51 t[2]+2^77
+ * t[3]+2^102 t[4]+...+2^230 t[9].
+ * fe limbs are bounded by 1.125*2^26,1.125*2^25,1.125*2^26,1.125*2^25,etc.
+ * Multiplication and carrying produce fe from fe_loose.
+ */
+typedef struct fe { u32 v[10]; } fe;
+
+/* fe_loose limbs are bounded by 
3.375*2^26,3.375*2^25,3.375*2^26,3.375*2^25,etc
+ * Addition and subtraction produce fe_loose from (fe, fe).
+ */
+typedef struct fe_loose { u32 v[10]; } fe_loose;
+
+static __always_inline void fe_frombytes_impl(u32 h[10], const u8 *s)
+{
+   /* Ignores top bit of s. */
+   u32 a0 = get_unaligned_le32(s);
+   u32 a1 = get_unaligned_le32(s+4);
+   u32 a2 = get_unaligned_le32(s+8);
+   u32 a3 = get_unaligned_le32(s+12);
+   u32 a4 =

[PATCH net-next] tls: Fixed a memory leak during socket close

2018-09-25 Thread Vakul Garg

During socket close, if there is a open record with tx context, it needs
to be be freed apart from freeing up plaintext and encrypted scatter
lists. This patch frees up the open record if present in tx context.

Also tls_free_both_sg() has been renamed to tls_free_open_rec() to
indicate that the free record in tx context is being freed inside the
function.

Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg 
---
 net/tls/tls_sw.c | 6 --
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index bf03f32aa983..6f2dc1873556 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -310,7 +310,7 @@ static void free_sg(struct sock *sk, struct scatterlist *sg,
*sg_size = 0;
 }
 
-static void tls_free_both_sg(struct sock *sk)
+static void tls_free_open_rec(struct sock *sk)
 {
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_tx *ctx = tls_sw_ctx_tx(tls_ctx);
@@ -327,6 +327,8 @@ static void tls_free_both_sg(struct sock *sk)
free_sg(sk, rec->sg_plaintext_data,
>sg_plaintext_num_elem,
>sg_plaintext_size);
+
+   kfree(rec);
 }
 
 int tls_tx_records(struct sock *sk, int flags)
@@ -1580,7 +1582,7 @@ void tls_sw_free_resources_tx(struct sock *sk)
}
 
crypto_free_aead(ctx->aead_send);
-   tls_free_both_sg(sk);
+   tls_free_open_rec(sk);
 
kfree(ctx);
 }
-- 
2.13.6

Re: netlink: 16 bytes leftover after parsing attributes in process `ip'.

2018-09-25 Thread Jiri Benc

On Tue, 25 Sep 2018 11:49:10 +0200, Christian Brauner wrote:
> So if people really want to hide this issue as much as we can then we
> can play the guessing game. I could send a patch that roughly does the
> following:
> 
> if (nlmsg_len(cb->nlh) < sizeof(struct ifinfomsg))
> guessed_header_len = sizeof(struct ifaddrmsg);
> else
> guessed_header_len = sizeof(struct ifinfomsg);
> 
> This will work since sizeof(ifaddrmsg) == 8 and sizeof(ifinfomsg) == 16.
> The only valid property for RTM_GETADDR requests is IFA_TARGET_NETNSID.
> This propert is a __s32 which should bring the message up to 12 bytes
> (not sure about alignment requiremnts and where we might wend up ten)
> which is still less than the 16 bytes without that property from
> ifinfomsg. That's a hacky hacky hack-hack and will likely work but will
> break when ifaddrmsg grows a new member or we introduce another property
> that is valid in RTM_GETADDR requests. It also will not work cleanly
> when users stuff additional properties in there that are vaif 
> (nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb, IFA_MAX,lid for the
> address family but are not used int RTM_GETADDR requests.

I'd expect that any potential existing code that makes use of other
attributes already assumes ifaddrmsg. Hence, if the nlmsg_len is >
sizeof(ifinfomsg), you can be sure that there are attributes and thus
the struct used was ifaddrmsg.

So, in order for RTM_GETADDR to work reliably with attributes, you have
to ensure that the length is > sizeof(ifinfomsg).

This can be achieved by putting IFA_TARGET_NETNSID into a nested
attribute. Just define IFA_EXTENDED (feel free to invent a better name,
of course) and put IFA_TARGET_NETNSID inside. Then in the code, attempt
to parse only when the size is large enough:

if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) {
int err;

err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb,
  IFA_MAX, ifa_ipv6_policy, NULL);
if (err < 0)
return err;
if (tb[IFA_EXTENDED]) {
...parse the nested attribute...
if (tb_nested[IFA_TARGET_NETNSID]) {
...etc...
}
}
}

Another option is forcing the user space to add another attribute, for
example, IFA_FLAGS_PRESENT, and attempt parsing only when it is
present. The logic would then be:

if (nlmsg_len(cb->nlh) > sizeof(struct ifinfomsg)) {
int err;

err = nlmsg_parse(cb->nlh, sizeof(struct ifaddrmsg), tb,
  IFA_MAX, ifa_ipv6_policy, NULL);
if (err < 0)
return err;
if (tb[IFA_FLAGS_PRESENT] && tb[IFA_TARGET_NETNSID]) {
...etc...
}
}

 Jiri

Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Michael Chan

On Tue, Sep 25, 2018 at 7:20 AM Eric Dumazet  wrote:
>
> On Tue, Sep 25, 2018 at 7:02 AM Michael Chan  
> wrote:
> >
> > On Mon, Sep 24, 2018 at 2:18 PM Song Liu  wrote:
> > >
> > >
> > >
> > > > On Sep 24, 2018, at 2:05 PM, Eric Dumazet  wrote:
> > > >
> > > >>
> > > >> Interesting, maybe a bnxt specific issue.
> > > >>
> > > >> It seems their model is to process TX/RX notification in the same 
> > > >> queue,
> > > >> they throw away RX events if budget == 0
> > > >>
> > > >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
> > > >> must be reverted.
> > > >>
> > > >> Otherwise, we have a possibility of blocking a queue under netpoll 
> > > >> pressure.
> > > >
> > > > Hmm, actually a revert might not be enough, since code at lines 
> > > > 2030-2031
> > > > would fire and we might not call napi_complete_done() anyway.
> > > >
> > > > Unfortunately this driver logic is quite complex.
> > > >
> > > > Could you test on other NIC eventually ?
> > > >
> > >
> > > It actually runs OK on ixgbe.
> > >
> > > @Michael, could you please help us with this?
> > >
> > I've taken a quick look using today's net tree plus Eric's
> > poll_one_napi() patch.  The problem I'm seeing is that netpoll calls
> > bnxt_poll() with budget 0.  And since work_done >= budget of 0, we
> > return without calling napi_complete_done() and without arming the
> > interrupt.  netpoll doesn't always call us back until we call
> > napi_complete_done(), right?  So I think if there are in-flight TX
> > completions, we'll miss those.
>
> That's the whole point of netpoll :
>
>  We drain the TX queues, without interrupts being involved at all,
> by calling ->napi() with a zero budget.
>
> napi_complete(), even if called from ->napi() while budget was zero,
> should do nothing but return early.
>
> budget==0 means that ->napi() should process all TX completions.

All TX completions that we can see.  We cannot see the in-flight ones.

If budget is exceeded, I think the assumption is that poll will always
be called again.

>
> So it looks like bnxt has a bug, that is showing up after the latest
> poll_one_napi() patch.
> This latest patch is needed otherwise the cpu attempting the
> netpoll-TX-drain might drain nothing at all,
> since it does not anymore call ndo_poll_controller() that was grabbing
> SCHED bits on all queues (napi_schedule() like calls)

I think the latest patch is preventing the normal interrupt -> NAPI
path from coming in and cleaning the remaining TX completions and
arming the interrupt.

[bpf-next PATCH 3/3] selftests/bpf: add XDP selftests for modifying and popping VLAN headers

2018-09-25 Thread Jesper Dangaard Brouer

This XDP selftest also contain a small TC-bpf component. It provoke
the generic-XDP bug fixed in previous commit.

The selftest itself shows how to do VLAN manipulation from XDP and TC.
The test demonstrate how XDP ingress can remove a VLAN tag, and how TC
egress can add back a VLAN tag.

This use-case originates from a production need by ISP (kviknet.dk),
who gets DSL-lines terminated as VLAN Q-in-Q tagged packets, and want
to avoid having an net_device for every end-customer on the box doing
the L2 to L3 termination.
  The test-setup is done via a veth-pair and creating two network
namespaces (ns1 and ns2).  The 'ns1' simulate the ISP network that are
loading the BPF-progs stripping and adding VLAN IDs.  The 'ns2'
simulate the DSL-customer that are using VLAN tagged packets.

Running the script with --interactive, will simply not call the
cleanup function.  This gives the effect of creating a testlab, that
the users can inspect and play with.  The --verbose option will simply
request that the shell will print input lines as they are read, this
include comments, which in effect make the comments visible docs.

Reported-by: Yoel Caspersen 
Signed-off-by: Jesper Dangaard Brouer 
---
 tools/testing/selftests/bpf/Makefile |5 
 tools/testing/selftests/bpf/test_xdp_vlan.c  |  292 ++
 tools/testing/selftests/bpf/test_xdp_vlan.sh |  195 +
 3 files changed, 490 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_xdp_vlan.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan.sh

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index fd3851d5c079..b2c80a73b148 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -35,7 +35,7 @@ TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o 
test_tcp_estats.o test
test_get_stack_rawtp.o test_sockmap_kern.o test_sockhash_kern.o \
test_lwt_seg6local.o sendmsg4_prog.o sendmsg6_prog.o 
test_lirc_mode2_kern.o \
get_cgroup_id_kern.o socket_cookie_prog.o test_select_reuseport_kern.o \
-   test_skb_cgroup_id_kern.o bpf_flow.o
+   test_skb_cgroup_id_kern.o bpf_flow.o test_xdp_vlan.o
 
 # Order correspond to 'make run_tests' order
 TEST_PROGS := test_kmod.sh \
@@ -48,7 +48,8 @@ TEST_PROGS := test_kmod.sh \
test_lwt_seg6local.sh \
test_lirc_mode2.sh \
test_skb_cgroup_id.sh \
-   test_flow_dissector.sh
+   test_flow_dissector.sh \
+   test_xdp_vlan.sh
 
 # Compile but not part of 'make run_tests'
 TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr 
test_skb_cgroup_id_user \
diff --git a/tools/testing/selftests/bpf/test_xdp_vlan.c 
b/tools/testing/selftests/bpf/test_xdp_vlan.c
new file mode 100644
index ..365a7d2d9f5c
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_vlan.c
@@ -0,0 +1,292 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *  Copyright(c) 2018 Jesper Dangaard Brouer.
+ *
+ * XDP/TC VLAN manipulation example
+ *
+ * GOTCHA: Remember to disable NIC hardware offloading of VLANs,
+ * else the VLAN tags are NOT inlined in the packet payload:
+ *
+ *  # ethtool -K ixgbe2 rxvlan off
+ *
+ * Verify setting:
+ *  # ethtool -k ixgbe2 | grep rx-vlan-offload
+ *  rx-vlan-offload: off
+ *
+ */
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "bpf_helpers.h"
+#include "bpf_endian.h"
+
+/* linux/if_vlan.h have not exposed this as UAPI, thus mirror some here
+ *
+ * struct vlan_hdr - vlan header
+ * @h_vlan_TCI: priority and VLAN ID
+ * @h_vlan_encapsulated_proto: packet type ID or len
+ */
+struct _vlan_hdr {
+   __be16 h_vlan_TCI;
+   __be16 h_vlan_encapsulated_proto;
+};
+#define VLAN_PRIO_MASK 0xe000 /* Priority Code Point */
+#define VLAN_PRIO_SHIFT13
+#define VLAN_CFI_MASK  0x1000 /* Canonical Format Indicator */
+#define VLAN_TAG_PRESENT   VLAN_CFI_MASK
+#define VLAN_VID_MASK  0x0fff /* VLAN Identifier */
+#define VLAN_N_VID 4096
+
+struct parse_pkt {
+   __u16 l3_proto;
+   __u16 l3_offset;
+   __u16 vlan_outer;
+   __u16 vlan_inner;
+   __u8  vlan_outer_offset;
+   __u8  vlan_inner_offset;
+};
+
+char _license[] SEC("license") = "GPL";
+
+static __always_inline
+bool parse_eth_frame(struct ethhdr *eth, void *data_end, struct parse_pkt *pkt)
+{
+   __u16 eth_type;
+   __u8 offset;
+
+   offset = sizeof(*eth);
+   /* Make sure packet is large enough for parsing eth + 2 VLAN headers */
+   if ((void *)eth + offset + (2*sizeof(struct _vlan_hdr)) > data_end)
+   return false;
+
+   eth_type = eth->h_proto;
+
+   /* Handle outer VLAN tag */
+   if (eth_type == bpf_htons(ETH_P_8021Q)
+   || eth_type == bpf_htons(ETH_P_8021AD)) {
+   struct _vlan_hdr *vlan_hdr;
+
+   vlan_hdr = (void *)eth + offset;
+

[bpf-next PATCH 2/3] bpf: make TC vlan bpf_helpers avail to selftests

2018-09-25 Thread Jesper Dangaard Brouer

The helper bpf_skb_vlan_push is needed by next patch, and the helper
bpf_skb_vlan_pop is added for completeness, regarding VLAN helpers.

Signed-off-by: Jesper Dangaard Brouer 
---
 tools/testing/selftests/bpf/bpf_helpers.h |4 
 1 file changed, 4 insertions(+)

diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index e4be7730222d..d057e6891a6b 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -143,6 +143,10 @@ static unsigned long long (*bpf_skb_cgroup_id)(void *ctx) =
(void *) BPF_FUNC_skb_cgroup_id;
 static unsigned long long (*bpf_skb_ancestor_cgroup_id)(void *ctx, int level) =
(void *) BPF_FUNC_skb_ancestor_cgroup_id;
+static int (*bpf_skb_vlan_push)(void *ctx, __be16 vlan_proto, __u16 vlan_tci) =
+   (void *) BPF_FUNC_skb_vlan_push;
+static int (*bpf_skb_vlan_pop)(void *ctx) =
+   (void *) BPF_FUNC_skb_vlan_pop;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions

[bpf-next PATCH 1/3] net: fix generic XDP to handle if eth header was mangled

2018-09-25 Thread Jesper Dangaard Brouer

XDP can modify (and resize) the Ethernet header in the packet.

There is a bug in generic-XDP, because skb->protocol and skb->pkt_type
are setup before reaching (netif_receive_)generic_xdp.

This bug was hit when XDP were popping VLAN headers (changing
eth->h_proto), as skb->protocol still contains VLAN-indication
(ETH_P_8021Q) causing invocation of skb_vlan_untag(skb), which corrupt
the packet (basically popping the VLAN again).

This patch catch if XDP changed eth header in such a way, that SKB
fields needs to be updated.

Fixes: d445516966dc ("net: xdp: support xdp generic on virtual devices")
Signed-off-by: Jesper Dangaard Brouer 
---
 net/core/dev.c |   14 ++
 1 file changed, 14 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index ca78dc5a79a3..db6d89f536cb 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4258,6 +4258,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
struct netdev_rx_queue *rxqueue;
void *orig_data, *orig_data_end;
u32 metalen, act = XDP_DROP;
+   __be16 orig_eth_type;
+   struct ethhdr *eth;
+   bool orig_bcast;
int hlen, off;
u32 mac_len;
 
@@ -4298,6 +4301,9 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
xdp->data_hard_start = skb->data - skb_headroom(skb);
orig_data_end = xdp->data_end;
orig_data = xdp->data;
+   eth = (struct ethhdr *)xdp->data;
+   orig_bcast = is_multicast_ether_addr_64bits(eth->h_dest);
+   orig_eth_type = eth->h_proto;
 
rxqueue = netif_get_rxqueue(skb);
xdp->rxq = >xdp_rxq;
@@ -4321,6 +4327,14 @@ static u32 netif_receive_generic_xdp(struct sk_buff *skb,
 
}
 
+   /* check if XDP changed eth hdr such SKB needs update */
+   eth = (struct ethhdr *)xdp->data;
+   if ((orig_eth_type != eth->h_proto) ||
+   (orig_bcast != is_multicast_ether_addr_64bits(eth->h_dest))) {
+   __skb_push(skb, mac_len);
+   skb->protocol = eth_type_trans(skb, skb->dev);
+   }
+
switch (act) {
case XDP_REDIRECT:
case XDP_TX:

[bpf-next PATCH 0/3] bpf/xdp: fix generic-XDP and demonstrate VLAN manipulation

2018-09-25 Thread Jesper Dangaard Brouer

While implementing PoC building blocks for eBPF code XDP+TC that can
manipulate VLANs headers, I discovered a bug in generic-XDP.

The fix should be backported to stable kernels.  Even-though
generic-XDP were introduced in v4.12, I think the bug is not exposed
until v4.14 in the mentined fixes commit.

---
p.s. send from Embedded Recipes 2018 in Paris

Jesper Dangaard Brouer (3):
  net: fix generic XDP to handle if eth header was mangled
  bpf: make TC vlan bpf_helpers avail to selftests
  selftests/bpf: add XDP selftests for modifying and popping VLAN headers


 net/core/dev.c   |   14 +
 tools/testing/selftests/bpf/Makefile |5 
 tools/testing/selftests/bpf/bpf_helpers.h|4 
 tools/testing/selftests/bpf/test_xdp_vlan.c  |  292 ++
 tools/testing/selftests/bpf/test_xdp_vlan.sh |  195 +
 5 files changed, 508 insertions(+), 2 deletions(-)
 create mode 100644 tools/testing/selftests/bpf/test_xdp_vlan.c
 create mode 100755 tools/testing/selftests/bpf/test_xdp_vlan.sh

--
Signature

Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Eric Dumazet

On Tue, Sep 25, 2018 at 7:02 AM Michael Chan  wrote:
>
> On Mon, Sep 24, 2018 at 2:18 PM Song Liu  wrote:
> >
> >
> >
> > > On Sep 24, 2018, at 2:05 PM, Eric Dumazet  wrote:
> > >
> > >>
> > >> Interesting, maybe a bnxt specific issue.
> > >>
> > >> It seems their model is to process TX/RX notification in the same queue,
> > >> they throw away RX events if budget == 0
> > >>
> > >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
> > >> must be reverted.
> > >>
> > >> Otherwise, we have a possibility of blocking a queue under netpoll 
> > >> pressure.
> > >
> > > Hmm, actually a revert might not be enough, since code at lines 2030-2031
> > > would fire and we might not call napi_complete_done() anyway.
> > >
> > > Unfortunately this driver logic is quite complex.
> > >
> > > Could you test on other NIC eventually ?
> > >
> >
> > It actually runs OK on ixgbe.
> >
> > @Michael, could you please help us with this?
> >
> I've taken a quick look using today's net tree plus Eric's
> poll_one_napi() patch.  The problem I'm seeing is that netpoll calls
> bnxt_poll() with budget 0.  And since work_done >= budget of 0, we
> return without calling napi_complete_done() and without arming the
> interrupt.  netpoll doesn't always call us back until we call
> napi_complete_done(), right?  So I think if there are in-flight TX
> completions, we'll miss those.

That's the whole point of netpoll :

 We drain the TX queues, without interrupts being involved at all,
by calling ->napi() with a zero budget.

napi_complete(), even if called from ->napi() while budget was zero,
should do nothing but return early.

budget==0 means that ->napi() should process all TX completions.

So it looks like bnxt has a bug, that is showing up after the latest
poll_one_napi() patch.
This latest patch is needed otherwise the cpu attempting the
netpoll-TX-drain might drain nothing at all,
since it does not anymore call ndo_poll_controller() that was grabbing
SCHED bits on all queues (napi_schedule() like calls)

Re: [PATCH v2 net-next 0/9] bnxt_en: devlink param updates

2018-09-25 Thread Jakub Kicinski

On Tue, 25 Sep 2018 09:44:41 +0530, Vasundhara Volam wrote:
> On Mon, Sep 24, 2018 at 10:11 PM Jakub Kicinski
>  wrote:
> >
> > On Mon, 24 Sep 2018 10:46:12 +0530, Vasundhara Volam wrote:  
> > > This patchset adds support for 3 generic and 1 driver-specific devlink
> > > parameters. Add documentation for these configuration parameters.
> > >
> > > Also, this patchset adds support to return proper error code if
> > > HWRM_NVM_GET/SET_VARIABLE commands return error code
> > > HWRM_ERR_CODE_RESOURCE_ACCESS_DENIED.
> > >
> > > v1->v2:
> > > -Remove hw_tc_offload parameter.
> > > -Update all patches with Cc of MAINTAINERS.
> > > -Add more description in commit message for device specific parameter.
> > > -Add a new Documentation/networking/devlink-params.txt with some
> > > generic devlink parameters information.
> > > -Add a new Documentation/networking/devlink-params-bnxt_en.txt with 
> > > devlink
> > > parameters information that are supported by bnxt_en driver.
> > >
> > > Vasundhara Volam (9):
> > >   devlink: Add generic parameter ignore_ari
> > >   devlink: Add generic parameter msix_vec_per_pf_max
> > >   devlink: Add generic parameter msix_vec_per_pf_min  
> >
> > Nobody agreed with me that we need structure the PCIe bits better so
> > I'll let go...
> >  
> > >   bnxt_en: Use ignore_ari devlink parameter
> > >   bnxt_en: return proper error when FW returns
> > > HWRM_ERR_CODE_RESOURCE_ACCESS_DENIED
> > >   bnxt_en: Use msix_vec_per_pf_max and msix_vec_per_pf_min devlink
> > > params.
> > >   bnxt_en: Add a driver specific gre_ver_check devlink parameter.  
> >
> > This looks like configuring forwarding rules with devlink, but again,  
> Do you think, this parameter should be made generic?

By no means.

> > I won't object if I'm the only one who finds this inappropriate.
> >
> > You should CC people who gave you feedback on the previous version.  
> Sorry, I will add in the next version of the patchset. Thanks.
> >  
> > >   devlink: Add Documentation/networking/devlink-params.txt
> > >   devlink: Add Documentation/networking/devlink-params-bnxt.txt

Re: [PATCH v2 net-next 9/9] devlink: Add Documentation/networking/devlink-params-bnxt.txt

2018-09-25 Thread Jakub Kicinski

On Tue, 25 Sep 2018 09:37:20 +0530, Vasundhara Volam wrote:
> > Why duplicate the description of the generic parameters?  
> Not all generic parameters are used by all drivers. So, I want to add
> information about
> type and configuration mode about generic parameters used by bnxt_en driver.
> I can remove description part keeping type and configuration mode, if
> it looks duplication.

That'd be better.

Re: [PATCH net 00/15] netpoll: avoid capture effects for NAPI drivers

2018-09-25 Thread Michael Chan

On Mon, Sep 24, 2018 at 2:18 PM Song Liu  wrote:
>
>
>
> > On Sep 24, 2018, at 2:05 PM, Eric Dumazet  wrote:
> >
> >>
> >> Interesting, maybe a bnxt specific issue.
> >>
> >> It seems their model is to process TX/RX notification in the same queue,
> >> they throw away RX events if budget == 0
> >>
> >> It means commit e7b9569102995ebc26821789628eef45bd9840d8 is wrong and
> >> must be reverted.
> >>
> >> Otherwise, we have a possibility of blocking a queue under netpoll 
> >> pressure.
> >
> > Hmm, actually a revert might not be enough, since code at lines 2030-2031
> > would fire and we might not call napi_complete_done() anyway.
> >
> > Unfortunately this driver logic is quite complex.
> >
> > Could you test on other NIC eventually ?
> >
>
> It actually runs OK on ixgbe.
>
> @Michael, could you please help us with this?
>
I've taken a quick look using today's net tree plus Eric's
poll_one_napi() patch.  The problem I'm seeing is that netpoll calls
bnxt_poll() with budget 0.  And since work_done >= budget of 0, we
return without calling napi_complete_done() and without arming the
interrupt.  netpoll doesn't always call us back until we call
napi_complete_done(), right?  So I think if there are in-flight TX
completions, we'll miss those.

Re: requesting stable backport of BPF security fix (commit dd066823db2ac4e22f721ec85190817b58059a54)

2018-09-25 Thread Daniel Borkmann

On 09/25/2018 02:46 PM, Jann Horn wrote:
> Hi!
> 
> Per the policy at Documentation/networking/netdev-FAQ.rst, I'm sending
> this to netdev@ and davem, rather than stable@; with a CC to security@
> because I believe that this is a security process issue.
> 
> Upstream commit dd066823db2ac4e22f721ec85190817b58059a54
> ("bpf/verifier: disallow pointer subtraction") fixes a security bug
> (kernel pointer leak to unprivileged userspace). The fix has been in
> Linus' tree since about a week ago, but the patch still doesn't appear
> in Greg's linux-4.18.y linux-stable-rc repo, in Greg's 4.18
> stable-queue, or in davem's stable queue at
> http://patchwork.ozlabs.org/bundle/davem/stable/?state=* .
> 
> Please queue it up for backporting.

Done & flushed out now, sorry for the delay.

Thanks,
Daniel

Re: netlink: 16 bytes leftover after parsing attributes in process `ip'.

2018-09-25 Thread Stephen Hemminger

On Tue, 25 Sep 2018 14:34:08 +0200
Christian Brauner  wrote:

> On Tue, Sep 25, 2018, 14:07 Stephen Hemminger 
> wrote:
> 
> > On Tue, 25 Sep 2018 11:49:10 +0200
> > Christian Brauner  wrote:
> >  
> > > On Mon, Sep 24, 2018 at 09:19:06PM -0600, David Ahern wrote:  
> > > > On top of net-next I am see a dmesg error:
> > > >
> > > > netlink: 16 bytes leftover after parsing attributes in process `ip'.
> > > >
> > > > I traced it to address lists and commit:
> > > >
> > > > commit 6ecf4c37eb3e89b0832c9616089a5cdca3747da7
> > > > Author: Christian Brauner 
> > > > Date:   Tue Sep 4 21:53:50 2018 +0200
> > > >
> > > > ipv6: enable IFA_TARGET_NETNSID for RTM_GETADDR
> > > >
> > > > Per the commit you are trying to guess whether the ancillary header is
> > > > an ifinfomsg or a ifaddrmsg. I am guessing you are guessing wrong.  
> > :-)  
> > >
> > > Well, I currently don't guess at all. :) I'm parsing with struct
> > > ifaddrmsg as assumed header size but ignore parsing errors when that
> > > fails. You don't get the niceties of the new property if you don't pack  
> >
> > There are legacy parts of netlink interface with kernel.
> > The ABI has evolved over time but some old parts are stuck in the past.
> >
> >  
> > > > I don't have time to take this to ground, but address listing is not  
> > the  
> > > > only area subject to iproute2's SNAFU of infomsg everywhere on dumps. I
> > > > have thought about this for route dumps, but its solution does not work
> > > > here. You'll need to find something because the current warning on  
> > every  
> > > > address dump is not acceptable.  
> > >
> > > Two points before I propose a migitation:
> > >
> > > 1. The burded of seeing pr_warn_ratelimited() messages in dmesg when
> > >userspace is doing something wrong is imho justifiable.
> > >Actually, I would argue that we should not hide the problem from
> > >userspace at all. The rate-limited (so no logging DOS afaict) warning
> > >messages are a perfect indicator that a tool is doing something wrong
> > >*without* introducing any regressions.
> > >The rtnetlink manpage clearly indicates that ifaddrmsg is supposed to
> > >be used too. Additionally, userspace stuffs an ifinfomsg in there but
> > >expects to receive ifaddrmsg. They should be warned loudly. :) So I
> > >actually like the warning messages.
> > > 2. Userspace should be fixed. Especially such an important standard tool
> > >as iproute2 that is maintained on git.kernel.org (glibc is already
> > >doing the right.).
> > >
> > > So if people really want to hide this issue as much as we can then we
> > > can play the guessing game. I could send a patch that roughly does the
> > > following:
> > >
> > > if (nlmsg_len(cb->nlh) < sizeof(struct ifinfomsg))
> > > guessed_header_len = sizeof(struct ifaddrmsg);
> > > else
> > > guessed_header_len = sizeof(struct ifinfomsg);
> > >
> > > This will work since sizeof(ifaddrmsg) == 8 and sizeof(ifinfomsg) == 16.
> > > The only valid property for RTM_GETADDR requests is IFA_TARGET_NETNSID.
> > > This propert is a __s32 which should bring the message up to 12 bytes
> > > (not sure about alignment requiremnts and where we might wend up ten)
> > > which is still less than the 16 bytes without that property from
> > > ifinfomsg. That's a hacky hacky hack-hack and will likely work but will
> > > break when ifaddrmsg grows a new member or we introduce another property
> > > that is valid in RTM_GETADDR requests. It also will not work cleanly
> > > when users stuff additional properties in there that are valid for the
> > > address family but are not used int RTM_GETADDR requests.
> > >
> > > I would like to hear what other people and davem think we should do.
> > > Patch it away or print the warning.
> > >
> > > Christian  
> >
> > You can't break old programs. That is one of the rules of kernel.
> > Therefore, please either revert the kernel change or put the new attribute
> > in a place where old versions do not cause problem.
> >  
> 
> Sorry, there's a misunderstanding here. The code doesn't regress anything.
> The patch  was  written  in a backward compatible way. The only thing it
> causes are rate-limited logging messages when the wrong struct is passed.
> But the request still succeeds. The issue is with the logging afaict.
> 
> Christian


That still means enterprise distributions that use the current kernel will
get customer complaints. You need to remove the warning.

[PATCH iproute2 1/1] DEBUG: Fix make check when need build generate_nlmsg

2018-09-25 Thread Petr Vorel

make check from top level Makefile defines several flags which break
building generate_nlmsg:

$ make check
make -C tools
gcc  -Wall -Wstrict-prototypes  -Wmissing-prototypes -Wmissing-declarations 
-Wold-style-definition -Wformat=2 -O2 -I../include -I../include/uapi 
-DRESOLVE_HOSTNAMES -DLIBDIR=\"/usr/lib\" -DCONFDIR=\"/etc/iproute2\" 
-DNETNS_RUN_DIR=\"/var/run/netns\" -DNETNS_ETC_DIR=\"/etc/netns\" -D_GNU_SOURCE 
-D_FILE_OFFSET_BITS=64 -D_LARGEFILE_SOURCE -D_LARGEFILE64_SOURCE  -DHAVE_SETNS 
-DHAVE_SELINUX -DHAVE_ELF -DHAVE_LIBMNL -I/usr/include/libmnl -DNEED_STRLCPY 
-DHAVE_LIBCAP ../lib/libutil.a ../lib/libnetlink.a -lselinux -lelf -lmnl -lcap  
-I../../include -include../../include/uapi/linux/netlink.h -o generate_nlmsg 
generate_nlmsg.c ../../lib/libnetlink.c -lmnl
gcc: error: ../lib/libutil.a: No such file or directory
gcc: error: ../lib/libnetlink.a: No such file or directory
make[2]: *** [Makefile:5: generate_nlmsg] Error 1
make[1]: *** [Makefile:40: generate_nlmsg] Error 2

To fix it reset CFLAGS in sub Makefile and remove LDLIBS entirely (as
required -lmnl flag was specified in 5dc2204c ("testsuite: add libmnl").

Fixes: 8804a8c0 ("Makefile: Add check target")

Signed-off-by: Petr Vorel 
---
Hi Stephen,

I'm sorry for this regression.

Kind regards,
Petr
---
 testsuite/tools/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/testsuite/tools/Makefile b/testsuite/tools/Makefile
index 7d53d226..e1d9bfef 100644
--- a/testsuite/tools/Makefile
+++ b/testsuite/tools/Makefile
@@ -1,8 +1,9 @@
 # SPDX-License-Identifier: GPL-2.0
+CFLAGS=
 include ../../config.mk
 
 generate_nlmsg: generate_nlmsg.c ../../lib/libnetlink.c
-   $(CC) $(CPPFLAGS) $(CFLAGS) $(LDLIBS) $(EXTRA_CFLAGS) -I../../include 
-include../../include/uapi/linux/netlink.h -o $@ $^ -lmnl
+   $(CC) $(CPPFLAGS) $(CFLAGS) $(EXTRA_CFLAGS) -I../../include 
-include../../include/uapi/linux/netlink.h -o $@ $^ -lmnl
 
 clean:
rm -f generate_nlmsg
-- 
2.19.0

requesting stable backport of BPF security fix (commit dd066823db2ac4e22f721ec85190817b58059a54)

2018-09-25 Thread Jann Horn

Hi!

Per the policy at Documentation/networking/netdev-FAQ.rst, I'm sending
this to netdev@ and davem, rather than stable@; with a CC to security@
because I believe that this is a security process issue.

Upstream commit dd066823db2ac4e22f721ec85190817b58059a54
("bpf/verifier: disallow pointer subtraction") fixes a security bug
(kernel pointer leak to unprivileged userspace). The fix has been in
Linus' tree since about a week ago, but the patch still doesn't appear
in Greg's linux-4.18.y linux-stable-rc repo, in Greg's 4.18
stable-queue, or in davem's stable queue at
http://patchwork.ozlabs.org/bundle/davem/stable/?state=* .

Please queue it up for backporting.

I am curious: Why was this not queued up for stable immediately? Or
was it, and I looked in the wrong place?

Re: [RFC PATCH iproute2-next] System specification health API

2018-09-25 Thread Eran Ben Elisha





On 9/16/2018 10:57 PM, Andrew Lunn wrote:

Why is this going under iproute rather than using one of the existing sensor 
API's.
For example Intel NIC's have thermal sensors etc.


Hi Stephen

These are not that sort of sensors. This is part of the naming problem
here. It is not really to do with health, it is about exceptions and
bugs. And the sensors are more like timeouts and watchdogs.

It is clear that the current names lead to a lot of confusion. Maybe:

health -> exception
sensor -> condition

Andrew



I think those names renaming can work well.

(Sorry for that response, Local holiday season...)

Eran

Re: netlink: 16 bytes leftover after parsing attributes in process `ip'.

2018-09-25 Thread Stephen Hemminger

On Tue, 25 Sep 2018 11:49:10 +0200
Christian Brauner  wrote:

> On Mon, Sep 24, 2018 at 09:19:06PM -0600, David Ahern wrote:
> > On top of net-next I am see a dmesg error:
> > 
> > netlink: 16 bytes leftover after parsing attributes in process `ip'.
> > 
> > I traced it to address lists and commit:
> > 
> > commit 6ecf4c37eb3e89b0832c9616089a5cdca3747da7
> > Author: Christian Brauner 
> > Date:   Tue Sep 4 21:53:50 2018 +0200
> > 
> > ipv6: enable IFA_TARGET_NETNSID for RTM_GETADDR
> > 
> > Per the commit you are trying to guess whether the ancillary header is
> > an ifinfomsg or a ifaddrmsg. I am guessing you are guessing wrong. :-)  
> 
> Well, I currently don't guess at all. :) I'm parsing with struct
> ifaddrmsg as assumed header size but ignore parsing errors when that
> fails. You don't get the niceties of the new property if you don't pack

There are legacy parts of netlink interface with kernel.
The ABI has evolved over time but some old parts are stuck in the past.


> > I don't have time to take this to ground, but address listing is not the
> > only area subject to iproute2's SNAFU of infomsg everywhere on dumps. I
> > have thought about this for route dumps, but its solution does not work
> > here. You'll need to find something because the current warning on every
> > address dump is not acceptable.  
> 
> Two points before I propose a migitation:
> 
> 1. The burded of seeing pr_warn_ratelimited() messages in dmesg when
>userspace is doing something wrong is imho justifiable.
>Actually, I would argue that we should not hide the problem from
>userspace at all. The rate-limited (so no logging DOS afaict) warning
>messages are a perfect indicator that a tool is doing something wrong
>*without* introducing any regressions.
>The rtnetlink manpage clearly indicates that ifaddrmsg is supposed to
>be used too. Additionally, userspace stuffs an ifinfomsg in there but
>expects to receive ifaddrmsg. They should be warned loudly. :) So I
>actually like the warning messages.
> 2. Userspace should be fixed. Especially such an important standard tool
>as iproute2 that is maintained on git.kernel.org (glibc is already
>doing the right.).
> 
> So if people really want to hide this issue as much as we can then we
> can play the guessing game. I could send a patch that roughly does the
> following:
> 
> if (nlmsg_len(cb->nlh) < sizeof(struct ifinfomsg))
> guessed_header_len = sizeof(struct ifaddrmsg);
> else
> guessed_header_len = sizeof(struct ifinfomsg);
> 
> This will work since sizeof(ifaddrmsg) == 8 and sizeof(ifinfomsg) == 16.
> The only valid property for RTM_GETADDR requests is IFA_TARGET_NETNSID.
> This propert is a __s32 which should bring the message up to 12 bytes
> (not sure about alignment requiremnts and where we might wend up ten)
> which is still less than the 16 bytes without that property from
> ifinfomsg. That's a hacky hacky hack-hack and will likely work but will
> break when ifaddrmsg grows a new member or we introduce another property
> that is valid in RTM_GETADDR requests. It also will not work cleanly
> when users stuff additional properties in there that are valid for the
> address family but are not used int RTM_GETADDR requests.
> 
> I would like to hear what other people and davem think we should do.
> Patch it away or print the warning.
> 
> Christian

You can't break old programs. That is one of the rules of kernel.
Therefore, please either revert the kernel change or put the new attribute
in a place where old versions do not cause problem.

There are people who run new kernels on old versions of iproute (like enterprise
distributions) and vice versa.

1 2 >

1 - 100 of 116 matches

Mail list logo