[PATCH bpf v3] tools/bpftool: fix a bug in bpftool perf

2018-06-11 Thread Yonghong Song
Commit b04df400c302 ("tools/bpftool: add perf subcommand")
introduced bpftool subcommand perf to query bpf program
kuprobe and tracepoint attachments.

The perf subcommand will first test whether bpf subcommand
BPF_TASK_FD_QUERY is supported in kernel or not. It does it
by opening a file with argv[0] and feeds the file descriptor
and current task pid to the kernel for querying.

Such an approach won't work if the argv[0] cannot be opened
successfully in the current directory. This is especially
true when bpftool is accessible through PATH env variable.
The error below reflects the open failure for file argv[0]
at home directory.

  [yhs@localhost ~]$ which bpftool
  /usr/local/sbin/bpftool
  [yhs@localhost ~]$ bpftool perf
  Error: perf_query_support: No such file or directory

To fix the issue, let us open root directory ("/")
which exists in every linux system. With the fix, the
error message will correctly reflect the permission issue.

  [yhs@localhost ~]$ which bpftool
  /usr/local/sbin/bpftool
  [yhs@localhost ~]$ bpftool perf
  Error: perf_query_support: Operation not permitted
  HINT: non root or kernel doesn't support TASK_FD_QUERY

Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
Reported-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 tools/bpf/bpftool/perf.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Changelogs:
 v2 -> v3:
   . fix a stupid error (flippingg the condition) I introduced in v2.
 v1 -> v2:
   . remove '\n' in the p_err format string in order to
 have valid json output.

diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
index ac6b1a12c9b7..b76b77dcfd1f 100644
--- a/tools/bpf/bpftool/perf.c
+++ b/tools/bpf/bpftool/perf.c
@@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
if (perf_query_supported)
goto out;
 
-   fd = open(bin_name, O_RDONLY);
+   fd = open("/", O_RDONLY);
if (fd < 0) {
-   p_err("perf_query_support: %s", strerror(errno));
+   p_err("perf_query_support: cannot open directory \"/\" (%s)",
+ strerror(errno));
goto out;
}
 
-- 
2.14.3



Re: [PATCH net] tls: fix NULL pointer dereference on poll

2018-06-11 Thread Christoph Hellwig
> Looks like the recent conversion from poll to poll_mask callback started
> in 152524231023 ("net: add support for ->poll_mask in proto_ops") missed
> to eventually convert kTLS, too: TCP's ->poll was converted over to the
> ->poll_mask in commit 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> and therefore kTLS wrongly saved the ->poll old one which is now NULL.

Looks like this TLS code was added in the same cycle. 

> diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
> index 301f224..a127d61 100644
> --- a/net/tls/tls_main.c
> +++ b/net/tls/tls_main.c
> @@ -712,7 +712,7 @@ static int __init tls_register(void)
>   build_protos(tls_prots[TLSV4], _prot);
>  
>   tls_sw_proto_ops = inet_stream_ops;
> - tls_sw_proto_ops.poll = tls_sw_poll;
> + tls_sw_proto_ops.poll_mask = tls_sw_poll_mask;
>   tls_sw_proto_ops.splice_read = tls_sw_splice_read;

Not new in this patch, but copying ops vectors is a very bad idea, not
only because your new instance can't be marked const and you thus open
up exploit vectors. I would suggest to clean this up eventually.

> +__poll_t tls_sw_poll_mask(struct socket *sock, __poll_t events)
>  {
>   struct sock *sk = sock->sk;
>   struct tls_context *tls_ctx = tls_get_ctx(sk);
>   struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
> + __poll_t mask;
>  
> + /* Grab EPOLLOUT and EPOLLHUP from the underlying socket */
> + mask = ctx->sk_poll_mask(sock, events);
>  
> + /* Clear EPOLLIN bits, and set based on recv_pkt */
> + mask &= ~(EPOLLIN | EPOLLRDNORM);
>   if (ctx->recv_pkt)
> + mask |= EPOLLIN | EPOLLRDNORM;
>  
> + return mask;

So you call the underlying protocol method on the struct sock of
the TLS code?  Again not reall new in this patch, but how is this
even supposed to work?


[PATCH net 1/4] nfp: don't pad strings in nfp_cpp_resource_find() to avoid gcc 8 warning

2018-06-11 Thread Jakub Kicinski
Once upon a time nfp_cpp_resource_find() took a name parameter,
which could be any user-chosen string.  Resources are identified
by a CRC32 hash of a 8 byte string, so we had to pad user input
with zeros to make sure CRC32 gave the correct result.

Since then nfp_cpp_resource_find() was made to operate on allocated
resources only (struct nfp_resource).  We kzalloc those so there is
no need to pad the strings and use memcmp.

This avoids a GCC 8 stringop-truncation warning:

In function ‘nfp_cpp_resource_find’,
inlined from ‘nfp_resource_try_acquire’ at .../nfpcore/nfp_resource.c:153:8,
inlined from ‘nfp_resource_acquire’ at .../nfpcore/nfp_resource.c:206:9:
.../nfpcore/nfp_resource.c:108:2: warning:  strncpy’ output may be 
truncated copying 8 bytes from a string of length 8 [-Wstringop-truncation]
  strncpy(name_pad, res->name, sizeof(name_pad));
  ^~

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c | 7 ++-
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c
index 2dd89dba9311..d32af598da90 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c
@@ -98,21 +98,18 @@ struct nfp_resource {
 
 static int nfp_cpp_resource_find(struct nfp_cpp *cpp, struct nfp_resource *res)
 {
-   char name_pad[NFP_RESOURCE_ENTRY_NAME_SZ] = {};
struct nfp_resource_entry entry;
u32 cpp_id, key;
int ret, i;
 
cpp_id = NFP_CPP_ID(NFP_RESOURCE_TBL_TARGET, 3, 0);  /* Atomic read */
 
-   strncpy(name_pad, res->name, sizeof(name_pad));
-
/* Search for a matching entry */
-   if (!memcmp(name_pad, NFP_RESOURCE_TBL_NAME "\0\0\0\0\0\0\0\0", 8)) {
+   if (!strcmp(res->name, NFP_RESOURCE_TBL_NAME)) {
nfp_err(cpp, "Grabbing device lock not supported\n");
return -EOPNOTSUPP;
}
-   key = crc32_posix(name_pad, sizeof(name_pad));
+   key = crc32_posix(res->name, NFP_RESOURCE_ENTRY_NAME_SZ);
 
for (i = 0; i < NFP_RESOURCE_TBL_ENTRIES; i++) {
u64 addr = NFP_RESOURCE_TBL_BASE +
-- 
2.17.1



[PATCH net 2/4] nfp: include all ring counters in interface stats

2018-06-11 Thread Jakub Kicinski
We are gathering software statistics on per-ring basis.
.ndo_get_stats64 handler adds the rings up.  Unfortunately
we are currently only adding up active rings, which means
that if user decreases the number of active rings the
statistics from deactivated rings will no longer be counted
and total interface statistics may go backwards.

Always sum all possible rings, the stats are allocated
statically for max number of rings, so we don't have to
worry about them being removed.  We could add the stats
up when user changes the ring count, but it seems unnecessary..
Adding up inactive rings will be very quick since no datapath
will be touching them.

Fixes: 164d1e9e5d52 ("nfp: add support for ethtool .set_channels")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 75110c8d6a90..ed27176c2bce 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3121,7 +3121,7 @@ static void nfp_net_stat64(struct net_device *netdev,
struct nfp_net *nn = netdev_priv(netdev);
int r;
 
-   for (r = 0; r < nn->dp.num_r_vecs; r++) {
+   for (r = 0; r < nn->max_r_vecs; r++) {
struct nfp_net_r_vector *r_vec = >r_vecs[r];
u64 data[3];
unsigned int start;
-- 
2.17.1



[PATCH net 0/4] nfp: fix a warning, stats, naming and route leak

2018-06-11 Thread Jakub Kicinski
Hi!

Various fixes for the NFP.  Patch 1 fixes a harmless GCC 8 warning.
Patch 2 ensures statistics are correct after users decrease the number
of channels/rings.  Patch 3 restores phy_port_name behaviour for flower,
ndo_get_phy_port_name used to return -EOPNOTSUPP on one of the netdevs,
and we need to keep it that way otherwise interface names may change.
Patch 4 fixes refcnt leak in flower tunnel offload code.

Jakub Kicinski (3):
  nfp: don't pad strings in nfp_cpp_resource_find() to avoid gcc 8
warning
  nfp: include all ring counters in interface stats
  nfp: remove phys_port_name on flower's vNIC

Pieter Jansen van Vuuren (1):
  nfp: flower: free dst_entry in route table

 drivers/net/ethernet/netronome/nfp/flower/main.c  | 1 +
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c   | 2 ++
 drivers/net/ethernet/netronome/nfp/nfp_net.h  | 4 
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c   | 4 ++--
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp_resource.c | 7 ++-
 5 files changed, 11 insertions(+), 7 deletions(-)

-- 
2.17.1



[PATCH net 3/4] nfp: remove phys_port_name on flower's vNIC

2018-06-11 Thread Jakub Kicinski
.ndo_get_phys_port_name was recently extended to support multi-vNIC
FWs.  These are firmwares which can have more than one vNIC per PF
without associated port (e.g. Adaptive Buffer Management FW), therefore
we need a way of distinguishing the vNICs.  Unfortunately, it's too
late to make flower use the same naming.  Flower users may depend on
.ndo_get_phys_port_name returning -EOPNOTSUPP, for example the name
udev gave the PF vNIC was just the bare PCI device-based name before
the change, and will have 'nn0' appended after.

To ensure flower's vNIC doesn't have phys_port_name attribute, add
a flag to vNIC struct and set it in flower code.  New projects will
not set the flag adhere to the naming scheme from the start.

Fixes: 51c1df83e35c ("nfp: assign vNIC id as phys_port_name of vNICs which are 
not ports")
Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
Reviewed-by: Simon Horman 
---
 drivers/net/ethernet/netronome/nfp/flower/main.c| 1 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h| 4 
 drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.c 
b/drivers/net/ethernet/netronome/nfp/flower/main.c
index 19cfa162ac65..1decf3a1cad3 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.c
@@ -455,6 +455,7 @@ static int nfp_flower_vnic_alloc(struct nfp_app *app, 
struct nfp_net *nn,
 
eth_hw_addr_random(nn->dp.netdev);
netif_keep_dst(nn->dp.netdev);
+   nn->vnic_no_name = true;
 
return 0;
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h 
b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 57cb035dcc6d..2a71a9ffd095 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -590,6 +590,8 @@ struct nfp_net_dp {
  * @vnic_list: Entry on device vNIC list
  * @pdev:  Backpointer to PCI device
  * @app:   APP handle if available
+ * @vnic_no_name:  For non-port PF vNIC make ndo_get_phys_port_name return
+ * -EOPNOTSUPP to keep backwards compatibility (set by app)
  * @port:  Pointer to nfp_port structure if vNIC is a port
  * @app_priv:  APP private data for this vNIC
  */
@@ -663,6 +665,8 @@ struct nfp_net {
struct pci_dev *pdev;
struct nfp_app *app;
 
+   bool vnic_no_name;
+
struct nfp_port *port;
 
void *app_priv;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c 
b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index ed27176c2bce..d4c27f849f9b 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3286,7 +3286,7 @@ nfp_net_get_phys_port_name(struct net_device *netdev, 
char *name, size_t len)
if (nn->port)
return nfp_port_get_phys_port_name(netdev, name, len);
 
-   if (nn->dp.is_vf)
+   if (nn->dp.is_vf || nn->vnic_no_name)
return -EOPNOTSUPP;
 
n = snprintf(name, len, "n%d", nn->id);
-- 
2.17.1



[PATCH net 4/4] nfp: flower: free dst_entry in route table

2018-06-11 Thread Jakub Kicinski
From: Pieter Jansen van Vuuren 

We need to release the refcnt on dst_entry in the route table, otherwise
we will leak the route.

Fixes: 8e6a9046b66a ("nfp: flower vxlan neighbour offload")
Signed-off-by: Pieter Jansen van Vuuren 
Signed-off-by: Louis Peens 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c 
b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index ec524d97869d..78afe75129ab 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -381,6 +381,8 @@ nfp_tun_neigh_event_handler(struct notifier_block *nb, 
unsigned long event,
err = PTR_ERR_OR_ZERO(rt);
if (err)
return NOTIFY_DONE;
+
+   ip_rt_put(rt);
 #else
return NOTIFY_DONE;
 #endif
-- 
2.17.1



Re: [PATCH bpf v2] tools/bpftool: fix a bug in bpftool perf

2018-06-11 Thread Yonghong Song




On 6/11/18 7:42 PM, Jakub Kicinski wrote:

On Mon, 11 Jun 2018 19:01:08 -0700, Yonghong Song wrote:

Commit b04df400c302 ("tools/bpftool: add perf subcommand")
introduced bpftool subcommand perf to query bpf program
kuprobe and tracepoint attachments.

The perf subcommand will first test whether bpf subcommand
BPF_TASK_FD_QUERY is supported in kernel or not. It does it
by opening a file with argv[0] and feeds the file descriptor
and current task pid to the kernel for querying.

Such an approach won't work if the argv[0] cannot be opened
successfully in the current directory. This is especially
true when bpftool is accessible through PATH env variable.
The error below reflects the open failure for file argv[0]
at home directory.

   [yhs@localhost ~]$ which bpftool
   /usr/local/sbin/bpftool
   [yhs@localhost ~]$ bpftool perf
   Error: perf_query_support: No such file or directory

To fix the issue, let us open root directory ("/")
which exists in every linux system. With the fix, the
error message will correctly reflect the permission issue.

   [yhs@localhost ~]$ which bpftool
   /usr/local/sbin/bpftool
   [yhs@localhost ~]$ bpftool perf
   Error: perf_query_support: Operation not permitted
   HINT: non root or kernel doesn't support TASK_FD_QUERY

Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
Reported-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
  tools/bpf/bpftool/perf.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

Changelogs:
  v1 -> v2:
. remove '\n' in the p_err format string in order to
  have valid json output.

diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
index ac6b1a12c9b7..239715aa6fb9 100644
--- a/tools/bpf/bpftool/perf.c
+++ b/tools/bpf/bpftool/perf.c
@@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
if (perf_query_supported)
goto out;
  
-	fd = open(bin_name, O_RDONLY);

-   if (fd < 0) {
-   p_err("perf_query_support: %s", strerror(errno));
+   fd = open("/", O_RDONLY);
+   if (fd > 0) {


Looking at this again, why change the comparison from less than to
greater than 0?


Totally my fault. I altered the condition to test p_err output since
I cannot trigger it normally, but forgot to change it back.
Will send yet another revision :-(


+   p_err("perf_query_support: cannot open directory \"/\" (%s)",
+ strerror(errno));
goto out;
}
  




problems with SCTP GSO

2018-06-11 Thread David Miller


I would like to bring up some problems with the current GSO
implementation in SCTP.

The most important for me right now is that SCTP uses
"skb_gro_receive()" to build "GSO" frames :-(

Really it just ends up using the slow path (basically, label 'merge'
and onwards).

So, using a GRO helper to build GSO packets is not great.

I want to make major surgery here and the only way I can is if
it is exactly the GRO demuxing path that uses skb_gro_receive().

Those paths pass in the list head from the NAPI struct that initiated
the GRO code paths.  That makes it easy for me to change this to use a
list_head or a hash chain.

Probably in the short term SCTP should just have a private helper that
builds the frag list, appending 'skb' to 'head'.

In the long term, SCTP should use the page frags just like TCP to
append the data when building GSO frames.  Then it could actually be
offloaded and passed into drivers without linearizing.


Re: [PATCH iproute2-next] ip-xfrm: Add support for OUTPUT_MARK

2018-06-11 Thread Stephen Hemminger
On Mon, 11 Jun 2018 20:11:33 -0600
Subash Abhinov Kasiviswanathan  wrote:

> This patch adds support for OUTPUT_MARK in xfrm state to exercise the
> functionality added by kernel commit 077fbac405bf
> ("net: xfrm: support setting an output mark.").
> 
> Sample output with output-mark -
> 
> src 192.168.1.1 dst 192.168.1.2
> proto esp spi 0x4321 reqid 0 mode tunnel
> replay-window 0 flag af-unspec
> auth-trunc xcbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b211 96
> enc cbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b233
> anti-replay context: seq 0x0, oseq 0x0, bitmap 0x
> output-mark 0x2
> 
> Signed-off-by: Subash Abhinov Kasiviswanathan 

This reminds me. It is time to convert xfrm to json output.


Re: [PATCH RFC] tcp: Do not reload skb pointer after skb_gro_receive().

2018-06-11 Thread Eric Dumazet



On 06/11/2018 06:00 PM, David Miller wrote:
> 
> This is not necessary.  skb_gro_receive() will never change what
> 'head' points to.
> 
> In it's original implementation (see commit 71d93b39e52e ("net: Add
> skb_gro_receive")), it did:
> 
> 
> + *head = nskb;
> + nskb->next = p->next;
> + p->next = NULL;
> 
> 
> This sequence was removed in commit 58025e46ea2d ("net: gro: remove
> obsolete code from skb_gro_receive()")
> 
> Signed-off-by: David S. Miller 

SGTM, thanks David !

Signed-off-by: Eric Dumazet 



Re: [PATCH bpf v2] tools/bpftool: fix a bug in bpftool perf

2018-06-11 Thread Jakub Kicinski
On Mon, 11 Jun 2018 19:01:08 -0700, Yonghong Song wrote:
> Commit b04df400c302 ("tools/bpftool: add perf subcommand")
> introduced bpftool subcommand perf to query bpf program
> kuprobe and tracepoint attachments.
> 
> The perf subcommand will first test whether bpf subcommand
> BPF_TASK_FD_QUERY is supported in kernel or not. It does it
> by opening a file with argv[0] and feeds the file descriptor
> and current task pid to the kernel for querying.
> 
> Such an approach won't work if the argv[0] cannot be opened
> successfully in the current directory. This is especially
> true when bpftool is accessible through PATH env variable.
> The error below reflects the open failure for file argv[0]
> at home directory.
> 
>   [yhs@localhost ~]$ which bpftool
>   /usr/local/sbin/bpftool
>   [yhs@localhost ~]$ bpftool perf
>   Error: perf_query_support: No such file or directory
> 
> To fix the issue, let us open root directory ("/")
> which exists in every linux system. With the fix, the
> error message will correctly reflect the permission issue.
> 
>   [yhs@localhost ~]$ which bpftool
>   /usr/local/sbin/bpftool
>   [yhs@localhost ~]$ bpftool perf
>   Error: perf_query_support: Operation not permitted
>   HINT: non root or kernel doesn't support TASK_FD_QUERY
> 
> Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
> Reported-by: Alexei Starovoitov 
> Signed-off-by: Yonghong Song 
> ---
>  tools/bpf/bpftool/perf.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> Changelogs:
>  v1 -> v2:
>. remove '\n' in the p_err format string in order to
>  have valid json output.
> 
> diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
> index ac6b1a12c9b7..239715aa6fb9 100644
> --- a/tools/bpf/bpftool/perf.c
> +++ b/tools/bpf/bpftool/perf.c
> @@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
>   if (perf_query_supported)
>   goto out;
>  
> - fd = open(bin_name, O_RDONLY);
> - if (fd < 0) {
> - p_err("perf_query_support: %s", strerror(errno));
> + fd = open("/", O_RDONLY);
> + if (fd > 0) {

Looking at this again, why change the comparison from less than to
greater than 0?

> + p_err("perf_query_support: cannot open directory \"/\" (%s)",
> +   strerror(errno));
>   goto out;
>   }
>  



Re: [PATCH iproute2-next] ip-xfrm: Add support for OUTPUT_MARK

2018-06-11 Thread Lorenzo Colitti
On Tue, Jun 12, 2018 at 11:12 AM Subash Abhinov Kasiviswanathan
 wrote:
>
> This patch adds support for OUTPUT_MARK in xfrm state to exercise the
> functionality added by kernel commit 077fbac405bf
> ("net: xfrm: support setting an output mark.").
>
> Sample output with output-mark -
>
> src 192.168.1.1 dst 192.168.1.2
> proto esp spi 0x4321 reqid 0 mode tunnel
> replay-window 0 flag af-unspec
> auth-trunc xcbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b211 96
> enc cbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b233
> anti-replay context: seq 0x0, oseq 0x0, bitmap 0x
> output-mark 0x2

Have you considered putting this earlier up in the output, where the
mark is printed as well?

> +   if (tb[XFRMA_OUTPUT_MARK]) {
> +   __u32 output_mark = rta_getattr_u32(tb[XFRMA_OUTPUT_MARK]);
> +
> +   fprintf(fp, "\toutput-mark 0x%x %s", output_mark, _SL_);
> +   }
>  }

If you wanted to implement the suggestion above, I think you could do
that by moving this code into xfrm_xfrma_print.

Other than that, LGTM.

Acked-by: Lorenzo Colitti 

Steffen - what's the status of the set_mark patches? Are you holding
them until the tree opens again? If so, then once they go in, we can
just make "set-mark" behave the same as "output-mark" in the iproute2
code, and add support for the mask as well.


Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Michael S. Tsirkin
On Mon, Jun 11, 2018 at 11:56:56AM -0700, Siwei Liu wrote:
> The current implementation may only work with new userspace, even so
> the eth0/eth0nsby naming is not consistenly persisted due to races in
> bus probing.

Which race do you mean exactly?

-- 
MST


[PATCH iproute2-next] ip-xfrm: Add support for OUTPUT_MARK

2018-06-11 Thread Subash Abhinov Kasiviswanathan
This patch adds support for OUTPUT_MARK in xfrm state to exercise the
functionality added by kernel commit 077fbac405bf
("net: xfrm: support setting an output mark.").

Sample output with output-mark -

src 192.168.1.1 dst 192.168.1.2
proto esp spi 0x4321 reqid 0 mode tunnel
replay-window 0 flag af-unspec
auth-trunc xcbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b211 96
enc cbc(aes) 0x3ed0af408cf5dcbf5d5d9a5fa806b233
anti-replay context: seq 0x0, oseq 0x0, bitmap 0x
output-mark 0x2

Signed-off-by: Subash Abhinov Kasiviswanathan 
---
 ip/ipxfrm.c| 5 +
 ip/xfrm_state.c| 9 +
 man/man8/ip-xfrm.8 | 2 ++
 3 files changed, 16 insertions(+)

diff --git a/ip/ipxfrm.c b/ip/ipxfrm.c
index 12c2f72..601f9b8 100644
--- a/ip/ipxfrm.c
+++ b/ip/ipxfrm.c
@@ -956,6 +956,11 @@ void xfrm_state_info_print(struct xfrm_usersa_info *xsinfo,
fprintf(fp, "%s %s", (char *)(sctx + 1), _SL_);
}
 
+   if (tb[XFRMA_OUTPUT_MARK]) {
+   __u32 output_mark = rta_getattr_u32(tb[XFRMA_OUTPUT_MARK]);
+
+   fprintf(fp, "\toutput-mark 0x%x %s", output_mark, _SL_);
+   }
 }
 
 void xfrm_policy_info_print(struct xfrm_userpolicy_info *xpinfo,
diff --git a/ip/xfrm_state.c b/ip/xfrm_state.c
index 85d959c..d005802 100644
--- a/ip/xfrm_state.c
+++ b/ip/xfrm_state.c
@@ -61,6 +61,7 @@ static void usage(void)
fprintf(stderr, "[ flag FLAG-LIST ] [ sel SELECTOR ] [ 
LIMIT-LIST ] [ encap ENCAP ]\n");
fprintf(stderr, "[ coa ADDR[/PLEN] ] [ ctx CTX ] [ extra-flag 
EXTRA-FLAG-LIST ]\n");
fprintf(stderr, "[ offload [dev DEV] dir DIR ]\n");
+   fprintf(stderr, "[ output-mark OUTPUT-MARK]\n");
fprintf(stderr, "Usage: ip xfrm state allocspi ID [ mode MODE ] [ mark 
MARK [ mask MASK ] ]\n");
fprintf(stderr, "[ reqid REQID ] [ seq SEQ ] [ min SPI max SPI 
]\n");
fprintf(stderr, "Usage: ip xfrm state { delete | get } ID [ mark MARK [ 
mask MASK ] ]\n");
@@ -322,6 +323,7 @@ static int xfrm_state_modify(int cmd, unsigned int flags, 
int argc, char **argv)
struct xfrm_user_sec_ctx sctx;
charstr[CTX_BUF_SIZE];
} ctx = {};
+   __u32 output_mark = 0;
 
while (argc > 0) {
if (strcmp(*argv, "mode") == 0) {
@@ -437,6 +439,10 @@ static int xfrm_state_modify(int cmd, unsigned int flags, 
int argc, char **argv)
invarg("value after \"offload dir\" is 
invalid", *argv);
is_offload = false;
}
+   } else if (strcmp(*argv, "output-mark") == 0) {
+   NEXT_ARG();
+   if (get_u32(_mark, *argv, 0))
+   invarg("value after \"output-mark\" is 
invalid", *argv);
} else {
/* try to assume ALGO */
int type = xfrm_algotype_getbyname(*argv);
@@ -720,6 +726,9 @@ static int xfrm_state_modify(int cmd, unsigned int flags, 
int argc, char **argv)
}
}
 
+   if (output_mark != 0)
+   addattr32(, sizeof(req.buf), XFRMA_OUTPUT_MARK, 
output_mark);
+
if (rtnl_open_byproto(, 0, NETLINK_XFRM) < 0)
exit(1);
 
diff --git a/man/man8/ip-xfrm.8 b/man/man8/ip-xfrm.8
index 988cc6a..e001596 100644
--- a/man/man8/ip-xfrm.8
+++ b/man/man8/ip-xfrm.8
@@ -59,6 +59,8 @@ ip-xfrm \- transform configuration
 .IR CTX " ]"
 .RB "[ " extra-flag
 .IR EXTRA-FLAG-LIST " ]"
+.RB "[ " output-mark
+.IR OUTPUT-MARK " ]"
 
 .ti -8
 .B "ip xfrm state allocspi"
-- 
1.9.1



[PATCH bpf v2] tools/bpftool: fix a bug in bpftool perf

2018-06-11 Thread Yonghong Song
Commit b04df400c302 ("tools/bpftool: add perf subcommand")
introduced bpftool subcommand perf to query bpf program
kuprobe and tracepoint attachments.

The perf subcommand will first test whether bpf subcommand
BPF_TASK_FD_QUERY is supported in kernel or not. It does it
by opening a file with argv[0] and feeds the file descriptor
and current task pid to the kernel for querying.

Such an approach won't work if the argv[0] cannot be opened
successfully in the current directory. This is especially
true when bpftool is accessible through PATH env variable.
The error below reflects the open failure for file argv[0]
at home directory.

  [yhs@localhost ~]$ which bpftool
  /usr/local/sbin/bpftool
  [yhs@localhost ~]$ bpftool perf
  Error: perf_query_support: No such file or directory

To fix the issue, let us open root directory ("/")
which exists in every linux system. With the fix, the
error message will correctly reflect the permission issue.

  [yhs@localhost ~]$ which bpftool
  /usr/local/sbin/bpftool
  [yhs@localhost ~]$ bpftool perf
  Error: perf_query_support: Operation not permitted
  HINT: non root or kernel doesn't support TASK_FD_QUERY

Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
Reported-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 tools/bpf/bpftool/perf.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

Changelogs:
 v1 -> v2:
   . remove '\n' in the p_err format string in order to
 have valid json output.

diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
index ac6b1a12c9b7..239715aa6fb9 100644
--- a/tools/bpf/bpftool/perf.c
+++ b/tools/bpf/bpftool/perf.c
@@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
if (perf_query_supported)
goto out;
 
-   fd = open(bin_name, O_RDONLY);
-   if (fd < 0) {
-   p_err("perf_query_support: %s", strerror(errno));
+   fd = open("/", O_RDONLY);
+   if (fd > 0) {
+   p_err("perf_query_support: cannot open directory \"/\" (%s)",
+ strerror(errno));
goto out;
}
 
-- 
2.14.3



Re: [PATCH bpf] tools/bpftool: fix a bug in bpftool perf

2018-06-11 Thread Yonghong Song




On 6/11/18 6:49 PM, Jakub Kicinski wrote:

On Mon, 11 Jun 2018 18:15:16 -0700, Yonghong Song wrote:

Commit b04df400c302 ("tools/bpftool: add perf subcommand")
introduced bpftool subcommand perf to query bpf program
kuprobe and tracepoint attachments.

The perf subcommand will first test whether bpf subcommand
BPF_TASK_FD_QUERY is supported in kernel or not. It does it
by opening a file with argv[0] and feeds the file descriptor
and current task pid to the kernel for querying.

Such an approach won't work if the argv[0] cannot be opened
successfully in the current directory. This is especially
true when bpftool is accessible through PATH env variable.
The error below reflects the open failure for file argv[0]
at home directory.

   [yhs@localhost ~]$ which bpftool
   /usr/local/sbin/bpftool
   [yhs@localhost ~]$ bpftool perf
   Error: perf_query_support: No such file or directory

To fix the issue, let us open root directory ("/")
which exists in every linux system. With the fix, the
error message will correctly reflect the permission issue.

   [yhs@localhost ~]$ which bpftool
   /usr/local/sbin/bpftool
   [yhs@localhost ~]$ bpftool perf
   Error: perf_query_support: Operation not permitted
   HINT: non root or kernel doesn't support TASK_FD_QUERY

Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
Reported-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
  tools/bpf/bpftool/perf.c | 7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
index ac6b1a12c9b7..f1e4c9b270e2 100644
--- a/tools/bpf/bpftool/perf.c
+++ b/tools/bpf/bpftool/perf.c
@@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
if (perf_query_supported)
goto out;
  
-	fd = open(bin_name, O_RDONLY);

-   if (fd < 0) {
-   p_err("perf_query_support: %s", strerror(errno));
+   fd = open("/", O_RDONLY);
+   if (fd > 0) {
+   p_err("perf_query_support: cannot open directory \"/\" (%s)\n",


nit: no \n at the end of p_err() format, because it breaks JSON :(


Thanks for letting me know. Will send v2 to fix this!




+ strerror(errno));
goto out;
}
  




Re: [PATCH bpf] tools/bpftool: fix a bug in bpftool perf

2018-06-11 Thread Jakub Kicinski
On Mon, 11 Jun 2018 18:15:16 -0700, Yonghong Song wrote:
> Commit b04df400c302 ("tools/bpftool: add perf subcommand")
> introduced bpftool subcommand perf to query bpf program
> kuprobe and tracepoint attachments.
> 
> The perf subcommand will first test whether bpf subcommand
> BPF_TASK_FD_QUERY is supported in kernel or not. It does it
> by opening a file with argv[0] and feeds the file descriptor
> and current task pid to the kernel for querying.
> 
> Such an approach won't work if the argv[0] cannot be opened
> successfully in the current directory. This is especially
> true when bpftool is accessible through PATH env variable.
> The error below reflects the open failure for file argv[0]
> at home directory.
> 
>   [yhs@localhost ~]$ which bpftool
>   /usr/local/sbin/bpftool
>   [yhs@localhost ~]$ bpftool perf
>   Error: perf_query_support: No such file or directory
> 
> To fix the issue, let us open root directory ("/")
> which exists in every linux system. With the fix, the
> error message will correctly reflect the permission issue.
> 
>   [yhs@localhost ~]$ which bpftool
>   /usr/local/sbin/bpftool
>   [yhs@localhost ~]$ bpftool perf
>   Error: perf_query_support: Operation not permitted
>   HINT: non root or kernel doesn't support TASK_FD_QUERY
> 
> Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
> Reported-by: Alexei Starovoitov 
> Signed-off-by: Yonghong Song 
> ---
>  tools/bpf/bpftool/perf.c | 7 ---
>  1 file changed, 4 insertions(+), 3 deletions(-)
> 
> diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
> index ac6b1a12c9b7..f1e4c9b270e2 100644
> --- a/tools/bpf/bpftool/perf.c
> +++ b/tools/bpf/bpftool/perf.c
> @@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
>   if (perf_query_supported)
>   goto out;
>  
> - fd = open(bin_name, O_RDONLY);
> - if (fd < 0) {
> - p_err("perf_query_support: %s", strerror(errno));
> + fd = open("/", O_RDONLY);
> + if (fd > 0) {
> + p_err("perf_query_support: cannot open directory \"/\" (%s)\n",

nit: no \n at the end of p_err() format, because it breaks JSON :(

> +   strerror(errno));
>   goto out;
>   }
>  



[PATCH bpf] tools/bpftool: fix a bug in bpftool perf

2018-06-11 Thread Yonghong Song
Commit b04df400c302 ("tools/bpftool: add perf subcommand")
introduced bpftool subcommand perf to query bpf program
kuprobe and tracepoint attachments.

The perf subcommand will first test whether bpf subcommand
BPF_TASK_FD_QUERY is supported in kernel or not. It does it
by opening a file with argv[0] and feeds the file descriptor
and current task pid to the kernel for querying.

Such an approach won't work if the argv[0] cannot be opened
successfully in the current directory. This is especially
true when bpftool is accessible through PATH env variable.
The error below reflects the open failure for file argv[0]
at home directory.

  [yhs@localhost ~]$ which bpftool
  /usr/local/sbin/bpftool
  [yhs@localhost ~]$ bpftool perf
  Error: perf_query_support: No such file or directory

To fix the issue, let us open root directory ("/")
which exists in every linux system. With the fix, the
error message will correctly reflect the permission issue.

  [yhs@localhost ~]$ which bpftool
  /usr/local/sbin/bpftool
  [yhs@localhost ~]$ bpftool perf
  Error: perf_query_support: Operation not permitted
  HINT: non root or kernel doesn't support TASK_FD_QUERY

Fixes: b04df400c302 ("tools/bpftool: add perf subcommand")
Reported-by: Alexei Starovoitov 
Signed-off-by: Yonghong Song 
---
 tools/bpf/bpftool/perf.c | 7 ---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/tools/bpf/bpftool/perf.c b/tools/bpf/bpftool/perf.c
index ac6b1a12c9b7..f1e4c9b270e2 100644
--- a/tools/bpf/bpftool/perf.c
+++ b/tools/bpf/bpftool/perf.c
@@ -29,9 +29,10 @@ static bool has_perf_query_support(void)
if (perf_query_supported)
goto out;
 
-   fd = open(bin_name, O_RDONLY);
-   if (fd < 0) {
-   p_err("perf_query_support: %s", strerror(errno));
+   fd = open("/", O_RDONLY);
+   if (fd > 0) {
+   p_err("perf_query_support: cannot open directory \"/\" (%s)\n",
+ strerror(errno));
goto out;
}
 
-- 
2.14.3



[PATCH RFC] tcp: Do not reload skb pointer after skb_gro_receive().

2018-06-11 Thread David Miller


This is not necessary.  skb_gro_receive() will never change what
'head' points to.

In it's original implementation (see commit 71d93b39e52e ("net: Add
skb_gro_receive")), it did:


+   *head = nskb;
+   nskb->next = p->next;
+   p->next = NULL;


This sequence was removed in commit 58025e46ea2d ("net: gro: remove
obsolete code from skb_gro_receive()")

Signed-off-by: David S. Miller 

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index 4d58e2ce0b5b..8cc7c3487330 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -268,8 +268,6 @@ struct sk_buff **tcp_gro_receive(struct sk_buff **head, 
struct sk_buff *skb)
goto out_check_final;
}
 
-   p = *head;
-   th2 = tcp_hdr(p);
tcp_flag_word(th2) |= flags & (TCP_FLAG_FIN | TCP_FLAG_PSH);
 
 out_check_final:


Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 5/7] net: Add generic ndo_select_queue functions

2018-06-11 Thread Alexander Duyck
Jeff,

Looks like I have some bugs on non-x86 architecture. I need to address
these in a v2 of this set so please hold off on applying until I can
get that submitted either tonight or tomorrow morning.

Thanks.

- Alex

On Mon, Jun 11, 2018 at 4:10 PM, kbuild test robot  wrote:
> Hi Alexander,
>
> Thank you for the patch! Yet something to improve:
>
> [auto build test ERROR on net-next/master]
> [also build test ERROR on next-20180608]
> [cannot apply to v4.17]
> [if your patch is applied to the wrong git tree, please drop us a note to 
> help improve the system]
>
> url:
> https://github.com/0day-ci/linux/commits/Alexander-Duyck/Add-support-for-L2-Fwd-Offload-w-o-ndo_select_queue/20180612-015220
> config: arm-allmodconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
> reproduce:
> wget 
> https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
> ~/bin/make.cross
> chmod +x ~/bin/make.cross
> # save the attached .config to linux build tree
> GCC_VERSION=7.2.0 make.cross ARCH=arm
>
> All errors (new ones prefixed by >>):
>
>>> drivers/net//ethernet/ti/netcp_core.c:1968:22: error: initialization from 
>>> incompatible pointer type [-Werror=incompatible-pointer-types]
>  .ndo_select_queue = dev_pick_tx_zero,
>  ^~~~
>drivers/net//ethernet/ti/netcp_core.c:1968:22: note: (near initialization 
> for 'netcp_netdev_ops.ndo_select_queue')
>cc1: some warnings being treated as errors
>
> vim +1968 drivers/net//ethernet/ti/netcp_core.c
>
>   1955
>   1956  static const struct net_device_ops netcp_netdev_ops = {
>   1957  .ndo_open   = netcp_ndo_open,
>   1958  .ndo_stop   = netcp_ndo_stop,
>   1959  .ndo_start_xmit = netcp_ndo_start_xmit,
>   1960  .ndo_set_rx_mode= netcp_set_rx_mode,
>   1961  .ndo_do_ioctl   = netcp_ndo_ioctl,
>   1962  .ndo_get_stats64= netcp_get_stats,
>   1963  .ndo_set_mac_address= eth_mac_addr,
>   1964  .ndo_validate_addr  = eth_validate_addr,
>   1965  .ndo_vlan_rx_add_vid= netcp_rx_add_vid,
>   1966  .ndo_vlan_rx_kill_vid   = netcp_rx_kill_vid,
>   1967  .ndo_tx_timeout = netcp_ndo_tx_timeout,
>> 1968  .ndo_select_queue   = dev_pick_tx_zero,
>   1969  .ndo_setup_tc   = netcp_setup_tc,
>   1970  };
>   1971
>
> ---
> 0-DAY kernel test infrastructureOpen Source Technology Center
> https://lists.01.org/pipermail/kbuild-all   Intel Corporation


Re: pull-request: bpf 2018-06-12

2018-06-11 Thread David Miller
From: Daniel Borkmann 
Date: Tue, 12 Jun 2018 01:59:56 +0200

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
> 
> 1) Avoid an allocation warning in AF_XDP by adding __GFP_NOWARN for the
>umem setup, from Björn.
> 
> 2) Silence a warning in bpf fs when an application tries to open(2) a
>pinned bpf obj due to missing fops. Add a dummy open fop that continues
>to just bail out in such case, from Daniel.
> 
> 3) Fix a BPF selftest urandom_read build issue where gcc complains that
>it gets built twice, from Anders.
> 
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.


Re: [net 0/5][pull request] Intel Wired LAN Driver Updates 2018-06-11

2018-06-11 Thread David Miller
From: Jeff Kirsher 
Date: Mon, 11 Jun 2018 09:16:25 -0700

> This series contains fixes to ixgbe IPsec and MACVLAN.
> 
> Alex provides the 5 fixes in this series, starting with fixing an issue
> where num_rx_pools was not being populated until after the queues and
> interrupts were reinitialized when enabling MACVLAN interfaces.  Updated
> to use CONFIG_XFRM_OFFLOAD instead of CONFIG_XFRM, since the code
> requires CONFIG_XFRM_OFFLOAD to be enabled.  Moved the IPsec
> initialization function to be more consistent with the placement of
> similar initialization functions and before the call to reset the
> hardware, which will clean up any link issues that may have been
> introduced.  Fixed the boolean logic that was testing for transmit OR
> receive ready bits, when it should have been testing for transmit AND
> receive ready bits.  Fixed the bit definitions for SECTXSTAT and SECRXSTAT
> registers and ensure that if IPsec is disabled on the part, do not
> enable it.
> 
> The following are changes since commit 
> f0dc7f9c6dd99891611fca5849cbc4c6965b690e:
>   Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE

Pulled, thanks Jeff.


Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Samudrala, Sridhar

On 6/11/2018 12:34 PM, Samudrala, Sridhar wrote:


On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote:

On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:

   * Set permanent and current address of net_failover device
 to match the primary.


We copy the dev_addr of standby dev to failover_dev in net_failover_create()
before calling register_netdev().
register_netdev() does a copy of dev_addr to perm_addr.

So i don't think this is an issue.



   * Carrier should be marked off before registering device
 the net_failover device.


Will fix this and also a couple of places dev_err() needs to be replaced with 
netdev_err()


Sridhar, do we want to address this?
If yes, could you please take a look at addressing these
meanwhile, while we keep arguing about making API changes?


Sure. I will submit patches to address these issues raised by Stephen.






pull-request: bpf 2018-06-12

2018-06-11 Thread Daniel Borkmann
Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Avoid an allocation warning in AF_XDP by adding __GFP_NOWARN for the
   umem setup, from Björn.

2) Silence a warning in bpf fs when an application tries to open(2) a
   pinned bpf obj due to missing fops. Add a dummy open fop that continues
   to just bail out in such case, from Daniel.

3) Fix a BPF selftest urandom_read build issue where gcc complains that
   it gets built twice, from Anders.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 66e58e0ef80a56a1d7857b6ce121141563cdd93e:

  bpfilter: fix race in pipe access (2018-06-07 20:07:28 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to a343993c518ce252b62ec00ac06bccfb1d17129d:

  xsk: silence warning on memory allocation failure (2018-06-11 23:49:07 +0200)


Anders Roxell (1):
  selftests: bpf: fix urandom_read build issue

Björn Töpel (1):
  xsk: silence warning on memory allocation failure

Daniel Borkmann (1):
  bpf: implement dummy fops for bpf objects

 kernel/bpf/inode.c   | 14 --
 net/xdp/xdp_umem.c   |  3 ++-
 tools/testing/selftests/bpf/Makefile |  4 +---
 3 files changed, 15 insertions(+), 6 deletions(-)


[PATCH v2 net-next] vlan: implement vlan id and protocol changes

2018-06-11 Thread Chas Williams
vlan_changelink silently ignores attempts to change the vlan id
or protocol id of an existing vlan interface.  Implement by adding
the new vlan id and protocol to the interface's vlan group and then
removing the old vlan id and protocol from the vlan group.

Signed-off-by: Chas Williams <3ch...@gmail.com>
---
 include/linux/netdevice.h |  1 +
 net/8021q/vlan.c  |  4 ++--
 net/8021q/vlan.h  |  2 ++
 net/8021q/vlan_netlink.c  | 38 ++
 net/core/dev.c|  1 +
 5 files changed, 44 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850c7936..a95ae238addf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2409,6 +2409,7 @@ enum netdev_cmd {
NETDEV_CVLAN_FILTER_DROP_INFO,
NETDEV_SVLAN_FILTER_PUSH_INFO,
NETDEV_SVLAN_FILTER_DROP_INFO,
+   NETDEV_CHANGEVLAN,
 };
 const char *netdev_cmd_to_name(enum netdev_cmd cmd);
 
diff --git a/net/8021q/vlan.c b/net/8021q/vlan.c
index 73a65789271b..b5e0ad1a581a 100644
--- a/net/8021q/vlan.c
+++ b/net/8021q/vlan.c
@@ -51,8 +51,8 @@ const char vlan_version[] = DRV_VERSION;
 
 /* End of global variables definitions. */
 
-static int vlan_group_prealloc_vid(struct vlan_group *vg,
-  __be16 vlan_proto, u16 vlan_id)
+int vlan_group_prealloc_vid(struct vlan_group *vg,
+   __be16 vlan_proto, u16 vlan_id)
 {
struct net_device **array;
unsigned int pidx, vidx;
diff --git a/net/8021q/vlan.h b/net/8021q/vlan.h
index 44df1c3df02d..c734dd21d70d 100644
--- a/net/8021q/vlan.h
+++ b/net/8021q/vlan.h
@@ -116,6 +116,8 @@ int register_vlan_dev(struct net_device *dev, struct 
netlink_ext_ack *extack);
 void unregister_vlan_dev(struct net_device *dev, struct list_head *head);
 bool vlan_dev_inherit_address(struct net_device *dev,
  struct net_device *real_dev);
+int vlan_group_prealloc_vid(struct vlan_group *vg,
+   __be16 vlan_proto, u16 vlan_id);
 
 static inline u32 vlan_get_ingress_priority(struct net_device *dev,
u16 vlan_tci)
diff --git a/net/8021q/vlan_netlink.c b/net/8021q/vlan_netlink.c
index 9b60c1e399e2..0e59babe6651 100644
--- a/net/8021q/vlan_netlink.c
+++ b/net/8021q/vlan_netlink.c
@@ -107,10 +107,48 @@ static int vlan_changelink(struct net_device *dev, struct 
nlattr *tb[],
   struct nlattr *data[],
   struct netlink_ext_ack *extack)
 {
+   struct vlan_dev_priv *vlan = vlan_dev_priv(dev);
struct ifla_vlan_flags *flags;
struct ifla_vlan_qos_mapping *m;
struct nlattr *attr;
int rem;
+   int err;
+   __be16 vlan_proto = vlan->vlan_proto;
+   u16 vlan_id = vlan->vlan_id;
+
+   if (data[IFLA_VLAN_ID])
+   vlan_id = nla_get_u16(data[IFLA_VLAN_ID]);
+
+   if (data[IFLA_VLAN_PROTOCOL])
+   vlan_proto = nla_get_be16(data[IFLA_VLAN_PROTOCOL]);
+
+   if (vlan->vlan_id != vlan_id || vlan->vlan_proto != vlan_proto) {
+   struct net_device *real_dev = vlan->real_dev;
+   struct vlan_info *vlan_info;
+   struct vlan_group *grp;
+   __be16 old_vlan_proto = vlan->vlan_proto;
+   u16 old_vlan_id = vlan->vlan_id;
+
+   err = vlan_vid_add(real_dev, vlan_proto, vlan_id);
+   if (err)
+   return err;
+   vlan_info = rtnl_dereference(real_dev->vlan_info);
+   grp = _info->grp;
+   err = vlan_group_prealloc_vid(grp, vlan_proto, vlan_id);
+   if (err < 0) {
+   vlan_vid_del(real_dev, vlan_proto, vlan_id);
+   return err;
+   }
+   vlan_group_set_device(grp, vlan_proto, vlan_id, dev);
+   vlan->vlan_proto = vlan_proto;
+   vlan->vlan_id = vlan_id;
+
+   vlan_group_set_device(grp, old_vlan_proto, old_vlan_id, NULL);
+   vlan_vid_del(real_dev, old_vlan_proto, old_vlan_id);
+
+   err = call_netdevice_notifiers(NETDEV_CHANGEVLAN, dev);
+   notifier_to_errno(err);
+   }
 
if (data[IFLA_VLAN_FLAGS]) {
flags = nla_data(data[IFLA_VLAN_FLAGS]);
diff --git a/net/core/dev.c b/net/core/dev.c
index 6e18242a1cae..849fdb60fd21 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1587,6 +1587,7 @@ const char *netdev_cmd_to_name(enum netdev_cmd cmd)
N(UDP_TUNNEL_DROP_INFO) N(CHANGE_TX_QUEUE_LEN)
N(CVLAN_FILTER_PUSH_INFO) N(CVLAN_FILTER_DROP_INFO)
N(SVLAN_FILTER_PUSH_INFO) N(SVLAN_FILTER_DROP_INFO)
+   N(CHANGEVLAN)
}
 #undef N
return "UNKNOWN_NETDEV_EVENT";
-- 
2.14.3



Re: [PATCH] tcp: verify the checksum of the first data segment in a new connection

2018-06-11 Thread van der Linden, Frank
Yeah, true, it's missing INERRS in this case. I'll fix it up a bit.

Frank

On 6/11/18, 4:38 PM, "Eric Dumazet"  wrote:



On 06/11/2018 04:25 PM, van der Linden, Frank wrote:
> A few comments on this one:
> 
> - obviously this is fairly serious, as it can let corrupted data all the 
way up to the application

Sure, although anyone relying on CRC checksum for ensuring TCP data 
integrity
has big troubles ;)

I would rather have a refined version of this patch doing a "goto 
csum_error" 
so that we properly increment TCP_MIB_CSUMERRORS and TCP_MIB_INERRS 

Thanks !





Re: [PATCH net] net/ipv6: Ensure cfg is properly initialized in ipv6_create_tempaddr

2018-06-11 Thread David Miller
From: dsah...@kernel.org
Date: Mon, 11 Jun 2018 07:12:12 -0700

> From: David Ahern 
> 
> Valdis reported a BUG in ipv6_add_addr:
 ...
> Looking at the code I found 1 element (peer_pfx) of the newly introduced
> ifa6_config struct that is not initialized. Use a memset rather than hard
> coding an init for each struct element.
> 
> Reported-by: Valdis Kletnieks 
> Fixes: e6464b8c63619 ("net/ipv6: Convert ipv6_add_addr to struct ifa6_config")
> Signed-off-by: David Ahern 

Applied, thanks David.


Re: [PATCH] tcp: verify the checksum of the first data segment in a new connection

2018-06-11 Thread Eric Dumazet



On 06/11/2018 04:25 PM, van der Linden, Frank wrote:
> A few comments on this one:
> 
> - obviously this is fairly serious, as it can let corrupted data all the way 
> up to the application

Sure, although anyone relying on CRC checksum for ensuring TCP data integrity
has big troubles ;)

I would rather have a refined version of this patch doing a "goto csum_error" 
so that we properly increment TCP_MIB_CSUMERRORS and TCP_MIB_INERRS 

Thanks !


Re: [PATCH net] tls: fix NULL pointer dereference on poll

2018-06-11 Thread David Miller
From: Daniel Borkmann 
Date: Mon, 11 Jun 2018 23:22:04 +0200

> While hacking on kTLS, I ran into the following panic from an
> unprivileged netserver / netperf TCP session:
 ...
> Debugging further, it turns out that calling into ctx->sk_poll() is
> invalid since sk_poll itself is NULL which was saved from the original
> TCP socket in order for tls_sw_poll() to invoke it.
> 
> Looks like the recent conversion from poll to poll_mask callback started
> in 152524231023 ("net: add support for ->poll_mask in proto_ops") missed
> to eventually convert kTLS, too: TCP's ->poll was converted over to the
> ->poll_mask in commit 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> and therefore kTLS wrongly saved the ->poll old one which is now NULL.
> 
> Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
> POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
> tcp_poll_mask() as well that is mangled here.
> 
> Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> Signed-off-by: Daniel Borkmann 

Applied, thanks Daniel.


Re: [bpf PATCH] bpf: selftest fix for sockmap

2018-06-11 Thread Y Song
On Mon, Jun 11, 2018 at 11:47 AM, John Fastabend
 wrote:
> In selftest test_maps the sockmap test case attempts to add a socket
> in listening state to the sockmap. This is no longer a valid operation
> so it fails as expected. However, the test wrongly reports this as an
> error now. Fix the test to avoid adding sockets in listening state.
>
> Fixes: 945ae430aa44 ("bpf: sockmap only allow ESTABLISHED sock state")

I cannot reproduce the failure and I cannot find the above "Fixes" commit.
I checked latest bpf/bpf-next/net-next trees. Maybe I missed something?

> Signed-off-by: John Fastabend 
> ---
>  tools/testing/selftests/bpf/test_maps.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/test_maps.c 
> b/tools/testing/selftests/bpf/test_maps.c
> index 6c25334..9fed5f0 100644
> --- a/tools/testing/selftests/bpf/test_maps.c
> +++ b/tools/testing/selftests/bpf/test_maps.c
> @@ -564,7 +564,7 @@ static void test_sockmap(int tasks, void *data)
> }
>
> /* Test update without programs */
> -   for (i = 0; i < 6; i++) {
> +   for (i = 2; i < 6; i++) {
> err = bpf_map_update_elem(fd, , [i], BPF_ANY);
> if (err) {
> printf("Failed noprog update sockmap '%i:%i'\n",
> @@ -727,7 +727,7 @@ static void test_sockmap(int tasks, void *data)
> }
>
> /* Test map update elem afterwards fd lives in fd and map_fd */
> -   for (i = 0; i < 6; i++) {
> +   for (i = 2; i < 6; i++) {
> err = bpf_map_update_elem(map_fd_rx, , [i], BPF_ANY);
> if (err) {
> printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",
>


Re: [PATCH] tcp: verify the checksum of the first data segment in a new connection

2018-06-11 Thread van der Linden, Frank
A few comments on this one:

- obviously this is fairly serious, as it can let corrupted data all the way up 
to the application
- I am not nuts about the patch itself, the code feels a bit cluttered, but 
it's the least invasive way
  I could think of. Probably some refactoring is needed at some point.
- here's how to easily reproduce it:

On the sender, set up artificial packet corruption:

#!/bin/sh
tc qdisc add dev eth0 root handle 1: prio
tc qdisc add dev eth0 parent 1:3 netem corrupt 50.0%
tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 match ip dst $DSTADDR 
flowid 10:3


Then, run the following on the sender:

while :; do dd if=/dev/zero of=/dev/stdout bs=4096 count=4 | nc $DSTADDR 1; 
sleep 1; done

..and this on the receiver:

uname -r; grep ^Tcp /proc/net/snmp; ttl=$((${SECONDS} + 300)); while [[ 
${SECONDS} -lt ${ttl} ]]; do out="foo.$(date +%s)"; nc -l 1 > "${out}"; 
md5=$(md5sum "${out}"|cut -d\  -f 1); echo -n "${md5}"; if [[ "${md5}" != 
"ce338fe6899778aacfc28414f2d9498b" ]]; then echo " corrupted"; else echo; fi; 
done; grep ^Tcp /proc/net/snmp

[obviously, if you change the size / content, the md5 sum has to be changed 
here]

This reproduces it fairly quickly (within 20 seconds) for us, on 4.14.x 
kernels. 4.17 kernels were verified to still have the issue.

Frank

On 6/11/18, 4:15 PM, "Frank van der Linden"  wrote:

commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
table") introduced an optimization for the handling of child sockets
created for a new TCP connection.

But this optimization passes any data associated with the last ACK of the
connection handshake up the stack without verifying its checksum, because it
calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
directly.  These lower-level processing functions do not do any checksum
verification.

Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
fix this.

Signed-off-by: Frank van der Linden 
---
 net/ipv4/tcp_ipv4.c | 8 +++-
 net/ipv6/tcp_ipv6.c | 8 +++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..1ec4c0d4aba5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1703,7 +1703,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
tcp_v4_fill_cb(skb, iph, th);
-   nsk = tcp_check_req(sk, skb, req, false, _stolen);
+
+   if (tcp_checksum_complete(skb)) {
+   __TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
+   } else {
+   nsk = tcp_check_req(sk, skb, req, false,
+   _stolen);
+   }
}
if (!nsk) {
reqsk_put(req);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6d664d83cd16..a12b694d3d1e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1486,7 +1486,13 @@ static int tcp_v6_rcv(struct sk_buff *skb)
th = (const struct tcphdr *)skb->data;
hdr = ipv6_hdr(skb);
tcp_v6_fill_cb(skb, hdr, th);
-   nsk = tcp_check_req(sk, skb, req, false, _stolen);
+
+   if (tcp_checksum_complete(skb)) {
+   __TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
+   } else {
+   nsk = tcp_check_req(sk, skb, req, false,
+   _stolen);
+   }
}
if (!nsk) {
reqsk_put(req);
-- 
2.14.4





[PATCH] tcp: verify the checksum of the first data segment in a new connection

2018-06-11 Thread Frank van der Linden
commit 079096f103fa ("tcp/dccp: install syn_recv requests into ehash
table") introduced an optimization for the handling of child sockets
created for a new TCP connection.

But this optimization passes any data associated with the last ACK of the
connection handshake up the stack without verifying its checksum, because it
calls tcp_child_process(), which in turn calls tcp_rcv_state_process()
directly.  These lower-level processing functions do not do any checksum
verification.

Insert a tcp_checksum_complete call in the TCP_NEW_SYN_RECEIVE path to
fix this.

Signed-off-by: Frank van der Linden 
---
 net/ipv4/tcp_ipv4.c | 8 +++-
 net/ipv6/tcp_ipv6.c | 8 +++-
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f70586b50838..1ec4c0d4aba5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1703,7 +1703,13 @@ int tcp_v4_rcv(struct sk_buff *skb)
th = (const struct tcphdr *)skb->data;
iph = ip_hdr(skb);
tcp_v4_fill_cb(skb, iph, th);
-   nsk = tcp_check_req(sk, skb, req, false, _stolen);
+
+   if (tcp_checksum_complete(skb)) {
+   __TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
+   } else {
+   nsk = tcp_check_req(sk, skb, req, false,
+   _stolen);
+   }
}
if (!nsk) {
reqsk_put(req);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 6d664d83cd16..a12b694d3d1e 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1486,7 +1486,13 @@ static int tcp_v6_rcv(struct sk_buff *skb)
th = (const struct tcphdr *)skb->data;
hdr = ipv6_hdr(skb);
tcp_v6_fill_cb(skb, hdr, th);
-   nsk = tcp_check_req(sk, skb, req, false, _stolen);
+
+   if (tcp_checksum_complete(skb)) {
+   __TCP_INC_STATS(net, TCP_MIB_CSUMERRORS);
+   } else {
+   nsk = tcp_check_req(sk, skb, req, false,
+   _stolen);
+   }
}
if (!nsk) {
reqsk_put(req);
-- 
2.14.4



Re: [bpf PATCH v2 1/2] bpf: sockmap, fix crash when ipv6 sock is added

2018-06-11 Thread Daniel Borkmann
Hi John,

On 06/08/2018 05:06 PM, John Fastabend wrote:
> This fixes a crash where we assign tcp_prot to IPv6 sockets instead
> of tcpv6_prot.
> 
> Previously we overwrote the sk->prot field with tcp_prot even in the
> AF_INET6 case. This patch ensures the correct tcp_prot and tcpv6_prot
> are used. Further, only allow ESTABLISHED connections to join the
> map per note in TLS ULP,
> 
>/* The TLS ulp is currently supported only for TCP sockets
> * in ESTABLISHED state.
> * Supporting sockets in LISTEN state will require us
> * to modify the accept implementation to clone rather then
> * share the ulp context.
> */
> 
> Also tested with 'netserver -6' and 'netperf -H [IPv6]' as well as
> 'netperf -H [IPv4]'. The ESTABLISHED check resolves the previously
> crashing case here.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Reported-by: syzbot+5c063698bdbfac19f...@syzkaller.appspotmail.com
> Signed-off-by: John Fastabend 
> Signed-off-by: Wei Wang 
[...]

Still one question for some more clarification below that popped up while
review:

> @@ -162,6 +164,8 @@ static bool bpf_tcp_stream_read(const struct sock *sk)
>  }
>  
>  static struct proto tcp_bpf_proto;
> +static struct proto tcpv6_bpf_proto;

These two are global, w/o locking.

>  static int bpf_tcp_init(struct sock *sk)
>  {
>   struct smap_psock *psock;
> @@ -181,14 +185,30 @@ static int bpf_tcp_init(struct sock *sk)
>   psock->save_close = sk->sk_prot->close;
>   psock->sk_proto = sk->sk_prot;
>  
> + if (sk->sk_family == AF_INET6) {
> + tcpv6_bpf_proto = *sk->sk_prot;
> + tcpv6_bpf_proto.close = bpf_tcp_close;
> + } else {
> + tcp_bpf_proto = *sk->sk_prot;
> + tcp_bpf_proto.close = bpf_tcp_close;
> + }

And each time we add a BPF ULP to a v4/v6 socket, we override tcp{,v6}_bpf_proto
from scratch.

>   if (psock->bpf_tx_msg) {
> + tcpv6_bpf_proto.sendmsg = bpf_tcp_sendmsg;
> + tcpv6_bpf_proto.sendpage = bpf_tcp_sendpage;
> + tcpv6_bpf_proto.recvmsg = bpf_tcp_recvmsg;
> + tcpv6_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
>   tcp_bpf_proto.sendmsg = bpf_tcp_sendmsg;
>   tcp_bpf_proto.sendpage = bpf_tcp_sendpage;
>   tcp_bpf_proto.recvmsg = bpf_tcp_recvmsg;
>   tcp_bpf_proto.stream_memory_read = bpf_tcp_stream_read;
>   }
>  
> - sk->sk_prot = _bpf_proto;
> + if (sk->sk_family == AF_INET6)
> + sk->sk_prot = _bpf_proto;
> + else
> + sk->sk_prot = _bpf_proto;

Where every active socket would be affected from it as well. Isn't that
generally racy? E.g. existing ones where tcpv6_bpf_proto.sendmsg points
to bpf_tcp_sendmsg would get overridden with earlier assignment on the
tcpv6_bpf_proto = *sk->sk_prot during their lifetime after bpf_tcp_init().

In the kTLS case, the v4 protos are built up in module init via tls_register()
and never change from there. The v6 ones are only reloaded when their addr
changes e.g. module reload would come to mind, which should only be possible
once no active v6 socket is present. What speaks against adapting similar
scheme resp. what am I missing that the above would work? (Would be nice if
there was some discussion in commit log related to it on 'why' this approach
was done differently.)

Thanks,
Daniel

>   rcu_read_unlock();
>   return 0;
>  }
> @@ -,8 +1131,6 @@ static void bpf_tcp_msg_add(struct smap_psock *psock,
>  
>  static int bpf_tcp_ulp_register(void)
>  {
> - tcp_bpf_proto = tcp_prot;
> - tcp_bpf_proto.close = bpf_tcp_close;
>   /* Once BPF TX ULP is registered it is never unregistered. It
>* will be in the ULP list for the lifetime of the system. Doing
>* duplicate registers is not a problem.
> 



Re: [Intel-wired-lan] [jkirsher/next-queue PATCH 5/7] net: Add generic ndo_select_queue functions

2018-06-11 Thread kbuild test robot
Hi Alexander,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on net-next/master]
[also build test ERROR on next-20180608]
[cannot apply to v4.17]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Alexander-Duyck/Add-support-for-L2-Fwd-Offload-w-o-ndo_select_queue/20180612-015220
config: arm-allmodconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=arm 

All errors (new ones prefixed by >>):

>> drivers/net//ethernet/ti/netcp_core.c:1968:22: error: initialization from 
>> incompatible pointer type [-Werror=incompatible-pointer-types]
 .ndo_select_queue = dev_pick_tx_zero,
 ^~~~
   drivers/net//ethernet/ti/netcp_core.c:1968:22: note: (near initialization 
for 'netcp_netdev_ops.ndo_select_queue')
   cc1: some warnings being treated as errors

vim +1968 drivers/net//ethernet/ti/netcp_core.c

  1955  
  1956  static const struct net_device_ops netcp_netdev_ops = {
  1957  .ndo_open   = netcp_ndo_open,
  1958  .ndo_stop   = netcp_ndo_stop,
  1959  .ndo_start_xmit = netcp_ndo_start_xmit,
  1960  .ndo_set_rx_mode= netcp_set_rx_mode,
  1961  .ndo_do_ioctl   = netcp_ndo_ioctl,
  1962  .ndo_get_stats64= netcp_get_stats,
  1963  .ndo_set_mac_address= eth_mac_addr,
  1964  .ndo_validate_addr  = eth_validate_addr,
  1965  .ndo_vlan_rx_add_vid= netcp_rx_add_vid,
  1966  .ndo_vlan_rx_kill_vid   = netcp_rx_kill_vid,
  1967  .ndo_tx_timeout = netcp_ndo_tx_timeout,
> 1968  .ndo_select_queue   = dev_pick_tx_zero,
  1969  .ndo_setup_tc   = netcp_setup_tc,
  1970  };
  1971  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation


.config.gz
Description: application/gzip


Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

2018-06-11 Thread Marc Dionne
On Mon, Jun 11, 2018 at 7:29 PM, Maciej Żenczykowski
 wrote:
>> This change is a potential performance regression for us.  Our
>> networking code sets up multiple sockets used by multiple threads to
>> listen on the same udp port.  For the first socket, the bind is done
>> before setting SO_REUSEPORT so that we can get an error and detect
>> that the port is currently in use by a different process, possibly a
>> previous incarnation of the same application.
>>
>> With this change, the servers end up with a single listener socket and
>> thread, which can have a large impact on performance for a busy
>> server.
>>
>> While we can adjust for the new behaviour going forward, this could be
>> an unpleasant surprise for our customers who get this update when
>> moving to a new kernel version or through a backport to stable
>> kernels, if this is targeted for stable.
>
> Probably you can:
> fd1=socket()
> fd2=socket()
> bind(fd1,port=0)
> setsockopt(fd2,REUSEPORT,1)
> port = getsockname(fd1)
> close(fd1)
> bind(fd2,port)
>
> Although yeah there's a slight chance of a race (ie. 2nd bind failing,
> in which case close() and retry).

This would be a bit different since we want a specific port rather
than a wildcard, but yes, a temporary socket could be bound just to
check if the port is in use and closed afterwards.  The problem is
that since fd2 has REUSEPORT set, that second bind might well succeed
if there was a race and the other user of the port also has REUSEPORT
set, as could be the case if it's another instance of the same
application.

Marc


Re: Qualcomm rmnet driver and qmi_wwan

2018-06-11 Thread Subash Abhinov Kasiviswanathan

both patches work properly for me.

Maybe it could be helpful in the first patch to add a print when
pass-through setting fails if raw-ip has not been set, just to let the
user know what is happening.



Thanks for testing Daniele.
I can add an error message there when pass through mode setting fails.
Will you be submitting the patch for iproute2. Otherwise, I can do so
a bit later.


Maybe rmnet_is_real_dev_registered(dev->net) instead, since we use that
elsewhere in this function?

Like Daniele said: It would be good to have some way to know when the
rawip condition fails.  Or even better: Automatically force rawip mode
when the rmnet driver attaches.  But that doesn't seem possible?  No
notifications or anything when an rx handler is registered?

Hmm, looking at this I wonder: Is the rawip check really necessary?  
You

skip all the extra rawip code in the driver anyway, so I don't see how
it matters.  But maybe the ethernet header_ops are a problem?

And I wonder about using skb->dev here.  Does that really work?  I
didn't think we set that until later.  Why not use dev->net instead?




Hi Bjørn

Yes, I will switch to using dev->net.
There doesnt seem to be a net device notifier event when the rx 
registration

happens.

If the dev type is ethernet, rmnet driver will try to remove the first 
14
bytes since it assumes an ethernet header is present in the packet. 
Hence the

need for raw ip mode in qmi_wwan.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project


Re: [PATCH bpf] xsk: silence warning on memory allocation failure

2018-06-11 Thread Y Song
On Mon, Jun 11, 2018 at 4:57 AM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> syzkaller reported a warning from xdp_umem_pin_pages():
>
>   WARNING: CPU: 1 PID: 4537 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 
> mm/slab_common.c:996
>   ...
>   __do_kmalloc mm/slab.c:3713 [inline]
>   __kmalloc+0x25/0x760 mm/slab.c:3727
>   kmalloc_array include/linux/slab.h:634 [inline]
>   kcalloc include/linux/slab.h:645 [inline]
>   xdp_umem_pin_pages net/xdp/xdp_umem.c:205 [inline]
>   xdp_umem_reg net/xdp/xdp_umem.c:318 [inline]
>   xdp_umem_create+0x5c9/0x10f0 net/xdp/xdp_umem.c:349
>   xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
>   __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
>   __do_sys_setsockopt net/socket.c:1946 [inline]
>   __se_sys_setsockopt net/socket.c:1943 [inline]
>   __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
>   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> This is a warning about attempting to allocate more than
> KMALLOC_MAX_SIZE memory. The request originates from userspace, and if
> the request is too big, the kernel is free to deny its allocation. In
> this patch, the failed allocation attempt is silenced with
> __GFP_NOWARN.
>
> Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
> Reported-by: syzbot+4abadc5d69117b346...@syzkaller.appspotmail.com
> Signed-off-by: Björn Töpel 

Acked-by: Yonghong Song 


Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

2018-06-11 Thread Maciej Żenczykowski
> This change is a potential performance regression for us.  Our
> networking code sets up multiple sockets used by multiple threads to
> listen on the same udp port.  For the first socket, the bind is done
> before setting SO_REUSEPORT so that we can get an error and detect
> that the port is currently in use by a different process, possibly a
> previous incarnation of the same application.
>
> With this change, the servers end up with a single listener socket and
> thread, which can have a large impact on performance for a busy
> server.
>
> While we can adjust for the new behaviour going forward, this could be
> an unpleasant surprise for our customers who get this update when
> moving to a new kernel version or through a backport to stable
> kernels, if this is targeted for stable.

Probably you can:
fd1=socket()
fd2=socket()
bind(fd1,port=0)
setsockopt(fd2,REUSEPORT,1)
port = getsockname(fd1)
close(fd1)
bind(fd2,port)

Although yeah there's a slight chance of a race (ie. 2nd bind failing,
in which case close() and retry).


Re: [PATCH net] tls: fix NULL pointer dereference on poll

2018-06-11 Thread Dave Watson
On 06/11/18 11:22 PM, Daniel Borkmann wrote:
> While hacking on kTLS, I ran into the following panic from an
> unprivileged netserver / netperf TCP session:
> 
>   [...]
> Looks like the recent conversion from poll to poll_mask callback started
> in 152524231023 ("net: add support for ->poll_mask in proto_ops") missed
> to eventually convert kTLS, too: TCP's ->poll was converted over to the
> ->poll_mask in commit 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> and therefore kTLS wrongly saved the ->poll old one which is now NULL.
> 
> Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
> POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
> tcp_poll_mask() as well that is mangled here.

Thanks, was just trying to bisect this myself. Works for me. 

Tested-by: Dave Watson 

> Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
> Signed-off-by: Daniel Borkmann 
> Cc: Christoph Hellwig 
> Cc: Dave Watson 


Re: [PATCH bpf] xsk: silence warning on memory allocation failure

2018-06-11 Thread Daniel Borkmann
On 06/11/2018 01:57 PM, Björn Töpel wrote:
> From: Björn Töpel 
> 
> syzkaller reported a warning from xdp_umem_pin_pages():
> 
>   WARNING: CPU: 1 PID: 4537 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 
> mm/slab_common.c:996
>   ...
>   __do_kmalloc mm/slab.c:3713 [inline]
>   __kmalloc+0x25/0x760 mm/slab.c:3727
>   kmalloc_array include/linux/slab.h:634 [inline]
>   kcalloc include/linux/slab.h:645 [inline]
>   xdp_umem_pin_pages net/xdp/xdp_umem.c:205 [inline]
>   xdp_umem_reg net/xdp/xdp_umem.c:318 [inline]
>   xdp_umem_create+0x5c9/0x10f0 net/xdp/xdp_umem.c:349
>   xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
>   __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
>   __do_sys_setsockopt net/socket.c:1946 [inline]
>   __se_sys_setsockopt net/socket.c:1943 [inline]
>   __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
>   do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> This is a warning about attempting to allocate more than
> KMALLOC_MAX_SIZE memory. The request originates from userspace, and if
> the request is too big, the kernel is free to deny its allocation. In
> this patch, the failed allocation attempt is silenced with
> __GFP_NOWARN.
> 
> Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
> Reported-by: syzbot+4abadc5d69117b346...@syzkaller.appspotmail.com
> Signed-off-by: Björn Töpel 

Applied to bpf, thanks Björn!


Re: [PATCH] net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

2018-06-11 Thread Marc Dionne
On Sun, Jun 3, 2018 at 2:47 PM, Maciej Żenczykowski
 wrote:
> From: Maciej Żenczykowski 
>
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
>
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
>
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
>
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
>
> This does unfortunately run the risk of breaking buggy
> userspace programs...

This change is a potential performance regression for us.  Our
networking code sets up multiple sockets used by multiple threads to
listen on the same udp port.  For the first socket, the bind is done
before setting SO_REUSEPORT so that we can get an error and detect
that the port is currently in use by a different process, possibly a
previous incarnation of the same application.

With this change, the servers end up with a single listener socket and
thread, which can have a large impact on performance for a busy
server.

While we can adjust for the new behaviour going forward, this could be
an unpleasant surprise for our customers who get this update when
moving to a new kernel version or through a backport to stable
kernels, if this is targeted for stable.

Marc


[PATCH net] tls: fix NULL pointer dereference on poll

2018-06-11 Thread Daniel Borkmann
While hacking on kTLS, I ran into the following panic from an
unprivileged netserver / netperf TCP session:

  BUG: unable to handle kernel NULL pointer dereference at 
  PGD 80037f378067 P4D 80037f378067 PUD 3c0e61067 PMD 0
  Oops: 0010 [#1] SMP KASAN PTI
  CPU: 1 PID: 2289 Comm: netserver Not tainted 4.17.0+ #139
  Hardware name: LENOVO 20FBCTO1WW/20FBCTO1WW, BIOS N1FET47W (1.21 ) 11/28/2016
  RIP: 0010:  (null)
  Code: Bad RIP value.
  RSP: 0018:88036abcf740 EFLAGS: 00010246
  RAX: dc00 RBX: 88036f5f6800 RCX: 11006debed26
  RDX: 88036abcf920 RSI: 8803cb1a4f00 RDI: 8803c258c280
  RBP: 8803c258c280 R08: 8803c258c280 R09: ed006f559d48
  R10: 88037aacea43 R11: ed006f559d49 R12: 8803c258c280
  R13: 8803cb1a4f20 R14: 00db R15: c168a350
  FS:  7f7e631f4700() GS:8803d1c8() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: ffd6 CR3: 0003ccf64005 CR4: 003606e0
  DR0:  DR1:  DR2: 
  DR3:  DR6: fffe0ff0 DR7: 0400
  Call Trace:
   ? tls_sw_poll+0xa4/0x160 [tls]
   ? sock_poll+0x20a/0x680
   ? do_select+0x77b/0x11a0
   ? poll_schedule_timeout.constprop.12+0x130/0x130
   ? pick_link+0xb00/0xb00
   ? read_word_at_a_time+0x13/0x20
   ? vfs_poll+0x270/0x270
   ? deref_stack_reg+0xad/0xe0
   ? __read_once_size_nocheck.constprop.6+0x10/0x10
  [...]

Debugging further, it turns out that calling into ctx->sk_poll() is
invalid since sk_poll itself is NULL which was saved from the original
TCP socket in order for tls_sw_poll() to invoke it.

Looks like the recent conversion from poll to poll_mask callback started
in 152524231023 ("net: add support for ->poll_mask in proto_ops") missed
to eventually convert kTLS, too: TCP's ->poll was converted over to the
->poll_mask in commit 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
and therefore kTLS wrongly saved the ->poll old one which is now NULL.

Convert kTLS over to use ->poll_mask instead. Also instead of POLLIN |
POLLRDNORM use the proper EPOLLIN | EPOLLRDNORM bits as the case in
tcp_poll_mask() as well that is mangled here.

Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
Signed-off-by: Daniel Borkmann 
Cc: Christoph Hellwig 
Cc: Dave Watson 
---
 include/net/tls.h  |  6 ++
 net/tls/tls_main.c |  2 +-
 net/tls/tls_sw.c   | 19 +--
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index 70c2737..7f84ea3 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -109,8 +109,7 @@ struct tls_sw_context_rx {
 
struct strparser strp;
void (*saved_data_ready)(struct sock *sk);
-   unsigned int (*sk_poll)(struct file *file, struct socket *sock,
-   struct poll_table_struct *wait);
+   __poll_t (*sk_poll_mask)(struct socket *sock, __poll_t events);
struct sk_buff *recv_pkt;
u8 control;
bool decrypted;
@@ -225,8 +224,7 @@ void tls_sw_free_resources_tx(struct sock *sk);
 void tls_sw_free_resources_rx(struct sock *sk);
 int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg, size_t len,
   int nonblock, int flags, int *addr_len);
-unsigned int tls_sw_poll(struct file *file, struct socket *sock,
-struct poll_table_struct *wait);
+__poll_t tls_sw_poll_mask(struct socket *sock, __poll_t events);
 ssize_t tls_sw_splice_read(struct socket *sock, loff_t *ppos,
   struct pipe_inode_info *pipe,
   size_t len, unsigned int flags);
diff --git a/net/tls/tls_main.c b/net/tls/tls_main.c
index 301f224..a127d61 100644
--- a/net/tls/tls_main.c
+++ b/net/tls/tls_main.c
@@ -712,7 +712,7 @@ static int __init tls_register(void)
build_protos(tls_prots[TLSV4], _prot);
 
tls_sw_proto_ops = inet_stream_ops;
-   tls_sw_proto_ops.poll = tls_sw_poll;
+   tls_sw_proto_ops.poll_mask = tls_sw_poll_mask;
tls_sw_proto_ops.splice_read = tls_sw_splice_read;
 
 #ifdef CONFIG_TLS_DEVICE
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 8ca57d0..34895b7 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -915,23 +915,22 @@ ssize_t tls_sw_splice_read(struct socket *sock,  loff_t 
*ppos,
return copied ? : err;
 }
 
-unsigned int tls_sw_poll(struct file *file, struct socket *sock,
-struct poll_table_struct *wait)
+__poll_t tls_sw_poll_mask(struct socket *sock, __poll_t events)
 {
-   unsigned int ret;
struct sock *sk = sock->sk;
struct tls_context *tls_ctx = tls_get_ctx(sk);
struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
+   __poll_t mask;
 
-   /* Grab POLLOUT and POLLHUP from the underlying socket */
-   ret = ctx->sk_poll(file, sock, wait);
+   /* Grab EPOLLOUT and EPOLLHUP from the 

Re: [PATCH net] ipv6: allow PMTU exceptions to local routes

2018-06-11 Thread David Miller
From: Julian Anastasov 
Date: Mon, 11 Jun 2018 02:02:54 +0300

> IPVS setups with local client and remote tunnel server need
> to create exception for the local virtual IP. What we do is to
> change PMTU from 64KB (on "lo") to 1460 in the common case.
> 
> Suggested-by: Martin KaFai Lau 
> Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering 
> pmtu exception")
> Fixes: 7343ff31ebf0 ("ipv6: Don't create clones of host routes.")
> Signed-off-by: Julian Anastasov 

Applied and queued up for -stable, thanks.


[Patch net] smc: convert to ->poll_mask

2018-06-11 Thread Cong Wang
smc->clcsock is an internal TCP socket, after TCP socket
converts to ->poll_mask, ->poll doesn't exist any more.
So just convert smc socket to ->poll_mask too.

Fixes: 2c7d3dacebd4 ("net/tcp: convert to ->poll_mask")
Reported-by: syzbot+f5066e369b2d5fff6...@syzkaller.appspotmail.com
Cc: Christoph Hellwig 
Cc: Ursula Braun 
Signed-off-by: Cong Wang 
---
 net/smc/af_smc.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 973b4471b532..da7f02edcd37 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1273,8 +1273,7 @@ static __poll_t smc_accept_poll(struct sock *parent)
return mask;
 }
 
-static __poll_t smc_poll(struct file *file, struct socket *sock,
-poll_table *wait)
+static __poll_t smc_poll_mask(struct socket *sock, __poll_t events)
 {
struct sock *sk = sock->sk;
__poll_t mask = 0;
@@ -1290,7 +1289,7 @@ static __poll_t smc_poll(struct file *file, struct socket 
*sock,
if ((sk->sk_state == SMC_INIT) || smc->use_fallback) {
/* delegate to CLC child sock */
release_sock(sk);
-   mask = smc->clcsock->ops->poll(file, smc->clcsock, wait);
+   mask = smc->clcsock->ops->poll_mask(smc->clcsock, events);
lock_sock(sk);
sk->sk_err = smc->clcsock->sk->sk_err;
if (sk->sk_err) {
@@ -1308,11 +1307,6 @@ static __poll_t smc_poll(struct file *file, struct 
socket *sock,
}
}
} else {
-   if (sk->sk_state != SMC_CLOSED) {
-   release_sock(sk);
-   sock_poll_wait(file, sk_sleep(sk), wait);
-   lock_sock(sk);
-   }
if (sk->sk_err)
mask |= EPOLLERR;
if ((sk->sk_shutdown == SHUTDOWN_MASK) ||
@@ -1625,7 +1619,7 @@ static const struct proto_ops smc_sock_ops = {
.socketpair = sock_no_socketpair,
.accept = smc_accept,
.getname= smc_getname,
-   .poll   = smc_poll,
+   .poll_mask  = smc_poll_mask,
.ioctl  = smc_ioctl,
.listen = smc_listen,
.shutdown   = smc_shutdown,
-- 
2.13.0



RE: [net] fq_codel: fix NULL pointer deref in fq_codel_reset

2018-06-11 Thread Keller, Jacob E
> -Original Message-
> From: Cong Wang [mailto:xiyou.wangc...@gmail.com]
> Sent: Monday, June 11, 2018 1:03 PM
> To: Kirsher, Jeffrey T 
> Cc: David Miller ; Keller, Jacob E
> ; Linux Kernel Network Developers
> ; nhor...@redhat.com; sassm...@redhat.com;
> jogre...@redhat.com; Eric Dumazet 
> Subject: Re: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
> 

> Making q->flows_cnt 0 is simpler and easier to understand.

Feel free to propose such a patch :)

Thanks,
Jake


Re: [net] fq_codel: fix NULL pointer deref in fq_codel_reset

2018-06-11 Thread Cong Wang
On Mon, Jun 11, 2018 at 12:57 PM, Keller, Jacob E
 wrote:
>
> I'm open to alternative suggestinos for fixing this, I think Eric suggested 
> that maybe we should just remove the ->reset() call from qdisc_destroy..?

You can't remove ->reset() for non-failure call path.

For failure path, yeah, but it is much simpler to just make
q->flows_cnt be 0 for this specific case.


Re: [net] fq_codel: fix NULL pointer deref in fq_codel_reset

2018-06-11 Thread Cong Wang
On Mon, Jun 11, 2018 at 10:00 AM, Jeff Kirsher
 wrote:
>
> We could mitigate some of these issues by changing fq_codel_init to more
> explicitly cleanup after itself when failing. For example, we could
> ensure that q->flowcnt was set to 0 so that the loop over each flow in
> fq_codel_reset would not trigger. However, this would not prevent a NULL
> pointer dereference when attempting to memset the q->backlogs.

Are you saying memset(ptr, 0, 0) is not nop?? :-/

Making q->flows_cnt 0 is simpler and easier to understand.


RE: [net] fq_codel: fix NULL pointer deref in fq_codel_reset

2018-06-11 Thread Keller, Jacob E
> -Original Message-
> From: Kirsher, Jeffrey T
> Sent: Monday, June 11, 2018 10:00 AM
> To: da...@davemloft.net
> Cc: Keller, Jacob E ; netdev@vger.kernel.org;
> nhor...@redhat.com; sassm...@redhat.com; jogre...@redhat.com; Eric
> Dumazet ; Kirsher, Jeffrey T
> 
> Subject: [net] fq_codel: fix NULL pointer deref in fq_codel_reset
> 
> From: Jacob Keller 
> 
> The function qdisc_create_dftl attempts to create a default qdisc. If
> this fails, it calls qdisc_destroy when cleaning up. The qdisc_destroy
> function calls the ->reset op on the qdisc.
> 
> In the case of sch_fq_codel.c, this function will panic when the qdisc
> wasn't properly initialized:
> 
>kernel: BUG: unable to handle kernel NULL pointer dereference at
> 0008
>kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
>kernel: PGD 0 P4D 0
>kernel: Oops:  [#1] SMP PTI
>kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc
> devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert
> iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt
> target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm
> ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac
> x86_pkg_temp_thermal intel_powerclamp coretemp kvm irqbypass
> crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate iTCO_wdt
> iTCO_vendor_support intel_uncore ib_core intel_rapl_perf mei_me mei joydev
> i2c_i801 lpc_ich ioatdma shpchp wmi sch_fq_codel xfs libcrc32c mgag200 ixgbe
> drm_kms_helper isci ttm firewire_ohci
>kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core
> scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf
> ipmi_msghandler [last unloaded: i40e]
>kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G   OE
> 4.16.13custom-fq-
> codel-test+ #3
>kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS
> SE5C600.86B.02.05.0004.051120151007 05/11/2015
>kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
>kernel: RSP: 0018:bfbf4c1fb620 EFLAGS: 00010246
>kernel: RAX: 0400 RBX:  RCX: 05b9
>kernel: RDX:  RSI: 9d03264a60c0 RDI: 9cfd17b31c00
>kernel: RBP: 0001 R08: 000260c0 R09: b679c3e9
>kernel: R10: f1dab06a0e80 R11: 9cfd163af800 R12: 9cfd17b31c00
>kernel: R13: 0001 R14: 9cfd153de600 R15: 0001
>kernel: FS:  7fdec2f92800() GS:9d032648()
> knlGS:
>kernel: CS:  0010 DS:  ES:  CR0: 80050033
>kernel: CR2: 0008 CR3: 000c1956a006 CR4: 000606e0
>kernel: Call Trace:
>kernel:  qdisc_destroy+0x56/0x140
>kernel:  qdisc_create_dflt+0x8b/0xb0
>kernel:  mq_init+0xc1/0xf0
>kernel:  qdisc_create_dflt+0x5a/0xb0
>kernel:  dev_activate+0x205/0x230
>kernel:  __dev_open+0xf5/0x160
>kernel:  __dev_change_flags+0x1a3/0x210
>kernel:  dev_change_flags+0x21/0x60
>kernel:  do_setlink+0x660/0xdf0
>kernel:  ? down_trylock+0x25/0x30
>kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
>kernel:  ? rtnl_newlink+0x816/0x990
>kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
>kernel:  ? _cond_resched+0x15/0x30
>kernel:  ? kmem_cache_alloc+0x20/0x1b0
>kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
>kernel:  ? rtnl_calcit.isra.30+0x100/0x100
>kernel:  ? netlink_rcv_skb+0x4c/0x120
>kernel:  ? netlink_unicast+0x19e/0x260
>kernel:  ? netlink_sendmsg+0x1ff/0x3c0
>kernel:  ? sock_sendmsg+0x36/0x40
>kernel:  ? ___sys_sendmsg+0x295/0x2f0
>kernel:  ? ebitmap_cmp+0x6d/0x90
>kernel:  ? dev_get_by_name_rcu+0x73/0x90
>kernel:  ? skb_dequeue+0x52/0x60
>kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
>kernel:  ? bit_waitqueue+0x30/0x30
>kernel:  ? fsnotify_grab_connector+0x3c/0x60
>kernel:  ? __sys_sendmsg+0x51/0x90
>kernel:  ? do_syscall_64+0x74/0x180
>kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
>kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 
> 00 00 00
> 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 48 
> 8b 3b e8
> 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
>kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: bfbf4c1fb620
>kernel: CR2: 0008
>kernel: ---[ end trace e81a62bede66274e ]---
> 
> This occurs because if fq_codel_init fails, it has left the private data
> in an incomplete state. For example, if tcf_block_get fails, (as in the
> above panic), then q->flows and q->backlogs will be NULL. Thus they will
> cause NULL pointer access when attempting to reset them in
> fq_codel_reset.
> 
> We could mitigate some of these issues by changing fq_codel_init to more
> 

RE: locking in wimax/i2400m

2018-06-11 Thread Perez-Gonzalez, Inaky
Hello Sebastian

Thanks for checking this out,

SA> I tried to figure out if the URB-completion handler uses any
SA> locking and stumbled here.

SA> i2400m_pm_notifier() is called from process context. This
SA> function invokes i2400m_fw_cache() + i2400m_fw_uncache(). Both
SA> functions do spin_lock(>rx_lock); while in other
SA> places (say i2400mu_rxd()) it does
SA> spin_lock_irqsave(>rx_lock, flags);

SA> So what do I miss? Is this lock never used in interrupt
SA> context and lockdep didn't complain or did nobody try suspend
SA> with this driver before?  From what I can tell
SA> i2400m_dev_bootstrap() has the same locking problem.

I don't remember ever getting to try suspend in the driver, so that
might be the case. That said, I haven't touched this code in years, or
the current locking best practices, so I'll let others chime in,
Thomas being prolly one of the key ones.



[PATCH net 1/3] hv_netvsc: drop common code until callback model fixed

2018-06-11 Thread Stephen Hemminger
The callback model of handling network failover is not suitable
in the current form.
  1. It was merged without addressing all the review feedback.
  2. It was merged without approval of any of the netvsc maintainers.
  3. Design discussion on how to handle PV/VF fallback is still
 not complete.
  4. IMHO the code model using callbacks is trying to make
 something common which isn't.

Revert the netvsc specific changes for now. Does not impact ongoing
development of failover model for virtio.
Revisit this after a simpler library based failover kernel
routines are extracted.

This reverts
commit 9c6ffbacdb57 ("hv_netvsc: fix error return code in netvsc_probe()")
and
commit 1ff78076d8dd ("netvsc: refactor notifier/event handling code to use the 
failover framework")

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/Kconfig  |   1 -
 drivers/net/hyperv/hyperv_net.h |   2 -
 drivers/net/hyperv/netvsc_drv.c | 224 +++-
 3 files changed, 165 insertions(+), 62 deletions(-)

diff --git a/drivers/net/hyperv/Kconfig b/drivers/net/hyperv/Kconfig
index 23a2d145813a..0765d5f61714 100644
--- a/drivers/net/hyperv/Kconfig
+++ b/drivers/net/hyperv/Kconfig
@@ -2,6 +2,5 @@ config HYPERV_NET
tristate "Microsoft Hyper-V virtual network driver"
depends on HYPERV
select UCS2_STRING
-   select FAILOVER
help
  Select this option to enable the Hyper-V virtual network driver.
diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 99d8e7398a5b..1be34d2e3563 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -932,8 +932,6 @@ struct net_device_context {
u32 vf_alloc;
/* Serial number of the VF to team with */
u32 vf_serial;
-
-   struct failover *failover;
 };
 
 /* Per channel data */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8eec156418ea..2d4370c94b6e 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -43,7 +43,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #include "hyperv_net.h"
 
@@ -1782,6 +1781,46 @@ static void netvsc_link_change(struct work_struct *w)
rtnl_unlock();
 }
 
+static struct net_device *get_netvsc_bymac(const u8 *mac)
+{
+   struct net_device *dev;
+
+   ASSERT_RTNL();
+
+   for_each_netdev(_net, dev) {
+   if (dev->netdev_ops != _ops)
+   continue;   /* not a netvsc device */
+
+   if (ether_addr_equal(mac, dev->perm_addr))
+   return dev;
+   }
+
+   return NULL;
+}
+
+static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
+{
+   struct net_device *dev;
+
+   ASSERT_RTNL();
+
+   for_each_netdev(_net, dev) {
+   struct net_device_context *net_device_ctx;
+
+   if (dev->netdev_ops != _ops)
+   continue;   /* not a netvsc device */
+
+   net_device_ctx = netdev_priv(dev);
+   if (!rtnl_dereference(net_device_ctx->nvdev))
+   continue;   /* device is removed */
+
+   if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
+   return dev; /* a match */
+   }
+
+   return NULL;
+}
+
 /* Called when VF is injecting data into network stack.
  * Change the associated network device from VF to netvsc.
  * note: already called with rcu_read_lock
@@ -1804,6 +1843,46 @@ static rx_handler_result_t netvsc_vf_handle_frame(struct 
sk_buff **pskb)
return RX_HANDLER_ANOTHER;
 }
 
+static int netvsc_vf_join(struct net_device *vf_netdev,
+ struct net_device *ndev)
+{
+   struct net_device_context *ndev_ctx = netdev_priv(ndev);
+   int ret;
+
+   ret = netdev_rx_handler_register(vf_netdev,
+netvsc_vf_handle_frame, ndev);
+   if (ret != 0) {
+   netdev_err(vf_netdev,
+  "can not register netvsc VF receive handler (err = 
%d)\n",
+  ret);
+   goto rx_handler_failed;
+   }
+
+   ret = netdev_master_upper_dev_link(vf_netdev, ndev,
+  NULL, NULL, NULL);
+   if (ret != 0) {
+   netdev_err(vf_netdev,
+  "can not set master device %s (err = %d)\n",
+  ndev->name, ret);
+   goto upper_link_failed;
+   }
+
+   /* set slave flag before open to prevent IPv6 addrconf */
+   vf_netdev->flags |= IFF_SLAVE;
+
+   schedule_delayed_work(_ctx->vf_takeover, VF_TAKEOVER_INT);
+
+   call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
+
+   netdev_info(vf_netdev, "joined to %s\n", ndev->name);
+   return 0;
+
+upper_link_failed:
+   netdev_rx_handler_unregister(vf_netdev);
+rx_handler_failed:
+   return ret;
+}
+
 

[PATCH net 0/3] hv_netvsc: notification and namespace fixes

2018-06-11 Thread Stephen Hemminger
This set of patches addresses two set of fixes. First it backs out
the common callback model which was merged in net-next without
completing all the review feedback or getting maintainer approval.

Then it fixes the transparent VF management code to handle network
namespaces.

Stephen Hemminger (3):
  hv_netvsc: drop common code until callback model fixed
  hv_netvsc: fix network namespace issues with VF support
  hv_netvsc: move VF to same namespace as netvsc device

 drivers/net/hyperv/Kconfig  |   1 -
 drivers/net/hyperv/hyperv_net.h |   4 +-
 drivers/net/hyperv/netvsc_drv.c | 242 
 3 files changed, 184 insertions(+), 63 deletions(-)

-- 
2.17.1



[PATCH net 3/3] hv_netvsc: move VF to same namespace as netvsc device

2018-06-11 Thread Stephen Hemminger
When VF is added, the paravirtual device is already present
and may have been moved to another network namespace. For example,
sometimes the management interface is put in another net namespace
in some environments.

The VF should get moved to where the netvsc device is when the
VF is discovered. The user can move it later (if desired).

Signed-off-by: Stephen Hemminger 
---
 drivers/net/hyperv/netvsc_drv.c | 21 -
 1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 8cb21e013d1d..bf1b845c1147 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -1930,6 +1930,7 @@ static int netvsc_register_vf(struct net_device 
*vf_netdev)
struct net_device *ndev;
struct net_device_context *net_device_ctx;
struct netvsc_device *netvsc_dev;
+   int ret;
 
if (vf_netdev->addr_len != ETH_ALEN)
return NOTIFY_DONE;
@@ -1948,11 +1949,29 @@ static int netvsc_register_vf(struct net_device 
*vf_netdev)
if (!netvsc_dev || rtnl_dereference(net_device_ctx->vf_netdev))
return NOTIFY_DONE;
 
-   if (netvsc_vf_join(vf_netdev, ndev) != 0)
+   /* if syntihetic interface is a different namespace,
+* then move the VF to that namespace; join will be
+* done again in that context.
+*/
+   if (!net_eq(dev_net(ndev), dev_net(vf_netdev))) {
+   ret = dev_change_net_namespace(vf_netdev,
+  dev_net(ndev), "eth%d");
+   if (ret)
+   netdev_err(vf_netdev,
+  "could not move to same namespace as %s: 
%d\n",
+  ndev->name, ret);
+   else
+   netdev_info(vf_netdev,
+   "VF moved to namespace with: %s\n",
+   ndev->name);
return NOTIFY_DONE;
+   }
 
netdev_info(ndev, "VF registering: %s\n", vf_netdev->name);
 
+   if (netvsc_vf_join(vf_netdev, ndev) != 0)
+   return NOTIFY_DONE;
+
dev_hold(vf_netdev);
rcu_assign_pointer(net_device_ctx->vf_netdev, vf_netdev);
return NOTIFY_OK;
-- 
2.17.1



[PATCH net 2/3] hv_netvsc: fix network namespace issues with VF support

2018-06-11 Thread Stephen Hemminger
When finding the parent netvsc device, the search needs to be across
all netvsc device instances (independent of network namespace).

Find parent device of VF using upper_dev_get routine which
searches only adjacent list.

Fixes: e8ff40d4bff1 ("hv_netvsc: improve VF device matching")
Signed-off-by: Stephen Hemminger 

netns aware byref
---
 drivers/net/hyperv/hyperv_net.h |  2 ++
 drivers/net/hyperv/netvsc_drv.c | 43 +++--
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/drivers/net/hyperv/hyperv_net.h b/drivers/net/hyperv/hyperv_net.h
index 1be34d2e3563..c0b3f3c125d4 100644
--- a/drivers/net/hyperv/hyperv_net.h
+++ b/drivers/net/hyperv/hyperv_net.h
@@ -902,6 +902,8 @@ struct net_device_context {
struct hv_device *device_ctx;
/* netvsc_device */
struct netvsc_device __rcu *nvdev;
+   /* list of netvsc net_devices */
+   struct list_head list;
/* reconfigure work */
struct delayed_work dwork;
/* last reconfig time */
diff --git a/drivers/net/hyperv/netvsc_drv.c b/drivers/net/hyperv/netvsc_drv.c
index 2d4370c94b6e..8cb21e013d1d 100644
--- a/drivers/net/hyperv/netvsc_drv.c
+++ b/drivers/net/hyperv/netvsc_drv.c
@@ -69,6 +69,8 @@ static int debug = -1;
 module_param(debug, int, 0444);
 MODULE_PARM_DESC(debug, "Debug level (0=none,...,16=all)");
 
+static LIST_HEAD(netvsc_dev_list);
+
 static void netvsc_change_rx_flags(struct net_device *net, int change)
 {
struct net_device_context *ndev_ctx = netdev_priv(net);
@@ -1783,13 +1785,10 @@ static void netvsc_link_change(struct work_struct *w)
 
 static struct net_device *get_netvsc_bymac(const u8 *mac)
 {
-   struct net_device *dev;
-
-   ASSERT_RTNL();
+   struct net_device_context *ndev_ctx;
 
-   for_each_netdev(_net, dev) {
-   if (dev->netdev_ops != _ops)
-   continue;   /* not a netvsc device */
+   list_for_each_entry(ndev_ctx, _dev_list, list) {
+   struct net_device *dev = hv_get_drvdata(ndev_ctx->device_ctx);
 
if (ether_addr_equal(mac, dev->perm_addr))
return dev;
@@ -1800,25 +1799,18 @@ static struct net_device *get_netvsc_bymac(const u8 
*mac)
 
 static struct net_device *get_netvsc_byref(struct net_device *vf_netdev)
 {
+   struct net_device_context *net_device_ctx;
struct net_device *dev;
 
-   ASSERT_RTNL();
-
-   for_each_netdev(_net, dev) {
-   struct net_device_context *net_device_ctx;
+   dev = netdev_master_upper_dev_get(vf_netdev);
+   if (!dev || dev->netdev_ops != _ops)
+   return NULL;/* not a netvsc device */
 
-   if (dev->netdev_ops != _ops)
-   continue;   /* not a netvsc device */
+   net_device_ctx = netdev_priv(dev);
+   if (!rtnl_dereference(net_device_ctx->nvdev))
+   return NULL;/* device is removed */
 
-   net_device_ctx = netdev_priv(dev);
-   if (!rtnl_dereference(net_device_ctx->nvdev))
-   continue;   /* device is removed */
-
-   if (rtnl_dereference(net_device_ctx->vf_netdev) == vf_netdev)
-   return dev; /* a match */
-   }
-
-   return NULL;
+   return dev;
 }
 
 /* Called when VF is injecting data into network stack.
@@ -2095,15 +2087,19 @@ static int netvsc_probe(struct hv_device *dev,
else
net->max_mtu = ETH_DATA_LEN;
 
-   ret = register_netdev(net);
+   rtnl_lock();
+   ret = register_netdevice(net);
if (ret != 0) {
pr_err("Unable to register netdev.\n");
goto register_failed;
}
 
-   return ret;
+   list_add(_device_ctx->list, _dev_list);
+   rtnl_unlock();
+   return 0;
 
 register_failed:
+   rtnl_unlock();
rndis_filter_device_remove(dev, nvdev);
 rndis_failed:
free_percpu(net_device_ctx->vf_stats);
@@ -2149,6 +2145,7 @@ static int netvsc_remove(struct hv_device *dev)
rndis_filter_device_remove(dev, nvdev);
 
unregister_netdevice(net);
+   list_del(_ctx->list);
 
rtnl_unlock();
rcu_read_unlock();
-- 
2.17.1



Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Samudrala, Sridhar



On 6/11/2018 11:10 AM, Michael S. Tsirkin wrote:

On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:

   * Set permanent and current address of net_failover device
 to match the primary.

   * Carrier should be marked off before registering device
 the net_failover device.

Sridhar, do we want to address this?
If yes, could you please take a look at addressing these
meanwhile, while we keep arguing about making API changes?


Sure. I will submit patches to address these issues raised by Stephen.




Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Siwei Liu
On Mon, Jun 11, 2018 at 8:22 AM, Stephen Hemminger
 wrote:
> On Fri, 8 Jun 2018 17:42:21 -0700
> Siwei Liu  wrote:
>
>> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
>>  wrote:
>> > On Fri, 8 Jun 2018 16:44:12 -0700
>> > Siwei Liu  wrote:
>> >
>> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
>> >>  wrote:
>> >> > On Fri, 8 Jun 2018 15:25:59 -0700
>> >> > Siwei Liu  wrote:
>> >> >
>> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
>> >> >>  wrote:
>> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
>> >> >> > "Michael S. Tsirkin"  wrote:
>> >> >> >
>> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:
>> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
>> >> >> >> > wrote:
>> >> >> >> > >The net failover should be a simple library, not a virtual
>> >> >> >> > >object with function callbacks (see callback hell).
>> >> >> >> >
>> >> >> >> > Why just a library? It should do a common things. I think it 
>> >> >> >> > should be a
>> >> >> >> > virtual object. Looks like your patch again splits the common
>> >> >> >> > functionality into multiple drivers. That is kind of backwards 
>> >> >> >> > attitude.
>> >> >> >> > I don't get it. We should rather focus on fixing the mess the
>> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 
>> >> >> >> > 3-netdev
>> >> >> >> > model.
>> >> >> >>
>> >> >> >> So it seems that at least one benefit for netvsc would be better
>> >> >> >> handling of renames.
>> >> >> >>
>> >> >> >> Question is how can this change to 3-netdev happen?  Stephen is
>> >> >> >> concerned about risk of breaking some userspace.
>> >> >> >>
>> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
>> >> >> >> address, and you said then "why not use existing network namespaces
>> >> >> >> rather than inventing a new abstraction". So how about it then? Do 
>> >> >> >> you
>> >> >> >> want to find a way to use namespaces to hide the PV device for 
>> >> >> >> netvsc
>> >> >> >> compatibility?
>> >> >> >>
>> >> >> >
>> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough 
>> >> >> > distro's and
>> >> >> > startups that all demand eth0 always be present. And VF may come and 
>> >> >> > go.
>> >> >> > After this history, there is a strong motivation not to change how 
>> >> >> > kernel
>> >> >> > behaves. Switching to 3 device model would be perceived as breaking
>> >> >> > existing userspace.
>> >> >> >
>> >> >> > With virtio you can  work it out with the distro's yourself.
>> >> >> > There is no pre-existing semantics to deal with.
>> >> >> >
>> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.
>> >> >>
>> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> >> around renaming devices, customizing udev rules and et al. Why
>> >> >> inheriting VF's name important? To allow existing config/setup around
>> >> >> VF continues to work across kernel feature upgrade. Most of network
>> >> >> config files in all distros are based on interface names. Few are MAC
>> >> >> address based but making lower slaves hidden would cover the rest. And
>> >> >> most importantly, preserving the same level of user experience as
>> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> >> exposed. This is essential to realize transparent live migration that
>> >> >> users dont have to learn and be aware of the undertaken.
>> >> >
>> >> > Inheriting the VF name will fail in the migration scenario.
>> >> > It is perfectly reasonable to migrate a guest to another machine where
>> >> > the VF PCI address is different. And since current udev/systemd model
>> >> > is to base network device name off of PCI address, the device will 
>> >> > change
>> >> > name when guest is migrated.
>> >> >
>> >> The scenario of having VF on a different PCI address on post migration
>> >> is essentially equal to plugging in a new NIC. Why it has to pair with
>> >> the original PV? A sepearte PV device should be in place to pair the
>> >> new VF.
>> >
>> > The host only guarantees that the PV device will be on the same network.
>> > It does not make any PCI guarantees. The way Windows works is to find
>> > the device based on "serial number" which is an Hyper-V specific attribute
>> > of PCI devices.
>> >
>> > I considered naming off of serial number but that won't work for the
>> > case where PV device is present first and VF arrives later. The serial
>> > number is attribute of VF, not the PV which is there first.
>>
>> I assume the PV can get that information ahead of time before VF
>> arrives? Without it how do you match the device when you see a VF
>> coming with some serial number? Is it possible for PV to get the
>> matching SN even earlier during probe time? Or it has to depend on the
>> presence of vPCI bridge to 

Re: net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

2018-06-11 Thread Andrei Vagin
On Sun, Jun 03, 2018 at 10:47:05AM -0700, Maciej Żenczykowski wrote:
> From: Maciej Żenczykowski 
> 
> It is not safe to do so because such sockets are already in the
> hash tables and changing these options can result in invalidating
> the tb->fastreuse(port) caching.
> 
> This can have later far reaching consequences wrt. bind conflict checks
> which rely on these caches (for optimization purposes).
> 
> Not to mention that you can currently end up with two identical
> non-reuseport listening sockets bound to the same local ip:port
> by clearing reuseport on them after they've already both been bound.
> 
> There is unfortunately no EISBOUND error or anything similar,
> and EISCONN seems to be misleading for a bound-but-not-connected
> socket, so use EUCLEAN 'Structure needs cleaning' which AFAICT
> is the closest you can get to meaning 'socket in bad state'.
> (although perhaps EINVAL wouldn't be a bad choice either?)
> 
> This does unfortunately run the risk of breaking buggy
> userspace programs...
> 
> Signed-off-by: Maciej Żenczykowski 
> Cc: Eric Dumazet 
> 
> Change-Id: I77c2b3429b2fdf42671eee0fa7a8ba721c94963b
> Reviewed-by: Eric Dumazet 
> ---
>  net/core/sock.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 435a0ba85e52..feca4c98f8a0 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -728,9 +728,22 @@ int sock_setsockopt(struct socket *sock, int level, int 
> optname,
>   sock_valbool_flag(sk, SOCK_DBG, valbool);
>   break;
>   case SO_REUSEADDR:
> - sk->sk_reuse = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> + val = (valbool ? SK_CAN_REUSE : SK_NO_REUSE);
> + if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> + inet_sk(sk)->inet_num &&
> + (sk->sk_reuse != val)) {
> + ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : 
> -EUCLEAN;

There are a few more states like TCP_LAST_ACK, TCP_CLOSE_WAIT which
means that a socket is connected.

Actually, I don't see any reasons to return two different values here.

> + break;
> + }
> + sk->sk_reuse = val;
>   break;
>   case SO_REUSEPORT:
> + if ((sk->sk_family == PF_INET || sk->sk_family == PF_INET6) &&
> + inet_sk(sk)->inet_num &&
> + (sk->sk_reuseport != valbool)) {
> + ret = (sk->sk_state == TCP_ESTABLISHED) ? -EISCONN : 
> -EUCLEAN;
> + break;
> + }
>   sk->sk_reuseport = valbool;
>   break;
>   case SO_TYPE:


Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Siwei Liu
On Fri, Jun 8, 2018 at 6:29 PM, Jakub Kicinski  wrote:
> On Fri, 8 Jun 2018 16:44:12 -0700, Siwei Liu wrote:
>> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
>> >> that flag, as well as the 1-netdev model, is to have a means to
>> >> inherit the interface name from the VF, and to eliminate playing hacks
>> >> around renaming devices, customizing udev rules and et al. Why
>> >> inheriting VF's name important? To allow existing config/setup around
>> >> VF continues to work across kernel feature upgrade. Most of network
>> >> config files in all distros are based on interface names. Few are MAC
>> >> address based but making lower slaves hidden would cover the rest. And
>> >> most importantly, preserving the same level of user experience as
>> >> using raw VF interface once getting all ndo_ops and ethtool_ops
>> >> exposed. This is essential to realize transparent live migration that
>> >> users dont have to learn and be aware of the undertaken.
>> >
>> > Inheriting the VF name will fail in the migration scenario.
>> > It is perfectly reasonable to migrate a guest to another machine where
>> > the VF PCI address is different. And since current udev/systemd model
>> > is to base network device name off of PCI address, the device will change
>> > name when guest is migrated.
>> >
>> The scenario of having VF on a different PCI address on post migration
>> is essentially equal to plugging in a new NIC. Why it has to pair with
>> the original PV? A sepearte PV device should be in place to pair the
>> new VF.
>
> IMHO it may be a better idea to look at the VF as acceleration for the
> PV rather than PV a migration vehicle from the VF.  Hence we should

I'm basically talking about two use cases not about solutions or
implementations specifically. As said, the one I'm looking into needs
to migrate a pre-failover VF setup to 1-netdev failover model in a
transparent manner. There's no point to switch PCI address back and
forth in the backend to set where to bind the PV or the VF, as you
have no ways to predict what guest kernel will be running until its
fully loaded. Supporting a VF on new location binding to existing PV
might be nice, but not directly relevant to those who don't need this
side feature than migration itself.

Having said that, while I somewhat agree both use cases should have
its own place in the picture, I don't think judging one better than
the other or vice versa is logical IMHO.

> continue to follow the naming of PV, like the current implementation
> does implicitly by linking to PV's struct device.

The current implementation may only work with new userspace, even so
the eth0/eth0nsby naming is not consistenly persisted due to races in
bus probing. The naming part should be fixed.

-Siwei


[bpf PATCH] bpf: selftest fix for sockmap

2018-06-11 Thread John Fastabend
In selftest test_maps the sockmap test case attempts to add a socket
in listening state to the sockmap. This is no longer a valid operation
so it fails as expected. However, the test wrongly reports this as an
error now. Fix the test to avoid adding sockets in listening state.

Fixes: 945ae430aa44 ("bpf: sockmap only allow ESTABLISHED sock state")
Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_maps.c |4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 6c25334..9fed5f0 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -564,7 +564,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Test update without programs */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_update_elem(fd, , [i], BPF_ANY);
if (err) {
printf("Failed noprog update sockmap '%i:%i'\n",
@@ -727,7 +727,7 @@ static void test_sockmap(int tasks, void *data)
}
 
/* Test map update elem afterwards fd lives in fd and map_fd */
-   for (i = 0; i < 6; i++) {
+   for (i = 2; i < 6; i++) {
err = bpf_map_update_elem(map_fd_rx, , [i], BPF_ANY);
if (err) {
printf("Failed map_fd_rx update sockmap %i '%i:%i'\n",



Re: net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on bound sockets

2018-06-11 Thread Andrei Vagin
Cc: Pavel

On Fri, Jun 08, 2018 at 03:07:30AM -0700, Maciej Żenczykowski wrote:
> I think we probably need to make sk->sk_reuse back into a boolean.
> (ie. eliminate SK_FORCE_REUSE)
> 
> Then add a new tcp/udp sk->ignore_bind_conflicts boolean setting...
> (ie. not just for tcp, but sol_socket)  [or perhaps SO_REPAIR,
> sk->repair or something]
> 
> What I'm not certain of is exactly what sorts of conflicts it should ignore...
> all?  probably not, still seems utterly wrong to allow creation of 2 connected
> tcp sockets with identical 5-tuples.

It is required when we are restoring i_b_c sockets on a server side.  In
this cases, they all have the same source address of a listening socket.

To restore these sockets, we need to be able to create a listening socket
and all i_b_c sockets and bind them all to the same source address.

BTW: Here is an example of how tcp_repair works:
https://github.com/avagin/tcp-repair/blob/master/tcp-constructor.c

> 
> Would it only ignore conflicts against other i_b_c sockets?
> ie. set it on all sockets as we're repairing, then clear it on them
> all once we're done?

TCP_REPAIR (which is set SK_FORCE_REUSE) is used to restore only i_b_c
sockets. SK_FORCE_REUSE is needed to ignore bind conflicts for repaired
sockets. It ignores conflicts agains other i_b_c and listen sockets.

The current idea is that CRIU will restore listening sockets first, and
them it will restore i_b_c sockets.

Pls, take a look at the attached patch.

> 
> and ignore all the fast caching when checking conflicts for an i_b_c socket?
> 
> For CRIU is it safe to assume we're restoring an entire namespace into
> a new namespace?

No. It isn't. CRIU can restore processes in an existing network namespace.

> 
> Could we perhaps instead allow a new namespace to ignore bind conflicts until
> we flip it into enforcing mode?

No, we could not

>From 990baa56993827ae6f4441cf078eddf73389d6ee Mon Sep 17 00:00:00 2001
From: Andrei Vagin 
Date: Fri, 8 Jun 2018 23:27:46 -0700
Subject: [PATCH] net: split sk_reuse into sk_reuse and sk_force_reuse

Currently sk_reuse can have there values: SK_NO_REUSE, SK_CAN_REUSE,
SK_FORCE_REUSE. SK_CAN_REUSE is set by SOL_REUSEADDR.  SK_FORCE_REUSE is
used to ignore bind conflicts for sockets in the repair mode.

This patch makes sk->sk_reuse back into a boolean and adds
sk->sk_force_reuse to track SK_FORCE_REUSE separatly.

Recently here were changes which prohibit to change
SO_REUSEADDR/SO_REUSEPORT on bound sockets and now it is impossible to
set origin values of these parameters for restored (repaired) sockets.

With introduced changes, the tcp_repair mode doesn't affect sk_reuse, so
it is possible to set its value before switching a socket into the
repair mode.

Fixes: f396922d862a ("net: do not allow changing SO_REUSEADDR/SO_REUSEPORT on 
bound sockets")
Signed-off-by: Andrei Vagin 
---
 include/net/sock.h  | 13 -
 net/ipv4/inet_connection_sock.c |  2 +-
 net/ipv4/tcp.c  |  4 ++--
 3 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index b3b75419eafe..8ad19286ab9e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -130,6 +130,7 @@ typedef __u64 __bitwise __addrpair;
  * @skc_family: network address family
  * @skc_state: Connection state
  * @skc_reuse: %SO_REUSEADDR setting
+ * @skc_force_reuse: ignore bind conflicts
  * @skc_reuseport: %SO_REUSEPORT setting
  * @skc_bound_dev_if: bound device index if != 0
  * @skc_bind_node: bind hash linkage for various protocol lookup tables
@@ -174,7 +175,8 @@ struct sock_common {
 
unsigned short  skc_family;
volatile unsigned char  skc_state;
-   unsigned char   skc_reuse:4;
+   unsigned char   skc_reuse:1;
+   unsigned char   skc_force_reuse:1;
unsigned char   skc_reuseport:1;
unsigned char   skc_ipv6only:1;
unsigned char   skc_net_refcnt:1;
@@ -339,6 +341,7 @@ struct sock {
 #define sk_family  __sk_common.skc_family
 #define sk_state   __sk_common.skc_state
 #define sk_reuse   __sk_common.skc_reuse
+#define sk_force_reuse __sk_common.skc_force_reuse
 #define sk_reuseport   __sk_common.skc_reuseport
 #define sk_ipv6only__sk_common.skc_ipv6only
 #define sk_net_refcnt  __sk_common.skc_net_refcnt
@@ -502,16 +505,8 @@ enum sk_pacing {
 #define rcu_dereference_sk_user_data(sk)   
rcu_dereference(__sk_user_data((sk)))
 #define rcu_assign_sk_user_data(sk, ptr)   
rcu_assign_pointer(__sk_user_data((sk)), ptr)
 
-/*
- * SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
- * or not whether his port will be reused by someone else. SK_FORCE_REUSE
- * on a socket means that the socket will reuse everybody else's port
- * without looking at the other's sk_reuse value.
- */
-
 #define SK_NO_REUSE0
 #define 

Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Michael S. Tsirkin
On Mon, Jun 04, 2018 at 08:42:31PM -0700, Stephen Hemminger wrote:
>   * Set permanent and current address of net_failover device
> to match the primary.
> 
>   * Carrier should be marked off before registering device
> the net_failover device.

Sridhar, do we want to address this?
If yes, could you please take a look at addressing these
meanwhile, while we keep arguing about making API changes?

-- 
MST


Re: [PATCH net] KEYS: DNS: fix parsing multiple options

2018-06-11 Thread Simon Horman
On Mon, Jun 11, 2018 at 10:57:42AM -0700, Eric Biggers wrote:
> Hi Simon,
> 
> On Mon, Jun 11, 2018 at 11:40:23AM +0200, Simon Horman wrote:
> > On Fri, Jun 08, 2018 at 09:20:37AM -0700, Eric Biggers wrote:
> > > From: Eric Biggers 
> > > 
> > > My recent fix for dns_resolver_preparse() printing very long strings was
> > > incomplete, as shown by syzbot which still managed to hit the
> > > WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:
> > > 
> > > precision 50001 too large
> > > WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0
> > > 
> > > The bug this time isn't just a printing bug, but also a logical error
> > > when multiple options ("#"-separated strings) are given in the key
> > > payload.  Specifically, when separating an option string into name and
> > > value, if there is no value then the name is incorrectly considered to
> > > end at the end of the key payload, rather than the end of the current
> > > option.  This bypasses validation of the option length, and also means
> > > that specifying multiple options is broken -- which presumably has gone
> > > unnoticed as there is currently only one valid option anyway.
> > > 
> > > Fix it by correctly calculating the length of the option name.
> > > 
> > > Reproducer:
> > > 
> > > perl -e 'print "#A#", "\x00" x 5' | keyctl padd dns_resolver desc 
> > > @s
> > > 
> > > Fixes: 4a2d789267e0 ("DNS: If the DNS server returns an error, allow that 
> > > to be cached [ver #2]")
> > > Signed-off-by: Eric Biggers 
> > > ---
> > >  net/dns_resolver/dns_key.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c
> > > index 40c851693f77e..d448823d4d2ed 100644
> > > --- a/net/dns_resolver/dns_key.c
> > > +++ b/net/dns_resolver/dns_key.c
> > > @@ -97,7 +97,7 @@ dns_resolver_preparse(struct key_preparsed_payload 
> > > *prep)
> > >   return -EINVAL;
> > >   }
> > >  
> > > - eq = memchr(opt, '=', opt_len) ?: end;
> > > + eq = memchr(opt, '=', opt_len) ?: next_opt;
> > >   opt_nlen = eq - opt;
> > >   eq++;
> > 
> > It seems risky to advance eq++ in the case there the value is empty.
> > Its not not pointing to the value but it may be accessed twice further on
> > in this loop.
> > 
> 
> Sure, that's a separate existing issue though, and it must be checked that the
> value is present before using it anyway, which the code already does, so it's
> not a "real" bug.  I think I'll keep this patch simple and leave that part 
> as-is
> for now.

Thanks Eric, I was reflecting on that too. I agree that your patch resolves
a problem without introducing a new one.

Reviewed-by: Simon Horman 


Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Michael S. Tsirkin
On Wed, Jun 06, 2018 at 03:21:21PM -0700, Stephen Hemminger wrote:
> > > 
> > > I think you need to get rid of triggering off of the name change.  
> > 
> > Worth considering down the road since it's a bit of a hack but it's one
> > we won't have trouble supporting, unlike the delayed uplink.
> 
> You can't depend on userspace doing rename.

There's only so much we can do to add new functionality to old
userspace. You can always just use PV and all will work.

-- 
MST


Re: [PATCH net] KEYS: DNS: fix parsing multiple options

2018-06-11 Thread Eric Biggers
Hi Simon,

On Mon, Jun 11, 2018 at 11:40:23AM +0200, Simon Horman wrote:
> On Fri, Jun 08, 2018 at 09:20:37AM -0700, Eric Biggers wrote:
> > From: Eric Biggers 
> > 
> > My recent fix for dns_resolver_preparse() printing very long strings was
> > incomplete, as shown by syzbot which still managed to hit the
> > WARN_ONCE() in set_precision() by adding a crafted "dns_resolver" key:
> > 
> > precision 50001 too large
> > WARNING: CPU: 7 PID: 864 at lib/vsprintf.c:2164 vsnprintf+0x48a/0x5a0
> > 
> > The bug this time isn't just a printing bug, but also a logical error
> > when multiple options ("#"-separated strings) are given in the key
> > payload.  Specifically, when separating an option string into name and
> > value, if there is no value then the name is incorrectly considered to
> > end at the end of the key payload, rather than the end of the current
> > option.  This bypasses validation of the option length, and also means
> > that specifying multiple options is broken -- which presumably has gone
> > unnoticed as there is currently only one valid option anyway.
> > 
> > Fix it by correctly calculating the length of the option name.
> > 
> > Reproducer:
> > 
> > perl -e 'print "#A#", "\x00" x 5' | keyctl padd dns_resolver desc @s
> > 
> > Fixes: 4a2d789267e0 ("DNS: If the DNS server returns an error, allow that 
> > to be cached [ver #2]")
> > Signed-off-by: Eric Biggers 
> > ---
> >  net/dns_resolver/dns_key.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/net/dns_resolver/dns_key.c b/net/dns_resolver/dns_key.c
> > index 40c851693f77e..d448823d4d2ed 100644
> > --- a/net/dns_resolver/dns_key.c
> > +++ b/net/dns_resolver/dns_key.c
> > @@ -97,7 +97,7 @@ dns_resolver_preparse(struct key_preparsed_payload *prep)
> > return -EINVAL;
> > }
> >  
> > -   eq = memchr(opt, '=', opt_len) ?: end;
> > +   eq = memchr(opt, '=', opt_len) ?: next_opt;
> > opt_nlen = eq - opt;
> > eq++;
> 
> It seems risky to advance eq++ in the case there the value is empty.
> Its not not pointing to the value but it may be accessed twice further on
> in this loop.
> 

Sure, that's a separate existing issue though, and it must be checked that the
value is present before using it anyway, which the code already does, so it's
not a "real" bug.  I think I'll keep this patch simple and leave that part as-is
for now.

Eric


[jkirsher/next-queue PATCH 2/7] net: Add support for subordinate device traffic classes

2018-06-11 Thread Alexander Duyck
This patch is meant to provide the basic tools needed to allow us to create
subordinate device traffic classes. The general idea here is to allow
subdividing the queues of a device into queue groups accessible through an
upper device such as a macvlan.

The idea here is to enforce the idea that an upper device has to be a
single queue device, ideally with IFF_NO_QUQUE set. With that being the
case we can pretty much guarantee that the tc_to_txq mappings and XPS maps
for the upper device are unused. As such we could reuse those in order to
support subdividing the lower device and distributing those queues between
the subordinate devices.

In order to distinguish between a regular set of traffic classes and if a
device is carrying subordinate traffic classes I changed num_tc from a u8
to a s16 value and use the negative values to represent the suboordinate
pool values. So starting at -1 and running to -32768 we can encode those as
pool values, and the existing values of 0 to 15 can be maintained.

Signed-off-by: Alexander Duyck 
---
 include/linux/netdevice.h |   16 
 net/core/dev.c|   89 +
 net/core/net-sysfs.c  |   21 ++-
 3 files changed, 124 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 3ec9850..41b4660 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -569,6 +569,9 @@ struct netdev_queue {
 * (/sys/class/net/DEV/Q/trans_timeout)
 */
unsigned long   trans_timeout;
+
+   /* Suboordinate device that the queue has been assigned to */
+   struct net_device   *sb_dev;
 /*
  * write-mostly part
  */
@@ -1978,7 +1981,7 @@ struct net_device {
 #ifdef CONFIG_DCB
const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
-   u8  num_tc;
+   s16 num_tc;
struct netdev_tc_txqtc_to_txq[TC_MAX_QUEUE];
u8  prio_tc_map[TC_BITMASK + 1];
 
@@ -2032,6 +2035,17 @@ int netdev_get_num_tc(struct net_device *dev)
return dev->num_tc;
 }
 
+void netdev_unbind_sb_channel(struct net_device *dev,
+ struct net_device *sb_dev);
+int netdev_bind_sb_channel_queue(struct net_device *dev,
+struct net_device *sb_dev,
+u8 tc, u16 count, u16 offset);
+int netdev_set_sb_channel(struct net_device *dev, u16 channel);
+static inline int netdev_get_sb_channel(struct net_device *dev)
+{
+   return max_t(int, -dev->num_tc, 0);
+}
+
 static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 unsigned int index)
diff --git a/net/core/dev.c b/net/core/dev.c
index 6e18242..27fe4f2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2068,11 +2068,13 @@ int netdev_txq_to_tc(struct net_device *dev, unsigned 
int txq)
struct netdev_tc_txq *tc = >tc_to_txq[0];
int i;
 
+   /* walk through the TCs and see if it falls into any of them */
for (i = 0; i < TC_MAX_QUEUE; i++, tc++) {
if ((txq - tc->offset) < tc->count)
return i;
}
 
+   /* didn't find it, just return -1 to indicate no match */
return -1;
}
 
@@ -2215,7 +2217,14 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
bool active = false;
 
if (dev->num_tc) {
+   /* Do not allow XPS on subordinate device directly */
num_tc = dev->num_tc;
+   if (num_tc < 0)
+   return -EINVAL;
+
+   /* If queue belongs to subordinate dev use its map */
+   dev = netdev_get_tx_queue(dev, index)->sb_dev ? : dev;
+
tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
@@ -2366,11 +2375,25 @@ int netif_set_xps_queue(struct net_device *dev, const 
struct cpumask *mask,
 EXPORT_SYMBOL(netif_set_xps_queue);
 
 #endif
+static void netdev_unbind_all_sb_channels(struct net_device *dev)
+{
+   struct netdev_queue *txq = >_tx[dev->num_tx_queues];
+
+   /* Unbind any subordinate channels */
+   while (txq-- != >_tx[0]) {
+   if (txq->sb_dev)
+   netdev_unbind_sb_channel(dev, txq->sb_dev);
+   }
+}
+
 void netdev_reset_tc(struct net_device *dev)
 {
 #ifdef CONFIG_XPS
netif_reset_xps_queues_gt(dev, 0);
 #endif
+   netdev_unbind_all_sb_channels(dev);
+
+   /* Reset TC configuration of device */
dev->num_tc = 0;
memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
@@ -2399,11 +2422,77 @@ int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
 #ifdef CONFIG_XPS

[jkirsher/next-queue PATCH 0/7] Add support for L2 Fwd Offload w/o ndo_select_queue

2018-06-11 Thread Alexander Duyck
This patch series is meant to allow support for the L2 forward offload, aka
MACVLAN offload without the need for using ndo_select_queue.

The existing solution currently requires that we use ndo_select_queue in
the transmit path if we want to associate specific Tx queues with a given
MACVLAN interface. In order to get away from this we need to repurpose the
tc_to_txq array and XPS pointer for the MACVLAN interface and use those as
a means of accessing the queues on the lower device. As a result we cannot
offload a device that is configured as multiqueue, however it doesn't
really make sense to configure a macvlan interfaced as being multiqueue
anyway since it doesn't really have a qdisc of its own in the first place.

I am submitting this as an RFC for the netdev mailing list, and officially
submitting it for testing to Jeff Kirsher's next-queue in order to validate
the ixgbe specific bits.

The big changes in this set are:
  Allow lower device to update tc_to_txq and XPS map of offloaded MACVLAN
  Disable XPS for single queue devices
  Replace accel_priv with sb_dev in ndo_select_queue
  Add sb_dev parameter to fallback function for ndo_select_queue
  Consolidated ndo_select_queue functions that appeared to be duplicates

---

Alexander Duyck (7):
  net-sysfs: Drop support for XPS and traffic_class on single queue device
  net: Add support for subordinate device traffic classes
  ixgbe: Add code to populate and use macvlan tc to Tx queue map
  net: Add support for subordinate traffic classes to netdev_pick_tx
  net: Add generic ndo_select_queue functions
  net: allow ndo_select_queue to pass netdev
  net: allow fallback function to pass netdev


 drivers/infiniband/hw/hfi1/vnic_main.c|2 
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |4 -
 drivers/net/bonding/bond_main.c   |3 
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |5 -
 drivers/net/ethernet/broadcom/bcmsysport.c|6 -
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |6 +
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |3 
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |5 -
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |5 -
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   62 ++--
 drivers/net/ethernet/lantiq_etop.c|   10 -
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|7 +
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |3 
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |5 -
 drivers/net/ethernet/renesas/ravb_main.c  |3 
 drivers/net/ethernet/sun/ldmvsw.c |3 
 drivers/net/ethernet/sun/sunvnet.c|3 
 drivers/net/ethernet/ti/netcp_core.c  |9 -
 drivers/net/hyperv/netvsc_drv.c   |6 -
 drivers/net/macvlan.c |   10 -
 drivers/net/net_failover.c|7 +
 drivers/net/team/team.c   |3 
 drivers/net/tun.c |3 
 drivers/net/wireless/marvell/mwifiex/main.c   |3 
 drivers/net/xen-netback/interface.c   |4 -
 drivers/net/xen-netfront.c|3 
 drivers/staging/netlogic/xlr_net.c|9 -
 drivers/staging/rtl8188eu/os_dep/os_intfs.c   |3 
 drivers/staging/rtl8723bs/os_dep/os_intfs.c   |7 -
 include/linux/netdevice.h |   32 
 net/core/dev.c|  154 ++---
 net/core/net-sysfs.c  |   36 +
 net/mac80211/iface.c  |4 -
 net/packet/af_packet.c|9 -
 35 files changed, 306 insertions(+), 134 deletions(-)

--


[jkirsher/next-queue PATCH 3/7] ixgbe: Add code to populate and use macvlan tc to Tx queue map

2018-06-11 Thread Alexander Duyck
This patch makes it so that we use the tc_to_txq mapping in the macvlan
device in order to select the Tx queue for outgoing packets.

The idea here is to try and move away from using ixgbe_select_queue and to
come up with a generic way to make this work for devices going forward. By
encoding this information in the netdev this can become something that can
be used generically as a solution for similar setups going forward.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   44 ++---
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index fc23e36..6e27848 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -5271,6 +5271,8 @@ static void ixgbe_clean_rx_ring(struct ixgbe_ring 
*rx_ring)
 static int ixgbe_fwd_ring_up(struct ixgbe_adapter *adapter,
 struct ixgbe_fwd_adapter *accel)
 {
+   u16 rss_i = adapter->ring_feature[RING_F_RSS].indices;
+   int num_tc = netdev_get_num_tc(adapter->netdev);
struct net_device *vdev = accel->netdev;
int i, baseq, err;
 
@@ -5282,6 +5284,11 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
accel->rx_base_queue = baseq;
accel->tx_base_queue = baseq;
 
+   /* record configuration for macvlan interface in vdev */
+   for (i = 0; i < num_tc; i++)
+   netdev_bind_sb_channel_queue(adapter->netdev, vdev,
+i, rss_i, baseq + (rss_i * i));
+
for (i = 0; i < adapter->num_rx_queues_per_pool; i++)
adapter->rx_ring[baseq + i]->netdev = vdev;
 
@@ -5306,6 +5313,10 @@ static int ixgbe_fwd_ring_up(struct ixgbe_adapter 
*adapter,
 
netdev_err(vdev, "L2FW offload disabled due to L2 filter error\n");
 
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
clear_bit(accel->pool, adapter->fwd_bitmask);
kfree(accel);
 
@@ -8212,18 +8223,22 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-   struct ixgbe_adapter *adapter;
-   int txq;
 #ifdef IXGBE_FCOE
+   struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
 #endif
+   int txq;
 
if (fwd_adapter) {
-   adapter = netdev_priv(dev);
-   txq = reciprocal_scale(skb_get_hash(skb),
-  adapter->num_rx_queues_per_pool);
+   u8 tc = netdev_get_num_tc(dev) ?
+   netdev_get_prio_tc_map(dev, skb->priority) : 0;
+   struct net_device *vdev = fwd_adapter->netdev;
+
+   txq = vdev->tc_to_txq[tc].offset;
+   txq += reciprocal_scale(skb_get_hash(skb),
+   vdev->tc_to_txq[tc].count);
 
-   return txq + fwd_adapter->tx_base_queue;
+   return txq;
}
 
 #ifdef IXGBE_FCOE
@@ -8777,6 +8792,11 @@ static int ixgbe_reassign_macvlan_pool(struct net_device 
*vdev, void *data)
/* if we cannot find a free pool then disable the offload */
netdev_err(vdev, "L2FW offload disabled due to lack of queue 
resources\n");
macvlan_release_l2fw_offload(vdev);
+
+   /* unbind the queues and drop the subordinate channel config */
+   netdev_unbind_sb_channel(adapter->netdev, vdev);
+   netdev_set_sb_channel(vdev, 0);
+
kfree(accel);
 
return 0;
@@ -9785,6 +9805,13 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
if (!macvlan_supports_dest_filter(vdev))
return ERR_PTR(-EMEDIUMTYPE);
 
+   /* We need to lock down the macvlan to be a single queue device so that
+* we can reuse the tc_to_txq field in the macvlan netdev to represent
+* the queue mapping to our netdev.
+*/
+   if (netif_is_multiqueue(vdev))
+   return ERR_PTR(-ERANGE);
+
pool = find_first_zero_bit(adapter->fwd_bitmask, adapter->num_rx_pools);
if (pool == adapter->num_rx_pools) {
u16 used_pools = adapter->num_vfs + adapter->num_rx_pools;
@@ -9841,6 +9868,7 @@ static void *ixgbe_fwd_add(struct net_device *pdev, 
struct net_device *vdev)
return ERR_PTR(-ENOMEM);
 
set_bit(pool, adapter->fwd_bitmask);
+   netdev_set_sb_channel(vdev, pool);
accel->pool = pool;
accel->netdev = vdev;
 
@@ -9882,6 +9910,10 @@ static void ixgbe_fwd_del(struct net_device *pdev, void 
*priv)
ring->netdev = NULL;
}
 
+   /* unbind the queues and drop the subordinate channel 

[jkirsher/next-queue PATCH 5/7] net: Add generic ndo_select_queue functions

2018-06-11 Thread Alexander Duyck
This patch adds a generic version of the ndo_select_queue functions for
either returning 0 or selecting a queue based on the processor ID. This is
generally meant to just reduce the number of functions we have to change
in the future when we have to deal with ndo_select_queue changes.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/lantiq_etop.c   |   10 +-
 drivers/net/ethernet/ti/netcp_core.c |9 +
 drivers/staging/netlogic/xlr_net.c   |9 +
 include/linux/netdevice.h|2 ++
 net/core/dev.c   |   12 
 net/packet/af_packet.c   |9 ++---
 6 files changed, 19 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/lantiq_etop.c 
b/drivers/net/ethernet/lantiq_etop.c
index afc8100..7a637b5 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -563,14 +563,6 @@ struct ltq_etop_priv {
spin_unlock_irqrestore(>lock, flags);
 }
 
-static u16
-ltq_etop_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv, select_queue_fallback_t fallback)
-{
-   /* we are currently only using the first queue */
-   return 0;
-}
-
 static int
 ltq_etop_init(struct net_device *dev)
 {
@@ -641,7 +633,7 @@ struct ltq_etop_priv {
.ndo_set_mac_address = ltq_etop_set_mac_address,
.ndo_validate_addr = eth_validate_addr,
.ndo_set_rx_mode = ltq_etop_set_multicast_list,
-   .ndo_select_queue = ltq_etop_select_queue,
+   .ndo_select_queue = dev_pick_tx_zero,
.ndo_init = ltq_etop_init,
.ndo_tx_timeout = ltq_etop_tx_timeout,
 };
diff --git a/drivers/net/ethernet/ti/netcp_core.c 
b/drivers/net/ethernet/ti/netcp_core.c
index e40aa3e..2c455bd 100644
--- a/drivers/net/ethernet/ti/netcp_core.c
+++ b/drivers/net/ethernet/ti/netcp_core.c
@@ -1889,13 +1889,6 @@ static int netcp_rx_kill_vid(struct net_device *ndev, 
__be16 proto, u16 vid)
return err;
 }
 
-static u16 netcp_select_queue(struct net_device *dev, struct sk_buff *skb,
- void *accel_priv,
- select_queue_fallback_t fallback)
-{
-   return 0;
-}
-
 static int netcp_setup_tc(struct net_device *dev, enum tc_setup_type type,
  void *type_data)
 {
@@ -1972,7 +1965,7 @@ static int netcp_setup_tc(struct net_device *dev, enum 
tc_setup_type type,
.ndo_vlan_rx_add_vid= netcp_rx_add_vid,
.ndo_vlan_rx_kill_vid   = netcp_rx_kill_vid,
.ndo_tx_timeout = netcp_ndo_tx_timeout,
-   .ndo_select_queue   = netcp_select_queue,
+   .ndo_select_queue   = dev_pick_tx_zero,
.ndo_setup_tc   = netcp_setup_tc,
 };
 
diff --git a/drivers/staging/netlogic/xlr_net.c 
b/drivers/staging/netlogic/xlr_net.c
index e461168..4e6611e 100644
--- a/drivers/staging/netlogic/xlr_net.c
+++ b/drivers/staging/netlogic/xlr_net.c
@@ -290,13 +290,6 @@ static netdev_tx_t xlr_net_start_xmit(struct sk_buff *skb,
return NETDEV_TX_OK;
 }
 
-static u16 xlr_net_select_queue(struct net_device *ndev, struct sk_buff *skb,
-   void *accel_priv,
-   select_queue_fallback_t fallback)
-{
-   return (u16)smp_processor_id();
-}
-
 static void xlr_hw_set_mac_addr(struct net_device *ndev)
 {
struct xlr_net_priv *priv = netdev_priv(ndev);
@@ -403,7 +396,7 @@ static void xlr_stats(struct net_device *ndev, struct 
rtnl_link_stats64 *stats)
.ndo_open = xlr_net_open,
.ndo_stop = xlr_net_stop,
.ndo_start_xmit = xlr_net_start_xmit,
-   .ndo_select_queue = xlr_net_select_queue,
+   .ndo_select_queue = dev_pick_tx_cpu_id,
.ndo_set_mac_address = xlr_net_set_mac_addr,
.ndo_set_rx_mode = xlr_set_rx_mode,
.ndo_get_stats64 = xlr_stats,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 91b3ca9..f277149 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2551,6 +2551,8 @@ struct net_device *__dev_get_by_flags(struct net *net, 
unsigned short flags,
 void dev_close_many(struct list_head *head, bool unlink);
 void dev_disable_lro(struct net_device *dev);
 int dev_loopback_xmit(struct net *net, struct sock *sk, struct sk_buff 
*newskb);
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb);
+u16 dev_pick_tx_cpu_id(struct net_device *dev, struct sk_buff *skb);
 int dev_queue_xmit(struct sk_buff *skb);
 int dev_queue_xmit_accel(struct sk_buff *skb, struct net_device *sb_dev);
 int dev_direct_xmit(struct sk_buff *skb, u16 queue_id);
diff --git a/net/core/dev.c b/net/core/dev.c
index 2249294..d746fdd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3505,6 +3505,18 @@ static inline int get_xps_queue(struct net_device *dev,
 #endif
 }
 
+u16 dev_pick_tx_zero(struct net_device *dev, struct sk_buff *skb)
+{
+   return 0;
+}
+EXPORT_SYMBOL(dev_pick_tx_zero);
+
+u16 

[jkirsher/next-queue PATCH 7/7] net: allow fallback function to pass netdev

2018-06-11 Thread Alexander Duyck
For most of these calls we can just pass NULL through to the fallback
function as the sb_dev. The only cases where we cannot are the cases where
we might be dealing with either an upper device or a driver that would
have configured things to support an sb_dev itself.

The only driver that has any signficant change in this patchset should be
ixgbe as we can drop the redundant functionality that existed in both the
ndo_select_queue function and the fallback function that was passed through
to us.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/amazon/ena/ena_netdev.c|2 +-
 drivers/net/ethernet/broadcom/bcmsysport.c  |4 ++--
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |3 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |2 +-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c   |2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |4 ++--
 drivers/net/ethernet/mellanox/mlx4/en_tx.c  |4 ++--
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c |2 +-
 drivers/net/hyperv/netvsc_drv.c |2 +-
 drivers/net/net_failover.c  |2 +-
 drivers/net/xen-netback/interface.c |2 +-
 include/linux/netdevice.h   |3 ++-
 net/core/dev.c  |   12 +++-
 net/packet/af_packet.c  |7 +--
 14 files changed, 21 insertions(+), 30 deletions(-)

diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index e3befb1..c673ac2 100644
--- a/drivers/net/ethernet/amazon/ena/ena_netdev.c
+++ b/drivers/net/ethernet/amazon/ena/ena_netdev.c
@@ -2224,7 +2224,7 @@ static u16 ena_select_queue(struct net_device *dev, 
struct sk_buff *skb,
if (skb_rx_queue_recorded(skb))
qid = skb_get_rx_queue(skb);
else
-   qid = fallback(dev, skb);
+   qid = fallback(dev, skb, NULL);
 
return qid;
 }
diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c 
b/drivers/net/ethernet/broadcom/bcmsysport.c
index 32f548e..eb890c4 100644
--- a/drivers/net/ethernet/broadcom/bcmsysport.c
+++ b/drivers/net/ethernet/broadcom/bcmsysport.c
@@ -2116,7 +2116,7 @@ static u16 bcm_sysport_select_queue(struct net_device 
*dev, struct sk_buff *skb,
unsigned int q, port;
 
if (!netdev_uses_dsa(dev))
-   return fallback(dev, skb);
+   return fallback(dev, skb, NULL);
 
/* DSA tagging layer will have configured the correct queue */
q = BRCM_TAG_GET_QUEUE(queue);
@@ -2124,7 +2124,7 @@ static u16 bcm_sysport_select_queue(struct net_device 
*dev, struct sk_buff *skb,
tx_ring = priv->ring_map[q + port * priv->per_port_num_tx_queues];
 
if (unlikely(!tx_ring))
-   return fallback(dev, skb);
+   return fallback(dev, skb, NULL);
 
return tx_ring->index;
 }
diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c 
b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 969dcc9..7a1b99f 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -1928,7 +1928,8 @@ u16 bnx2x_select_queue(struct net_device *dev, struct 
sk_buff *skb,
}
 
/* select a non-FCoE queue */
-   return fallback(dev, skb) % (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
+   return fallback(dev, skb, NULL) %
+  (BNX2X_NUM_ETH_QUEUES(bp) * bp->max_cos);
 }
 
 void bnx2x_set_num_queues(struct bnx2x *bp)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 8de3039..380931d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -972,7 +972,7 @@ static u16 cxgb_select_queue(struct net_device *dev, struct 
sk_buff *skb,
return txq;
}
 
-   return fallback(dev, skb) % dev->real_num_tx_queues;
+   return fallback(dev, skb, NULL) % dev->real_num_tx_queues;
 }
 
 static int closest_timer(const struct sge *s, int time)
diff --git a/drivers/net/ethernet/hisilicon/hns/hns_enet.c 
b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
index c36a231..8327254 100644
--- a/drivers/net/ethernet/hisilicon/hns/hns_enet.c
+++ b/drivers/net/ethernet/hisilicon/hns/hns_enet.c
@@ -2033,7 +2033,7 @@ static void hns_nic_get_stats64(struct net_device *ndev,
is_multicast_ether_addr(eth_hdr->h_dest))
return 0;
else
-   return fallback(ndev, skb);
+   return fallback(ndev, skb, NULL);
 }
 
 static const struct net_device_ops hns_nic_netdev_ops = {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 5d9867e..eef64d0 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8248,11 +8248,11 @@ static u16 

[jkirsher/next-queue PATCH 6/7] net: allow ndo_select_queue to pass netdev

2018-06-11 Thread Alexander Duyck
This patch makes it so that instead of passing a void pointer as the
accel_priv we instead pass a net_device pointer as sb_dev. Making this
change allows us to pass the subordinate device through to the fallback
function eventually so that we can keep the actual code in the
ndo_select_queue call as focused on possible on the exception cases.

Signed-off-by: Alexander Duyck 
---
 drivers/infiniband/hw/hfi1/vnic_main.c|2 +-
 drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c |4 ++--
 drivers/net/bonding/bond_main.c   |3 ++-
 drivers/net/ethernet/amazon/ena/ena_netdev.c  |3 ++-
 drivers/net/ethernet/broadcom/bcmsysport.c|2 +-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c   |3 ++-
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.h   |3 ++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c   |3 ++-
 drivers/net/ethernet/hisilicon/hns/hns_enet.c |3 ++-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |7 ---
 drivers/net/ethernet/mellanox/mlx4/en_tx.c|3 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |3 ++-
 drivers/net/ethernet/mellanox/mlx5/core/en_tx.c   |3 ++-
 drivers/net/ethernet/renesas/ravb_main.c  |3 ++-
 drivers/net/ethernet/sun/ldmvsw.c |3 ++-
 drivers/net/ethernet/sun/sunvnet.c|3 ++-
 drivers/net/hyperv/netvsc_drv.c   |4 ++--
 drivers/net/net_failover.c|5 +++--
 drivers/net/team/team.c   |3 ++-
 drivers/net/tun.c |3 ++-
 drivers/net/wireless/marvell/mwifiex/main.c   |3 ++-
 drivers/net/xen-netback/interface.c   |2 +-
 drivers/net/xen-netfront.c|3 ++-
 drivers/staging/rtl8188eu/os_dep/os_intfs.c   |3 ++-
 drivers/staging/rtl8723bs/os_dep/os_intfs.c   |7 +++
 include/linux/netdevice.h |   11 +++
 net/core/dev.c|6 --
 net/mac80211/iface.c  |4 ++--
 net/packet/af_packet.c|9 +++--
 30 files changed, 73 insertions(+), 44 deletions(-)

diff --git a/drivers/infiniband/hw/hfi1/vnic_main.c 
b/drivers/infiniband/hw/hfi1/vnic_main.c
index 5d65582..616fc9b 100644
--- a/drivers/infiniband/hw/hfi1/vnic_main.c
+++ b/drivers/infiniband/hw/hfi1/vnic_main.c
@@ -423,7 +423,7 @@ static netdev_tx_t hfi1_netdev_start_xmit(struct sk_buff 
*skb,
 
 static u16 hfi1_vnic_select_queue(struct net_device *netdev,
  struct sk_buff *skb,
- void *accel_priv,
+ struct net_device *sb_dev,
  select_queue_fallback_t fallback)
 {
struct hfi1_vnic_vport_info *vinfo = opa_vnic_dev_priv(netdev);
diff --git a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c 
b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
index 0c8aec6..6155878 100644
--- a/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
+++ b/drivers/infiniband/ulp/opa_vnic/opa_vnic_netdev.c
@@ -95,7 +95,7 @@ static netdev_tx_t opa_netdev_start_xmit(struct sk_buff *skb,
 }
 
 static u16 opa_vnic_select_queue(struct net_device *netdev, struct sk_buff 
*skb,
-void *accel_priv,
+struct net_device *sb_dev,
 select_queue_fallback_t fallback)
 {
struct opa_vnic_adapter *adapter = opa_vnic_priv(netdev);
@@ -107,7 +107,7 @@ static u16 opa_vnic_select_queue(struct net_device *netdev, 
struct sk_buff *skb,
mdata->entropy = opa_vnic_calc_entropy(skb);
mdata->vl = opa_vnic_get_vl(adapter, skb);
rc = adapter->rn_ops->ndo_select_queue(netdev, skb,
-  accel_priv, fallback);
+  sb_dev, fallback);
skb_pull(skb, sizeof(*mdata));
return rc;
 }
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index bd53a71..e33f689 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4094,7 +4094,8 @@ static inline int bond_slave_override(struct bonding 
*bond,
 
 
 static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb,
-void *accel_priv, select_queue_fallback_t fallback)
+struct net_device *sb_dev,
+select_queue_fallback_t fallback)
 {
/* This helper function exists to help dev_pick_tx get the correct
 * destination queue.  Using a helper function skips a call to
diff --git a/drivers/net/ethernet/amazon/ena/ena_netdev.c 
b/drivers/net/ethernet/amazon/ena/ena_netdev.c
index f2af87d..e3befb1 100644
--- 

[jkirsher/next-queue PATCH 4/7] net: Add support for subordinate traffic classes to netdev_pick_tx

2018-06-11 Thread Alexander Duyck
This change makes it so that we can support the concept of subordinate
device traffic classes to the core networking code. In doing this we can
start pulling out the driver specific bits needed to support selecting a
queue based on an upper device.

The solution at is currently stands is only partially implemented. I have
the start of some XPS bits in here, but I would still need to allow for
configuration of the XPS maps on the queues reserved for the subordinate
devices. For now I am using the reference to the sb_dev XPS map as just a
way to skip the lookup of the lower device XPS map for now as that would
result in the wrong queue being picked.

Signed-off-by: Alexander Duyck 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   19 +++-
 drivers/net/macvlan.c |   10 +---
 include/linux/netdevice.h |4 +-
 net/core/dev.c|   57 +++--
 4 files changed, 45 insertions(+), 45 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 6e27848..053a54c 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8219,20 +8219,17 @@ static void ixgbe_atr(struct ixgbe_ring *ring,
  input, common, ring->queue_index);
 }
 
+#ifdef IXGBE_FCOE
 static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
  void *accel_priv, select_queue_fallback_t 
fallback)
 {
-   struct ixgbe_fwd_adapter *fwd_adapter = accel_priv;
-#ifdef IXGBE_FCOE
struct ixgbe_adapter *adapter;
struct ixgbe_ring_feature *f;
-#endif
int txq;
 
-   if (fwd_adapter) {
-   u8 tc = netdev_get_num_tc(dev) ?
-   netdev_get_prio_tc_map(dev, skb->priority) : 0;
-   struct net_device *vdev = fwd_adapter->netdev;
+   if (accel_priv) {
+   u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+   struct net_device *vdev = accel_priv;
 
txq = vdev->tc_to_txq[tc].offset;
txq += reciprocal_scale(skb_get_hash(skb),
@@ -8241,8 +8238,6 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
return txq;
}
 
-#ifdef IXGBE_FCOE
-
/*
 * only execute the code below if protocol is FCoE
 * or FIP and we have FCoE enabled on the adapter
@@ -8268,11 +8263,9 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
txq -= f->indices;
 
return txq + f->offset;
-#else
-   return fallback(dev, skb);
-#endif
 }
 
+#endif
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
   struct xdp_frame *xdpf)
 {
@@ -10076,7 +10069,6 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_open   = ixgbe_open,
.ndo_stop   = ixgbe_close,
.ndo_start_xmit = ixgbe_xmit_frame,
-   .ndo_select_queue   = ixgbe_select_queue,
.ndo_set_rx_mode= ixgbe_set_rx_mode,
.ndo_validate_addr  = eth_validate_addr,
.ndo_set_mac_address= ixgbe_set_mac,
@@ -10099,6 +10091,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
.ndo_poll_controller= ixgbe_netpoll,
 #endif
 #ifdef IXGBE_FCOE
+   .ndo_select_queue   = ixgbe_select_queue,
.ndo_fcoe_ddp_setup = ixgbe_fcoe_ddp_get,
.ndo_fcoe_ddp_target = ixgbe_fcoe_ddp_target,
.ndo_fcoe_ddp_done = ixgbe_fcoe_ddp_put,
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index adde8fc..401e1d1 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -514,7 +514,6 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
const struct macvlan_dev *vlan = netdev_priv(dev);
const struct macvlan_port *port = vlan->port;
const struct macvlan_dev *dest;
-   void *accel_priv = NULL;
 
if (vlan->mode == MACVLAN_MODE_BRIDGE) {
const struct ethhdr *eth = (void *)skb->data;
@@ -533,15 +532,10 @@ static int macvlan_queue_xmit(struct sk_buff *skb, struct 
net_device *dev)
return NET_XMIT_SUCCESS;
}
}
-
-   /* For packets that are non-multicast and not bridged we will pass
-* the necessary information so that the lowerdev can distinguish
-* the source of the packets via the accel_priv value.
-*/
-   accel_priv = vlan->accel_priv;
 xmit_world:
skb->dev = vlan->lowerdev;
-   return dev_queue_xmit_accel(skb, accel_priv);
+   return dev_queue_xmit_accel(skb,
+   netdev_get_sb_channel(dev) ? dev : NULL);
 }
 
 static inline netdev_tx_t macvlan_netpoll_send_skb(struct macvlan_dev *vlan, 
struct sk_buff *skb)
diff --git 

[jkirsher/next-queue PATCH 1/7] net-sysfs: Drop support for XPS and traffic_class on single queue device

2018-06-11 Thread Alexander Duyck
This patch makes it so that we do not report the traffic class or allow XPS
configuration on single queue devices. This is mostly to avoid unnecessary
complexity with changes I have planned that will allow us to reuse
the unused tc_to_txq and XPS configuration on a single queue device to
allow it to make use of a subset of queues on an underlying device.

Signed-off-by: Alexander Duyck 
---
 net/core/net-sysfs.c |   15 +--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
index bb7e80f..335c6a4 100644
--- a/net/core/net-sysfs.c
+++ b/net/core/net-sysfs.c
@@ -1047,9 +1047,14 @@ static ssize_t traffic_class_show(struct netdev_queue 
*queue,
  char *buf)
 {
struct net_device *dev = queue->dev;
-   int index = get_netdev_queue_index(queue);
-   int tc = netdev_txq_to_tc(dev, index);
+   int index;
+   int tc;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
+   index = get_netdev_queue_index(queue);
+   tc = netdev_txq_to_tc(dev, index);
if (tc < 0)
return -EINVAL;
 
@@ -1214,6 +1219,9 @@ static ssize_t xps_cpus_show(struct netdev_queue *queue,
cpumask_var_t mask;
unsigned long index;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
index = get_netdev_queue_index(queue);
 
if (dev->num_tc) {
@@ -1260,6 +1268,9 @@ static ssize_t xps_cpus_store(struct netdev_queue *queue,
cpumask_var_t mask;
int err;
 
+   if (!netif_is_multiqueue(dev))
+   return -ENOENT;
+
if (!capable(CAP_NET_ADMIN))
return -EPERM;
 



Backport bonding patches to fix active-passive

2018-06-11 Thread Nate Clark
Hi,

On the latest 4.9 stable active-passive bonding does not always
failover to the passive slave when carrier is lost on the active
slave. It seems that the issue stems from the backport of
c4adfc822bf5d8e97660b6114b5a8892530ce8cb, bonding: make speed, duplex
setting consistent with link state. There were subsequent patches
which resolved issues with the change to bond_update_speed_duplex
which were not backported. The three commits which seem to resolve the
issue are b5bf0f5b16b9c316c34df9f31d4be8729eb86845,
3f3c278c94dd994fe0d9f21679ae19b9c0a55292 and
ad729bc9acfb7c47112964b4877ef5404578ed13. There are other commits in
mainline which also revolve around
c4adfc822bf5d8e97660b6114b5a8892530ce8cb but are not necessary to
resolving the active-passive failover problems.

Would it be possible to queue up the three commits for backporting to
4.9 stable:
b5bf0f5b16b9c316c34df9f31d4be8729eb86845 bonding: correctly update
link status during mii-commit
3f3c278c94dd994fe0d9f21679ae19b9c0a55292 bonding: fix active-backup transition
ad729bc9acfb7c47112964b4877ef5404578ed13 bonding: require speed/duplex
only for 802.3ad, alb and tlb

All of those commits apply cleanly to 4.9.107.

Thanks,
-nate


Re: Qualcomm rmnet driver and qmi_wwan

2018-06-11 Thread Bjørn Mork
Subash Abhinov Kasiviswanathan  writes:

>> thanks, I will test it on Monday.
>>
>> Just a question for my knowledge: is the new sysfs attribute really
>> needed? I mean, is there not any other way to understand from qmi_wwan
>> without user intervention that there is the rmnet device attached?
>>
>> Regards,
>> Daniele
>>
>
> Hi Daniele
>
> You can check for the rx_handler attached to qmi_wwan dev and see if it
> belongs to rmnet. You can use the attached patch for it but it think the
> sysfs way might be a bit cleaner.
>
> -- 
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project
>
> From f7a2b90948da47ade1b345eddb37b721f5ab65f4 Mon Sep 17 00:00:00 2001
> From: Subash Abhinov Kasiviswanathan 
> Date: Sat, 9 Jun 2018 11:14:22 -0600
> Subject: [PATCH] net: qmi_wwan: Allow packets to pass through to rmnet
>
> Pass through mode is to allow packets in MAP format to be passed
> on to rmnet if the rmnet rx handler is attached to it.
>
> Signed-off-by: Subash Abhinov Kasiviswanathan 
> ---
>  drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c |  4 +++-
>  drivers/net/usb/qmi_wwan.c | 10 ++
>  include/linux/if_rmnet.h   | 20 
>  3 files changed, 33 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/if_rmnet.h
>
> diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c 
> b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> index 5f4e447..164a18f 100644
> --- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> +++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
> @@ -17,6 +17,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include "rmnet_config.h"
>  #include "rmnet_handlers.h"
>  #include "rmnet_vnd.h"
> @@ -48,10 +49,11 @@
>   [IFLA_RMNET_FLAGS]  = { .len = sizeof(struct ifla_rmnet_flags) },
>  };
>  
> -static int rmnet_is_real_dev_registered(const struct net_device *real_dev)
> +int rmnet_is_real_dev_registered(const struct net_device *real_dev)
>  {
>   return rcu_access_pointer(real_dev->rx_handler) == rmnet_rx_handler;
>  }
> +EXPORT_SYMBOL(rmnet_is_real_dev_registered);
>  
>  /* Needs rtnl lock */
>  static struct rmnet_port*
> diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
> index f52a9be..abdae63 100644
> --- a/drivers/net/usb/qmi_wwan.c
> +++ b/drivers/net/usb/qmi_wwan.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  /* This driver supports wwan (3G/LTE/?) devices using a vendor
>   * specific management protocol called Qualcomm MSM Interface (QMI) -
> @@ -354,6 +355,10 @@ static ssize_t add_mux_store(struct device *d,  struct 
> device_attribute *attr, c
>   if (kstrtou8(buf, 0, _id))
>   return -EINVAL;
>  
> + /* rmnet is already attached here */
> + if (rmnet_is_real_dev_registered(to_net_dev(d)))
> + return -EINVAL;
> +


Maybe rmnet_is_real_dev_registered(dev->net) instead, since we use that
elsewhere in this function?


>   /* mux_id [1 - 0x7f] range empirically found */
>   if (mux_id < 1 || mux_id > 0x7f)
>   return -EINVAL;
> @@ -543,6 +548,11 @@ static int qmi_wwan_rx_fixup(struct usbnet *dev, struct 
> sk_buff *skb)
>   if (skb->len < dev->net->hard_header_len)
>   return 0;
>  
> + if (rawip && rmnet_is_real_dev_registered(skb->dev)) {
> + skb->protocol = htons(ETH_P_MAP);
> + return (netif_rx(skb) == NET_RX_SUCCESS);
> + }

Like Daniele said: It would be good to have some way to know when the
rawip condition fails.  Or even better: Automatically force rawip mode
when the rmnet driver attaches.  But that doesn't seem possible?  No
notifications or anything when an rx handler is registered?

Hmm, looking at this I wonder: Is the rawip check really necessary?  You
skip all the extra rawip code in the driver anyway, so I don't see how
it matters.  But maybe the ethernet header_ops are a problem?

And I wonder about using skb->dev here.  Does that really work?  I
didn't think we set that until later.  Why not use dev->net instead?



Bjørn


[net] fq_codel: fix NULL pointer deref in fq_codel_reset

2018-06-11 Thread Jeff Kirsher
From: Jacob Keller 

The function qdisc_create_dftl attempts to create a default qdisc. If
this fails, it calls qdisc_destroy when cleaning up. The qdisc_destroy
function calls the ->reset op on the qdisc.

In the case of sch_fq_codel.c, this function will panic when the qdisc
wasn't properly initialized:

   kernel: BUG: unable to handle kernel NULL pointer dereference at 
0008
   kernel: IP: fq_codel_reset+0x58/0xd0 [sch_fq_codel]
   kernel: PGD 0 P4D 0
   kernel: Oops:  [#1] SMP PTI
   kernel: Modules linked in: i40iw i40e(OE) xt_CHECKSUM iptable_mangle 
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc 
devlink ebtable_filter ebtables ip6table_filter ip6_tables rpcrdma ib_isert 
iscsi_target_mod sunrpc ib_iser libiscsi scsi_transport_iscsi ib_srpt 
target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs 
ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal 
intel_powerclamp coretemp kvm irqbypass crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel intel_cstate iTCO_wdt iTCO_vendor_support intel_uncore 
ib_core intel_rapl_perf mei_me mei joydev i2c_i801 lpc_ich ioatdma shpchp wmi 
sch_fq_codel xfs libcrc32c mgag200 ixgbe drm_kms_helper isci ttm firewire_ohci
   kernel:  mdio drm igb libsas crc32c_intel firewire_core ptp pps_core 
scsi_transport_sas crc_itu_t dca i2c_algo_bit ipmi_si ipmi_devintf 
ipmi_msghandler [last unloaded: i40e]
   kernel: CPU: 10 PID: 4219 Comm: ip Tainted: G   OE
4.16.13custom-fq-codel-test+ #3
   kernel: Hardware name: Intel Corporation S2600CO/S2600CO, BIOS 
SE5C600.86B.02.05.0004.051120151007 05/11/2015
   kernel: RIP: 0010:fq_codel_reset+0x58/0xd0 [sch_fq_codel]
   kernel: RSP: 0018:bfbf4c1fb620 EFLAGS: 00010246
   kernel: RAX: 0400 RBX:  RCX: 05b9
   kernel: RDX:  RSI: 9d03264a60c0 RDI: 9cfd17b31c00
   kernel: RBP: 0001 R08: 000260c0 R09: b679c3e9
   kernel: R10: f1dab06a0e80 R11: 9cfd163af800 R12: 9cfd17b31c00
   kernel: R13: 0001 R14: 9cfd153de600 R15: 0001
   kernel: FS:  7fdec2f92800() GS:9d032648() 
knlGS:
   kernel: CS:  0010 DS:  ES:  CR0: 80050033
   kernel: CR2: 0008 CR3: 000c1956a006 CR4: 000606e0
   kernel: Call Trace:
   kernel:  qdisc_destroy+0x56/0x140
   kernel:  qdisc_create_dflt+0x8b/0xb0
   kernel:  mq_init+0xc1/0xf0
   kernel:  qdisc_create_dflt+0x5a/0xb0
   kernel:  dev_activate+0x205/0x230
   kernel:  __dev_open+0xf5/0x160
   kernel:  __dev_change_flags+0x1a3/0x210
   kernel:  dev_change_flags+0x21/0x60
   kernel:  do_setlink+0x660/0xdf0
   kernel:  ? down_trylock+0x25/0x30
   kernel:  ? xfs_buf_trylock+0x1a/0xd0 [xfs]
   kernel:  ? rtnl_newlink+0x816/0x990
   kernel:  ? _xfs_buf_find+0x327/0x580 [xfs]
   kernel:  ? _cond_resched+0x15/0x30
   kernel:  ? kmem_cache_alloc+0x20/0x1b0
   kernel:  ? rtnetlink_rcv_msg+0x200/0x2f0
   kernel:  ? rtnl_calcit.isra.30+0x100/0x100
   kernel:  ? netlink_rcv_skb+0x4c/0x120
   kernel:  ? netlink_unicast+0x19e/0x260
   kernel:  ? netlink_sendmsg+0x1ff/0x3c0
   kernel:  ? sock_sendmsg+0x36/0x40
   kernel:  ? ___sys_sendmsg+0x295/0x2f0
   kernel:  ? ebitmap_cmp+0x6d/0x90
   kernel:  ? dev_get_by_name_rcu+0x73/0x90
   kernel:  ? skb_dequeue+0x52/0x60
   kernel:  ? __inode_wait_for_writeback+0x7f/0xf0
   kernel:  ? bit_waitqueue+0x30/0x30
   kernel:  ? fsnotify_grab_connector+0x3c/0x60
   kernel:  ? __sys_sendmsg+0x51/0x90
   kernel:  ? do_syscall_64+0x74/0x180
   kernel:  ? entry_SYSCALL_64_after_hwframe+0x3d/0xa2
   kernel: Code: 00 00 48 89 87 00 02 00 00 8b 87 a0 01 00 00 85 c0 0f 84 84 00 
00 00 31 ed 48 63 dd 83 c5 01 48 c1 e3 06 49 03 9c 24 90 01 00 00 <48> 8b 73 08 
48 8b 3b e8 6c 9a 4f f6 48 8d 43 10 48 c7 03 00 00
   kernel: RIP: fq_codel_reset+0x58/0xd0 [sch_fq_codel] RSP: bfbf4c1fb620
   kernel: CR2: 0008
   kernel: ---[ end trace e81a62bede66274e ]---

This occurs because if fq_codel_init fails, it has left the private data
in an incomplete state. For example, if tcf_block_get fails, (as in the
above panic), then q->flows and q->backlogs will be NULL. Thus they will
cause NULL pointer access when attempting to reset them in
fq_codel_reset.

We could mitigate some of these issues by changing fq_codel_init to more
explicitly cleanup after itself when failing. For example, we could
ensure that q->flowcnt was set to 0 so that the loop over each flow in
fq_codel_reset would not trigger. However, this would not prevent a NULL
pointer dereference when attempting to memset the q->backlogs.

Instead, just add a NULL check prior to attempting to reset these
fields.

Cc: Eric Dumazet 
Signed-off-by: Jacob Keller 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 net/sched/sch_fq_codel.c | 15 +--
 1 file changed, 9 

Re: [PATCH nf-next] netfilter: nft_reject_bridge: remove unnecessary ttl set

2018-06-11 Thread Taehee Yoo
2018-06-12 1:35 GMT+09:00 Taehee Yoo :
> In the nft_reject_br_send_v4_tcp_reset(), a ttl is set by
> the nf_reject_ip_tcphdr_put(). so, below code is unnecessary.
>
> Signed-off-by: Taehee Yoo 
> ---
>  net/bridge/netfilter/nft_reject_bridge.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/net/bridge/netfilter/nft_reject_bridge.c 
> b/net/bridge/netfilter/nft_reject_bridge.c
> index eaf05de..e0b082c 100644
> --- a/net/bridge/netfilter/nft_reject_bridge.c
> +++ b/net/bridge/netfilter/nft_reject_bridge.c
> @@ -89,8 +89,7 @@ static void nft_reject_br_send_v4_tcp_reset(struct net *net,
> niph = nf_reject_iphdr_put(nskb, oldskb, IPPROTO_TCP,
>net->ipv4.sysctl_ip_default_ttl);
> nf_reject_ip_tcphdr_put(nskb, oldskb, oth);
> -   niph->ttl   = net->ipv4.sysctl_ip_default_ttl;
> -   niph->tot_len   = htons(nskb->len);
> +   niph->tot_len = htons(nskb->len);
> ip_send_check(niph);
>
> nft_reject_br_push_etherhdr(oldskb, nskb);
> --
> 2.9.3
>

I'm so sorry, I sent this to you by mistake.
Please ignore this.

Thanks


mainline: x86_64: kernel panic: RIP: 0010:__xfrm_policy_check+0xcb/0x690

2018-06-11 Thread Naresh Kamboju
Kernel panic on x86_64 machine running mainline 4.17.0 kernel while testing
selftests bpf test_tunnel.sh test caused this kernel panic.
I have noticed this kernel panic start happening from
4.17.0-rc7-next-20180529 and still happening on 4.17.0-next-20180608.

[  213.638287] BUG: unable to handle kernel NULL pointer dereference
at 0008
++[ ip xfrm poli  213.674036] PGD 0 P4D 0
[  213.674118] audit: type=1327 audit(1528917683.623:7):
proctitle=6970007866726D00706F6C69637900616464007372630031302E312E312E3130302F3332006473740031302E312E312E3230302F33320064697200696E00746D706C00737263003137322E31362E312E31303000647374003137322E31362E312E3230300070726F746F006573700072657169640031006D6F64650074756E6E
[  213.677950] Oops:  [#1] SMP PTI
cy[ add src 10.1.  213.677952] CPU: 2 PID: 0 Comm: swapper/2 Tainted:
GW 4.17.0-next-20180608 #1
[  213.677953] Hardware name: Supermicro SYS-5019S-ML/X11SSH-F, BIOS
2.0b 07/27/2017
[  213.726998] RIP: 0010:__xfrm_policy_check+0xcb/0x690
[  213.731962] Code: 80 3d 0a d8 f1 00 00 0f 84 c1 02 00 00 4c 8b 25
2b af f4 00 e8 66 a6 6a ff 85 c0 74 0d 80 3d eb d7 f1 00 00 0f 84 d5
02 00 00 <49> 8b 44 24 08 48 85 c0 74 0c 48 8d b5 78 ff ff ff 4c 89 ff
ff d0
1.[100/32 dst 10.  213.750836] RSP: 0018:91cf6fd03a48 EFLAGS: 00010246
[  213.757441] RAX:  RBX: 0002 RCX: 0002
[  213.764566] RDX: b863ebe0 RSI:  RDI: 
[  213.771688] RBP: 91cf6fd03b18 R08: b863ebe0 R09: 
[  213.778813] R10: 91cf6fd039d0 R11:  R12: 
[  213.785935] R13: 91cf5b23d84e R14: 91cf5b779f80 R15: 91cf5589cc00
[  213.793062] FS:  () GS:91cf6fd0()
knlGS:
[  213.801162] CS:  0010 DS:  ES:  CR0: 80050033
[  213.806900] CR2: 0008 CR3: 4201e001 CR4: 003606e0
[  213.814025] DR0:  DR1:  DR2: 
[  213.821200] DR3:  DR6: fffe0ff0 DR7: 0400
[  213.828324] Call Trace:
[  213.830769]  
[  213.832783]  ? trace_hardirqs_on+0xd/0x10
[  213.836819]  __xfrm_policy_check2.constprop.36+0x6c/0xc0
[  213.842131]  tcp_v4_rcv+0x9ef/0xbd0
[  213.845615]  ? ip_local_deliver_finish+0x26/0x340
[  213.850314]  ip_local_deliver_finish+0xc1/0x340
[  213.854843]  ip_local_deliver+0x74/0x220
[  213.858761]  ? inet_del_offload+0x40/0x40
[  213.862767]  ip_rcv_finish+0x1f0/0x550
[  213.866519]  ip_rcv+0x282/0x480
[  213.869657]  ? ip_local_deliver_finish+0x340/0x340
[  213.874448]  __netif_receive_skb_core+0x3b2/0xd30
[  213.879145]  ? lock_acquire+0xd5/0x1c0
[  213.882891]  __netif_receive_skb+0x18/0x60
[  213.886990]  ? __netif_receive_skb+0x18/0x60
[  213.891252]  netif_receive_skb_internal+0x79/0x370
[  213.896062]  napi_gro_receive+0x138/0x1b0
[  213.900121]  igb_poll+0x610/0xe70
[  213.903440]  net_rx_action+0x246/0x4b0
[  213.907190]  ? lock_acquire+0xd5/0x1c0
[  213.910933]  ? igb_msix_ring+0x5e/0x70
[  213.914681]  __do_softirq+0xbf/0x493
[  213.918260]  irq_exit+0xc3/0xd0
[  213.921405]  do_IRQ+0x65/0x110
[  213.924464]  common_interrupt+0xf/0xf
[  213.928128]  
[  213.930225] RIP: 0010:cpuidle_enter_state+0xa7/0x370
1.[1.200/32 dir i  213.935182] Code: 47 e8 bd 9a 7f ff 48 89 45 d0 0f
1f 44 00 00 31 ff e8 2d a9 7f ff 80 7d c7 00 0f 85 ee 01 00 00 e8 ae
9f 81 ff fb 48 8b 4d d0 <48> 2b 4d c8 48 ba cf f7 53 e3 a5 9b c4 20 48
89 c8 48 c1 f9 3f 48
[  213.955445] RSP: 0018:ab2c01943e38 EFLAGS: 0246 ORIG_RAX:
ffdc
[  213.963002] RAX: 91cf6fd21ec0 RBX: 0002 RCX: 0031bdd50c87
[  213.970127] RDX: 0031bdd50c87 RSI: 2aaa RDI: b84ab752
[  213.977250] RBP: ab2c01943e78 R08: 0061 R09: 0018
[  213.984375] R10: ab2c01943e18 R11: 0092 R12: 91cf5ce88000
[  213.991497] R13: b94cf278 R14: 0002 R15: b94cf260
[  213.998624]  ? cpuidle_enter_state+0xa2/0x370
[  214.002982]  ? cpuidle_enter_state+0xa2/0x370
[  214.007332]  cpuidle_enter+0x17/0x20
[  214.010902]  call_cpuidle+0x23/0x40
[  214.014387]  do_idle+0x1f0/0x250
[  214.017613]  cpu_startup_entry+0x73/0x80
[  214.021538]  start_secondary+0x175/0x1a0
[  214.025465]  secondary_startup_64+0xa5/0xb0
[  214.029651] Modules linked in: cls_bpf xt_mark algif_hash af_alg
x86_pkg_temp_thermal fuse
[  214.037941] CR2: 0008
[  214.041255] ---[ end trace a0b077febc9b99ca ]---
[  214.045874] RIP: 0010:__xfrm_policy_check+0xcb/0x690
n tmpl src 172.1[  214.050838] Code: 80 3d 0a d8 f1 00 00 0f 84 c1 02
00 00 4c 8b 25 2b af f4 00 e8 66 a6 6a ff 85 c0 74 0d 80 3d eb d7 f1
00 00 0f 84 d5 02 00 00 <49> 8b 44 24 08 48 85 c0 74 0c 48 8d b5 78 ff
ff ff 4c 89 ff ff d0
[  214.071103] RSP: 0018:91cf6fd03a48 EFLAGS: 00010246
[  214.076327] RAX:  RBX: 0002 RCX: 0002
[  214.083451] RDX: 

Re: [PATCH 3/6] arcnet: com20020: Add com20020 io mapped version

2018-06-11 Thread kbuild test robot
Hi Andrea,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
[also build test WARNING on v4.17 next-20180608]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Andrea-Greco/arcnet-leds-Removed-leds-dependecy/20180611-222941
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> drivers/net/arcnet/com20020-io.c:34:45: sparse: incorrect type in argument 1 
>> (different address spaces) @@expected void [noderef] * 
>> @@got sn:2>* @@
   drivers/net/arcnet/com20020-io.c:34:45:expected void [noderef] 
*
   drivers/net/arcnet/com20020-io.c:34:45:got void *
   drivers/net/arcnet/com20020-io.c:39:45: sparse: incorrect type in argument 2 
(different address spaces) @@expected void [noderef] * @@   
 got sn:2>* @@
   drivers/net/arcnet/com20020-io.c:39:45:expected void [noderef] 
*
   drivers/net/arcnet/com20020-io.c:39:45:got void *
>> drivers/net/arcnet/com20020-io.c:44:22: sparse: incorrect type in argument 1 
>> (different address spaces) @@expected void [noderef] *port @@
>> got void [noderef] *port @@
   drivers/net/arcnet/com20020-io.c:44:22:expected void [noderef] 
*port
   drivers/net/arcnet/com20020-io.c:44:22:got void *[noderef] 

   drivers/net/arcnet/com20020-io.c:49:23: sparse: incorrect type in argument 1 
(different address spaces) @@expected void [noderef] *port @@got 
void [noderef] *port @@
   drivers/net/arcnet/com20020-io.c:49:23:expected void [noderef] 
*port
   drivers/net/arcnet/com20020-io.c:49:23:got void *[noderef] 

>> drivers/net/arcnet/com20020-io.c:219:19: sparse: cast removes address space 
>> of expression
   drivers/net/arcnet/com20020-io.c: In function 'io_arc_inb':
   drivers/net/arcnet/com20020-io.c:34:17: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
 return ioread8((void *__iomem) addr + offset);
^
   drivers/net/arcnet/com20020-io.c: In function 'io_arc_outb':
   drivers/net/arcnet/com20020-io.c:39:18: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
 iowrite8(value, (void *__iomem)addr + offset);
 ^
   drivers/net/arcnet/com20020-io.c: In function 'io_arc_insb':
   drivers/net/arcnet/com20020-io.c:44:14: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
 ioread8_rep((void *__iomem) (addr + offset), buffer, count);
 ^
   drivers/net/arcnet/com20020-io.c: In function 'io_arc_outsb':
   drivers/net/arcnet/com20020-io.c:49:15: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
 iowrite8_rep((void *__iomem) (addr + offset), buffer, count);
  ^
   drivers/net/arcnet/com20020-io.c: In function 'com20020_probe':
   drivers/net/arcnet/com20020-io.c:219:11: warning: cast from pointer to 
integer of different size [-Wpointer-to-int-cast]
 ioaddr = (int)devm_ioremap(>dev, iores->start,
  ^
   drivers/net/arcnet/com20020-io.c:288:27: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
 devm_iounmap(>dev, (void __iomem *)ioaddr);
  ^

vim +34 drivers/net/arcnet/com20020-io.c

31  
32  static unsigned int io_arc_inb(int addr, int offset)
33  {
  > 34  return ioread8((void *__iomem) addr + offset);
35  }
36  
37  static void io_arc_outb(int value, int addr, int offset)
38  {
  > 39  iowrite8(value, (void *__iomem)addr + offset);
40  }
41  
42  static void io_arc_insb(int  addr, int offset, void *buffer, int count)
43  {
  > 44  ioread8_rep((void *__iomem) (addr + offset), buffer, count);
45  }
46  
47  static void io_arc_outsb(int addr, int offset, void *buffer, int count)
48  {
49  iowrite8_rep((void *__iomem) (addr + offset), buffer, count);
50  }
51  
52  enum com20020_xtal_freq {
53  freq_10Mhz = 10,
54  freq_20Mhz = 20,
55  };
56  
57  enum com20020_arcnet_speed {
58  arc_speed_10M_bps = 1000,
59  arc_speed_5M_bps = 500,
60  arc_speed_2M50_bps = 250,
61  arc_speed_1M25_bps = 125,
62  arc_speed_625K_bps = 625000,
63  arc_speed_312K5_bps = 312500,
64  arc_speed_156K25_bps = 156250,
65  };
66  
67  enum com20020_timeout {
68  arc_timeout_328us =   328000,
69  arc_timeout_164us = 164000,
70  arc_timeout_82us =  82000,
71  arc_timeout_20u5s =  20500,
72  };
73  
74  static int setup_clock(int *c

[PATCH nf-next] netfilter: nft_reject_bridge: remove unnecessary ttl set

2018-06-11 Thread Taehee Yoo
In the nft_reject_br_send_v4_tcp_reset(), a ttl is set by
the nf_reject_ip_tcphdr_put(). so, below code is unnecessary.

Signed-off-by: Taehee Yoo 
---
 net/bridge/netfilter/nft_reject_bridge.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/bridge/netfilter/nft_reject_bridge.c 
b/net/bridge/netfilter/nft_reject_bridge.c
index eaf05de..e0b082c 100644
--- a/net/bridge/netfilter/nft_reject_bridge.c
+++ b/net/bridge/netfilter/nft_reject_bridge.c
@@ -89,8 +89,7 @@ static void nft_reject_br_send_v4_tcp_reset(struct net *net,
niph = nf_reject_iphdr_put(nskb, oldskb, IPPROTO_TCP,
   net->ipv4.sysctl_ip_default_ttl);
nf_reject_ip_tcphdr_put(nskb, oldskb, oth);
-   niph->ttl   = net->ipv4.sysctl_ip_default_ttl;
-   niph->tot_len   = htons(nskb->len);
+   niph->tot_len = htons(nskb->len);
ip_send_check(niph);
 
nft_reject_br_push_etherhdr(oldskb, nskb);
-- 
2.9.3



Re: [PATCH v2 15/24] net: qualcomm: MODULE_DEVICE_TABLE(serdev)

2018-06-11 Thread Ricardo Ribalda Delgado
Hi Marcel,
On Mon, Jun 11, 2018 at 5:52 PM Marcel Holtmann  wrote:
>
> Hi Ricardo,
>
>  the commit message is misleading me. If I build something with ACPI or 
>  DT support, then modinfo will show all modalias information for ACPI and 
>  DT compatible strings. What else does udev/modprobe actually need? Is 
>  something broken with the modalias export?
> >>>
> >>> The main purpose is to autoload drivers for devices that have been
> >>> created via sysfs or another module.
> >>>
> >>> Eg1: We have a serial port on a standard computer that has connected a
> >>> GPS module. Since it is something that is not in the ACPI nor the DT
> >>> table the user will run
> >>>
> >>> echo serdev_gps > /sys/bus/serial/devices/serial0/new_device
> >>>
> >>> Eg2 module: 
> >>> https://github.com/ribalda/linux/blob/415bb3f0076c2b846ebe5409589b8e1e3004f55a/drivers/tty/serdev/test_platform.c
> >>>
> >>> Modprobe does not know what module to load for that device unless
> >>> there is a matching MODULE_DEVICE_TABLE
> >>> Today, we have the same functionality for i2c devices
> >>> https://www.kernel.org/doc/Documentation/i2c/instantiating-devices
> >>
> >> but why does this have to be the driver name? I would rather say, create a 
> >> generic string that describes the hardware and then use that. I think you 
> >> also want the special handling, as from USB for example, that lets you 
> >> reference an existing .compatible or ACPI ID as reference so that the 
> >> driver_data is copied.
> >
> > We can choose any name, but if there are no special variants (like the
> > rave-sp driver) I would prefer using the module name. We can of course
> > use more names, like the part number of the the device, but it is very
> > convenient to also use the module name, and other subsystems also do
> > that.
> >
> > If you want to add specific names for this device please let me know
> > and I will add them to the list. You know much more about this module
> > than myself.
>
> if we want to use the driver name, then why not build this into the 
> new_device handling itself instead of duplicating that name in each 
> serdev_device_id table. What is the reference here? For example platform 
> device do matching on driver name if I recall this correctly.

We do the matching also over driver name at serdev_device_match():

if (sdrv->id_table)
return !!serdev_match_id(sdrv->id_table, sdev);

return strcmp(sdev->modalias, drv->name) == 0;

This is just for the module autoloading

>
> However what is most important is that device_get_match_data() actually 
> works. There should be no difference between DT, ACPI or a dynamic device.

Agree, will take a look to this tomorrow morning.

>
> Regards
>
> Marcel
>


-- 
Ricardo Ribalda


Re: [PATCH 3/6] arcnet: com20020: Add com20020 io mapped version

2018-06-11 Thread kbuild test robot
Hi Andrea,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on net-next/master]
[also build test WARNING on v4.17 next-20180608]
[if your patch is applied to the wrong git tree, please drop us a note to help 
improve the system]

url:
https://github.com/0day-ci/linux/commits/Andrea-Greco/arcnet-leds-Removed-leds-dependecy/20180611-222941
config: sparc64-allmodconfig (attached as .config)
compiler: sparc64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
wget 
https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O 
~/bin/make.cross
chmod +x ~/bin/make.cross
# save the attached .config to linux build tree
GCC_VERSION=7.2.0 make.cross ARCH=sparc64 

All warnings (new ones prefixed by >>):

   drivers/net//arcnet/com20020-io.c: In function 'io_arc_inb':
>> drivers/net//arcnet/com20020-io.c:34:17: warning: cast to pointer from 
>> integer of different size [-Wint-to-pointer-cast]
 return ioread8((void *__iomem) addr + offset);
^
   drivers/net//arcnet/com20020-io.c: In function 'io_arc_outb':
   drivers/net//arcnet/com20020-io.c:39:18: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
 iowrite8(value, (void *__iomem)addr + offset);
 ^
   drivers/net//arcnet/com20020-io.c: In function 'io_arc_insb':
   drivers/net//arcnet/com20020-io.c:44:14: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
 ioread8_rep((void *__iomem) (addr + offset), buffer, count);
 ^
   drivers/net//arcnet/com20020-io.c: In function 'io_arc_outsb':
   drivers/net//arcnet/com20020-io.c:49:15: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
 iowrite8_rep((void *__iomem) (addr + offset), buffer, count);
  ^
   drivers/net//arcnet/com20020-io.c: In function 'com20020_probe':
>> drivers/net//arcnet/com20020-io.c:219:11: warning: cast from pointer to 
>> integer of different size [-Wpointer-to-int-cast]
 ioaddr = (int)devm_ioremap(>dev, iores->start,
  ^
   drivers/net//arcnet/com20020-io.c:288:27: warning: cast to pointer from 
integer of different size [-Wint-to-pointer-cast]
 devm_iounmap(>dev, (void __iomem *)ioaddr);
  ^

vim +34 drivers/net//arcnet/com20020-io.c

31  
32  static unsigned int io_arc_inb(int addr, int offset)
33  {
  > 34  return ioread8((void *__iomem) addr + offset);
35  }
36  
37  static void io_arc_outb(int value, int addr, int offset)
38  {
39  iowrite8(value, (void *__iomem)addr + offset);
40  }
41  
42  static void io_arc_insb(int  addr, int offset, void *buffer, int count)
43  {
  > 44  ioread8_rep((void *__iomem) (addr + offset), buffer, count);
45  }
46  
47  static void io_arc_outsb(int addr, int offset, void *buffer, int count)
48  {
49  iowrite8_rep((void *__iomem) (addr + offset), buffer, count);
50  }
51  
52  enum com20020_xtal_freq {
53  freq_10Mhz = 10,
54  freq_20Mhz = 20,
55  };
56  
57  enum com20020_arcnet_speed {
58  arc_speed_10M_bps = 1000,
59  arc_speed_5M_bps = 500,
60  arc_speed_2M50_bps = 250,
61  arc_speed_1M25_bps = 125,
62  arc_speed_625K_bps = 625000,
63  arc_speed_312K5_bps = 312500,
64  arc_speed_156K25_bps = 156250,
65  };
66  
67  enum com20020_timeout {
68  arc_timeout_328us =   328000,
69  arc_timeout_164us = 164000,
70  arc_timeout_82us =  82000,
71  arc_timeout_20u5s =  20500,
72  };
73  
74  static int setup_clock(int *clockp, int *clockm, int xtal, int 
arcnet_speed)
75  {
76  int pll_factor, req_clock_frq = 20;
77  
78  switch (arcnet_speed) {
79  case arc_speed_10M_bps:
80  req_clock_frq = 80;
81  *clockp = 0;
82  break;
83  case arc_speed_5M_bps:
84  req_clock_frq = 40;
85  *clockp = 0;
86  break;
87  case arc_speed_2M50_bps:
88  *clockp = 0;
89  break;
90  case arc_speed_1M25_bps:
91  *clockp = 1;
92  break;
93  case arc_speed_625K_bps:
94  *clockp = 2;
95  break;
96  case arc_speed_312K5_bps:
97  *clockp = 3;
98  break;
99  case arc_speed_156K25_bps:
   100  *clockp = 4;
   101  break;
   102  default:
   103  return -EINVAL;
   104  }
   105  
 

[net 3/5] ixgbe: Move ipsec init function to before reset call

2018-06-11 Thread Jeff Kirsher
From: Alexander Duyck 

This patch moves the IPsec init function in ixgbe_sw_init. This way it is a
bit more consistent with the placement of similar initialization functions
and is placed before the reset_hw call which should allow us to clean up
any link issues that may be introduced by the fact that we force the link
up if somehow the device had IPsec still enabled before the driver was
loaded.

In addition to the function move it is necessary to change the assignment
of netdev->features. The easiest way to do this is to just test for the
existence of adapter->ipsec and if it is present we set the feature bits.

CC: stable 
Fixes: 49a94d74d948 ("ixgbe: add ipsec engine start and stop routines")
Reported-by: Andre Tomt 
Signed-off-by: Alexander Duyck 
Acked-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c |  7 ---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c  | 11 +--
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 344a1f213a5f..38d8cf75e9ad 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -1001,13 +1001,6 @@ void ixgbe_init_ipsec_offload(struct ixgbe_adapter 
*adapter)
 
adapter->netdev->xfrmdev_ops = _xfrmdev_ops;
 
-#define IXGBE_ESP_FEATURES (NETIF_F_HW_ESP | \
-NETIF_F_HW_ESP_TX_CSUM | \
-NETIF_F_GSO_ESP)
-
-   adapter->netdev->features |= IXGBE_ESP_FEATURES;
-   adapter->netdev->hw_enc_features |= IXGBE_ESP_FEATURES;
-
return;
 
 err2:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index a925f05ec342..8d061af276d3 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -6117,6 +6117,7 @@ static int ixgbe_sw_init(struct ixgbe_adapter *adapter,
 #ifdef CONFIG_IXGBE_DCB
ixgbe_init_dcb(adapter);
 #endif
+   ixgbe_init_ipsec_offload(adapter);
 
/* default flow control settings */
hw->fc.requested_mode = ixgbe_fc_full;
@@ -10429,6 +10430,14 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
if (hw->mac.type >= ixgbe_mac_82599EB)
netdev->features |= NETIF_F_SCTP_CRC;
 
+#ifdef CONFIG_XFRM_OFFLOAD
+#define IXGBE_ESP_FEATURES (NETIF_F_HW_ESP | \
+NETIF_F_HW_ESP_TX_CSUM | \
+NETIF_F_GSO_ESP)
+
+   if (adapter->ipsec)
+   netdev->features |= IXGBE_ESP_FEATURES;
+#endif
/* copy netdev features into list of user selectable features */
netdev->hw_features |= netdev->features |
   NETIF_F_HW_VLAN_CTAG_FILTER |
@@ -10491,8 +10500,6 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
 NETIF_F_FCOE_MTU;
}
 #endif /* IXGBE_FCOE */
-   ixgbe_init_ipsec_offload(adapter);
-
if (adapter->flags2 & IXGBE_FLAG2_RSC_CAPABLE)
netdev->hw_features |= NETIF_F_LRO;
if (adapter->flags2 & IXGBE_FLAG2_RSC_ENABLED)
-- 
2.17.1



[net 1/5] ixgbe: Fix setting of TC configuration for macvlan case

2018-06-11 Thread Jeff Kirsher
From: Alexander Duyck 

When we were enabling macvlan interfaces we weren't correctly configuring
things until ixgbe_setup_tc was called a second time either by tweaking the
number of queues or increasing the macvlan count past 15.

The issue came down to the fact that num_rx_pools is not populated until
after the queues and interrupts are reinitialized.

Instead of trying to set it sooner we can just move the call to setup at
least 1 traffic class to the SR-IOV/VMDq setup function so that we just set
it for this one case. We already had a spot that was configuring the queues
for TC 0 in the code here anyway so it makes sense to also set the number
of TCs here as well.

CC: stable 
Fixes: 49cfbeb7a95c ("ixgbe: Fix handling of macvlan Tx offload")
Signed-off-by: Alexander Duyck 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  | 8 
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 8 
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index 893a9206e718..d361f570ca37 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -593,6 +593,14 @@ static bool ixgbe_set_sriov_queues(struct ixgbe_adapter 
*adapter)
}
 
 #endif
+   /* To support macvlan offload we have to use num_tc to
+* restrict the queues that can be used by the device.
+* By doing this we can avoid reporting a false number of
+* queues.
+*/
+   if (vmdq_i > 1)
+   netdev_set_num_tc(adapter->netdev, 1);
+
/* populate TC0 for use by pool 0 */
netdev_set_tc_queue(adapter->netdev, 0,
adapter->num_rx_queues_per_pool, 0);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 4929f7265598..f9e0dc041cfb 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8822,14 +8822,6 @@ int ixgbe_setup_tc(struct net_device *dev, u8 tc)
} else {
netdev_reset_tc(dev);
 
-   /* To support macvlan offload we have to use num_tc to
-* restrict the queues that can be used by the device.
-* By doing this we can avoid reporting a false number of
-* queues.
-*/
-   if (!tc && adapter->num_rx_pools > 1)
-   netdev_set_num_tc(dev, 1);
-
if (adapter->hw.mac.type == ixgbe_mac_82598EB)
adapter->hw.fc.requested_mode = adapter->last_lfc_mode;
 
-- 
2.17.1



[net 4/5] ixgbe: Avoid loopback and fix boolean logic in ipsec_stop_data

2018-06-11 Thread Jeff Kirsher
From: Alexander Duyck 

This patch fixes two issues. First we add an early test for the Tx and Rx
security block ready bits. By doing this we can avoid the need for waits or
loopback in the event that the security block is already flushed out.
Secondly we fix the boolean logic that was testing for the Tx OR Rx ready
bits being set and change it so that we only exit if the Tx AND Rx ready
bits are both set.

CC: stable 
Signed-off-by: Alexander Duyck 
Acked-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 13 +++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 38d8cf75e9ad..7b23fb0c2d07 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -158,7 +158,16 @@ static void ixgbe_ipsec_stop_data(struct ixgbe_adapter 
*adapter)
reg |= IXGBE_SECRXCTRL_RX_DIS;
IXGBE_WRITE_REG(hw, IXGBE_SECRXCTRL, reg);
 
-   IXGBE_WRITE_FLUSH(hw);
+   /* If both Tx and Rx are ready there are no packets
+* that we need to flush so the loopback configuration
+* below is not necessary.
+*/
+   t_rdy = IXGBE_READ_REG(hw, IXGBE_SECTXSTAT) &
+   IXGBE_SECTXSTAT_SECTX_RDY;
+   r_rdy = IXGBE_READ_REG(hw, IXGBE_SECRXSTAT) &
+   IXGBE_SECRXSTAT_SECRX_RDY;
+   if (t_rdy && r_rdy)
+   return;
 
/* If the tx fifo doesn't have link, but still has data,
 * we can't clear the tx sec block.  Set the MAC loopback
@@ -185,7 +194,7 @@ static void ixgbe_ipsec_stop_data(struct ixgbe_adapter 
*adapter)
IXGBE_SECTXSTAT_SECTX_RDY;
r_rdy = IXGBE_READ_REG(hw, IXGBE_SECRXSTAT) &
IXGBE_SECRXSTAT_SECRX_RDY;
-   } while (!t_rdy && !r_rdy && limit--);
+   } while (!(t_rdy && r_rdy) && limit--);
 
/* undo loopback if we played with it earlier */
if (!link) {
-- 
2.17.1



[net 2/5] ixgbe: Use CONFIG_XFRM_OFFLOAD instead of CONFIG_XFRM

2018-06-11 Thread Jeff Kirsher
From: Alexander Duyck 

There is no point in adding code if CONFIG_XFRM is defined that we won't
use unless CONFIG_XFRM_OFFLOAD is defined. So instead of leaving this code
floating around I am replacing the ifdef with what I believe is the correct
one so that we only include the code and variables if they will actually be
used.

CC: stable 
Signed-off-by: Alexander Duyck 
Acked-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  | 4 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 2 +-
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index fc534e91c6b2..144d5fe6b944 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -760,9 +760,9 @@ struct ixgbe_adapter {
 #define IXGBE_RSS_KEY_SIZE 40  /* size of RSS Hash Key in bytes */
u32 *rss_key;
 
-#ifdef CONFIG_XFRM
+#ifdef CONFIG_XFRM_OFFLOAD
struct ixgbe_ipsec *ipsec;
-#endif /* CONFIG_XFRM */
+#endif /* CONFIG_XFRM_OFFLOAD */
 };
 
 static inline u8 ixgbe_max_rss_indices(struct ixgbe_adapter *adapter)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f9e0dc041cfb..a925f05ec342 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -9896,7 +9896,7 @@ ixgbe_features_check(struct sk_buff *skb, struct 
net_device *dev,
 * the TSO, so it's the exception.
 */
if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID)) {
-#ifdef CONFIG_XFRM
+#ifdef CONFIG_XFRM_OFFLOAD
if (!skb->sp)
 #endif
features &= ~NETIF_F_TSO;
-- 
2.17.1



[net 0/5][pull request] Intel Wired LAN Driver Updates 2018-06-11

2018-06-11 Thread Jeff Kirsher
This series contains fixes to ixgbe IPsec and MACVLAN.

Alex provides the 5 fixes in this series, starting with fixing an issue
where num_rx_pools was not being populated until after the queues and
interrupts were reinitialized when enabling MACVLAN interfaces.  Updated
to use CONFIG_XFRM_OFFLOAD instead of CONFIG_XFRM, since the code
requires CONFIG_XFRM_OFFLOAD to be enabled.  Moved the IPsec
initialization function to be more consistent with the placement of
similar initialization functions and before the call to reset the
hardware, which will clean up any link issues that may have been
introduced.  Fixed the boolean logic that was testing for transmit OR
receive ready bits, when it should have been testing for transmit AND
receive ready bits.  Fixed the bit definitions for SECTXSTAT and SECRXSTAT
registers and ensure that if IPsec is disabled on the part, do not
enable it.

The following are changes since commit f0dc7f9c6dd99891611fca5849cbc4c6965b690e:
  Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 10GbE

Alexander Duyck (5):
  ixgbe: Fix setting of TC configuration for macvlan case
  ixgbe: Use CONFIG_XFRM_OFFLOAD instead of CONFIG_XFRM
  ixgbe: Move ipsec init function to before reset call
  ixgbe: Avoid loopback and fix boolean logic in ipsec_stop_data
  ixgbe: Fix bit definitions and add support for testing for ipsec
support

 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |  4 +--
 .../net/ethernet/intel/ixgbe/ixgbe_ipsec.c| 34 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |  8 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 21 ++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h |  6 ++--
 5 files changed, 48 insertions(+), 25 deletions(-)

-- 
2.17.1



[net 5/5] ixgbe: Fix bit definitions and add support for testing for ipsec support

2018-06-11 Thread Jeff Kirsher
From: Alexander Duyck 

This patch addresses two issues. First it adds the correct bit definitions
for the SECTXSTAT and SECRXSTAT registers. Then it makes use of those
definitions to test for if IPsec has been disabled on the part and if so we
do not enable it.

CC: stable 
Signed-off-by: Alexander Duyck 
Reported-by: Andre Tomt 
Acked-by: Shannon Nelson 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c | 14 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h  |  6 --
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
index 7b23fb0c2d07..c116f459945d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ipsec.c
@@ -975,10 +975,22 @@ void ixgbe_ipsec_rx(struct ixgbe_ring *rx_ring,
  **/
 void ixgbe_init_ipsec_offload(struct ixgbe_adapter *adapter)
 {
+   struct ixgbe_hw *hw = >hw;
struct ixgbe_ipsec *ipsec;
+   u32 t_dis, r_dis;
size_t size;
 
-   if (adapter->hw.mac.type == ixgbe_mac_82598EB)
+   if (hw->mac.type == ixgbe_mac_82598EB)
+   return;
+
+   /* If there is no support for either Tx or Rx offload
+* we should not be advertising support for IPsec.
+*/
+   t_dis = IXGBE_READ_REG(hw, IXGBE_SECTXSTAT) &
+   IXGBE_SECTXSTAT_SECTX_OFF_DIS;
+   r_dis = IXGBE_READ_REG(hw, IXGBE_SECRXSTAT) &
+   IXGBE_SECRXSTAT_SECRX_OFF_DIS;
+   if (t_dis || r_dis)
return;
 
ipsec = kzalloc(sizeof(*ipsec), GFP_KERNEL);
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index e8ed37749ab1..44cfb2021145 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -599,13 +599,15 @@ struct ixgbe_nvm_version {
 #define IXGBE_SECTXCTRL_STORE_FORWARD   0x0004
 
 #define IXGBE_SECTXSTAT_SECTX_RDY   0x0001
-#define IXGBE_SECTXSTAT_ECC_TXERR   0x0002
+#define IXGBE_SECTXSTAT_SECTX_OFF_DIS   0x0002
+#define IXGBE_SECTXSTAT_ECC_TXERR   0x0004
 
 #define IXGBE_SECRXCTRL_SECRX_DIS   0x0001
 #define IXGBE_SECRXCTRL_RX_DIS  0x0002
 
 #define IXGBE_SECRXSTAT_SECRX_RDY   0x0001
-#define IXGBE_SECRXSTAT_ECC_RXERR   0x0002
+#define IXGBE_SECRXSTAT_SECRX_OFF_DIS   0x0002
+#define IXGBE_SECRXSTAT_ECC_RXERR   0x0004
 
 /* LinkSec (MacSec) Registers */
 #define IXGBE_LSECTXCAP 0x08A00
-- 
2.17.1



Re: [PATCH net] ipv6: allow PMTU exceptions to local routes

2018-06-11 Thread Martin KaFai Lau
On Mon, Jun 11, 2018 at 02:02:54AM +0300, Julian Anastasov wrote:
> IPVS setups with local client and remote tunnel server need
> to create exception for the local virtual IP. What we do is to
> change PMTU from 64KB (on "lo") to 1460 in the common case.
> 
> Suggested-by: Martin KaFai Lau 
> Fixes: 45e4fd26683c ("ipv6: Only create RTF_CACHE routes after encountering 
> pmtu exception")
> Fixes: 7343ff31ebf0 ("ipv6: Don't create clones of host routes.")
> Signed-off-by: Julian Anastasov 
Acked-by: Martin KaFai Lau 


locking in wimax/i2400m

2018-06-11 Thread Sebastian Andrzej Siewior
I tried to figure out if the URB-completion handler uses any locking and
stumbled here.

i2400m_pm_notifier() is called from process context. This function
invokes i2400m_fw_cache() + i2400m_fw_uncache(). Both functions do 
spin_lock(>rx_lock);
while in other places (say i2400mu_rxd()) it does 
spin_lock_irqsave(>rx_lock, flags);

So what do I miss? Is this lock never used in interrupt context and
lockdep didn't complain or did nobody try suspend with this driver
before?
>From what I can tell i2400m_dev_bootstrap() has the same locking
problem. 

While here, I noticed that i2400m_fw_cache() does use GFP_ATOMIC which
should be GFP_KERNEL since the context can't be atomic at this point
(there is even request_firmware() later on).

I am also curious why there is NULL and ~0 because it does not seem to
make a difference in i2400m_fw_uncache() but I'm more interested in
the locking bits :)

Sebastian


Re: [PATCH v2 15/24] net: qualcomm: MODULE_DEVICE_TABLE(serdev)

2018-06-11 Thread Marcel Holtmann
Hi Ricardo,

 the commit message is misleading me. If I build something with ACPI or DT 
 support, then modinfo will show all modalias information for ACPI and DT 
 compatible strings. What else does udev/modprobe actually need? Is 
 something broken with the modalias export?
>>> 
>>> The main purpose is to autoload drivers for devices that have been
>>> created via sysfs or another module.
>>> 
>>> Eg1: We have a serial port on a standard computer that has connected a
>>> GPS module. Since it is something that is not in the ACPI nor the DT
>>> table the user will run
>>> 
>>> echo serdev_gps > /sys/bus/serial/devices/serial0/new_device
>>> 
>>> Eg2 module: 
>>> https://github.com/ribalda/linux/blob/415bb3f0076c2b846ebe5409589b8e1e3004f55a/drivers/tty/serdev/test_platform.c
>>> 
>>> Modprobe does not know what module to load for that device unless
>>> there is a matching MODULE_DEVICE_TABLE
>>> Today, we have the same functionality for i2c devices
>>> https://www.kernel.org/doc/Documentation/i2c/instantiating-devices
>> 
>> but why does this have to be the driver name? I would rather say, create a 
>> generic string that describes the hardware and then use that. I think you 
>> also want the special handling, as from USB for example, that lets you 
>> reference an existing .compatible or ACPI ID as reference so that the 
>> driver_data is copied.
> 
> We can choose any name, but if there are no special variants (like the
> rave-sp driver) I would prefer using the module name. We can of course
> use more names, like the part number of the the device, but it is very
> convenient to also use the module name, and other subsystems also do
> that.
> 
> If you want to add specific names for this device please let me know
> and I will add them to the list. You know much more about this module
> than myself.

if we want to use the driver name, then why not build this into the new_device 
handling itself instead of duplicating that name in each serdev_device_id 
table. What is the reference here? For example platform device do matching on 
driver name if I recall this correctly.

However what is most important is that device_get_match_data() actually works. 
There should be no difference between DT, ACPI or a dynamic device.

Regards

Marcel



Re: [PATCH] Bluetooth: hci_bcm: Configure SCO routing automatically

2018-06-11 Thread Marcel Holtmann
Hi Rob,

 Added support to automatically configure the SCO packet routing at the
 device setup. The SCO packets are used with the HSP / HFP profiles, but in
 some devices (ex. CYW43438) they are routed to a PCM output by default. 
 This
 change allows sending the vendor specific HCI command to configure the SCO
 routing. The parameters of the command are loaded from the device tree.
>>> 
>>> Please wrap your commit msg.
>> 
>> 
>> Sure.
>>> 
>>> 
 
 Signed-off-by: Attila Tőkés 
 ---
 .../bindings/net/broadcom-bluetooth.txt   |  7 ++
>>> 
>>> Please split bindings to separate patch.
>> 
>> 
>> Ok, I will split this in two.
>>> 
>>> 
>>> 
>>> 
 drivers/bluetooth/hci_bcm.c   | 72 +++
 2 files changed, 79 insertions(+)
 
 diff --git
 a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
 b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
 index 4194ff7e..aea3a094 100644
 --- a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
 +++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
 @@ -21,6 +21,12 @@ Optional properties:
  - clocks: clock specifier if external clock provided to the controller
  - clock-names: should be "extclk"
 
 + SCO routing parameters:
 + - sco-routing: 0-3 (PCM, Transport, Codec, I2S)
 + - pcm-interface-rate: 0-4 (128 Kbps - 2048 Kbps)
 + - pcm-frame-type: 0 (short), 1 (long)
 + - pcm-sync-mode: 0 (slave), 1 (master)
 + - pcm-clock-mode: 0 (slave), 1 (master)
>>> 
>>> Are these Broadcom specific? Properties need either vendor prefix or
>>> to be documented in a common location. I think these look like the
>>> latter.
>> 
>> 
>> These will be used as parameters of a vendor specific (Broadcom/Cypress)
>> command configuring the SCO packet routing. See the Write_SCO_PCM_Int_Param
>> command from: http://www.cypress.com/file/298311/download.
> 
> The DT should just describe how the h/w is hooked-up. What the s/w has
> to do based on that is the driver's problem which is certainly
> vendor/chip specific, but that is all irrelevant to the binding.
> 
>> What would be the property names with a Broadcom / Cypress vendor prefix?
>> 
>>brcm,sco-routing
>>brcm,pcm-interface-rate
>>brcm,pcm-frame-type
>>brcm,pcm-sync-mode
>>brcm,pcm-clock-mode
>> 
>> ?
> 
> Yes.

we can do this. However all pcm-* are optional if you switch the HCI transport. 
And sco-routing should default to HCI if that is not present. Meaning a driver 
should actively trying to change this. Nevertheless, it would be good if a 
driver reads the current settings.

In theory we could make sco-routing generic, but so many vendors have different 
modes, that we better keep this vendor specific.

Regards

Marcel



Re: [PATCH] Bluetooth: hci_bcm: Configure SCO routing automatically

2018-06-11 Thread Rob Herring
On Sat, Jun 9, 2018 at 12:26 AM, Attila Tőkés  wrote:
> Hello Rob,
>
> On Fri, Jun 8, 2018 at 8:25 PM, Rob Herring  wrote:
>>
>> On Fri, Jun 8, 2018 at 10:20 AM,   wrote:
>> > From: Attila Tőkés 
>> >
>> > Added support to automatically configure the SCO packet routing at the
>> > device setup. The SCO packets are used with the HSP / HFP profiles, but in
>> > some devices (ex. CYW43438) they are routed to a PCM output by default. 
>> > This
>> > change allows sending the vendor specific HCI command to configure the SCO
>> > routing. The parameters of the command are loaded from the device tree.
>>
>> Please wrap your commit msg.
>
>
> Sure.
>>
>>
>> >
>> > Signed-off-by: Attila Tőkés 
>> > ---
>> >  .../bindings/net/broadcom-bluetooth.txt   |  7 ++
>>
>> Please split bindings to separate patch.
>
>
> Ok, I will split this in two.
>>
>>
>>
>>
>> >  drivers/bluetooth/hci_bcm.c   | 72 +++
>> >  2 files changed, 79 insertions(+)
>> >
>> > diff --git
>> > a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>> > b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>> > index 4194ff7e..aea3a094 100644
>> > --- a/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>> > +++ b/Documentation/devicetree/bindings/net/broadcom-bluetooth.txt
>> > @@ -21,6 +21,12 @@ Optional properties:
>> >   - clocks: clock specifier if external clock provided to the controller
>> >   - clock-names: should be "extclk"
>> >
>> > + SCO routing parameters:
>> > + - sco-routing: 0-3 (PCM, Transport, Codec, I2S)
>> > + - pcm-interface-rate: 0-4 (128 Kbps - 2048 Kbps)
>> > + - pcm-frame-type: 0 (short), 1 (long)
>> > + - pcm-sync-mode: 0 (slave), 1 (master)
>> > + - pcm-clock-mode: 0 (slave), 1 (master)
>>
>> Are these Broadcom specific? Properties need either vendor prefix or
>> to be documented in a common location. I think these look like the
>> latter.
>
>
> These will be used as parameters of a vendor specific (Broadcom/Cypress)
> command configuring the SCO packet routing. See the Write_SCO_PCM_Int_Param
> command from: http://www.cypress.com/file/298311/download.

The DT should just describe how the h/w is hooked-up. What the s/w has
to do based on that is the driver's problem which is certainly
vendor/chip specific, but that is all irrelevant to the binding.

> What would be the property names with a Broadcom / Cypress vendor prefix?
>
> brcm,sco-routing
> brcm,pcm-interface-rate
> brcm,pcm-frame-type
> brcm,pcm-sync-mode
> brcm,pcm-clock-mode
>
> ?

Yes.

>
>>
>>
>> However, this also looks incomplete to me. For example, which SoC
>> I2S/PCM port is BT audio connected to and how does it fit into the
>> existing audio related bindings? There's been work on HDMI audio
>> bindings which would be similar (except for the SCO over UART at
>> least).
>
>
> The I2S / PCM pins of the Bluetooth chip most likely will not be connected
> to the host SoC. When used with a SoC we probably want tor route the SCO
> packets over the UART transport. A2DP data already goes here and we probably
> want the same for HSP / HFP.

Then that should be the default with no properties.

I imagine for phones, vendors want I2S hooked up for voice calls so
audio can be routed directly to/from the modem and the power hungry
apps processor can be kept in low power modes.

> For example in the Raspberry Pi 3 B, the CYW43438's PCM output is not
> connected (there is no I2S output), But it may be connected to any other
> device in other hardware.
>
> The purpose of this command is to tell the BT chip where to send the SCO
> packets. From the driver's perspective, it does not really matters where
> these pins are connected.

Bindings are for h/w description (and config to some extent).


>> >  Example:
>> >
>> > @@ -31,5 +37,6 @@ Example:
>> > bluetooth {
>> > compatible = "brcm,bcm43438-bt";
>> > max-speed = <921600>;
>> > +   sco-routing = <1>; /* 1 = transport (UART) */
>> > };
>> >  };
>> > diff --git a/drivers/bluetooth/hci_bcm.c b/drivers/bluetooth/hci_bcm.c
>> > index ddbd8c6a..0e729534 100644
>> > --- a/drivers/bluetooth/hci_bcm.c
>> > +++ b/drivers/bluetooth/hci_bcm.c
>> > @@ -83,6 +83,16 @@
>> >   * @hu: pointer to HCI UART controller struct,
>> >   * used to disable flow control during runtime suspend and system
>> > sleep
>> >   * @is_suspended: whether flow control is currently disabled
>> > + *
>> > + *  SCO routing parameters:
>> > + *   used as the parameters for the bcm_set_pcm_int_params command
>> > + * @sco_routing:
>> > + *  >= 255 (skip SCO routing configuration)
>> > + *  0-3 (PCM, Transport, Codec, I2S)
>> > + * @pcm_interface_rate: 0-4 (128 Kbps - 2048 Kbps)
>> > + * @pcm_frame_type: 0 (short), 1 (long)
>> > + * @pcm_sync_mode: 0 (slave), 1 (master)
>> > + * @pcm_clock_mode: 0 (slave), 1 (master)
>> >   */
>> >  struct bcm_device {
>> > /* Must be the first 

Re: [PATCH v2 15/24] net: qualcomm: MODULE_DEVICE_TABLE(serdev)

2018-06-11 Thread Ricardo Ribalda Delgado
Hi Marcel,
On Mon, Jun 11, 2018 at 5:28 PM Marcel Holtmann  wrote:
>
> Hi Marcel,
>
> >> the commit message is misleading me. If I build something with ACPI or DT 
> >> support, then modinfo will show all modalias information for ACPI and DT 
> >> compatible strings. What else does udev/modprobe actually need? Is 
> >> something broken with the modalias export?
> >
> > The main purpose is to autoload drivers for devices that have been
> > created via sysfs or another module.
> >
> > Eg1: We have a serial port on a standard computer that has connected a
> > GPS module. Since it is something that is not in the ACPI nor the DT
> > table the user will run
> >
> > echo serdev_gps > /sys/bus/serial/devices/serial0/new_device
> >
> > Eg2 module: 
> > https://github.com/ribalda/linux/blob/415bb3f0076c2b846ebe5409589b8e1e3004f55a/drivers/tty/serdev/test_platform.c
> >
> > Modprobe does not know what module to load for that device unless
> > there is a matching MODULE_DEVICE_TABLE
> > Today, we have the same functionality for i2c devices
> > https://www.kernel.org/doc/Documentation/i2c/instantiating-devices
>
> but why does this have to be the driver name? I would rather say, create a 
> generic string that describes the hardware and then use that. I think you 
> also want the special handling, as from USB for example, that lets you 
> reference an existing .compatible or ACPI ID as reference so that the 
> driver_data is copied.

We can choose any name, but if there are no special variants (like the
rave-sp driver) I would prefer using the module name. We can of course
use more names, like the part number of the the device, but it is very
convenient to also use the module name, and other subsystems also do
that.

If you want to add specific names for this device please let me know
and I will add them to the list. You know much more about this module
than myself.

Thanks!

>
> Regards
>
> Marcel
>


-- 
Ricardo Ribalda


Re: [PATCH v2 15/24] net: qualcomm: MODULE_DEVICE_TABLE(serdev)

2018-06-11 Thread Marcel Holtmann
Hi Marcel,

>> the commit message is misleading me. If I build something with ACPI or DT 
>> support, then modinfo will show all modalias information for ACPI and DT 
>> compatible strings. What else does udev/modprobe actually need? Is something 
>> broken with the modalias export?
> 
> The main purpose is to autoload drivers for devices that have been
> created via sysfs or another module.
> 
> Eg1: We have a serial port on a standard computer that has connected a
> GPS module. Since it is something that is not in the ACPI nor the DT
> table the user will run
> 
> echo serdev_gps > /sys/bus/serial/devices/serial0/new_device
> 
> Eg2 module: 
> https://github.com/ribalda/linux/blob/415bb3f0076c2b846ebe5409589b8e1e3004f55a/drivers/tty/serdev/test_platform.c
> 
> Modprobe does not know what module to load for that device unless
> there is a matching MODULE_DEVICE_TABLE
> Today, we have the same functionality for i2c devices
> https://www.kernel.org/doc/Documentation/i2c/instantiating-devices

but why does this have to be the driver name? I would rather say, create a 
generic string that describes the hardware and then use that. I think you also 
want the special handling, as from USB for example, that lets you reference an 
existing .compatible or ACPI ID as reference so that the driver_data is copied.

Regards

Marcel



Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Stephen Hemminger
On Fri, 8 Jun 2018 17:42:21 -0700
Siwei Liu  wrote:

> On Fri, Jun 8, 2018 at 5:02 PM, Stephen Hemminger
>  wrote:
> > On Fri, 8 Jun 2018 16:44:12 -0700
> > Siwei Liu  wrote:
> >  
> >> On Fri, Jun 8, 2018 at 4:18 PM, Stephen Hemminger
> >>  wrote:  
> >> > On Fri, 8 Jun 2018 15:25:59 -0700
> >> > Siwei Liu  wrote:
> >> >  
> >> >> On Wed, Jun 6, 2018 at 2:24 PM, Stephen Hemminger
> >> >>  wrote:  
> >> >> > On Wed, 6 Jun 2018 15:30:27 +0300
> >> >> > "Michael S. Tsirkin"  wrote:
> >> >> >  
> >> >> >> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> >> >> >> > Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org 
> >> >> >> > wrote:  
> >> >> >> > >The net failover should be a simple library, not a virtual
> >> >> >> > >object with function callbacks (see callback hell).  
> >> >> >> >
> >> >> >> > Why just a library? It should do a common things. I think it 
> >> >> >> > should be a
> >> >> >> > virtual object. Looks like your patch again splits the common
> >> >> >> > functionality into multiple drivers. That is kind of backwards 
> >> >> >> > attitude.
> >> >> >> > I don't get it. We should rather focus on fixing the mess the
> >> >> >> > introduction of netvsc-bonding caused and switch netvsc to 3-netdev
> >> >> >> > model.  
> >> >> >>
> >> >> >> So it seems that at least one benefit for netvsc would be better
> >> >> >> handling of renames.
> >> >> >>
> >> >> >> Question is how can this change to 3-netdev happen?  Stephen is
> >> >> >> concerned about risk of breaking some userspace.
> >> >> >>
> >> >> >> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >> >> >> address, and you said then "why not use existing network namespaces
> >> >> >> rather than inventing a new abstraction". So how about it then? Do 
> >> >> >> you
> >> >> >> want to find a way to use namespaces to hide the PV device for netvsc
> >> >> >> compatibility?
> >> >> >>  
> >> >> >
> >> >> > Netvsc can't work with 3 dev model. MS has worked with enough 
> >> >> > distro's and
> >> >> > startups that all demand eth0 always be present. And VF may come and 
> >> >> > go.
> >> >> > After this history, there is a strong motivation not to change how 
> >> >> > kernel
> >> >> > behaves. Switching to 3 device model would be perceived as breaking
> >> >> > existing userspace.
> >> >> >
> >> >> > With virtio you can  work it out with the distro's yourself.
> >> >> > There is no pre-existing semantics to deal with.
> >> >> >
> >> >> > For the virtio, I don't see the need for IFF_HIDDEN.  
> >> >>
> >> >> I have a somewhat different view regarding IFF_HIDDEN. The purpose of
> >> >> that flag, as well as the 1-netdev model, is to have a means to
> >> >> inherit the interface name from the VF, and to eliminate playing hacks
> >> >> around renaming devices, customizing udev rules and et al. Why
> >> >> inheriting VF's name important? To allow existing config/setup around
> >> >> VF continues to work across kernel feature upgrade. Most of network
> >> >> config files in all distros are based on interface names. Few are MAC
> >> >> address based but making lower slaves hidden would cover the rest. And
> >> >> most importantly, preserving the same level of user experience as
> >> >> using raw VF interface once getting all ndo_ops and ethtool_ops
> >> >> exposed. This is essential to realize transparent live migration that
> >> >> users dont have to learn and be aware of the undertaken.  
> >> >
> >> > Inheriting the VF name will fail in the migration scenario.
> >> > It is perfectly reasonable to migrate a guest to another machine where
> >> > the VF PCI address is different. And since current udev/systemd model
> >> > is to base network device name off of PCI address, the device will change
> >> > name when guest is migrated.
> >> >  
> >> The scenario of having VF on a different PCI address on post migration
> >> is essentially equal to plugging in a new NIC. Why it has to pair with
> >> the original PV? A sepearte PV device should be in place to pair the
> >> new VF.  
> >
> > The host only guarantees that the PV device will be on the same network.
> > It does not make any PCI guarantees. The way Windows works is to find
> > the device based on "serial number" which is an Hyper-V specific attribute
> > of PCI devices.
> >
> > I considered naming off of serial number but that won't work for the
> > case where PV device is present first and VF arrives later. The serial
> > number is attribute of VF, not the PV which is there first.  
> 
> I assume the PV can get that information ahead of time before VF
> arrives? Without it how do you match the device when you see a VF
> coming with some serial number? Is it possible for PV to get the
> matching SN even earlier during probe time? Or it has to depend on the
> presence of vPCI bridge to generate this SN?



NO. the PV device does not know ahead of time and there are scenario
where the serial and PCI info can change when it does arrive. These
are test 

Re: [PATCH net] failover: eliminate callback hell

2018-06-11 Thread Stephen Hemminger
On Fri, 8 Jun 2018 15:54:38 -0700
Siwei Liu  wrote:

> On Wed, Jun 6, 2018 at 2:54 PM, Samudrala, Sridhar
>  wrote:
> >
> >
> > On 6/6/2018 2:24 PM, Stephen Hemminger wrote:  
> >>
> >> On Wed, 6 Jun 2018 15:30:27 +0300
> >> "Michael S. Tsirkin"  wrote:
> >>  
> >>> On Wed, Jun 06, 2018 at 09:25:12AM +0200, Jiri Pirko wrote:  
> 
>  Tue, Jun 05, 2018 at 05:42:31AM CEST, step...@networkplumber.org wrote:  
> >
> > The net failover should be a simple library, not a virtual
> > object with function callbacks (see callback hell).  
> 
>  Why just a library? It should do a common things. I think it should be a
>  virtual object. Looks like your patch again splits the common
>  functionality into multiple drivers. That is kind of backwards attitude.
>  I don't get it. We should rather focus on fixing the mess the
>  introduction of netvsc-bonding caused and switch netvsc to 3-netdev
>  model.  
> >>>
> >>> So it seems that at least one benefit for netvsc would be better
> >>> handling of renames.
> >>>
> >>> Question is how can this change to 3-netdev happen?  Stephen is
> >>> concerned about risk of breaking some userspace.
> >>>
> >>> Stephen, this seems to be the usecase that IFF_HIDDEN was trying to
> >>> address, and you said then "why not use existing network namespaces
> >>> rather than inventing a new abstraction". So how about it then? Do you
> >>> want to find a way to use namespaces to hide the PV device for netvsc
> >>> compatibility?
> >>>  
> >> Netvsc can't work with 3 dev model. MS has worked with enough distro's and
> >> startups that all demand eth0 always be present. And VF may come and go.
> >> After this history, there is a strong motivation not to change how kernel
> >> behaves. Switching to 3 device model would be perceived as breaking
> >> existing userspace.  
> >
> >
> > I think it should be possible for netvsc to work with 3 dev model if the
> > only
> > requirement is that eth0 will always be present. With net_failover, you will
> > see eth0 and eth0nsby OR with older distros eth0 and eth1.  It may be an
> > issue
> > if somehow there is userspace requirement that there can be only 2 netdevs,
> > not 3
> > when VF is plugged.
> >
> > eth0 will be the net_failover device and eth0nsby/eth1 will be the netvsc
> > device
> > and the IP address gets configured on eth0. Will this be an issue?
> >  
> Did you realize that the eth0 name in the current 3-netdev code can't
> be consistently persisted across reboot, if you have more than one VFs
> to pair with? On one boot it got eth0/eth0nsby, on the next it may get
> eth1/eth1nsby on the same interface.
> 
> It won't be useable by default until you add some custom udev rules.
> 

I think there is no reason to break things by moving netvsc to 3 device
model. 

The first device probed is always the same on Hyper-V/Azure, and always
comes up as eth0. The order comes from the fact that they are reported
to guest in that order and currently vmbus probe is single threaded.


Re: [PATCH v2 15/24] net: qualcomm: MODULE_DEVICE_TABLE(serdev)

2018-06-11 Thread Ricardo Ribalda Delgado
Hi Marcel,
On Mon, Jun 11, 2018 at 3:01 PM Marcel Holtmann  wrote:
>
>
> the commit message is misleading me. If I build something with ACPI or DT 
> support, then modinfo will show all modalias information for ACPI and DT 
> compatible strings. What else does udev/modprobe actually need? Is something 
> broken with the modalias export?

The main purpose is to autoload drivers for devices that have been
created via sysfs or another module.

Eg1: We have a serial port on a standard computer that has connected a
GPS module. Since it is something that is not in the ACPI nor the DT
table the user will run

echo serdev_gps > /sys/bus/serial/devices/serial0/new_device

Eg2 module: 
https://github.com/ribalda/linux/blob/415bb3f0076c2b846ebe5409589b8e1e3004f55a/drivers/tty/serdev/test_platform.c

Modprobe does not know what module to load for that device unless
there is a matching MODULE_DEVICE_TABLE
Today, we have the same functionality for i2c devices
https://www.kernel.org/doc/Documentation/i2c/instantiating-devices

I guess the commit message is really bad :), sorry about that. Any
suggestions to improve it?

Thanks!

>
> Regards
>
> Marcel
>


-- 
Ricardo Ribalda


Re: [iproute2-next v2 1/2] tipc: JSON support for showing nametable

2018-06-11 Thread Stephen Hemminger
On Mon, 11 Jun 2018 09:16:59 +0700
Hoang Le  wrote:

> @@ -33,6 +35,8 @@ static void about(struct cmdl *cmdl)
>   "\n"
>   "Options:\n"
>   " -h, --help \t\tPrint help for last given command\n"
> + " -j, --json \t\tJson format printouts\n"
> + " -p, --pretty \t\tpretty print\n"
>   "\n"

You should also update manual page:  man/man8/tipc-nametable.8
Just cut/paste text from ip or tc pages.


Re: Qualcomm rmnet driver and qmi_wwan

2018-06-11 Thread Daniele Palmas
Hi Subash,

2018-06-09 19:55 GMT+02:00 Subash Abhinov Kasiviswanathan
:
>> thanks, I will test it on Monday.
>>
>> Just a question for my knowledge: is the new sysfs attribute really
>> needed? I mean, is there not any other way to understand from qmi_wwan
>> without user intervention that there is the rmnet device attached?
>>
>> Regards,
>> Daniele
>>
>
> Hi Daniele
>
> You can check for the rx_handler attached to qmi_wwan dev and see if it
> belongs to rmnet. You can use the attached patch for it but it think the
> sysfs way might be a bit cleaner.
>

both patches work properly for me.

Maybe it could be helpful in the first patch to add a print when
pass-through setting fails if raw-ip has not been set, just to let the
user know what is happening.

Thanks,
Daniele

>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project


[PATCH 4/6] arcnet: com20020: bindings for smsc com20020

2018-06-11 Thread Andrea Greco
From: Andrea Greco 

Add devicetree bindings for smsc com20020

Signed-off-by: Andrea Greco 
---
 .../devicetree/bindings/net/smsc-com20020.txt   | 21 +
 1 file changed, 21 insertions(+)
 create mode 100644 Documentation/devicetree/bindings/net/smsc-com20020.txt

diff --git a/Documentation/devicetree/bindings/net/smsc-com20020.txt 
b/Documentation/devicetree/bindings/net/smsc-com20020.txt
new file mode 100644
index ..660a4a751f29
--- /dev/null
+++ b/Documentation/devicetree/bindings/net/smsc-com20020.txt
@@ -0,0 +1,21 @@
+SMSC com20020 Arcnet network controller
+
+Required property:
+- timeout-ns: Arcnet bus timeout, Idle Time (328000 - 20500)
+- bus-speed-bps: Arcnet bus speed (1000 - 156250)
+- smsc,xtal-mhz: External oscillator frequency
+- smsc,backplane-enabled: Controller use backplane mode
+- reset-gpios: Chip reset pin
+- interrupts: Should contain controller interrupt
+
+arcnet@2800 {
+   compatible = "smsc,com20020";
+
+   timeout-ns = <20500>;
+   bus-speed-hz = <1000>;
+   smsc,xtal-mhz = <20>;
+   smsc,backplane-enabled;
+
+   reset-gpios = < 21 GPIO_ACTIVE_LOW>;
+   interrupts = < 10 GPIO_ACTIVE_LOW>;
+};
-- 
2.14.4



  1   2   >