date:20180221

Re: [PATCH net v2 1/2] Revert "tuntap: add missing xdp flush"

2018-02-21 Thread Sergei Shtylyov


On 2/22/2018 9:24 AM, Jason Wang wrote:


This reverts commit 762c330d670e3d4b795cf7a8d761866fdd1eef49. The
reason is we try to batch packets for devmap which causes calling
xdp_do_flush() under the process context. Simply disable premmption


   s/under/in/.
   Disabling preemption.


may not work since process may move among processors which lead
xdp_do_flush() to miss some flushes on some processors.

So simply revert the patch, a follow-up path will add the xdp flush
correctly.

Reported-by: Christoffer Dall 
Fixes: 762c330d670e ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang 

[...]

MBR, Sergei

Re: [PATCH net v2 2/2] tuntap: correctly add the missing xdp flush

2018-02-21 Thread Sergei Shtylyov


Hello!

On 2/22/2018 9:24 AM, Jason Wang wrote:


Commit 762c330d670e ("tuntap: add missing xdp flush") tries to fix the
devmap stall caused by missed xdp flush by counting the pending xdp
redirected packets and flush when it exceeds NAPI_POLL_WEIGHT or
MSG_MORE is clear. This may lead BUG() since xdp_do_flush() was


   Lead to BUG().


called under process context with preemption enabled. Simply disable


   s/under/in the/?


preemption may silent the warning but be not enough since process may


   Silence.


move between different CPUS during a batch which cause xdp_do_flush()
misses some CPU where the process run previously. Consider the several
fallouts, that commit was reverted. To fix the issue correctly, we can
simply calling xdp_do_flush() immediately after xdp_do_redirect(),


   Call.


a side effect is that this removes any possibility of batching which
could be addressed in the future.

Reported-by: Christoffer Dall 
Fixes: 762c330d670e ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang 

[...]

MBR, Sergei

[PATCH iproute2-next v1] rdma: Add batch command support

2018-02-21 Thread Leon Romanovsky

From: Leon Romanovsky 

Implement an option (-b) to execute RDMAtool commands
from supplied file. This follows the same model as
in use for ip and devlink tools, by expecting
every new command to be on new line.

These commands are expected to be without any -*
(e.g. -d, -j, e.t.c) global flags, which should be
called externally.

Signed-off-by: Leon Romanovsky 
---

Changelog v0->v1:
  * Used ARRAY_SIZE instead of hardcoded value as an input to makeargs()

David,

This patch is based on iproute2.git because iproute2-next doesn't
have latest restrack code. The patch itself is completely independent
from that code and is supposed to go to -next, but it has conflicts
(manual page and help line).

Can you please merge iproute2 master into iproute2-next prior to
applying this patch?

Thanks
---
 man/man8/rdma.8 | 16 +
 rdma/rdma.c | 70 ++---
 2 files changed, 78 insertions(+), 8 deletions(-)

diff --git a/man/man8/rdma.8 b/man/man8/rdma.8
index 798b33d3..fba77693 100644
--- a/man/man8/rdma.8
+++ b/man/man8/rdma.8
@@ -11,6 +11,12 @@ rdma \- RDMA tool
 .BR help " }"
 .sp

+.ti -8
+.B rdma
+.RB "[ " -force " ] "
+.BI "-batch " filename
+.sp
+
 .ti -8
 .IR OBJECT " := { "
 .BR dev " | " link " }"
@@ -31,6 +37,16 @@ Print the version of the
 .B rdma
 tool and exit.

+.TP
+.BR "\-b", " \-batch " 
+Read commands from provided file or standard input and invoke them.
+First failure will cause termination of rdma.
+
+.TP
+.BR "\-force"
+Don't terminate rdma on errors in batch mode.
+If there were any errors during execution of the commands, the application 
return code will be non zero.
+
 .TP
 .BR "\-d" , " --details"
 Otuput detailed information.
diff --git a/rdma/rdma.c b/rdma/rdma.c
index 19608f41..ab2c9608 100644
--- a/rdma/rdma.c
+++ b/rdma/rdma.c
@@ -15,8 +15,9 @@
 static void help(char *name)
 {
pr_out("Usage: %s [ OPTIONS ] OBJECT { COMMAND | help }\n"
+  "   %s [ -f[orce] ] -b[atch] filename\n"
   "where  OBJECT := { dev | link | resource | help }\n"
-  "   OPTIONS := { -V[ersion] | -d[etails] | -j[son] | 
-p[retty]}\n", name);
+  "   OPTIONS := { -V[ersion] | -d[etails] | -j[son] | 
-p[retty]}\n", name, name);
 }

 static int cmd_help(struct rd *rd)
@@ -25,7 +26,7 @@ static int cmd_help(struct rd *rd)
return 0;
 }

-static int rd_cmd(struct rd *rd)
+static int rd_cmd(struct rd *rd, int argc, char **argv)
 {
const struct rd_cmd cmds[] = {
{ NULL, cmd_help },
@@ -36,17 +37,54 @@ static int rd_cmd(struct rd *rd)
{ 0 }
};

+   rd->argc = argc;
+   rd->argv = argv;
+
return rd_exec_cmd(rd, cmds, "object");
 }

-static int rd_init(struct rd *rd, int argc, char **argv, char *filename)
+static int rd_batch(struct rd *rd, const char *name, bool force)
+{
+   char *line = NULL;
+   size_t len = 0;
+   int ret = 0;
+
+   if (name && strcmp(name, "-") != 0) {
+   if (!freopen(name, "r", stdin)) {
+   pr_err("Cannot open file \"%s\" for reading: %s\n",
+  name, strerror(errno));
+   return errno;
+   }
+   }
+
+   cmdlineno = 0;
+   while (getcmdline(, , stdin) != -1) {
+   char *largv[512];
+   int largc;
+
+   largc = makeargs(line, largv, ARRAY_SIZE(largv));
+   if (!largc)
+   continue;   /* blank line */
+
+   ret = rd_cmd(rd, largc, largv);
+   if (ret) {
+   pr_err("Command failed %s:%d\n", name, cmdlineno);
+   if (!force)
+   break;
+   }
+   }
+
+   free(line);
+
+   return ret;
+}
+
+static int rd_init(struct rd *rd, char *filename)
 {
uint32_t seq;
int ret;

rd->filename = filename;
-   rd->argc = argc;
-   rd->argv = argv;
INIT_LIST_HEAD(>dev_map_list);
INIT_LIST_HEAD(>filter_list);

@@ -87,11 +125,15 @@ int main(int argc, char **argv)
{ "json",   no_argument,NULL, 'j' },
{ "pretty", no_argument,NULL, 'p' },
{ "details",no_argument,NULL, 'd' },
+   { "force",  no_argument,NULL, 'f' },
+   { "batch",  required_argument,  NULL, 'b' },
{ NULL, 0, NULL, 0 }
};
+   const char *batch_file = NULL;
bool pretty_output = false;
bool show_details = false;
bool json_output = false;
+   bool force = false;
char *filename;
struct rd rd;
int opt;
@@ -99,7 +141,7 @@ int main(int argc, char **argv)

filename = basename(argv[0]);

-   while ((opt =

[PATCH bpf] bpf: fix rcu lockdep warning for lpm_trie map_free callback

2018-02-21 Thread Yonghong Song

Commit 9a3efb6b661f ("bpf: fix memory leak in lpm_trie map_free callback 
function")
fixed a memory leak and removed unnecessary locks in map_free callback function.
Unfortrunately, it introduced a lockdep warning. When lockdep checking is 
turned on,
running tools/testing/selftests/bpf/test_lpm_map will have:

  [   98.294321] =
  [   98.294807] WARNING: suspicious RCU usage
  [   98.295359] 4.16.0-rc2+ #193 Not tainted
  [   98.295907] -
  [   98.296486] /home/yhs/work/bpf/kernel/bpf/lpm_trie.c:572 suspicious 
rcu_dereference_check() usage!
  [   98.297657]
  [   98.297657] other info that might help us debug this:
  [   98.297657]
  [   98.298663]
  [   98.298663] rcu_scheduler_active = 2, debug_locks = 1
  [   98.299536] 2 locks held by kworker/2:1/54:
  [   98.300152]  #0:  ((wq_completion)"events"){+.+.}, at: 
[<196bc1f0>] process_one_work+0x157/0x5c0
  [   98.301381]  #1:  ((work_completion)(>work)){+.+.}, at: 
[<196bc1f0>] process_one_work+0x157/0x5c0

Since actual trie tree removal happens only after no other
accesses to the tree are possible, this patch simply converted all
rcu protected pointer access to normal access, which removed the
above warning.

Fixes: 9a3efb6b661f ("bpf: fix memory leak in lpm_trie map_free callback 
function")
Reported-by: Eric Dumazet 
Signed-off-by: Yonghong Song 
---
 kernel/bpf/lpm_trie.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index a75e02c..0c15813 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -552,7 +552,7 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
 static void trie_free(struct bpf_map *map)
 {
struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
-   struct lpm_trie_node __rcu **slot;
+   struct lpm_trie_node **slot;
struct lpm_trie_node *node;
 
/* Wait for outstanding programs to complete
@@ -569,23 +569,22 @@ static void trie_free(struct bpf_map *map)
slot = >root;
 
for (;;) {
-   node = rcu_dereference_protected(*slot,
-   lockdep_is_held(>lock));
+   node = *slot;
if (!node)
goto out;
 
-   if (rcu_access_pointer(node->child[0])) {
+   if (node->child[0]) {
slot = >child[0];
continue;
}
 
-   if (rcu_access_pointer(node->child[1])) {
+   if (node->child[1]) {
slot = >child[1];
continue;
}
 
kfree(node);
-   RCU_INIT_POINTER(*slot, NULL);
+   *slot = NULL;
break;
}
}
-- 
2.9.5

[PATCH net v2 1/2] Revert "tuntap: add missing xdp flush"

2018-02-21 Thread Jason Wang

This reverts commit 762c330d670e3d4b795cf7a8d761866fdd1eef49. The
reason is we try to batch packets for devmap which causes calling
xdp_do_flush() under the process context. Simply disable premmption
may not work since process may move among processors which lead
xdp_do_flush() to miss some flushes on some processors.

So simply revert the patch, a follow-up path will add the xdp flush
correctly.

Reported-by: Christoffer Dall 
Fixes: 762c330d670e ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 15 ---
 1 file changed, 15 deletions(-)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index b52258c..2823a4a 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -181,7 +181,6 @@ struct tun_file {
struct tun_struct *detached;
struct ptr_ring tx_ring;
struct xdp_rxq_info xdp_rxq;
-   int xdp_pending_pkts;
 };
 
 struct tun_flow_entry {
@@ -1662,7 +1661,6 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
case XDP_REDIRECT:
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
-   ++tfile->xdp_pending_pkts;
err = xdp_do_redirect(tun->dev, , xdp_prog);
if (err)
goto err_redirect;
@@ -1984,11 +1982,6 @@ static ssize_t tun_chr_write_iter(struct kiocb *iocb, 
struct iov_iter *from)
result = tun_get_user(tun, tfile, NULL, from,
  file->f_flags & O_NONBLOCK, false);
 
-   if (tfile->xdp_pending_pkts) {
-   tfile->xdp_pending_pkts = 0;
-   xdp_do_flush_map();
-   }
-
tun_put(tun);
return result;
 }
@@ -2325,13 +2318,6 @@ static int tun_sendmsg(struct socket *sock, struct 
msghdr *m, size_t total_len)
ret = tun_get_user(tun, tfile, m->msg_control, >msg_iter,
   m->msg_flags & MSG_DONTWAIT,
   m->msg_flags & MSG_MORE);
-
-   if (tfile->xdp_pending_pkts >= NAPI_POLL_WEIGHT ||
-   !(m->msg_flags & MSG_MORE)) {
-   tfile->xdp_pending_pkts = 0;
-   xdp_do_flush_map();
-   }
-
tun_put(tun);
return ret;
 }
@@ -3163,7 +3149,6 @@ static int tun_chr_open(struct inode *inode, struct file 
* file)
sock_set_flag(>sk, SOCK_ZEROCOPY);
 
memset(>tx_ring, 0, sizeof(tfile->tx_ring));
-   tfile->xdp_pending_pkts = 0;
 
return 0;
 }
-- 
2.7.4

[PATCH net v2 2/2] tuntap: correctly add the missing xdp flush

2018-02-21 Thread Jason Wang

Commit 762c330d670e ("tuntap: add missing xdp flush") tries to fix the
devmap stall caused by missed xdp flush by counting the pending xdp
redirected packets and flush when it exceeds NAPI_POLL_WEIGHT or
MSG_MORE is clear. This may lead BUG() since xdp_do_flush() was
called under process context with preemption enabled. Simply disable
preemption may silent the warning but be not enough since process may
move between different CPUS during a batch which cause xdp_do_flush()
misses some CPU where the process run previously. Consider the several
fallouts, that commit was reverted. To fix the issue correctly, we can
simply calling xdp_do_flush() immediately after xdp_do_redirect(),
a side effect is that this removes any possibility of batching which
could be addressed in the future.

Reported-by: Christoffer Dall 
Fixes: 762c330d670e ("tuntap: add missing xdp flush")
Signed-off-by: Jason Wang 
---
 drivers/net/tun.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 2823a4a..a363ea2 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1662,6 +1662,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct 
*tun,
get_page(alloc_frag->page);
alloc_frag->offset += buflen;
err = xdp_do_redirect(tun->dev, , xdp_prog);
+   xdp_do_flush_map();
if (err)
goto err_redirect;
rcu_read_unlock();
-- 
2.7.4

Re: [PATCH] netlink: put module reference if dump start fails

2018-02-21 Thread Bo YU


Hi,
On Wed, Feb 21, 2018 at 04:41:05PM +0100, Jason A. Donenfeld wrote:

Fixes: 41c87425a1ac ("netlink: do not set cb_running if dump's start() errs")

I think you Would better to resend it.

Bo,

Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case

2018-02-21 Thread Jason Wang




On 2018年02月21日 00:52, John Fastabend wrote:

On 02/20/2018 03:17 AM, Jesper Dangaard Brouer wrote:

On Fri, 16 Feb 2018 09:19:02 -0800
John Fastabend  wrote:


On 02/16/2018 07:41 AM, Jesper Dangaard Brouer wrote:

On Fri, 16 Feb 2018 13:31:37 +0800
Jason Wang  wrote:
   

On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote:

The virtio_net code have three different RX code-paths in receive_buf().
Two of these code paths can handle XDP, but one of them is broken for
at least XDP_REDIRECT.

Function(1): receive_big() does not support XDP.
Function(2): receive_small() support XDP fully and uses build_skb().
Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().

The simple explanation is that receive_mergeable() is broken because
it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
header+data in single page and enough tail room for skb_shared_info.

The longer explaination is that receive_mergeable() tries to
work-around and satisfy these XDP requiresments e.g. by having a
function xdp_linearize_page() that allocates and memcpy RX buffers
around (in case packet is scattered across multiple rx buffers).  This
does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
we have not implemented bpf_xdp_adjust_tail yet).

The XDP_REDIRECT action combined with cpumap is broken, and cause hard
to debug crashes.  The main issue is that the RX packet does not have
the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
skb_shared_info to overlap the next packets head-room (in which cpumap
stores info).

Reproducing depend on the packet payload length and if RX-buffer size
happened to have tail-room for skb_shared_info or not.  But to make
this even harder to troubleshoot, the RX-buffer size is runtime
dynamically change based on an Exponentially Weighted Moving Average
(EWMA) over the packet length, when refilling RX rings.

This patch only disable XDP_REDIRECT support in receive_mergeable()
case, because it can cause a real crash.

But IMHO we should NOT support XDP in receive_mergeable() at all,
because the principles behind XDP are to gain speed by (1) code
simplicity, (2) sacrificing memory and (3) where possible moving
runtime checks to setup time.  These principles are clearly being
violated in receive_mergeable(), that e.g. runtime track average
buffer size to save memory consumption.

I agree to disable it for -net now.

Okay... I'll send an official patch later.
   

For net-next, we probably can do:

- drop xdp_linearize_page() and do XDP through generic XDP helper
   after skb was built

I disagree strongly here - it makes no sense.

Why do you want to explicit fallback to Generic-XDP?
(... then all the performance gain is gone!)
And besides, a couple of function calls later, the generic XDP code
will/can get invoked anyhow...
   

Hi, Can we get EWMA to ensure for majority of cases we have the extra
head room? Seems we could just over-estimate the size by N-bytes. In
some cases we may under-estimate and then would need to fall back to
generic-xdp or otherwise growing the buffer which of course would be
painful and slow, but presumably would happen rarely.

Hmmm... (first of all it is missing tail-room not head-room).
Second having all this extra size estimating code, and fallback options
leaves a very bad taste in my mouth... this sounds like a sure way to
kill performance.



I think it would be much better to keep this feature vs kill it and
make its configuration even more painful to get XDP working on virtio.

Based on you request, I'm going to fixing as much as possible of the
XDP code path in driver virtio_net... I now have 4 fix patches...


Thanks a lot!


There is no way around disabling XDP_REDIRECT in receive_mergeable(),
as XDP does not have a way to detect/know the "data_hard_end" of the
data "frame".


Disabling EWMA also seems reasonable to me.

To me, it seems more reasonable to have a separate RX function call
when an XDP program gets attached, and in that process change to the
memory model so it is compatible with XDP.


I would be OK with that but would be curious to see what Jason and
Michael think. When I original wrote the XDP for virtio support the
XDP infra was still primitive and we didn't have metadata, cpu maps,
etc.


Yes, that why cpumap fails.


  yet. I suspect there might need to be some additional coordination
between guest and host though to switch the packet modes. If I recall
this was where some of the original trouble came from.


Maybe, but I think just have a separate refill function should be 
sufficient? Then we could reuse the exist code to deal with e.g 
synchronization.




  

Take a step back:
  What is the reason/use-case for implementing XDP inside virtio_net?

 From a DDoS/performance perspective XDP in virtio_net happens on the
"wrong-side" as it is activated _inside_ the guest OS, which is too
late for a DDoS filter, as the guest kick/switch

[PATCH iproute2 net-next] ss: print skmeminfo for packet sockets

2018-02-21 Thread Roopa Prabhu

From: Roopa Prabhu 

before:
$ss --packet -p -m
p_raw0  0*:eth0
  users:(("lldpd",pid=2240,fd=11))

after:
$ss --packet -p -m
p_raw0  0*:eth0
  users:(("lldpd",pid=2240,fd=11))
  skmem:(r0,rb266240,t0,tb266240,f0,w0,o320,bl0,d0)

Signed-off-by: Roopa Prabhu 
---
 misc/ss.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/misc/ss.c b/misc/ss.c
index 29a2507..49f9c49 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -3920,6 +3920,9 @@ static int packet_show_sock(const struct sockaddr_nl 
*addr,
fil++;
}
}
+
+   if (show_mem)
+   print_skmeminfo(tb, PACKET_DIAG_MEMINFO);
return 0;
 }
 
-- 
2.1.4

F.LLI PISTOLESI Snc

2018-02-21 Thread . F.LLI PISTOLESI Snc

Hello , 
 


I am looking for a reliable supplier /manufacturer of products for sell in 
Europe.

I came across your listing and wanted to get some information regarding minimum 
Order Quantities, FOB pricing and also the possibility of packaging including 
payments terms.

So could you please get back to be with the above informations as soon as 
possible .

My email is :tm6428...@gmail.com

Many thanks and i looking forward to hearing from you and hopefully placing an 
order with you company.

Best Regards
Lorenzo Delleani.

F.LLI PISTOLESI Snc Company P.O. box 205
2740 AE Waddinxveen
The Netherlands

Re: [PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Alexei Starovoitov

On Wed, Feb 21, 2018 at 07:53:22PM -0800, Eric Dumazet wrote:
> > So what kinda comment there would make sense?
> 
> I was thinking of something very explicit :
> 
> /* byte sequence for following assembly code used by eBPF
>call ...
>...
>retq
> */
> #define RETPOLINE_RAX_DIRECT_FOR_EBPF \
>    EMIT1_off32(0xE8, 7);/* callq  */   \
>    /* capture_spec: */\
>    EMIT2(0xF3, 0x90);   /* pause */   \
>    EMIT3(0x0F, 0xAE, 0xE8); /* lfence */  \
>    EMIT2(0xEB, 0xF9);   /* jmp  */  \
>    /* set_up_target: */   \
>    EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */   \
>    EMIT1(0xC3); /* retq */\

got it. yeah. makes sense to me.

Re: [PATCH bpf v2] bpf: fix memory leak in lpm_trie map_free callback function

2018-02-21 Thread Yonghong Song




On 2/21/18 7:40 PM, Eric Dumazet wrote:

On Tue, 2018-02-13 at 19:17 -0800, Alexei Starovoitov wrote:

On Tue, Feb 13, 2018 at 07:00:21PM -0800, Yonghong Song wrote:

There is a memory leak happening in lpm_trie map_free callback
function trie_free. The trie structure itself does not get freed.

Also, trie_free function did not do synchronize_rcu before freeing
various data structures. This is incorrect as some rcu_read_lock
region(s) for lookup, update, delete or get_next_key may not complete yet.
The fix is to add synchronize_rcu in the beginning of trie_free.
The useless spin_lock is removed from this function as well.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reported-by: Mathieu Malaterre 
Reported-by: Alexei Starovoitov 
Tested-by: Mathieu Malaterre 
Signed-off-by: Yonghong Song 
---
  kernel/bpf/lpm_trie.c | 11 +++
  1 file changed, 7 insertions(+), 4 deletions(-)

v1->v2:
   Make comments more precise and make label name more appropriate,
   as suggested by Daniel


Applied to bpf tree, Thanks Yonghong.



This does not look good.

LOCKDEP surely should complain to

node = rcu_dereference_protected(*slot, lockdep_is_held(>lock));

Since we no longer hold trie->lock


Eric,

Thanks for spotting this issue. Will fix this issue soon.

Yonghong

Re: [PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Eric Dumazet

On Wed, 2018-02-21 at 19:43 -0800, Alexei Starovoitov wrote:
> On Wed, Feb 21, 2018 at 07:04:02PM -0800, Eric Dumazet wrote:
> > On Thu, 2018-02-22 at 01:05 +0100, Daniel Borkmann wrote:
> > 
> > ...
> > 
> > > +/* Instead of plain jmp %rax, we emit a retpoline to control
> > > + * speculative execution for the indirect branch.
> > > + */
> > > +static void emit_retpoline_rax_trampoline(u8 **pprog)
> > > +{
> > > + u8 *prog = *pprog;
> > > + int cnt = 0;
> > > +
> > > + EMIT1_off32(0xE8, 7);/* callq  */
> > > + /* capture_spec: */
> > > + EMIT2(0xF3, 0x90);   /* pause */
> > > + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */
> > > + EMIT2(0xEB, 0xF9);   /* jmp  */
> > > + /* set_up_target: */
> > > + EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */
> > > + EMIT1(0xC3); /* retq */
> > > +
> > > + BUILD_BUG_ON(cnt != RETPOLINE_SIZE);
> > > + *pprog = prog;
> > 
> > You might define the actual code sequence (and length) in 
> > arch/x86/include/asm/nospec-branch.h
> > 
> > If we need to adjust code sequences for RETPOLINE, then we wont
> > forget/miss that arch/x86/net/bpf_jit_comp.c had it hard-coded.
> 
> like adding a comment to asm/nospec-branch.h that says
> "dont forget to adjust bpf_jit_comp.c" ?
> but clang/gcc generate slightly different sequences for
> retpoline anyway, so even if '.macro RETPOLINE_JMP' in
> nospec-branch.h changes it doesn't mean that x64 jit has to change.
> So what kinda comment there would make sense?

I was thinking of something very explicit :

/* byte sequence for following assembly code used by eBPF
   call ...
   ...
   retq
*/
#define RETPOLINE_RAX_DIRECT_FOR_EBPF \
   EMIT1_off32(0xE8, 7);/* callq  */   \
   /* capture_spec: */\
   EMIT2(0xF3, 0x90);   /* pause */   \
   EMIT3(0x0F, 0xAE, 0xE8); /* lfence */  \
   EMIT2(0xEB, 0xF9);   /* jmp  */  \
   /* set_up_target: */   \
   EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */   \
   EMIT1(0xC3); /* retq */\

Might be simply byte encoded, (array of 17 bytes)

Well, something like that anyway...

[PATCH net-next 3/3] nfp: advertise firmware for mixed 10G/25G mode

2018-02-21 Thread Jakub Kicinski

From: Dirk van der Merwe 

The AMDA0099-0001 platform can support the 1x10G + 1x25G mixed mode
operation. Recently, firmware has been added for this configuration
mode.

Signed-off-by: Dirk van der Merwe 
Acked-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index ab301d56430b..c4b1f344b4da 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -645,6 +645,7 @@ 
MODULE_FIRMWARE("netronome/nic_AMDA0097-0001_4x10_1x40.nffw");
 MODULE_FIRMWARE("netronome/nic_AMDA0097-0001_8x10.nffw");
 MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_2x10.nffw");
 MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_2x25.nffw");
+MODULE_FIRMWARE("netronome/nic_AMDA0099-0001_1x10_1x25.nffw");
 
 MODULE_AUTHOR("Netronome Systems ");
 MODULE_LICENSE("GPL");
-- 
2.15.1

[PATCH net-next] ibmvnic: Split counters for scrq/pools/napi

2018-02-21 Thread Nathan Fontenot

The approach of one counter to rule them all when tracking the number
of active sub-crqs, pools, and napi has problems handling some failover
scenarios. This is due to the split in initializing the sub crqs,
pools and napi in different places and the placement of updating
the active counts.

This patch simplifies this by having a counter for tx and rx
sub-crqs, pools, and napi.

Signed-off-by: Nathan Fontenot 
---
 drivers/net/ethernet/ibm/ibmvnic.c |   38 
 drivers/net/ethernet/ibm/ibmvnic.h |7 +--
 2 files changed, 22 insertions(+), 23 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 1703b881252f..8ca88f7cc661 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -461,7 +461,7 @@ static void release_rx_pools(struct ibmvnic_adapter 
*adapter)
if (!adapter->rx_pool)
return;
 
-   for (i = 0; i < adapter->num_active_rx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_rx_pools; i++) {
rx_pool = >rx_pool[i];
 
netdev_dbg(adapter->netdev, "Releasing rx_pool[%d]\n", i);
@@ -484,6 +484,7 @@ static void release_rx_pools(struct ibmvnic_adapter 
*adapter)
 
kfree(adapter->rx_pool);
adapter->rx_pool = NULL;
+   adapter->num_active_rx_pools = 0;
 }
 
 static int init_rx_pools(struct net_device *netdev)
@@ -508,6 +509,8 @@ static int init_rx_pools(struct net_device *netdev)
return -1;
}
 
+   adapter->num_active_rx_pools = rxadd_subcrqs;
+
for (i = 0; i < rxadd_subcrqs; i++) {
rx_pool = >rx_pool[i];
 
@@ -608,7 +611,7 @@ static void release_tx_pools(struct ibmvnic_adapter 
*adapter)
if (!adapter->tx_pool)
return;
 
-   for (i = 0; i < adapter->num_active_tx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_tx_pools; i++) {
netdev_dbg(adapter->netdev, "Releasing tx_pool[%d]\n", i);
tx_pool = >tx_pool[i];
kfree(tx_pool->tx_buff);
@@ -619,6 +622,7 @@ static void release_tx_pools(struct ibmvnic_adapter 
*adapter)
 
kfree(adapter->tx_pool);
adapter->tx_pool = NULL;
+   adapter->num_active_tx_pools = 0;
 }
 
 static int init_tx_pools(struct net_device *netdev)
@@ -635,6 +639,8 @@ static int init_tx_pools(struct net_device *netdev)
if (!adapter->tx_pool)
return -1;
 
+   adapter->num_active_tx_pools = tx_subcrqs;
+
for (i = 0; i < tx_subcrqs; i++) {
tx_pool = >tx_pool[i];
 
@@ -745,6 +751,7 @@ static int init_napi(struct ibmvnic_adapter *adapter)
   ibmvnic_poll, NAPI_POLL_WEIGHT);
}
 
+   adapter->num_active_rx_napi = adapter->req_rx_queues;
return 0;
 }
 
@@ -755,7 +762,7 @@ static void release_napi(struct ibmvnic_adapter *adapter)
if (!adapter->napi)
return;
 
-   for (i = 0; i < adapter->num_active_rx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_rx_napi; i++) {
if (>napi[i]) {
netdev_dbg(adapter->netdev,
   "Releasing napi[%d]\n", i);
@@ -765,6 +772,7 @@ static void release_napi(struct ibmvnic_adapter *adapter)
 
kfree(adapter->napi);
adapter->napi = NULL;
+   adapter->num_active_rx_napi = 0;
 }
 
 static int ibmvnic_login(struct net_device *netdev)
@@ -998,10 +1006,6 @@ static int init_resources(struct ibmvnic_adapter *adapter)
return rc;
 
rc = init_tx_pools(netdev);
-
-   adapter->num_active_tx_scrqs = adapter->req_tx_queues;
-   adapter->num_active_rx_scrqs = adapter->req_rx_queues;
-
return rc;
 }
 
@@ -1706,9 +1710,6 @@ static int do_reset(struct ibmvnic_adapter *adapter,
 
release_napi(adapter);
init_napi(adapter);
-
-   adapter->num_active_tx_scrqs = adapter->req_tx_queues;
-   adapter->num_active_rx_scrqs = adapter->req_rx_queues;
} else {
rc = reset_tx_pools(adapter);
if (rc)
@@ -2398,19 +2399,10 @@ static struct ibmvnic_sub_crq_queue 
*init_sub_crq_queue(struct ibmvnic_adapter
 
 static void release_sub_crqs(struct ibmvnic_adapter *adapter, bool do_h_free)
 {
-   u64 num_tx_scrqs, num_rx_scrqs;
int i;
 
-   if (adapter->state == VNIC_PROBED) {
-   num_tx_scrqs = adapter->req_tx_queues;
-   num_rx_scrqs = adapter->req_rx_queues;
-   } else {
-   num_tx_scrqs = adapter->num_active_tx_scrqs;
-   num_rx_scrqs = adapter->num_active_rx_scrqs;
-   }
-
if (adapter->tx_scrq) {
-   for (i = 0; i < num_tx_scrqs; i++) {
+   for (i = 0; i < adapter->num_active_tx_scrqs; i++) {

[PATCH net-next 2/3] aquantia: add Makefiles to all directories

2018-02-21 Thread Jakub Kicinski

To be able to build separate objects we need to provide
Kbuild with a Makefile in each directory.

Signed-off-by: Jakub Kicinski 
---
CC: Igor Russkikh 

 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile | 2 ++
 1 file changed, 2 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile

diff --git a/drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile 
b/drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
-- 
2.15.1

[PATCH net-next 0/3] nfp: build and FW initramfs updates

2018-02-21 Thread Jakub Kicinski

Hi!

This set brings empty makefiles to allow building single object files
(useful for build-testing), Kbuild does not cater to this use case
too well.  There are two ethernet drivers right now which suffer
from this (nfp, aquantia), both are fixed.

Dirk adds an uncommon FW image name to the list of firmware files
module may request.

Dirk van der Merwe (1):
  nfp: advertise firmware for mixed 10G/25G mode

Jakub Kicinski (2):
  nfp: add Makefiles to all directories
  aquantia: add Makefiles to all directories

 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile  | 2 ++
 drivers/net/ethernet/netronome/nfp/bpf/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/flower/Makefile  | 2 ++
 drivers/net/ethernet/netronome/nfp/nfp_main.c   | 1 +
 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/nic/Makefile | 2 ++
 7 files changed, 13 insertions(+)
 create mode 100644 drivers/net/ethernet/aquantia/atlantic/hw_atl/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/bpf/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/flower/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nic/Makefile

-- 
2.15.1

[PATCH net-next 1/3] nfp: add Makefiles to all directories

2018-02-21 Thread Jakub Kicinski

To be able to build separate objects we need to provide
Kbuild with a Makefile in each directory.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/flower/Makefile  | 2 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile | 2 ++
 drivers/net/ethernet/netronome/nfp/nic/Makefile | 2 ++
 5 files changed, 10 insertions(+)
 create mode 100644 drivers/net/ethernet/netronome/nfp/bpf/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/flower/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile
 create mode 100644 drivers/net/ethernet/netronome/nfp/nic/Makefile

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/Makefile 
b/drivers/net/ethernet/netronome/nfp/bpf/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/bpf/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
diff --git a/drivers/net/ethernet/netronome/nfp/flower/Makefile 
b/drivers/net/ethernet/netronome/nfp/flower/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/flower/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/Makefile 
b/drivers/net/ethernet/netronome/nfp/nfpcore/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
diff --git a/drivers/net/ethernet/netronome/nfp/nic/Makefile 
b/drivers/net/ethernet/netronome/nfp/nic/Makefile
new file mode 100644
index ..805fa28f391a
--- /dev/null
+++ b/drivers/net/ethernet/netronome/nfp/nic/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+# kbuild requires Makefile in a directory to build individual objects
-- 
2.15.1

Re: [PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Alexei Starovoitov

On Wed, Feb 21, 2018 at 07:04:02PM -0800, Eric Dumazet wrote:
> On Thu, 2018-02-22 at 01:05 +0100, Daniel Borkmann wrote:
> 
> ...
> 
> > +/* Instead of plain jmp %rax, we emit a retpoline to control
> > + * speculative execution for the indirect branch.
> > + */
> > +static void emit_retpoline_rax_trampoline(u8 **pprog)
> > +{
> > +   u8 *prog = *pprog;
> > +   int cnt = 0;
> > +
> > +   EMIT1_off32(0xE8, 7);/* callq  */
> > +   /* capture_spec: */
> > +   EMIT2(0xF3, 0x90);   /* pause */
> > +   EMIT3(0x0F, 0xAE, 0xE8); /* lfence */
> > +   EMIT2(0xEB, 0xF9);   /* jmp  */
> > +   /* set_up_target: */
> > +   EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */
> > +   EMIT1(0xC3); /* retq */
> > +
> > +   BUILD_BUG_ON(cnt != RETPOLINE_SIZE);
> > +   *pprog = prog;
> 
> You might define the actual code sequence (and length) in 
> arch/x86/include/asm/nospec-branch.h
> 
> If we need to adjust code sequences for RETPOLINE, then we wont
> forget/miss that arch/x86/net/bpf_jit_comp.c had it hard-coded.

like adding a comment to asm/nospec-branch.h that says
"dont forget to adjust bpf_jit_comp.c" ?
but clang/gcc generate slightly different sequences for
retpoline anyway, so even if '.macro RETPOLINE_JMP' in
nospec-branch.h changes it doesn't mean that x64 jit has to change.
So what kinda comment there would make sense?

Re: [PATCH bpf v2] bpf: fix memory leak in lpm_trie map_free callback function

2018-02-21 Thread Eric Dumazet

On Tue, 2018-02-13 at 19:17 -0800, Alexei Starovoitov wrote:
> On Tue, Feb 13, 2018 at 07:00:21PM -0800, Yonghong Song wrote:
> > There is a memory leak happening in lpm_trie map_free callback
> > function trie_free. The trie structure itself does not get freed.
> > 
> > Also, trie_free function did not do synchronize_rcu before freeing
> > various data structures. This is incorrect as some rcu_read_lock
> > region(s) for lookup, update, delete or get_next_key may not complete yet.
> > The fix is to add synchronize_rcu in the beginning of trie_free.
> > The useless spin_lock is removed from this function as well.
> > 
> > Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map 
> > implementation")
> > Reported-by: Mathieu Malaterre 
> > Reported-by: Alexei Starovoitov 
> > Tested-by: Mathieu Malaterre 
> > Signed-off-by: Yonghong Song 
> > ---
> >  kernel/bpf/lpm_trie.c | 11 +++
> >  1 file changed, 7 insertions(+), 4 deletions(-)
> > 
> > v1->v2:
> >   Make comments more precise and make label name more appropriate,
> >   as suggested by Daniel
> 
> Applied to bpf tree, Thanks Yonghong.


This does not look good.

LOCKDEP surely should complain to

node = rcu_dereference_protected(*slot, lockdep_is_held(>lock));

Since we no longer hold trie->lock

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Samudrala, Sridhar


On 2/21/2018 6:35 PM, Samudrala, Sridhar wrote:

On 2/21/2018 5:59 PM, Siwei Liu wrote:

On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
 wrote:

On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu  wrote:

I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
 wrote:
On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  
wrote:

On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:

Ppatch 2 is in response to the community request for a 3 netdev
solution.  However, it creates some issues we'll get into in a 
moment.

It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver 
creates
an additional 'bypass' netdev that acts as a master device and 
controls

2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' 
netdevs are
associated with the same 'pci' device.  The user accesses the 
network
interface via 'bypass' netdev. The 'bypass' netdev chooses 
'active' netdev
as default for transmits when it is available with link up and 
running.

Thank you do doing this.


We noticed a couple of issues with this approach during testing.
- As both 'bypass' and 'backup' netdevs are associated with the 
same
   virtio pci device, udev tries to rename both of them with the 
same name
   and the 2nd rename will fail. This would be OK as long as the 
first netdev
   to be renamed is the 'bypass' netdev, but the order in which 
udev gets

   to rename the 2 netdevs is not reliable.

Out of curiosity - why do you link the master netdev to the virtio
struct device?

The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.


Yes. This sounds interesting. Looks like you want an existing VM image 
with

VF only configuration to get transparent live migration support by adding
virtio_net with BACKUP feature.  We may need another feature bit to 
switch

between these 2 options.


After thinking some more, this may be more involved than adding a new
feature bit.  This requires a netdev created by virtio to take over the 
name of

a VF netdev associated with a PCI device that may not be plugged in when
the virtio driver is coming up. This definitely requires some new messages

Re: [RFC net PATCH] virtio_net: disable XDP_REDIRECT in receive_mergeable() case

2018-02-21 Thread Jason Wang




On 2018年02月16日 23:41, Jesper Dangaard Brouer wrote:

On Fri, 16 Feb 2018 13:31:37 +0800
Jason Wang  wrote:


On 2018年02月16日 06:43, Jesper Dangaard Brouer wrote:

The virtio_net code have three different RX code-paths in receive_buf().
Two of these code paths can handle XDP, but one of them is broken for
at least XDP_REDIRECT.

Function(1): receive_big() does not support XDP.
Function(2): receive_small() support XDP fully and uses build_skb().
Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().

The simple explanation is that receive_mergeable() is broken because
it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
header+data in single page and enough tail room for skb_shared_info.

The longer explaination is that receive_mergeable() tries to
work-around and satisfy these XDP requiresments e.g. by having a
function xdp_linearize_page() that allocates and memcpy RX buffers
around (in case packet is scattered across multiple rx buffers).  This
does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
we have not implemented bpf_xdp_adjust_tail yet).

The XDP_REDIRECT action combined with cpumap is broken, and cause hard
to debug crashes.  The main issue is that the RX packet does not have
the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
skb_shared_info to overlap the next packets head-room (in which cpumap
stores info).

Reproducing depend on the packet payload length and if RX-buffer size
happened to have tail-room for skb_shared_info or not.  But to make
this even harder to troubleshoot, the RX-buffer size is runtime
dynamically change based on an Exponentially Weighted Moving Average
(EWMA) over the packet length, when refilling RX rings.

This patch only disable XDP_REDIRECT support in receive_mergeable()
case, because it can cause a real crash.

But IMHO we should NOT support XDP in receive_mergeable() at all,
because the principles behind XDP are to gain speed by (1) code
simplicity, (2) sacrificing memory and (3) where possible moving
runtime checks to setup time.  These principles are clearly being
violated in receive_mergeable(), that e.g. runtime track average
buffer size to save memory consumption.

I agree to disable it for -net now.

Okay... I'll send an official patch later.


For net-next, we probably can do:

- drop xdp_linearize_page() and do XDP through generic XDP helper
   after skb was built

I disagree strongly here - it makes no sense.

Why do you want to explicit fallback to Generic-XDP?
(... then all the performance gain is gone!)


Note this only happens when:

1) Rx buffer size is under estimated, we could disable estimation and 
then this won't happen
2) headroom is not sufficient, we try hard to not stop device during XDP 
set, so this can happen but only for the first several packets


So this looks pretty fine and remove a lot of complex codes.


And besides, a couple of function calls later, the generic XDP code
will/can get invoked anyhow...


How if we choose to use native mode of XDP?




Take a step back:
  What is the reason/use-case for implementing XDP inside virtio_net?

 From a DDoS/performance perspective XDP in virtio_net happens on the
"wrong-side" as it is activated _inside_ the guest OS, which is too
late for a DDoS filter, as the guest kick/switch overhead have already
occurred.


I don't see any difference of virtio-net now. Consider a real hardward 
NIC, XDP (except for the offload case) also start to drop packet after 
it reach hardware.




I do use XDP_DROP inside the guest (driver virtio_net), but just to
perform what I can zoom-in benchmarking, for perf-record isolating the
early RX code path in the guest.  (Using iptables "raw" table drop is
almost as useful for that purpose).



We could not assume the type of virtio-net backend, it could be dpdk or 
other high speed implementation. And I'm pretty sure there are more use 
cases, here are two:


- Use XDP to accelerate nest VM
- XDP offload to host



The XDP ndo_xdp_xmit in tuntap/tun.c (that you also implemented) is
significantly more interesting.  As it allow us to skip large parts of
the network stack and redirect from a physical device (ixgbe) into a
guest device.  Ran a benchmark:
  - 0.5 Mpps with normal code path into device with driver tun
  - 3.7 Mpps with XDP_REDIRECT from ixgbe into same device

Plus, there are indications that 3.7Mpps is not the real limit, as
guest CPU doing XDP_DROP is 75% idle... thus this is a likely a
scheduling + queue size issue.


Yes, XDP_DROP can do more (but I forget the exact number). Btw testpmd 
(in guest) can give me about 3Mpps when doing forwarding (io mode). The 
main bottleneck in this case is vhost, XDP_REDIRECT can provides about 
8Mpps to tun, but vhost can only receive about 4Mpps, and vhost tx can 
only have 3Mpps.


Thanks





- disable EWMA when XDP is set and reserve enough tailroom.


Besides the described bug:

Update(1): There is also a OOM leak in the

[PATCH iproute2] README: re-add updated information link

2018-02-21 Thread Jakub Kicinski

From: Quentin Monnet 

The "Information" link was removed from README file in commit
d7843207e6fd ("README: update location of git repositories, remove
broken info link"), because it redirected to a page that no longer
existed on the Linux Foundation wiki.

This page has just been restored, so we can add the link back again.
Since the previous link was a redirection, use the updated link instead.

Thanks to Luca Boccassi for investigating this issue, restoring and
updating the page.

Signed-off-by: Quentin Monnet 
---
 README | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/README b/README
index 1b7f44272f5a..f66fd5faf4cf 100644
--- a/README
+++ b/README
@@ -1,5 +1,8 @@
 This is a set of utilities for Linux networking.
 
+Information:
+https://wiki.linuxfoundation.org/networking/iproute2
+
 Download:
 http://www.kernel.org/pub/linux/utils/net/iproute2/
 
-- 
2.15.1

Re: [PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Eric Dumazet

On Thu, 2018-02-22 at 01:05 +0100, Daniel Borkmann wrote:

...

> +/* Instead of plain jmp %rax, we emit a retpoline to control
> + * speculative execution for the indirect branch.
> + */
> +static void emit_retpoline_rax_trampoline(u8 **pprog)
> +{
> + u8 *prog = *pprog;
> + int cnt = 0;
> +
> + EMIT1_off32(0xE8, 7);/* callq  */
> + /* capture_spec: */
> + EMIT2(0xF3, 0x90);   /* pause */
> + EMIT3(0x0F, 0xAE, 0xE8); /* lfence */
> + EMIT2(0xEB, 0xF9);   /* jmp  */
> + /* set_up_target: */
> + EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */
> + EMIT1(0xC3); /* retq */
> +
> + BUILD_BUG_ON(cnt != RETPOLINE_SIZE);
> + *pprog = prog;

You might define the actual code sequence (and length) in 
arch/x86/include/asm/nospec-branch.h

If we need to adjust code sequences for RETPOLINE, then we wont
forget/miss that arch/x86/net/bpf_jit_comp.c had it hard-coded.

Thanks Daniel.

Re: [PATCH v2 net-next 1/2] lan743x: Add main source files for new lan743x driver

2018-02-21 Thread Andrew Lunn

> +static void lan743x_intr_unregister_isr(struct lan743x_adapter *adapter,
> + int vector_index)
> +{
> + struct lan743x_vector *vector = >intr.vector_list
> + [vector_index];
> +
> + devm_free_irq(>pci.pdev->dev, vector->irq, vector);

Hu Bryan

The point of devm_ is that you don't need to free resources you have
allocated using devm_. The core will release them when the device is
removed.

In this case, you might want to ensure the hardware will not generate
any more interrupts, but you can leave the core to call free_irq().

Please look at all your devm_*free() like calls, and remove them if
they are not needed. 

> +static void lan743x_mdiobus_cleanup(struct lan743x_adapter *adapter)
> +{
> + if (adapter->init_flags & LAN743X_INIT_FLAG_MDIOBUS_REGISTERED) {
> + mdiobus_unregister(adapter->mdiobus);
> + adapter->init_flags &= ~LAN743X_INIT_FLAG_MDIOBUS_REGISTERED;
> + }
> + if (adapter->init_flags & LAN743X_INIT_FLAG_MDIOBUS_ALLOCATED) {
> + devm_mdiobus_free(>pci.pdev->dev, adapter->mdiobus);
> + adapter->mdiobus = NULL;
> + adapter->init_flags &= ~LAN743X_INIT_FLAG_MDIOBUS_ALLOCATED;
> + }
> +}

So you can delete devm_mdiobus_free(). That probably means you can also remove 
LAN743X_INIT_FLAG_MDIOBUS_ALLOCATED.

> +
> +static void lan743x_full_cleanup(struct lan743x_adapter *adapter)
> +{
> + if (adapter->init_flags & LAN743X_INIT_FLAG_NETDEV_REGISTERED) {
> + unregister_netdev(adapter->netdev);
> + adapter->init_flags &= ~LAN743X_INIT_FLAG_NETDEV_REGISTERED;
> + }
> + lan743x_mdiobus_cleanup(adapter);
> + lan743x_hardware_cleanup(adapter);
> + if (adapter->init_flags & LAN743X_COMPONENT_FLAG_PCI) {
> + lan743x_pci_cleanup(adapter);
> + adapter->init_flags &= ~LAN743X_COMPONENT_FLAG_PCI;
> + }
> +
> + /* would have freed netdev here.
> +  * but netdev was allocated with devm_alloc_etherdev.
> +  * and devm_free_netdev is not accessible.
> +  * so it is expected to be freed by the devm subsystem.
> +  */

And this comment can go.

Andrew

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Samudrala, Sridhar


On 2/21/2018 5:59 PM, Siwei Liu wrote:

On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
 wrote:

On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu  wrote:

I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
 wrote:

On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:

On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:

Ppatch 2 is in response to the community request for a 3 netdev
solution.  However, it creates some issues we'll get into in a moment.
It extends virtio_net to use alternate datapath when available and
registered. When BACKUP feature is enabled, virtio_net driver creates
an additional 'bypass' netdev that acts as a master device and controls
2 slave devices.  The original virtio_net netdev is registered as
'backup' netdev and a passthru/vf device with the same MAC gets
registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
associated with the same 'pci' device.  The user accesses the network
interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
as default for transmits when it is available with link up and running.

Thank you do doing this.


We noticed a couple of issues with this approach during testing.
- As both 'bypass' and 'backup' netdevs are associated with the same
   virtio pci device, udev tries to rename both of them with the same name
   and the 2nd rename will fail. This would be OK as long as the first netdev
   to be renamed is the 'bypass' netdev, but the order in which udev gets
   to rename the 2 netdevs is not reliable.

Out of curiosity - why do you link the master netdev to the virtio
struct device?

The basic idea of all this is that we wanted this to work with an
existing VM image that was using virtio. As such we were trying to
make it so that the bypass interface takes the place of the original
virtio and get udev to rename the bypass to what the original
virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.


Yes. This sounds interesting. Looks like you want an existing VM image with
VF only configuration to get transparent live migration support by adding
virtio_net with BACKUP feature.  We may need another feature bit to switch
between these 2 options.





The other bits get into more complexity then we are ready to handle
for now. I think I might have talked about something similar that I
was referring to as a "virtio-bond" where you would have a PCI/PCIe
tree topology that makes this easier to sort out, and the "virtio-bond
would be used to handle coordination/configuration of a much more
complex interface.

That was

[PATCH v2] selftests/bpf/test_maps: exit child process without error in ENOMEM case

2018-02-21 Thread Li Zhijian

From: Li Zhijian 

test_maps contains a series of stress tests, and previously it will break the
rest tests when it failed to alloc memory.
---
Failed to create hashmap key=8 value=262144 'Cannot allocate memory'
Failed to create hashmap key=16 value=262144 'Cannot allocate memory'
Failed to create hashmap key=8 value=262144 'Cannot allocate memory'
Failed to create hashmap key=8 value=262144 'Cannot allocate memory'
test_maps: test_maps.c:955: run_parallel: Assertion `status == 0' failed.
Aborted
not ok 1..3 selftests:  test_maps [FAIL]
---
after this patch, the rest tests will be continue when it occurs an ENOMEM 
failure

CC: Alexei Starovoitov 
CC: Philip Li 
Suggested-by: Daniel Borkmann 
Signed-off-by: Li Zhijian 
---
 tools/testing/selftests/bpf/test_maps.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_maps.c 
b/tools/testing/selftests/bpf/test_maps.c
index 436c4c7..9e03a4c 100644
--- a/tools/testing/selftests/bpf/test_maps.c
+++ b/tools/testing/selftests/bpf/test_maps.c
@@ -126,6 +126,8 @@ static void test_hashmap_sizes(int task, void *data)
fd = bpf_create_map(BPF_MAP_TYPE_HASH, i, j,
2, map_flags);
if (fd < 0) {
+   if (errno == ENOMEM)
+   return;
printf("Failed to create hashmap key=%d 
value=%d '%s'\n",
   i, j, strerror(errno));
exit(1);
-- 
2.7.4

[PATCH iproute2] ip: Properly display AF_BRIDGE address information for neighbor events

2018-02-21 Thread Donald Sharp

The vxlan driver when a neighbor add/delete event occurs sends
NDA_DST filled with a union:

union vxlan_addr {
struct sockaddr_in sin;
struct sockaddr_in6 sin6;
struct sockaddr sa;
};

This eventually calls rt_addr_n2a_r which had no handler for the
AF_BRIDGE family and "???" was being printed.

Add code to properly display this data when requested.

Signed-off-by: Donald Sharp 
---
 lib/utils.c | 18 ++
 1 file changed, 18 insertions(+)

diff --git a/lib/utils.c b/lib/utils.c
index 24aeddd8..e01e18a7 100644
--- a/lib/utils.c
+++ b/lib/utils.c
@@ -1004,6 +1004,24 @@ const char *rt_addr_n2a_r(int af, int len,
}
case AF_PACKET:
return ll_addr_n2a(addr, len, ARPHRD_VOID, buf, buflen);
+   case AF_BRIDGE:
+   {
+   unsigned short family = ((struct sockaddr *)addr)->sa_family;
+   struct sockaddr_in6 *sin6;
+   struct sockaddr_in *sin;
+
+   switch(family) {
+   case AF_INET:
+   sin = (struct sockaddr_in *)addr;
+   return inet_ntop(AF_INET, >sin_addr, buf, buflen);
+   case AF_INET6:
+   sin6 = (struct sockaddr_in6 *)addr;
+   return inet_ntop(AF_INET6, >sin6_addr,
+buf, buflen);
+   }
+
+   /* fallthrough */
+   }
default:
return "???";
}
-- 
2.14.3

Re: ss issue on arm not showing UDP listening ports

2018-02-21 Thread Jesse Cooper

Thank you for the suggestions. This is on a raspberry pi 3 not sure if
that fact matters. I will notify Raspbian of the issue.

On 02/21/2018 03:03 PM, Stefano Brivio wrote:
> On Wed, 21 Feb 2018 12:37:31 -0500
> jesse_coo...@codeholics.com wrote:
> 
>> ss utility, iproute2-ss161212
> 
> Works for me on iproute2-ss161212 and 4.9.0 kernel on armv7l. Unless
> somebody on the list has other ideas, I guess you should either try
> more recent versions, debug it (strace should show a pair of
> recvmsg/sendmsg for each UDP socket) or file a ticket for your
> distribution.
>

nft/bpf interpreters and spectre2. Was: [PATCH RFC 0/4] net: add bpfilter

2018-02-21 Thread Alexei Starovoitov

On Wed, Feb 21, 2018 at 01:13:03PM +0100, Florian Westphal wrote:
> 
> Obvious candidates are: meta, numgen, limit, objref, quota, reject.
> 
> We should probably also consider removing
> CONFIG_NFT_SET_RBTREE and CONFIG_NFT_SET_HASH and just always
> build both too (at least rbtree since that offers interval).
> 
> For the indirect call issue we can use direct calls from eval loop for
> some of the more frequently used ones, similar to what we do already
> for nft_cmp_fast_expr. 

nft_cmp_fast_expr and other expressions mentioned above made me thinking...

do we have the same issue with nft interpreter as we had with bpf one?
bpf interpreter was used as part of spectre2 attack to leak
information via cache side channel and let VM read hypervisor memory.
Due to that issue we removed bpf interpreter from the kernel code.
That's what CONFIG_BPF_JIT_ALWAYS_ON for...
but we still have nft interpreter in the kernel that can also
execute arbitrary nft expressions.

Jann's exploit used the following bpf instructions:
struct bpf_insn evil_bytecode_instrs[] = {
// rax = target_byte_addr
{ .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 0, .imm = target_byte_addr }, { 
.imm = target_byte_addr>>32 },
// rdi = timing_leak_array
{ .code = BPF_LD | BPF_IMM | BPF_DW, .dst_reg = 1, .imm = host_timing_leak_addr 
}, { .imm = host_timing_leak_addr>>32 },
// rax = *(u8*)rax
{ .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0 },
// rax = rax << ...
{ .code = BPF_ALU64 | BPF_LSH | BPF_K, .dst_reg = 0, .imm = 10 - bit_idx },
// rax = rax & 0x400
{ .code = BPF_ALU64 | BPF_AND | BPF_K, .dst_reg = 0, .imm = 0x400 },
// rax = rdi + rax
{ .code = BPF_ALU64 | BPF_ADD | BPF_X, .dst_reg = 0, .src_reg = 1 },
// *(u8*) (rax + 0x800)
{ .code = BPF_LDX | BPF_MEM | BPF_B, .dst_reg = 0, .src_reg = 0, .off = 0x800 },

and a gadget to jump into __bpf_prog_run with insn pointing
to memory controlled by the guest while accessible
(at different virt address) by the hypervisor.

It seems possible to construct similar sequence of instructions
out of nft expressions and use gadget that jumps into nft_do_chain().
The attacker would need to discover more kernel addresses:
nft_do_chain, nft_cmp_fast_ops, nft_payload_fast_ops, nft_bitwise_eval,
nft_lookup_eval, and nft_bitmap_lookup
to populate nft chains, rules and expressions in guest memory
comparing to bpf interpreter attack.

Then in nft_do_chain(struct nft_pktinfo *pkt, void *priv)
pkt needs to point to fake struct sk_buff in guest memory with
skb->head == target_byte_addr
The first nft expression can be nft_payload_fast_eval().
If it's properly constructed with
(nft_payload->based == NFT_PAYLOAD_NETWORK_HEADER, offset == 0, len == 0, dreg 
== 1)
it will do arbitrary load of
*(u8 *)dest = *(u8 *)ptr;
from target_byte_addr into register 1 of nft state machine
(dest is u32 array of registers in the stack of nft_do_chain)
Second nft expression can be nft_bitwise_eval() to mask particular
bit in register 1.
Then nft_cmp_eval() to check whether bit is one or zero and
conditional NFT_BREAK out of first nft expression into second nft rule.
The last conditional nft_immediate_eval() in the first rule will set
register 1 to 0x400 * 8 while the first nft_bitwise_eval() in
the second rule with do r1 &= 0x400 * 8.
So at this point r1 will have either 0x400 * 8 or 0 depending
on value of speculatively loaded bit.
The last expression can be nft_lookup_eval() with 
nft_lookup->set->ops->lookup == nft_bitmap_lookup
which will do nft_bitmap->bitmap[idx] where idx = r1 / 8
The memory used for this last nft_lookup/bitmap expression is
both an instruction and timing_leak_array itself.
If I'm not mistaken, this sequence of nft expression will
speculatively execute very similar logic as in evil_bytecode_instrs[]

The amount of actual speculative native cpu load/stores/branches is
probably more than executed by bpf interpreter for these evil bytecodes,
but likely well within cpu speculation window of 100+ insns.

Obviously such exploit is harder to do than bpf based one.
Do we need to do anything about it ?
May be it's easier to find gadgets in .text of vmlinux
instead of messing with interpreters?

Jann,
can you comment on removing interpreters in general?
Do we need to worry about having bpf and/or nft interpreter
in the kernel?

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Samudrala, Sridhar

On 2/21/2018 6:02 PM, Jakub Kicinski wrote:

On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote:

I don't see why the team cannot be there always.

It is more the logistical nightmare. Part of the goal here was to work
with the cloud base images that are out there such as
https://alt.fedoraproject.org/cloud/. With just the kernel changes the
overhead for this stays fairly small and would be pulled in as just a
standard part of the kernel update process. The virtio bypass only
pops up if the backup bit is present. With the team solution it
requires that the base image use the team driver on virtio_net when it
sees one. I doubt the OSVs would want to do that just because SR-IOV
isn't that popular of a case.

IIUC we need to monitor for a "backup hint", spawn the master, rename it
to maintain backwards compatibility with no-VF setups and enslave the VF
if it appears.

All those sound possible from user space, the advantage of the kernel
solution right now is that it has more complete code.

Am I misunderstanding?

I think there is some misunderstanding about the exact requirement and 
the usecase
we are trying to solve.  If the Guest is allowed to do this 
configuration, we already have

a solution with either bond/team based user space configuration.

This is to enable cloud service providers to provide a accelerated 
datapath by simply
letting to tenants to get their own images with the only requirement to 
enable their

kernels with newer virtio_net driver with BACKUP support and the VF driver.

To recap from an earlier thread, here is a response from Stephen that 
talks about the
requirement for the netvsc solution and we would like to provide similar 
solution for

KVM based cloud deployments.

> The requirement with Azure accelerated network was that a stock 
distribution image from the

> store must be able to run unmodified and get accelerated networking.
>  Not sure if other environments need to work the same, but it would 
be nice.

>  That meant no additional setup scripts (aka no bonding) and also it must
> work transparently with hot-plug. Also there are diverse set of 
environments:
> openstack, cloudinit, network manager and systemd. The solution had 
to not depend

> on any one of them, but also not break any of them.

Thanks
Sridhar

[PATCH v2 iproute2-next 2/3] ip: Display ip rule protocol used

2018-02-21 Thread Donald Sharp

Modify 'ip rule' command to notice when the kernel passes
to us the originating protocol.

Add code to allow the `ip rule flush protocol XXX`
command to be accepted and properly handled.

Modify the documentation to reflect these code changes.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c| 29 ++---
 man/man8/ip-rule.8 | 18 +-
 2 files changed, 39 insertions(+), 8 deletions(-)

diff --git a/ip/iprule.c b/ip/iprule.c
index 00a6c26a..39008768 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -47,6 +47,7 @@ static void usage(void)
"[ iif STRING ] [ oif STRING ] [ pref NUMBER ] [ 
l3mdev ]\n"
"[ uidrange NUMBER-NUMBER ]\n"
"ACTION := [ table TABLE_ID ]\n"
+   "  [ protocol RPROTO ]\n"
"  [ nat ADDRESS ]\n"
"  [ realms [SRCREALM/]DSTREALM ]\n"
"  [ goto NUMBER ]\n"
@@ -71,6 +72,8 @@ static struct
struct fib_rule_uid_range range;
inet_prefix src;
inet_prefix dst;
+   int protocol;
+   int protocolmask;
 } filter;
 
 static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb)
@@ -338,6 +341,10 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
rtnl_rtntype_n2a(frh->action,
 b1, sizeof(b1)));
 
+   if (frh->proto != RTPROT_UNSPEC)
+   fprintf(fp, " proto %s ",
+   rtnl_rtprot_n2a(frh->proto, b1, sizeof(b1)));
+
fprintf(fp, "\n");
fflush(fp);
return 0;
@@ -391,6 +398,9 @@ static int flush_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n,
 
parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
+   if ((filter.protocol^frh->proto))
+   return 0;
+
if (tb[FRA_PRIORITY]) {
n->nlmsg_type = RTM_DELRULE;
n->nlmsg_flags = NLM_F_REQUEST;
@@ -415,12 +425,6 @@ static int iprule_list_flush_or_save(int argc, char 
**argv, int action)
if (af == AF_UNSPEC)
af = AF_INET;
 
-   if (action != IPRULE_LIST && argc > 0) {
-   fprintf(stderr, "\"ip rule %s\" does not take any arguments.\n",
-   action == IPRULE_SAVE ? "save" : "flush");
-   return -1;
-   }
-
switch (action) {
case IPRULE_SAVE:
if (save_rule_prep())
@@ -508,7 +512,18 @@ static int iprule_list_flush_or_save(int argc, char 
**argv, int action)
NEXT_ARG();
if (get_prefix(, *argv, af))
invarg("from value is invalid\n", *argv);
-   } else {
+   } else if (matches(*argv, "protocol") == 0) {
+   __u32 prot;
+   NEXT_ARG();
+   filter.protocolmask = -1;
+   if (rtnl_rtprot_a2n(, *argv)) {
+   if (strcmp(*argv, "all") != 0)
+   invarg("invalid \"protocol\"\n", *argv);
+   prot = 0;
+   filter.protocolmask = 0;
+   }
+   filter.protocol = prot;
+   } else{
if (matches(*argv, "dst") == 0 ||
matches(*argv, "to") == 0) {
NEXT_ARG();
diff --git a/man/man8/ip-rule.8 b/man/man8/ip-rule.8
index a5c47981..98b2573d 100644
--- a/man/man8/ip-rule.8
+++ b/man/man8/ip-rule.8
@@ -50,6 +50,8 @@ ip-rule \- routing policy database management
 .IR ACTION " := [ "
 .B  table
 .IR TABLE_ID " ] [ "
+.B  protocol
+.IR RPROTO " ] [ "
 .B  nat
 .IR ADDRESS " ] [ "
 .B realms
@@ -240,6 +242,10 @@ The options preference and order are synonyms with 
priority.
 the routing table identifier to lookup if the rule selector matches.
 It is also possible to use lookup instead of table.
 
+.TP
+.BI protocol " RPROTO"
+the protocol who installed the rule in question.
+
 .TP
 .BI suppress_prefixlength " NUMBER"
 reject routing decisions that have a prefix length of NUMBER or less.
@@ -275,7 +281,11 @@ updates, it flushes the routing cache with
 .RE
 .TP
 .B ip rule flush - also dumps all the deleted rules.
-This command has no arguments.
+.RS
+.TP
+.BI protocol " RPROTO"
+Select the originating protocol.
+.RE
 .TP
 .B ip rule show - list rules
 This command has no arguments.
@@ -283,6 +293,12 @@ The options list or lst are synonyms with show.
 
 .TP
 .B ip rule save
+.RS
+.TP
+.BI protocl " RPROTO"
+Select the originating protocol.
+.RE
+.TP
 save rules table information to stdout
 .RS
 This command behaves like
-- 
2.14.3

[PATCH v2 iproute2-next 0/3] Allow 'ip rule' command to use protocol

2018-02-21 Thread Donald Sharp

Fix iprule.c to use the actual `struct fib_rule_hdr` and to
allow the end user to see and use the protocol keyword
for rule manipulations.

v2: Rearrange and code changes as per David Ahern

Donald Sharp (3):
  ip: Use the `struct fib_rule_hdr` for rules
  ip: Display ip rule protocol used
  ip: Allow rules to accept a specified protocol

 include/uapi/linux/fib_rules.h |   2 +-
 ip/iprule.c| 164 -
 man/man8/ip-rule.8 |  18 -
 3 files changed, 114 insertions(+), 70 deletions(-)

-- 
2.14.3

[PATCH v2 iproute2-next 1/3] ip: Use the `struct fib_rule_hdr` for rules

2018-02-21 Thread Donald Sharp

The iprule.c code was using `struct rtmsg` as the data
type to pass into the kernel for the netlink message.
While 'struct rtmsg' and `struct fib_rule_hdr` are
the same size and mostly the same, we should use
the correct data structure.  This commit translates
the data structures to have iprule.c use the correct
one.

Additionally copy over the modified fib_rules.h file

Signed-off-by: Donald Sharp 
---
 include/uapi/linux/fib_rules.h |   2 +-
 ip/iprule.c| 129 ++---
 2 files changed, 69 insertions(+), 62 deletions(-)

diff --git a/include/uapi/linux/fib_rules.h b/include/uapi/linux/fib_rules.h
index 2b642bf9..92553917 100644
--- a/include/uapi/linux/fib_rules.h
+++ b/include/uapi/linux/fib_rules.h
@@ -23,8 +23,8 @@ struct fib_rule_hdr {
__u8tos;
 
__u8table;
+   __u8proto;
__u8res1;   /* reserved */
-   __u8res2;   /* reserved */
__u8action;
 
__u32   flags;
diff --git a/ip/iprule.c b/ip/iprule.c
index a3abf2f6..00a6c26a 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -73,25 +73,33 @@ static struct
inet_prefix dst;
 } filter;
 
+static inline int frh_get_table(struct fib_rule_hdr *frh, struct rtattr **tb)
+{
+   __u32 table = frh->table;
+   if (tb[RTA_TABLE])
+   table = rta_getattr_u32(tb[RTA_TABLE]);
+   return table;
+}
+
 static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr **tb, int host_len)
 {
-   struct rtmsg *r = NLMSG_DATA(n);
+   struct fib_rule_hdr *frh = NLMSG_DATA(n);
__u32 table;
 
-   if (preferred_family != AF_UNSPEC && r->rtm_family != preferred_family)
+   if (preferred_family != AF_UNSPEC && frh->family != preferred_family)
return false;
 
if (filter.prefmask &&
filter.pref ^ (tb[FRA_PRIORITY] ? rta_getattr_u32(tb[FRA_PRIORITY]) 
: 0))
return false;
-   if (filter.not && !(r->rtm_flags & FIB_RULE_INVERT))
+   if (filter.not && !(frh->flags & FIB_RULE_INVERT))
return false;
 
if (filter.src.family) {
inet_prefix *f_src = 
 
-   if (f_src->family != r->rtm_family ||
-   f_src->bitlen > r->rtm_src_len)
+   if (f_src->family != frh->family ||
+   f_src->bitlen > frh->src_len)
return false;
 
if (inet_addr_match_rta(f_src, tb[FRA_SRC]))
@@ -101,15 +109,15 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct 
rtattr **tb, int host_len)
if (filter.dst.family) {
inet_prefix *f_dst = 
 
-   if (f_dst->family != r->rtm_family ||
-   f_dst->bitlen > r->rtm_dst_len)
+   if (f_dst->family != frh->family ||
+   f_dst->bitlen > frh->dst_len)
return false;
 
if (inet_addr_match_rta(f_dst, tb[FRA_DST]))
return false;
}
 
-   if (filter.tosmask && filter.tos ^ r->rtm_tos)
+   if (filter.tosmask && filter.tos ^ frh->tos)
return false;
 
if (filter.fwmark) {
@@ -159,7 +167,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
return false;
}
 
-   table = rtm_get_table(r, tb);
+   table = frh_get_table(frh, tb);
if (filter.tb > 0 && filter.tb ^ table)
return false;
 
@@ -169,7 +177,7 @@ static bool filter_nlmsg(struct nlmsghdr *n, struct rtattr 
**tb, int host_len)
 int print_rule(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
 {
FILE *fp = (FILE *)arg;
-   struct rtmsg *r = NLMSG_DATA(n);
+   struct fib_rule_hdr *frh = NLMSG_DATA(n);
int len = n->nlmsg_len;
int host_len = -1;
__u32 table;
@@ -180,13 +188,13 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
if (n->nlmsg_type != RTM_NEWRULE && n->nlmsg_type != RTM_DELRULE)
return 0;
 
-   len -= NLMSG_LENGTH(sizeof(*r));
+   len -= NLMSG_LENGTH(sizeof(*frh));
if (len < 0)
return -1;
 
-   parse_rtattr(tb, FRA_MAX, RTM_RTA(r), len);
+   parse_rtattr(tb, FRA_MAX, RTM_RTA(frh), len);
 
-   host_len = af_bit_len(r->rtm_family);
+   host_len = af_bit_len(frh->family);
 
if (!filter_nlmsg(n, tb, host_len))
return 0;
@@ -200,41 +208,41 @@ int print_rule(const struct sockaddr_nl *who, struct 
nlmsghdr *n, void *arg)
else
fprintf(fp, "0:\t");
 
-   if (r->rtm_flags & FIB_RULE_INVERT)
+   if (frh->flags & FIB_RULE_INVERT)
fprintf(fp, "not ");
 
if (tb[FRA_SRC]) {
-   if (r->rtm_src_len != host_len) {
+   if (frh->src_len != host_len) {
fprintf(fp,

[PATCH v2 iproute2-next 3/3] ip: Allow rules to accept a specified protocol

2018-02-21 Thread Donald Sharp

Allow the specification of a protocol when the user
adds/modifies/deletes a rule.

Signed-off-by: Donald Sharp 
---
 ip/iprule.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/ip/iprule.c b/ip/iprule.c
index 39008768..192fe215 100644
--- a/ip/iprule.c
+++ b/ip/iprule.c
@@ -683,6 +683,12 @@ static int iprule_modify(int cmd, int argc, char **argv)
if (get_rt_realms_or_raw(, *argv))
invarg("invalid realms\n", *argv);
addattr32(, sizeof(req), FRA_FLOW, realm);
+   } else if (matches(*argv, "protocol") == 0) {
+   __u32 proto;
+   NEXT_ARG();
+   if (rtnl_rtprot_a2n(, *argv))
+   invarg("\"protocol\" value is invalid\n", 
*argv);
+   req.frh.proto = proto;
} else if (matches(*argv, "table") == 0 ||
   strcmp(*argv, "lookup") == 0) {
NEXT_ARG();
-- 
2.14.3

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Jakub Kicinski

On Wed, 21 Feb 2018 12:57:09 -0800, Alexander Duyck wrote:
> > I don't see why the team cannot be there always.  
> 
> It is more the logistical nightmare. Part of the goal here was to work
> with the cloud base images that are out there such as
> https://alt.fedoraproject.org/cloud/. With just the kernel changes the
> overhead for this stays fairly small and would be pulled in as just a
> standard part of the kernel update process. The virtio bypass only
> pops up if the backup bit is present. With the team solution it
> requires that the base image use the team driver on virtio_net when it
> sees one. I doubt the OSVs would want to do that just because SR-IOV
> isn't that popular of a case.

IIUC we need to monitor for a "backup hint", spawn the master, rename it
to maintain backwards compatibility with no-VF setups and enslave the VF
if it appears.

All those sound possible from user space, the advantage of the kernel
solution right now is that it has more complete code.

Am I misunderstanding?

Re: [PATCH RFC PoC 0/3] nftables meets bpf

2018-02-21 Thread Jakub Kicinski

On Wed, 21 Feb 2018 16:30:07 -0800, Florian Fainelli wrote:
> On 02/21/2018 03:46 PM, Jakub Kicinski wrote:
> > On Tue, 20 Feb 2018 11:58:22 +0100, Pablo Neira Ayuso wrote:  
> >> We also have a large range of TCAM based hardware offload outthere
> >> that will _not_ work with your BPF HW offload infrastructure. What
> >> this bpf infrastructure pushes into the kernel is just a blob
> >> expressing things in a very low-level instruction-set: trying to find
> >> a mapping of that to typical HW intermediate representations in the
> >> TCAM based HW offload world will be simply crazy.  
> > 
> > I'm not sure where the TCAM talk is coming from.  Think much smaller -
> > cellular modems/phone SoCs, 32bit ARM/MIPS router box CPUs.  The
> > information the verifier is gathering will be crucial for optimizing
> > those.  Please don't discount the value of being able to use
> > heterogeneous processing units by the networking stack.
> 
> The only use case that we have a good answer for is when there is no HW
> offload capability available, because there, we know that eBPF is our
> best possible solution for a software fast path, in large part because
> of all the efforts that went into making it both safe and fast.

I was trying to point out that JITing eBPF for the host on 32 bit
systems is already a pain, Jiong Wang is leading an effort to improve
this both from LLVM and verifier angles, IOW running through the
verifier may become useful even for host JITs :)

> When there is offloading HW available, there does not appear to be a
> perfect answer to this problem of, given a standard Linux utility that
> can express any sort of match + action, be it ethtool::rxnfc,
> tc/cls_{u32,flower}, nftables, how do I transform that into what makes
> most sense to my HW? You could:
> 
> - have hardware that understands BPF bytecode directly, great, then you
> don't have to do anything, just pass it up the driver baby, oh wait,
> it's not that simple, the NFP driver is not small

True, it's not the largest but fair point, IMHO we should be trying to
push for sharing as much code between drivers as possible, and on all
fronts, but that's a topic for another time...

> - transform BPF back into something that your hardware understand, does
> that belong in the kernel? Maybe, maybe not

Personally, I think there is non-zero probability of AMP CPUs/systems
becoming more common.  NFP is very powerful and fast, but less advanced
solution may just use an off-the-shelf MIPS/ARM/Andes core.  Taking it
slightly further from home to the cellular/WiFi wake up problem which
was mentioned by Android folks at one of netdevs - if we have
MIPS/ARM/Andes *host* JIT in the kernel, and the NIC processor is built
on one of those all the driver needs to provide is some glue and we can
offload filtering to the MCU on the NIC/modem!

> - use a completely different intermediate representation like P4,
> brainfuck, I don't know
>
> Maybe first things first, we have at least 3 different programming
> interfaces, if not more: ethtool::rxnfc, tc/cls_{u32,flower}, nftables
> that are all capable of programming TCAMs and hardware capable of match
> + action, how about we start with having some sort of common library
> code that:
> 
> - validates input parameters against HW capabilities

This one may be quite hard.

> - does the adequate transformation from any of these interfaces into a
> generic set of input parameters
> - define what the appropriate behavior is when programming through all
> of these 3 interfaces that ultimately access the same shared piece of
> HW, and therefore need to manage resources allocation?

That would be great! :)  Flower stands out today as the most feature
rich and a go-to for TCAM offloads.

>

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Siwei Liu

On Wed, Feb 21, 2018 at 4:17 PM, Alexander Duyck
 wrote:
> On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu  wrote:
>> I haven't checked emails for days and did not realize the new revision
>> had already came out. And thank you for the effort, this revision
>> really looks to be a step forward towards our use case and is close to
>> what we wanted to do. A few questions in line.
>>
>> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>>  wrote:
>>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
 On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
> Ppatch 2 is in response to the community request for a 3 netdev
> solution.  However, it creates some issues we'll get into in a moment.
> It extends virtio_net to use alternate datapath when available and
> registered. When BACKUP feature is enabled, virtio_net driver creates
> an additional 'bypass' netdev that acts as a master device and controls
> 2 slave devices.  The original virtio_net netdev is registered as
> 'backup' netdev and a passthru/vf device with the same MAC gets
> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
> associated with the same 'pci' device.  The user accesses the network
> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
> as default for transmits when it is available with link up and running.

 Thank you do doing this.

> We noticed a couple of issues with this approach during testing.
> - As both 'bypass' and 'backup' netdevs are associated with the same
>   virtio pci device, udev tries to rename both of them with the same name
>   and the 2nd rename will fail. This would be OK as long as the first 
> netdev
>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>   to rename the 2 netdevs is not reliable.

 Out of curiosity - why do you link the master netdev to the virtio
 struct device?
>>>
>>> The basic idea of all this is that we wanted this to work with an
>>> existing VM image that was using virtio. As such we were trying to
>>> make it so that the bypass interface takes the place of the original
>>> virtio and get udev to rename the bypass to what the original
>>> virtio_net was.
>>
>> Could it made it also possible to take over the config from VF instead
>> of virtio on an existing VM image? And get udev rename the bypass
>> netdev to what the original VF was. I don't say tightly binding the
>> bypass master to only virtio or VF, but I think we should provide both
>> options to support different upgrade paths. Possibly we could tweak
>> the device tree layout to reuse the same PCI slot for the master
>> bypass netdev, such that udev would not get confused when renaming the
>> device. The VF needs to use a different function slot afterwards.
>> Perhaps we might need to a special multiseat like QEMU device for that
>> purpose?
>>
>> Our case we'll upgrade the config from VF to virtio-bypass directly.
>
> So if I am understanding what you are saying you are wanting to flip
> the backup interface from the virtio to a VF. The problem is that
> becomes a bit of a vendor lock-in solution since it would rely on a
> specific VF driver. I would agree with Jiri that we don't want to go
> down that path. We don't want every VF out there firing up its own
> separate bond. Ideally you want the hypervisor to be able to manage
> all of this which is why it makes sense to have virtio manage this and
> why this is associated with the virtio_net interface.

No, that's not what I was talking about of course. I thought you
mentioned the upgrade scenario this patch would like to address is to
use the bypass interface "to take the place of the original virtio,
and get udev to rename the bypass to what the original virtio_net
was". That is one of the possible upgrade paths for sure. However the
upgrade path I was seeking is to use the bypass interface to take the
place of original VF interface while retaining the name and network
configs, which generally can be done simply with kernel upgrade. It
would become limiting as this patch makes the bypass interface share
the same virtio pci device with virito backup. Can this bypass
interface be made general to take place of any pci device other than
virtio-net? This will be more helpful as the cloud users who has
existing setup on VF interface don't have to recreate it on virtio-net
and VF separately again.

>
> The other bits get into more complexity then we are ready to handle
> for now. I think I might have talked about something similar that I
> was referring to as a "virtio-bond" where you would have a PCI/PCIe
> tree topology that makes this easier to sort out, and the "virtio-bond
> would be used to handle coordination/configuration of a much more
> complex interface.

That was one way to solve this problem but I'd like to see

Bug with 'ip' command where build a stuck list of interfaces

2018-02-21 Thread Rm Beer

Hello.
With 'ip tunnel add tun0 mode ipip dev enp1s0' make two interfaces
called 'tunl0' and 'tun0'.
Late with 'ip tunnel del tun0' remove only 'tun0' and forget the
'tunl0' in the list.
with this you can not do anything else in the list, it can not be
created and it can not be deleted. It is hanging.

# ip link list
:
:
:
4: tunl0@NONE:  mtu 1480 qdisc noop state DOWN mode DEFAULT
group default qlen 1000
link/ipip 0.0.0.0 brd 0.0.0.0

[PATCH v2 net-next] net: dsa: mv88e6xxx: scratch registers and external MDIO pins

2018-02-21 Thread Andrew Lunn

MV88E6352 and later switches support GPIO control through the "Scratch
& Misc" global2 register. Two of the pins controlled this way on the
mv88e6390 family are the external MDIO pins. They can either by used
as part of the MII interface for port 0, GPIOs, or MDIO. Add a
function to configure them for MDIO, if possible, and call it when
registering the external MDIO bus.

Suggested-by: Russell King 
Signed-off-by: Andrew Lunn 
---
v2: Make stub function static inline, as reported by 0-day.

 drivers/net/dsa/mv88e6xxx/chip.c|  9 +
 drivers/net/dsa/mv88e6xxx/global2.h | 14 
 drivers/net/dsa/mv88e6xxx/global2_scratch.c | 51 +
 3 files changed, 74 insertions(+)

diff --git a/drivers/net/dsa/mv88e6xxx/chip.c b/drivers/net/dsa/mv88e6xxx/chip.c
index 39c7ad7e490f..e1b5c5c66fce 100644
--- a/drivers/net/dsa/mv88e6xxx/chip.c
+++ b/drivers/net/dsa/mv88e6xxx/chip.c
@@ -2165,6 +2165,15 @@ static int mv88e6xxx_mdio_register(struct mv88e6xxx_chip 
*chip,
struct mii_bus *bus;
int err;
 
+   if (external) {
+   mutex_lock(>reg_lock);
+   err = mv88e6xxx_g2_scratch_gpio_set_smi(chip, true);
+   mutex_unlock(>reg_lock);
+
+   if (err)
+   return err;
+   }
+
bus = devm_mdiobus_alloc_size(chip->dev, sizeof(*mdio_bus));
if (!bus)
return -ENOMEM;
diff --git a/drivers/net/dsa/mv88e6xxx/global2.h 
b/drivers/net/dsa/mv88e6xxx/global2.h
index 25f92b3d7157..d85c91036c0f 100644
--- a/drivers/net/dsa/mv88e6xxx/global2.h
+++ b/drivers/net/dsa/mv88e6xxx/global2.h
@@ -266,6 +266,11 @@
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL50x6D
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL60x6E
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL70x6F
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA0  0x70
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA1  0x71
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA1_NO_CPU   BIT(2)
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA2  0x72
+#define MV88E6352_G2_SCRATCH_CONFIG_DATA2_P0_MODE_MASK 0x3
 
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL_GPIO0
 #define MV88E6352_G2_SCRATCH_GPIO_PCTL_TRIG1
@@ -325,6 +330,9 @@ extern const struct mv88e6xxx_avb_ops mv88e6390_avb_ops;
 
 extern const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops;
 
+int mv88e6xxx_g2_scratch_gpio_set_smi(struct mv88e6xxx_chip *chip,
+ bool external);
+
 #else /* !CONFIG_NET_DSA_MV88E6XXX_GLOBAL2 */
 
 static inline int mv88e6xxx_g2_require(struct mv88e6xxx_chip *chip)
@@ -465,6 +473,12 @@ static const struct mv88e6xxx_avb_ops mv88e6390_avb_ops = 
{};
 
 static const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops = {};
 
+static inline int mv88e6xxx_g2_scratch_gpio_set_smi(struct mv88e6xxx_chip 
*chip,
+   bool external)
+{
+   return -EOPNOTSUPP;
+}
+
 #endif /* CONFIG_NET_DSA_MV88E6XXX_GLOBAL2 */
 
 #endif /* _MV88E6XXX_GLOBAL2_H */
diff --git a/drivers/net/dsa/mv88e6xxx/global2_scratch.c 
b/drivers/net/dsa/mv88e6xxx/global2_scratch.c
index 0ff12bff9f0e..3f92b8892dc7 100644
--- a/drivers/net/dsa/mv88e6xxx/global2_scratch.c
+++ b/drivers/net/dsa/mv88e6xxx/global2_scratch.c
@@ -238,3 +238,54 @@ const struct mv88e6xxx_gpio_ops mv88e6352_gpio_ops = {
.get_pctl = mv88e6352_g2_scratch_gpio_get_pctl,
.set_pctl = mv88e6352_g2_scratch_gpio_set_pctl,
 };
+
+/**
+ * mv88e6xxx_g2_gpio_set_smi - set gpio muxing for external smi
+ * @chip: chip private data
+ * @external: set mux for external smi, or free for gpio usage
+ *
+ * Some mv88e6xxx models have GPIO pins that may be configured as
+ * an external SMI interface, or they may be made free for other
+ * GPIO uses.
+ */
+int mv88e6xxx_g2_scratch_gpio_set_smi(struct mv88e6xxx_chip *chip,
+ bool external)
+{
+   int misc_cfg = MV88E6352_G2_SCRATCH_MISC_CFG;
+   int config_data1 = MV88E6352_G2_SCRATCH_CONFIG_DATA1;
+   int config_data2 = MV88E6352_G2_SCRATCH_CONFIG_DATA2;
+   bool no_cpu;
+   u8 p0_mode;
+   int err;
+   u8 val;
+
+   err = mv88e6xxx_g2_scratch_read(chip, config_data2, );
+   if (err)
+   return err;
+
+   p0_mode = val & MV88E6352_G2_SCRATCH_CONFIG_DATA2_P0_MODE_MASK;
+
+   if (p0_mode == 0x01 || p0_mode == 0x02)
+   return -EBUSY;
+
+   err = mv88e6xxx_g2_scratch_read(chip, config_data1, );
+   if (err)
+   return err;
+
+   no_cpu = !!(val & MV88E6352_G2_SCRATCH_CONFIG_DATA1_NO_CPU);
+
+   err = mv88e6xxx_g2_scratch_read(chip, misc_cfg, );
+   if (err)
+   return err;
+
+   /* NO_CPU being 0 inverts the meaning of the bit */
+   if (!no_cpu)
+   external = !external;
+
+   if (external)
+   val |= MV88E6352_G2_SCRATCH_MISC_CFG_NORMALSMI;
+   else
+

Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Daniel Borkmann

On 02/22/2018 01:37 AM, Shuah Khan wrote:
> On 02/21/2018 05:33 PM, Daniel Borkmann wrote:
>> Hi Shuah,
>>
>> On 02/22/2018 12:03 AM, Shuah Khan wrote:
>>> On 02/21/2018 03:48 PM, David Miller wrote:
 From: Anders Roxell 
 Date: Wed, 21 Feb 2018 22:30:01 +0100

> bpf builds a test program for loading BPF ELF files. Add the executable
> to the .gitignore list.
>
> Signed-off-by: Anders Roxell 

 Acked-by: David S. Miller 
>>>
>>> Thanks. I will pull this in for 4.16-rc
>>
>> I would have taken this into bpf tree, but fair enough. This one
>> in particular doesn't cause any conflicts right now, so feel free
>> to pick it up then. But usual paths are via bpf / bpf-next as we
>> otherwise would run into real ugly merge conflicts.
> 
> Daniel,
> 
> Go for it. I know bpf tree stuff causes conflicts. That is why I usually
> stay away from bpfs selftests and let you handle them.
> 
> You are taking the other one away, pick this up as well.

Okay, great, thanks for your understanding, Shuah. Just applied
to bpf tree.

> Acked-by: Shuah Khan 
> 
> thanks,
> -- Shuah
>

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Willem de Bruijn

On Wed, Feb 21, 2018 at 7:26 PM, Sowmini Varadhan
 wrote:
> On (02/21/18 18:45), Willem de Bruijn wrote:
>>
>> I do mean returning 0 instead of -EAGAIN if control data is ready.
>> Something like
>>
>> @@ -611,7 +611,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr
>> *msg, size_t size,
>>
>> if (!rds_next_incoming(rs, )) {
>> if (nonblock) {
>> -   ret = -EAGAIN;
>> +   ncookies = rds_recvmsg_zcookie(rs, msg);
>> +   ret = ncookies ? 0 : -EAGAIN;
>> break;
>> }
>
> Yes, but you now have an implicit branch based on ncookies, so I'm
> not sure it saved all that much?

At least it removes the extra list empty check in the hot path and
relegates this to the relatively rare branch when the queue is empty
and the syscall is non-blocking.

> like I said let me revisit this

Okay. I won't harp on this further.

>> By the way, the put_cmsg is unconditional even if the caller did
>> not supply msg_control. So it is basically no longer safe to ever
>> call read, recv or recvfrom on a socket if zerocopy notifications
>> are outstanding.
>
> Wait, I thought put_cmsg already checks for these things.

It does, and sets MSG_CTRUNC to signal that it was unable to
write all control data. But by then the notifications have already
been dequeued.

>> It is possible to check msg_controllen before even deciding whether
>> to try to dequeue notifications (and take the lock). I see that this is
>> not common. But RDS of all cases seems to do this, in
>> rds_notify_queue_get:
>
> yes the comment above that code suggests that this bit of code
> was done to avoid calling put_cmsg while holding the rs_lock.
>
> One bit of administrivia though- if I now drop sk_error_queue for
> PF_RDS, I'll have to fix selftests in the same patch too, so the
> patch will get a bit bulky (and thus a bit more difficult to review).

Understood. It might be cleanest to split into three patches. One
revert of the error queue code, one new feature and one update
to the test.

Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Shuah Khan

On 02/21/2018 05:33 PM, Daniel Borkmann wrote:
> Hi Shuah,
> 
> On 02/22/2018 12:03 AM, Shuah Khan wrote:
>> On 02/21/2018 03:48 PM, David Miller wrote:
>>> From: Anders Roxell 
>>> Date: Wed, 21 Feb 2018 22:30:01 +0100
>>>
 bpf builds a test program for loading BPF ELF files. Add the executable
 to the .gitignore list.

 Signed-off-by: Anders Roxell 
>>>
>>> Acked-by: David S. Miller 
>>
>> Thanks. I will pull this in for 4.16-rc
> 
> I would have taken this into bpf tree, but fair enough. This one
> in particular doesn't cause any conflicts right now, so feel free
> to pick it up then. But usual paths are via bpf / bpf-next as we
> otherwise would run into real ugly merge conflicts.
> 

Daniel,

Go for it. I know bpf tree stuff causes conflicts. That is why I usually
stay away from bpfs selftests and let you handle them.

You are taking the other one away, pick this up as well.

Acked-by: Shuah Khan 

thanks,
-- Shuah

Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Daniel Borkmann

Hi Shuah,

On 02/22/2018 12:03 AM, Shuah Khan wrote:
> On 02/21/2018 03:48 PM, David Miller wrote:
>> From: Anders Roxell 
>> Date: Wed, 21 Feb 2018 22:30:01 +0100
>>
>>> bpf builds a test program for loading BPF ELF files. Add the executable
>>> to the .gitignore list.
>>>
>>> Signed-off-by: Anders Roxell 
>>
>> Acked-by: David S. Miller 
> 
> Thanks. I will pull this in for 4.16-rc

I would have taken this into bpf tree, but fair enough. This one
in particular doesn't cause any conflicts right now, so feel free
to pick it up then. But usual paths are via bpf / bpf-next as we
otherwise would run into real ugly merge conflicts.

Thanks,
Daniel

Re: [PATCH RFC PoC 0/3] nftables meets bpf

2018-02-21 Thread Florian Fainelli

On 02/21/2018 03:46 PM, Jakub Kicinski wrote:
> On Tue, 20 Feb 2018 11:58:22 +0100, Pablo Neira Ayuso wrote:
>> We also have a large range of TCAM based hardware offload outthere
>> that will _not_ work with your BPF HW offload infrastructure. What
>> this bpf infrastructure pushes into the kernel is just a blob
>> expressing things in a very low-level instruction-set: trying to find
>> a mapping of that to typical HW intermediate representations in the
>> TCAM based HW offload world will be simply crazy.
> 
> I'm not sure where the TCAM talk is coming from.  Think much smaller -
> cellular modems/phone SoCs, 32bit ARM/MIPS router box CPUs.  The
> information the verifier is gathering will be crucial for optimizing
> those.  Please don't discount the value of being able to use
> heterogeneous processing units by the networking stack.
> 

The only use case that we have a good answer for is when there is no HW
offload capability available, because there, we know that eBPF is our
best possible solution for a software fast path, in large part because
of all the efforts that went into making it both safe and fast.

When there is offloading HW available, there does not appear to be a
perfect answer to this problem of, given a standard Linux utility that
can express any sort of match + action, be it ethtool::rxnfc,
tc/cls_{u32,flower}, nftables, how do I transform that into what makes
most sense to my HW? You could:

- have hardware that understands BPF bytecode directly, great, then you
don't have to do anything, just pass it up the driver baby, oh wait,
it's not that simple, the NFP driver is not small

- transform BPF back into something that your hardware understand, does
that belong in the kernel? Maybe, maybe not

- use a completely different intermediate representation like P4,
brainfuck, I don't know

Maybe first things first, we have at least 3 different programming
interfaces, if not more: ethtool::rxnfc, tc/cls_{u32,flower}, nftables
that are all capable of programming TCAMs and hardware capable of match
+ action, how about we start with having some sort of common library
code that:

- validates input parameters against HW capabilities
- does the adequate transformation from any of these interfaces into a
generic set of input parameters
- define what the appropriate behavior is when programming through all
of these 3 interfaces that ultimately access the same shared piece of
HW, and therefore need to manage resources allocation?


-- 
Florian

[PATCH net] ibmvnic: Fix early release of login buffer

2018-02-21 Thread Thomas Falcon

The login buffer is released before the driver can perform
sanity checks between resources the driver requested and what
firmware will provide. Don't release the login buffer until
the sanity check is performed.

Fixes: 34f0f4e3f488 ("ibmvnic: Fix login buffer memory leaks")
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 1703b88..340e1ab 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -3790,7 +3790,6 @@ static int handle_login_rsp(union ibmvnic_crq 
*login_rsp_crq,
 
dma_unmap_single(dev, adapter->login_buf_token, adapter->login_buf_sz,
 DMA_BIDIRECTIONAL);
-   release_login_buffer(adapter);
dma_unmap_single(dev, adapter->login_rsp_buf_token,
 adapter->login_rsp_buf_sz, DMA_BIDIRECTIONAL);
 
@@ -3821,6 +3820,7 @@ static int handle_login_rsp(union ibmvnic_crq 
*login_rsp_crq,
ibmvnic_remove(adapter->vdev);
return -EIO;
}
+   release_login_buffer(adapter);
complete(>init_done);
 
return 0;
-- 
1.8.3.1

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Sowmini Varadhan

On (02/21/18 18:45), Willem de Bruijn wrote:
> 
> I do mean returning 0 instead of -EAGAIN if control data is ready.
> Something like
> 
> @@ -611,7 +611,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr
> *msg, size_t size,
> 
> if (!rds_next_incoming(rs, )) {
> if (nonblock) {
> -   ret = -EAGAIN;
> +   ncookies = rds_recvmsg_zcookie(rs, msg);
> +   ret = ncookies ? 0 : -EAGAIN;
> break;
> }

Yes, but you now have an implicit branch based on ncookies, so I'm
not sure it saved all that much? like I said let me revisit this

> By the way, the put_cmsg is unconditional even if the caller did
> not supply msg_control. So it is basically no longer safe to ever
> call read, recv or recvfrom on a socket if zerocopy notifications
> are outstanding.

Wait, I thought put_cmsg already checks for these things. 

> It is possible to check msg_controllen before even deciding whether
> to try to dequeue notifications (and take the lock). I see that this is
> not common. But RDS of all cases seems to do this, in
> rds_notify_queue_get:

yes the comment above that code suggests that this bit of code
was done to avoid calling put_cmsg while holding the rs_lock.

One bit of administrivia though- if I now drop sk_error_queue for
PF_RDS, I'll have to fix selftests in the same patch too, so the
patch will get a bit bulky (and thus a bit more difficult to review).

Re: [PATCH] selftests/bpf: tcpbpf_kern: use in6_* macros from glibc

2018-02-21 Thread Daniel Borkmann

On 02/21/2018 05:51 PM, Anders Roxell wrote:
> Both glibc and the kernel have in6_* macros definitions. Build fails
> because it picks up wrong in6_* macro from the kernel header and not the
> header from glibc.
> 
> Fixes build error below:
> clang -I. -I./include/uapi -I../../../include/uapi
>  -Wno-compare-distinct-pointer-types \
>  -O2 -target bpf -emit-llvm -c test_tcpbpf_kern.c -o - |  \
> llc -march=bpf -mcpu=generic -filetype=obj
>  -o .../tools/testing/selftests/bpf/test_tcpbpf_kern.o
> In file included from test_tcpbpf_kern.c:12:
> .../netinet/in.h:101:5: error: expected identifier
> IPPROTO_HOPOPTS = 0,   /* IPv6 Hop-by-Hop options.  */
> ^
> .../linux/in6.h:131:26: note: expanded from macro 'IPPROTO_HOPOPTS'
> ^
> In file included from test_tcpbpf_kern.c:12:
> /usr/include/netinet/in.h:103:5: error: expected identifier
> IPPROTO_ROUTING = 43,  /* IPv6 routing header.  */
> ^
> .../linux/in6.h:132:26: note: expanded from macro 'IPPROTO_ROUTING'
> ^
> In file included from test_tcpbpf_kern.c:12:
> .../netinet/in.h:105:5: error: expected identifier
> IPPROTO_FRAGMENT = 44, /* IPv6 fragmentation header.  */
> ^
> 
> Since both glibc and the kernel have in6_* macros definitions, use the
> one from glibc.  Kernel headers will check for previous libc definitions
> by including include/linux/libc-compat.h.
> 
> Reported-by: Daniel Díaz 
> Signed-off-by: Anders Roxell 

Applied to bpf tree, thanks Anders!

[PATCH net-next] ibmvnic: Fix TX descriptor tracking

2018-02-21 Thread Thomas Falcon

With the recent change, transmissions that only needed
one descriptor were being missed. The result is that such
packets were tracked as outstanding transmissions but never
removed when its completion notification was received.

Fixes: ffc385b95adb ("ibmvnic: Keep track of supplementary TX descriptors")
Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 340e1ab..b3a34d9 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1478,7 +1478,6 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
if ((*hdrs >> 7) & 1) {
build_hdr_descs_arr(tx_buff, _entries, *hdrs);
tx_crq.v1.n_crq_elem = num_entries;
-   tx_buff->num_entries = num_entries;
tx_buff->indir_arr[0] = tx_crq;
tx_buff->indir_dma = dma_map_single(dev, tx_buff->indir_arr,
sizeof(tx_buff->indir_arr),
@@ -1533,6 +1532,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
netif_stop_subqueue(netdev, queue_num);
}
 
+   tx_buff->num_entries = num_entries;
tx_packets++;
tx_bytes += skb->len;
txq->trans_start = jiffies;
-- 
1.8.3.1

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Alexander Duyck

On Wed, Feb 21, 2018 at 3:50 PM, Siwei Liu  wrote:
> I haven't checked emails for days and did not realize the new revision
> had already came out. And thank you for the effort, this revision
> really looks to be a step forward towards our use case and is close to
> what we wanted to do. A few questions in line.
>
> On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
>  wrote:
>> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
>>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
 Ppatch 2 is in response to the community request for a 3 netdev
 solution.  However, it creates some issues we'll get into in a moment.
 It extends virtio_net to use alternate datapath when available and
 registered. When BACKUP feature is enabled, virtio_net driver creates
 an additional 'bypass' netdev that acts as a master device and controls
 2 slave devices.  The original virtio_net netdev is registered as
 'backup' netdev and a passthru/vf device with the same MAC gets
 registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
 associated with the same 'pci' device.  The user accesses the network
 interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
 as default for transmits when it is available with link up and running.
>>>
>>> Thank you do doing this.
>>>
 We noticed a couple of issues with this approach during testing.
 - As both 'bypass' and 'backup' netdevs are associated with the same
   virtio pci device, udev tries to rename both of them with the same name
   and the 2nd rename will fail. This would be OK as long as the first 
 netdev
   to be renamed is the 'bypass' netdev, but the order in which udev gets
   to rename the 2 netdevs is not reliable.
>>>
>>> Out of curiosity - why do you link the master netdev to the virtio
>>> struct device?
>>
>> The basic idea of all this is that we wanted this to work with an
>> existing VM image that was using virtio. As such we were trying to
>> make it so that the bypass interface takes the place of the original
>> virtio and get udev to rename the bypass to what the original
>> virtio_net was.
>
> Could it made it also possible to take over the config from VF instead
> of virtio on an existing VM image? And get udev rename the bypass
> netdev to what the original VF was. I don't say tightly binding the
> bypass master to only virtio or VF, but I think we should provide both
> options to support different upgrade paths. Possibly we could tweak
> the device tree layout to reuse the same PCI slot for the master
> bypass netdev, such that udev would not get confused when renaming the
> device. The VF needs to use a different function slot afterwards.
> Perhaps we might need to a special multiseat like QEMU device for that
> purpose?
>
> Our case we'll upgrade the config from VF to virtio-bypass directly.

So if I am understanding what you are saying you are wanting to flip
the backup interface from the virtio to a VF. The problem is that
becomes a bit of a vendor lock-in solution since it would rely on a
specific VF driver. I would agree with Jiri that we don't want to go
down that path. We don't want every VF out there firing up its own
separate bond. Ideally you want the hypervisor to be able to manage
all of this which is why it makes sense to have virtio manage this and
why this is associated with the virtio_net interface.

The other bits get into more complexity then we are ready to handle
for now. I think I might have talked about something similar that I
was referring to as a "virtio-bond" where you would have a PCI/PCIe
tree topology that makes this easier to sort out, and the "virtio-bond
would be used to handle coordination/configuration of a much more
complex interface.

>>
>>> FWIW two solutions that immediately come to mind is to export "backup"
>>> as phys_port_name of the backup virtio link and/or assign a name to the
>>> master like you are doing already.  I think team uses team%d and bond
>>> uses bond%d, soft naming of master devices seems quite natural in this
>>> case.
>>
>> I figured I had overlooked something like that.. Thanks for pointing
>> this out. Okay so I think the phys_port_name approach might resolve
>> the original issue. If I am reading things correctly what we end up
>> with is the master showing up as "ens1" for example and the backup
>> showing up as "ens1nbackup". Am I understanding that right?
>>
>> The problem with the team/bond%d approach is that it creates a new
>> netdevice and so it would require guest configuration changes.
>>
>>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>>> link is quite neat.
>>
>> I agree. For non-"backup" virio_net devices would it be okay for us to
>> just return -EOPNOTSUPP? I assume it would be and that way the legacy
>> behavior could be maintained although the function still exists.
>>
 - When

Re: [PATCH] bpf: clean up unused-variable warning

2018-02-21 Thread Daniel Borkmann

On 02/20/2018 11:07 PM, Arnd Bergmann wrote:
> The only user of this variable is inside of an #ifdef, causing
> a warning without CONFIG_INET:
> 
> net/core/filter.c: In function 'bpf_sock_ops_cb_flags_set':
> net/core/filter.c:3382:6: error: unused variable 'val' 
> [-Werror=unused-variable]
>   int val = argval & BPF_SOCK_OPS_ALL_CB_FLAGS;
> 
> This replaces the #ifdef with a nicer IS_ENABLED() check that
> makes the code more readable and avoids the warning.
> 
> Fixes: b13d88072172 ("bpf: Adds field bpf_sock_ops_cb_flags to tcp_sock")
> Signed-off-by: Arnd Bergmann 

Now applied to bpf, thanks Arnd!

[PATCH bpf] bpf, x64: implement retpoline for tail call

2018-02-21 Thread Daniel Borkmann

Implement a retpoline [0] for the BPF tail call JIT'ing that converts
the indirect jump via jmp %rax that is used to make the long jump into
another JITed BPF image. Since this is subject to speculative execution,
we need to control the transient instruction sequence here as well
when CONFIG_RETPOLINE is set, and direct it into a pause + lfence loop.
The latter aligns also with what gcc / clang emits (e.g. [1]).

JIT dump after patch:

  # bpftool p d x i 1
   0: (18) r2 = map[id:1]
   2: (b7) r3 = 0
   3: (85) call bpf_tail_call#12
   4: (b7) r0 = 2
   5: (95) exit

With CONFIG_RETPOLINE:

  # bpftool p d j i 1
  [...]
  33:   cmp%edx,0x24(%rsi)
  36:   jbe0x0072  |*
  38:   mov0x24(%rbp),%eax
  3e:   cmp$0x20,%eax
  41:   ja 0x0072  |
  43:   add$0x1,%eax
  46:   mov%eax,0x24(%rbp)
  4c:   mov0x90(%rsi,%rdx,8),%rax
  54:   test   %rax,%rax
  57:   je 0x0072  |
  59:   mov0x28(%rax),%rax
  5d:   add$0x25,%rax
  61:   callq  0x006d  |+
  66:   pause  |
  68:   lfence |
  6b:   jmp0x0066  |
  6d:   mov%rax,(%rsp) |
  71:   retq   |
  72:   mov$0x2,%eax
  [...]

  * relative fall-through jumps in error case
  + retpoline for indirect jump

Without CONFIG_RETPOLINE:

  # bpftool p d j i 1
  [...]
  33:   cmp%edx,0x24(%rsi)
  36:   jbe0x0063  |*
  38:   mov0x24(%rbp),%eax
  3e:   cmp$0x20,%eax
  41:   ja 0x0063  |
  43:   add$0x1,%eax
  46:   mov%eax,0x24(%rbp)
  4c:   mov0x90(%rsi,%rdx,8),%rax
  54:   test   %rax,%rax
  57:   je 0x0063  |
  59:   mov0x28(%rax),%rax
  5d:   add$0x25,%rax
  61:   jmpq   *%rax   |-
  63:   mov$0x2,%eax
  [...]

  * relative fall-through jumps in error case
  - plain indirect jump as before

  [0] https://support.google.com/faqs/answer/7625886
  [1] 
https://github.com/gcc-mirror/gcc/commit/a31e654fa107be968b802786d747e962c2fcdb2b

Signed-off-by: Daniel Borkmann 
---
 arch/x86/net/bpf_jit_comp.c | 52 -
 1 file changed, 42 insertions(+), 10 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 4923d92..7e8d562 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -261,6 +261,35 @@ static void emit_prologue(u8 **pprog, u32 stack_depth)
*pprog = prog;
 }
 
+#ifdef CONFIG_RETPOLINE
+# define RETPOLINE_SIZE17
+# define OFFSET_ADJRETPOLINE_SIZE
+
+/* Instead of plain jmp %rax, we emit a retpoline to control
+ * speculative execution for the indirect branch.
+ */
+static void emit_retpoline_rax_trampoline(u8 **pprog)
+{
+   u8 *prog = *pprog;
+   int cnt = 0;
+
+   EMIT1_off32(0xE8, 7);/* callq  */
+   /* capture_spec: */
+   EMIT2(0xF3, 0x90);   /* pause */
+   EMIT3(0x0F, 0xAE, 0xE8); /* lfence */
+   EMIT2(0xEB, 0xF9);   /* jmp  */
+   /* set_up_target: */
+   EMIT4(0x48, 0x89, 0x04, 0x24); /* mov %rax,(%rsp) */
+   EMIT1(0xC3); /* retq */
+
+   BUILD_BUG_ON(cnt != RETPOLINE_SIZE);
+   *pprog = prog;
+}
+#else
+/* Plain jmp %rax version used. */
+# define OFFSET_ADJ2
+#endif
+
 /* generate the following code:
  * ... bpf_tail_call(void *ctx, struct bpf_array *array, u64 index) ...
  *   if (index >= array->map.max_entries)
@@ -290,7 +319,7 @@ static void emit_bpf_tail_call(u8 **pprog)
EMIT2(0x89, 0xD2);/* mov edx, edx */
EMIT3(0x39, 0x56, /* cmp dword ptr [rsi + 16], 
edx */
  offsetof(struct bpf_array, map.max_entries));
-#define OFFSET1 43 /* number of bytes to jump */
+#define OFFSET1 (41 + OFFSET_ADJ) /* number of bytes to jump */
EMIT2(X86_JBE, OFFSET1);  /* jbe out */
label1 = cnt;
 
@@ -299,7 +328,7 @@ static void emit_bpf_tail_call(u8 **pprog)
 */
EMIT2_off32(0x8B, 0x85, 36);  /* mov eax, dword ptr [rbp + 
36] */
EMIT3(0x83, 0xF8, MAX_TAIL_CALL_CNT); /* cmp eax, MAX_TAIL_CALL_CNT 
*/
-#define OFFSET2 32
+#define OFFSET2 (30 + OFFSET_ADJ)
EMIT2(X86_JA, OFFSET2);   /* ja out */
label2 = cnt;
EMIT3(0x83, 0xC0, 0x01);  /* add eax, 1 */
@@ -313,7 +342,7 @@ static void emit_bpf_tail_call(u8 **pprog)
 *   goto out;
 */
EMIT3(0x48, 0x85, 0xC0);  /* test rax,rax */
-#define OFFSET3 10
+#define OFFSET3 (8 + OFFSET_ADJ)
EMIT2(X86_JE, OFFSET3);   /* je out */
label3 = cnt;
 
@@ -326,16 +355,19 @@ static void emit_bpf_tail_call(u8 **pprog)
 * rdi == ctx (1st arg)
 * rax == prog->bpf_func + prologue_size
 */
-   EMIT2(0xFF, 0xE0);/* jmp rax */
-
-   /* out: */
-

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Siwei Liu

I haven't checked emails for days and did not realize the new revision
had already came out. And thank you for the effort, this revision
really looks to be a step forward towards our use case and is close to
what we wanted to do. A few questions in line.

On Sat, Feb 17, 2018 at 9:12 AM, Alexander Duyck
 wrote:
> On Fri, Feb 16, 2018 at 6:38 PM, Jakub Kicinski  wrote:
>> On Fri, 16 Feb 2018 10:11:19 -0800, Sridhar Samudrala wrote:
>>> Ppatch 2 is in response to the community request for a 3 netdev
>>> solution.  However, it creates some issues we'll get into in a moment.
>>> It extends virtio_net to use alternate datapath when available and
>>> registered. When BACKUP feature is enabled, virtio_net driver creates
>>> an additional 'bypass' netdev that acts as a master device and controls
>>> 2 slave devices.  The original virtio_net netdev is registered as
>>> 'backup' netdev and a passthru/vf device with the same MAC gets
>>> registered as 'active' netdev. Both 'bypass' and 'backup' netdevs are
>>> associated with the same 'pci' device.  The user accesses the network
>>> interface via 'bypass' netdev. The 'bypass' netdev chooses 'active' netdev
>>> as default for transmits when it is available with link up and running.
>>
>> Thank you do doing this.
>>
>>> We noticed a couple of issues with this approach during testing.
>>> - As both 'bypass' and 'backup' netdevs are associated with the same
>>>   virtio pci device, udev tries to rename both of them with the same name
>>>   and the 2nd rename will fail. This would be OK as long as the first netdev
>>>   to be renamed is the 'bypass' netdev, but the order in which udev gets
>>>   to rename the 2 netdevs is not reliable.
>>
>> Out of curiosity - why do you link the master netdev to the virtio
>> struct device?
>
> The basic idea of all this is that we wanted this to work with an
> existing VM image that was using virtio. As such we were trying to
> make it so that the bypass interface takes the place of the original
> virtio and get udev to rename the bypass to what the original
> virtio_net was.

Could it made it also possible to take over the config from VF instead
of virtio on an existing VM image? And get udev rename the bypass
netdev to what the original VF was. I don't say tightly binding the
bypass master to only virtio or VF, but I think we should provide both
options to support different upgrade paths. Possibly we could tweak
the device tree layout to reuse the same PCI slot for the master
bypass netdev, such that udev would not get confused when renaming the
device. The VF needs to use a different function slot afterwards.
Perhaps we might need to a special multiseat like QEMU device for that
purpose?

Our case we'll upgrade the config from VF to virtio-bypass directly.

>
>> FWIW two solutions that immediately come to mind is to export "backup"
>> as phys_port_name of the backup virtio link and/or assign a name to the
>> master like you are doing already.  I think team uses team%d and bond
>> uses bond%d, soft naming of master devices seems quite natural in this
>> case.
>
> I figured I had overlooked something like that.. Thanks for pointing
> this out. Okay so I think the phys_port_name approach might resolve
> the original issue. If I am reading things correctly what we end up
> with is the master showing up as "ens1" for example and the backup
> showing up as "ens1nbackup". Am I understanding that right?
>
> The problem with the team/bond%d approach is that it creates a new
> netdevice and so it would require guest configuration changes.
>
>> IMHO phys_port_name == "backup" if BACKUP bit is set on slave virtio
>> link is quite neat.
>
> I agree. For non-"backup" virio_net devices would it be okay for us to
> just return -EOPNOTSUPP? I assume it would be and that way the legacy
> behavior could be maintained although the function still exists.
>
>>> - When the 'active' netdev is unplugged OR not present on a destination
>>>   system after live migration, the user will see 2 virtio_net netdevs.
>>
>> That's necessary and expected, all configuration applies to the master
>> so master must exist.
>
> With the naming issue resolved this is the only item left outstanding.
> This becomes a matter of form vs function.
>
> The main complaint about the "3 netdev" solution is a bit confusing to
> have the 2 netdevs present if the VF isn't there. The idea is that
> having the extra "master" netdev there if there isn't really a bond is
> a bit ugly.

Is it this uglier in terms of user experience rather than
functionality? I don't want it dynamically changed between 2-netdev
and 3-netdev depending on the presence of VF. That gets back to my
original question and suggestion earlier: why not just hide the lower
netdevs from udev renaming and such? Which important observability
benefits users may get if exposing the lower netdevs?

Thanks,
-Siwei

>
> The downside of the "2 netdev" solution is that you have to deal

Re: [PATCH RFC PoC 0/3] nftables meets bpf

2018-02-21 Thread Jakub Kicinski

On Tue, 20 Feb 2018 11:58:22 +0100, Pablo Neira Ayuso wrote:
> We also have a large range of TCAM based hardware offload outthere
> that will _not_ work with your BPF HW offload infrastructure. What
> this bpf infrastructure pushes into the kernel is just a blob
> expressing things in a very low-level instruction-set: trying to find
> a mapping of that to typical HW intermediate representations in the
> TCAM based HW offload world will be simply crazy.

I'm not sure where the TCAM talk is coming from.  Think much smaller -
cellular modems/phone SoCs, 32bit ARM/MIPS router box CPUs.  The
information the verifier is gathering will be crucial for optimizing
those.  Please don't discount the value of being able to use
heterogeneous processing units by the networking stack.

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Willem de Bruijn

>> Okay. If callers must already handle 0 as a valid return code, then
>> it is fine to add another case that does this.
>>
>> The extra branch in the hot path is still rather unfortunately. Could
>> this be integrated in the existing if (nonblock) branch below?
>
> that's where I first started. it got even hairier because the
> callers expect a retval of 0 (-EAGAIN threw rds-stress into an
> infinite loop of continulally trying to recv) and the end result
> was just confusing code with the same number of branches..

I do mean returning 0 instead of -EAGAIN if control data is ready.
Something like

@@ -611,7 +611,8 @@ int rds_recvmsg(struct socket *sock, struct msghdr
*msg, size_t size,

if (!rds_next_incoming(rs, )) {
if (nonblock) {
-   ret = -EAGAIN;
+   ncookies = rds_recvmsg_zcookie(rs, msg);
+   ret = ncookies ? 0 : -EAGAIN;
break;
}

By the way, the put_cmsg is unconditional even if the caller did
not supply msg_control. So it is basically no longer safe to ever
call read, recv or recvfrom on a socket if zerocopy notifications
are outstanding.

It is possible to check msg_controllen before even deciding whether
to try to dequeue notifications (and take the lock). I see that this is
not common. But RDS of all cases seems to do this, in
rds_notify_queue_get:

max_messages = msghdr->msg_controllen /
CMSG_SPACE(sizeof(cmsg));

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Sowmini Varadhan

On (02/21/18 17:50), Willem de Bruijn wrote:
> 
> In the common case no more than one notification will be outstanding,
> but with a fixed number of notifications per packet, in edge cases this
> list may be long.
   :
> Socket functions block if sk_err is non-zero. See for instance
> tcp_sendmsg_locked. It is set by most code that also calls
> sock_queue_err_skb and also on dequeue from err skb.
> 
> This is the main reason that I would consider dropping error
> queue completely if you expect all users of RDS to use the
> cmsg on regular read to get these notifications.

I see. That's a good point, and maybe it makes sense to just have
a struct sk_buff_head rs_zcookie_quese on the rds_sock, and
have rds_rm_zerocopy_callback chain cookies ot this  rs_zcookie_queue.

[discussion regarding rds_recvmsg return values elided]

> Okay. If callers must already handle 0 as a valid return code, then
> it is fine to add another case that does this.
> 
> The extra branch in the hot path is still rather unfortunately. Could
> this be integrated in the existing if (nonblock) branch below?

that's where I first started. it got even hairier because the
callers expect a retval of 0 (-EAGAIN threw rds-stress into an
infinite loop of continulally trying to recv) and the end result
was just confusing code with the same number of branches.. 
let me revisit this when I spin out V2 without the sk_error_queue..

--Sowmini

[PATCH net-next 4/7] net: Rename NETEVENT_MULTIPATH_HASH_UPDATE

2018-02-21 Thread David Ahern

Rename NETEVENT_MULTIPATH_HASH_UPDATE to
NETEVENT_IPV4_MPATH_HASH_UPDATE to denote it relates to a change
in the IPv4 hash policy.

Signed-off-by: David Ahern 
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 2 +-
 include/net/netevent.h| 2 +-
 net/ipv4/sysctl_net_ipv4.c| 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 05146970c19c..da8ca721f2fc 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -2427,7 +2427,7 @@ static int mlxsw_sp_router_netevent_event(struct 
notifier_block *nb,
mlxsw_core_schedule_work(_work->work);
mlxsw_sp_port_dev_put(mlxsw_sp_port);
break;
-   case NETEVENT_MULTIPATH_HASH_UPDATE:
+   case NETEVENT_IPV4_MPATH_HASH_UPDATE:
net = ptr;
 
if (!net_eq(net, _net))
diff --git a/include/net/netevent.h b/include/net/netevent.h
index 40e7bab68490..baee605a94ab 100644
--- a/include/net/netevent.h
+++ b/include/net/netevent.h
@@ -26,7 +26,7 @@ enum netevent_notif_type {
NETEVENT_NEIGH_UPDATE = 1, /* arg is struct neighbour ptr */
NETEVENT_REDIRECT, /* arg is struct netevent_redirect ptr */
NETEVENT_DELAY_PROBE_TIME_UPDATE, /* arg is struct neigh_parms ptr */
-   NETEVENT_MULTIPATH_HASH_UPDATE, /* arg is struct net ptr */
+   NETEVENT_IPV4_MPATH_HASH_UPDATE, /* arg is struct net ptr */
 };
 
 int register_netevent_notifier(struct notifier_block *nb);
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 89683d868b37..011de9a20ec6 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -400,7 +400,7 @@ static int proc_fib_multipath_hash_policy(struct ctl_table 
*table, int write,
 
ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
if (write && ret == 0)
-   call_netevent_notifiers(NETEVENT_MULTIPATH_HASH_UPDATE, net);
+   call_netevent_notifiers(NETEVENT_IPV4_MPATH_HASH_UPDATE, net);
 
return ret;
 }
-- 
2.11.0

Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Shuah Khan

On 02/21/2018 03:48 PM, David Miller wrote:
> From: Anders Roxell 
> Date: Wed, 21 Feb 2018 22:30:01 +0100
> 
>> bpf builds a test program for loading BPF ELF files. Add the executable
>> to the .gitignore list.
>>
>> Signed-off-by: Anders Roxell 
> 
> Acked-by: David S. Miller 
> 
> 

Thanks. I will pull this in for 4.16-rc

-- Shuah

Proposal

2018-02-21 Thread melisa mehmet

Hello

Greetings to you and everyone around you please did you get my previous email 
regarding my proposal ?
please let me know if we can work together on this.

Best Reagrds

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Willem de Bruijn

On Wed, Feb 21, 2018 at 5:14 PM, Sowmini Varadhan
 wrote:
> On (02/21/18 16:54), Willem de Bruijn wrote:
>>
>> I'd put this optimization behind a socket option.
>
> Yes, that thought occurred to me as well- I think RDS applications
> are unlikely to use the error_queue path because, as I mentioned
> before, these are heavily request-response based, so you're
> going to be getting something back for each sendmsg anyway.
>
> So if I had a sockopt, it would be something that would be
> "piggyback completion" most (all?) of the time anyway, but I suppose
> making it explicit is useful to have.
>
>> Either that, or always store cookies on an RDS specific queue, return as recv
>> cmsg and then remove the alternative error queue path completely.
>
> I think the error queue path is good to have, if only to be aligned
> with other socket families.
>
>> > +   spin_lock_irqsave(>lock, flags);
>>
>> This seems expensive on every recvmsg. Even if zerocopy is not enabled.
>
> noted, I'll predicate it on both zcopy and the sockopt suggestion.
>
>> > +   if (skb_queue_empty(q)) {
>> > +   spin_unlock_irqrestore(>lock, flags);
>> > +   return 0;
>> > +   }
>> > +   skb_queue_walk_safe(q, skb, tmp) {
>>
>> This too. If anything, just peek at the queue head and skip otherwise.
>> Anything else will cause an error, which blocks regular read and write
>> on the socket.
>
> but that would be technically incorrect, because you could have a
> mix/match of the zcopy notification with other sk_error_queue messages
> (given that rds_rm_zerocopy_callback looks at the tail, and
> creates a new skb if the tail is not a SO_EE_ORIGIN_ZCOOKIE).
>
> Of course, its true that *today* the only thing in the rds socket
> error queue is the cookie list, so this queue walk is a bit of overkill..
>
> but maybe I am missing something you are concerned about? The queue walk is
> intended to pull out the first SO_EE_ORIGIN_ZCOOKIE (if it exists)
> and return quietly otherwise

Yes, avoiding out of order completions is sensible.

In the common case no more than one notification will be outstanding,
but with a fixed number of notifications per packet, in edge cases this
list may be long.

Reading using a reverse walk avoids that problem.

> (leaving err_queue unchanged)- where
> do you see the blocking error?

Socket functions block if sk_err is non-zero. See for instance
tcp_sendmsg_locked. It is set by most code that also calls
sock_queue_err_skb and also on dequeue from err skb.

This is the main reason that I would consider dropping error
queue completely if you expect all users of RDS to use the
cmsg on regular read to get these notifications.

>> > +   if (list_empty(>rs_recv_queue) && nonblock) {
>> > +   ncookies = rds_recvmsg_zcookie(rs, msg);
>> > +   if (ncookies) {
>> > +   ret = 0;
>>
>> This probably changes semantics for MSG_DONTWAIT. Will it ever return 0 now?
>
> The only time MSG_DONTWAIT was previously returning 0 for PF_RDS
> was when there was a congestion notification or rdma completion
> i.e., the checks for rs_notify_queue and rs_cong_notify earlier
> in the function, and in those 2 cases it was not returning data
> (i.e. ret was always 0). In all other cases  rds_recvmsg would
> always return either
> - ret == -1, with errno set to EAGAIN or ETIMEDOUT, depending on the value
>   of MSG_DONTWAIT, or,
> - ret > 0 with data bytes.
>
> That behavior (as well existing behavior for congestion notification
> and rdma completion) has not changed. The only thing that changed is that
> if there is no data to pass up, the ret will be 0, errno will be 0,
> and the ancillary data may have completion cookies.

Okay. If callers must already handle 0 as a valid return code, then
it is fine to add another case that does this.

The extra branch in the hot path is still rather unfortunately. Could
this be integrated in the existing if (nonblock) branch below?

Re: [PATCH net-next v4 0/2] Remove IPVlan module dependencies on IPv6 and L3 Master dev

2018-02-21 Thread David Miller

From: Matteo Croce 
Date: Wed, 21 Feb 2018 01:31:12 +0100

> The IPVlan module currently depends on IPv6 and L3 Master dev.
> Refactor the code to allow building IPVlan module regardless of the value
> of CONFIG_IPV6 as done in other drivers like VxLAN or GENEVE.
> Also change the CONFIG_NET_L3_MASTER_DEV dependency into a select,
> since compiling L3 Master device alone has little sense.
> 
> $ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
> CONFIG_IPV6=y
> CONFIG_IPVLAN=m
> $ ll drivers/net/ipvlan/ipvlan.ko
> 48K drivers/net/ipvlan/ipvlan.ko
> 
> $ grep -wE 'CONFIG_(IPV6|IPVLAN)' .config
> # CONFIG_IPV6 is not set
> CONFIG_IPVLAN=m
> $ ll drivers/net/ipvlan/ipvlan.ko
> 44K drivers/net/ipvlan/ipvlan.ko

Series applied, thanks Matteo.

Re: [PATCH net-next v2 1/1] net: Allow a rule to track originating protocol

2018-02-21 Thread David Miller

From: Donald Sharp 
Date: Tue, 20 Feb 2018 08:55:58 -0500

> Allow a rule that is being added/deleted/modified or
> dumped to contain the originating protocol's id.
> 
> The protocol is handled just like a routes originating
> protocol is.  This is especially useful because there
> is starting to be a plethora of different user space
> programs adding rules.
> 
> Allow the vrf device to specify that the kernel is the originator
> of the rule created for this device.
> 
> Signed-off-by: Donald Sharp 

Looks good, applied, thanks Donald.

Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread David Miller

From: Anders Roxell 
Date: Wed, 21 Feb 2018 22:30:01 +0100

> bpf builds a test program for loading BPF ELF files. Add the executable
> to the .gitignore list.
> 
> Signed-off-by: Anders Roxell 

Acked-by: David S. Miller

[PATCH] net/smc9194: Remove bogus CONFIG_MAC reference

2018-02-21 Thread Finn Thain

AFAIK the only version of smc9194.c with Mac support is the one in the
linux-mac68k CVS repo, which never made it to the mainline.

Despite that, from v2.3.45, arch/m68k/config.in listed CONFIG_SMC9194
under CONFIG_MAC. This mistake got carried over into Kconfig in v2.5.55.
(See pre-git era "[PATCH] add m68k dependencies to net driver config".)

Signed-off-by: Finn Thain 
---
 drivers/net/ethernet/smsc/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/smsc/Kconfig 
b/drivers/net/ethernet/smsc/Kconfig
index 63aca9f847e1..4c2f612e4414 100644
--- a/drivers/net/ethernet/smsc/Kconfig
+++ b/drivers/net/ethernet/smsc/Kconfig
@@ -20,7 +20,7 @@ if NET_VENDOR_SMSC
 
 config SMC9194
tristate "SMC 9194 support"
-   depends on (ISA || MAC && BROKEN)
+   depends on ISA
select CRC32
---help---
  This is support for the SMC9xxx based Ethernet cards. Choose this
-- 
2.16.1

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Sowmini Varadhan

On (02/21/18 16:54), Willem de Bruijn wrote:
> 
> I'd put this optimization behind a socket option.

Yes, that thought occurred to me as well- I think RDS applications
are unlikely to use the error_queue path because, as I mentioned
before, these are heavily request-response based, so you're 
going to be getting something back for each sendmsg anyway.

So if I had a sockopt, it would be something that would be
"piggyback completion" most (all?) of the time anyway, but I suppose
making it explicit is useful to have.

> Either that, or always store cookies on an RDS specific queue, return as recv
> cmsg and then remove the alternative error queue path completely.

I think the error queue path is good to have, if only to be aligned
with other socket families.

> > +   spin_lock_irqsave(>lock, flags);
> 
> This seems expensive on every recvmsg. Even if zerocopy is not enabled.

noted, I'll predicate it on both zcopy and the sockopt suggestion.

> > +   if (skb_queue_empty(q)) {
> > +   spin_unlock_irqrestore(>lock, flags);
> > +   return 0;
> > +   }
> > +   skb_queue_walk_safe(q, skb, tmp) {
> 
> This too. If anything, just peek at the queue head and skip otherwise.
> Anything else will cause an error, which blocks regular read and write
> on the socket.

but that would be technically incorrect, because you could have a 
mix/match of the zcopy notification with other sk_error_queue messages
(given that rds_rm_zerocopy_callback looks at the tail, and
creates a new skb if the tail is not a SO_EE_ORIGIN_ZCOOKIE).

Of course, its true that *today* the only thing in the rds socket
error queue is the cookie list, so this queue walk is a bit of overkill..

but maybe I am missing something you are concerned about? The queue walk is
intended to pull out the first SO_EE_ORIGIN_ZCOOKIE (if it exists)
and return quietly otherwise (leaving err_queue unchanged)- where
do you see the blocking error?

> > +   if (list_empty(>rs_recv_queue) && nonblock) {
> > +   ncookies = rds_recvmsg_zcookie(rs, msg);
> > +   if (ncookies) {
> > +   ret = 0;
> 
> This probably changes semantics for MSG_DONTWAIT. Will it ever return 0 now?

The only time MSG_DONTWAIT was previously returning 0 for PF_RDS
was when there was a congestion notification or rdma completion
i.e., the checks for rs_notify_queue and rs_cong_notify earlier
in the function, and in those 2 cases it was not returning data
(i.e. ret was always 0). In all other cases  rds_recvmsg would
always return either
- ret == -1, with errno set to EAGAIN or ETIMEDOUT, depending on the value
  of MSG_DONTWAIT, or,
- ret > 0 with data bytes.

That behavior (as well existing behavior for congestion notification
and rdma completion) has not changed. The only thing that changed is that
if there is no data to pass up, the ret will be 0, errno will be 0,
and the ancillary data may have completion cookies.

--Sowmini

Re: syzcaller patch postings...

2018-02-21 Thread Florian Westphal

David Miller  wrote:
> I have to mention this now before it gets out of control.
> 
> I would like to ask that syzkaller stop posting the patch it is
> testing when it posts to netdev.

Same for netfilter-devel.

I could not get a reproducer to trigger and asked syzbot to test
the patch (great feature, thanks!) -- i did not cc any mailing list.

So I was very surprised syzbot announced test result by adding
back all CCs from original report.  I think its better to
announce the result to patch author only.

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Willem de Bruijn

On Wed, Feb 21, 2018 at 3:19 PM, Sowmini Varadhan
 wrote:
> This commit is an optimization that builds on top of commit 01883eda72bd
> ("rds: support for zcopy completion notification") for PF_RDS sockets.
>
> Cookies associated with zerocopy completion are passed up on the POLLIN
> channel, piggybacked with data whereever possible. Such cookies are passed
> up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when
> the returned value of recvmsg() is >= 0. A max of SO_EE_ORIGIN_MAX_ZCOOKIES
> may be passed with each message.

I'd put this optimization behind a socket option.

Either that, or always store cookies on an RDS specific queue, return as recv
cmsg and then remove the alternative error queue path completely.

Having both is confusing, also in terms of system behavior (e.g., see signaling
and sk_err handling in sock_dequeue_err_skb).

> +static int rds_recvmsg_zcookie(struct rds_sock *rs, struct msghdr *msg)
> +{
> +   struct sk_buff *skb, *tmp;
> +   struct sock_exterr_skb *serr;
> +   struct sock *sk = rds_rs_to_sk(rs);
> +   struct sk_buff_head *q = >sk_error_queue;
> +   struct rds_zcopy_cookies done;
> +   u32 *ptr;
> +   int i;
> +   unsigned long flags;
> +
> +   spin_lock_irqsave(>lock, flags);

This seems expensive on every recvmsg. Even if zerocopy is not enabled.

> +   if (skb_queue_empty(q)) {
> +   spin_unlock_irqrestore(>lock, flags);
> +   return 0;
> +   }
> +   skb_queue_walk_safe(q, skb, tmp) {

This too. If anything, just peek at the queue head and skip otherwise.
Anything else will cause an error, which blocks regular read and write
on the socket.

>  int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
> int msg_flags)
>  {
> @@ -586,6 +623,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
> size_t size,
> int ret = 0, nonblock = msg_flags & MSG_DONTWAIT;
> DECLARE_SOCKADDR(struct sockaddr_in *, sin, msg->msg_name);
> struct rds_incoming *inc = NULL;
> +   int ncookies;
>
> /* udp_recvmsg()->sock_recvtimeo() gets away without locking too.. */
> timeo = sock_rcvtimeo(sk, nonblock);
> @@ -609,6 +647,14 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
> size_t size,
> break;
> }
>
> +   if (list_empty(>rs_recv_queue) && nonblock) {
> +   ncookies = rds_recvmsg_zcookie(rs, msg);
> +   if (ncookies) {
> +   ret = 0;

This probably changes semantics for MSG_DONTWAIT. Will it ever return 0 now?

[PATCH v2 net-next 2/2] lan743x: Update MAINTAINERS to include lan743x driver

2018-02-21 Thread Bryan Whitehead

Update MAINTAINERS to include lan743x driver

Signed-off-by: Bryan Whitehead 
---
 MAINTAINERS | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 9a7f76e..c340125 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9149,6 +9149,13 @@ F:   drivers/net/dsa/microchip/*
 F: include/linux/platform_data/microchip-ksz.h
 F: Documentation/devicetree/bindings/net/dsa/ksz.txt
 
+MICROCHIP LAN743X ETHERNET DRIVER
+M: Bryan Whitehead 
+M: Microchip Linux Driver Support 
+L: netdev@vger.kernel.org
+S: Maintained
+F: drivers/net/ethernet/microchip/lan743x_*
+
 MICROCHIP USB251XB DRIVER
 M: Richard Leitner 
 L: linux-...@vger.kernel.org
-- 
2.7.4

Re: [PATCH] rds: send: mark expected switch fall-through in rds_rm_size

2018-02-21 Thread David Miller

From: "Gustavo A. R. Silva" 
Date: Mon, 19 Feb 2018 12:10:20 -0600

> In preparation to enabling -Wimplicit-fallthrough, mark switch cases
> where we are expecting to fall through.
> 
> Addresses-Coverity-ID: 1465362 ("Missing break in switch")
> Signed-off-by: Gustavo A. R. Silva 

Applied.

syzcaller patch postings...

2018-02-21 Thread David Miller


I have to mention this now before it gets out of control.

I would like to ask that syzkaller stop posting the patch it is
testing when it posts to netdev.

This creates a lot of confusion and I have to manually change the
status in patchwork of every patch syzcaller posts in this way.

I would suggest to post a link to the patch in the archives or
even better, the patchwork link.

Thanks.

Re: [PATCH] selftests/bpf: tcpbpf_kern: use in6_* macros from glibc

2018-02-21 Thread Daniel Díaz

On 02/21/2018 10:51 AM, Anders Roxell wrote:
> Both glibc and the kernel have in6_* macros definitions. Build fails
> because it picks up wrong in6_* macro from the kernel header and not the
> header from glibc.
> 
> Fixes build error below:
> clang -I. -I./include/uapi -I../../../include/uapi
>  -Wno-compare-distinct-pointer-types \
>  -O2 -target bpf -emit-llvm -c test_tcpbpf_kern.c -o - |  \
> llc -march=bpf -mcpu=generic -filetype=obj
>  -o .../tools/testing/selftests/bpf/test_tcpbpf_kern.o
> In file included from test_tcpbpf_kern.c:12:
> .../netinet/in.h:101:5: error: expected identifier
> IPPROTO_HOPOPTS = 0,   /* IPv6 Hop-by-Hop options.  */
> ^
> .../linux/in6.h:131:26: note: expanded from macro 'IPPROTO_HOPOPTS'
> ^
> In file included from test_tcpbpf_kern.c:12:
> /usr/include/netinet/in.h:103:5: error: expected identifier
> IPPROTO_ROUTING = 43,  /* IPv6 routing header.  */
> ^
> .../linux/in6.h:132:26: note: expanded from macro 'IPPROTO_ROUTING'
> ^
> In file included from test_tcpbpf_kern.c:12:
> .../netinet/in.h:105:5: error: expected identifier
> IPPROTO_FRAGMENT = 44, /* IPv6 fragmentation header.  */
> ^
> 
> Since both glibc and the kernel have in6_* macros definitions, use the
> one from glibc.  Kernel headers will check for previous libc definitions
> by including include/linux/libc-compat.h.
> 
> Reported-by: Daniel Díaz 
> Signed-off-by: Anders Roxell 

FWIW, this was
Tested-by: Daniel Díaz 

> ---
>  tools/testing/selftests/bpf/test_tcpbpf_kern.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/tools/testing/selftests/bpf/test_tcpbpf_kern.c 
> b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
> index 57119ad57a3f..3e645ee41ed5 100644
> --- a/tools/testing/selftests/bpf/test_tcpbpf_kern.c
> +++ b/tools/testing/selftests/bpf/test_tcpbpf_kern.c
> @@ -5,7 +5,6 @@
>  #include 
>  #include 
>  #include 
> -#include 
>  #include 
>  #include 
>  #include

Re: [PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Daniel Díaz

On 02/21/2018 03:30 PM, Anders Roxell wrote:
> bpf builds a test program for loading BPF ELF files. Add the executable
> to the .gitignore list.
> 
> Signed-off-by: Anders Roxell 

Tested-by: Daniel Díaz 

> ---
>  tools/testing/selftests/bpf/.gitignore | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/tools/testing/selftests/bpf/.gitignore 
> b/tools/testing/selftests/bpf/.gitignore
> index cc15af2e54fe..9cf83f895d98 100644
> --- a/tools/testing/selftests/bpf/.gitignore
> +++ b/tools/testing/selftests/bpf/.gitignore
> @@ -11,3 +11,4 @@ test_progs
>  test_tcpbpf_user
>  test_verifier_log
>  feature
> +test_libbpf_open

Re: [PATCH 00/19] Netfilter fixes for net

2018-02-21 Thread David Miller

From: Pablo Neira Ayuso 
Date: Tue, 20 Feb 2018 17:38:47 +0100

> The following patchset contains large batch with Netfilter fixes for
> your net tree, mostly due to syzbot report fixups and pr_err()
> ratelimiting, more specifically, they are:
 ...
> You can pull these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Pulled, thank you.

> P.S: If I can get net.git merge into net-next.git, I'll appreciate
>  since I have people willing to bang me here with patches that
>  have dependencies with this batch. Thanks again!

That might have to wait until the weekend.  I'll see what I can
do meanwhile.

Thanks.

[PATCH] selftests/bpf: update gitignore with test_libbpf_open

2018-02-21 Thread Anders Roxell

bpf builds a test program for loading BPF ELF files. Add the executable
to the .gitignore list.

Signed-off-by: Anders Roxell 
---
 tools/testing/selftests/bpf/.gitignore | 1 +
 1 file changed, 1 insertion(+)

diff --git a/tools/testing/selftests/bpf/.gitignore 
b/tools/testing/selftests/bpf/.gitignore
index cc15af2e54fe..9cf83f895d98 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -11,3 +11,4 @@ test_progs
 test_tcpbpf_user
 test_verifier_log
 feature
+test_libbpf_open
-- 
2.11.0

Possible iputils infrastructure/hosting issue on skbuff.net

2018-02-21 Thread Graph Worlok

http://www.skbuff.net/iputils/ has a list of source tarball releases.

All of these releases reference
http://www.skbuff.net/iputils/iputils-current.tar.bz2 as the location
to obtain the source.

This URL 404's, meaning that most linux systems in existence refer to
a unavailable URL for the source in their man pages.

If this location is no longer valid going forwards, can we please at
least have a copy of the current stable source modified to reflect
this fact available at
http://www.skbuff.net/iputils/iputils-current.tar.bz2, so that anybody
who does care enough to pull down the latest source can get a pointer
to the new location?

or is it simply an infrastructure issue, and the script to label the
latest tarball release as -current is no longer running?

Regards.

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Alexander Duyck

On Wed, Feb 21, 2018 at 11:38 AM, Jiri Pirko  wrote:
> Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko  wrote:
>>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote:
On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko  wrote:
> Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
>>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko  wrote:
>>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
> Yeah, I can see it now :( I guess that the ship has sailed and we are
> stuck with this ugly thing forever...
>
> Could you at least make some common code that is shared in between
> netvsc and virtio_net so this is handled in exacly the same way in 
> both?

IMHO netvsc is a vendor specific driver which made a mistake on what
behaviour it provides (or tried to align itself with Windows SR-IOV).
Let's not make a far, far more commonly deployed and important driver
(virtio) bug-compatible with netvsc.
>>>
>>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>>> it was a huge mistake to merge it. I personally would vote to unmerge it
>>> and make the solution based on team/bond.
>>>
>>>

To Jiri's initial comments, I feel the same way, in fact I've talked to
the NetworkManager guys to get auto-bonding based on MACs handled in
user space.  I think it may very well get done in next versions of NM,
but isn't done yet.  Stephen also raised the point that not everybody is
using NM.
>>>
>>> Can be done in NM, networkd or other network management tools.
>>> Even easier to do this in teamd and let them all benefit.
>>>
>>> Actually, I took a stab to implement this in teamd. Took me like an hour
>>> and half.
>>>
>>> You can just run teamd with config option "kidnap" like this:
>>> # teamd/teamd -c '{"kidnap": true }'
>>>
>>> Whenever teamd sees another netdev to appear with the same mac as his,
>>> or whenever teamd sees another netdev to change mac to his,
>>> it enslaves it.
>>>
>>> Here's the patch (quick and dirty):
>>>
>>> Subject: [patch teamd] teamd: introduce kidnap feature
>>>
>>> Signed-off-by: Jiri Pirko 
>>
>>So this doesn't really address the original problem we were trying to
>>solve. You asked earlier why the netdev name mattered and it mostly
>>has to do with configuration. Specifically what our patch is
>>attempting to resolve is the issue of how to allow a cloud provider to
>>upgrade their customer to SR-IOV support and live migration without
>>requiring them to reconfigure their guest. So the general idea with
>>our patch is to take a VM that is running with virtio_net only and
>>allow it to instead spawn a virtio_bypass master using the same netdev
>>name as the original virtio, and then have the virtio_net and VF come
>>up and be enslaved by the bypass interface. Doing it this way we can
>>allow for multi-vendor SR-IOV live migration support using a guest
>>that was originally configured for virtio only.
>>
>>The problem with your solution is we already have teaming and bonding
>>as you said. There is already a write-up from Red Hat on how to do it
>>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>>That is all well and good as long as you are willing to keep around
>>two VM images, one for virtio, and one for SR-IOV with live migration.
>
> You don't need 2 images. You need only one. The one with the team setup.
> That's it. If another netdev with the same mac appears, teamd will
> enslave it and run traffic on it. If not, ok, you'll go only through
> virtio_net.

Isn't that going to cause the routing table to get messed up when we
rearrange the netdevs? We don't want to have an significant disruption
 in traffic when we are adding/removing the VF. It seems like we would
need to invalidate any entries that were configured for the virtio_net
and reestablish them on the new team interface. Part of the criteria
we have been working with is that we should be able to transition from
having a VF to not or vice versa without seeing any significant
disruption in the traffic.
>>>
>>> What? You have routes on the team netdev. virtio_net and VF are only
>>> slaves. What are you talking about? I don't get it :/
>>
>>So lets walk though this by example. The general idea of the base case
>>for all this is somebody starting with virtio_net, we will call the

[for-next 1/7] net/mlx5: CQ Database per EQ

2018-02-21 Thread Saeed Mahameed

Before this patch the driver had one CQ database protected via one
spinlock, this spinlock is meant to synchronize between CQ
adding/removing and CQ IRQ interrupt handling.

On a system with large number of CPUs and on a work load that requires
lots of interrupts, this global spinlock becomes a very nasty hotspot
and introduces a contention between the active cores, which will
significantly hurt performance and becomes a bottleneck that prevents
seamless cpu scaling.

To solve this we simply move the CQ database and its spinlock to be per
EQ (IRQ), thus per core.

Tested with:
system: 2 sockets, 14 cores per socket, hyperthreading, 2x14x2=56 cores
netperf command: ./super_netperf 200 -P 0 -t TCP_RR  -H  -l 30 -- -r 
300,300 -o -s 1M,1M -S 1M,1M

WITHOUT THIS PATCH:
Average: CPU%usr   %nice%sys %iowait%irq   %soft %steal  %guest 
 %gnice   %idle
Average: all4.320.00   36.150.090.00   34.02   0.000.00 
   0.00   25.41

Samples: 2M of event 'cycles:pp', Event count (approx.): 1554616897271
Overhead  Command  Shared Object Symbol
+   14.28%  swapper  [kernel.vmlinux]  [k] intel_idle
+   12.25%  swapper  [kernel.vmlinux]  [k] 
queued_spin_lock_slowpath
+   10.29%  netserver[kernel.vmlinux]  [k] 
queued_spin_lock_slowpath
+1.32%  netserver[kernel.vmlinux]  [k] mlx5e_xmit

WITH THIS PATCH:
Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal  
%guest  %gnice   %idle
Average: all4.270.00   34.310.010.00   18.710.00
0.000.00   42.69

Samples: 2M of event 'cycles:pp', Event count (approx.): 1498132937483
Overhead  Command  Shared Object Symbol
+   23.33%  swapper  [kernel.vmlinux]  [k] intel_idle
+1.69%  netserver[kernel.vmlinux]  [k] mlx5e_xmit

Tested-by: Song Liu 
Signed-off-by: Saeed Mahameed 
Reviewed-by: Gal Pressman 
---
 drivers/net/ethernet/mellanox/mlx5/core/cq.c   | 69 +++---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c   | 10 +++-
 drivers/net/ethernet/mellanox/mlx5/core/main.c |  8 +--
 include/linux/mlx5/cq.h|  3 +-
 include/linux/mlx5/driver.h| 22 
 5 files changed, 62 insertions(+), 50 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
index 1016e05c7ec7..dfbeeaa43276 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
@@ -86,10 +86,10 @@ static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq)
spin_unlock_irqrestore(_ctx->lock, flags);
 }
 
-void mlx5_cq_completion(struct mlx5_core_dev *dev, u32 cqn)
+void mlx5_cq_completion(struct mlx5_eq *eq, u32 cqn)
 {
+   struct mlx5_cq_table *table = >cq_table;
struct mlx5_core_cq *cq;
-   struct mlx5_cq_table *table = >priv.cq_table;
 
spin_lock(>lock);
cq = radix_tree_lookup(>tree, cqn);
@@ -98,7 +98,7 @@ void mlx5_cq_completion(struct mlx5_core_dev *dev, u32 cqn)
spin_unlock(>lock);
 
if (!cq) {
-   mlx5_core_warn(dev, "Completion event for bogus CQ 0x%x\n", 
cqn);
+   mlx5_core_warn(eq->dev, "Completion event for bogus CQ 0x%x\n", 
cqn);
return;
}
 
@@ -110,9 +110,9 @@ void mlx5_cq_completion(struct mlx5_core_dev *dev, u32 cqn)
complete(>free);
 }
 
-void mlx5_cq_event(struct mlx5_core_dev *dev, u32 cqn, int event_type)
+void mlx5_cq_event(struct mlx5_eq *eq, u32 cqn, int event_type)
 {
-   struct mlx5_cq_table *table = >priv.cq_table;
+   struct mlx5_cq_table *table = >cq_table;
struct mlx5_core_cq *cq;
 
spin_lock(>lock);
@@ -124,7 +124,7 @@ void mlx5_cq_event(struct mlx5_core_dev *dev, u32 cqn, int 
event_type)
spin_unlock(>lock);
 
if (!cq) {
-   mlx5_core_warn(dev, "Async event for bogus CQ 0x%x\n", cqn);
+   mlx5_core_warn(eq->dev, "Async event for bogus CQ 0x%x\n", cqn);
return;
}
 
@@ -137,19 +137,22 @@ void mlx5_cq_event(struct mlx5_core_dev *dev, u32 cqn, 
int event_type)
 int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq,
u32 *in, int inlen)
 {
-   struct mlx5_cq_table *table = >priv.cq_table;
u32 out[MLX5_ST_SZ_DW(create_cq_out)];
u32 din[MLX5_ST_SZ_DW(destroy_cq_in)];
u32 dout[MLX5_ST_SZ_DW(destroy_cq_out)];
int eqn = MLX5_GET(cqc, MLX5_ADDR_OF(create_cq_in, in, cq_context),
   c_eqn);
-   struct mlx5_eq *eq;
+   struct mlx5_eq *eq, *async_eq;
+   struct mlx5_cq_table *table;
int err;
 
+   async_eq = >priv.eq_table.async_eq;
eq = mlx5_eqn2eq(dev, eqn);
if (IS_ERR(eq))

[pull request][for-next 0/7] Mellanox, mlx5 shared code updates 2018-02-21

2018-02-21 Thread Saeed Mahameed

Hi Dave & Doug,

This series includes shared code updates for mlx5 core driver for both
netdev and rdma subsystems.  This series should be pulled to both
trees so we can continue netdev and rdma specific submissions separately.

For more information please see tag log below.

P.S. We expect two more shared code pull requests.

The series doesn't cause any conflict with the latest mlx5 net fixes
series.

Please pull and let me know if there's any issue,

Thanks,
Saeed.

---

The following changes since commit 7928b2cbe55b2a410a0f5c1f154610059c57b1b2:

  Linux 4.16-rc1 (2018-02-11 15:04:29 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux.git 
tags/mlx5-updates-2018-02-21

for you to fetch changes up to 388ca8be00370db132464e27f745b8a0add19fcb:

  IB/mlx5: Implement fragmented completion queue (CQ) (2018-02-15 00:30:03 
-0800)


mlx5-updates-2018-02-21

This series includes shared code updates for mlx5 core driver for both
netdev and rdma subsystems.

By Saeed,
First six patches of the series are meant to address a performance issue
and should provide a performance boost for multi core IRQ interrupt hungry
workloads.  The issue is fixed in the first patch, all other patches are
meant to refactor the code in light of this fix.

The problem it comes to fix, is a shared spinlock accessed across all HCA
IRQs which protects the CQ database.  To solve this we simply move the CQ
database and its spinlock to be per EQ (IRQ), thus per core.

By Yonatan,
Fragmented completion queue (CQ) for RDMA,
core driver implementation to create fragmented CQ buffers rather than
one large contiguous memory buffer, the implementation scheme already
exist and used by the netdev CQs, the patch shares that code with the
rdma CQ creation flow and makes use of the new API in mlx5_ib driver.

Thanks,
Saeed.


Saeed Mahameed (6):
  net/mlx5: CQ Database per EQ
  net/mlx5: Add missing likely/unlikely hints to cq events
  net/mlx5: EQ add/del CQ API
  net/mlx5: CQ hold/put API
  net/mlx5: Move CQ completion and event forwarding logic to eq.c
  net/mlx5: Remove redundant EQ API exports

Yonatan Cohen (1):
  IB/mlx5: Implement fragmented completion queue (CQ)

 drivers/infiniband/hw/mlx5/cq.c|  64 +++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h   |   6 +-
 drivers/net/ethernet/mellanox/mlx5/core/alloc.c|  37 ---
 drivers/net/ethernet/mellanox/mlx5/core/cq.c   | 116 +
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c|  11 +-
 drivers/net/ethernet/mellanox/mlx5/core/eq.c   |  92 +++-
 drivers/net/ethernet/mellanox/mlx5/core/main.c |   8 +-
 .../net/ethernet/mellanox/mlx5/core/mlx5_core.h|  21 
 drivers/net/ethernet/mellanox/mlx5/core/wq.c   |  18 ++--
 drivers/net/ethernet/mellanox/mlx5/core/wq.h   |  22 ++--
 include/linux/mlx5/cq.h|  14 ++-
 include/linux/mlx5/driver.h|  88 
 12 files changed, 279 insertions(+), 218 deletions(-)

[for-next 2/7] net/mlx5: Add missing likely/unlikely hints to cq events

2018-02-21 Thread Saeed Mahameed

If a hardware event is targeting a CQ, that CQ should exist.
Add unlikely to error handling flows.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Gal Pressman 
---
 drivers/net/ethernet/mellanox/mlx5/core/cq.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
index dfbeeaa43276..9feeb555e937 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
@@ -97,7 +97,7 @@ void mlx5_cq_completion(struct mlx5_eq *eq, u32 cqn)
refcount_inc(>refcount);
spin_unlock(>lock);
 
-   if (!cq) {
+   if (unlikely(!cq)) {
mlx5_core_warn(eq->dev, "Completion event for bogus CQ 0x%x\n", 
cqn);
return;
}
@@ -118,12 +118,12 @@ void mlx5_cq_event(struct mlx5_eq *eq, u32 cqn, int 
event_type)
spin_lock(>lock);
 
cq = radix_tree_lookup(>tree, cqn);
-   if (cq)
+   if (likely(cq))
refcount_inc(>refcount);
 
spin_unlock(>lock);
 
-   if (!cq) {
+   if (unlikely(!cq)) {
mlx5_core_warn(eq->dev, "Async event for bogus CQ 0x%x\n", cqn);
return;
}
-- 
2.14.3

[for-next 5/7] net/mlx5: Move CQ completion and event forwarding logic to eq.c

2018-02-21 Thread Saeed Mahameed

Since CQ tree is now per EQ, CQ completion and event forwarding became
specific implementation of EQ logic, this patch moves that logic to eq.c
and makes those functions static.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Gal Pressman 
---
 drivers/net/ethernet/mellanox/mlx5/core/cq.c | 45 -
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 49 ++--
 include/linux/mlx5/driver.h  |  2 --
 3 files changed, 47 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
index 06dc7bd302ed..669ed16938b3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cq.c
@@ -85,51 +85,6 @@ static void mlx5_add_cq_to_tasklet(struct mlx5_core_cq *cq)
spin_unlock_irqrestore(_ctx->lock, flags);
 }
 
-/* caller must eventually call mlx5_cq_put on the returned cq */
-static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
-{
-   struct mlx5_cq_table *table = >cq_table;
-   struct mlx5_core_cq *cq = NULL;
-
-   spin_lock(>lock);
-   cq = radix_tree_lookup(>tree, cqn);
-   if (likely(cq))
-   mlx5_cq_hold(cq);
-   spin_unlock(>lock);
-
-   return cq;
-}
-
-void mlx5_cq_completion(struct mlx5_eq *eq, u32 cqn)
-{
-   struct mlx5_core_cq *cq = mlx5_eq_cq_get(eq, cqn);
-
-   if (unlikely(!cq)) {
-   mlx5_core_warn(eq->dev, "Completion event for bogus CQ 0x%x\n", 
cqn);
-   return;
-   }
-
-   ++cq->arm_sn;
-
-   cq->comp(cq);
-
-   mlx5_cq_put(cq);
-}
-
-void mlx5_cq_event(struct mlx5_eq *eq, u32 cqn, int event_type)
-{
-   struct mlx5_core_cq *cq = mlx5_eq_cq_get(eq, cqn);
-
-   if (unlikely(!cq)) {
-   mlx5_core_warn(eq->dev, "Async event for bogus CQ 0x%x\n", cqn);
-   return;
-   }
-
-   cq->event(cq, event_type);
-
-   mlx5_cq_put(cq);
-}
-
 int mlx5_core_create_cq(struct mlx5_core_dev *dev, struct mlx5_core_cq *cq,
u32 *in, int inlen)
 {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index c1f0468e95bd..7e442b38a8ca 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -393,6 +393,51 @@ static void general_event_handler(struct mlx5_core_dev 
*dev,
}
 }
 
+/* caller must eventually call mlx5_cq_put on the returned cq */
+static struct mlx5_core_cq *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
+{
+   struct mlx5_cq_table *table = >cq_table;
+   struct mlx5_core_cq *cq = NULL;
+
+   spin_lock(>lock);
+   cq = radix_tree_lookup(>tree, cqn);
+   if (likely(cq))
+   mlx5_cq_hold(cq);
+   spin_unlock(>lock);
+
+   return cq;
+}
+
+static void mlx5_eq_cq_completion(struct mlx5_eq *eq, u32 cqn)
+{
+   struct mlx5_core_cq *cq = mlx5_eq_cq_get(eq, cqn);
+
+   if (unlikely(!cq)) {
+   mlx5_core_warn(eq->dev, "Completion event for bogus CQ 0x%x\n", 
cqn);
+   return;
+   }
+
+   ++cq->arm_sn;
+
+   cq->comp(cq);
+
+   mlx5_cq_put(cq);
+}
+
+static void mlx5_eq_cq_event(struct mlx5_eq *eq, u32 cqn, int event_type)
+{
+   struct mlx5_core_cq *cq = mlx5_eq_cq_get(eq, cqn);
+
+   if (unlikely(!cq)) {
+   mlx5_core_warn(eq->dev, "Async event for bogus CQ 0x%x\n", cqn);
+   return;
+   }
+
+   cq->event(cq, event_type);
+
+   mlx5_cq_put(cq);
+}
+
 static irqreturn_t mlx5_eq_int(int irq, void *eq_ptr)
 {
struct mlx5_eq *eq = eq_ptr;
@@ -415,7 +460,7 @@ static irqreturn_t mlx5_eq_int(int irq, void *eq_ptr)
switch (eqe->type) {
case MLX5_EVENT_TYPE_COMP:
cqn = be32_to_cpu(eqe->data.comp.cqn) & 0xff;
-   mlx5_cq_completion(eq, cqn);
+   mlx5_eq_cq_completion(eq, cqn);
break;
case MLX5_EVENT_TYPE_DCT_DRAINED:
rsn = be32_to_cpu(eqe->data.dct.dctn) & 0xff;
@@ -472,7 +517,7 @@ static irqreturn_t mlx5_eq_int(int irq, void *eq_ptr)
cqn = be32_to_cpu(eqe->data.cq_err.cqn) & 0xff;
mlx5_core_warn(dev, "CQ error on CQN 0x%x, syndrome 
0x%x\n",
   cqn, eqe->data.cq_err.syndrome);
-   mlx5_cq_event(eq, cqn, eqe->type);
+   mlx5_eq_cq_event(eq, cqn, eqe->type);
break;
 
case MLX5_EVENT_TYPE_PAGE_REQUEST:
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 96e003db2bcd..09e2f3e8753c 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1049,12 +1049,10 @@ int mlx5_eq_init(struct mlx5_core_dev *dev);
 void

Re: [PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Santosh Shilimkar


On 2/21/2018 12:19 PM, Sowmini Varadhan wrote:

This commit is an optimization that builds on top of commit 01883eda72bd
("rds: support for zcopy completion notification") for PF_RDS sockets.

Cookies associated with zerocopy completion are passed up on the POLLIN
channel, piggybacked with data whereever possible. Such cookies are passed

s/whereever/wherever


up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when
the returned value of recvmsg() is >= 0. A max of SO_EE_ORIGIN_MAX_ZCOOKIES
may be passed with each message.

Signed-off-by: Sowmini Varadhan 
---

Acked-by: Santosh Shilimkar

Re: ss issue on arm not showing UDP listening ports

2018-02-21 Thread Stefano Brivio

On Wed, 21 Feb 2018 12:37:31 -0500
jesse_coo...@codeholics.com wrote:

> ss utility, iproute2-ss161212

Works for me on iproute2-ss161212 and 4.9.0 kernel on armv7l. Unless
somebody on the list has other ideas, I guess you should either try
more recent versions, debug it (strace should show a pair of
recvmsg/sendmsg for each UDP socket) or file a ticket for your
distribution.

-- 
Stefano

Re: [PATCH V7 2/4] sctp: Add ip option support

2018-02-21 Thread Paul Moore

On February 21, 2018 9:33:51 AM Marcelo Ricardo Leitner 
 wrote:
> On Tue, Feb 20, 2018 at 07:15:27PM +, Richard Haines wrote:
>> Add ip option support to allow LSM security modules to utilise CIPSO/IPv4
>> and CALIPSO/IPv6 services.
>> 
>> Signed-off-by: Richard Haines 
>
> LGTM too, thanks!
>
> Acked-by: Marcelo Ricardo Leitner 

I agree, thanks everyone for all the work, review, and patience behind this 
patchset!  I'll work on merging this into selinux/next and I'll send a note 
when it's done.

--
paul moore
www.paul-moore.com

[PATCH v2 net-next 0/2] lan743x: Add new lan743x driver

2018-02-21 Thread Bryan Whitehead

Add new lan743x driver. 

The lan743x from Microchip Technologies Inc,
is a PCIe to Gigabit Ethernet Controller.

Bryan Whitehead (2):
  lan743x: Add main source files for new lan743x driver
  lan743x: Update MAINTAINERS to include lan743x driver

 MAINTAINERS   |7 +
 drivers/net/ethernet/microchip/Kconfig|   10 +
 drivers/net/ethernet/microchip/Makefile   |3 +
 drivers/net/ethernet/microchip/lan743x_main.c | 2757 +
 drivers/net/ethernet/microchip/lan743x_main.h |  686 ++
 5 files changed, 3463 insertions(+)
 create mode 100644 drivers/net/ethernet/microchip/lan743x_main.c
 create mode 100644 drivers/net/ethernet/microchip/lan743x_main.h

-- 
2.7.4

Re: Qualcomm rmnet driver and qmi_wwan

2018-02-21 Thread Subash Abhinov Kasiviswanathan


On 2018-02-21 04:38, Daniele Palmas wrote:

Hello,

in rmnet kernel documentation I read:

"This driver can be used to register onto any physical network device 
in
IP mode. Physical transports include USB, HSIC, PCIe and IP 
accelerator."


Does this mean that it can be used in association with the qmi_wwan 
driver?


If yes, can someone give me an hint on the steps to follow?

If not, does anyone know if it is possible to modify qmi_wwan in order
to take advantage of the features provided by the rmnet driver?

In this case hint on the changes for modifying qmi_wwan are welcome.

Thanks in advance,
Daniele


Hi

I havent used qmi_wwan so the following comment is based on code 
inspection.
qmimux_register_device() is creating qmimux devices with usb net device 
as
real_dev. The Multiplexing and aggregation header (qmimux_hdr) is 
stripped off

in qmimux_rx_fixup() and the packet is passed on to stack.

You could instead create rmnet devices with the usb netdevice as real 
dev.
The packets from the usb net driver can be queued to network stack 
directly
as rmnet driver will setup a RX handler. rmnet driver will process the 
packets

further and then queue to network stack.

--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

[for-next 6/7] net/mlx5: Remove redundant EQ API exports

2018-02-21 Thread Saeed Mahameed

EQ structure and API is private to mlx5_core driver only, external
drivers should not have access or the means to manipulate EQ objects.

Remove redundant exports and move API functions out of the linux/mlx5
include directory into the driver's mlx5_core.h private include file.

Signed-off-by: Saeed Mahameed 
Reviewed-by: Gal Pressman 
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c|  3 ---
 drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h | 17 +
 include/linux/mlx5/driver.h | 17 -
 3 files changed, 17 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c 
b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index 7e442b38a8ca..c1c94974e16b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -720,7 +720,6 @@ int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq, u8 vecidx,
mlx5_buf_free(dev, >buf);
return err;
 }
-EXPORT_SYMBOL_GPL(mlx5_create_map_eq);
 
 int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq)
 {
@@ -747,7 +746,6 @@ int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq)
 
return err;
 }
-EXPORT_SYMBOL_GPL(mlx5_destroy_unmap_eq);
 
 int mlx5_eq_add_cq(struct mlx5_eq *eq, struct mlx5_core_cq *cq)
 {
@@ -925,4 +923,3 @@ int mlx5_core_eq_query(struct mlx5_core_dev *dev, struct 
mlx5_eq *eq,
MLX5_SET(query_eq_in, in, eq_number, eq->eqn);
return mlx5_cmd_exec(dev, in, sizeof(in), out, outlen);
 }
-EXPORT_SYMBOL_GPL(mlx5_core_eq_query);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h 
b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
index 54a1cbfb1b5a..23e17ac0cba5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/mlx5_core.h
@@ -117,11 +117,28 @@ int mlx5_destroy_scheduling_element_cmd(struct 
mlx5_core_dev *dev, u8 hierarchy,
 int mlx5_wait_for_vf_pages(struct mlx5_core_dev *dev);
 u64 mlx5_read_internal_timer(struct mlx5_core_dev *dev);
 
+int mlx5_eq_init(struct mlx5_core_dev *dev);
+void mlx5_eq_cleanup(struct mlx5_core_dev *dev);
+int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 
vecidx,
+  int nent, u64 mask, const char *name,
+  enum mlx5_eq_type type);
+int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq);
 int mlx5_eq_add_cq(struct mlx5_eq *eq, struct mlx5_core_cq *cq);
 int mlx5_eq_del_cq(struct mlx5_eq *eq, struct mlx5_core_cq *cq);
+int mlx5_core_eq_query(struct mlx5_core_dev *dev, struct mlx5_eq *eq,
+  u32 *out, int outlen);
+int mlx5_start_eqs(struct mlx5_core_dev *dev);
+void mlx5_stop_eqs(struct mlx5_core_dev *dev);
 struct mlx5_eq *mlx5_eqn2eq(struct mlx5_core_dev *dev, int eqn);
 u32 mlx5_eq_poll_irq_disabled(struct mlx5_eq *eq);
 void mlx5_cq_tasklet_cb(unsigned long data);
+void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec, bool forced);
+int mlx5_debug_eq_add(struct mlx5_core_dev *dev, struct mlx5_eq *eq);
+void mlx5_debug_eq_remove(struct mlx5_core_dev *dev, struct mlx5_eq *eq);
+int mlx5_eq_debugfs_init(struct mlx5_core_dev *dev);
+void mlx5_eq_debugfs_cleanup(struct mlx5_core_dev *dev);
+int mlx5_cq_debugfs_init(struct mlx5_core_dev *dev);
+void mlx5_cq_debugfs_cleanup(struct mlx5_core_dev *dev);
 
 int mlx5_query_pcam_reg(struct mlx5_core_dev *dev, u32 *pcam, u8 feature_group,
u8 access_reg_group);
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 09e2f3e8753c..2860a253275b 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1045,20 +1045,11 @@ int mlx5_satisfy_startup_pages(struct mlx5_core_dev 
*dev, int boot);
 int mlx5_reclaim_startup_pages(struct mlx5_core_dev *dev);
 void mlx5_register_debugfs(void);
 void mlx5_unregister_debugfs(void);
-int mlx5_eq_init(struct mlx5_core_dev *dev);
-void mlx5_eq_cleanup(struct mlx5_core_dev *dev);
 void mlx5_fill_page_array(struct mlx5_buf *buf, __be64 *pas);
 void mlx5_fill_page_frag_array(struct mlx5_frag_buf *frag_buf, __be64 *pas);
 void mlx5_rsc_event(struct mlx5_core_dev *dev, u32 rsn, int event_type);
 void mlx5_srq_event(struct mlx5_core_dev *dev, u32 srqn, int event_type);
 struct mlx5_core_srq *mlx5_core_get_srq(struct mlx5_core_dev *dev, u32 srqn);
-void mlx5_cmd_comp_handler(struct mlx5_core_dev *dev, u64 vec, bool forced);
-int mlx5_create_map_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq, u8 
vecidx,
-  int nent, u64 mask, const char *name,
-  enum mlx5_eq_type type);
-int mlx5_destroy_unmap_eq(struct mlx5_core_dev *dev, struct mlx5_eq *eq);
-int mlx5_start_eqs(struct mlx5_core_dev *dev);
-void mlx5_stop_eqs(struct mlx5_core_dev *dev);
 int mlx5_vector2eqn(struct mlx5_core_dev *dev, int vector, int *eqn,

Re: [PATCH net-next,v3] net: sched: add em_ipt ematch for calling xtables matches

2018-02-21 Thread David Miller

From: Eyal Birger 
Date: Thu, 15 Feb 2018 19:42:43 +0200

> The commit a new tc ematch for using netfilter xtable matches.
> 
> This allows early classification as well as mirroning/redirecting traffic
> based on logic implemented in netfilter extensions.
> 
> Current supported use case is classification based on the incoming IPSec
> state used during decpsulation using the 'policy' iptables extension
> (xt_policy).
> 
> The module dynamically fetches the netfilter match module and calls
> it using a fake xt_action_param structure based on validated userspace
> provided parameters.
> 
> As the xt_policy match does not access skb->data, no skb modifications
> are needed on match.
> 
> Signed-off-by: Eyal Birger 

Applied, thank you.

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-21 Thread Guillaume Nault

On Sun, Feb 18, 2018 at 12:01:02PM +0200, Denys Fedoryshchenko wrote:
> On 2018-02-16 20:48, Guillaume Nault wrote:
> > On Fri, Feb 16, 2018 at 01:13:18PM +0200, Denys Fedoryshchenko wrote:
> > > As far as i can see there is only KASAN triggered again(and server
> > > rebooted
> > > shortly after that), but nothing else:
> > > 
> > Ok, so no refcount failure detected. Not what I expected... but that's
> > still an information. It's getting even harder to find a ppp scenario
> > that could lead to such symptoms.
> > If that's acceptable for you, you can try reverting the few commits
> > that entered after 4.14.
> > 
> > 02612bb05e51df8489db5e94d0cf8d1c81f87b0c pppoe: take ->needed_headroom
> > of lower device into account on xmit
> > 0171c41835591e9aa2e384b703ef9a6ae367c610 ppp: unlock all_ppp_mutex
> > before registering device
> > e6675000f9a404f7651724c0b2e2e71f7247d3a1 ppp: exit_net cleanup checks
> > added
> > f02b2320b27c16b644691267ee3b5c110846f49e ppp: Destroy the mutex when
> > cleanup
> > 90e229ef61fad240554f5899eb122fbe44990f78 ppp: allow usage in namespaces
> > 709c89b45b874b2f81a074b8802a736009873f48 drivers, net, ppp: convert
> > syncppp.refcnt from atomic_t to refcount_t
> > d780cd44e3cea119a3346e6d7c04d35b9c50d54b drivers, net, ppp: convert
> > ppp_file.refcnt from atomic_t to refcount_t
> > 313a912155c78ed87ad6fca175dc56b75fd00a58 drivers, net, ppp: convert
> > asyncppp.refcnt from atomic_t to refcount_t
> > 
> > Sorry, but I have nothing better to propose for now. At least that
> > should help narrowing the problem space.
> > I'm going to stress test ppp_generic and pppoe on my side.
> > 
> Quick update.
> Testing 5 first patches didn't changed anything.
> But revering more, with last 4 patches also (i did all together) is changing
> things, probably i need to repeat one night more reverting just all
> refcount_t patches.
>
So you got the following trace with all 8 patches reverted, right?
I prefer to concentrate on the other traces for now. If this one tends
to be reproducible, you can try to activate lockdep (for lack of better
suggestion).

>  [25222.173840] [ cut here ]
>  [25222.174259] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 3 timed out
>  [25222.174618] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:323
> dev_watchdog+0x44a/0x555
>  [25222.175212] Modules linked in: pppoe pppox ppp_generic slhc netconsole
> configfs coretemp nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp
> nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
> t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 xt_set
> xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net
> ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
> t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables
> x_tables 8021q garp mrp stp llc ixgbe dca
>  [25222.177133] CPU: 3 PID: 0 Comm: swapper/3 Tainted: GB   W
> 4.15.3-build-0134 #6
>  [25222.184121] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
>  [25222.184457] RIP: 0010:dev_watchdog+0x44a/0x555
>  [25222.184791] RSP: 0018:8803f22c7d98 EFLAGS: 00010292
>  [25222.185127] RAX:  RBX: 8803ded00438 RCX:
> 
>  [25222.185463] RDX: 0001 RSI: 0002 RDI:
> ed007e458fa8
>  [25222.185797] RBP: 8803ded0 R08: 0001 R09:
> 
>  [25222.186133] R10: 8803f22c7e30 R11: 0001 R12:
> 8803ded28450
>  [25222.186471] R13: 0003 R14: dc00 R15:
> 8803ded283c0
>  [25222.186804] FS:  () GS:8803f22c()
> knlGS:
>  [25222.187401] CS:  0010 DS:  ES:  CR0: 80050033
>  [25222.187739] CR2: 561f5bffc128 CR3: 000445a0d003 CR4:
> 001606e0
>  [25222.188077] Call Trace:
>  [25222.188410]  
>  [25222.188740]  ? dev_graft_qdisc+0xfa/0xfa
>  [25222.189072]  call_timer_fn+0x15/0x72
>  [25222.189407]  ? dev_graft_qdisc+0xfa/0xfa
>  [25222.189741]  expire_timers+0x1b9/0x1d5
>  [25222.190072]  run_timer_softirq+0x184/0x361
>  [25222.190400]  ? expire_timers+0x1d5/0x1d5
>  [25222.190723]  ? enqueue_hrtimer+0xce/0xd8
>  [25222.191048]  ? __hrtimer_run_queues+0x1ec/0x24d
>  [25222.191373]  __do_softirq+0x17f/0x34a
>  [25222.191702]  irq_exit+0x8f/0xf9
>  [25222.192034]  smp_apic_timer_interrupt+0xcb/0xd6
>  [25222.192365]  apic_timer_interrupt+0x92/0xa0
>  [25222.192695]  
>  [25222.193023] RIP: 0010:mwait_idle+0x99/0xac
>  [25222.193355] RSP: 0018:8803f030fef8 EFLAGS: 0246 ORIG_RAX:
> ff11
>  [25222.193956] RAX:  RBX: 8803f02e3500 RCX:
> 
>  [25222.194290] RDX: 11007e05c6a0 RSI:  RDI:
> 
>  [25222.194626] RBP: 8803f02e3500 R08: ed007ccc8eef R09:
> 8803e6647728
>  [25222.194958] R10: 8803f030fdd0 R11: 0001 R12:
> 
>  [25222.195292] R13: dc00 R14:

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-21 Thread Guillaume Nault

On Wed, Feb 21, 2018 at 12:26:51PM +0200, Denys Fedoryshchenko wrote:
> It seems even rebuilding seemingly stable version triggering crashes too
> (but different ones)
Different ones? The trace following your message looks very similar to
your first KASAN report. Or are you refering to the lockup you posted
on Sun, 18 Feb 2018?

Also, which stable versions are you refering to?

I'm interested in the ppp_generic.o file that produced the following
trace. Just to be sure that the differences come from the new debugging
options.

> Maybe it is coincidence, and bug reproducer appeared in network same time i
> decided to upgrade kernel,
> as it happened with xt_MSS(and that bug existed for years).
> 
> Deleted quoting, i added more debug options (as much as performance
> degradation allows me).
> This is vanilla again:
> 
> [14834.090421]
> ==
> [14834.091157] BUG: KASAN: use-after-free in __list_add_valid+0x69/0xad
> [14834.091521] Read of size 8 at addr 8803dbeb8660 by task
> accel-pppd/12636
> [14834.091905]
> [14834.092282] CPU: 0 PID: 12636 Comm: accel-pppd Not tainted
> 4.15.4-build-0134 #1
> [14834.092930] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80
> 04/02/2015
> [14834.093320] Call Trace:
> [14834.093680]  dump_stack+0xb3/0x13e
> [14834.094050]  ? _atomic_dec_and_lock+0x10f/0x10f
> [14834.094434]  print_address_description+0x69/0x236
> [14834.094814]  ? __list_add_valid+0x69/0xad
> [14834.095197]  kasan_report+0x219/0x23f
> [14834.095570]  __list_add_valid+0x69/0xad
> [14834.095957]  ppp_ioctl+0x1216/0x2201 [ppp_generic]
> [14834.096348]  ? ppp_write+0x1cc/0x1cc [ppp_generic]
> [14834.096723]  ? get_usage_char.isra.2+0x36/0x36
> [14834.097094]  ? packet_poll+0x362/0x362
> [14834.097455]  ? lock_downgrade+0x4d0/0x4d0
> [14834.097811]  ? rcu_irq_enter_disabled+0x8/0x8
> [14834.098187]  ? get_usage_char.isra.2+0x36/0x36
> [14834.098561]  ? __fget+0x3b8/0x3eb
> [14834.098936]  ? get_usage_char.isra.2+0x36/0x36
> [14834.099309]  ? __fget+0x3a0/0x3eb
> [14834.099682]  ? get_usage_char.isra.2+0x36/0x36
> [14834.100069]  ? __fget+0x3a0/0x3eb
> [14834.100443]  ? lock_downgrade+0x4d0/0x4d0
> [14834.100814]  ? rcu_irq_enter_disabled+0x8/0x8
> [14834.101203]  ? __fget+0x3b8/0x3eb
> [14834.101581]  ? expand_files+0x62f/0x62f
> [14834.101945]  ? kernel_read+0xed/0xed
> [14834.102322]  ? SyS_getpeername+0x28b/0x28b
> [14834.102690]  vfs_ioctl+0x6e/0x81
> [14834.103049]  do_vfs_ioctl+0xe2f/0xe62
> [14834.103413]  ? ioctl_preallocate+0x211/0x211
> [14834.103778]  ? __fget_light+0x28c/0x2ca
> [14834.104150]  ? iterate_fd+0x2a8/0x2a8
> [14834.104526]  ? SyS_rt_sigprocmask+0x12e/0x181
> [14834.104876]  ? sigprocmask+0x23f/0x23f
> [14834.105231]  ? SyS_write+0x148/0x173
> [14834.105580]  ? SyS_read+0x173/0x173
> [14834.105943]  SyS_ioctl+0x39/0x55
> [14834.106316]  ? do_vfs_ioctl+0xe62/0xe62
> [14834.106694]  do_syscall_64+0x262/0x594
> [14834.107076]  ? syscall_return_slowpath+0x351/0x351
> [14834.107447]  ? up_read+0x17/0x2c
> [14834.107806]  ? __do_page_fault+0x68a/0x763
> [14834.108171]  ? entry_SYSCALL_64_after_hwframe+0x36/0x9b
> [14834.108550]  ? trace_hardirqs_off_thunk+0x1a/0x1c
> [14834.108937]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.109293] RIP: 0033:0x7fc9be3758a7
> [14834.109652] RSP: 002b:7fc9bf92aaf8 EFLAGS: 0206 ORIG_RAX:
> 0010
> [14834.110313] RAX: ffda RBX: 7fc9bdc5e1e3 RCX:
> 7fc9be3758a7
> [14834.110707] RDX: 7fc9b7ad13e8 RSI: 4004743a RDI:
> 4b9f
> [14834.111082] RBP: 7fc9bf92ab20 R08:  R09:
> 55f07a27fe40
> [14834.111471] R10: 0008 R11: 0206 R12:
> 7fc9b7ad12d8
> [14834.111845] R13: 7ffd06346a6f R14:  R15:
> 7fc9bf92b700
> [14834.112231]
> [14834.112589] Allocated by task 12636:
> [14834.112962]  ppp_register_net_channel+0xc4/0x610 [ppp_generic]
> [14834.113331]  pppoe_connect+0xe6d/0x1097 [pppoe]
> [14834.113691]  SyS_connect+0x19c/0x274
> [14834.114054]  do_syscall_64+0x262/0x594
> [14834.114421]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.114792]
> [14834.115139] Freed by task 12636:
> [14834.115504]  kfree+0xe2/0x154
> [14834.115866]  ppp_release+0x11b/0x12a [ppp_generic]
> [14834.116240]  __fput+0x342/0x5ba
> [14834.116611]  task_work_run+0x15d/0x198
> [14834.116973]  exit_to_usermode_loop+0xc7/0x153
> [14834.117320]  do_syscall_64+0x53d/0x594
> [14834.117694]  entry_SYSCALL_64_after_hwframe+0x26/0x9b
> [14834.118067]
> [14834.118426] The buggy address belongs to the object at 8803dbeb8480
> [14834.119087] The buggy address is located 480 bytes inside of
> [14834.119755] The buggy address belongs to the page:
> [14834.120138] page:ea000f6fae00 count:1 mapcount:0 mapping:
> (null) index:0x8803dbebd580 compound_mapcount: 0
> [14834.120817] flags: 0x17ffe0008100(slab|head)
> [14834.121171] raw: 17ffe0008100  8803dbebd580
>

Re: [PATCH net-next ] ibmvnic: Correct goto target for tx irq initialization failure

2018-02-21 Thread David Miller

From: Nathan Fontenot 
Date: Tue, 20 Feb 2018 11:04:18 -0600

> When a failure occurs during initialization of the tx sub crq
> irqs, we should branch to the cleanup of the tx irqs. The current
> code branches to the rx irq cleanup and attempts to cleanup the
> rx irqs which have not been initialized.
> 
> Signed-off-by: Nathan Fontenot 

Applied, thanks.

Re: [patch net-next] mlxsw: spectrum_switchdev: Allow port enslavement to a VLAN-unaware bridge

2018-02-21 Thread Ido Schimmel

On Wed, Feb 21, 2018 at 12:41:39PM -0700, David Ahern wrote:
> On 2/21/18 12:25 PM, Ido Schimmel wrote:
> >>
> >> and can talk to the hosts:
> >> # ping6 ff02::2%br0
> > 
> > Can you try ff02::1 ?
> 
> same result.
> 
> > 
> >> PING ff02::2%br0(ff02::2) 56 data bytes
> >> 64 bytes from fe80::7efe:90ff:fee8:3a79: icmp_seq=1 ttl=64 time=0.073 ms
> >> 64 bytes from fe80::202:ff:fe00:2: icmp_seq=1 ttl=64 time=0.661 ms (DUP!)
> >> 64 bytes from fe80::202:ff:fe00:5: icmp_seq=1 ttl=64 time=0.705 ms (DUP!)
> >> 64 bytes from fe80::202:ff:fe00:1: icmp_seq=1 ttl=64 time=0.720 ms (DUP!)
> >> 64 bytes from fe80::202:ff:fe00:3: icmp_seq=1 ttl=64 time=0.729 ms (DUP!)
> >> 64 bytes from fe80::202:ff:fe00:6: icmp_seq=1 ttl=64 time=0.739 ms (DUP!)
> >> 64 bytes from fe80::202:ff:fe00:4: icmp_seq=1 ttl=64 time=0.748 ms (DUP!)
> >> 64 bytes from fe80::202:ff:fe00:7: icmp_seq=1 ttl=64 time=0.757 ms (DUP!)
> >> 64 bytes from fe80::202:ff:fe00:8: icmp_seq=1 ttl=64 time=0.766 ms (DUP!)
> >>
> >> but the hosts can not talk to each other:
> >>
> >> cumulus@host3:~$ net show lldp
> >>
> >> LocalPort  Speed  Mode  RemoteHost   RemotePort
> >> -  -    ---  --
> >> swp1   10GInterface/L3  mlx-2700-05  swp1s2
> >>
> >> cumulus@host3:~$ ping6 3000:1000:1000:1000::2
> >> PING 3000:1000:1000:1000::2(3000:1000:1000:1000::2) 56 data bytes
> >> ^C
> >> --- 3000:1000:1000:1000::2 ping statistics ---
> >> 3 packets transmitted, 0 received, 100% packet loss, time 1999ms
> > 
> > Can you please try
> > 
> > # brctl showstp br0
> > 
> > To make sure STP state is set to forwarding?
> 
> it is.
> 
> > 
> > Does it matter if you try IPv4 ping or if vlan_filtering is set 1?
> > Unfortunately, I can't reproduce on my switch.
> 
> Bring up the hosts and then reboot the switch. At that point I get no
> host to host communication. As soon as I flap the port on host1 host3 to
> host1 starts working.
> 
> So it seems to be something about the initial boot state.

You didn't have IPv6 *and* IPv4 ping? I'm asking because it's possible
host1 sent an MLD join to the Solicited-node multicast address before
the bridge started listening, which means it didn't have a corresponding
MDB entry.

Assuming your hosts aren't functioning as multicast routers and sending
MLD queries and that you didn't configure them as mrouter ports on the
switch, then when host3 sent a neighbour solicitation message to host1's
Solicited-node multicast address it wasn't flooded to host3 which
prevented ping from passing.

This also explains why it started working when you flapped the port on
host1, as Linux generates MLD joins in these cases.

You can try to disable snooping:

# ip link set dev br0 type bridge mcast_snooping 0

Just a guess, but worth a try.

Thanks!

Re: [PATCH net] amd-xgbe: Restore PCI interrupt enablement setting on resume

2018-02-21 Thread David Miller

From: Tom Lendacky 
Date: Tue, 20 Feb 2018 15:22:05 -0600

> After resuming from suspend, the PCI device support must re-enable the
> interrupt setting so that interrupts are actually delivered.
> 
> Signed-off-by: Tom Lendacky 
> ---
> 
> Please queue this patch up to stable releases 4.14 and above.

Applied and queued up for -stable, thank you.

Re: [patch net-next] mlxsw: spectrum_switchdev: Allow port enslavement to a VLAN-unaware bridge

2018-02-21 Thread David Ahern

On 2/21/18 1:24 PM, Ido Schimmel wrote:
>>> Does it matter if you try IPv4 ping or if vlan_filtering is set 1?
>>> Unfortunately, I can't reproduce on my switch.
>>
>> Bring up the hosts and then reboot the switch. At that point I get no
>> host to host communication. As soon as I flap the port on host1 host3 to
>> host1 starts working.
>>
>> So it seems to be something about the initial boot state.
> 
> You didn't have IPv6 *and* IPv4 ping? I'm asking because it's possible
> host1 sent an MLD join to the Solicited-node multicast address before
> the bridge started listening, which means it didn't have a corresponding
> MDB entry.

The sim only configures IPv6, but it is not acting as an mcast router.
It's really a dummy setup -- bridge on the switch, ports connected to hosts.

> 
> Assuming your hosts aren't functioning as multicast routers and sending
> MLD queries and that you didn't configure them as mrouter ports on the
> switch, then when host3 sent a neighbour solicitation message to host1's
> Solicited-node multicast address it wasn't flooded to host3 which
> prevented ping from passing.
> 
> This also explains why it started working when you flapped the port on
> host1, as Linux generates MLD joins in these cases.
> 
> You can try to disable snooping:
> 
> # ip link set dev br0 type bridge mcast_snooping 0
> 
> Just a guess, but worth a try.

good guess. That change gets it working.

[PATCH net-next] RDS: deliver zerocopy completion notification with data as an optimization

2018-02-21 Thread Sowmini Varadhan

This commit is an optimization that builds on top of commit 01883eda72bd
("rds: support for zcopy completion notification") for PF_RDS sockets.

Cookies associated with zerocopy completion are passed up on the POLLIN
channel, piggybacked with data whereever possible. Such cookies are passed
up as ancillary data (at level SOL_RDS) in a struct rds_zcopy_cookies when
the returned value of recvmsg() is >= 0. A max of SO_EE_ORIGIN_MAX_ZCOOKIES
may be passed with each message.

Signed-off-by: Sowmini Varadhan 
---
 include/uapi/linux/rds.h |8 +++
 net/rds/recv.c   |   47 ++
 2 files changed, 55 insertions(+), 0 deletions(-)

diff --git a/include/uapi/linux/rds.h b/include/uapi/linux/rds.h
index 12e3bca..e733c01 100644
--- a/include/uapi/linux/rds.h
+++ b/include/uapi/linux/rds.h
@@ -37,6 +37,8 @@
 
 #include 
 #include   /* For __kernel_sockaddr_storage. */
+#include 
+#include 
 
 #define RDS_IB_ABI_VERSION 0x301
 
@@ -104,6 +106,7 @@
 #define RDS_CMSG_MASKED_ATOMIC_CSWP9
 #define RDS_CMSG_RXPATH_LATENCY11
 #defineRDS_CMSG_ZCOPY_COOKIE   12
+#defineRDS_CMSG_ZCOPY_COMPLETION   13
 
 #define RDS_INFO_FIRST 1
 #define RDS_INFO_COUNTERS  1
@@ -317,6 +320,11 @@ struct rds_rdma_notify {
 #define RDS_RDMA_DROPPED   3
 #define RDS_RDMA_OTHER_ERROR   4
 
+struct rds_zcopy_cookies {
+   __u32 num;
+   __u32 cookies[SO_EE_ORIGIN_MAX_ZCOOKIES];
+};
+
 /*
  * Common set of flags for all RDMA related structs
  */
diff --git a/net/rds/recv.c b/net/rds/recv.c
index b080961..44da829 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -577,6 +577,43 @@ static int rds_cmsg_recv(struct rds_incoming *inc, struct 
msghdr *msg,
return ret;
 }
 
+static int rds_recvmsg_zcookie(struct rds_sock *rs, struct msghdr *msg)
+{
+   struct sk_buff *skb, *tmp;
+   struct sock_exterr_skb *serr;
+   struct sock *sk = rds_rs_to_sk(rs);
+   struct sk_buff_head *q = >sk_error_queue;
+   struct rds_zcopy_cookies done;
+   u32 *ptr;
+   int i;
+   unsigned long flags;
+
+   spin_lock_irqsave(>lock, flags);
+   if (skb_queue_empty(q)) {
+   spin_unlock_irqrestore(>lock, flags);
+   return 0;
+   }
+   skb_queue_walk_safe(q, skb, tmp) {
+   serr = SKB_EXT_ERR(skb);
+   if (serr->ee.ee_origin == SO_EE_ORIGIN_ZCOOKIE) {
+   __skb_unlink(skb, q);
+   break;
+   }
+   }
+   spin_unlock_irqrestore(>lock, flags);
+
+   if (!skb)
+   return 0;
+   memset(, 0, sizeof(done));
+   done.num = serr->ee.ee_data;
+   ptr = (u32 *)skb->data;
+   for (i = 0; i < done.num; i++)
+   done.cookies[i] = *ptr++;
+   put_cmsg(msg, SOL_RDS, RDS_CMSG_ZCOPY_COMPLETION, sizeof(done), );
+   consume_skb(skb);
+   return done.num;
+}
+
 int rds_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
int msg_flags)
 {
@@ -586,6 +623,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
int ret = 0, nonblock = msg_flags & MSG_DONTWAIT;
DECLARE_SOCKADDR(struct sockaddr_in *, sin, msg->msg_name);
struct rds_incoming *inc = NULL;
+   int ncookies;
 
/* udp_recvmsg()->sock_recvtimeo() gets away without locking too.. */
timeo = sock_rcvtimeo(sk, nonblock);
@@ -609,6 +647,14 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
break;
}
 
+   if (list_empty(>rs_recv_queue) && nonblock) {
+   ncookies = rds_recvmsg_zcookie(rs, msg);
+   if (ncookies) {
+   ret = 0;
+   break;
+   }
+   }
+
if (!rds_next_incoming(rs, )) {
if (nonblock) {
ret = -EAGAIN;
@@ -656,6 +702,7 @@ int rds_recvmsg(struct socket *sock, struct msghdr *msg, 
size_t size,
msg->msg_flags |= MSG_TRUNC;
}
 
+   ncookies = rds_recvmsg_zcookie(rs, msg);
if (rds_cmsg_recv(inc, msg, rs)) {
ret = -EFAULT;
goto out;
-- 
1.7.1

Re: pull-request: bpf 2018-02-20

2018-02-21 Thread David Miller

From: Daniel Borkmann 
Date: Tue, 20 Feb 2018 22:08:32 +0100

> The following pull-request contains BPF updates for your *net* tree.
> 
> The main changes are:
 ...
> Please consider pulling these changes from:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Pulled, thanks Daniel.

Re: [PATCH net-next] ipv6: allow userspace to add IFA_F_OPTIMISTIC addresses

2018-02-21 Thread David Miller

From: Sabrina Dubroca 
Date: Tue, 20 Feb 2018 19:17:17 +0100

> 2018-02-20, 10:25:41 -0700, David Ahern wrote:
>> On 2/20/18 9:43 AM, Sabrina Dubroca wrote:
>> > According to RFC 4429 (section 3.1), adding new IPv6 addresses as
>> > optimistic addresses is acceptable, as long as the implementation
>> > follows some rules:
>> > 
>> >* Optimistic DAD SHOULD only be used when the implementation is aware
>> > that the address is based on a most likely unique interface
>> > identifier (such as in [RFC2464]), generated randomly [RFC3041],
>> > or by a well-distributed hash function [RFC3972] or assigned by
>> > Dynamic Host Configuration Protocol for IPv6 (DHCPv6) [RFC3315].
>> > Optimistic DAD SHOULD NOT be used for manually entered
>> > addresses.
>> 
>> That last line suggests this patch should not be allowed.
> 
> I think it should. Some tools perform autoconfiguration in userspace,
> why should the kernel prevent them from requesting optimistic DAD?
> 
> If the administrator decides to enable optimistic DAD on
> poorly-choosen addresses, or to disable DAD entirely, that's their
> problem.

See, this is the slippery slope we go down once we have allowed
userspace to engage in the ipv6 autoconfiguration process.

Whether the kernel is in control or not, or enforcing the rules
properly, is always going to be ambiguous and hard to determine.

I somewhat regret allowing us to go down this path...

Re: [PATCH net-next 6/7] mlxsw: spectrum_router: Add support for ipv6 hash policy update

2018-02-21 Thread Ido Schimmel

On Wed, Feb 21, 2018 at 10:49:53AM -0800, David Ahern wrote:
> Similar to 28678f07f127d ("mlxsw: spectrum_router: Update multipath hash
> parameters upon netevents") for IPv4, make sure the kernel and asic are
> using the same hash algorithm for path selection.
> 
> Signed-off-by: David Ahern 

Reviewed-by: Ido Schimmel 

Will review the rest and test tomorrow morning first thing.

Thanks a lot!

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-21 Thread Jiri Pirko

Wed, Feb 21, 2018 at 06:56:35PM CET, alexander.du...@gmail.com wrote:
>On Wed, Feb 21, 2018 at 8:58 AM, Jiri Pirko  wrote:
>> Wed, Feb 21, 2018 at 05:49:49PM CET, alexander.du...@gmail.com wrote:
>>>On Wed, Feb 21, 2018 at 8:11 AM, Jiri Pirko  wrote:
 Wed, Feb 21, 2018 at 04:56:48PM CET, alexander.du...@gmail.com wrote:
>On Wed, Feb 21, 2018 at 1:51 AM, Jiri Pirko  wrote:
>> Tue, Feb 20, 2018 at 11:33:56PM CET, kubak...@wp.pl wrote:
>>>On Tue, 20 Feb 2018 21:14:10 +0100, Jiri Pirko wrote:
 Yeah, I can see it now :( I guess that the ship has sailed and we are
 stuck with this ugly thing forever...

 Could you at least make some common code that is shared in between
 netvsc and virtio_net so this is handled in exacly the same way in 
 both?
>>>
>>>IMHO netvsc is a vendor specific driver which made a mistake on what
>>>behaviour it provides (or tried to align itself with Windows SR-IOV).
>>>Let's not make a far, far more commonly deployed and important driver
>>>(virtio) bug-compatible with netvsc.
>>
>> Yeah. netvsc solution is a dangerous precedent here and in my opinition
>> it was a huge mistake to merge it. I personally would vote to unmerge it
>> and make the solution based on team/bond.
>>
>>
>>>
>>>To Jiri's initial comments, I feel the same way, in fact I've talked to
>>>the NetworkManager guys to get auto-bonding based on MACs handled in
>>>user space.  I think it may very well get done in next versions of NM,
>>>but isn't done yet.  Stephen also raised the point that not everybody is
>>>using NM.
>>
>> Can be done in NM, networkd or other network management tools.
>> Even easier to do this in teamd and let them all benefit.
>>
>> Actually, I took a stab to implement this in teamd. Took me like an hour
>> and half.
>>
>> You can just run teamd with config option "kidnap" like this:
>> # teamd/teamd -c '{"kidnap": true }'
>>
>> Whenever teamd sees another netdev to appear with the same mac as his,
>> or whenever teamd sees another netdev to change mac to his,
>> it enslaves it.
>>
>> Here's the patch (quick and dirty):
>>
>> Subject: [patch teamd] teamd: introduce kidnap feature
>>
>> Signed-off-by: Jiri Pirko 
>
>So this doesn't really address the original problem we were trying to
>solve. You asked earlier why the netdev name mattered and it mostly
>has to do with configuration. Specifically what our patch is
>attempting to resolve is the issue of how to allow a cloud provider to
>upgrade their customer to SR-IOV support and live migration without
>requiring them to reconfigure their guest. So the general idea with
>our patch is to take a VM that is running with virtio_net only and
>allow it to instead spawn a virtio_bypass master using the same netdev
>name as the original virtio, and then have the virtio_net and VF come
>up and be enslaved by the bypass interface. Doing it this way we can
>allow for multi-vendor SR-IOV live migration support using a guest
>that was originally configured for virtio only.
>
>The problem with your solution is we already have teaming and bonding
>as you said. There is already a write-up from Red Hat on how to do it
>(https://access.redhat.com/documentation/en-us/red_hat_virtualization/4.1/html/virtual_machine_management_guide/sect-migrating_virtual_machines_between_hosts).
>That is all well and good as long as you are willing to keep around
>two VM images, one for virtio, and one for SR-IOV with live migration.

 You don't need 2 images. You need only one. The one with the team setup.
 That's it. If another netdev with the same mac appears, teamd will
 enslave it and run traffic on it. If not, ok, you'll go only through
 virtio_net.
>>>
>>>Isn't that going to cause the routing table to get messed up when we
>>>rearrange the netdevs? We don't want to have an significant disruption
>>> in traffic when we are adding/removing the VF. It seems like we would
>>>need to invalidate any entries that were configured for the virtio_net
>>>and reestablish them on the new team interface. Part of the criteria
>>>we have been working with is that we should be able to transition from
>>>having a VF to not or vice versa without seeing any significant
>>>disruption in the traffic.
>>
>> What? You have routes on the team netdev. virtio_net and VF are only
>> slaves. What are you talking about? I don't get it :/
>
>So lets walk though this by example. The general idea of the base case
>for all this is somebody starting with virtio_net, we will call the
>interface "ens1" for now. It comes up and is assigned a dhcp address
>and everything works as expected. Now in order to get better
>performance we want to add a VF

Re: ppp/pppoe, still panic 4.15.3 in ppp_push

2018-02-21 Thread Denys Fedoryshchenko


On 2018-02-21 20:55, Guillaume Nault wrote:

On Wed, Feb 21, 2018 at 12:26:51PM +0200, Denys Fedoryshchenko wrote:
It seems even rebuilding seemingly stable version triggering crashes 
too

(but different ones)

Different ones? The trace following your message looks very similar to
your first KASAN report. Or are you refering to the lockup you posted
on Sun, 18 Feb 2018?

Also, which stable versions are you refering to?
Trace i sent in previous email - is latest kernel, vanilla, just more 
debug options and few options disabled.
One of disabled was spitting some errors (it is obviously bug), 
CONFIG_XFRM, in nf_xfrm_me_harder (i reported about it).

And i disabled namespaces, as they are often source of trouble.

Today i will try to revert just:
drivers, net, ppp: convert asyncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert syncppp.refcnt from atomic_t to refcount_t
drivers, net, ppp: convert ppp_file.refcnt from atomic_t to  refcount_t

Because i suspect previously, after reverting this patches i got 
different kernel
panic (and i didn't noticed that, now too late to identify between other 
crashes),

seems it was not KASAN.
I will report results after testing, unfortunately i can't test it more 
than once per day.


"Stable" for me was 4.14.2 - but it looks like on that kernel i am 
getting different issue now.

I will paste it below.

Another observation, just hour ago, i noticed on another server, where i 
am testing 4.15, and 4.14.20
(at moment of testing 4.14.20, but no debug at that moment), when i 
killed accel-pppd (pppoe server software),
with 8k sessions online, i got some weird behaviour, accel-pppd process 
got stuck, same as
ifconfig and "ip link", and even kexec -e didn't worked(got stuck too), 
unless i did kexec -e -x

(so it wont try to make interfaces down on kexec).
I will try to reproduce this bug as well, with debug enabled (lockdep 
and so) i hope it is not related.




I'm interested in the ppp_generic.o file that produced the following
trace. Just to be sure that the differences come from the new debugging
options.

Also kernel config:
https://nuclearcat.com/bughunting/config.txt
https://nuclearcat.com/bughunting/ppp_generic.o

This is in 4.14.2, was seemingly stable before:

[50401.388670] NETDEV WATCHDOG: eth1 (ixgbe): transmit queue 1 timed out
[50401.389014] [ cut here ]
[50401.389340] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:320 
dev_watchdog+0x15c/0x1b9
[50401.389925] Modules linked in: pppoe pppox ppp_generic slhc 
netconsole configfs coretemp nf_nat_pptp nf_nat_proto_gre 
nf_conntrack_pptp nf_conntrack_proto_gre tun xt_TEE nf_dup_ipv4 x
t_REDIRECT nf_nat_redirect xt_nat xt_TCPMSS ipt_REJECT nf_reject_ipv4 
xt_set xt_string xt_connmark xt_DSCP xt_mark xt_tcpudp ip_set_hash_net 
ip_set_hash_ip ip_set nfnetlink iptable_mangle iptable_filter iptable_na
t nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack 
ip_tables x_tables 8021q garp mrp stp llc ixgbe dca
[50401.391869] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 
4.14.2-build-0134 #4
[50401.392191] Hardware name: HP ProLiant DL320e Gen8 v2, BIOS P80 
04/02/2015

[50401.392513] task: 880434d72640 task.stack: c90001914000
[50401.392836] RIP: 0010:dev_watchdog+0x15c/0x1b9
[50401.393155] RSP: 0018:8804364c3e90 EFLAGS: 00010286
[50401.393470] RAX: 0039 RBX: 88042f6e RCX: 

[50401.393787] RDX: 0001 RSI: 0002 RDI: 
828dbc64
[50401.394103] RBP: 8804364c3eb0 R08: 0001 R09: 

[50401.394420] R10: 0002 R11: 8803fa075c00 R12: 
0001
[50401.394739] R13: 0040 R14: 0003 R15: 
81e05108
[50401.395064] FS:  () GS:8804364c() 
knlGS:

[50401.395645] CS:  0010 DS:  ES:  CR0: 80050033
[50401.395970] CR2: 7fff25fc20a8 CR3: 01e09005 CR4: 
001606e0

[50401.396294] Call Trace:
[50401.396613]  
[50401.396934]  ? qdisc_rcu_free+0x3f/0x3f
[50401.397255]  call_timer_fn.isra.4+0x17/0x7b
[50401.397576]  expire_timers+0x6f/0x7e
[50401.397899]  run_timer_softirq+0x6d/0x8f
[50401.398219]  ? ktime_get+0x3b/0x8c
[50401.398540]  ? lapic_next_event+0x18/0x1c
[50401.398862]  ? clockevents_program_event+0xa3/0xbb
[50401.399186]  __do_softirq+0xbc/0x1ab
[50401.399510]  irq_exit+0x4d/0x8e
[50401.399832]  smp_apic_timer_interrupt+0x73/0x80
[50401.400157]  apic_timer_interrupt+0x8d/0xa0
[50401.400480]  
[50401.400801] RIP: 0010:mwait_idle+0x4e/0x61
[50401.401123] RSP: 0018:c90001917ec0 EFLAGS: 0246 ORIG_RAX: 
ff10
[50401.401714] RAX:  RBX: 880434d72640 RCX: 

[50401.402037] RDX:  RSI:  RDI: 

[50401.402362] RBP: c90001917ec0 R08:  R09: 
0001
[50401.402685] R10: c90001917e58 R11: 037a R12:

Re: [net PATCH 0/4] virtio_net: several bugs in XDP code for driver virtio_net

2018-02-21 Thread David Miller

From: Jesper Dangaard Brouer 
Date: Tue, 20 Feb 2018 14:31:59 +0100

> The virtio_net driver actually violates the original memory model of
> XDP causing hard to debug crashes.  Per request of John Fastabend,
> instead of removing the XDP feature I'm fixing as much as possible.
> While testing virtio_net with XDP_REDIRECT I found 4 different bugs.
> 
> Patch-1: not enough tail-room for build_skb in receive_mergeable()
>  only option is to disable XDP_REDIRECT in receive_mergeable()
> 
> Patch-2: XDP in receive_small() basically never worked (check wrong flag)
> 
> Patch-3: fix memory leak for XDP_REDIRECT in error cases
> 
> Patch-4: avoid crash when ndo_xdp_xmit is called on dev not ready for XDP
> 
> In the longer run, we should consider introducing a separate receive
> function when attaching an XDP program, and also change the memory
> model to be compatible with XDP when attaching an XDP prog.

Series applied, thanks Jesper.

1 2 >

1 - 100 of 192 matches

Mail list logo