date:20180424

[RFC bpf] bpf, x64: fix JIT emission for dead code

2018-04-24 Thread Gianluca Borello

Commit 2a5418a13fcf ("bpf: improve dead code sanitizing") replaced dead
code with a series of ja-1 instructions, for safety. That made JIT
compilation much more complex for some BPF programs. One instance of such
programs is, for example:

bool flag = false
...
/* A bunch of other code */
...
if (flag)
do_something()

In some cases llvm is not able to remove at compile time the code for
do_something(), so the generated BPF program ends up with a large amount
of dead instructions. In one specific real life example, there are two
series of ~500 and ~1000 dead instructions in the program. When the
verifier replaces them with a series of ja-1 instructions, it causes an
interesting behavior at JIT time.

During the first pass, since all the instructions are estimated at 64
bytes, the ja-1 instructions end up being translated as 5 bytes JMP
instructions (0xE9), since the jump offsets become increasingly large (>
127) as each instruction gets discovered to be 5 bytes instead of the
estimated 64.

Starting from the second pass, the first N instructions of the ja-1
sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
become <= 127 this time. In particular, N is defined as roughly 127 / (5
- 2) ~= 42. So, each further pass will make the subsequent N JMP
instructions shrink from 5 to 2 bytes, making the image shrink every time.
This means that in order to have the entire program converge, there need
to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
for translating the dead code. If we add this number to the passes needed
to translate the other non dead code, it brings such program to 40+
passes, and JIT doesn't complete. Ultimately the userspace loader fails
because such BPF program was supposed to be part of a prog array owner
being JITed.

While it is certainly possible to try to refactor such programs to help
the compiler remove dead code, the behavior is not really intuitive and it
puts further burden on the BPF developer who is not expecting such
behavior. To make things worse, such programs are working just fine in all
the kernel releases prior to the ja-1 fix.

A possible approach to mitigate this behavior consists into noticing that
for ja-1 instructions we don't really need to rely on the estimated size
of the previous and current instructions, we know that a -1 BPF jump
offset can be safely translated into a 0xEB instruction with a jump offset
of -2.

Such fix brings the BPF program in the previous example to complete again
in ~9 passes.

Fixes: 2a5418a13fcf ("bpf: improve dead code sanitizing")
Signed-off-by: Gianluca Borello 
---
Hi

Posting this as RFC since I just wanted to report this potential bug and
propose a possible fix, although I'm not sure if:

1) There might be a better fix on the JIT side
2) We might want to replace the ja-1 instructions with something else
3) We might want to ignore this issue

If we choose option 3, I'd just like to point out that this caused a real
regression on a couple BPF programs that are part of a larger collection
of programs that used to work fine on 4.15 and I recently found broken 
in 4.16, so there would be some value in somehow addressing this.

Thanks

 arch/x86/net/bpf_jit_comp.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index b725154182cc..abce27ceb411 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -1027,7 +1027,17 @@ xadd:if (is_imm8(insn->off))
break;
 
case BPF_JMP | BPF_JA:
-   jmp_offset = addrs[i + insn->off] - addrs[i];
+   if (insn->off == -1)
+   /* -1 jmp instructions will always jump
+* backwards two bytes. Explicitly handling
+* this case avoids wasting too many passes
+* when there are long sequences of replaced
+* dead code.
+*/
+   jmp_offset = -2;
+   else
+   jmp_offset = addrs[i + insn->off] - addrs[i];
+
if (!jmp_offset)
/* optimize out nop jumps */
break;
-- 
2.17.0

[PATCH net-next 0/2] tcp: mmap: rework zerocopy receive

2018-04-24 Thread Eric Dumazet

syzbot reported a lockdep issue caused by tcp mmap() support.

I implemented Andy Lutomirski nice suggestions to resolve the
issue and increase scalability as well.

First patch is adding a new setsockopt() operation and changes mmap()
behavior.

Second patch changes tcp_mmap reference program.

Eric Dumazet (2):
  tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive
  selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE

 include/uapi/linux/tcp.h   |   8 ++
 net/ipv4/tcp.c | 186 +
 tools/testing/selftests/net/tcp_mmap.c |  63 +
 3 files changed, 139 insertions(+), 118 deletions(-)

-- 
2.17.0.484.g0c8726318c-goog

[PATCH net-next 1/2] tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive

2018-04-24 Thread Eric Dumazet

When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.

Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.

1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
  This operation does not involve any TCP locking.

2) setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
 the transfert of pages from skbs to one VMA.
  This operation only uses down_read(>mm->mmap_sem) after
  holding TCP lock, thus solving the lockdep issue.

This new implementation was suggested by Andy Lutomirski with great details.

Benefits are :

- Better scalability, in case multiple threads reuse VMAS
   (without mmap()/munmap() calls) since mmap_sem wont be write locked.

- Better error recovery.
   The previous mmap() model had to provide the expected size of the
   mapping. If for some reason one part could not be mapped (partial MSS),
   the whole operation had to be aborted.
   With the tcp_zerocopy_receive struct, kernel can report how
   many bytes were successfuly mapped, and how many bytes should
   be read to skip the problematic sequence.

- No more memory allocation to hold an array of page pointers.
  16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/

- skbs are freed while mmap_sem has been released

Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)

Note that memcg might require additional changes.

Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Suggested-by: Andy Lutomirski 
Cc: linux...@kvack.org
Cc: Soheil Hassas Yeganeh 
---
 include/uapi/linux/tcp.h |   8 ++
 net/ipv4/tcp.c   | 186 ---
 2 files changed, 103 insertions(+), 91 deletions(-)

diff --git a/include/uapi/linux/tcp.h b/include/uapi/linux/tcp.h
index 
379b08700a542d49bbce9b4b49b17879d00b69bb..e9e8373b34b9ddc735329341b91f455bf5c0b17c
 100644
--- a/include/uapi/linux/tcp.h
+++ b/include/uapi/linux/tcp.h
@@ -122,6 +122,7 @@ enum {
 #define TCP_MD5SIG_EXT 32  /* TCP MD5 Signature with extensions */
 #define TCP_FASTOPEN_KEY   33  /* Set the key for Fast Open (cookie) */
 #define TCP_FASTOPEN_NO_COOKIE 34  /* Enable TFO without a TFO cookie */
+#define TCP_ZEROCOPY_RECEIVE   35
 
 struct tcp_repair_opt {
__u32   opt_code;
@@ -276,4 +277,11 @@ struct tcp_diag_md5sig {
__u8tcpm_key[TCP_MD5SIG_MAXKEYLEN];
 };
 
+/* setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) */
+
+struct tcp_zerocopy_receive {
+   __u64 address;  /* in: address of mapping */
+   __u32 length;   /* in/out: number of bytes to map/mapped */
+   __u32 recv_skip_hint;   /* out: amount of bytes to skip */
+};
 #endif /* _UAPI_LINUX_TCP_H */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 
dfd090ea54ad47112fc23c61180b5bf8edd2c736..a28eca97df9465a0aa522a1833a171b53c237b1f
 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1726,118 +1726,108 @@ int tcp_set_rcvlowat(struct sock *sk, int val)
 }
 EXPORT_SYMBOL(tcp_set_rcvlowat);
 
-/* When user wants to mmap X pages, we first need to perform the mapping
- * before freeing any skbs in receive queue, otherwise user would be unable
- * to fallback to standard recvmsg(). This happens if some data in the
- * requested block is not exactly fitting in a page.
- *
- * We only support order-0 pages for the moment.
- * mmap() on TCP is very strict, there is no point
- * trying to accommodate with pathological layouts.
- */
+static const struct vm_operations_struct tcp_vm_ops = {
+};
+
 int tcp_mmap(struct file *file, struct socket *sock,
 struct vm_area_struct *vma)
 {
-   unsigned long size = vma->vm_end - vma->vm_start;
-   unsigned int nr_pages = size >> PAGE_SHIFT;
-   struct page **pages_array = NULL;
-   u32 seq, len, offset, nr = 0;
-   struct sock *sk = sock->sk;
-   const skb_frag_t *frags;
+   if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+   return -EPERM;
+   vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+
+   /* Instruct vm_insert_page() to not down_read(mmap_sem) */
+   vma->vm_flags |= VM_MIXEDMAP;
+
+   vma->vm_ops = _vm_ops;
+   return 0;
+}
+EXPORT_SYMBOL(tcp_mmap);
+
+static int tcp_zerocopy_receive(struct sock *sk,
+   struct tcp_zerocopy_receive *zc)
+{
+   unsigned long address = (unsigned long)zc->address;
+   const skb_frag_t *frags = NULL;
+   u32 length = 0, seq, offset;
+   struct vm_area_struct *vma;
+   struct sk_buff *skb = NULL;
struct tcp_sock *tp;
-   struct sk_buff *skb;
int ret;
 
-   if (vma->vm_pgoff || !nr_pages)
+   if (address &

[PATCH net-next 2/2] selftests: net: tcp_mmap must use TCP_ZEROCOPY_RECEIVE

2018-04-24 Thread Eric Dumazet

After prior kernel change, mmap() on TCP socket only reserves VMA.

We have to use setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...)
to perform the transfert of pages from skbs in TCP receive queue into such VMA.

struct tcp_zerocopy_receive {
__u64 address;  /* in: address of mapping */
__u32 length;   /* in/out: number of bytes to map/mapped */
__u32 recv_skip_hint;   /* out: amount of bytes to skip */
};

After a successful setsockopt(...TCP_ZEROCOPY_RECEIVE...), @length contains
number of bytes that were mapped, and @recv_skip_hint contains number of bytes
that should be read using conventional read()/recv()/recvmsg() system calls,
to skip a sequence of bytes that can not be mapped, because not properly page
aligned.

Signed-off-by: Eric Dumazet 
Cc: Andy Lutomirski 
Cc: Soheil Hassas Yeganeh 
---
 tools/testing/selftests/net/tcp_mmap.c | 63 +++---
 1 file changed, 36 insertions(+), 27 deletions(-)

diff --git a/tools/testing/selftests/net/tcp_mmap.c 
b/tools/testing/selftests/net/tcp_mmap.c
index 
dea342fe6f4e88b5709d2ac37b2fc9a2a320bf44..5b381cdbdd6319556ba4e3dad530fae8f13f5a9b
 100644
--- a/tools/testing/selftests/net/tcp_mmap.c
+++ b/tools/testing/selftests/net/tcp_mmap.c
@@ -76,9 +76,10 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
+#include 
+#include 
 
 #ifndef MSG_ZEROCOPY
 #define MSG_ZEROCOPY0x400
@@ -134,11 +135,12 @@ void hash_zone(void *zone, unsigned int length)
 void *child_thread(void *arg)
 {
unsigned long total_mmap = 0, total = 0;
+   struct tcp_zerocopy_receive zc;
unsigned long delta_usec;
int flags = MAP_SHARED;
struct timeval t0, t1;
char *buffer = NULL;
-   void *oaddr = NULL;
+   void *addr = NULL;
double throughput;
struct rusage ru;
int lu, fd;
@@ -153,41 +155,45 @@ void *child_thread(void *arg)
perror("malloc");
goto error;
}
+   if (zflg) {
+   addr = mmap(NULL, chunk_size, PROT_READ, flags, fd, 0);
+   if (addr == (void *)-1)
+   zflg = 0;
+   }
while (1) {
struct pollfd pfd = { .fd = fd, .events = POLLIN, };
int sub;
 
poll(, 1, 1);
if (zflg) {
-   void *naddr;
+   int res;
 
-   naddr = mmap(oaddr, chunk_size, PROT_READ, flags, fd, 
0);
-   if (naddr == (void *)-1) {
-   if (errno == EAGAIN) {
-   /* That is if SO_RCVLOWAT is buggy */
-   usleep(1000);
-   continue;
-   }
-   if (errno == EINVAL) {
-   flags = MAP_SHARED;
-   oaddr = NULL;
-   goto fallback;
-   }
-   if (errno != EIO)
-   perror("mmap()");
+   zc.address = (__u64)addr;
+   zc.length = chunk_size;
+   zc.recv_skip_hint = 0;
+   res = setsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE,
+, sizeof(zc));
+   if (res == -1)
break;
+
+   if (zc.length) {
+   assert(zc.length <= chunk_size);
+   total_mmap += zc.length;
+   if (xflg)
+   hash_zone(addr, zc.length);
+   total += zc.length;
}
-   total_mmap += chunk_size;
-   if (xflg)
-   hash_zone(naddr, chunk_size);
-   total += chunk_size;
-   if (!keepflag) {
-   flags |= MAP_FIXED;
-   oaddr = naddr;
+   if (zc.recv_skip_hint) {
+   assert(zc.recv_skip_hint <= chunk_size);
+   lu = read(fd, buffer, zc.recv_skip_hint);
+   if (lu > 0) {
+   if (xflg)
+   hash_zone(buffer, lu);
+   total += lu;
+   }
}
continue;
}
-fallback:
sub = 0;
while (sub < chunk_size) {
lu = read(fd, buffer + sub, chunk_size - sub);
@@ -228,6 +234,8 @@

RE: [PATCH net-next] lan78xx: Lan7801 Support for Fixed PHY

2018-04-24 Thread RaghuramChary.Jallipalli

Hi Andrew,

> > +#define DRIVER_VERSION "1.0.7"
> 
> Hi Raghuram
> 
> Driver version strings a pretty pointless. You might want to remove it.
>
OK, will remove it.
 
> > +   netdev_info(dev->net, "Registered FIXED PHY\n");
> 
> There are too many detdev_info() messages here. Maybe make them both
> netdev_dbg().
>

OK. Will modify to netdev_dbg()
 
> > +   dev->interface = PHY_INTERFACE_MODE_RGMII;
> > +   dev->fixedphy = phydev;
> 
> You can use
> 
> if (!phy_is_pseudo_fixed_link(phydev))
> 
> to determine is a PHY is a fixed phy. I think you can then do without
> dev->fixedphy.
> 
dev->fixedphy stores the fixed phydev, which will be passed to the 
fixed_phy_unregister routine , so I think phy_is_pseudo_fixed_link check is not 
necessary.

> > +   ret = lan78xx_write_reg(dev, HW_CFG, buf);
> > +   goto phyinit;
> 
> Please don't use a goto like this. Maybe turn this into a switch statement?
>
OK. Will modify.
 
> > static int lan78xx_phy_init(struct lan78xx_net *dev)
> > goto error;
> 
> Please take a look at what happens at error: It does not look correct.
> Probably now is a good time to refactor the whole of lan78xx_phy_init()
> 

OK. Will take care.

Thanks,
-Raghu

KASAN: stack-out-of-bounds Write in compat_copy_entries

2018-04-24 Thread syzbot


Hello,

syzbot hit the following crash on upstream commit
24cac7009cb1b211f1c793ecb6a462c03dc35818 (Tue Apr 24 21:16:40 2018 +)
Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=4e42a04e0bc33cb6c087


So far this crash happened 3 times on upstream.
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=4827027970457600
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=6212733133389824
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=7043958930931867332

compiler: gcc (GCC) 8.0.1 20180413 (experimental)
user-space arch: i386

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4e42a04e0bc33cb6c...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
IPVS: ftp: loaded support on port[0] = 21
==
BUG: KASAN: stack-out-of-bounds in strlcpy include/linux/string.h:300  
[inline]
BUG: KASAN: stack-out-of-bounds in compat_mtw_from_user  
net/bridge/netfilter/ebtables.c:1957 [inline]
BUG: KASAN: stack-out-of-bounds in ebt_size_mwt  
net/bridge/netfilter/ebtables.c:2059 [inline]
BUG: KASAN: stack-out-of-bounds in size_entry_mwt  
net/bridge/netfilter/ebtables.c:2155 [inline]
BUG: KASAN: stack-out-of-bounds in compat_copy_entries+0x96c/0x14a0  
net/bridge/netfilter/ebtables.c:2194

Write of size 33 at addr 8801b0abf888 by task syz-executor0/4504

CPU: 0 PID: 4504 Comm: syz-executor0 Not tainted 4.17.0-rc2+ #40
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1b9/0x294 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 memcpy+0x37/0x50 mm/kasan/kasan.c:303
 strlcpy include/linux/string.h:300 [inline]
 compat_mtw_from_user net/bridge/netfilter/ebtables.c:1957 [inline]
 ebt_size_mwt net/bridge/netfilter/ebtables.c:2059 [inline]
 size_entry_mwt net/bridge/netfilter/ebtables.c:2155 [inline]
 compat_copy_entries+0x96c/0x14a0 net/bridge/netfilter/ebtables.c:2194
 compat_do_replace+0x483/0x900 net/bridge/netfilter/ebtables.c:2285
 compat_do_ebt_set_ctl+0x2ac/0x324 net/bridge/netfilter/ebtables.c:2367
 compat_nf_sockopt net/netfilter/nf_sockopt.c:144 [inline]
 compat_nf_setsockopt+0x9b/0x140 net/netfilter/nf_sockopt.c:156
 compat_ip_setsockopt+0xff/0x140 net/ipv4/ip_sockglue.c:1279
 inet_csk_compat_setsockopt+0x97/0x120 net/ipv4/inet_connection_sock.c:1041
 compat_tcp_setsockopt+0x49/0x80 net/ipv4/tcp.c:2901
 compat_sock_common_setsockopt+0xb4/0x150 net/core/sock.c:3050
 __compat_sys_setsockopt+0x1ab/0x7c0 net/compat.c:403
 __do_compat_sys_setsockopt net/compat.c:416 [inline]
 __se_compat_sys_setsockopt net/compat.c:413 [inline]
 __ia32_compat_sys_setsockopt+0xbd/0x150 net/compat.c:413
 do_syscall_32_irqs_on arch/x86/entry/common.c:323 [inline]
 do_fast_syscall_32+0x345/0xf9b arch/x86/entry/common.c:394
 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7fb3cb9
RSP: 002b:fff0c26c EFLAGS: 0282 ORIG_RAX: 016e
RAX: ffda RBX: 0003 RCX: 
RDX: 0080 RSI: 2300 RDI: 05f4
RBP:  R08:  R09: 
R10:  R11:  R12: 
R13:  R14:  R15: 

The buggy address belongs to the page:
page:ea0006c2afc0 count:0 mapcount:0 mapping: index:0x0
flags: 0x2fffc00()
raw: 02fffc00   
raw:  ea0006c20101  
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 8801b0abf780: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 8801b0abf800: 00 00 00 00 00 f1 f1 f1 f1 00 00 f2 f2 f2 f2 f2

8801b0abf880: f2 00 00 00 07 f3 f3 f3 f3 00 00 00 00 00 00 00

   ^
 8801b0abf900: 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1 00 00
 8801b0abf980: 00 f2 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00
==


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep

[RFC v3 1/5] virtio: add packed ring definitions

2018-04-24 Thread Tiwei Bie

Signed-off-by: Tiwei Bie 
---
 include/uapi/linux/virtio_config.h | 12 +++-
 include/uapi/linux/virtio_ring.h   | 36 
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/virtio_config.h 
b/include/uapi/linux/virtio_config.h
index 308e2096291f..a6e392325e3a 100644
--- a/include/uapi/linux/virtio_config.h
+++ b/include/uapi/linux/virtio_config.h
@@ -49,7 +49,7 @@
  * transport being used (eg. virtio_ring), the rest are per-device feature
  * bits. */
 #define VIRTIO_TRANSPORT_F_START   28
-#define VIRTIO_TRANSPORT_F_END 34
+#define VIRTIO_TRANSPORT_F_END 36
 
 #ifndef VIRTIO_CONFIG_NO_LEGACY
 /* Do we get callbacks when the ring is completely used, even if we've
@@ -71,4 +71,14 @@
  * this is for compatibility with legacy systems.
  */
 #define VIRTIO_F_IOMMU_PLATFORM33
+
+/* This feature indicates support for the packed virtqueue layout. */
+#define VIRTIO_F_RING_PACKED   34
+
+/*
+ * This feature indicates that all buffers are used by the device
+ * in the same order in which they have been made available.
+ */
+#define VIRTIO_F_IN_ORDER  35
+
 #endif /* _UAPI_LINUX_VIRTIO_CONFIG_H */
diff --git a/include/uapi/linux/virtio_ring.h b/include/uapi/linux/virtio_ring.h
index 6d5d5faa989b..3932cb80c347 100644
--- a/include/uapi/linux/virtio_ring.h
+++ b/include/uapi/linux/virtio_ring.h
@@ -44,6 +44,9 @@
 /* This means the buffer contains a list of buffer descriptors. */
 #define VRING_DESC_F_INDIRECT  4
 
+#define VRING_DESC_F_AVAIL(b)  ((b) << 7)
+#define VRING_DESC_F_USED(b)   ((b) << 15)
+
 /* The Host uses this in used->flags to advise the Guest: don't kick me when
  * you add a buffer.  It's unreliable, so it's simply an optimization.  Guest
  * will still kick if it's out of buffers. */
@@ -53,6 +56,10 @@
  * optimization.  */
 #define VRING_AVAIL_F_NO_INTERRUPT 1
 
+#define VRING_EVENT_F_ENABLE   0x0
+#define VRING_EVENT_F_DISABLE  0x1
+#define VRING_EVENT_F_DESC 0x2
+
 /* We support indirect buffer descriptors */
 #define VIRTIO_RING_F_INDIRECT_DESC28
 
@@ -171,4 +178,33 @@ static inline int vring_need_event(__u16 event_idx, __u16 
new_idx, __u16 old)
return (__u16)(new_idx - event_idx - 1) < (__u16)(new_idx - old);
 }
 
+struct vring_packed_desc_event {
+   /* __virtio16 off  : 15; // Descriptor Event Offset
+* __virtio16 wrap : 1;  // Descriptor Event Wrap Counter */
+   __virtio16 off_wrap;
+   /* __virtio16 flags : 2; // Descriptor Event Flags */
+   __virtio16 flags;
+};
+
+struct vring_packed_desc {
+   /* Buffer Address. */
+   __virtio64 addr;
+   /* Buffer Length. */
+   __virtio32 len;
+   /* Buffer ID. */
+   __virtio16 id;
+   /* The flags depending on descriptor type. */
+   __virtio16 flags;
+};
+
+struct vring_packed {
+   unsigned int num;
+
+   struct vring_packed_desc *desc;
+
+   struct vring_packed_desc_event *driver;
+
+   struct vring_packed_desc_event *device;
+};
+
 #endif /* _UAPI_LINUX_VIRTIO_RING_H */
-- 
2.11.0

[RFC v3 0/5] virtio: support packed ring

2018-04-24 Thread Tiwei Bie

Hello everyone,

This RFC implements packed ring support in virtio driver.

Some simple functional tests have been done with Jason's
packed ring implementation in vhost:

https://lkml.org/lkml/2018/4/23/12

Both of ping and netperf worked as expected (with EVENT_IDX
disabled). But there are below known issues:

1. Reloading the guest driver will break the Tx/Rx;
2. Zeroing the flags when detaching a used desc will
   break the guest -> host path.

Some simple functional tests have also been done with
Wei's packed ring implementation in QEMU:

http://lists.nongnu.org/archive/html/qemu-devel/2018-04/msg00342.html

Both of ping and netperf worked as expected (with EVENT_IDX
disabled). Reloading the guest driver also worked as expected.

TODO:
- Refinements (for code and commit log) and bug fixes;
- Discuss/fix/test EVENT_IDX support;
- Test devices other than net;

RFC v2 -> RFC v3:
- Split into small patches (Jason);
- Add helper virtqueue_use_indirect() (Jason);
- Just set id for the last descriptor of a list (Jason);
- Calculate the prev in virtqueue_add_packed() (Jason);
- Fix/improve desc suppression code (Jason/MST);
- Refine the code layout for XXX_split/packed and wrappers (MST);
- Fix the comments and API in uapi (MST);
- Remove the BUG_ON() for indirect (Jason);
- Some other refinements and bug fixes;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!

Tiwei Bie (5):
  virtio: add packed ring definitions
  virtio_ring: support creating packed ring
  virtio_ring: add packed ring support
  virtio_ring: add event idx support in packed ring
  virtio_ring: enable packed ring

 drivers/virtio/virtio_ring.c   | 1271 
 include/linux/virtio_ring.h|8 +-
 include/uapi/linux/virtio_config.h |   12 +-
 include/uapi/linux/virtio_ring.h   |   36 +
 4 files changed, 1049 insertions(+), 278 deletions(-)

-- 
2.11.0

[RFC v3 2/5] virtio_ring: support creating packed ring

2018-04-24 Thread Tiwei Bie

This commit introduces the support for creating packed ring.
All split ring specific functions are added _split suffix.
Some necessary stubs for packed ring are also added.

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 764 ---
 include/linux/virtio_ring.h  |   8 +-
 2 files changed, 513 insertions(+), 259 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..e164822ca66e 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -64,8 +64,8 @@ struct vring_desc_state {
 struct vring_virtqueue {
struct virtqueue vq;
 
-   /* Actual memory layout for this queue */
-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
 
/* Can we use weak barriers? */
bool weak_barriers;
@@ -79,19 +79,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
 
-   /* Head of free buffer list. */
-   unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
 
/* Last used index we've seen. */
u16 last_used_idx;
 
-   /* Last written value to avail->flags */
-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
 
-   /* Last written value to avail->idx in guest byte order */
-   u16 avail_idx_shadow;
+   /* Head of free buffer list. */
+   unsigned int free_head;
+
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   unsigned int next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
 
/* How to notify other side. FIXME: commonalize hcalls! */
bool (*notify)(struct virtqueue *vq);
@@ -201,8 +227,17 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
 }
 
-static void vring_unmap_one(const struct vring_virtqueue *vq,
-   struct vring_desc *desc)
+static int vring_mapping_error(const struct vring_virtqueue *vq,
+  dma_addr_t addr)
+{
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return 0;
+
+   return dma_mapping_error(vring_dma_dev(vq), addr);
+}
+
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
 {
u16 flags;
 
@@ -226,17 +261,9 @@ static void vring_unmap_one(const struct vring_virtqueue 
*vq,
}
 }
 
-static int vring_mapping_error(const struct vring_virtqueue *vq,
-  dma_addr_t addr)
-{
-   if (!vring_use_dma_api(vq->vq.vdev))
-   return 0;
-
-   return dma_mapping_error(vring_dma_dev(vq), addr);
-}
-
-static struct vring_desc *alloc_indirect(struct virtqueue *_vq,
-unsigned int total_sg, gfp_t gfp)
+static struct vring_desc *alloc_indirect_split(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
 {
struct vring_desc *desc;
unsigned int i;
@@ -257,14 +284,14 @@ static struct vring_desc *alloc_indirect(struct virtqueue 
*_vq,
return desc;
 }
 
-static inline int virtqueue_add(struct virtqueue *_vq,
-   struct scatterlist *sgs[],
-   unsigned int total_sg,
-   unsigned int out_sgs,
-   unsigned int in_sgs,
-   void *data,
-   void *ctx,
-   gfp_t gfp)
+static inline int virtqueue_add_split(struct virtqueue *_vq,
+ struct scatterlist *sgs[],
+ unsigned int total_sg,
+ unsigned int out_sgs,
+ unsigned int in_sgs,
+

[RFC v3 4/5] virtio_ring: add event idx support in packed ring

2018-04-24 Thread Tiwei Bie

This commit introduces the event idx support in packed
ring. This feature is temporarily disabled, because the
implementation in this patch may not work as expected,
and some further discussions on the implementation are
needed, e.g. do we have to check the wrap counter when
checking whether a kick is needed?

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 53 
 1 file changed, 49 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 0181e93897be..b1039c2985b9 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
 static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
-   u16 flags;
+   u16 new, old, off_wrap, flags;
bool needs_kick;
u32 snapshot;
 
@@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue 
*_vq)
 * suppressions. */
virtio_mb(vq->weak_barriers);
 
+   old = vq->next_avail_idx - vq->num_added;
+   new = vq->next_avail_idx;
+   vq->num_added = 0;
+
snapshot = *(u32 *)vq->vring_packed.device;
+   off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0x);
flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
 
 #ifdef DEBUG
@@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct 
virtqueue *_vq)
vq->last_add_time_valid = false;
 #endif
 
-   needs_kick = (flags != VRING_EVENT_F_DISABLE);
+   if (flags == VRING_EVENT_F_DESC)
+   needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);
+   else
+   needs_kick = (flags != VRING_EVENT_F_DISABLE);
END_USE(vq);
return needs_kick;
 }
@@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct 
virtqueue *_vq,
if (vq->last_used_idx >= vq->vring_packed.num)
vq->last_used_idx -= vq->vring_packed.num;
 
+   /* If we expect an interrupt for the next entry, tell host
+* by writing event index and flush out the write before
+* the read in the next get_buf call. */
+   if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
+   virtio_store_mb(vq->weak_barriers,
+   >vring_packed.driver->off_wrap,
+   cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
+   (vq->wrap_counter << 15)));
+
 #ifdef DEBUG
vq->last_add_time_valid = false;
 #endif
@@ -1143,10 +1160,17 @@ static unsigned 
virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   vq->last_used_idx | (vq->wrap_counter << 15));
 
if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
virtio_wmb(vq->weak_barriers);
-   vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
+   vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
+VRING_EVENT_F_ENABLE;
vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
vq->event_flags_shadow);
}
@@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue 
*_vq, unsigned last_used_idx)
 static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
 {
struct vring_virtqueue *vq = to_vvq(_vq);
+   u16 bufs, used_idx, wrap_counter;
 
START_USE(vq);
 
/* We optimistically turn back on interrupts, then check if there was
 * more to do. */
+   /* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
+* either clear the flags bit or point the event index at the next
+* entry. Always update the event index to keep code simple. */
+
+   /* TODO: tune this threshold */
+   bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
+
+   used_idx = vq->last_used_idx + bufs;
+   wrap_counter = vq->wrap_counter;
+
+   if (used_idx >= vq->vring_packed.num) {
+   used_idx -= vq->vring_packed.num;
+   wrap_counter ^= 1;
+   }
+
+   vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
+   used_idx | (wrap_counter << 15));
 
if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
virtio_wmb(vq->weak_barriers);
-   vq->event_flags_shadow =

[RFC v3 5/5] virtio_ring: enable packed ring

2018-04-24 Thread Tiwei Bie

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index b1039c2985b9..9a3d13e1e2ba 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1873,6 +1873,8 @@ void vring_transport_features(struct virtio_device *vdev)
break;
case VIRTIO_F_IOMMU_PLATFORM:
break;
+   case VIRTIO_F_RING_PACKED:
+   break;
default:
/* We don't understand this bit. */
__virtio_clear_bit(vdev, i);
-- 
2.11.0

[RFC v3 3/5] virtio_ring: add packed ring support

2018-04-24 Thread Tiwei Bie

This commit introduces the basic support (without EVENT_IDX)
for packed ring.

Signed-off-by: Tiwei Bie 
---
 drivers/virtio/virtio_ring.c | 444 ++-
 1 file changed, 434 insertions(+), 10 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index e164822ca66e..0181e93897be 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,7 +58,8 @@
 
 struct vring_desc_state {
void *data; /* Data for callback. */
-   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+   void *indir_desc;   /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
 };
 
 struct vring_virtqueue {
@@ -142,6 +143,16 @@ struct vring_virtqueue {
 
 #define to_vvq(_vq) container_of(_vq, struct vring_virtqueue, vq)
 
+static inline bool virtqueue_use_indirect(struct virtqueue *_vq,
+ unsigned int total_sg)
+{
+   struct vring_virtqueue *vq = to_vvq(_vq);
+
+   /* If the host supports indirect descriptor tables, and we have multiple
+* buffers, then go indirect. FIXME: tune this threshold */
+   return (vq->indirect && total_sg > 1 && vq->vq.num_free);
+}
+
 /*
  * Modern virtio devices have feature bits to specify whether they need a
  * quirk and bypass the IOMMU. If not there, just use the DMA API.
@@ -327,9 +338,7 @@ static inline int virtqueue_add_split(struct virtqueue *_vq,
 
head = vq->free_head;
 
-   /* If the host supports indirect descriptor tables, and we have multiple
-* buffers, then go indirect. FIXME: tune this threshold */
-   if (vq->indirect && total_sg > 1 && vq->vq.num_free)
+   if (virtqueue_use_indirect(_vq, total_sg))
desc = alloc_indirect_split(_vq, total_sg, gfp);
else {
desc = NULL;
@@ -741,6 +750,49 @@ static inline unsigned vring_size_packed(unsigned int num, 
unsigned long align)
& ~(align - 1)) + sizeof(struct vring_packed_desc_event) * 2;
 }
 
+static void vring_unmap_one_packed(const struct vring_virtqueue *vq,
+  struct vring_packed_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+(flags & VRING_DESC_F_WRITE) ?
+DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   } else {
+   dma_unmap_page(vring_dma_dev(vq),
+  virtio64_to_cpu(vq->vq.vdev, desc->addr),
+  virtio32_to_cpu(vq->vq.vdev, desc->len),
+  (flags & VRING_DESC_F_WRITE) ?
+  DMA_FROM_DEVICE : DMA_TO_DEVICE);
+   }
+}
+
+static struct vring_packed_desc *alloc_indirect_packed(struct virtqueue *_vq,
+  unsigned int total_sg,
+  gfp_t gfp)
+{
+   struct vring_packed_desc *desc;
+
+   /*
+* We require lowmem mappings for the descriptors because
+* otherwise virt_to_phys will give us bogus addresses in the
+* virtqueue.
+*/
+   gfp &= ~__GFP_HIGHMEM;
+
+   desc = kmalloc(total_sg * sizeof(struct vring_packed_desc), gfp);
+
+   return desc;
+}
+
 static inline int virtqueue_add_packed(struct virtqueue *_vq,
   struct scatterlist *sgs[],
   unsigned int total_sg,
@@ -750,47 +802,419 @@ static inline int virtqueue_add_packed(struct virtqueue 
*_vq,
   void *ctx,
   gfp_t gfp)
 {
+   struct vring_virtqueue *vq = to_vvq(_vq);
+   struct vring_packed_desc *desc;
+   struct scatterlist *sg;
+   unsigned int i, n, descs_used, uninitialized_var(prev), err_idx;
+   __virtio16 uninitialized_var(head_flags), flags;
+   int head, wrap_counter;
+   bool indirect;
+
+   START_USE(vq);
+
+   BUG_ON(data == NULL);
+   BUG_ON(ctx && vq->indirect);
+
+   if (unlikely(vq->broken)) {
+   END_USE(vq);
+   return -EIO;
+   }
+
+#ifdef DEBUG
+   {
+   ktime_t now = ktime_get();
+
+   /* No kick or get, with .1 second between?  Warn. */
+   if (vq->last_add_time_valid)
+   WARN_ON(ktime_to_ms(ktime_sub(now, vq->last_add_time))
+   > 100);
+   vq->last_add_time =

[PATCH bpf-next 4/4] nfp: bpf: optimize comparisons to negative constants

2018-04-24 Thread Jakub Kicinski

Comparison instruction requires a subtraction.  If the constant
is negative we are more likely to fit it into a NFP instruction
directly if we change the sign and use addition.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 42 +++
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  6 +++-
 2 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 5b8da7a67df4..65f0791cae0c 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1247,6 +1247,7 @@ static int cmp_imm(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
const struct bpf_insn *insn = >insn;
u64 imm = insn->imm; /* sign extend */
const struct jmp_code_map *code;
+   enum alu_op alu_op, carry_op;
u8 reg = insn->dst_reg * 2;
swreg tmp_reg;
 
@@ -1254,19 +1255,22 @@ static int cmp_imm(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
if (!code)
return -EINVAL;
 
+   alu_op = meta->jump_neg_op ? ALU_OP_ADD : ALU_OP_SUB;
+   carry_op = meta->jump_neg_op ? ALU_OP_ADD_C : ALU_OP_SUB_C;
+
tmp_reg = ur_load_imm_any(nfp_prog, imm & ~0U, imm_b(nfp_prog));
if (!code->swap)
-   emit_alu(nfp_prog, reg_none(), reg_a(reg), ALU_OP_SUB, tmp_reg);
+   emit_alu(nfp_prog, reg_none(), reg_a(reg), alu_op, tmp_reg);
else
-   emit_alu(nfp_prog, reg_none(), tmp_reg, ALU_OP_SUB, reg_a(reg));
+   emit_alu(nfp_prog, reg_none(), tmp_reg, alu_op, reg_a(reg));
 
tmp_reg = ur_load_imm_any(nfp_prog, imm >> 32, imm_b(nfp_prog));
if (!code->swap)
emit_alu(nfp_prog, reg_none(),
-reg_a(reg + 1), ALU_OP_SUB_C, tmp_reg);
+reg_a(reg + 1), carry_op, tmp_reg);
else
emit_alu(nfp_prog, reg_none(),
-tmp_reg, ALU_OP_SUB_C, reg_a(reg + 1));
+tmp_reg, carry_op, reg_a(reg + 1));
 
emit_br(nfp_prog, code->br_mask, insn->off, 0);
 
@@ -2745,21 +2749,35 @@ static void nfp_bpf_opt_neg_add_sub(struct nfp_prog 
*nfp_prog)
continue;
 
if (BPF_CLASS(insn.code) != BPF_ALU &&
-   BPF_CLASS(insn.code) != BPF_ALU64)
+   BPF_CLASS(insn.code) != BPF_ALU64 &&
+   BPF_CLASS(insn.code) != BPF_JMP)
continue;
if (BPF_SRC(insn.code) != BPF_K)
continue;
if (insn.imm >= 0)
continue;
 
-   if (BPF_OP(insn.code) == BPF_ADD)
-   insn.code = BPF_CLASS(insn.code) | BPF_SUB;
-   else if (BPF_OP(insn.code) == BPF_SUB)
-   insn.code = BPF_CLASS(insn.code) | BPF_ADD;
-   else
-   continue;
+   if (BPF_CLASS(insn.code) == BPF_JMP) {
+   switch (BPF_OP(insn.code)) {
+   case BPF_JGE:
+   case BPF_JSGE:
+   case BPF_JLT:
+   case BPF_JSLT:
+   meta->jump_neg_op = true;
+   break;
+   default:
+   continue;
+   }
+   } else {
+   if (BPF_OP(insn.code) == BPF_ADD)
+   insn.code = BPF_CLASS(insn.code) | BPF_SUB;
+   else if (BPF_OP(insn.code) == BPF_SUB)
+   insn.code = BPF_CLASS(insn.code) | BPF_ADD;
+   else
+   continue;
 
-   meta->insn.code = insn.code | BPF_K;
+   meta->insn.code = insn.code | BPF_K;
+   }
 
meta->insn.imm = -insn.imm;
}
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h 
b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 4981c8944ca3..68b5d326483d 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -236,6 +236,7 @@ struct nfp_bpf_reg_state {
  * @xadd_over_16bit: 16bit immediate is not guaranteed
  * @xadd_maybe_16bit: 16bit immediate is possible
  * @jmp_dst: destination info for jump instructions
+ * @jump_neg_op: jump instruction has inverted immediate, use ADD instead of 
SUB
  * @func_id: function id for call instructions
  * @arg1: arg1 for call instructions
  * @arg2: arg2 for call instructions
@@ -264,7 +265,10 @@ struct nfp_insn_meta {
bool xadd_maybe_16bit;
};
/* jump */
-   struct nfp_insn_meta *jmp_dst;
+   struct {
+   struct

[PATCH bpf-next 2/4] nfp: bpf: optimize add/sub of a negative constant

2018-04-24 Thread Jakub Kicinski

NFP instruction set can fit small immediates into the instruction.
Negative integers, however, will never fit because they will have
highest bit set.  If we swap the ALU op between ADD and SUB and
negate the constant we have a better chance of fitting small negative
integers into the instruction itself and saving one or two cycles.

immed[gprB_21, 0xfffc]
alu[gprA_4, gprA_4, +, gprB_21], gpr_wrboth
immed[gprB_21, 0x]
alu[gprA_5, gprA_5, +carry, gprB_21], gpr_wrboth

now becomes:

alu[gprA_4, gprA_4, -, 4], gpr_wrboth
alu[gprA_5, gprA_5, -carry, 0], gpr_wrboth

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 35 
 1 file changed, 35 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 9cc638718272..a5590988fc69 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -2777,6 +2777,40 @@ static void nfp_bpf_opt_reg_init(struct nfp_prog 
*nfp_prog)
}
 }
 
+/* abs(insn.imm) will fit better into unrestricted reg immediate -
+ * convert add/sub of a negative number into a sub/add of a positive one.
+ */
+static void nfp_bpf_opt_neg_add_sub(struct nfp_prog *nfp_prog)
+{
+   struct nfp_insn_meta *meta;
+
+   list_for_each_entry(meta, _prog->insns, l) {
+   struct bpf_insn insn = meta->insn;
+
+   if (meta->skip)
+   continue;
+
+   if (BPF_CLASS(insn.code) != BPF_ALU &&
+   BPF_CLASS(insn.code) != BPF_ALU64)
+   continue;
+   if (BPF_SRC(insn.code) != BPF_K)
+   continue;
+   if (insn.imm >= 0)
+   continue;
+
+   if (BPF_OP(insn.code) == BPF_ADD)
+   insn.code = BPF_CLASS(insn.code) | BPF_SUB;
+   else if (BPF_OP(insn.code) == BPF_SUB)
+   insn.code = BPF_CLASS(insn.code) | BPF_ADD;
+   else
+   continue;
+
+   meta->insn.code = insn.code | BPF_K;
+
+   meta->insn.imm = -insn.imm;
+   }
+}
+
 /* Remove masking after load since our load guarantees this is not needed */
 static void nfp_bpf_opt_ld_mask(struct nfp_prog *nfp_prog)
 {
@@ -3212,6 +3246,7 @@ static int nfp_bpf_optimize(struct nfp_prog *nfp_prog)
 {
nfp_bpf_opt_reg_init(nfp_prog);
 
+   nfp_bpf_opt_neg_add_sub(nfp_prog);
nfp_bpf_opt_ld_mask(nfp_prog);
nfp_bpf_opt_ld_shift(nfp_prog);
nfp_bpf_opt_ldst_gather(nfp_prog);
-- 
2.16.2

[PATCH bpf-next 0/4] nfp: bpf: optimize negative sums

2018-04-24 Thread Jakub Kicinski

Hi!

This set adds an optimization run to the NFP jit to turn ADD and SUB
instructions with negative immediate into the opposite operation with
a positive immediate.  NFP can fit small immediates into the instructions
but it can't ever fit negative immediates.  Addition of small negative
immediates is quite common in BPF programs for stack address calculations,
therefore this optimization gives us non-negligible savings in instruction
count (up to 4%).


Jakub Kicinski (4):
  nfp: bpf: remove double space
  nfp: bpf: optimize add/sub of a negative constant
  nfp: bpf: tabularize generations of compare operations
  nfp: bpf: optimize comparisons to negative constants

 drivers/net/ethernet/netronome/nfp/bpf/jit.c  | 231 +-
 drivers/net/ethernet/netronome/nfp/bpf/main.h |   6 +-
 2 files changed, 124 insertions(+), 113 deletions(-)

-- 
2.16.2

[PATCH bpf-next 1/4] nfp: bpf: remove double space

2018-04-24 Thread Jakub Kicinski

Whitespace cleanup - remove double space.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index 29b4e5f8c102..9cc638718272 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1400,7 +1400,7 @@ map_call_stack_common(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
if (!load_lm_ptr)
return 0;
 
-   emit_csr_wr(nfp_prog, stack_reg(nfp_prog),  NFP_CSR_ACT_LM_ADDR0);
+   emit_csr_wr(nfp_prog, stack_reg(nfp_prog), NFP_CSR_ACT_LM_ADDR0);
wrp_nops(nfp_prog, 3);
 
return 0;
-- 
2.16.2

[PATCH bpf-next 3/4] nfp: bpf: tabularize generations of compare operations

2018-04-24 Thread Jakub Kicinski

There are quite a few compare instructions now, use a table
to translate BPF instruction code to NFP instruction parameters
instead of parameterizing helpers.  This saves LOC and makes
future extensions easier.

Signed-off-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/bpf/jit.c | 168 ++-
 1 file changed, 61 insertions(+), 107 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/jit.c 
b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
index a5590988fc69..5b8da7a67df4 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/jit.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/jit.c
@@ -1214,45 +1214,79 @@ wrp_test_reg(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta,
return 0;
 }
 
-static int
-wrp_cmp_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
-   enum br_mask br_mask, bool swap)
+static const struct jmp_code_map {
+   enum br_mask br_mask;
+   bool swap;
+} jmp_code_map[] = {
+   [BPF_JGT >> 4]  = { BR_BLO, true },
+   [BPF_JGE >> 4]  = { BR_BHS, false },
+   [BPF_JLT >> 4]  = { BR_BLO, false },
+   [BPF_JLE >> 4]  = { BR_BHS, true },
+   [BPF_JSGT >> 4] = { BR_BLT, true },
+   [BPF_JSGE >> 4] = { BR_BGE, false },
+   [BPF_JSLT >> 4] = { BR_BLT, false },
+   [BPF_JSLE >> 4] = { BR_BGE, true },
+};
+
+static const struct jmp_code_map *nfp_jmp_code_get(struct nfp_insn_meta *meta)
+{
+   unsigned int op;
+
+   op = BPF_OP(meta->insn.code) >> 4;
+   /* br_mask of 0 is BR_BEQ which we don't use in jump code table */
+   if (WARN_ONCE(op >= ARRAY_SIZE(jmp_code_map) ||
+ !jmp_code_map[op].br_mask,
+ "no code found for jump instruction"))
+   return NULL;
+
+   return _code_map[op];
+}
+
+static int cmp_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
const struct bpf_insn *insn = >insn;
u64 imm = insn->imm; /* sign extend */
+   const struct jmp_code_map *code;
u8 reg = insn->dst_reg * 2;
swreg tmp_reg;
 
+   code = nfp_jmp_code_get(meta);
+   if (!code)
+   return -EINVAL;
+
tmp_reg = ur_load_imm_any(nfp_prog, imm & ~0U, imm_b(nfp_prog));
-   if (!swap)
+   if (!code->swap)
emit_alu(nfp_prog, reg_none(), reg_a(reg), ALU_OP_SUB, tmp_reg);
else
emit_alu(nfp_prog, reg_none(), tmp_reg, ALU_OP_SUB, reg_a(reg));
 
tmp_reg = ur_load_imm_any(nfp_prog, imm >> 32, imm_b(nfp_prog));
-   if (!swap)
+   if (!code->swap)
emit_alu(nfp_prog, reg_none(),
 reg_a(reg + 1), ALU_OP_SUB_C, tmp_reg);
else
emit_alu(nfp_prog, reg_none(),
 tmp_reg, ALU_OP_SUB_C, reg_a(reg + 1));
 
-   emit_br(nfp_prog, br_mask, insn->off, 0);
+   emit_br(nfp_prog, code->br_mask, insn->off, 0);
 
return 0;
 }
 
-static int
-wrp_cmp_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
-   enum br_mask br_mask, bool swap)
+static int cmp_reg(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
 {
const struct bpf_insn *insn = >insn;
+   const struct jmp_code_map *code;
u8 areg, breg;
 
+   code = nfp_jmp_code_get(meta);
+   if (!code)
+   return -EINVAL;
+
areg = insn->dst_reg * 2;
breg = insn->src_reg * 2;
 
-   if (swap) {
+   if (code->swap) {
areg ^= breg;
breg ^= areg;
areg ^= breg;
@@ -1261,7 +1295,7 @@ wrp_cmp_reg(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta,
emit_alu(nfp_prog, reg_none(), reg_a(areg), ALU_OP_SUB, reg_b(breg));
emit_alu(nfp_prog, reg_none(),
 reg_a(areg + 1), ALU_OP_SUB_C, reg_b(breg + 1));
-   emit_br(nfp_prog, br_mask, insn->off, 0);
+   emit_br(nfp_prog, code->br_mask, insn->off, 0);
 
return 0;
 }
@@ -2283,46 +2317,6 @@ static int jeq_imm(struct nfp_prog *nfp_prog, struct 
nfp_insn_meta *meta)
return 0;
 }
 
-static int jgt_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BLO, true);
-}
-
-static int jge_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BHS, false);
-}
-
-static int jlt_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BLO, false);
-}
-
-static int jle_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BHS, true);
-}
-
-static int jsgt_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BLT, true);
-}
-
-static int jsge_imm(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta)
-{
-   return wrp_cmp_imm(nfp_prog, meta, BR_BGE, false);
-}
-
-static int jslt_imm(struct nfp_prog

[net-next:master 42/70] drivers/net/usb/lan78xx.c:1651:37-38: WARNING: Use ARRAY_SIZE

2018-04-24 Thread kbuild test robot

tree:   https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 
master
head:   c749fa181bd5848be78691d23168ec61ce691b95
commit: 496218656f9857d801512efdec1d609ebbe8a83b [42/70] lan78xx: Add support 
to dump lan78xx registers


coccinelle warnings: (new ones prefixed by >>)

>> drivers/net/usb/lan78xx.c:1651:37-38: WARNING: Use ARRAY_SIZE

vim +1651 drivers/net/usb/lan78xx.c

  1641  
  1642  static void
  1643  lan78xx_get_regs(struct net_device *netdev, struct ethtool_regs *regs,
  1644   void *buf)
  1645  {
  1646  u32 *data = buf;
  1647  int i, j;
  1648  struct lan78xx_net *dev = netdev_priv(netdev);
  1649  
  1650  /* Read Device/MAC registers */
> 1651  for (i = 0; i < (sizeof(lan78xx_regs) / sizeof(u32)); i++)
  1652  lan78xx_read_reg(dev, lan78xx_regs[i], [i]);
  1653  
  1654  if (!netdev->phydev)
  1655  return;
  1656  
  1657  /* Read PHY registers */
  1658  for (j = 0; j < 32; i++, j++)
  1659  data[i] = phy_read(netdev->phydev, j);
  1660  }
  1661  

---
0-DAY kernel test infrastructureOpen Source Technology Center
https://lists.01.org/pipermail/kbuild-all   Intel Corporation

[PATCH net-next 2/4] nfp: print PCIe link bandwidth on probe

2018-04-24 Thread Jakub Kicinski

To aid debugging of performance issues caused by limited PCIe
bandwidth print the PCIe link information on probe.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c
index cd678323bacb..a0e336bd1d85 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c
@@ -1330,6 +1330,7 @@ struct nfp_cpp *nfp_cpp_from_nfp6000_pcie(struct pci_dev 
*pdev)
/*  Finished with card initialization. */
dev_info(>dev,
 "Netronome Flow Processor NFP4000/NFP6000 PCIe Card Probe\n");
+   pcie_print_link_status(pdev);
 
nfp = kzalloc(sizeof(*nfp), GFP_KERNEL);
if (!nfp) {
-- 
2.16.2

[PATCH net-next 3/4] nfp: flower: support offloading multiple rules with same cookie

2018-04-24 Thread Jakub Kicinski

From: John Hurley 

When multiple netdevs are attached to a tc offload block and register for
callbacks, a rule added to the block will be propogated to all netdevs.
Previously these were detected as duplicates (based on cookie) and
rejected. Modify the rule nfp lookup function to optionally include an
ingress netdev and a host context along with the cookie value when
searching for a rule. When a new rule is passed to the driver, the netdev
the rule is to be attached to is considered when searching for dublicates.
When a stats update is received from HW, the host context is used
alongside the cookie to map to the correct host rule.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  8 +--
 .../net/ethernet/netronome/nfp/flower/metadata.c   | 20 +---
 .../net/ethernet/netronome/nfp/flower/offload.c| 27 --
 3 files changed, 38 insertions(+), 17 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index c67e1b54c614..9e6804bc9b40 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -47,6 +47,7 @@
 struct net_device;
 struct nfp_app;
 
+#define NFP_FL_STATS_CTX_DONT_CARE cpu_to_be32(0x)
 #define NFP_FL_STATS_ENTRY_RS  BIT(20)
 #define NFP_FL_STATS_ELEM_RS   4
 #define NFP_FL_REPEATED_HASH_MAX   BIT(17)
@@ -189,6 +190,7 @@ struct nfp_fl_payload {
spinlock_t lock; /* lock stats */
struct nfp_fl_stats stats;
__be32 nfp_tun_ipv4_addr;
+   struct net_device *ingress_dev;
char *unmasked_data;
char *mask_data;
char *action_data;
@@ -216,12 +218,14 @@ int nfp_flower_compile_action(struct 
tc_cls_flower_offload *flow,
  struct nfp_fl_payload *nfp_flow);
 int nfp_compile_flow_metadata(struct nfp_app *app,
  struct tc_cls_flower_offload *flow,
- struct nfp_fl_payload *nfp_flow);
+ struct nfp_fl_payload *nfp_flow,
+ struct net_device *netdev);
 int nfp_modify_flow_metadata(struct nfp_app *app,
 struct nfp_fl_payload *nfp_flow);
 
 struct nfp_fl_payload *
-nfp_flower_search_fl_table(struct nfp_app *app, unsigned long 
tc_flower_cookie);
+nfp_flower_search_fl_table(struct nfp_app *app, unsigned long tc_flower_cookie,
+  struct net_device *netdev, __be32 host_ctx);
 struct nfp_fl_payload *
 nfp_flower_remove_fl_table(struct nfp_app *app, unsigned long 
tc_flower_cookie);
 
diff --git a/drivers/net/ethernet/netronome/nfp/flower/metadata.c 
b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
index db977cf8e933..21668aa435e8 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/metadata.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/metadata.c
@@ -99,14 +99,18 @@ static int nfp_get_stats_entry(struct nfp_app *app, u32 
*stats_context_id)
 
 /* Must be called with either RTNL or rcu_read_lock */
 struct nfp_fl_payload *
-nfp_flower_search_fl_table(struct nfp_app *app, unsigned long tc_flower_cookie)
+nfp_flower_search_fl_table(struct nfp_app *app, unsigned long tc_flower_cookie,
+  struct net_device *netdev, __be32 host_ctx)
 {
struct nfp_flower_priv *priv = app->priv;
struct nfp_fl_payload *flower_entry;
 
hash_for_each_possible_rcu(priv->flow_table, flower_entry, link,
   tc_flower_cookie)
-   if (flower_entry->tc_flower_cookie == tc_flower_cookie)
+   if (flower_entry->tc_flower_cookie == tc_flower_cookie &&
+   (!netdev || flower_entry->ingress_dev == netdev) &&
+   (host_ctx == NFP_FL_STATS_CTX_DONT_CARE ||
+flower_entry->meta.host_ctx_id == host_ctx))
return flower_entry;
 
return NULL;
@@ -121,13 +125,11 @@ nfp_flower_update_stats(struct nfp_app *app, struct 
nfp_fl_stats_frame *stats)
flower_cookie = be64_to_cpu(stats->stats_cookie);
 
rcu_read_lock();
-   nfp_flow = nfp_flower_search_fl_table(app, flower_cookie);
+   nfp_flow = nfp_flower_search_fl_table(app, flower_cookie, NULL,
+ stats->stats_con_id);
if (!nfp_flow)
goto exit_rcu_unlock;
 
-   if (nfp_flow->meta.host_ctx_id != stats->stats_con_id)
-   goto exit_rcu_unlock;
-
spin_lock(_flow->lock);
nfp_flow->stats.pkts += be32_to_cpu(stats->pkt_count);
nfp_flow->stats.bytes += be64_to_cpu(stats->byte_count);
@@ -317,7 +319,8 @@ nfp_check_mask_remove(struct nfp_app *app, char *mask_data, 
u32 mask_len,
 
 int nfp_compile_flow_metadata(struct nfp_app

[PATCH net-next 4/4] nfp: flower: ignore duplicate cb requests for same rule

2018-04-24 Thread Jakub Kicinski

From: John Hurley 

If a flower rule has a repr both as ingress and egress port then 2
callbacks may be generated for the same rule request.

Add an indicator to each flow as to whether or not it was added from an
ingress registered cb. If so then ignore add/del/stat requests to it from
an egress cb.

Signed-off-by: John Hurley 
Reviewed-by: Jakub Kicinski 
---
 drivers/net/ethernet/netronome/nfp/flower/main.h   |  1 +
 .../net/ethernet/netronome/nfp/flower/offload.c| 23 +++---
 2 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/main.h 
b/drivers/net/ethernet/netronome/nfp/flower/main.h
index 9e6804bc9b40..733ff53cc601 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/main.h
+++ b/drivers/net/ethernet/netronome/nfp/flower/main.h
@@ -194,6 +194,7 @@ struct nfp_fl_payload {
char *unmasked_data;
char *mask_data;
char *action_data;
+   bool ingress_offload;
 };
 
 struct nfp_fl_stats_frame {
diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c 
b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index bdc82e11a31e..70ec9d821b91 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -345,7 +345,7 @@ nfp_flower_calculate_key_layers(struct nfp_app *app,
 }
 
 static struct nfp_fl_payload *
-nfp_flower_allocate_new(struct nfp_fl_key_ls *key_layer)
+nfp_flower_allocate_new(struct nfp_fl_key_ls *key_layer, bool egress)
 {
struct nfp_fl_payload *flow_pay;
 
@@ -371,6 +371,8 @@ nfp_flower_allocate_new(struct nfp_fl_key_ls *key_layer)
flow_pay->meta.flags = 0;
spin_lock_init(_pay->lock);
 
+   flow_pay->ingress_offload = !egress;
+
return flow_pay;
 
 err_free_mask:
@@ -402,8 +404,20 @@ nfp_flower_add_offload(struct nfp_app *app, struct 
net_device *netdev,
struct nfp_flower_priv *priv = app->priv;
struct nfp_fl_payload *flow_pay;
struct nfp_fl_key_ls *key_layer;
+   struct net_device *ingr_dev;
int err;
 
+   ingr_dev = egress ? NULL : netdev;
+   flow_pay = nfp_flower_search_fl_table(app, flow->cookie, ingr_dev,
+ NFP_FL_STATS_CTX_DONT_CARE);
+   if (flow_pay) {
+   /* Ignore as duplicate if it has been added by different cb. */
+   if (flow_pay->ingress_offload && egress)
+   return 0;
+   else
+   return -EOPNOTSUPP;
+   }
+
key_layer = kmalloc(sizeof(*key_layer), GFP_KERNEL);
if (!key_layer)
return -ENOMEM;
@@ -413,7 +427,7 @@ nfp_flower_add_offload(struct nfp_app *app, struct 
net_device *netdev,
if (err)
goto err_free_key_ls;
 
-   flow_pay = nfp_flower_allocate_new(key_layer);
+   flow_pay = nfp_flower_allocate_new(key_layer, egress);
if (!flow_pay) {
err = -ENOMEM;
goto err_free_key_ls;
@@ -485,7 +499,7 @@ nfp_flower_del_offload(struct nfp_app *app, struct 
net_device *netdev,
nfp_flow = nfp_flower_search_fl_table(app, flow->cookie, ingr_dev,
  NFP_FL_STATS_CTX_DONT_CARE);
if (!nfp_flow)
-   return -ENOENT;
+   return egress ? 0 : -ENOENT;
 
err = nfp_modify_flow_metadata(app, nfp_flow);
if (err)
@@ -534,6 +548,9 @@ nfp_flower_get_stats(struct nfp_app *app, struct net_device 
*netdev,
if (!nfp_flow)
return -EINVAL;
 
+   if (nfp_flow->ingress_offload && egress)
+   return 0;
+
spin_lock_bh(_flow->lock);
tcf_exts_stats_update(flow->exts, nfp_flow->stats.bytes,
  nfp_flow->stats.pkts, nfp_flow->stats.used);
-- 
2.16.2

[PATCH net-next 0/4] nfp: flower tc block support and nfp PCI updates

2018-04-24 Thread Jakub Kicinski

This series improves the nfp PCIe code by making use of the new
pcie_print_link_status() helper and resetting NFP locks when
driver loads.  This can help us avoid lock ups after host crashes
and is rebooted with PCIe reset or when kdump kernel is loaded.

The flower changes come from John, he says:

This patchset fixes offload issues when multiple repr netdevs are bound to
a tc block and filter rules added. Previously the rule would be passed to
the reprs and would be rejected in all but the first as the cookie value
will indicate a duplicate. The first patch extends the flow lookup
function to consider both host context and ingress netdev along with the
cookie value. This means that a rule with a given cookie can exist
multiple times assuming the ingress netdev is different. The host context
ensures that stats from fw are associated with the correct instance of the
rule.

The second patch protects against rejecting add/del/stat messages when a
rule has a repr as both an ingress port and an egress dev. In such cases a
callback can be triggered twice (once for ingress and once for egress)
and can lead to duplicate rule detection or incorrect double calls.


Jakub Kicinski (2):
  nfp: reset local locks on init
  nfp: print PCIe link bandwidth on probe

John Hurley (2):
  nfp: flower: support offloading multiple rules with same cookie
  nfp: flower: ignore duplicate cb requests for same rule

 drivers/net/ethernet/netronome/nfp/flower/main.h   |  9 +++-
 .../net/ethernet/netronome/nfp/flower/metadata.c   | 20 +---
 .../net/ethernet/netronome/nfp/flower/offload.c| 50 ++
 drivers/net/ethernet/netronome/nfp/nfp_main.c  |  5 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h   |  2 +
 .../ethernet/netronome/nfp/nfpcore/nfp6000_pcie.c  |  1 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h   |  2 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c | 45 +
 .../ethernet/netronome/nfp/nfpcore/nfp_resource.c  | 59 ++
 9 files changed, 173 insertions(+), 20 deletions(-)

-- 
2.16.2

[PATCH net-next 1/4] nfp: reset local locks on init

2018-04-24 Thread Jakub Kicinski

NFP locks record the owner when held, for PCIe devices the owner
ID will be the PCIe link number.  When driver loads it should scan
known locks and if they indicate that they are held by local
endpoint but the driver doesn't hold them - release them.

Locks can be left taken for instance when kernel gets kexec-ed or
after a crash.  Management FW tries to clean up stale locks too,
but it currently depends on PCIe link going down which doesn't
always happen.

Signed-off-by: Jakub Kicinski 
Reviewed-by: Dirk van der Merwe 
---
 drivers/net/ethernet/netronome/nfp/nfp_main.c  |  5 ++
 drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h   |  2 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h   |  2 +
 .../net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c | 45 +
 .../ethernet/netronome/nfp/nfpcore/nfp_resource.c  | 59 ++
 5 files changed, 113 insertions(+)

diff --git a/drivers/net/ethernet/netronome/nfp/nfp_main.c 
b/drivers/net/ethernet/netronome/nfp/nfp_main.c
index 6a3e3231e111..c8aef9508109 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_main.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_main.c
@@ -489,6 +489,10 @@ static int nfp_pci_probe(struct pci_dev *pdev,
goto err_disable_msix;
}
 
+   err = nfp_resource_table_init(pf->cpp);
+   if (err)
+   goto err_cpp_free;
+
pf->hwinfo = nfp_hwinfo_read(pf->cpp);
 
dev_info(>dev, "Assembly: %s%s%s-%s CPLD: %s\n",
@@ -551,6 +555,7 @@ static int nfp_pci_probe(struct pci_dev *pdev,
vfree(pf->dumpspec);
 err_hwinfo_free:
kfree(pf->hwinfo);
+err_cpp_free:
nfp_cpp_free(pf->cpp);
 err_disable_msix:
destroy_workqueue(pf->wq);
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
index ced62d112aa2..f44d0a857314 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp.h
@@ -94,6 +94,8 @@ int nfp_nsp_read_sensors(struct nfp_nsp *state, unsigned int 
sensor_mask,
 /* MAC Statistics Accumulator */
 #define NFP_RESOURCE_MAC_STATISTICS"mac.stat"
 
+int nfp_resource_table_init(struct nfp_cpp *cpp);
+
 struct nfp_resource *
 nfp_resource_acquire(struct nfp_cpp *cpp, const char *name);
 
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h
index c8f2c064cce3..4e19add1c539 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_cpp.h
@@ -295,6 +295,8 @@ void nfp_cpp_mutex_free(struct nfp_cpp_mutex *mutex);
 int nfp_cpp_mutex_lock(struct nfp_cpp_mutex *mutex);
 int nfp_cpp_mutex_unlock(struct nfp_cpp_mutex *mutex);
 int nfp_cpp_mutex_trylock(struct nfp_cpp_mutex *mutex);
+int nfp_cpp_mutex_reclaim(struct nfp_cpp *cpp, int target,
+ unsigned long long address);
 
 /**
  * nfp_cppcore_pcie_unit() - Get PCI Unit of a CPP handle
diff --git a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c 
b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
index cb28ac03e4ca..c88bf673cb76 100644
--- a/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
+++ b/drivers/net/ethernet/netronome/nfp/nfpcore/nfp_mutex.c
@@ -59,6 +59,11 @@ static u32 nfp_mutex_unlocked(u16 interface)
return (u32)interface << 16 | 0x;
 }
 
+static u32 nfp_mutex_owner(u32 val)
+{
+   return val >> 16;
+}
+
 static bool nfp_mutex_is_locked(u32 val)
 {
return (val & 0x) == 0x000f;
@@ -351,3 +356,43 @@ int nfp_cpp_mutex_trylock(struct nfp_cpp_mutex *mutex)
 
return nfp_mutex_is_locked(tmp) ? -EBUSY : -EINVAL;
 }
+
+/**
+ * nfp_cpp_mutex_reclaim() - Unlock mutex if held by local endpoint
+ * @cpp:   NFP CPP handle
+ * @target:NFP CPP target ID (ie NFP_CPP_TARGET_CLS or NFP_CPP_TARGET_MU)
+ * @address:   Offset into the address space of the NFP CPP target ID
+ *
+ * Release lock if held by local system.  Extreme care is advised, call only
+ * when no local lock users can exist.
+ *
+ * Return:  0 if the lock was OK, 1 if locked by us, -errno on invalid 
mutex
+ */
+int nfp_cpp_mutex_reclaim(struct nfp_cpp *cpp, int target,
+ unsigned long long address)
+{
+   const u32 mur = NFP_CPP_ID(target, 3, 0);   /* atomic_read */
+   const u32 muw = NFP_CPP_ID(target, 4, 0);   /* atomic_write */
+   u16 interface = nfp_cpp_interface(cpp);
+   int err;
+   u32 tmp;
+
+   err = nfp_cpp_mutex_validate(interface, , address);
+   if (err)
+   return err;
+
+   /* Check lock */
+   err = nfp_cpp_readl(cpp, mur, address, );
+   if (err < 0)
+   return err;
+
+   if (nfp_mutex_is_unlocked(tmp) || nfp_mutex_owner(tmp) != interface)
+   return 0;
+
+   /* Bust the lock */
+   err = nfp_cpp_writel(cpp, muw,

Re: [PATCH] net: ethernet: ave: fix ave_start_xmit()'s return type

2018-04-24 Thread Kunihiko Hayashi

Hi,

On Tue, 24 Apr 2018 15:17:25 +0200
Luc Van Oostenryck  wrote:

> The method ndo_start_xmit() is defined as returning an 'netdev_tx_t',
> which is a typedef for an enum type, but the implementation in this
> driver returns an 'int'.
> 
> Fix this by returning 'netdev_tx_t' in this driver too.
> 
> Signed-off-by: Luc Van Oostenryck 

Thanks for fixing.
I think it's preferable to add 'Fixes'.

Fixes: 4c270b55a5af ("net: ethernet: socionext: add AVE ethernet driver")
Acked-by: Kunihiko Hayashi 

Thank you,

> ---
>  drivers/net/ethernet/socionext/sni_ave.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/socionext/sni_ave.c 
> b/drivers/net/ethernet/socionext/sni_ave.c
> index 0b3b7a460..95625bca1 100644
> --- a/drivers/net/ethernet/socionext/sni_ave.c
> +++ b/drivers/net/ethernet/socionext/sni_ave.c
> @@ -1360,7 +1360,7 @@ static int ave_stop(struct net_device *ndev)
>   return 0;
>  }
>  
> -static int ave_start_xmit(struct sk_buff *skb, struct net_device *ndev)
> +static netdev_tx_t ave_start_xmit(struct sk_buff *skb, struct net_device 
> *ndev)
>  {
>   struct ave_private *priv = netdev_priv(ndev);
>   u32 proc_idx, done_idx, ndesc, cmdsts;
> -- 
> 2.17.0

---
Best Regards,
Kunihiko Hayashi

Re: [PATCH net-next v2 0/7] net: Extend availability of PHY statistics

2018-04-24 Thread David Miller

From: Florian Fainelli 
Date: Tue, 24 Apr 2018 17:35:09 -0700

> David, sorry looks like there will be a v3, the last patch introduces a
> possible problem with bcm_sf2 which uses b53_common as a library, will
> respin later.

Ok, no problem.

[PATCH] [PATCH bpf-next] samples/bpf/bpf_load.c: remove redundant ret assignment in bpf_load_program()

2018-04-24 Thread Wang Sheng-Hui

2 redundant ret assignments removded:
* 'ret = 1' before the logic 'if (data_maps)', and if any errors jump to
  label 'done'. No 'ret = 1' needed before the error jump.
* After the '/* load programs */' part, if everything goes well, then
  the BPF code will be loaded and 'ret' set to 0 by load_and_attach().
  If something goes wrong, 'ret' set to none-O, the redundant 'ret = 0'
  after the for clause will make the error skipped.
  For example, if some BPF code cannot provide supported program types
  in ELF SEC("unknown"), the for clause will not call load_and_attach()
  to load the BPF code. 1 should be returned to callees instead of 0.

Signed-off-by: Wang Sheng-Hui 
---
 samples/bpf/bpf_load.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index bebe4188b4b3..feca497d6afd 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -549,7 +549,6 @@ static int do_load_bpf_file(const char *path, fixup_map_cb 
fixup_map)
if (nr_maps < 0) {
printf("Error: Failed loading ELF maps (errno:%d):%s\n",
   nr_maps, strerror(-nr_maps));
-   ret = 1;
goto done;
}
if (load_maps(map_data, nr_maps, fixup_map))
@@ -615,7 +614,6 @@ static int do_load_bpf_file(const char *path, fixup_map_cb 
fixup_map)
}
}
 
-   ret = 0;
 done:
close(fd);
return ret;
-- 
2.17.0

Re: [PATCH 2/5] ide: kill ide_toggle_bounce

2018-04-24 Thread Jens Axboe

On 4/24/18 12:16 PM, Christoph Hellwig wrote:
> ide_toggle_bounce did select various strange block bounce limits, including
> not bouncing at all as soon as an iommu is present in the system.  Given
> that the dma_map routines now handle any required bounce buffering except
> for ISA DMA, and the ide code already must handle either ISA DMA or highmem
> at least for iommu equipped systems we can get rid of the block layer
> bounce limit setting entirely.

Pretty sure I was the one to add this code, when highmem page IO was
enabled about two decades ago...

Outside of DMA, the issue was that the PIO code could not handle
highmem. That's not the case anymore, so this should be fine.

Reviewed-by: Jens Axboe 

-- 
Jens Axboe

general protection fault in smc_set_keepalive

2018-04-24 Thread syzbot


Hello,

syzbot hit the following crash on net-next commit
9c20b9372fbaf6f7d4c05f5f925806a7928f0c73 (Tue Apr 24 03:08:41 2018 +)
net: fib_rules: fix l3mdev netlink attr processing
syzbot dashboard link:  
https://syzkaller.appspot.com/bug?extid=cf9012c597c8379d535c


So far this crash happened 2 times on net-next.
C reproducer: https://syzkaller.appspot.com/x/repro.c?id=4775309383565312
syzkaller reproducer:  
https://syzkaller.appspot.com/x/repro.syz?id=4978230683500544
Raw console output:  
https://syzkaller.appspot.com/x/log.txt?id=4770663504019456
Kernel config:  
https://syzkaller.appspot.com/x/.config?id=-2918904850634584293

compiler: gcc (GCC) 8.0.1 20180413 (experimental)

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+cf9012c597c8379d5...@syzkaller.appspotmail.com
It will help syzbot understand when the bug is fixed. See footer for  
details.

If you forward the report, please keep this part and the footer.

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault:  [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 0 PID: 4455 Comm: syz-executor060 Not tainted 4.17.0-rc1+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011

RIP: 0010:smc_set_keepalive+0x4e/0xd0 net/smc/af_smc.c:59
RSP: 0018:8801ced8fa68 EFLAGS: 00010202
RAX: dc00 RBX:  RCX: 85d72bcb
RDX: 0004 RSI: 873f0a94 RDI: 0020
RBP: 8801ced8fa80 R08: 8801b67e44c0 R09: 0006
R10: 8801b67e44c0 R11:  R12: 8801b6bff7c0
R13: 0001 R14: 0003 R15: 8801aee2b540
FS:  009e4880() GS:8801dae0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2: 2040 CR3: 0001b6e75000 CR4: 001406f0
DR0:  DR1:  DR2: 
DR3:  DR6: fffe0ff0 DR7: 0400
Call Trace:
 sock_setsockopt+0x14e2/0x1fe0 net/core/sock.c:801
 __sys_setsockopt+0x2df/0x390 net/socket.c:1899
 __do_sys_setsockopt net/socket.c:1914 [inline]
 __se_sys_setsockopt net/socket.c:1911 [inline]
 __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1911
 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43fcf9
RSP: 002b:7ffe62977a78 EFLAGS: 0217 ORIG_RAX: 0036
RAX: ffda RBX: 004002c8 RCX: 0043fcf9
RDX: 0009 RSI: 0001 RDI: 0003
RBP: 006ca018 R08: 0004 R09: 004002c8
R10: 2040 R11: 0217 R12: 00401620
R13: 004016b0 R14:  R15: 
Code: ff df 48 89 fa 48 c1 ea 03 80 3c 02 00 75 78 48 8b 9b 50 04 00 00 48  
b8 00 00 00 00 00 fc ff df 48 8d 7b 20 48 89 fa 48 c1 ea 03 <80> 3c 02 00  
75 6b 48 b8 00 00 00 00 00 fc ff df 48 8b 5b 20 48

RIP: smc_set_keepalive+0x4e/0xd0 net/smc/af_smc.c:59 RSP: 8801ced8fa68
---[ end trace a76f9ed0fb111068 ]---


---
This bug is generated by a dumb bot. It may contain errors.
See https://goo.gl/tpsmEJ for details.
Direct all questions to syzkal...@googlegroups.com.

syzbot will keep track of this bug report.
If you forgot to add the Reported-by tag, once the fix for this bug is  
merged

into any tree, please reply to this email with:
#syz fix: exact-commit-title
If you want to test a patch for this bug, please reply with:
#syz test: git://repo/address.git branch
and provide the patch inline or as an attachment.
To mark this as a duplicate of another syzbot report, please reply with:
#syz dup: exact-subject-of-another-report
If it's a one-off invalid bug report, please reply with:
#syz invalid
Note: if the crash happens again, it will cause creation of a new bug  
report.

Note: all commands must start from beginning of the line in the email body.

[PATCH net-next] net: rules: Move l3mdev attribute validation to a helper

2018-04-24 Thread David Ahern

Move the check on FRA_L3MDEV attribute to helper to improve the
readability of fib_nl2rule. Update the extack messages to be
clear when the configuration option is disabled versus an invalid
value has been passed.

Signed-off-by: David Ahern 
---
 net/core/fib_rules.c | 34 --
 1 file changed, 24 insertions(+), 10 deletions(-)

diff --git a/net/core/fib_rules.c b/net/core/fib_rules.c
index 2271c80fd967..126ffc5bc630 100644
--- a/net/core/fib_rules.c
+++ b/net/core/fib_rules.c
@@ -454,6 +454,27 @@ static struct fib_rule *rule_find(struct fib_rules_ops 
*ops,
return NULL;
 }
 
+#ifdef CONFIG_NET_L3_MASTER_DEV
+static int fib_nl2rule_l3mdev(struct nlattr *nla, struct fib_rule *nlrule,
+ struct netlink_ext_ack *extack)
+{
+   nlrule->l3mdev = nla_get_u8(nla);
+   if (nlrule->l3mdev != 1) {
+   NL_SET_ERR_MSG(extack, "Invalid l3mdev attribute");
+   return -1;
+   }
+
+   return 0;
+}
+#else
+static int fib_nl2rule_l3mdev(struct nlattr *nla, struct fib_rule *nlrule,
+ struct netlink_ext_ack *extack)
+{
+   NL_SET_ERR_MSG(extack, "l3mdev support is not enabled in kernel");
+   return -1;
+}
+#endif
+
 static int fib_nl2rule(struct sk_buff *skb, struct nlmsghdr *nlh,
   struct netlink_ext_ack *extack,
   struct fib_rules_ops *ops,
@@ -536,16 +557,9 @@ static int fib_nl2rule(struct sk_buff *skb, struct 
nlmsghdr *nlh,
nlrule->tun_id = nla_get_be64(tb[FRA_TUN_ID]);
 
err = -EINVAL;
-   if (tb[FRA_L3MDEV]) {
-#ifdef CONFIG_NET_L3_MASTER_DEV
-   nlrule->l3mdev = nla_get_u8(tb[FRA_L3MDEV]);
-   if (nlrule->l3mdev != 1)
-#endif
-   {
-   NL_SET_ERR_MSG(extack, "Invalid l3mdev");
-   goto errout_free;
-   }
-   }
+   if (tb[FRA_L3MDEV] &&
+   fib_nl2rule_l3mdev(tb[FRA_L3MDEV], nlrule, extack) < 0)
+   goto errout_free;
 
nlrule->action = frh->action;
nlrule->flags = frh->flags;
-- 
2.11.0

Re: [PATCH] net: phy: allow scanning busses with missing phys

2018-04-24 Thread Florian Fainelli

On 04/24/2018 09:09 AM, Alexandre Belloni wrote:
> Some MDIO busses will error out when trying to read a phy address with no
> phy present at that address. In that case, probing the bus will fail
> because __mdiobus_register() is scanning the bus for all possible phys
> addresses.
> 
> In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
> this address and set the phy ID to 0x which is then properly
> handled in get_phy_device().
> 
> Suggested-by: Andrew Lunn 
> Signed-off-by: Alexandre Belloni 

Reviewed-by: Florian Fainelli 
-- 
Florian

Re: [PATCH] net: phy: allow scanning busses with missing phys

2018-04-24 Thread Florian Fainelli

On 04/24/2018 11:01 AM, Andrew Lunn wrote:
> On Tue, Apr 24, 2018 at 09:37:09AM -0700, Florian Fainelli wrote:
>>
>>
>> On 04/24/2018 09:09 AM, Alexandre Belloni wrote:
>>> Some MDIO busses will error out when trying to read a phy address with no
>>> phy present at that address. In that case, probing the bus will fail
>>> because __mdiobus_register() is scanning the bus for all possible phys
>>> addresses.
>>>
>>> In case MII_PHYSID1 returns -EIO or -ENODEV, consider there is no phy at
>>> this address and set the phy ID to 0x which is then properly
>>> handled in get_phy_device().
>>
>> Humm, why not have your MDIO bus implementation do the scanning itself
>> in a reset() callback, which happens before probing the bus, and based
>> on the results, set phy_mask accordingly such that only PHYs present are
>> populated?
> 
> Hi Florian
> 
> Seems a bit odd have the driver do this, when the core could.

I could see arguments for doing it in either location, if this is done
in the driver, the driver is responsible for knowing what addresses will
respond, which does not seem to big of a stretch. If done by core, we
lift the driver from taking care of that. Either way is fine, really.

> 
>> My only concern with your change is that we are having a special
>> treatment for EIO and ENODEV, so we must make sure MDIO bus drivers are
>> all conforming to that.
>  
> I don't see how it could be a problem. It used to be any error was a
> real error, and would stop the bus from loading. It now means there is
> nothing there. The only possible side effect is an mdio bus driver
> might remain loaded without any devices if all reads return
> ENODEV/EIO, were as before it would probably never load. 

Right, though that does not seem to be a big problem.

> 
>   Andrew
> 


-- 
Florian

Re: [RFC PATCH ghak32 V2 01/13] audit: add container id

2018-04-24 Thread Richard Guy Briggs

On 2018-04-24 15:01, Paul Moore wrote:
> On Mon, Apr 23, 2018 at 10:02 PM, Richard Guy Briggs  wrote:
> > On 2018-04-23 19:15, Paul Moore wrote:
> >> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs  
> >> wrote:
> >> > On 2018-04-18 19:47, Paul Moore wrote:
> >> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs  
> >> >> wrote:
> >> >> > Implement the proc fs write to set the audit container ID of a 
> >> >> > process,
> >> >> > emitting an AUDIT_CONTAINER record to document the event.
> >> >> >
> >> >> > This is a write from the container orchestrator task to a proc entry 
> >> >> > of
> >> >> > the form /proc/PID/containerid where PID is the process ID of the 
> >> >> > newly
> >> >> > created task that is to become the first task in a container, or an
> >> >> > additional task added to a container.
> >> >> >
> >> >> > The write expects up to a u64 value (unset: 18446744073709551615).
> >> >> >
> >> >> > This will produce a record such as this:
> >> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 
> >> >> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 
> >> >> > tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 
> >> >> > res=0
> >> >> >
> >> >> > The "op" field indicates an initial set.  The "pid" to "ses" fields 
> >> >> > are
> >> >> > the orchestrator while the "opid" field is the object's PID, the 
> >> >> > process
> >> >> > being "contained".  Old and new container ID values are given in the
> >> >> > "contid" fields, while res indicates its success.
> >> >> >
> >> >> > It is not permitted to self-set, unset or re-set the container ID.  A
> >> >> > child inherits its parent's container ID, but then can be set only 
> >> >> > once
> >> >> > after.
> >> >> >
> >> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
> >> >> >
> >> >> > Signed-off-by: Richard Guy Briggs 
> >> >> > ---
> >> >> >  fs/proc/base.c | 37 
> >> >> >  include/linux/audit.h  | 16 +
> >> >> >  include/linux/init_task.h  |  4 ++-
> >> >> >  include/linux/sched.h  |  1 +
> >> >> >  include/uapi/linux/audit.h |  2 ++
> >> >> >  kernel/auditsc.c   | 84 
> >> >> > ++
> >> >> >  6 files changed, 143 insertions(+), 1 deletion(-)
> 
> ...
> 
> >> >> >  /* audit_rule_data supports filter rules with both integer and string
> >> >> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
> >> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
> >> >> > index 4e0a4ac..29c8482 100644
> >> >> > --- a/kernel/auditsc.c
> >> >> > +++ b/kernel/auditsc.c
> >> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
> >> >> > return rc;
> >> >> >  }
> >> >> >
> >> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 
> >> >> > containerid)
> >> >> > +{
> >> >> > +   struct task_struct *parent;
> >> >> > +   u64 pcontainerid, ccontainerid;
> >> >> > +
> >> >> > +   /* Don't allow to set our own containerid */
> >> >> > +   if (current == task)
> >> >> > +   return -EPERM;
> >> >>
> >> >> Why not?  Is there some obvious security concern that I missing?
> >> >
> >> > We then lose the distinction in the AUDIT_CONTAINER record between the
> >> > initiating PID and the target PID.  This was outlined in the proposal.
> >>
> >> I just went back and reread the v3 proposal and I still don't see a
> >> good explanation of this.  Why is this bad?  What's the security
> >> concern?
> >
> > I don't remember, specifically.  Maybe this has been addressed by the
> > check for children/threads or identical parent container ID.  So, I'm
> > reluctantly willing to remove that check for now.
> 
> Okay.  For the record, if someone can explain to me why this
> restriction saves us from some terrible situation I'm all for leaving
> it.  I'm just opposed to restrictions without solid reasoning behind
> them.
> 
> >> > Having said that, I'm still not sure we have protected sufficiently from
> >> > a child turning around and setting it's parent's as yet unset or
> >> > inherited audit container ID.
> >>
> >> Yes, I believe we only want to let a task set the audit container for
> >> it's children (or itself/threads if we decide to allow that, see
> >> above).  There *has* to be a function to check to see if a task if a
> >> child of a given task ... right? ... although this is likely to be a
> >> pointer traversal and locking nightmare ... hmmm.
> >
> > Isn't that just (struct task_struct)parent == (struct
> > task_struct)child->parent (or ->real_parent)?
> >
> > And now that I say that, it is covered by the following patch's child
> > check, so as long as we keep that, we should be fine.
> 
> I was thinking of checking not just current's immediate children, but
> any of it's descendants as I believe that is what we want to limit,
> yes?  I

Re: [PATCH net-next v2 0/7] net: Extend availability of PHY statistics

2018-04-24 Thread Florian Fainelli

On 04/24/2018 05:27 PM, Florian Fainelli wrote:
> Hi all,
> 
> This patch series adds support for retrieving PHY statistics with DSA switches
> when the CPU port uses a PHY to PHY connection (as opposed to MAC to MAC).
> To get there a number of things are done:
> 
> - first we move the code dealing with PHY statistics outside of 
> net/core/ethtool.c
>   and create helper functions since the same code will be reused
> - then we allow network device drivers to provide an ethtool_get_phy_stats 
> callback
>   when the standard PHY library helpers are not suitable
> - we update the DSA functions dealing with ethtool operations to get passed a
>   stringset instead of assuming ETH_SS_STATS like they currently do
> - then we provide a set of standard helpers within DSA as a framework and add
>   the plumbing to allow retrieving the PHY statistics of the CPU port(s)
> - finally plug support for retrieving such PHY statistics with the b53 driver
> 
> Changes in v2:
> 
> - got actual testing when the DSA master network device has a PHY that
>   already provides statistics (thanks Nikita!)
> 
> - fixed the kbuild error reported when CONFIG_PHYLIB=n
> 
> - removed the checking of ops which is redundant and not needed

David, sorry looks like there will be a v3, the last patch introduces a
possible problem with bcm_sf2 which uses b53_common as a library, will
respin later.
-- 
Florian

[PATCH net-next v2 0/7] net: Extend availability of PHY statistics

2018-04-24 Thread Florian Fainelli


Hi all,

This patch series adds support for retrieving PHY statistics with DSA switches
when the CPU port uses a PHY to PHY connection (as opposed to MAC to MAC).
To get there a number of things are done:

- first we move the code dealing with PHY statistics outside of 
net/core/ethtool.c
  and create helper functions since the same code will be reused
- then we allow network device drivers to provide an ethtool_get_phy_stats 
callback
  when the standard PHY library helpers are not suitable
- we update the DSA functions dealing with ethtool operations to get passed a
  stringset instead of assuming ETH_SS_STATS like they currently do
- then we provide a set of standard helpers within DSA as a framework and add
  the plumbing to allow retrieving the PHY statistics of the CPU port(s)
- finally plug support for retrieving such PHY statistics with the b53 driver

Changes in v2:

- got actual testing when the DSA master network device has a PHY that
  already provides statistics (thanks Nikita!)

- fixed the kbuild error reported when CONFIG_PHYLIB=n

- removed the checking of ops which is redundant and not needed

Florian Fainelli (7):
  net: Move PHY statistics code into PHY library helpers
  net: Allow network devices to have PHY statistics
  net: dsa: Do not check for ethtool_ops validity
  net: dsa: Pass stringset to ethtool operations
  net: dsa: Add helper function to obtain PHY device of a given port
  net: dsa: Allow providing PHY statistics from CPU port
  net: dsa: b53: Add support for reading PHY statistics

 drivers/net/dsa/b53/b53_common.c   | 79 ++---
 drivers/net/dsa/b53/b53_priv.h |  6 ++-
 drivers/net/dsa/bcm_sf2.c  |  1 +
 drivers/net/dsa/dsa_loop.c | 11 -
 drivers/net/dsa/lan9303-core.c | 11 -
 drivers/net/dsa/microchip/ksz_common.c | 11 -
 drivers/net/dsa/mt7530.c   | 11 -
 drivers/net/dsa/mv88e6xxx/chip.c   | 10 +++-
 drivers/net/dsa/qca8k.c| 10 +++-
 drivers/net/phy/phy.c  | 48 ++
 include/linux/ethtool.h|  5 ++
 include/linux/phy.h| 20 
 include/net/dsa.h  | 12 -
 net/core/ethtool.c | 61 ---
 net/dsa/master.c   | 62 +++
 net/dsa/port.c | 90 +-
 net/dsa/slave.c|  5 +-
 17 files changed, 366 insertions(+), 87 deletions(-)

-- 
2.14.1

[PATCH net-next v2 4/7] net: dsa: Pass stringset to ethtool operations

2018-04-24 Thread Florian Fainelli

Up until now we largely assumed that we were interested in ETH_SS_STATS
type of strings for all ethtool operations, this is about to change with
the introduction of additional string sets, e.g: ETH_SS_PHY_STATS.
Update all functions to take an appropriate stringset argument and act
on it when it is different than ETH_SS_STATS for now.

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c   | 11 +--
 drivers/net/dsa/b53/b53_priv.h |  5 +++--
 drivers/net/dsa/dsa_loop.c | 11 +--
 drivers/net/dsa/lan9303-core.c | 11 +--
 drivers/net/dsa/microchip/ksz_common.c | 11 +--
 drivers/net/dsa/mt7530.c   | 11 +--
 drivers/net/dsa/mv88e6xxx/chip.c   | 10 --
 drivers/net/dsa/qca8k.c| 10 --
 include/net/dsa.h  |  5 +++--
 net/dsa/master.c   | 21 +
 net/dsa/slave.c|  5 +++--
 11 files changed, 83 insertions(+), 28 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 78616787f2a3..726b2d8c6fe9 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -806,13 +806,17 @@ static unsigned int b53_get_mib_size(struct b53_device 
*dev)
return B53_MIBS_SIZE;
 }
 
-void b53_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+uint8_t *data)
 {
struct b53_device *dev = ds->priv;
const struct b53_mib_desc *mibs = b53_get_mib(dev);
unsigned int mib_size = b53_get_mib_size(dev);
unsigned int i;
 
+   if (stringset != ETH_SS_STATS)
+   return;
+
for (i = 0; i < mib_size; i++)
strlcpy(data + i * ETH_GSTRING_LEN,
mibs[i].name, ETH_GSTRING_LEN);
@@ -852,10 +856,13 @@ void b53_get_ethtool_stats(struct dsa_switch *ds, int 
port, uint64_t *data)
 }
 EXPORT_SYMBOL(b53_get_ethtool_stats);
 
-int b53_get_sset_count(struct dsa_switch *ds, int port)
+int b53_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
struct b53_device *dev = ds->priv;
 
+   if (sset != ETH_SS_STATS)
+   return 0;
+
return b53_get_mib_size(dev);
 }
 EXPORT_SYMBOL(b53_get_sset_count);
diff --git a/drivers/net/dsa/b53/b53_priv.h b/drivers/net/dsa/b53/b53_priv.h
index 1187ebd79287..b933d5cb5c2d 100644
--- a/drivers/net/dsa/b53/b53_priv.h
+++ b/drivers/net/dsa/b53/b53_priv.h
@@ -286,9 +286,10 @@ static inline int b53_switch_get_reset_gpio(struct 
b53_device *dev)
 /* Exported functions towards other drivers */
 void b53_imp_vlan_setup(struct dsa_switch *ds, int cpu_port);
 int b53_configure_vlan(struct dsa_switch *ds);
-void b53_get_strings(struct dsa_switch *ds, int port, uint8_t *data);
+void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
+uint8_t *data);
 void b53_get_ethtool_stats(struct dsa_switch *ds, int port, uint64_t *data);
-int b53_get_sset_count(struct dsa_switch *ds, int port);
+int b53_get_sset_count(struct dsa_switch *ds, int port, int sset);
 int b53_br_join(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_leave(struct dsa_switch *ds, int port, struct net_device *bridge);
 void b53_br_set_stp_state(struct dsa_switch *ds, int port, u8 state);
diff --git a/drivers/net/dsa/dsa_loop.c b/drivers/net/dsa/dsa_loop.c
index f77be9f85cb3..9354cc08d3fd 100644
--- a/drivers/net/dsa/dsa_loop.c
+++ b/drivers/net/dsa/dsa_loop.c
@@ -86,16 +86,23 @@ static int dsa_loop_setup(struct dsa_switch *ds)
return 0;
 }
 
-static int dsa_loop_get_sset_count(struct dsa_switch *ds, int port)
+static int dsa_loop_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
+   if (sset != ETH_SS_STATS)
+   return 0;
+
return __DSA_LOOP_CNT_MAX;
 }
 
-static void dsa_loop_get_strings(struct dsa_switch *ds, int port, uint8_t 
*data)
+static void dsa_loop_get_strings(struct dsa_switch *ds, int port,
+u32 stringset, uint8_t *data)
 {
struct dsa_loop_priv *ps = ds->priv;
unsigned int i;
 
+   if (stringset != ETH_SS_STATS)
+   return;
+
for (i = 0; i < __DSA_LOOP_CNT_MAX; i++)
memcpy(data + i * ETH_GSTRING_LEN,
   ps->ports[port].mib[i].name, ETH_GSTRING_LEN);
diff --git a/drivers/net/dsa/lan9303-core.c b/drivers/net/dsa/lan9303-core.c
index fefa454f3e56..b4f6e1a67dd9 100644
--- a/drivers/net/dsa/lan9303-core.c
+++ b/drivers/net/dsa/lan9303-core.c
@@ -977,10 +977,14 @@ static const struct lan9303_mib_desc lan9303_mib[] = {
{ .offset = LAN9303_MAC_TX_LATECOL_0, .name = "TxLateCol", },
 };
 
-static void lan9303_get_strings(struct dsa_switch *ds, int port, uint8_t *data)
+static void lan9303_get_strings(struct dsa_switch *ds,

[PATCH net-next v2 3/7] net: dsa: Do not check for ethtool_ops validity

2018-04-24 Thread Florian Fainelli

This is completely redundant with what netdev_set_default_ethtool_ops()
does, we are always guaranteed to have a valid dev->ethtool_ops pointer,
however, within that structure, not all function calls may be populated,
so we still have to check them individually.

Signed-off-by: Florian Fainelli 
---
 net/dsa/master.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/dsa/master.c b/net/dsa/master.c
index 90e6df0351eb..9ec16b39ed15 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -22,7 +22,7 @@ static void dsa_master_get_ethtool_stats(struct net_device 
*dev,
int port = cpu_dp->index;
int count = 0;
 
-   if (ops && ops->get_sset_count && ops->get_ethtool_stats) {
+   if (ops->get_sset_count && ops->get_ethtool_stats) {
count = ops->get_sset_count(dev, ETH_SS_STATS);
ops->get_ethtool_stats(dev, stats, data);
}
@@ -38,7 +38,7 @@ static int dsa_master_get_sset_count(struct net_device *dev, 
int sset)
struct dsa_switch *ds = cpu_dp->ds;
int count = 0;
 
-   if (ops && ops->get_sset_count)
+   if (ops->get_sset_count)
count += ops->get_sset_count(dev, sset);
 
if (sset == ETH_SS_STATS && ds->ops->get_sset_count)
@@ -64,7 +64,7 @@ static void dsa_master_get_strings(struct net_device *dev, 
uint32_t stringset,
/* We do not want to be NULL-terminated, since this is a prefix */
pfx[sizeof(pfx) - 1] = '_';
 
-   if (ops && ops->get_sset_count && ops->get_strings) {
+   if (ops->get_sset_count && ops->get_strings) {
mcount = ops->get_sset_count(dev, ETH_SS_STATS);
ops->get_strings(dev, stringset, data);
}
-- 
2.14.1

[PATCH net-next v2 1/7] net: Move PHY statistics code into PHY library helpers

2018-04-24 Thread Florian Fainelli

In order to make it possible for network device drivers that do not
necessarily have a phy_device attached, but still report PHY statistics,
have a preliminary refactoring consisting in creating helper functions
that encapsulate the PHY device driver knowledge within PHYLIB.

Signed-off-by: Florian Fainelli 
---
 drivers/net/phy/phy.c | 48 
 include/linux/phy.h   | 20 
 net/core/ethtool.c| 38 --
 3 files changed, 76 insertions(+), 30 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 05c1e8ef15e6..a98ed12c0009 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -1277,3 +1277,51 @@ int phy_ethtool_nway_reset(struct net_device *ndev)
return phy_restart_aneg(phydev);
 }
 EXPORT_SYMBOL(phy_ethtool_nway_reset);
+
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
+{
+   if (!phydev->drv)
+   return -EIO;
+
+   mutex_lock(>lock);
+   phydev->drv->get_strings(phydev, data);
+   mutex_unlock(>lock);
+
+   return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_get_strings);
+
+int phy_ethtool_get_sset_count(struct phy_device *phydev)
+{
+   int ret;
+
+   if (!phydev->drv)
+   return -EIO;
+
+   if (phydev->drv->get_sset_count &&
+   phydev->drv->get_strings &&
+   phydev->drv->get_stats) {
+   mutex_lock(>lock);
+   ret = phydev->drv->get_sset_count(phydev);
+   mutex_unlock(>lock);
+
+   return ret;
+   }
+
+   return -EOPNOTSUPP;
+}
+EXPORT_SYMBOL(phy_ethtool_get_sset_count);
+
+int phy_ethtool_get_stats(struct phy_device *phydev,
+ struct ethtool_stats *stats, u64 *data)
+{
+   if (!phydev->drv)
+   return -EIO;
+
+   mutex_lock(>lock);
+   phydev->drv->get_stats(phydev, stats, data);
+   mutex_unlock(>lock);
+
+   return 0;
+}
+EXPORT_SYMBOL(phy_ethtool_get_stats);
diff --git a/include/linux/phy.h b/include/linux/phy.h
index f0b5870a6d40..6ca81395c545 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -1066,6 +1066,26 @@ int phy_ethtool_nway_reset(struct net_device *ndev);
 #if IS_ENABLED(CONFIG_PHYLIB)
 int __init mdio_bus_init(void);
 void mdio_bus_exit(void);
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data);
+int phy_ethtool_get_sset_count(struct phy_device *phydev);
+int phy_ethtool_get_stats(struct phy_device *phydev,
+ struct ethtool_stats *stats, u64 *data);
+#else
+int phy_ethtool_get_strings(struct phy_device *phydev, u8 *data)
+{
+   return -EOPNOTSUPP;
+}
+
+int phy_ethtool_get_sset_count(struct phy_device *phydev)
+{
+   return -EOPNOTSUPP;
+}
+
+int phy_ethtool_get_stats(struct phy_device *phydev,
+ struct ethtool_stats *stats, u64 *data)
+{
+   return -EOPNOTSUPP;
+}
 #endif
 
 extern struct bus_type mdio_bus_type;
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 03416e6dd5d7..f0d42e093c4a 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -210,23 +210,6 @@ static int ethtool_set_features(struct net_device *dev, 
void __user *useraddr)
return ret;
 }
 
-static int phy_get_sset_count(struct phy_device *phydev)
-{
-   int ret;
-
-   if (phydev->drv->get_sset_count &&
-   phydev->drv->get_strings &&
-   phydev->drv->get_stats) {
-   mutex_lock(>lock);
-   ret = phydev->drv->get_sset_count(phydev);
-   mutex_unlock(>lock);
-
-   return ret;
-   }
-
-   return -EOPNOTSUPP;
-}
-
 static int __ethtool_get_sset_count(struct net_device *dev, int sset)
 {
const struct ethtool_ops *ops = dev->ethtool_ops;
@@ -245,7 +228,7 @@ static int __ethtool_get_sset_count(struct net_device *dev, 
int sset)
 
if (sset == ETH_SS_PHY_STATS) {
if (dev->phydev)
-   return phy_get_sset_count(dev->phydev);
+   return phy_ethtool_get_sset_count(dev->phydev);
else
return -EOPNOTSUPP;
}
@@ -272,15 +255,10 @@ static void __ethtool_get_strings(struct net_device *dev,
else if (stringset == ETH_SS_PHY_TUNABLES)
memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
else if (stringset == ETH_SS_PHY_STATS) {
-   struct phy_device *phydev = dev->phydev;
-
-   if (phydev) {
-   mutex_lock(>lock);
-   phydev->drv->get_strings(phydev, data);
-   mutex_unlock(>lock);
-   } else {
+   if (dev->phydev)
+   phy_ethtool_get_strings(dev->phydev, data);
+   else
return;
-   }
} else
/* ops->get_strings is valid because checked earlier */

[PATCH net-next v2 2/7] net: Allow network devices to have PHY statistics

2018-04-24 Thread Florian Fainelli

Add a new callback: get_ethtool_phy_stats() which allows network device
drivers not making use of the PHY library to return PHY statistics.
Update ethtool_get_phy_stats(), __ethtool_get_sset_count() and
__ethtool_get_strings() accordingly to interogate the network device
about ETH_SS_PHY_STATS.

Signed-off-by: Florian Fainelli 
---
 include/linux/ethtool.h |  5 +
 net/core/ethtool.c  | 39 +--
 2 files changed, 26 insertions(+), 18 deletions(-)

diff --git a/include/linux/ethtool.h b/include/linux/ethtool.h
index ebe41811ed34..9b19556f0156 100644
--- a/include/linux/ethtool.h
+++ b/include/linux/ethtool.h
@@ -310,6 +310,9 @@ bool ethtool_convert_link_mode_to_legacy_u32(u32 
*legacy_u32,
  * fields should be ignored (use %__ETHTOOL_LINK_MODE_MASK_NBITS
  * instead of the latter), any change to them will be overwritten
  * by kernel. Returns a negative error code or zero.
+ * @get_ethtool_phy_stats: Return extended statistics about the PHY device.
+ * This is only useful if the device maintains PHY statistics and
+ * cannot use the standard PHY library helpers.
  *
  * All operations are optional (i.e. the function pointer may be set
  * to %NULL) and callers must take this into account.  Callers must
@@ -405,5 +408,7 @@ struct ethtool_ops {
  struct ethtool_fecparam *);
int (*set_fecparam)(struct net_device *,
  struct ethtool_fecparam *);
+   void(*get_ethtool_phy_stats)(struct net_device *,
+struct ethtool_stats *, u64 *);
 };
 #endif /* _LINUX_ETHTOOL_H */
diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index f0d42e093c4a..4b8992ccf904 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -226,12 +226,9 @@ static int __ethtool_get_sset_count(struct net_device 
*dev, int sset)
if (sset == ETH_SS_PHY_TUNABLES)
return ARRAY_SIZE(phy_tunable_strings);
 
-   if (sset == ETH_SS_PHY_STATS) {
-   if (dev->phydev)
-   return phy_ethtool_get_sset_count(dev->phydev);
-   else
-   return -EOPNOTSUPP;
-   }
+   if (sset == ETH_SS_PHY_STATS && dev->phydev &&
+   !ops->get_ethtool_phy_stats)
+   return phy_ethtool_get_sset_count(dev->phydev);
 
if (ops->get_sset_count && ops->get_strings)
return ops->get_sset_count(dev, sset);
@@ -254,12 +251,10 @@ static void __ethtool_get_strings(struct net_device *dev,
memcpy(data, tunable_strings, sizeof(tunable_strings));
else if (stringset == ETH_SS_PHY_TUNABLES)
memcpy(data, phy_tunable_strings, sizeof(phy_tunable_strings));
-   else if (stringset == ETH_SS_PHY_STATS) {
-   if (dev->phydev)
-   phy_ethtool_get_strings(dev->phydev, data);
-   else
-   return;
-   } else
+   else if (stringset == ETH_SS_PHY_STATS && dev->phydev &&
+!ops->get_ethtool_phy_stats)
+   phy_ethtool_get_strings(dev->phydev, data);
+   else
/* ops->get_strings is valid because checked earlier */
ops->get_strings(dev, stringset, data);
 }
@@ -1971,15 +1966,19 @@ static int ethtool_get_stats(struct net_device *dev, 
void __user *useraddr)
 
 static int ethtool_get_phy_stats(struct net_device *dev, void __user *useraddr)
 {
-   struct ethtool_stats stats;
+   const struct ethtool_ops *ops = dev->ethtool_ops;
struct phy_device *phydev = dev->phydev;
+   struct ethtool_stats stats;
u64 *data;
int ret, n_stats;
 
-   if (!phydev)
+   if (!phydev && (!ops->get_ethtool_phy_stats || !ops->get_sset_count))
return -EOPNOTSUPP;
 
-   n_stats = phy_ethtool_get_sset_count(dev->phydev);
+   if (dev->phydev && !ops->get_ethtool_phy_stats)
+   n_stats = phy_ethtool_get_sset_count(dev->phydev);
+   else
+   n_stats = ops->get_sset_count(dev, ETH_SS_PHY_STATS);
if (n_stats < 0)
return n_stats;
if (n_stats > S32_MAX / sizeof(u64))
@@ -1994,9 +1993,13 @@ static int ethtool_get_phy_stats(struct net_device *dev, 
void __user *useraddr)
if (n_stats && !data)
return -ENOMEM;
 
-   ret = phy_ethtool_get_stats(dev->phydev, , data);
-   if (ret < 0)
-   return ret;
+   if (dev->phydev && !ops->get_ethtool_phy_stats) {
+   ret = phy_ethtool_get_stats(dev->phydev, , data);
+   if (ret < 0)
+   return ret;
+   } else {
+   ops->get_ethtool_phy_stats(dev, , data);
+   }
 
ret = -EFAULT;
if (copy_to_user(useraddr, , sizeof(stats)))
-- 
2.14.1

[PATCH net-next v2 5/7] net: dsa: Add helper function to obtain PHY device of a given port

2018-04-24 Thread Florian Fainelli

In preparation for having more call sites attempting to obtain a
reference against a PHY device corresponding to a particular port,
introduce a helper function for that purpose.

Signed-off-by: Florian Fainelli 
---
 net/dsa/port.c | 33 ++---
 1 file changed, 22 insertions(+), 11 deletions(-)

diff --git a/net/dsa/port.c b/net/dsa/port.c
index 7acc1169d75e..5e2a88720a9a 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -273,25 +273,38 @@ int dsa_port_vlan_del(struct dsa_port *dp,
return 0;
 }
 
-static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
+static struct phy_device *dsa_port_get_phy_device(struct dsa_port *dp)
 {
-   struct device_node *port_dn = dp->dn;
struct device_node *phy_dn;
-   struct dsa_switch *ds = dp->ds;
struct phy_device *phydev;
-   int port = dp->index;
-   int err = 0;
 
-   phy_dn = of_parse_phandle(port_dn, "phy-handle", 0);
+   phy_dn = of_parse_phandle(dp->dn, "phy-handle", 0);
if (!phy_dn)
-   return 0;
+   return NULL;
 
phydev = of_phy_find_device(phy_dn);
if (!phydev) {
-   err = -EPROBE_DEFER;
-   goto err_put_of;
+   of_node_put(phy_dn);
+   return ERR_PTR(-EPROBE_DEFER);
}
 
+   return phydev;
+}
+
+static int dsa_port_setup_phy_of(struct dsa_port *dp, bool enable)
+{
+   struct dsa_switch *ds = dp->ds;
+   struct phy_device *phydev;
+   int port = dp->index;
+   int err = 0;
+
+   phydev = dsa_port_get_phy_device(dp);
+   if (!phydev)
+   return 0;
+
+   if (IS_ERR(phydev))
+   return PTR_ERR(phydev);
+
if (enable) {
err = genphy_config_init(phydev);
if (err < 0)
@@ -317,8 +330,6 @@ static int dsa_port_setup_phy_of(struct dsa_port *dp, bool 
enable)
 
 err_put_dev:
put_device(>mdio.dev);
-err_put_of:
-   of_node_put(phy_dn);
return err;
 }
 
-- 
2.14.1

[PATCH net-next v2 7/7] net: dsa: b53: Add support for reading PHY statistics

2018-04-24 Thread Florian Fainelli

Allow the b53 driver to return PHY statistics when the CPU port used is
different than 5, 7 or 8, because those are typically PHY-less on most
devices. This is useful for debugging link problems between the switch
and an external host when using a non standard CPU port number (e.g: 4).

Signed-off-by: Florian Fainelli 
---
 drivers/net/dsa/b53/b53_common.c | 76 ++--
 drivers/net/dsa/b53/b53_priv.h   |  1 +
 drivers/net/dsa/bcm_sf2.c|  1 +
 3 files changed, 68 insertions(+), 10 deletions(-)

diff --git a/drivers/net/dsa/b53/b53_common.c b/drivers/net/dsa/b53/b53_common.c
index 726b2d8c6fe9..7faea467aca8 100644
--- a/drivers/net/dsa/b53/b53_common.c
+++ b/drivers/net/dsa/b53/b53_common.c
@@ -806,20 +806,39 @@ static unsigned int b53_get_mib_size(struct b53_device 
*dev)
return B53_MIBS_SIZE;
 }
 
+static struct phy_device *b53_get_phy_device(struct dsa_switch *ds, int port)
+{
+   /* These ports typically do not have built-in PHYs */
+   switch (port) {
+   case B53_CPU_PORT_25:
+   case 7:
+   case B53_CPU_PORT:
+   return NULL;
+   }
+
+   return mdiobus_get_phy(ds->slave_mii_bus, port);
+}
+
 void b53_get_strings(struct dsa_switch *ds, int port, u32 stringset,
 uint8_t *data)
 {
struct b53_device *dev = ds->priv;
const struct b53_mib_desc *mibs = b53_get_mib(dev);
unsigned int mib_size = b53_get_mib_size(dev);
+   struct phy_device *phydev;
unsigned int i;
 
-   if (stringset != ETH_SS_STATS)
-   return;
+   if (stringset == ETH_SS_STATS) {
+   for (i = 0; i < mib_size; i++)
+   strlcpy(data + i * ETH_GSTRING_LEN,
+   mibs[i].name, ETH_GSTRING_LEN);
+   } else if (stringset == ETH_SS_PHY_STATS) {
+   phydev = b53_get_phy_device(ds, port);
+   if (!phydev)
+   return;
 
-   for (i = 0; i < mib_size; i++)
-   strlcpy(data + i * ETH_GSTRING_LEN,
-   mibs[i].name, ETH_GSTRING_LEN);
+   phy_ethtool_get_strings(phydev, data);
+   }
 }
 EXPORT_SYMBOL(b53_get_strings);
 
@@ -856,14 +875,34 @@ void b53_get_ethtool_stats(struct dsa_switch *ds, int 
port, uint64_t *data)
 }
 EXPORT_SYMBOL(b53_get_ethtool_stats);
 
+void b53_get_ethtool_phy_stats(struct dsa_switch *ds, int port, uint64_t *data)
+{
+   struct phy_device *phydev;
+
+   phydev = b53_get_phy_device(ds, port);
+   if (!phydev)
+   return;
+
+   phy_ethtool_get_stats(phydev, NULL, data);
+}
+EXPORT_SYMBOL(b53_get_ethtool_phy_stats);
+
 int b53_get_sset_count(struct dsa_switch *ds, int port, int sset)
 {
struct b53_device *dev = ds->priv;
+   struct phy_device *phydev;
 
-   if (sset != ETH_SS_STATS)
-   return 0;
+   if (sset == ETH_SS_STATS) {
+   return b53_get_mib_size(dev);
+   } else if (sset == ETH_SS_PHY_STATS) {
+   phydev = b53_get_phy_device(ds, port);
+   if (!phydev)
+   return 0;
 
-   return b53_get_mib_size(dev);
+   return phy_ethtool_get_sset_count(phydev);
+   }
+
+   return 0;
 }
 EXPORT_SYMBOL(b53_get_sset_count);
 
@@ -1484,7 +1523,7 @@ void b53_br_fast_age(struct dsa_switch *ds, int port)
 }
 EXPORT_SYMBOL(b53_br_fast_age);
 
-static bool b53_can_enable_brcm_tags(struct dsa_switch *ds, int port)
+static bool b53_possible_cpu_port(struct dsa_switch *ds, int port)
 {
/* Broadcom switches will accept enabling Broadcom tags on the
 * following ports: 5, 7 and 8, any other port is not supported
@@ -1496,10 +1535,19 @@ static bool b53_can_enable_brcm_tags(struct dsa_switch 
*ds, int port)
return true;
}
 
-   dev_warn(ds->dev, "Port %d is not Broadcom tag capable\n", port);
return false;
 }
 
+static bool b53_can_enable_brcm_tags(struct dsa_switch *ds, int port)
+{
+   bool ret = b53_possible_cpu_port(ds, port);
+
+   if (!ret)
+   dev_warn(ds->dev, "Port %d is not Broadcom tag capable\n",
+port);
+   return ret;
+}
+
 enum dsa_tag_protocol b53_get_tag_protocol(struct dsa_switch *ds, int port)
 {
struct b53_device *dev = ds->priv;
@@ -1657,6 +1705,7 @@ static const struct dsa_switch_ops b53_switch_ops = {
.get_strings= b53_get_strings,
.get_ethtool_stats  = b53_get_ethtool_stats,
.get_sset_count = b53_get_sset_count,
+   .get_ethtool_phy_stats  = b53_get_ethtool_phy_stats,
.phy_read   = b53_phy_read16,
.phy_write  = b53_phy_write16,
.adjust_link= b53_adjust_link,
@@ -1961,6 +2010,13 @@ static int b53_switch_init(struct b53_device *dev)
dev->num_ports = dev->cpu_port + 1;
dev->enabled_ports |= BIT(dev->cpu_port);
 
+

[PATCH net-next v2 6/7] net: dsa: Allow providing PHY statistics from CPU port

2018-04-24 Thread Florian Fainelli

Implement the same type of ethtool diversion that we have for
ETH_SS_STATS and make it work with ETH_SS_PHY_STATS. This allows
providing PHY level statistics for CPU ports that are directly
connecting to a PHY device.

Signed-off-by: Florian Fainelli 
---
 include/net/dsa.h |  7 +++
 net/dsa/master.c  | 47 -
 net/dsa/port.c| 57 +++
 3 files changed, 106 insertions(+), 5 deletions(-)

diff --git a/include/net/dsa.h b/include/net/dsa.h
index 0bc0aad1b02e..462e9741b210 100644
--- a/include/net/dsa.h
+++ b/include/net/dsa.h
@@ -361,6 +361,8 @@ struct dsa_switch_ops {
void(*get_ethtool_stats)(struct dsa_switch *ds,
 int port, uint64_t *data);
int (*get_sset_count)(struct dsa_switch *ds, int port, int sset);
+   void(*get_ethtool_phy_stats)(struct dsa_switch *ds,
+int port, uint64_t *data);
 
/*
 * ethtool Wake-on-LAN
@@ -589,4 +591,9 @@ static inline int call_dsa_notifiers(unsigned long val, 
struct net_device *dev,
 #define BRCM_TAG_GET_PORT(v)   ((v) >> 8)
 #define BRCM_TAG_GET_QUEUE(v)  ((v) & 0xff)
 
+
+int dsa_port_get_phy_strings(struct dsa_port *dp, uint8_t *data);
+int dsa_port_get_ethtool_phy_stats(struct dsa_port *dp, uint64_t *data);
+int dsa_port_get_phy_sset_count(struct dsa_port *dp);
+
 #endif
diff --git a/net/dsa/master.c b/net/dsa/master.c
index 8d27687fd0ca..c90ee3227dea 100644
--- a/net/dsa/master.c
+++ b/net/dsa/master.c
@@ -31,6 +31,32 @@ static void dsa_master_get_ethtool_stats(struct net_device 
*dev,
ds->ops->get_ethtool_stats(ds, port, data + count);
 }
 
+static void dsa_master_get_ethtool_phy_stats(struct net_device *dev,
+struct ethtool_stats *stats,
+uint64_t *data)
+{
+   struct dsa_port *cpu_dp = dev->dsa_ptr;
+   const struct ethtool_ops *ops = cpu_dp->orig_ethtool_ops;
+   struct dsa_switch *ds = cpu_dp->ds;
+   int port = cpu_dp->index;
+   int count = 0;
+
+   if (dev->phydev && !ops->get_ethtool_phy_stats) {
+   count = phy_ethtool_get_sset_count(dev->phydev);
+   if (count >= 0)
+   phy_ethtool_get_stats(dev->phydev, stats, data);
+   } else if (ops->get_sset_count && ops->get_ethtool_phy_stats) {
+   count = ops->get_sset_count(dev, ETH_SS_PHY_STATS);
+   ops->get_ethtool_phy_stats(dev, stats, data);
+   }
+
+   if (count < 0)
+   count = 0;
+
+   if (ds->ops->get_ethtool_phy_stats)
+   ds->ops->get_ethtool_phy_stats(ds, port, data + count);
+}
+
 static int dsa_master_get_sset_count(struct net_device *dev, int sset)
 {
struct dsa_port *cpu_dp = dev->dsa_ptr;
@@ -38,11 +64,14 @@ static int dsa_master_get_sset_count(struct net_device 
*dev, int sset)
struct dsa_switch *ds = cpu_dp->ds;
int count = 0;
 
-   if (ops->get_sset_count) {
+   if (sset == ETH_SS_PHY_STATS && dev->phydev &&
+   !ops->get_ethtool_phy_stats)
+   count = phy_ethtool_get_sset_count(dev->phydev);
+   else if (ops->get_sset_count)
count = ops->get_sset_count(dev, sset);
-   if (count < 0)
-   count = 0;
-   }
+
+   if (count < 0)
+   count = 0;
 
if (ds->ops->get_sset_count)
count += ds->ops->get_sset_count(ds, cpu_dp->index, sset);
@@ -67,7 +96,14 @@ static void dsa_master_get_strings(struct net_device *dev, 
uint32_t stringset,
/* We do not want to be NULL-terminated, since this is a prefix */
pfx[sizeof(pfx) - 1] = '_';
 
-   if (ops->get_sset_count && ops->get_strings) {
+   if (stringset == ETH_SS_PHY_STATS && dev->phydev &&
+   !ops->get_ethtool_phy_stats) {
+   mcount = phy_ethtool_get_sset_count(dev->phydev);
+   if (mcount < 0)
+   mcount = 0;
+   else
+   phy_ethtool_get_strings(dev->phydev, data);
+   } else if (ops->get_sset_count && ops->get_strings) {
mcount = ops->get_sset_count(dev, stringset);
if (mcount < 0)
mcount = 0;
@@ -107,6 +143,7 @@ static int dsa_master_ethtool_setup(struct net_device *dev)
ops->get_sset_count = dsa_master_get_sset_count;
ops->get_ethtool_stats = dsa_master_get_ethtool_stats;
ops->get_strings = dsa_master_get_strings;
+   ops->get_ethtool_phy_stats = dsa_master_get_ethtool_phy_stats;
 
dev->ethtool_ops = ops;
 
diff --git a/net/dsa/port.c b/net/dsa/port.c
index 5e2a88720a9a..2413beb995be 100644
--- a/net/dsa/port.c
+++ b/net/dsa/port.c
@@ -383,3 +383,60 @@ void dsa_port_link_unregister_of(struct dsa_port *dp)
else

Re: [bpf-next PATCH] bpf: reduce runtime of test_sockmap tests

2018-04-24 Thread Daniel Borkmann

On 04/25/2018 01:28 AM, John Fastabend wrote:
> When test_sockmap was running outside of selftests and was not being
> run by build bots it was reasonable to spend significant amount of
> time running various tests. The number of tests is high because many
> different I/O iterators are run.
> 
> However, now that test_sockmap is part of selftests rather than
> iterate through all I/O sides only test a minimal set of min/max
> values along with a few "normal" I/O ops. Also remove the long
> running tests. They can be run from other test frameworks on a regular
> cadence.
> 
> This significanly reduces runtime of test_sockmap.
> 
> Before:
> 
> $ time sudo ./test_sockmap  > /dev/null
> 
> real4m47.521s
> user0m0.370s
> sys 0m3.131s
> 
> After:
> 
> $ time sudo ./test_sockmap  > /dev/null
> 
> real0m0.514s
> user0m0.104s
> sys 0m0.430s
> 
> The CLI is still available for users that want to test the long
> running tests that do the larger send/recv tests.
> 
> Signed-off-by: John Fastabend 

Applied to bpf-next, thanks John!

Re: [PATCH v2] bpf, x86_32: add eBPF JIT compiler for ia32

2018-04-24 Thread Daniel Borkmann

On 04/19/2018 05:54 PM, Wang YanQing wrote:
> The JIT compiler emits ia32 bit instructions. Currently, It supports
> eBPF only. Classic BPF is supported because of the conversion by BPF core.
> 
> Almost all instructions from eBPF ISA supported except the following:
> BPF_ALU64 | BPF_DIV | BPF_K
> BPF_ALU64 | BPF_DIV | BPF_X
> BPF_ALU64 | BPF_MOD | BPF_K
> BPF_ALU64 | BPF_MOD | BPF_X
> BPF_STX | BPF_XADD | BPF_W
> BPF_STX | BPF_XADD | BPF_DW
> 
> It doesn't support BPF_JMP|BPF_CALL with BPF_PSEUDO_CALL too.
> 
> ia32 has few general purpose registers, EAX|EDX|ECX|EBX|ESI|EDI,
> and in these six registers, we can't treat all of them as real
> general purpose registers in jit:
> MUL instructions need EAX:EDX, shift instructions need ECX.
> 
> So I decide to use stack to emulate all eBPF 64 registers, this will
> simplify the implementation a lot, because we don't need to face the
> flexible memory address modes on ia32, for example, we don't need to
> write below code pattern for one BPF_ADD instruction:
> 
> if (src is a register && dst is a register)
> {
>//one instruction encoding for ADD instruction
> } else if (only src is a register)
> {
>//another different instruction encoding for ADD instruction
> } else if (only dst is a register)
> {
>//another different instruction encoding for ADD instruction
> } else
> {
>//src and dst are all on stack.
>//move src or dst to temporary registers
> }
> 
> If the above example if-else-else-else isn't so painful, try to think
> it for BPF_ALU64|BPF_*SHIFT* instruction which we need to use many
> native instructions to emulate.
> 
> Tested on my PC (Intel(R) Core(TM) i5-5200U CPU) and virtualbox.

Just out of plain curiosity, do you have a specific use case on the
JIT for x86_32? Are you targeting this to be used for e.g. Atom or
the like (sorry, don't really have a good overview where x86_32 is
still in use these days)?

> Testing results on i5-5200U:
> 
> 1) test_bpf: Summary: 349 PASSED, 0 FAILED, [319/341 JIT'ed]
> 2) test_progs: Summary: 81 PASSED, 2 FAILED.
>test_progs report "libbpf: incorrect bpf_call opcode" for
>test_l4lb_noinline and test_xdp_noinline, because there is
>no llvm-6.0 on my machine, and current implementation doesn't
>support BPF_PSEUDO_CALL, so I think we can ignore the two failed
>testcases.
> 3) test_lpm: OK
> 4) test_lru_map: OK
> 5) test_verifier: Summary: 823 PASSED, 5 FAILED
>test_verifier report "invalid bpf_context access off=68 size=1/2/4/8"
>for all the 5 FAILED testcases with/without jit, we need to fix the
>failed testcases themself instead of this jit.

Can you elaborate further on these? Looks like this definitely needs
fixing on 32 bit. Would be great to get a better understanding of the
underlying bug(s) and properly fix them.

> Above tests are all done with following flags settings discretely:
> 1:bpf_jit_enable=1 and bpf_jit_harden=0
> 2:bpf_jit_enable=1 and bpf_jit_harden=2
> 
> Below are some numbers for this jit implementation:
> Note:
>   I run test_progs in kselftest 100 times continuously for every testcase,
>   the numbers are in format: total/times=avg.
>   The numbers that test_bpf reports show almost the same relation.
> 
> a:jit_enable=0 and jit_harden=0b:jit_enable=1 and jit_harden=0
>   test_pkt_access:PASS:ipv4:15622/100=156  
> test_pkt_access:PASS:ipv4:10057/100=100
>   test_pkt_access:PASS:ipv6:9130/100=91
> test_pkt_access:PASS:ipv6:5055/100=50
>   test_xdp:PASS:ipv4:240198/100=2401   test_xdp:PASS:ipv4:145945/100=1459
>   test_xdp:PASS:ipv6:137326/100=1373   test_xdp:PASS:ipv6:67337/100=673
>   test_l4lb:PASS:ipv4:61100/100=611test_l4lb:PASS:ipv4:38137/100=381
>   test_l4lb:PASS:ipv6:101000/100=1010  test_l4lb:PASS:ipv6:57779/100=577
> 
> c:jit_enable=1 and jit_harden=2
>   test_pkt_access:PASS:ipv4:12650/100=126
>   test_pkt_access:PASS:ipv6:7074/100=70
>   test_xdp:PASS:ipv4:147211/100=1472
>   test_xdp:PASS:ipv6:85783/100=857
>   test_l4lb:PASS:ipv4:53222/100=532
>   test_l4lb:PASS:ipv6:76322/100=763
> 
> Yes, the numbers are pretty when turn off jit_harden, if we want to speedup
> jit_harden, then we need to move BPF_REG_AX to *real* register instead of 
> stack
> emulation, but when we do it, we need to face all the pain I describe above. 
> We
> can do it in next step.
> 
> See Documentation/networking/filter.txt for more information.
> 
> Signed-off-by: Wang YanQing 
[...]
> + /* call */
> + case BPF_JMP | BPF_CALL:
> + {
> + const u8 *r1 = bpf2ia32[BPF_REG_1];
> + const u8 *r2 = bpf2ia32[BPF_REG_2];
> + const u8 *r3 = bpf2ia32[BPF_REG_3];
> + const u8 *r4 = bpf2ia32[BPF_REG_4];
> + const u8 *r5 = bpf2ia32[BPF_REG_5];
> +
> + if (insn->src_reg == BPF_PSEUDO_CALL)
> +

[PATCH v3] ath9k: dfs: Remove VLA usage

2018-04-24 Thread Kees Cook

In the quest to remove all stack VLA usage from the kernel[1], this
redefines FFT_NUM_SAMPLES as a #define instead of const int, which still
triggers gcc's VLA checking pass.

[1] https://lkml.org/lkml/2018/3/7/621

Co-developed-by: Andreas Christoforou 
Signed-off-by: Kees Cook 
---
v3: replace FFT_NUM_SAMPLES as a #define (Joe)
---
 drivers/net/wireless/ath/ath9k/dfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/wireless/ath/ath9k/dfs.c 
b/drivers/net/wireless/ath/ath9k/dfs.c
index 6fee9a464cce..e6e56a925121 100644
--- a/drivers/net/wireless/ath/ath9k/dfs.c
+++ b/drivers/net/wireless/ath/ath9k/dfs.c
@@ -40,8 +40,8 @@ static const int BIN_DELTA_MIN= 1;
 static const int BIN_DELTA_MAX = 10;
 
 /* we need at least 3 deltas / 4 samples for a reliable chirp detection */
-#define NUM_DIFFS 3
-static const int FFT_NUM_SAMPLES   = (NUM_DIFFS + 1);
+#define NUM_DIFFS  3
+#define FFT_NUM_SAMPLES(NUM_DIFFS + 1)
 
 /* Threshold for difference of delta peaks */
 static const int MAX_DIFF  = 2;
-- 
2.7.4


-- 
Kees Cook
Pixel Security

[PATCH v3 ipsec-next] xfrm: remove VLA usage in __xfrm6_sort()

2018-04-24 Thread Kees Cook

In the quest to remove all stack VLA usage removed from the kernel[1],
just use XFRM_MAX_DEPTH as already done for the "class" array. In one
case, it'll do this loop up to 5, the other caller up to 6.

[1] https://lkml.org/lkml/2018/3/7/621

Co-developed-by: Andreas Christoforou 
Signed-off-by: Kees Cook 
---
v3:
- adjust Subject and commit log (Steffen)
- use "= { }" instead of memset() (Stefano)
- reorder variables (Stefano)
v2:
- use XFRM_MAX_DEPTH for "count" array (Steffen and Mathias).
---
 net/ipv6/xfrm6_state.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv6/xfrm6_state.c b/net/ipv6/xfrm6_state.c
index 16f434791763..eeb44b64ae7f 100644
--- a/net/ipv6/xfrm6_state.c
+++ b/net/ipv6/xfrm6_state.c
@@ -60,9 +60,9 @@ xfrm6_init_temprop(struct xfrm_state *x, const struct 
xfrm_tmpl *tmpl,
 static int
 __xfrm6_sort(void **dst, void **src, int n, int (*cmp)(void *p), int maxclass)
 {
-   int i;
+   int count[XFRM_MAX_DEPTH] = { };
int class[XFRM_MAX_DEPTH];
-   int count[maxclass];
+   int i;
 
memset(count, 0, sizeof(count));
 
-- 
2.7.4


-- 
Kees Cook
Pixel Security

[bpf-next PATCH] bpf: reduce runtime of test_sockmap tests

2018-04-24 Thread John Fastabend

When test_sockmap was running outside of selftests and was not being
run by build bots it was reasonable to spend significant amount of
time running various tests. The number of tests is high because many
different I/O iterators are run.

However, now that test_sockmap is part of selftests rather than
iterate through all I/O sides only test a minimal set of min/max
values along with a few "normal" I/O ops. Also remove the long
running tests. They can be run from other test frameworks on a regular
cadence.

This significanly reduces runtime of test_sockmap.

Before:

$ time sudo ./test_sockmap  > /dev/null

real4m47.521s
user0m0.370s
sys 0m3.131s

After:

$ time sudo ./test_sockmap  > /dev/null

real0m0.514s
user0m0.104s
sys 0m0.430s

The CLI is still available for users that want to test the long
running tests that do the larger send/recv tests.

Signed-off-by: John Fastabend 
---
 tools/testing/selftests/bpf/test_sockmap.c |   33 ++--
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sockmap.c 
b/tools/testing/selftests/bpf/test_sockmap.c
index 6d63a1c..29c022d 100644
--- a/tools/testing/selftests/bpf/test_sockmap.c
+++ b/tools/testing/selftests/bpf/test_sockmap.c
@@ -344,8 +344,8 @@ static int msg_loop(int fd, int iov_count, int iov_length, 
int cnt,
if (err < 0)
perror("recv start time: ");
while (s->bytes_recvd < total_bytes) {
-   timeout.tv_sec = 1;
-   timeout.tv_usec = 0;
+   timeout.tv_sec = 0;
+   timeout.tv_usec = 10;
 
/* FD sets */
FD_ZERO();
@@ -903,12 +903,10 @@ static int test_exec(int cgrp, struct sockmap_options 
*opt)
 {
int err = __test_exec(cgrp, SENDMSG, opt);
 
-   sched_yield();
if (err)
goto out;
 
err = __test_exec(cgrp, SENDPAGE, opt);
-   sched_yield();
 out:
return err;
 }
@@ -928,19 +926,18 @@ static int test_loop(int cgrp)
opt.iov_length = 0;
opt.rate = 0;
 
-   for (r = 1; r < 100; r += 33) {
-   for (i = 1; i < 100; i += 33) {
-   for (l = 1; l < 100; l += 33) {
-   opt.rate = r;
-   opt.iov_count = i;
-   opt.iov_length = l;
-   err = test_exec(cgrp, );
-   if (err)
-   goto out;
-   }
+   r = 1;
+   for (i = 1; i < 100; i += 33) {
+   for (l = 1; l < 100; l += 33) {
+   opt.rate = r;
+   opt.iov_count = i;
+   opt.iov_length = l;
+   err = test_exec(cgrp, );
+   if (err)
+   goto out;
}
}
-
+   sched_yield();
 out:
return err;
 }
@@ -1031,6 +1028,7 @@ static int test_send(struct sockmap_options *opt, int 
cgrp)
if (err)
goto out;
 out:
+   sched_yield();
return err;
 }
 
@@ -1168,7 +1166,7 @@ static int test_start_end(int cgrp)
opt.iov_length = 100;
txmsg_cork = 1600;
 
-   for (i = 99; i <= 1600; i += 100) {
+   for (i = 99; i <= 1600; i += 500) {
txmsg_start = 0;
txmsg_end = i;
err = test_exec(cgrp, );
@@ -1177,7 +1175,7 @@ static int test_start_end(int cgrp)
}
 
/* Test start/end with cork but pull data in middle */
-   for (i = 199; i <= 1600; i += 100) {
+   for (i = 199; i <= 1600; i += 500) {
txmsg_start = 100;
txmsg_end = i;
err = test_exec(cgrp, );
@@ -1221,6 +1219,7 @@ static int test_start_end(int cgrp)
 out:
txmsg_start = 0;
txmsg_end = 0;
+   sched_yield();
return err;
 }

Re: [PATCH] [PATCH bpf-next] samples/bpf/bpf_load.c: remove redundant ret, assignment in bpf_load_program()

2018-04-24 Thread Daniel Borkmann

On 04/24/2018 10:18 AM, shhuiw wrote:
> 
> 2 redundant ret assignments removded:
> * 'ret = 1' before the logic 'if (data_maps)', and if any errors jump to
>   label 'done'. No 'ret = 1' needed before the error jump.
> * After the '/* load programs */' part, if everything goes well, then
>   the BPF code will be loaded and 'ret' set to 0 by load_and_attach().
>   If something goes wrong, 'ret' set to none-O, the redundant 'ret = 0'
>   after the for clause will make the error skipped.
>   For example, if some BPF code cannot provide supported program types
>   in ELF SEC("unknown"), the for clause will not call load_and_attach()
>   to load the BPF code. 1 should be returned to callees instead of 0.
> 
> Signed-off-by: Wang Sheng-Hui 

Your patch is corrupted, please use something like git-send-email(1).

Thanks,
Daniel

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman

Christian Brauner  writes:

> On Wed, Apr 25, 2018, 00:41 Eric W. Biederman  wrote:
>
>  Bah. This code is obviously correct and probably wrong.
>
>  How do we deliver uevents for network devices that are outside of the
>  initial user namespace? The kernel still needs to deliver those.
>
>  The logic to figure out which network namespace a device needs to be
>  delivered to is is present in kobj_bcast_filter. That logic will almost
>  certainly need to be turned inside out. Sign not as easy as I would
>  have hoped.
>
> My first patch that we discussed put additional filtering logic into 
> kobj_bcast_filter for that very reason. But I can move that logic
> out and come up with a new patch.

I may have mis-understood.

I heard and am still hearing additional filtering to reduce the places
the packet is delievered.

I am saying something needs to change to increase the number of places
the packet is delivered.

For the special class of devices that kobj_bcast_filter would apply to
those need to be delivered to netowrk namespaces  that are no longer on
uevent_sock_list.

So the code fundamentally needs to split into two paths.  Ordinary
devices that use uevent_sock_list.  Network devices that are just
delivered in their own network namespace.

netlink_broadcast_filtered gets to go away completely.
The logic of figuring out the network namespace though becomes trickier.

Now it may make sense to have all of that as an additional patch on top
of this one or perhaps a precursor patch that addresses the problem.  We
will unfortunately drop those uevents today because their uids are not
valid.  But they are not delivered anywhere else so to allow them to be
received we need to fix them.

Eric

>
>  Christian Brauner  writes:
>  > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> namespaces")
>  >
>  > enabled sending hotplug events into all network namespaces back in 2010.
>  > Over time the set of uevents that get sent into all network namespaces has
>  > shrunk a little. We have now reached the point where hotplug events for all
>  > devices that carry a namespace tag are filtered according to that
>  > namespace. Specifically, they are filtered whenever the namespace tag of
>  > the kobject does not match the namespace tag of the netlink socket. One
>  > example are network devices. Uevents for network devices only show up in
>  > the network namespaces these devices are moved to or created in.
>  >
>  > However, any uevent for a kobject that does not have a namespace tag
>  > associated with it will not be filtered and we will broadcast it into all
>  > network namespaces. This behavior stopped making sense when user namespaces
>  > were introduced.
>  >
>  > This patch restricts uevents to the initial user namespace for a couple of
>  > reasons that have been extensively discusses on the mailing list [1].
>  > - Thundering herd:
>  > Broadcasting uevents into all network namespaces introduces significant
>  > overhead.
>  > All processes that listen to uevents running in non-initial user
>  > namespaces will end up responding to uevents that will be meaningless to
>  > them. Mainly, because non-initial user namespaces cannot easily manage
>  > devices unless they have a privileged host-process helping them out. This
>  > means that there will be a thundering herd of activity when there
>  > shouldn't be any.
>  > - Uevents from non-root users are already filtered in userspace:
>  > Uevents are filtered by userspace in a user namespace because the
>  > received uid != 0. Instead the uid associated with the event will be
>  > 65534 == "nobody" because the global root uid is not mapped.
>  > This means we can safely and without introducing regressions modify the
>  > kernel to not send uevents into all network namespaces whose owning user
>  > namespace is not the initial user namespace because we know that
>  > userspace will ignore the message because of the uid anyway. I have
>  > a) verified that is is true for every udev implementation out there b)
>  > that this behavior has been present in all udev implementations from the
>  > very beginning.
>  > - Removing needless overhead/Increasing performance:
>  > Currently, the uevent socket for each network namespace is added to the
>  > global variable uevent_sock_list. The list itself needs to be protected
>  > by a mutex. So everytime a uevent is generated the mutex is taken on the
>  > list. The mutex is held *from the creation of the uevent (memory
>  > allocation, string creation etc. until all uevent sockets have been
>  > handled*. This is aggravated by the fact that for each uevent socket that
>  > has listeners the mc_list must be walked as well which means we're
>  > talking O(n^2) here. Given that a standard Linux workload usually has
>  > quite a lot of network namespaces and - in the face of containers - a lot
>  > of user namespaces this

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Christian Brauner

On Tue, Apr 24, 2018 at 05:40:07PM -0500, Eric W. Biederman wrote:
> 
> Bah.  This code is obviously correct and probably wrong.
> 
> How do we deliver uevents for network devices that are outside of the
> initial user namespace?  The kernel still needs to deliver those.
> 
> The logic to figure out which network namespace a device needs to be
> delivered to is is present in kobj_bcast_filter.  That logic will almost
> certainly need to be turned inside out.  Sign not as easy as I would
> have hoped.

That's why my initial patch [1] added additional filtering logic to
kobj_bcast_filter(). But since we care about performance improvements as
well I can come up with a patch that moves this logic out of
kobj_bcast_filter().

Christian
[1]: https://www.spinics.net/lists/netdev/msg494487.html

> 
> Eric
> 
> Christian Brauner  writes:
> > commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> > namespaces")
> >
> > enabled sending hotplug events into all network namespaces back in 2010.
> > Over time the set of uevents that get sent into all network namespaces has
> > shrunk a little. We have now reached the point where hotplug events for all
> > devices that carry a namespace tag are filtered according to that
> > namespace. Specifically, they are filtered whenever the namespace tag of
> > the kobject does not match the namespace tag of the netlink socket. One
> > example are network devices. Uevents for network devices only show up in
> > the network namespaces these devices are moved to or created in.
> >
> > However, any uevent for a kobject that does not have a namespace tag
> > associated with it will not be filtered and we will broadcast it into all
> > network namespaces. This behavior stopped making sense when user namespaces
> > were introduced.
> >
> > This patch restricts uevents to the initial user namespace for a couple of
> > reasons that have been extensively discusses on the mailing list [1].
> > - Thundering herd:
> >   Broadcasting uevents into all network namespaces introduces significant
> >   overhead.
> >   All processes that listen to uevents running in non-initial user
> >   namespaces will end up responding to uevents that will be meaningless to
> >   them. Mainly, because non-initial user namespaces cannot easily manage
> >   devices unless they have a privileged host-process helping them out. This
> >   means that there will be a thundering herd of activity when there
> >   shouldn't be any.
> > - Uevents from non-root users are already filtered in userspace:
> >   Uevents are filtered by userspace in a user namespace because the
> >   received uid != 0. Instead the uid associated with the event will be
> >   65534 == "nobody" because the global root uid is not mapped.
> >   This means we can safely and without introducing regressions modify the
> >   kernel to not send uevents into all network namespaces whose owning user
> >   namespace is not the initial user namespace because we know that
> >   userspace will ignore the message because of the uid anyway. I have
> >   a) verified that is is true for every udev implementation out there b)
> >   that this behavior has been present in all udev implementations from the
> >   very beginning.
> > - Removing needless overhead/Increasing performance:
> >   Currently, the uevent socket for each network namespace is added to the
> >   global variable uevent_sock_list. The list itself needs to be protected
> >   by a mutex. So everytime a uevent is generated the mutex is taken on the
> >   list. The mutex is held *from the creation of the uevent (memory
> >   allocation, string creation etc. until all uevent sockets have been
> >   handled*. This is aggravated by the fact that for each uevent socket that
> >   has listeners the mc_list must be walked as well which means we're
> >   talking O(n^2) here. Given that a standard Linux workload usually has
> >   quite a lot of network namespaces and - in the face of containers - a lot
> >   of user namespaces this quickly becomes a performance problem (see
> >   "Thundering herd" above). By just recording uevent sockets of network
> >   namespaces that are owned by the initial user namespace we significantly
> >   increase performance in this codepath.
> > - Injecting uevents:
> >   There's a valid argument that containers might be interested in receiving
> >   device events especially if they are delegated to them by a privileged
> >   userspace process. One prime example are SR-IOV enabled devices that are
> >   explicitly designed to be handed of to other users such as VMs or
> >   containers.
> >   This use-case can now be correctly handled since
> >   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
> >   introduced the ability to send uevents from userspace. As such we can let
> >   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
> >   the network namespace of the netlink socket) userspace process make a
> >

Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

2018-04-24 Thread Eric W. Biederman

Christian Brauner  writes:

> On Tue, Apr 24, 2018 at 04:52:20PM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > Now that it's possible to have a different set of uevents in different
>> > network namespaces, per-network namespace uevent sequence numbers are
>> > introduced. This increases performance as locking is now restricted to the
>> > network namespace affected by the uevent rather than locking everything.
>> > Testing revealed significant performance improvements. For details see
>> > "Testing" below.
>> 
>> Maybe.  Your locking is wrong, and a few other things are wrong.  see
>> below.
>
> Thanks for the review! Happy to rework this until it's in a mergeable shape.
>
>> 
>> > Since commit 692ec06 ("netns: send uevent messages") network namespaces not
>> > owned by the intial user namespace can be sent uevents from a sufficiently
>> > privileged userspace process.
>> > In order to send a uevent into a network namespace not owned by the initial
>> > user namespace we currently still need to take the *global mutex* that
>> > locks the uevent socket list even though the list *only contains network
>> > namespaces owned by the initial user namespace*. This needs to be done
>> > because the uevent counter is a global variable. Taking the global lock is
>> > performance sensitive since a user on the host can spawn a pool of n
>> > process that each create their own new user and network namespaces and then
>> > go on to inject uevents in parallel into the network namespace of all of
>> > these processes. This can have a significant performance impact for the
>> > host's udevd since it means that there can be a lot of delay between a
>> > device being added and the corresponding uevent being sent out and
>> > available for processing by udevd. It also means that each network
>> > namespace not owned by the initial user namespace which userspace has sent
>> > a uevent to will need to wait until the lock becomes available.
>> >
>> > Implementation:
>> > This patch gives each network namespace its own uevent sequence number.
>> > Each network namespace not owned by the initial user namespace receives its
>> > own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
>> > so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
>> > file it is clearly documented which lock has to be taken. All network
>> > namespaces owned by the initial user namespace will still share the same
>> > lock since they are all served sequentially via the uevent socket list.
>> > This decouples the locking and ensures that the host retrieves uevents as
>> > fast as possible even if there are a lot of uevents injected into network
>> > namespaces not owned by the initial user namespace.  In addition, each
>> > network namespace not owned by the initial user namespace does not have to
>> > wait on any other network namespace not sharing the same user namespace.
>> >
>> > Testing:
>> > Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
>> > with decoupled locking and one without. To ensure that testing made sense
>> > both kernels carried the patch to remove network namespaces not owned by
>> > the initial user namespace from the uevent socket list.
>> > Three tests were constructed. All of them showed significant performance
>> > improvements with per-netns uevent sequence numbers and decoupled locking.
>> >
>> >  # Testcase 1:
>> >Only Injecting Uevents into network namespaces not owned by the initial
>> >user namespace.
>> >- created 1000 new user namespace + network namespace pairs
>> >- opened a uevent listener in each of those namespace pairs
>> >- injected uevents into each of those network namespaces 10,000 times
>> >  meaning 10,000,000 (10 million) uevents were injected. (The high
>> >  number of uevent injections should get rid of a lot of jitter.)
>> >  The injection was done by fork()ing 1000 uevent injectors in a simple
>> >  for-loop to ensure that uevents were injected in parallel.
>> >- mean transaction time was calculated:
>> >  - *without* uevent sequence number namespacing: 67 μs
>> >  - *with* uevent sequence number namespacing:55 μs
>> >  - makes a difference of:12 μs
>> >- a t-test was performed on the two data vectors which revealed
>> >  shows significant performance improvements:
>> >  Welch Two Sample t-test
>> >  data:  x1 and y1
>> >  t = 405.16, df = 18883000, p-value < 2.2e-16
>> >  alternative hypothesis: true difference in means is not equal to 0
>> >  95 percent confidence interval:
>> >  12.14949 12.26761
>> >  sample estimates:
>> >  mean of x mean of y
>> >  68.48594  56.27739
>> >
>> >  # Testcase 2:
>> >Injecting Uevents into network namespaces not owned by the initial user
>> >namespace and network namespaces owned by the

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman


Bah.  This code is obviously correct and probably wrong.

How do we deliver uevents for network devices that are outside of the
initial user namespace?  The kernel still needs to deliver those.

The logic to figure out which network namespace a device needs to be
delivered to is is present in kobj_bcast_filter.  That logic will almost
certainly need to be turned inside out.  Sign not as easy as I would
have hoped.

Eric

Christian Brauner  writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that

Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV

2018-04-24 Thread Alexander Duyck

On Tue, Apr 24, 2018 at 2:51 PM, Bjorn Helgaas  wrote:
> On Sat, Apr 21, 2018 at 05:22:27PM -0700, Alexander Duyck wrote:
>> On Sat, Apr 21, 2018 at 1:34 PM, Bjorn Helgaas  wrote:
>
>> > For example, I'm not sure what you mean by "devices where the PF is
>> > not capable of managing VF resources."
>> >
>> > It *sounds* like you're saying the hardware works differently on some
>> > devices, but I don't think that's what you mean.  I think you're
>> > saying something about which drivers are used for the PF and the VF.
>>
>> That is sort of what I am saying.
>>
>> So for example with ixgbe there is functionality which is controlled
>> in the MMIO space of the PF that affects the functionality of the VFs
>> that are generated on the device. The PF has to rearrange the
>> resources such as queues and interrupts on the device before it can
>> enable SR-IOV, and it could alter those later to limit what the VF is
>> capable of doing.
>>
>> The model I am dealing with via this patch set has a PF that is not
>> much different than the VFs other than the fact that it has some
>> extended configuration space bits in place for SR-IOV, ARI, ACS, and
>> whatever other bits are needed in order to support spawning isolated
>> VFs.
>
> OK, thanks for the explanation, I think I understand what's going on
> now, correct me if I'm mistaken.  I added a hint about "PF" for Randy,
> too.
>
> These are on pci/virtualization for v4.18.

I reviewed them and all of the changes to patches 1 & 2 both below,
and in the tree look good to me.

Thanks for taking care of all this.

- Alex

> commit 8effc395c209
> Author: Alexander Duyck 
> Date:   Sat Apr 21 15:23:09 2018 -0500
>
> PCI/IOV: Add pci_sriov_configure_simple()
>
> SR-IOV (Single Root I/O Virtualization) is an optional PCIe capability 
> (see
> PCIe r4.0, sec 9).  A PCIe Function with the SR-IOV capability is referred
> to as a PF (Physical Function).  If SR-IOV is enabled on the PF, several
> VFs (Virtual Functions) may be created.  The VFs can be individually
> assigned to virtual machines, which allows them to share a single hardware
> device while being isolated from each other.
>
> Some SR-IOV devices have resources such as queues and interrupts that must
> be set up in the PF before enabling the VFs, so they require a PF driver 
> to
> do that.
>
> Other SR-IOV devices don't require any PF setup before enabling VFs.  Add 
> a
> pci_sriov_configure_simple() interface so PF drivers for such devices can
> use it without repeating the VF-enabling code.
>
> Tested-by: Mark Rustad 
> Signed-off-by: Alexander Duyck 
> [bhelgaas: changelog, comment]
> Signed-off-by: Bjorn Helgaas 
> Reviewed-by: Greg Rose 
> Reviewed-by: Christoph Hellwig :wq
>
> commit a8ccf8a3
> Author: Alexander Duyck 
> Date:   Tue Apr 24 16:47:16 2018 -0500
>
> PCI/IOV: Add pci-pf-stub driver for PFs that only enable VFs
>
> Some SR-IOV PF devices provide no functionality other than acting as a
> means of enabling VFs.  For these devices, we want to enable the VFs and
> assign them to guest virtual machines, but there's no need to have a 
> driver
> for the PF itself.
>
> Add a new pci-pf-stub driver to claim those PF devices and provide the
> generic VF enable functionality.  An administrator can use the sysfs
> "sriov_numvfs" file to enable VFs, then assign them to guests.
>
> For now I only have one example ID provided by Amazon in terms of devices
> that require this functionality.  The general idea is that in the future 
> we
> will see other devices added as vendors come up with devices where the PF
> is more or less just a lightweight shim used to allocate VFs.
>
> Signed-off-by: Alexander Duyck 
> [bhelgaas: changelog]
> Signed-off-by: Bjorn Helgaas 
> Reviewed-by: Greg Rose 
> Reviewed-by: Christoph Hellwig 
>
> commit 115ddc491922
> Author: Alexander Duyck 
> Date:   Tue Apr 24 16:47:22 2018 -0500
>
> net: ena: Use pci_sriov_configure_simple() to enable VFs
>
> Instead of implementing our own version of a SR-IOV configuration stub in
> the ena driver, use the existing pci_sriov_configure_simple() function.
>
> Signed-off-by: Alexander Duyck 
> Signed-off-by: Bjorn Helgaas 
> Reviewed-by: Greg Rose 
> Reviewed-by: Christoph Hellwig 
>
> commit 74d986abc20b
> Author: Alexander Duyck 
> Date:   Tue Apr 24 16:47:27 2018 -0500
>
> nvme-pci: Use pci_sriov_configure_simple() to enable VFs
>
>

Re: [bpf-next PATCH 0/4] selftests for BPF sockmap use cases

2018-04-24 Thread Daniel Borkmann

On 04/23/2018 11:30 PM, John Fastabend wrote:
> This series moves ./samples/sockmap into BPF selftests. There are a
> few good reasons to do this. First, by pushing this into selftests
> the tests will be run automatically. Second, sockmap was not really
> a sample of anything anymore, but rather a large set of tests.
> 
> Note: There are three recent fixes outstanding against bpf branch
> that can be detected occosionally by the automated tests here.
> 
> https://patchwork.ozlabs.org/patch/903138/
> https://patchwork.ozlabs.org/patch/903139/
> https://patchwork.ozlabs.org/patch/903140/
> 
> ---
> 
> John Fastabend (4):
>   bpf: sockmap, code sockmap_test in C
>   bpf: sockmap, add a set of tests to run by default
>   bpf: sockmap, add selftests
>   bpf: sockmap, remove samples program

Applied to bpf-next, thanks John!

Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

2018-04-24 Thread Christian Brauner

On Tue, Apr 24, 2018 at 04:52:20PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > Now that it's possible to have a different set of uevents in different
> > network namespaces, per-network namespace uevent sequence numbers are
> > introduced. This increases performance as locking is now restricted to the
> > network namespace affected by the uevent rather than locking everything.
> > Testing revealed significant performance improvements. For details see
> > "Testing" below.
> 
> Maybe.  Your locking is wrong, and a few other things are wrong.  see
> below.

Thanks for the review! Happy to rework this until it's in a mergeable shape.

> 
> > Since commit 692ec06 ("netns: send uevent messages") network namespaces not
> > owned by the intial user namespace can be sent uevents from a sufficiently
> > privileged userspace process.
> > In order to send a uevent into a network namespace not owned by the initial
> > user namespace we currently still need to take the *global mutex* that
> > locks the uevent socket list even though the list *only contains network
> > namespaces owned by the initial user namespace*. This needs to be done
> > because the uevent counter is a global variable. Taking the global lock is
> > performance sensitive since a user on the host can spawn a pool of n
> > process that each create their own new user and network namespaces and then
> > go on to inject uevents in parallel into the network namespace of all of
> > these processes. This can have a significant performance impact for the
> > host's udevd since it means that there can be a lot of delay between a
> > device being added and the corresponding uevent being sent out and
> > available for processing by udevd. It also means that each network
> > namespace not owned by the initial user namespace which userspace has sent
> > a uevent to will need to wait until the lock becomes available.
> >
> > Implementation:
> > This patch gives each network namespace its own uevent sequence number.
> > Each network namespace not owned by the initial user namespace receives its
> > own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
> > so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
> > file it is clearly documented which lock has to be taken. All network
> > namespaces owned by the initial user namespace will still share the same
> > lock since they are all served sequentially via the uevent socket list.
> > This decouples the locking and ensures that the host retrieves uevents as
> > fast as possible even if there are a lot of uevents injected into network
> > namespaces not owned by the initial user namespace.  In addition, each
> > network namespace not owned by the initial user namespace does not have to
> > wait on any other network namespace not sharing the same user namespace.
> >
> > Testing:
> > Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
> > with decoupled locking and one without. To ensure that testing made sense
> > both kernels carried the patch to remove network namespaces not owned by
> > the initial user namespace from the uevent socket list.
> > Three tests were constructed. All of them showed significant performance
> > improvements with per-netns uevent sequence numbers and decoupled locking.
> >
> >  # Testcase 1:
> >Only Injecting Uevents into network namespaces not owned by the initial
> >user namespace.
> >- created 1000 new user namespace + network namespace pairs
> >- opened a uevent listener in each of those namespace pairs
> >- injected uevents into each of those network namespaces 10,000 times
> >  meaning 10,000,000 (10 million) uevents were injected. (The high
> >  number of uevent injections should get rid of a lot of jitter.)
> >  The injection was done by fork()ing 1000 uevent injectors in a simple
> >  for-loop to ensure that uevents were injected in parallel.
> >- mean transaction time was calculated:
> >  - *without* uevent sequence number namespacing: 67 μs
> >  - *with* uevent sequence number namespacing:55 μs
> >  - makes a difference of:12 μs
> >- a t-test was performed on the two data vectors which revealed
> >  shows significant performance improvements:
> >  Welch Two Sample t-test
> >  data:  x1 and y1
> >  t = 405.16, df = 18883000, p-value < 2.2e-16
> >  alternative hypothesis: true difference in means is not equal to 0
> >  95 percent confidence interval:
> >  12.14949 12.26761
> >  sample estimates:
> >  mean of x mean of y
> >  68.48594  56.27739
> >
> >  # Testcase 2:
> >Injecting Uevents into network namespaces not owned by the initial user
> >namespace and network namespaces owned by the initial user namespace.
> >- created 500 new user namespace + network namespace pairs
> >- created 500 new network namespace pairs
> >-

[PATCH] selftests: net: add in_netns.sh TEST_GEN_PROGS_EXTENDED

2018-04-24 Thread Anders Roxell

Script in_netns.sh is a utility function and not its own test so it
shouldn't be part of the TEST_PROGS. The in_netns.sh get used by
run_afpackettests.
To install in_netns.sh without being added to the main run_kselftest.sh
script use the TEST_GEN_PROGS_EXTENDED variable.

Fixes: 5ff9c1a3dd92 ("selftests: net: add in_netns.sh to TEST_PROGS")
Signed-off-by: Anders Roxell 
---
 tools/testing/selftests/net/Makefile | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/net/Makefile 
b/tools/testing/selftests/net/Makefile
index c3761c35f542..4a8cfe8071a7 100644
--- a/tools/testing/selftests/net/Makefile
+++ b/tools/testing/selftests/net/Makefile
@@ -5,7 +5,8 @@ CFLAGS =  -Wall -Wl,--no-as-needed -O2 -g
 CFLAGS += -I../../../../usr/include/
 
 TEST_PROGS := run_netsocktests run_afpackettests test_bpf.sh netdevice.sh 
rtnetlink.sh
-TEST_PROGS += fib_tests.sh fib-onlink-tests.sh in_netns.sh pmtu.sh
+TEST_PROGS += fib_tests.sh fib-onlink-tests.sh pmtu.sh
+TEST_GEN_PROGS_EXTENDED := in_netns.sh
 TEST_GEN_FILES =  socket
 TEST_GEN_FILES += psock_fanout psock_tpacket msg_zerocopy
 TEST_GEN_FILES += tcp_mmap
-- 
2.11.0

Re: [PATCH bpf-next] tools/bpf: remove test_sock_addr from TEST_GEN_PROGS

2018-04-24 Thread Daniel Borkmann

On 04/24/2018 11:45 PM, Yonghong Song wrote:
> Since test_sock_addr is not supposed to run by itself,
> remove it from TEST_GEN_PROGS and add it to
> TEST_GEN_PROGS_EXTENDED. This way, run_tests will
> not run test_sock_addr. The corresponding test to run
> is test_sock_addr.sh.
> 
> Signed-off-by: Yonghong Song 

Applied to bpf-next, thanks Yonghong!

Re: [PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Eric W. Biederman


We already do this in practice in userspace.  It doesn't make much
sense to perform this delivery.  So we might as well make this optimization.

Christian Brauner  writes:
> commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")
>
> enabled sending hotplug events into all network namespaces back in 2010.
> Over time the set of uevents that get sent into all network namespaces has
> shrunk a little. We have now reached the point where hotplug events for all
> devices that carry a namespace tag are filtered according to that
> namespace. Specifically, they are filtered whenever the namespace tag of
> the kobject does not match the namespace tag of the netlink socket. One
> example are network devices. Uevents for network devices only show up in
> the network namespaces these devices are moved to or created in.
>
> However, any uevent for a kobject that does not have a namespace tag
> associated with it will not be filtered and we will broadcast it into all
> network namespaces. This behavior stopped making sense when user namespaces
> were introduced.
>
> This patch restricts uevents to the initial user namespace for a couple of
> reasons that have been extensively discusses on the mailing list [1].
> - Thundering herd:
>   Broadcasting uevents into all network namespaces introduces significant
>   overhead.
>   All processes that listen to uevents running in non-initial user
>   namespaces will end up responding to uevents that will be meaningless to
>   them. Mainly, because non-initial user namespaces cannot easily manage
>   devices unless they have a privileged host-process helping them out. This
>   means that there will be a thundering herd of activity when there
>   shouldn't be any.
> - Uevents from non-root users are already filtered in userspace:
>   Uevents are filtered by userspace in a user namespace because the
>   received uid != 0. Instead the uid associated with the event will be
>   65534 == "nobody" because the global root uid is not mapped.
>   This means we can safely and without introducing regressions modify the
>   kernel to not send uevents into all network namespaces whose owning user
>   namespace is not the initial user namespace because we know that
>   userspace will ignore the message because of the uid anyway. I have
>   a) verified that is is true for every udev implementation out there b)
>   that this behavior has been present in all udev implementations from the
>   very beginning.
> - Removing needless overhead/Increasing performance:
>   Currently, the uevent socket for each network namespace is added to the
>   global variable uevent_sock_list. The list itself needs to be protected
>   by a mutex. So everytime a uevent is generated the mutex is taken on the
>   list. The mutex is held *from the creation of the uevent (memory
>   allocation, string creation etc. until all uevent sockets have been
>   handled*. This is aggravated by the fact that for each uevent socket that
>   has listeners the mc_list must be walked as well which means we're
>   talking O(n^2) here. Given that a standard Linux workload usually has
>   quite a lot of network namespaces and - in the face of containers - a lot
>   of user namespaces this quickly becomes a performance problem (see
>   "Thundering herd" above). By just recording uevent sockets of network
>   namespaces that are owned by the initial user namespace we significantly
>   increase performance in this codepath.
> - Injecting uevents:
>   There's a valid argument that containers might be interested in receiving
>   device events especially if they are delegated to them by a privileged
>   userspace process. One prime example are SR-IOV enabled devices that are
>   explicitly designed to be handed of to other users such as VMs or
>   containers.
>   This use-case can now be correctly handled since
>   commit 692ec06d7c92 ("netns: send uevent messages"). This commit
>   introduced the ability to send uevents from userspace. As such we can let
>   a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
>   the network namespace of the netlink socket) userspace process make a
>   decision what uevents should be sent. This removes the need to blindly
>   broadcast uevents into all user namespaces and provides a performant and
>   safe solution to this problem.
> - Filtering logic:
>   This patch filters by *owning user namespace of the network namespace a
>   given task resides in* and not by user namespace of the task per se. This
>   means if the user namespace of a given task is unshared but the network
>   namespace is kept and is owned by the initial user namespace a listener
>   that is opening the uevent socket in that network namespace can still
>   listen to uevents.
>
> [1]: https://lkml.org/lkml/2018/4/4/739
> Signed-off-by: Christian Brauner 
Acked-by: "Eric W. Biederman" 

> ---
> Changelog v1->v2:

Re: [PATCH v2] selftests: bpf: update .gitignore with missing file

2018-04-24 Thread Daniel Borkmann

On 04/24/2018 12:53 AM, Anders Roxell wrote:
> Fixes: c0fa1b6c3efc ("bpf: btf: Add BTF tests")
> Signed-off-by: Anders Roxell 
> ---
> Rebased against bpf-next.

Applied to bpf-next, thanks Anders!

Re: [PATCH bpf-next v5 0/2] bpf: allow map helpers access to map values directly

2018-04-24 Thread Daniel Borkmann

On 04/24/2018 03:06 PM, Paul Chaignon wrote:
> Currently, helpers that expect ARG_PTR_TO_MAP_KEY and ARG_PTR_TO_MAP_VALUE
> can only access stack and packet memory.  This patchset allows these
> helpers to directly access map values by passing registers of type
> PTR_TO_MAP_VALUE.
> 
> The first patch changes the verifier; the second adds new test cases.
> 
> The first three versions of this patchset were sent on the iovisor-dev
> mailing list only.
> 
> Changelogs:
>   Changes in v5:
> - Refactor using check_helper_mem_access.
>   Changes in v4:
> - Rebase.
>   Changes in v3:
> - Bug fixes.
> - Negative test cases.
>   Changes in v2:
> - Additional test cases for adjusted maps.

Applied to bpf-next, thanks Paul!

Re: [PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

2018-04-24 Thread Eric W. Biederman

Christian Brauner  writes:

> Now that it's possible to have a different set of uevents in different
> network namespaces, per-network namespace uevent sequence numbers are
> introduced. This increases performance as locking is now restricted to the
> network namespace affected by the uevent rather than locking everything.
> Testing revealed significant performance improvements. For details see
> "Testing" below.

Maybe.  Your locking is wrong, and a few other things are wrong.  see
below.

> Since commit 692ec06 ("netns: send uevent messages") network namespaces not
> owned by the intial user namespace can be sent uevents from a sufficiently
> privileged userspace process.
> In order to send a uevent into a network namespace not owned by the initial
> user namespace we currently still need to take the *global mutex* that
> locks the uevent socket list even though the list *only contains network
> namespaces owned by the initial user namespace*. This needs to be done
> because the uevent counter is a global variable. Taking the global lock is
> performance sensitive since a user on the host can spawn a pool of n
> process that each create their own new user and network namespaces and then
> go on to inject uevents in parallel into the network namespace of all of
> these processes. This can have a significant performance impact for the
> host's udevd since it means that there can be a lot of delay between a
> device being added and the corresponding uevent being sent out and
> available for processing by udevd. It also means that each network
> namespace not owned by the initial user namespace which userspace has sent
> a uevent to will need to wait until the lock becomes available.
>
> Implementation:
> This patch gives each network namespace its own uevent sequence number.
> Each network namespace not owned by the initial user namespace receives its
> own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
> so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
> file it is clearly documented which lock has to be taken. All network
> namespaces owned by the initial user namespace will still share the same
> lock since they are all served sequentially via the uevent socket list.
> This decouples the locking and ensures that the host retrieves uevents as
> fast as possible even if there are a lot of uevents injected into network
> namespaces not owned by the initial user namespace.  In addition, each
> network namespace not owned by the initial user namespace does not have to
> wait on any other network namespace not sharing the same user namespace.
>
> Testing:
> Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
> with decoupled locking and one without. To ensure that testing made sense
> both kernels carried the patch to remove network namespaces not owned by
> the initial user namespace from the uevent socket list.
> Three tests were constructed. All of them showed significant performance
> improvements with per-netns uevent sequence numbers and decoupled locking.
>
>  # Testcase 1:
>Only Injecting Uevents into network namespaces not owned by the initial
>user namespace.
>- created 1000 new user namespace + network namespace pairs
>- opened a uevent listener in each of those namespace pairs
>- injected uevents into each of those network namespaces 10,000 times
>  meaning 10,000,000 (10 million) uevents were injected. (The high
>  number of uevent injections should get rid of a lot of jitter.)
>  The injection was done by fork()ing 1000 uevent injectors in a simple
>  for-loop to ensure that uevents were injected in parallel.
>- mean transaction time was calculated:
>  - *without* uevent sequence number namespacing: 67 μs
>  - *with* uevent sequence number namespacing:55 μs
>  - makes a difference of:12 μs
>- a t-test was performed on the two data vectors which revealed
>  shows significant performance improvements:
>  Welch Two Sample t-test
>  data:  x1 and y1
>  t = 405.16, df = 18883000, p-value < 2.2e-16
>  alternative hypothesis: true difference in means is not equal to 0
>  95 percent confidence interval:
>  12.14949 12.26761
>  sample estimates:
>  mean of x mean of y
>  68.48594  56.27739
>
>  # Testcase 2:
>Injecting Uevents into network namespaces not owned by the initial user
>namespace and network namespaces owned by the initial user namespace.
>- created 500 new user namespace + network namespace pairs
>- created 500 new network namespace pairs
>- opened a uevent listener in each of those namespace pairs
>- injected uevents into each of those network namespaces 10,000 times
>  meaning 10,000,000 (10 million) uevents were injected. (The high
>  number of uevent injections should get rid of a lot of jitter.)
>  The injection was done by

Re: [PATCH bpf-next,v3 0/2] bpf: add helper for getting xfrm states

2018-04-24 Thread Daniel Borkmann

On 04/24/2018 04:50 PM, Eyal Birger wrote:
> This patchset adds support for fetching XFRM state information from
> an eBPF program called from TC.
> 
> The first patch introduces a helper for fetching an XFRM state from the
> skb's secpath. The XFRM state is modeled using a new virtual struct which
> contains the SPI, peer address, and reqid values of the state; This struct
> can be extended in the future to provide additional state information.
> 
> The second patch adds a test example in test_tunnel_bpf.sh. The sample
> validates the correct extraction of state information by the eBPF program.
> 
> ---
> v3:
>   - Kept SPI and peer IPv4 address in state in network byte order
> following suggestion from Alexei Starovoitov
> v2:
>   - Fixed two comments by Daniel Borkmann:
> - disallow reserved flags in helper call
> - avoid compiling in helper code when CONFIG_XFRM is off

Applied to bpf-next, thanks Eyal!

Re: [pci PATCH v8 0/4] Add support for unmanaged SR-IOV

2018-04-24 Thread Bjorn Helgaas

On Sat, Apr 21, 2018 at 05:22:27PM -0700, Alexander Duyck wrote:
> On Sat, Apr 21, 2018 at 1:34 PM, Bjorn Helgaas  wrote:

> > For example, I'm not sure what you mean by "devices where the PF is
> > not capable of managing VF resources."
> >
> > It *sounds* like you're saying the hardware works differently on some
> > devices, but I don't think that's what you mean.  I think you're
> > saying something about which drivers are used for the PF and the VF.
> 
> That is sort of what I am saying.
> 
> So for example with ixgbe there is functionality which is controlled
> in the MMIO space of the PF that affects the functionality of the VFs
> that are generated on the device. The PF has to rearrange the
> resources such as queues and interrupts on the device before it can
> enable SR-IOV, and it could alter those later to limit what the VF is
> capable of doing.
> 
> The model I am dealing with via this patch set has a PF that is not
> much different than the VFs other than the fact that it has some
> extended configuration space bits in place for SR-IOV, ARI, ACS, and
> whatever other bits are needed in order to support spawning isolated
> VFs.

OK, thanks for the explanation, I think I understand what's going on
now, correct me if I'm mistaken.  I added a hint about "PF" for Randy,
too.

These are on pci/virtualization for v4.18.

commit 8effc395c209
Author: Alexander Duyck 
Date:   Sat Apr 21 15:23:09 2018 -0500

PCI/IOV: Add pci_sriov_configure_simple()

SR-IOV (Single Root I/O Virtualization) is an optional PCIe capability (see
PCIe r4.0, sec 9).  A PCIe Function with the SR-IOV capability is referred
to as a PF (Physical Function).  If SR-IOV is enabled on the PF, several
VFs (Virtual Functions) may be created.  The VFs can be individually
assigned to virtual machines, which allows them to share a single hardware
device while being isolated from each other.

Some SR-IOV devices have resources such as queues and interrupts that must
be set up in the PF before enabling the VFs, so they require a PF driver to
do that.

Other SR-IOV devices don't require any PF setup before enabling VFs.  Add a
pci_sriov_configure_simple() interface so PF drivers for such devices can
use it without repeating the VF-enabling code.

Tested-by: Mark Rustad 
Signed-off-by: Alexander Duyck 
[bhelgaas: changelog, comment]
Signed-off-by: Bjorn Helgaas 
Reviewed-by: Greg Rose 
Reviewed-by: Christoph Hellwig :wq

commit a8ccf8a3
Author: Alexander Duyck 
Date:   Tue Apr 24 16:47:16 2018 -0500

PCI/IOV: Add pci-pf-stub driver for PFs that only enable VFs

Some SR-IOV PF devices provide no functionality other than acting as a
means of enabling VFs.  For these devices, we want to enable the VFs and
assign them to guest virtual machines, but there's no need to have a driver
for the PF itself.

Add a new pci-pf-stub driver to claim those PF devices and provide the
generic VF enable functionality.  An administrator can use the sysfs
"sriov_numvfs" file to enable VFs, then assign them to guests.

For now I only have one example ID provided by Amazon in terms of devices
that require this functionality.  The general idea is that in the future we
will see other devices added as vendors come up with devices where the PF
is more or less just a lightweight shim used to allocate VFs.

Signed-off-by: Alexander Duyck 
[bhelgaas: changelog]
Signed-off-by: Bjorn Helgaas 
Reviewed-by: Greg Rose 
Reviewed-by: Christoph Hellwig 

commit 115ddc491922
Author: Alexander Duyck 
Date:   Tue Apr 24 16:47:22 2018 -0500

net: ena: Use pci_sriov_configure_simple() to enable VFs

Instead of implementing our own version of a SR-IOV configuration stub in
the ena driver, use the existing pci_sriov_configure_simple() function.

Signed-off-by: Alexander Duyck 
Signed-off-by: Bjorn Helgaas 
Reviewed-by: Greg Rose 
Reviewed-by: Christoph Hellwig 

commit 74d986abc20b
Author: Alexander Duyck 
Date:   Tue Apr 24 16:47:27 2018 -0500

nvme-pci: Use pci_sriov_configure_simple() to enable VFs

Instead of implementing our own version of a SR-IOV configuration stub in
the nvme driver, use the existing pci_sriov_configure_simple() function.

Signed-off-by: Alexander Duyck 
Signed-off-by: Bjorn Helgaas 
Reviewed-by: Christoph Hellwig

[PATCH bpf-next] tools/bpf: remove test_sock_addr from TEST_GEN_PROGS

2018-04-24 Thread Yonghong Song

Since test_sock_addr is not supposed to run by itself,
remove it from TEST_GEN_PROGS and add it to
TEST_GEN_PROGS_EXTENDED. This way, run_tests will
not run test_sock_addr. The corresponding test to run
is test_sock_addr.sh.

Signed-off-by: Yonghong Song 
---
 tools/testing/selftests/bpf/Makefile | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/bpf/Makefile 
b/tools/testing/selftests/bpf/Makefile
index 0b72cc7..64037ee 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -24,7 +24,7 @@ urandom_read: urandom_read.c
 # Order correspond to 'make run_tests' order
 TEST_GEN_PROGS = test_verifier test_tag test_maps test_lru_map test_lpm_map 
test_progs \
test_align test_verifier_log test_dev_cgroup test_tcpbpf_user \
-   test_sock test_sock_addr test_btf
+   test_sock test_btf
 
 TEST_GEN_FILES = test_pkt_access.o test_xdp.o test_l4lb.o test_tcp_estats.o 
test_obj_id.o \
test_pkt_md_access.o test_xdp_redirect.o test_xdp_meta.o 
sockmap_parse_prog.o \
@@ -43,7 +43,7 @@ TEST_PROGS := test_kmod.sh \
test_sock_addr.sh
 
 # Compile but not part of 'make run_tests'
-TEST_GEN_PROGS_EXTENDED = test_libbpf_open
+TEST_GEN_PROGS_EXTENDED = test_libbpf_open test_sock_addr
 
 include ../lib.mk
 
-- 
2.9.5

[PATCH net-next] sctp: fix identification of new acks for SFR-CACC

2018-04-24 Thread Marcelo Ricardo Leitner

It's currently written as:

if (!tchunk->tsn_gap_acked) {   [1]
tchunk->tsn_gap_acked = 1;
...
}

if (TSN_lte(tsn, sack_ctsn)) {
if (!tchunk->tsn_gap_acked) {
/* SFR-CACC processing */
...
}
}

Which causes the SFR-CACC processing on ack reception to never process,
as tchunk->tsn_gap_acked is always true by then. Block [1] was
moved to that position by the commit marked below.

This patch fixes it by doing SFR-CACC processing earlier, before
tsn_gap_acked is set to true.

Fixes: 31b02e154940 ("sctp: Failover transmitted list on transport delete")
Signed-off-by: Marcelo Ricardo Leitner 
---

Even though this is a -stable candidate, please apply it to net-next
to avoid conflicts with subsequent patches in my queue. Thanks.

 net/sctp/outqueue.c | 48 +++-
 1 file changed, 23 insertions(+), 25 deletions(-)

diff --git a/net/sctp/outqueue.c b/net/sctp/outqueue.c
index 
f211b3db6a3543073e113da121bb28518b0af491..dee7cbd5483149024f2f3195db2fe4d473b1a00a
 100644
--- a/net/sctp/outqueue.c
+++ b/net/sctp/outqueue.c
@@ -1457,7 +1457,7 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 * the outstanding bytes for this chunk, so only
 * count bytes associated with a transport.
 */
-   if (transport) {
+   if (transport && !tchunk->tsn_gap_acked) {
/* If this chunk is being used for RTT
 * measurement, calculate the RTT and update
 * the RTO using this value.
@@ -1469,14 +1469,34 @@ static void sctp_check_transmitted(struct sctp_outq *q,
 * first instance of the packet or a later
 * instance).
 */
-   if (!tchunk->tsn_gap_acked &&
-   !sctp_chunk_retransmitted(tchunk) &&
+   if (!sctp_chunk_retransmitted(tchunk) &&
tchunk->rtt_in_progress) {
tchunk->rtt_in_progress = 0;
rtt = jiffies - tchunk->sent_at;
sctp_transport_update_rto(transport,
  rtt);
}
+
+   if (TSN_lte(tsn, sack_ctsn)) {
+   /*
+* SFR-CACC algorithm:
+* 2) If the SACK contains gap acks
+* and the flag CHANGEOVER_ACTIVE is
+* set the receiver of the SACK MUST
+* take the following action:
+*
+* B) For each TSN t being acked that
+* has not been acked in any SACK so
+* far, set cacc_saw_newack to 1 for
+* the destination that the TSN was
+* sent to.
+*/
+   if (sack->num_gap_ack_blocks &&
+   q->asoc->peer.primary_path->cacc.
+   changeover_active)
+   transport->cacc.cacc_saw_newack
+   = 1;
+   }
}

/* If the chunk hasn't been marked as ACKED,
@@ -1508,28 +1528,6 @@ static void sctp_check_transmitted(struct sctp_outq *q,
restart_timer = 1;
forward_progress = true;

-   if (!tchunk->tsn_gap_acked) {
-   /*
-* SFR-CACC algorithm:
-* 2) If the SACK contains gap acks
-* and the flag CHANGEOVER_ACTIVE is
-* set the receiver of the SACK MUST
-* take the following action:
-*
-* B) For each TSN t being acked that
-* has not been acked in any SACK so
-* far, set cacc_saw_newack to 1 for
-* the destination that the TSN was
-

[PATCH net-next] sctp: fix const parameter violation in sctp_make_sack

2018-04-24 Thread Marcelo Ricardo Leitner

sctp_make_sack() make changes to the asoc and this cast is just
bypassing the const attribute. As there is no need to have the const
there, just remove it and fix the violation.

Signed-off-by: Marcelo Ricardo Leitner 
---

This one can go to net or net-next, but targetting net-next here just to
keep it together with the rest (which I'll post as patches get in).

 include/net/sctp/sm.h| 2 +-
 net/sctp/sm_make_chunk.c | 9 -
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/include/net/sctp/sm.h b/include/net/sctp/sm.h
index 
2d0e782c90551377ad654bcef1224bbdb75ba394..f4b657478a304050851f33d92c71162a4a4a2e50
 100644
--- a/include/net/sctp/sm.h
+++ b/include/net/sctp/sm.h
@@ -207,7 +207,7 @@ struct sctp_chunk *sctp_make_datafrag_empty(const struct 
sctp_association *asoc,
int len, __u8 flags, gfp_t gfp);
 struct sctp_chunk *sctp_make_ecne(const struct sctp_association *asoc,
  const __u32 lowest_tsn);
-struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc);
+struct sctp_chunk *sctp_make_sack(struct sctp_association *asoc);
 struct sctp_chunk *sctp_make_shutdown(const struct sctp_association *asoc,
  const struct sctp_chunk *chunk);
 struct sctp_chunk *sctp_make_shutdown_ack(const struct sctp_association *asoc,
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 
5a4fb1dc8400a0316177ce65be8126857297eb5e..db93eabd6ef500ab20be6e091ee06bd3b60923c9
 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -779,10 +779,9 @@ struct sctp_chunk *sctp_make_datafrag_empty(const struct 
sctp_association *asoc,
  * association.  This reports on which TSN's we've seen to date,
  * including duplicates and gaps.
  */
-struct sctp_chunk *sctp_make_sack(const struct sctp_association *asoc)
+struct sctp_chunk *sctp_make_sack(struct sctp_association *asoc)
 {
struct sctp_tsnmap *map = (struct sctp_tsnmap *)>peer.tsn_map;
-   struct sctp_association *aptr = (struct sctp_association *)asoc;
struct sctp_gap_ack_block gabs[SCTP_MAX_GABS];
__u16 num_gabs, num_dup_tsns;
struct sctp_transport *trans;
@@ -857,7 +856,7 @@ struct sctp_chunk *sctp_make_sack(const struct 
sctp_association *asoc)

/* Add the duplicate TSN information.  */
if (num_dup_tsns) {
-   aptr->stats.idupchunks += num_dup_tsns;
+   asoc->stats.idupchunks += num_dup_tsns;
sctp_addto_chunk(retval, sizeof(__u32) * num_dup_tsns,
 sctp_tsnmap_get_dups(map));
}
@@ -869,11 +868,11 @@ struct sctp_chunk *sctp_make_sack(const struct 
sctp_association *asoc)
 * association so no transport will match after a wrap event like this,
 * Until the next sack
 */
-   if (++aptr->peer.sack_generation == 0) {
+   if (++asoc->peer.sack_generation == 0) {
list_for_each_entry(trans, >peer.transport_addr_list,
transports)
trans->sack_generation = 0;
-   aptr->peer.sack_generation = 1;
+   asoc->peer.sack_generation = 1;
}
 nodata:
return retval;
--
2.14.3

[PATCH net-next] neighbour: support for NTF_EXT_LEARNED flag

2018-04-24 Thread Roopa Prabhu

From: Roopa Prabhu 

This patch extends NTF_EXT_LEARNED support to the neighbour system.
Example use-case: An Ethernet VPN implementation (eg in FRR routing suite)
can use this flag to add dynamic reachable external neigh entires
learned via control plane. The use of neigh NTF_EXT_LEARNED in this
patch is consistent with its use with bridge and vxlan fdb entries.

Signed-off-by: Roopa Prabhu 
---
 include/net/neighbour.h | 19 ++-
 net/core/neighbour.c|  8 +++-
 2 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index e421f86..6c1eecd 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -246,6 +246,7 @@ static inline void *neighbour_priv(const struct neighbour 
*n)
 #define NEIGH_UPDATE_F_OVERRIDE0x0001
 #define NEIGH_UPDATE_F_WEAK_OVERRIDE   0x0002
 #define NEIGH_UPDATE_F_OVERRIDE_ISROUTER   0x0004
+#define NEIGH_UPDATE_F_EXT_LEARNED 0x2000
 #define NEIGH_UPDATE_F_ISROUTER0x4000
 #define NEIGH_UPDATE_F_ADMIN   0x8000
 
@@ -526,5 +527,21 @@ static inline void neigh_ha_snapshot(char *dst, const 
struct neighbour *n,
} while (read_seqretry(>ha_lock, seq));
 }
 
-
+static inline void neigh_update_ext_learned(struct neighbour *neigh, u32 flags,
+   int *notify)
+{
+   u8 ndm_flags = 0;
+
+   if (!(flags & NEIGH_UPDATE_F_ADMIN))
+   return;
+
+   ndm_flags |= (flags & NEIGH_UPDATE_F_EXT_LEARNED) ? NTF_EXT_LEARNED : 0;
+   if ((neigh->flags ^ ndm_flags) & NTF_EXT_LEARNED) {
+   if (ndm_flags & NTF_EXT_LEARNED)
+   neigh->flags |= NTF_EXT_LEARNED;
+   else
+   neigh->flags &= ~NTF_EXT_LEARNED;
+   *notify = 1;
+   }
+}
 #endif
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index ce51986..5afae29 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -820,7 +820,8 @@ static void neigh_periodic_work(struct work_struct *work)
write_lock(>lock);
 
state = n->nud_state;
-   if (state & (NUD_PERMANENT | NUD_IN_TIMER)) {
+   if ((state & (NUD_PERMANENT | NUD_IN_TIMER)) ||
+   (n->flags & NTF_EXT_LEARNED)) {
write_unlock(>lock);
goto next_elt;
}
@@ -1136,6 +1137,8 @@ int neigh_update(struct neighbour *neigh, const u8 
*lladdr, u8 new,
if (neigh->dead)
goto out;
 
+   neigh_update_ext_learned(neigh, flags, );
+
if (!(new & NUD_VALID)) {
neigh_del_timer(neigh);
if (old & NUD_CONNECTED)
@@ -1781,6 +1784,9 @@ static int neigh_add(struct sk_buff *skb, struct nlmsghdr 
*nlh,
flags &= ~NEIGH_UPDATE_F_OVERRIDE;
}
 
+   if (ndm->ndm_flags & NTF_EXT_LEARNED)
+   flags |= NEIGH_UPDATE_F_EXT_LEARNED;
+
if (ndm->ndm_flags & NTF_USE) {
neigh_event_send(neigh, NULL);
err = 0;
-- 
2.1.4

Re: [PATCH net-next 2/2 v1] netns: isolate seqnums to use per-netns locks

2018-04-24 Thread Christian Brauner

On Tue, Apr 24, 2018 at 03:39:25PM -0400, David Miller wrote:
> From: Christian Brauner 
> Date: Mon, 23 Apr 2018 12:24:43 +0200
> 
> > +   #ifdef CONFIG_NET
> > +   seqnum = get_ns_uevent_seqnum_by_vpid();
> > +   #else
> > +   seqnum = uevent_seqnum;
> > +   #endif
> 
> Please don't indend the code like this.
> 
> By indenting the CPP directives, which should be at column zero, the
> actual code became double indented.

Ah, sorry. Sent v2 with the indendation fixed just now.

Thanks!
Christian

[PATCH net-next 0/2 v2] netns: uevent performance tweaks

2018-04-24 Thread Christian Brauner

Hey everyone,

This is v2 of "netns: uevent performance tweaks" which contains *no
functional changes* just a minor indendation fix as requested by David.

Like Eric requested, I did extensive testing that prove significant
performance improvements when using per-netns uevent sequence numbers
with decoupled locks. The results and test descriptions were added to
the commit message of
[PATCH 2/2 v1] netns: isolate seqnums to use per-netns locks.

This series deals with a bunch of performance improvements when sending
out uevents that have been extensively discussed here:
https://lkml.org/lkml/2018/4/10/592

- Only record uevent sockets from network namespaces owned by the
  initial user namespace in the global uevent socket list.
  Eric, this is the exact patch we agreed upon in
  https://lkml.org/lkml/2018/4/10/592.
  A very detailed rationale is present in the commit message for
  [PATCH 1/2] netns: restrict uevents
- Decouple the locking for network namespaces in the global uevent
  socket list from the locking for network namespaces not in the global
  uevent socket list.
  A very detailed rationale including performance test results is
  present in the commit message for
  [PATCH 2/2] netns: isolate seqnums to use per-netns locks

Thanks!
Christian

Christian Brauner (2):
  netns: restrict uevents
  netns: isolate seqnums to use per-netns locks

 include/linux/kobject.h |   2 +
 include/net/net_namespace.h |   3 +
 kernel/ksysfs.c |  11 +++-
 lib/kobject_uevent.c| 122 
 net/core/net_namespace.c|  14 +
 5 files changed, 126 insertions(+), 26 deletions(-)

-- 
2.17.0

[PATCH net-next 1/2 v2] netns: restrict uevents

2018-04-24 Thread Christian Brauner

commit 07e98962fa77 ("kobject: Send hotplug events in all network namespaces")

enabled sending hotplug events into all network namespaces back in 2010.
Over time the set of uevents that get sent into all network namespaces has
shrunk a little. We have now reached the point where hotplug events for all
devices that carry a namespace tag are filtered according to that
namespace. Specifically, they are filtered whenever the namespace tag of
the kobject does not match the namespace tag of the netlink socket. One
example are network devices. Uevents for network devices only show up in
the network namespaces these devices are moved to or created in.

However, any uevent for a kobject that does not have a namespace tag
associated with it will not be filtered and we will broadcast it into all
network namespaces. This behavior stopped making sense when user namespaces
were introduced.

This patch restricts uevents to the initial user namespace for a couple of
reasons that have been extensively discusses on the mailing list [1].
- Thundering herd:
  Broadcasting uevents into all network namespaces introduces significant
  overhead.
  All processes that listen to uevents running in non-initial user
  namespaces will end up responding to uevents that will be meaningless to
  them. Mainly, because non-initial user namespaces cannot easily manage
  devices unless they have a privileged host-process helping them out. This
  means that there will be a thundering herd of activity when there
  shouldn't be any.
- Uevents from non-root users are already filtered in userspace:
  Uevents are filtered by userspace in a user namespace because the
  received uid != 0. Instead the uid associated with the event will be
  65534 == "nobody" because the global root uid is not mapped.
  This means we can safely and without introducing regressions modify the
  kernel to not send uevents into all network namespaces whose owning user
  namespace is not the initial user namespace because we know that
  userspace will ignore the message because of the uid anyway. I have
  a) verified that is is true for every udev implementation out there b)
  that this behavior has been present in all udev implementations from the
  very beginning.
- Removing needless overhead/Increasing performance:
  Currently, the uevent socket for each network namespace is added to the
  global variable uevent_sock_list. The list itself needs to be protected
  by a mutex. So everytime a uevent is generated the mutex is taken on the
  list. The mutex is held *from the creation of the uevent (memory
  allocation, string creation etc. until all uevent sockets have been
  handled*. This is aggravated by the fact that for each uevent socket that
  has listeners the mc_list must be walked as well which means we're
  talking O(n^2) here. Given that a standard Linux workload usually has
  quite a lot of network namespaces and - in the face of containers - a lot
  of user namespaces this quickly becomes a performance problem (see
  "Thundering herd" above). By just recording uevent sockets of network
  namespaces that are owned by the initial user namespace we significantly
  increase performance in this codepath.
- Injecting uevents:
  There's a valid argument that containers might be interested in receiving
  device events especially if they are delegated to them by a privileged
  userspace process. One prime example are SR-IOV enabled devices that are
  explicitly designed to be handed of to other users such as VMs or
  containers.
  This use-case can now be correctly handled since
  commit 692ec06d7c92 ("netns: send uevent messages"). This commit
  introduced the ability to send uevents from userspace. As such we can let
  a sufficiently privileged (CAP_SYS_ADMIN in the owning user namespace of
  the network namespace of the netlink socket) userspace process make a
  decision what uevents should be sent. This removes the need to blindly
  broadcast uevents into all user namespaces and provides a performant and
  safe solution to this problem.
- Filtering logic:
  This patch filters by *owning user namespace of the network namespace a
  given task resides in* and not by user namespace of the task per se. This
  means if the user namespace of a given task is unshared but the network
  namespace is kept and is owned by the initial user namespace a listener
  that is opening the uevent socket in that network namespace can still
  listen to uevents.

[1]: https://lkml.org/lkml/2018/4/4/739
Signed-off-by: Christian Brauner 
---
Changelog v1->v2:
* patch unchanged
Changelog v0->v1:
* patch unchanged
---
 lib/kobject_uevent.c | 18 --
 1 file changed, 12 insertions(+), 6 deletions(-)

diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
index 15ea216a67ce..f5f5038787ac 100644
--- a/lib/kobject_uevent.c
+++ b/lib/kobject_uevent.c
@@ -703,9 +703,13 @@ static int uevent_net_init(struct net *net)
 
net->uevent_sock = ue_sk;
 
-

[PATCH net-next 2/2 v2] netns: isolate seqnums to use per-netns locks

2018-04-24 Thread Christian Brauner

Now that it's possible to have a different set of uevents in different
network namespaces, per-network namespace uevent sequence numbers are
introduced. This increases performance as locking is now restricted to the
network namespace affected by the uevent rather than locking everything.
Testing revealed significant performance improvements. For details see
"Testing" below.

Since commit 692ec06 ("netns: send uevent messages") network namespaces not
owned by the intial user namespace can be sent uevents from a sufficiently
privileged userspace process.
In order to send a uevent into a network namespace not owned by the initial
user namespace we currently still need to take the *global mutex* that
locks the uevent socket list even though the list *only contains network
namespaces owned by the initial user namespace*. This needs to be done
because the uevent counter is a global variable. Taking the global lock is
performance sensitive since a user on the host can spawn a pool of n
process that each create their own new user and network namespaces and then
go on to inject uevents in parallel into the network namespace of all of
these processes. This can have a significant performance impact for the
host's udevd since it means that there can be a lot of delay between a
device being added and the corresponding uevent being sent out and
available for processing by udevd. It also means that each network
namespace not owned by the initial user namespace which userspace has sent
a uevent to will need to wait until the lock becomes available.

Implementation:
This patch gives each network namespace its own uevent sequence number.
Each network namespace not owned by the initial user namespace receives its
own mutex. The struct uevent_sock is opaque to callers outside of kobject.c
so the mutex *can* and *is* only ever accessed in lib/kobject.c. In this
file it is clearly documented which lock has to be taken. All network
namespaces owned by the initial user namespace will still share the same
lock since they are all served sequentially via the uevent socket list.
This decouples the locking and ensures that the host retrieves uevents as
fast as possible even if there are a lot of uevents injected into network
namespaces not owned by the initial user namespace.  In addition, each
network namespace not owned by the initial user namespace does not have to
wait on any other network namespace not sharing the same user namespace.

Testing:
Two 4.17-rc1 test kernels were compiled. One with per netns uevent seqnums
with decoupled locking and one without. To ensure that testing made sense
both kernels carried the patch to remove network namespaces not owned by
the initial user namespace from the uevent socket list.
Three tests were constructed. All of them showed significant performance
improvements with per-netns uevent sequence numbers and decoupled locking.

 # Testcase 1:
   Only Injecting Uevents into network namespaces not owned by the initial
   user namespace.
   - created 1000 new user namespace + network namespace pairs
   - opened a uevent listener in each of those namespace pairs
   - injected uevents into each of those network namespaces 10,000 times
 meaning 10,000,000 (10 million) uevents were injected. (The high
 number of uevent injections should get rid of a lot of jitter.)
 The injection was done by fork()ing 1000 uevent injectors in a simple
 for-loop to ensure that uevents were injected in parallel.
   - mean transaction time was calculated:
 - *without* uevent sequence number namespacing: 67 μs
 - *with* uevent sequence number namespacing:55 μs
 - makes a difference of:12 μs
   - a t-test was performed on the two data vectors which revealed
 shows significant performance improvements:
 Welch Two Sample t-test
 data:  x1 and y1
 t = 405.16, df = 18883000, p-value < 2.2e-16
 alternative hypothesis: true difference in means is not equal to 0
 95 percent confidence interval:
 12.14949 12.26761
 sample estimates:
 mean of x mean of y
 68.48594  56.27739

 # Testcase 2:
   Injecting Uevents into network namespaces not owned by the initial user
   namespace and network namespaces owned by the initial user namespace.
   - created 500 new user namespace + network namespace pairs
   - created 500 new network namespace pairs
   - opened a uevent listener in each of those namespace pairs
   - injected uevents into each of those network namespaces 10,000 times
 meaning 10,000,000 (10 million) uevents were injected. (The high
 number of uevent injections should get rid of a lot of jitter.)
 The injection was done by fork()ing 1000 uevent injectors in a simple
 for-loop to ensure that uevents were injected in parallel.
   - mean transaction time was calculated:
 - *without* uevent sequence number namespacing: 572 μs
 - *with* uevent sequence number namespacing:514 μs
 - makes a difference of:

[GIT] Networking

2018-04-24 Thread David Miller


1) Fix rtnl deadlock in ipvs, from Julian Anastasov.

2) s390 qeth fixes from Julian Wiedmann (control IO completion stalls, bad MAC
   address update sequence, request side races on command IO timeouts).

3) Handle seq_file overflow properly in l2tp, from Guillaume Nault.

4) Fix VLAN priority mappings in cpsw driver, from Ivan Khoronzhuk.

5) Packet scheduler ife action fixes (malformed TLV lengths, etc.)
   from Alexander Aring.

6) Fix out of bounds access in tcp md5 option parser, from Jann Horn.

7) Missing netlink attribute policies in rtm_ipv6_policy table, from
   Eric Dumazet.

8) Missing socket address length checks in l2tp and pppoe connect,
   from Guillaume Nault.

9) Fix netconsole over team and bonding, from Xin Long.

10) Fix race with AF_PACKET socket state bitfields, from Willem de
Bruijn.

Pulled, thanks a lot!

The following changes since commit 83beed7b2b26f232d782127792dd0cd4362fdc41:

  Merge branch 'fixes' of 
git://git.kernel.org/pub/scm/linux/kernel/git/evalenti/linux-soc-thermal 
(2018-04-20 10:56:32 -0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 

for you to fetch changes up to d19efb729f10339f91c35003d480dc718cae3b3c:

  Merge branch '1GbE' of 
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue (2018-04-24 
16:17:59 -0400)


Ahmed Abdelsalam (1):
  ipv6: sr: fix NULL pointer dereference in seg6_do_srh_encap()- v4 pkts

Alexander Aring (3):
  net: sched: ife: signal not finding metaid
  net: sched: ife: handle malformed tlv length
  net: sched: ife: check on metadata length

Anders Roxell (1):
  selftests: bpf: update .gitignore with missing generated files

Anirudh Venkataramanan (2):
  ice: Fix initialization for num_nodes_added
  ice: Fix incorrect comment for action type

Arnd Bergmann (1):
  netfilter: fix CONFIG_NF_REJECT_IPV6=m link error

Ben Shelton (1):
  ice: Do not check INTEVENT bit for OICR interrupts

Colin Ian King (1):
  ixgbevf: ensure xdp_ring resources are free'd on error exit

Cong Wang (3):
  netfilter: conntrack: silent a memory leak warning
  llc: delete timers synchronously in llc_sk_free()
  llc: fix NULL pointer deref for SOCK_ZAPPED

David S. Miller (6):
  Merge branch 's390-qeth-fixes'
  Merge branch 'net-sched-ife-malformed-ife-packet-fixes'
  Merge git://git.kernel.org/.../bpf/bpf
  Merge git://git.kernel.org/.../pablo/nf
  Merge branch 'amd-xgbe-fixes'
  Merge branch '1GbE' of git://git.kernel.org/.../jkirsher/net-queue

Doron Roberts-Kedes (1):
  strparser: Do not call mod_delayed_work with a timeout of LONG_MAX

Edward Cree (1):
  sfc: ARFS filter IDs

Eric Dumazet (1):
  ipv6: add RTA_TABLE and RTA_PREFSRC to rtm_ipv6_policy

Florian Fainelli (1):
  net: ethtool: Add missing kernel doc for FEC parameters

Florian Westphal (4):
  netfilter: nf_conntrack_sip: allow duplicate SDP expectations
  netfilter: ebtables: don't attempt to allocate 0-sized compat array
  netfilter: nf_tables: can't fail after linking rule into active rule list
  netfilter: nf_tables: free set name in error path

Guillaume Nault (3):
  l2tp: fix {pppol2tp, l2tp_dfs}_seq_stop() in case of seq_file overflow
  l2tp: check sockaddr length in pppol2tp_connect()
  pppoe: check sockaddr length in pppoe_connect()

Ivan Khoronzhuk (1):
  net: ethernet: ti: cpsw: fix tx vlan priority mapping

Jack Ma (1):
  netfilter: xt_connmark: Add bit mapping for bit-shift operation.

Jann Horn (2):
  bpf: sockmap remove dead check
  tcp: don't read out-of-bounds opsize

Jingju Hou (1):
  net: phy: marvell: clear wol event before setting it

Julian Anastasov (1):
  ipvs: fix rtnl_lock lockups caused by start_sync_thread

Julian Wiedmann (6):
  s390/qeth: fix error handling in adapter command callbacks
  s390/qeth: avoid control IO completion stalls
  s390/qeth: handle failure on workqueue creation
  s390/qeth: fix MAC address update sequence
  s390/qeth: fix request-side race during cmd IO timeout
  s390/qeth: use Read device to query hypervisor for MAC

Md Fahad Iqbal Polash (1):
  ice: Fix insufficient memory issue in ice_aq_manage_mac_read

Pablo Neira Ayuso (2):
  netfilter: nf_tables: NAT chain and extensions require NF_TABLES
  netfilter: xt_connmark: do not cast xt_connmark_tginfo1 to 
xt_connmark_tginfo2

Stephen Rothwell (1):
  netfilter: conntrack: include kmemleak.h for kmemleak_not_leak()

Taehee Yoo (1):
  netfilter: nf_tables: fix out-of-bounds in nft_chain_commit_update

Thomas Falcon (1):
  ibmvnic: Clean actual number of RX or TX pools

Tom Lendacky (3):
  amd-xgbe: Add pre/post auto-negotiation phy hooks
  amd-xgbe: Improve KR auto-negotiation and training
  amd-xgbe: Only use the SFP supported transceiver signals

Vinicius Costa

Re: [PATCH net-next] liquidio: Swap VF representor Tx and Rx statistics

2018-04-24 Thread David Miller

From: Felix Manlunas 
Date: Tue, 24 Apr 2018 10:23:27 -0700

> From: Srinivas Jampala 
> 
> Swap VF representor tx and rx interface statistics since it is a
> virtual switchdev port and tx for VM should be rx for VF representor
> and vice-versa.
> 
> Signed-off-by: Srinivas Jampala 
> Acked-by: Derek Chickles 
> Signed-off-by: Felix Manlunas 

Applied, thank you.

Re: [PATCH net-next] net/ipv6: fix LOCKDEP issue in rt6_remove_exception_rt()

2018-04-24 Thread David Miller

From: Eric Dumazet 
Date: Tue, 24 Apr 2018 09:22:49 -0700

> rt6_remove_exception_rt() is called under rcu_read_lock() only.
> 
> We lock rt6_exception_lock a bit later, so we do not hold
> rt6_exception_lock yet.
> 
> Fixes: 8a14e46f1402 ("net/ipv6: Fix missing rcu dereferences on from")
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 

Applied, thanks Eric.

Re: [net 0/6][pull request] Intel Wired LAN Driver Updates 2018-04-24

2018-04-24 Thread David Miller

From: Jeff Kirsher 
Date: Tue, 24 Apr 2018 12:43:44 -0700

> On Tue, 2018-04-24 at 12:29 -0700, Jeff Kirsher wrote:
>> This series contains fixes to ixgbevf, igb and ice drivers.
>> 
>> Colin Ian King fixes the return value on error for the new XDP
>> support
>> that went into ixgbevf for 4.16.
> 
> Oops, I meant 4.17, not 4.16.

Pulled with this fixed, thanks Jeff.

packetdrill 2.0 release

2018-04-24 Thread Neal Cardwell

Hi All,

We're happy to announce the 2.0 release of the Google version of the
packetdrill network testing tool.

The code may be found at the packetdrill-v2.0 tag in the Google packetdrill
github repo:
   https://github.com/google/packetdrill

The commit is here:

https://github.com/google/packetdrill/commit/9a0ade62b7c8e3a19854b5855178dc3bb9d7f453

The 2.0 commit message, summarizing features and contributors, is included
below for a quick overview.

cheers,
neal

---
net-test: packetdrill: merge Google packetdrill changes through April 2018

This commit merges into Google's public packetdrill repository the
majority of the packetdrill tool changes made at Google in the period
2013-2018 (after the initial open source release of packetdrill).

Major features added in this commit include:

+ support for testing:
   + cmsg data
   + TCP send timestamping
   + TCP timestamping opt stats (TCP_NLA_BUSY and friends)
   + TCP zero-copy (e.g. see --send_omit_free)
   + TCP MD5 options
   + TCP urgent pointer field
   + experimental and RFC-compliant TCP Fast Open options
   + ICMP sockets
   + the IPv4 or IPv6 TOS field
   + IPv6 flow labels
   + in IPv6-only environments
+ wider system call support:
   + epoll system calls (epoll_create(), epoll_ctl(), epoll_wait())
   + pipe()
   + splice()
   + cap_set()
+ optional final clean-up commands for destructor-like tear-down
   commands that are basically always executed at termination, whether
   scripts fail or succeed
+ improved Python support:
   + exporting symbolic names for tcpi_state values
   + exporting recent additions to Linux struct tcp_info
   + exporting TCP_CC_INFO for Vegas, DCTCP, BBR
   + exporting SO_MEMINFO
+ the ability to test shared libraries that support the sockets API,
   rather than just the kernel sockets API (see packetdrill.h)
+ preprocessor-style symbol definitions, e.g. -Dfoo=bar
+ support for random local IP addresses

Willem de Bruijn spearheaded this effort to upstream this batch of
changes, and put in a huge amount of work to make this happen. I would
like to thank him for all his work on this.

I would also like to thank the following Googlers for their
contributions over the years to the packetdrill code base, which are
reflected in this patch:

   Wei Wang
   Maciej Żenczykowski
   Yuchung Cheng
   Eric Dumazet
   Soheil Hassas Yeganeh
   Dimitris Michailidis
   Willem de Bruijn
   Yaogong Wang
   Eric Salo
   Chonggang Li
   Priyaranjan Jha
   Andreas Terzis
   Xiao Jia
   Mike Maloney
   Barath Raghavan
   Yousuk Seung
   Nandita Dukkipati
   Michael Davidson
   Hsiao-keng Jerry Chu
   Greg Thelen
   Chema Gonzalez
   Luigi Rizzo
   Kevin Athey
   Jeff Grafton
   Francis Y. Yan
   Fabien Duchene
   Bill Sommerfeld
   Anatol Pomazau

This commit has been verified to build cleanly with the default gcc
compiler on the following Linux distributions:

   Debian 8
   Debian 9
   Red Hat Enterprise Linux 7.4
   Ubuntu 14.04
   Ubuntu 17.10

This commit has not been tested on or ported to any BSD variants, due
to lack of time among members of our team. We are happy to accept
patches to get it to compile/run on popular BSD variants.

Re: Boot failures with net-next after rebase to v4.17.0-rc1

2018-04-24 Thread Linus Torvalds

On Tue, Apr 24, 2018 at 12:54 PM, Jesper Dangaard Brouer
 wrote:
> Hi all,
>
> I'm experiencing boot failures with net-next git-tree after it got
> rebased/merged with Linus'es tree at v4.17.0-rc1.

I suspect it's the global bit stuff that came in very late in the
merge window, and had been developed and tested for a while before,
but showed some problems under some configs.

The fix is currently in the x86/pti tree in -tip, see:

   x86/pti: Fix boot problems from Global-bit setting

and I expect it will percolate upstream soon.

In the meantime, it would be good to verify that merging that x86/pti
branch fixes it for you?

There is another candidate for boot problems - do you happen to have
CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled? That can under certain
circumstances get a percpu setup page fault because memory hadn't been
initialized sufficiently.

The fix there is to move the mm_init() call one step earlier in
init_main(): start_kernel() (to before trap_init()).

And if it's neither of the above, I think you'll need to help bisect it.

   Linus

Re: [PATCH] net/tls: remove redundant second null check on sgout

2018-04-24 Thread David Miller

From: Colin King 
Date: Tue, 24 Apr 2018 13:36:58 +0100

> From: Colin Ian King 
> 
> A duplicated null check on sgout is redundant as it is known to be
> already true because of the identical earlier check. Remove it.
> Detected by cppcheck:
> 
> net/tls/tls_sw.c:696: (warning) Identical inner 'if' condition is always
> true.
> 
> Signed-off-by: Colin Ian King 

Applied to net-next, thank you.

Re: [PATCH] fsl/fman_port: remove redundant check on port->rev_info.major

2018-04-24 Thread David Miller

From: Colin King 
Date: Tue, 24 Apr 2018 12:39:45 +0100

> From: Colin Ian King 
> 
> The check port->rev_info.major >= 6 is being performed twice, thus
> the inner second check is always true and is redundant, hence it
> can be removed. Detected by cppcheck.
> 
> drivers/net/ethernet/freescale/fman/fman_port.c:1394]: (warning)
> Identical inner 'if' condition is always true.
> 
> Signed-off-by: Colin Ian King 

Applied to net-next, thank you.

Boot failures with net-next after rebase to v4.17.0-rc1

2018-04-24 Thread Jesper Dangaard Brouer

Hi all,

I'm experiencing boot failures with net-next git-tree after it got
rebased/merged with Linus'es tree at v4.17.0-rc1.

The boot problem only occurs for certain kernel configs. I've bisected
the config problem down to enabling CONFIG_PREEMPT=y and resulting
dependencies in below diff.

Is this a know problem?
Have others experienced this too?

This happens for me on two different (x86_64) testlab machines...
I also tested on Linus'es tree at v4.17-rc2, and problem also exists
for me there.
-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer


--- config21-steps-works2018-04-24 21:33:42.353751894 +0200
+++ config20-steps-bad  2018-04-24 21:27:19.852654328 +0200
@@ -131,7 +131,7 @@
 #
 # RCU Subsystem
 #
-CONFIG_TREE_RCU=y
+CONFIG_PREEMPT_RCU=y
 # CONFIG_RCU_EXPERT is not set
 CONFIG_SRCU=y
 CONFIG_TREE_SRCU=y
@@ -421,11 +421,7 @@
 CONFIG_BFQ_GROUP_IOSCHED=y
 CONFIG_PREEMPT_NOTIFIERS=y
 CONFIG_ASN1=y
-CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
-CONFIG_INLINE_READ_UNLOCK=y
-CONFIG_INLINE_READ_UNLOCK_IRQ=y
-CONFIG_INLINE_WRITE_UNLOCK=y
-CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
+CONFIG_UNINLINE_SPIN_UNLOCK=y
 CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
 CONFIG_MUTEX_SPIN_ON_OWNER=y
 CONFIG_RWSEM_SPIN_ON_OWNER=y
@@ -497,9 +493,10 @@
 CONFIG_SCHED_SMT=y
 CONFIG_SCHED_MC=y
 CONFIG_SCHED_MC_PRIO=y
-CONFIG_PREEMPT_NONE=y
+# CONFIG_PREEMPT_NONE is not set
 # CONFIG_PREEMPT_VOLUNTARY is not set
-# CONFIG_PREEMPT is not set
+CONFIG_PREEMPT=y
+CONFIG_PREEMPT_COUNT=y
 CONFIG_X86_LOCAL_APIC=y
 CONFIG_X86_IO_APIC=y
 CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
@@ -3931,6 +3928,7 @@
 # CONFIG_SCHEDSTATS is not set
 # CONFIG_SCHED_STACK_END_CHECK is not set
 # CONFIG_DEBUG_TIMEKEEPING is not set
+# CONFIG_DEBUG_PREEMPT is not set
 
 #
 # Lock Debugging (spinlocks, mutexes, etc...)
@@ -3996,6 +3994,7 @@
 CONFIG_FUNCTION_GRAPH_TRACER=y
 # CONFIG_PREEMPTIRQ_EVENTS is not set
 # CONFIG_IRQSOFF_TRACER is not set
+# CONFIG_PREEMPT_TRACER is not set
 # CONFIG_SCHED_TRACER is not set
 CONFIG_HWLAT_TRACER=y
 # CONFIG_FTRACE_SYSCALLS is not set

Re: [PATCH net-next v5 0/3] kernel: add support to collect hardware logs in crash recovery kernel

2018-04-24 Thread David Miller

From: Rahul Lakkireddy 
Date: Sat, 21 Apr 2018 22:35:52 +0530

> Patch 1 adds API to vmcore module to allow drivers to register callback
> to collect the device specific hardware/firmware logs.  The logs will
> be added to /proc/vmcore as elf notes.
> 
> Patch 2 updates read and mmap logic to append device specific hardware/
> firmware logs as elf notes.
> 
> Patch 3 shows a cxgb4 driver example using the API to collect
> hardware/firmware logs in crash recovery kernel, before hardware is
> initialized.

Are there any serious remaining objections to this series?  I'm going to
integrate this into net-next soon if not.

Thank you.

Re: [net 0/6][pull request] Intel Wired LAN Driver Updates 2018-04-24

2018-04-24 Thread Jeff Kirsher

On Tue, 2018-04-24 at 12:29 -0700, Jeff Kirsher wrote:
> This series contains fixes to ixgbevf, igb and ice drivers.
> 
> Colin Ian King fixes the return value on error for the new XDP
> support
> that went into ixgbevf for 4.16.

Oops, I meant 4.17, not 4.16.

> 
> Vinicius provides a fix for queue 0 for igb, which was not receiving
> all
> the credits it needed when QAV mode was enabled.
> 
> Anirudh provides several fixes for the new ice driver, starting with
> properly initializing num_nodes_added to zero.  Fixed up a code
> comment
> to better reflect what is really going on in the code.  Fixed how to
> detect if an OICR interrupt has occurred to a more reliable method.
> 
> Md Fahad fixes the ice driver to allocate the right amount of memory
> when reading and storing the devices MAC addresses.  The device can
> have
> up to 2 MAC addresses (LAN and WoL), while WoL is currently not
> supported, we need to ensure it can be properly handled when support
> is
> added.
> 
> The following are changes since commit
> 9cf2f437ca5b39828984064fad213e68fc17ef11:
>   team: fix netconsole setup over team
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue
> 1GbE
> 
> Anirudh Venkataramanan (2):
>   ice: Fix initialization for num_nodes_added
>   ice: Fix incorrect comment for action type
> 
> Ben Shelton (1):
>   ice: Do not check INTEVENT bit for OICR interrupts
> 
> Colin Ian King (1):
>   ixgbevf: ensure xdp_ring resources are free'd on error exit
> 
> Md Fahad Iqbal Polash (1):
>   ice: Fix insufficient memory issue in ice_aq_manage_mac_read
> 
> Vinicius Costa Gomes (1):
>   igb: Fix the transmission mode of queue 0 for Qav mode
> 
>  drivers/net/ethernet/intel/ice/ice_adminq_cmd.h   |  2 +-
>  drivers/net/ethernet/intel/ice/ice_common.c   | 22
> +-
>  drivers/net/ethernet/intel/ice/ice_hw_autogen.h   |  2 --
>  drivers/net/ethernet/intel/ice/ice_main.c |  4 
>  drivers/net/ethernet/intel/ice/ice_sched.c|  4 ++--
>  drivers/net/ethernet/intel/igb/igb_main.c | 17
> -
>  drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 +-
>  7 files changed, 37 insertions(+), 16 deletions(-)
> 


signature.asc
Description: This is a digitally signed message part

Re: [PATCH net-next 2/2 v1] netns: isolate seqnums to use per-netns locks

2018-04-24 Thread David Miller

From: Christian Brauner 
Date: Mon, 23 Apr 2018 12:24:43 +0200

> + #ifdef CONFIG_NET
> + seqnum = get_ns_uevent_seqnum_by_vpid();
> + #else
> + seqnum = uevent_seqnum;
> + #endif

Please don't indend the code like this.

By indenting the CPP directives, which should be at column zero, the
actual code became double indented.

Thank you.

[net 0/6][pull request] Intel Wired LAN Driver Updates 2018-04-24

2018-04-24 Thread Jeff Kirsher

This series contains fixes to ixgbevf, igb and ice drivers.

Colin Ian King fixes the return value on error for the new XDP support
that went into ixgbevf for 4.16.

Vinicius provides a fix for queue 0 for igb, which was not receiving all
the credits it needed when QAV mode was enabled.

Anirudh provides several fixes for the new ice driver, starting with
properly initializing num_nodes_added to zero.  Fixed up a code comment
to better reflect what is really going on in the code.  Fixed how to
detect if an OICR interrupt has occurred to a more reliable method.

Md Fahad fixes the ice driver to allocate the right amount of memory
when reading and storing the devices MAC addresses.  The device can have
up to 2 MAC addresses (LAN and WoL), while WoL is currently not
supported, we need to ensure it can be properly handled when support is
added.

The following are changes since commit 9cf2f437ca5b39828984064fad213e68fc17ef11:
  team: fix netconsole setup over team
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/net-queue 1GbE

Anirudh Venkataramanan (2):
  ice: Fix initialization for num_nodes_added
  ice: Fix incorrect comment for action type

Ben Shelton (1):
  ice: Do not check INTEVENT bit for OICR interrupts

Colin Ian King (1):
  ixgbevf: ensure xdp_ring resources are free'd on error exit

Md Fahad Iqbal Polash (1):
  ice: Fix insufficient memory issue in ice_aq_manage_mac_read

Vinicius Costa Gomes (1):
  igb: Fix the transmission mode of queue 0 for Qav mode

 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h   |  2 +-
 drivers/net/ethernet/intel/ice/ice_common.c   | 22 +-
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h   |  2 --
 drivers/net/ethernet/intel/ice/ice_main.c |  4 
 drivers/net/ethernet/intel/ice/ice_sched.c|  4 ++--
 drivers/net/ethernet/intel/igb/igb_main.c | 17 -
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c |  2 +-
 7 files changed, 37 insertions(+), 16 deletions(-)

-- 
2.14.3

[net 3/6] ice: Fix initialization for num_nodes_added

2018-04-24 Thread Jeff Kirsher

From: Anirudh Venkataramanan 

ice_sched_add_nodes_to_layer is used recursively, and so we start
with num_nodes_added being 0. This way, in case of an error or if
num_nodes is NULL, the function just returns 0 to indicate that no
nodes were added.

Fixes: 5513b920a4f7 ("ice: Update Tx scheduler tree for VSI multi-Tx queue 
support")
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Tony Brelinski 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_sched.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_sched.c 
b/drivers/net/ethernet/intel/ice/ice_sched.c
index f16ff3e4a840..2e6c1d92cc88 100644
--- a/drivers/net/ethernet/intel/ice/ice_sched.c
+++ b/drivers/net/ethernet/intel/ice/ice_sched.c
@@ -751,14 +751,14 @@ ice_sched_add_nodes_to_layer(struct ice_port_info *pi,
u16 num_added = 0;
u32 temp;
 
+   *num_nodes_added = 0;
+
if (!num_nodes)
return status;
 
if (!parent || layer < hw->sw_entry_point_layer)
return ICE_ERR_PARAM;
 
-   *num_nodes_added = 0;
-
/* max children per node per layer */
max_child_nodes =
le16_to_cpu(hw->layer_info[parent->tx_sched_layer].max_children);
-- 
2.14.3

[net 5/6] ice: Do not check INTEVENT bit for OICR interrupts

2018-04-24 Thread Jeff Kirsher

From: Ben Shelton 

According to the hardware spec, checking the INTEVENT bit isn't a
reliable way to detect if an OICR interrupt has occurred. This is
because this bit can be cleared by the hardware/firmware before the
interrupt service routine has run. So instead, just check for OICR
events every time.

Fixes: 940b61af02f4 ("ice: Initialize PF and setup miscellaneous interrupt")
Signed-off-by: Ben Shelton 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Tony Brelinski 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_hw_autogen.h | 2 --
 drivers/net/ethernet/intel/ice/ice_main.c   | 4 
 2 files changed, 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h 
b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
index 1b9e2ef48a9d..499904874b3f 100644
--- a/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
+++ b/drivers/net/ethernet/intel/ice/ice_hw_autogen.h
@@ -121,8 +121,6 @@
 #define PFINT_FW_CTL_CAUSE_ENA_S   30
 #define PFINT_FW_CTL_CAUSE_ENA_M   BIT(PFINT_FW_CTL_CAUSE_ENA_S)
 #define PFINT_OICR 0x0016CA00
-#define PFINT_OICR_INTEVENT_S  0
-#define PFINT_OICR_INTEVENT_M  BIT(PFINT_OICR_INTEVENT_S)
 #define PFINT_OICR_HLP_RDY_S   14
 #define PFINT_OICR_HLP_RDY_M   BIT(PFINT_OICR_HLP_RDY_S)
 #define PFINT_OICR_CPM_RDY_S   15
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c 
b/drivers/net/ethernet/intel/ice/ice_main.c
index 210b7910f1cd..5299caf55a7f 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -1722,9 +1722,6 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, 
void *data)
oicr = rd32(hw, PFINT_OICR);
ena_mask = rd32(hw, PFINT_OICR_ENA);
 
-   if (!(oicr & PFINT_OICR_INTEVENT_M))
-   goto ena_intr;
-
if (oicr & PFINT_OICR_GRST_M) {
u32 reset;
/* we have a reset warning */
@@ -1782,7 +1779,6 @@ static irqreturn_t ice_misc_intr(int __always_unused irq, 
void *data)
}
ret = IRQ_HANDLED;
 
-ena_intr:
/* re-enable interrupt causes that are not handled during this pass */
wr32(hw, PFINT_OICR_ENA, ena_mask);
if (!test_bit(__ICE_DOWN, pf->state)) {
-- 
2.14.3

[net 4/6] ice: Fix incorrect comment for action type

2018-04-24 Thread Jeff Kirsher

From: Anirudh Venkataramanan 

Action type 5 defines large action generic values. Fix comment to
reflect that better.

Signed-off-by: Anirudh Venkataramanan 
Tested-by: Tony Brelinski 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_adminq_cmd.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h 
b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
index 5b13ca1bd85f..7dc5f045e969 100644
--- a/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
+++ b/drivers/net/ethernet/intel/ice/ice_adminq_cmd.h
@@ -586,7 +586,7 @@ struct ice_sw_rule_lg_act {
 #define ICE_LG_ACT_MIRROR_VSI_ID_S 3
 #define ICE_LG_ACT_MIRROR_VSI_ID_M (0x3FF << ICE_LG_ACT_MIRROR_VSI_ID_S)
 
-   /* Action type = 5 - Large Action */
+   /* Action type = 5 - Generic Value */
 #define ICE_LG_ACT_GENERIC 0x5
 #define ICE_LG_ACT_GENERIC_VALUE_S 3
 #define ICE_LG_ACT_GENERIC_VALUE_M (0x << ICE_LG_ACT_GENERIC_VALUE_S)
-- 
2.14.3

[net 2/6] igb: Fix the transmission mode of queue 0 for Qav mode

2018-04-24 Thread Jeff Kirsher

From: Vinicius Costa Gomes 

When Qav mode is enabled, queue 0 should be kept on Stream Reservation
mode. From the i210 datasheet, section 8.12.19:

"Note: Queue0 QueueMode must be set to 1b when TransmitMode is set to
Qav." ("QueueMode 1b" represents the Stream Reservation mode)

The solution is to give queue 0 the all the credits it might need, so
it has priority over queue 1.

A situation where this can happen is when cbs is "installed" only on
queue 1, leaving queue 0 alone. For example:

$ tc qdisc replace dev enp2s0 handle 100: parent root mqprio num_tc 3 \
   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0

$ tc qdisc replace dev enp2s0 parent 100:2 cbs locredit -1470 \
   hicredit 30 sendslope -98 idleslope 2 offload 1

Signed-off-by: Vinicius Costa Gomes 
Tested-by: Aaron Brown 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/igb/igb_main.c | 17 -
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index c1c0bc30a16d..cce7ada89255 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -1700,7 +1700,22 @@ static void igb_configure_cbs(struct igb_adapter 
*adapter, int queue,
WARN_ON(hw->mac.type != e1000_i210);
WARN_ON(queue < 0 || queue > 1);
 
-   if (enable) {
+   if (enable || queue == 0) {
+   /* i210 does not allow the queue 0 to be in the Strict
+* Priority mode while the Qav mode is enabled, so,
+* instead of disabling strict priority mode, we give
+* queue 0 the maximum of credits possible.
+*
+* See section 8.12.19 of the i210 datasheet, "Note:
+* Queue0 QueueMode must be set to 1b when
+* TransmitMode is set to Qav."
+*/
+   if (queue == 0 && !enable) {
+   /* max "linkspeed" idleslope in kbps */
+   idleslope = 100;
+   hicredit = ETH_FRAME_LEN;
+   }
+
set_tx_desc_fetch_prio(hw, queue, TX_QUEUE_PRIO_HIGH);
set_queue_mode(hw, queue, QUEUE_MODE_STREAM_RESERVATION);
 
-- 
2.14.3

[net 1/6] ixgbevf: ensure xdp_ring resources are free'd on error exit

2018-04-24 Thread Jeff Kirsher

From: Colin Ian King 

The current error handling for failed resource setup for xdp_ring
data is a break out of the loop and returning 0 indicated everything
was OK, when in fact it is not.  Fix this by exiting via the
error exit label err_setup_tx that will clean up the resources
correctly and return and error status.

Detected by CoverityScan, CID#1466879 ("Logically dead code")

Fixes: 21092e9ce8b1 ("ixgbevf: Add support for XDP_TX action")
Signed-off-by: Colin Ian King 
Tested-by: Andrew Bowers 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c 
b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 3d9033f26eff..e3d04f226d57 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -3420,7 +3420,7 @@ static int ixgbevf_setup_all_tx_resources(struct 
ixgbevf_adapter *adapter)
if (!err)
continue;
hw_dbg(>hw, "Allocation for XDP Queue %u failed\n", j);
-   break;
+   goto err_setup_tx;
}
 
return 0;
-- 
2.14.3

[net 6/6] ice: Fix insufficient memory issue in ice_aq_manage_mac_read

2018-04-24 Thread Jeff Kirsher

From: Md Fahad Iqbal Polash 

For the MAC read operation, the device can return up to two (LAN and WoL)
MAC addresses. Without access to adequate memory, the device will return
an error. Fixed this by allocating the right amount of memory. Also, logic
to detect and copy the LAN MAC address into the port_info structure has
been added. Note that the WoL MAC address is ignored currently as the WoL
feature isn't supported yet.

Fixes: dc49c7723676 ("ice: Get MAC/PHY/link info and scheduler topology")
Signed-off-by: Md Fahad Iqbal Polash 
Signed-off-by: Anirudh Venkataramanan 
Tested-by: Tony Brelinski 
Signed-off-by: Jeff Kirsher 
---
 drivers/net/ethernet/intel/ice/ice_common.c | 22 +-
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_common.c 
b/drivers/net/ethernet/intel/ice/ice_common.c
index 21977ec984c4..71d032cc5fa7 100644
--- a/drivers/net/ethernet/intel/ice/ice_common.c
+++ b/drivers/net/ethernet/intel/ice/ice_common.c
@@ -78,6 +78,7 @@ ice_aq_manage_mac_read(struct ice_hw *hw, void *buf, u16 
buf_size,
struct ice_aq_desc desc;
enum ice_status status;
u16 flags;
+   u8 i;
 
cmd = _read;
 
@@ -98,8 +99,16 @@ ice_aq_manage_mac_read(struct ice_hw *hw, void *buf, u16 
buf_size,
return ICE_ERR_CFG;
}
 
-   ether_addr_copy(hw->port_info->mac.lan_addr, resp->mac_addr);
-   ether_addr_copy(hw->port_info->mac.perm_addr, resp->mac_addr);
+   /* A single port can report up to two (LAN and WoL) addresses */
+   for (i = 0; i < cmd->num_addr; i++)
+   if (resp[i].addr_type == ICE_AQC_MAN_MAC_ADDR_TYPE_LAN) {
+   ether_addr_copy(hw->port_info->mac.lan_addr,
+   resp[i].mac_addr);
+   ether_addr_copy(hw->port_info->mac.perm_addr,
+   resp[i].mac_addr);
+   break;
+   }
+
return 0;
 }
 
@@ -464,9 +473,12 @@ enum ice_status ice_init_hw(struct ice_hw *hw)
if (status)
goto err_unroll_sched;
 
-   /* Get port MAC information */
-   mac_buf_len = sizeof(struct ice_aqc_manage_mac_read_resp);
-   mac_buf = devm_kzalloc(ice_hw_to_dev(hw), mac_buf_len, GFP_KERNEL);
+   /* Get MAC information */
+   /* A single port can report up to two (LAN and WoL) addresses */
+   mac_buf = devm_kcalloc(ice_hw_to_dev(hw), 2,
+  sizeof(struct ice_aqc_manage_mac_read_resp),
+  GFP_KERNEL);
+   mac_buf_len = 2 * sizeof(struct ice_aqc_manage_mac_read_resp);
 
if (!mac_buf) {
status = ICE_ERR_NO_MEMORY;
-- 
2.14.3

Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit

2018-04-24 Thread David Miller

From: Yi-Hung Wei 
Date: Tue, 24 Apr 2018 11:21:33 -0700

> On Tue, Apr 24, 2018 at 10:42 AM, David Miller  wrote:
>> From: Pravin Shelar 
>> Date: Mon, 23 Apr 2018 23:34:48 -0700
>>
>>> OK. Thanks for the info.
>>
>> So, ACK, Reviewed-by, etc.? :-)
>>
> 
> Parvin provides feedback in a previous email.  I will address them and
> send out v3.

Aha, I see, thanks for explaining.

Re: [RFC PATCH ghak32 V2 01/13] audit: add container id

2018-04-24 Thread Paul Moore

On Mon, Apr 23, 2018 at 10:02 PM, Richard Guy Briggs  wrote:
> On 2018-04-23 19:15, Paul Moore wrote:
>> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs  wrote:
>> > On 2018-04-18 19:47, Paul Moore wrote:
>> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs  
>> >> wrote:
>> >> > Implement the proc fs write to set the audit container ID of a process,
>> >> > emitting an AUDIT_CONTAINER record to document the event.
>> >> >
>> >> > This is a write from the container orchestrator task to a proc entry of
>> >> > the form /proc/PID/containerid where PID is the process ID of the newly
>> >> > created task that is to become the first task in a container, or an
>> >> > additional task added to a container.
>> >> >
>> >> > The write expects up to a u64 value (unset: 18446744073709551615).
>> >> >
>> >> > This will produce a record such as this:
>> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 
>> >> > subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 
>> >> > tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 
>> >> > res=0
>> >> >
>> >> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
>> >> > the orchestrator while the "opid" field is the object's PID, the process
>> >> > being "contained".  Old and new container ID values are given in the
>> >> > "contid" fields, while res indicates its success.
>> >> >
>> >> > It is not permitted to self-set, unset or re-set the container ID.  A
>> >> > child inherits its parent's container ID, but then can be set only once
>> >> > after.
>> >> >
>> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
>> >> >
>> >> > Signed-off-by: Richard Guy Briggs 
>> >> > ---
>> >> >  fs/proc/base.c | 37 
>> >> >  include/linux/audit.h  | 16 +
>> >> >  include/linux/init_task.h  |  4 ++-
>> >> >  include/linux/sched.h  |  1 +
>> >> >  include/uapi/linux/audit.h |  2 ++
>> >> >  kernel/auditsc.c   | 84 
>> >> > ++
>> >> >  6 files changed, 143 insertions(+), 1 deletion(-)

...

>> >> >  /* audit_rule_data supports filter rules with both integer and string
>> >> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> >> > index 4e0a4ac..29c8482 100644
>> >> > --- a/kernel/auditsc.c
>> >> > +++ b/kernel/auditsc.c
>> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
>> >> > return rc;
>> >> >  }
>> >> >
>> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 
>> >> > containerid)
>> >> > +{
>> >> > +   struct task_struct *parent;
>> >> > +   u64 pcontainerid, ccontainerid;
>> >> > +
>> >> > +   /* Don't allow to set our own containerid */
>> >> > +   if (current == task)
>> >> > +   return -EPERM;
>> >>
>> >> Why not?  Is there some obvious security concern that I missing?
>> >
>> > We then lose the distinction in the AUDIT_CONTAINER record between the
>> > initiating PID and the target PID.  This was outlined in the proposal.
>>
>> I just went back and reread the v3 proposal and I still don't see a
>> good explanation of this.  Why is this bad?  What's the security
>> concern?
>
> I don't remember, specifically.  Maybe this has been addressed by the
> check for children/threads or identical parent container ID.  So, I'm
> reluctantly willing to remove that check for now.

Okay.  For the record, if someone can explain to me why this
restriction saves us from some terrible situation I'm all for leaving
it.  I'm just opposed to restrictions without solid reasoning behind
them.

>> > Having said that, I'm still not sure we have protected sufficiently from
>> > a child turning around and setting it's parent's as yet unset or
>> > inherited audit container ID.
>>
>> Yes, I believe we only want to let a task set the audit container for
>> it's children (or itself/threads if we decide to allow that, see
>> above).  There *has* to be a function to check to see if a task if a
>> child of a given task ... right? ... although this is likely to be a
>> pointer traversal and locking nightmare ... hmmm.
>
> Isn't that just (struct task_struct)parent == (struct
> task_struct)child->parent (or ->real_parent)?
>
> And now that I say that, it is covered by the following patch's child
> check, so as long as we keep that, we should be fine.

I was thinking of checking not just current's immediate children, but
any of it's descendants as I believe that is what we want to limit,
yes?  I just worry that it isn't really practical to perform that
check.

>> >> I ask because I suppose it might be possible for some container
>> >> runtime to do a fork, setup some of the environment and them exec the
>> >> container (before you answer the obvious "namespaces!" please remember
>> >> we're not trying to

Re: [PATCH bpf-next 08/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP

2018-04-24 Thread Björn Töpel

2018-04-24 18:56 GMT+02:00 Willem de Bruijn :
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel  wrote:
>> From: Björn Töpel 
>>
>> The xskmap is yet another BPF map, very much inspired by
>> dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
>> adds AF_XDP sockets into the map, and by using the bpf_redirect_map
>> helper, an XDP program can redirect XDP frames to an AF_XDP socket.
>>
>> Note that a socket that is bound to certain ifindex/queue index will
>> *only* accept XDP frames from that netdev/queue index. If an XDP
>> program tries to redirect from a netdev/queue index other than what
>> the socket is bound to, the frame will not be received on the socket.
>>
>> A socket can reside in multiple maps.
>>
>> Signed-off-by: Björn Töpel 
>
>> +struct xsk_map_entry {
>> +   struct xdp_sock *xs;
>> +   struct rcu_head rcu;
>> +};
>
>> +struct xdp_sock *__xsk_map_lookup_elem(struct bpf_map *map, u32 key)
>> +{
>> +   struct xsk_map *m = container_of(map, struct xsk_map, map);
>> +   struct xsk_map_entry *entry;
>> +
>> +   if (key >= map->max_entries)
>> +   return NULL;
>> +
>> +   entry = READ_ONCE(m->xsk_map[key]);
>> +   return entry ? entry->xs : NULL;
>> +}
>
> This dynamically allocated structure adds an extra cacheline lookup. If
> xdp_sock gets an rcu_head, it can be linked into the map directly.

Nice one! I'll try this out!

Re: [PATCH net-next 5/8] net: mscc: Add initial Ocelot switch support

2018-04-24 Thread Alexandre Belloni

I realise now that I didn't reply to this comment:

On 30/03/2018 16:50:08+0200, Andrew Lunn wrote:
> > The fact is that ocelot doesn't have separate controls. The port is
> > either forwarding or not. If it is not forwarding, then there is nothing
> > to tell the HW to do.
> 
> Think about the following sequence:
> 
> ip link set lan0 up
> 
> After this command, i expect to see packets on lan0 arrive at the
> host, tcpdump to work, etc. This probably means the port is in
> 'forwarding' mode, or for B53, STP is disabled.
> 

On Ocelot, forwarding packets to the host (i.e. forwarding frames
received on the port to the cpu port) is separate from bridging ports
together. So after that command, the host can receive packets on lan0.


-- 
Alexandre Belloni, Bootlin (formerly Free Electrons)
Embedded Linux and Kernel engineering
https://bootlin.com

Re: [PATCH bpf-next 05/15] xsk: add support for bind for Rx

2018-04-24 Thread Björn Töpel

2018-04-24 18:55 GMT+02:00 Willem de Bruijn :
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel  wrote:
>> From: Magnus Karlsson 
>>
>> Here, the bind syscall is added. Binding an AF_XDP socket, means
>> associating the socket to an umem, a netdev and a queue index. This
>> can be done in two ways.
>>
>> The first way, creating a "socket from scratch". Create the umem using
>> the XDP_UMEM_REG setsockopt and an associated fill queue with
>> XDP_UMEM_FILL_QUEUE. Create the Rx queue using the XDP_RX_QUEUE
>> setsockopt. Call bind passing ifindex and queue index ("channel" in
>> ethtool speak).
>>
>> The second way to bind a socket, is simply skipping the
>> umem/netdev/queue index, and passing another already setup AF_XDP
>> socket. The new socket will then have the same umem/netdev/queue index
>> as the parent so it will share the same umem. You must also set the
>> flags field in the socket address to XDP_SHARED_UMEM.
>>
>> Signed-off-by: Magnus Karlsson 
>> ---
>
>> +static struct socket *xsk_lookup_xsk_from_fd(int fd, int *err)
>> +{
>> +   struct socket *sock;
>> +
>> +   *err = -ENOTSOCK;
>> +   sock = sockfd_lookup(fd, err);
>> +   if (!sock)
>> +   return NULL;
>> +
>> +   if (sock->sk->sk_family != PF_XDP) {
>> +   *err = -ENOPROTOOPT;
>> +   sockfd_put(sock);
>> +   return NULL;
>> +   }
>> +
>> +   *err = 0;
>> +   return sock;
>> +}
>
> In this and similar cases, can use ERR_PTR to avoid the extra argument.

Noted. Thanks!

Re: [PATCH v3] kvmalloc: always use vmalloc if CONFIG_DEBUG_SG

2018-04-24 Thread Mikulas Patocka



On Tue, 24 Apr 2018, Matthew Wilcox wrote:

> On Tue, Apr 24, 2018 at 08:29:14AM -0400, Mikulas Patocka wrote:
> > 
> > 
> > On Mon, 23 Apr 2018, Matthew Wilcox wrote:
> > 
> > > On Mon, Apr 23, 2018 at 08:06:16PM -0400, Mikulas Patocka wrote:
> > > > Some bugs (such as buffer overflows) are better detected
> > > > with kmalloc code, so we must test the kmalloc path too.
> > > 
> > > Well now, this brings up another item for the collective TODO list --
> > > implement redzone checks for vmalloc.  Unless this is something already
> > > taken care of by kasan or similar.
> > 
> > The kmalloc overflow testing is also not ideal - it rounds the size up to 
> > the next slab size and detects buffer overflows only at this boundary.
> > 
> > Some times ago, I made a "kmalloc guard" patch that places a magic number 
> > immediatelly after the requested size - so that it can detect overflows at 
> > byte boundary 
> > ( https://www.redhat.com/archives/dm-devel/2014-September/msg00018.html )
> > 
> > That patch found a bug in crypto code:
> > ( http://lkml.iu.edu/hypermail/linux/kernel/1409.1/02325.html )
> 
> Is it still worth doing this, now we have kasan?

The kmalloc guard has much lower overhead than kasan.

(BTW. when I tried kasan, it oopsed with persistent memory)

Mikulas

Re: [PATCH bpf-next 07/15] xsk: add Rx receive functions and poll support

2018-04-24 Thread Björn Töpel

2018-04-24 18:56 GMT+02:00 Willem de Bruijn :
> On Mon, Apr 23, 2018 at 9:56 AM, Björn Töpel  wrote:
>> From: Björn Töpel 
>>
>> Here the actual receive functions of AF_XDP are implemented, that in a
>> later commit, will be called from the XDP layers.
>>
>> There's one set of functions for the XDP_DRV side and another for
>> XDP_SKB (generic).
>>
>> Support for the poll syscall is also implemented.
>>
>> Signed-off-by: Björn Töpel 
>> ---
>
>> +/* Common functions operating for both RXTX and umem queues */
>> +
>> +static inline u32 xskq_nb_avail(struct xsk_queue *q, u32 dcnt)
>> +{
>> +   u32 entries = q->prod_tail - q->cons_tail;
>> +
>> +   if (entries == 0) {
>> +   /* Refresh the local pointer */
>> +   q->prod_tail = READ_ONCE(q->ring->producer);
>> +   }
>> +
>> +   entries = q->prod_tail - q->cons_tail;
>
> Probably meant to be inside the branch? Though I see the same
> pattern in the userspace example program.
>

Yes! Nasty C going on here... :-(

>> +static inline u32 *xskq_validate_id(struct xsk_queue *q)
>> +{
>> +   while (q->cons_tail != q->cons_head) {
>> +   struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
>> +   unsigned int idx = q->cons_tail & q->ring_mask;
>> +
>> +   if (xskq_is_valid_id(q, ring->desc[idx]))
>> +   return >desc[idx];
>
> Missing a q->cons_tail increment in this loop?

Indeed! Good catch! Thanks!


Björn

Re: [net-next v2] ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode

2018-04-24 Thread Ahmed Abdelsalam

On Tue, 24 Apr 2018 19:59:55 +0200
Ahmed Abdelsalam  wrote:

> This patch has been tested for IPv6, IPv4, and L2 traffic.
> 
> Signed-off-by: Ahmed Abdelsalam 
> ---
>  include/net/netns/ipv6.h   |  1 +
>  net/ipv6/seg6_iptunnel.c   | 24 ++--
>  net/ipv6/sysctl_net_ipv6.c |  8 
>  3 files changed, 31 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
> index 97b3a54..c978a31 100644
> --- a/include/net/netns/ipv6.h
> +++ b/include/net/netns/ipv6.h
> @@ -43,6 +43,7 @@ struct netns_sysctl_ipv6 {
>   int max_hbh_opts_cnt;
>   int max_dst_opts_len;
>   int max_hbh_opts_len;
> + int seg6_flowlabel;
>  };
>  
>  struct netns_ipv6 {
> diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
> index 5fe1394..3d9cd86 100644
> --- a/net/ipv6/seg6_iptunnel.c
> +++ b/net/ipv6/seg6_iptunnel.c
> @@ -91,6 +91,24 @@ static void set_tun_src(struct net *net, struct net_device 
> *dev,
>   rcu_read_unlock();
>  }
>  
> +/* Compute flowlabel for outer IPv6 header */
> +__be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
> +struct ipv6hdr *inner_hdr)

David, please take v3 of the patch. 
I re-defined seg6_make_flowlabel () as static to fix the kbuild test robot.

-- 
Ahmed Abdelsalam

Re: [PATCH 5/5] PCI: remove PCI_DMA_BUS_IS_PHYS

2018-04-24 Thread Palmer Dabbelt


On Tue, 24 Apr 2018 11:16:25 PDT (-0700), Christoph Hellwig wrote:

This was used by the ide, scsi and networking code in the past to
determine if they should bounce payloads.  Now that the dma mapping
always have to support dma to all physical memory (thanks to swiotlb
for non-iommu systems) there is no need to this crude hack any more.

Signed-off-by: Christoph Hellwig 
---
[...]
diff --git a/arch/riscv/include/asm/pci.h b/arch/riscv/include/asm/pci.h
index 0f2fc9ef20fc..b3638c505728 100644
--- a/arch/riscv/include/asm/pci.h
+++ b/arch/riscv/include/asm/pci.h
@@ -26,9 +26,6 @@
 /* RISC-V shim does not initialize PCI bus */
 #define pcibios_assign_all_busses() 1

-/* We do not have an IOMMU */
-#define PCI_DMA_BUS_IS_PHYS 1
-
 extern int isa_dma_bridge_buggy;

 #ifdef CONFIG_PCI


Thanks!

Acked-by: Palmer Dabbelt  (For the RISC-V change)

[net-next v3] ipv6: sr: Compute flowlabel for outer IPv6 header of seg6 encap mode

2018-04-24 Thread Ahmed Abdelsalam

ECMP (equal-cost multipath) hashes are typically computed on the packets'
5-tuple(src IP, dst IP, src port, dst port, L4 proto).

For encapsulated packets, the L4 data is not readily available and ECMP
hashing will often revert to (src IP, dst IP). This will lead to traffic
polarization on a single ECMP path, causing congestion and waste of network
capacity.

In IPv6, the 20-bit flow label field is also used as part of the ECMP hash.
In the lack of L4 data, the hashing will be on (src IP, dst IP, flow
label). Having a non-zero flow label is thus important for proper traffic
load balancing when L4 data is unavailable (i.e., when packets are
encapsulated).

Currently, the seg6_do_srh_encap() function extracts the original packet's
flow label and set it as the outer IPv6 flow label. There are two issues
with this behaviour:

a) There is no guarantee that the inner flow label is set by the source.
b) If the original packet is not IPv6, the flow label will be set to
zero (e.g., IPv4 or L2 encap).

This patch adds a function, named seg6_make_flowlabel(), that computes a
flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and
leverages the per namespace 'seg6_flowlabel" sysctl value.

The currently support behaviours are as follows:
-1 set flowlabel to zero.
0 copy flowlabel from Inner paceket in case of Inner IPv6
(Set flowlabel to 0 in case IPv4/L2)
1 Compute the flowlabel using seg6_make_flowlabel()

This patch has been tested for IPv6, IPv4, and L2 traffic.

Signed-off-by: Ahmed Abdelsalam 
---
 include/net/netns/ipv6.h   |  1 +
 net/ipv6/seg6_iptunnel.c   | 24 ++--
 net/ipv6/sysctl_net_ipv6.c |  8 
 3 files changed, 31 insertions(+), 2 deletions(-)

diff --git a/include/net/netns/ipv6.h b/include/net/netns/ipv6.h
index 97b3a54..c978a31 100644
--- a/include/net/netns/ipv6.h
+++ b/include/net/netns/ipv6.h
@@ -43,6 +43,7 @@ struct netns_sysctl_ipv6 {
int max_hbh_opts_cnt;
int max_dst_opts_len;
int max_hbh_opts_len;
+   int seg6_flowlabel;
 };
 
 struct netns_ipv6 {
diff --git a/net/ipv6/seg6_iptunnel.c b/net/ipv6/seg6_iptunnel.c
index 5fe1394..9898926 100644
--- a/net/ipv6/seg6_iptunnel.c
+++ b/net/ipv6/seg6_iptunnel.c
@@ -91,6 +91,24 @@ static void set_tun_src(struct net *net, struct net_device 
*dev,
rcu_read_unlock();
 }
 
+/* Compute flowlabel for outer IPv6 header */
+static __be32 seg6_make_flowlabel(struct net *net, struct sk_buff *skb,
+ struct ipv6hdr *inner_hdr)
+{
+   int do_flowlabel = net->ipv6.sysctl.seg6_flowlabel;
+   __be32 flowlabel = 0;
+   u32 hash;
+
+   if (do_flowlabel > 0) {
+   hash = skb_get_hash(skb);
+   rol32(hash, 16);
+   flowlabel = (__force __be32)hash & IPV6_FLOWLABEL_MASK;
+   } else if (!do_flowlabel && skb->protocol == htons(ETH_P_IPV6)) {
+   flowlabel = ip6_flowlabel(inner_hdr);
+   }
+   return flowlabel;
+}
+
 /* encapsulate an IPv6 packet within an outer IPv6 header with a given SRH */
 int seg6_do_srh_encap(struct sk_buff *skb, struct ipv6_sr_hdr *osrh, int proto)
 {
@@ -99,6 +117,7 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct 
ipv6_sr_hdr *osrh, int proto)
struct ipv6hdr *hdr, *inner_hdr;
struct ipv6_sr_hdr *isrh;
int hdrlen, tot_len, err;
+   __be32 flowlabel;
 
hdrlen = (osrh->hdrlen + 1) << 3;
tot_len = hdrlen + sizeof(*hdr);
@@ -119,12 +138,13 @@ int seg6_do_srh_encap(struct sk_buff *skb, struct 
ipv6_sr_hdr *osrh, int proto)
 * decapsulation will overwrite inner hlim with outer hlim
 */
 
+   flowlabel = seg6_make_flowlabel(net, skb, inner_hdr);
if (skb->protocol == htons(ETH_P_IPV6)) {
ip6_flow_hdr(hdr, ip6_tclass(ip6_flowinfo(inner_hdr)),
-ip6_flowlabel(inner_hdr));
+flowlabel);
hdr->hop_limit = inner_hdr->hop_limit;
} else {
-   ip6_flow_hdr(hdr, 0, 0);
+   ip6_flow_hdr(hdr, 0, flowlabel);
hdr->hop_limit = ip6_dst_hoplimit(skb_dst(skb));
}
 
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 6fbdef6..e15cd37 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -152,6 +152,13 @@ static struct ctl_table ipv6_table_template[] = {
.extra1 = ,
.extra2 = ,
},
+   {
+   .procname   = "seg6_flowlabel",
+   .data   = _net.ipv6.sysctl.seg6_flowlabel,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec
+   },
{ }
 };
 
@@ -217,6 +224,7 @@ static int __net_init ipv6_sysctl_net_init(struct net *net)
ipv6_table[12].data = >ipv6.sysctl.max_dst_opts_len;
ipv6_table[13].data =

Re: [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit

2018-04-24 Thread Yi-Hung Wei

On Tue, Apr 24, 2018 at 10:42 AM, David Miller  wrote:
> From: Pravin Shelar 
> Date: Mon, 23 Apr 2018 23:34:48 -0700
>
>> OK. Thanks for the info.
>
> So, ACK, Reviewed-by, etc.? :-)
>

Parvin provides feedback in a previous email.  I will address them and
send out v3.

Thanks,

-Yi-Hung

1 2 3 4 >

1 - 100 of 364 matches

Mail list logo