Re: [PATCH] slip: Check if rstate is initialized before uncompressing

2018-04-09 Thread tejaswit

On 2018-04-09 20:34, David Miller wrote:

From: Tejaswi Tanikella 
Date: Mon, 9 Apr 2018 14:23:49 +0530


@@ -673,6 +677,7 @@ struct slcompress *
if (cs->cs_tcp.doff > 5)
 	  memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), 
(cs->cs_tcp.doff - 5) * 4);

cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2;
+   cs->initialized = 1;
/* Put headers back on packet

 ...

 struct cstate {
byte_t  cs_this;/* connection id number (xmit) */
+   byte_t  initialized;/* non-zero if initialized */


Please use 'bool' and true/false for 'initialized'.


Made the changes.
From b04441041034a5e044705d3c5a2cc338dd4bfc3a Mon Sep 17 00:00:00 2001
From: Tejaswi Tanikella 
Date: Thu, 29 Mar 2018 14:46:41 +0530
Subject: [PATCH] slip: Check if rstate is initialized before uncompressing

On receiving a packet the state index points to the rstate which must be
used to fill up IP and TCP headers. But if the state index points to a
rstate which is unitialized, i.e. filled with zeros, it gets stuck in an
infinite loop inside ip_fast_csum trying to compute the ip checsum of a
header with zero length.

89.666953:   <2> [] slhc_uncompress+0x464/0x468
89.666965:   <2> [] ppp_receive_nonmp_frame+0x3b4/0x65c
89.666978:   <2> [] ppp_receive_frame+0x64/0x7e0
89.666991:   <2> [] ppp_input+0x104/0x198
89.667005:   <2> [] pppopns_recv_core+0x238/0x370
89.667027:   <2> [] __sk_receive_skb+0xdc/0x250
89.667040:   <2> [] pppopns_recv+0x44/0x60
89.667053:   <2> [] __sock_queue_rcv_skb+0x16c/0x24c
89.667065:   <2> [] sock_queue_rcv_skb+0x2c/0x38
89.667085:   <2> [] raw_rcv+0x124/0x154
89.667098:   <2> [] raw_local_deliver+0x1e0/0x22c
89.667117:   <2> [] ip_local_deliver_finish+0x70/0x24c
89.667131:   <2> [] ip_local_deliver+0x100/0x10c

./scripts/faddr2line vmlinux slhc_uncompress+0x464/0x468 output:
 ip_fast_csum at arch/arm64/include/asm/checksum.h:40
 (inlined by) slhc_uncompress at drivers/net/slip/slhc.c:615

Adding a variable to indicate if the current rstate is initialized. If
such a packet arrives, move to toss state.

Signed-off-by: Tejaswi Tanikella 
---
 drivers/net/slip/slhc.c | 5 +
 include/net/slhc_vj.h   | 1 +
 2 files changed, 6 insertions(+)

diff --git a/drivers/net/slip/slhc.c b/drivers/net/slip/slhc.c
index 5782733..f4e93f5 100644
--- a/drivers/net/slip/slhc.c
+++ b/drivers/net/slip/slhc.c
@@ -509,6 +509,10 @@ struct slcompress *
 		if(x < 0 || x > comp->rslot_limit)
 			goto bad;
 
+		/* Check if the cstate is initialized */
+		if (!comp->rstate[x].initialized)
+			goto bad;
+
 		comp->flags &=~ SLF_TOSS;
 		comp->recv_current = x;
 	} else {
@@ -673,6 +677,7 @@ struct slcompress *
 	if (cs->cs_tcp.doff > 5)
 	  memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), (cs->cs_tcp.doff - 5) * 4);
 	cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2;
+	cs->initialized = true;
 	/* Put headers back on packet
 	 * Neither header checksum is recalculated
 	 */
diff --git a/include/net/slhc_vj.h b/include/net/slhc_vj.h
index 8716d59..8fcf890 100644
--- a/include/net/slhc_vj.h
+++ b/include/net/slhc_vj.h
@@ -127,6 +127,7 @@
  */
 struct cstate {
 	byte_t	cs_this;	/* connection id number (xmit) */
+	bool	initialized;	/* true if initialized */
 	struct cstate *next;	/* next in ring (xmit) */
 	struct iphdr cs_ip;	/* ip/tcp hdr from most recent packet */
 	struct tcphdr cs_tcp;
-- 
1.9.1



[PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check

2018-04-09 Thread Stefan Hajnoczi
v2:
 * Rewrote the conditional to make the vq access check clearer [Linus]
 * Added Patch 2 to make the return type consistent and harder to misuse [Linus]

The first patch fixes the vhost virtqueue access check which was recently
broken.  The second patch replaces the int return type with bool to prevent
future bugs.

Stefan Hajnoczi (2):
  vhost: fix vhost_vq_access_ok() log check
  vhost: return bool from *_access_ok() functions

 drivers/vhost/vhost.h |  4 +--
 drivers/vhost/vhost.c | 70 ++-
 2 files changed, 38 insertions(+), 36 deletions(-)

-- 
2.14.3



[PATCH v2 2/2] vhost: return bool from *_access_ok() functions

2018-04-09 Thread Stefan Hajnoczi
Currently vhost *_access_ok() functions return int.  This is error-prone
because there are two popular conventions:

1. 0 means failure, 1 means success
2. -errno means failure, 0 means success

Although vhost mostly uses #1, it does not do so consistently.
umem_access_ok() uses #2.

This patch changes the return type from int to bool so that false means
failure and true means success.  This eliminates a potential source of
errors.

Suggested-by: Linus Torvalds 
Signed-off-by: Stefan Hajnoczi 
---
 drivers/vhost/vhost.h |  4 ++--
 drivers/vhost/vhost.c | 66 +--
 2 files changed, 35 insertions(+), 35 deletions(-)

diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index ac4b6056f19a..6e00fa57af09 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *);
 void vhost_dev_stop(struct vhost_dev *);
 long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user 
*argp);
 long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp);
-int vhost_vq_access_ok(struct vhost_virtqueue *vq);
-int vhost_log_access_ok(struct vhost_dev *);
+bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
+bool vhost_log_access_ok(struct vhost_dev *);
 
 int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct iovec iov[], unsigned int iov_count,
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 93fd0c75b0d8..b6a082ef33dd 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
 
-static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
+static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz)
 {
u64 a = addr / VHOST_PAGE_SIZE / 8;
 
/* Make sure 64 bit math will not overflow. */
if (a > ULONG_MAX - (unsigned long)log_base ||
a + (unsigned long)log_base > ULONG_MAX)
-   return 0;
+   return false;
 
return access_ok(VERIFY_WRITE, log_base + a,
 (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8);
@@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size)
 }
 
 /* Caller should have vq mutex and device mutex. */
-static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
-  int log_all)
+static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem,
+   int log_all)
 {
struct vhost_umem_node *node;
 
if (!umem)
-   return 0;
+   return false;
 
list_for_each_entry(node, >umem_list, link) {
unsigned long a = node->userspace_addr;
 
if (vhost_overflow(node->userspace_addr, node->size))
-   return 0;
+   return false;
 
 
if (!access_ok(VERIFY_WRITE, (void __user *)a,
node->size))
-   return 0;
+   return false;
else if (log_all && !log_access_ok(log_base,
   node->start,
   node->size))
-   return 0;
+   return false;
}
-   return 1;
+   return true;
 }
 
 static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq,
@@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct 
vhost_virtqueue *vq,
 
 /* Can we switch to this memory table? */
 /* Caller should have device mutex but not vq mutex */
-static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
-   int log_all)
+static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem,
+int log_all)
 {
int i;
 
for (i = 0; i < d->nvqs; ++i) {
-   int ok;
+   bool ok;
bool log;
 
mutex_lock(>vqs[i]->mutex);
@@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct 
vhost_umem *umem,
ok = vq_memory_access_ok(d->vqs[i]->log_base,
 umem, log);
else
-   ok = 1;
+   ok = true;
mutex_unlock(>vqs[i]->mutex);
if (!ok)
-   return 0;
+   return false;
}
-   return 1;
+   return true;
 }
 
 static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
@@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d,
spin_unlock(>iotlb_lock);
 }
 
-static int umem_access_ok(u64 uaddr, u64 size, int access)
+static 

[PATCH v2 1/2] vhost: fix vhost_vq_access_ok() log check

2018-04-09 Thread Stefan Hajnoczi
Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
when IOTLB is enabled") introduced a regression.  The logic was
originally:

  if (vq->iotlb)
  return 1;
  return A && B;

After the patch the short-circuit logic for A was inverted:

  if (A || vq->iotlb)
  return A;
  return B;

This patch fixes the regression by rewriting the checks in the obvious
way, no longer returning A when vq->iotlb is non-NULL (which is hard to
understand).

Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
Cc: Jason Wang 
Signed-off-by: Stefan Hajnoczi 
---
 drivers/vhost/vhost.c | 8 +---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5320039671b7..93fd0c75b0d8 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq,
 /* Caller should have vq mutex and device mutex */
 int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
-   int ret = vq_log_access_ok(vq, vq->log_base);
+   if (!vq_log_access_ok(vq, vq->log_base))
+   return 0;
 
-   if (ret || vq->iotlb)
-   return ret;
+   /* Access validation occurs at prefetch time with IOTLB */
+   if (vq->iotlb)
+   return 1;
 
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
 }
-- 
2.14.3



Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-09 Thread Tiwei Bie
On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote:
> On 2018年04月02日 23:23, Tiwei Bie wrote:
> > This patch introduces a mdev (mediated device) based hardware
> > vhost backend. This backend is an abstraction of the various
> > hardware vhost accelerators (potentially any device that uses
> > virtio ring can be used as a vhost accelerator). Some generic
> > mdev parent ops are provided for accelerator drivers to support
> > generating mdev instances.
> > 
> > What's this
> > ===
> > 
> > The idea is that we can setup a virtio ring compatible device
> > with the messages available at the vhost-backend. Originally,
> > these messages are used to implement a software vhost backend,
> > but now we will use these messages to setup a virtio ring
> > compatible hardware device. Then the hardware device will be
> > able to work with the guest virtio driver in the VM just like
> > what the software backend does. That is to say, we can implement
> > a hardware based vhost backend in QEMU, and any virtio ring
> > compatible devices potentially can be used with this backend.
> > (We also call it vDPA -- vhost Data Path Acceleration).
> > 
> > One problem is that, different virtio ring compatible devices
> > may have different device interfaces. That is to say, we will
> > need different drivers in QEMU. It could be troublesome. And
> > that's what this patch trying to fix. The idea behind this
> > patch is very simple: mdev is a standard way to emulate device
> > in kernel.
> 
> So you just move the abstraction layer from qemu to kernel, and you still
> need different drivers in kernel for different device interfaces of
> accelerators. This looks even more complex than leaving it in qemu. As you
> said, another idea is to implement userspace vhost backend for accelerators
> which seems easier and could co-work with other parts of qemu without
> inventing new type of messages.

I'm not quite sure. Do you think it's acceptable to
add various vendor specific hardware drivers in QEMU?

> 
> Need careful thought here to seek a best solution here.

Yeah, definitely! :)
And your opinions would be very helpful!

> 
> >   So we defined a standard device based on mdev, which
> > is able to accept vhost messages. When the mdev emulation code
> > (i.e. the generic mdev parent ops provided by this patch) gets
> > vhost messages, it will parse and deliver them to accelerator
> > drivers. Drivers can use these messages to setup accelerators.
> > 
> > That is to say, the generic mdev parent ops (e.g. read()/write()/
> > ioctl()/...) will be provided for accelerator drivers to register
> > accelerators as mdev parent devices. And each accelerator device
> > will support generating standard mdev instance(s).
> > 
> > With this standard device interface, we will be able to just
> > develop one userspace driver to implement the hardware based
> > vhost backend in QEMU.
> > 
> > Difference between vDPA and PCI passthru
> > 
> > 
> > The key difference between vDPA and PCI passthru is that, in
> > vDPA only the data path of the device (e.g. DMA ring, notify
> > region and queue interrupt) is pass-throughed to the VM, the
> > device control path (e.g. PCI configuration space and MMIO
> > regions) is still defined and emulated by QEMU.
> > 
> > The benefits of keeping virtio device emulation in QEMU compared
> > with virtio device PCI passthru include (but not limit to):
> > 
> > - consistent device interface for guest OS in the VM;
> > - max flexibility on the hardware design, especially the
> >accelerator for each vhost backend doesn't have to be a
> >full PCI device;
> > - leveraging the existing virtio live-migration framework;
> > 
> > The interface of this mdev based device
> > ===
> > 
> > 1. BAR0
> > 
> > The MMIO region described by BAR0 is the main control
> > interface. Messages will be written to or read from
> > this region.
> > 
> > The message type is determined by the `request` field
> > in message header. The message size is encoded in the
> > message header too. The message format looks like this:
> > 
> > struct vhost_vfio_op {
> > __u64 request;
> > __u32 flags;
> > /* Flag values: */
> > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
> > __u32 size;
> > union {
> > __u64 u64;
> > struct vhost_vring_state state;
> > struct vhost_vring_addr addr;
> > struct vhost_memory memory;
> > } payload;
> > };
> > 
> > The existing vhost-kernel ioctl cmds are reused as
> > the message requests in above structure.
> > 
> > Each message will be written to or read from this
> > region at offset 0:
> > 
> > int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op)
> > {
> > int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
> > struct vhost_vfio *vfio = dev->opaque;
> > int ret;
> > 
> > ret = pwrite64(vfio->device_fd, op, count, 

Re: [PATCH v5 0/6] enable creating [k,u]probe with perf_event_open

2018-04-09 Thread Alexei Starovoitov

On 4/9/18 9:45 PM, Ravi Bangoria wrote:

Hi Song,

On 12/07/2017 04:15 AM, Song Liu wrote:

With current kernel, user space tools can only create/destroy [k,u]probes
with a text-based API (kprobe_events and uprobe_events in tracefs). This
approach relies on user space to clean up the [k,u]probe after using them.
However, this is not easy for user space to clean up properly.

To solve this problem, we introduce a file descriptor based API.
Specifically, we extended perf_event_open to create [k,u]probe, and attach
this [k,u]probe to the file descriptor created by perf_event_open. These
[k,u]probe are associated with this file descriptor, so they are not
available in tracefs.


Sorry for being late. One simple question..

Will it be good to support k/uprobe arguments with perf_event_open()?
Do you have any plans about that?


no plans for that. People that use text based interfaces should
probably be using text interfaces consistently.
imo mixing FD-based kprobe api with text is not worth the complexity.



Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-04-09 Thread Alexei Starovoitov
On Mon, Apr 09, 2018 at 12:01:59AM +0200, Mickaël Salaün wrote:
> 
> On 04/08/2018 11:06 PM, Andy Lutomirski wrote:
> > On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaün  wrote:
> >>
> >> On 02/27/2018 10:48 PM, Mickaël Salaün wrote:
> >>>
> >>> On 27/02/2018 17:39, Andy Lutomirski wrote:
>  On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov
>   wrote:
> > On Tue, Feb 27, 2018 at 05:20:55AM +, Andy Lutomirski wrote:
> >> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
> >>  wrote:
> >>> On Tue, Feb 27, 2018 at 04:40:34AM +, Andy Lutomirski wrote:
>  On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>   wrote:
> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
> >> The seccomp(2) syscall can be used by a task to apply a Landlock 
> >> program
> >> to itself. As a seccomp filter, a Landlock program is enforced for 
> >> the
> >> current task and all its future children. A program is immutable 
> >> and a
> >> task can only add new restricting programs to itself, forming a 
> >> list of
> >> programss.
> >>
> >> A Landlock program is tied to a Landlock hook. If the action on a 
> >> kernel
> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
> >> capabilities, other LSM), then a Landlock hook related to this 
> >> kind of
> >> object is triggered. The list of programs for this hook is then
> >> evaluated. Each program return a 32-bit value which can deny the 
> >> action
> >> on a kernel object with a non-zero value. If every programs of the 
> >> list
> >> return zero, then the action on the object is allowed.
> >>
> >> Multiple Landlock programs can be chained to share a 64-bits value 
> >> for a
> >> call chain (e.g. evaluating multiple elements of a file path).  
> >> This
> >> chaining is restricted when a process construct this chain by 
> >> loading a
> >> program, but additional checks are performed when it requests to 
> >> apply
> >> this chain of programs to itself.  The restrictions ensure that it 
> >> is
> >> not possible to call multiple programs in a way that would imply to
> >> handle multiple shared values (i.e. cookies) for one chain.  For 
> >> now,
> >> only a fs_pick program can be chained to the same type of program,
> >> because it may make sense if they have different triggers (cf. next
> >> commits).  This restrictions still allows to reuse Landlock 
> >> programs in
> >> a safe way (e.g. use the same loaded fs_walk program with multiple
> >> chains of fs_pick programs).
> >>
> >> Signed-off-by: Mickaël Salaün 
> >
> > ...
> >
> >> +struct landlock_prog_set *landlock_prepend_prog(
> >> + struct landlock_prog_set *current_prog_set,
> >> + struct bpf_prog *prog)
> >> +{
> >> + struct landlock_prog_set *new_prog_set = current_prog_set;
> >> + unsigned long pages;
> >> + int err;
> >> + size_t i;
> >> + struct landlock_prog_set tmp_prog_set = {};
> >> +
> >> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
> >> + return ERR_PTR(-EINVAL);
> >> +
> >> + /* validate memory size allocation */
> >> + pages = prog->pages;
> >> + if (current_prog_set) {
> >> + size_t i;
> >> +
> >> + for (i = 0; i < 
> >> ARRAY_SIZE(current_prog_set->programs); i++) {
> >> + struct landlock_prog_list *walker_p;
> >> +
> >> + for (walker_p = 
> >> current_prog_set->programs[i];
> >> + walker_p; walker_p = 
> >> walker_p->prev)
> >> + pages += walker_p->prog->pages;
> >> + }
> >> + /* count a struct landlock_prog_set if we need to 
> >> allocate one */
> >> + if (refcount_read(_prog_set->usage) != 1)
> >> + pages += round_up(sizeof(*current_prog_set), 
> >> PAGE_SIZE)
> >> + / PAGE_SIZE;
> >> + }
> >> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
> >> + return ERR_PTR(-E2BIG);
> >> +
> >> + /* ensure early that we can allocate enough memory for the 
> >> new
> >> +  * prog_lists */
> >> + err = 

Re: [PATCH v5 0/6] enable creating [k,u]probe with perf_event_open

2018-04-09 Thread Ravi Bangoria
Hi Song,

On 12/07/2017 04:15 AM, Song Liu wrote:
> With current kernel, user space tools can only create/destroy [k,u]probes
> with a text-based API (kprobe_events and uprobe_events in tracefs). This
> approach relies on user space to clean up the [k,u]probe after using them.
> However, this is not easy for user space to clean up properly.
>
> To solve this problem, we introduce a file descriptor based API.
> Specifically, we extended perf_event_open to create [k,u]probe, and attach
> this [k,u]probe to the file descriptor created by perf_event_open. These
> [k,u]probe are associated with this file descriptor, so they are not
> available in tracefs.

Sorry for being late. One simple question..

Will it be good to support k/uprobe arguments with perf_event_open()?
Do you have any plans about that?

Thanks,
Ravi



Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel

2018-04-09 Thread Md. Islam
Gotcha. I'm working on it. I've created a function that creates
sk_buff from xdp_buff. But still getting an error while the sk_buff is
being processed by tcp. I will send you the patch once I'm done.

Thanks!

On Thu, Apr 5, 2018 at 10:55 PM, David Ahern  wrote:
> On 4/3/18 9:15 PM, Md. Islam wrote:
>>> Have you looked at what I would consider a more interesting use case of
>>> packets into a node and delivered to a namespace via veth?
>>>
>>>+--+---
>>>| Host | container
>>>|  |
>>>|+---{ veth1 }-|-{veth2}
>>>|   |  |
>>>+{ eth1 }--
>>>
>>> Can xdp / bpf on eth1 be used to speed up delivery to the container?
>>
>> I didn't consider that, but it sounds like an important use case. How
>> do we determine which namespace gets the packet?
>>
>
> FIB lookups of course. Starting with my patch set that handles
> forwarding on eth1, what is needed for XDP with veth? ie., a program on
> eth1 does the lookup and redirects the packet to veth1 for Tx.
> ndo_xdp_xmit for veth knows the packet needs to be forwarded to veth2
> internally and there is no skb allocated for the packet yet.



-- 
Tamim
PhD Candidate,
Kent State University
http://web.cs.kent.edu/~mislam4/


RE: [PATCH net 3/3] lan78xx: Lan7801 Support for Fixed PHY

2018-04-09 Thread RaghuramChary.Jallipalli
> Ah, cool. I was thinking you were going to say an SFP cage.
> 
> What switch is it? Does it have a DSA driver?
> 
We have 3 port switch KSZ9893 yet to release which is similar to the one 
KSZ9477/KSZ9897 which has DSA driver.
Most of the time 3 port switch being used with LAN7801 to extend the ports.

Thanks,
-Raghu


Re: [RFC v2] virtio: support packed ring

2018-04-09 Thread Tiwei Bie
On Tue, Apr 10, 2018 at 10:55:25AM +0800, Jason Wang wrote:
> On 2018年04月01日 22:12, Tiwei Bie wrote:
> > Hello everyone,
> > 
> > This RFC implements packed ring support for virtio driver.
> > 
> > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
> > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
> > Minor changes are needed for the vhost code, e.g. to kick the guest.
> > 
> > TODO:
> > - Refinements and bug fixes;
> > - Split into small patches;
> > - Test indirect descriptor support;
> > - Test/fix event suppression support;
> > - Test devices other than net;
> > 
> > RFC v1 -> RFC v2:
> > - Add indirect descriptor support - compile test only;
> > - Add event suppression supprt - compile test only;
> > - Move vring_packed_init() out of uapi (Jason, MST);
> > - Merge two loops into one in virtqueue_add_packed() (Jason);
> > - Split vring_unmap_one() for packed ring and split ring (Jason);
> > - Avoid using '%' operator (Jason);
> > - Rename free_head -> next_avail_idx (Jason);
> > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
> > - Some other refinements and bug fixes;
> > 
> > Thanks!
> 
> Will try to review this later.
> 
> But it would be better if you can split it (more than 1000 lines is too big
> to be reviewed easily). E.g you can at least split it into three patches,
> new structures, datapath, and event suppression.
> 

No problem! It's on my TODO list. I'll get it done in the next version.

Thanks!


Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading

2018-04-09 Thread Jason Wang



On 2018年04月02日 21:40, Vladislav Yasevich wrote:

To support SCTP checksum offloading, we need to add a new feature
to virtio_net, so we can negotiate support between the hypervisor
and the guest.

The signalling to the guest that an alternate checksum needs to
be used is done via a new flag in the virtio_net_hdr.  If the
flag is set, the host will know to perform an alternate checksum
calculation, which right now is only CRC32c.

Signed-off-by: Vladislav Yasevich 
---
  drivers/net/virtio_net.c| 11 ---
  include/linux/virtio_net.h  |  6 ++
  include/uapi/linux/virtio_net.h |  2 ++
  3 files changed, 16 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7b187ec..b601294 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev)
/* Do we support "hardware" checksums? */
if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
/* This opens up the world of extra features. */
-   dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG;
+   netdev_features_t sctp = 0;
+
+   if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM))
+   sctp |= NETIF_F_SCTP_CRC;
+
+   dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
if (csum)
-   dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG;
+   dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp;
  
  		if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) {

dev->hw_features |= NETIF_F_TSO
@@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = {
  };
  
  #define VIRTNET_FEATURES \

-   VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \
+   VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM,  VIRTIO_NET_F_SCTP_CSUM, \


It looks to me _F_SCTP_CSUM implies the ability of both device and 
driver. Do we still need the flexibility like csum to differ guest/host 
ability like e.g _F_GUEST_SCTP_CSUM?


Thanks


VIRTIO_NET_F_MAC, \
VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \
VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, 
VIRTIO_NET_F_GUEST_TSO6, \
diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h
index f144216..2e7a64a 100644
--- a/include/linux/virtio_net.h
+++ b/include/linux/virtio_net.h
@@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb,
  
  		if (!skb_partial_csum_set(skb, start, off))

return -EINVAL;
+
+   if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET)
+   skb->csum_not_inet = 1;
}
  
  	if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) {

@@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct 
sk_buff *skb,
hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID;
} /* else everything is zero */
  
+	if (skb->csum_not_inet)

+   hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET;
+
return 0;
  }
  
diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h

index 5de6ed3..3f279c8 100644
--- a/include/uapi/linux/virtio_net.h
+++ b/include/uapi/linux/virtio_net.h
@@ -36,6 +36,7 @@
  #define VIRTIO_NET_F_GUEST_CSUM   1   /* Guest handles pkts w/ 
partial csum */
  #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. 
*/
  #define VIRTIO_NET_F_MTU  3   /* Initial MTU advice */
+#define VIRTIO_NET_F_SCTP_CSUM  4  /* SCTP checksum offload support */
  #define VIRTIO_NET_F_MAC  5   /* Host has given MAC address. */
  #define VIRTIO_NET_F_GUEST_TSO4   7   /* Guest can handle TSOv4 in. */
  #define VIRTIO_NET_F_GUEST_TSO6   8   /* Guest can handle TSOv6 in. */
@@ -101,6 +102,7 @@ struct virtio_net_config {
  struct virtio_net_hdr_v1 {
  #define VIRTIO_NET_HDR_F_NEEDS_CSUM   1   /* Use csum_start, csum_offset 
*/
  #define VIRTIO_NET_HDR_F_DATA_VALID   2   /* Csum is valid */
+#define VIRTIO_NET_HDR_F_CSUM_NOT_INET  4   /* Checksum is not inet */
__u8 flags;
  #define VIRTIO_NET_HDR_GSO_NONE   0   /* Not a GSO frame */
  #define VIRTIO_NET_HDR_GSO_TCPV4  1   /* GSO frame, IPv4 TCP (TSO) */




Re: [Resend Patch 1/3] Vmbus: Add function to report available ring buffer to write in total ring size percentage

2018-04-09 Thread Martin K. Petersen

Long,

> I hope this patch set goes through SCSI, because it's purpose is to
> improve storvsc.
>
> If this strategy is not possible, I can resubmit the 1st two patches to
> net, and the 3rd patch to scsi after the 1st two are merged.

Applied to my staging tree for 4.18/scsi-queue. Thanks!

-- 
Martin K. Petersen  Oracle Linux Engineering


Re: [RFC v2] virtio: support packed ring

2018-04-09 Thread Jason Wang



On 2018年04月01日 22:12, Tiwei Bie wrote:

Hello everyone,

This RFC implements packed ring support for virtio driver.

The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented
by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html
Minor changes are needed for the vhost code, e.g. to kick the guest.

TODO:
- Refinements and bug fixes;
- Split into small patches;
- Test indirect descriptor support;
- Test/fix event suppression support;
- Test devices other than net;

RFC v1 -> RFC v2:
- Add indirect descriptor support - compile test only;
- Add event suppression supprt - compile test only;
- Move vring_packed_init() out of uapi (Jason, MST);
- Merge two loops into one in virtqueue_add_packed() (Jason);
- Split vring_unmap_one() for packed ring and split ring (Jason);
- Avoid using '%' operator (Jason);
- Rename free_head -> next_avail_idx (Jason);
- Add comments for virtio_wmb() in virtqueue_add_packed() (Jason);
- Some other refinements and bug fixes;

Thanks!


Will try to review this later.

But it would be better if you can split it (more than 1000 lines is too 
big to be reviewed easily). E.g you can at least split it into three 
patches, new structures, datapath, and event suppression.


Thanks




Signed-off-by: Tiwei Bie 
---
  drivers/virtio/virtio_ring.c   | 1094 +---
  include/linux/virtio_ring.h|8 +-
  include/uapi/linux/virtio_config.h |   12 +-
  include/uapi/linux/virtio_ring.h   |   61 ++
  4 files changed, 980 insertions(+), 195 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 71458f493cf8..0515dca34d77 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -58,14 +58,15 @@
  
  struct vring_desc_state {

void *data; /* Data for callback. */
-   struct vring_desc *indir_desc;  /* Indirect descriptor, if any. */
+   void *indir_desc;   /* Indirect descriptor, if any. */
+   int num;/* Descriptor list length. */
  };
  
  struct vring_virtqueue {

struct virtqueue vq;
  
-	/* Actual memory layout for this queue */

-   struct vring vring;
+   /* Is this a packed ring? */
+   bool packed;
  
  	/* Can we use weak barriers? */

bool weak_barriers;
@@ -79,19 +80,45 @@ struct vring_virtqueue {
/* Host publishes avail event idx */
bool event;
  
-	/* Head of free buffer list. */

-   unsigned int free_head;
/* Number we've added since last sync. */
unsigned int num_added;
  
  	/* Last used index we've seen. */

u16 last_used_idx;
  
-	/* Last written value to avail->flags */

-   u16 avail_flags_shadow;
+   union {
+   /* Available for split ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring vring;
  
-	/* Last written value to avail->idx in guest byte order */

-   u16 avail_idx_shadow;
+   /* Head of free buffer list. */
+   unsigned int free_head;
+
+   /* Last written value to avail->flags */
+   u16 avail_flags_shadow;
+
+   /* Last written value to avail->idx in
+* guest byte order. */
+   u16 avail_idx_shadow;
+   };
+
+   /* Available for packed ring */
+   struct {
+   /* Actual memory layout for this queue. */
+   struct vring_packed vring_packed;
+
+   /* Driver ring wrap counter. */
+   u8 wrap_counter;
+
+   /* Index of the next avail descriptor. */
+   unsigned int next_avail_idx;
+
+   /* Last written value to driver->flags in
+* guest byte order. */
+   u16 event_flags_shadow;
+   };
+   };
  
  	/* How to notify other side. FIXME: commonalize hcalls! */

bool (*notify)(struct virtqueue *vq);
@@ -201,8 +228,33 @@ static dma_addr_t vring_map_single(const struct 
vring_virtqueue *vq,
  cpu_addr, size, direction);
  }
  
-static void vring_unmap_one(const struct vring_virtqueue *vq,

-   struct vring_desc *desc)
+static void vring_unmap_one_split(const struct vring_virtqueue *vq,
+ struct vring_desc *desc)
+{
+   u16 flags;
+
+   if (!vring_use_dma_api(vq->vq.vdev))
+   return;
+
+   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
+
+   if (flags & VRING_DESC_F_INDIRECT) {
+   dma_unmap_single(vring_dma_dev(vq),
+virtio64_to_cpu(vq->vq.vdev, desc->addr),
+virtio32_to_cpu(vq->vq.vdev, desc->len),
+   

Re: [RFC] vhost: introduce mdev based hardware vhost backend

2018-04-09 Thread Jason Wang



On 2018年04月02日 23:23, Tiwei Bie wrote:

This patch introduces a mdev (mediated device) based hardware
vhost backend. This backend is an abstraction of the various
hardware vhost accelerators (potentially any device that uses
virtio ring can be used as a vhost accelerator). Some generic
mdev parent ops are provided for accelerator drivers to support
generating mdev instances.

What's this
===

The idea is that we can setup a virtio ring compatible device
with the messages available at the vhost-backend. Originally,
these messages are used to implement a software vhost backend,
but now we will use these messages to setup a virtio ring
compatible hardware device. Then the hardware device will be
able to work with the guest virtio driver in the VM just like
what the software backend does. That is to say, we can implement
a hardware based vhost backend in QEMU, and any virtio ring
compatible devices potentially can be used with this backend.
(We also call it vDPA -- vhost Data Path Acceleration).

One problem is that, different virtio ring compatible devices
may have different device interfaces. That is to say, we will
need different drivers in QEMU. It could be troublesome. And
that's what this patch trying to fix. The idea behind this
patch is very simple: mdev is a standard way to emulate device
in kernel.


So you just move the abstraction layer from qemu to kernel, and you 
still need different drivers in kernel for different device interfaces 
of accelerators. This looks even more complex than leaving it in qemu. 
As you said, another idea is to implement userspace vhost backend for 
accelerators which seems easier and could co-work with other parts of 
qemu without inventing new type of messages.


Need careful thought here to seek a best solution here.


  So we defined a standard device based on mdev, which
is able to accept vhost messages. When the mdev emulation code
(i.e. the generic mdev parent ops provided by this patch) gets
vhost messages, it will parse and deliver them to accelerator
drivers. Drivers can use these messages to setup accelerators.

That is to say, the generic mdev parent ops (e.g. read()/write()/
ioctl()/...) will be provided for accelerator drivers to register
accelerators as mdev parent devices. And each accelerator device
will support generating standard mdev instance(s).

With this standard device interface, we will be able to just
develop one userspace driver to implement the hardware based
vhost backend in QEMU.

Difference between vDPA and PCI passthru


The key difference between vDPA and PCI passthru is that, in
vDPA only the data path of the device (e.g. DMA ring, notify
region and queue interrupt) is pass-throughed to the VM, the
device control path (e.g. PCI configuration space and MMIO
regions) is still defined and emulated by QEMU.

The benefits of keeping virtio device emulation in QEMU compared
with virtio device PCI passthru include (but not limit to):

- consistent device interface for guest OS in the VM;
- max flexibility on the hardware design, especially the
   accelerator for each vhost backend doesn't have to be a
   full PCI device;
- leveraging the existing virtio live-migration framework;

The interface of this mdev based device
===

1. BAR0

The MMIO region described by BAR0 is the main control
interface. Messages will be written to or read from
this region.

The message type is determined by the `request` field
in message header. The message size is encoded in the
message header too. The message format looks like this:

struct vhost_vfio_op {
__u64 request;
__u32 flags;
/* Flag values: */
#define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */
__u32 size;
union {
__u64 u64;
struct vhost_vring_state state;
struct vhost_vring_addr addr;
struct vhost_memory memory;
} payload;
};

The existing vhost-kernel ioctl cmds are reused as
the message requests in above structure.

Each message will be written to or read from this
region at offset 0:

int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
int ret;

ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset);
if (ret != count)
return -1;

return 0;
}

int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op)
{
int count = VHOST_VFIO_OP_HDR_SIZE + op->size;
struct vhost_vfio *vfio = dev->opaque;
uint64_t request = op->request;
int ret;

ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset);
if (ret != count || request != op->request)
return -1;

return 0;
}

It's quite straightforward to set things to the device.
Just need to write the message to 

Re: [PATCH net 3/3] lan78xx: Lan7801 Support for Fixed PHY

2018-04-09 Thread Andrew Lunn
On Tue, Apr 10, 2018 at 02:23:23AM +, 
raghuramchary.jallipa...@microchip.com wrote:
> > 
> > What do you expect is connected to the MAC if there is no PHY?
> > 
> Hi Andrew,
> We connect the Ethernet switch to this MAC.

Ah, cool. I was thinking you were going to say an SFP cage.

What switch is it? Does it have a DSA driver?

 Andrew


Re: [PATCH 2/2] alx: add disable_wol paramenter

2018-04-09 Thread AceLan Kao
The problem is I don't have a machine with that wakeup issue, and I
need WoL feature.
Instead of spreading "alx with WoL" dkms package everywhere, I would
like to see it's supported in the driver and is disabled by default.

Moreover, the wakeup issue may come from old Atheros chips, or result
from buggy BIOS.
With the WoL has been removed from the driver, no one will report
issue about that, and we don't have any chance to find a fix for it.

Adding this feature back is not covering a paper on the issue, it
makes people have a chance to examine this feature.

2018-04-09 22:50 GMT+08:00 David Miller :
> From: Andrew Lunn 
> Date: Mon, 9 Apr 2018 14:39:10 +0200
>
>> On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote:
>>> The WoL feature was reported broken and will lead to
>>> the system resume immediately after suspending.
>>> This symptom is not happening on every system, so adding
>>> disable_wol option and disable WoL by default to prevent the issue from
>>> happening again.
>>
>>>  const char alx_drv_name[] = "alx";
>>>
>>> +/* disable WoL by default */
>>> +bool disable_wol = 1;
>>> +module_param(disable_wol, bool, 0);
>>> +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature");
>>> +
>>
>> Hi AceLan
>>
>> This seems like you are papering over the cracks. And module
>> parameters are not liked.
>>
>> Please try to find the real problem.
>
> Agreed.


RE: [PATCH net 1/3] lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables

2018-04-09 Thread RaghuramChary.Jallipalli
> >
> > Hi Raghuram
> >
> > You might want to look at phy_read_paged(), phy_write_paged(), etc.
> >
> > There can be race conditions with paged access.
> 
> Yep, so something like:
> 
> static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr,
>  u32 data)
> {
>   int save_page, val;
>   u16 buf;
> 
>   save_page = phy_save_page(phydev);
>   phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
>   LAN88XX_EXT_PAGE_TR_LOW_DATA, (data &
> 0x));
>   phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
>   LAN88XX_EXT_PAGE_TR_HIGH_DATA,
>   (data & 0x00FF) >> 16);
> 
>   /* Config control bits [15:13] of register */
>   buf = (regaddr & ~(0x3 << 13));/* Clr [14:13] to write data in reg */
>   buf |= 0x8000; /* Set [15] to Packet transmit */
> 
>   phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
>   LAN88XX_EXT_PAGE_TR_CR, buf);
>   usleep_range(1000, 2000);/* Wait for Data to be written */
> 
>   val = phy_read_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR,
>LAN88XX_EXT_PAGE_TR_CR);
>   if (!(val & 0x8000))
>   pr_warn("TR Register[0x%X] configuration failed\n",
> regaddr);
> 
>   phy_restore_page(phydev, save_page, 0); }
> 
> Since PHY accesses and thus things like phy_save_page() can fail, the return
> type of this function should be changed to 'int' and some error checking
> should be added.

Thanks David/Andrew.
Will take care of it.

Thanks,
Raghu


RE: [PATCH net 2/3] lan78xx: Add support to dump lan78xx registers

2018-04-09 Thread RaghuramChary.Jallipalli
> If there is no PHY attached, you probably should not include PHY_REG_SIZE
> here.
> 
Sure, will address it.

Thanks,
Raghu


RE: [PATCH net 3/3] lan78xx: Lan7801 Support for Fixed PHY

2018-04-09 Thread RaghuramChary.Jallipalli
> 
> What do you expect is connected to the MAC if there is no PHY?
> 
Hi Andrew,
We connect the Ethernet switch to this MAC. The Ethernet switch port connected 
to MAC do not have the phy.
In this case, need to load the MAC driver and link speed/duplex set.

Thanks,
Raghu


Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page

2018-04-09 Thread Alexei Starovoitov
On Mon, Apr 09, 2018 at 02:25:26PM +0100, Quentin Monnet wrote:
> 
> Anyway, I am fine with keeping just signatures, descriptions and return
> values for now. I will submit a new version with only those items.

Thank you.

Could you also split it into few patches?
 include/uapi/linux/bpf.h   | 2237 
 scripts/bpf_helpers_doc.py |  568 +++
 2 files changed, 2429 insertions(+), 376 deletions(-)

replying back and forth on a single patch of such size will be tedious
for others to follow.
May be document ~10 helpers at a time ? Total of ~7 patches and extra
patch for .py ?



[PATCH v2] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init

2018-04-09 Thread Jia-Ju Bai
dn_route_init() is never called in atomic context.

The call chain ending up at dn_route_init() is:
[1] dn_route_init() <- decnet_init()
decnet_init() is only set as a parameter of module_init().

Despite never getting called from atomic context,
dn_route_init() calls __get_free_pages() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Modify the description of GFP_ATOMIC in v1.
  Thank Eric for good advice.
---
 net/decnet/dn_route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 0bd3afd..59ed12a 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1898,7 +1898,7 @@ void __init dn_route_init(void)
while(dn_rt_hash_mask & (dn_rt_hash_mask - 1))
dn_rt_hash_mask--;
dn_rt_hash_table = (struct dn_rt_hash_bucket *)
-   __get_free_pages(GFP_ATOMIC, order);
+   __get_free_pages(GFP_KERNEL, order);
} while (dn_rt_hash_table == NULL && --order > 0);
 
if (!dn_rt_hash_table)
-- 
1.9.1



[PATCH v2] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in tipc_mon_create

2018-04-09 Thread Jia-Ju Bai
tipc_mon_create() is never called in atomic context.

The call chain ending up at dn_route_init() is:
[1] tipc_mon_create() <- tipc_enable_bearer() <- tipc_nl_bearer_enable()
tipc_nl_bearer_enable() calls rtnl_lock(), which indicates this function
is not called in atomic context.

Despite never getting called from atomic context,
tipc_mon_create() calls kzalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Modify the description of GFP_ATOMIC in v1.
  Thank Eric for good advice.
---
 net/tipc/monitor.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c
index 9e109bb..9714d80 100644
--- a/net/tipc/monitor.c
+++ b/net/tipc/monitor.c
@@ -604,9 +604,9 @@ int tipc_mon_create(struct net *net, int bearer_id)
if (tn->monitors[bearer_id])
return 0;
 
-   mon = kzalloc(sizeof(*mon), GFP_ATOMIC);
-   self = kzalloc(sizeof(*self), GFP_ATOMIC);
-   dom = kzalloc(sizeof(*dom), GFP_ATOMIC);
+   mon = kzalloc(sizeof(*mon), GFP_KERNEL);
+   self = kzalloc(sizeof(*self), GFP_KERNEL);
+   dom = kzalloc(sizeof(*dom), GFP_KERNEL);
if (!mon || !self || !dom) {
kfree(mon);
kfree(self);
-- 
1.9.1



[PATCH v2] net: nsci: Replace GFP_ATOMIC with GFP_KERNEL in ncsi_register_dev

2018-04-09 Thread Jia-Ju Bai
ncsi_register_dev() is never called in atomic context.
This function is only called by ftgmac100_probe() in 
drivers/net/ethernet/faraday/ftgmac100.c.
And ftgmac100_probe() is only set as ".probe" in "struct platform_driver".

Despite never getting called from atomic context,
ncsi_register_dev() calls kzalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Modify the description of GFP_ATOMIC in v1.
  Thank Eric for good advice.
---
 net/ncsi/ncsi-manage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 3fd3c39..6b5b5a0 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -1508,7 +1508,7 @@ struct ncsi_dev *ncsi_register_dev(struct net_device *dev,
return nd;
 
/* Create NCSI device */
-   ndp = kzalloc(sizeof(*ndp), GFP_ATOMIC);
+   ndp = kzalloc(sizeof(*ndp), GFP_KERNEL);
if (!ndp)
return NULL;
 
-- 
1.9.1



[PATCH v2] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in tipc_mon_create

2018-04-09 Thread Jia-Ju Bai
tipc_mon_create() is never called in atomic context.

The call chain ending up at dn_route_init() is:
[1] tipc_mon_create() <- tipc_enable_bearer() <- tipc_nl_bearer_enable()
tipc_nl_bearer_enable() calls rtnl_lock(), which indicates this function
is not called in atomic context.

Despite never getting called from atomic context,
tipc_mon_create() calls kzalloc() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Modify the description of GFP_ATOMIC in v1.
  Thank Eric for good advice.
---
 net/tipc/monitor.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c
index 9e109bb..9714d80 100644
--- a/net/tipc/monitor.c
+++ b/net/tipc/monitor.c
@@ -604,9 +604,9 @@ int tipc_mon_create(struct net *net, int bearer_id)
if (tn->monitors[bearer_id])
return 0;
 
-   mon = kzalloc(sizeof(*mon), GFP_ATOMIC);
-   self = kzalloc(sizeof(*self), GFP_ATOMIC);
-   dom = kzalloc(sizeof(*dom), GFP_ATOMIC);
+   mon = kzalloc(sizeof(*mon), GFP_KERNEL);
+   self = kzalloc(sizeof(*self), GFP_KERNEL);
+   dom = kzalloc(sizeof(*dom), GFP_KERNEL);
if (!mon || !self || !dom) {
kfree(mon);
kfree(self);
-- 
1.9.1



[PATCH v2] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init

2018-04-09 Thread Jia-Ju Bai
dccp_init() is never called in atomic context.
This function is only set as a parameter of module_init().

Despite never getting called from atomic context,
dccp_init() calls __get_free_pages() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Modify the description of GFP_ATOMIC in v1.
  Thank Eric for good advice.
---
 net/dccp/proto.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index b68168f..f63ba93 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1159,7 +1159,7 @@ static int __init dccp_init(void)
hash_size--;
dccp_hashinfo.ehash_mask = hash_size - 1;
dccp_hashinfo.ehash = (struct inet_ehash_bucket *)
-   __get_free_pages(GFP_ATOMIC|__GFP_NOWARN, ehash_order);
+   __get_free_pages(GFP_KERNEL|__GFP_NOWARN, ehash_order);
} while (!dccp_hashinfo.ehash && --ehash_order > 0);
 
if (!dccp_hashinfo.ehash) {
@@ -1182,7 +1182,7 @@ static int __init dccp_init(void)
bhash_order > 0)
continue;
dccp_hashinfo.bhash = (struct inet_bind_hashbucket *)
-   __get_free_pages(GFP_ATOMIC|__GFP_NOWARN, bhash_order);
+   __get_free_pages(GFP_KERNEL|__GFP_NOWARN, bhash_order);
} while (!dccp_hashinfo.bhash && --bhash_order >= 0);
 
if (!dccp_hashinfo.bhash) {
-- 
1.9.1



[PATCH v2] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init

2018-04-09 Thread Jia-Ju Bai
dn_route_init() is never called in atomic context.

The call chain ending up at dn_route_init() is:
[1] dn_route_init() <- decnet_init()
decnet_init() is only set as a parameter of module_init().

Despite never getting called from atomic context,
dn_route_init() calls __get_free_pages() with GFP_ATOMIC,
which does not sleep for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
which can sleep and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
v2:
* Modify the description of GFP_ATOMIC in v1.
  Thank Eric for good advice.
---
 net/decnet/dn_route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 0bd3afd..59ed12a 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1898,7 +1898,7 @@ void __init dn_route_init(void)
while(dn_rt_hash_mask & (dn_rt_hash_mask - 1))
dn_rt_hash_mask--;
dn_rt_hash_table = (struct dn_rt_hash_bucket *)
-   __get_free_pages(GFP_ATOMIC, order);
+   __get_free_pages(GFP_KERNEL, order);
} while (dn_rt_hash_table == NULL && --order > 0);
 
if (!dn_rt_hash_table)
-- 
1.9.1



Re: [PATCH RESEND net] vhost: fix vhost_vq_access_ok() log check

2018-04-09 Thread Stefan Hajnoczi
On Tue, Apr 10, 2018 at 3:40 AM, Michael S. Tsirkin  wrote:
> From: Stefan Hajnoczi 
>
> Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
> when IOTLB is enabled") introduced a regression.  The logic was
> originally:
>
>   if (vq->iotlb)
>   return 1;
>   return A && B;
>
> After the patch the short-circuit logic for A was inverted:
>
>   if (A || vq->iotlb)
>   return A;
>   return B;
>
> The correct logic is:
>
>   if (!A || vq->iotlb)
>   return A;
>   return B;
>
> Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
> Cc: Jason Wang 
> Signed-off-by: Stefan Hajnoczi 
> Acked-by: Michael S. Tsirkin 
>
> ---
>  drivers/vhost/vhost.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

NACK

I will send a v2 with cleaner logic as suggested by Linus.

Stefan


Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog

2018-04-09 Thread Alexei Starovoitov

On 4/9/18 11:41 AM, Yonghong Song wrote:



On 4/9/18 9:47 AM, Alexei Starovoitov wrote:

On 4/9/18 9:18 AM, Yonghong Song wrote:

syzbot reported a possible deadlock in perf_event_detach_bpf_prog.

...

@@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct
perf_event *event, void __user *info)
 return -EINVAL;
 if (copy_from_user(, uquery, sizeof(query)))
 return -EFAULT;
-if (query.ids_len > BPF_TRACE_MAX_PROGS)
+
+ids_len = query.ids_len;
+if (ids_len > BPF_TRACE_MAX_PROGS)
 return -E2BIG;
+ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
+if (!ids)
+return -ENOMEM;

 mutex_lock(_event_mutex);
 ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
-   uquery->ids,
-   query.ids_len,
-   >prog_cnt);
+   ids,
+   ids_len,
+   _cnt);
 mutex_unlock(_event_mutex);

+if (!ret || ret == -ENOSPC) {
+if (copy_to_user(>prog_cnt, _cnt,
sizeof(prog_cnt)) ||
+copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) {
+ret = -EFAULT;
+goto out;
+}
+}
+
+out:
+kfree(ids);


alloc/free just to avoid this locking dependency feels suboptimal.


We actually already did kcalloc/kfree in bpf_prog_array_copy_to_user.
In that function, we did not copy_to_user one id at a time.
We allocate a temporary array and store the result there
and at the end, we call one copy_to_user to copy to the user buffer.

The patch here just moved this allocation and associated copy_to_user
out of the function and bpf_event_mutex. It did not introduce new
allocations.


I see, so the patch essentially open coding
bpf_prog_array_copy_to_user()
can we share the code then?
bpf/core.c callsite used by trace/bpf_trace.c
and similar callsite in bpf/cgroup.c
should be using common helper.




may be we can get rid of bpf_event_mutex in some cases?
the perf event itself is locked via perf_event_ctx_lock() when we're
calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog.
I forgot what was the motivation for us to introduce it in the
first place.


The original motivation for the lock to make sure bpf_prog_array
does not change during middle of attach/detach/query. it looks like
we have:
   . perf_event_attach|query under perf_event_ctx_lock
   . perf_event_detach not under perf_event_ctx_lock
Introducing perf_event_ctx_lock to perf_event_detach could still
have the deadlock.


ahh, right, since the progs are in even->tp_event which can
be shared by multiple perf_events.
Scratch that idea.



Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemminger
 wrote:
> On Mon, 9 Apr 2018 15:30:42 -0700
> Siwei Liu  wrote:
>
>> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn  wrote:
>> >> No, implementation wise I'd avoid changing the class on the fly. What
>> >> I'm looking to is a means to add a secondary class or class aliasing
>> >> mechanism for netdevs that allows mapping for a kernel device
>> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> >> creating symlinks between these two namespaces as an analogy. All
>> >> userspace visible netdevs today will have both a kernel name and a
>> >> userspace visible name, having one (/class/net) referecing the other
>> >> (/class/net-kernel) in its own namespace. The newly introduced
>> >> IFF_AUTO_MANAGED device will have a kernel name only
>> >> (/class/net-kernel). As a result, the existing applications using
>> >> /class/net don't break, while we're adding the kernel namespace that
>> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> >> at all.
>> >
>> > My gut feeling is this whole scheme will not fly. You really should be
>> > talking to GregKH.
>>
>> Will do. Before spreading it out loudly I'd run it within netdev to
>> clarify the need for why not exposing the lower netdevs is critical
>> for cloud service providers in the face of introducing a new feature,
>> and we are not hiding anything but exposing it in a way that don't
>> break existing userspace applications while introducing feature is
>> possible with the limitation of keeping old userspace still.
>>
>> >
>> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
>> > A device can start out as a normal device, and will change to being
>> > automatic later, when the user on top of it probes.
>>
>> Sure. In whatever form it's still a netdev, and changing the namespace
>> should be more dynamic than changing the class.
>>
>> Thanks a lot,
>> -Siwei
>>
>> >
>> > Andrew
>
> Also, remember for netdev's /sys is really a third class API.
> The primary API's are netlink and ioctl. Also why not use existing
> network namespaces rather than inventing a new abstraction?

Because we want to leave old userspace unmodified while making SR-IOV
live migration transparent to users. Specifically, we'd want old udevd
to skip through uevents for the lower netdevs, while also making new
udevd able to name the bypass_master interface by referencing the pci
slot information which is only present in the sysfs entry for
IFF_AUTO_MANAGED net device.

The problem of using network namespace is that, no sysfs entry will be
around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate
netns, unless we generalize the scope for what netns is designed for
(isolation I mean). For auto-managed netdevs we don't neccessarily
wants strict isolation, but rather a way of sticking to 1-netdev model
for strict backward compatibility requiement of the old userspace,
while exposing the information in a way new userspace understands.

Thanks,
-Siwei


Re: [PATCH net-next] netns: filter uevents correctly

2018-04-09 Thread Eric W. Biederman
Christian Brauner  writes:

> On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
>> Christian Brauner  writes:
>> 
>> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
>> >> On 05.04.2018 17:07, Christian Brauner wrote:
>> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
>> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
>> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network 
>> >> >>> namespaces")
>> >> >>>
>> >> >>> enabled sending hotplug events into all network namespaces back in 
>> >> >>> 2010.
>> >> >>> Over time the set of uevents that get sent into all network 
>> >> >>> namespaces has
>> >> >>> shrunk. We have now reached the point where hotplug events for all 
>> >> >>> devices
>> >> >>> that carry a namespace tag are filtered according to that namespace.
>> >> >>>
>> >> >>> Specifically, they are filtered whenever the namespace tag of the 
>> >> >>> kobject
>> >> >>> does not match the namespace tag of the netlink socket. One example 
>> >> >>> are
>> >> >>> network devices. Uevents for network devices only show up in the 
>> >> >>> network
>> >> >>> namespaces these devices are moved to or created in.
>> >> >>>
>> >> >>> However, any uevent for a kobject that does not have a namespace tag
>> >> >>> associated with it will not be filtered and we will *try* to 
>> >> >>> broadcast it
>> >> >>> into all network namespaces.
>> >> >>>
>> >> >>> The original patchset was written in 2010 before user namespaces were 
>> >> >>> a
>> >> >>> thing. With the introduction of user namespaces sending out uevents 
>> >> >>> became
>> >> >>> partially isolated as they were filtered by user namespaces:
>> >> >>>
>> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
>> >> >>>
>> >> >>> if (!net_eq(sock_net(sk), p->net)) {
>> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
>> >> >>> return;
>> >> >>>
>> >> >>> if (!peernet_has_id(sock_net(sk), p->net))
>> >> >>> return;
>> >> >>>
>> >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
>> >> >>>  CAP_NET_BROADCAST))
>> >> >>> j   return;
>> >> >>> }
>> >> >>>
>> >> >>> The file_ns_capable() check will check whether the caller had
>> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the 
>> >> >>> user
>> >> >>> namespace of interest. This check is fine in general but seems 
>> >> >>> insufficient
>> >> >>> to me when paired with uevents. The reason is that devices always 
>> >> >>> belong to
>> >> >>> the initial user namespace so uevents for kobjects that do not carry a
>> >> >>> namespace tag should never be sent into another user namespace. This 
>> >> >>> has
>> >> >>> been the intention all along. But there's one case where this breaks,
>> >> >>> namely if a new user namespace is created by root on the host and an
>> >> >>> identity mapping is established between root on the host and root in 
>> >> >>> the
>> >> >>> new user namespace. Here's a reproducer:
>> >> >>>
>> >> >>>  sudo unshare -U --map-root
>> >> >>>  udevadm monitor -k
>> >> >>>  # Now change to initial user namespace and e.g. do
>> >> >>>  modprobe kvm
>> >> >>>  # or
>> >> >>>  rmmod kvm
>> >> >>>
>> >> >>> will allow the non-initial user namespace to retrieve all uevents 
>> >> >>> from the
>> >> >>> host. This seems very anecdotal given that in the general case user
>> >> >>> namespaces do not see any uevents and also can't really do anything 
>> >> >>> useful
>> >> >>> with them.
>> >> >>>
>> >> >>> Additionally, it is now possible to send uevents from userspace. As 
>> >> >>> such we
>> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
>> >> >>> namespace of the network namespace of the netlink socket) userspace 
>> >> >>> process
>> >> >>> make a decision what uevents should be sent.
>> >> >>>
>> >> >>> This makes me think that we should simply ensure that uevents for 
>> >> >>> kobjects
>> >> >>> that do not carry a namespace tag are *always* filtered by user 
>> >> >>> namespace
>> >> >>> in kobj_bcast_filter(). Specifically:
>> >> >>> - If the owning user namespace of the uevent socket is not 
>> >> >>> init_user_ns the
>> >> >>>   event will always be filtered.
>> >> >>> - If the network namespace the uevent socket belongs to was created 
>> >> >>> in the
>> >> >>>   initial user namespace but was opened from a non-initial user 
>> >> >>> namespace
>> >> >>>   the event will be filtered as well.
>> >> >>> Put another way, uevents for kobjects not carrying a namespace tag 
>> >> >>> are now
>> >> >>> always only sent to the initial user namespace. The regression 
>> >> >>> potential
>> >> >>> for this is near to non-existent since user namespaces can't really do
>> >> >>> anything with interesting devices.
>> >> >>>
>> >> >>> Signed-off-by: Christian Brauner 
>> 

Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Stephen Hemminger
On Mon, 9 Apr 2018 15:30:42 -0700
Siwei Liu  wrote:

> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn  wrote:
> >> No, implementation wise I'd avoid changing the class on the fly. What
> >> I'm looking to is a means to add a secondary class or class aliasing
> >> mechanism for netdevs that allows mapping for a kernel device
> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> >> creating symlinks between these two namespaces as an analogy. All
> >> userspace visible netdevs today will have both a kernel name and a
> >> userspace visible name, having one (/class/net) referecing the other
> >> (/class/net-kernel) in its own namespace. The newly introduced
> >> IFF_AUTO_MANAGED device will have a kernel name only
> >> (/class/net-kernel). As a result, the existing applications using
> >> /class/net don't break, while we're adding the kernel namespace that
> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> >> at all.  
> >
> > My gut feeling is this whole scheme will not fly. You really should be
> > talking to GregKH.  
> 
> Will do. Before spreading it out loudly I'd run it within netdev to
> clarify the need for why not exposing the lower netdevs is critical
> for cloud service providers in the face of introducing a new feature,
> and we are not hiding anything but exposing it in a way that don't
> break existing userspace applications while introducing feature is
> possible with the limitation of keeping old userspace still.
> 
> >
> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> > A device can start out as a normal device, and will change to being
> > automatic later, when the user on top of it probes.  
> 
> Sure. In whatever form it's still a netdev, and changing the namespace
> should be more dynamic than changing the class.
> 
> Thanks a lot,
> -Siwei
> 
> >
> > Andrew  

Also, remember for netdev's /sys is really a third class API.
The primary API's are netlink and ioctl. Also why not use existing
network namespaces rather than inventing a new abstraction?


Re: [PATCH] iscsi: respond to netlink with unicast when appropriate

2018-04-09 Thread Lee Duncan
On 04/09/2018 03:15 PM, Chris Leech wrote:
> Instead of always multicasting responses, send a unicast netlink message
> directed at the correct pid.  This will be needed if we ever want to
> support multiple userspace processes interacting with the kernel over
> iSCSI netlink simultaneously.  Limitations can currently be seen if you
> attempt to run multiple iscsistart commands in parallel.
> 
> We've fixed up the userspace issues in iscsistart that prevented
> multiple instances from running, so now attempts to speed up booting by
> bringing up multiple iscsi sessions at once in the initramfs are just
> running into misrouted responses that this fixes.

As you may know, I disagree with running multiple iscsistart-s at the
same time, since that's what iscsid is for.

Never the less, I believe we _should_ be able to have multiple processes
talking to the kernel target code, so I agree with these changes.

> 
> Signed-off-by: Chris Leech 
> ---
>  drivers/scsi/scsi_transport_iscsi.c | 29 ++---
>  1 file changed, 18 insertions(+), 11 deletions(-)
> 
> ... (diffs removed to save electrons)
> 

Reviewed-by: Lee Duncan 
-- 
Lee Duncan
SUSE Labs


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn  wrote:
>> No, implementation wise I'd avoid changing the class on the fly. What
>> I'm looking to is a means to add a secondary class or class aliasing
>> mechanism for netdevs that allows mapping for a kernel device
>> namespace (/class/net-kernel) to userspace (/class/net). Imagine
>> creating symlinks between these two namespaces as an analogy. All
>> userspace visible netdevs today will have both a kernel name and a
>> userspace visible name, having one (/class/net) referecing the other
>> (/class/net-kernel) in its own namespace. The newly introduced
>> IFF_AUTO_MANAGED device will have a kernel name only
>> (/class/net-kernel). As a result, the existing applications using
>> /class/net don't break, while we're adding the kernel namespace that
>> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
>> at all.
>
> My gut feeling is this whole scheme will not fly. You really should be
> talking to GregKH.

Will do. Before spreading it out loudly I'd run it within netdev to
clarify the need for why not exposing the lower netdevs is critical
for cloud service providers in the face of introducing a new feature,
and we are not hiding anything but exposing it in a way that don't
break existing userspace applications while introducing feature is
possible with the limitation of keeping old userspace still.

>
> Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
> A device can start out as a normal device, and will change to being
> automatic later, when the user on top of it probes.

Sure. In whatever form it's still a netdev, and changing the namespace
should be more dynamic than changing the class.

Thanks a lot,
-Siwei

>
> Andrew


[PATCH] iscsi: respond to netlink with unicast when appropriate

2018-04-09 Thread Chris Leech
Instead of always multicasting responses, send a unicast netlink message
directed at the correct pid.  This will be needed if we ever want to
support multiple userspace processes interacting with the kernel over
iSCSI netlink simultaneously.  Limitations can currently be seen if you
attempt to run multiple iscsistart commands in parallel.

We've fixed up the userspace issues in iscsistart that prevented
multiple instances from running, so now attempts to speed up booting by
bringing up multiple iscsi sessions at once in the initramfs are just
running into misrouted responses that this fixes.

Signed-off-by: Chris Leech 
---
 drivers/scsi/scsi_transport_iscsi.c | 29 ++---
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/scsi_transport_iscsi.c 
b/drivers/scsi/scsi_transport_iscsi.c
index f4b52b44b966..65f6c94f2e9b 100644
--- a/drivers/scsi/scsi_transport_iscsi.c
+++ b/drivers/scsi/scsi_transport_iscsi.c
@@ -2322,6 +2322,12 @@ iscsi_multicast_skb(struct sk_buff *skb, uint32_t group, 
gfp_t gfp)
return nlmsg_multicast(nls, skb, 0, group, gfp);
 }
 
+static int
+iscsi_unicast_skb(struct sk_buff *skb, u32 portid)
+{
+   return nlmsg_unicast(nls, skb, portid);
+}
+
 int iscsi_recv_pdu(struct iscsi_cls_conn *conn, struct iscsi_hdr *hdr,
   char *data, uint32_t data_size)
 {
@@ -2524,14 +2530,11 @@ void iscsi_ping_comp_event(uint32_t host_no, struct 
iscsi_transport *transport,
 EXPORT_SYMBOL_GPL(iscsi_ping_comp_event);
 
 static int
-iscsi_if_send_reply(uint32_t group, int seq, int type, int done, int multi,
-   void *payload, int size)
+iscsi_if_send_reply(u32 portid, int type, void *payload, int size)
 {
struct sk_buff  *skb;
struct nlmsghdr *nlh;
int len = nlmsg_total_size(size);
-   int flags = multi ? NLM_F_MULTI : 0;
-   int t = done ? NLMSG_DONE : type;
 
skb = alloc_skb(len, GFP_ATOMIC);
if (!skb) {
@@ -2539,10 +2542,9 @@ iscsi_if_send_reply(uint32_t group, int seq, int type, 
int done, int multi,
return -ENOMEM;
}
 
-   nlh = __nlmsg_put(skb, 0, 0, t, (len - sizeof(*nlh)), 0);
-   nlh->nlmsg_flags = flags;
+   nlh = __nlmsg_put(skb, 0, 0, type, (len - sizeof(*nlh)), 0);
memcpy(nlmsg_data(nlh), payload, size);
-   return iscsi_multicast_skb(skb, group, GFP_ATOMIC);
+   return iscsi_unicast_skb(skb, portid);
 }
 
 static int
@@ -3470,6 +3472,7 @@ static int
 iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group)
 {
int err = 0;
+   u32 portid;
struct iscsi_uevent *ev = nlmsg_data(nlh);
struct iscsi_transport *transport = NULL;
struct iscsi_internal *priv;
@@ -3490,10 +3493,12 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr 
*nlh, uint32_t *group)
if (!try_module_get(transport->owner))
return -EINVAL;
 
+   portid = NETLINK_CB(skb).portid;
+
switch (nlh->nlmsg_type) {
case ISCSI_UEVENT_CREATE_SESSION:
err = iscsi_if_create_session(priv, ep, ev,
- NETLINK_CB(skb).portid,
+ portid,
  ev->u.c_session.initial_cmdsn,
  ev->u.c_session.cmds_max,
  ev->u.c_session.queue_depth);
@@ -3506,7 +3511,7 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr 
*nlh, uint32_t *group)
}
 
err = iscsi_if_create_session(priv, ep, ev,
-   NETLINK_CB(skb).portid,
+   portid,
ev->u.c_bound_session.initial_cmdsn,
ev->u.c_bound_session.cmds_max,
ev->u.c_bound_session.queue_depth);
@@ -3664,6 +3669,8 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr 
*nlh, uint32_t *group)
 static void
 iscsi_if_rx(struct sk_buff *skb)
 {
+   u32 portid = NETLINK_CB(skb).portid;
+
mutex_lock(_queue_mutex);
while (skb->len >= NLMSG_HDRLEN) {
int err;
@@ -3699,8 +3706,8 @@ iscsi_if_rx(struct sk_buff *skb)
break;
if (ev->type == ISCSI_UEVENT_GET_CHAP && !err)
break;
-   err = iscsi_if_send_reply(group, nlh->nlmsg_seq,
-   nlh->nlmsg_type, 0, 0, ev, sizeof(*ev));
+   err = iscsi_if_send_reply(portid, nlh->nlmsg_type,
+ ev, sizeof(*ev));
} while (err < 0 && err != -ECONNREFUSED && err != -ESRCH);
skb_pull(skb, rlen);
}
-- 
2.14.3



Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Andrew Lunn
> No, implementation wise I'd avoid changing the class on the fly. What
> I'm looking to is a means to add a secondary class or class aliasing
> mechanism for netdevs that allows mapping for a kernel device
> namespace (/class/net-kernel) to userspace (/class/net). Imagine
> creating symlinks between these two namespaces as an analogy. All
> userspace visible netdevs today will have both a kernel name and a
> userspace visible name, having one (/class/net) referecing the other
> (/class/net-kernel) in its own namespace. The newly introduced
> IFF_AUTO_MANAGED device will have a kernel name only
> (/class/net-kernel). As a result, the existing applications using
> /class/net don't break, while we're adding the kernel namespace that
> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
> at all.

My gut feeling is this whole scheme will not fly. You really should be
talking to GregKH.

Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic.
A device can start out as a normal device, and will change to being
automatic later, when the user on top of it probes.

Andrew


Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-09 Thread Siwei Liu
On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunn  wrote:
> Hi Siwei
>
>> I think everyone seems to agree not to fiddle with the ":" prefix, but
>> rather have a new class of network subsystem under /sys/class thus a
>> separate device namespace e.g. /sys/class/net-kernel for those
>> auto-managed lower netdevs is needed.
>
> How do you get a device into this new class? I don't know the Linux
> driver model too well, but to get a device out of one class and into
> another, i think you need to device_del(dev). modify dev->class and
> then device_add(dev). However, device_add() says you are not allowed
> to do this.

No, implementation wise I'd avoid changing the class on the fly. What
I'm looking to is a means to add a secondary class or class aliasing
mechanism for netdevs that allows mapping for a kernel device
namespace (/class/net-kernel) to userspace (/class/net). Imagine
creating symlinks between these two namespaces as an analogy. All
userspace visible netdevs today will have both a kernel name and a
userspace visible name, having one (/class/net) referecing the other
(/class/net-kernel) in its own namespace. The newly introduced
IFF_AUTO_MANAGED device will have a kernel name only
(/class/net-kernel). As a result, the existing applications using
/class/net don't break, while we're adding the kernel namespace that
allows IFF_AUTO_MANAGED devices which will not be exposed to userspace
at all.

As this requires changing the internals of driver model core it's a
rather big hammer approach I'd think. If there exists a better
implementation than this to allow adding a separate layer of in-kernel
device namespace, I'd more than welcome to hear about.

>
> And i don't even see how this helps. Are you also not going to call
> list_netdevice()? Are you going to add some other list for these
> devices in a different class?

list_netdevice() is still called. I think with the current RFC patch,
I've added two lists for netdevs under the kernel namespace:
dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace
netdevs get registered will be added to two types of lists: the
userspace list for e.g. dev_list, and also the kernelspace list e.g.
dev_cmpl_list (I can rename it to something more accurate). The
IFF_AUTO_MANAGED device will be only added to kernelspace list e.g.
dev_cmpl_list.

Hope all your questions are answered.

Thanks,
-Siwei


>
>Andrew


Re: [RFC PATCH v2 00/14] Introducing AF_XDP support

2018-04-09 Thread William Tu
On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpel  wrote:
> From: Björn Töpel 
>
> This RFC introduces a new address family called AF_XDP that is
> optimized for high performance packet processing and, in upcoming
> patch sets, zero-copy semantics. In this v2 version, we have removed
> all zero-copy related code in order to make it smaller, simpler and
> hopefully more review friendly. This RFC only supports copy-mode for
> the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX
> using the XDP_DRV path. Zero-copy support requires XDP and driver
> changes that Jesper Dangaard Brouer is working on. Some of his work is
> already on the mailing list for review. We will publish our zero-copy
> support for RX and TX on top of his patch sets at a later point in
> time.
>
> An AF_XDP socket (XSK) is created with the normal socket()
> syscall. Associated with each XSK are two queues: the RX queue and the
> TX queue. A socket can receive packets on the RX queue and it can send
> packets on the TX queue. These queues are registered and sized with
> the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is
> mandatory to have at least one of these queues for each socket. In
> contrast to AF_PACKET V2/V3 these descriptor queues are separated from
> packet buffers. An RX or TX descriptor points to a data buffer in a
> memory area called a UMEM. RX and TX can share the same UMEM so that a
> packet does not have to be copied between RX and TX. Moreover, if a
> packet needs to be kept for a while due to a possible retransmit, the
> descriptor that points to that packet can be changed to point to
> another and reused right away. This again avoids copying data.
>
> This new dedicated packet buffer area is called a UMEM. It consists of
> a number of equally size frames and each frame has a unique frame
> id. A descriptor in one of the queues references a frame by
> referencing its frame id. The user space allocates memory for this
> UMEM using whatever means it feels is most appropriate (malloc, mmap,
> huge pages, etc). This memory area is then registered with the kernel
> using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues:
> the FILL queue and the COMPLETION queue. The fill queue is used by the
> application to send down frame ids for the kernel to fill in with RX
> packet data. References to these frames will then appear in the RX
> queue of the XSK once they have been received. The completion queue,
> on the other hand, contains frame ids that the kernel has transmitted
> completely and can now be used again by user space, for either TX or
> RX. Thus, the frame ids appearing in the completion queue are ids that
> were previously transmitted using the TX queue. In summary, the RX and
> FILL queues are used for the RX path and the TX and COMPLETION queues
> are used for the TX path.
>
Can we register a UMEM to multiple device's queue?

So far the l2fwd sample code is sending/receiving from the same
queue. I'm thinking about forwarding packets from one device to another.
Now I'm copying packets from one device's RX desc to another device's TX
completion queue. But this introduces one extra copy.

One way I can do is to call bpf_redirect helper function, but sometimes
I still need to process the packet in userspace.

I like this work!
Thanks a lot.
William


Re: [net-next PATCH v3 01/11] soc: ti: K2G: enhancement to support QMSS in K2G NAVSS

2018-04-09 Thread Rob Herring
On Mon, Apr 02, 2018 at 10:37:51AM -0400, Murali Karicheri wrote:
> Navigator Subsystem (NAVSS) available on K2G SoC has a cut down
> version of QMSS with less number of queues, internal linking ram
> with lesser number of buffers etc.  It doesn't have status and
> explicit push register space as in QMSS available on other K2 SoCs.
> So define reg indices specific to QMSS on K2G. This patch introduces
> "ti,66ak2g-navss-qm" compatibility to identify QMSS on K2G NAVSS
> and to customize the dts handling code. Per Device manual,
> descriptors with index less than or equal to regions0_size is in region 0
> in the case of K2 QMSS where as for QMSS on K2G, descriptors with index
> less than regions0_size is in region 0. So update the size accordingly in
> the regions0_size bits of the linking ram size 0 register.
> 
> Signed-off-by: Murali Karicheri 
> Signed-off-by: WingMan Kwok 
> ---
>  .../bindings/soc/ti/keystone-navigator-qmss.txt|  9 ++-

Reviewed-by: Rob Herring 

>  drivers/soc/ti/knav_qmss.h |  6 ++
>  drivers/soc/ti/knav_qmss_queue.c   | 90 
> --
>  3 files changed, 82 insertions(+), 23 deletions(-)


[PATCH] mISDN: Remove VLAs

2018-04-09 Thread Laura Abbott

There's an ongoing effort to remove VLAs[1] from the kernel to eventually
turn on -Wvla. Remove the VLAs from the mISDN code by switching to using
kstrdup in one place and just using the upper bound in another.

[1] https://lkml.org/lkml/2018/3/7/621

Signed-off-by: Laura Abbott 
---
 drivers/isdn/mISDN/dsp_hwec.c   | 8 +---
 drivers/isdn/mISDN/l1oip_core.c | 2 +-
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/isdn/mISDN/dsp_hwec.c b/drivers/isdn/mISDN/dsp_hwec.c
index a6e87076acc2..5336bbdbfdc5 100644
--- a/drivers/isdn/mISDN/dsp_hwec.c
+++ b/drivers/isdn/mISDN/dsp_hwec.c
@@ -68,12 +68,12 @@ void dsp_hwec_enable(struct dsp *dsp, const char *arg)
goto _do;
 
{
-   char _dup[len + 1];
char *dup, *tok, *name, *val;
int tmp;
 
-   strcpy(_dup, arg);
-   dup = _dup;
+   dup = kstrdup(arg, GFP_ATOMIC);
+   if (!dup)
+   return;
 
while ((tok = strsep(, ","))) {
if (!strlen(tok))
@@ -89,6 +89,8 @@ void dsp_hwec_enable(struct dsp *dsp, const char *arg)
deftaps = tmp;
}
}
+
+   kfree(dup);
}
 
 _do:
diff --git a/drivers/isdn/mISDN/l1oip_core.c b/drivers/isdn/mISDN/l1oip_core.c
index 21d50e4cc5e1..df594245deca 100644
--- a/drivers/isdn/mISDN/l1oip_core.c
+++ b/drivers/isdn/mISDN/l1oip_core.c
@@ -279,7 +279,7 @@ l1oip_socket_send(struct l1oip *hc, u8 localcodec, u8 
channel, u32 chanmask,
  u16 timebase, u8 *buf, int len)
 {
u8 *p;
-   u8 frame[len + 32];
+   u8 frame[L1OIP_MAX_PERFRAME + 32];
struct socket *socket = NULL;
 
if (debug & DEBUG_L1OIP_MSG)
-- 
2.14.3



Re: [PATCH AUTOSEL for 4.9 160/293] MIPS: Give __secure_computing() access to syscall arguments.

2018-04-09 Thread James Hogan
On Mon, Apr 09, 2018 at 12:24:58AM +, Sasha Levin wrote:
> From: David Daney 
> 
> [ Upstream commit 669c4092225f0ed5df12ebee654581b558a5e3ed ]
> 
> KProbes of __seccomp_filter() are not very useful without access to
> the syscall arguments.
> 
> Do what x86 does, and populate a struct seccomp_data to be passed to
> __secure_computing().  This allows samples/bpf/tracex5 to extract a
> sensible trace.

This broke o32 indirect syscalls, and was fixed by commit 3d729deaf287
("MIPS: seccomp: Fix indirect syscall args").

Cheers
James


signature.asc
Description: Digital signature


Patch "x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()" has been added to the 4.9-stable tree

2018-04-09 Thread gregkh

This is a note to let you know that I've just added the patch titled

x86/asm: Don't use RBP as a temporary register in 
csum_partial_copy_generic()

to the 4.9-stable tree which can be found at:

http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary

The filename of the patch is:
 
x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch
and it can be found in the queue-4.9 subdirectory.

If you, or anyone else, feels it should not be added to the stable tree,
please let  know about it.


>From foo@baz Mon Apr  9 17:09:24 CEST 2018
From: Josh Poimboeuf 
Date: Thu, 4 May 2017 09:51:40 -0500
Subject: x86/asm: Don't use RBP as a temporary register in 
csum_partial_copy_generic()

From: Josh Poimboeuf 


[ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ]

Andrey Konovalov reported the following warning while fuzzing the kernel
with syzkaller:

  WARNING: kernel stack regs at 8800686869f8 in a.out:4933 has bad 'bp' 
value c3fc855a10167ec0

The unwinder dump revealed that RBP had a bad value when an interrupt
occurred in csum_partial_copy_generic().

That function saves RBP on the stack and then overwrites it, using it as
a scratch register.  That's problematic because it breaks stack traces
if an interrupt occurs in the middle of the function.

Replace the usage of RBP with another callee-saved register (R15) so
stack traces are no longer affected.

Reported-by: Andrey Konovalov 
Tested-by: Andrey Konovalov 
Signed-off-by: Josh Poimboeuf 
Cc: Cong Wang 
Cc: David S . Miller 
Cc: Dmitry Vyukov 
Cc: Eric Dumazet 
Cc: Kostya Serebryany 
Cc: Linus Torvalds 
Cc: Marcelo Ricardo Leitner 
Cc: Neil Horman 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vlad Yasevich 
Cc: linux-s...@vger.kernel.org
Cc: netdev 
Cc: syzkaller 
Link: 
http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoim...@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman 
---
 arch/x86/lib/csum-copy_64.S |   12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

--- a/arch/x86/lib/csum-copy_64.S
+++ b/arch/x86/lib/csum-copy_64.S
@@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic)
movq  %r12, 3*8(%rsp)
movq  %r14, 4*8(%rsp)
movq  %r13, 5*8(%rsp)
-   movq  %rbp, 6*8(%rsp)
+   movq  %r15, 6*8(%rsp)
 
movq  %r8, (%rsp)
movq  %r9, 1*8(%rsp)
@@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic)
/* main loop. clear in 64 byte blocks */
/* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */
/* r11: temp3, rdx: temp4, r12 loopcnt */
-   /* r10: temp5, rbp: temp6, r14 temp7, r13 temp8 */
+   /* r10: temp5, r15: temp6, r14 temp7, r13 temp8 */
.p2align 4
 .Lloop:
source
@@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic)
source
movq  32(%rdi), %r10
source
-   movq  40(%rdi), %rbp
+   movq  40(%rdi), %r15
source
movq  48(%rdi), %r14
source
@@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic)
adcq  %r11, %rax
adcq  %rdx, %rax
adcq  %r10, %rax
-   adcq  %rbp, %rax
+   adcq  %r15, %rax
adcq  %r14, %rax
adcq  %r13, %rax
 
@@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic)
dest
movq %r10, 32(%rsi)
dest
-   movq %rbp, 40(%rsi)
+   movq %r15, 40(%rsi)
dest
movq %r14, 48(%rsi)
dest
@@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic)
movq 3*8(%rsp), %r12
movq 4*8(%rsp), %r14
movq 5*8(%rsp), %r13
-   movq 6*8(%rsp), %rbp
+   movq 6*8(%rsp), %r15
addq $7*8, %rsp
ret
 


Patches currently in stable-queue which might be from jpoim...@redhat.com are

queue-4.9/x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch


RFC: ti_hecc: don't repoll but interrupt again

2018-04-09 Thread Jeroen Hofstee

Hello,

I posted below RFC to the linux-can ML, but it might be actually
more appropriate for netdev (since it involves napi). This driver
is lying a bit about the amount of work done. Changing that (always
look at work done instead of the hw state) seems to solve the issue,
but I have no clear understanding why that is (and why it takes down
the ethernet stack as well).

If someone has suggestions / ideas on what to check, please let me
know.

Regards,
Jeroen



The mail send to linux-can:


While updating to Linux 4.9.93, the ti_hecc causes several problems. One
of them is that under high load on the CAN-bus and ethernet, the ethernet
connection completely stalls and won't recover. The patch below seems to
fix that (as in read as much as we can and always re-enable interrupts,
it will poll again thereafter).

It would be appreciated if someone more familiar with this code can
comment on it.

Thanks in advance,

Jeroen

--- a/drivers/net/can/ti_hecc.c
+++ b/drivers/net/can/ti_hecc.c
@@ -625,20 +625,16 @@ static int ti_hecc_rx_poll(struct napi_struct
*napi, int quota)
spin_unlock_irqrestore(>mbx_lock, flags);
 } else if (priv->rx_next == HECC_MAX_TX_MBOX - 1) {
 priv->rx_next = HECC_RX_FIRST_MBOX;
-   break;
 }
 }

 /* Enable packet interrupt if all pkts are handled */
-   if (hecc_read(priv, HECC_CANRMP) == 0) {
+   if (num_pkts < quota) {
 napi_complete(napi);
 /* Re-enable RX mailbox interrupts */
 mbx_mask = hecc_read(priv, HECC_CANMIM);
 mbx_mask |= HECC_TX_MBOX_MASK;
 hecc_write(priv, HECC_CANMIM, mbx_mask);
-   } else {
-   /* repoll is done only if whole budget is used */
-   num_pkts = quota;
 }



[PATCH RESEND net] vhost: fix vhost_vq_access_ok() log check

2018-04-09 Thread Michael S. Tsirkin
From: Stefan Hajnoczi 

Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log
when IOTLB is enabled") introduced a regression.  The logic was
originally:

  if (vq->iotlb)
  return 1;
  return A && B;

After the patch the short-circuit logic for A was inverted:

  if (A || vq->iotlb)
  return A;
  return B;

The correct logic is:

  if (!A || vq->iotlb)
  return A;
  return B;

Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com
Cc: Jason Wang 
Signed-off-by: Stefan Hajnoczi 
Acked-by: Michael S. Tsirkin 

---
 drivers/vhost/vhost.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5320039671b7..f6af4210679a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1246,7 +1246,7 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq)
 {
int ret = vq_log_access_ok(vq, vq->log_base);
 
-   if (ret || vq->iotlb)
+   if (!ret || vq->iotlb)
return ret;
 
return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used);
-- 
2.14.3


Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available

2018-04-09 Thread Samudrala, Sridhar

On 4/9/2018 1:07 AM, Jiri Pirko wrote:

Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote:

On 4/6/2018 5:48 AM, Jiri Pirko wrote:

Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote:

[...]


+static int virtnet_bypass_join_child(struct net_device *bypass_netdev,
+struct net_device *child_netdev)
+{
+   struct virtnet_bypass_info *vbi;
+   bool backup;
+
+   vbi = netdev_priv(bypass_netdev);
+   backup = (child_netdev->dev.parent == bypass_netdev->dev.parent);
+   if (backup ? rtnl_dereference(vbi->backup_netdev) :
+   rtnl_dereference(vbi->active_netdev)) {
+   netdev_info(bypass_netdev,
+   "%s attempting to join bypass dev when %s already 
present\n",
+   child_netdev->name, backup ? "backup" : "active");

Bypass module should check if there is already some other netdev
enslaved and refuse right there.

This will work for virtio-net with 3 netdev model, but this check has to be 
done by netvsc
as its bypass_netdev is same as the backup_netdev.
Will add a flag while registering with the bypass module to indicate if the 
driver is doing
a 2 netdev or 3 netdev model and based on that flag this check can be done in 
bypass module
for 3 netdev scenario.

Just let me undestand it clearly. What I expect the difference would be
between 2netdev and3 netdev model is this:
2netdev:
   bypass_master
  /
 /
VF_slave

3netdev:
   bypass_master
  / \
 /   \
VF_slave   backup_slave

Is that correct? If not, how does it look like?



Looks correct.
VF_slave and backup_slave are the original netdevs and are present in both the 
models.
In the 3 netdev model, bypass_master netdev is created and VF_slave and 
backup_slave are
marked as the 2 slaves of this new netdev.

In the 2 netdev model, backup_slave acts as bypass_master and the bypass module 
doesn't
have access to netdev_priv of the backup_slave.

Once i moved all the ndo_ops of the master netdev to bypass module, i realized 
that we can
move the create/destroy of the upper netdev also to bypass.c.
That way the changes to virtio_net become very minimal.

With these updates, bypass module now supports both the models by exporting 2 
sets of
functions.
3 netdev:
   int bypass_master_create(struct net_device *backup_netdev,
struct bypass_master **pbypass_master);
   void bypass_master_destroy(struct bypass_master *bypass_master);


2 netdev:
   int bypass_master_register(struct net_device *backup_netdev, struct 
bypass_ops *ops,
  struct bypass_master **pbypass_master);
   void bypass_master_unregister(struct bypass_master *bypass_master);

Will send the next revision in a day or two.

Thanks
Sridhar



Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog

2018-04-09 Thread Yonghong Song



On 4/9/18 9:47 AM, Alexei Starovoitov wrote:

On 4/9/18 9:18 AM, Yonghong Song wrote:

syzbot reported a possible deadlock in perf_event_detach_bpf_prog.

...
@@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct 
perf_event *event, void __user *info)

 return -EINVAL;
 if (copy_from_user(, uquery, sizeof(query)))
 return -EFAULT;
-    if (query.ids_len > BPF_TRACE_MAX_PROGS)
+
+    ids_len = query.ids_len;
+    if (ids_len > BPF_TRACE_MAX_PROGS)
 return -E2BIG;
+    ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
+    if (!ids)
+    return -ENOMEM;

 mutex_lock(_event_mutex);
 ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
-   uquery->ids,
-   query.ids_len,
-   >prog_cnt);
+   ids,
+   ids_len,
+   _cnt);
 mutex_unlock(_event_mutex);

+    if (!ret || ret == -ENOSPC) {
+    if (copy_to_user(>prog_cnt, _cnt, 
sizeof(prog_cnt)) ||

+    copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) {
+    ret = -EFAULT;
+    goto out;
+    }
+    }
+
+out:
+    kfree(ids);


alloc/free just to avoid this locking dependency feels suboptimal.


We actually already did kcalloc/kfree in bpf_prog_array_copy_to_user.
In that function, we did not copy_to_user one id at a time.
We allocate a temporary array and store the result there
and at the end, we call one copy_to_user to copy to the user buffer.

The patch here just moved this allocation and associated copy_to_user
out of the function and bpf_event_mutex. It did not introduce new
allocations.



may be we can get rid of bpf_event_mutex in some cases?
the perf event itself is locked via perf_event_ctx_lock() when we're
calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog.
I forgot what was the motivation for us to introduce it in the
first place.


The original motivation for the lock to make sure bpf_prog_array
does not change during middle of attach/detach/query. it looks like
we have:
   . perf_event_attach|query under perf_event_ctx_lock
   . perf_event_detach not under perf_event_ctx_lock
Introducing perf_event_ctx_lock to perf_event_detach could still
have the deadlock.





Re: [PATCH iproute2-next 1/1] tc: jsonify skbedit action

2018-04-09 Thread Roman Mashak
David Ahern  writes:

> On 4/3/18 1:24 PM, Roman Mashak wrote:
>>  if (tb[TCA_SKBEDIT_PTYPE] != NULL) {
>> -ptype = RTA_DATA(tb[TCA_SKBEDIT_PTYPE]);
>> -if (*ptype == PACKET_HOST)
>> -fprintf(f, " ptype host");
>> -else if (*ptype == PACKET_BROADCAST)
>> -fprintf(f, " ptype broadcast");
>> -else if (*ptype == PACKET_MULTICAST)
>> -fprintf(f, " ptype multicast");
>> -else if (*ptype == PACKET_OTHERHOST)
>> -fprintf(f, " ptype otherhost");
>> +ptype = rta_getattr_u16(tb[TCA_SKBEDIT_PTYPE]);
>> +if (ptype == PACKET_HOST)
>> +print_string(PRINT_ANY, "ptype", " %s", "ptype host");
>> +else if (ptype == PACKET_BROADCAST)
>> +print_string(PRINT_ANY, "ptype", " %s",
>> + "ptype broadcast");
>> +else if (ptype == PACKET_MULTICAST)
>> +print_string(PRINT_ANY, "ptype", " %s",
>> + "ptype multicast");
>> +else if (ptype == PACKET_OTHERHOST)
>> +print_string(PRINT_ANY, "ptype", " %s",
>> + "ptype otherhost");
>
> Shouldn't that be:
> print_string(PRINT_ANY, "ptype", "ptype %s", "otherhost");
>
> And ditto for the other strings.
>
>>  else
>> -fprintf(f, " ptype %d", *ptype);
>> +print_uint(PRINT_ANY, "ptype", " %u", ptype);
>
> And then this one needs 'ptype' before %u

OK. I will send v2.


RE: [Intel-wired-lan] [next-queue PATCH v6 10/10] igb: Add support for adding offloaded clsflower filters

2018-04-09 Thread Vinicius Costa Gomes
Hi,

"Brown, Aaron F"  writes:

>> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
>> Behalf Of Vinicius Costa Gomes
>> Sent: Thursday, March 29, 2018 2:08 PM
>> To: intel-wired-...@lists.osuosl.org
>> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus > palen...@intel.com>
>> Subject: [Intel-wired-lan] [next-queue PATCH v6 10/10] igb: Add support for
>> adding offloaded clsflower filters
>> 
>> This allows filters added by tc-flower and specifying MAC addresses,
>> Ethernet types, and the VLAN priority field, to be offloaded to the
>> controller.
>
> Can I get a brief explanation for enabling this?  I'm currently happy
> with this patch series from a regression perspective, but am
> personally a bit, umm, challenged with tc in general but would like to
> run it through the paces a bit.  If it can be done in a one or two
> liner I think it would be a good addition to the patch description.


Of course, my bad for not adding those examples since the begining.



Cheers,
--
Vinicius


RE: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC address support for ethtool nftuple filters

2018-04-09 Thread Vinicius Costa Gomes
Hi,

"Brown, Aaron F"  writes:

[...]

>
>> 
>> I added that note in the hope that someone else would have an stronger
>> opinion about what to do.
>
> I don't have a strong opinion beyond my preference for an ideal world
> where everything works :)  If the part simply cannot filter on the src
> address as a whole without the protocol I would ideally prefer an
> attempt in ethtool to set the filter on src address as a whole to
> return an error WHILE still allowing the filter to be set on an
> ethertype when the proto keyword is issued.  If ethtool does not allow
> that fine grain of control then I think the way it is now is good, I'd
> rather have the annoyance of being able to set a filter that does
> nothing then not be able to set the more specific filter at all.

We are agreed. The next version of this series implements just that:
specifying the src address alone is an error, but if the classifier has
a src address and any other filter, it works.


Cheers,
--
Vinicius


[PATCH RFC iptables] iptables: Per-net ns lock

2018-04-09 Thread Kirill Tkhai
In CRIU and LXC-restore we met the situation,
when iptables in container can't be restored
because of permission denied:

https://github.com/checkpoint-restore/criu/issues/469

Containers want to restore their own net ns,
while they may have no their own mnt ns.
This case they share host's /run/xtables.lock
file, but they may not have permission to open
it.

Patch makes /run/xtables.lock to be per-namespace,
i.e., to refer to the caller task's net ns.

What you think?

Thanks,
Kirill

Signed-off-by: Kirill Tkhai 
---
 iptables/xshared.c |7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/iptables/xshared.c b/iptables/xshared.c
index 06db72d4..b6dbe4e7 100644
--- a/iptables/xshared.c
+++ b/iptables/xshared.c
@@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval 
*wait_interval)
time_left.tv_sec = wait;
time_left.tv_usec = 0;
 
-   fd = open(XT_LOCK_NAME, O_CREAT, 0600);
+   if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 &&
+   errno != EEXIST) {
+   fprintf(stderr, "Fatal: can't create lock file\n");
+   return XT_LOCK_FAILED;
+   }
+   fd = open(XT_LOCK_NAME, O_RDONLY);
if (fd < 0) {
fprintf(stderr, "Fatal: can't open lock file %s: %s\n",
XT_LOCK_NAME, strerror(errno));



Re: [RFC PATCH bpf-next 2/6] bpf: add bpf_get_stack helper

2018-04-09 Thread Yonghong Song



On 4/9/18 3:01 AM, Daniel Borkmann wrote:

On 04/09/2018 07:02 AM, Alexei Starovoitov wrote:

On 4/8/18 9:53 PM, Yonghong Song wrote:

@@ -1004,7 +1007,8 @@ static void __bpf_prog_put(struct bpf_prog
*prog, bool do_idr_lock)
  bpf_prog_kallsyms_del(prog->aux->func[i]);
  bpf_prog_kallsyms_del(prog);

-    call_rcu(>aux->rcu, __bpf_prog_put_rcu);
+    synchronize_rcu();
+    __bpf_prog_put_rcu(>aux->rcu);


there should have been lockdep splat.
We cannot call synchronize_rcu here, since we cannot sleep
in some cases.


Let me double check this. The following is the reason
why I am using synchronize_rcu().

With call_rcu(>aux->rcu, __bpf_prog_put_rcu) and
_bpf_prog_put_rcu calls put_callchain_buffers() which
calls mutex_lock(), the runtime with CONFIG_DEBUG_ATOMIC_SLEEP=y
will complains since potential sleep inside the call_rcu is not
allowed.


I see. Indeed. We cannot call put_callchain_buffers() from rcu callback,
but doing synchronize_rcu() here is also not possible.
How about moving put_callchain into bpf_prog_free_deferred() ?


+1, the assumption is that you can call bpf_prog_put() and also the
bpf_map_put() from any context. Sleeping here for a long time might
subtly break things badly.


Thanks for the suggestion! This should work.


Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog

2018-04-09 Thread Alexei Starovoitov

On 4/9/18 9:18 AM, Yonghong Song wrote:

syzbot reported a possible deadlock in perf_event_detach_bpf_prog.

...

@@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct perf_event *event, 
void __user *info)
return -EINVAL;
if (copy_from_user(, uquery, sizeof(query)))
return -EFAULT;
-   if (query.ids_len > BPF_TRACE_MAX_PROGS)
+
+   ids_len = query.ids_len;
+   if (ids_len > BPF_TRACE_MAX_PROGS)
return -E2BIG;
+   ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN);
+   if (!ids)
+   return -ENOMEM;

mutex_lock(_event_mutex);
ret = bpf_prog_array_copy_info(event->tp_event->prog_array,
-  uquery->ids,
-  query.ids_len,
-  >prog_cnt);
+  ids,
+  ids_len,
+  _cnt);
mutex_unlock(_event_mutex);

+   if (!ret || ret == -ENOSPC) {
+   if (copy_to_user(>prog_cnt, _cnt, 
sizeof(prog_cnt)) ||
+   copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) {
+   ret = -EFAULT;
+   goto out;
+   }
+   }
+
+out:
+   kfree(ids);


alloc/free just to avoid this locking dependency feels suboptimal.

may be we can get rid of bpf_event_mutex in some cases?
the perf event itself is locked via perf_event_ctx_lock() when we're
calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog.
I forgot what was the motivation for us to introduce it in the
first place.



Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc

2018-04-09 Thread Jesus Sanchez-Palencia
Hi Thomas,


On 03/28/2018 12:48 AM, Thomas Gleixner wrote:

(...)

>
> There are two modes:
>
>   1) Send at the given TX time (Explicit mode)
>
>   2) Send before given TX time (Deadline mode)
>
> There is no need to specify 'drop if late' simply because if the message is
> handed in past the given TX time, it's too late by definition. What you are
> trying to implement is a hybrid of TSN and general purpose (not time aware)
> networking in one go. And you do that because your overall design is not
> looking at the big picture. You designed from a given use case assumption
> and tried to fit other things into it with duct tape.


Ok, I see the difference now, thanks. I have just two more questions about the
deadline mode, please see below.

(...)


>
>>> Coming back to the overall scheme. If you start upfront with a time slice
>>> manager which is designed to:
>>>
>>>   - Handle multiple channels
>>>
>>>   - Expose the time constraints, properties per channel
>>>
>>> then you can fit all kind of use cases, whether designed by committee or
>>> not. You can configure that thing per node or network wide. It does not
>>> make a difference. The only difference are the resulting constraints.
>>
>>
>> Ok, and I believe the above was covered by what we had proposed before, 
>> unless
>> what you meant by time constraints is beyond the configured port schedule.
>>
>> Are you suggesting that we'll need to have a kernel entity that is not only
>> aware of the current traffic classes 'schedule', but also of the resources 
>> that
>> are still available for new streams to be accommodated into the classes? 
>> Putting
>> it differently, is the TAS you envision just an entity that runs a schedule, 
>> or
>> is it a time-aware 'orchestrator'?
>
> In the first place its something which runs a defined schedule.
>
> The accomodation for new streams is required, but not necessarily at the
> root qdisc level. That might be a qdisc feeding into it.
>
> Assume you have a bandwidth reservation, aka time slot, for audio. If your
> audio related qdisc does deadline scheduling then you can add new streams
> to it up to the point where it's not longer able to fit.
>
> The only thing which might be needed at the root qdisc is the ability to
> utilize unused time slots for other purposes, but that's not required to be
> there in the first place as long as its designed in a way that it can be
> added later on.


Ok, agreed.


>
>>> So lets look once more at the picture in an abstract way:
>>>
>>>[ NIC ]
>>>   |
>>>  [ Time slice manager ]
>>> |   |
>>>  [ Ch 0 ] ... [ Ch N ]
>>>
>>> So you have a bunch of properties here:
>>>
>>> 1) Number of Channels ranging from 1 to N
>>>
>>> 2) Start point, slice period and slice length per channel
>>
>> Ok, so we agree that a TAS entity is needed. Assuming that channels are 
>> traffic
>> classes, do you have something else in mind other than a new root qdisc?
>
> Whatever you call it, the important point is that it is the gate keeper to
> the network adapter and there is no way around it. It fully controls the
> timed schedule how simple or how complex it may be.


Ok, and I've finally understood the nuance between the above and what we had
planned initially.


(...)


>>
>> * TAS:
>>
>>The idea we are currently exploring is to add a "time-aware", priority 
>> based
>>qdisc, that also exposes the Tx queues available and provides a mechanism 
>> for
>>mapping priority <-> traffic class <-> Tx queues in a similar fashion as
>>mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would 
>> be:
>>
>>$ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4\
>> map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 \
>> queues 0 1 2 3  \
>> sched-file gates.sched [base-time ]   \
>>[cycle-time ] [extension-time ]
>>
>> is multi-line, with each line being of the following format:
>>  
>>
>>Qbv only defines one : "S" for 'SetGates'
>>
>>For example:
>>
>>S 0x01 300
>>S 0x03 500
>>
>>This means that there are two intervals, the first will have the gate
>>for traffic class 0 open for 300 nanoseconds, the second will have
>>both traffic classes open for 500 nanoseconds.
>
> To accomodate stuff like control systems you also need a base line, which
> is not expressed as interval. Otherwise you can't schedule network wide
> explicit plans. That's either an absolute network-time (TAI) time stamp or
> an offset to a well defined network-time (TAI) time stamp, e.g. start of
> epoch or something else which is agreed on. The actual schedule then fast
> forwards past now (TAI) and sets up the slots from there. That makes node
> hotplug possible as well.


Sure, and the [base-time ] on the command line above was actually
wrong. It should have been expressed 

[PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog

2018-04-09 Thread Yonghong Song
syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
The error details:
  ==
  WARNING: possible circular locking dependency detected
  4.16.0-rc7+ #3 Not tainted
  --
  syz-executor7/24531 is trying to acquire lock:
   (bpf_event_mutex){+.+.}, at: [<8a849b07>] 
perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854

  but task is already holding lock:
   (>mmap_sem){}, at: [<38768f87>] vm_mmap_pgoff+0x198/0x280 
mm/util.c:353

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #1 (>mmap_sem){}:
   __might_fault+0x13a/0x1d0 mm/memory.c:4571
   _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
   copy_to_user include/linux/uaccess.h:155 [inline]
   bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
   perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
   _perf_ioctl kernel/events/core.c:4750 [inline]
   perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
   vfs_ioctl fs/ioctl.c:46 [inline]
   do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
   SYSC_ioctl fs/ioctl.c:701 [inline]
   SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
   do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
   entry_SYSCALL_64_after_hwframe+0x42/0xb7

  -> #0 (bpf_event_mutex){+.+.}:
   lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
   __mutex_lock_common kernel/locking/mutex.c:756 [inline]
   __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
   mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
   perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
   perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
   _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
   put_event+0x24/0x30 kernel/events/core.c:4204
   perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
   remove_vma+0xb4/0x1b0 mm/mmap.c:172
   remove_vma_list mm/mmap.c:2490 [inline]
   do_munmap+0x82a/0xdf0 mm/mmap.c:2731
   mmap_region+0x59e/0x15a0 mm/mmap.c:1646
   do_mmap+0x6c0/0xe00 mm/mmap.c:1483
   do_mmap_pgoff include/linux/mm.h:2223 [inline]
   vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
   SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
   SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
   SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
   SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
   do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
   entry_SYSCALL_64_after_hwframe+0x42/0xb7

  other info that might help us debug this:

   Possible unsafe locking scenario:

 CPU0CPU1
 
lock(>mmap_sem);
 lock(bpf_event_mutex);
 lock(>mmap_sem);
lock(bpf_event_mutex);

   *** DEADLOCK ***
  ==

The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow
user space to query prog array on the same tp") where copy_to_user,
which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
At the same time, during perf_event file descriptor close,
mm->mmap_sem is held first and then subsequent
perf_event_detach_bpf_prog needs bpf_event_mutex lock.
Such a senario caused a deadlock.

As suggested by Danial, moving copy_to_user out of the
bpf_event_mutex lock should fix the problem.

Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the 
same tp")
Reported-by: syzbot+dc5ca0e4c9bfafaf2...@syzkaller.appspotmail.com
Signed-off-by: Yonghong Song 
---
 include/linux/bpf.h  |  4 ++--
 kernel/bpf/core.c| 25 +++--
 kernel/trace/bpf_trace.c | 24 
 3 files changed, 41 insertions(+), 12 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 95a7abd..486e65e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -339,8 +339,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu 
*progs,
 void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs,
struct bpf_prog *old_prog);
 int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array,
-__u32 __user *prog_ids, u32 request_cnt,
-__u32 __user *prog_cnt);
+u32 *prog_ids, u32 request_cnt,
+u32 *prog_cnt);
 int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array,
struct bpf_prog *exclude_prog,
struct bpf_prog *include_prog,
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index d315b39..a95a7de 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1683,22 +1683,35 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu 
*old_array,
 }
 
 int 

Re: [PATCH net-next] netns: filter uevents correctly

2018-04-09 Thread Christian Brauner
On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote:
> Christian Brauner  writes:
> 
> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote:
> >> On 05.04.2018 17:07, Christian Brauner wrote:
> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote:
> >> >> On 04.04.2018 22:48, Christian Brauner wrote:
> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network 
> >> >>> namespaces")
> >> >>>
> >> >>> enabled sending hotplug events into all network namespaces back in 
> >> >>> 2010.
> >> >>> Over time the set of uevents that get sent into all network namespaces 
> >> >>> has
> >> >>> shrunk. We have now reached the point where hotplug events for all 
> >> >>> devices
> >> >>> that carry a namespace tag are filtered according to that namespace.
> >> >>>
> >> >>> Specifically, they are filtered whenever the namespace tag of the 
> >> >>> kobject
> >> >>> does not match the namespace tag of the netlink socket. One example are
> >> >>> network devices. Uevents for network devices only show up in the 
> >> >>> network
> >> >>> namespaces these devices are moved to or created in.
> >> >>>
> >> >>> However, any uevent for a kobject that does not have a namespace tag
> >> >>> associated with it will not be filtered and we will *try* to broadcast 
> >> >>> it
> >> >>> into all network namespaces.
> >> >>>
> >> >>> The original patchset was written in 2010 before user namespaces were a
> >> >>> thing. With the introduction of user namespaces sending out uevents 
> >> >>> became
> >> >>> partially isolated as they were filtered by user namespaces:
> >> >>>
> >> >>> net/netlink/af_netlink.c:do_one_broadcast()
> >> >>>
> >> >>> if (!net_eq(sock_net(sk), p->net)) {
> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID))
> >> >>> return;
> >> >>>
> >> >>> if (!peernet_has_id(sock_net(sk), p->net))
> >> >>> return;
> >> >>>
> >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns,
> >> >>>  CAP_NET_BROADCAST))
> >> >>> j   return;
> >> >>> }
> >> >>>
> >> >>> The file_ns_capable() check will check whether the caller had
> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user
> >> >>> namespace of interest. This check is fine in general but seems 
> >> >>> insufficient
> >> >>> to me when paired with uevents. The reason is that devices always 
> >> >>> belong to
> >> >>> the initial user namespace so uevents for kobjects that do not carry a
> >> >>> namespace tag should never be sent into another user namespace. This 
> >> >>> has
> >> >>> been the intention all along. But there's one case where this breaks,
> >> >>> namely if a new user namespace is created by root on the host and an
> >> >>> identity mapping is established between root on the host and root in 
> >> >>> the
> >> >>> new user namespace. Here's a reproducer:
> >> >>>
> >> >>>  sudo unshare -U --map-root
> >> >>>  udevadm monitor -k
> >> >>>  # Now change to initial user namespace and e.g. do
> >> >>>  modprobe kvm
> >> >>>  # or
> >> >>>  rmmod kvm
> >> >>>
> >> >>> will allow the non-initial user namespace to retrieve all uevents from 
> >> >>> the
> >> >>> host. This seems very anecdotal given that in the general case user
> >> >>> namespaces do not see any uevents and also can't really do anything 
> >> >>> useful
> >> >>> with them.
> >> >>>
> >> >>> Additionally, it is now possible to send uevents from userspace. As 
> >> >>> such we
> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user
> >> >>> namespace of the network namespace of the netlink socket) userspace 
> >> >>> process
> >> >>> make a decision what uevents should be sent.
> >> >>>
> >> >>> This makes me think that we should simply ensure that uevents for 
> >> >>> kobjects
> >> >>> that do not carry a namespace tag are *always* filtered by user 
> >> >>> namespace
> >> >>> in kobj_bcast_filter(). Specifically:
> >> >>> - If the owning user namespace of the uevent socket is not 
> >> >>> init_user_ns the
> >> >>>   event will always be filtered.
> >> >>> - If the network namespace the uevent socket belongs to was created in 
> >> >>> the
> >> >>>   initial user namespace but was opened from a non-initial user 
> >> >>> namespace
> >> >>>   the event will be filtered as well.
> >> >>> Put another way, uevents for kobjects not carrying a namespace tag are 
> >> >>> now
> >> >>> always only sent to the initial user namespace. The regression 
> >> >>> potential
> >> >>> for this is near to non-existent since user namespaces can't really do
> >> >>> anything with interesting devices.
> >> >>>
> >> >>> Signed-off-by: Christian Brauner 
> >> >>> ---
> >> >>>  lib/kobject_uevent.c | 10 +-
> >> >>>  1 file changed, 9 insertions(+), 1 deletion(-)
> >> >>>
> >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> >> >>> 

Re: Enable and configure storm prevention in a network device

2018-04-09 Thread Andrew Lunn
> > The Marvell switches have leaky buckets, which can be used for
> > limiting broadcast and multicast packets, as well as traffic shaping
> > in general. Storm prevention is just a form of traffic shaping, so if
> > we have generic traffic shaping, it can be used for storm prevention.
> > 
> TI's CPSW hardware as well has similar capability to limit broadcast and
> multicast packets at the ingress. Isn't it a traffic policing at the Ingress
> rather than traffic shaping as the hardware drops the frames at the ingress
> if the rate exceeds a limit?

Hi Murali

It depends on the generation of Marvell switches. Older ones have just
egress traffic shaping. Newer ones also have ingress rate limiting.

   Andrew


Re: Enable and configure storm prevention in a network device

2018-04-09 Thread Murali Karicheri
Andrew,

On 04/06/2018 10:30 AM, Andrew Lunn wrote:
> On Thu, Apr 05, 2018 at 03:35:06PM -0700, Florian Fainelli wrote:
>> On 04/05/2018 01:20 PM, David Miller wrote:
>>> From: Murali Karicheri 
>>> Date: Thu, 5 Apr 2018 16:14:49 -0400
>>>
 Is there a standard way to implement and configure storm prevention
 in a Linux network device?
>>>
>>> What kind of "storm", an interrupt storm?
>>>
>>
>> I would assume Murali is referring to L2 broadcast storms which is
>> common in switches. There is not an API for that AFAICT and I am not
>> sure what a proper API would look like.
> 
> tc?
> 
> The Marvell switches have leaky buckets, which can be used for
> limiting broadcast and multicast packets, as well as traffic shaping
> in general. Storm prevention is just a form of traffic shaping, so if
> we have generic traffic shaping, it can be used for storm prevention.
> 
TI's CPSW hardware as well has similar capability to limit broadcast and
multicast packets at the ingress. Isn't it a traffic policing at the Ingress
rather than traffic shaping as the hardware drops the frames at the ingress
if the rate exceeds a limit?

I haven't done any work on TC, but with my limited knowledge of TC, it
seems we might be able to use TC to offload the TC policing to the hardware
through ndo_setup_tc()? Could someone shed some light on how to do this
with tc? Some tc filer command and then some hook in kernel to offload this
to hardware?

Murali

>Andrew
> 


-- 
Murali Karicheri
Linux Kernel, Keystone


Re: [PATCH net] inetpeer: fix uninit-value in inet_getpeer

2018-04-09 Thread Eric Dumazet


On 04/09/2018 07:58 AM, David Miller wrote:
> From: Eric Dumazet 
> Date: Mon,  9 Apr 2018 06:43:27 -0700
> 
>> syzbot/KMSAN reported that p->dtime was read while it was
>> not yet initialized in :
>>
>>  delta = (__u32)jiffies - p->dtime;
>>  if (delta < ttl || !refcount_dec_if_one(>refcnt))
>>  gc_stack[i] = NULL;
>>
>> This is a false positive, because the inetpeer wont be erased
>> from rb-tree if the refcount_dec_if_one(>refcnt) does not
>> succeed. And this wont happen before first inet_putpeer() call
>> for this inetpeer has been done, and ->dtime field is written
>> exactly before the refcount_dec_and_test(>refcnt).
>>
>> The KMSAN report was :
>  ...
>> Signed-off-by: Eric Dumazet 
>> Reported-by: syzbot 
> 
> Applied, but it looks like we are just adding assignments simply
> to placate these reports when the tools and facilities cannot
> see through the logic properly.
> 

To be fair, this is because the check on ->dtime should be done a second time 
after
the refcount_dec_if_one(>refcnt)

It is a tiny race, and we do not really care given nature of inetpeer cache, 
best effort,
and DDOS candidate anyway.

If we purge one entry too soon, this is not a big deal.

I believe tool is fine.


Re: [PATCH] slip: Check if rstate is initialized before uncompressing

2018-04-09 Thread David Miller
From: Tejaswi Tanikella 
Date: Mon, 9 Apr 2018 14:23:49 +0530

> @@ -673,6 +677,7 @@ struct slcompress *
>   if (cs->cs_tcp.doff > 5)
> memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), 
> (cs->cs_tcp.doff - 5) * 4);
>   cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2;
> + cs->initialized = 1;
>   /* Put headers back on packet
 ...
>  struct cstate {
>   byte_t  cs_this;/* connection id number (xmit) */
> + byte_t  initialized;/* non-zero if initialized */

Please use 'bool' and true/false for 'initialized'.


Re: [PATCH RESEND v2] vhost-net: set packet weight of tx polling to 2 * vq size

2018-04-09 Thread David Miller
From: haibinzhang(张海斌) 
Date: Mon, 9 Apr 2018 07:22:17 +

> handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy
> polling udp packets with small length(e.g. 1byte udp payload), because setting
> VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet 
> length.
> 
> Ping-Latencies shown below were tested between two Virtual Machines using
> netperf (UDP_STREAM, len=1), and then another machine pinged the client:
 ...
> Acked-by: Michael S. Tsirkin 
> Signed-off-by: Haibin Zhang 
> Signed-off-by: Yunfang Tai 
> Signed-off-by: Lidong Chen 

Applied, thank you.


Re: [PATCH v5] net: thunderx: rework mac addresses list to u64 array

2018-04-09 Thread David Miller
From: Vadim Lomovtsev 
Date: Mon,  9 Apr 2018 06:24:48 -0700

> From: Vadim Lomovtsev 
> 
> It is too expensive to pass u64 values via linked list, instead
> allocate array for them by overall number of mac addresses from netdev.
> 
> This eventually removes multiple kmalloc() calls, aviod memory
> fragmentation and allow to put single null check on kmalloc
> return value in order to prevent a potential null pointer dereference.
> 
> Addresses-Coverity-ID: 1467429 ("Dereference null return value")
> Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback 
> implementation for VF")
> Reported-by: Dan Carpenter 
> Signed-off-by: Vadim Lomovtsev 

Applied, thanks.


Re: [PATCH net] inetpeer: fix uninit-value in inet_getpeer

2018-04-09 Thread David Miller
From: Eric Dumazet 
Date: Mon,  9 Apr 2018 06:43:27 -0700

> syzbot/KMSAN reported that p->dtime was read while it was
> not yet initialized in :
> 
>   delta = (__u32)jiffies - p->dtime;
>   if (delta < ttl || !refcount_dec_if_one(>refcnt))
>   gc_stack[i] = NULL;
> 
> This is a false positive, because the inetpeer wont be erased
> from rb-tree if the refcount_dec_if_one(>refcnt) does not
> succeed. And this wont happen before first inet_putpeer() call
> for this inetpeer has been done, and ->dtime field is written
> exactly before the refcount_dec_and_test(>refcnt).
> 
> The KMSAN report was :
 ...
> Signed-off-by: Eric Dumazet 
> Reported-by: syzbot 

Applied, but it looks like we are just adding assignments simply
to placate these reports when the tools and facilities cannot
see through the logic properly.


Re: [PATCH 2/2] alx: add disable_wol paramenter

2018-04-09 Thread David Miller
From: Andrew Lunn 
Date: Mon, 9 Apr 2018 14:39:10 +0200

> On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote:
>> The WoL feature was reported broken and will lead to
>> the system resume immediately after suspending.
>> This symptom is not happening on every system, so adding
>> disable_wol option and disable WoL by default to prevent the issue from
>> happening again.
> 
>>  const char alx_drv_name[] = "alx";
>>  
>> +/* disable WoL by default */
>> +bool disable_wol = 1;
>> +module_param(disable_wol, bool, 0);
>> +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature");
>> +
> 
> Hi AceLan
> 
> This seems like you are papering over the cracks. And module
> parameters are not liked.
> 
> Please try to find the real problem.

Agreed.


Re: [PATCH] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init

2018-04-09 Thread Jia-Ju Bai



On 2018/4/9 22:44, Eric Dumazet wrote:


On 04/09/2018 07:10 AM, Jia-Ju Bai wrote:

dn_route_init() is never called in atomic context.

The call chain ending up at dn_route_init() is:
[1] dn_route_init() <- decnet_init()
decnet_init() is only set as a parameter of module_init().

Despite never getting called from atomic context,
dn_route_init() calls __get_free_pages() with GFP_ATOMIC,
which waits busily for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
to avoid busy waiting and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
  net/decnet/dn_route.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 0bd3afd..59ed12a 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1898,7 +1898,7 @@ void __init dn_route_init(void)
while(dn_rt_hash_mask & (dn_rt_hash_mask - 1))
dn_rt_hash_mask--;
dn_rt_hash_table = (struct dn_rt_hash_bucket *)
-   __get_free_pages(GFP_ATOMIC, order);
+   __get_free_pages(GFP_KERNEL, order);
} while (dn_rt_hash_table == NULL && --order > 0);
  
  	if (!dn_rt_hash_table)



This might OOM under pressure.


Sorry, I do not understand this.
Could you please explain the reason?


This would need __GFP_NOWARN | __GFP_NORETRY  I guess, and would target net-next



Do you mean
__get_free_pages(__GFP_NOWARN | __GFP_NORETRY) or
__get_free_pages(__GFP_NOWARN | __GFP_NORETRY | GFP_KERNEL)?


Best wishes,
Jia-Ju Bai


Re: [PATCH] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init

2018-04-09 Thread Jia-Ju Bai



On 2018/4/9 22:42, Eric Dumazet wrote:


On 04/09/2018 07:10 AM, Jia-Ju Bai wrote:

dccp_init() is never called in atomic context.
This function is only set as a parameter of module_init().

Despite never getting called from atomic context,
dccp_init() calls __get_free_pages() with GFP_ATOMIC,
which waits busily for allocation.

What do you mean by "waits busily" ?

GFP_ATOMIC does not sleep, does not wait.


Sorry, I should modify it to "does not sleep".
Do you think it is okay?


Best wishes,
Jia-Ju Bai


Re: [PATCH] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init

2018-04-09 Thread Eric Dumazet


On 04/09/2018 07:10 AM, Jia-Ju Bai wrote:
> dn_route_init() is never called in atomic context.
> 
> The call chain ending up at dn_route_init() is:
> [1] dn_route_init() <- decnet_init()
> decnet_init() is only set as a parameter of module_init().
> 
> Despite never getting called from atomic context,
> dn_route_init() calls __get_free_pages() with GFP_ATOMIC,
> which waits busily for allocation.
> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
> to avoid busy waiting and improve the possibility of sucessful allocation.
> 
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.
> 
> Signed-off-by: Jia-Ju Bai 
> ---
>  net/decnet/dn_route.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
> index 0bd3afd..59ed12a 100644
> --- a/net/decnet/dn_route.c
> +++ b/net/decnet/dn_route.c
> @@ -1898,7 +1898,7 @@ void __init dn_route_init(void)
>   while(dn_rt_hash_mask & (dn_rt_hash_mask - 1))
>   dn_rt_hash_mask--;
>   dn_rt_hash_table = (struct dn_rt_hash_bucket *)
> - __get_free_pages(GFP_ATOMIC, order);
> + __get_free_pages(GFP_KERNEL, order);
>   } while (dn_rt_hash_table == NULL && --order > 0);
>  
>   if (!dn_rt_hash_table)
> 

This might OOM under pressure.

This would need __GFP_NOWARN | __GFP_NORETRY  I guess, and would target net-next



Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error

2018-04-09 Thread Arnd Bergmann
On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayuso  wrote:
> Hi Arnd,
>
> On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
>> We get a new link error with CONFIG_NFT_REJECT_INET=y and 
>> CONFIG_NF_REJECT_IPV6=m
>
> I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
> and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
> and CONFIG_NF_REJECT_IPV6=m.
>
> I mean, just like we do with NFT_FIB_INET.

That can only work if NFT_REJECT_INET can be made a 'tristate' symbol
again, so that code gets built as a loadable module if
CONFIG_NF_REJECT_IPV6=m.

> BTW, I think this problem has been is not related to the recent patch,
> but something older that kbuild robot has triggered more easily for
> some reason?

02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool'
symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to
restricted to a loadable module with IPV6=m, but can now be
built-in, which causes that link error.

  Arnd


Re: WARNING in ip_rt_bug

2018-04-09 Thread David Miller
From: Dmitry Vyukov 
Date: Mon, 9 Apr 2018 08:06:20 +0200

> +Eric said that perhaps we just need to revert:
> 
> commit c378a9c019cf5e017d1ed24954b54fae7bebd2bc
> Date:   Sat May 21 07:16:42 2011 +
> ipv4: Give backtrace in ip_rt_bug().

And I replied to him that we shouldn't.

Reverting makes the backtrace, and all the useful debugging
information, go away.  It won't fix the actual bug, which seems
to be that ICMP's route lookup tried to use an input route
for sending a packet.


Re: [PATCH] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init

2018-04-09 Thread Eric Dumazet


On 04/09/2018 07:10 AM, Jia-Ju Bai wrote:
> dccp_init() is never called in atomic context.
> This function is only set as a parameter of module_init().
> 
> Despite never getting called from atomic context,
> dccp_init() calls __get_free_pages() with GFP_ATOMIC,
> which waits busily for allocation.

What do you mean by "waits busily" ?

GFP_ATOMIC does not sleep, does not wait.

> GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
> to avoid busy waiting and improve the possibility of sucessful allocation.
> 
> This is found by a static analysis tool named DCNS written by myself.
> And I also manually check it.
> 
> Signed-off-by: Jia-Ju Bai 
> ---




Re: [PATCH net] arp: fix arp_filter on l3slave devices

2018-04-09 Thread David Ahern
On 4/8/18 9:36 PM, Sasha Levin wrote:
> Hi,
> 
> [This is an automated email]
> 
> This commit has been processed by the -stable helper bot and determined
> to be a high probability candidate for -stable trees. (score: 33.5930)
> 
> The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92, 
> v4.4.126.
> 
> v4.16: Build OK!
> v4.15.15: Build OK!
> v4.14.32: Build OK!
> v4.9.92: Build OK!
> v4.4.126: Build OK!

All of those would be good for this patch. Thanks


Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error

2018-04-09 Thread Pablo Neira Ayuso
Hi Arnd,

On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote:
> We get a new link error with CONFIG_NFT_REJECT_INET=y and 
> CONFIG_NF_REJECT_IPV6=m

I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4
and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y
and CONFIG_NF_REJECT_IPV6=m.

I mean, just like we do with NFT_FIB_INET.

BTW, I think this problem has been is not related to the recent patch,
but something older that kbuild robot has triggered more easily for
some reason?

Thanks for your patch!
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index d3220b43c832..b48c57bb9aaf 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -601,7 +601,8 @@ config NFT_REJECT
 
 config NFT_REJECT_INET
 	depends on NF_TABLES_INET
-	default NFT_REJECT
+	depends on NFT_REJECT_IPV4
+	depends on NFT_REJECT_IPV6
 	tristate
 
 config NFT_COMPAT


Re: Enable and configure storm prevention in a network device

2018-04-09 Thread Murali Karicheri
On 04/05/2018 06:35 PM, Florian Fainelli wrote:
> On 04/05/2018 01:20 PM, David Miller wrote:
>> From: Murali Karicheri 
>> Date: Thu, 5 Apr 2018 16:14:49 -0400
>>
>>> Is there a standard way to implement and configure storm prevention
>>> in a Linux network device?
>>
>> What kind of "storm", an interrupt storm?
>>
> 
> I would assume Murali is referring to L2 broadcast storms which is
> common in switches. There is not an API for that AFAICT and I am not
> sure what a proper API would look like.
> 
Thanks Florian for adding more details. Yes, I am referring to
L2 broadcast or multicast storm. At this point I don't see a reason
why to exclude unicast storm as well if the hardware has the capability.


-- 
Murali Karicheri
Linux Kernel, Keystone


Re: [PATCH net] net: dsa: mv88e6xxx: Fix receive time stamp race condition.

2018-04-09 Thread Richard Cochran
On Mon, Apr 09, 2018 at 12:03:14AM -0700, Richard Cochran wrote:
> This patch fixes the race by moving the queue onto a list on the stack
> before reading out the latched time stamp value.
> 
> Fixes: c6fe0ad2c3499 ("net: dsa: mv88e6xxx: add rx/tx timestamping support")

Dave, please hold off on this patch.  I am seeing new problems in my
testing with this applied.  I still need to get to the bottom of
this.

Thanks,
Richard


[PATCH] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in tipc_mon_create

2018-04-09 Thread Jia-Ju Bai
tipc_mon_create() is never called in atomic context.

The call chain ending up at dn_route_init() is:
[1] tipc_mon_create() <- tipc_enable_bearer() <- tipc_nl_bearer_enable()
tipc_nl_bearer_enable() calls rtnl_lock(), which indicates this function
is not called in atomic context.

Despite never getting called from atomic context,
tipc_mon_create() calls kzalloc() with GFP_ATOMIC,
which waits busily for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
to avoid busy waiting and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
 net/tipc/monitor.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c
index 9e109bb..9714d80 100644
--- a/net/tipc/monitor.c
+++ b/net/tipc/monitor.c
@@ -604,9 +604,9 @@ int tipc_mon_create(struct net *net, int bearer_id)
if (tn->monitors[bearer_id])
return 0;
 
-   mon = kzalloc(sizeof(*mon), GFP_ATOMIC);
-   self = kzalloc(sizeof(*self), GFP_ATOMIC);
-   dom = kzalloc(sizeof(*dom), GFP_ATOMIC);
+   mon = kzalloc(sizeof(*mon), GFP_KERNEL);
+   self = kzalloc(sizeof(*self), GFP_KERNEL);
+   dom = kzalloc(sizeof(*dom), GFP_KERNEL);
if (!mon || !self || !dom) {
kfree(mon);
kfree(self);
-- 
1.9.1



[PATCH] net: nsci: Replace GFP_ATOMIC with GFP_KERNEL in ncsi_register_dev

2018-04-09 Thread Jia-Ju Bai
ncsi_register_dev() is never called in atomic context.
This function is only called by ftgmac100_probe() in 
drivers/net/ethernet/faraday/ftgmac100.c.
And ftgmac100_probe() is only set as ".probe" in "struct platform_driver".

Despite never getting called from atomic context,
ncsi_register_dev() calls kzalloc() with GFP_ATOMIC,
which waits busily for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
to avoid busy waiting and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
 net/ncsi/ncsi-manage.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c
index 3fd3c39..6b5b5a0 100644
--- a/net/ncsi/ncsi-manage.c
+++ b/net/ncsi/ncsi-manage.c
@@ -1508,7 +1508,7 @@ struct ncsi_dev *ncsi_register_dev(struct net_device *dev,
return nd;
 
/* Create NCSI device */
-   ndp = kzalloc(sizeof(*ndp), GFP_ATOMIC);
+   ndp = kzalloc(sizeof(*ndp), GFP_KERNEL);
if (!ndp)
return NULL;
 
-- 
1.9.1



[PATCH] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init

2018-04-09 Thread Jia-Ju Bai
dn_route_init() is never called in atomic context.

The call chain ending up at dn_route_init() is:
[1] dn_route_init() <- decnet_init()
decnet_init() is only set as a parameter of module_init().

Despite never getting called from atomic context,
dn_route_init() calls __get_free_pages() with GFP_ATOMIC,
which waits busily for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
to avoid busy waiting and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
 net/decnet/dn_route.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c
index 0bd3afd..59ed12a 100644
--- a/net/decnet/dn_route.c
+++ b/net/decnet/dn_route.c
@@ -1898,7 +1898,7 @@ void __init dn_route_init(void)
while(dn_rt_hash_mask & (dn_rt_hash_mask - 1))
dn_rt_hash_mask--;
dn_rt_hash_table = (struct dn_rt_hash_bucket *)
-   __get_free_pages(GFP_ATOMIC, order);
+   __get_free_pages(GFP_KERNEL, order);
} while (dn_rt_hash_table == NULL && --order > 0);
 
if (!dn_rt_hash_table)
-- 
1.9.1



[PATCH] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init

2018-04-09 Thread Jia-Ju Bai
dccp_init() is never called in atomic context.
This function is only set as a parameter of module_init().

Despite never getting called from atomic context,
dccp_init() calls __get_free_pages() with GFP_ATOMIC,
which waits busily for allocation.
GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL,
to avoid busy waiting and improve the possibility of sucessful allocation.

This is found by a static analysis tool named DCNS written by myself.
And I also manually check it.

Signed-off-by: Jia-Ju Bai 
---
 net/dccp/proto.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index b68168f..f63ba93 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1159,7 +1159,7 @@ static int __init dccp_init(void)
hash_size--;
dccp_hashinfo.ehash_mask = hash_size - 1;
dccp_hashinfo.ehash = (struct inet_ehash_bucket *)
-   __get_free_pages(GFP_ATOMIC|__GFP_NOWARN, ehash_order);
+   __get_free_pages(GFP_KERNEL|__GFP_NOWARN, ehash_order);
} while (!dccp_hashinfo.ehash && --ehash_order > 0);
 
if (!dccp_hashinfo.ehash) {
@@ -1182,7 +1182,7 @@ static int __init dccp_init(void)
bhash_order > 0)
continue;
dccp_hashinfo.bhash = (struct inet_bind_hashbucket *)
-   __get_free_pages(GFP_ATOMIC|__GFP_NOWARN, bhash_order);
+   __get_free_pages(GFP_KERNEL|__GFP_NOWARN, bhash_order);
} while (!dccp_hashinfo.bhash && --bhash_order >= 0);
 
if (!dccp_hashinfo.bhash) {
-- 
1.9.1



[PATCH net] inetpeer: fix uninit-value in inet_getpeer

2018-04-09 Thread Eric Dumazet
syzbot/KMSAN reported that p->dtime was read while it was
not yet initialized in :

delta = (__u32)jiffies - p->dtime;
if (delta < ttl || !refcount_dec_if_one(>refcnt))
gc_stack[i] = NULL;

This is a false positive, because the inetpeer wont be erased
from rb-tree if the refcount_dec_if_one(>refcnt) does not
succeed. And this wont happen before first inet_putpeer() call
for this inetpeer has been done, and ->dtime field is written
exactly before the refcount_dec_and_test(>refcnt).

The KMSAN report was :

BUG: KMSAN: uninit-value in inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
BUG: KMSAN: uninit-value in inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
CPU: 0 PID: 9494 Comm: syz-executor5 Not tainted 4.16.0+ #82
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:17 [inline]
 dump_stack+0x185/0x1d0 lib/dump_stack.c:53
 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676
 inet_peer_gc net/ipv4/inetpeer.c:163 [inline]
 inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228
 inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
 icmpv4_xrlim_allow net/ipv4/icmp.c:330 [inline]
 icmp_send+0x2b44/0x3050 net/ipv4/icmp.c:725
 ip_options_compile+0x237c/0x29f0 net/ipv4/ip_options.c:472
 ip_rcv_options net/ipv4/ip_input.c:284 [inline]
 ip_rcv_finish+0xda8/0x16d0 net/ipv4/ip_input.c:365
 NF_HOOK include/linux/netfilter.h:288 [inline]
 ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493
 __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562
 __netif_receive_skb net/core/dev.c:4627 [inline]
 netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
 netif_receive_skb+0x230/0x240 net/core/dev.c:4725
 tun_rx_batched drivers/net/tun.c:1555 [inline]
 tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
 do_iter_write+0x30d/0xd40 fs/read_write.c:932
 vfs_writev fs/read_write.c:977 [inline]
 do_writev+0x3c9/0x830 fs/read_write.c:1012
 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
 SyS_writev+0x56/0x80 fs/read_write.c:1082
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x455111
RSP: 002b:7fae0365cba0 EFLAGS: 0293 ORIG_RAX: 0014
RAX: ffda RBX: 002e RCX: 00455111
RDX: 0001 RSI: 7fae0365cbf0 RDI: 00fc
RBP: 2040 R08: 00fc R09: 
R10: 002e R11: 0293 R12: 
R13: 0658 R14: 006fc8e0 R15: 

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
 kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
 kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756
 inet_getpeer+0xed8/0x1e70 net/ipv4/inetpeer.c:210
 inet_getpeer_v4 include/net/inetpeer.h:110 [inline]
 ip4_frag_init+0x4d1/0x740 net/ipv4/ip_fragment.c:153
 inet_frag_alloc net/ipv4/inet_fragment.c:369 [inline]
 inet_frag_create net/ipv4/inet_fragment.c:385 [inline]
 inet_frag_find+0x7da/0x1610 net/ipv4/inet_fragment.c:418
 ip_find net/ipv4/ip_fragment.c:275 [inline]
 ip_defrag+0x448/0x67a0 net/ipv4/ip_fragment.c:676
 ip_check_defrag+0x775/0xda0 net/ipv4/ip_fragment.c:724
 packet_rcv_fanout+0x2a8/0x8d0 net/packet/af_packet.c:1447
 deliver_skb net/core/dev.c:1897 [inline]
 deliver_ptype_list_skb net/core/dev.c:1912 [inline]
 __netif_receive_skb_core+0x314a/0x4a80 net/core/dev.c:4545
 __netif_receive_skb net/core/dev.c:4627 [inline]
 netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701
 netif_receive_skb+0x230/0x240 net/core/dev.c:4725
 tun_rx_batched drivers/net/tun.c:1555 [inline]
 tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962
 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990
 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776
 do_iter_write+0x30d/0xd40 fs/read_write.c:932
 vfs_writev fs/read_write.c:977 [inline]
 do_writev+0x3c9/0x830 fs/read_write.c:1012
 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085
 SyS_writev+0x56/0x80 fs/read_write.c:1082
 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
 entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
---
 net/ipv4/inetpeer.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c
index 
1f04bd91fc2e999ddb82f4be92d39d229166b691..d757b9642d0d1c418bffad5bcd50e8e7bf336c66
 100644
--- a/net/ipv4/inetpeer.c
+++ b/net/ipv4/inetpeer.c
@@ -211,6 +211,7 @@ struct inet_peer *inet_getpeer(struct inet_peer_base *base,
p = kmem_cache_alloc(peer_cachep, GFP_ATOMIC);
if (p) {
p->daddr = *daddr;
+   p->dtime = (__u32)jiffies;

Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page

2018-04-09 Thread Quentin Monnet
2018-04-09 12:52 UTC+0200 ~ Markus Heiser 
> 
>> Am 09.04.2018 um 12:08 schrieb Daniel Borkmann :
> [...]
> 
>>> May I completely misunderstood you, so correct my if I'am wrong:
>>>
>>> - ./scripts/bpf_helpers_doc.py : produces reST markup from C-comments
>>> - ./scripts/kerne-doc  : produces reST markup from C-comments
>>>
>>> IMO: both are doing the same job, so why not using kernel-doc?
>>
>> They are not really doing the same job, in bpf_helpers_doc.py case you don't
>> want the whole header rendered, but just a fraction of it, that is, the
>> single big comment which describes all BPF helper functions that a BPF
>> program developer has available to use in user space - aka the entries in
>> the __BPF_FUNC_MAPPER() macro;
> 
> 
>> I also doubt the latter would actually qualify
>> in kdoc context as some sort of a function description.
> 
> latter .. ah, OK .. thanks for clarifying. 
> 
> -- Markus --

As Daniel explained, kernel-doc does not apply in this case, we do not
have the full function prototype for eBPF helpers in the header file.
But to be honest, I didn't even realise kernel-doc was available and
could do something close to what I was looking for, so thanks for your
feedback! :)

Quentin



v4.16 in_dev_finish_destroy() accessing freed idev->dev->pcpu_refcnt

2018-04-09 Thread Mark Rutland
Hi all,

As a heads-up, while fuzzing v4.16 on arm64, I'm hitting an intermittent
issue where in_dev_finish_destroy() calls dev_put() on idev->dev, where
idev->dev->pcpu_refcnt is NULL. Apparently idev->dev has already been
freed.

This results in a fault where we try to access the percpu offset of
NULL, such as in the example splat at the end of this mail.

So far I've had no luck in extracting a reliable reproducer, but I've
uploaded the relevant syzkaller logs (and reports) to my kernel.org
webspace [1].

Thanks,
Mark.

[1] 
https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20180409-in_dev_finish_destroy-null-pcpu_refcnt/

Unable to handle kernel paging request at virtual address 60002be8
Mem abort info:
  ESR = 0x9604
  Exception class = DABT (current EL), IL = 32 bits
  SET = 0, FnV = 0
  EA = 0, S1PTW = 0
Data abort info:
  ISV = 0, ISS = 0x0004
  CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp = fb56b784
[60002be8] pgd=
Internal error: Oops: 9604 [#1] PREEMPT SMP
Modules linked in:
CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.16.0-6-g6eee13dfb529 #1
Hardware name: linux,dummy-virt (DT)
pstate: 8045 (Nzcv daif +PAN -UAO)
pc : __percpu_add arch/arm64/include/asm/percpu.h:102 [inline]
pc : dev_put include/linux/netdevice.h:3395 [inline]
pc : in_dev_finish_destroy+0xcc/0x230 net/ipv4/devinet.c:231
lr : dev_put include/linux/netdevice.h:3395 [inline]
lr : in_dev_finish_destroy+0xa4/0x230 net/ipv4/devinet.c:231
sp : 800036317b70
x29: 800036317b70 x28: 80001f39d338 
x27: 2a917460 x26: 2a917000 
x25: 000a x24: 2a919000 
x23: 800036317c90 x22: 29b84da8 
x21: 13e73a68 x20: 80006e7a9100 
x19: 80001f39d180 x18:  
x17:  x16: 282951f0 
x15: 0001 x14: f200 
x13: 2a9fb560 x12: 0002 
x11: 1c147fbb x10: 0004 
x9 :  x8 : 1fffe4000178908c 
x7 :  x6 :  
x5 :  x4 : 1fffe4000178908c 
x3 : 1fffe4000178908c x2 :  
x1 : 60002be8 x0 : 60002be8 
Process ksoftirqd/2 (pid: 22, stack limit = 0xe96adbb2)
Call trace:
 __percpu_add arch/arm64/include/asm/percpu.h:102 [inline]
 dev_put include/linux/netdevice.h:3395 [inline]
 in_dev_finish_destroy+0xcc/0x230 net/ipv4/devinet.c:231
 in_dev_put include/linux/inetdevice.h:248 [inline]
 in_dev_rcu_put+0x34/0x48 net/ipv4/devinet.c:287
 __rcu_reclaim kernel/rcu/rcu.h:172 [inline]
 rcu_do_batch kernel/rcu/tree.c:2674 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2933 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:2900 [inline]
 rcu_process_callbacks+0x350/0x988 kernel/rcu/tree.c:2917
 __do_softirq+0x318/0x734 kernel/softirq.c:285
 run_ksoftirqd+0x70/0xa8 kernel/softirq.c:666
 smpboot_thread_fn+0x544/0x9d0 kernel/smpboot.c:164
 kthread+0x2f8/0x380 kernel/kthread.c:238
 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1158
Code: d538d081 f9425a80 9282 8b01 (885f7c04) 
---[ end trace 577c2c0fbbf0d4d4 ]---


Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page

2018-04-09 Thread Quentin Monnet
2018-04-09 11:01 UTC+0200 ~ Daniel Borkmann 
> On 04/06/2018 01:11 PM, Quentin Monnet wrote:
>> eBPF helper functions can be called from within eBPF programs to perform
>> a variety of tasks that would be otherwise hard or impossible to do with
>> eBPF itself. There is a growing number of such helper functions in the
>> kernel, but documentation is scarce. The main user space header file
>> does contain a short commented description of most helpers, but it is
>> somewhat outdated and not complete. It is more a "cheat sheet" than a
>> real documentation accessible to new eBPF developers.
>>
>> This commit attempts to improve the situation by replacing the existing
>> overview for the helpers with a more developed description. Furthermore,
>> a Python script is added to generate a manual page for eBPF helpers. The
>> workflow is the following, and requires the rst2man utility:
>>
>> $ ./scripts/bpf_helpers_doc.py \
>> --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
>> $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
>> $ man /tmp/bpf-helpers.7
>>
>> The objective is to keep all documentation related to the helpers in a
>> single place, and to be able to generate from here a manual page that
>> could be packaged in the man-pages repository and shipped with most
>> distributions [1].
>>
>> Additionally, parsing the prototypes of the helper functions could
>> hopefully be reused, with a different Printer object, to generate
>> header files needed in some eBPF-related projects.
>>
>> Regarding the description of each helper, it comprises several items:
>>
>> - The function prototype.
>> - A description of the function and of its arguments (except for a
>>   couple of cases, when there are no arguments and the return value
>>   makes the function usage really obvious).
>> - A description of return values (if not void).
>> - A listing of eBPF program types (if relevant, map types) compatible
>>   with the helper.
>> - Information about the helper being restricted to GPL programs, or not.
>> - The kernel version in which the helper was introduced.
>> - The commit that introduced the helper (this is mostly to have it in
>>   the source of the man page, as it can be used to track changes and
>>   update the page).
>>
>> For several helpers, descriptions are inspired (at times, nearly copied)
>> from the commit logs introducing them in the kernel--Many thanks to
>> their respective authors! They were completed as much as possible, the
>> objective being to have something easily accessible even for people just
>> starting with eBPF. There is probably a bit more work to do in this
>> direction for some helpers.
>>
>> Some RST formatting is used in the descriptions (not in function
>> prototypes, to keep them readable, but the Python script provided in
>> order to generate the RST for the manual page does add formatting to
>> prototypes, to produce something pretty) to get "bold" and "italics" in
>> manual pages. Hopefully, the descriptions in bpf.h file remains
>> perfectly readable. Note that the few trailing white spaces are
>> intentional, removing them would break paragraphs for rst2man.
>>
>> The descriptions should ideally be updated each time someone adds a new
>> helper, or updates the behaviour (compatibility extended to new program
>> types, new socket option supported...) or the interface (new flags
>> available, ...) of existing ones.
>>
>> [1] I have not contacted people from the man-pages project prior to
>> sending this RFC, so I can offer no guaranty at this time that they
>> would accept to take the generated man page.
>>
>> Cc: linux-...@vger.kernel.org
>> Cc: linux-...@vger.kernel.org
>> Signed-off-by: Quentin Monnet 
> 
> Great work, thanks a lot for doing this!
> 
> [...]
>> + * int bpf_probe_read(void *dst, u32 size, const void *src)
>> + *  Description
>> + *  For tracing programs, safely attempt to read *size* bytes from
>> + *  address *src* and store the data in *dst*.
>> + *  Return
>> + *  0 on success, or a negative error in case of failure.
>> + *  For
>> + *  *Tracing programs*.
>> + *  GPL only
>> + *  Yes
>> + *  Since
>> + *  Linux 4.1
>> + *
>> + * .. commit 2541517c32be
>>   *
>>   * u64 bpf_ktime_get_ns(void)
>> - * Return: current ktime
>> - *
>> - * int bpf_trace_printk(const char *fmt, int fmt_size, ...)
>> - * Return: length of buffer written or negative error
>> + *  Description
>> + *  Return the time elapsed since system boot, in nanoseconds.
>> + *  Return
>> + *  Current *ktime*.
>> + *  For
>> + *  All program types, except
>> + *  **BPF_PROG_TYPE_CGROUP_DEVICE**.
> 
> I think we should probably always just list the actual program types instead,
> since when new types get added to the kernel not supporting bpf_ktime_get_ns()
> helper in this example, the above statement would be misleading 

[PATCH v5] net: thunderx: rework mac addresses list to u64 array

2018-04-09 Thread Vadim Lomovtsev
From: Vadim Lomovtsev 

It is too expensive to pass u64 values via linked list, instead
allocate array for them by overall number of mac addresses from netdev.

This eventually removes multiple kmalloc() calls, aviod memory
fragmentation and allow to put single null check on kmalloc
return value in order to prevent a potential null pointer dereference.

Addresses-Coverity-ID: 1467429 ("Dereference null return value")
Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback 
implementation for VF")
Reported-by: Dan Carpenter 
Signed-off-by: Vadim Lomovtsev 
---
Changes from v1 to v2:
 - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[];
Changes from v2 to v3:
 - update commit description with 'Reported-by: Dan Carpenter';
 - update size calculations for mc list to offsetof() call
   instead of explicit arithmetic;
Changes from v3 to v4:
 - change loop control variable type from u8 to int, accordingly
   to mc_count size;
Changes from v4 to v5:
 - remove explicit initialization of 'idx' variable as per review comment;
---
 drivers/net/ethernet/cavium/thunder/nic.h|  7 +-
 drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +---
 2 files changed, 11 insertions(+), 24 deletions(-)

diff --git a/drivers/net/ethernet/cavium/thunder/nic.h 
b/drivers/net/ethernet/cavium/thunder/nic.h
index 5fc46c5a4f36..448d1fafc827 100644
--- a/drivers/net/ethernet/cavium/thunder/nic.h
+++ b/drivers/net/ethernet/cavium/thunder/nic.h
@@ -265,14 +265,9 @@ struct nicvf_drv_stats {
 
 struct cavium_ptp;
 
-struct xcast_addr {
-   struct list_head list;
-   u64  addr;
-};
-
 struct xcast_addr_list {
-   struct list_head list;
int  count;
+   u64  mc[];
 };
 
 struct nicvf_work {
diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c 
b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
index 1e9a31fef729..707db3304396 100644
--- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c
+++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c
@@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
  work.work);
struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
union nic_mbx mbx = {};
-   struct xcast_addr *xaddr, *next;
+   int idx;
 
if (!vf_work)
return;
@@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct 
*work_arg)
/* check if we have any specific MACs to be added to PF DMAC filter */
if (vf_work->mc) {
/* now go through kernel list of MACs and add them one by one */
-   list_for_each_entry_safe(xaddr, next,
-_work->mc->list, list) {
+   for (idx = 0; idx < vf_work->mc->count; idx++) {
mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST;
-   mbx.xcast.data.mac = xaddr->addr;
+   mbx.xcast.data.mac = vf_work->mc->mc[idx];
nicvf_send_msg_to_pf(nic, );
-
-   /* after receiving ACK from PF release memory */
-   list_del(>list);
-   kfree(xaddr);
-   vf_work->mc->count--;
}
kfree(vf_work->mc);
}
@@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev)
mode |= BGX_XCAST_MCAST_FILTER;
/* here we need to copy mc addrs */
if (netdev_mc_count(netdev)) {
-   struct xcast_addr *xaddr;
-
-   mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC);
-   INIT_LIST_HEAD(_list->list);
+   mc_list = kmalloc(offsetof(typeof(*mc_list),
+  
mc[netdev_mc_count(netdev)]),
+ GFP_ATOMIC);
+   if (unlikely(!mc_list))
+   return;
+   mc_list->count = 0;
netdev_hw_addr_list_for_each(ha, >mc) {
-   xaddr = kmalloc(sizeof(*xaddr),
-   GFP_ATOMIC);
-   xaddr->addr =
+   mc_list->mc[mc_list->count] =
ether_addr_to_u64(ha->addr);
-   list_add_tail(>list,
- _list->list);
mc_list->count++;
}
}
-- 
2.14.3



Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)

2018-04-09 Thread Stefan Hajnoczi
On Mon, Apr 9, 2018 at 11:28 AM, Stefan Hajnoczi  wrote:
> On Mon, Apr 09, 2018 at 05:44:36AM +0300, Michael S. Tsirkin wrote:
>> On Mon, Apr 09, 2018 at 10:37:45AM +0800, Stefan Hajnoczi wrote:
>> > On Sat, Apr 7, 2018 at 3:02 AM, syzbot
>> >  wrote:
>> > > syzbot hit the following crash on upstream commit
>> > > 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +)
>> > > Merge tag 'armsoc-drivers' of
>> > > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
>> > > syzbot dashboard link:
>> > > https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd
>> >
>> > To prevent duplicated work: I am working on this one.
>> >
>> > Stefan
>>
>> Do you want to try this patchset:
>> https://lkml.org/lkml/2018/4/5/665
>>
>> ?
>
> Thanks, I'll give it a shot.
>
> I also noticed a regression in commit
> d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when
> IOTLB is enabled") and am currently testing a fix.

I have sent a fix:
https://lkml.org/lkml/2018/4/9/390

Stefan


Re: WARNING in ip_rt_bug

2018-04-09 Thread Eric Dumazet


On 04/08/2018 11:06 PM, Dmitry Vyukov wrote:
> On Mon, Apr 9, 2018 at 7:59 AM, syzbot
>  wrote:
>> Hello,
>>
>> syzbot hit the following crash on net-next commit
>> 8bde261e535257e81087d39ff808414e2f5aa39d (Sun Apr 1 02:31:43 2018 +)
>> Merge tag 'mlx5-updates-2018-03-30' of
>> git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
>> syzbot dashboard link:
>> https://syzkaller.appspot.com/bug?extid=b09ac67a2af842b12eab
>>
>> Unfortunately, I don't have any reproducer for this crash yet.
>> Raw console output:
>> https://syzkaller.appspot.com/x/log.txt?id=5991727739437056
>> Kernel config:
>> https://syzkaller.appspot.com/x/.config?id=3327544840960562528
>> compiler: gcc (GCC) 7.1.1 20170620
>>
>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>> Reported-by: syzbot+b09ac67a2af842b12...@syzkaller.appspotmail.com
>> It will help syzbot understand when the bug is fixed. See footer for
>> details.
>> If you forward the report, please keep this part and the footer.
> 
> 
> +Eric said that perhaps we just need to revert:
> 
> commit c378a9c019cf5e017d1ed24954b54fae7bebd2bc
> Date:   Sat May 21 07:16:42 2011 +
> ipv4: Give backtrace in ip_rt_bug().
> 

And David replied :


Let's not do the revert, I wouldn't have seen the backtrace which
points where this bug is if we had.

icmp_route_lookup(), in one branch, does an input route lookup and
uses the result of that to send the icmp message.

That can't be right, input routes should never be used for
transmitting traffice and that's how we end up at ip_rt_bug().





[PATCH] net/mlx5: remove some extraneous spaces in indentations

2018-04-09 Thread Colin King
From: Colin Ian King 

There are several lines where there is an extraneous space causing
indentation misalignment. Remove them.

Cleans up Cocconelle warnings:

./drivers/net/ethernet/mellanox/mlx5/core/qp.c:409:3-18: code aligned
with following code on line 410
./drivers/net/ethernet/mellanox/mlx5/core/qp.c:415:3-18: code aligned
with following code on line 416
./drivers/net/ethernet/mellanox/mlx5/core/qp.c:421:3-18: code aligned
with following code on line 422

Signed-off-by: Colin Ian King 
---
 drivers/net/ethernet/mellanox/mlx5/core/qp.c | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c 
b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
index 02d6c5b5d502..4ca07bfb6b14 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c
@@ -407,21 +407,21 @@ static int modify_qp_mbox_alloc(struct mlx5_core_dev 
*dev, u16 opcode, int qpn,
case MLX5_CMD_OP_RST2INIT_QP:
if (MBOX_ALLOC(mbox, rst2init_qp))
return -ENOMEM;
-MOD_QP_IN_SET_QPC(rst2init_qp, mbox->in, opcode, qpn,
-  opt_param_mask, qpc);
-break;
+   MOD_QP_IN_SET_QPC(rst2init_qp, mbox->in, opcode, qpn,
+ opt_param_mask, qpc);
+   break;
case MLX5_CMD_OP_INIT2RTR_QP:
if (MBOX_ALLOC(mbox, init2rtr_qp))
return -ENOMEM;
-MOD_QP_IN_SET_QPC(init2rtr_qp, mbox->in, opcode, qpn,
-  opt_param_mask, qpc);
-break;
+   MOD_QP_IN_SET_QPC(init2rtr_qp, mbox->in, opcode, qpn,
+ opt_param_mask, qpc);
+   break;
case MLX5_CMD_OP_RTR2RTS_QP:
if (MBOX_ALLOC(mbox, rtr2rts_qp))
return -ENOMEM;
-MOD_QP_IN_SET_QPC(rtr2rts_qp, mbox->in, opcode, qpn,
-  opt_param_mask, qpc);
-break;
+   MOD_QP_IN_SET_QPC(rtr2rts_qp, mbox->in, opcode, qpn,
+ opt_param_mask, qpc);
+   break;
case MLX5_CMD_OP_RTS2RTS_QP:
if (MBOX_ALLOC(mbox, rts2rts_qp))
return -ENOMEM;
-- 
2.15.1



Re: [PATCH 2/2] alx: add disable_wol paramenter

2018-04-09 Thread Andrew Lunn
On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote:
> The WoL feature was reported broken and will lead to
> the system resume immediately after suspending.
> This symptom is not happening on every system, so adding
> disable_wol option and disable WoL by default to prevent the issue from
> happening again.

>  const char alx_drv_name[] = "alx";
>  
> +/* disable WoL by default */
> +bool disable_wol = 1;
> +module_param(disable_wol, bool, 0);
> +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature");
> +

Hi AceLan

This seems like you are papering over the cracks. And module
parameters are not liked.

Please try to find the real problem.

   Andrew


Re: Passing uninitialised local variable

2018-04-09 Thread Petr Machata
Arend van Spriel  writes:

> On 3/28/2018 1:20 PM, Himanshu Jha wrote:
>> I recently found that a local variable in passed uninitialised to the
>> function at
>>
>> drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c:2950
>>
>>  u32 var;
>>  err = brcmf_fil_iovar_int_get(ifp, "dtim_assoc", );

>>
>> s32
>> brcmf_fil_iovar_int_get(struct brcmf_if *ifp, char *name, u32 *data)
>> {
>>  __le32 data_le = cpu_to_le32(*data);

>> }
>>
>> We can cleary see that 'var' in used uninitialised in the very first line
>> which is an undefined behavior.
>
> Why undefined? We copy some stack data and we do transfer that to the device. 
> However in this case
> the device does nothing with it and it is simply overwritten by the response.

"Undefined behavior" is a technical term for when there are no
guarantees as to what the result of executing a given code will be. None
at all--it might for example abort, and that would be perfectly valid as
well. (To be clear, this is not about the device, but about the CPU that
this code runs on.)

Uninitialized reads are one example of a code construct that invokes
undefined behavior.

Thanks,
Petr


[PATCH 1/2] Revert "alx: remove WoL support"

2018-04-09 Thread AceLan Kao
This reverts commit bc2bebe8de8ed4ba6482c9cc370b0dd72ffe8cd2.

There are still many people need this feature, so try adding it back.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=61651
Signed-off-by: AceLan Kao 
---
 drivers/net/ethernet/atheros/alx/ethtool.c |  36 +++
 drivers/net/ethernet/atheros/alx/hw.c  | 154 -
 drivers/net/ethernet/atheros/alx/hw.h  |   5 +
 drivers/net/ethernet/atheros/alx/main.c| 142 --
 4 files changed, 326 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/atheros/alx/ethtool.c 
b/drivers/net/ethernet/atheros/alx/ethtool.c
index 2f4eabf..859e272 100644
--- a/drivers/net/ethernet/atheros/alx/ethtool.c
+++ b/drivers/net/ethernet/atheros/alx/ethtool.c
@@ -310,11 +310,47 @@ static int alx_get_sset_count(struct net_device *netdev, 
int sset)
}
 }
 
+static void alx_get_wol(struct net_device *netdev, struct ethtool_wolinfo *wol)
+{
+   struct alx_priv *alx = netdev_priv(netdev);
+   struct alx_hw *hw = >hw;
+
+   wol->supported = WAKE_MAGIC | WAKE_PHY;
+   wol->wolopts = 0;
+
+   if (hw->sleep_ctrl & ALX_SLEEP_WOL_MAGIC)
+   wol->wolopts |= WAKE_MAGIC;
+   if (hw->sleep_ctrl & ALX_SLEEP_WOL_PHY)
+   wol->wolopts |= WAKE_PHY;
+}
+
+static int alx_set_wol(struct net_device *netdev, struct ethtool_wolinfo *wol)
+{
+   struct alx_priv *alx = netdev_priv(netdev);
+   struct alx_hw *hw = >hw;
+
+   if (wol->wolopts & ~(WAKE_MAGIC | WAKE_PHY))
+   return -EOPNOTSUPP;
+
+   hw->sleep_ctrl = 0;
+
+   if (wol->wolopts & WAKE_MAGIC)
+   hw->sleep_ctrl |= ALX_SLEEP_WOL_MAGIC;
+   if (wol->wolopts & WAKE_PHY)
+   hw->sleep_ctrl |= ALX_SLEEP_WOL_PHY;
+
+   device_set_wakeup_enable(>hw.pdev->dev, hw->sleep_ctrl);
+
+   return 0;
+}
+
 const struct ethtool_ops alx_ethtool_ops = {
.get_pauseparam = alx_get_pauseparam,
.set_pauseparam = alx_set_pauseparam,
.get_msglevel   = alx_get_msglevel,
.set_msglevel   = alx_set_msglevel,
+   .get_wol= alx_get_wol,
+   .set_wol= alx_set_wol,
.get_link   = ethtool_op_get_link,
.get_strings= alx_get_strings,
.get_sset_count = alx_get_sset_count,
diff --git a/drivers/net/ethernet/atheros/alx/hw.c 
b/drivers/net/ethernet/atheros/alx/hw.c
index 6ac40b0..f9bf612 100644
--- a/drivers/net/ethernet/atheros/alx/hw.c
+++ b/drivers/net/ethernet/atheros/alx/hw.c
@@ -332,6 +332,16 @@ void alx_set_macaddr(struct alx_hw *hw, const u8 *addr)
alx_write_mem32(hw, ALX_STAD1, val);
 }
 
+static void alx_enable_osc(struct alx_hw *hw)
+{
+   u32 val;
+
+   /* rising edge */
+   val = alx_read_mem32(hw, ALX_MISC);
+   alx_write_mem32(hw, ALX_MISC, val & ~ALX_MISC_INTNLOSC_OPEN);
+   alx_write_mem32(hw, ALX_MISC, val | ALX_MISC_INTNLOSC_OPEN);
+}
+
 static void alx_reset_osc(struct alx_hw *hw, u8 rev)
 {
u32 val, val2;
@@ -774,7 +784,6 @@ int alx_setup_speed_duplex(struct alx_hw *hw, u32 ethadv, 
u8 flowctrl)
return err;
 }
 
-
 void alx_post_phy_link(struct alx_hw *hw)
 {
u16 phy_val, len, agc;
@@ -848,6 +857,65 @@ void alx_post_phy_link(struct alx_hw *hw)
}
 }
 
+/* NOTE:
+ *1. phy link must be established before calling this function
+ *2. wol option (pattern,magic,link,etc.) is configed before call it.
+ */
+int alx_pre_suspend(struct alx_hw *hw, int speed, u8 duplex)
+{
+   u32 master, mac, phy, val;
+   int err = 0;
+
+   master = alx_read_mem32(hw, ALX_MASTER);
+   master &= ~ALX_MASTER_PCLKSEL_SRDS;
+   mac = hw->rx_ctrl;
+   /* 10/100 half */
+   ALX_SET_FIELD(mac, ALX_MAC_CTRL_SPEED,  ALX_MAC_CTRL_SPEED_10_100);
+   mac &= ~(ALX_MAC_CTRL_FULLD | ALX_MAC_CTRL_RX_EN | ALX_MAC_CTRL_TX_EN);
+
+   phy = alx_read_mem32(hw, ALX_PHY_CTRL);
+   phy &= ~(ALX_PHY_CTRL_DSPRST_OUT | ALX_PHY_CTRL_CLS);
+   phy |= ALX_PHY_CTRL_RST_ANALOG | ALX_PHY_CTRL_HIB_PULSE |
+  ALX_PHY_CTRL_HIB_EN;
+
+   /* without any activity  */
+   if (!(hw->sleep_ctrl & ALX_SLEEP_ACTIVE)) {
+   err = alx_write_phy_reg(hw, ALX_MII_IER, 0);
+   if (err)
+   return err;
+   phy |= ALX_PHY_CTRL_IDDQ | ALX_PHY_CTRL_POWER_DOWN;
+   } else {
+   if (hw->sleep_ctrl & (ALX_SLEEP_WOL_MAGIC | ALX_SLEEP_CIFS))
+   mac |= ALX_MAC_CTRL_RX_EN | ALX_MAC_CTRL_BRD_EN;
+   if (hw->sleep_ctrl & ALX_SLEEP_CIFS)
+   mac |= ALX_MAC_CTRL_TX_EN;
+   if (duplex == DUPLEX_FULL)
+   mac |= ALX_MAC_CTRL_FULLD;
+   if (speed == SPEED_1000)
+   ALX_SET_FIELD(mac, ALX_MAC_CTRL_SPEED,
+ ALX_MAC_CTRL_SPEED_1000);
+   phy |= ALX_PHY_CTRL_DSPRST_OUT;
+   

[PATCH 2/2] alx: add disable_wol paramenter

2018-04-09 Thread AceLan Kao
The WoL feature was reported broken and will lead to
the system resume immediately after suspending.
This symptom is not happening on every system, so adding
disable_wol option and disable WoL by default to prevent the issue from
happening again.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=61651
Signed-off-by: AceLan Kao 
---
 drivers/net/ethernet/atheros/alx/ethtool.c | 7 ++-
 drivers/net/ethernet/atheros/alx/main.c| 5 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/atheros/alx/ethtool.c 
b/drivers/net/ethernet/atheros/alx/ethtool.c
index 859e272..e50129d 100644
--- a/drivers/net/ethernet/atheros/alx/ethtool.c
+++ b/drivers/net/ethernet/atheros/alx/ethtool.c
@@ -46,6 +46,8 @@
 #include "reg.h"
 #include "hw.h"
 
+extern const bool disable_wol;
+
 /* The order of these strings must match the order of the fields in
  * struct alx_hw_stats
  * See hw.h
@@ -315,6 +317,9 @@ static void alx_get_wol(struct net_device *netdev, struct 
ethtool_wolinfo *wol)
struct alx_priv *alx = netdev_priv(netdev);
struct alx_hw *hw = >hw;
 
+   if (disable_wol)
+   return;
+
wol->supported = WAKE_MAGIC | WAKE_PHY;
wol->wolopts = 0;
 
@@ -329,7 +334,7 @@ static int alx_set_wol(struct net_device *netdev, struct 
ethtool_wolinfo *wol)
struct alx_priv *alx = netdev_priv(netdev);
struct alx_hw *hw = >hw;
 
-   if (wol->wolopts & ~(WAKE_MAGIC | WAKE_PHY))
+   if (disable_wol || (wol->wolopts & ~(WAKE_MAGIC | WAKE_PHY)))
return -EOPNOTSUPP;
 
hw->sleep_ctrl = 0;
diff --git a/drivers/net/ethernet/atheros/alx/main.c 
b/drivers/net/ethernet/atheros/alx/main.c
index c0e2bb2..c27fdf7 100644
--- a/drivers/net/ethernet/atheros/alx/main.c
+++ b/drivers/net/ethernet/atheros/alx/main.c
@@ -51,6 +51,11 @@
 
 const char alx_drv_name[] = "alx";
 
+/* disable WoL by default */
+bool disable_wol = 1;
+module_param(disable_wol, bool, 0);
+MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature");
+
 static void alx_free_txbuf(struct alx_tx_queue *txq, int entry)
 {
struct alx_buffer *txb = >bufs[entry];
-- 
2.7.4



Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?

2018-04-09 Thread Jesper Dangaard Brouer
On Fri, 6 Apr 2018 12:36:18 +0200
Daniel Borkmann  wrote:

> On 04/05/2018 10:51 PM, Jesper Dangaard Brouer wrote:
> > On Thu, 5 Apr 2018 12:37:19 +0200
> > Daniel Borkmann  wrote:
> >   
> >> On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote:  
> >>> Hi Suricata people,
> >>>
> >>> When Eric Leblond (and I helped) integrated XDP in Suricata, we ran
> >>> into the issue, that at Suricata load/start time, we cannot determine
> >>> if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on
> >>> this HW (e.g require driver XDP_REDIRECT support and bpf cpumap).
> >>>
> >>> We would have liked a way to report that suricata.yaml config was
> >>> invalid for this hardware/setup.  Now, it just loads, and packets gets
> >>> silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints).
> >>>
> >>> My question to suricata developers: (Q1) Do you already have code that
> >>> query the kernel or drivers for features?
> >>>
> >>> At the IOvisor call (2 weeks ago), we discussed two options of exposing
> >>> XDP features avail in a given driver.
> >>>
> >>> Option#1: Extend existing ethtool -k/-K "offload and other features"
> >>> with some XDP features, that userspace can query. (Do you already query
> >>> offloads, regarding Q1)
> >>>
> >>> Option#2: Invent a new 'ip link set xdp' netlink msg with a query option. 
> >>>
> >>
> >> I don't really mind if you go via ethtool, as long as we handle this
> >> generically from there and e.g. call the dev's ndo_bpf handler such that
> >> we keep all the information in one place. This can be a new ndo_bpf command
> >> e.g. XDP_QUERY_FEATURES or such.  
> > 
> > Just to be clear: notice as Victor points out[2], they are programmable
> > going though the IOCTL (SIOCETHTOOL) and not using cmdline tools.  
> 
> Sure, that was perfectly clear. (But at the same time if you extend the
> ioctl, it's obvious to also add support to actual ethtool cmdline tool.)
> 
> > [2] https://github.com/OISF/suricata/blob/master/src/util-ioctl.c#L326
> > 
> > If you want everything to go through the drivers ndo_bpf call anyway
> > (which userspace API is netlink based) then at what point to you  
> 
> Not really, that's the front end. ndo_bpf itself is a plain netdev op
> and has no real tie to netlink.
> 
> > want drivers to call their own ndo_bpf, when activated though their
> > ethtool_ops ? (Sorry, but I don't follow the flow you are proposing)
> > 
> > Notice, I'm not directly against using the drivers ndo_bpf call.  I can
> > see it does provide kernel more flexibility than the ethtool IOCTL.  
> 
> What I was saying is that even if you go via ethtool ioctl api, where
> you end up in dev_ethtool() and have some new ETHTOOL_* query command,
> then instead of adding a new ethtool_ops callback, we can and should
> reuse ndo_bpf from there.

Okay, you want to catch this in dev_ethtool(), that connected the dots
for me, thanks.  BUT it does feel a little fake and confusing to do
this, as the driver developers seeing the info via ethtool, would
expect they need to implement an ethtool_ops callback.

My original idea behind going through the ethtool APIs, were that every
driver is already using this, and it's familiar to the driver
developers.  And there are existing userspace tools that sysadm's
already know, that we can leverage.

Thinking about this over the weekend.  I have changed my mind a bit,
based on your feedback. I do buy into your idea of forcing things
through the drivers ndo_bpf call.  I'm no-longer convinced this should
go through ethtool, but as (you desc) we can, if we find that this is
the easiest ways to export and leverage an existing userspace tool.



> [...]
> > Here, I want to discuss how drivers expose/tell userspace that they
> > support a given feature: Specifically a bit for: XDP_REDIRECT action
> > support.
> >   
> >> Same for meta data,  
> > 
> > Well, not really.  It would be a "nice-to-have", but not strictly
> > needed as a feature bit.  XDP meta-data is controlled via a helper.
> > And the BPF-prog can detect/see runtime, that the helper bpf_xdp_adjust_meta
> > returns -ENOTSUPP (and need to check the ret value anyhow).  Thus,
> > there is that not much gained by exposing this to be detected setup
> > time, as all drivers should eventually support this, and we can detect
> > it runtime.
> > 
> > The missing XDP_REDIRECT action features bit it different, as the
> > BPF-prog cannot detect runtime that this is an unsupported action.
> > Plus, setup time we cannot query the driver for supported XDP actions.  
> 
> Ok, so with the example of meta data, you're arguing that it's okay
> to load a native XDP program onto a driver, and run actual traffic on
> the NIC in order probe for the availability of the feature when you're
> saying that it "can detect/see [at] runtime". I totally agree with you
> that all drivers should eventually support this (same with XDP_REDIRECT),
> but 

[PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error

2018-04-09 Thread Arnd Bergmann
We get a new link error with CONFIG_NFT_REJECT_INET=y and 
CONFIG_NF_REJECT_IPV6=m
after larger parts of the nftables modules are linked together:

net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval':
nft_reject_inet.c:(.text+0x17c): undefined reference to `nf_send_unreach6'
nft_reject_inet.c:(.text+0x190): undefined reference to `nf_send_reset6'

The problem is that with NF_TABLES_INET set, we implicitly try to use
the ipv6 version as well for NFT_REJECT, but when CONFIG_IPV6 is set to
a loadable module, it's impossible to reach that.

The best workaround I found is to express the above as a Kconfig
dependency, forcing NFT_REJECT itself to be 'm' in that particular
configuration.

Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type")
Signed-off-by: Arnd Bergmann 
---
 net/netfilter/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index 704b3832dbad..44d8a55e9721 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -594,6 +594,7 @@ config NFT_QUOTA
 config NFT_REJECT
default m if NETFILTER_ADVANCED=n
tristate "Netfilter nf_tables reject support"
+   depends on !NF_TABLES_INET || (IPV6!=m || m)
help
  This option adds the "reject" expression that you can use to
  explicitly deny and notify via TCP reset/ICMP informational errors
-- 
2.9.0



Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page

2018-04-09 Thread Markus Heiser

> Am 09.04.2018 um 12:08 schrieb Daniel Borkmann :
[...]

>> May I completely misunderstood you, so correct my if I'am wrong:
>> 
>> - ./scripts/bpf_helpers_doc.py : produces reST markup from C-comments
>> - ./scripts/kerne-doc  : produces reST markup from C-comments
>> 
>> IMO: both are doing the same job, so why not using kernel-doc?
> 
> They are not really doing the same job, in bpf_helpers_doc.py case you don't
> want the whole header rendered, but just a fraction of it, that is, the
> single big comment which describes all BPF helper functions that a BPF
> program developer has available to use in user space - aka the entries in
> the __BPF_FUNC_MAPPER() macro;


> I also doubt the latter would actually qualify
> in kdoc context as some sort of a function description.

latter .. ah, OK .. thanks for clarifying. 

-- Markus --



Re: [PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent

2018-04-09 Thread Steffen Klassert
On Sat, Apr 07, 2018 at 11:40:47AM -0400, Kevin Easton wrote:
> Several places use (x + 7) / 8 to convert from a number of bits to a number
> of bytes.  Replace those with DIV_ROUND_UP(x, 8) instead, for consistency
> with other parts of the same file.
> 
> Signed-off-by: Kevin Easton 

Please resubmit this one to ipsec-next after the
merge window. Thanks!


Re: [PATCH v2 1/2] af_key: Always verify length of provided sadb_key

2018-04-09 Thread Steffen Klassert
On Sat, Apr 07, 2018 at 11:40:33AM -0400, Kevin Easton wrote:
> Key extensions (struct sadb_key) include a user-specified number of key
> bits.  The kernel uses that number to determine how much key data to copy
> out of the message in pfkey_msg2xfrm_state().
> 
> The length of the sadb_key message must be verified to be long enough,
> even in the case of SADB_X_AALG_NULL.  Furthermore, the sadb_key_len value
> must be long enough to include both the key data and the struct sadb_key
> itself.
> 
> Introduce a helper function verify_key_len(), and call it from
> parse_exthdrs() where other exthdr types are similarly checked for
> correctness.
> 
> Signed-off-by: Kevin Easton 
> Reported-by: 
> syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com

Applied to the ipsec tree, thanks Kevin!


Re: [PATCH v2 0/2] af_key: Fix for sadb_key memcpy read overrun

2018-04-09 Thread Steffen Klassert
On Sat, Apr 07, 2018 at 11:40:18AM -0400, Kevin Easton wrote:
> As found by syzbot, af_key does not properly validate the key length in
> sadb_key messages from userspace.  This can result in copying from beyond
> the end of the sadb_key part of the message, or indeed beyond the end of
> the entire packet.
> 
> Both these patches apply cleanly to ipsec-next.  Based on Steffen's
> feedback I have re-ordered them so that the fix only is in patch 1, which
> I would suggest is also a stable tree candidate, whereas patch 2 is a
> cleanup only.

I think here is some explanation needed. Usually bugfixes and cleanups
don't go to the same tree. On IPsec bugfixes go to the'ipsec' tree
while cleanups and new features go to the 'ipsec-next' tree. So
you need to split up your patchsets into patches that are targeted
to 'ipsec' and 'ipsec-next'. Aside from that, we are in 'merge window'
currently. This means that most maintainers don't accept patches to
their -next trees. If you have patches for a -next tree, wait until
the merge window is over (when v4.17-rc1 is released) and send them
then.



Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page

2018-04-09 Thread Daniel Borkmann
On 04/09/2018 11:35 AM, Markus Heiser wrote:
> 
>> Am 09.04.2018 um 11:25 schrieb Daniel Borkmann :
>>
>> On 04/09/2018 11:21 AM, Markus Heiser wrote:
>> [...]
>>> Do we really need another kernel-doc parser?
>>>
>>>  ./scripts/kernel-doc include/uapi/linux/bpf.h
>>>
>>> should already do the job (producing .rst). For more infos, take a look at
>>
>> This has absolutely zero to do with kernel-doc, but rather producing
>> a description of BPF helper function that are later assembled into an
>> actual man-page that BPF program developers (user space) can use.
> 
> May I completely misunderstood you, so correct my if I'am wrong:
> 
> - ./scripts/bpf_helpers_doc.py : produces reST markup from C-comments
> - ./scripts/kerne-doc  : produces reST markup from C-comments
> 
> IMO: both are doing the same job, so why not using kernel-doc?

They are not really doing the same job, in bpf_helpers_doc.py case you don't
want the whole header rendered, but just a fraction of it, that is, the
single big comment which describes all BPF helper functions that a BPF
program developer has available to use in user space - aka the entries in
the __BPF_FUNC_MAPPER() macro; I also doubt the latter would actually qualify
in kdoc context as some sort of a function description.


Re: [RFC PATCH bpf-next 2/6] bpf: add bpf_get_stack helper

2018-04-09 Thread Daniel Borkmann
On 04/09/2018 07:02 AM, Alexei Starovoitov wrote:
> On 4/8/18 9:53 PM, Yonghong Song wrote:
 @@ -1004,7 +1007,8 @@ static void __bpf_prog_put(struct bpf_prog
 *prog, bool do_idr_lock)
  bpf_prog_kallsyms_del(prog->aux->func[i]);
  bpf_prog_kallsyms_del(prog);

 -    call_rcu(>aux->rcu, __bpf_prog_put_rcu);
 +    synchronize_rcu();
 +    __bpf_prog_put_rcu(>aux->rcu);
>>>
>>> there should have been lockdep splat.
>>> We cannot call synchronize_rcu here, since we cannot sleep
>>> in some cases.
>>
>> Let me double check this. The following is the reason
>> why I am using synchronize_rcu().
>>
>> With call_rcu(>aux->rcu, __bpf_prog_put_rcu) and
>> _bpf_prog_put_rcu calls put_callchain_buffers() which
>> calls mutex_lock(), the runtime with CONFIG_DEBUG_ATOMIC_SLEEP=y
>> will complains since potential sleep inside the call_rcu is not
>> allowed.
> 
> I see. Indeed. We cannot call put_callchain_buffers() from rcu callback,
> but doing synchronize_rcu() here is also not possible.
> How about moving put_callchain into bpf_prog_free_deferred() ?

+1, the assumption is that you can call bpf_prog_put() and also the
bpf_map_put() from any context. Sleeping here for a long time might
subtly break things badly.


Re: [PATCH v4] net: thunderx: rework mac addresses list to u64 array

2018-04-09 Thread Vadim Lomovtsev
On Sun, Apr 08, 2018 at 12:42:00PM -0400, David Miller wrote:
> From: Vadim Lomovtsev 
> Date: Fri,  6 Apr 2018 12:53:54 -0700
> 
> > @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct 
> > *work_arg)
> >   work.work);
> > struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work);
> > union nic_mbx mbx = {};
> > -   struct xcast_addr *xaddr, *next;
> > +   int idx = 0;
> 
> No need to initialize idx.
> 
> > +   for (idx = 0; idx < vf_work->mc->count; idx++) {
> 
> As it is always explicitly initialized at, and only used inside of,
> this loop.

Ok, will post next version shortly.
Thanks for your time.

Vadim


Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page

2018-04-09 Thread Markus Heiser

> Am 09.04.2018 um 11:25 schrieb Daniel Borkmann :
> 
> On 04/09/2018 11:21 AM, Markus Heiser wrote:
> [...]
>> Do we really need another kernel-doc parser?
>> 
>>  ./scripts/kernel-doc include/uapi/linux/bpf.h
>> 
>> should already do the job (producing .rst). For more infos, take a look at
> 
> This has absolutely zero to do with kernel-doc, but rather producing
> a description of BPF helper function that are later assembled into an
> actual man-page that BPF program developers (user space) can use.

May I completely misunderstood you, so correct my if I'am wrong:

- ./scripts/bpf_helpers_doc.py : produces reST markup from C-comments
- ./scripts/kerne-doc  : produces reST markup from C-comments

IMO: both are doing the same job, so why not using kernel-doc?

-- Markus --




Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page

2018-04-09 Thread Markus Heiser
On 04/06/2018 01:11 PM, Quentin Monnet wrote:
>> eBPF helper functions can be called from within eBPF programs to perform
>> a variety of tasks that would be otherwise hard or impossible to do with
>> eBPF itself. There is a growing number of such helper functions in the
>> kernel, but documentation is scarce. The main user space header file
>> does contain a short commented description of most helpers, but it is
>> somewhat outdated and not complete. It is more a "cheat sheet" than a
>> real documentation accessible to new eBPF developers.
>> 
>> This commit attempts to improve the situation by replacing the existing
>> overview for the helpers with a more developed description. Furthermore,
>> a Python script is added to generate a manual page for eBPF helpers. The
>> workflow is the following, and requires the rst2man utility:
>> 
>>$ ./scripts/bpf_helpers_doc.py \
>>--filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst
>>$ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7
>>$ man /tmp/bpf-helpers.7
>> 
>> The objective is to keep all documentation related to the helpers in a
>> single place,

Do we really need another kernel-doc parser?

  ./scripts/kernel-doc include/uapi/linux/bpf.h

should already do the job (producing .rst). For more infos, take a look at

  https://www.kernel.org/doc/html/latest/doc-guide/index.html

-- Markus --

PS: sorry for re-post, first post was HTML which is not accepted by ML :o


  1   2   >