Re: [PATCH] slip: Check if rstate is initialized before uncompressing
On 2018-04-09 20:34, David Miller wrote: From: Tejaswi TanikellaDate: Mon, 9 Apr 2018 14:23:49 +0530 @@ -673,6 +677,7 @@ struct slcompress * if (cs->cs_tcp.doff > 5) memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), (cs->cs_tcp.doff - 5) * 4); cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2; + cs->initialized = 1; /* Put headers back on packet ... struct cstate { byte_t cs_this;/* connection id number (xmit) */ + byte_t initialized;/* non-zero if initialized */ Please use 'bool' and true/false for 'initialized'. Made the changes. From b04441041034a5e044705d3c5a2cc338dd4bfc3a Mon Sep 17 00:00:00 2001 From: Tejaswi Tanikella Date: Thu, 29 Mar 2018 14:46:41 +0530 Subject: [PATCH] slip: Check if rstate is initialized before uncompressing On receiving a packet the state index points to the rstate which must be used to fill up IP and TCP headers. But if the state index points to a rstate which is unitialized, i.e. filled with zeros, it gets stuck in an infinite loop inside ip_fast_csum trying to compute the ip checsum of a header with zero length. 89.666953: <2> [] slhc_uncompress+0x464/0x468 89.666965: <2> [] ppp_receive_nonmp_frame+0x3b4/0x65c 89.666978: <2> [] ppp_receive_frame+0x64/0x7e0 89.666991: <2> [] ppp_input+0x104/0x198 89.667005: <2> [] pppopns_recv_core+0x238/0x370 89.667027: <2> [] __sk_receive_skb+0xdc/0x250 89.667040: <2> [] pppopns_recv+0x44/0x60 89.667053: <2> [] __sock_queue_rcv_skb+0x16c/0x24c 89.667065: <2> [] sock_queue_rcv_skb+0x2c/0x38 89.667085: <2> [] raw_rcv+0x124/0x154 89.667098: <2> [] raw_local_deliver+0x1e0/0x22c 89.667117: <2> [] ip_local_deliver_finish+0x70/0x24c 89.667131: <2> [] ip_local_deliver+0x100/0x10c ./scripts/faddr2line vmlinux slhc_uncompress+0x464/0x468 output: ip_fast_csum at arch/arm64/include/asm/checksum.h:40 (inlined by) slhc_uncompress at drivers/net/slip/slhc.c:615 Adding a variable to indicate if the current rstate is initialized. If such a packet arrives, move to toss state. Signed-off-by: Tejaswi Tanikella --- drivers/net/slip/slhc.c | 5 + include/net/slhc_vj.h | 1 + 2 files changed, 6 insertions(+) diff --git a/drivers/net/slip/slhc.c b/drivers/net/slip/slhc.c index 5782733..f4e93f5 100644 --- a/drivers/net/slip/slhc.c +++ b/drivers/net/slip/slhc.c @@ -509,6 +509,10 @@ struct slcompress * if(x < 0 || x > comp->rslot_limit) goto bad; + /* Check if the cstate is initialized */ + if (!comp->rstate[x].initialized) + goto bad; + comp->flags &=~ SLF_TOSS; comp->recv_current = x; } else { @@ -673,6 +677,7 @@ struct slcompress * if (cs->cs_tcp.doff > 5) memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), (cs->cs_tcp.doff - 5) * 4); cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2; + cs->initialized = true; /* Put headers back on packet * Neither header checksum is recalculated */ diff --git a/include/net/slhc_vj.h b/include/net/slhc_vj.h index 8716d59..8fcf890 100644 --- a/include/net/slhc_vj.h +++ b/include/net/slhc_vj.h @@ -127,6 +127,7 @@ */ struct cstate { byte_t cs_this; /* connection id number (xmit) */ + bool initialized; /* true if initialized */ struct cstate *next; /* next in ring (xmit) */ struct iphdr cs_ip; /* ip/tcp hdr from most recent packet */ struct tcphdr cs_tcp; -- 1.9.1
[PATCH v2 0/2] vhost: fix vhost_vq_access_ok() log check
v2: * Rewrote the conditional to make the vq access check clearer [Linus] * Added Patch 2 to make the return type consistent and harder to misuse [Linus] The first patch fixes the vhost virtqueue access check which was recently broken. The second patch replaces the int return type with bool to prevent future bugs. Stefan Hajnoczi (2): vhost: fix vhost_vq_access_ok() log check vhost: return bool from *_access_ok() functions drivers/vhost/vhost.h | 4 +-- drivers/vhost/vhost.c | 70 ++- 2 files changed, 38 insertions(+), 36 deletions(-) -- 2.14.3
[PATCH v2 2/2] vhost: return bool from *_access_ok() functions
Currently vhost *_access_ok() functions return int. This is error-prone because there are two popular conventions: 1. 0 means failure, 1 means success 2. -errno means failure, 0 means success Although vhost mostly uses #1, it does not do so consistently. umem_access_ok() uses #2. This patch changes the return type from int to bool so that false means failure and true means success. This eliminates a potential source of errors. Suggested-by: Linus TorvaldsSigned-off-by: Stefan Hajnoczi --- drivers/vhost/vhost.h | 4 ++-- drivers/vhost/vhost.c | 66 +-- 2 files changed, 35 insertions(+), 35 deletions(-) diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h index ac4b6056f19a..6e00fa57af09 100644 --- a/drivers/vhost/vhost.h +++ b/drivers/vhost/vhost.h @@ -178,8 +178,8 @@ void vhost_dev_cleanup(struct vhost_dev *); void vhost_dev_stop(struct vhost_dev *); long vhost_dev_ioctl(struct vhost_dev *, unsigned int ioctl, void __user *argp); long vhost_vring_ioctl(struct vhost_dev *d, int ioctl, void __user *argp); -int vhost_vq_access_ok(struct vhost_virtqueue *vq); -int vhost_log_access_ok(struct vhost_dev *); +bool vhost_vq_access_ok(struct vhost_virtqueue *vq); +bool vhost_log_access_ok(struct vhost_dev *); int vhost_get_vq_desc(struct vhost_virtqueue *, struct iovec iov[], unsigned int iov_count, diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 93fd0c75b0d8..b6a082ef33dd 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -641,14 +641,14 @@ void vhost_dev_cleanup(struct vhost_dev *dev) } EXPORT_SYMBOL_GPL(vhost_dev_cleanup); -static int log_access_ok(void __user *log_base, u64 addr, unsigned long sz) +static bool log_access_ok(void __user *log_base, u64 addr, unsigned long sz) { u64 a = addr / VHOST_PAGE_SIZE / 8; /* Make sure 64 bit math will not overflow. */ if (a > ULONG_MAX - (unsigned long)log_base || a + (unsigned long)log_base > ULONG_MAX) - return 0; + return false; return access_ok(VERIFY_WRITE, log_base + a, (sz + VHOST_PAGE_SIZE * 8 - 1) / VHOST_PAGE_SIZE / 8); @@ -661,30 +661,30 @@ static bool vhost_overflow(u64 uaddr, u64 size) } /* Caller should have vq mutex and device mutex. */ -static int vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem, - int log_all) +static bool vq_memory_access_ok(void __user *log_base, struct vhost_umem *umem, + int log_all) { struct vhost_umem_node *node; if (!umem) - return 0; + return false; list_for_each_entry(node, >umem_list, link) { unsigned long a = node->userspace_addr; if (vhost_overflow(node->userspace_addr, node->size)) - return 0; + return false; if (!access_ok(VERIFY_WRITE, (void __user *)a, node->size)) - return 0; + return false; else if (log_all && !log_access_ok(log_base, node->start, node->size)) - return 0; + return false; } - return 1; + return true; } static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq, @@ -701,13 +701,13 @@ static inline void __user *vhost_vq_meta_fetch(struct vhost_virtqueue *vq, /* Can we switch to this memory table? */ /* Caller should have device mutex but not vq mutex */ -static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem, - int log_all) +static bool memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem, +int log_all) { int i; for (i = 0; i < d->nvqs; ++i) { - int ok; + bool ok; bool log; mutex_lock(>vqs[i]->mutex); @@ -717,12 +717,12 @@ static int memory_access_ok(struct vhost_dev *d, struct vhost_umem *umem, ok = vq_memory_access_ok(d->vqs[i]->log_base, umem, log); else - ok = 1; + ok = true; mutex_unlock(>vqs[i]->mutex); if (!ok) - return 0; + return false; } - return 1; + return true; } static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len, @@ -959,21 +959,21 @@ static void vhost_iotlb_notify_vq(struct vhost_dev *d, spin_unlock(>iotlb_lock); } -static int umem_access_ok(u64 uaddr, u64 size, int access) +static
[PATCH v2 1/2] vhost: fix vhost_vq_access_ok() log check
Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when IOTLB is enabled") introduced a regression. The logic was originally: if (vq->iotlb) return 1; return A && B; After the patch the short-circuit logic for A was inverted: if (A || vq->iotlb) return A; return B; This patch fixes the regression by rewriting the checks in the obvious way, no longer returning A when vq->iotlb is non-NULL (which is hard to understand). Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com Cc: Jason WangSigned-off-by: Stefan Hajnoczi --- drivers/vhost/vhost.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 5320039671b7..93fd0c75b0d8 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1244,10 +1244,12 @@ static int vq_log_access_ok(struct vhost_virtqueue *vq, /* Caller should have vq mutex and device mutex */ int vhost_vq_access_ok(struct vhost_virtqueue *vq) { - int ret = vq_log_access_ok(vq, vq->log_base); + if (!vq_log_access_ok(vq, vq->log_base)) + return 0; - if (ret || vq->iotlb) - return ret; + /* Access validation occurs at prefetch time with IOTLB */ + if (vq->iotlb) + return 1; return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used); } -- 2.14.3
Re: [RFC] vhost: introduce mdev based hardware vhost backend
On Tue, Apr 10, 2018 at 10:52:52AM +0800, Jason Wang wrote: > On 2018年04月02日 23:23, Tiwei Bie wrote: > > This patch introduces a mdev (mediated device) based hardware > > vhost backend. This backend is an abstraction of the various > > hardware vhost accelerators (potentially any device that uses > > virtio ring can be used as a vhost accelerator). Some generic > > mdev parent ops are provided for accelerator drivers to support > > generating mdev instances. > > > > What's this > > === > > > > The idea is that we can setup a virtio ring compatible device > > with the messages available at the vhost-backend. Originally, > > these messages are used to implement a software vhost backend, > > but now we will use these messages to setup a virtio ring > > compatible hardware device. Then the hardware device will be > > able to work with the guest virtio driver in the VM just like > > what the software backend does. That is to say, we can implement > > a hardware based vhost backend in QEMU, and any virtio ring > > compatible devices potentially can be used with this backend. > > (We also call it vDPA -- vhost Data Path Acceleration). > > > > One problem is that, different virtio ring compatible devices > > may have different device interfaces. That is to say, we will > > need different drivers in QEMU. It could be troublesome. And > > that's what this patch trying to fix. The idea behind this > > patch is very simple: mdev is a standard way to emulate device > > in kernel. > > So you just move the abstraction layer from qemu to kernel, and you still > need different drivers in kernel for different device interfaces of > accelerators. This looks even more complex than leaving it in qemu. As you > said, another idea is to implement userspace vhost backend for accelerators > which seems easier and could co-work with other parts of qemu without > inventing new type of messages. I'm not quite sure. Do you think it's acceptable to add various vendor specific hardware drivers in QEMU? > > Need careful thought here to seek a best solution here. Yeah, definitely! :) And your opinions would be very helpful! > > > So we defined a standard device based on mdev, which > > is able to accept vhost messages. When the mdev emulation code > > (i.e. the generic mdev parent ops provided by this patch) gets > > vhost messages, it will parse and deliver them to accelerator > > drivers. Drivers can use these messages to setup accelerators. > > > > That is to say, the generic mdev parent ops (e.g. read()/write()/ > > ioctl()/...) will be provided for accelerator drivers to register > > accelerators as mdev parent devices. And each accelerator device > > will support generating standard mdev instance(s). > > > > With this standard device interface, we will be able to just > > develop one userspace driver to implement the hardware based > > vhost backend in QEMU. > > > > Difference between vDPA and PCI passthru > > > > > > The key difference between vDPA and PCI passthru is that, in > > vDPA only the data path of the device (e.g. DMA ring, notify > > region and queue interrupt) is pass-throughed to the VM, the > > device control path (e.g. PCI configuration space and MMIO > > regions) is still defined and emulated by QEMU. > > > > The benefits of keeping virtio device emulation in QEMU compared > > with virtio device PCI passthru include (but not limit to): > > > > - consistent device interface for guest OS in the VM; > > - max flexibility on the hardware design, especially the > >accelerator for each vhost backend doesn't have to be a > >full PCI device; > > - leveraging the existing virtio live-migration framework; > > > > The interface of this mdev based device > > === > > > > 1. BAR0 > > > > The MMIO region described by BAR0 is the main control > > interface. Messages will be written to or read from > > this region. > > > > The message type is determined by the `request` field > > in message header. The message size is encoded in the > > message header too. The message format looks like this: > > > > struct vhost_vfio_op { > > __u64 request; > > __u32 flags; > > /* Flag values: */ > > #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */ > > __u32 size; > > union { > > __u64 u64; > > struct vhost_vring_state state; > > struct vhost_vring_addr addr; > > struct vhost_memory memory; > > } payload; > > }; > > > > The existing vhost-kernel ioctl cmds are reused as > > the message requests in above structure. > > > > Each message will be written to or read from this > > region at offset 0: > > > > int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op) > > { > > int count = VHOST_VFIO_OP_HDR_SIZE + op->size; > > struct vhost_vfio *vfio = dev->opaque; > > int ret; > > > > ret = pwrite64(vfio->device_fd, op, count,
Re: [PATCH v5 0/6] enable creating [k,u]probe with perf_event_open
On 4/9/18 9:45 PM, Ravi Bangoria wrote: Hi Song, On 12/07/2017 04:15 AM, Song Liu wrote: With current kernel, user space tools can only create/destroy [k,u]probes with a text-based API (kprobe_events and uprobe_events in tracefs). This approach relies on user space to clean up the [k,u]probe after using them. However, this is not easy for user space to clean up properly. To solve this problem, we introduce a file descriptor based API. Specifically, we extended perf_event_open to create [k,u]probe, and attach this [k,u]probe to the file descriptor created by perf_event_open. These [k,u]probe are associated with this file descriptor, so they are not available in tracefs. Sorry for being late. One simple question.. Will it be good to support k/uprobe arguments with perf_event_open()? Do you have any plans about that? no plans for that. People that use text based interfaces should probably be using text interfaces consistently. imo mixing FD-based kprobe api with text is not worth the complexity.
Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy
On Mon, Apr 09, 2018 at 12:01:59AM +0200, Mickaël Salaün wrote: > > On 04/08/2018 11:06 PM, Andy Lutomirski wrote: > > On Sun, Apr 8, 2018 at 6:13 AM, Mickaël Salaünwrote: > >> > >> On 02/27/2018 10:48 PM, Mickaël Salaün wrote: > >>> > >>> On 27/02/2018 17:39, Andy Lutomirski wrote: > On Tue, Feb 27, 2018 at 5:32 AM, Alexei Starovoitov > wrote: > > On Tue, Feb 27, 2018 at 05:20:55AM +, Andy Lutomirski wrote: > >> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov > >> wrote: > >>> On Tue, Feb 27, 2018 at 04:40:34AM +, Andy Lutomirski wrote: > On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov > wrote: > > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote: > >> The seccomp(2) syscall can be used by a task to apply a Landlock > >> program > >> to itself. As a seccomp filter, a Landlock program is enforced for > >> the > >> current task and all its future children. A program is immutable > >> and a > >> task can only add new restricting programs to itself, forming a > >> list of > >> programss. > >> > >> A Landlock program is tied to a Landlock hook. If the action on a > >> kernel > >> object is allowed by the other Linux security mechanisms (e.g. DAC, > >> capabilities, other LSM), then a Landlock hook related to this > >> kind of > >> object is triggered. The list of programs for this hook is then > >> evaluated. Each program return a 32-bit value which can deny the > >> action > >> on a kernel object with a non-zero value. If every programs of the > >> list > >> return zero, then the action on the object is allowed. > >> > >> Multiple Landlock programs can be chained to share a 64-bits value > >> for a > >> call chain (e.g. evaluating multiple elements of a file path). > >> This > >> chaining is restricted when a process construct this chain by > >> loading a > >> program, but additional checks are performed when it requests to > >> apply > >> this chain of programs to itself. The restrictions ensure that it > >> is > >> not possible to call multiple programs in a way that would imply to > >> handle multiple shared values (i.e. cookies) for one chain. For > >> now, > >> only a fs_pick program can be chained to the same type of program, > >> because it may make sense if they have different triggers (cf. next > >> commits). This restrictions still allows to reuse Landlock > >> programs in > >> a safe way (e.g. use the same loaded fs_walk program with multiple > >> chains of fs_pick programs). > >> > >> Signed-off-by: Mickaël Salaün > > > > ... > > > >> +struct landlock_prog_set *landlock_prepend_prog( > >> + struct landlock_prog_set *current_prog_set, > >> + struct bpf_prog *prog) > >> +{ > >> + struct landlock_prog_set *new_prog_set = current_prog_set; > >> + unsigned long pages; > >> + int err; > >> + size_t i; > >> + struct landlock_prog_set tmp_prog_set = {}; > >> + > >> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK) > >> + return ERR_PTR(-EINVAL); > >> + > >> + /* validate memory size allocation */ > >> + pages = prog->pages; > >> + if (current_prog_set) { > >> + size_t i; > >> + > >> + for (i = 0; i < > >> ARRAY_SIZE(current_prog_set->programs); i++) { > >> + struct landlock_prog_list *walker_p; > >> + > >> + for (walker_p = > >> current_prog_set->programs[i]; > >> + walker_p; walker_p = > >> walker_p->prev) > >> + pages += walker_p->prog->pages; > >> + } > >> + /* count a struct landlock_prog_set if we need to > >> allocate one */ > >> + if (refcount_read(_prog_set->usage) != 1) > >> + pages += round_up(sizeof(*current_prog_set), > >> PAGE_SIZE) > >> + / PAGE_SIZE; > >> + } > >> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES) > >> + return ERR_PTR(-E2BIG); > >> + > >> + /* ensure early that we can allocate enough memory for the > >> new > >> + * prog_lists */ > >> + err =
Re: [PATCH v5 0/6] enable creating [k,u]probe with perf_event_open
Hi Song, On 12/07/2017 04:15 AM, Song Liu wrote: > With current kernel, user space tools can only create/destroy [k,u]probes > with a text-based API (kprobe_events and uprobe_events in tracefs). This > approach relies on user space to clean up the [k,u]probe after using them. > However, this is not easy for user space to clean up properly. > > To solve this problem, we introduce a file descriptor based API. > Specifically, we extended perf_event_open to create [k,u]probe, and attach > this [k,u]probe to the file descriptor created by perf_event_open. These > [k,u]probe are associated with this file descriptor, so they are not > available in tracefs. Sorry for being late. One simple question.. Will it be good to support k/uprobe arguments with perf_event_open()? Do you have any plans about that? Thanks, Ravi
Re: [PATCH v15 ] net/veth/XDP: Line-rate packet forwarding in kernel
Gotcha. I'm working on it. I've created a function that creates sk_buff from xdp_buff. But still getting an error while the sk_buff is being processed by tcp. I will send you the patch once I'm done. Thanks! On Thu, Apr 5, 2018 at 10:55 PM, David Ahernwrote: > On 4/3/18 9:15 PM, Md. Islam wrote: >>> Have you looked at what I would consider a more interesting use case of >>> packets into a node and delivered to a namespace via veth? >>> >>>+--+--- >>>| Host | container >>>| | >>>|+---{ veth1 }-|-{veth2} >>>| | | >>>+{ eth1 }-- >>> >>> Can xdp / bpf on eth1 be used to speed up delivery to the container? >> >> I didn't consider that, but it sounds like an important use case. How >> do we determine which namespace gets the packet? >> > > FIB lookups of course. Starting with my patch set that handles > forwarding on eth1, what is needed for XDP with veth? ie., a program on > eth1 does the lookup and redirects the packet to veth1 for Tx. > ndo_xdp_xmit for veth knows the packet needs to be forwarded to veth2 > internally and there is no skb allocated for the packet yet. -- Tamim PhD Candidate, Kent State University http://web.cs.kent.edu/~mislam4/
RE: [PATCH net 3/3] lan78xx: Lan7801 Support for Fixed PHY
> Ah, cool. I was thinking you were going to say an SFP cage. > > What switch is it? Does it have a DSA driver? > We have 3 port switch KSZ9893 yet to release which is similar to the one KSZ9477/KSZ9897 which has DSA driver. Most of the time 3 port switch being used with LAN7801 to extend the ports. Thanks, -Raghu
Re: [RFC v2] virtio: support packed ring
On Tue, Apr 10, 2018 at 10:55:25AM +0800, Jason Wang wrote: > On 2018年04月01日 22:12, Tiwei Bie wrote: > > Hello everyone, > > > > This RFC implements packed ring support for virtio driver. > > > > The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented > > by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html > > Minor changes are needed for the vhost code, e.g. to kick the guest. > > > > TODO: > > - Refinements and bug fixes; > > - Split into small patches; > > - Test indirect descriptor support; > > - Test/fix event suppression support; > > - Test devices other than net; > > > > RFC v1 -> RFC v2: > > - Add indirect descriptor support - compile test only; > > - Add event suppression supprt - compile test only; > > - Move vring_packed_init() out of uapi (Jason, MST); > > - Merge two loops into one in virtqueue_add_packed() (Jason); > > - Split vring_unmap_one() for packed ring and split ring (Jason); > > - Avoid using '%' operator (Jason); > > - Rename free_head -> next_avail_idx (Jason); > > - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason); > > - Some other refinements and bug fixes; > > > > Thanks! > > Will try to review this later. > > But it would be better if you can split it (more than 1000 lines is too big > to be reviewed easily). E.g you can at least split it into three patches, > new structures, datapath, and event suppression. > No problem! It's on my TODO list. I'll get it done in the next version. Thanks!
Re: [PATCH net-next 1/5] virtio: Add support for SCTP checksum offloading
On 2018年04月02日 21:40, Vladislav Yasevich wrote: To support SCTP checksum offloading, we need to add a new feature to virtio_net, so we can negotiate support between the hypervisor and the guest. The signalling to the guest that an alternate checksum needs to be used is done via a new flag in the virtio_net_hdr. If the flag is set, the host will know to perform an alternate checksum calculation, which right now is only CRC32c. Signed-off-by: Vladislav Yasevich--- drivers/net/virtio_net.c| 11 --- include/linux/virtio_net.h | 6 ++ include/uapi/linux/virtio_net.h | 2 ++ 3 files changed, 16 insertions(+), 3 deletions(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index 7b187ec..b601294 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -2724,9 +2724,14 @@ static int virtnet_probe(struct virtio_device *vdev) /* Do we support "hardware" checksums? */ if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) { /* This opens up the world of extra features. */ - dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG; + netdev_features_t sctp = 0; + + if (virtio_has_feature(vdev, VIRTIO_NET_F_SCTP_CSUM)) + sctp |= NETIF_F_SCTP_CRC; + + dev->hw_features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp; if (csum) - dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG; + dev->features |= NETIF_F_HW_CSUM | NETIF_F_SG | sctp; if (virtio_has_feature(vdev, VIRTIO_NET_F_GSO)) { dev->hw_features |= NETIF_F_TSO @@ -2952,7 +2957,7 @@ static struct virtio_device_id id_table[] = { }; #define VIRTNET_FEATURES \ - VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, \ + VIRTIO_NET_F_CSUM, VIRTIO_NET_F_GUEST_CSUM, VIRTIO_NET_F_SCTP_CSUM, \ It looks to me _F_SCTP_CSUM implies the ability of both device and driver. Do we still need the flexibility like csum to differ guest/host ability like e.g _F_GUEST_SCTP_CSUM? Thanks VIRTIO_NET_F_MAC, \ VIRTIO_NET_F_HOST_TSO4, VIRTIO_NET_F_HOST_UFO, VIRTIO_NET_F_HOST_TSO6, \ VIRTIO_NET_F_HOST_ECN, VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6, \ diff --git a/include/linux/virtio_net.h b/include/linux/virtio_net.h index f144216..2e7a64a 100644 --- a/include/linux/virtio_net.h +++ b/include/linux/virtio_net.h @@ -39,6 +39,9 @@ static inline int virtio_net_hdr_to_skb(struct sk_buff *skb, if (!skb_partial_csum_set(skb, start, off)) return -EINVAL; + + if (hdr->flags & VIRTIO_NET_HDR_F_CSUM_NOT_INET) + skb->csum_not_inet = 1; } if (hdr->gso_type != VIRTIO_NET_HDR_GSO_NONE) { @@ -96,6 +99,9 @@ static inline int virtio_net_hdr_from_skb(const struct sk_buff *skb, hdr->flags = VIRTIO_NET_HDR_F_DATA_VALID; } /* else everything is zero */ + if (skb->csum_not_inet) + hdr->flags &= VIRTIO_NET_HDR_F_CSUM_NOT_INET; + return 0; } diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h index 5de6ed3..3f279c8 100644 --- a/include/uapi/linux/virtio_net.h +++ b/include/uapi/linux/virtio_net.h @@ -36,6 +36,7 @@ #define VIRTIO_NET_F_GUEST_CSUM 1 /* Guest handles pkts w/ partial csum */ #define VIRTIO_NET_F_CTRL_GUEST_OFFLOADS 2 /* Dynamic offload configuration. */ #define VIRTIO_NET_F_MTU 3 /* Initial MTU advice */ +#define VIRTIO_NET_F_SCTP_CSUM 4 /* SCTP checksum offload support */ #define VIRTIO_NET_F_MAC 5 /* Host has given MAC address. */ #define VIRTIO_NET_F_GUEST_TSO4 7 /* Guest can handle TSOv4 in. */ #define VIRTIO_NET_F_GUEST_TSO6 8 /* Guest can handle TSOv6 in. */ @@ -101,6 +102,7 @@ struct virtio_net_config { struct virtio_net_hdr_v1 { #define VIRTIO_NET_HDR_F_NEEDS_CSUM 1 /* Use csum_start, csum_offset */ #define VIRTIO_NET_HDR_F_DATA_VALID 2 /* Csum is valid */ +#define VIRTIO_NET_HDR_F_CSUM_NOT_INET 4 /* Checksum is not inet */ __u8 flags; #define VIRTIO_NET_HDR_GSO_NONE 0 /* Not a GSO frame */ #define VIRTIO_NET_HDR_GSO_TCPV4 1 /* GSO frame, IPv4 TCP (TSO) */
Re: [Resend Patch 1/3] Vmbus: Add function to report available ring buffer to write in total ring size percentage
Long, > I hope this patch set goes through SCSI, because it's purpose is to > improve storvsc. > > If this strategy is not possible, I can resubmit the 1st two patches to > net, and the 3rd patch to scsi after the 1st two are merged. Applied to my staging tree for 4.18/scsi-queue. Thanks! -- Martin K. Petersen Oracle Linux Engineering
Re: [RFC v2] virtio: support packed ring
On 2018年04月01日 22:12, Tiwei Bie wrote: Hello everyone, This RFC implements packed ring support for virtio driver. The code was tested with DPDK vhost (testpmd/vhost-PMD) implemented by Jens at http://dpdk.org/ml/archives/dev/2018-January/089417.html Minor changes are needed for the vhost code, e.g. to kick the guest. TODO: - Refinements and bug fixes; - Split into small patches; - Test indirect descriptor support; - Test/fix event suppression support; - Test devices other than net; RFC v1 -> RFC v2: - Add indirect descriptor support - compile test only; - Add event suppression supprt - compile test only; - Move vring_packed_init() out of uapi (Jason, MST); - Merge two loops into one in virtqueue_add_packed() (Jason); - Split vring_unmap_one() for packed ring and split ring (Jason); - Avoid using '%' operator (Jason); - Rename free_head -> next_avail_idx (Jason); - Add comments for virtio_wmb() in virtqueue_add_packed() (Jason); - Some other refinements and bug fixes; Thanks! Will try to review this later. But it would be better if you can split it (more than 1000 lines is too big to be reviewed easily). E.g you can at least split it into three patches, new structures, datapath, and event suppression. Thanks Signed-off-by: Tiwei Bie--- drivers/virtio/virtio_ring.c | 1094 +--- include/linux/virtio_ring.h|8 +- include/uapi/linux/virtio_config.h | 12 +- include/uapi/linux/virtio_ring.h | 61 ++ 4 files changed, 980 insertions(+), 195 deletions(-) diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c index 71458f493cf8..0515dca34d77 100644 --- a/drivers/virtio/virtio_ring.c +++ b/drivers/virtio/virtio_ring.c @@ -58,14 +58,15 @@ struct vring_desc_state { void *data; /* Data for callback. */ - struct vring_desc *indir_desc; /* Indirect descriptor, if any. */ + void *indir_desc; /* Indirect descriptor, if any. */ + int num;/* Descriptor list length. */ }; struct vring_virtqueue { struct virtqueue vq; - /* Actual memory layout for this queue */ - struct vring vring; + /* Is this a packed ring? */ + bool packed; /* Can we use weak barriers? */ bool weak_barriers; @@ -79,19 +80,45 @@ struct vring_virtqueue { /* Host publishes avail event idx */ bool event; - /* Head of free buffer list. */ - unsigned int free_head; /* Number we've added since last sync. */ unsigned int num_added; /* Last used index we've seen. */ u16 last_used_idx; - /* Last written value to avail->flags */ - u16 avail_flags_shadow; + union { + /* Available for split ring */ + struct { + /* Actual memory layout for this queue. */ + struct vring vring; - /* Last written value to avail->idx in guest byte order */ - u16 avail_idx_shadow; + /* Head of free buffer list. */ + unsigned int free_head; + + /* Last written value to avail->flags */ + u16 avail_flags_shadow; + + /* Last written value to avail->idx in +* guest byte order. */ + u16 avail_idx_shadow; + }; + + /* Available for packed ring */ + struct { + /* Actual memory layout for this queue. */ + struct vring_packed vring_packed; + + /* Driver ring wrap counter. */ + u8 wrap_counter; + + /* Index of the next avail descriptor. */ + unsigned int next_avail_idx; + + /* Last written value to driver->flags in +* guest byte order. */ + u16 event_flags_shadow; + }; + }; /* How to notify other side. FIXME: commonalize hcalls! */ bool (*notify)(struct virtqueue *vq); @@ -201,8 +228,33 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq, cpu_addr, size, direction); } -static void vring_unmap_one(const struct vring_virtqueue *vq, - struct vring_desc *desc) +static void vring_unmap_one_split(const struct vring_virtqueue *vq, + struct vring_desc *desc) +{ + u16 flags; + + if (!vring_use_dma_api(vq->vq.vdev)) + return; + + flags = virtio16_to_cpu(vq->vq.vdev, desc->flags); + + if (flags & VRING_DESC_F_INDIRECT) { + dma_unmap_single(vring_dma_dev(vq), +virtio64_to_cpu(vq->vq.vdev, desc->addr), +virtio32_to_cpu(vq->vq.vdev, desc->len), +
Re: [RFC] vhost: introduce mdev based hardware vhost backend
On 2018年04月02日 23:23, Tiwei Bie wrote: This patch introduces a mdev (mediated device) based hardware vhost backend. This backend is an abstraction of the various hardware vhost accelerators (potentially any device that uses virtio ring can be used as a vhost accelerator). Some generic mdev parent ops are provided for accelerator drivers to support generating mdev instances. What's this === The idea is that we can setup a virtio ring compatible device with the messages available at the vhost-backend. Originally, these messages are used to implement a software vhost backend, but now we will use these messages to setup a virtio ring compatible hardware device. Then the hardware device will be able to work with the guest virtio driver in the VM just like what the software backend does. That is to say, we can implement a hardware based vhost backend in QEMU, and any virtio ring compatible devices potentially can be used with this backend. (We also call it vDPA -- vhost Data Path Acceleration). One problem is that, different virtio ring compatible devices may have different device interfaces. That is to say, we will need different drivers in QEMU. It could be troublesome. And that's what this patch trying to fix. The idea behind this patch is very simple: mdev is a standard way to emulate device in kernel. So you just move the abstraction layer from qemu to kernel, and you still need different drivers in kernel for different device interfaces of accelerators. This looks even more complex than leaving it in qemu. As you said, another idea is to implement userspace vhost backend for accelerators which seems easier and could co-work with other parts of qemu without inventing new type of messages. Need careful thought here to seek a best solution here. So we defined a standard device based on mdev, which is able to accept vhost messages. When the mdev emulation code (i.e. the generic mdev parent ops provided by this patch) gets vhost messages, it will parse and deliver them to accelerator drivers. Drivers can use these messages to setup accelerators. That is to say, the generic mdev parent ops (e.g. read()/write()/ ioctl()/...) will be provided for accelerator drivers to register accelerators as mdev parent devices. And each accelerator device will support generating standard mdev instance(s). With this standard device interface, we will be able to just develop one userspace driver to implement the hardware based vhost backend in QEMU. Difference between vDPA and PCI passthru The key difference between vDPA and PCI passthru is that, in vDPA only the data path of the device (e.g. DMA ring, notify region and queue interrupt) is pass-throughed to the VM, the device control path (e.g. PCI configuration space and MMIO regions) is still defined and emulated by QEMU. The benefits of keeping virtio device emulation in QEMU compared with virtio device PCI passthru include (but not limit to): - consistent device interface for guest OS in the VM; - max flexibility on the hardware design, especially the accelerator for each vhost backend doesn't have to be a full PCI device; - leveraging the existing virtio live-migration framework; The interface of this mdev based device === 1. BAR0 The MMIO region described by BAR0 is the main control interface. Messages will be written to or read from this region. The message type is determined by the `request` field in message header. The message size is encoded in the message header too. The message format looks like this: struct vhost_vfio_op { __u64 request; __u32 flags; /* Flag values: */ #define VHOST_VFIO_NEED_REPLY 0x1 /* Whether need reply */ __u32 size; union { __u64 u64; struct vhost_vring_state state; struct vhost_vring_addr addr; struct vhost_memory memory; } payload; }; The existing vhost-kernel ioctl cmds are reused as the message requests in above structure. Each message will be written to or read from this region at offset 0: int vhost_vfio_write(struct vhost_dev *dev, struct vhost_vfio_op *op) { int count = VHOST_VFIO_OP_HDR_SIZE + op->size; struct vhost_vfio *vfio = dev->opaque; int ret; ret = pwrite64(vfio->device_fd, op, count, vfio->bar0_offset); if (ret != count) return -1; return 0; } int vhost_vfio_read(struct vhost_dev *dev, struct vhost_vfio_op *op) { int count = VHOST_VFIO_OP_HDR_SIZE + op->size; struct vhost_vfio *vfio = dev->opaque; uint64_t request = op->request; int ret; ret = pread64(vfio->device_fd, op, count, vfio->bar0_offset); if (ret != count || request != op->request) return -1; return 0; } It's quite straightforward to set things to the device. Just need to write the message to
Re: [PATCH net 3/3] lan78xx: Lan7801 Support for Fixed PHY
On Tue, Apr 10, 2018 at 02:23:23AM +, raghuramchary.jallipa...@microchip.com wrote: > > > > What do you expect is connected to the MAC if there is no PHY? > > > Hi Andrew, > We connect the Ethernet switch to this MAC. Ah, cool. I was thinking you were going to say an SFP cage. What switch is it? Does it have a DSA driver? Andrew
Re: [PATCH 2/2] alx: add disable_wol paramenter
The problem is I don't have a machine with that wakeup issue, and I need WoL feature. Instead of spreading "alx with WoL" dkms package everywhere, I would like to see it's supported in the driver and is disabled by default. Moreover, the wakeup issue may come from old Atheros chips, or result from buggy BIOS. With the WoL has been removed from the driver, no one will report issue about that, and we don't have any chance to find a fix for it. Adding this feature back is not covering a paper on the issue, it makes people have a chance to examine this feature. 2018-04-09 22:50 GMT+08:00 David Miller: > From: Andrew Lunn > Date: Mon, 9 Apr 2018 14:39:10 +0200 > >> On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote: >>> The WoL feature was reported broken and will lead to >>> the system resume immediately after suspending. >>> This symptom is not happening on every system, so adding >>> disable_wol option and disable WoL by default to prevent the issue from >>> happening again. >> >>> const char alx_drv_name[] = "alx"; >>> >>> +/* disable WoL by default */ >>> +bool disable_wol = 1; >>> +module_param(disable_wol, bool, 0); >>> +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature"); >>> + >> >> Hi AceLan >> >> This seems like you are papering over the cracks. And module >> parameters are not liked. >> >> Please try to find the real problem. > > Agreed.
RE: [PATCH net 1/3] lan78xx: PHY DSP registers initialization to address EEE link drop issues with long cables
> > > > Hi Raghuram > > > > You might want to look at phy_read_paged(), phy_write_paged(), etc. > > > > There can be race conditions with paged access. > > Yep, so something like: > > static void lan88xx_TR_reg_set(struct phy_device *phydev, u16 regaddr, > u32 data) > { > int save_page, val; > u16 buf; > > save_page = phy_save_page(phydev); > phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR, > LAN88XX_EXT_PAGE_TR_LOW_DATA, (data & > 0x)); > phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR, > LAN88XX_EXT_PAGE_TR_HIGH_DATA, > (data & 0x00FF) >> 16); > > /* Config control bits [15:13] of register */ > buf = (regaddr & ~(0x3 << 13));/* Clr [14:13] to write data in reg */ > buf |= 0x8000; /* Set [15] to Packet transmit */ > > phy_write_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR, > LAN88XX_EXT_PAGE_TR_CR, buf); > usleep_range(1000, 2000);/* Wait for Data to be written */ > > val = phy_read_paged(phydev, LAN88XX_EXT_PAGE_ACCESS_TR, >LAN88XX_EXT_PAGE_TR_CR); > if (!(val & 0x8000)) > pr_warn("TR Register[0x%X] configuration failed\n", > regaddr); > > phy_restore_page(phydev, save_page, 0); } > > Since PHY accesses and thus things like phy_save_page() can fail, the return > type of this function should be changed to 'int' and some error checking > should be added. Thanks David/Andrew. Will take care of it. Thanks, Raghu
RE: [PATCH net 2/3] lan78xx: Add support to dump lan78xx registers
> If there is no PHY attached, you probably should not include PHY_REG_SIZE > here. > Sure, will address it. Thanks, Raghu
RE: [PATCH net 3/3] lan78xx: Lan7801 Support for Fixed PHY
> > What do you expect is connected to the MAC if there is no PHY? > Hi Andrew, We connect the Ethernet switch to this MAC. The Ethernet switch port connected to MAC do not have the phy. In this case, need to load the MAC driver and link speed/duplex set. Thanks, Raghu
Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
On Mon, Apr 09, 2018 at 02:25:26PM +0100, Quentin Monnet wrote: > > Anyway, I am fine with keeping just signatures, descriptions and return > values for now. I will submit a new version with only those items. Thank you. Could you also split it into few patches? include/uapi/linux/bpf.h | 2237 scripts/bpf_helpers_doc.py | 568 +++ 2 files changed, 2429 insertions(+), 376 deletions(-) replying back and forth on a single patch of such size will be tedious for others to follow. May be document ~10 helpers at a time ? Total of ~7 patches and extra patch for .py ?
[PATCH v2] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init
dn_route_init() is never called in atomic context. The call chain ending up at dn_route_init() is: [1] dn_route_init() <- decnet_init() decnet_init() is only set as a parameter of module_init(). Despite never getting called from atomic context, dn_route_init() calls __get_free_pages() with GFP_ATOMIC, which does not sleep for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, which can sleep and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- v2: * Modify the description of GFP_ATOMIC in v1. Thank Eric for good advice. --- net/decnet/dn_route.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c index 0bd3afd..59ed12a 100644 --- a/net/decnet/dn_route.c +++ b/net/decnet/dn_route.c @@ -1898,7 +1898,7 @@ void __init dn_route_init(void) while(dn_rt_hash_mask & (dn_rt_hash_mask - 1)) dn_rt_hash_mask--; dn_rt_hash_table = (struct dn_rt_hash_bucket *) - __get_free_pages(GFP_ATOMIC, order); + __get_free_pages(GFP_KERNEL, order); } while (dn_rt_hash_table == NULL && --order > 0); if (!dn_rt_hash_table) -- 1.9.1
[PATCH v2] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in tipc_mon_create
tipc_mon_create() is never called in atomic context. The call chain ending up at dn_route_init() is: [1] tipc_mon_create() <- tipc_enable_bearer() <- tipc_nl_bearer_enable() tipc_nl_bearer_enable() calls rtnl_lock(), which indicates this function is not called in atomic context. Despite never getting called from atomic context, tipc_mon_create() calls kzalloc() with GFP_ATOMIC, which does not sleep for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, which can sleep and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- v2: * Modify the description of GFP_ATOMIC in v1. Thank Eric for good advice. --- net/tipc/monitor.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c index 9e109bb..9714d80 100644 --- a/net/tipc/monitor.c +++ b/net/tipc/monitor.c @@ -604,9 +604,9 @@ int tipc_mon_create(struct net *net, int bearer_id) if (tn->monitors[bearer_id]) return 0; - mon = kzalloc(sizeof(*mon), GFP_ATOMIC); - self = kzalloc(sizeof(*self), GFP_ATOMIC); - dom = kzalloc(sizeof(*dom), GFP_ATOMIC); + mon = kzalloc(sizeof(*mon), GFP_KERNEL); + self = kzalloc(sizeof(*self), GFP_KERNEL); + dom = kzalloc(sizeof(*dom), GFP_KERNEL); if (!mon || !self || !dom) { kfree(mon); kfree(self); -- 1.9.1
[PATCH v2] net: nsci: Replace GFP_ATOMIC with GFP_KERNEL in ncsi_register_dev
ncsi_register_dev() is never called in atomic context. This function is only called by ftgmac100_probe() in drivers/net/ethernet/faraday/ftgmac100.c. And ftgmac100_probe() is only set as ".probe" in "struct platform_driver". Despite never getting called from atomic context, ncsi_register_dev() calls kzalloc() with GFP_ATOMIC, which does not sleep for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, which can sleep and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- v2: * Modify the description of GFP_ATOMIC in v1. Thank Eric for good advice. --- net/ncsi/ncsi-manage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index 3fd3c39..6b5b5a0 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -1508,7 +1508,7 @@ struct ncsi_dev *ncsi_register_dev(struct net_device *dev, return nd; /* Create NCSI device */ - ndp = kzalloc(sizeof(*ndp), GFP_ATOMIC); + ndp = kzalloc(sizeof(*ndp), GFP_KERNEL); if (!ndp) return NULL; -- 1.9.1
[PATCH v2] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in tipc_mon_create
tipc_mon_create() is never called in atomic context. The call chain ending up at dn_route_init() is: [1] tipc_mon_create() <- tipc_enable_bearer() <- tipc_nl_bearer_enable() tipc_nl_bearer_enable() calls rtnl_lock(), which indicates this function is not called in atomic context. Despite never getting called from atomic context, tipc_mon_create() calls kzalloc() with GFP_ATOMIC, which does not sleep for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, which can sleep and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- v2: * Modify the description of GFP_ATOMIC in v1. Thank Eric for good advice. --- net/tipc/monitor.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c index 9e109bb..9714d80 100644 --- a/net/tipc/monitor.c +++ b/net/tipc/monitor.c @@ -604,9 +604,9 @@ int tipc_mon_create(struct net *net, int bearer_id) if (tn->monitors[bearer_id]) return 0; - mon = kzalloc(sizeof(*mon), GFP_ATOMIC); - self = kzalloc(sizeof(*self), GFP_ATOMIC); - dom = kzalloc(sizeof(*dom), GFP_ATOMIC); + mon = kzalloc(sizeof(*mon), GFP_KERNEL); + self = kzalloc(sizeof(*self), GFP_KERNEL); + dom = kzalloc(sizeof(*dom), GFP_KERNEL); if (!mon || !self || !dom) { kfree(mon); kfree(self); -- 1.9.1
[PATCH v2] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init
dccp_init() is never called in atomic context. This function is only set as a parameter of module_init(). Despite never getting called from atomic context, dccp_init() calls __get_free_pages() with GFP_ATOMIC, which does not sleep for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, which can sleep and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- v2: * Modify the description of GFP_ATOMIC in v1. Thank Eric for good advice. --- net/dccp/proto.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/dccp/proto.c b/net/dccp/proto.c index b68168f..f63ba93 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -1159,7 +1159,7 @@ static int __init dccp_init(void) hash_size--; dccp_hashinfo.ehash_mask = hash_size - 1; dccp_hashinfo.ehash = (struct inet_ehash_bucket *) - __get_free_pages(GFP_ATOMIC|__GFP_NOWARN, ehash_order); + __get_free_pages(GFP_KERNEL|__GFP_NOWARN, ehash_order); } while (!dccp_hashinfo.ehash && --ehash_order > 0); if (!dccp_hashinfo.ehash) { @@ -1182,7 +1182,7 @@ static int __init dccp_init(void) bhash_order > 0) continue; dccp_hashinfo.bhash = (struct inet_bind_hashbucket *) - __get_free_pages(GFP_ATOMIC|__GFP_NOWARN, bhash_order); + __get_free_pages(GFP_KERNEL|__GFP_NOWARN, bhash_order); } while (!dccp_hashinfo.bhash && --bhash_order >= 0); if (!dccp_hashinfo.bhash) { -- 1.9.1
[PATCH v2] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init
dn_route_init() is never called in atomic context. The call chain ending up at dn_route_init() is: [1] dn_route_init() <- decnet_init() decnet_init() is only set as a parameter of module_init(). Despite never getting called from atomic context, dn_route_init() calls __get_free_pages() with GFP_ATOMIC, which does not sleep for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, which can sleep and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- v2: * Modify the description of GFP_ATOMIC in v1. Thank Eric for good advice. --- net/decnet/dn_route.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c index 0bd3afd..59ed12a 100644 --- a/net/decnet/dn_route.c +++ b/net/decnet/dn_route.c @@ -1898,7 +1898,7 @@ void __init dn_route_init(void) while(dn_rt_hash_mask & (dn_rt_hash_mask - 1)) dn_rt_hash_mask--; dn_rt_hash_table = (struct dn_rt_hash_bucket *) - __get_free_pages(GFP_ATOMIC, order); + __get_free_pages(GFP_KERNEL, order); } while (dn_rt_hash_table == NULL && --order > 0); if (!dn_rt_hash_table) -- 1.9.1
Re: [PATCH RESEND net] vhost: fix vhost_vq_access_ok() log check
On Tue, Apr 10, 2018 at 3:40 AM, Michael S. Tsirkinwrote: > From: Stefan Hajnoczi > > Commit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log > when IOTLB is enabled") introduced a regression. The logic was > originally: > > if (vq->iotlb) > return 1; > return A && B; > > After the patch the short-circuit logic for A was inverted: > > if (A || vq->iotlb) > return A; > return B; > > The correct logic is: > > if (!A || vq->iotlb) > return A; > return B; > > Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com > Cc: Jason Wang > Signed-off-by: Stefan Hajnoczi > Acked-by: Michael S. Tsirkin > > --- > drivers/vhost/vhost.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) NACK I will send a v2 with cleaner logic as suggested by Linus. Stefan
Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
On 4/9/18 11:41 AM, Yonghong Song wrote: On 4/9/18 9:47 AM, Alexei Starovoitov wrote: On 4/9/18 9:18 AM, Yonghong Song wrote: syzbot reported a possible deadlock in perf_event_detach_bpf_prog. ... @@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info) return -EINVAL; if (copy_from_user(, uquery, sizeof(query))) return -EFAULT; -if (query.ids_len > BPF_TRACE_MAX_PROGS) + +ids_len = query.ids_len; +if (ids_len > BPF_TRACE_MAX_PROGS) return -E2BIG; +ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN); +if (!ids) +return -ENOMEM; mutex_lock(_event_mutex); ret = bpf_prog_array_copy_info(event->tp_event->prog_array, - uquery->ids, - query.ids_len, - >prog_cnt); + ids, + ids_len, + _cnt); mutex_unlock(_event_mutex); +if (!ret || ret == -ENOSPC) { +if (copy_to_user(>prog_cnt, _cnt, sizeof(prog_cnt)) || +copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) { +ret = -EFAULT; +goto out; +} +} + +out: +kfree(ids); alloc/free just to avoid this locking dependency feels suboptimal. We actually already did kcalloc/kfree in bpf_prog_array_copy_to_user. In that function, we did not copy_to_user one id at a time. We allocate a temporary array and store the result there and at the end, we call one copy_to_user to copy to the user buffer. The patch here just moved this allocation and associated copy_to_user out of the function and bpf_event_mutex. It did not introduce new allocations. I see, so the patch essentially open coding bpf_prog_array_copy_to_user() can we share the code then? bpf/core.c callsite used by trace/bpf_trace.c and similar callsite in bpf/cgroup.c should be using common helper. may be we can get rid of bpf_event_mutex in some cases? the perf event itself is locked via perf_event_ctx_lock() when we're calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog. I forgot what was the motivation for us to introduce it in the first place. The original motivation for the lock to make sure bpf_prog_array does not change during middle of attach/detach/query. it looks like we have: . perf_event_attach|query under perf_event_ctx_lock . perf_event_detach not under perf_event_ctx_lock Introducing perf_event_ctx_lock to perf_event_detach could still have the deadlock. ahh, right, since the progs are in even->tp_event which can be shared by multiple perf_events. Scratch that idea.
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On Mon, Apr 9, 2018 at 4:03 PM, Stephen Hemmingerwrote: > On Mon, 9 Apr 2018 15:30:42 -0700 > Siwei Liu wrote: > >> On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn wrote: >> >> No, implementation wise I'd avoid changing the class on the fly. What >> >> I'm looking to is a means to add a secondary class or class aliasing >> >> mechanism for netdevs that allows mapping for a kernel device >> >> namespace (/class/net-kernel) to userspace (/class/net). Imagine >> >> creating symlinks between these two namespaces as an analogy. All >> >> userspace visible netdevs today will have both a kernel name and a >> >> userspace visible name, having one (/class/net) referecing the other >> >> (/class/net-kernel) in its own namespace. The newly introduced >> >> IFF_AUTO_MANAGED device will have a kernel name only >> >> (/class/net-kernel). As a result, the existing applications using >> >> /class/net don't break, while we're adding the kernel namespace that >> >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace >> >> at all. >> > >> > My gut feeling is this whole scheme will not fly. You really should be >> > talking to GregKH. >> >> Will do. Before spreading it out loudly I'd run it within netdev to >> clarify the need for why not exposing the lower netdevs is critical >> for cloud service providers in the face of introducing a new feature, >> and we are not hiding anything but exposing it in a way that don't >> break existing userspace applications while introducing feature is >> possible with the limitation of keeping old userspace still. >> >> > >> > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic. >> > A device can start out as a normal device, and will change to being >> > automatic later, when the user on top of it probes. >> >> Sure. In whatever form it's still a netdev, and changing the namespace >> should be more dynamic than changing the class. >> >> Thanks a lot, >> -Siwei >> >> > >> > Andrew > > Also, remember for netdev's /sys is really a third class API. > The primary API's are netlink and ioctl. Also why not use existing > network namespaces rather than inventing a new abstraction? Because we want to leave old userspace unmodified while making SR-IOV live migration transparent to users. Specifically, we'd want old udevd to skip through uevents for the lower netdevs, while also making new udevd able to name the bypass_master interface by referencing the pci slot information which is only present in the sysfs entry for IFF_AUTO_MANAGED net device. The problem of using network namespace is that, no sysfs entry will be around for IFF_AUTO_MANAGED netdev if we isolate it out to a separate netns, unless we generalize the scope for what netns is designed for (isolation I mean). For auto-managed netdevs we don't neccessarily wants strict isolation, but rather a way of sticking to 1-netdev model for strict backward compatibility requiement of the old userspace, while exposing the information in a way new userspace understands. Thanks, -Siwei
Re: [PATCH net-next] netns: filter uevents correctly
Christian Braunerwrites: > On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote: >> Christian Brauner writes: >> >> > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote: >> >> On 05.04.2018 17:07, Christian Brauner wrote: >> >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote: >> >> >> On 04.04.2018 22:48, Christian Brauner wrote: >> >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network >> >> >>> namespaces") >> >> >>> >> >> >>> enabled sending hotplug events into all network namespaces back in >> >> >>> 2010. >> >> >>> Over time the set of uevents that get sent into all network >> >> >>> namespaces has >> >> >>> shrunk. We have now reached the point where hotplug events for all >> >> >>> devices >> >> >>> that carry a namespace tag are filtered according to that namespace. >> >> >>> >> >> >>> Specifically, they are filtered whenever the namespace tag of the >> >> >>> kobject >> >> >>> does not match the namespace tag of the netlink socket. One example >> >> >>> are >> >> >>> network devices. Uevents for network devices only show up in the >> >> >>> network >> >> >>> namespaces these devices are moved to or created in. >> >> >>> >> >> >>> However, any uevent for a kobject that does not have a namespace tag >> >> >>> associated with it will not be filtered and we will *try* to >> >> >>> broadcast it >> >> >>> into all network namespaces. >> >> >>> >> >> >>> The original patchset was written in 2010 before user namespaces were >> >> >>> a >> >> >>> thing. With the introduction of user namespaces sending out uevents >> >> >>> became >> >> >>> partially isolated as they were filtered by user namespaces: >> >> >>> >> >> >>> net/netlink/af_netlink.c:do_one_broadcast() >> >> >>> >> >> >>> if (!net_eq(sock_net(sk), p->net)) { >> >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID)) >> >> >>> return; >> >> >>> >> >> >>> if (!peernet_has_id(sock_net(sk), p->net)) >> >> >>> return; >> >> >>> >> >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, >> >> >>> CAP_NET_BROADCAST)) >> >> >>> j return; >> >> >>> } >> >> >>> >> >> >>> The file_ns_capable() check will check whether the caller had >> >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the >> >> >>> user >> >> >>> namespace of interest. This check is fine in general but seems >> >> >>> insufficient >> >> >>> to me when paired with uevents. The reason is that devices always >> >> >>> belong to >> >> >>> the initial user namespace so uevents for kobjects that do not carry a >> >> >>> namespace tag should never be sent into another user namespace. This >> >> >>> has >> >> >>> been the intention all along. But there's one case where this breaks, >> >> >>> namely if a new user namespace is created by root on the host and an >> >> >>> identity mapping is established between root on the host and root in >> >> >>> the >> >> >>> new user namespace. Here's a reproducer: >> >> >>> >> >> >>> sudo unshare -U --map-root >> >> >>> udevadm monitor -k >> >> >>> # Now change to initial user namespace and e.g. do >> >> >>> modprobe kvm >> >> >>> # or >> >> >>> rmmod kvm >> >> >>> >> >> >>> will allow the non-initial user namespace to retrieve all uevents >> >> >>> from the >> >> >>> host. This seems very anecdotal given that in the general case user >> >> >>> namespaces do not see any uevents and also can't really do anything >> >> >>> useful >> >> >>> with them. >> >> >>> >> >> >>> Additionally, it is now possible to send uevents from userspace. As >> >> >>> such we >> >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user >> >> >>> namespace of the network namespace of the netlink socket) userspace >> >> >>> process >> >> >>> make a decision what uevents should be sent. >> >> >>> >> >> >>> This makes me think that we should simply ensure that uevents for >> >> >>> kobjects >> >> >>> that do not carry a namespace tag are *always* filtered by user >> >> >>> namespace >> >> >>> in kobj_bcast_filter(). Specifically: >> >> >>> - If the owning user namespace of the uevent socket is not >> >> >>> init_user_ns the >> >> >>> event will always be filtered. >> >> >>> - If the network namespace the uevent socket belongs to was created >> >> >>> in the >> >> >>> initial user namespace but was opened from a non-initial user >> >> >>> namespace >> >> >>> the event will be filtered as well. >> >> >>> Put another way, uevents for kobjects not carrying a namespace tag >> >> >>> are now >> >> >>> always only sent to the initial user namespace. The regression >> >> >>> potential >> >> >>> for this is near to non-existent since user namespaces can't really do >> >> >>> anything with interesting devices. >> >> >>> >> >> >>> Signed-off-by: Christian Brauner >>
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On Mon, 9 Apr 2018 15:30:42 -0700 Siwei Liuwrote: > On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunn wrote: > >> No, implementation wise I'd avoid changing the class on the fly. What > >> I'm looking to is a means to add a secondary class or class aliasing > >> mechanism for netdevs that allows mapping for a kernel device > >> namespace (/class/net-kernel) to userspace (/class/net). Imagine > >> creating symlinks between these two namespaces as an analogy. All > >> userspace visible netdevs today will have both a kernel name and a > >> userspace visible name, having one (/class/net) referecing the other > >> (/class/net-kernel) in its own namespace. The newly introduced > >> IFF_AUTO_MANAGED device will have a kernel name only > >> (/class/net-kernel). As a result, the existing applications using > >> /class/net don't break, while we're adding the kernel namespace that > >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace > >> at all. > > > > My gut feeling is this whole scheme will not fly. You really should be > > talking to GregKH. > > Will do. Before spreading it out loudly I'd run it within netdev to > clarify the need for why not exposing the lower netdevs is critical > for cloud service providers in the face of introducing a new feature, > and we are not hiding anything but exposing it in a way that don't > break existing userspace applications while introducing feature is > possible with the limitation of keeping old userspace still. > > > > > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic. > > A device can start out as a normal device, and will change to being > > automatic later, when the user on top of it probes. > > Sure. In whatever form it's still a netdev, and changing the namespace > should be more dynamic than changing the class. > > Thanks a lot, > -Siwei > > > > > Andrew Also, remember for netdev's /sys is really a third class API. The primary API's are netlink and ioctl. Also why not use existing network namespaces rather than inventing a new abstraction?
Re: [PATCH] iscsi: respond to netlink with unicast when appropriate
On 04/09/2018 03:15 PM, Chris Leech wrote: > Instead of always multicasting responses, send a unicast netlink message > directed at the correct pid. This will be needed if we ever want to > support multiple userspace processes interacting with the kernel over > iSCSI netlink simultaneously. Limitations can currently be seen if you > attempt to run multiple iscsistart commands in parallel. > > We've fixed up the userspace issues in iscsistart that prevented > multiple instances from running, so now attempts to speed up booting by > bringing up multiple iscsi sessions at once in the initramfs are just > running into misrouted responses that this fixes. As you may know, I disagree with running multiple iscsistart-s at the same time, since that's what iscsid is for. Never the less, I believe we _should_ be able to have multiple processes talking to the kernel target code, so I agree with these changes. > > Signed-off-by: Chris Leech> --- > drivers/scsi/scsi_transport_iscsi.c | 29 ++--- > 1 file changed, 18 insertions(+), 11 deletions(-) > > ... (diffs removed to save electrons) > Reviewed-by: Lee Duncan -- Lee Duncan SUSE Labs
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On Mon, Apr 9, 2018 at 3:15 PM, Andrew Lunnwrote: >> No, implementation wise I'd avoid changing the class on the fly. What >> I'm looking to is a means to add a secondary class or class aliasing >> mechanism for netdevs that allows mapping for a kernel device >> namespace (/class/net-kernel) to userspace (/class/net). Imagine >> creating symlinks between these two namespaces as an analogy. All >> userspace visible netdevs today will have both a kernel name and a >> userspace visible name, having one (/class/net) referecing the other >> (/class/net-kernel) in its own namespace. The newly introduced >> IFF_AUTO_MANAGED device will have a kernel name only >> (/class/net-kernel). As a result, the existing applications using >> /class/net don't break, while we're adding the kernel namespace that >> allows IFF_AUTO_MANAGED devices which will not be exposed to userspace >> at all. > > My gut feeling is this whole scheme will not fly. You really should be > talking to GregKH. Will do. Before spreading it out loudly I'd run it within netdev to clarify the need for why not exposing the lower netdevs is critical for cloud service providers in the face of introducing a new feature, and we are not hiding anything but exposing it in a way that don't break existing userspace applications while introducing feature is possible with the limitation of keeping old userspace still. > > Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic. > A device can start out as a normal device, and will change to being > automatic later, when the user on top of it probes. Sure. In whatever form it's still a netdev, and changing the namespace should be more dynamic than changing the class. Thanks a lot, -Siwei > > Andrew
[PATCH] iscsi: respond to netlink with unicast when appropriate
Instead of always multicasting responses, send a unicast netlink message directed at the correct pid. This will be needed if we ever want to support multiple userspace processes interacting with the kernel over iSCSI netlink simultaneously. Limitations can currently be seen if you attempt to run multiple iscsistart commands in parallel. We've fixed up the userspace issues in iscsistart that prevented multiple instances from running, so now attempts to speed up booting by bringing up multiple iscsi sessions at once in the initramfs are just running into misrouted responses that this fixes. Signed-off-by: Chris Leech--- drivers/scsi/scsi_transport_iscsi.c | 29 ++--- 1 file changed, 18 insertions(+), 11 deletions(-) diff --git a/drivers/scsi/scsi_transport_iscsi.c b/drivers/scsi/scsi_transport_iscsi.c index f4b52b44b966..65f6c94f2e9b 100644 --- a/drivers/scsi/scsi_transport_iscsi.c +++ b/drivers/scsi/scsi_transport_iscsi.c @@ -2322,6 +2322,12 @@ iscsi_multicast_skb(struct sk_buff *skb, uint32_t group, gfp_t gfp) return nlmsg_multicast(nls, skb, 0, group, gfp); } +static int +iscsi_unicast_skb(struct sk_buff *skb, u32 portid) +{ + return nlmsg_unicast(nls, skb, portid); +} + int iscsi_recv_pdu(struct iscsi_cls_conn *conn, struct iscsi_hdr *hdr, char *data, uint32_t data_size) { @@ -2524,14 +2530,11 @@ void iscsi_ping_comp_event(uint32_t host_no, struct iscsi_transport *transport, EXPORT_SYMBOL_GPL(iscsi_ping_comp_event); static int -iscsi_if_send_reply(uint32_t group, int seq, int type, int done, int multi, - void *payload, int size) +iscsi_if_send_reply(u32 portid, int type, void *payload, int size) { struct sk_buff *skb; struct nlmsghdr *nlh; int len = nlmsg_total_size(size); - int flags = multi ? NLM_F_MULTI : 0; - int t = done ? NLMSG_DONE : type; skb = alloc_skb(len, GFP_ATOMIC); if (!skb) { @@ -2539,10 +2542,9 @@ iscsi_if_send_reply(uint32_t group, int seq, int type, int done, int multi, return -ENOMEM; } - nlh = __nlmsg_put(skb, 0, 0, t, (len - sizeof(*nlh)), 0); - nlh->nlmsg_flags = flags; + nlh = __nlmsg_put(skb, 0, 0, type, (len - sizeof(*nlh)), 0); memcpy(nlmsg_data(nlh), payload, size); - return iscsi_multicast_skb(skb, group, GFP_ATOMIC); + return iscsi_unicast_skb(skb, portid); } static int @@ -3470,6 +3472,7 @@ static int iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group) { int err = 0; + u32 portid; struct iscsi_uevent *ev = nlmsg_data(nlh); struct iscsi_transport *transport = NULL; struct iscsi_internal *priv; @@ -3490,10 +3493,12 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group) if (!try_module_get(transport->owner)) return -EINVAL; + portid = NETLINK_CB(skb).portid; + switch (nlh->nlmsg_type) { case ISCSI_UEVENT_CREATE_SESSION: err = iscsi_if_create_session(priv, ep, ev, - NETLINK_CB(skb).portid, + portid, ev->u.c_session.initial_cmdsn, ev->u.c_session.cmds_max, ev->u.c_session.queue_depth); @@ -3506,7 +3511,7 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group) } err = iscsi_if_create_session(priv, ep, ev, - NETLINK_CB(skb).portid, + portid, ev->u.c_bound_session.initial_cmdsn, ev->u.c_bound_session.cmds_max, ev->u.c_bound_session.queue_depth); @@ -3664,6 +3669,8 @@ iscsi_if_recv_msg(struct sk_buff *skb, struct nlmsghdr *nlh, uint32_t *group) static void iscsi_if_rx(struct sk_buff *skb) { + u32 portid = NETLINK_CB(skb).portid; + mutex_lock(_queue_mutex); while (skb->len >= NLMSG_HDRLEN) { int err; @@ -3699,8 +3706,8 @@ iscsi_if_rx(struct sk_buff *skb) break; if (ev->type == ISCSI_UEVENT_GET_CHAP && !err) break; - err = iscsi_if_send_reply(group, nlh->nlmsg_seq, - nlh->nlmsg_type, 0, 0, ev, sizeof(*ev)); + err = iscsi_if_send_reply(portid, nlh->nlmsg_type, + ev, sizeof(*ev)); } while (err < 0 && err != -ECONNREFUSED && err != -ESRCH); skb_pull(skb, rlen); } -- 2.14.3
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
> No, implementation wise I'd avoid changing the class on the fly. What > I'm looking to is a means to add a secondary class or class aliasing > mechanism for netdevs that allows mapping for a kernel device > namespace (/class/net-kernel) to userspace (/class/net). Imagine > creating symlinks between these two namespaces as an analogy. All > userspace visible netdevs today will have both a kernel name and a > userspace visible name, having one (/class/net) referecing the other > (/class/net-kernel) in its own namespace. The newly introduced > IFF_AUTO_MANAGED device will have a kernel name only > (/class/net-kernel). As a result, the existing applications using > /class/net don't break, while we're adding the kernel namespace that > allows IFF_AUTO_MANAGED devices which will not be exposed to userspace > at all. My gut feeling is this whole scheme will not fly. You really should be talking to GregKH. Anyway, please remember that IFF_AUTO_MANAGED will need to be dynamic. A device can start out as a normal device, and will change to being automatic later, when the user on top of it probes. Andrew
Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
On Fri, Apr 6, 2018 at 8:19 PM, Andrew Lunnwrote: > Hi Siwei > >> I think everyone seems to agree not to fiddle with the ":" prefix, but >> rather have a new class of network subsystem under /sys/class thus a >> separate device namespace e.g. /sys/class/net-kernel for those >> auto-managed lower netdevs is needed. > > How do you get a device into this new class? I don't know the Linux > driver model too well, but to get a device out of one class and into > another, i think you need to device_del(dev). modify dev->class and > then device_add(dev). However, device_add() says you are not allowed > to do this. No, implementation wise I'd avoid changing the class on the fly. What I'm looking to is a means to add a secondary class or class aliasing mechanism for netdevs that allows mapping for a kernel device namespace (/class/net-kernel) to userspace (/class/net). Imagine creating symlinks between these two namespaces as an analogy. All userspace visible netdevs today will have both a kernel name and a userspace visible name, having one (/class/net) referecing the other (/class/net-kernel) in its own namespace. The newly introduced IFF_AUTO_MANAGED device will have a kernel name only (/class/net-kernel). As a result, the existing applications using /class/net don't break, while we're adding the kernel namespace that allows IFF_AUTO_MANAGED devices which will not be exposed to userspace at all. As this requires changing the internals of driver model core it's a rather big hammer approach I'd think. If there exists a better implementation than this to allow adding a separate layer of in-kernel device namespace, I'd more than welcome to hear about. > > And i don't even see how this helps. Are you also not going to call > list_netdevice()? Are you going to add some other list for these > devices in a different class? list_netdevice() is still called. I think with the current RFC patch, I've added two lists for netdevs under the kernel namespace: dev_cmpl_list and name_cmpl_hlist. As a result of that, all userspace netdevs get registered will be added to two types of lists: the userspace list for e.g. dev_list, and also the kernelspace list e.g. dev_cmpl_list (I can rename it to something more accurate). The IFF_AUTO_MANAGED device will be only added to kernelspace list e.g. dev_cmpl_list. Hope all your questions are answered. Thanks, -Siwei > >Andrew
Re: [RFC PATCH v2 00/14] Introducing AF_XDP support
On Tue, Mar 27, 2018 at 9:59 AM, Björn Töpelwrote: > From: Björn Töpel > > This RFC introduces a new address family called AF_XDP that is > optimized for high performance packet processing and, in upcoming > patch sets, zero-copy semantics. In this v2 version, we have removed > all zero-copy related code in order to make it smaller, simpler and > hopefully more review friendly. This RFC only supports copy-mode for > the generic XDP path (XDP_SKB) for both RX and TX and copy-mode for RX > using the XDP_DRV path. Zero-copy support requires XDP and driver > changes that Jesper Dangaard Brouer is working on. Some of his work is > already on the mailing list for review. We will publish our zero-copy > support for RX and TX on top of his patch sets at a later point in > time. > > An AF_XDP socket (XSK) is created with the normal socket() > syscall. Associated with each XSK are two queues: the RX queue and the > TX queue. A socket can receive packets on the RX queue and it can send > packets on the TX queue. These queues are registered and sized with > the setsockopts XDP_RX_QUEUE and XDP_TX_QUEUE, respectively. It is > mandatory to have at least one of these queues for each socket. In > contrast to AF_PACKET V2/V3 these descriptor queues are separated from > packet buffers. An RX or TX descriptor points to a data buffer in a > memory area called a UMEM. RX and TX can share the same UMEM so that a > packet does not have to be copied between RX and TX. Moreover, if a > packet needs to be kept for a while due to a possible retransmit, the > descriptor that points to that packet can be changed to point to > another and reused right away. This again avoids copying data. > > This new dedicated packet buffer area is called a UMEM. It consists of > a number of equally size frames and each frame has a unique frame > id. A descriptor in one of the queues references a frame by > referencing its frame id. The user space allocates memory for this > UMEM using whatever means it feels is most appropriate (malloc, mmap, > huge pages, etc). This memory area is then registered with the kernel > using the new setsockopt XDP_UMEM_REG. The UMEM also has two queues: > the FILL queue and the COMPLETION queue. The fill queue is used by the > application to send down frame ids for the kernel to fill in with RX > packet data. References to these frames will then appear in the RX > queue of the XSK once they have been received. The completion queue, > on the other hand, contains frame ids that the kernel has transmitted > completely and can now be used again by user space, for either TX or > RX. Thus, the frame ids appearing in the completion queue are ids that > were previously transmitted using the TX queue. In summary, the RX and > FILL queues are used for the RX path and the TX and COMPLETION queues > are used for the TX path. > Can we register a UMEM to multiple device's queue? So far the l2fwd sample code is sending/receiving from the same queue. I'm thinking about forwarding packets from one device to another. Now I'm copying packets from one device's RX desc to another device's TX completion queue. But this introduces one extra copy. One way I can do is to call bpf_redirect helper function, but sometimes I still need to process the packet in userspace. I like this work! Thanks a lot. William
Re: [net-next PATCH v3 01/11] soc: ti: K2G: enhancement to support QMSS in K2G NAVSS
On Mon, Apr 02, 2018 at 10:37:51AM -0400, Murali Karicheri wrote: > Navigator Subsystem (NAVSS) available on K2G SoC has a cut down > version of QMSS with less number of queues, internal linking ram > with lesser number of buffers etc. It doesn't have status and > explicit push register space as in QMSS available on other K2 SoCs. > So define reg indices specific to QMSS on K2G. This patch introduces > "ti,66ak2g-navss-qm" compatibility to identify QMSS on K2G NAVSS > and to customize the dts handling code. Per Device manual, > descriptors with index less than or equal to regions0_size is in region 0 > in the case of K2 QMSS where as for QMSS on K2G, descriptors with index > less than regions0_size is in region 0. So update the size accordingly in > the regions0_size bits of the linking ram size 0 register. > > Signed-off-by: Murali Karicheri> Signed-off-by: WingMan Kwok > --- > .../bindings/soc/ti/keystone-navigator-qmss.txt| 9 ++- Reviewed-by: Rob Herring > drivers/soc/ti/knav_qmss.h | 6 ++ > drivers/soc/ti/knav_qmss_queue.c | 90 > -- > 3 files changed, 82 insertions(+), 23 deletions(-)
[PATCH] mISDN: Remove VLAs
There's an ongoing effort to remove VLAs[1] from the kernel to eventually turn on -Wvla. Remove the VLAs from the mISDN code by switching to using kstrdup in one place and just using the upper bound in another. [1] https://lkml.org/lkml/2018/3/7/621 Signed-off-by: Laura Abbott--- drivers/isdn/mISDN/dsp_hwec.c | 8 +--- drivers/isdn/mISDN/l1oip_core.c | 2 +- 2 files changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/isdn/mISDN/dsp_hwec.c b/drivers/isdn/mISDN/dsp_hwec.c index a6e87076acc2..5336bbdbfdc5 100644 --- a/drivers/isdn/mISDN/dsp_hwec.c +++ b/drivers/isdn/mISDN/dsp_hwec.c @@ -68,12 +68,12 @@ void dsp_hwec_enable(struct dsp *dsp, const char *arg) goto _do; { - char _dup[len + 1]; char *dup, *tok, *name, *val; int tmp; - strcpy(_dup, arg); - dup = _dup; + dup = kstrdup(arg, GFP_ATOMIC); + if (!dup) + return; while ((tok = strsep(, ","))) { if (!strlen(tok)) @@ -89,6 +89,8 @@ void dsp_hwec_enable(struct dsp *dsp, const char *arg) deftaps = tmp; } } + + kfree(dup); } _do: diff --git a/drivers/isdn/mISDN/l1oip_core.c b/drivers/isdn/mISDN/l1oip_core.c index 21d50e4cc5e1..df594245deca 100644 --- a/drivers/isdn/mISDN/l1oip_core.c +++ b/drivers/isdn/mISDN/l1oip_core.c @@ -279,7 +279,7 @@ l1oip_socket_send(struct l1oip *hc, u8 localcodec, u8 channel, u32 chanmask, u16 timebase, u8 *buf, int len) { u8 *p; - u8 frame[len + 32]; + u8 frame[L1OIP_MAX_PERFRAME + 32]; struct socket *socket = NULL; if (debug & DEBUG_L1OIP_MSG) -- 2.14.3
Re: [PATCH AUTOSEL for 4.9 160/293] MIPS: Give __secure_computing() access to syscall arguments.
On Mon, Apr 09, 2018 at 12:24:58AM +, Sasha Levin wrote: > From: David Daney> > [ Upstream commit 669c4092225f0ed5df12ebee654581b558a5e3ed ] > > KProbes of __seccomp_filter() are not very useful without access to > the syscall arguments. > > Do what x86 does, and populate a struct seccomp_data to be passed to > __secure_computing(). This allows samples/bpf/tracex5 to extract a > sensible trace. This broke o32 indirect syscalls, and was fixed by commit 3d729deaf287 ("MIPS: seccomp: Fix indirect syscall args"). Cheers James signature.asc Description: Digital signature
Patch "x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic()" has been added to the 4.9-stable tree
This is a note to let you know that I've just added the patch titled x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic() to the 4.9-stable tree which can be found at: http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary The filename of the patch is: x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch and it can be found in the queue-4.9 subdirectory. If you, or anyone else, feels it should not be added to the stable tree, please letknow about it. >From foo@baz Mon Apr 9 17:09:24 CEST 2018 From: Josh Poimboeuf Date: Thu, 4 May 2017 09:51:40 -0500 Subject: x86/asm: Don't use RBP as a temporary register in csum_partial_copy_generic() From: Josh Poimboeuf [ Upstream commit 42fc6c6cb1662ba2fa727dd01c9473c63be4e3b6 ] Andrey Konovalov reported the following warning while fuzzing the kernel with syzkaller: WARNING: kernel stack regs at 8800686869f8 in a.out:4933 has bad 'bp' value c3fc855a10167ec0 The unwinder dump revealed that RBP had a bad value when an interrupt occurred in csum_partial_copy_generic(). That function saves RBP on the stack and then overwrites it, using it as a scratch register. That's problematic because it breaks stack traces if an interrupt occurs in the middle of the function. Replace the usage of RBP with another callee-saved register (R15) so stack traces are no longer affected. Reported-by: Andrey Konovalov Tested-by: Andrey Konovalov Signed-off-by: Josh Poimboeuf Cc: Cong Wang Cc: David S . Miller Cc: Dmitry Vyukov Cc: Eric Dumazet Cc: Kostya Serebryany Cc: Linus Torvalds Cc: Marcelo Ricardo Leitner Cc: Neil Horman Cc: Peter Zijlstra Cc: Thomas Gleixner Cc: Vlad Yasevich Cc: linux-s...@vger.kernel.org Cc: netdev Cc: syzkaller Link: http://lkml.kernel.org/r/4b03a961efda5ec9bfe46b7b9c9ad72d1efad343.1493909486.git.jpoim...@redhat.com Signed-off-by: Ingo Molnar Signed-off-by: Sasha Levin Signed-off-by: Greg Kroah-Hartman --- arch/x86/lib/csum-copy_64.S | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) --- a/arch/x86/lib/csum-copy_64.S +++ b/arch/x86/lib/csum-copy_64.S @@ -55,7 +55,7 @@ ENTRY(csum_partial_copy_generic) movq %r12, 3*8(%rsp) movq %r14, 4*8(%rsp) movq %r13, 5*8(%rsp) - movq %rbp, 6*8(%rsp) + movq %r15, 6*8(%rsp) movq %r8, (%rsp) movq %r9, 1*8(%rsp) @@ -74,7 +74,7 @@ ENTRY(csum_partial_copy_generic) /* main loop. clear in 64 byte blocks */ /* r9: zero, r8: temp2, rbx: temp1, rax: sum, rcx: saved length */ /* r11: temp3, rdx: temp4, r12 loopcnt */ - /* r10: temp5, rbp: temp6, r14 temp7, r13 temp8 */ + /* r10: temp5, r15: temp6, r14 temp7, r13 temp8 */ .p2align 4 .Lloop: source @@ -89,7 +89,7 @@ ENTRY(csum_partial_copy_generic) source movq 32(%rdi), %r10 source - movq 40(%rdi), %rbp + movq 40(%rdi), %r15 source movq 48(%rdi), %r14 source @@ -103,7 +103,7 @@ ENTRY(csum_partial_copy_generic) adcq %r11, %rax adcq %rdx, %rax adcq %r10, %rax - adcq %rbp, %rax + adcq %r15, %rax adcq %r14, %rax adcq %r13, %rax @@ -121,7 +121,7 @@ ENTRY(csum_partial_copy_generic) dest movq %r10, 32(%rsi) dest - movq %rbp, 40(%rsi) + movq %r15, 40(%rsi) dest movq %r14, 48(%rsi) dest @@ -203,7 +203,7 @@ ENTRY(csum_partial_copy_generic) movq 3*8(%rsp), %r12 movq 4*8(%rsp), %r14 movq 5*8(%rsp), %r13 - movq 6*8(%rsp), %rbp + movq 6*8(%rsp), %r15 addq $7*8, %rsp ret Patches currently in stable-queue which might be from jpoim...@redhat.com are queue-4.9/x86-asm-don-t-use-rbp-as-a-temporary-register-in-csum_partial_copy_generic.patch
RFC: ti_hecc: don't repoll but interrupt again
Hello, I posted below RFC to the linux-can ML, but it might be actually more appropriate for netdev (since it involves napi). This driver is lying a bit about the amount of work done. Changing that (always look at work done instead of the hw state) seems to solve the issue, but I have no clear understanding why that is (and why it takes down the ethernet stack as well). If someone has suggestions / ideas on what to check, please let me know. Regards, Jeroen The mail send to linux-can: While updating to Linux 4.9.93, the ti_hecc causes several problems. One of them is that under high load on the CAN-bus and ethernet, the ethernet connection completely stalls and won't recover. The patch below seems to fix that (as in read as much as we can and always re-enable interrupts, it will poll again thereafter). It would be appreciated if someone more familiar with this code can comment on it. Thanks in advance, Jeroen --- a/drivers/net/can/ti_hecc.c +++ b/drivers/net/can/ti_hecc.c @@ -625,20 +625,16 @@ static int ti_hecc_rx_poll(struct napi_struct *napi, int quota) spin_unlock_irqrestore(>mbx_lock, flags); } else if (priv->rx_next == HECC_MAX_TX_MBOX - 1) { priv->rx_next = HECC_RX_FIRST_MBOX; - break; } } /* Enable packet interrupt if all pkts are handled */ - if (hecc_read(priv, HECC_CANRMP) == 0) { + if (num_pkts < quota) { napi_complete(napi); /* Re-enable RX mailbox interrupts */ mbx_mask = hecc_read(priv, HECC_CANMIM); mbx_mask |= HECC_TX_MBOX_MASK; hecc_write(priv, HECC_CANMIM, mbx_mask); - } else { - /* repoll is done only if whole budget is used */ - num_pkts = quota; }
[PATCH RESEND net] vhost: fix vhost_vq_access_ok() log check
From: Stefan HajnocziCommit d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when IOTLB is enabled") introduced a regression. The logic was originally: if (vq->iotlb) return 1; return A && B; After the patch the short-circuit logic for A was inverted: if (A || vq->iotlb) return A; return B; The correct logic is: if (!A || vq->iotlb) return A; return B; Reported-by: syzbot+65a84dde0214b0387...@syzkaller.appspotmail.com Cc: Jason Wang Signed-off-by: Stefan Hajnoczi Acked-by: Michael S. Tsirkin --- drivers/vhost/vhost.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 5320039671b7..f6af4210679a 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1246,7 +1246,7 @@ int vhost_vq_access_ok(struct vhost_virtqueue *vq) { int ret = vq_log_access_ok(vq, vq->log_base); - if (ret || vq->iotlb) + if (!ret || vq->iotlb) return ret; return vq_access_ok(vq, vq->num, vq->desc, vq->avail, vq->used); -- 2.14.3
Re: [RFC PATCH net-next v5 3/4] virtio_net: Extend virtio to use VF datapath when available
On 4/9/2018 1:07 AM, Jiri Pirko wrote: Sat, Apr 07, 2018 at 12:59:14AM CEST, sridhar.samudr...@intel.com wrote: On 4/6/2018 5:48 AM, Jiri Pirko wrote: Thu, Apr 05, 2018 at 11:08:22PM CEST, sridhar.samudr...@intel.com wrote: [...] +static int virtnet_bypass_join_child(struct net_device *bypass_netdev, +struct net_device *child_netdev) +{ + struct virtnet_bypass_info *vbi; + bool backup; + + vbi = netdev_priv(bypass_netdev); + backup = (child_netdev->dev.parent == bypass_netdev->dev.parent); + if (backup ? rtnl_dereference(vbi->backup_netdev) : + rtnl_dereference(vbi->active_netdev)) { + netdev_info(bypass_netdev, + "%s attempting to join bypass dev when %s already present\n", + child_netdev->name, backup ? "backup" : "active"); Bypass module should check if there is already some other netdev enslaved and refuse right there. This will work for virtio-net with 3 netdev model, but this check has to be done by netvsc as its bypass_netdev is same as the backup_netdev. Will add a flag while registering with the bypass module to indicate if the driver is doing a 2 netdev or 3 netdev model and based on that flag this check can be done in bypass module for 3 netdev scenario. Just let me undestand it clearly. What I expect the difference would be between 2netdev and3 netdev model is this: 2netdev: bypass_master / / VF_slave 3netdev: bypass_master / \ / \ VF_slave backup_slave Is that correct? If not, how does it look like? Looks correct. VF_slave and backup_slave are the original netdevs and are present in both the models. In the 3 netdev model, bypass_master netdev is created and VF_slave and backup_slave are marked as the 2 slaves of this new netdev. In the 2 netdev model, backup_slave acts as bypass_master and the bypass module doesn't have access to netdev_priv of the backup_slave. Once i moved all the ndo_ops of the master netdev to bypass module, i realized that we can move the create/destroy of the upper netdev also to bypass.c. That way the changes to virtio_net become very minimal. With these updates, bypass module now supports both the models by exporting 2 sets of functions. 3 netdev: int bypass_master_create(struct net_device *backup_netdev, struct bypass_master **pbypass_master); void bypass_master_destroy(struct bypass_master *bypass_master); 2 netdev: int bypass_master_register(struct net_device *backup_netdev, struct bypass_ops *ops, struct bypass_master **pbypass_master); void bypass_master_unregister(struct bypass_master *bypass_master); Will send the next revision in a day or two. Thanks Sridhar
Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
On 4/9/18 9:47 AM, Alexei Starovoitov wrote: On 4/9/18 9:18 AM, Yonghong Song wrote: syzbot reported a possible deadlock in perf_event_detach_bpf_prog. ... @@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info) return -EINVAL; if (copy_from_user(, uquery, sizeof(query))) return -EFAULT; - if (query.ids_len > BPF_TRACE_MAX_PROGS) + + ids_len = query.ids_len; + if (ids_len > BPF_TRACE_MAX_PROGS) return -E2BIG; + ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN); + if (!ids) + return -ENOMEM; mutex_lock(_event_mutex); ret = bpf_prog_array_copy_info(event->tp_event->prog_array, - uquery->ids, - query.ids_len, - >prog_cnt); + ids, + ids_len, + _cnt); mutex_unlock(_event_mutex); + if (!ret || ret == -ENOSPC) { + if (copy_to_user(>prog_cnt, _cnt, sizeof(prog_cnt)) || + copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) { + ret = -EFAULT; + goto out; + } + } + +out: + kfree(ids); alloc/free just to avoid this locking dependency feels suboptimal. We actually already did kcalloc/kfree in bpf_prog_array_copy_to_user. In that function, we did not copy_to_user one id at a time. We allocate a temporary array and store the result there and at the end, we call one copy_to_user to copy to the user buffer. The patch here just moved this allocation and associated copy_to_user out of the function and bpf_event_mutex. It did not introduce new allocations. may be we can get rid of bpf_event_mutex in some cases? the perf event itself is locked via perf_event_ctx_lock() when we're calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog. I forgot what was the motivation for us to introduce it in the first place. The original motivation for the lock to make sure bpf_prog_array does not change during middle of attach/detach/query. it looks like we have: . perf_event_attach|query under perf_event_ctx_lock . perf_event_detach not under perf_event_ctx_lock Introducing perf_event_ctx_lock to perf_event_detach could still have the deadlock.
Re: [PATCH iproute2-next 1/1] tc: jsonify skbedit action
David Ahernwrites: > On 4/3/18 1:24 PM, Roman Mashak wrote: >> if (tb[TCA_SKBEDIT_PTYPE] != NULL) { >> -ptype = RTA_DATA(tb[TCA_SKBEDIT_PTYPE]); >> -if (*ptype == PACKET_HOST) >> -fprintf(f, " ptype host"); >> -else if (*ptype == PACKET_BROADCAST) >> -fprintf(f, " ptype broadcast"); >> -else if (*ptype == PACKET_MULTICAST) >> -fprintf(f, " ptype multicast"); >> -else if (*ptype == PACKET_OTHERHOST) >> -fprintf(f, " ptype otherhost"); >> +ptype = rta_getattr_u16(tb[TCA_SKBEDIT_PTYPE]); >> +if (ptype == PACKET_HOST) >> +print_string(PRINT_ANY, "ptype", " %s", "ptype host"); >> +else if (ptype == PACKET_BROADCAST) >> +print_string(PRINT_ANY, "ptype", " %s", >> + "ptype broadcast"); >> +else if (ptype == PACKET_MULTICAST) >> +print_string(PRINT_ANY, "ptype", " %s", >> + "ptype multicast"); >> +else if (ptype == PACKET_OTHERHOST) >> +print_string(PRINT_ANY, "ptype", " %s", >> + "ptype otherhost"); > > Shouldn't that be: > print_string(PRINT_ANY, "ptype", "ptype %s", "otherhost"); > > And ditto for the other strings. > >> else >> -fprintf(f, " ptype %d", *ptype); >> +print_uint(PRINT_ANY, "ptype", " %u", ptype); > > And then this one needs 'ptype' before %u OK. I will send v2.
RE: [Intel-wired-lan] [next-queue PATCH v6 10/10] igb: Add support for adding offloaded clsflower filters
Hi, "Brown, Aaron F"writes: >> From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On >> Behalf Of Vinicius Costa Gomes >> Sent: Thursday, March 29, 2018 2:08 PM >> To: intel-wired-...@lists.osuosl.org >> Cc: netdev@vger.kernel.org; Sanchez-Palencia, Jesus > palen...@intel.com> >> Subject: [Intel-wired-lan] [next-queue PATCH v6 10/10] igb: Add support for >> adding offloaded clsflower filters >> >> This allows filters added by tc-flower and specifying MAC addresses, >> Ethernet types, and the VLAN priority field, to be offloaded to the >> controller. > > Can I get a brief explanation for enabling this? I'm currently happy > with this patch series from a regression perspective, but am > personally a bit, umm, challenged with tc in general but would like to > run it through the paces a bit. If it can be done in a one or two > liner I think it would be a good addition to the patch description. Of course, my bad for not adding those examples since the begining. Cheers, -- Vinicius
RE: [Intel-wired-lan] [next-queue PATCH v6 08/10] igb: Add MAC address support for ethtool nftuple filters
Hi, "Brown, Aaron F"writes: [...] > >> >> I added that note in the hope that someone else would have an stronger >> opinion about what to do. > > I don't have a strong opinion beyond my preference for an ideal world > where everything works :) If the part simply cannot filter on the src > address as a whole without the protocol I would ideally prefer an > attempt in ethtool to set the filter on src address as a whole to > return an error WHILE still allowing the filter to be set on an > ethertype when the proto keyword is issued. If ethtool does not allow > that fine grain of control then I think the way it is now is good, I'd > rather have the annoyance of being able to set a filter that does > nothing then not be able to set the more specific filter at all. We are agreed. The next version of this series implements just that: specifying the src address alone is an error, but if the classifier has a src address and any other filter, it works. Cheers, -- Vinicius
[PATCH RFC iptables] iptables: Per-net ns lock
In CRIU and LXC-restore we met the situation, when iptables in container can't be restored because of permission denied: https://github.com/checkpoint-restore/criu/issues/469 Containers want to restore their own net ns, while they may have no their own mnt ns. This case they share host's /run/xtables.lock file, but they may not have permission to open it. Patch makes /run/xtables.lock to be per-namespace, i.e., to refer to the caller task's net ns. What you think? Thanks, Kirill Signed-off-by: Kirill Tkhai--- iptables/xshared.c |7 ++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/iptables/xshared.c b/iptables/xshared.c index 06db72d4..b6dbe4e7 100644 --- a/iptables/xshared.c +++ b/iptables/xshared.c @@ -254,7 +254,12 @@ static int xtables_lock(int wait, struct timeval *wait_interval) time_left.tv_sec = wait; time_left.tv_usec = 0; - fd = open(XT_LOCK_NAME, O_CREAT, 0600); + if (symlink("/proc/self/ns/net", XT_LOCK_NAME) != 0 && + errno != EEXIST) { + fprintf(stderr, "Fatal: can't create lock file\n"); + return XT_LOCK_FAILED; + } + fd = open(XT_LOCK_NAME, O_RDONLY); if (fd < 0) { fprintf(stderr, "Fatal: can't open lock file %s: %s\n", XT_LOCK_NAME, strerror(errno));
Re: [RFC PATCH bpf-next 2/6] bpf: add bpf_get_stack helper
On 4/9/18 3:01 AM, Daniel Borkmann wrote: On 04/09/2018 07:02 AM, Alexei Starovoitov wrote: On 4/8/18 9:53 PM, Yonghong Song wrote: @@ -1004,7 +1007,8 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) bpf_prog_kallsyms_del(prog->aux->func[i]); bpf_prog_kallsyms_del(prog); - call_rcu(>aux->rcu, __bpf_prog_put_rcu); + synchronize_rcu(); + __bpf_prog_put_rcu(>aux->rcu); there should have been lockdep splat. We cannot call synchronize_rcu here, since we cannot sleep in some cases. Let me double check this. The following is the reason why I am using synchronize_rcu(). With call_rcu(>aux->rcu, __bpf_prog_put_rcu) and _bpf_prog_put_rcu calls put_callchain_buffers() which calls mutex_lock(), the runtime with CONFIG_DEBUG_ATOMIC_SLEEP=y will complains since potential sleep inside the call_rcu is not allowed. I see. Indeed. We cannot call put_callchain_buffers() from rcu callback, but doing synchronize_rcu() here is also not possible. How about moving put_callchain into bpf_prog_free_deferred() ? +1, the assumption is that you can call bpf_prog_put() and also the bpf_map_put() from any context. Sleeping here for a long time might subtly break things badly. Thanks for the suggestion! This should work.
Re: [PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
On 4/9/18 9:18 AM, Yonghong Song wrote: syzbot reported a possible deadlock in perf_event_detach_bpf_prog. ... @@ -985,16 +986,31 @@ int perf_event_query_prog_array(struct perf_event *event, void __user *info) return -EINVAL; if (copy_from_user(, uquery, sizeof(query))) return -EFAULT; - if (query.ids_len > BPF_TRACE_MAX_PROGS) + + ids_len = query.ids_len; + if (ids_len > BPF_TRACE_MAX_PROGS) return -E2BIG; + ids = kcalloc(ids_len, sizeof(u32), GFP_USER | __GFP_NOWARN); + if (!ids) + return -ENOMEM; mutex_lock(_event_mutex); ret = bpf_prog_array_copy_info(event->tp_event->prog_array, - uquery->ids, - query.ids_len, - >prog_cnt); + ids, + ids_len, + _cnt); mutex_unlock(_event_mutex); + if (!ret || ret == -ENOSPC) { + if (copy_to_user(>prog_cnt, _cnt, sizeof(prog_cnt)) || + copy_to_user(uquery->ids, ids, ids_len * sizeof(u32))) { + ret = -EFAULT; + goto out; + } + } + +out: + kfree(ids); alloc/free just to avoid this locking dependency feels suboptimal. may be we can get rid of bpf_event_mutex in some cases? the perf event itself is locked via perf_event_ctx_lock() when we're calling perf_event_query_prog_array, perf_event_attach|detach_bpf_prog. I forgot what was the motivation for us to introduce it in the first place.
Re: [RFC v3 net-next 13/18] net/sched: Introduce the TBS Qdisc
Hi Thomas, On 03/28/2018 12:48 AM, Thomas Gleixner wrote: (...) > > There are two modes: > > 1) Send at the given TX time (Explicit mode) > > 2) Send before given TX time (Deadline mode) > > There is no need to specify 'drop if late' simply because if the message is > handed in past the given TX time, it's too late by definition. What you are > trying to implement is a hybrid of TSN and general purpose (not time aware) > networking in one go. And you do that because your overall design is not > looking at the big picture. You designed from a given use case assumption > and tried to fit other things into it with duct tape. Ok, I see the difference now, thanks. I have just two more questions about the deadline mode, please see below. (...) > >>> Coming back to the overall scheme. If you start upfront with a time slice >>> manager which is designed to: >>> >>> - Handle multiple channels >>> >>> - Expose the time constraints, properties per channel >>> >>> then you can fit all kind of use cases, whether designed by committee or >>> not. You can configure that thing per node or network wide. It does not >>> make a difference. The only difference are the resulting constraints. >> >> >> Ok, and I believe the above was covered by what we had proposed before, >> unless >> what you meant by time constraints is beyond the configured port schedule. >> >> Are you suggesting that we'll need to have a kernel entity that is not only >> aware of the current traffic classes 'schedule', but also of the resources >> that >> are still available for new streams to be accommodated into the classes? >> Putting >> it differently, is the TAS you envision just an entity that runs a schedule, >> or >> is it a time-aware 'orchestrator'? > > In the first place its something which runs a defined schedule. > > The accomodation for new streams is required, but not necessarily at the > root qdisc level. That might be a qdisc feeding into it. > > Assume you have a bandwidth reservation, aka time slot, for audio. If your > audio related qdisc does deadline scheduling then you can add new streams > to it up to the point where it's not longer able to fit. > > The only thing which might be needed at the root qdisc is the ability to > utilize unused time slots for other purposes, but that's not required to be > there in the first place as long as its designed in a way that it can be > added later on. Ok, agreed. > >>> So lets look once more at the picture in an abstract way: >>> >>>[ NIC ] >>> | >>> [ Time slice manager ] >>> | | >>> [ Ch 0 ] ... [ Ch N ] >>> >>> So you have a bunch of properties here: >>> >>> 1) Number of Channels ranging from 1 to N >>> >>> 2) Start point, slice period and slice length per channel >> >> Ok, so we agree that a TAS entity is needed. Assuming that channels are >> traffic >> classes, do you have something else in mind other than a new root qdisc? > > Whatever you call it, the important point is that it is the gate keeper to > the network adapter and there is no way around it. It fully controls the > timed schedule how simple or how complex it may be. Ok, and I've finally understood the nuance between the above and what we had planned initially. (...) >> >> * TAS: >> >>The idea we are currently exploring is to add a "time-aware", priority >> based >>qdisc, that also exposes the Tx queues available and provides a mechanism >> for >>mapping priority <-> traffic class <-> Tx queues in a similar fashion as >>mqprio. We are calling this qdisc 'taprio', and its 'tc' cmd line would >> be: >> >>$ $ tc qdisc add dev ens4 parent root handle 100 taprio num_tc 4\ >> map 2 2 1 0 3 3 3 3 3 3 3 3 3 3 3 3 \ >> queues 0 1 2 3 \ >> sched-file gates.sched [base-time ] \ >>[cycle-time ] [extension-time ] >> >> is multi-line, with each line being of the following format: >> >> >>Qbv only defines one : "S" for 'SetGates' >> >>For example: >> >>S 0x01 300 >>S 0x03 500 >> >>This means that there are two intervals, the first will have the gate >>for traffic class 0 open for 300 nanoseconds, the second will have >>both traffic classes open for 500 nanoseconds. > > To accomodate stuff like control systems you also need a base line, which > is not expressed as interval. Otherwise you can't schedule network wide > explicit plans. That's either an absolute network-time (TAI) time stamp or > an offset to a well defined network-time (TAI) time stamp, e.g. start of > epoch or something else which is agreed on. The actual schedule then fast > forwards past now (TAI) and sets up the slots from there. That makes node > hotplug possible as well. Sure, and the [base-time ] on the command line above was actually wrong. It should have been expressed
[PATCH bpf] bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog
syzbot reported a possible deadlock in perf_event_detach_bpf_prog. The error details: == WARNING: possible circular locking dependency detected 4.16.0-rc7+ #3 Not tainted -- syz-executor7/24531 is trying to acquire lock: (bpf_event_mutex){+.+.}, at: [<8a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854 but task is already holding lock: (>mmap_sem){}, at: [<38768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #1 (>mmap_sem){}: __might_fault+0x13a/0x1d0 mm/memory.c:4571 _copy_to_user+0x2c/0xc0 lib/usercopy.c:25 copy_to_user include/linux/uaccess.h:155 [inline] bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694 perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891 _perf_ioctl kernel/events/core.c:4750 [inline] perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770 vfs_ioctl fs/ioctl.c:46 [inline] do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686 SYSC_ioctl fs/ioctl.c:701 [inline] SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 -> #0 (bpf_event_mutex){+.+.}: lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __mutex_lock_common kernel/locking/mutex.c:756 [inline] __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854 perf_event_free_bpf_prog kernel/events/core.c:8147 [inline] _free_event+0xbdb/0x10f0 kernel/events/core.c:4116 put_event+0x24/0x30 kernel/events/core.c:4204 perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172 remove_vma+0xb4/0x1b0 mm/mmap.c:172 remove_vma_list mm/mmap.c:2490 [inline] do_munmap+0x82a/0xdf0 mm/mmap.c:2731 mmap_region+0x59e/0x15a0 mm/mmap.c:1646 do_mmap+0x6c0/0xe00 mm/mmap.c:1483 do_mmap_pgoff include/linux/mm.h:2223 [inline] vm_mmap_pgoff+0x1de/0x280 mm/util.c:355 SYSC_mmap_pgoff mm/mmap.c:1533 [inline] SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491 SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline] SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 other info that might help us debug this: Possible unsafe locking scenario: CPU0CPU1 lock(>mmap_sem); lock(bpf_event_mutex); lock(>mmap_sem); lock(bpf_event_mutex); *** DEADLOCK *** == The bug is introduced by Commit f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp") where copy_to_user, which requires mm->mmap_sem, is called inside bpf_event_mutex lock. At the same time, during perf_event file descriptor close, mm->mmap_sem is held first and then subsequent perf_event_detach_bpf_prog needs bpf_event_mutex lock. Such a senario caused a deadlock. As suggested by Danial, moving copy_to_user out of the bpf_event_mutex lock should fix the problem. Fixes: f371b304f12e ("bpf/tracing: allow user space to query prog array on the same tp") Reported-by: syzbot+dc5ca0e4c9bfafaf2...@syzkaller.appspotmail.com Signed-off-by: Yonghong Song--- include/linux/bpf.h | 4 ++-- kernel/bpf/core.c| 25 +++-- kernel/trace/bpf_trace.c | 24 3 files changed, 41 insertions(+), 12 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 95a7abd..486e65e 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -339,8 +339,8 @@ int bpf_prog_array_copy_to_user(struct bpf_prog_array __rcu *progs, void bpf_prog_array_delete_safe(struct bpf_prog_array __rcu *progs, struct bpf_prog *old_prog); int bpf_prog_array_copy_info(struct bpf_prog_array __rcu *array, -__u32 __user *prog_ids, u32 request_cnt, -__u32 __user *prog_cnt); +u32 *prog_ids, u32 request_cnt, +u32 *prog_cnt); int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array, struct bpf_prog *exclude_prog, struct bpf_prog *include_prog, diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index d315b39..a95a7de 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1683,22 +1683,35 @@ int bpf_prog_array_copy(struct bpf_prog_array __rcu *old_array, } int
Re: [PATCH net-next] netns: filter uevents correctly
On Thu, Apr 05, 2018 at 10:59:49PM -0500, Eric W. Biederman wrote: > Christian Braunerwrites: > > > On Thu, Apr 05, 2018 at 05:26:59PM +0300, Kirill Tkhai wrote: > >> On 05.04.2018 17:07, Christian Brauner wrote: > >> > On Thu, Apr 05, 2018 at 04:01:03PM +0300, Kirill Tkhai wrote: > >> >> On 04.04.2018 22:48, Christian Brauner wrote: > >> >>> commit 07e98962fa77 ("kobject: Send hotplug events in all network > >> >>> namespaces") > >> >>> > >> >>> enabled sending hotplug events into all network namespaces back in > >> >>> 2010. > >> >>> Over time the set of uevents that get sent into all network namespaces > >> >>> has > >> >>> shrunk. We have now reached the point where hotplug events for all > >> >>> devices > >> >>> that carry a namespace tag are filtered according to that namespace. > >> >>> > >> >>> Specifically, they are filtered whenever the namespace tag of the > >> >>> kobject > >> >>> does not match the namespace tag of the netlink socket. One example are > >> >>> network devices. Uevents for network devices only show up in the > >> >>> network > >> >>> namespaces these devices are moved to or created in. > >> >>> > >> >>> However, any uevent for a kobject that does not have a namespace tag > >> >>> associated with it will not be filtered and we will *try* to broadcast > >> >>> it > >> >>> into all network namespaces. > >> >>> > >> >>> The original patchset was written in 2010 before user namespaces were a > >> >>> thing. With the introduction of user namespaces sending out uevents > >> >>> became > >> >>> partially isolated as they were filtered by user namespaces: > >> >>> > >> >>> net/netlink/af_netlink.c:do_one_broadcast() > >> >>> > >> >>> if (!net_eq(sock_net(sk), p->net)) { > >> >>> if (!(nlk->flags & NETLINK_F_LISTEN_ALL_NSID)) > >> >>> return; > >> >>> > >> >>> if (!peernet_has_id(sock_net(sk), p->net)) > >> >>> return; > >> >>> > >> >>> if (!file_ns_capable(sk->sk_socket->file, p->net->user_ns, > >> >>> CAP_NET_BROADCAST)) > >> >>> j return; > >> >>> } > >> >>> > >> >>> The file_ns_capable() check will check whether the caller had > >> >>> CAP_NET_BROADCAST at the time of opening the netlink socket in the user > >> >>> namespace of interest. This check is fine in general but seems > >> >>> insufficient > >> >>> to me when paired with uevents. The reason is that devices always > >> >>> belong to > >> >>> the initial user namespace so uevents for kobjects that do not carry a > >> >>> namespace tag should never be sent into another user namespace. This > >> >>> has > >> >>> been the intention all along. But there's one case where this breaks, > >> >>> namely if a new user namespace is created by root on the host and an > >> >>> identity mapping is established between root on the host and root in > >> >>> the > >> >>> new user namespace. Here's a reproducer: > >> >>> > >> >>> sudo unshare -U --map-root > >> >>> udevadm monitor -k > >> >>> # Now change to initial user namespace and e.g. do > >> >>> modprobe kvm > >> >>> # or > >> >>> rmmod kvm > >> >>> > >> >>> will allow the non-initial user namespace to retrieve all uevents from > >> >>> the > >> >>> host. This seems very anecdotal given that in the general case user > >> >>> namespaces do not see any uevents and also can't really do anything > >> >>> useful > >> >>> with them. > >> >>> > >> >>> Additionally, it is now possible to send uevents from userspace. As > >> >>> such we > >> >>> can let a sufficiently privileged (CAP_SYS_ADMIN in the owning user > >> >>> namespace of the network namespace of the netlink socket) userspace > >> >>> process > >> >>> make a decision what uevents should be sent. > >> >>> > >> >>> This makes me think that we should simply ensure that uevents for > >> >>> kobjects > >> >>> that do not carry a namespace tag are *always* filtered by user > >> >>> namespace > >> >>> in kobj_bcast_filter(). Specifically: > >> >>> - If the owning user namespace of the uevent socket is not > >> >>> init_user_ns the > >> >>> event will always be filtered. > >> >>> - If the network namespace the uevent socket belongs to was created in > >> >>> the > >> >>> initial user namespace but was opened from a non-initial user > >> >>> namespace > >> >>> the event will be filtered as well. > >> >>> Put another way, uevents for kobjects not carrying a namespace tag are > >> >>> now > >> >>> always only sent to the initial user namespace. The regression > >> >>> potential > >> >>> for this is near to non-existent since user namespaces can't really do > >> >>> anything with interesting devices. > >> >>> > >> >>> Signed-off-by: Christian Brauner > >> >>> --- > >> >>> lib/kobject_uevent.c | 10 +- > >> >>> 1 file changed, 9 insertions(+), 1 deletion(-) > >> >>> > >> >>> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c > >> >>>
Re: Enable and configure storm prevention in a network device
> > The Marvell switches have leaky buckets, which can be used for > > limiting broadcast and multicast packets, as well as traffic shaping > > in general. Storm prevention is just a form of traffic shaping, so if > > we have generic traffic shaping, it can be used for storm prevention. > > > TI's CPSW hardware as well has similar capability to limit broadcast and > multicast packets at the ingress. Isn't it a traffic policing at the Ingress > rather than traffic shaping as the hardware drops the frames at the ingress > if the rate exceeds a limit? Hi Murali It depends on the generation of Marvell switches. Older ones have just egress traffic shaping. Newer ones also have ingress rate limiting. Andrew
Re: Enable and configure storm prevention in a network device
Andrew, On 04/06/2018 10:30 AM, Andrew Lunn wrote: > On Thu, Apr 05, 2018 at 03:35:06PM -0700, Florian Fainelli wrote: >> On 04/05/2018 01:20 PM, David Miller wrote: >>> From: Murali Karicheri>>> Date: Thu, 5 Apr 2018 16:14:49 -0400 >>> Is there a standard way to implement and configure storm prevention in a Linux network device? >>> >>> What kind of "storm", an interrupt storm? >>> >> >> I would assume Murali is referring to L2 broadcast storms which is >> common in switches. There is not an API for that AFAICT and I am not >> sure what a proper API would look like. > > tc? > > The Marvell switches have leaky buckets, which can be used for > limiting broadcast and multicast packets, as well as traffic shaping > in general. Storm prevention is just a form of traffic shaping, so if > we have generic traffic shaping, it can be used for storm prevention. > TI's CPSW hardware as well has similar capability to limit broadcast and multicast packets at the ingress. Isn't it a traffic policing at the Ingress rather than traffic shaping as the hardware drops the frames at the ingress if the rate exceeds a limit? I haven't done any work on TC, but with my limited knowledge of TC, it seems we might be able to use TC to offload the TC policing to the hardware through ndo_setup_tc()? Could someone shed some light on how to do this with tc? Some tc filer command and then some hook in kernel to offload this to hardware? Murali >Andrew > -- Murali Karicheri Linux Kernel, Keystone
Re: [PATCH net] inetpeer: fix uninit-value in inet_getpeer
On 04/09/2018 07:58 AM, David Miller wrote: > From: Eric Dumazet> Date: Mon, 9 Apr 2018 06:43:27 -0700 > >> syzbot/KMSAN reported that p->dtime was read while it was >> not yet initialized in : >> >> delta = (__u32)jiffies - p->dtime; >> if (delta < ttl || !refcount_dec_if_one(>refcnt)) >> gc_stack[i] = NULL; >> >> This is a false positive, because the inetpeer wont be erased >> from rb-tree if the refcount_dec_if_one(>refcnt) does not >> succeed. And this wont happen before first inet_putpeer() call >> for this inetpeer has been done, and ->dtime field is written >> exactly before the refcount_dec_and_test(>refcnt). >> >> The KMSAN report was : > ... >> Signed-off-by: Eric Dumazet >> Reported-by: syzbot > > Applied, but it looks like we are just adding assignments simply > to placate these reports when the tools and facilities cannot > see through the logic properly. > To be fair, this is because the check on ->dtime should be done a second time after the refcount_dec_if_one(>refcnt) It is a tiny race, and we do not really care given nature of inetpeer cache, best effort, and DDOS candidate anyway. If we purge one entry too soon, this is not a big deal. I believe tool is fine.
Re: [PATCH] slip: Check if rstate is initialized before uncompressing
From: Tejaswi TanikellaDate: Mon, 9 Apr 2018 14:23:49 +0530 > @@ -673,6 +677,7 @@ struct slcompress * > if (cs->cs_tcp.doff > 5) > memcpy(cs->cs_tcpopt, icp + ihl*4 + sizeof(struct tcphdr), > (cs->cs_tcp.doff - 5) * 4); > cs->cs_hsize = ihl*2 + cs->cs_tcp.doff*2; > + cs->initialized = 1; > /* Put headers back on packet ... > struct cstate { > byte_t cs_this;/* connection id number (xmit) */ > + byte_t initialized;/* non-zero if initialized */ Please use 'bool' and true/false for 'initialized'.
Re: [PATCH RESEND v2] vhost-net: set packet weight of tx polling to 2 * vq size
From: haibinzhang(张海斌)Date: Mon, 9 Apr 2018 07:22:17 + > handle_tx will delay rx for tens or even hundreds of milliseconds when tx busy > polling udp packets with small length(e.g. 1byte udp payload), because setting > VHOST_NET_WEIGHT takes into account only sent-bytes but no single packet > length. > > Ping-Latencies shown below were tested between two Virtual Machines using > netperf (UDP_STREAM, len=1), and then another machine pinged the client: ... > Acked-by: Michael S. Tsirkin > Signed-off-by: Haibin Zhang > Signed-off-by: Yunfang Tai > Signed-off-by: Lidong Chen Applied, thank you.
Re: [PATCH v5] net: thunderx: rework mac addresses list to u64 array
From: Vadim LomovtsevDate: Mon, 9 Apr 2018 06:24:48 -0700 > From: Vadim Lomovtsev > > It is too expensive to pass u64 values via linked list, instead > allocate array for them by overall number of mac addresses from netdev. > > This eventually removes multiple kmalloc() calls, aviod memory > fragmentation and allow to put single null check on kmalloc > return value in order to prevent a potential null pointer dereference. > > Addresses-Coverity-ID: 1467429 ("Dereference null return value") > Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback > implementation for VF") > Reported-by: Dan Carpenter > Signed-off-by: Vadim Lomovtsev Applied, thanks.
Re: [PATCH net] inetpeer: fix uninit-value in inet_getpeer
From: Eric DumazetDate: Mon, 9 Apr 2018 06:43:27 -0700 > syzbot/KMSAN reported that p->dtime was read while it was > not yet initialized in : > > delta = (__u32)jiffies - p->dtime; > if (delta < ttl || !refcount_dec_if_one(>refcnt)) > gc_stack[i] = NULL; > > This is a false positive, because the inetpeer wont be erased > from rb-tree if the refcount_dec_if_one(>refcnt) does not > succeed. And this wont happen before first inet_putpeer() call > for this inetpeer has been done, and ->dtime field is written > exactly before the refcount_dec_and_test(>refcnt). > > The KMSAN report was : ... > Signed-off-by: Eric Dumazet > Reported-by: syzbot Applied, but it looks like we are just adding assignments simply to placate these reports when the tools and facilities cannot see through the logic properly.
Re: [PATCH 2/2] alx: add disable_wol paramenter
From: Andrew LunnDate: Mon, 9 Apr 2018 14:39:10 +0200 > On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote: >> The WoL feature was reported broken and will lead to >> the system resume immediately after suspending. >> This symptom is not happening on every system, so adding >> disable_wol option and disable WoL by default to prevent the issue from >> happening again. > >> const char alx_drv_name[] = "alx"; >> >> +/* disable WoL by default */ >> +bool disable_wol = 1; >> +module_param(disable_wol, bool, 0); >> +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature"); >> + > > Hi AceLan > > This seems like you are papering over the cracks. And module > parameters are not liked. > > Please try to find the real problem. Agreed.
Re: [PATCH] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init
On 2018/4/9 22:44, Eric Dumazet wrote: On 04/09/2018 07:10 AM, Jia-Ju Bai wrote: dn_route_init() is never called in atomic context. The call chain ending up at dn_route_init() is: [1] dn_route_init() <- decnet_init() decnet_init() is only set as a parameter of module_init(). Despite never getting called from atomic context, dn_route_init() calls __get_free_pages() with GFP_ATOMIC, which waits busily for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, to avoid busy waiting and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- net/decnet/dn_route.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c index 0bd3afd..59ed12a 100644 --- a/net/decnet/dn_route.c +++ b/net/decnet/dn_route.c @@ -1898,7 +1898,7 @@ void __init dn_route_init(void) while(dn_rt_hash_mask & (dn_rt_hash_mask - 1)) dn_rt_hash_mask--; dn_rt_hash_table = (struct dn_rt_hash_bucket *) - __get_free_pages(GFP_ATOMIC, order); + __get_free_pages(GFP_KERNEL, order); } while (dn_rt_hash_table == NULL && --order > 0); if (!dn_rt_hash_table) This might OOM under pressure. Sorry, I do not understand this. Could you please explain the reason? This would need __GFP_NOWARN | __GFP_NORETRY I guess, and would target net-next Do you mean __get_free_pages(__GFP_NOWARN | __GFP_NORETRY) or __get_free_pages(__GFP_NOWARN | __GFP_NORETRY | GFP_KERNEL)? Best wishes, Jia-Ju Bai
Re: [PATCH] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init
On 2018/4/9 22:42, Eric Dumazet wrote: On 04/09/2018 07:10 AM, Jia-Ju Bai wrote: dccp_init() is never called in atomic context. This function is only set as a parameter of module_init(). Despite never getting called from atomic context, dccp_init() calls __get_free_pages() with GFP_ATOMIC, which waits busily for allocation. What do you mean by "waits busily" ? GFP_ATOMIC does not sleep, does not wait. Sorry, I should modify it to "does not sleep". Do you think it is okay? Best wishes, Jia-Ju Bai
Re: [PATCH] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init
On 04/09/2018 07:10 AM, Jia-Ju Bai wrote: > dn_route_init() is never called in atomic context. > > The call chain ending up at dn_route_init() is: > [1] dn_route_init() <- decnet_init() > decnet_init() is only set as a parameter of module_init(). > > Despite never getting called from atomic context, > dn_route_init() calls __get_free_pages() with GFP_ATOMIC, > which waits busily for allocation. > GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, > to avoid busy waiting and improve the possibility of sucessful allocation. > > This is found by a static analysis tool named DCNS written by myself. > And I also manually check it. > > Signed-off-by: Jia-Ju Bai> --- > net/decnet/dn_route.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c > index 0bd3afd..59ed12a 100644 > --- a/net/decnet/dn_route.c > +++ b/net/decnet/dn_route.c > @@ -1898,7 +1898,7 @@ void __init dn_route_init(void) > while(dn_rt_hash_mask & (dn_rt_hash_mask - 1)) > dn_rt_hash_mask--; > dn_rt_hash_table = (struct dn_rt_hash_bucket *) > - __get_free_pages(GFP_ATOMIC, order); > + __get_free_pages(GFP_KERNEL, order); > } while (dn_rt_hash_table == NULL && --order > 0); > > if (!dn_rt_hash_table) > This might OOM under pressure. This would need __GFP_NOWARN | __GFP_NORETRY I guess, and would target net-next
Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
On Mon, Apr 9, 2018 at 4:37 PM, Pablo Neira Ayusowrote: > Hi Arnd, > > On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote: >> We get a new link error with CONFIG_NFT_REJECT_INET=y and >> CONFIG_NF_REJECT_IPV6=m > > I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4 > and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y > and CONFIG_NF_REJECT_IPV6=m. > > I mean, just like we do with NFT_FIB_INET. That can only work if NFT_REJECT_INET can be made a 'tristate' symbol again, so that code gets built as a loadable module if CONFIG_NF_REJECT_IPV6=m. > BTW, I think this problem has been is not related to the recent patch, > but something older that kbuild robot has triggered more easily for > some reason? 02c7b25e5f54 is the one that turned NF_TABLES_INET into a 'bool' symbol. NFT_REJECT depends on NF_TABLES_INET, so it used to restricted to a loadable module with IPV6=m, but can now be built-in, which causes that link error. Arnd
Re: WARNING in ip_rt_bug
From: Dmitry VyukovDate: Mon, 9 Apr 2018 08:06:20 +0200 > +Eric said that perhaps we just need to revert: > > commit c378a9c019cf5e017d1ed24954b54fae7bebd2bc > Date: Sat May 21 07:16:42 2011 + > ipv4: Give backtrace in ip_rt_bug(). And I replied to him that we shouldn't. Reverting makes the backtrace, and all the useful debugging information, go away. It won't fix the actual bug, which seems to be that ICMP's route lookup tried to use an input route for sending a packet.
Re: [PATCH] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init
On 04/09/2018 07:10 AM, Jia-Ju Bai wrote: > dccp_init() is never called in atomic context. > This function is only set as a parameter of module_init(). > > Despite never getting called from atomic context, > dccp_init() calls __get_free_pages() with GFP_ATOMIC, > which waits busily for allocation. What do you mean by "waits busily" ? GFP_ATOMIC does not sleep, does not wait. > GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, > to avoid busy waiting and improve the possibility of sucessful allocation. > > This is found by a static analysis tool named DCNS written by myself. > And I also manually check it. > > Signed-off-by: Jia-Ju Bai> ---
Re: [PATCH net] arp: fix arp_filter on l3slave devices
On 4/8/18 9:36 PM, Sasha Levin wrote: > Hi, > > [This is an automated email] > > This commit has been processed by the -stable helper bot and determined > to be a high probability candidate for -stable trees. (score: 33.5930) > > The bot has tested the following trees: v4.16, v4.15.15, v4.14.32, v4.9.92, > v4.4.126. > > v4.16: Build OK! > v4.15.15: Build OK! > v4.14.32: Build OK! > v4.9.92: Build OK! > v4.4.126: Build OK! All of those would be good for this patch. Thanks
Re: [PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
Hi Arnd, On Mon, Apr 09, 2018 at 12:53:12PM +0200, Arnd Bergmann wrote: > We get a new link error with CONFIG_NFT_REJECT_INET=y and > CONFIG_NF_REJECT_IPV6=m I think we can update NFT_REJECT_INET so it depends on NFT_REJECT_IPV4 and NFT_REJECT_IPV6. This doesn't allow here CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m. I mean, just like we do with NFT_FIB_INET. BTW, I think this problem has been is not related to the recent patch, but something older that kbuild robot has triggered more easily for some reason? Thanks for your patch! diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index d3220b43c832..b48c57bb9aaf 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -601,7 +601,8 @@ config NFT_REJECT config NFT_REJECT_INET depends on NF_TABLES_INET - default NFT_REJECT + depends on NFT_REJECT_IPV4 + depends on NFT_REJECT_IPV6 tristate config NFT_COMPAT
Re: Enable and configure storm prevention in a network device
On 04/05/2018 06:35 PM, Florian Fainelli wrote: > On 04/05/2018 01:20 PM, David Miller wrote: >> From: Murali Karicheri>> Date: Thu, 5 Apr 2018 16:14:49 -0400 >> >>> Is there a standard way to implement and configure storm prevention >>> in a Linux network device? >> >> What kind of "storm", an interrupt storm? >> > > I would assume Murali is referring to L2 broadcast storms which is > common in switches. There is not an API for that AFAICT and I am not > sure what a proper API would look like. > Thanks Florian for adding more details. Yes, I am referring to L2 broadcast or multicast storm. At this point I don't see a reason why to exclude unicast storm as well if the hardware has the capability. -- Murali Karicheri Linux Kernel, Keystone
Re: [PATCH net] net: dsa: mv88e6xxx: Fix receive time stamp race condition.
On Mon, Apr 09, 2018 at 12:03:14AM -0700, Richard Cochran wrote: > This patch fixes the race by moving the queue onto a list on the stack > before reading out the latched time stamp value. > > Fixes: c6fe0ad2c3499 ("net: dsa: mv88e6xxx: add rx/tx timestamping support") Dave, please hold off on this patch. I am seeing new problems in my testing with this applied. I still need to get to the bottom of this. Thanks, Richard
[PATCH] net: tipc: Replace GFP_ATOMIC with GFP_KERNEL in tipc_mon_create
tipc_mon_create() is never called in atomic context. The call chain ending up at dn_route_init() is: [1] tipc_mon_create() <- tipc_enable_bearer() <- tipc_nl_bearer_enable() tipc_nl_bearer_enable() calls rtnl_lock(), which indicates this function is not called in atomic context. Despite never getting called from atomic context, tipc_mon_create() calls kzalloc() with GFP_ATOMIC, which waits busily for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, to avoid busy waiting and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- net/tipc/monitor.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/net/tipc/monitor.c b/net/tipc/monitor.c index 9e109bb..9714d80 100644 --- a/net/tipc/monitor.c +++ b/net/tipc/monitor.c @@ -604,9 +604,9 @@ int tipc_mon_create(struct net *net, int bearer_id) if (tn->monitors[bearer_id]) return 0; - mon = kzalloc(sizeof(*mon), GFP_ATOMIC); - self = kzalloc(sizeof(*self), GFP_ATOMIC); - dom = kzalloc(sizeof(*dom), GFP_ATOMIC); + mon = kzalloc(sizeof(*mon), GFP_KERNEL); + self = kzalloc(sizeof(*self), GFP_KERNEL); + dom = kzalloc(sizeof(*dom), GFP_KERNEL); if (!mon || !self || !dom) { kfree(mon); kfree(self); -- 1.9.1
[PATCH] net: nsci: Replace GFP_ATOMIC with GFP_KERNEL in ncsi_register_dev
ncsi_register_dev() is never called in atomic context. This function is only called by ftgmac100_probe() in drivers/net/ethernet/faraday/ftgmac100.c. And ftgmac100_probe() is only set as ".probe" in "struct platform_driver". Despite never getting called from atomic context, ncsi_register_dev() calls kzalloc() with GFP_ATOMIC, which waits busily for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, to avoid busy waiting and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- net/ncsi/ncsi-manage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/ncsi/ncsi-manage.c b/net/ncsi/ncsi-manage.c index 3fd3c39..6b5b5a0 100644 --- a/net/ncsi/ncsi-manage.c +++ b/net/ncsi/ncsi-manage.c @@ -1508,7 +1508,7 @@ struct ncsi_dev *ncsi_register_dev(struct net_device *dev, return nd; /* Create NCSI device */ - ndp = kzalloc(sizeof(*ndp), GFP_ATOMIC); + ndp = kzalloc(sizeof(*ndp), GFP_KERNEL); if (!ndp) return NULL; -- 1.9.1
[PATCH] net: decnet: Replace GFP_ATOMIC with GFP_KERNEL in dn_route_init
dn_route_init() is never called in atomic context. The call chain ending up at dn_route_init() is: [1] dn_route_init() <- decnet_init() decnet_init() is only set as a parameter of module_init(). Despite never getting called from atomic context, dn_route_init() calls __get_free_pages() with GFP_ATOMIC, which waits busily for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, to avoid busy waiting and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- net/decnet/dn_route.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/net/decnet/dn_route.c b/net/decnet/dn_route.c index 0bd3afd..59ed12a 100644 --- a/net/decnet/dn_route.c +++ b/net/decnet/dn_route.c @@ -1898,7 +1898,7 @@ void __init dn_route_init(void) while(dn_rt_hash_mask & (dn_rt_hash_mask - 1)) dn_rt_hash_mask--; dn_rt_hash_table = (struct dn_rt_hash_bucket *) - __get_free_pages(GFP_ATOMIC, order); + __get_free_pages(GFP_KERNEL, order); } while (dn_rt_hash_table == NULL && --order > 0); if (!dn_rt_hash_table) -- 1.9.1
[PATCH] net: dccp: Replace GFP_ATOMIC with GFP_KERNEL in dccp_init
dccp_init() is never called in atomic context. This function is only set as a parameter of module_init(). Despite never getting called from atomic context, dccp_init() calls __get_free_pages() with GFP_ATOMIC, which waits busily for allocation. GFP_ATOMIC is not necessary and can be replaced with GFP_KERNEL, to avoid busy waiting and improve the possibility of sucessful allocation. This is found by a static analysis tool named DCNS written by myself. And I also manually check it. Signed-off-by: Jia-Ju Bai--- net/dccp/proto.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/net/dccp/proto.c b/net/dccp/proto.c index b68168f..f63ba93 100644 --- a/net/dccp/proto.c +++ b/net/dccp/proto.c @@ -1159,7 +1159,7 @@ static int __init dccp_init(void) hash_size--; dccp_hashinfo.ehash_mask = hash_size - 1; dccp_hashinfo.ehash = (struct inet_ehash_bucket *) - __get_free_pages(GFP_ATOMIC|__GFP_NOWARN, ehash_order); + __get_free_pages(GFP_KERNEL|__GFP_NOWARN, ehash_order); } while (!dccp_hashinfo.ehash && --ehash_order > 0); if (!dccp_hashinfo.ehash) { @@ -1182,7 +1182,7 @@ static int __init dccp_init(void) bhash_order > 0) continue; dccp_hashinfo.bhash = (struct inet_bind_hashbucket *) - __get_free_pages(GFP_ATOMIC|__GFP_NOWARN, bhash_order); + __get_free_pages(GFP_KERNEL|__GFP_NOWARN, bhash_order); } while (!dccp_hashinfo.bhash && --bhash_order >= 0); if (!dccp_hashinfo.bhash) { -- 1.9.1
[PATCH net] inetpeer: fix uninit-value in inet_getpeer
syzbot/KMSAN reported that p->dtime was read while it was not yet initialized in : delta = (__u32)jiffies - p->dtime; if (delta < ttl || !refcount_dec_if_one(>refcnt)) gc_stack[i] = NULL; This is a false positive, because the inetpeer wont be erased from rb-tree if the refcount_dec_if_one(>refcnt) does not succeed. And this wont happen before first inet_putpeer() call for this inetpeer has been done, and ->dtime field is written exactly before the refcount_dec_and_test(>refcnt). The KMSAN report was : BUG: KMSAN: uninit-value in inet_peer_gc net/ipv4/inetpeer.c:163 [inline] BUG: KMSAN: uninit-value in inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228 CPU: 0 PID: 9494 Comm: syz-executor5 Not tainted 4.16.0+ #82 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x185/0x1d0 lib/dump_stack.c:53 kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067 __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:676 inet_peer_gc net/ipv4/inetpeer.c:163 [inline] inet_getpeer+0x1567/0x1e70 net/ipv4/inetpeer.c:228 inet_getpeer_v4 include/net/inetpeer.h:110 [inline] icmpv4_xrlim_allow net/ipv4/icmp.c:330 [inline] icmp_send+0x2b44/0x3050 net/ipv4/icmp.c:725 ip_options_compile+0x237c/0x29f0 net/ipv4/ip_options.c:472 ip_rcv_options net/ipv4/ip_input.c:284 [inline] ip_rcv_finish+0xda8/0x16d0 net/ipv4/ip_input.c:365 NF_HOOK include/linux/netfilter.h:288 [inline] ip_rcv+0x119d/0x16f0 net/ipv4/ip_input.c:493 __netif_receive_skb_core+0x47cf/0x4a80 net/core/dev.c:4562 __netif_receive_skb net/core/dev.c:4627 [inline] netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701 netif_receive_skb+0x230/0x240 net/core/dev.c:4725 tun_rx_batched drivers/net/tun.c:1555 [inline] tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 do_iter_write+0x30d/0xd40 fs/read_write.c:932 vfs_writev fs/read_write.c:977 [inline] do_writev+0x3c9/0x830 fs/read_write.c:1012 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 SyS_writev+0x56/0x80 fs/read_write.c:1082 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 RIP: 0033:0x455111 RSP: 002b:7fae0365cba0 EFLAGS: 0293 ORIG_RAX: 0014 RAX: ffda RBX: 002e RCX: 00455111 RDX: 0001 RSI: 7fae0365cbf0 RDI: 00fc RBP: 2040 R08: 00fc R09: R10: 002e R11: 0293 R12: R13: 0658 R14: 006fc8e0 R15: Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline] kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188 kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314 kmem_cache_alloc+0xaab/0xb90 mm/slub.c:2756 inet_getpeer+0xed8/0x1e70 net/ipv4/inetpeer.c:210 inet_getpeer_v4 include/net/inetpeer.h:110 [inline] ip4_frag_init+0x4d1/0x740 net/ipv4/ip_fragment.c:153 inet_frag_alloc net/ipv4/inet_fragment.c:369 [inline] inet_frag_create net/ipv4/inet_fragment.c:385 [inline] inet_frag_find+0x7da/0x1610 net/ipv4/inet_fragment.c:418 ip_find net/ipv4/ip_fragment.c:275 [inline] ip_defrag+0x448/0x67a0 net/ipv4/ip_fragment.c:676 ip_check_defrag+0x775/0xda0 net/ipv4/ip_fragment.c:724 packet_rcv_fanout+0x2a8/0x8d0 net/packet/af_packet.c:1447 deliver_skb net/core/dev.c:1897 [inline] deliver_ptype_list_skb net/core/dev.c:1912 [inline] __netif_receive_skb_core+0x314a/0x4a80 net/core/dev.c:4545 __netif_receive_skb net/core/dev.c:4627 [inline] netif_receive_skb_internal+0x49d/0x630 net/core/dev.c:4701 netif_receive_skb+0x230/0x240 net/core/dev.c:4725 tun_rx_batched drivers/net/tun.c:1555 [inline] tun_get_user+0x6d88/0x7580 drivers/net/tun.c:1962 tun_chr_write_iter+0x1d4/0x330 drivers/net/tun.c:1990 do_iter_readv_writev+0x7bb/0x970 include/linux/fs.h:1776 do_iter_write+0x30d/0xd40 fs/read_write.c:932 vfs_writev fs/read_write.c:977 [inline] do_writev+0x3c9/0x830 fs/read_write.c:1012 SYSC_writev+0x9b/0xb0 fs/read_write.c:1085 SyS_writev+0x56/0x80 fs/read_write.c:1082 do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Signed-off-by: Eric DumazetReported-by: syzbot --- net/ipv4/inetpeer.c | 1 + 1 file changed, 1 insertion(+) diff --git a/net/ipv4/inetpeer.c b/net/ipv4/inetpeer.c index 1f04bd91fc2e999ddb82f4be92d39d229166b691..d757b9642d0d1c418bffad5bcd50e8e7bf336c66 100644 --- a/net/ipv4/inetpeer.c +++ b/net/ipv4/inetpeer.c @@ -211,6 +211,7 @@ struct inet_peer *inet_getpeer(struct inet_peer_base *base, p = kmem_cache_alloc(peer_cachep, GFP_ATOMIC); if (p) { p->daddr = *daddr; + p->dtime = (__u32)jiffies;
Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
2018-04-09 12:52 UTC+0200 ~ Markus Heiser> >> Am 09.04.2018 um 12:08 schrieb Daniel Borkmann : > [...] > >>> May I completely misunderstood you, so correct my if I'am wrong: >>> >>> - ./scripts/bpf_helpers_doc.py : produces reST markup from C-comments >>> - ./scripts/kerne-doc : produces reST markup from C-comments >>> >>> IMO: both are doing the same job, so why not using kernel-doc? >> >> They are not really doing the same job, in bpf_helpers_doc.py case you don't >> want the whole header rendered, but just a fraction of it, that is, the >> single big comment which describes all BPF helper functions that a BPF >> program developer has available to use in user space - aka the entries in >> the __BPF_FUNC_MAPPER() macro; > > >> I also doubt the latter would actually qualify >> in kdoc context as some sort of a function description. > > latter .. ah, OK .. thanks for clarifying. > > -- Markus -- As Daniel explained, kernel-doc does not apply in this case, we do not have the full function prototype for eBPF helpers in the header file. But to be honest, I didn't even realise kernel-doc was available and could do something close to what I was looking for, so thanks for your feedback! :) Quentin
v4.16 in_dev_finish_destroy() accessing freed idev->dev->pcpu_refcnt
Hi all, As a heads-up, while fuzzing v4.16 on arm64, I'm hitting an intermittent issue where in_dev_finish_destroy() calls dev_put() on idev->dev, where idev->dev->pcpu_refcnt is NULL. Apparently idev->dev has already been freed. This results in a fault where we try to access the percpu offset of NULL, such as in the example splat at the end of this mail. So far I've had no luck in extracting a reliable reproducer, but I've uploaded the relevant syzkaller logs (and reports) to my kernel.org webspace [1]. Thanks, Mark. [1] https://www.kernel.org/pub/linux/kernel/people/mark/bugs/20180409-in_dev_finish_destroy-null-pcpu_refcnt/ Unable to handle kernel paging request at virtual address 60002be8 Mem abort info: ESR = 0x9604 Exception class = DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 Data abort info: ISV = 0, ISS = 0x0004 CM = 0, WnR = 0 user pgtable: 4k pages, 48-bit VAs, pgdp = fb56b784 [60002be8] pgd= Internal error: Oops: 9604 [#1] PREEMPT SMP Modules linked in: CPU: 2 PID: 22 Comm: ksoftirqd/2 Not tainted 4.16.0-6-g6eee13dfb529 #1 Hardware name: linux,dummy-virt (DT) pstate: 8045 (Nzcv daif +PAN -UAO) pc : __percpu_add arch/arm64/include/asm/percpu.h:102 [inline] pc : dev_put include/linux/netdevice.h:3395 [inline] pc : in_dev_finish_destroy+0xcc/0x230 net/ipv4/devinet.c:231 lr : dev_put include/linux/netdevice.h:3395 [inline] lr : in_dev_finish_destroy+0xa4/0x230 net/ipv4/devinet.c:231 sp : 800036317b70 x29: 800036317b70 x28: 80001f39d338 x27: 2a917460 x26: 2a917000 x25: 000a x24: 2a919000 x23: 800036317c90 x22: 29b84da8 x21: 13e73a68 x20: 80006e7a9100 x19: 80001f39d180 x18: x17: x16: 282951f0 x15: 0001 x14: f200 x13: 2a9fb560 x12: 0002 x11: 1c147fbb x10: 0004 x9 : x8 : 1fffe4000178908c x7 : x6 : x5 : x4 : 1fffe4000178908c x3 : 1fffe4000178908c x2 : x1 : 60002be8 x0 : 60002be8 Process ksoftirqd/2 (pid: 22, stack limit = 0xe96adbb2) Call trace: __percpu_add arch/arm64/include/asm/percpu.h:102 [inline] dev_put include/linux/netdevice.h:3395 [inline] in_dev_finish_destroy+0xcc/0x230 net/ipv4/devinet.c:231 in_dev_put include/linux/inetdevice.h:248 [inline] in_dev_rcu_put+0x34/0x48 net/ipv4/devinet.c:287 __rcu_reclaim kernel/rcu/rcu.h:172 [inline] rcu_do_batch kernel/rcu/tree.c:2674 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2933 [inline] __rcu_process_callbacks kernel/rcu/tree.c:2900 [inline] rcu_process_callbacks+0x350/0x988 kernel/rcu/tree.c:2917 __do_softirq+0x318/0x734 kernel/softirq.c:285 run_ksoftirqd+0x70/0xa8 kernel/softirq.c:666 smpboot_thread_fn+0x544/0x9d0 kernel/smpboot.c:164 kthread+0x2f8/0x380 kernel/kthread.c:238 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1158 Code: d538d081 f9425a80 9282 8b01 (885f7c04) ---[ end trace 577c2c0fbbf0d4d4 ]---
Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
2018-04-09 11:01 UTC+0200 ~ Daniel Borkmann> On 04/06/2018 01:11 PM, Quentin Monnet wrote: >> eBPF helper functions can be called from within eBPF programs to perform >> a variety of tasks that would be otherwise hard or impossible to do with >> eBPF itself. There is a growing number of such helper functions in the >> kernel, but documentation is scarce. The main user space header file >> does contain a short commented description of most helpers, but it is >> somewhat outdated and not complete. It is more a "cheat sheet" than a >> real documentation accessible to new eBPF developers. >> >> This commit attempts to improve the situation by replacing the existing >> overview for the helpers with a more developed description. Furthermore, >> a Python script is added to generate a manual page for eBPF helpers. The >> workflow is the following, and requires the rst2man utility: >> >> $ ./scripts/bpf_helpers_doc.py \ >> --filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst >> $ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7 >> $ man /tmp/bpf-helpers.7 >> >> The objective is to keep all documentation related to the helpers in a >> single place, and to be able to generate from here a manual page that >> could be packaged in the man-pages repository and shipped with most >> distributions [1]. >> >> Additionally, parsing the prototypes of the helper functions could >> hopefully be reused, with a different Printer object, to generate >> header files needed in some eBPF-related projects. >> >> Regarding the description of each helper, it comprises several items: >> >> - The function prototype. >> - A description of the function and of its arguments (except for a >> couple of cases, when there are no arguments and the return value >> makes the function usage really obvious). >> - A description of return values (if not void). >> - A listing of eBPF program types (if relevant, map types) compatible >> with the helper. >> - Information about the helper being restricted to GPL programs, or not. >> - The kernel version in which the helper was introduced. >> - The commit that introduced the helper (this is mostly to have it in >> the source of the man page, as it can be used to track changes and >> update the page). >> >> For several helpers, descriptions are inspired (at times, nearly copied) >> from the commit logs introducing them in the kernel--Many thanks to >> their respective authors! They were completed as much as possible, the >> objective being to have something easily accessible even for people just >> starting with eBPF. There is probably a bit more work to do in this >> direction for some helpers. >> >> Some RST formatting is used in the descriptions (not in function >> prototypes, to keep them readable, but the Python script provided in >> order to generate the RST for the manual page does add formatting to >> prototypes, to produce something pretty) to get "bold" and "italics" in >> manual pages. Hopefully, the descriptions in bpf.h file remains >> perfectly readable. Note that the few trailing white spaces are >> intentional, removing them would break paragraphs for rst2man. >> >> The descriptions should ideally be updated each time someone adds a new >> helper, or updates the behaviour (compatibility extended to new program >> types, new socket option supported...) or the interface (new flags >> available, ...) of existing ones. >> >> [1] I have not contacted people from the man-pages project prior to >> sending this RFC, so I can offer no guaranty at this time that they >> would accept to take the generated man page. >> >> Cc: linux-...@vger.kernel.org >> Cc: linux-...@vger.kernel.org >> Signed-off-by: Quentin Monnet > > Great work, thanks a lot for doing this! > > [...] >> + * int bpf_probe_read(void *dst, u32 size, const void *src) >> + * Description >> + * For tracing programs, safely attempt to read *size* bytes from >> + * address *src* and store the data in *dst*. >> + * Return >> + * 0 on success, or a negative error in case of failure. >> + * For >> + * *Tracing programs*. >> + * GPL only >> + * Yes >> + * Since >> + * Linux 4.1 >> + * >> + * .. commit 2541517c32be >> * >> * u64 bpf_ktime_get_ns(void) >> - * Return: current ktime >> - * >> - * int bpf_trace_printk(const char *fmt, int fmt_size, ...) >> - * Return: length of buffer written or negative error >> + * Description >> + * Return the time elapsed since system boot, in nanoseconds. >> + * Return >> + * Current *ktime*. >> + * For >> + * All program types, except >> + * **BPF_PROG_TYPE_CGROUP_DEVICE**. > > I think we should probably always just list the actual program types instead, > since when new types get added to the kernel not supporting bpf_ktime_get_ns() > helper in this example, the above statement would be misleading
[PATCH v5] net: thunderx: rework mac addresses list to u64 array
From: Vadim LomovtsevIt is too expensive to pass u64 values via linked list, instead allocate array for them by overall number of mac addresses from netdev. This eventually removes multiple kmalloc() calls, aviod memory fragmentation and allow to put single null check on kmalloc return value in order to prevent a potential null pointer dereference. Addresses-Coverity-ID: 1467429 ("Dereference null return value") Fixes: 37c3347eb247 ("net: thunderx: add ndo_set_rx_mode callback implementation for VF") Reported-by: Dan Carpenter Signed-off-by: Vadim Lomovtsev --- Changes from v1 to v2: - C99 syntax: update xcast_addr_list struct field mc[0] -> mc[]; Changes from v2 to v3: - update commit description with 'Reported-by: Dan Carpenter'; - update size calculations for mc list to offsetof() call instead of explicit arithmetic; Changes from v3 to v4: - change loop control variable type from u8 to int, accordingly to mc_count size; Changes from v4 to v5: - remove explicit initialization of 'idx' variable as per review comment; --- drivers/net/ethernet/cavium/thunder/nic.h| 7 +- drivers/net/ethernet/cavium/thunder/nicvf_main.c | 28 +--- 2 files changed, 11 insertions(+), 24 deletions(-) diff --git a/drivers/net/ethernet/cavium/thunder/nic.h b/drivers/net/ethernet/cavium/thunder/nic.h index 5fc46c5a4f36..448d1fafc827 100644 --- a/drivers/net/ethernet/cavium/thunder/nic.h +++ b/drivers/net/ethernet/cavium/thunder/nic.h @@ -265,14 +265,9 @@ struct nicvf_drv_stats { struct cavium_ptp; -struct xcast_addr { - struct list_head list; - u64 addr; -}; - struct xcast_addr_list { - struct list_head list; int count; + u64 mc[]; }; struct nicvf_work { diff --git a/drivers/net/ethernet/cavium/thunder/nicvf_main.c b/drivers/net/ethernet/cavium/thunder/nicvf_main.c index 1e9a31fef729..707db3304396 100644 --- a/drivers/net/ethernet/cavium/thunder/nicvf_main.c +++ b/drivers/net/ethernet/cavium/thunder/nicvf_main.c @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg) work.work); struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work); union nic_mbx mbx = {}; - struct xcast_addr *xaddr, *next; + int idx; if (!vf_work) return; @@ -1956,16 +1956,10 @@ static void nicvf_set_rx_mode_task(struct work_struct *work_arg) /* check if we have any specific MACs to be added to PF DMAC filter */ if (vf_work->mc) { /* now go through kernel list of MACs and add them one by one */ - list_for_each_entry_safe(xaddr, next, -_work->mc->list, list) { + for (idx = 0; idx < vf_work->mc->count; idx++) { mbx.xcast.msg = NIC_MBOX_MSG_ADD_MCAST; - mbx.xcast.data.mac = xaddr->addr; + mbx.xcast.data.mac = vf_work->mc->mc[idx]; nicvf_send_msg_to_pf(nic, ); - - /* after receiving ACK from PF release memory */ - list_del(>list); - kfree(xaddr); - vf_work->mc->count--; } kfree(vf_work->mc); } @@ -1996,17 +1990,15 @@ static void nicvf_set_rx_mode(struct net_device *netdev) mode |= BGX_XCAST_MCAST_FILTER; /* here we need to copy mc addrs */ if (netdev_mc_count(netdev)) { - struct xcast_addr *xaddr; - - mc_list = kmalloc(sizeof(*mc_list), GFP_ATOMIC); - INIT_LIST_HEAD(_list->list); + mc_list = kmalloc(offsetof(typeof(*mc_list), + mc[netdev_mc_count(netdev)]), + GFP_ATOMIC); + if (unlikely(!mc_list)) + return; + mc_list->count = 0; netdev_hw_addr_list_for_each(ha, >mc) { - xaddr = kmalloc(sizeof(*xaddr), - GFP_ATOMIC); - xaddr->addr = + mc_list->mc[mc_list->count] = ether_addr_to_u64(ha->addr); - list_add_tail(>list, - _list->list); mc_list->count++; } } -- 2.14.3
Re: kernel BUG at drivers/vhost/vhost.c:LINE! (2)
On Mon, Apr 9, 2018 at 11:28 AM, Stefan Hajnocziwrote: > On Mon, Apr 09, 2018 at 05:44:36AM +0300, Michael S. Tsirkin wrote: >> On Mon, Apr 09, 2018 at 10:37:45AM +0800, Stefan Hajnoczi wrote: >> > On Sat, Apr 7, 2018 at 3:02 AM, syzbot >> > wrote: >> > > syzbot hit the following crash on upstream commit >> > > 38c23685b273cfb4ccf31a199feccce3bdcb5d83 (Fri Apr 6 04:29:35 2018 +) >> > > Merge tag 'armsoc-drivers' of >> > > git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc >> > > syzbot dashboard link: >> > > https://syzkaller.appspot.com/bug?extid=65a84dde0214b0387ccd >> > >> > To prevent duplicated work: I am working on this one. >> > >> > Stefan >> >> Do you want to try this patchset: >> https://lkml.org/lkml/2018/4/5/665 >> >> ? > > Thanks, I'll give it a shot. > > I also noticed a regression in commit > d65026c6c62e7d9616c8ceb5a53b68bcdc050525 ("vhost: validate log when > IOTLB is enabled") and am currently testing a fix. I have sent a fix: https://lkml.org/lkml/2018/4/9/390 Stefan
Re: WARNING in ip_rt_bug
On 04/08/2018 11:06 PM, Dmitry Vyukov wrote: > On Mon, Apr 9, 2018 at 7:59 AM, syzbot >wrote: >> Hello, >> >> syzbot hit the following crash on net-next commit >> 8bde261e535257e81087d39ff808414e2f5aa39d (Sun Apr 1 02:31:43 2018 +) >> Merge tag 'mlx5-updates-2018-03-30' of >> git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux >> syzbot dashboard link: >> https://syzkaller.appspot.com/bug?extid=b09ac67a2af842b12eab >> >> Unfortunately, I don't have any reproducer for this crash yet. >> Raw console output: >> https://syzkaller.appspot.com/x/log.txt?id=5991727739437056 >> Kernel config: >> https://syzkaller.appspot.com/x/.config?id=3327544840960562528 >> compiler: gcc (GCC) 7.1.1 20170620 >> >> IMPORTANT: if you fix the bug, please add the following tag to the commit: >> Reported-by: syzbot+b09ac67a2af842b12...@syzkaller.appspotmail.com >> It will help syzbot understand when the bug is fixed. See footer for >> details. >> If you forward the report, please keep this part and the footer. > > > +Eric said that perhaps we just need to revert: > > commit c378a9c019cf5e017d1ed24954b54fae7bebd2bc > Date: Sat May 21 07:16:42 2011 + > ipv4: Give backtrace in ip_rt_bug(). > And David replied : Let's not do the revert, I wouldn't have seen the backtrace which points where this bug is if we had. icmp_route_lookup(), in one branch, does an input route lookup and uses the result of that to send the icmp message. That can't be right, input routes should never be used for transmitting traffice and that's how we end up at ip_rt_bug().
[PATCH] net/mlx5: remove some extraneous spaces in indentations
From: Colin Ian KingThere are several lines where there is an extraneous space causing indentation misalignment. Remove them. Cleans up Cocconelle warnings: ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:409:3-18: code aligned with following code on line 410 ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:415:3-18: code aligned with following code on line 416 ./drivers/net/ethernet/mellanox/mlx5/core/qp.c:421:3-18: code aligned with following code on line 422 Signed-off-by: Colin Ian King --- drivers/net/ethernet/mellanox/mlx5/core/qp.c | 18 +- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/drivers/net/ethernet/mellanox/mlx5/core/qp.c b/drivers/net/ethernet/mellanox/mlx5/core/qp.c index 02d6c5b5d502..4ca07bfb6b14 100644 --- a/drivers/net/ethernet/mellanox/mlx5/core/qp.c +++ b/drivers/net/ethernet/mellanox/mlx5/core/qp.c @@ -407,21 +407,21 @@ static int modify_qp_mbox_alloc(struct mlx5_core_dev *dev, u16 opcode, int qpn, case MLX5_CMD_OP_RST2INIT_QP: if (MBOX_ALLOC(mbox, rst2init_qp)) return -ENOMEM; -MOD_QP_IN_SET_QPC(rst2init_qp, mbox->in, opcode, qpn, - opt_param_mask, qpc); -break; + MOD_QP_IN_SET_QPC(rst2init_qp, mbox->in, opcode, qpn, + opt_param_mask, qpc); + break; case MLX5_CMD_OP_INIT2RTR_QP: if (MBOX_ALLOC(mbox, init2rtr_qp)) return -ENOMEM; -MOD_QP_IN_SET_QPC(init2rtr_qp, mbox->in, opcode, qpn, - opt_param_mask, qpc); -break; + MOD_QP_IN_SET_QPC(init2rtr_qp, mbox->in, opcode, qpn, + opt_param_mask, qpc); + break; case MLX5_CMD_OP_RTR2RTS_QP: if (MBOX_ALLOC(mbox, rtr2rts_qp)) return -ENOMEM; -MOD_QP_IN_SET_QPC(rtr2rts_qp, mbox->in, opcode, qpn, - opt_param_mask, qpc); -break; + MOD_QP_IN_SET_QPC(rtr2rts_qp, mbox->in, opcode, qpn, + opt_param_mask, qpc); + break; case MLX5_CMD_OP_RTS2RTS_QP: if (MBOX_ALLOC(mbox, rts2rts_qp)) return -ENOMEM; -- 2.15.1
Re: [PATCH 2/2] alx: add disable_wol paramenter
On Mon, Apr 09, 2018 at 07:35:14PM +0800, AceLan Kao wrote: > The WoL feature was reported broken and will lead to > the system resume immediately after suspending. > This symptom is not happening on every system, so adding > disable_wol option and disable WoL by default to prevent the issue from > happening again. > const char alx_drv_name[] = "alx"; > > +/* disable WoL by default */ > +bool disable_wol = 1; > +module_param(disable_wol, bool, 0); > +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature"); > + Hi AceLan This seems like you are papering over the cracks. And module parameters are not liked. Please try to find the real problem. Andrew
Re: Passing uninitialised local variable
Arend van Sprielwrites: > On 3/28/2018 1:20 PM, Himanshu Jha wrote: >> I recently found that a local variable in passed uninitialised to the >> function at >> >> drivers/net/wireless/broadcom/brcm80211/brcmfmac/cfg80211.c:2950 >> >> u32 var; >> err = brcmf_fil_iovar_int_get(ifp, "dtim_assoc", ); >> >> s32 >> brcmf_fil_iovar_int_get(struct brcmf_if *ifp, char *name, u32 *data) >> { >> __le32 data_le = cpu_to_le32(*data); >> } >> >> We can cleary see that 'var' in used uninitialised in the very first line >> which is an undefined behavior. > > Why undefined? We copy some stack data and we do transfer that to the device. > However in this case > the device does nothing with it and it is simply overwritten by the response. "Undefined behavior" is a technical term for when there are no guarantees as to what the result of executing a given code will be. None at all--it might for example abort, and that would be perfectly valid as well. (To be clear, this is not about the device, but about the CPU that this code runs on.) Uninitialized reads are one example of a code construct that invokes undefined behavior. Thanks, Petr
[PATCH 1/2] Revert "alx: remove WoL support"
This reverts commit bc2bebe8de8ed4ba6482c9cc370b0dd72ffe8cd2. There are still many people need this feature, so try adding it back. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=61651 Signed-off-by: AceLan Kao--- drivers/net/ethernet/atheros/alx/ethtool.c | 36 +++ drivers/net/ethernet/atheros/alx/hw.c | 154 - drivers/net/ethernet/atheros/alx/hw.h | 5 + drivers/net/ethernet/atheros/alx/main.c| 142 -- 4 files changed, 326 insertions(+), 11 deletions(-) diff --git a/drivers/net/ethernet/atheros/alx/ethtool.c b/drivers/net/ethernet/atheros/alx/ethtool.c index 2f4eabf..859e272 100644 --- a/drivers/net/ethernet/atheros/alx/ethtool.c +++ b/drivers/net/ethernet/atheros/alx/ethtool.c @@ -310,11 +310,47 @@ static int alx_get_sset_count(struct net_device *netdev, int sset) } } +static void alx_get_wol(struct net_device *netdev, struct ethtool_wolinfo *wol) +{ + struct alx_priv *alx = netdev_priv(netdev); + struct alx_hw *hw = >hw; + + wol->supported = WAKE_MAGIC | WAKE_PHY; + wol->wolopts = 0; + + if (hw->sleep_ctrl & ALX_SLEEP_WOL_MAGIC) + wol->wolopts |= WAKE_MAGIC; + if (hw->sleep_ctrl & ALX_SLEEP_WOL_PHY) + wol->wolopts |= WAKE_PHY; +} + +static int alx_set_wol(struct net_device *netdev, struct ethtool_wolinfo *wol) +{ + struct alx_priv *alx = netdev_priv(netdev); + struct alx_hw *hw = >hw; + + if (wol->wolopts & ~(WAKE_MAGIC | WAKE_PHY)) + return -EOPNOTSUPP; + + hw->sleep_ctrl = 0; + + if (wol->wolopts & WAKE_MAGIC) + hw->sleep_ctrl |= ALX_SLEEP_WOL_MAGIC; + if (wol->wolopts & WAKE_PHY) + hw->sleep_ctrl |= ALX_SLEEP_WOL_PHY; + + device_set_wakeup_enable(>hw.pdev->dev, hw->sleep_ctrl); + + return 0; +} + const struct ethtool_ops alx_ethtool_ops = { .get_pauseparam = alx_get_pauseparam, .set_pauseparam = alx_set_pauseparam, .get_msglevel = alx_get_msglevel, .set_msglevel = alx_set_msglevel, + .get_wol= alx_get_wol, + .set_wol= alx_set_wol, .get_link = ethtool_op_get_link, .get_strings= alx_get_strings, .get_sset_count = alx_get_sset_count, diff --git a/drivers/net/ethernet/atheros/alx/hw.c b/drivers/net/ethernet/atheros/alx/hw.c index 6ac40b0..f9bf612 100644 --- a/drivers/net/ethernet/atheros/alx/hw.c +++ b/drivers/net/ethernet/atheros/alx/hw.c @@ -332,6 +332,16 @@ void alx_set_macaddr(struct alx_hw *hw, const u8 *addr) alx_write_mem32(hw, ALX_STAD1, val); } +static void alx_enable_osc(struct alx_hw *hw) +{ + u32 val; + + /* rising edge */ + val = alx_read_mem32(hw, ALX_MISC); + alx_write_mem32(hw, ALX_MISC, val & ~ALX_MISC_INTNLOSC_OPEN); + alx_write_mem32(hw, ALX_MISC, val | ALX_MISC_INTNLOSC_OPEN); +} + static void alx_reset_osc(struct alx_hw *hw, u8 rev) { u32 val, val2; @@ -774,7 +784,6 @@ int alx_setup_speed_duplex(struct alx_hw *hw, u32 ethadv, u8 flowctrl) return err; } - void alx_post_phy_link(struct alx_hw *hw) { u16 phy_val, len, agc; @@ -848,6 +857,65 @@ void alx_post_phy_link(struct alx_hw *hw) } } +/* NOTE: + *1. phy link must be established before calling this function + *2. wol option (pattern,magic,link,etc.) is configed before call it. + */ +int alx_pre_suspend(struct alx_hw *hw, int speed, u8 duplex) +{ + u32 master, mac, phy, val; + int err = 0; + + master = alx_read_mem32(hw, ALX_MASTER); + master &= ~ALX_MASTER_PCLKSEL_SRDS; + mac = hw->rx_ctrl; + /* 10/100 half */ + ALX_SET_FIELD(mac, ALX_MAC_CTRL_SPEED, ALX_MAC_CTRL_SPEED_10_100); + mac &= ~(ALX_MAC_CTRL_FULLD | ALX_MAC_CTRL_RX_EN | ALX_MAC_CTRL_TX_EN); + + phy = alx_read_mem32(hw, ALX_PHY_CTRL); + phy &= ~(ALX_PHY_CTRL_DSPRST_OUT | ALX_PHY_CTRL_CLS); + phy |= ALX_PHY_CTRL_RST_ANALOG | ALX_PHY_CTRL_HIB_PULSE | + ALX_PHY_CTRL_HIB_EN; + + /* without any activity */ + if (!(hw->sleep_ctrl & ALX_SLEEP_ACTIVE)) { + err = alx_write_phy_reg(hw, ALX_MII_IER, 0); + if (err) + return err; + phy |= ALX_PHY_CTRL_IDDQ | ALX_PHY_CTRL_POWER_DOWN; + } else { + if (hw->sleep_ctrl & (ALX_SLEEP_WOL_MAGIC | ALX_SLEEP_CIFS)) + mac |= ALX_MAC_CTRL_RX_EN | ALX_MAC_CTRL_BRD_EN; + if (hw->sleep_ctrl & ALX_SLEEP_CIFS) + mac |= ALX_MAC_CTRL_TX_EN; + if (duplex == DUPLEX_FULL) + mac |= ALX_MAC_CTRL_FULLD; + if (speed == SPEED_1000) + ALX_SET_FIELD(mac, ALX_MAC_CTRL_SPEED, + ALX_MAC_CTRL_SPEED_1000); + phy |= ALX_PHY_CTRL_DSPRST_OUT; +
[PATCH 2/2] alx: add disable_wol paramenter
The WoL feature was reported broken and will lead to the system resume immediately after suspending. This symptom is not happening on every system, so adding disable_wol option and disable WoL by default to prevent the issue from happening again. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=61651 Signed-off-by: AceLan Kao--- drivers/net/ethernet/atheros/alx/ethtool.c | 7 ++- drivers/net/ethernet/atheros/alx/main.c| 5 + 2 files changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/atheros/alx/ethtool.c b/drivers/net/ethernet/atheros/alx/ethtool.c index 859e272..e50129d 100644 --- a/drivers/net/ethernet/atheros/alx/ethtool.c +++ b/drivers/net/ethernet/atheros/alx/ethtool.c @@ -46,6 +46,8 @@ #include "reg.h" #include "hw.h" +extern const bool disable_wol; + /* The order of these strings must match the order of the fields in * struct alx_hw_stats * See hw.h @@ -315,6 +317,9 @@ static void alx_get_wol(struct net_device *netdev, struct ethtool_wolinfo *wol) struct alx_priv *alx = netdev_priv(netdev); struct alx_hw *hw = >hw; + if (disable_wol) + return; + wol->supported = WAKE_MAGIC | WAKE_PHY; wol->wolopts = 0; @@ -329,7 +334,7 @@ static int alx_set_wol(struct net_device *netdev, struct ethtool_wolinfo *wol) struct alx_priv *alx = netdev_priv(netdev); struct alx_hw *hw = >hw; - if (wol->wolopts & ~(WAKE_MAGIC | WAKE_PHY)) + if (disable_wol || (wol->wolopts & ~(WAKE_MAGIC | WAKE_PHY))) return -EOPNOTSUPP; hw->sleep_ctrl = 0; diff --git a/drivers/net/ethernet/atheros/alx/main.c b/drivers/net/ethernet/atheros/alx/main.c index c0e2bb2..c27fdf7 100644 --- a/drivers/net/ethernet/atheros/alx/main.c +++ b/drivers/net/ethernet/atheros/alx/main.c @@ -51,6 +51,11 @@ const char alx_drv_name[] = "alx"; +/* disable WoL by default */ +bool disable_wol = 1; +module_param(disable_wol, bool, 0); +MODULE_PARM_DESC(disable_wol, "Disable Wake on Lan feature"); + static void alx_free_txbuf(struct alx_tx_queue *txq, int entry) { struct alx_buffer *txb = >bufs[entry]; -- 2.7.4
Re: [iovisor-dev] Best userspace programming API for XDP features query to kernel?
On Fri, 6 Apr 2018 12:36:18 +0200 Daniel Borkmannwrote: > On 04/05/2018 10:51 PM, Jesper Dangaard Brouer wrote: > > On Thu, 5 Apr 2018 12:37:19 +0200 > > Daniel Borkmann wrote: > > > >> On 04/04/2018 02:28 PM, Jesper Dangaard Brouer via iovisor-dev wrote: > >>> Hi Suricata people, > >>> > >>> When Eric Leblond (and I helped) integrated XDP in Suricata, we ran > >>> into the issue, that at Suricata load/start time, we cannot determine > >>> if the chosen XDP config options, like xdp-cpu-redirect[1], is valid on > >>> this HW (e.g require driver XDP_REDIRECT support and bpf cpumap). > >>> > >>> We would have liked a way to report that suricata.yaml config was > >>> invalid for this hardware/setup. Now, it just loads, and packets gets > >>> silently dropped by XDP (well a WARN_ONCE and catchable via tracepoints). > >>> > >>> My question to suricata developers: (Q1) Do you already have code that > >>> query the kernel or drivers for features? > >>> > >>> At the IOvisor call (2 weeks ago), we discussed two options of exposing > >>> XDP features avail in a given driver. > >>> > >>> Option#1: Extend existing ethtool -k/-K "offload and other features" > >>> with some XDP features, that userspace can query. (Do you already query > >>> offloads, regarding Q1) > >>> > >>> Option#2: Invent a new 'ip link set xdp' netlink msg with a query option. > >>> > >> > >> I don't really mind if you go via ethtool, as long as we handle this > >> generically from there and e.g. call the dev's ndo_bpf handler such that > >> we keep all the information in one place. This can be a new ndo_bpf command > >> e.g. XDP_QUERY_FEATURES or such. > > > > Just to be clear: notice as Victor points out[2], they are programmable > > going though the IOCTL (SIOCETHTOOL) and not using cmdline tools. > > Sure, that was perfectly clear. (But at the same time if you extend the > ioctl, it's obvious to also add support to actual ethtool cmdline tool.) > > > [2] https://github.com/OISF/suricata/blob/master/src/util-ioctl.c#L326 > > > > If you want everything to go through the drivers ndo_bpf call anyway > > (which userspace API is netlink based) then at what point to you > > Not really, that's the front end. ndo_bpf itself is a plain netdev op > and has no real tie to netlink. > > > want drivers to call their own ndo_bpf, when activated though their > > ethtool_ops ? (Sorry, but I don't follow the flow you are proposing) > > > > Notice, I'm not directly against using the drivers ndo_bpf call. I can > > see it does provide kernel more flexibility than the ethtool IOCTL. > > What I was saying is that even if you go via ethtool ioctl api, where > you end up in dev_ethtool() and have some new ETHTOOL_* query command, > then instead of adding a new ethtool_ops callback, we can and should > reuse ndo_bpf from there. Okay, you want to catch this in dev_ethtool(), that connected the dots for me, thanks. BUT it does feel a little fake and confusing to do this, as the driver developers seeing the info via ethtool, would expect they need to implement an ethtool_ops callback. My original idea behind going through the ethtool APIs, were that every driver is already using this, and it's familiar to the driver developers. And there are existing userspace tools that sysadm's already know, that we can leverage. Thinking about this over the weekend. I have changed my mind a bit, based on your feedback. I do buy into your idea of forcing things through the drivers ndo_bpf call. I'm no-longer convinced this should go through ethtool, but as (you desc) we can, if we find that this is the easiest ways to export and leverage an existing userspace tool. > [...] > > Here, I want to discuss how drivers expose/tell userspace that they > > support a given feature: Specifically a bit for: XDP_REDIRECT action > > support. > > > >> Same for meta data, > > > > Well, not really. It would be a "nice-to-have", but not strictly > > needed as a feature bit. XDP meta-data is controlled via a helper. > > And the BPF-prog can detect/see runtime, that the helper bpf_xdp_adjust_meta > > returns -ENOTSUPP (and need to check the ret value anyhow). Thus, > > there is that not much gained by exposing this to be detected setup > > time, as all drivers should eventually support this, and we can detect > > it runtime. > > > > The missing XDP_REDIRECT action features bit it different, as the > > BPF-prog cannot detect runtime that this is an unsupported action. > > Plus, setup time we cannot query the driver for supported XDP actions. > > Ok, so with the example of meta data, you're arguing that it's okay > to load a native XDP program onto a driver, and run actual traffic on > the NIC in order probe for the availability of the feature when you're > saying that it "can detect/see [at] runtime". I totally agree with you > that all drivers should eventually support this (same with XDP_REDIRECT), > but
[PATCH] netfilter: fix CONFIG_NF_REJECT_IPV6=m link error
We get a new link error with CONFIG_NFT_REJECT_INET=y and CONFIG_NF_REJECT_IPV6=m after larger parts of the nftables modules are linked together: net/netfilter/nft_reject_inet.o: In function `nft_reject_inet_eval': nft_reject_inet.c:(.text+0x17c): undefined reference to `nf_send_unreach6' nft_reject_inet.c:(.text+0x190): undefined reference to `nf_send_reset6' The problem is that with NF_TABLES_INET set, we implicitly try to use the ipv6 version as well for NFT_REJECT, but when CONFIG_IPV6 is set to a loadable module, it's impossible to reach that. The best workaround I found is to express the above as a Kconfig dependency, forcing NFT_REJECT itself to be 'm' in that particular configuration. Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type") Signed-off-by: Arnd Bergmann--- net/netfilter/Kconfig | 1 + 1 file changed, 1 insertion(+) diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig index 704b3832dbad..44d8a55e9721 100644 --- a/net/netfilter/Kconfig +++ b/net/netfilter/Kconfig @@ -594,6 +594,7 @@ config NFT_QUOTA config NFT_REJECT default m if NETFILTER_ADVANCED=n tristate "Netfilter nf_tables reject support" + depends on !NF_TABLES_INET || (IPV6!=m || m) help This option adds the "reject" expression that you can use to explicitly deny and notify via TCP reset/ICMP informational errors -- 2.9.0
Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
> Am 09.04.2018 um 12:08 schrieb Daniel Borkmann: [...] >> May I completely misunderstood you, so correct my if I'am wrong: >> >> - ./scripts/bpf_helpers_doc.py : produces reST markup from C-comments >> - ./scripts/kerne-doc : produces reST markup from C-comments >> >> IMO: both are doing the same job, so why not using kernel-doc? > > They are not really doing the same job, in bpf_helpers_doc.py case you don't > want the whole header rendered, but just a fraction of it, that is, the > single big comment which describes all BPF helper functions that a BPF > program developer has available to use in user space - aka the entries in > the __BPF_FUNC_MAPPER() macro; > I also doubt the latter would actually qualify > in kdoc context as some sort of a function description. latter .. ah, OK .. thanks for clarifying. -- Markus --
Re: [PATCH v2 2/2] af_key: Use DIV_ROUND_UP() instead of open-coded equivalent
On Sat, Apr 07, 2018 at 11:40:47AM -0400, Kevin Easton wrote: > Several places use (x + 7) / 8 to convert from a number of bits to a number > of bytes. Replace those with DIV_ROUND_UP(x, 8) instead, for consistency > with other parts of the same file. > > Signed-off-by: Kevin EastonPlease resubmit this one to ipsec-next after the merge window. Thanks!
Re: [PATCH v2 1/2] af_key: Always verify length of provided sadb_key
On Sat, Apr 07, 2018 at 11:40:33AM -0400, Kevin Easton wrote: > Key extensions (struct sadb_key) include a user-specified number of key > bits. The kernel uses that number to determine how much key data to copy > out of the message in pfkey_msg2xfrm_state(). > > The length of the sadb_key message must be verified to be long enough, > even in the case of SADB_X_AALG_NULL. Furthermore, the sadb_key_len value > must be long enough to include both the key data and the struct sadb_key > itself. > > Introduce a helper function verify_key_len(), and call it from > parse_exthdrs() where other exthdr types are similarly checked for > correctness. > > Signed-off-by: Kevin Easton> Reported-by: > syzbot+5022a34ca5a3d49b84223653fab632dfb7b4c...@syzkaller.appspotmail.com Applied to the ipsec tree, thanks Kevin!
Re: [PATCH v2 0/2] af_key: Fix for sadb_key memcpy read overrun
On Sat, Apr 07, 2018 at 11:40:18AM -0400, Kevin Easton wrote: > As found by syzbot, af_key does not properly validate the key length in > sadb_key messages from userspace. This can result in copying from beyond > the end of the sadb_key part of the message, or indeed beyond the end of > the entire packet. > > Both these patches apply cleanly to ipsec-next. Based on Steffen's > feedback I have re-ordered them so that the fix only is in patch 1, which > I would suggest is also a stable tree candidate, whereas patch 2 is a > cleanup only. I think here is some explanation needed. Usually bugfixes and cleanups don't go to the same tree. On IPsec bugfixes go to the'ipsec' tree while cleanups and new features go to the 'ipsec-next' tree. So you need to split up your patchsets into patches that are targeted to 'ipsec' and 'ipsec-next'. Aside from that, we are in 'merge window' currently. This means that most maintainers don't accept patches to their -next trees. If you have patches for a -next tree, wait until the merge window is over (when v4.17-rc1 is released) and send them then.
Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
On 04/09/2018 11:35 AM, Markus Heiser wrote: > >> Am 09.04.2018 um 11:25 schrieb Daniel Borkmann: >> >> On 04/09/2018 11:21 AM, Markus Heiser wrote: >> [...] >>> Do we really need another kernel-doc parser? >>> >>> ./scripts/kernel-doc include/uapi/linux/bpf.h >>> >>> should already do the job (producing .rst). For more infos, take a look at >> >> This has absolutely zero to do with kernel-doc, but rather producing >> a description of BPF helper function that are later assembled into an >> actual man-page that BPF program developers (user space) can use. > > May I completely misunderstood you, so correct my if I'am wrong: > > - ./scripts/bpf_helpers_doc.py : produces reST markup from C-comments > - ./scripts/kerne-doc : produces reST markup from C-comments > > IMO: both are doing the same job, so why not using kernel-doc? They are not really doing the same job, in bpf_helpers_doc.py case you don't want the whole header rendered, but just a fraction of it, that is, the single big comment which describes all BPF helper functions that a BPF program developer has available to use in user space - aka the entries in the __BPF_FUNC_MAPPER() macro; I also doubt the latter would actually qualify in kdoc context as some sort of a function description.
Re: [RFC PATCH bpf-next 2/6] bpf: add bpf_get_stack helper
On 04/09/2018 07:02 AM, Alexei Starovoitov wrote: > On 4/8/18 9:53 PM, Yonghong Song wrote: @@ -1004,7 +1007,8 @@ static void __bpf_prog_put(struct bpf_prog *prog, bool do_idr_lock) bpf_prog_kallsyms_del(prog->aux->func[i]); bpf_prog_kallsyms_del(prog); - call_rcu(>aux->rcu, __bpf_prog_put_rcu); + synchronize_rcu(); + __bpf_prog_put_rcu(>aux->rcu); >>> >>> there should have been lockdep splat. >>> We cannot call synchronize_rcu here, since we cannot sleep >>> in some cases. >> >> Let me double check this. The following is the reason >> why I am using synchronize_rcu(). >> >> With call_rcu(>aux->rcu, __bpf_prog_put_rcu) and >> _bpf_prog_put_rcu calls put_callchain_buffers() which >> calls mutex_lock(), the runtime with CONFIG_DEBUG_ATOMIC_SLEEP=y >> will complains since potential sleep inside the call_rcu is not >> allowed. > > I see. Indeed. We cannot call put_callchain_buffers() from rcu callback, > but doing synchronize_rcu() here is also not possible. > How about moving put_callchain into bpf_prog_free_deferred() ? +1, the assumption is that you can call bpf_prog_put() and also the bpf_map_put() from any context. Sleeping here for a long time might subtly break things badly.
Re: [PATCH v4] net: thunderx: rework mac addresses list to u64 array
On Sun, Apr 08, 2018 at 12:42:00PM -0400, David Miller wrote: > From: Vadim Lomovtsev> Date: Fri, 6 Apr 2018 12:53:54 -0700 > > > @@ -1929,7 +1929,7 @@ static void nicvf_set_rx_mode_task(struct work_struct > > *work_arg) > > work.work); > > struct nicvf *nic = container_of(vf_work, struct nicvf, rx_mode_work); > > union nic_mbx mbx = {}; > > - struct xcast_addr *xaddr, *next; > > + int idx = 0; > > No need to initialize idx. > > > + for (idx = 0; idx < vf_work->mc->count; idx++) { > > As it is always explicitly initialized at, and only used inside of, > this loop. Ok, will post next version shortly. Thanks for your time. Vadim
Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
> Am 09.04.2018 um 11:25 schrieb Daniel Borkmann: > > On 04/09/2018 11:21 AM, Markus Heiser wrote: > [...] >> Do we really need another kernel-doc parser? >> >> ./scripts/kernel-doc include/uapi/linux/bpf.h >> >> should already do the job (producing .rst). For more infos, take a look at > > This has absolutely zero to do with kernel-doc, but rather producing > a description of BPF helper function that are later assembled into an > actual man-page that BPF program developers (user space) can use. May I completely misunderstood you, so correct my if I'am wrong: - ./scripts/bpf_helpers_doc.py : produces reST markup from C-comments - ./scripts/kerne-doc : produces reST markup from C-comments IMO: both are doing the same job, so why not using kernel-doc? -- Markus --
Re: [RFC bpf-next] bpf: document eBPF helpers and add a script to generate man page
On 04/06/2018 01:11 PM, Quentin Monnet wrote: >> eBPF helper functions can be called from within eBPF programs to perform >> a variety of tasks that would be otherwise hard or impossible to do with >> eBPF itself. There is a growing number of such helper functions in the >> kernel, but documentation is scarce. The main user space header file >> does contain a short commented description of most helpers, but it is >> somewhat outdated and not complete. It is more a "cheat sheet" than a >> real documentation accessible to new eBPF developers. >> >> This commit attempts to improve the situation by replacing the existing >> overview for the helpers with a more developed description. Furthermore, >> a Python script is added to generate a manual page for eBPF helpers. The >> workflow is the following, and requires the rst2man utility: >> >>$ ./scripts/bpf_helpers_doc.py \ >>--filename include/uapi/linux/bpf.h > /tmp/bpf-helpers.rst >>$ rst2man /tmp/bpf-helpers.rst > /tmp/bpf-helpers.7 >>$ man /tmp/bpf-helpers.7 >> >> The objective is to keep all documentation related to the helpers in a >> single place, Do we really need another kernel-doc parser? ./scripts/kernel-doc include/uapi/linux/bpf.h should already do the job (producing .rst). For more infos, take a look at https://www.kernel.org/doc/html/latest/doc-guide/index.html -- Markus -- PS: sorry for re-post, first post was HTML which is not accepted by ML :o