date:20180226

Re: [v3,net-next,2/2] tls: Use correct sk->sk_prot for IPV6

2018-02-26 Thread Boris Pismenny


Hi Guenter,

On 2/23/2018 11:52 PM, Guenter Roeck wrote:

Hi Ilya,

On Mon, Sep 04, 2017 at 01:14:01PM +0300, Ilya Lesokhin wrote:

The tls ulp overrides sk->prot with a new tls specific proto structs.
The tls specific structs were previously based on the ipv4 specific
tcp_prot sturct.
As a result, attaching the tls ulp to an ipv6 tcp socket replaced
some ipv6 callback with the ipv4 equivalents.

This patch adds ipv6 tls proto structs and uses them when
attached to ipv6 sockets.



Do you plan to update this patch with the missing TCPv6 support ?


We'll re-spin a v4 by EOW.


As far as I can see, the part that was accepted upstream does not fix
the TCPv6 protocol issue which triggers CVE-2018-5703.

If adding IPv6 support is not possible/acceptable, would it make sense
to limit TLS support to TCPv4, ie add something like

if (sk->sk_prot != _prot)
return -EINVAL;

to tls_init() ?


AFAIK there are users of TLS over IPv6 who wouldn't find this acceptable.

Best,
Boris.

inconsistent lock state with usbnet/asix usb ethernet and xhci

2018-02-26 Thread Marek Szyprowski


Hi

I've noticed that USBnet/ASIX AX88772B USB driver produces deplock kernel
warning ("inconsistent lock state") on Chromebook2 Peach-PIT board. No
special activity is needed to reproduce this issue, it happens almost
on every boot. ASIX USB ethernet is connected to XHCI USB host controller
on that board. Is it a known issue? Frankly I have no idea where to look
to fix it. The same adapter connected to EHCI ports on other boards based
on the same SoC works fine without any warnings.

Here are some more information from that board:
# lsusb
Bus 006 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 005 Device 002: ID 0b95:772b ASIX Electronics Corp. AX88772B
Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Bus 001 Device 002: ID 2232:1056 Silicon Motion
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub

# lsusb -t
/:  Bus 06.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
/:  Bus 05.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 480M
    |__ Port 1: Dev 2, If 0, Class=Vendor Specific Class, Driver=asix, 480M
/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
/:  Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 480M
/:  Bus 02.Port 1: Dev 1, Class=root_hub, Driver=exynos-ohci/3p, 12M
/:  Bus 01.Port 1: Dev 1, Class=root_hub, Driver=exynos-ehci/3p, 480M
    |__ Port 1: Dev 2, If 0, Class=Video, Driver=, 480M
    |__ Port 1: Dev 2, If 1, Class=Video, Driver=, 480M


And the log with mentioned warning:

[   17.768040] 
[   17.772239] WARNING: inconsistent lock state
[   17.776511] 4.16.0-rc3-next-20180227-7-g876c53a7493c #453 Not tainted
[   17.783329] 
[   17.787580] inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
[   17.793607] swapper/0/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
[   17.798751]  (>seq#5){?.-.}, at: [<9b22e5f0>] 
asix_rx_fixup_internal+0x188/0x288

[   17.806790] {IN-HARDIRQ-W} state was registered at:
[   17.811677]   tx_complete+0x100/0x208
[   17.815319]   __usb_hcd_giveback_urb+0x60/0xf0
[   17.819770]   xhci_giveback_urb_in_irq+0xa8/0x240
[   17.824469]   xhci_td_cleanup+0xf4/0x16c
[   17.828367]   xhci_irq+0xe74/0x2240
[   17.831827]   usb_hcd_irq+0x24/0x38
[   17.835343]   __handle_irq_event_percpu+0x98/0x510
[   17.840111]   handle_irq_event_percpu+0x1c/0x58
[   17.844623]   handle_irq_event+0x38/0x5c
[   17.848519]   handle_fasteoi_irq+0xa4/0x138
[   17.852681]   generic_handle_irq+0x18/0x28
[   17.856760]   __handle_domain_irq+0x6c/0xe4
[   17.860941]   gic_handle_irq+0x54/0xa0
[   17.864666]   __irq_svc+0x70/0xb0
[   17.867964]   arch_cpu_idle+0x20/0x3c
[   17.871578]   arch_cpu_idle+0x20/0x3c
[   17.875190]   do_idle+0x144/0x218
[   17.878468]   cpu_startup_entry+0x18/0x1c
[   17.882454]   start_kernel+0x394/0x400
[   17.886177] irq event stamp: 161912
[   17.889616] hardirqs last  enabled at (161912): [<7bedfacf>] 
__netdev_alloc_skb+0xcc/0x140
[   17.897893] hardirqs last disabled at (161911): [] 
__netdev_alloc_skb+0x94/0x140

[   17.904903] exynos5-hsi2c 12ca.i2c: tx timeout
[   17.906116] softirqs last  enabled at (161904): [<387102ff>] 
irq_enter+0x78/0x80
[   17.906123] softirqs last disabled at (161905): [] 
irq_exit+0x134/0x158

[   17.925722].
[   17.925722] other info that might help us debug this:
[   17.933435]  Possible unsafe locking scenario:
[   17.933435].
[   17.940331]    CPU0
[   17.942488]    
[   17.944894]   lock(>seq#5);
[   17.948274]   
[   17.950847] lock(>seq#5);
[   17.954386].
[   17.954386]  *** DEADLOCK ***
[   17.954386].
[   17.962422] no locks held by swapper/0/0.
[   17.966011].
[   17.966011] stack backtrace:
[   17.971333] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 
4.16.0-rc3-next-20180227-7-g876c53a7493c #453

[   17.980312] Hardware name: SAMSUNG EXYNOS (Flattened Device Tree)
[   17.986380] [] (unwind_backtrace) from [] 
(show_stack+0x10/0x14)
[   17.994128] [] (show_stack) from [] 
(dump_stack+0x90/0xc8)
[   18.001339] [] (dump_stack) from [] 
(print_usage_bug+0x25c/0x2cc)
[   18.009161] [] (print_usage_bug) from [] 
(mark_lock+0x290/0x698)

[   18.014952] exynos5-hsi2c 12ca.i2c: tx timeout
[   18.016899] [] (mark_lock) from [] 
(__lock_acquire+0x454/0x1850)
[   18.029449] [] (__lock_acquire) from [] 
(lock_acquire+0xc8/0x2b8)
[   18.037272] [] (lock_acquire) from [] 
(usbnet_skb_return+0x7c/0x1a0)
[   18.045356] [] (usbnet_skb_return) from [] 
(asix_rx_fixup_internal+0x188/0x288)
[   18.054420] [] (asix_rx_fixup_internal) from [] 
(usbnet_bh+0xf8/0x2e4)
[   18.062694] [] (usbnet_bh) from [] 
(tasklet_action+0x8c/0x13c)
[   18.070259] [] (tasklet_action) from [] 
(__do_softirq+0xd4/0x6d4)
[   18.078089] [] (__do_softirq) from [] 
(irq_exit+0x134/0x158)
[   18.085480] [] (irq_exit) from []

Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-26 Thread Alexei Starovoitov

On Tue, Feb 27, 2018 at 05:20:55AM +, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
>  wrote:
> > On Tue, Feb 27, 2018 at 04:40:34AM +, Andy Lutomirski wrote:
> >> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
> >>  wrote:
> >> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
> >> >> The seccomp(2) syscall can be used by a task to apply a Landlock program
> >> >> to itself. As a seccomp filter, a Landlock program is enforced for the
> >> >> current task and all its future children. A program is immutable and a
> >> >> task can only add new restricting programs to itself, forming a list of
> >> >> programss.
> >> >>
> >> >> A Landlock program is tied to a Landlock hook. If the action on a kernel
> >> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
> >> >> capabilities, other LSM), then a Landlock hook related to this kind of
> >> >> object is triggered. The list of programs for this hook is then
> >> >> evaluated. Each program return a 32-bit value which can deny the action
> >> >> on a kernel object with a non-zero value. If every programs of the list
> >> >> return zero, then the action on the object is allowed.
> >> >>
> >> >> Multiple Landlock programs can be chained to share a 64-bits value for a
> >> >> call chain (e.g. evaluating multiple elements of a file path).  This
> >> >> chaining is restricted when a process construct this chain by loading a
> >> >> program, but additional checks are performed when it requests to apply
> >> >> this chain of programs to itself.  The restrictions ensure that it is
> >> >> not possible to call multiple programs in a way that would imply to
> >> >> handle multiple shared values (i.e. cookies) for one chain.  For now,
> >> >> only a fs_pick program can be chained to the same type of program,
> >> >> because it may make sense if they have different triggers (cf. next
> >> >> commits).  This restrictions still allows to reuse Landlock programs in
> >> >> a safe way (e.g. use the same loaded fs_walk program with multiple
> >> >> chains of fs_pick programs).
> >> >>
> >> >> Signed-off-by: Mickaël Salaün 
> >> >
> >> > ...
> >> >
> >> >> +struct landlock_prog_set *landlock_prepend_prog(
> >> >> + struct landlock_prog_set *current_prog_set,
> >> >> + struct bpf_prog *prog)
> >> >> +{
> >> >> + struct landlock_prog_set *new_prog_set = current_prog_set;
> >> >> + unsigned long pages;
> >> >> + int err;
> >> >> + size_t i;
> >> >> + struct landlock_prog_set tmp_prog_set = {};
> >> >> +
> >> >> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
> >> >> + return ERR_PTR(-EINVAL);
> >> >> +
> >> >> + /* validate memory size allocation */
> >> >> + pages = prog->pages;
> >> >> + if (current_prog_set) {
> >> >> + size_t i;
> >> >> +
> >> >> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); 
> >> >> i++) {
> >> >> + struct landlock_prog_list *walker_p;
> >> >> +
> >> >> + for (walker_p = current_prog_set->programs[i];
> >> >> + walker_p; walker_p = 
> >> >> walker_p->prev)
> >> >> + pages += walker_p->prog->pages;
> >> >> + }
> >> >> + /* count a struct landlock_prog_set if we need to 
> >> >> allocate one */
> >> >> + if (refcount_read(_prog_set->usage) != 1)
> >> >> + pages += round_up(sizeof(*current_prog_set), 
> >> >> PAGE_SIZE)
> >> >> + / PAGE_SIZE;
> >> >> + }
> >> >> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
> >> >> + return ERR_PTR(-E2BIG);
> >> >> +
> >> >> + /* ensure early that we can allocate enough memory for the new
> >> >> +  * prog_lists */
> >> >> + err = store_landlock_prog(_prog_set, current_prog_set, prog);
> >> >> + if (err)
> >> >> + return ERR_PTR(err);
> >> >> +
> >> >> + /*
> >> >> +  * Each task_struct points to an array of prog list pointers.  
> >> >> These
> >> >> +  * tables are duplicated when additions are made (which means each
> >> >> +  * table needs to be refcounted for the processes using it). When 
> >> >> a new
> >> >> +  * table is created, all the refcounters on the prog_list are 
> >> >> bumped (to
> >> >> +  * track each table that references the prog). When a new prog is
> >> >> +  * added, it's just prepended to the list for the new table to 
> >> >> point
> >> >> +  * at.
> >> >> +  *
> >> >> +  * Manage all the possible errors before this step to not 
> >> >> uselessly
> >> >> +  * duplicate current_prog_set and avoid a rollback.
> >> >> +  */
> >> >> + if (!new_prog_set) {
> >> >> + /*
> >> >> +  * If there is no Landlock program set used by the 
> >> >> current

Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-26 Thread Andy Lutomirski

On Tue, Feb 27, 2018 at 4:54 AM, Alexei Starovoitov
 wrote:
> On Tue, Feb 27, 2018 at 04:40:34AM +, Andy Lutomirski wrote:
>> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>>  wrote:
>> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>> >> The seccomp(2) syscall can be used by a task to apply a Landlock program
>> >> to itself. As a seccomp filter, a Landlock program is enforced for the
>> >> current task and all its future children. A program is immutable and a
>> >> task can only add new restricting programs to itself, forming a list of
>> >> programss.
>> >>
>> >> A Landlock program is tied to a Landlock hook. If the action on a kernel
>> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
>> >> capabilities, other LSM), then a Landlock hook related to this kind of
>> >> object is triggered. The list of programs for this hook is then
>> >> evaluated. Each program return a 32-bit value which can deny the action
>> >> on a kernel object with a non-zero value. If every programs of the list
>> >> return zero, then the action on the object is allowed.
>> >>
>> >> Multiple Landlock programs can be chained to share a 64-bits value for a
>> >> call chain (e.g. evaluating multiple elements of a file path).  This
>> >> chaining is restricted when a process construct this chain by loading a
>> >> program, but additional checks are performed when it requests to apply
>> >> this chain of programs to itself.  The restrictions ensure that it is
>> >> not possible to call multiple programs in a way that would imply to
>> >> handle multiple shared values (i.e. cookies) for one chain.  For now,
>> >> only a fs_pick program can be chained to the same type of program,
>> >> because it may make sense if they have different triggers (cf. next
>> >> commits).  This restrictions still allows to reuse Landlock programs in
>> >> a safe way (e.g. use the same loaded fs_walk program with multiple
>> >> chains of fs_pick programs).
>> >>
>> >> Signed-off-by: Mickaël Salaün 
>> >
>> > ...
>> >
>> >> +struct landlock_prog_set *landlock_prepend_prog(
>> >> + struct landlock_prog_set *current_prog_set,
>> >> + struct bpf_prog *prog)
>> >> +{
>> >> + struct landlock_prog_set *new_prog_set = current_prog_set;
>> >> + unsigned long pages;
>> >> + int err;
>> >> + size_t i;
>> >> + struct landlock_prog_set tmp_prog_set = {};
>> >> +
>> >> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>> >> + return ERR_PTR(-EINVAL);
>> >> +
>> >> + /* validate memory size allocation */
>> >> + pages = prog->pages;
>> >> + if (current_prog_set) {
>> >> + size_t i;
>> >> +
>> >> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); 
>> >> i++) {
>> >> + struct landlock_prog_list *walker_p;
>> >> +
>> >> + for (walker_p = current_prog_set->programs[i];
>> >> + walker_p; walker_p = walker_p->prev)
>> >> + pages += walker_p->prog->pages;
>> >> + }
>> >> + /* count a struct landlock_prog_set if we need to allocate 
>> >> one */
>> >> + if (refcount_read(_prog_set->usage) != 1)
>> >> + pages += round_up(sizeof(*current_prog_set), 
>> >> PAGE_SIZE)
>> >> + / PAGE_SIZE;
>> >> + }
>> >> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>> >> + return ERR_PTR(-E2BIG);
>> >> +
>> >> + /* ensure early that we can allocate enough memory for the new
>> >> +  * prog_lists */
>> >> + err = store_landlock_prog(_prog_set, current_prog_set, prog);
>> >> + if (err)
>> >> + return ERR_PTR(err);
>> >> +
>> >> + /*
>> >> +  * Each task_struct points to an array of prog list pointers.  These
>> >> +  * tables are duplicated when additions are made (which means each
>> >> +  * table needs to be refcounted for the processes using it). When a 
>> >> new
>> >> +  * table is created, all the refcounters on the prog_list are 
>> >> bumped (to
>> >> +  * track each table that references the prog). When a new prog is
>> >> +  * added, it's just prepended to the list for the new table to point
>> >> +  * at.
>> >> +  *
>> >> +  * Manage all the possible errors before this step to not uselessly
>> >> +  * duplicate current_prog_set and avoid a rollback.
>> >> +  */
>> >> + if (!new_prog_set) {
>> >> + /*
>> >> +  * If there is no Landlock program set used by the current 
>> >> task,
>> >> +  * then create a new one.
>> >> +  */
>> >> + new_prog_set = new_landlock_prog_set();
>> >> + if (IS_ERR(new_prog_set))
>> >> + goto put_tmp_lists;
>> >> + } else if (refcount_read(_prog_set->usage) > 1) {
>>

Re: [PATCH bpf-next v8 08/11] landlock: Add ptrace restrictions

2018-02-26 Thread Andy Lutomirski



> On Feb 26, 2018, at 8:17 PM, Andy Lutomirski  wrote:
> 
>> On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
>> A landlocked process has less privileges than a non-landlocked process
>> and must then be subject to additional restrictions when manipulating
>> processes. To be allowed to use ptrace(2) and related syscalls on a
>> target process, a landlocked process must have a subset of the target
>> process' rules.
>> 
>> Signed-off-by: Mickaël Salaün 
>> Cc: Alexei Starovoitov 
>> Cc: Andy Lutomirski 
>> Cc: Daniel Borkmann 
>> Cc: David S. Miller 
>> Cc: James Morris 
>> Cc: Kees Cook 
>> Cc: Serge E. Hallyn 
>> ---
>> 
>> Changes since v6:
>> * factor out ptrace check
>> * constify pointers
>> * cleanup headers
>> * use the new security_add_hooks()
>> ---
>> security/landlock/Makefile   |   2 +-
>> security/landlock/hooks_ptrace.c | 124 
>> +++
>> security/landlock/hooks_ptrace.h |  11 
>> security/landlock/init.c |   2 +
>> 4 files changed, 138 insertions(+), 1 deletion(-)
>> create mode 100644 security/landlock/hooks_ptrace.c
>> create mode 100644 security/landlock/hooks_ptrace.h
>> 
>> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
>> index d0f532a93b4e..605504d852d3 100644
>> --- a/security/landlock/Makefile
>> +++ b/security/landlock/Makefile
>> @@ -3,4 +3,4 @@ obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
>> landlock-y := init.o chain.o task.o \
>>tag.o tag_fs.o \
>>enforce.o enforce_seccomp.o \
>> -   hooks.o hooks_cred.o hooks_fs.o
>> +   hooks.o hooks_cred.o hooks_fs.o hooks_ptrace.o
>> diff --git a/security/landlock/hooks_ptrace.c 
>> b/security/landlock/hooks_ptrace.c
>> new file mode 100644
>> index ..f1b977b9c808
>> --- /dev/null
>> +++ b/security/landlock/hooks_ptrace.c
>> @@ -0,0 +1,124 @@
>> +/*
>> + * Landlock LSM - ptrace hooks
>> + *
>> + * Copyright © 2017 Mickaël Salaün 
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License version 2, as
>> + * published by the Free Software Foundation.
>> + */
>> +
>> +#include 
>> +#include 
>> +#include  /* ARRAY_SIZE */
>> +#include 
>> +#include  /* struct task_struct */
>> +#include 
>> +
>> +#include "common.h" /* struct landlock_prog_set */
>> +#include "hooks.h" /* landlocked() */
>> +#include "hooks_ptrace.h"
>> +
>> +static bool progs_are_subset(const struct landlock_prog_set *parent,
>> +   const struct landlock_prog_set *child)
>> +{
>> +   size_t i;
>> +
>> +   if (!parent || !child)
>> +   return false;
>> +   if (parent == child)
>> +   return true;
>> +
>> +   for (i = 0; i < ARRAY_SIZE(child->programs); i++) {
> 
> ARRAY_SIZE(child->programs) seems misleading.  Is there no define
> NUM_LANDLOCK_PROG_TYPES or similar?
> 
>> +   struct landlock_prog_list *walker;
>> +   bool found_parent = false;
>> +
>> +   if (!parent->programs[i])
>> +   continue;
>> +   for (walker = child->programs[i]; walker;
>> +   walker = walker->prev) {
>> +   if (walker == parent->programs[i]) {
>> +   found_parent = true;
>> +   break;
>> +   }
>> +   }
>> +   if (!found_parent)
>> +   return false;
>> +   }
>> +   return true;
>> +}
> 
> If you used seccomp, you'd get this type of check for free, and it
> would be a lot easier to comprehend.  AFAICT the only extra leniency
> you're granting is that you're agnostic to the order in which the
> rules associated with different program types were applied, which
> could easily be added to seccomp.

On second thought, this is all way too complicated.  I think the correct logic 
is either "if you are filtered by Landlock, you cannot ptrace anything" or to 
delete this patch entirely. If something like Tycho's notifiers goes in, then 
it's not obvious that, just because you have the same set of filters, you have 
the same privilege.  Similarly, if a feature that lets a filter query its 
cgroup goes in (and you proposed this once!) then the logic you implemented 
here is wrong.

Or you could just say that it's the responsibility of a Landlock user to 
properly filter ptrace() just like it's the responsibility of seccomp users to 
filter ptrace if needed.

I take this as further evidence that Landlock makes much more sense as part of 
seccomp than as a totally separate thing.  We've very carefully reviewed these 
things for seccomp.  Please don't make us do it again from scratch.

Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-26 Thread Alexei Starovoitov

On Tue, Feb 27, 2018 at 04:40:34AM +, Andy Lutomirski wrote:
> On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
>  wrote:
> > On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
> >> The seccomp(2) syscall can be used by a task to apply a Landlock program
> >> to itself. As a seccomp filter, a Landlock program is enforced for the
> >> current task and all its future children. A program is immutable and a
> >> task can only add new restricting programs to itself, forming a list of
> >> programss.
> >>
> >> A Landlock program is tied to a Landlock hook. If the action on a kernel
> >> object is allowed by the other Linux security mechanisms (e.g. DAC,
> >> capabilities, other LSM), then a Landlock hook related to this kind of
> >> object is triggered. The list of programs for this hook is then
> >> evaluated. Each program return a 32-bit value which can deny the action
> >> on a kernel object with a non-zero value. If every programs of the list
> >> return zero, then the action on the object is allowed.
> >>
> >> Multiple Landlock programs can be chained to share a 64-bits value for a
> >> call chain (e.g. evaluating multiple elements of a file path).  This
> >> chaining is restricted when a process construct this chain by loading a
> >> program, but additional checks are performed when it requests to apply
> >> this chain of programs to itself.  The restrictions ensure that it is
> >> not possible to call multiple programs in a way that would imply to
> >> handle multiple shared values (i.e. cookies) for one chain.  For now,
> >> only a fs_pick program can be chained to the same type of program,
> >> because it may make sense if they have different triggers (cf. next
> >> commits).  This restrictions still allows to reuse Landlock programs in
> >> a safe way (e.g. use the same loaded fs_walk program with multiple
> >> chains of fs_pick programs).
> >>
> >> Signed-off-by: Mickaël Salaün 
> >
> > ...
> >
> >> +struct landlock_prog_set *landlock_prepend_prog(
> >> + struct landlock_prog_set *current_prog_set,
> >> + struct bpf_prog *prog)
> >> +{
> >> + struct landlock_prog_set *new_prog_set = current_prog_set;
> >> + unsigned long pages;
> >> + int err;
> >> + size_t i;
> >> + struct landlock_prog_set tmp_prog_set = {};
> >> +
> >> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
> >> + return ERR_PTR(-EINVAL);
> >> +
> >> + /* validate memory size allocation */
> >> + pages = prog->pages;
> >> + if (current_prog_set) {
> >> + size_t i;
> >> +
> >> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) 
> >> {
> >> + struct landlock_prog_list *walker_p;
> >> +
> >> + for (walker_p = current_prog_set->programs[i];
> >> + walker_p; walker_p = walker_p->prev)
> >> + pages += walker_p->prog->pages;
> >> + }
> >> + /* count a struct landlock_prog_set if we need to allocate 
> >> one */
> >> + if (refcount_read(_prog_set->usage) != 1)
> >> + pages += round_up(sizeof(*current_prog_set), 
> >> PAGE_SIZE)
> >> + / PAGE_SIZE;
> >> + }
> >> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
> >> + return ERR_PTR(-E2BIG);
> >> +
> >> + /* ensure early that we can allocate enough memory for the new
> >> +  * prog_lists */
> >> + err = store_landlock_prog(_prog_set, current_prog_set, prog);
> >> + if (err)
> >> + return ERR_PTR(err);
> >> +
> >> + /*
> >> +  * Each task_struct points to an array of prog list pointers.  These
> >> +  * tables are duplicated when additions are made (which means each
> >> +  * table needs to be refcounted for the processes using it). When a 
> >> new
> >> +  * table is created, all the refcounters on the prog_list are bumped 
> >> (to
> >> +  * track each table that references the prog). When a new prog is
> >> +  * added, it's just prepended to the list for the new table to point
> >> +  * at.
> >> +  *
> >> +  * Manage all the possible errors before this step to not uselessly
> >> +  * duplicate current_prog_set and avoid a rollback.
> >> +  */
> >> + if (!new_prog_set) {
> >> + /*
> >> +  * If there is no Landlock program set used by the current 
> >> task,
> >> +  * then create a new one.
> >> +  */
> >> + new_prog_set = new_landlock_prog_set();
> >> + if (IS_ERR(new_prog_set))
> >> + goto put_tmp_lists;
> >> + } else if (refcount_read(_prog_set->usage) > 1) {
> >> + /*
> >> +  * If the current task is not the sole user of its Landlock
> >> +  * program set, then duplicate them.
> >> +  */
> >> +

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Andy Lutomirski



> On Feb 26, 2018, at 8:38 PM, Kees Cook  wrote:
> 
> On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski  wrote:
>>> On Feb 26, 2018, at 3:20 PM, Kees Cook  wrote:
>>> 
>>> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
>>>  wrote:
> On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
> This patchset enables seccomp filters to be written in eBPF. Although, 
> this
> [...]
 The main statement I want to hear from seccomp maintainers before
 proceeding any further on this that enabling eBPF in seccomp won't lead
 to seccomp folks arguing against changes in bpf core (like verifier)
 just because it's used by seccomp.
 It must be spelled out in the commit log with explicit Ack.
>>> 
>>> The primary thing I'm concerned about with eBPF and seccomp is
>>> side-effects from eBPF programs running at syscall time. This is an
>>> extremely sensitive area, and I want to be sure there won't be
>>> feature-creep here that leads to seccomp getting into a bad state.
>>> 
>>> As long as seccomp can continue have its own verifier, I *think* this
>>> will be fine, though, again I remain concerned about maps, etc. I'm
>>> still reviewing these patches and how they might provide overlap with
>>> Tycho's needs too, etc.
>> 
>> I'm not sure I see this as a huge problem.  As far as I can see, there
>> are three ways that a verifier change could be problematic:
>> 
>> 1. Addition of a new type of map.  But seccomp would just not allow
>> new map types by default, right?
>> 
>> 2. Addition of a new BPF_CALLable helper.  Seccomp wants a way to
>> whitelist BPF_CALL targets.  That should be straightforward.
> 
> Yup, agreed on 1 and 2.
> 
>> 3. Straight-up bugs.  Those are exactly as problematic as verifier
>> bugs in any other unprivileged eBPF program type, right?  I don't see
>> why seccomp is special here.
> 
> My concern is more about unintended design mistakes or other feature
> creep with side-effects, especially when it comes to privileges and
> synchronization. Getting no-new-privs done correctly, for example,
> took some careful thought and discussion, and I'm shy from how painful
> TSYNC was on the process locking side, and eBPF has had some rather
> ugly flaws in the past (and recently: it was nice to be able to say
> for Spectre that seccomp filters couldn't be constructed to make
> attacks but eBPF could). Adding the complexity needs to be worth the
> gain. I'm on board for doing it, I just want to be careful. :)
> 

I agree.  I think that, if we do this right, we get a clean version of Tycho's 
notifiers.  We can also very easily build on that to send a non-blocking 
message to the notifier fd, which gets us a version of seccomp logging that 
works for things like Chromium and even strace.  I think this is worth it.

I also think this sort of argument is why Mickaël's privileged-first Landlock 
is the wrong approach.  By getting the unprivileged parts right from day one, 
we can carefully extend the mechanism and keep it usable by unprivileged apps.  
But, if we'd started as root-only, fixing up everything needed to make it safe 
for unprivileged users after the fact would have been quite messy.

And the considerations for making eBPF safe for use by unprivileged tasks to 
filter their descendents are more or less the same for seccomp and Landlock.  
Can we please arrange things so we solve this problem only once?

Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-26 Thread Andy Lutomirski

On Tue, Feb 27, 2018 at 2:08 AM, Alexei Starovoitov
 wrote:
> On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
>> The seccomp(2) syscall can be used by a task to apply a Landlock program
>> to itself. As a seccomp filter, a Landlock program is enforced for the
>> current task and all its future children. A program is immutable and a
>> task can only add new restricting programs to itself, forming a list of
>> programss.
>>
>> A Landlock program is tied to a Landlock hook. If the action on a kernel
>> object is allowed by the other Linux security mechanisms (e.g. DAC,
>> capabilities, other LSM), then a Landlock hook related to this kind of
>> object is triggered. The list of programs for this hook is then
>> evaluated. Each program return a 32-bit value which can deny the action
>> on a kernel object with a non-zero value. If every programs of the list
>> return zero, then the action on the object is allowed.
>>
>> Multiple Landlock programs can be chained to share a 64-bits value for a
>> call chain (e.g. evaluating multiple elements of a file path).  This
>> chaining is restricted when a process construct this chain by loading a
>> program, but additional checks are performed when it requests to apply
>> this chain of programs to itself.  The restrictions ensure that it is
>> not possible to call multiple programs in a way that would imply to
>> handle multiple shared values (i.e. cookies) for one chain.  For now,
>> only a fs_pick program can be chained to the same type of program,
>> because it may make sense if they have different triggers (cf. next
>> commits).  This restrictions still allows to reuse Landlock programs in
>> a safe way (e.g. use the same loaded fs_walk program with multiple
>> chains of fs_pick programs).
>>
>> Signed-off-by: Mickaël Salaün 
>
> ...
>
>> +struct landlock_prog_set *landlock_prepend_prog(
>> + struct landlock_prog_set *current_prog_set,
>> + struct bpf_prog *prog)
>> +{
>> + struct landlock_prog_set *new_prog_set = current_prog_set;
>> + unsigned long pages;
>> + int err;
>> + size_t i;
>> + struct landlock_prog_set tmp_prog_set = {};
>> +
>> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
>> + return ERR_PTR(-EINVAL);
>> +
>> + /* validate memory size allocation */
>> + pages = prog->pages;
>> + if (current_prog_set) {
>> + size_t i;
>> +
>> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
>> + struct landlock_prog_list *walker_p;
>> +
>> + for (walker_p = current_prog_set->programs[i];
>> + walker_p; walker_p = walker_p->prev)
>> + pages += walker_p->prog->pages;
>> + }
>> + /* count a struct landlock_prog_set if we need to allocate one 
>> */
>> + if (refcount_read(_prog_set->usage) != 1)
>> + pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
>> + / PAGE_SIZE;
>> + }
>> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
>> + return ERR_PTR(-E2BIG);
>> +
>> + /* ensure early that we can allocate enough memory for the new
>> +  * prog_lists */
>> + err = store_landlock_prog(_prog_set, current_prog_set, prog);
>> + if (err)
>> + return ERR_PTR(err);
>> +
>> + /*
>> +  * Each task_struct points to an array of prog list pointers.  These
>> +  * tables are duplicated when additions are made (which means each
>> +  * table needs to be refcounted for the processes using it). When a new
>> +  * table is created, all the refcounters on the prog_list are bumped 
>> (to
>> +  * track each table that references the prog). When a new prog is
>> +  * added, it's just prepended to the list for the new table to point
>> +  * at.
>> +  *
>> +  * Manage all the possible errors before this step to not uselessly
>> +  * duplicate current_prog_set and avoid a rollback.
>> +  */
>> + if (!new_prog_set) {
>> + /*
>> +  * If there is no Landlock program set used by the current 
>> task,
>> +  * then create a new one.
>> +  */
>> + new_prog_set = new_landlock_prog_set();
>> + if (IS_ERR(new_prog_set))
>> + goto put_tmp_lists;
>> + } else if (refcount_read(_prog_set->usage) > 1) {
>> + /*
>> +  * If the current task is not the sole user of its Landlock
>> +  * program set, then duplicate them.
>> +  */
>> + new_prog_set = new_landlock_prog_set();
>> + if (IS_ERR(new_prog_set))
>> + goto put_tmp_lists;
>> + for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
>> + new_prog_set->programs[i] =
>> +

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Kees Cook

On Mon, Feb 26, 2018 at 8:19 PM, Andy Lutomirski  wrote:
>> On Feb 26, 2018, at 3:20 PM, Kees Cook  wrote:
>>
>> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
>>  wrote:
 On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
 This patchset enables seccomp filters to be written in eBPF. Although, this
 [...]
>>> The main statement I want to hear from seccomp maintainers before
>>> proceeding any further on this that enabling eBPF in seccomp won't lead
>>> to seccomp folks arguing against changes in bpf core (like verifier)
>>> just because it's used by seccomp.
>>> It must be spelled out in the commit log with explicit Ack.
>>
>> The primary thing I'm concerned about with eBPF and seccomp is
>> side-effects from eBPF programs running at syscall time. This is an
>> extremely sensitive area, and I want to be sure there won't be
>> feature-creep here that leads to seccomp getting into a bad state.
>>
>> As long as seccomp can continue have its own verifier, I *think* this
>> will be fine, though, again I remain concerned about maps, etc. I'm
>> still reviewing these patches and how they might provide overlap with
>> Tycho's needs too, etc.
>
> I'm not sure I see this as a huge problem.  As far as I can see, there
> are three ways that a verifier change could be problematic:
>
> 1. Addition of a new type of map.  But seccomp would just not allow
> new map types by default, right?
>
> 2. Addition of a new BPF_CALLable helper.  Seccomp wants a way to
> whitelist BPF_CALL targets.  That should be straightforward.

Yup, agreed on 1 and 2.

> 3. Straight-up bugs.  Those are exactly as problematic as verifier
> bugs in any other unprivileged eBPF program type, right?  I don't see
> why seccomp is special here.

My concern is more about unintended design mistakes or other feature
creep with side-effects, especially when it comes to privileges and
synchronization. Getting no-new-privs done correctly, for example,
took some careful thought and discussion, and I'm shy from how painful
TSYNC was on the process locking side, and eBPF has had some rather
ugly flaws in the past (and recently: it was nice to be able to say
for Spectre that seccomp filters couldn't be constructed to make
attacks but eBPF could). Adding the complexity needs to be worth the
gain. I'm on board for doing it, I just want to be careful. :)

-Kees

-- 
Kees Cook
Pixel Security

Re: [net] cb0789fd8b: INFO:task_blocked_for_more_than#seconds

2018-02-26 Thread Florian Fainelli

On February 26, 2018 7:02:34 PM PST, kernel test robot <fengguang...@intel.com> 
wrote:
>FYI, we noticed the following commit (built with gcc-6):
>
>commit: cb0789fd8b03b59948257b67531d0b5fd87061c5 ("net: phy: Restore
>phy_resume() locking assumption")
>url:
>https://github.com/0day-ci/linux/commits/Andrew-Lunn/net-phy-Restore-phy_resume-locking-assumption/20180226-220932

This was testing v1 of Andrew's patch which had a mistake leading to a 
deadlock, v2 does not have that problem.

-- 
Florian

Re: [PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-02-26 Thread Andy Lutomirski

On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
> Hi,
>
> This eight series is a major revamp of the Landlock design compared to
> the previous series [1]. This enables more flexibility and granularity
> of access control with file paths. It is now possible to enforce an
> access control according to a file hierarchy. Landlock uses the concept
> of inode and path to identify such hierarchy. In a way, it brings tools
> to program what is a file hierarchy.
>
> There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET.
> Each of them accepts a dedicated eBPF program, called a Landlock
> program.  They can be chained to enforce a full access control according
> to a list of directories or files. The set of actions on a file is well
> defined (e.g. read, write, ioctl, append, lock, mount...) taking
> inspiration from the major Linux LSMs and some other access-controls
> like Capsicum.  These program types are designed to be cache-friendly,
> which give room for optimizations in the future.
>
> The documentation patch contains some kernel documentation and
> explanations on how to use Landlock.  The compiled documentation and
> a talk I gave at FOSDEM can be found here: https://landlock.io
> This patch series can be found in the branch landlock-v8 in this repo:
> https://github.com/landlock-lsm/linux
>
> There is still some minor issues with this patch series but it should
> demonstrate how powerful this design may be. One of these issues is that
> it is not a stackable LSM anymore, but the infrastructure management of
> security blobs should allow to stack it with other LSM [4].
>
> This is the first step of the roadmap discussed at LPC [2].  While the
> intended final goal is to allow unprivileged users to use Landlock, this
> series allows only a process with global CAP_SYS_ADMIN to load and
> enforce a rule.  This may help to get feedback and avoid unexpected
> behaviors.
>
> This series can be applied on top of bpf-next, commit 7d72637eb39f
> ("Merge branch 'x86-jit'").  This can be tested with
> CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK.  I would really
> appreciate constructive comments on the design and the code.
>
>
> # Landlock LSM
>
> The goal of this new Linux Security Module (LSM) called Landlock is to
> allow any process, including unprivileged ones, to create powerful
> security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This
> kind of sandbox is expected to help mitigate the security impact of bugs
> or unexpected/malicious behaviors in user-space applications.
>
> The approach taken is to add the minimum amount of code while still
> allowing the user-space application to create quite complex access
> rules.  A dedicated security policy language such as the one used by
> SELinux, AppArmor and other major LSMs involves a lot of code and is
> usually permitted to only a trusted user (i.e. root).  On the contrary,
> eBPF programs already exist and are designed to be safely loaded by
> unprivileged user-space.
>
> This design does not seem too intrusive but is flexible enough to allow
> a powerful sandbox mechanism accessible by any process on Linux. The use
> of seccomp and Landlock is more suitable with the help of a user-space
> library (e.g.  libseccomp) that could help to specify a high-level
> language to express a security policy instead of raw eBPF programs.
> Moreover, thanks to the LLVM front-end, it is quite easy to write an
> eBPF program with a subset of the C language.
>
>
> # Frequently asked questions
>
> ## Why is seccomp-bpf not enough?
>
> A seccomp filter can access only raw syscall arguments (i.e. the
> register values) which means that it is not possible to filter according
> to the value pointed to by an argument, such as a file pathname. As an
> embryonic Landlock version demonstrated, filtering at the syscall level
> is complicated (e.g. need to take care of race conditions). This is
> mainly because the access control checkpoints of the kernel are not at
> this high-level but more underneath, at the LSM-hook level. The LSM
> hooks are designed to handle this kind of checks.  Landlock abstracts
> this approach to leverage the ability of unprivileged users to limit
> themselves.
>
> Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt
>
>
> ## Why use the seccomp(2) syscall?
>
> Landlock use the same semantic as seccomp to apply access rule
> restrictions. It add a new layer of security for the current process
> which is inherited by its children. It makes sense to use an unique
> access-restricting syscall (that should be allowed by seccomp filters)
> which can only drop privileges. Moreover, a Landlock rule could come
> from outside a process (e.g.  passed through a UNIX socket). It is then
> useful to differentiate the creation/load of Landlock eBPF programs via
> bpf(2), from rule enforcement via seccomp(2).

This seems like a weak argument to me.  Sure, this is a bit different
from seccomp(),

Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-02-26 Thread Kees Cook

On Mon, Feb 26, 2018 at 8:08 PM, Sargun Dhillon  wrote:
> On Mon, Feb 26, 2018 at 7:57 PM, Tycho Andersen  wrote:
>> On Mon, Feb 26, 2018 at 07:49:48PM -0800, Sargun Dhillon wrote:
>>> On Mon, Feb 26, 2018 at 4:54 PM, Tycho Andersen  wrote:
>>> > On Mon, Feb 26, 2018 at 07:27:05AM +, Sargun Dhillon wrote:
>>> >> +config SECCOMP_FILTER_EXTENDED
>>> >> + bool "Extended BPF seccomp filters"
>>> >> + depends on SECCOMP_FILTER && BPF_SYSCALL
>>> >> + depends on !CHECKPOINT_RESTORE
>>> >
>>> > Why not just give -EINVAL or something in case one of these is
>>> > requested, instead of making them incompatible at compile time?
>>> >
>>> > Tycho
>>> There's already code to return -EMEDIUMTYPE if it's a non-classic, or
>>> non-saved filter. Under the normal case, with CHECKPOINT_RESTORE
>>> enabled, you should never be able to get that. I think it makes sense
>>> to preserve this behaviour.
>>
>> Oh, right. So can't we just drop this, and the existing code will
>> DTRT, i.e. give you -EMEDIUMTYPE because the new filters aren't
>> supported, until they are?
>>
>> Tycho
> My suggestion is we merge this as is, so we don't break checkpoint /
> restore, and I will try to get the filter dumping patching in the same
> development cycle as it comes at minimal risk. Otherwise, we risk
> introducing a feature which could break checkpoint/restore, even in
> unprivileged containers since anyone can load a BPF Seccomp filter.

There is no rush to merge such a drastic expansion of the seccomp
attack surface. :) For me, the driving feature is if we can get
Tycho's notifier implemented in eBPF. The speed improvements, as far
as I'm concerned, aren't sufficient to add eBPF to seccomp. They are
certainly a nice benefit, but seccomp must be very conservative about
adding attack surface.

-Kees

-- 
Kees Cook
Pixel Security

Re: [PATCH bpf-next 3/8] tools: bpftool: detect sub-programs from the eBPF sequence

2018-02-26 Thread Alexei Starovoitov

On Mon, Feb 26, 2018 at 08:12:49PM -0800, Jakub Kicinski wrote:
> From: Jiong Wang 
> 
> This patch detect all sub-programs from the eBPF sequence and keep the
> information in the new CFG data structure.
> 
> The detection algorithm is basically the same as the one in verifier except
> we need to use insn->off instead of insn->imm to get the pc-relative call
> offset. Because verifier has modified insn->off/insn->imm during finishing
> the verification.
> 
> Also, we don't need to do some sanity checks as verifier has done them.
> 
> Signed-off-by: Jiong Wang 
> Acked-by: Jakub Kicinski 

overall idea looks good to me.
One step towards proper decompiler ;)

> diff --git a/tools/bpf/bpftool/cfg.c b/tools/bpf/bpftool/cfg.c
> new file mode 100644
> index ..42a0d4c4271c
> --- /dev/null
> +++ b/tools/bpf/bpftool/cfg.c
> @@ -0,0 +1,143 @@
> +/*
> + * Copyright (C) 2018 Netronome Systems, Inc.
> + *
> + * This software is dual licensed under the GNU General License Version 2,
> + * June 1991 as shown in the file COPYING in the top-level directory of this
> + * source tree or the BSD 2-Clause License provided below.  You have the
> + * option to license this software under the complete terms of either 
> license.

please use SPDX instead.

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Andy Lutomirski

> On Feb 26, 2018, at 3:20 PM, Kees Cook  wrote:
>
> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
>  wrote:
>>> On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
>>> This patchset enables seccomp filters to be written in eBPF. Although, this
>>> [...]
>> The main statement I want to hear from seccomp maintainers before
>> proceeding any further on this that enabling eBPF in seccomp won't lead
>> to seccomp folks arguing against changes in bpf core (like verifier)
>> just because it's used by seccomp.
>> It must be spelled out in the commit log with explicit Ack.
>
> The primary thing I'm concerned about with eBPF and seccomp is
> side-effects from eBPF programs running at syscall time. This is an
> extremely sensitive area, and I want to be sure there won't be
> feature-creep here that leads to seccomp getting into a bad state.
>
> As long as seccomp can continue have its own verifier, I *think* this
> will be fine, though, again I remain concerned about maps, etc. I'm
> still reviewing these patches and how they might provide overlap with
> Tycho's needs too, etc.

I'm not sure I see this as a huge problem.  As far as I can see, there
are three ways that a verifier change could be problematic:

1. Addition of a new type of map.  But seccomp would just not allow
new map types by default, right?

2. Addition of a new BPF_CALLable helper.  Seccomp wants a way to
whitelist BPF_CALL targets.  That should be straightforward.

3. Straight-up bugs.  Those are exactly as problematic as verifier
bugs in any other unprivileged eBPF program type, right?  I don't see
why seccomp is special here.

Re: [PATCH bpf-next v8 08/11] landlock: Add ptrace restrictions

2018-02-26 Thread Andy Lutomirski

On Tue, Feb 27, 2018 at 12:41 AM, Mickaël Salaün  wrote:
> A landlocked process has less privileges than a non-landlocked process
> and must then be subject to additional restrictions when manipulating
> processes. To be allowed to use ptrace(2) and related syscalls on a
> target process, a landlocked process must have a subset of the target
> process' rules.
>
> Signed-off-by: Mickaël Salaün 
> Cc: Alexei Starovoitov 
> Cc: Andy Lutomirski 
> Cc: Daniel Borkmann 
> Cc: David S. Miller 
> Cc: James Morris 
> Cc: Kees Cook 
> Cc: Serge E. Hallyn 
> ---
>
> Changes since v6:
> * factor out ptrace check
> * constify pointers
> * cleanup headers
> * use the new security_add_hooks()
> ---
>  security/landlock/Makefile   |   2 +-
>  security/landlock/hooks_ptrace.c | 124 
> +++
>  security/landlock/hooks_ptrace.h |  11 
>  security/landlock/init.c |   2 +
>  4 files changed, 138 insertions(+), 1 deletion(-)
>  create mode 100644 security/landlock/hooks_ptrace.c
>  create mode 100644 security/landlock/hooks_ptrace.h
>
> diff --git a/security/landlock/Makefile b/security/landlock/Makefile
> index d0f532a93b4e..605504d852d3 100644
> --- a/security/landlock/Makefile
> +++ b/security/landlock/Makefile
> @@ -3,4 +3,4 @@ obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
>  landlock-y := init.o chain.o task.o \
> tag.o tag_fs.o \
> enforce.o enforce_seccomp.o \
> -   hooks.o hooks_cred.o hooks_fs.o
> +   hooks.o hooks_cred.o hooks_fs.o hooks_ptrace.o
> diff --git a/security/landlock/hooks_ptrace.c 
> b/security/landlock/hooks_ptrace.c
> new file mode 100644
> index ..f1b977b9c808
> --- /dev/null
> +++ b/security/landlock/hooks_ptrace.c
> @@ -0,0 +1,124 @@
> +/*
> + * Landlock LSM - ptrace hooks
> + *
> + * Copyright © 2017 Mickaël Salaün 
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2, as
> + * published by the Free Software Foundation.
> + */
> +
> +#include 
> +#include 
> +#include  /* ARRAY_SIZE */
> +#include 
> +#include  /* struct task_struct */
> +#include 
> +
> +#include "common.h" /* struct landlock_prog_set */
> +#include "hooks.h" /* landlocked() */
> +#include "hooks_ptrace.h"
> +
> +static bool progs_are_subset(const struct landlock_prog_set *parent,
> +   const struct landlock_prog_set *child)
> +{
> +   size_t i;
> +
> +   if (!parent || !child)
> +   return false;
> +   if (parent == child)
> +   return true;
> +
> +   for (i = 0; i < ARRAY_SIZE(child->programs); i++) {

ARRAY_SIZE(child->programs) seems misleading.  Is there no define
NUM_LANDLOCK_PROG_TYPES or similar?

> +   struct landlock_prog_list *walker;
> +   bool found_parent = false;
> +
> +   if (!parent->programs[i])
> +   continue;
> +   for (walker = child->programs[i]; walker;
> +   walker = walker->prev) {
> +   if (walker == parent->programs[i]) {
> +   found_parent = true;
> +   break;
> +   }
> +   }
> +   if (!found_parent)
> +   return false;
> +   }
> +   return true;
> +}

If you used seccomp, you'd get this type of check for free, and it
would be a lot easier to comprehend.  AFAICT the only extra leniency
you're granting is that you're agnostic to the order in which the
rules associated with different program types were applied, which
could easily be added to seccomp.

Re: [PATCH bpf-next 0/2] Few BPF kselftest improvements

2018-02-26 Thread Alexei Starovoitov

On Mon, Feb 26, 2018 at 10:34:31PM +0100, Daniel Borkmann wrote:
> First one unifies the often repeated rlimit handling and the
> second one enables and adds run-time tests for BPF tail calls
> which provides useful coverage in particular for JITs.

Applied to bpf-next, thanks Daniel.

[PATCH bpf-next 2/8] tools: bpftool: factor out xlated dump related code into separate file

2018-02-26 Thread Jakub Kicinski

From: Jiong Wang 

This patch factors out those code of dumping xlated eBPF instructions into
xlated_dumper.[h|c].

They are quite independent dumper functions, so better to be kept
separately.

New dumper support will be added in later patches in this set.

Signed-off-by: Jiong Wang 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/prog.c  | 255 +-
 tools/bpf/bpftool/xlated_dumper.c | 282 ++
 tools/bpf/bpftool/xlated_dumper.h |  58 
 3 files changed, 341 insertions(+), 254 deletions(-)
 create mode 100644 tools/bpf/bpftool/xlated_dumper.c
 create mode 100644 tools/bpf/bpftool/xlated_dumper.h

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index 950d11dd42ab..c5afee9838e6 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -48,7 +48,7 @@
 #include 
 
 #include "main.h"
-#include "disasm.h"
+#include "xlated_dumper.h"
 
 static const char * const prog_type_name[] = {
[BPF_PROG_TYPE_UNSPEC]  = "unspec",
@@ -407,259 +407,6 @@ static int do_show(int argc, char **argv)
return err;
 }
 
-#define SYM_MAX_NAME   256
-
-struct kernel_sym {
-   unsigned long address;
-   char name[SYM_MAX_NAME];
-};
-
-struct dump_data {
-   unsigned long address_call_base;
-   struct kernel_sym *sym_mapping;
-   __u32 sym_count;
-   char scratch_buff[SYM_MAX_NAME];
-};
-
-static int kernel_syms_cmp(const void *sym_a, const void *sym_b)
-{
-   return ((struct kernel_sym *)sym_a)->address -
-  ((struct kernel_sym *)sym_b)->address;
-}
-
-static void kernel_syms_load(struct dump_data *dd)
-{
-   struct kernel_sym *sym;
-   char buff[256];
-   void *tmp, *address;
-   FILE *fp;
-
-   fp = fopen("/proc/kallsyms", "r");
-   if (!fp)
-   return;
-
-   while (!feof(fp)) {
-   if (!fgets(buff, sizeof(buff), fp))
-   break;
-   tmp = realloc(dd->sym_mapping,
- (dd->sym_count + 1) *
- sizeof(*dd->sym_mapping));
-   if (!tmp) {
-out:
-   free(dd->sym_mapping);
-   dd->sym_mapping = NULL;
-   fclose(fp);
-   return;
-   }
-   dd->sym_mapping = tmp;
-   sym = >sym_mapping[dd->sym_count];
-   if (sscanf(buff, "%p %*c %s", , sym->name) != 2)
-   continue;
-   sym->address = (unsigned long)address;
-   if (!strcmp(sym->name, "__bpf_call_base")) {
-   dd->address_call_base = sym->address;
-   /* sysctl kernel.kptr_restrict was set */
-   if (!sym->address)
-   goto out;
-   }
-   if (sym->address)
-   dd->sym_count++;
-   }
-
-   fclose(fp);
-
-   qsort(dd->sym_mapping, dd->sym_count,
- sizeof(*dd->sym_mapping), kernel_syms_cmp);
-}
-
-static void kernel_syms_destroy(struct dump_data *dd)
-{
-   free(dd->sym_mapping);
-}
-
-static struct kernel_sym *kernel_syms_search(struct dump_data *dd,
-unsigned long key)
-{
-   struct kernel_sym sym = {
-   .address = key,
-   };
-
-   return dd->sym_mapping ?
-  bsearch(, dd->sym_mapping, dd->sym_count,
-  sizeof(*dd->sym_mapping), kernel_syms_cmp) : NULL;
-}
-
-static void print_insn(struct bpf_verifier_env *env, const char *fmt, ...)
-{
-   va_list args;
-
-   va_start(args, fmt);
-   vprintf(fmt, args);
-   va_end(args);
-}
-
-static const char *print_call_pcrel(struct dump_data *dd,
-   struct kernel_sym *sym,
-   unsigned long address,
-   const struct bpf_insn *insn)
-{
-   if (sym)
-   snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
-"%+d#%s", insn->off, sym->name);
-   else
-   snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
-"%+d#0x%lx", insn->off, address);
-   return dd->scratch_buff;
-}
-
-static const char *print_call_helper(struct dump_data *dd,
-struct kernel_sym *sym,
-unsigned long address)
-{
-   if (sym)
-   snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
-"%s", sym->name);
-   else
-   snprintf(dd->scratch_buff, sizeof(dd->scratch_buff),
-"0x%lx", address);
-   return dd->scratch_buff;
-}
-
-static const char *print_call(void *private_data,
- const struct bpf_insn

[PATCH bpf-next 4/8] tools: bpftool: partition basic-block for each function in the CFG

2018-02-26 Thread Jakub Kicinski

From: Jiong Wang 

This patch partition basic-block for each function in the CFG. The
algorithm is simple, we identify basic-block head in a first traversal,
then second traversal to identify the tail.

We could build extended basic-block (EBB) in next steps. EBB could make the
graph more readable when the eBPF sequence is big.

Signed-off-by: Jiong Wang 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/cfg.c | 118 +++-
 1 file changed, 117 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/cfg.c b/tools/bpf/bpftool/cfg.c
index 42a0d4c4271c..a61b76edc18d 100644
--- a/tools/bpf/bpftool/cfg.c
+++ b/tools/bpf/bpftool/cfg.c
@@ -45,17 +45,32 @@ struct cfg {
 
 struct func_node {
struct list_head l;
+   struct list_head bbs;
struct bpf_insn *start;
struct bpf_insn *end;
int idx;
+   int bb_num;
+};
+
+struct bb_node {
+   struct list_head l;
+   struct bpf_insn *head;
+   struct bpf_insn *tail;
+   int idx;
 };
 
 #define func_prev(func)list_prev_entry(func, l)
 #define func_next(func)list_next_entry(func, l)
+#define bb_prev(bb)list_prev_entry(bb, l)
+#define bb_next(bb)list_next_entry(bb, l)
 #define cfg_first_func(cfg)\
list_first_entry(>funcs, struct func_node, l)
 #define cfg_last_func(cfg) \
list_last_entry(>funcs, struct func_node, l)
+#define func_first_bb(func)\
+   list_first_entry(>bbs, struct bb_node, l)
+#define func_last_bb(func) \
+   list_last_entry(>bbs, struct bb_node, l)
 
 static struct func_node *cfg_append_func(struct cfg *cfg, struct bpf_insn 
*insn)
 {
@@ -82,6 +97,30 @@ static struct func_node *cfg_append_func(struct cfg *cfg, 
struct bpf_insn *insn)
return new_func;
 }
 
+static struct bb_node *func_append_bb(struct func_node *func,
+ struct bpf_insn *insn)
+{
+   struct bb_node *new_bb, *bb;
+
+   list_for_each_entry(bb, >bbs, l) {
+   if (bb->head == insn)
+   return bb;
+   else if (bb->head > insn)
+   break;
+   }
+
+   bb = bb_prev(bb);
+   new_bb = calloc(1, sizeof(*new_bb));
+   if (!new_bb) {
+   p_err("OOM when allocating BB node");
+   return NULL;
+   }
+   new_bb->head = insn;
+   list_add(_bb->l, >l);
+
+   return new_bb;
+}
+
 static bool cfg_partition_funcs(struct cfg *cfg, struct bpf_insn *cur,
struct bpf_insn *end)
 {
@@ -111,13 +150,83 @@ static bool cfg_partition_funcs(struct cfg *cfg, struct 
bpf_insn *cur,
return false;
 }
 
+static bool func_partition_bb_head(struct func_node *func)
+{
+   struct bpf_insn *cur, *end;
+   struct bb_node *bb;
+
+   cur = func->start;
+   end = func->end;
+   INIT_LIST_HEAD(>bbs);
+   bb = func_append_bb(func, cur);
+   if (!bb)
+   return true;
+
+   for (; cur <= end; cur++) {
+   if (BPF_CLASS(cur->code) == BPF_JMP) {
+   u8 opcode = BPF_OP(cur->code);
+
+   if (opcode == BPF_EXIT || opcode == BPF_CALL)
+   continue;
+
+   bb = func_append_bb(func, cur + cur->off + 1);
+   if (!bb)
+   return true;
+
+   if (opcode != BPF_JA) {
+   bb = func_append_bb(func, cur + 1);
+   if (!bb)
+   return true;
+   }
+   }
+   }
+
+   return false;
+}
+
+static void func_partition_bb_tail(struct func_node *func)
+{
+   struct bb_node *bb, *last;
+   unsigned int bb_idx = 0;
+
+   last = func_last_bb(func);
+   last->tail = func->end;
+   bb = func_first_bb(func);
+   list_for_each_entry_from(bb, >l, l) {
+   bb->tail = bb_next(bb)->head - 1;
+   bb->idx = bb_idx++;
+   }
+
+   last->idx = bb_idx++;
+   func->bb_num = bb_idx;
+}
+
+static bool func_partition_bb(struct func_node *func)
+{
+   if (func_partition_bb_head(func))
+   return true;
+
+   func_partition_bb_tail(func);
+
+   return false;
+}
+
 static bool cfg_build(struct cfg *cfg, struct bpf_insn *insn, unsigned int len)
 {
int cnt = len / sizeof(*insn);
+   struct func_node *func;
 
INIT_LIST_HEAD(>funcs);
 
-   return cfg_partition_funcs(cfg, insn, insn + cnt);
+   if (cfg_partition_funcs(cfg, insn, insn + cnt))
+   return true;
+
+   list_for_each_entry(func, >funcs, l) {
+   if (func_partition_bb(func))
+   return true;
+   }
+
+   return false;
 }
 
 static void cfg_destroy(struct cfg

[PATCH bpf-next 7/8] tools: bpftool: new command-line option and documentation for 'visual'

2018-02-26 Thread Jakub Kicinski

From: Jiong Wang 

This patch adds new command-line option for visualizing the xlated eBPF
sequence.

Documentations are updated accordingly.

Usage:

  bpftool prog dump xlated id 2 visual

Reviewed-by: Quentin Monnet 
Signed-off-by: Jiong Wang 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/Documentation/bpftool-prog.rst | 18 --
 tools/bpf/bpftool/prog.c | 12 +++-
 2 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index e4ceee7f2dff..67ca6c69376c 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -21,7 +21,7 @@ MAP COMMANDS
 =
 
 |  **bpftool** **prog { show | list }** [*PROG*]
-|  **bpftool** **prog dump xlated** *PROG* [{**file** *FILE* | 
**opcodes**}]
+|  **bpftool** **prog dump xlated** *PROG* [{**file** *FILE* | **opcodes** 
| **visual**}]
 |  **bpftool** **prog dump jited**  *PROG* [{**file** *FILE* | 
**opcodes**}]
 |  **bpftool** **prog pin** *PROG* *FILE*
 |  **bpftool** **prog load** *OBJ* *FILE*
@@ -39,12 +39,18 @@ DESCRIPTION
  Output will start with program ID followed by program type and
  zero or more named attributes (depending on kernel version).
 
-   **bpftool prog dump xlated** *PROG* [{ **file** *FILE* | **opcodes** }]
- Dump eBPF instructions of the program from the kernel.
- If *FILE* is specified image will be written to a file,
- otherwise it will be disassembled and printed to stdout.
+   **bpftool prog dump xlated** *PROG* [{ **file** *FILE* | **opcodes** | 
**visual** }]
+ Dump eBPF instructions of the program from the kernel. By
+ default, eBPF will be disassembled and printed to standard
+ output in human-readable format. In this case, **opcodes**
+ controls if raw opcodes should be printed as well.
 
- **opcodes** controls if raw opcodes will be printed.
+ If **file** is specified, the binary image will instead be
+ written to *FILE*.
+
+ If **visual** is specified, control flow graph (CFG) will be
+ built instead, and eBPF instructions will be presented with
+ CFG in DOT format, on standard output.
 
**bpftool prog dump jited**  *PROG* [{ **file** *FILE* | **opcodes** }]
  Dump jited image (host machine code) of the program.
diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index c5afee9838e6..f7a810897eac 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -47,6 +47,7 @@
 #include 
 #include 
 
+#include "cfg.h"
 #include "main.h"
 #include "xlated_dumper.h"
 
@@ -415,6 +416,7 @@ static int do_dump(int argc, char **argv)
unsigned int buf_size;
char *filepath = NULL;
bool opcodes = false;
+   bool visual = false;
unsigned char *buf;
__u32 *member_len;
__u64 *member_ptr;
@@ -453,6 +455,9 @@ static int do_dump(int argc, char **argv)
} else if (is_prefix(*argv, "opcodes")) {
opcodes = true;
NEXT_ARG();
+   } else if (is_prefix(*argv, "visual")) {
+   visual = true;
+   NEXT_ARG();
}
 
if (argc) {
@@ -536,6 +541,11 @@ static int do_dump(int argc, char **argv)
}
 
disasm_print_insn(buf, *member_len, opcodes, name);
+   } else if (visual) {
+   if (json_output)
+   jsonw_null(json_wtr);
+   else
+   dump_xlated_cfg(buf, *member_len);
} else {
kernel_syms_load();
if (json_output)
@@ -596,7 +606,7 @@ static int do_help(int argc, char **argv)
 
fprintf(stderr,
"Usage: %s %s { show | list } [PROG]\n"
-   "   %s %s dump xlated PROG [{ file FILE | opcodes }]\n"
+   "   %s %s dump xlated PROG [{ file FILE | opcodes | visual 
}]\n"
"   %s %s dump jited  PROG [{ file FILE | opcodes }]\n"
"   %s %s pin   PROG FILE\n"
"   %s %s load  OBJ  FILE\n"
-- 
2.15.1

[PATCH bpf-next 8/8] tools: bpftool: add bash completion for CFG dump

2018-02-26 Thread Jakub Kicinski

From: Quentin Monnet 

Add bash completion for the "visual" keyword used for dumping the CFG of
eBPF programs with bpftool. Make sure we only complete with this keyword
when we dump "xlated" (and not "jited") instructions.

Acked-by: Jiong Wang 
Signed-off-by: Quentin Monnet 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/bash-completion/bpftool | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 08719c54a614..490811b45fa7 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -147,7 +147,7 @@ _bpftool()
 
 # Deal with simplest keywords
 case $prev in
-help|key|opcodes)
+help|key|opcodes|visual)
 return 0
 ;;
 tag)
@@ -223,11 +223,16 @@ _bpftool()
 return 0
 ;;
 *)
-_bpftool_once_attr 'file'
+_bpftool_once_attr 'file'
+if _bpftool_search_list 'xlated'; then
+COMPREPLY+=( $( compgen -W 'opcodes visual' -- \
+"$cur" ) )
+else
 COMPREPLY+=( $( compgen -W 'opcodes' -- \
 "$cur" ) )
-return 0
-;;
+fi
+return 0
+;;
 esac
 ;;
 pin)
-- 
2.15.1

[PATCH bpf-next 1/8] tools: bpftool: remove unnecessary 'if' to reduce indentation

2018-02-26 Thread Jakub Kicinski

From: Jiong Wang 

It is obvious we could use 'else if' instead of start a new 'if' in the
touched code.

Signed-off-by: Jiong Wang 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/prog.c | 38 ++
 1 file changed, 18 insertions(+), 20 deletions(-)

diff --git a/tools/bpf/bpftool/prog.c b/tools/bpf/bpftool/prog.c
index e549e329be82..950d11dd42ab 100644
--- a/tools/bpf/bpftool/prog.c
+++ b/tools/bpf/bpftool/prog.c
@@ -777,27 +777,25 @@ static int do_dump(int argc, char **argv)
 
if (json_output)
jsonw_null(json_wtr);
-   } else {
-   if (member_len == _prog_len) {
-   const char *name = NULL;
-
-   if (info.ifindex) {
-   name = ifindex_to_bfd_name_ns(info.ifindex,
- info.netns_dev,
- info.netns_ino);
-   if (!name)
-   goto err_free;
-   }
-
-   disasm_print_insn(buf, *member_len, opcodes, name);
-   } else {
-   kernel_syms_load();
-   if (json_output)
-   dump_xlated_json(, buf, *member_len, 
opcodes);
-   else
-   dump_xlated_plain(, buf, *member_len, 
opcodes);
-   kernel_syms_destroy();
+   } else if (member_len == _prog_len) {
+   const char *name = NULL;
+
+   if (info.ifindex) {
+   name = ifindex_to_bfd_name_ns(info.ifindex,
+ info.netns_dev,
+ info.netns_ino);
+   if (!name)
+   goto err_free;
}
+
+   disasm_print_insn(buf, *member_len, opcodes, name);
+   } else {
+   kernel_syms_load();
+   if (json_output)
+   dump_xlated_json(, buf, *member_len, opcodes);
+   else
+   dump_xlated_plain(, buf, *member_len, opcodes);
+   kernel_syms_destroy();
}
 
free(buf);
-- 
2.15.1

[PATCH bpf-next 0/8] tools: bpftool: visualization support for eBPF program

2018-02-26 Thread Jakub Kicinski

Jiong says:

This patch set is an application of CFG information on eBPF program
visualization. It presents some initial code for building CFG information
from eBPF instruction sequences.

After we get eBPF program bytecode, we do sub-program detection and
basic-block partition. These information then are visualized into DOT
graph.

The user could use any DOT graphic tools (xdot, graphviz etc) to view it.

For example:

  bpftool prog dump xlated id 2 visual &>output.dot

  [xdot | dotty] output.dot
dot -Tpng -o output.png

This initial patch set hasn't tuned much on the dot description layout
nor decoration, we could improve them later once the direction of the patch
set is agreed on. We could also visualize some static analysis performance
data.


Jiong Wang (7):
  tools: bpftool: remove unnecessary 'if' to reduce indentation
  tools: bpftool: factor out xlated dump related code into separate file
  tools: bpftool: detect sub-programs from the eBPF sequence
  tools: bpftool: partition basic-block for each function in the CFG
  tools: bpftool: add out edges for each basic-block
  tools: bpftool: generate .dot graph from CFG information
  tools: bpftool: new command-line option and documentation for 'visual'

Quentin Monnet (1):
  tools: bpftool: add bash completion for CFG dump

 tools/bpf/bpftool/Documentation/bpftool-prog.rst |  18 +-
 tools/bpf/bpftool/bash-completion/bpftool|  13 +-
 tools/bpf/bpftool/cfg.c  | 510 +++
 tools/bpf/bpftool/cfg.h  |  39 ++
 tools/bpf/bpftool/prog.c | 305 ++
 tools/bpf/bpftool/xlated_dumper.c| 334 +++
 tools/bpf/bpftool/xlated_dumper.h|  60 +++
 7 files changed, 994 insertions(+), 285 deletions(-)
 create mode 100644 tools/bpf/bpftool/cfg.c
 create mode 100644 tools/bpf/bpftool/cfg.h
 create mode 100644 tools/bpf/bpftool/xlated_dumper.c
 create mode 100644 tools/bpf/bpftool/xlated_dumper.h

-- 
2.15.1

[PATCH bpf-next 3/8] tools: bpftool: detect sub-programs from the eBPF sequence

2018-02-26 Thread Jakub Kicinski

From: Jiong Wang 

This patch detect all sub-programs from the eBPF sequence and keep the
information in the new CFG data structure.

The detection algorithm is basically the same as the one in verifier except
we need to use insn->off instead of insn->imm to get the pc-relative call
offset. Because verifier has modified insn->off/insn->imm during finishing
the verification.

Also, we don't need to do some sanity checks as verifier has done them.

Signed-off-by: Jiong Wang 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/cfg.c | 143 
 tools/bpf/bpftool/cfg.h |  39 +
 2 files changed, 182 insertions(+)
 create mode 100644 tools/bpf/bpftool/cfg.c
 create mode 100644 tools/bpf/bpftool/cfg.h

diff --git a/tools/bpf/bpftool/cfg.c b/tools/bpf/bpftool/cfg.c
new file mode 100644
index ..42a0d4c4271c
--- /dev/null
+++ b/tools/bpf/bpftool/cfg.c
@@ -0,0 +1,143 @@
+/*
+ * Copyright (C) 2018 Netronome Systems, Inc.
+ *
+ * This software is dual licensed under the GNU General License Version 2,
+ * June 1991 as shown in the file COPYING in the top-level directory of this
+ * source tree or the BSD 2-Clause License provided below.  You have the
+ * option to license this software under the complete terms of either license.
+ *
+ * The BSD 2-Clause License:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  1. Redistributions of source code must retain the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer.
+ *
+ *  2. Redistributions in binary form must reproduce the above
+ * copyright notice, this list of conditions and the following
+ * disclaimer in the documentation and/or other materials
+ * provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include 
+#include 
+#include 
+
+#include "cfg.h"
+#include "main.h"
+
+struct cfg {
+   struct list_head funcs;
+   int func_num;
+};
+
+struct func_node {
+   struct list_head l;
+   struct bpf_insn *start;
+   struct bpf_insn *end;
+   int idx;
+};
+
+#define func_prev(func)list_prev_entry(func, l)
+#define func_next(func)list_next_entry(func, l)
+#define cfg_first_func(cfg)\
+   list_first_entry(>funcs, struct func_node, l)
+#define cfg_last_func(cfg) \
+   list_last_entry(>funcs, struct func_node, l)
+
+static struct func_node *cfg_append_func(struct cfg *cfg, struct bpf_insn 
*insn)
+{
+   struct func_node *new_func, *func;
+
+   list_for_each_entry(func, >funcs, l) {
+   if (func->start == insn)
+   return func;
+   else if (func->start > insn)
+   break;
+   }
+
+   func = func_prev(func);
+   new_func = calloc(1, sizeof(*new_func));
+   if (!new_func) {
+   p_err("OOM when allocating FUNC node");
+   return NULL;
+   }
+   new_func->start = insn;
+   new_func->idx = cfg->func_num;
+   list_add(_func->l, >l);
+   cfg->func_num++;
+
+   return new_func;
+}
+
+static bool cfg_partition_funcs(struct cfg *cfg, struct bpf_insn *cur,
+   struct bpf_insn *end)
+{
+   struct func_node *func, *last_func;
+
+   func = cfg_append_func(cfg, cur);
+   if (!func)
+   return true;
+
+   for (; cur < end; cur++) {
+   if (cur->code != (BPF_JMP | BPF_CALL))
+   continue;
+   if (cur->src_reg != BPF_PSEUDO_CALL)
+   continue;
+   func = cfg_append_func(cfg, cur + cur->off + 1);
+   if (!func)
+   return true;
+   }
+
+   last_func = cfg_last_func(cfg);
+   last_func->end = end - 1;
+   func = cfg_first_func(cfg);
+   list_for_each_entry_from(func, _func->l, l) {
+   func->end = func_next(func)->start - 1;
+   }
+
+   return false;
+}
+
+static bool cfg_build(struct cfg *cfg, struct bpf_insn *insn, unsigned int len)
+{
+   int cnt = len / sizeof(*insn);
+
+   INIT_LIST_HEAD(>funcs);
+
+   return cfg_partition_funcs(cfg, insn, insn + cnt);
+}
+
+static void cfg_destroy(struct cfg *cfg)
+{
+   struct func_node *func, *func2;
+
+

[PATCH bpf-next 5/8] tools: bpftool: add out edges for each basic-block

2018-02-26 Thread Jakub Kicinski

From: Jiong Wang 

This patch adds out edges for each basic-block. We will need these out
edges to finish the .dot graph drawing.

Signed-off-by: Jiong Wang 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/cfg.c | 162 +++-
 1 file changed, 160 insertions(+), 2 deletions(-)

diff --git a/tools/bpf/bpftool/cfg.c b/tools/bpf/bpftool/cfg.c
index a61b76edc18d..1dd84e3b8e8a 100644
--- a/tools/bpf/bpftool/cfg.c
+++ b/tools/bpf/bpftool/cfg.c
@@ -54,15 +54,32 @@ struct func_node {
 
 struct bb_node {
struct list_head l;
+   struct list_head e_prevs;
+   struct list_head e_succs;
struct bpf_insn *head;
struct bpf_insn *tail;
int idx;
 };
 
+#define EDGE_FLAG_EMPTY0x0
+#define EDGE_FLAG_FALLTHROUGH  0x1
+#define EDGE_FLAG_JUMP 0x2
+struct edge_node {
+   struct list_head l;
+   struct bb_node *src;
+   struct bb_node *dst;
+   int flags;
+};
+
+#define ENTRY_BLOCK_INDEX  0
+#define EXIT_BLOCK_INDEX   1
+#define NUM_FIXED_BLOCKS   2
 #define func_prev(func)list_prev_entry(func, l)
 #define func_next(func)list_next_entry(func, l)
 #define bb_prev(bb)list_prev_entry(bb, l)
 #define bb_next(bb)list_next_entry(bb, l)
+#define entry_bb(func) func_first_bb(func)
+#define exit_bb(func)  func_last_bb(func)
 #define cfg_first_func(cfg)\
list_first_entry(>funcs, struct func_node, l)
 #define cfg_last_func(cfg) \
@@ -116,11 +133,30 @@ static struct bb_node *func_append_bb(struct func_node 
*func,
return NULL;
}
new_bb->head = insn;
+   INIT_LIST_HEAD(_bb->e_prevs);
+   INIT_LIST_HEAD(_bb->e_succs);
list_add(_bb->l, >l);
 
return new_bb;
 }
 
+static struct bb_node *func_insert_dummy_bb(struct list_head *after)
+{
+   struct bb_node *bb;
+
+   bb = calloc(1, sizeof(*bb));
+   if (!bb) {
+   p_err("OOM when allocating BB node");
+   return NULL;
+   }
+
+   INIT_LIST_HEAD(>e_prevs);
+   INIT_LIST_HEAD(>e_succs);
+   list_add(>l, after);
+
+   return bb;
+}
+
 static bool cfg_partition_funcs(struct cfg *cfg, struct bpf_insn *cur,
struct bpf_insn *end)
 {
@@ -186,8 +222,8 @@ static bool func_partition_bb_head(struct func_node *func)
 
 static void func_partition_bb_tail(struct func_node *func)
 {
+   unsigned int bb_idx = NUM_FIXED_BLOCKS;
struct bb_node *bb, *last;
-   unsigned int bb_idx = 0;
 
last = func_last_bb(func);
last->tail = func->end;
@@ -201,6 +237,23 @@ static void func_partition_bb_tail(struct func_node *func)
func->bb_num = bb_idx;
 }
 
+static bool func_add_special_bb(struct func_node *func)
+{
+   struct bb_node *bb;
+
+   bb = func_insert_dummy_bb(>bbs);
+   if (!bb)
+   return true;
+   bb->idx = ENTRY_BLOCK_INDEX;
+
+   bb = func_insert_dummy_bb(_last_bb(func)->l);
+   if (!bb)
+   return true;
+   bb->idx = EXIT_BLOCK_INDEX;
+
+   return false;
+}
+
 static bool func_partition_bb(struct func_node *func)
 {
if (func_partition_bb_head(func))
@@ -211,6 +264,96 @@ static bool func_partition_bb(struct func_node *func)
return false;
 }
 
+static struct bb_node *func_search_bb_with_head(struct func_node *func,
+   struct bpf_insn *insn)
+{
+   struct bb_node *bb;
+
+   list_for_each_entry(bb, >bbs, l) {
+   if (bb->head == insn)
+   return bb;
+   }
+
+   return NULL;
+}
+
+static struct edge_node *new_edge(struct bb_node *src, struct bb_node *dst,
+ int flags)
+{
+   struct edge_node *e;
+
+   e = calloc(1, sizeof(*e));
+   if (!e) {
+   p_err("OOM when allocating edge node");
+   return NULL;
+   }
+
+   if (src)
+   e->src = src;
+   if (dst)
+   e->dst = dst;
+
+   e->flags |= flags;
+
+   return e;
+}
+
+static bool func_add_bb_edges(struct func_node *func)
+{
+   struct bpf_insn *insn;
+   struct edge_node *e;
+   struct bb_node *bb;
+
+   bb = entry_bb(func);
+   e = new_edge(bb, bb_next(bb), EDGE_FLAG_FALLTHROUGH);
+   if (!e)
+   return true;
+   list_add_tail(>l, >e_succs);
+
+   bb = exit_bb(func);
+   e = new_edge(bb_prev(bb), bb, EDGE_FLAG_FALLTHROUGH);
+   if (!e)
+   return true;
+   list_add_tail(>l, >e_prevs);
+
+   bb = entry_bb(func);
+   bb = bb_next(bb);
+   list_for_each_entry_from(bb, _bb(func)->l, l) {
+   e = new_edge(bb, NULL, EDGE_FLAG_EMPTY);
+   if (!e)
+   return true;
+   e->src = bb;
+
+

[PATCH bpf-next 6/8] tools: bpftool: generate .dot graph from CFG information

2018-02-26 Thread Jakub Kicinski

From: Jiong Wang 

This patch let bpftool print .dot graph file into stdout.

This graph is generated by the following steps:

  - iterate through the function list.
  - generate basic-block(BB) definition for each BB in the function.
  - draw out edges to connect BBs.

This patch is the initial support, the layout and decoration of the .dot
graph could be improved.

Also, it will be useful if we could visualize some performance data from
static analysis.

Signed-off-by: Jiong Wang 
Acked-by: Jakub Kicinski 
---
 tools/bpf/bpftool/cfg.c   | 93 +++
 tools/bpf/bpftool/xlated_dumper.c | 52 ++
 tools/bpf/bpftool/xlated_dumper.h |  2 +
 3 files changed, 147 insertions(+)

diff --git a/tools/bpf/bpftool/cfg.c b/tools/bpf/bpftool/cfg.c
index 1dd84e3b8e8a..d93d87297ae8 100644
--- a/tools/bpf/bpftool/cfg.c
+++ b/tools/bpf/bpftool/cfg.c
@@ -37,6 +37,7 @@
 
 #include "cfg.h"
 #include "main.h"
+#include "xlated_dumper.h"
 
 struct cfg {
struct list_head funcs;
@@ -404,6 +405,96 @@ static void cfg_destroy(struct cfg *cfg)
}
 }
 
+static void draw_bb_node(struct func_node *func, struct bb_node *bb)
+{
+   const char *shape;
+
+   if (bb->idx == ENTRY_BLOCK_INDEX || bb->idx == EXIT_BLOCK_INDEX)
+   shape = "Mdiamond";
+   else
+   shape = "record";
+
+   printf("\tfn_%d_bb_%d [shape=%s,style=filled,label=\"",
+  func->idx, bb->idx, shape);
+
+   if (bb->idx == ENTRY_BLOCK_INDEX) {
+   printf("ENTRY");
+   } else if (bb->idx == EXIT_BLOCK_INDEX) {
+   printf("EXIT");
+   } else {
+   unsigned int start_idx;
+   struct dump_data dd = {};
+
+   printf("{");
+   kernel_syms_load();
+   start_idx = bb->head - func->start;
+   dump_xlated_for_graph(, bb->head, bb->tail, start_idx);
+   kernel_syms_destroy();
+   printf("}");
+   }
+
+   printf("\"];\n\n");
+}
+
+static void draw_bb_succ_edges(struct func_node *func, struct bb_node *bb)
+{
+   const char *style = "\"solid,bold\"";
+   const char *color = "black";
+   int func_idx = func->idx;
+   struct edge_node *e;
+   int weight = 10;
+
+   if (list_empty(>e_succs))
+   return;
+
+   list_for_each_entry(e, >e_succs, l) {
+   printf("\tfn_%d_bb_%d:s -> fn_%d_bb_%d:n [style=%s, color=%s, 
weight=%d, constraint=true",
+  func_idx, e->src->idx, func_idx, e->dst->idx,
+  style, color, weight);
+   printf("];\n");
+   }
+}
+
+static void func_output_bb_def(struct func_node *func)
+{
+   struct bb_node *bb;
+
+   list_for_each_entry(bb, >bbs, l) {
+   draw_bb_node(func, bb);
+   }
+}
+
+static void func_output_edges(struct func_node *func)
+{
+   int func_idx = func->idx;
+   struct bb_node *bb;
+
+   list_for_each_entry(bb, >bbs, l) {
+   draw_bb_succ_edges(func, bb);
+   }
+
+   /* Add an invisible edge from ENTRY to EXIT, this is to
+* improve the graph layout.
+*/
+   printf("\tfn_%d_bb_%d:s -> fn_%d_bb_%d:n [style=\"invis\", 
constraint=true];\n",
+  func_idx, ENTRY_BLOCK_INDEX, func_idx, EXIT_BLOCK_INDEX);
+}
+
+static void cfg_dump(struct cfg *cfg)
+{
+   struct func_node *func;
+
+   printf("digraph \"DOT graph for eBPF program\" {\n");
+   list_for_each_entry(func, >funcs, l) {
+   printf("subgraph \"cluster_%d\" 
{\n\tstyle=\"dashed\";\n\tcolor=\"black\";\n\tlabel=\"func_%d ()\";\n",
+  func->idx, func->idx);
+   func_output_bb_def(func);
+   func_output_edges(func);
+   printf("}\n");
+   }
+   printf("}\n");
+}
+
 void dump_xlated_cfg(void *buf, unsigned int len)
 {
struct bpf_insn *insn = buf;
@@ -413,5 +504,7 @@ void dump_xlated_cfg(void *buf, unsigned int len)
if (cfg_build(, insn, len))
return;
 
+   cfg_dump();
+
cfg_destroy();
 }
diff --git a/tools/bpf/bpftool/xlated_dumper.c 
b/tools/bpf/bpftool/xlated_dumper.c
index a765a2123bff..7c7145131d30 100644
--- a/tools/bpf/bpftool/xlated_dumper.c
+++ b/tools/bpf/bpftool/xlated_dumper.c
@@ -119,6 +119,37 @@ static void print_insn(struct bpf_verifier_env *env, const 
char *fmt, ...)
va_end(args);
 }
 
+static void
+print_insn_for_graph(struct bpf_verifier_env *env, const char *fmt, ...)
+{
+   char buf[64], *p;
+   va_list args;
+
+   va_start(args, fmt);
+   vsnprintf(buf, sizeof(buf), fmt, args);
+   va_end(args);
+
+   p = buf;
+   while (*p != '\0') {
+   if (*p == '\n') {
+   memmove(p + 3, p, strlen(buf) + 1 - (p - buf));
+   /* Align each instruction dump

Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-02-26 Thread Sargun Dhillon

On Mon, Feb 26, 2018 at 7:57 PM, Tycho Andersen  wrote:
> On Mon, Feb 26, 2018 at 07:49:48PM -0800, Sargun Dhillon wrote:
>> On Mon, Feb 26, 2018 at 4:54 PM, Tycho Andersen  wrote:
>> > On Mon, Feb 26, 2018 at 07:27:05AM +, Sargun Dhillon wrote:
>> >> +config SECCOMP_FILTER_EXTENDED
>> >> + bool "Extended BPF seccomp filters"
>> >> + depends on SECCOMP_FILTER && BPF_SYSCALL
>> >> + depends on !CHECKPOINT_RESTORE
>> >
>> > Why not just give -EINVAL or something in case one of these is
>> > requested, instead of making them incompatible at compile time?
>> >
>> > Tycho
>> There's already code to return -EMEDIUMTYPE if it's a non-classic, or
>> non-saved filter. Under the normal case, with CHECKPOINT_RESTORE
>> enabled, you should never be able to get that. I think it makes sense
>> to preserve this behaviour.
>
> Oh, right. So can't we just drop this, and the existing code will
> DTRT, i.e. give you -EMEDIUMTYPE because the new filters aren't
> supported, until they are?
>
> Tycho
My suggestion is we merge this as is, so we don't break checkpoint /
restore, and I will try to get the filter dumping patching in the same
development cycle as it comes at minimal risk. Otherwise, we risk
introducing a feature which could break checkpoint/restore, even in
unprivileged containers since anyone can load a BPF Seccomp filter.

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Tycho Andersen

On Mon, Feb 26, 2018 at 07:46:19PM -0800, Sargun Dhillon wrote:
> On Mon, Feb 26, 2018 at 5:01 PM, Tycho Andersen  wrote:
> > On Mon, Feb 26, 2018 at 03:20:15PM -0800, Kees Cook wrote:
> >> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
> >>  wrote:
> >> > On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
> >> >> This patchset enables seccomp filters to be written in eBPF. Although, 
> >> >> this
> >> >> [...]
> >> > The main statement I want to hear from seccomp maintainers before
> >> > proceeding any further on this that enabling eBPF in seccomp won't lead
> >> > to seccomp folks arguing against changes in bpf core (like verifier)
> >> > just because it's used by seccomp.
> >> > It must be spelled out in the commit log with explicit Ack.
> >>
> >> The primary thing I'm concerned about with eBPF and seccomp is
> >> side-effects from eBPF programs running at syscall time. This is an
> >> extremely sensitive area, and I want to be sure there won't be
> >> feature-creep here that leads to seccomp getting into a bad state.
> >>
> >> As long as seccomp can continue have its own verifier,
> >
> > I guess these patches should introduce some additional restrictions in
> > kernel/seccomp.c then? Based on my reading now, it's whatever the eBPF
> > verifier allows.
> >
> Like what? The helpers allowed are listed in seccomp.c. You have the
> same restrictions as the traditional eBPF verifier (no unsafe memory
> access, jumps backwards, etc..). I'm not sure which built-in eBPF
> functionality presents risk.

I think that's the $64,000 question that Kees is trying to answer r.e.
maps, etc.

There's also the possibility that eBPF grows something new
that's unsafe for seccomp.

Cheers,

Tycho

Re: [PATCH bpf] bpf, ppc64: fix out of bounds access in tail call

2018-02-26 Thread Alexei Starovoitov

On Mon, Feb 26, 2018 at 10:00:47PM +0100, Daniel Borkmann wrote:
> While working on 16338a9b3ac3 ("bpf, arm64: fix out of bounds access in
> tail call") I noticed that ppc64 JIT is partially affected as well. While
> the bound checking is correctly performed as unsigned comparison, the
> register with the index value however, is never truncated into 32 bit
> space, so e.g. a index value of 0x1ULL with a map of 1 element
> would pass with PPC_CMPLW() whereas we later on continue with the full
> 64 bit register value. Therefore, as we do in interpreter and other JITs
> truncate the value to 32 bit initially in order to fix access.
> 
> Fixes: ce0761419fae ("powerpc/bpf: Implement support for tail calls")
> Signed-off-by: Daniel Borkmann 
> Reviewed-by: Naveen N. Rao 
> Tested-by: Naveen N. Rao 

Applied to bpf tree, thanks Daniel.

Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-02-26 Thread Tycho Andersen

On Mon, Feb 26, 2018 at 07:49:48PM -0800, Sargun Dhillon wrote:
> On Mon, Feb 26, 2018 at 4:54 PM, Tycho Andersen  wrote:
> > On Mon, Feb 26, 2018 at 07:27:05AM +, Sargun Dhillon wrote:
> >> +config SECCOMP_FILTER_EXTENDED
> >> + bool "Extended BPF seccomp filters"
> >> + depends on SECCOMP_FILTER && BPF_SYSCALL
> >> + depends on !CHECKPOINT_RESTORE
> >
> > Why not just give -EINVAL or something in case one of these is
> > requested, instead of making them incompatible at compile time?
> >
> > Tycho
> There's already code to return -EMEDIUMTYPE if it's a non-classic, or
> non-saved filter. Under the normal case, with CHECKPOINT_RESTORE
> enabled, you should never be able to get that. I think it makes sense
> to preserve this behaviour.

Oh, right. So can't we just drop this, and the existing code will
DTRT, i.e. give you -EMEDIUMTYPE because the new filters aren't
supported, until they are?

Tycho

Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-02-26 Thread Sargun Dhillon

On Mon, Feb 26, 2018 at 4:54 PM, Tycho Andersen  wrote:
> On Mon, Feb 26, 2018 at 07:27:05AM +, Sargun Dhillon wrote:
>> +config SECCOMP_FILTER_EXTENDED
>> + bool "Extended BPF seccomp filters"
>> + depends on SECCOMP_FILTER && BPF_SYSCALL
>> + depends on !CHECKPOINT_RESTORE
>
> Why not just give -EINVAL or something in case one of these is
> requested, instead of making them incompatible at compile time?
>
> Tycho
There's already code to return -EMEDIUMTYPE if it's a non-classic, or
non-saved filter. Under the normal case, with CHECKPOINT_RESTORE
enabled, you should never be able to get that. I think it makes sense
to preserve this behaviour.

My rough plan is to introduce a mechanism to dump filters like you can
cBPF filters. If you look at my v1, there was a patch that did this.
Once this gets in, I can prepare that patch, and we can lift this
restriction.

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Sargun Dhillon

On Mon, Feb 26, 2018 at 5:01 PM, Tycho Andersen  wrote:
> On Mon, Feb 26, 2018 at 03:20:15PM -0800, Kees Cook wrote:
>> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
>>  wrote:
>> > On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
>> >> This patchset enables seccomp filters to be written in eBPF. Although, 
>> >> this
>> >> [...]
>> > The main statement I want to hear from seccomp maintainers before
>> > proceeding any further on this that enabling eBPF in seccomp won't lead
>> > to seccomp folks arguing against changes in bpf core (like verifier)
>> > just because it's used by seccomp.
>> > It must be spelled out in the commit log with explicit Ack.
>>
>> The primary thing I'm concerned about with eBPF and seccomp is
>> side-effects from eBPF programs running at syscall time. This is an
>> extremely sensitive area, and I want to be sure there won't be
>> feature-creep here that leads to seccomp getting into a bad state.
>>
>> As long as seccomp can continue have its own verifier,
>
> I guess these patches should introduce some additional restrictions in
> kernel/seccomp.c then? Based on my reading now, it's whatever the eBPF
> verifier allows.
>
Like what? The helpers allowed are listed in seccomp.c. You have the
same restrictions as the traditional eBPF verifier (no unsafe memory
access, jumps backwards, etc..). I'm not sure which built-in eBPF
functionality presents risk.

>> I *think* this will be fine, though, again I remain concerned about
>> maps, etc. I'm still reviewing these patches and how they might
>> provide overlap with Tycho's needs too, etc.
>
> Yes, it's on my TODO list to take a look at how to do it as suggested
> by Alexi on top of this set before posting a v2. Haven't had time
> recently, though.
>
> Cheers,
>
> Tycho

There's a lot of interest (in general) of having a mechanism to do
notifications to userspace processes from eBPF for a variety of use
cases. I think that this would be valuable for more than just seccomp,
if it's implemented in a general purpose manner.

Re: [PATCH] net: make tc-police action MTU behavior match documentation

2018-02-26 Thread Cong Wang

On Mon, Feb 26, 2018 at 12:10 PM, Andrew Collins
 wrote:
> The man page for tc-police states that the MTU defaults to
> unlimited if peakrate is not specified, but it actually defaults
> to 2047.

I don't find such statement from the man page:
http://man7.org/linux/man-pages/man8/tc-police.8.html


What am I missing?

Re: [PATCH v2 net] net: phy: Restore phy_resume() locking assumption

2018-02-26 Thread Florian Fainelli

On February 26, 2018 4:56:06 PM PST, Andrew Lunn  wrote:
>commit f5e64032a799 ("net: phy: fix resume handling") changes the
>locking semantics for phy_resume() such that the caller now needs to
>hold the phy mutex. Not all call sites were adopted to this new
>semantic, resulting in warnings from the added
>WARN_ON(!mutex_is_locked(>lock)).  Rather than change the
>semantics, add a __phy_resume() and restore the old behavior of
>phy_resume().
>
>Reported-by: Heiner Kallweit 
>Fixes: f5e64032a799 ("net: phy: fix resume handling")
>Signed-off-by: Andrew Lunn 
>---
>v2: Fix type in phy_resume() calling itself
>---

This version looks good thanks for taking care of that:

Reviewed-by: Florian Fainelli 

-- 
Florian

Re: [PATCH net-next] net/ncsi: Add generic netlink family

2018-02-26 Thread Samuel Mendoza-Jonas

On Mon, 2018-02-26 at 11:31 -0500, David Miller wrote:
> From: Samuel Mendoza-Jonas 
> Date: Fri, 23 Feb 2018 15:15:18 +1100
> 
> > + * @NCSI_CMD_SET_INTERFACE: set preferred package and channel combination.
> > + *   Requires NCSI_ATTR_IFINDEX and the preferred NCSI_ATTR_PACKAGE_ID and
> > + *   optionally the preferred NCSI_ATTR_CHANNEL_ID. If neither IDs are
> > + *   specified the setting is cleared.
> 
> I think clearing the setting when the required attributes are missing
> is dangerous behavior.
> 
> It is ambiguous whether the user intended the setting to be cleared,
> or was in error and forgot to supply the attribute due to a bug.

Fair point - I'll change this to be an error and add a separate command
to clear the setting explicitly.
In that vein is having NCSI_ATTR_CHANNEL_ID as an optional parameter
ambiguous enough to justify a separate command as well?

Regards,
Sam

Re: [net PATCH 1/4] virtio_net: disable XDP_REDIRECT in receive_mergeable() case

2018-02-26 Thread Jason Wang




On 2018年02月27日 10:25, Jason Wang wrote:



On 2018年02月27日 08:40, Michael S. Tsirkin wrote:

IMHO we should consider NOT supporting XDP in receive_mergeable() at
all, because the principles behind XDP are to gain speed by (1) code
simplicity, (2) sacrificing memory and (3) where possible moving
runtime checks to setup time.  These principles are clearly being
violated in receive_mergeable(), that e.g. runtime track average
buffer size to save memory consumption.

In the longer run, we should consider introducing a separate receive
function when attaching an XDP program, and also change the memory
model to be compatible with XDP when attaching an XDP prog.

I agree with a separate function approach.

So each buffer is tagged as xdp/non xdp, we check that
and handle appropriately - where non xdp could be handled
by the generic path.


If we want to have separated function, should we do it for all XDP 
capable drivers instead of virtio-net only?


Thanks



What's more, since we don't stop device during XDP set. Even if a buffer 
is tagged as xdp/non xdp during refill, we still can't make sure whether 
or not XDP was used at receive function. So we can not handle them 
separately.


Thanks

Re: [RFC PATCH v2] ptr_ring: linked list fallback

2018-02-26 Thread Jason Wang




On 2018年02月27日 04:34, Michael S. Tsirkin wrote:

On Mon, Feb 26, 2018 at 11:15:42AM +0800, Jason Wang wrote:

On 2018年02月26日 09:17, Michael S. Tsirkin wrote:

So pointer rings work fine, but they have a problem: make them too small
and not enough entries fit.  Make them too large and you start flushing
your cache and running out of memory.

This is a new idea of mine: a ring backed by a linked list. Once you run
out of ring entries, instead of a drop you fall back on a list with a
common lock.

Should work well for the case where the ring is typically sized
correctly, but will help address the fact that some user try to set e.g.
tx queue length to 100.

In other words, the idea is that if a user sets a really huge TX queue
length, we allocate a ptr_ring which is smaller, and use the backup
linked list when necessary to provide the requested TX queue length
legitimately.

My hope this will move us closer to direction where e.g. fw codel can
use ptr rings without locking at all.  The API is still very rough, and
I really need to take a hard look at lock nesting.

Compiled only, sending for early feedback/flames.

Signed-off-by: Michael S. Tsirkin
---

changes from v1:
- added clarifications by DaveM in the commit log
- build fixes

   include/linux/ptr_ring.h | 64 
+---
   1 file changed, 61 insertions(+), 3 deletions(-)

diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
index d72b2e7..8aa8882 100644
--- a/include/linux/ptr_ring.h
+++ b/include/linux/ptr_ring.h
@@ -31,11 +31,18 @@
   #include 
   #endif
+/* entries must start with the following structure */
+struct plist {
+   struct plist *next;
+   struct plist *last; /* only valid in the 1st entry */
+};

So I wonder whether or not it's better to do this in e.g skb_array
implementation. Then it can use its own prev/next field.

XDP uses ptr ring directly, doesn't it?



Well I believe the main user for this is qdisc, which use skb array. And 
we can not use what implemented in this patch directly for sk_buff 
without some changes on the data structure.


For XDP, we need to embed plist in struct xdp_buff too, so it looks to 
me that the better approach is to have separated function for ptr ring 
and skb array.


Thanks

Re: [net PATCH 1/4] virtio_net: disable XDP_REDIRECT in receive_mergeable() case

2018-02-26 Thread Jason Wang




On 2018年02月27日 08:40, Michael S. Tsirkin wrote:

IMHO we should consider NOT supporting XDP in receive_mergeable() at
all, because the principles behind XDP are to gain speed by (1) code
simplicity, (2) sacrificing memory and (3) where possible moving
runtime checks to setup time.  These principles are clearly being
violated in receive_mergeable(), that e.g. runtime track average
buffer size to save memory consumption.

In the longer run, we should consider introducing a separate receive
function when attaching an XDP program, and also change the memory
model to be compatible with XDP when attaching an XDP prog.

I agree with a separate function approach.

So each buffer is tagged as xdp/non xdp, we check that
and handle appropriately - where non xdp could be handled
by the generic path.


If we want to have separated function, should we do it for all XDP 
capable drivers instead of virtio-net only?


Thanks

Re: [net-next 1/7] i40e: link_down_on_close private flag support

2018-02-26 Thread Jakub Kicinski

On Mon, 26 Feb 2018 13:39:25 -0800, Jeff Kirsher wrote:
> From: Mariusz Stachura 
> 
> This patch introduces new ethtool private flag used for
> forcing true link state. Function i40e_force_link_state that implements
> this functionality was added, it sets phy_type = 0 in order to
> work-around firmware's LESM. False positive error messages were
> suppressed.
> 
> The ndo_open() should not succeed if there were issues with forcing link
> state to be UP.
> 
> Added I40E_PHY_TYPES_BITMASK define with all phy types OR-ed together in
> one bitmask.  Added after phy type definition, so it will be hard to
> forget to include new phy types to the bitmask.
> 
> CC: Jakub Kicinski 
> Signed-off-by: Mariusz Stachura 
> Signed-off-by: Mitch Williams 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 

Thanks, from functional perspective this looks reasonable AFAICT.

We may want to have a conversation about priv flags like this at later
time, since I guess more NICs won't bring the link down when netdev is
closed today..  Although it maybe a larger endeavour, perhaps it would 
be cool to communicate to the user the reason why MAC/PHY is on in some
standard way..  NC-SI and multi-host comes to mind.  

> @@ -7524,6 +7596,9 @@ int i40e_open(struct net_device *netdev)
>  
>   netif_carrier_off(netdev);
>  
> + if (i40e_force_link_state(pf, true))
> + return -EAGAIN;

There are some minor style issues with the rest of the patch, but here
you may want to (a) propagate the actual error value;

>   err = i40e_vsi_open(vsi);
>   if (err)
>   return err;

(b) undo the link state if vsi_open() fails.

Maybe that's a bit of a nitpick...

http://www.skbuff.net/iputils/iputils-current.tar.bz2 gives 404 error.

2018-02-26 Thread Graph Worlok

I tried posting this last week, but it seems to have been lost.

Did somebody forget to migrate part of a script to correctly link
-current against the latest tarball at http://www.skbuff.net/iputils/
?

Its referenced as the source location for ping in just about any
distro i can get my hands on.

Regards.

Re: [PATCH bpf-next v2] samples/bpf: Add program for CPU state statistics

2018-02-26 Thread Leo Yan

On Mon, Feb 26, 2018 at 11:26:52AM +0100, Daniel Borkmann wrote:
> On 02/26/2018 02:19 AM, Leo Yan wrote:

[...]

> > CPU states statistics:
> > state(ms)  cstate-0cstate-1cstate-2pstate-0pstate-1
> > pstate-2pstate-3pstate-4
> > CPU-0  767 6111111863  561 31  756  
> >853 190
> > CPU-1  241 10606   107956  484 125 646  
> >990 85
> > CPU-2  413 19721   98735   636 84  696  
> >757 89
> > CPU-3  84  11711   79989   17516   909 4811 
> >5773341
> > CPU-4  152 19610   98229   444 53  649  
> >708 1283
> > CPU-5  185 8781108697  666 91  671  
> >677 1365
> > CPU-6  157 21964   95825   581 67  566  
> >684 1284
> > CPU-7  125 15238   102704  398 20  665  
> >786 1197
> > 
> > Cc: Daniel Lezcano 
> > Cc: Vincent Guittot 
> > Signed-off-by: Leo Yan 
> 
> Applied to bpf-next, thanks Leo!

Thanks for help, Daniel.

Re: [PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-26 Thread Alexei Starovoitov

On Tue, Feb 27, 2018 at 01:41:15AM +0100, Mickaël Salaün wrote:
> The seccomp(2) syscall can be used by a task to apply a Landlock program
> to itself. As a seccomp filter, a Landlock program is enforced for the
> current task and all its future children. A program is immutable and a
> task can only add new restricting programs to itself, forming a list of
> programss.
> 
> A Landlock program is tied to a Landlock hook. If the action on a kernel
> object is allowed by the other Linux security mechanisms (e.g. DAC,
> capabilities, other LSM), then a Landlock hook related to this kind of
> object is triggered. The list of programs for this hook is then
> evaluated. Each program return a 32-bit value which can deny the action
> on a kernel object with a non-zero value. If every programs of the list
> return zero, then the action on the object is allowed.
> 
> Multiple Landlock programs can be chained to share a 64-bits value for a
> call chain (e.g. evaluating multiple elements of a file path).  This
> chaining is restricted when a process construct this chain by loading a
> program, but additional checks are performed when it requests to apply
> this chain of programs to itself.  The restrictions ensure that it is
> not possible to call multiple programs in a way that would imply to
> handle multiple shared values (i.e. cookies) for one chain.  For now,
> only a fs_pick program can be chained to the same type of program,
> because it may make sense if they have different triggers (cf. next
> commits).  This restrictions still allows to reuse Landlock programs in
> a safe way (e.g. use the same loaded fs_walk program with multiple
> chains of fs_pick programs).
> 
> Signed-off-by: Mickaël Salaün 

...

> +struct landlock_prog_set *landlock_prepend_prog(
> + struct landlock_prog_set *current_prog_set,
> + struct bpf_prog *prog)
> +{
> + struct landlock_prog_set *new_prog_set = current_prog_set;
> + unsigned long pages;
> + int err;
> + size_t i;
> + struct landlock_prog_set tmp_prog_set = {};
> +
> + if (prog->type != BPF_PROG_TYPE_LANDLOCK_HOOK)
> + return ERR_PTR(-EINVAL);
> +
> + /* validate memory size allocation */
> + pages = prog->pages;
> + if (current_prog_set) {
> + size_t i;
> +
> + for (i = 0; i < ARRAY_SIZE(current_prog_set->programs); i++) {
> + struct landlock_prog_list *walker_p;
> +
> + for (walker_p = current_prog_set->programs[i];
> + walker_p; walker_p = walker_p->prev)
> + pages += walker_p->prog->pages;
> + }
> + /* count a struct landlock_prog_set if we need to allocate one 
> */
> + if (refcount_read(_prog_set->usage) != 1)
> + pages += round_up(sizeof(*current_prog_set), PAGE_SIZE)
> + / PAGE_SIZE;
> + }
> + if (pages > LANDLOCK_PROGRAMS_MAX_PAGES)
> + return ERR_PTR(-E2BIG);
> +
> + /* ensure early that we can allocate enough memory for the new
> +  * prog_lists */
> + err = store_landlock_prog(_prog_set, current_prog_set, prog);
> + if (err)
> + return ERR_PTR(err);
> +
> + /*
> +  * Each task_struct points to an array of prog list pointers.  These
> +  * tables are duplicated when additions are made (which means each
> +  * table needs to be refcounted for the processes using it). When a new
> +  * table is created, all the refcounters on the prog_list are bumped (to
> +  * track each table that references the prog). When a new prog is
> +  * added, it's just prepended to the list for the new table to point
> +  * at.
> +  *
> +  * Manage all the possible errors before this step to not uselessly
> +  * duplicate current_prog_set and avoid a rollback.
> +  */
> + if (!new_prog_set) {
> + /*
> +  * If there is no Landlock program set used by the current task,
> +  * then create a new one.
> +  */
> + new_prog_set = new_landlock_prog_set();
> + if (IS_ERR(new_prog_set))
> + goto put_tmp_lists;
> + } else if (refcount_read(_prog_set->usage) > 1) {
> + /*
> +  * If the current task is not the sole user of its Landlock
> +  * program set, then duplicate them.
> +  */
> + new_prog_set = new_landlock_prog_set();
> + if (IS_ERR(new_prog_set))
> + goto put_tmp_lists;
> + for (i = 0; i < ARRAY_SIZE(new_prog_set->programs); i++) {
> + new_prog_set->programs[i] =
> + READ_ONCE(current_prog_set->programs[i]);
> + if (new_prog_set->programs[i])
> + refcount_inc(_prog_set->programs[i]->usage);
> + }
>

Re: [PATCH net] r8152: fix tx packets accounting

2018-02-26 Thread David Miller

From: Eric Dumazet 
Date: Sun, 25 Feb 2018 19:12:10 -0800

> From: Eric Dumazet 
> 
> r8152 driver handles TSO packets (limited to ~16KB) quite well,
> but pretends each TSO logical packet is a single packet on the wire.
> 
> There is also some error since headers are accounted once, but
> error rate is small enough that we do not care.
> 
> Signed-off-by: Eric Dumazet 

Applied, thanks Eric.

Re: [net-next 00/13][pull request] 10GbE Intel Wired LAN Driver Updates 2018-02-26

2018-02-26 Thread David Miller

From: Jeff Kirsher 
Date: Mon, 26 Feb 2018 10:07:43 -0800

> This series contains updates to ixgbe and ixgbevf only.
 ...
> Emil updates ixgbevf with several features and improvements done in
> other drivers, starting with the handling of page addresses so that we
> always refer to them using a void pointer.  Added a 'legacy-rx' flag to
> allow switching between the old and new receive code paths.  Added
> support for using 3K buggers in order 1 page.  Updated the driver to
> ensure that calls to ixgbevf_open() are rtnl lock protected and improved
> the error handling when setting up multiple queues.  Added support for
> providing a buffer with head room and tail room to allow for shared
> info, NET_SKB_PAD, and NET_IP_ALIGN, so that we can start using
> build_skb to build frames instead of using memcpy() the headers.
> Updated the logic of handling rings closer to ixgbe.  Consolidated the
> receive paths to reduce duplication when we expand them in the future.
> Added build_skb() support to ixgbevf.

This seems to suggest that XDP support for ixgbevf is coming next? :-)

> The following are changes since commit 
> f74290fdb363665538743d14c4f00aeacdb68d87:
>   Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE

Pulled, thanks Jeff.

Re: [next-queue PATCH 7/8] igb: Add support for adding offloaded clsflower filters

2018-02-26 Thread Vinicius Costa Gomes

Hi,

Florian Fainelli  writes:

> On 02/26/2018 04:40 PM, Vinicius Costa Gomes wrote:
>> Hi,
>> 
>> Florian Fainelli  writes:
>> 
>>> On February 23, 2018 5:20:35 PM PST, Vinicius Costa Gomes 
>>>  wrote:
 This allows filters added by tc-flower and specifying MAC addresses,
 Ethernet types, and the VLAN priority field, to be offloaded to the
 controller.

 This reuses most of the infrastructure used by ethtool, ethtool can be
 used to read these filters, but modification and deletion can only be
 done via tc-flower.
>>>
>>> You would want to check what other drivers supporting both
>>> ethtool::rxnfc and cls_flower do, but this can be seriously confusing
>>> to an user. As an user I would be more comfortable with seeing only
>>> rules added through ethtool via ethtool and those added by cls_flower
>>> via cls_flower. They will both access a shared set of resources but it
>>> seems easier for me to dump rules with both tools to figure out why
>>> -ENOSPC was returned rather than seeing something I did not add.
>>> Others might see it entirely differently.
>> 
>> I took a closer look at mlx5 and i40e, and they seem to use different
>> hardware capabilities (even incompatible in the case of i40e, which
>> returns an error when tring to add cls_flower filter when an ethtool
>> based filter exists) for ethtool and cls_flower. So I don't think the
>> model applies exactly here.
>> 
>> As they keep the filters separated for the user perspective, I could do
>> the same, in the name of convention, it's just that the separation is
>> more "artificial". But I have no strong opinions either way.
>
> True, I would still conform to what these two drivers do since they have
> a large user base (so does igb, but not yet for cls_flower yet since you
> are obviously working on it).

Awesome. Will present them as separate to the user, then.

>
>> 
>>>
>>> If you added the ability for cls_flower to indicate a queue number and
>>> either a fixed rule index or auto-placement (RX_CLS_LOC_ANY), could
>>> that eliminate entirely the need for adding MAC address steering in
>>> earlier patches?
>> 
>> I am not sure that I understand. 'cls_flower' already has support for
>> indicating a queue number (expressed via the 'hw_tc' parameter to tc)
>> (commit 384c181e3780 ("net: sched: Identify hardware traffic classes
>> using classid").
>
> I had missed that cls_flower gained the capability to specify a queue
> number, that's good. What it still does not support AFAICT that ethtool
> does though is either automatically allocating a rule location (Rule ID
> shown by ethtool) or allowing placement at a specific location. This can
> be important when the rule location can be carried by the hardware on
> e.g: a per-packet basis, the hardware that I work with (bcm_sf2_cfp.c)
> makes use of that for instance, maybe this is such an isolated case that
> I should take care of it at some point if I was remotely serious into
> providing tc/cls_flower support for that driver...

Now I am starting to see what you meant about the rules location. In the
igb case of the igb-based controller, only the rule 0 is special (it's
reserved for the local address) which the user wouldn't have no control
over. From what I can see, for the igb driver, the location of rules only
controls the order they are displayed to the user.

>
>> 
>> And adding more control for the allocation of indexes for the rules seem
>> not to help much in reducing the size/complexity of this series. I.e.
>> this series has 4 parts: bug fixes, adding support for source addresses
>> for MAC filters, adding ethtool support MAC address filters (as it was
>> only missing some glue code), and adding offloading for some types of
>> cls_flower filters. More control for the allocation of rule indexes would
>> only affect the cls_flower part.
>> 
>> But perhaps I could be missing something here.
>
> You are absolutely right, it was not so much about trying to reduce the
> complexity rather than avoiding having two user interface facilities:
> ethtool and tc/cls_flower to do essentailly the same thing, yet, having
> some small differences in the offered capabilities, in the case of
> tc/cls_flower, lack of specification of rule location.
> -- 
> Florian


Cheers,
--
Vinicius

Re: [PATCH bpf-next v8 01/11] fs,security: Add a security blob to nameidata

2018-02-26 Thread Al Viro

On Tue, Feb 27, 2018 at 12:57:21AM +, Al Viro wrote:
> On Tue, Feb 27, 2018 at 01:41:11AM +0100, Mickaël Salaün wrote:
> > The function current_nameidata_security(struct inode *) can be used to
> > retrieve a blob's pointer address tied to the inode being walk through.
> > This enable to follow a path lookup and know where an inode access come
> > from. This is needed for the Landlock LSM to be able to restrict access
> > to file path.
> > 
> > The LSM hook nameidata_free_security(struct inode *) is called before
> > freeing the associated nameidata.
> 
> NAK.  Not without well-defined semantics and "some Linux S uses that for
> something, don't ask what" does not count.

Incidentally, pathwalk mechanics is subject to change at zero notice, so
if you want something, you'd better
* have explicitly defined semantics
* explain what it is - on fsdevel
* not have it hidden behind the layers of opaque LSM dreck, pardon
the redundance.

Again, pathwalk internals have changed in the past and may bloody well
change again in the future.  There's a damn good reason why struct nameidata
is _not_ visible outside of fs/namei.c, and quietly relying upon any
implementation details is no-go.

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-26 Thread Michael S. Tsirkin

On Mon, Feb 26, 2018 at 05:02:18PM -0800, Stephen Hemminger wrote:
> On Mon, 26 Feb 2018 08:19:24 +0100
> Jiri Pirko  wrote:
> 
> > Sat, Feb 24, 2018 at 12:59:04AM CET, step...@networkplumber.org wrote:
> > >On Thu, 22 Feb 2018 13:30:12 -0800
> > >Alexander Duyck  wrote:
> > >  
> > >> > Again, I undertand your motivation. Yet I don't like your solution.
> > >> > But if the decision is made to do this in-driver bonding. I would like
> > >> > to see it baing done some generic way:
> > >> > 1) share the same "in-driver bonding core" code with netvsc
> > >> >put to net/core.
> > >> > 2) the "in-driver bonding core" will strictly limit the functionality,
> > >> >like active-backup mode only, one vf, one backup, vf netdev type
> > >> >check (so noone could enslave a tap or anything else)
> > >> > If user would need something more, he should employ team/bond.
> > >
> > >Sharing would be good, but netvsc world would really like to only have
> > >one visible network device.  
> > 
> > Why do you mind? All would be the same, there would be just another
> > netdevice unused by the vm user (same as the vf netdev).
> > 
> 
> I mind because our requirement is no changes to userspace.
> No special udev rules, no bonding script, no setup.

Agreed. It is mostly fine from this point of view, except that you need
to know to skip the slaves.  Maybe we could look at some kind of
trick e.g. pretending link is down for slaves?

> Things like cloudinit running on current distro's expect to see a single
> eth0.  The VF device show up can also be an issue because distro's have
> stupid rules like Network Manager trying to start DHCP on every interface.
> We deal with that now by doing stuff like udev rules to get it to stop
> but that is still causing user errors.

So the ideal of a single net device isn't achieved by netvsc.

Since you have scripts to skip the PT device, can't they
hind the PV slave too? How do they identify the device to skip?

I agree it would be nice to have a way to hide the extra netdev
from userspace.

The benefit of the separation is that each slave device can
be configured with e.g. its own native ethtool commands for
optimum performance.

-- 
MST

Re: [PATCH 0/2] mark some slabs as visible not mergeable

2018-02-26 Thread David Miller

From: Stephen Hemminger 
Date: Mon, 26 Feb 2018 13:46:13 -0800

> This is ancient original iproute2 code that dumpster dives into
> slabinfo to get summary statistics on active objects.
> 
>   1) open sockets (sock_inode_cache)

The sockets inuse counter from /proc/net/sockstat is really
sufficient for this.

>   2) TCP ports bound (tcp_bind_buckets) [*]
>   3) TCP time wait sockets (tw_sock_TCP) [*]

Time wait is provided by /proc/net/sockstat as well.

>   4) TCP syn sockets (request_sock_TCP) [*]

It shouldn't be too hard to fill in the last two gaps, maintaining a
counter for bind buckets and request socks, and exporting them in new
/proc/net/sockstat field.

That would be so much better than disabling a useful optimization
in the SLAB allocator.

Thank you.

Re: [RFC PATCH v3 0/3] Enable virtio_net to act as a backup for a passthru device

2018-02-26 Thread Stephen Hemminger

On Mon, 26 Feb 2018 08:19:24 +0100
Jiri Pirko  wrote:

> Sat, Feb 24, 2018 at 12:59:04AM CET, step...@networkplumber.org wrote:
> >On Thu, 22 Feb 2018 13:30:12 -0800
> >Alexander Duyck  wrote:
> >  
> >> > Again, I undertand your motivation. Yet I don't like your solution.
> >> > But if the decision is made to do this in-driver bonding. I would like
> >> > to see it baing done some generic way:
> >> > 1) share the same "in-driver bonding core" code with netvsc
> >> >put to net/core.
> >> > 2) the "in-driver bonding core" will strictly limit the functionality,
> >> >like active-backup mode only, one vf, one backup, vf netdev type
> >> >check (so noone could enslave a tap or anything else)
> >> > If user would need something more, he should employ team/bond.
> >
> >Sharing would be good, but netvsc world would really like to only have
> >one visible network device.  
> 
> Why do you mind? All would be the same, there would be just another
> netdevice unused by the vm user (same as the vf netdev).
> 

I mind because our requirement is no changes to userspace.
No special udev rules, no bonding script, no setup.

Things like cloudinit running on current distro's expect to see a single
eth0.  The VF device show up can also be an issue because distro's have
stupid rules like Network Manager trying to start DHCP on every interface.
We deal with that now by doing stuff like udev rules to get it to stop
but that is still causing user errors.

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Tycho Andersen

On Mon, Feb 26, 2018 at 03:20:15PM -0800, Kees Cook wrote:
> On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
>  wrote:
> > On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
> >> This patchset enables seccomp filters to be written in eBPF. Although, this
> >> [...]
> > The main statement I want to hear from seccomp maintainers before
> > proceeding any further on this that enabling eBPF in seccomp won't lead
> > to seccomp folks arguing against changes in bpf core (like verifier)
> > just because it's used by seccomp.
> > It must be spelled out in the commit log with explicit Ack.
> 
> The primary thing I'm concerned about with eBPF and seccomp is
> side-effects from eBPF programs running at syscall time. This is an
> extremely sensitive area, and I want to be sure there won't be
> feature-creep here that leads to seccomp getting into a bad state.
> 
> As long as seccomp can continue have its own verifier,

I guess these patches should introduce some additional restrictions in
kernel/seccomp.c then? Based on my reading now, it's whatever the eBPF
verifier allows.

> I *think* this will be fine, though, again I remain concerned about
> maps, etc. I'm still reviewing these patches and how they might
> provide overlap with Tycho's needs too, etc.

Yes, it's on my TODO list to take a look at how to do it as suggested
by Alexi on top of this set before posting a v2. Haven't had time
recently, though.

Cheers,

Tycho

Re: [PATCH bpf-next v8 01/11] fs,security: Add a security blob to nameidata

2018-02-26 Thread Al Viro

On Tue, Feb 27, 2018 at 01:41:11AM +0100, Mickaël Salaün wrote:
> The function current_nameidata_security(struct inode *) can be used to
> retrieve a blob's pointer address tied to the inode being walk through.
> This enable to follow a path lookup and know where an inode access come
> from. This is needed for the Landlock LSM to be able to restrict access
> to file path.
> 
> The LSM hook nameidata_free_security(struct inode *) is called before
> freeing the associated nameidata.

NAK.  Not without well-defined semantics and "some Linux S uses that for
something, don't ask what" does not count.

[PATCH v2 net] net: phy: Restore phy_resume() locking assumption

2018-02-26 Thread Andrew Lunn

commit f5e64032a799 ("net: phy: fix resume handling") changes the
locking semantics for phy_resume() such that the caller now needs to
hold the phy mutex. Not all call sites were adopted to this new
semantic, resulting in warnings from the added
WARN_ON(!mutex_is_locked(>lock)).  Rather than change the
semantics, add a __phy_resume() and restore the old behavior of
phy_resume().

Reported-by: Heiner Kallweit 
Fixes: f5e64032a799 ("net: phy: fix resume handling")
Signed-off-by: Andrew Lunn 
---
v2: Fix type in phy_resume() calling itself
---
 drivers/net/phy/phy.c|  2 +-
 drivers/net/phy/phy_device.c | 18 +-
 include/linux/phy.h  |  1 +
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index e3e29c2b028b..a6f924fee584 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -819,7 +819,7 @@ void phy_start(struct phy_device *phydev)
break;
case PHY_HALTED:
/* if phy was suspended, bring the physical link up again */
-   phy_resume(phydev);
+   __phy_resume(phydev);
 
/* make sure interrupts are re-enabled for the PHY */
if (phy_interrupt_is_valid(phydev)) {
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index d39ae77707ef..478405e544cc 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -135,9 +135,7 @@ static int mdio_bus_phy_resume(struct device *dev)
if (!mdio_bus_phy_may_suspend(phydev))
goto no_resume;
 
-   mutex_lock(>lock);
ret = phy_resume(phydev);
-   mutex_unlock(>lock);
if (ret < 0)
return ret;
 
@@ -1041,9 +1039,7 @@ int phy_attach_direct(struct net_device *dev, struct 
phy_device *phydev,
if (err)
goto error;
 
-   mutex_lock(>lock);
phy_resume(phydev);
-   mutex_unlock(>lock);
phy_led_triggers_register(phydev);
 
return err;
@@ -1172,7 +1168,7 @@ int phy_suspend(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(phy_suspend);
 
-int phy_resume(struct phy_device *phydev)
+int __phy_resume(struct phy_device *phydev)
 {
struct phy_driver *phydrv = to_phy_driver(phydev->mdio.dev.driver);
int ret = 0;
@@ -1189,6 +1185,18 @@ int phy_resume(struct phy_device *phydev)
 
return ret;
 }
+EXPORT_SYMBOL(__phy_resume);
+
+int phy_resume(struct phy_device *phydev)
+{
+   int ret;
+
+   mutex_lock(>lock);
+   ret = __phy_resume(phydev);
+   mutex_unlock(>lock);
+
+   return ret;
+}
 EXPORT_SYMBOL(phy_resume);
 
 int phy_loopback(struct phy_device *phydev, bool enable)
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 5a0c3e53e7c2..d7069539f351 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -924,6 +924,7 @@ void phy_device_remove(struct phy_device *phydev);
 int phy_init_hw(struct phy_device *phydev);
 int phy_suspend(struct phy_device *phydev);
 int phy_resume(struct phy_device *phydev);
+int __phy_resume(struct phy_device *phydev);
 int phy_loopback(struct phy_device *phydev, bool enable);
 struct phy_device *phy_attach(struct net_device *dev, const char *bus_id,
  phy_interface_t interface);
-- 
2.16.2

Re: [net-next v3 1/2] bpf, seccomp: Add eBPF filter capabilities

2018-02-26 Thread Tycho Andersen

On Mon, Feb 26, 2018 at 07:27:05AM +, Sargun Dhillon wrote:
> +config SECCOMP_FILTER_EXTENDED
> + bool "Extended BPF seccomp filters"
> + depends on SECCOMP_FILTER && BPF_SYSCALL
> + depends on !CHECKPOINT_RESTORE

Why not just give -EINVAL or something in case one of these is
requested, instead of making them incompatible at compile time?

Tycho

Re: [next-queue PATCH 7/8] igb: Add support for adding offloaded clsflower filters

2018-02-26 Thread Florian Fainelli

On 02/26/2018 04:40 PM, Vinicius Costa Gomes wrote:
> Hi,
> 
> Florian Fainelli  writes:
> 
>> On February 23, 2018 5:20:35 PM PST, Vinicius Costa Gomes 
>>  wrote:
>>> This allows filters added by tc-flower and specifying MAC addresses,
>>> Ethernet types, and the VLAN priority field, to be offloaded to the
>>> controller.
>>>
>>> This reuses most of the infrastructure used by ethtool, ethtool can be
>>> used to read these filters, but modification and deletion can only be
>>> done via tc-flower.
>>
>> You would want to check what other drivers supporting both
>> ethtool::rxnfc and cls_flower do, but this can be seriously confusing
>> to an user. As an user I would be more comfortable with seeing only
>> rules added through ethtool via ethtool and those added by cls_flower
>> via cls_flower. They will both access a shared set of resources but it
>> seems easier for me to dump rules with both tools to figure out why
>> -ENOSPC was returned rather than seeing something I did not add.
>> Others might see it entirely differently.
> 
> I took a closer look at mlx5 and i40e, and they seem to use different
> hardware capabilities (even incompatible in the case of i40e, which
> returns an error when tring to add cls_flower filter when an ethtool
> based filter exists) for ethtool and cls_flower. So I don't think the
> model applies exactly here.
> 
> As they keep the filters separated for the user perspective, I could do
> the same, in the name of convention, it's just that the separation is
> more "artificial". But I have no strong opinions either way.

True, I would still conform to what these two drivers do since they have
a large user base (so does igb, but not yet for cls_flower yet since you
are obviously working on it).

> 
>>
>> If you added the ability for cls_flower to indicate a queue number and
>> either a fixed rule index or auto-placement (RX_CLS_LOC_ANY), could
>> that eliminate entirely the need for adding MAC address steering in
>> earlier patches?
> 
> I am not sure that I understand. 'cls_flower' already has support for
> indicating a queue number (expressed via the 'hw_tc' parameter to tc)
> (commit 384c181e3780 ("net: sched: Identify hardware traffic classes
> using classid").

I had missed that cls_flower gained the capability to specify a queue
number, that's good. What it still does not support AFAICT that ethtool
does though is either automatically allocating a rule location (Rule ID
shown by ethtool) or allowing placement at a specific location. This can
be important when the rule location can be carried by the hardware on
e.g: a per-packet basis, the hardware that I work with (bcm_sf2_cfp.c)
makes use of that for instance, maybe this is such an isolated case that
I should take care of it at some point if I was remotely serious into
providing tc/cls_flower support for that driver...

> 
> And adding more control for the allocation of indexes for the rules seem
> not to help much in reducing the size/complexity of this series. I.e.
> this series has 4 parts: bug fixes, adding support for source addresses
> for MAC filters, adding ethtool support MAC address filters (as it was
> only missing some glue code), and adding offloading for some types of
> cls_flower filters. More control for the allocation of rule indexes would
> only affect the cls_flower part.
> 
> But perhaps I could be missing something here.

You are absolutely right, it was not so much about trying to reduce the
complexity rather than avoiding having two user interface facilities:
ethtool and tc/cls_flower to do essentailly the same thing, yet, having
some small differences in the offered capabilities, in the case of
tc/cls_flower, lack of specification of rule location.
-- 
Florian

RE: [PATCH RFC iproute-next 5/5] rdma: Add PD resource tracking information

2018-02-26 Thread Steve Wise

> > On Wed, Feb 14, 2018 at 01:07:01PM -0800, Steve Wise wrote:
> > > Sample output:
> > >
> > > # rdma resource show pd
> > > link cxgb4_0/- local_dma_lkey 0x0 usecnt 4 flags 0x0 pid 30503 comm
> rping
> >
> > One more thing, flags need to be pre-parsed and accessible with "-d"
> > command,
> > as we did with dev,link capabilities.
> 
> Will do.
> 
> Thanks for reviewing this series!
> 

Turns out the only flags field was in the pd, and pd->flags only has one
flag, currently, indicating that the global_dma_rkey is in use.  Since the
kernel side only sends up the global_dma_rkey if that flag is set, I just
dropped the PD_FLAGS attribute in both patch series.  If pd flags grow, we
can add it, but I don't expect that.

Steve.

[PATCH bpf-next v8 00/11] Landlock LSM: Toward unprivileged sandboxing

2018-02-26 Thread Mickaël Salaün

Hi,

This eight series is a major revamp of the Landlock design compared to
the previous series [1]. This enables more flexibility and granularity
of access control with file paths. It is now possible to enforce an
access control according to a file hierarchy. Landlock uses the concept
of inode and path to identify such hierarchy. In a way, it brings tools
to program what is a file hierarchy.

There is now three types of Landlock hooks: FS_WALK, FS_PICK and FS_GET.
Each of them accepts a dedicated eBPF program, called a Landlock
program.  They can be chained to enforce a full access control according
to a list of directories or files. The set of actions on a file is well
defined (e.g. read, write, ioctl, append, lock, mount...) taking
inspiration from the major Linux LSMs and some other access-controls
like Capsicum.  These program types are designed to be cache-friendly,
which give room for optimizations in the future.

The documentation patch contains some kernel documentation and
explanations on how to use Landlock.  The compiled documentation and
a talk I gave at FOSDEM can be found here: https://landlock.io
This patch series can be found in the branch landlock-v8 in this repo:
https://github.com/landlock-lsm/linux

There is still some minor issues with this patch series but it should
demonstrate how powerful this design may be. One of these issues is that
it is not a stackable LSM anymore, but the infrastructure management of
security blobs should allow to stack it with other LSM [4].

This is the first step of the roadmap discussed at LPC [2].  While the
intended final goal is to allow unprivileged users to use Landlock, this
series allows only a process with global CAP_SYS_ADMIN to load and
enforce a rule.  This may help to get feedback and avoid unexpected
behaviors.

This series can be applied on top of bpf-next, commit 7d72637eb39f
("Merge branch 'x86-jit'").  This can be tested with
CONFIG_SECCOMP_FILTER and CONFIG_SECURITY_LANDLOCK.  I would really
appreciate constructive comments on the design and the code.


# Landlock LSM

The goal of this new Linux Security Module (LSM) called Landlock is to
allow any process, including unprivileged ones, to create powerful
security sandboxes comparable to XNU Sandbox or OpenBSD Pledge. This
kind of sandbox is expected to help mitigate the security impact of bugs
or unexpected/malicious behaviors in user-space applications.

The approach taken is to add the minimum amount of code while still
allowing the user-space application to create quite complex access
rules.  A dedicated security policy language such as the one used by
SELinux, AppArmor and other major LSMs involves a lot of code and is
usually permitted to only a trusted user (i.e. root).  On the contrary,
eBPF programs already exist and are designed to be safely loaded by
unprivileged user-space.

This design does not seem too intrusive but is flexible enough to allow
a powerful sandbox mechanism accessible by any process on Linux. The use
of seccomp and Landlock is more suitable with the help of a user-space
library (e.g.  libseccomp) that could help to specify a high-level
language to express a security policy instead of raw eBPF programs.
Moreover, thanks to the LLVM front-end, it is quite easy to write an
eBPF program with a subset of the C language.


# Frequently asked questions

## Why is seccomp-bpf not enough?

A seccomp filter can access only raw syscall arguments (i.e. the
register values) which means that it is not possible to filter according
to the value pointed to by an argument, such as a file pathname. As an
embryonic Landlock version demonstrated, filtering at the syscall level
is complicated (e.g. need to take care of race conditions). This is
mainly because the access control checkpoints of the kernel are not at
this high-level but more underneath, at the LSM-hook level. The LSM
hooks are designed to handle this kind of checks.  Landlock abstracts
this approach to leverage the ability of unprivileged users to limit
themselves.

Cf. section "What it isn't?" in Documentation/prctl/seccomp_filter.txt


## Why use the seccomp(2) syscall?

Landlock use the same semantic as seccomp to apply access rule
restrictions. It add a new layer of security for the current process
which is inherited by its children. It makes sense to use an unique
access-restricting syscall (that should be allowed by seccomp filters)
which can only drop privileges. Moreover, a Landlock rule could come
from outside a process (e.g.  passed through a UNIX socket). It is then
useful to differentiate the creation/load of Landlock eBPF programs via
bpf(2), from rule enforcement via seccomp(2).


## Why a new LSM? Are SELinux, AppArmor, Smack and Tomoyo not good
   enough?

The current access control LSMs are fine for their purpose which is to
give the *root* the ability to enforce a security policy for the
*system*. What is missing is a way to enforce a security policy for any
application by its developer and

[PATCH bpf-next v8 05/11] seccomp,landlock: Enforce Landlock programs per process hierarchy

2018-02-26 Thread Mickaël Salaün

The seccomp(2) syscall can be used by a task to apply a Landlock program
to itself. As a seccomp filter, a Landlock program is enforced for the
current task and all its future children. A program is immutable and a
task can only add new restricting programs to itself, forming a list of
programss.

A Landlock program is tied to a Landlock hook. If the action on a kernel
object is allowed by the other Linux security mechanisms (e.g. DAC,
capabilities, other LSM), then a Landlock hook related to this kind of
object is triggered. The list of programs for this hook is then
evaluated. Each program return a 32-bit value which can deny the action
on a kernel object with a non-zero value. If every programs of the list
return zero, then the action on the object is allowed.

Multiple Landlock programs can be chained to share a 64-bits value for a
call chain (e.g. evaluating multiple elements of a file path).  This
chaining is restricted when a process construct this chain by loading a
program, but additional checks are performed when it requests to apply
this chain of programs to itself.  The restrictions ensure that it is
not possible to call multiple programs in a way that would imply to
handle multiple shared values (i.e. cookies) for one chain.  For now,
only a fs_pick program can be chained to the same type of program,
because it may make sense if they have different triggers (cf. next
commits).  This restrictions still allows to reuse Landlock programs in
a safe way (e.g. use the same loaded fs_walk program with multiple
chains of fs_pick programs).

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andrew Morton 
Cc: Andy Lutomirski 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
Cc: Will Drewry 
Link: https://lkml.kernel.org/r/c10a503d-5e35-7785-2f3d-25ed8dd63...@digikod.net
---

Changes since v7:
* handle and verify program chains
* split and rename providers.c to enforce.c and enforce_seccomp.c
* rename LANDLOCK_SUBTYPE_* to LANDLOCK_*

Changes since v6:
* rename some functions with more accurate names to reflect that an eBPF
  program for Landlock could be used for something else than a rule
* reword rule "appending" to "prepending" and explain it
* remove the superfluous no_new_privs check, only check global
  CAP_SYS_ADMIN when prepending a Landlock rule (needed for containers)
* create and use {get,put}_seccomp_landlock() (suggested by Kees Cook)
* replace ifdef with static inlined function (suggested by Kees Cook)
* use get_user() (suggested by Kees Cook)
* replace atomic_t with refcount_t (requested by Kees Cook)
* move struct landlock_{rule,events} from landlock.h to common.h
* cleanup headers

Changes since v5:
* remove struct landlock_node and use a similar inheritance mechanisme
  as seccomp-bpf (requested by Andy Lutomirski)
* rename SECCOMP_ADD_LANDLOCK_RULE to SECCOMP_APPEND_LANDLOCK_RULE
* rename file manager.c to providers.c
* add comments
* typo and cosmetic fixes

Changes since v4:
* merge manager and seccomp patches
* return -EFAULT in seccomp(2) when user_bpf_fd is null to easely check
  if Landlock is supported
* only allow a process with the global CAP_SYS_ADMIN to use Landlock
  (will be lifted in the future)
* add an early check to exit as soon as possible if the current process
  does not have Landlock rules

Changes since v3:
* remove the hard link with seccomp (suggested by Andy Lutomirski and
  Kees Cook):
  * remove the cookie which could imply multiple evaluation of Landlock
rules
  * remove the origin field in struct landlock_data
* remove documentation fix (merged upstream)
* rename the new seccomp command to SECCOMP_ADD_LANDLOCK_RULE
* internal renaming
* split commit
* new design to be able to inherit on the fly the parent rules

Changes since v2:
* Landlock programs can now be run without seccomp filter but for any
  syscall (from the process) or interruption
* move Landlock related functions and structs into security/landlock/*
  (to manage cgroups as well)
* fix seccomp filter handling: run Landlock programs for each of their
  legitimate seccomp filter
* properly clean up all seccomp results
* cosmetic changes to ease the understanding
* fix some ifdef
---
 include/linux/landlock.h|  37 
 include/linux/seccomp.h |   5 +
 include/uapi/linux/seccomp.h|   1 +
 kernel/fork.c   |   8 +-
 kernel/seccomp.c|   4 +
 security/landlock/Makefile  |   3 +-
 security/landlock/chain.c   |  39 
 security/landlock/chain.h   |  35 
 security/landlock/common.h  |  53 +
 security/landlock/enforce.c | 386 
 security/landlock/enforce.h |  21 ++
 security/landlock/enforce_seccomp.c | 102 ++
 12 files changed, 692 insertions(+), 2

[PATCH bpf-next v8 03/11] bpf: Add eBPF program subtype and is_valid_subtype() verifier

2018-02-26 Thread Mickaël Salaün

The goal of the program subtype is to be able to have different static
fine-grained verifications for a unique program type.

The struct bpf_verifier_ops gets a new optional function:
is_valid_subtype(). This new verifier is called at the beginning of the
eBPF program verification to check if the (optional) program subtype is
valid.

The struct bpf_prog_ops gets a new optional function: put_extra(). This
may be used to put extra data.

For now, only Landlock eBPF programs are using a program subtype (see
next commits) but this could be used by other program types in the
future.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Link: https://lkml.kernel.org/r/20160827205559.ga43...@ast-mbp.thefacebook.com
---

Changes since v7:
* rename LANDLOCK_SUBTYPE_* to LANDLOCK_*
* move subtype in bpf_prog_aux and use only one bit for has_subtype
  (suggested by Alexei Starovoitov)
* wrap the prog_subtype with a prog_extra to be able to reference kernel
  pointers:
  * add an optional put_extra() function to struct bpf_prog_ops to be
able to free the pointed data
  * replace all the prog_subtype with prog_extra in the struct
bpf_verifier_ops functions
* remove the ABI field (requested by Alexei Starovoitov)
* rename subtype fields

Changes since v6:
* rename Landlock version to ABI to better reflect its purpose
* fix unsigned integer checks
* fix pointer cast
* constify pointers
* rebase

Changes since v5:
* use a prog_subtype pointer and make it future-proof
* add subtype test
* constify bpf_load_program()'s subtype argument
* cleanup subtype initialization
* rebase

Changes since v4:
* replace the "status" field with "version" (more generic)
* replace the "access" field with "ability" (less confusing)

Changes since v3:
* remove the "origin" field
* add an "option" field
* cleanup comments
---
 include/linux/bpf.h | 17 ++-
 include/uapi/linux/bpf.h| 11 +
 kernel/bpf/cgroup.c |  6 ++-
 kernel/bpf/core.c   |  5 +-
 kernel/bpf/syscall.c| 35 -
 kernel/bpf/verifier.c   | 19 ++--
 kernel/trace/bpf_trace.c| 15 --
 net/core/filter.c   | 76 ++---
 samples/bpf/bpf_load.c  |  3 +-
 samples/bpf/cookie_uid_helper_example.c |  2 +-
 samples/bpf/fds_example.c   |  2 +-
 samples/bpf/sock_example.c  |  3 +-
 samples/bpf/test_cgrp2_attach.c |  2 +-
 samples/bpf/test_cgrp2_attach2.c|  4 +-
 samples/bpf/test_cgrp2_sock.c   |  2 +-
 tools/include/uapi/linux/bpf.h  | 11 +
 tools/lib/bpf/bpf.c | 15 --
 tools/lib/bpf/bpf.h |  8 +--
 tools/lib/bpf/libbpf.c  |  5 +-
 tools/perf/tests/bpf.c  |  2 +-
 tools/testing/selftests/bpf/test_align.c|  2 +-
 tools/testing/selftests/bpf/test_tag.c  |  2 +-
 tools/testing/selftests/bpf/test_verifier.c | 18 ++-
 23 files changed, 200 insertions(+), 65 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 66df387106de..377b2f3519f3 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -200,26 +200,38 @@ bpf_ctx_record_field_size(struct bpf_insn_access_aux 
*aux, u32 size)
aux->ctx_field_size = size;
 }
 
+/* specific data per program type */
+struct bpf_prog_extra {
+   union bpf_prog_subtype subtype;
+   union {
+   struct bpf_prog *previous;
+   } landlock_hook;
+};
+
 struct bpf_prog_ops {
int (*test_run)(struct bpf_prog *prog, const union bpf_attr *kattr,
union bpf_attr __user *uattr);
+   void (*put_extra)(struct bpf_prog_extra *prog_extra);
 };
 
 struct bpf_verifier_ops {
/* return eBPF function prototype for verification */
-   const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id 
func_id);
+   const struct bpf_func_proto *(*get_func_proto)(enum bpf_func_id func_id,
+ const struct bpf_prog_extra *prog_extra);
 
/* return true if 'size' wide access at offset 'off' within bpf_context
 * with 'type' (read or write) is allowed
 */
bool (*is_valid_access)(int off, int size, enum bpf_access_type type,
-   struct bpf_insn_access_aux *info);
+   struct bpf_insn_access_aux *info,
+   const struct bpf_prog_extra *prog_extra);
int (*gen_prologue)(struct bpf_insn *insn, bool direct_write,
const struct bpf_prog *prog);
u32 (*convert_ctx_access)(enum bpf_access_type type,
  const struct bpf_insn

[PATCH bpf-next v8 02/11] fs,security: Add a new file access type: MAY_CHROOT

2018-02-26 Thread Mickaël Salaün

For compatibility reason, MAY_CHROOT is always set with MAY_CHDIR.
However, this new flag enable to differentiate a chdir form a chroot.

This is needed for the Landlock LSM to be able to evaluate a new root
directory.

Signed-off-by: Mickaël Salaün 
Cc: Alexander Viro 
Cc: Casey Schaufler 
Cc: James Morris 
Cc: John Johansen 
Cc: Kees Cook 
Cc: Paul Moore 
Cc: "Serge E. Hallyn" 
Cc: Stephen Smalley 
Cc: Tetsuo Handa 
Cc: linux-fsde...@vger.kernel.org
---
 fs/open.c  | 3 ++-
 include/linux/fs.h | 1 +
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/open.c b/fs/open.c
index 7ea118471dce..084d147c0e96 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -489,7 +489,8 @@ SYSCALL_DEFINE1(chroot, const char __user *, filename)
if (error)
goto out;
 
-   error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR);
+   error = inode_permission(path.dentry->d_inode, MAY_EXEC | MAY_CHDIR |
+   MAY_CHROOT);
if (error)
goto dput_and_out;
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 2a815560fda0..67c62374446c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -90,6 +90,7 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
 #define MAY_CHDIR  0x0040
 /* called from RCU mode, don't block */
 #define MAY_NOT_BLOCK  0x0080
+#define MAY_CHROOT 0x0100
 
 /*
  * flags in file.f_mode.  Note that FMODE_READ and FMODE_WRITE must correspond
-- 
2.16.2

[PATCH bpf-next v8 04/11] bpf,landlock: Define an eBPF program type for Landlock hooks

2018-02-26 Thread Mickaël Salaün

Add a new type of eBPF program used by Landlock hooks. This type of
program can be chained with the same eBPF program type (according to
subtype rules). A state can be kept with a value available in the
program's context (e.g. named "cookie" for Landlock programs).

This new BPF program type will be registered with the Landlock LSM
initialization.

Add an initial Landlock Kconfig and update the MAINTAINERS file.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
---

Changes since v7:
* cosmetic fixes
* rename LANDLOCK_SUBTYPE_* to LANDLOCK_*
* cleanup UAPI definitions and move them from bpf.h to landlock.h
  (suggested by Alexei Starovoitov)
* disable Landlock by default (suggested by Alexei Starovoitov)
* rename BPF_PROG_TYPE_LANDLOCK_{RULE,HOOK}
* update the Kconfig
* update the MAINTAINERS file
* replace the IOCTL, LOCK and FCNTL events with FS_PICK, FS_WALK and
  FS_GET hook types
* add the ability to chain programs with an eBPF program file descriptor
  (i.e. the "previous" field in a Landlock subtype) and keep a state
  with a "cookie" value available from the context
* add a "triggers" subtype bitfield to match specific actions (e.g.
  append, chdir, read...)

Changes since v6:
* add 3 more sub-events: IOCTL, LOCK, FCNTL
  https://lkml.kernel.org/r/2fbc99a6-f190-f335-bd14-04bdeed35...@digikod.net
* rename LANDLOCK_VERSION to LANDLOCK_ABI to better reflect its purpose,
  and move it from landlock.h to common.h
* rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE: an eBPF
  program could be used for something else than a rule
* simplify struct landlock_context by removing the arch and syscall_nr fields
* remove all eBPF map functions call, remove ABILITY_WRITE
* refactor bpf_landlock_func_proto() (suggested by Kees Cook)
* constify pointers
* fix doc inclusion

Changes since v5:
* rename file hooks.c to init.c
* fix spelling

Changes since v4:
* merge a minimal (not enabled) LSM code and Kconfig in this commit

Changes since v3:
* split commit
* revamp the landlock_context:
  * add arch, syscall_nr and syscall_cmd (ioctl, fcntl…) to be able to
cross-check action with the event type
  * replace args array with dedicated fields to ease the addition of new
fields
---
 MAINTAINERS |  13 +++
 include/linux/bpf_types.h   |   3 +
 include/uapi/linux/bpf.h|   1 +
 include/uapi/linux/landlock.h   | 155 +++
 security/Kconfig|   1 +
 security/Makefile   |   2 +
 security/landlock/Kconfig   |  18 
 security/landlock/Makefile  |   3 +
 security/landlock/common.h  |  32 +++
 security/landlock/init.c| 180 
 tools/include/uapi/linux/bpf.h  |   1 +
 tools/include/uapi/linux/landlock.h | 155 +++
 12 files changed, 564 insertions(+)
 create mode 100644 include/uapi/linux/landlock.h
 create mode 100644 security/landlock/Kconfig
 create mode 100644 security/landlock/Makefile
 create mode 100644 security/landlock/common.h
 create mode 100644 security/landlock/init.c
 create mode 100644 tools/include/uapi/linux/landlock.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 3bdc260e36b7..ac0809094bae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7882,6 +7882,19 @@ S:   Maintained
 F: net/l3mdev
 F: include/net/l3mdev.h
 
+LANDLOCK SECURITY MODULE
+M: Mickaël Salaün 
+S: Supported
+F: Documentation/security/landlock/
+F: include/linux/landlock.h
+F: include/uapi/linux/landlock.h
+F: samples/bpf/landlock*
+F: security/landlock/
+F: tools/include/uapi/linux/landlock.h
+F: tools/testing/selftests/landlock/
+K: landlock
+K: LANDLOCK
+
 LANTIQ MIPS ARCHITECTURE
 M: John Crispin 
 L: linux-m...@linux-mips.org
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 19b8349a3809..0ca019f3ae4a 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -22,6 +22,9 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_PERF_EVENT, perf_event)
 #ifdef CONFIG_CGROUP_BPF
 BPF_PROG_TYPE(BPF_PROG_TYPE_CGROUP_DEVICE, cg_dev)
 #endif
+#ifdef CONFIG_SECURITY_LANDLOCK
+BPF_PROG_TYPE(BPF_PROG_TYPE_LANDLOCK_HOOK, landlock)
+#endif
 
 BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
 BPF_MAP_TYPE(BPF_MAP_TYPE_PERCPU_ARRAY, percpu_array_map_ops)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 87885c92ca78..2433aa1a0fd4 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -133,6 +133,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_SOCK_OPS,
BPF_PROG_TYPE_SK_SKB,

[PATCH bpf-next v8 06/11] bpf,landlock: Add a new map type: inode

2018-02-26 Thread Mickaël Salaün

This new map store arbitrary 64-bits values referenced by inode keys.
The map can be updated from user space with file descriptor pointing to
inodes tied to a file system.  From an eBPF (Landlock) program point of
view, such a map is read-only and can only be used to retrieved a
64-bits value tied to a given inode.  This is useful to recognize an
inode tagged by user space, without access right to this inode (i.e. no
need to have a write access to this inode).

This also add new BPF map object types: landlock_tag_object and
landlock_chain.  The landlock_chain pointer is needed to be able to
handle multiple tags per inode.  The landlock_tag_object is needed to
update a reference to a list of shared tags.  This is typically used by
a struct file (reference) and a struct inode (shared list of tags).
This way, we can account the process/user for the number of tagged
files, while still being able to read the tags from the pointed inode.

Add dedicated BPF functions to handle this type of map:
* bpf_inode_map_update_elem()
* bpf_inode_map_lookup_elem()
* bpf_inode_map_delete_elem()

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
Cc: Jann Horn 
---

Changes since v7:
* new design with a dedicated map and a BPF function to tie a value to
  an inode
* add the ability to set or get a tag on an inode from a Landlock
  program

Changes since v6:
* remove WARN_ON() for missing dentry->d_inode
* refactor bpf_landlock_func_proto() (suggested by Kees Cook)

Changes since v5:
* cosmetic fixes and rebase

Changes since v4:
* use a file abstraction (handle) to wrap inode, dentry, path and file
  structs
* remove bpf_landlock_cmp_fs_beneath()
* rename the BPF helper and move it to kernel/bpf/
* tighten helpers accessible by a Landlock rule

Changes since v3:
* remove bpf_landlock_cmp_fs_prop() (suggested by Alexei Starovoitov)
* add hooks dealing with struct inode and struct path pointers:
  inode_permission and inode_getattr
* add abstraction over eBPF helper arguments thanks to wrapping structs
* add bpf_landlock_get_fs_mode() helper to check file type and mode
* merge WARN_ON() (suggested by Kees Cook)
* fix and update bpf_helpers.h
* use BPF_CALL_* for eBPF helpers (suggested by Alexei Starovoitov)
* make handle arraymap safe (RCU) and remove buggy synchronize_rcu()
* factor out the arraymay walk
* use size_t to index array (suggested by Jann Horn)

Changes since v2:
* add MNT_INTERNAL check to only add file handle from user-visible FS
  (e.g. no anonymous inode)
* replace struct file* with struct path* in map_landlock_handle
* add BPF protos
* fix bpf_landlock_cmp_fs_prop_with_struct_file()
---
 include/linux/bpf.h|  18 ++
 include/linux/bpf_types.h  |   3 +
 include/linux/landlock.h   |  24 +++
 include/uapi/linux/bpf.h   |  22 ++-
 kernel/bpf/Makefile|   3 +
 kernel/bpf/core.c  |   1 +
 kernel/bpf/helpers.c   |  38 +
 kernel/bpf/inodemap.c  | 318 +++
 kernel/bpf/syscall.c   |  27 ++-
 kernel/bpf/verifier.c  |  25 +++
 security/landlock/Makefile |   1 +
 security/landlock/tag.c| 373 +
 security/landlock/tag.h|  36 
 security/landlock/tag_fs.c |  59 +++
 security/landlock/tag_fs.h |  26 +++
 tools/include/uapi/linux/bpf.h |  22 ++-
 16 files changed, 993 insertions(+), 3 deletions(-)
 create mode 100644 kernel/bpf/inodemap.c
 create mode 100644 security/landlock/tag.c
 create mode 100644 security/landlock/tag.h
 create mode 100644 security/landlock/tag_fs.c
 create mode 100644 security/landlock/tag_fs.h

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 377b2f3519f3..c9b940a44c3e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -127,6 +127,10 @@ enum bpf_arg_type {
 
ARG_PTR_TO_CTX, /* pointer to context */
ARG_ANYTHING,   /* any (initialized) argument is ok */
+
+   ARG_PTR_TO_INODE,   /* pointer to a struct inode */
+   ARG_PTR_TO_LL_TAG_OBJ,  /* pointer to a struct landlock_tag_object */
+   ARG_PTR_TO_LL_CHAIN,/* pointer to a struct landlock_chain */
 };
 
 /* type of values returned from helper functions */
@@ -184,6 +188,9 @@ enum bpf_reg_type {
PTR_TO_PACKET_META,  /* skb->data - meta_len */
PTR_TO_PACKET,   /* reg points to skb->data */
PTR_TO_PACKET_END,   /* skb->data + headlen */
+   PTR_TO_INODE,/* reg points to struct inode */
+   PTR_TO_LL_TAG_OBJ,   /* reg points to struct landlock_tag_object */
+   PTR_TO_LL_CHAIN, /* reg points to struct landlock_chain */
 };

[PATCH bpf-next v8 01/11] fs,security: Add a security blob to nameidata

2018-02-26 Thread Mickaël Salaün

The function current_nameidata_security(struct inode *) can be used to
retrieve a blob's pointer address tied to the inode being walk through.
This enable to follow a path lookup and know where an inode access come
from. This is needed for the Landlock LSM to be able to restrict access
to file path.

The LSM hook nameidata_free_security(struct inode *) is called before
freeing the associated nameidata.

Signed-off-by: Mickaël Salaün 
Cc: Alexander Viro 
Cc: Casey Schaufler 
Cc: James Morris 
Cc: John Johansen 
Cc: Kees Cook 
Cc: Paul Moore 
Cc: "Serge E. Hallyn" 
Cc: Stephen Smalley 
Cc: Tetsuo Handa 
Cc: linux-fsde...@vger.kernel.org
---
 fs/namei.c| 39 +++
 include/linux/lsm_hooks.h |  7 +++
 include/linux/namei.h | 14 +-
 include/linux/security.h  |  7 +++
 security/security.c   |  7 +++
 5 files changed, 73 insertions(+), 1 deletion(-)

diff --git a/fs/namei.c b/fs/namei.c
index 921ae32dbc80..d592b3fb0d1e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -505,6 +505,9 @@ struct nameidata {
struct inode*link_inode;
unsignedroot_seq;
int dfd;
+#ifdef CONFIG_SECURITY
+   struct nameidata_lookup lookup;
+#endif
 } __randomize_layout;
 
 static void set_nameidata(struct nameidata *p, int dfd, struct filename *name)
@@ -515,6 +518,9 @@ static void set_nameidata(struct nameidata *p, int dfd, 
struct filename *name)
p->name = name;
p->total_link_count = old ? old->total_link_count : 0;
p->saved = old;
+#ifdef CONFIG_SECURITY
+   p->lookup.security = NULL;
+#endif
current->nameidata = p;
 }
 
@@ -522,6 +528,7 @@ static void restore_nameidata(void)
 {
struct nameidata *now = current->nameidata, *old = now->saved;
 
+   security_nameidata_put_lookup(>lookup, now->inode);
current->nameidata = old;
if (old)
old->total_link_count = now->total_link_count;
@@ -549,6 +556,27 @@ static int __nd_alloc_stack(struct nameidata *nd)
return 0;
 }
 
+#ifdef CONFIG_SECURITY
+/**
+ * current_nameidata_lookup - get the state of the current path walk
+ *
+ * @inode: inode associated to the path walk
+ *
+ * Used by LSM modules for access restriction based on path walk. The LSM is in
+ * charge of the lookup->security blob allocation and management. The hook
+ * security_nameidata_put_lookup() will be called after the path walk end.
+ *
+ * Return ERR_PTR(-ENOENT) if there is no match.
+ */
+struct nameidata_lookup *current_nameidata_lookup(const struct inode *inode)
+{
+   if (!current->nameidata || current->nameidata->inode != inode)
+   return ERR_PTR(-ENOENT);
+   return >nameidata->lookup;
+}
+EXPORT_SYMBOL(current_nameidata_lookup);
+#endif
+
 /**
  * path_connected - Verify that a path->dentry is below path->mnt.mnt_root
  * @path: nameidate to verify
@@ -2009,6 +2037,13 @@ static inline u64 hash_name(const void *salt, const char 
*name)
 
 #endif
 
+static inline void refresh_lookup(struct nameidata *nd)
+{
+#ifdef CONFIG_SECURITY
+   nd->lookup.type = nd->last_type;
+#endif
+}
+
 /*
  * Name resolution.
  * This is the basic name resolution function, turning a pathname into
@@ -2025,6 +2060,8 @@ static int link_path_walk(const char *name, struct 
nameidata *nd)
name++;
if (!*name)
return 0;
+   /* be ready for may_lookup() */
+   refresh_lookup(nd);
 
/* At this point we know we have a real path component. */
for(;;) {
@@ -2064,6 +2101,8 @@ static int link_path_walk(const char *name, struct 
nameidata *nd)
nd->last.hash_len = hash_len;
nd->last.name = name;
nd->last_type = type;
+   /* be ready for the next security_inode_permission() */
+   refresh_lookup(nd);
 
name += hashlen_len(hash_len);
if (!*name)
diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index 7161d8e7ee79..d71cf183f0be 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -428,6 +428,10 @@
  * security module does not know about attribute or a negative error code
  * to abort the copy up. Note that the caller is responsible for reading
  * and writing the xattrs as this hook is merely a filter.
+ * @nameidata_put_lookup:
+ * Deallocate and clear the current's nameidata->lookup.security field.
+ * @lookup->security contains the security structure to be freed.
+ * @inode is the last associated inode to the path walk
  *
  * Security hooks for file operations
  *
@@ -1514,6 +1518,8 @@ union security_list_options {
void (*inode_getsecid)(struct inode *inode,

[PATCH bpf-next v8 08/11] landlock: Add ptrace restrictions

2018-02-26 Thread Mickaël Salaün

A landlocked process has less privileges than a non-landlocked process
and must then be subject to additional restrictions when manipulating
processes. To be allowed to use ptrace(2) and related syscalls on a
target process, a landlocked process must have a subset of the target
process' rules.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
---

Changes since v6:
* factor out ptrace check
* constify pointers
* cleanup headers
* use the new security_add_hooks()
---
 security/landlock/Makefile   |   2 +-
 security/landlock/hooks_ptrace.c | 124 +++
 security/landlock/hooks_ptrace.h |  11 
 security/landlock/init.c |   2 +
 4 files changed, 138 insertions(+), 1 deletion(-)
 create mode 100644 security/landlock/hooks_ptrace.c
 create mode 100644 security/landlock/hooks_ptrace.h

diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index d0f532a93b4e..605504d852d3 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -3,4 +3,4 @@ obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
 landlock-y := init.o chain.o task.o \
tag.o tag_fs.o \
enforce.o enforce_seccomp.o \
-   hooks.o hooks_cred.o hooks_fs.o
+   hooks.o hooks_cred.o hooks_fs.o hooks_ptrace.o
diff --git a/security/landlock/hooks_ptrace.c b/security/landlock/hooks_ptrace.c
new file mode 100644
index ..f1b977b9c808
--- /dev/null
+++ b/security/landlock/hooks_ptrace.c
@@ -0,0 +1,124 @@
+/*
+ * Landlock LSM - ptrace hooks
+ *
+ * Copyright © 2017 Mickaël Salaün 
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2, as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include  /* ARRAY_SIZE */
+#include 
+#include  /* struct task_struct */
+#include 
+
+#include "common.h" /* struct landlock_prog_set */
+#include "hooks.h" /* landlocked() */
+#include "hooks_ptrace.h"
+
+static bool progs_are_subset(const struct landlock_prog_set *parent,
+   const struct landlock_prog_set *child)
+{
+   size_t i;
+
+   if (!parent || !child)
+   return false;
+   if (parent == child)
+   return true;
+
+   for (i = 0; i < ARRAY_SIZE(child->programs); i++) {
+   struct landlock_prog_list *walker;
+   bool found_parent = false;
+
+   if (!parent->programs[i])
+   continue;
+   for (walker = child->programs[i]; walker;
+   walker = walker->prev) {
+   if (walker == parent->programs[i]) {
+   found_parent = true;
+   break;
+   }
+   }
+   if (!found_parent)
+   return false;
+   }
+   return true;
+}
+
+static bool task_has_subset_progs(const struct task_struct *parent,
+   const struct task_struct *child)
+{
+#ifdef CONFIG_SECCOMP_FILTER
+   if (progs_are_subset(parent->seccomp.landlock_prog_set,
+   child->seccomp.landlock_prog_set))
+   /* must be ANDed with other providers (i.e. cgroup) */
+   return true;
+#endif /* CONFIG_SECCOMP_FILTER */
+   return false;
+}
+
+static int task_ptrace(const struct task_struct *parent,
+   const struct task_struct *child)
+{
+   if (!landlocked(parent))
+   return 0;
+
+   if (!landlocked(child))
+   return -EPERM;
+
+   if (task_has_subset_progs(parent, child))
+   return 0;
+
+   return -EPERM;
+}
+
+/**
+ * hook_ptrace_access_check - determine whether the current process may access
+ *   another
+ *
+ * @child: the process to be accessed
+ * @mode: the mode of attachment
+ *
+ * If the current task has Landlock programs, then the child must have at least
+ * the same programs.  Else denied.
+ *
+ * Determine whether a process may access another, returning 0 if permission
+ * granted, -errno if denied.
+ */
+static int hook_ptrace_access_check(struct task_struct *child,
+   unsigned int mode)
+{
+   return task_ptrace(current, child);
+}
+
+/**
+ * hook_ptrace_traceme - determine whether another process may trace the
+ *  current one
+ *
+ * @parent: the task proposed to be the tracer
+ *
+ * If the parent has Landlock programs, then the current task must have the
+ * same or more programs.
+ * Else denied.
+ *
+ * Determine whether the nominated task is permitted to trace the current
+ * process, returning 0 if

[PATCH bpf-next v8 11/11] landlock: Add user and kernel documentation for Landlock

2018-02-26 Thread Mickaël Salaün

This documentation can be built with the Sphinx framework.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Jonathan Corbet 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
---

Changes since v7:
* update documentation according to the Landlock revamp

Changes since v6:
* add a check for ctx->event
* rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE
* rename Landlock version to ABI to better reflect its purpose and add a
  dedicated changelog section
* update tables
* relax no_new_privs recommendations
* remove ABILITY_WRITE related functions
* reword rule "appending" to "prepending" and explain it
* cosmetic fixes

Changes since v5:
* update the rule hierarchy inheritance explanation
* briefly explain ctx->arg2
* add ptrace restrictions
* explain EPERM
* update example (subtype)
* use ":manpage:"
---
 Documentation/security/index.rst   |   1 +
 Documentation/security/landlock/index.rst  |  19 +++
 Documentation/security/landlock/kernel.rst | 100 ++
 Documentation/security/landlock/user.rst   | 206 +
 4 files changed, 326 insertions(+)
 create mode 100644 Documentation/security/landlock/index.rst
 create mode 100644 Documentation/security/landlock/kernel.rst
 create mode 100644 Documentation/security/landlock/user.rst

diff --git a/Documentation/security/index.rst b/Documentation/security/index.rst
index 298a94a33f05..1db294025d0f 100644
--- a/Documentation/security/index.rst
+++ b/Documentation/security/index.rst
@@ -11,3 +11,4 @@ Security Documentation
LSM
self-protection
tpm/index
+   landlock/index
diff --git a/Documentation/security/landlock/index.rst 
b/Documentation/security/landlock/index.rst
new file mode 100644
index ..8afde6a5805c
--- /dev/null
+++ b/Documentation/security/landlock/index.rst
@@ -0,0 +1,19 @@
+=
+Landlock LSM: programmatic access control
+=
+
+Landlock is a stackable Linux Security Module (LSM) that makes it possible to
+create security sandboxes.  This kind of sandbox is expected to help mitigate
+the security impact of bugs or unexpected/malicious behaviors in user-space
+applications.  The current version allows only a process with the global
+CAP_SYS_ADMIN capability to create such sandboxes but the ultimate goal of
+Landlock is to empower any process, including unprivileged ones, to securely
+restrict themselves.  Landlock is inspired by seccomp-bpf but instead of
+filtering syscalls and their raw arguments, a Landlock rule can inspect the use
+of kernel objects like files and hence make a decision according to the kernel
+semantic.
+
+.. toctree::
+
+user
+kernel
diff --git a/Documentation/security/landlock/kernel.rst 
b/Documentation/security/landlock/kernel.rst
new file mode 100644
index ..0a52915e346c
--- /dev/null
+++ b/Documentation/security/landlock/kernel.rst
@@ -0,0 +1,100 @@
+==
+Landlock: kernel documentation
+==
+
+eBPF properties
+===
+
+To get an expressive language while still being safe and small, Landlock is
+based on eBPF. Landlock should be usable by untrusted processes and must
+therefore expose a minimal attack surface. The eBPF bytecode is minimal,
+powerful, widely used and designed to be used by untrusted applications. Thus,
+reusing the eBPF support in the kernel enables a generic approach while
+minimizing new code.
+
+An eBPF program has access to an eBPF context containing some fields used to
+inspect the current object. These arguments can be used directly (e.g. cookie)
+or passed to helper functions according to their types (e.g. inode pointer). It
+is then possible to do complex access checks without race conditions or
+inconsistent evaluation (i.e.  `incorrect mirroring of the OS code and state
+`_).
+
+A Landlock hook describes a particular access type.  For now, there is three
+hooks dedicated to filesystem related operations: LANDLOCK_HOOK_FS_PICK,
+LANDLOCK_HOOK_FS_WALK and LANDLOCK_HOOK_FS_GET.  A Landlock program is tied to
+one hook.  This makes it possible to statically check context accesses,
+potentially performed by such program, and hence prevents kernel address leaks
+and ensure the right use of hook arguments with eBPF functions.  Any user can
+add multiple Landlock programs per Landlock hook.  They are stacked and
+evaluated one after the other, starting from the most recent program, as
+seccomp-bpf does with its filters.  Underneath, a hook is an abstraction over a
+set of LSM hooks.
+
+
+Guiding

[PATCH bpf-next v8 10/11] bpf,landlock: Add tests for Landlock

2018-02-26 Thread Mickaël Salaün

Test basic context access, ptrace protection and filesystem hooks and
Landlock program chaining with multiple cases.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
Cc: Shuah Khan 
Cc: Will Drewry 
---

Changes since v7:
* update tests and add new ones for filesystem hierarchy and Landlock
  chains.

Changes since v6:
* use the new kselftest_harness.h
* use const variables
* replace ASSERT_STEP with ASSERT_*
* rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE
* force sample library rebuild
* fix install target

Changes since v5:
* add subtype test
* add ptrace tests
* split and rename files
* cleanup and rebase
---
 tools/testing/selftests/Makefile   |   1 +
 tools/testing/selftests/bpf/bpf_helpers.h  |   7 +
 tools/testing/selftests/bpf/test_verifier.c|  84 +
 tools/testing/selftests/landlock/.gitignore|   5 +
 tools/testing/selftests/landlock/Makefile  |  35 ++
 tools/testing/selftests/landlock/test.h|  31 ++
 tools/testing/selftests/landlock/test_base.c   |  27 ++
 tools/testing/selftests/landlock/test_chain.c  | 249 +
 tools/testing/selftests/landlock/test_fs.c | 492 +
 tools/testing/selftests/landlock/test_ptrace.c | 158 
 10 files changed, 1089 insertions(+)
 create mode 100644 tools/testing/selftests/landlock/.gitignore
 create mode 100644 tools/testing/selftests/landlock/Makefile
 create mode 100644 tools/testing/selftests/landlock/test.h
 create mode 100644 tools/testing/selftests/landlock/test_base.c
 create mode 100644 tools/testing/selftests/landlock/test_chain.c
 create mode 100644 tools/testing/selftests/landlock/test_fs.c
 create mode 100644 tools/testing/selftests/landlock/test_ptrace.c

diff --git a/tools/testing/selftests/Makefile b/tools/testing/selftests/Makefile
index 7442dfb73b7f..5d00deb3cab6 100644
--- a/tools/testing/selftests/Makefile
+++ b/tools/testing/selftests/Makefile
@@ -14,6 +14,7 @@ TARGETS += gpio
 TARGETS += intel_pstate
 TARGETS += ipc
 TARGETS += kcmp
+TARGETS += landlock
 TARGETS += lib
 TARGETS += membarrier
 TARGETS += memfd
diff --git a/tools/testing/selftests/bpf/bpf_helpers.h 
b/tools/testing/selftests/bpf/bpf_helpers.h
index dde2c11d7771..414e267491f7 100644
--- a/tools/testing/selftests/bpf/bpf_helpers.h
+++ b/tools/testing/selftests/bpf/bpf_helpers.h
@@ -86,6 +86,13 @@ static int (*bpf_perf_prog_read_value)(void *ctx, void *buf,
(void *) BPF_FUNC_perf_prog_read_value;
 static int (*bpf_override_return)(void *ctx, unsigned long rc) =
(void *) BPF_FUNC_override_return;
+static unsigned long long (*bpf_inode_map_lookup)(void *map, void *key) =
+   (void *) BPF_FUNC_inode_map_lookup;
+static unsigned long long (*bpf_inode_get_tag)(void *inode, void *chain) =
+   (void *) BPF_FUNC_inode_get_tag;
+static unsigned long long (*bpf_landlock_set_tag)(void *tag_obj, void *chain,
+ unsigned long long value) =
+   (void *) BPF_FUNC_landlock_set_tag;
 
 /* llvm builtin functions that eBPF C program may use to
  * emit BPF_LD_ABS and BPF_LD_IND instructions
diff --git a/tools/testing/selftests/bpf/test_verifier.c 
b/tools/testing/selftests/bpf/test_verifier.c
index 3c24a5a7bafc..5f68b95187fe 100644
--- a/tools/testing/selftests/bpf/test_verifier.c
+++ b/tools/testing/selftests/bpf/test_verifier.c
@@ -31,6 +31,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -11240,6 +11241,89 @@ static struct bpf_test tests[] = {
.result = REJECT,
.has_prog_subtype = true,
},
+   {
+   "missing subtype",
+   .insns = {
+   BPF_MOV32_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .errstr = "",
+   .result = REJECT,
+   .prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+   },
+   {
+   "landlock/fs_pick: always accept",
+   .insns = {
+   BPF_MOV32_IMM(BPF_REG_0, 0),
+   BPF_EXIT_INSN(),
+   },
+   .result = ACCEPT,
+   .prog_type = BPF_PROG_TYPE_LANDLOCK_HOOK,
+   .has_prog_subtype = true,
+   .prog_subtype = {
+   .landlock_hook = {
+   .type = LANDLOCK_HOOK_FS_PICK,
+   .triggers = LANDLOCK_TRIGGER_FS_PICK_READ,
+   }
+   },
+   },
+   {
+   "landlock/fs_pick: read context",
+   .insns = {
+   BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
+

[PATCH bpf-next v8 07/11] landlock: Handle filesystem access control

2018-02-26 Thread Mickaël Salaün

This add three Landlock: FS_WALK, FS_PICK and FS_GET.

The FS_WALK hook is used to walk through a file path. A program tied to
this hook will be evaluated for each directory traversal except the last
one if it is the leaf of the path.

The FS_PICK hook is used to validate a set of actions requested on a
file. This actions are defined with triggers (e.g. read, write, open,
append...).

The FS_GET hook is used to tag open files, which is necessary to be able
to evaluate relative paths.  A program tied to this hook can tag a file
with an inode map.

A Landlock program can be chained to another if it is permitted by the
BPF verifier. A FS_WALK can be chained to a FS_PICK which can be chained
to a FS_GET.

The Landlock LSM hook registration is done after other LSM to only run
actions from user-space, via eBPF programs, if the access was granted by
major (privileged) LSMs.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
---

Changes since v7:
* major rewrite with clean Landlock hooks able to deal with file paths

Changes since v6:
* add 3 more sub-events: IOCTL, LOCK, FCNTL
  https://lkml.kernel.org/r/2fbc99a6-f190-f335-bd14-04bdeed35...@digikod.net
* use the new security_add_hooks()
* explain the -Werror=unused-function
* constify pointers
* cleanup headers

Changes since v5:
* split hooks.[ch] into hooks.[ch] and hooks_fs.[ch]
* add more documentation
* cosmetic fixes
* rebase (SCALAR_VALUE)

Changes since v4:
* add LSM hook abstraction called Landlock event
  * use the compiler type checking to verify hooks use by an event
  * handle all filesystem related LSM hooks (e.g. file_permission,
mmap_file, sb_mount...)
* register BPF programs for Landlock just after LSM hooks registration
* move hooks registration after other LSMs
* add failsafes to check if a hook is not used by the kernel
* allow partial raw value access form the context (needed for programs
  generated by LLVM)

Changes since v3:
* split commit
* add hooks dealing with struct inode and struct path pointers:
  inode_permission and inode_getattr
* add abstraction over eBPF helper arguments thanks to wrapping structs
---
 include/linux/lsm_hooks.h   |5 +
 security/landlock/Makefile  |5 +-
 security/landlock/common.h  |9 +
 security/landlock/enforce_seccomp.c |   10 +
 security/landlock/hooks.c   |  121 +
 security/landlock/hooks.h   |   35 ++
 security/landlock/hooks_cred.c  |   52 ++
 security/landlock/hooks_cred.h  |1 +
 security/landlock/hooks_fs.c| 1021 +++
 security/landlock/hooks_fs.h|   60 ++
 security/landlock/init.c|   56 ++
 security/landlock/task.c|   34 ++
 security/landlock/task.h|   29 +
 security/security.c |   12 +-
 14 files changed, 1447 insertions(+), 3 deletions(-)
 create mode 100644 security/landlock/hooks.c
 create mode 100644 security/landlock/hooks.h
 create mode 100644 security/landlock/hooks_cred.c
 create mode 100644 security/landlock/hooks_cred.h
 create mode 100644 security/landlock/hooks_fs.c
 create mode 100644 security/landlock/hooks_fs.h
 create mode 100644 security/landlock/task.c
 create mode 100644 security/landlock/task.h

diff --git a/include/linux/lsm_hooks.h b/include/linux/lsm_hooks.h
index d71cf183f0be..c40163385b68 100644
--- a/include/linux/lsm_hooks.h
+++ b/include/linux/lsm_hooks.h
@@ -2032,5 +2032,10 @@ void __init loadpin_add_hooks(void);
 #else
 static inline void loadpin_add_hooks(void) { };
 #endif
+#ifdef CONFIG_SECURITY_LANDLOCK
+extern void __init landlock_add_hooks(void);
+#else
+static inline void __init landlock_add_hooks(void) { }
+#endif /* CONFIG_SECURITY_LANDLOCK */
 
 #endif /* ! __LINUX_LSM_HOOKS_H */
diff --git a/security/landlock/Makefile b/security/landlock/Makefile
index 0e1dd4612ecc..d0f532a93b4e 100644
--- a/security/landlock/Makefile
+++ b/security/landlock/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_SECURITY_LANDLOCK) := landlock.o
 
-landlock-y := init.o chain.o \
+landlock-y := init.o chain.o task.o \
tag.o tag_fs.o \
-   enforce.o enforce_seccomp.o
+   enforce.o enforce_seccomp.o \
+   hooks.o hooks_cred.o hooks_fs.o
diff --git a/security/landlock/common.h b/security/landlock/common.h
index 245e4ccafcf2..6d36b70068d5 100644
--- a/security/landlock/common.h
+++ b/security/landlock/common.h
@@ -82,4 +82,13 @@ static inline enum landlock_hook_type get_type(struct 
bpf_prog *prog)
return prog->aux->extra->subtype.landlock_hook.type;
 }
 
+__maybe_unused
+static bool current_has_prog_type(enum landlock_hook_type hook_type)
+{
+   struct landlock_prog_set *prog_set;
+
+   prog_set =

[PATCH bpf-next v8 09/11] bpf: Add a Landlock sandbox example

2018-02-26 Thread Mickaël Salaün

Add a basic sandbox tool to launch a command which is only allowed to
access in a read only or read-write way a whitelist of file hierarchies.

Add to the bpf_load library the ability to handle a BPF program subtype.

Signed-off-by: Mickaël Salaün 
Cc: Alexei Starovoitov 
Cc: Andy Lutomirski 
Cc: Daniel Borkmann 
Cc: David S. Miller 
Cc: James Morris 
Cc: Kees Cook 
Cc: Serge E. Hallyn 
---

Changes since v7:
* rewrite the example using an inode map
* add to bpf_load the ability to handle subtypes per program type

Changes since v6:
* check return value of load_and_attach()
* allow to write on pipes
* rename BPF_PROG_TYPE_LANDLOCK to BPF_PROG_TYPE_LANDLOCK_RULE
* rename Landlock version to ABI to better reflect its purpose
* use const variable (suggested by Kees Cook)
* remove useless definitions (suggested by Kees Cook)
* add detailed explanations (suggested by Kees Cook)

Changes since v5:
* cosmetic fixes
* rebase

Changes since v4:
* write Landlock rule in C and compiled it with LLVM
* remove cgroup handling
* remove path handling: only handle a read-only environment
* remove errno return codes

Changes since v3:
* remove seccomp and origin field: completely free from seccomp programs
* handle more FS-related hooks
* handle inode hooks and directory traversal
* add faked but consistent view thanks to ENOENT
* add /lib64 in the example
* fix spelling
* rename some types and definitions (e.g. SECCOMP_ADD_LANDLOCK_RULE)

Changes since v2:
* use BPF_PROG_ATTACH for cgroup handling
---
 samples/bpf/Makefile |   4 +
 samples/bpf/bpf_load.c   |  82 -
 samples/bpf/bpf_load.h   |   7 ++
 samples/bpf/landlock1.h  |  14 
 samples/bpf/landlock1_kern.c | 171 +++
 samples/bpf/landlock1_user.c | 164 +
 6 files changed, 439 insertions(+), 3 deletions(-)
 create mode 100644 samples/bpf/landlock1.h
 create mode 100644 samples/bpf/landlock1_kern.c
 create mode 100644 samples/bpf/landlock1_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ec3fc8d88e87..015b1375daa5 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -43,6 +43,7 @@ hostprogs-y += xdp_redirect_cpu
 hostprogs-y += xdp_monitor
 hostprogs-y += xdp_rxq_info
 hostprogs-y += syscall_tp
+hostprogs-y += landlock1
 
 # Libbpf dependencies
 LIBBPF := ../../tools/lib/bpf/bpf.o ../../tools/lib/bpf/nlattr.o
@@ -93,6 +94,7 @@ xdp_redirect_cpu-objs := bpf_load.o $(LIBBPF) 
xdp_redirect_cpu_user.o
 xdp_monitor-objs := bpf_load.o $(LIBBPF) xdp_monitor_user.o
 xdp_rxq_info-objs := bpf_load.o $(LIBBPF) xdp_rxq_info_user.o
 syscall_tp-objs := bpf_load.o $(LIBBPF) syscall_tp_user.o
+landlock1-objs := bpf_load.o $(LIBBPF) landlock1_user.o
 
 # Tell kbuild to always build the programs
 always := $(hostprogs-y)
@@ -144,6 +146,7 @@ always += xdp_monitor_kern.o
 always += xdp_rxq_info_kern.o
 always += xdp2skb_meta_kern.o
 always += syscall_tp_kern.o
+always += landlock1_kern.o
 
 HOSTCFLAGS += -I$(objtree)/usr/include
 HOSTCFLAGS += -I$(srctree)/tools/lib/
@@ -188,6 +191,7 @@ HOSTLOADLIBES_xdp_redirect_cpu += -lelf
 HOSTLOADLIBES_xdp_monitor += -lelf
 HOSTLOADLIBES_xdp_rxq_info += -lelf
 HOSTLOADLIBES_syscall_tp += -lelf
+HOSTLOADLIBES_landlock1 += -lelf
 
 # Allows pointing LLC/CLANG to a LLVM backend with bpf support, redefine on 
cmdline:
 #  make samples/bpf/ LLC=~/git/llvm/build/bin/llc 
CLANG=~/git/llvm/build/bin/clang
diff --git a/samples/bpf/bpf_load.c b/samples/bpf/bpf_load.c
index 5bb37db6054b..f7c91093b2f5 100644
--- a/samples/bpf/bpf_load.c
+++ b/samples/bpf/bpf_load.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -43,6 +44,9 @@ int prog_array_fd = -1;
 struct bpf_map_data map_data[MAX_MAPS];
 int map_data_count = 0;
 
+struct bpf_subtype_data subtype_data[MAX_PROGS];
+int subtype_data_count = 0;
+
 static int populate_prog_array(const char *event, int prog_fd)
 {
int ind = atoi(event), err;
@@ -67,12 +71,14 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int size)
bool is_cgroup_sk = strncmp(event, "cgroup/sock", 11) == 0;
bool is_sockops = strncmp(event, "sockops", 7) == 0;
bool is_sk_skb = strncmp(event, "sk_skb", 6) == 0;
+   bool is_landlock = strncmp(event, "landlock", 8) == 0;
size_t insns_cnt = size / sizeof(struct bpf_insn);
enum bpf_prog_type prog_type;
char buf[256];
int fd, efd, err, id;
struct perf_event_attr attr = {};
union bpf_prog_subtype *st = NULL;
+   struct bpf_subtype_data *sd = NULL;
 
attr.type = PERF_TYPE_TRACEPOINT;
attr.sample_type = PERF_SAMPLE_RAW;
@@ -97,6 +103,50 @@ static int load_and_attach(const char *event, struct 
bpf_insn *prog, int

Re: [net PATCH 1/4] virtio_net: disable XDP_REDIRECT in receive_mergeable() case

2018-02-26 Thread Michael S. Tsirkin

On Tue, Feb 20, 2018 at 02:32:04PM +0100, Jesper Dangaard Brouer wrote:
> The virtio_net code have three different RX code-paths in receive_buf().
> Two of these code paths can handle XDP, but one of them is broken for
> at least XDP_REDIRECT.
> 
> Function(1): receive_big() does not support XDP.
> Function(2): receive_small() support XDP fully and uses build_skb().
> Function(3): receive_mergeable() broken XDP_REDIRECT uses napi_alloc_skb().
> 
> The simple explanation is that receive_mergeable() is broken because
> it uses napi_alloc_skb(), which violates XDP given XDP assumes packet
> header+data in single page and enough tail room for skb_shared_info.
> 
> The longer explaination is that receive_mergeable() tries to
> work-around and satisfy these XDP requiresments e.g. by having a
> function xdp_linearize_page() that allocates and memcpy RX buffers
> around (in case packet is scattered across multiple rx buffers).  This
> does currently satisfy XDP_PASS, XDP_DROP and XDP_TX (but only because
> we have not implemented bpf_xdp_adjust_tail yet).
> 
> The XDP_REDIRECT action combined with cpumap is broken, and cause hard
> to debug crashes.  The main issue is that the RX packet does not have
> the needed tail-room (SKB_DATA_ALIGN(skb_shared_info)), causing
> skb_shared_info to overlap the next packets head-room (in which cpumap
> stores info).
> 
> Reproducing depend on the packet payload length and if RX-buffer size
> happened to have tail-room for skb_shared_info or not.  But to make
> this even harder to troubleshoot, the RX-buffer size is runtime
> dynamically change based on an Exponentially Weighted Moving Average
> (EWMA) over the packet length, when refilling RX rings.
> 
> This patch only disable XDP_REDIRECT support in receive_mergeable()
> case, because it can cause a real crash.
> 
> IMHO we should consider NOT supporting XDP in receive_mergeable() at
> all, because the principles behind XDP are to gain speed by (1) code
> simplicity, (2) sacrificing memory and (3) where possible moving
> runtime checks to setup time.  These principles are clearly being
> violated in receive_mergeable(), that e.g. runtime track average
> buffer size to save memory consumption.
> 
> In the longer run, we should consider introducing a separate receive
> function when attaching an XDP program, and also change the memory
> model to be compatible with XDP when attaching an XDP prog.

I agree with a separate function approach.

So each buffer is tagged as xdp/non xdp, we check that
and handle appropriately - where non xdp could be handled
by the generic path.



> Fixes: 186b3c998c50 ("virtio-net: support XDP_REDIRECT")
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  drivers/net/virtio_net.c |7 ---
>  1 file changed, 7 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 626c27352ae2..0ca91942a884 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -677,7 +677,6 @@ static struct sk_buff *receive_mergeable(struct 
> net_device *dev,
>   struct bpf_prog *xdp_prog;
>   unsigned int truesize;
>   unsigned int headroom = mergeable_ctx_to_headroom(ctx);
> - int err;
>  
>   head_skb = NULL;
>  
> @@ -754,12 +753,6 @@ static struct sk_buff *receive_mergeable(struct 
> net_device *dev,
>   goto err_xdp;
>   rcu_read_unlock();
>   goto xdp_xmit;
> - case XDP_REDIRECT:
> - err = xdp_do_redirect(dev, , xdp_prog);
> - if (!err)
> - *xdp_xmit = true;
> - rcu_read_unlock();
> - goto xdp_xmit;
>   default:
>   bpf_warn_invalid_xdp_action(act);
>   case XDP_ABORTED:

Re: [next-queue PATCH 7/8] igb: Add support for adding offloaded clsflower filters

2018-02-26 Thread Vinicius Costa Gomes

Hi,

Florian Fainelli  writes:

> On February 23, 2018 5:20:35 PM PST, Vinicius Costa Gomes 
>  wrote:
>>This allows filters added by tc-flower and specifying MAC addresses,
>>Ethernet types, and the VLAN priority field, to be offloaded to the
>>controller.
>>
>>This reuses most of the infrastructure used by ethtool, ethtool can be
>>used to read these filters, but modification and deletion can only be
>>done via tc-flower.
>
> You would want to check what other drivers supporting both
> ethtool::rxnfc and cls_flower do, but this can be seriously confusing
> to an user. As an user I would be more comfortable with seeing only
> rules added through ethtool via ethtool and those added by cls_flower
> via cls_flower. They will both access a shared set of resources but it
> seems easier for me to dump rules with both tools to figure out why
> -ENOSPC was returned rather than seeing something I did not add.
> Others might see it entirely differently.

I took a closer look at mlx5 and i40e, and they seem to use different
hardware capabilities (even incompatible in the case of i40e, which
returns an error when tring to add cls_flower filter when an ethtool
based filter exists) for ethtool and cls_flower. So I don't think the
model applies exactly here.

As they keep the filters separated for the user perspective, I could do
the same, in the name of convention, it's just that the separation is
more "artificial". But I have no strong opinions either way.

>
> If you added the ability for cls_flower to indicate a queue number and
> either a fixed rule index or auto-placement (RX_CLS_LOC_ANY), could
> that eliminate entirely the need for adding MAC address steering in
> earlier patches?

I am not sure that I understand. 'cls_flower' already has support for
indicating a queue number (expressed via the 'hw_tc' parameter to tc)
(commit 384c181e3780 ("net: sched: Identify hardware traffic classes
using classid").

And adding more control for the allocation of indexes for the rules seem
not to help much in reducing the size/complexity of this series. I.e.
this series has 4 parts: bug fixes, adding support for source addresses
for MAC filters, adding ethtool support MAC address filters (as it was
only missing some glue code), and adding offloading for some types of
cls_flower filters. More control for the allocation of rule indexes would
only affect the cls_flower part.

But perhaps I could be missing something here.

>
> -- 
> Florian

Re: [PATCH RFC net-next 16/20] net/ipv6: Cleanup exception route handling

2018-02-26 Thread Wei Wang

On Mon, Feb 26, 2018 at 3:02 PM, David Ahern  wrote:
> On 2/26/18 3:29 PM, Wei Wang wrote:
>> On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
>>> IPv6 FIB will only contain FIB entries with exception routes added to
>>> the FIB entry. Remove CACHE and dst checks from fib6 add and delete since
>>> they can never happen once the data type changes.
>>>
>>> Fixup the lookup functions to use a f6i name for fib lookups and retain
>>> the current rt name for return variables.
>>>
>>> Signed-off-by: David Ahern 
>>> ---
>>>  net/ipv6/ip6_fib.c |  16 +--
>>>  net/ipv6/route.c   | 122 
>>> ++---
>>>  2 files changed, 71 insertions(+), 67 deletions(-)
>>>
>>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>>> index 5b03f7e8d850..63a91db61749 100644
>>> --- a/net/ipv6/ip6_fib.c
>>> +++ b/net/ipv6/ip6_fib.c
>>> @@ -1046,7 +1046,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
>>> struct rt6_info *rt,
>>>  static void fib6_start_gc(struct net *net, struct rt6_info *rt)
>>>  {
>>> if (!timer_pending(>ipv6.ip6_fib_timer) &&
>>> -   (rt->rt6i_flags & (RTF_EXPIRES | RTF_CACHE)))
>>> +   (rt->rt6i_flags & RTF_EXPIRES))
>>> mod_timer(>ipv6.ip6_fib_timer,
>>>   jiffies + net->ipv6.sysctl.ip6_rt_gc_interval);
>>>  }
>>> @@ -1097,8 +1097,6 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
>>> *rt,
>>
>> This rt here should be f6i?
>>
>>>
>>> if (WARN_ON_ONCE(!atomic_read(>dst.__refcnt)))
>>> return -EINVAL;
>>> -   if (WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE))
>>> -   return -EINVAL;
>>>
>>> if (info->nlh) {
>>> if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
>>> @@ -1622,8 +1620,6 @@ static void fib6_del_route(struct fib6_table *table, 
>>> struct fib6_node *fn,
>>>
>>> RT6_TRACE("fib6_del_route\n");
>>>
>>> -   WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE);
>>> -
>>> /* Unlink it */
>>> *rtp = rt->rt6_next;
>>
>> This rt here is also f6i right?
>>
>>> rt->rt6i_node = NULL;
>>> @@ -1692,21 +1688,11 @@ int fib6_del(struct rt6_info *rt, struct nl_info 
>>> *info)
>>
>> This rt here is also f6i right?
>>
>>> struct rt6_info __rcu **rtp;
>>> struct rt6_info __rcu **rtp_next;
>>>
>>> -#if RT6_DEBUG >= 2
>>> -   if (rt->dst.obsolete > 0) {
>>> -   WARN_ON(fn);
>>> -   return -ENOENT;
>>> -   }
>>> -#endif
>>> if (!fn || rt == net->ipv6.fib6_null_entry)
>>> return -ENOENT;
>>>
>>> WARN_ON(!(fn->fn_flags & RTN_RTINFO));
>>>
>>> -   /* remove cached dst from exception table */
>>> -   if (rt->rt6i_flags & RTF_CACHE)
>>> -   return rt6_remove_exception_rt(rt);
>>
>> Could you help delete rt6_remove_exception_rt() function? I don't
>> think it is used anymore.
>
> It is still used by ip6_negative_advice, ip6_link_failure and
> ip6_del_cached_rt. It can be made static; will fix.
>
Right. Missed those.

>
> The rest of your comments for this patch are renaming rt to f6i. My
> thought is to follow up with another patch that does the rename of rt to
> f6i for all fib6_info. Given how large this change is already I did not
> want to add extra diffs for that. If there is agreement to fold that
> part in now, I can do it.
Sure. Sounds good to me.

Re: [PATCH RFC net-next 10/20] net/ipv6: move expires into rt6_info

2018-02-26 Thread Wei Wang

On Mon, Feb 26, 2018 at 2:55 PM, David Ahern  wrote:
> On 2/26/18 3:28 PM, Wei Wang wrote:
>>> @@ -213,11 +234,6 @@ static inline void rt6_set_expires(struct rt6_info 
>>> *rt, unsigned long expires)
>>>
>>>  static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
>>>  {
>>> -   struct rt6_info *rt;
>>> -
>>> -   for (rt = rt0; rt && !(rt->rt6i_flags & RTF_EXPIRES); rt = 
>>> rt->from);
>>> -   if (rt && rt != rt0)
>>> -   rt0->dst.expires = rt->dst.expires;
>>
>> I was wondering if we need to retain the above logic. It makes sure
>> dst.expires gets synced to its "parent" route. But  it might be hard
>> because after your change, we can no longer use rt->from to refer to
>> the "parent".
>
> As I understand it, the FIB entries are cloned into pcpu, uncached and
> exception routes. We should never have an rt6_info that ever points back
> more than 1 level -- ie., the dst rt6_info points to a from representing
> the original FIB entry.
>
Yes. Agree.

> After my change 'from' will still point to the FIB entry as a fib6_info
> which has its own expires.
>
understood. And fib6_age() is using fib6_check_expired() and
rt6_age_exceptions() is checking rt->dst.expires which I think is
correct.

> When I looked this code I was really confused. At best, the for loop
> above sets rt0->dst.expires to some value based on the 'from' but then
> the very next line calls dst_set_expires with the passed in timeout value.
>
>
>>
>>> dst_set_expires(>dst, timeout);
>>> rt0->rt6i_flags |= RTF_EXPIRES;
>>>  }
>

Re: [PATCH] test_bpf: add a schedule point

2018-02-26 Thread Eric Dumazet

On Mon, 2018-02-26 at 21:11 +0100, Daniel Borkmann wrote:
> On 02/26/2018 07:52 PM, Eric Dumazet wrote:
> > From: Eric Dumazet 
> > 
> > test_bpf() is taking 1.6 seconds nowadays, it is time
> > to add a schedule point in it.
> > 
> > Signed-off-by: Eric Dumazet 
> 
> Applied to bpf tree, thanks Eric!

Thanks Daniel

Note that some BPF programs are quite expensive

[  173.447471] test_bpf: #264 BPF_MAXINSNS: Call heavy transformations jited:1 
19248 18548 PASS
jited:1 12519 PASS
[  173.509228] test_bpf: #269 BPF_MAXINSNS: ld_abs+get_processor_id jited:1 
20896 PASS

So we can still consume ~200 ms per test, without cond_resched()

Maybe reducing MAX_TESTRUNS from 1 to 1000 would be the next step ?

[PATCH 4/5 net-next] ibmvnic: Report queue stops and restarts as debug output

2018-02-26 Thread Thomas Falcon

It's not necessary to report each time a queue is stopped and restarted
as an informational message. Change that to be a debug message so that
it can be observed if needed but not printed by default.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index b907c0a607e6..53b2c91fa786 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1528,7 +1528,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
 
if (atomic_add_return(num_entries, _scrq->used)
>= adapter->req_tx_entries_per_subcrq) {
-   netdev_info(netdev, "Stopping queue %d\n", queue_num);
+   netdev_dbg(netdev, "Stopping queue %d\n", queue_num);
netif_stop_subqueue(netdev, queue_num);
}
 
@@ -2541,8 +2541,8 @@ static int ibmvnic_complete_tx(struct ibmvnic_adapter 
*adapter,
__netif_subqueue_stopped(adapter->netdev,
 scrq->pool_index)) {
netif_wake_subqueue(adapter->netdev, scrq->pool_index);
-   netdev_info(adapter->netdev, "Started queue %d\n",
-   scrq->pool_index);
+   netdev_dbg(adapter->netdev, "Started queue %d\n",
+  scrq->pool_index);
}
}
 
-- 
2.15.0

[PATCH 5/5 net-next] ibmvnic: Do not attempt to login if RX or TX queues are not allocated

2018-02-26 Thread Thomas Falcon

If a device reset fails for some reason, TX and RX queue resources
could be released. If a user attempts to open the device in this scenario,
it may result in a kernel panic as the driver tries to access this
memory. To fix this, include a check before device login that TX/RX
queues are still there before enabling the device. In addition, return a
value that can be checked in case of any errors to avoid waiting for a
completion that will never come.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 53b2c91fa786..765407179fdd 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -111,7 +111,7 @@ static int ibmvnic_poll(struct napi_struct *napi, int data);
 static void send_map_query(struct ibmvnic_adapter *adapter);
 static void send_request_map(struct ibmvnic_adapter *, dma_addr_t, __be32, u8);
 static void send_request_unmap(struct ibmvnic_adapter *, u8);
-static void send_login(struct ibmvnic_adapter *adapter);
+static int send_login(struct ibmvnic_adapter *adapter);
 static void send_cap_queries(struct ibmvnic_adapter *adapter);
 static int init_sub_crqs(struct ibmvnic_adapter *);
 static int init_sub_crq_irqs(struct ibmvnic_adapter *adapter);
@@ -809,8 +809,11 @@ static int ibmvnic_login(struct net_device *netdev)
}
 
reinit_completion(>init_done);
-   send_login(adapter);
-   if (!wait_for_completion_timeout(>init_done,
+   rc = send_login(adapter);
+   if (rc) {
+   dev_err(dev, "Unable to attempt device login\n");
+   return rc;
+   } else if (!wait_for_completion_timeout(>init_done,
 timeout)) {
dev_err(dev, "Login timeout\n");
return -1;
@@ -3074,7 +3077,7 @@ static void vnic_add_client_data(struct ibmvnic_adapter 
*adapter,
strncpy(>name, adapter->netdev->name, len);
 }
 
-static void send_login(struct ibmvnic_adapter *adapter)
+static int send_login(struct ibmvnic_adapter *adapter)
 {
struct ibmvnic_login_rsp_buffer *login_rsp_buffer;
struct ibmvnic_login_buffer *login_buffer;
@@ -3090,6 +3093,12 @@ static void send_login(struct ibmvnic_adapter *adapter)
struct vnic_login_client_data *vlcd;
int i;
 
+   if (!adapter->tx_scrq || !adapter->rx_scrq) {
+   netdev_err(adapter->netdev,
+  "RX or TX queues are not allocated, device login 
failed\n");
+   return -1;
+   }
+
release_login_rsp_buffer(adapter);
client_data_len = vnic_client_data_len(adapter);
 
@@ -3187,7 +3196,7 @@ static void send_login(struct ibmvnic_adapter *adapter)
crq.login.len = cpu_to_be32(buffer_size);
ibmvnic_send_crq(adapter, );
 
-   return;
+   return 0;
 
 buf_rsp_map_failed:
kfree(login_rsp_buffer);
@@ -3196,7 +3205,7 @@ static void send_login(struct ibmvnic_adapter *adapter)
 buf_map_failed:
kfree(login_buffer);
 buf_alloc_failed:
-   return;
+   return -1;
 }
 
 static void send_request_map(struct ibmvnic_adapter *adapter, dma_addr_t addr,
-- 
2.15.0

[PATCH 0/5 net-next] ibmvnic: Miscellaneous driver fixes and enhancements

2018-02-26 Thread Thomas Falcon

There is not a general theme to this patch set other than that it
fixes a few issues with the ibmvnic driver. I will just give a quick
summary of what each patch does here.

"ibmvnic: Fix TX descriptor tracking again" resolves a race condition
introduced in an earlier fix to track outstanding transmit descriptors.
This condition can throw off the tracking counter to the point that
a transmit queue will halt forever.

"ibmvnic: Allocate statistics buffers during probe" allocates queue
statistics buffers on device probe to avoid a crash when accessing
statistics of an unopened interface.

"ibmvnic: Harden TX/RX pool cleaning" includes additional checks to
avoid a bad access when cleaning RX and TX buffer pools during a device
reset.

"ibmvnic: Report queue stops and restarts as debug output" changes TX
queue state notifications from informational to debug messages. This
information is not necessarily useful to a user and under load can result
in a lot of log output.

"ibmvnic: Do not attempt to login if RX or TX queues are not allocated"
checks that device queues have been allocated successfully before
attempting device login. This resolves a panic that could occur if a
user attempted to configure a device after a failed reset.

Thanks for your attention.

Thomas Falcon (5):
  ibmvnic: Fix TX descriptor tracking again
  ibmvnic: Allocate statistics buffers during probe
  ibmvnic: Harden TX/RX pool cleaning
  ibmvnic: Report queue stops and restarts as debug output
  ibmvnic: Do not attempt to login if RX or TX queues are not allocated

 drivers/net/ethernet/ibm/ibmvnic.c | 71 +++---
 1 file changed, 43 insertions(+), 28 deletions(-)

-- 
2.15.0

[PATCH 2/5 net-next] ibmvnic: Allocate statistics buffers during probe

2018-02-26 Thread Thomas Falcon

Currently, buffers holding individual queue statistics are allocated
when the device is opened. If an ibmvnic interface is hotplugged or
initialized but never opened, an attempt to get statistics with
ethtool will result in a kernel panic.

Since the driver allocates a constant number, the maximum supported
queues, of buffers, these can be allocated during device probe and
freed when the device is hot-unplugged or the module is removed.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 21 +++--
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 32ee202de13a..1a0f67b60003 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -845,8 +845,6 @@ static void release_resources(struct ibmvnic_adapter 
*adapter)
release_tx_pools(adapter);
release_rx_pools(adapter);
 
-   release_stats_token(adapter);
-   release_stats_buffers(adapter);
release_error_buffers(adapter);
release_napi(adapter);
release_login_rsp_buffer(adapter);
@@ -974,14 +972,6 @@ static int init_resources(struct ibmvnic_adapter *adapter)
if (rc)
return rc;
 
-   rc = init_stats_buffers(adapter);
-   if (rc)
-   return rc;
-
-   rc = init_stats_token(adapter);
-   if (rc)
-   return rc;
-
adapter->vpd = kzalloc(sizeof(*adapter->vpd), GFP_KERNEL);
if (!adapter->vpd)
return -ENOMEM;
@@ -4431,6 +4421,14 @@ static int ibmvnic_init(struct ibmvnic_adapter *adapter)
release_crq_queue(adapter);
}
 
+   rc = init_stats_buffers(adapter);
+   if (rc)
+   return rc;
+
+   rc = init_stats_token(adapter);
+   if (rc)
+   return rc;
+
return rc;
 }
 
@@ -4538,6 +4536,9 @@ static int ibmvnic_remove(struct vio_dev *dev)
release_sub_crqs(adapter, 1);
release_crq_queue(adapter);
 
+   release_stats_token(adapter);
+   release_stats_buffers(adapter);
+
adapter->state = VNIC_REMOVED;
 
mutex_unlock(>reset_lock);
-- 
2.15.0

[PATCH 1/5 net-next] ibmvnic: Fix TX descriptor tracking again

2018-02-26 Thread Thomas Falcon

Sorry, the previous change introduced a race condition between
transmit completion processing and tracking TX descriptors. If a
completion is received before the number of descriptors is logged,
the number of descriptors will be add but not removed. After enough
times, this could halt the transmit queue forever.

Log the number of descriptors used by a transmit before sending.
I stress tested the fix on two different systems running over the
weekend without any issues.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 5a86a916492c..32ee202de13a 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1482,6 +1482,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
if ((*hdrs >> 7) & 1) {
build_hdr_descs_arr(tx_buff, _entries, *hdrs);
tx_crq.v1.n_crq_elem = num_entries;
+   tx_buff->num_entries = num_entries;
tx_buff->indir_arr[0] = tx_crq;
tx_buff->indir_dma = dma_map_single(dev, tx_buff->indir_arr,
sizeof(tx_buff->indir_arr),
@@ -1500,6 +1501,7 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
   (u64)tx_buff->indir_dma,
   (u64)num_entries);
} else {
+   tx_buff->num_entries = num_entries;
lpar_rc = send_subcrq(adapter, handle_array[queue_num],
  _crq);
}
@@ -1536,7 +1538,6 @@ static int ibmvnic_xmit(struct sk_buff *skb, struct 
net_device *netdev)
netif_stop_subqueue(netdev, queue_num);
}
 
-   tx_buff->num_entries = num_entries;
tx_packets++;
tx_bytes += skb->len;
txq->trans_start = jiffies;
-- 
2.15.0

[PATCH 3/5 net-next] ibmvnic: Harden TX/RX pool cleaning

2018-02-26 Thread Thomas Falcon

If the driver releases resources after a failed reset or some other
error, the driver might attempt to clean up and free memory that
isn't there anymore. Include some additional checks that RX/TX queues
along with their associated structures are still there before cleaning.

Signed-off-by: Thomas Falcon 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 20 
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index 1a0f67b60003..b907c0a607e6 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1081,6 +1081,7 @@ static int ibmvnic_open(struct net_device *netdev)
 static void clean_rx_pools(struct ibmvnic_adapter *adapter)
 {
struct ibmvnic_rx_pool *rx_pool;
+   struct ibmvnic_rx_buff *rx_buff;
u64 rx_entries;
int rx_scrqs;
int i, j;
@@ -1094,14 +1095,15 @@ static void clean_rx_pools(struct ibmvnic_adapter 
*adapter)
/* Free any remaining skbs in the rx buffer pools */
for (i = 0; i < rx_scrqs; i++) {
rx_pool = >rx_pool[i];
-   if (!rx_pool)
+   if (!rx_pool || !rx_pool->rx_buff)
continue;
 
netdev_dbg(adapter->netdev, "Cleaning rx_pool[%d]\n", i);
for (j = 0; j < rx_entries; j++) {
-   if (rx_pool->rx_buff[j].skb) {
-   dev_kfree_skb_any(rx_pool->rx_buff[j].skb);
-   rx_pool->rx_buff[j].skb = NULL;
+   rx_buff = _pool->rx_buff[j];
+   if (rx_buff && rx_buff->skb) {
+   dev_kfree_skb_any(rx_buff->skb);
+   rx_buff->skb = NULL;
}
}
}
@@ -1110,6 +1112,7 @@ static void clean_rx_pools(struct ibmvnic_adapter 
*adapter)
 static void clean_tx_pools(struct ibmvnic_adapter *adapter)
 {
struct ibmvnic_tx_pool *tx_pool;
+   struct ibmvnic_tx_buff *tx_buff;
u64 tx_entries;
int tx_scrqs;
int i, j;
@@ -1123,14 +1126,15 @@ static void clean_tx_pools(struct ibmvnic_adapter 
*adapter)
/* Free any remaining skbs in the tx buffer pools */
for (i = 0; i < tx_scrqs; i++) {
tx_pool = >tx_pool[i];
-   if (!tx_pool)
+   if (!tx_pool && !tx_pool->tx_buff)
continue;
 
netdev_dbg(adapter->netdev, "Cleaning tx_pool[%d]\n", i);
for (j = 0; j < tx_entries; j++) {
-   if (tx_pool->tx_buff[j].skb) {
-   dev_kfree_skb_any(tx_pool->tx_buff[j].skb);
-   tx_pool->tx_buff[j].skb = NULL;
+   tx_buff = _pool->tx_buff[j];
+   if (tx_buff && tx_buff->skb) {
+   dev_kfree_skb_any(tx_buff->skb);
+   tx_buff->skb = NULL;
}
}
}
-- 
2.15.0

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Sargun Dhillon

On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
 wrote:
> On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
>> This patchset enables seccomp filters to be written in eBPF. Although, this
>> patchset doesn't introduce much of the functionality enabled by eBPF, it lays
>> the ground work for it. Currently, you have to disable CHECKPOINT_RESTORE
>> support in order to utilize eBPF seccomp filters, as eBPF filters cannot be
>> retrieved via the ptrace GET_FILTER API.
>
> this was discussed multiple times in the past.
> In eBPF land it's practically impossible to do checkpoint/restore
> of the whole bpf program/map graph.
>
>> Any user can load a bpf seccomp filter program, and it can be pinned and
>> reused without requiring access to the bpf syscalls. A user only requires
>> the traditional permissions of either being cap_sys_admin, or have
>> no_new_privs set in order to install their rule.
>>
>> The primary reason for not adding maps support in this patchset is
>> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
>> If we have a map that the BPF program can read, it can potentially
>> "change" privileges after running. It seems like doing writes only
>> is safe, because it can be pure, and side effect free, and therefore
>> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
>> to an agreement, this can be in a follow-up patchset.
>
> readonly maps already exist. See BPF_F_RDONLY.
> Is that not enough?
>
With BPF_F_RDONLY, is there a mechanism to populate a prog_array, and
then mark it rd_only?

>> A benchmark of this patchset is as follows for a very standard eBPF filter:
>>
>> Given this test program:
>> for (i = 10; i < ; i++) syscall(__NR_getpid);
>>
>> If I implement an eBPF filter with PROG_ARRAYs with a program per syscall,
>> and tail call, the numbers are such:
>> ebpf JIT 12.3% slower than native
>> ebpf no JIT 13.6% slower than native
>> seccomp JIT 17.6% slower than native
>> seccomp no JIT 37% slower than native
>
> the perf gains are misleading, since patches don't enable bpf_tail_call.
>
> The main statement I want to hear from seccomp maintainers before
> proceeding any further on this that enabling eBPF in seccomp won't lead
> to seccomp folks arguing against changes in bpf core (like verifier)
> just because it's used by seccomp.
> It must be spelled out in the commit log with explicit Ack.
>

[PATCH] net: allow interface to be set into VRF if VLAN interface in same VRF

2018-02-26 Thread Mike Manning

Setting an interface into a VRF fails with 'RTNETLINK answers: File
exists' if one of its VLAN interfaces is already in the same VRF.
As the VRF is an upper device of the VLAN interface, it is also showing
up as an upper device of the interface itself. The solution is to
restrict this check to devices other than master. As only one master
device can be linked to a device, the check in this case is that the
upper device (VRF) being linked to is not the same as the master device
instead of it not being any one of the upper devices.

The following example shows an interface ens12 (with a VLAN interface
ens12.10) being set into VRF green, which behaves as expected:

  # ip link add link ens12 ens12.10 type vlan id 10
  # ip link set dev ens12 master vrfgreen
  # ip link show dev ens12
3: ens12:  mtu 1500 qdisc fq_codel
   master vrfgreen state UP mode DEFAULT group default qlen 1000
   link/ether 52:54:00:4c:a0:45 brd ff:ff:ff:ff:ff:ff

But if the VLAN interface has previously been set into the same VRF,
then setting the interface into the VRF fails:

  # ip link set dev ens12 nomaster
  # ip link set dev ens12.10 master vrfgreen
  # ip link show dev ens12.10
39: ens12.10@ens12:  mtu 1500
qdisc noqueue master vrfgreen state UP mode DEFAULT group default
qlen 1000 link/ether 52:54:00:4c:a0:45 brd ff:ff:ff:ff:ff:ff
  # ip link set dev ens12 master vrfgreen
RTNETLINK answers: File exists

The workaround is to move the VLAN interface back into the default VRF
beforehand, but it has to be shut first so as to avoid the risk of
traffic leaking from the VRF. This fix avoids needing this workaround.

Signed-off-by: Mike Manning 
---
 net/core/dev.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index d4362be..2cedf52 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6396,6 +6396,7 @@ static int __netdev_upper_dev_link(struct net_device *dev,
.linking = true,
.upper_info = upper_info,
};
+   struct net_device *master_dev;
int ret = 0;
 
ASSERT_RTNL();
@@ -6407,11 +6408,14 @@ static int __netdev_upper_dev_link(struct net_device 
*dev,
if (netdev_has_upper_dev(upper_dev, dev))
return -EBUSY;
 
-   if (netdev_has_upper_dev(dev, upper_dev))
-   return -EEXIST;
-
-   if (master && netdev_master_upper_dev_get(dev))
-   return -EBUSY;
+   if (!master) {
+   if (netdev_has_upper_dev(dev, upper_dev))
+   return -EEXIST;
+   } else {
+   master_dev = netdev_master_upper_dev_get(dev);
+   if (master_dev)
+   return master_dev == upper_dev ? -EEXIST : -EBUSY;
+   }
 
ret = call_netdevice_notifiers_info(NETDEV_PRECHANGEUPPER,
_info.info);
-- 
2.1.4

Re: linux-next: manual merge of the bpf-next tree with the bpf tree

2018-02-26 Thread Stephen Rothwell

Hi Dave,

On Mon, 26 Feb 2018 11:41:47 +1100 Stephen Rothwell  
wrote:
>
> Today's linux-next merge of the bpf-next tree got a conflict in:
> 
>   tools/testing/selftests/bpf/test_verifier.c
> 
> between commit:
> 
>   ca36960211eb ("bpf: allow xadd only on aligned memory")
> 
> from the bpf tree and commit:
> 
>   23d191a82c13 ("bpf: add various jit test cases")
> 
> from the bpf-next tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> -- 
> Cheers,
> Stephen Rothwell
> 
> diff --cc tools/testing/selftests/bpf/test_verifier.c
> index 437c0b1c9d21,c987d3a2426f..
> --- a/tools/testing/selftests/bpf/test_verifier.c
> +++ b/tools/testing/selftests/bpf/test_verifier.c
> @@@ -11163,64 -11140,95 +11166,153 @@@ static struct bpf_test tests[] = 
>   .result = REJECT,
>   .prog_type = BPF_PROG_TYPE_TRACEPOINT,
>   },
>  +{
>  +"xadd/w check unaligned stack",
>  +.insns = {
>  +BPF_MOV64_IMM(BPF_REG_0, 1),
>  +BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
>  +BPF_STX_XADD(BPF_W, BPF_REG_10, BPF_REG_0, -7),
>  +BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_10, -8),
>  +BPF_EXIT_INSN(),
>  +},
>  +.result = REJECT,
>  +.errstr = "misaligned stack access off",
>  +.prog_type = BPF_PROG_TYPE_SCHED_CLS,
>  +},
>  +{
>  +"xadd/w check unaligned map",
>  +.insns = {
>  +BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
>  +BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
>  +BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
>  +BPF_LD_MAP_FD(BPF_REG_1, 0),
>  +BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
>  + BPF_FUNC_map_lookup_elem),
>  +BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>  +BPF_EXIT_INSN(),
>  +BPF_MOV64_IMM(BPF_REG_1, 1),
>  +BPF_STX_XADD(BPF_W, BPF_REG_0, BPF_REG_1, 3),
>  +BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 3),
>  +BPF_EXIT_INSN(),
>  +},
>  +.fixup_map1 = { 3 },
>  +.result = REJECT,
>  +.errstr = "misaligned value access off",
>  +.prog_type = BPF_PROG_TYPE_SCHED_CLS,
>  +},
>  +{
>  +"xadd/w check unaligned pkt",
>  +.insns = {
>  +BPF_LDX_MEM(BPF_W, BPF_REG_2, BPF_REG_1,
>  +offsetof(struct xdp_md, data)),
>  +BPF_LDX_MEM(BPF_W, BPF_REG_3, BPF_REG_1,
>  +offsetof(struct xdp_md, data_end)),
>  +BPF_MOV64_REG(BPF_REG_1, BPF_REG_2),
>  +BPF_ALU64_IMM(BPF_ADD, BPF_REG_1, 8),
>  +BPF_JMP_REG(BPF_JLT, BPF_REG_1, BPF_REG_3, 2),
>  +BPF_MOV64_IMM(BPF_REG_0, 99),
>  +BPF_JMP_IMM(BPF_JA, 0, 0, 6),
>  +BPF_MOV64_IMM(BPF_REG_0, 1),
>  +BPF_ST_MEM(BPF_W, BPF_REG_2, 0, 0),
>  +BPF_ST_MEM(BPF_W, BPF_REG_2, 3, 0),
>  +BPF_STX_XADD(BPF_W, BPF_REG_2, BPF_REG_0, 1),
>  +BPF_STX_XADD(BPF_W, BPF_REG_2, BPF_REG_0, 2),
>  +BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_2, 1),
>  +BPF_EXIT_INSN(),
>  +},
>  +.result = REJECT,
>  +.errstr = "BPF_XADD stores into R2 packet",
>  +.prog_type = BPF_PROG_TYPE_XDP,
>  +},
> + {
> + "jit: lsh, rsh, arsh by 1",
> + .insns = {
> + BPF_MOV64_IMM(BPF_REG_0, 1),
> + BPF_MOV64_IMM(BPF_REG_1, 0xff),
> + BPF_ALU64_IMM(BPF_LSH, BPF_REG_1, 1),
> + BPF_ALU32_IMM(BPF_LSH, BPF_REG_1, 1),
> + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0x3fc, 1),
> + BPF_EXIT_INSN(),
> + BPF_ALU64_IMM(BPF_RSH, BPF_REG_1, 1),
> + BPF_ALU32_IMM(BPF_RSH, BPF_REG_1, 1),
> + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0xff, 1),
> + BPF_EXIT_INSN(),
> + BPF_ALU64_IMM(BPF_ARSH, BPF_REG_1, 1),
> + BPF_JMP_IMM(BPF_JEQ, BPF_REG_1, 0x7f, 1),
> + BPF_EXIT_INSN(),
> + BPF_MOV64_IMM(BPF_REG_0, 2),
> + BPF_EXIT_INSN(),
> + },
> +

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Kees Cook

On Mon, Feb 26, 2018 at 3:04 PM, Alexei Starovoitov
 wrote:
> On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
>> This patchset enables seccomp filters to be written in eBPF. Although, this
>> [...]
> The main statement I want to hear from seccomp maintainers before
> proceeding any further on this that enabling eBPF in seccomp won't lead
> to seccomp folks arguing against changes in bpf core (like verifier)
> just because it's used by seccomp.
> It must be spelled out in the commit log with explicit Ack.

The primary thing I'm concerned about with eBPF and seccomp is
side-effects from eBPF programs running at syscall time. This is an
extremely sensitive area, and I want to be sure there won't be
feature-creep here that leads to seccomp getting into a bad state.

As long as seccomp can continue have its own verifier, I *think* this
will be fine, though, again I remain concerned about maps, etc. I'm
still reviewing these patches and how they might provide overlap with
Tycho's needs too, etc.

-Kees

-- 
Kees Cook
Pixel Security

Re: [PATCH RFC net-next 07/20] net/ipv6: Move nexthop data to fib6_nh

2018-02-26 Thread Wei Wang

On Mon, Feb 26, 2018 at 2:47 PM, David Ahern  wrote:
> On 2/26/18 3:28 PM, Wei Wang wrote:
>> On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
>>> Introduce fib6_nh structure and move nexthop related data from
>>> rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
>>> lwtstate from a FIB lookup perspective are converted to use fib6_nh;
>>> datapath references to dst version are left as is.
>>>
>>
>> My understanding is that after your whole patch series, sibling routes
>> will still have their own fib6_info. Does it make sense to make this
>> fib6_nh as an array in fib6_info so that sibling routes will share
>> fib6_info but will have their own fib6_nh as a future improvement? It
>> matches ipv4 behavior. And I think it will make the sibling route
>> handling code easier?
>
> I was not planning to. IPv6 allowing individual nexthops to be added and
> deleted is very convenient. I do agree the existing sibling route
> linkage makes the code much more complicated than it needs to be.
>
> After this set, I plan to send patches for nexthops as separate objects
> - which will have an impact on how multipath routes are done. With
> nexthop objects there will be 1 prefix route pointing to a nexthop
> object that is multipath (meaning it points in turn to a series of
> nexthop objects). This provides the simplification (no sibling linkage)
> without losing the individual nexhtop add / delete option.

Got it. Thanks for the explanation.

Re: [net-next v3 0/2] eBPF seccomp filters

2018-02-26 Thread Alexei Starovoitov

On Mon, Feb 26, 2018 at 07:26:54AM +, Sargun Dhillon wrote:
> This patchset enables seccomp filters to be written in eBPF. Although, this
> patchset doesn't introduce much of the functionality enabled by eBPF, it lays
> the ground work for it. Currently, you have to disable CHECKPOINT_RESTORE
> support in order to utilize eBPF seccomp filters, as eBPF filters cannot be
> retrieved via the ptrace GET_FILTER API.

this was discussed multiple times in the past.
In eBPF land it's practically impossible to do checkpoint/restore
of the whole bpf program/map graph.

> Any user can load a bpf seccomp filter program, and it can be pinned and
> reused without requiring access to the bpf syscalls. A user only requires
> the traditional permissions of either being cap_sys_admin, or have
> no_new_privs set in order to install their rule.
> 
> The primary reason for not adding maps support in this patchset is
> to avoid introducing new complexities around PR_SET_NO_NEW_PRIVS.
> If we have a map that the BPF program can read, it can potentially
> "change" privileges after running. It seems like doing writes only
> is safe, because it can be pure, and side effect free, and therefore
> not negatively effect PR_SET_NO_NEW_PRIVS. Nonetheless, if we come
> to an agreement, this can be in a follow-up patchset.

readonly maps already exist. See BPF_F_RDONLY.
Is that not enough?

> A benchmark of this patchset is as follows for a very standard eBPF filter:
> 
> Given this test program:
> for (i = 10; i < ; i++) syscall(__NR_getpid);
> 
> If I implement an eBPF filter with PROG_ARRAYs with a program per syscall,
> and tail call, the numbers are such:
> ebpf JIT 12.3% slower than native
> ebpf no JIT 13.6% slower than native
> seccomp JIT 17.6% slower than native
> seccomp no JIT 37% slower than native

the perf gains are misleading, since patches don't enable bpf_tail_call.

The main statement I want to hear from seccomp maintainers before
proceeding any further on this that enabling eBPF in seccomp won't lead
to seccomp folks arguing against changes in bpf core (like verifier)
just because it's used by seccomp.
It must be spelled out in the commit log with explicit Ack.

Re: [PATCH RFC net-next 16/20] net/ipv6: Cleanup exception route handling

2018-02-26 Thread David Ahern

On 2/26/18 3:29 PM, Wei Wang wrote:
> On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
>> IPv6 FIB will only contain FIB entries with exception routes added to
>> the FIB entry. Remove CACHE and dst checks from fib6 add and delete since
>> they can never happen once the data type changes.
>>
>> Fixup the lookup functions to use a f6i name for fib lookups and retain
>> the current rt name for return variables.
>>
>> Signed-off-by: David Ahern 
>> ---
>>  net/ipv6/ip6_fib.c |  16 +--
>>  net/ipv6/route.c   | 122 
>> ++---
>>  2 files changed, 71 insertions(+), 67 deletions(-)
>>
>> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
>> index 5b03f7e8d850..63a91db61749 100644
>> --- a/net/ipv6/ip6_fib.c
>> +++ b/net/ipv6/ip6_fib.c
>> @@ -1046,7 +1046,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
>> struct rt6_info *rt,
>>  static void fib6_start_gc(struct net *net, struct rt6_info *rt)
>>  {
>> if (!timer_pending(>ipv6.ip6_fib_timer) &&
>> -   (rt->rt6i_flags & (RTF_EXPIRES | RTF_CACHE)))
>> +   (rt->rt6i_flags & RTF_EXPIRES))
>> mod_timer(>ipv6.ip6_fib_timer,
>>   jiffies + net->ipv6.sysctl.ip6_rt_gc_interval);
>>  }
>> @@ -1097,8 +1097,6 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
>> *rt,
> 
> This rt here should be f6i?
> 
>>
>> if (WARN_ON_ONCE(!atomic_read(>dst.__refcnt)))
>> return -EINVAL;
>> -   if (WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE))
>> -   return -EINVAL;
>>
>> if (info->nlh) {
>> if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
>> @@ -1622,8 +1620,6 @@ static void fib6_del_route(struct fib6_table *table, 
>> struct fib6_node *fn,
>>
>> RT6_TRACE("fib6_del_route\n");
>>
>> -   WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE);
>> -
>> /* Unlink it */
>> *rtp = rt->rt6_next;
> 
> This rt here is also f6i right?
> 
>> rt->rt6i_node = NULL;
>> @@ -1692,21 +1688,11 @@ int fib6_del(struct rt6_info *rt, struct nl_info 
>> *info)
> 
> This rt here is also f6i right?
> 
>> struct rt6_info __rcu **rtp;
>> struct rt6_info __rcu **rtp_next;
>>
>> -#if RT6_DEBUG >= 2
>> -   if (rt->dst.obsolete > 0) {
>> -   WARN_ON(fn);
>> -   return -ENOENT;
>> -   }
>> -#endif
>> if (!fn || rt == net->ipv6.fib6_null_entry)
>> return -ENOENT;
>>
>> WARN_ON(!(fn->fn_flags & RTN_RTINFO));
>>
>> -   /* remove cached dst from exception table */
>> -   if (rt->rt6i_flags & RTF_CACHE)
>> -   return rt6_remove_exception_rt(rt);
> 
> Could you help delete rt6_remove_exception_rt() function? I don't
> think it is used anymore.

It is still used by ip6_negative_advice, ip6_link_failure and
ip6_del_cached_rt. It can be made static; will fix.


The rest of your comments for this patch are renaming rt to f6i. My
thought is to follow up with another patch that does the rename of rt to
f6i for all fib6_info. Given how large this change is already I did not
want to add extra diffs for that. If there is agreement to fold that
part in now, I can do it.

Re: [PATCH V8 2/4] sctp: Add ip option support

2018-02-26 Thread Marcelo Ricardo Leitner

On Mon, Feb 26, 2018 at 05:48:48PM -0500, Paul Moore wrote:
> On Sat, Feb 24, 2018 at 11:18 AM, Richard Haines
>  wrote:
> > Add ip option support to allow LSM security modules to utilise CIPSO/IPv4
> > and CALIPSO/IPv6 services.
> >
> > Signed-off-by: Richard Haines 
> > ---
> > All SCTP lksctp-tools/src/func_tests run correctly in enforcing mode.
> > All "./sctp-tests run" obtained from: https://github.com/sctp/sctp-tests
> > pass.
> >
> > V7 Changes:
> > 1) Log when copy ip options fail for IPv4 and IPv6
> > 2) Correct sctp_setsockopt_maxseg() function. Note that the lksctp-tools
> > func_tests do not test with struct sctp_assoc_value. Just used simple test
> > and okay.
> > 3) Move calculation of overheads to sctp_packet_config().
> > NOTE: Initially in sctp_packet_reset() I set packet->size and
> > packet->overhead to zero (as it is a reset). This was okay for all the
> > lksctp-tools function tests, however when running "sctp-tests" ndatshched
> > tests it causes these to fail with an st_s.log entry of:
> > sid: 3, expected: 3
> > sid: 3, expected: 3
> > unexpected sid packet !!!
> > sid: 1, expected: 3
> >
> > I then found sctp_packet_transmit() relies on setting
> > "packet->size = packet->overhead;" to reset size to the current overhead
> > after sending packets, hence the comment in sctp_packet_reset()
> >
> > V8 Change:
> > Fix sparse warning:
> > net/sctp/protocol.c:269:28: sparse: dereference of noderef expression
> > highlighted in [1] for sctp_v4_ip_options_len() function.
> >
> > [1] https://lists.01.org/pipermail/kbuild-all/2018-February/043695.html
> >
> >  include/net/sctp/sctp.h|  4 +++-
> >  include/net/sctp/structs.h |  2 ++
> >  net/sctp/chunk.c   | 10 +++---
> >  net/sctp/ipv6.c| 45 
> > ++---
> >  net/sctp/output.c  | 34 +-
> >  net/sctp/protocol.c| 43 +++
> >  net/sctp/socket.c  | 11 ---
> >  7 files changed, 122 insertions(+), 27 deletions(-)
> 
> Thanks Richard.
> 
> Neil and Marcelo, I transfered your acked-by to this patch, if you've
> got any objections to that please let me know.

That's fine by me. Thanks

  Marcelo

Re: [PATCH RFC net-next 10/20] net/ipv6: move expires into rt6_info

2018-02-26 Thread David Ahern

On 2/26/18 3:28 PM, Wei Wang wrote:
>> @@ -213,11 +234,6 @@ static inline void rt6_set_expires(struct rt6_info *rt, 
>> unsigned long expires)
>>
>>  static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
>>  {
>> -   struct rt6_info *rt;
>> -
>> -   for (rt = rt0; rt && !(rt->rt6i_flags & RTF_EXPIRES); rt = rt->from);
>> -   if (rt && rt != rt0)
>> -   rt0->dst.expires = rt->dst.expires;
> 
> I was wondering if we need to retain the above logic. It makes sure
> dst.expires gets synced to its "parent" route. But  it might be hard
> because after your change, we can no longer use rt->from to refer to
> the "parent".

As I understand it, the FIB entries are cloned into pcpu, uncached and
exception routes. We should never have an rt6_info that ever points back
more than 1 level -- ie., the dst rt6_info points to a from representing
the original FIB entry.

After my change 'from' will still point to the FIB entry as a fib6_info
which has its own expires.

When I looked this code I was really confused. At best, the for loop
above sets rt0->dst.expires to some value based on the 'from' but then
the very next line calls dst_set_expires with the passed in timeout value.

> 
>> dst_set_expires(>dst, timeout);
>> rt0->rt6i_flags |= RTF_EXPIRES;
>>  }

Re: [PATCH V8 2/4] sctp: Add ip option support

2018-02-26 Thread Paul Moore

On Sat, Feb 24, 2018 at 11:18 AM, Richard Haines
 wrote:
> Add ip option support to allow LSM security modules to utilise CIPSO/IPv4
> and CALIPSO/IPv6 services.
>
> Signed-off-by: Richard Haines 
> ---
> All SCTP lksctp-tools/src/func_tests run correctly in enforcing mode.
> All "./sctp-tests run" obtained from: https://github.com/sctp/sctp-tests
> pass.
>
> V7 Changes:
> 1) Log when copy ip options fail for IPv4 and IPv6
> 2) Correct sctp_setsockopt_maxseg() function. Note that the lksctp-tools
> func_tests do not test with struct sctp_assoc_value. Just used simple test
> and okay.
> 3) Move calculation of overheads to sctp_packet_config().
> NOTE: Initially in sctp_packet_reset() I set packet->size and
> packet->overhead to zero (as it is a reset). This was okay for all the
> lksctp-tools function tests, however when running "sctp-tests" ndatshched
> tests it causes these to fail with an st_s.log entry of:
> sid: 3, expected: 3
> sid: 3, expected: 3
> unexpected sid packet !!!
> sid: 1, expected: 3
>
> I then found sctp_packet_transmit() relies on setting
> "packet->size = packet->overhead;" to reset size to the current overhead
> after sending packets, hence the comment in sctp_packet_reset()
>
> V8 Change:
> Fix sparse warning:
> net/sctp/protocol.c:269:28: sparse: dereference of noderef expression
> highlighted in [1] for sctp_v4_ip_options_len() function.
>
> [1] https://lists.01.org/pipermail/kbuild-all/2018-February/043695.html
>
>  include/net/sctp/sctp.h|  4 +++-
>  include/net/sctp/structs.h |  2 ++
>  net/sctp/chunk.c   | 10 +++---
>  net/sctp/ipv6.c| 45 ++---
>  net/sctp/output.c  | 34 +-
>  net/sctp/protocol.c| 43 +++
>  net/sctp/socket.c  | 11 ---
>  7 files changed, 122 insertions(+), 27 deletions(-)

Thanks Richard.

Neil and Marcelo, I transfered your acked-by to this patch, if you've
got any objections to that please let me know.

> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index f7ae6b0..25c5c87 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -441,9 +441,11 @@ static inline int sctp_list_single_entry(struct 
> list_head *head)
>  static inline int sctp_frag_point(const struct sctp_association *asoc, int 
> pmtu)
>  {
> struct sctp_sock *sp = sctp_sk(asoc->base.sk);
> +   struct sctp_af *af = sp->pf->af;
> int frag = pmtu;
>
> -   frag -= sp->pf->af->net_header_len;
> +   frag -= af->ip_options_len(asoc->base.sk);
> +   frag -= af->net_header_len;
> frag -= sizeof(struct sctphdr) + sctp_datachk_len(>stream);
>
> if (asoc->user_frag)
> diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
> index 03e92dd..ead5fce 100644
> --- a/include/net/sctp/structs.h
> +++ b/include/net/sctp/structs.h
> @@ -491,6 +491,7 @@ struct sctp_af {
> void(*ecn_capable)(struct sock *sk);
> __u16   net_header_len;
> int sockaddr_len;
> +   int (*ip_options_len)(struct sock *sk);
> sa_family_t sa_family;
> struct list_head list;
>  };
> @@ -515,6 +516,7 @@ struct sctp_pf {
> int (*addr_to_user)(struct sctp_sock *sk, union sctp_addr *addr);
> void (*to_sk_saddr)(union sctp_addr *, struct sock *sk);
> void (*to_sk_daddr)(union sctp_addr *, struct sock *sk);
> +   void (*copy_ip_options)(struct sock *sk, struct sock *newsk);
> struct sctp_af *af;
>  };
>
> diff --git a/net/sctp/chunk.c b/net/sctp/chunk.c
> index 991a530..d726d21 100644
> --- a/net/sctp/chunk.c
> +++ b/net/sctp/chunk.c
> @@ -171,6 +171,8 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
> sctp_association *asoc,
> struct list_head *pos, *temp;
> struct sctp_chunk *chunk;
> struct sctp_datamsg *msg;
> +   struct sctp_sock *sp;
> +   struct sctp_af *af;
> int err;
>
> msg = sctp_datamsg_new(GFP_KERNEL);
> @@ -189,9 +191,11 @@ struct sctp_datamsg *sctp_datamsg_from_user(struct 
> sctp_association *asoc,
> /* This is the biggest possible DATA chunk that can fit into
>  * the packet
>  */
> -   max_data = asoc->pathmtu -
> -  sctp_sk(asoc->base.sk)->pf->af->net_header_len -
> -  sizeof(struct sctphdr) - sctp_datachk_len(>stream);
> +   sp = sctp_sk(asoc->base.sk);
> +   af = sp->pf->af;
> +   max_data = asoc->pathmtu - af->net_header_len -
> +  sizeof(struct sctphdr) - sctp_datachk_len(>stream) -
> +  af->ip_options_len(asoc->base.sk);
> max_data = SCTP_TRUNC4(max_data);
>
> /* If the the peer requested that we authenticate DATA chunks
> diff --git a/net/sctp/ipv6.c

Re: [PATCH RFC net-next 07/20] net/ipv6: Move nexthop data to fib6_nh

2018-02-26 Thread David Ahern

On 2/26/18 3:28 PM, Wei Wang wrote:
> On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
>> Introduce fib6_nh structure and move nexthop related data from
>> rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
>> lwtstate from a FIB lookup perspective are converted to use fib6_nh;
>> datapath references to dst version are left as is.
>>
> 
> My understanding is that after your whole patch series, sibling routes
> will still have their own fib6_info. Does it make sense to make this
> fib6_nh as an array in fib6_info so that sibling routes will share
> fib6_info but will have their own fib6_nh as a future improvement? It
> matches ipv4 behavior. And I think it will make the sibling route
> handling code easier?

I was not planning to. IPv6 allowing individual nexthops to be added and
deleted is very convenient. I do agree the existing sibling route
linkage makes the code much more complicated than it needs to be.

After this set, I plan to send patches for nexthops as separate objects
- which will have an impact on how multipath routes are done. With
nexthop objects there will be 1 prefix route pointing to a nexthop
object that is multipath (meaning it points in turn to a series of
nexthop objects). This provides the simplification (no sibling linkage)
without losing the individual nexhtop add / delete option.

Re: [RFC PATCH V4] pci: virtio_pci: Add SR-IOV support for virtio_pci devices

2018-02-26 Thread Alexander Duyck

On Mon, Feb 26, 2018 at 2:38 PM, Michael S. Tsirkin  wrote:
> On Mon, Feb 26, 2018 at 10:05:31AM -0800, Alexander Duyck wrote:
>> On Mon, Feb 26, 2018 at 9:48 AM, Rustad, Mark D  
>> wrote:
>> > Alex,
>> >
>> >> On Feb 26, 2018, at 7:26 AM, Alexander Duyck  
>> >> wrote:
>> >>
>> >> Mark,
>> >>
>> >> In the future please don't put my "Reviewed-by" on a patch that I
>> >> haven't reviewed. I believe I reviewed one of the earlier patches, but
>> >> I hadn't reviewed this version.
>> >
>> > I'm very sorry. I completely spaced doing something about that. I think 
>> > yours was the first Reviewed-by I ever had in this way. In the future I 
>> > will remove such things from my changelog right after sending. Thanks for 
>> > alerting me to what I had failed to do.
>> >
>> >> Also, after thinking about it over the weekend we may want to look at
>> >> just coming up with a truly "generic" solution that is applied to
>> >> SR-IOV capable devices that don't have a SR-IOV capable driver loaded
>> >> on them. That would allow us to handle the uio, vfio, pci-stub, and
>> >> virtio cases all in one fell swoop. I think us going though and
>> >> modifying one patch at a time to do this kind of thing isn't going to
>> >> scale.
>> >
>> > The notion of that kind of troubles me - at least pci-stub does. Having 
>> > worked on ixgbe a bit, I have to wonder what kind of havoc would ensue if 
>> > an ixgbe device were assigned to a guest, and an attempt was made to 
>> > allocate VFs by the pci-stub. The guest could be running any version of 
>> > the ixgbe driver, possibly even an old one that didn't support SR-IOV. 
>> > Even if it did support SR-IOV, I don't know how it would respond to 
>> > mailbox messages when it doesn't think it has VFs.
>>
>> The assumption here is that the root user knows what they are doing.
>
> There are tools that let non-root users load the stub.
>
> People use that to e.g.  prevent a native driver from loading while also
> assuming it won't break the kernel.
>
>
>> We have already had some discussion on this in regards to VFIO. My
>> thought is we look at adding a new PCI sysfs option called
>> "sriov_unmanaged_autoprobe" that would be similar to
>> "sriov_drivers_autoprobe" and is used to determine if we allow for
>> auto probing of the VFs into the host kernel when SR-IOV is enabled.
>
> I'm not sure how a global option can work for use-cases
> such as containers though.

It isn't global. It is per PF. There is a sysfs value per PF called
"sriov_drivers_autoprobe" that currently controls if VFs are
automatically bound by drivers from the kernel. What I am suggesting
is we split that value, and then add a new one called
"sriov_unmanaged_autoprobe" to handle the case where either there is
no driver on the PF or the driver loaded doesn't support SR-IOV. The
default behavior for it would be to be disabled by default so if VFs
are spawned by an unmanaged PF the drivers don't load on them by
default. The root user can toggle the sysfs entry if they want that
behavior to change.

The idea based on a request by Alex Williamson that we prevent VF
drivers from loading in the host if a PF is managed by user-space or a
firmware entity and we don't have a clear way of configuring VFs from
the kernel.

- Alex

Re: [RFC PATCH V4] pci: virtio_pci: Add SR-IOV support for virtio_pci devices

2018-02-26 Thread Michael S. Tsirkin

On Mon, Feb 26, 2018 at 02:38:01PM -0800, Alexander Duyck wrote:
> On Mon, Feb 26, 2018 at 2:32 PM, Michael S. Tsirkin  wrote:
> > On Mon, Feb 26, 2018 at 07:26:14AM -0800, Alexander Duyck wrote:
> >> On Sun, Feb 25, 2018 at 8:48 PM, Mark Rustad  
> >> wrote:
> >> > Hardware-realized virtio_pci devices can implement SR-IOV, so this
> >> > patch enables its use. The device in question is an upcoming Intel
> >> > NIC that implements both a virtio_net PF and virtio_net VFs. These
> >> > are hardware realizations of what has been up to now been a software
> >> > interface.
> >> >
> >> > The device in question has the following 4-part PCI IDs:
> >> >
> >> > PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
> >> > VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
> >> >
> >> > The patch needs no check for device ID, because the callback will
> >> > never be made for devices that do not assert the capability or
> >> > when run on a platform incapable of SR-IOV.
> >> >
> >> > One reason for this patch is because the hardware requires the
> >> > vendor ID of a VF to be the same as the vendor ID of the PF that
> >> > created it. So it seemed logical to simply have a fully-functioning
> >> > virtio_net PF create the VFs. This patch makes that possible.
> >> >
> >> > Signed-off-by: Mark Rustad 
> >> > Reviewed-by: Alexander Duyck 
> >>
> >> Mark,
> >>
> >> In the future please don't put my "Reviewed-by" on a patch that I
> >> haven't reviewed. I believe I reviewed one of the earlier patches, but
> >> I hadn't reviewed this version.
> >>
> >> Also, after thinking about it over the weekend we may want to look at
> >> just coming up with a truly "generic" solution that is applied to
> >> SR-IOV capable devices that don't have a SR-IOV capable driver loaded
> >> on them. That would allow us to handle the uio, vfio, pci-stub, and
> >> virtio cases all in one fell swoop. I think us going though and
> >> modifying one patch at a time to do this kind of thing isn't going to
> >> scale.
> >
> > uio really can't support VFs properly - without proper IOMMU
> > support any MSIs can corrupt kernel memory, and VFs are
> > limited to MSIs.
> 
> UIO wasn't being run on the VFs, it was just running the PF.

I see. That's fine then.

> The point
> is that there have been about 4 attempts, including this one, to add
> SR-IOV support to drivers that don't actually do any VF management
> internally. They were just being used as a shim so that they could add
> the sriov_configure function to a driver that would load on the PF.
> 
> If we make the solution generic I think it should turn out pretty
> clean. Most of the work just needs to happen in the sysfs function for
> storing the value that is written to sriov_numvfs. I'm working with
> Mark and a few other people now to get this addressed and I hope that
> we can have a patch available shortly.
> 
> Thanks.
> 
> - Alex

Re: [RFC PATCH V4] pci: virtio_pci: Add SR-IOV support for virtio_pci devices

2018-02-26 Thread Michael S. Tsirkin

On Mon, Feb 26, 2018 at 10:05:31AM -0800, Alexander Duyck wrote:
> On Mon, Feb 26, 2018 at 9:48 AM, Rustad, Mark D  
> wrote:
> > Alex,
> >
> >> On Feb 26, 2018, at 7:26 AM, Alexander Duyck  
> >> wrote:
> >>
> >> Mark,
> >>
> >> In the future please don't put my "Reviewed-by" on a patch that I
> >> haven't reviewed. I believe I reviewed one of the earlier patches, but
> >> I hadn't reviewed this version.
> >
> > I'm very sorry. I completely spaced doing something about that. I think 
> > yours was the first Reviewed-by I ever had in this way. In the future I 
> > will remove such things from my changelog right after sending. Thanks for 
> > alerting me to what I had failed to do.
> >
> >> Also, after thinking about it over the weekend we may want to look at
> >> just coming up with a truly "generic" solution that is applied to
> >> SR-IOV capable devices that don't have a SR-IOV capable driver loaded
> >> on them. That would allow us to handle the uio, vfio, pci-stub, and
> >> virtio cases all in one fell swoop. I think us going though and
> >> modifying one patch at a time to do this kind of thing isn't going to
> >> scale.
> >
> > The notion of that kind of troubles me - at least pci-stub does. Having 
> > worked on ixgbe a bit, I have to wonder what kind of havoc would ensue if 
> > an ixgbe device were assigned to a guest, and an attempt was made to 
> > allocate VFs by the pci-stub. The guest could be running any version of the 
> > ixgbe driver, possibly even an old one that didn't support SR-IOV. Even if 
> > it did support SR-IOV, I don't know how it would respond to mailbox 
> > messages when it doesn't think it has VFs.
> 
> The assumption here is that the root user knows what they are doing.

There are tools that let non-root users load the stub.

People use that to e.g.  prevent a native driver from loading while also
assuming it won't break the kernel.


> We have already had some discussion on this in regards to VFIO. My
> thought is we look at adding a new PCI sysfs option called
> "sriov_unmanaged_autoprobe" that would be similar to
> "sriov_drivers_autoprobe" and is used to determine if we allow for
> auto probing of the VFs into the host kernel when SR-IOV is enabled.

I'm not sure how a global option can work for use-cases
such as containers though.

> I
> would want to default the value to false so that by default an
> unmanaged PF wouldn't have its VFs assigned to the host unless we
> specifically enable it by updating the sysfs value.
> 
> >> I'll try to do some digging and find the VFIO approach we had been
> >> working on. I think with a couple tweaks we can probably make that
> >> truly generic and ready for submission.
> >
> > I'd like to know more about you are thinking about.
> 
> Basic idea is to take your generic SR-IOV enable/disable bits from
> (http://patchwork.ozlabs.org/patch/877674/) and combine it with the
> some of the autoprobe bits and feedback comments from
> (http://patchwork.ozlabs.org/patch/846454/). The idea would be when
> either no driver is loaded, or a driver without the sriov_configure
> method we update the iov auto probe setting based on the value we set
> via sriov_unamanaged_autoprobe, and then call your generic
> configuration method to enable SR-IOV.

Re: [RFC PATCH V4] pci: virtio_pci: Add SR-IOV support for virtio_pci devices

2018-02-26 Thread Alexander Duyck

On Mon, Feb 26, 2018 at 2:32 PM, Michael S. Tsirkin  wrote:
> On Mon, Feb 26, 2018 at 07:26:14AM -0800, Alexander Duyck wrote:
>> On Sun, Feb 25, 2018 at 8:48 PM, Mark Rustad  wrote:
>> > Hardware-realized virtio_pci devices can implement SR-IOV, so this
>> > patch enables its use. The device in question is an upcoming Intel
>> > NIC that implements both a virtio_net PF and virtio_net VFs. These
>> > are hardware realizations of what has been up to now been a software
>> > interface.
>> >
>> > The device in question has the following 4-part PCI IDs:
>> >
>> > PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
>> > VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
>> >
>> > The patch needs no check for device ID, because the callback will
>> > never be made for devices that do not assert the capability or
>> > when run on a platform incapable of SR-IOV.
>> >
>> > One reason for this patch is because the hardware requires the
>> > vendor ID of a VF to be the same as the vendor ID of the PF that
>> > created it. So it seemed logical to simply have a fully-functioning
>> > virtio_net PF create the VFs. This patch makes that possible.
>> >
>> > Signed-off-by: Mark Rustad 
>> > Reviewed-by: Alexander Duyck 
>>
>> Mark,
>>
>> In the future please don't put my "Reviewed-by" on a patch that I
>> haven't reviewed. I believe I reviewed one of the earlier patches, but
>> I hadn't reviewed this version.
>>
>> Also, after thinking about it over the weekend we may want to look at
>> just coming up with a truly "generic" solution that is applied to
>> SR-IOV capable devices that don't have a SR-IOV capable driver loaded
>> on them. That would allow us to handle the uio, vfio, pci-stub, and
>> virtio cases all in one fell swoop. I think us going though and
>> modifying one patch at a time to do this kind of thing isn't going to
>> scale.
>
> uio really can't support VFs properly - without proper IOMMU
> support any MSIs can corrupt kernel memory, and VFs are
> limited to MSIs.

UIO wasn't being run on the VFs, it was just running the PF. The point
is that there have been about 4 attempts, including this one, to add
SR-IOV support to drivers that don't actually do any VF management
internally. They were just being used as a shim so that they could add
the sriov_configure function to a driver that would load on the PF.

If we make the solution generic I think it should turn out pretty
clean. Most of the work just needs to happen in the sysfs function for
storing the value that is written to sriov_numvfs. I'm working with
Mark and a few other people now to get this addressed and I hope that
we can have a patch available shortly.

Thanks.

- Alex

Re: [RFC PATCH V4] pci: virtio_pci: Add SR-IOV support for virtio_pci devices

2018-02-26 Thread Michael S. Tsirkin

On Mon, Feb 26, 2018 at 07:26:14AM -0800, Alexander Duyck wrote:
> On Sun, Feb 25, 2018 at 8:48 PM, Mark Rustad  wrote:
> > Hardware-realized virtio_pci devices can implement SR-IOV, so this
> > patch enables its use. The device in question is an upcoming Intel
> > NIC that implements both a virtio_net PF and virtio_net VFs. These
> > are hardware realizations of what has been up to now been a software
> > interface.
> >
> > The device in question has the following 4-part PCI IDs:
> >
> > PF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 15fe
> > VF: vendor: 1af4 device: 1041 subvendor: 8086 subdevice: 05fe
> >
> > The patch needs no check for device ID, because the callback will
> > never be made for devices that do not assert the capability or
> > when run on a platform incapable of SR-IOV.
> >
> > One reason for this patch is because the hardware requires the
> > vendor ID of a VF to be the same as the vendor ID of the PF that
> > created it. So it seemed logical to simply have a fully-functioning
> > virtio_net PF create the VFs. This patch makes that possible.
> >
> > Signed-off-by: Mark Rustad 
> > Reviewed-by: Alexander Duyck 
> 
> Mark,
> 
> In the future please don't put my "Reviewed-by" on a patch that I
> haven't reviewed. I believe I reviewed one of the earlier patches, but
> I hadn't reviewed this version.
> 
> Also, after thinking about it over the weekend we may want to look at
> just coming up with a truly "generic" solution that is applied to
> SR-IOV capable devices that don't have a SR-IOV capable driver loaded
> on them. That would allow us to handle the uio, vfio, pci-stub, and
> virtio cases all in one fell swoop. I think us going though and
> modifying one patch at a time to do this kind of thing isn't going to
> scale.

uio really can't support VFs properly - without proper IOMMU
support any MSIs can corrupt kernel memory, and VFs are
limited to MSIs.

> I'll try to do some digging and find the VFIO approach we had been
> working on. I think with a couple tweaks we can probably make that
> truly generic and ready for submission.
> 
> Thanks.
> 
> - Alex
> 
> > ---
> > Changes in V4:
> > - V3 was a mis-send, this has what was intended
> > - Move most code to new helpers in pci/iov.c, pci_sriov_configure_generic
> >   and pci_sriov_disable_generic
> > - Correct mislabeling of vendor and device IDs
> > - Other minor changelog fixes
> > - Rebased to pci/master, since most changes are in that area now
> > - No new ifdefs with this approach (yay)
> > Changes in V3:
> > - Missent patch, please disregard
> > Changes in V2:
> > - Simplified logic from previous version, removed added driver variable
> > - Disable SR-IOV on driver removal except when VFs are assigned
> > - Sent as RFC to virtio-dev, linux-pci, netdev, lkml and others
> > ---
> >  drivers/pci/iov.c  |   50 
> > 
> >  drivers/virtio/virtio_pci_common.c |2 +
> >  include/linux/pci.h|   10 +++
> >  3 files changed, 62 insertions(+)
> >
> > diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> > index 677924ae0350..4b110e169b7c 100644
> > --- a/drivers/pci/iov.c
> > +++ b/drivers/pci/iov.c
> > @@ -367,6 +367,56 @@ static void sriov_disable(struct pci_dev *dev)
> > pci_iov_set_numvfs(dev, 0);
> >  }
> >
> > +/**
> > + * pci_sriov_disable_generic - standard helper to disable SR-IOV
> > + * @dev:the PCI PF device whose VFs are to be disabled
> > + */
> > +int pci_sriov_disable_generic(struct pci_dev *dev)
> > +{
> > +   /*
> > +* If vfs are assigned we cannot shut down SR-IOV without causing
> > +* issues, so just leave the hardware available.
> > +*/
> > +   if (pci_vfs_assigned(dev)) {
> > +   pci_warn(dev,
> > +"Cannot disable SR-IOV while VFs are assigned - 
> > VFs will not be deallocated\n");
> > +   return -EPERM;
> > +   }
> > +   pci_disable_sriov(dev);
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(pci_sriov_disable_generic);
> > +
> > +static int pci_sriov_enable(struct pci_dev *dev, int num_vfs)
> > +{
> > +   int rc;
> > +
> > +   if (pci_num_vf(dev))
> > +   return -EINVAL;
> > +
> > +   rc = pci_enable_sriov(dev, num_vfs);
> > +   if (rc) {
> > +   pci_warn(dev, "Failed to enable PCI sriov: %d\n", rc);
> > +   return rc;
> > +   }
> > +   pci_info(dev, "SR-IOV enabled with %d VFs\n", num_vfs);
> > +   return num_vfs;
> > +}
> > +
> > +/**
> > + * pci_sriov_configure_generic - standard helper to configure SR-IOV
> > + * @dev: the PCI PF device that is configuring SR-IOV
> > + */
> > +int pci_sriov_configure_generic(struct pci_dev *dev, int num_vfs)
> > +{
> > +   if (num_vfs)
> > +   return pci_sriov_enable(dev, num_vfs);
> > +   if (!pci_num_vf(dev))
> > +   return

Re: [PATCH RFC net-next 16/20] net/ipv6: Cleanup exception route handling

2018-02-26 Thread Wei Wang

On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
> IPv6 FIB will only contain FIB entries with exception routes added to
> the FIB entry. Remove CACHE and dst checks from fib6 add and delete since
> they can never happen once the data type changes.
>
> Fixup the lookup functions to use a f6i name for fib lookups and retain
> the current rt name for return variables.
>
> Signed-off-by: David Ahern 
> ---
>  net/ipv6/ip6_fib.c |  16 +--
>  net/ipv6/route.c   | 122 
> ++---
>  2 files changed, 71 insertions(+), 67 deletions(-)
>
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 5b03f7e8d850..63a91db61749 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -1046,7 +1046,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, 
> struct rt6_info *rt,
>  static void fib6_start_gc(struct net *net, struct rt6_info *rt)
>  {
> if (!timer_pending(>ipv6.ip6_fib_timer) &&
> -   (rt->rt6i_flags & (RTF_EXPIRES | RTF_CACHE)))
> +   (rt->rt6i_flags & RTF_EXPIRES))
> mod_timer(>ipv6.ip6_fib_timer,
>   jiffies + net->ipv6.sysctl.ip6_rt_gc_interval);
>  }
> @@ -1097,8 +1097,6 @@ int fib6_add(struct fib6_node *root, struct rt6_info 
> *rt,

This rt here should be f6i?

>
> if (WARN_ON_ONCE(!atomic_read(>dst.__refcnt)))
> return -EINVAL;
> -   if (WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE))
> -   return -EINVAL;
>
> if (info->nlh) {
> if (!(info->nlh->nlmsg_flags & NLM_F_CREATE))
> @@ -1622,8 +1620,6 @@ static void fib6_del_route(struct fib6_table *table, 
> struct fib6_node *fn,
>
> RT6_TRACE("fib6_del_route\n");
>
> -   WARN_ON_ONCE(rt->rt6i_flags & RTF_CACHE);
> -
> /* Unlink it */
> *rtp = rt->rt6_next;

This rt here is also f6i right?

> rt->rt6i_node = NULL;
> @@ -1692,21 +1688,11 @@ int fib6_del(struct rt6_info *rt, struct nl_info 
> *info)

This rt here is also f6i right?

> struct rt6_info __rcu **rtp;
> struct rt6_info __rcu **rtp_next;
>
> -#if RT6_DEBUG >= 2
> -   if (rt->dst.obsolete > 0) {
> -   WARN_ON(fn);
> -   return -ENOENT;
> -   }
> -#endif
> if (!fn || rt == net->ipv6.fib6_null_entry)
> return -ENOENT;
>
> WARN_ON(!(fn->fn_flags & RTN_RTINFO));
>
> -   /* remove cached dst from exception table */
> -   if (rt->rt6i_flags & RTF_CACHE)
> -   return rt6_remove_exception_rt(rt);

Could you help delete rt6_remove_exception_rt() function? I don't
think it is used anymore.

> -
> /*
>  *  Walk the leaf entries looking for ourself
>  */
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index 3ea60e932eb9..19b91c60ee55 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1094,35 +1094,36 @@ static struct rt6_info *ip6_pol_route_lookup(struct 
> net *net,
>  struct fib6_table *table,
>  struct flowi6 *fl6, int flags)
>  {
> -   struct rt6_info *rt, *rt_cache;
> +   struct rt6_info *f6i;
> struct fib6_node *fn;
> +   struct rt6_info *rt;
>
> rcu_read_lock();
> fn = fib6_lookup(>tb6_root, >daddr, >saddr);
>  restart:
> -   rt = rcu_dereference(fn->leaf);
> -   if (!rt) {
> -   rt = net->ipv6.fib6_null_entry;
> +   f6i = rcu_dereference(fn->leaf);
> +   if (!f6i) {
> +   f6i = net->ipv6.fib6_null_entry;
> } else {
> -   rt = rt6_device_match(net, rt, >saddr,
> +   f6i = rt6_device_match(net, f6i, >saddr,
>   fl6->flowi6_oif, flags);
> -   if (rt->rt6i_nsiblings && fl6->flowi6_oif == 0)
> -   rt = rt6_multipath_select(rt, fl6,
> +   if (f6i->rt6i_nsiblings && fl6->flowi6_oif == 0)
> +   f6i = rt6_multipath_select(f6i, fl6,
>   fl6->flowi6_oif, flags);
> }
> -   if (rt == net->ipv6.fib6_null_entry) {
> +   if (f6i == net->ipv6.fib6_null_entry) {
> fn = fib6_backtrack(fn, >saddr);
> if (fn)
> goto restart;
> }
> +
> /* Search through exception table */
> -   rt_cache = rt6_find_cached_rt(rt, >daddr, >saddr);
> -   if (rt_cache) {
> -   rt = rt_cache;
> +   rt = rt6_find_cached_rt(f6i, >daddr, >saddr);
> +   if (rt) {
> if (ip6_hold_safe(net, , true))
> dst_use_noref(>dst, jiffies);
> } else {
> -   rt = ip6_create_rt_rcu(rt);
> +   rt = ip6_create_rt_rcu(f6i);
> }
>
> rcu_read_unlock();
> @@ -1204,9 +1205,6 @@ static struct rt6_info *ip6_rt_cache_alloc(struct 
>

Re: [PATCH RFC net-next 10/20] net/ipv6: move expires into rt6_info

2018-02-26 Thread Wei Wang

On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
> Add expires to rt6_info for FIB entries, and add fib6 helpers to
> manage it. Data path use of dst.expires remains.
>
> Signed-off-by: David Ahern 
> ---
>  include/net/ip6_fib.h | 26 +-
>  net/ipv6/addrconf.c   |  6 +++---
>  net/ipv6/ip6_fib.c|  8 
>  net/ipv6/ndisc.c  |  2 +-
>  net/ipv6/route.c  | 14 +++---
>  5 files changed, 36 insertions(+), 20 deletions(-)
>
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index da81669b9c90..3ba0bb7c7a43 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -179,6 +179,7 @@ struct rt6_info {
> should_flush:1,
> unused:6;
>
> +   unsigned long   expires;
> struct dst_metrics  *fib6_metrics;
>  #define fib6_pmtu  fib6_metrics->metrics[RTAX_MTU-1]
>  #define fib6_hoplimit  fib6_metrics->metrics[RTAX_HOPLIMIT-1]
> @@ -199,6 +200,26 @@ static inline struct inet6_dev *ip6_dst_idev(struct 
> dst_entry *dst)
> return ((struct rt6_info *)dst)->rt6i_idev;
>  }
>
> +static inline void fib6_clean_expires(struct rt6_info *f6i)
> +{
> +   f6i->rt6i_flags &= ~RTF_EXPIRES;
> +   f6i->expires = 0;
> +}
> +
> +static inline void fib6_set_expires(struct rt6_info *f6i,
> +   unsigned long expires)
> +{
> +   f6i->expires = expires;
> +   f6i->rt6i_flags |= RTF_EXPIRES;
> +}
> +
> +static inline bool fib6_check_expired(const struct rt6_info *f6i)
> +{
> +   if (f6i->rt6i_flags & RTF_EXPIRES)
> +   return time_after(jiffies, f6i->expires);
> +   return false;
> +}
> +
>  static inline void rt6_clean_expires(struct rt6_info *rt)
>  {
> rt->rt6i_flags &= ~RTF_EXPIRES;
> @@ -213,11 +234,6 @@ static inline void rt6_set_expires(struct rt6_info *rt, 
> unsigned long expires)
>
>  static inline void rt6_update_expires(struct rt6_info *rt0, int timeout)
>  {
> -   struct rt6_info *rt;
> -
> -   for (rt = rt0; rt && !(rt->rt6i_flags & RTF_EXPIRES); rt = rt->from);
> -   if (rt && rt != rt0)
> -   rt0->dst.expires = rt->dst.expires;

I was wondering if we need to retain the above logic. It makes sure
dst.expires gets synced to its "parent" route. But  it might be hard
because after your change, we can no longer use rt->from to refer to
the "parent".

> dst_set_expires(>dst, timeout);
> rt0->rt6i_flags |= RTF_EXPIRES;
>  }
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index eeecef2b83a4..478f45bf13cf 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -1202,7 +1202,7 @@ cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned 
> long expires, bool del_r
> ip6_del_rt(dev_net(ifp->idev->dev), rt);
> else {
> if (!(rt->rt6i_flags & RTF_EXPIRES))
> -   rt6_set_expires(rt, expires);
> +   fib6_set_expires(rt, expires);
> ip6_rt_put(rt);
> }
> }
> @@ -2648,9 +2648,9 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 
> *opt, int len, bool sllao)
> rt = NULL;
> } else if (addrconf_finite_timeout(rt_expires)) {
> /* not infinity */
> -   rt6_set_expires(rt, jiffies + rt_expires);
> +   fib6_set_expires(rt, jiffies + rt_expires);
> } else {
> -   rt6_clean_expires(rt);
> +   fib6_clean_expires(rt);
> }
> } else if (valid_lft) {
> clock_t expires = 0;
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index faa2b46349df..7bc23b048189 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -886,9 +886,9 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct 
> rt6_info *rt,
> if (!(iter->rt6i_flags & RTF_EXPIRES))
> return -EEXIST;
> if (!(rt->rt6i_flags & RTF_EXPIRES))
> -   rt6_clean_expires(iter);
> +   fib6_clean_expires(iter);
> else
> -   rt6_set_expires(iter, 
> rt->dst.expires);
> +   fib6_set_expires(iter, rt->expires);
> iter->fib6_pmtu = rt->fib6_pmtu;
> return -EEXIST;
> }
> @@ -1975,8 +1975,8 @@ static int fib6_age(struct rt6_info *rt, void *arg)
>  *  Routes are expired even if they are

[PATCH ipsec-next] esp: check the NETIF_F_HW_ESP_TX_CSUM bit before segmenting

2018-02-26 Thread Shannon Nelson

If I understand correctly, we should not be asking for a
checksum offload on an ipsec packet if the netdev isn't
advertising NETIF_F_HW_ESP_TX_CSUM.  In that case, we should
clear the NETIF_F_CSUM_MASK bits.

Signed-off-by: Shannon Nelson 
---
 net/ipv4/esp4_offload.c | 2 ++
 net/ipv6/esp6_offload.c | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/net/ipv4/esp4_offload.c b/net/ipv4/esp4_offload.c
index da5635f..7cf755e 100644
--- a/net/ipv4/esp4_offload.c
+++ b/net/ipv4/esp4_offload.c
@@ -138,6 +138,8 @@ static struct sk_buff *esp4_gso_segment(struct sk_buff *skb,
if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
(x->xso.dev != skb->dev))
esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
+   else if (!(features & NETIF_F_HW_ESP_TX_CSUM))
+   esp_features = features & ~NETIF_F_CSUM_MASK;
 
xo->flags |= XFRM_GSO_SEGMENT;
 
diff --git a/net/ipv6/esp6_offload.c b/net/ipv6/esp6_offload.c
index 3fd1ec7..27f59b6 100644
--- a/net/ipv6/esp6_offload.c
+++ b/net/ipv6/esp6_offload.c
@@ -165,6 +165,8 @@ static struct sk_buff *esp6_gso_segment(struct sk_buff *skb,
if (!(features & NETIF_F_HW_ESP) || !x->xso.offload_handle ||
(x->xso.dev != skb->dev))
esp_features = features & ~(NETIF_F_SG | NETIF_F_CSUM_MASK);
+   else if (!(features & NETIF_F_HW_ESP_TX_CSUM))
+   esp_features = features & ~NETIF_F_CSUM_MASK;
 
xo->flags |= XFRM_GSO_SEGMENT;
 
-- 
2.7.4

Re: [PATCH RFC net-next 07/20] net/ipv6: Move nexthop data to fib6_nh

2018-02-26 Thread Wei Wang

On Sun, Feb 25, 2018 at 11:47 AM, David Ahern  wrote:
> Introduce fib6_nh structure and move nexthop related data from
> rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
> lwtstate from a FIB lookup perspective are converted to use fib6_nh;
> datapath references to dst version are left as is.
>

My understanding is that after your whole patch series, sibling routes
will still have their own fib6_info. Does it make sense to make this
fib6_nh as an array in fib6_info so that sibling routes will share
fib6_info but will have their own fib6_nh as a future improvement? It
matches ipv4 behavior. And I think it will make the sibling route
handling code easier?

> Signed-off-by: David Ahern 
> ---
>  .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  32 ++--
>  include/net/ip6_fib.h  |  16 +-
>  include/net/ip6_route.h|   6 +-
>  net/ipv6/addrconf.c|   2 +-
>  net/ipv6/ip6_fib.c |   6 +-
>  net/ipv6/route.c   | 164 
> -
>  6 files changed, 127 insertions(+), 99 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c 
> b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> index 05146970c19c..90d01df783b3 100644
> --- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> +++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
> @@ -2700,9 +2700,9 @@ mlxsw_sp_nexthop6_group_cmp(const struct 
> mlxsw_sp_nexthop_group *nh_grp,
> struct in6_addr *gw;
> int ifindex, weight;
>
> -   ifindex = mlxsw_sp_rt6->rt->dst.dev->ifindex;
> -   weight = mlxsw_sp_rt6->rt->rt6i_nh_weight;
> -   gw = _sp_rt6->rt->rt6i_gateway;
> +   ifindex = mlxsw_sp_rt6->rt->fib6_nh.nh_dev->ifindex;
> +   weight = mlxsw_sp_rt6->rt->fib6_nh.nh_weight;
> +   gw = _sp_rt6->rt->fib6_nh.nh_gw;
> if (!mlxsw_sp_nexthop6_group_has_nexthop(nh_grp, gw, ifindex,
>  weight))
> return false;
> @@ -2768,7 +2768,7 @@ mlxsw_sp_nexthop6_group_hash(struct mlxsw_sp_fib6_entry 
> *fib6_entry, u32 seed)
> struct net_device *dev;
>
> list_for_each_entry(mlxsw_sp_rt6, _entry->rt6_list, list) {
> -   dev = mlxsw_sp_rt6->rt->dst.dev;
> +   dev = mlxsw_sp_rt6->rt->fib6_nh.nh_dev;
> val ^= dev->ifindex;
> }
>
> @@ -3766,9 +3766,9 @@ mlxsw_sp_rt6_nexthop(struct mlxsw_sp_nexthop_group 
> *nh_grp,
> struct mlxsw_sp_nexthop *nh = _grp->nexthops[i];
> struct rt6_info *rt = mlxsw_sp_rt6->rt;
>
> -   if (nh->rif && nh->rif->dev == rt->dst.dev &&
> +   if (nh->rif && nh->rif->dev == rt->fib6_nh.nh_dev &&
> ipv6_addr_equal((const struct in6_addr *) >gw_addr,
> -   >rt6i_gateway))
> +   >fib6_nh.nh_gw))
> return nh;
> continue;
> }
> @@ -3825,7 +3825,7 @@ mlxsw_sp_fib6_entry_offload_set(struct 
> mlxsw_sp_fib_entry *fib_entry)
>
> if (fib_entry->type == MLXSW_SP_FIB_ENTRY_TYPE_LOCAL) {
> list_first_entry(_entry->rt6_list, struct mlxsw_sp_rt6,
> -list)->rt->rt6i_nh_flags |= RTNH_F_OFFLOAD;
> +list)->rt->fib6_nh.nh_flags |= 
> RTNH_F_OFFLOAD;
> return;
> }
>
> @@ -3835,9 +3835,9 @@ mlxsw_sp_fib6_entry_offload_set(struct 
> mlxsw_sp_fib_entry *fib_entry)
>
> nh = mlxsw_sp_rt6_nexthop(nh_grp, mlxsw_sp_rt6);
> if (nh && nh->offloaded)
> -   mlxsw_sp_rt6->rt->rt6i_nh_flags |= RTNH_F_OFFLOAD;
> +   mlxsw_sp_rt6->rt->fib6_nh.nh_flags |= RTNH_F_OFFLOAD;
> else
> -   mlxsw_sp_rt6->rt->rt6i_nh_flags &= ~RTNH_F_OFFLOAD;
> +   mlxsw_sp_rt6->rt->fib6_nh.nh_flags &= ~RTNH_F_OFFLOAD;
> }
>  }
>
> @@ -3852,7 +3852,7 @@ mlxsw_sp_fib6_entry_offload_unset(struct 
> mlxsw_sp_fib_entry *fib_entry)
> list_for_each_entry(mlxsw_sp_rt6, _entry->rt6_list, list) {
> struct rt6_info *rt = mlxsw_sp_rt6->rt;
>
> -   rt->rt6i_nh_flags &= ~RTNH_F_OFFLOAD;
> +   rt->fib6_nh.nh_flags &= ~RTNH_F_OFFLOAD;
> }
>  }
>
> @@ -4748,8 +4748,8 @@ static bool mlxsw_sp_nexthop6_ipip_type(const struct 
> mlxsw_sp *mlxsw_sp,
> const struct rt6_info *rt,
> enum mlxsw_sp_ipip_type *ret)
>  {
> -   return rt->dst.dev &&
> -  mlxsw_sp_netdev_ipip_type(mlxsw_sp, rt->dst.dev, ret);
> +   return rt->fib6_nh.nh_dev &&
> +

[PATCH net-next 0/4] stmmac barrier fixes and cleanup

2018-02-26 Thread Niklas Cassel

stmmac barrier fixes and cleanup

Niklas Cassel (4):
  net: stmmac: ensure that the MSS desc is the last desc to set the own
bit
  net: stmmac: use correct barrier between coherent memory and MMIO
  net: stmmac: ensure that the device has released ownership before
reading data
  net: stmmac: make dwmac4_release_tx_desc() clear all descriptor fields

 drivers/net/ethernet/stmicro/stmmac/dwmac4_descs.c |  2 ++
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  | 18 +++---
 2 files changed, 17 insertions(+), 3 deletions(-)

-- 
2.14.2

[PATCH net-next 2/4] net: stmmac: use correct barrier between coherent memory and MMIO

2018-02-26 Thread Niklas Cassel

The last memory barrier in stmmac_xmit()/stmmac_tso_xmit() is placed
between a coherent memory write and a MMIO write:

The own bit is written in First Desc (TSO: MSS desc or First Desc).

The DMA engine is started by a write to the tx desc tail pointer/
enable dma transmission register, i.e. a MMIO write.

This barrier cannot be a simple dma_wmb(), since a dma_wmb() is only
used to guarantee the ordering, with respect to other writes,
to cache coherent DMA memory.

To guarantee that the cache coherent memory writes have completed
before we attempt to write to the cache incoherent MMIO region,
we need to use the more heavyweight barrier wmb().

Signed-off-by: Niklas Cassel 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3b5e7b06e796..6dd04f237b2a 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2997,7 +2997,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, 
struct net_device *dev)
 * descriptor and then barrier is needed to make sure that
 * all is coherent before granting the DMA engine.
 */
-   dma_wmb();
+   wmb();
 
if (netif_msg_pktdata(priv)) {
pr_info("%s: curr=%d dirty=%d f=%d, e=%d, f_p=%p, nfrags %d\n",
@@ -3221,7 +3221,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, 
struct net_device *dev)
 * descriptor and then barrier is needed to make sure that
 * all is coherent before granting the DMA engine.
 */
-   dma_wmb();
+   wmb();
}
 
netdev_tx_sent_queue(netdev_get_tx_queue(dev, queue), skb->len);
-- 
2.14.2

[PATCH net-next 1/4] net: stmmac: ensure that the MSS desc is the last desc to set the own bit

2018-02-26 Thread Niklas Cassel

A dma_wmb() is used to guarantee the ordering, with respect to
other writes, to cache coherent DMA memory.

There is a dma_wmb() in prepare_tx_desc()/prepare_tso_tx_desc() which
ensures that TDES0/1/2 is written before TDES3 (which contains the own
bit), for First Desc.

However, in the rare case that MSS changes, there will be a MSS
context descriptor in front of the regular DMA descriptors:

 <- DMA Next Descriptor




Thus, for this special case, we need a dma_wmb()
after prepare_tso_tx_desc()/before writing the own bit to the MSS desc,
so that we flush the write to TDES3 for First Desc,
in order to ensure that the MSS descriptor is the last descriptor to
set the own bit.

Signed-off-by: Niklas Cassel 
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c 
b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index c8d86d77e03d..3b5e7b06e796 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2983,8 +2983,15 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, 
struct net_device *dev)
tcp_hdrlen(skb) / 4, (skb->len - proto_hdr_len));
 
/* If context desc is used to change MSS */
-   if (mss_desc)
+   if (mss_desc) {
+   /* Make sure that first descriptor has been completely
+* written, including its own bit. This is because MSS is
+* actually before first descriptor, so we need to make
+* sure that MSS's own bit is the last thing written.
+*/
+   dma_wmb();
priv->hw->desc->set_tx_owner(mss_desc);
+   }
 
/* The own bit must be the latest setting done when prepare the
 * descriptor and then barrier is needed to make sure that
-- 
2.14.2

1 2 3 4 >

1 - 100 of 353 matches

Mail list logo