from:"Daniel Borkmann"

Re: [PATCH bpf] bpf, riscv: clear target register high 32-bits for and/or/xor on ALU32

2019-05-21 Thread Daniel Borkmann

On 05/21/2019 03:46 PM, Björn Töpel wrote:
> When using 32-bit subregisters (ALU32), the RISC-V JIT would not clear
> the high 32-bits of the target register and therefore generate
> incorrect code.
> 
> E.g., in the following code:
> 
>   $ cat test.c
>   unsigned int f(unsigned long long a,
>  unsigned int b)
>   {
>   return (unsigned int)a & b;
>   }
> 
>   $ clang-9 -target bpf -O2 -emit-llvm -S test.c -o - | \
>   llc-9 -mattr=+alu32 -mcpu=v3
>   .text
>   .file   "test.c"
>   .globl  f
>   .p2align3
>   .type   f,@function
>   f:
>   r0 = r1
>   w0 &= w2
>   exit
>   .Lfunc_end0:
>   .size   f, .Lfunc_end0-f
> 
> The JIT would not clear the high 32-bits of r0 after the
> and-operation, which in this case might give an incorrect return
> value.
> 
> After this patch, that is not the case, and the upper 32-bits are
> cleared.
> 
> Reported-by: Jiong Wang 
> Fixes: 2353ecc6f91f ("bpf, riscv: add BPF JIT for RV64G")
> Signed-off-by: Björn Töpel 

Was this missed because test_verifier did not have test coverage?
If so, could you follow-up with alu32 test cases for it, so other
JITs can be tracked for these kind of issue as well. We should
probably have one for every alu32 alu op to make sure it's not
forgotten anywhere.

Thanks,
Daniel

Re: [PATCH] Documentation/networking: fix af_xdp.rst Sphinx warnings

2019-05-21 Thread Daniel Borkmann

On 05/20/2019 11:22 PM, Randy Dunlap wrote:
> From: Randy Dunlap 
> 
> Fix Sphinx warnings in Documentation/networking/af_xdp.rst by
> adding indentation:
> 
> Documentation/networking/af_xdp.rst:319: WARNING: Literal block expected; 
> none found.
> Documentation/networking/af_xdp.rst:326: WARNING: Literal block expected; 
> none found.
> 
> Fixes: 0f4a9b7d4ecb ("xsk: add FAQ to facilitate for first time users")
> 
> Signed-off-by: Randy Dunlap 
> Cc: Magnus Karlsson 
> Cc: Daniel Borkmann 

Applied, thanks!

Re: [PATCH bpf] bpf: fix out-of-bounds read in __bpf_skc_lookup

2019-05-21 Thread Daniel Borkmann

On 05/21/2019 09:52 AM, Lorenz Bauer wrote:
> __bpf_skc_lookup takes a socket tuple and the length of the
> tuple as an argument. Based on the length, it decides which
> address family to pass to the helper function sk_lookup.
> 
> In case of AF_INET6, it fails to verify that the length
> of the tuple is long enough. sk_lookup may therefore access
> data past the end of the tuple.
> 
> Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> Signed-off-by: Lorenz Bauer 

Applied, thanks!

Re: [PATCH bpf] bpf: Check sk_fullsock() before returning from bpf_sk_lookup()

2019-05-21 Thread Daniel Borkmann

On 05/17/2019 11:21 PM, Martin KaFai Lau wrote:
> The BPF_FUNC_sk_lookup_xxx helpers return RET_PTR_TO_SOCKET_OR_NULL.
> Meaning a fullsock ptr and its fullsock's fields in bpf_sock can be
> accessed, e.g. type, protocol, mark and priority.
> Some new helper, like bpf_sk_storage_get(), also expects
> ARG_PTR_TO_SOCKET is a fullsock.
> 
> bpf_sk_lookup() currently calls sk_to_full_sk() before returning.
> However, the ptr returned from sk_to_full_sk() is not guaranteed
> to be a fullsock.  For example, it cannot get a fullsock if sk
> is in TCP_TIME_WAIT.
> 
> This patch checks for sk_fullsock() before returning. If it is not
> a fullsock, sock_gen_put() is called if needed and then returns NULL.
> 
> Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF")
> Cc: Joe Stringer 
> Signed-off-by: Martin KaFai Lau 

Applied, thanks!

Re: [PATCH] samples: bpf: fix: change the buffer size for read()

2019-05-21 Thread Daniel Borkmann

On 05/19/2019 11:05 AM, Chang-Hsien Tsai wrote:
> If the trace for read is larger than 4096,
> the return value sz will be 4096.
> This results in off-by-one error on buf.
> 
> static char buf[4096];
> ssize_t sz;
> 
> sz = read(trace_fd, buf, sizeof(buf));
> if (sz > 0) {
> buf[sz] = 0;
> puts(buf);
> }
> 
> Signed-off-by: Chang-Hsien Tsai 

Applied, thanks!

bpf-next is OPEN

2019-05-21 Thread Daniel Borkmann

Merge window is closed, so new round begins as usual.

Thanks everyone,
Daniel

Re: [PATCH net] net: sched: sch_ingress: do not report ingress filter info in egress path

2019-05-22 Thread Daniel Borkmann

On 05/22/2019 12:20 PM, Lorenzo Bianconi wrote:
>> On Tue, 2019-05-21 at 14:59 +0200, Lorenzo Bianconi wrote:
>>> Currently if we add a filter to the ingress qdisc (e.g matchall) the
>>> filter data are reported even in the egress path. The issue can be
>>> triggered with the following reproducer:
> 
> [...]
> 
>>> diff --git a/net/sched/sch_ingress.c b/net/sched/sch_ingress.c
>>> index 0bac926b46c7..1825347fed3a 100644
>>> --- a/net/sched/sch_ingress.c
>>> +++ b/net/sched/sch_ingress.c
>>> @@ -31,7 +31,7 @@ static struct Qdisc *ingress_leaf(struct Qdisc *sch, 
>>> unsigned long arg)
>>>  
>>>  static unsigned long ingress_find(struct Qdisc *sch, u32 classid)
>>>  {
>>> -   return TC_H_MIN(classid) + 1;
>>> +   return TC_H_MIN(classid);
>>
>> probably this breaks a command that was wrong before, but it's worth
>> mentioning. Because of the above hunk, the following command
>>
>> # tc qdisc add dev test0 ingress
>> # tc filter add dev test0 parent :fff1 matchall action drop
>> # tc filter add dev test0 parent : matchall action continue
>>
>> gave no errors, and dropped packets on unpatched kernel. With this patch,
>> the kernel refuses to add the 'matchall' rules (and because of that,
>> traffic passes).
>>
>> running TDC, it seems that a patched kernel does not pass anymore
>> some of the test cases belonging to the 'filter' category:
>>
>> # ./tdc.py -e 901f
>> Test 901f: Add fw filter with prio at 32-bit maxixum
>> exit: 2
>> exit: 0
>> RTNETLINK answers: Invalid argument
>> We have an error talking to the kernel, -1
>>
>> All test results:
>> 1..1
>> not ok 1 901f - Add fw filter with prio at 32-bit maxixum
>> Command exited with 2, expected 0
>> RTNETLINK answers: Invalid argument
>> We have an error talking to the kernel, -1
>>
>> (the same test is passing on a unpatched kernel)
>>
>> Do you think it's worth fixing those test cases too?
>>
>> thanks a lot!
>> -- 
>> davide
> 
> Hi Davide,
> 
> thx to point this out. Applying this patch the ingress qdisc has the same
> behaviour of clsact one.
> 
> $tc qdisc add dev lo clsact
> $tc filter add dev lo parent :fff1 matchall action drop
> Error: Specified class doesn't exist.
> We have an error talking to the kernel, -1
> $tc filter add dev lo parent :fff2 matchall action drop
> 
> $tc qdisc add dev lo ingress
> $tc filter add dev lo parent :fff2 matchall action drop
> 
> is it acceptable? If so I can fix the tests as well
> If not, is there another way to verify the filter is for the ingress path if
> parent identifier is not constant? (ingress_find() reports the TC_H_MIN of
> parent identifier)

As far as I know this would break sch_ingress users ... For sch_ingress
any minor should be accepted. For sch_clsact, only 0xFFF2U and 0xFFF3U
are accepted, so it can be extended in future if needed. For old sch_ingress
that ship has sailed, which is why sch_clsact was needed in order to have
such selectors, see also 1f211a1b929c ("net, sched: add clsact qdisc").
Meaning, minors for sch_ingress are a superset of sch_clsact and not
compatible in that sense. If you adapt sch_ingress to the same behavior
as sch_clsact, things might break indeed as Davide pointed out.

> Regards,
> Lorenzo
> 
>>
>>>  }
>>>  
>>>  static unsigned long ingress_bind_filter(struct Qdisc *sch,
>>> @@ -53,7 +53,12 @@ static struct tcf_block *ingress_tcf_block(struct Qdisc 
>>> *sch, unsigned long cl,
>>>  {
>>> struct ingress_sched_data *q = qdisc_priv(sch);
>>>  
>>> -   return q->block;
>>> +   switch (cl) {
>>> +   case TC_H_MIN(TC_H_MIN_INGRESS):
>>> +   return q->block;
>>> +   default:
>>> +   return NULL;
>>> +   }
>>>  }
>>>  
>>>  static void clsact_chain_head_change(struct tcf_proto *tp_head, void *priv)
>>
>>

Re: [PATCH bpf] bpf, riscv: clear target register high 32-bits for and/or/xor on ALU32

2019-05-23 Thread Daniel Borkmann

On 05/21/2019 04:12 PM, Björn Töpel wrote:
> On Tue, 21 May 2019 at 16:02, Daniel Borkmann  wrote:
>> On 05/21/2019 03:46 PM, Björn Töpel wrote:
>>> When using 32-bit subregisters (ALU32), the RISC-V JIT would not clear
>>> the high 32-bits of the target register and therefore generate
>>> incorrect code.
>>>
>>> E.g., in the following code:
>>>
>>>   $ cat test.c
>>>   unsigned int f(unsigned long long a,
>>>  unsigned int b)
>>>   {
>>>   return (unsigned int)a & b;
>>>   }
>>>
>>>   $ clang-9 -target bpf -O2 -emit-llvm -S test.c -o - | \
>>>   llc-9 -mattr=+alu32 -mcpu=v3
>>>   .text
>>>   .file   "test.c"
>>>   .globl  f
>>>   .p2align3
>>>   .type   f,@function
>>>   f:
>>>   r0 = r1
>>>   w0 &= w2
>>>   exit
>>>   .Lfunc_end0:
>>>   .size   f, .Lfunc_end0-f
>>>
>>> The JIT would not clear the high 32-bits of r0 after the
>>> and-operation, which in this case might give an incorrect return
>>> value.
>>>
>>> After this patch, that is not the case, and the upper 32-bits are
>>> cleared.
>>>
>>> Reported-by: Jiong Wang 
>>> Fixes: 2353ecc6f91f ("bpf, riscv: add BPF JIT for RV64G")
>>> Signed-off-by: Björn Töpel 
>>
>> Was this missed because test_verifier did not have test coverage?
> 
> Yup, and Jiong noted it.

Applied, thanks!

Re: [PATCH bpf] selftests: bpf: add zero extend checks for ALU32 and/or/xor

2019-05-23 Thread Daniel Borkmann

On 05/23/2019 08:38 AM, Y Song wrote:
> On Wed, May 22, 2019 at 1:46 PM Björn Töpel  wrote:
>> On Wed, 22 May 2019 at 20:13, Y Song  wrote:
>>> On Wed, May 22, 2019 at 2:25 AM Björn Töpel  wrote:

 Add three tests to test_verifier/basic_instr that make sure that the
 high 32-bits of the destination register is cleared after an ALU32
 and/or/xor.

 Signed-off-by: Björn Töpel 
>>>
>>> I think the patch intends for bpf-next, right? The patch itself looks
>>> good to me.
>>> Acked-by: Yonghong Song 
>>
>> Thank you. Actually, it was intended for the bpf tree, as a test
>> follow up for this [1] fix.
> Then maybe you want to add a Fixes tag and resubmit?

Why would the test case need a fixes tag? It's common practice that we have
BPF fixes that we queue to bpf tree along with kselftest test cases related
to them. Therefore, applied as well, thanks for following up!

Björn, in my email from the fix, I mentioned we should have test snippets
ideally for all of the alu32 insns to not miss something falling through the
cracks when JITs get added or changed. If you have some cycles to add the
remaining missing ones, that would be much appreciated.

Thanks,
Daniel

Re: [bpf PATCH] bpf: sockmap, restore sk_write_space when psock gets dropped

2019-05-23 Thread Daniel Borkmann

On 05/22/2019 12:01 PM, Jakub Sitnicki wrote:
> Once psock gets unlinked from its sock (sk_psock_drop), user-space can
> still trigger a call to sk->sk_write_space by setting TCP_NOTSENT_LOWAT
> socket option. This causes a null-ptr-deref because we try to read
> psock->saved_write_space from sk_psock_write_space:
> 
> ==
> BUG: KASAN: null-ptr-deref in sk_psock_write_space+0x69/0x80
> Read of size 8 at addr 01a0 by task sockmap-echo/131
> 
> CPU: 0 PID: 131 Comm: sockmap-echo Not tainted 5.2.0-rc1-00094-gf49aa1de9836 
> #81
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> ?-20180724_192412-buildhw-07.phx2.fedoraproject.org-1.fc29 04/01/2014
> Call Trace:
>  ? sk_psock_write_space+0x69/0x80
>  __kasan_report.cold.2+0x5/0x3f
>  ? sk_psock_write_space+0x69/0x80
>  kasan_report+0xe/0x20
>  sk_psock_write_space+0x69/0x80
>  tcp_setsockopt+0x69a/0xfc0
>  ? tcp_shutdown+0x70/0x70
>  ? fsnotify+0x5b0/0x5f0
>  ? remove_wait_queue+0x90/0x90
>  ? __fget_light+0xa5/0xf0
>  __sys_setsockopt+0xe6/0x180
>  ? sockfd_lookup_light+0xb0/0xb0
>  ? vfs_write+0x195/0x210
>  ? ksys_write+0xc9/0x150
>  ? __x64_sys_read+0x50/0x50
>  ? __bpf_trace_x86_fpu+0x10/0x10
>  __x64_sys_setsockopt+0x61/0x70
>  do_syscall_64+0xc5/0x520
>  ? vmacache_find+0xc0/0x110
>  ? syscall_return_slowpath+0x110/0x110
>  ? handle_mm_fault+0xb4/0x110
>  ? entry_SYSCALL_64_after_hwframe+0x3e/0xbe
>  ? trace_hardirqs_off_caller+0x4b/0x120
>  ? trace_hardirqs_off_thunk+0x1a/0x3a
>  entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x7f2e5e7cdcce
> Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb b1 66 2e 0f 1f 84 00 00 00 00 00
> 0f 1f 44 00 00 f3 0f 1e fa 49 89 ca b8 36 00 00 00 0f 05 <48> 3d 01 f0 ff
> ff 73 01 c3 48 8b 0d 8a 11 0c 00 f7 d8 64 89 01 48
> RSP: 002b:7ffed011b778 EFLAGS: 0206 ORIG_RAX: 0036
> RAX: ffda RBX: 0003 RCX: 7f2e5e7cdcce
> RDX: 0019 RSI: 0006 RDI: 0007
> RBP: 7ffed011b790 R08: 0004 R09: 7f2e5e84ee80
> R10: 7ffed011b788 R11: 0206 R12: 7ffed011b78c
> R13: 7ffed011b788 R14: 0007 R15: 0068
> ==
> 
> Restore the saved sk_write_space callback when psock is being dropped to
> fix the crash.
> 
> Signed-off-by: Jakub Sitnicki 

LGTM, applied thanks!

Re: [PATCH v2 bpf-next 0/3] bpf: increase jmp sequence limit

2019-05-23 Thread Daniel Borkmann

On 05/22/2019 05:14 AM, Alexei Starovoitov wrote:
> Patch 1 - jmp sequence limit
> Patch 2 - improve existing tests
> Patch 3 - add pyperf-based realistic bpf program that takes advantage
> of higher limit and use it as a stress test
> 
> v1->v2: fixed nit in patch 3. added Andrii's acks
> 
> Alexei Starovoitov (3):
>   bpf: bump jmp sequence limit
>   selftests/bpf: adjust verifier scale test
>   selftests/bpf: add pyperf scale test
> 
>  kernel/bpf/verifier.c |   7 +-
>  .../bpf/prog_tests/bpf_verif_scale.c  |  31 +-
>  tools/testing/selftests/bpf/progs/pyperf.h| 268 ++
>  tools/testing/selftests/bpf/progs/pyperf100.c |   4 +
>  tools/testing/selftests/bpf/progs/pyperf180.c |   4 +
>  tools/testing/selftests/bpf/progs/pyperf50.c  |   4 +
>  tools/testing/selftests/bpf/test_verifier.c   |  31 +-
>  7 files changed, 319 insertions(+), 30 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/pyperf.h
>  create mode 100644 tools/testing/selftests/bpf/progs/pyperf100.c
>  create mode 100644 tools/testing/selftests/bpf/progs/pyperf180.c
>  create mode 100644 tools/testing/selftests/bpf/progs/pyperf50.c
> 

Looks good, applied, thanks!

Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper

2019-05-23 Thread Daniel Borkmann

On 05/22/2019 07:39 AM, Yonghong Song wrote:
> This patch tries to solve the following specific use case.
> 
> Currently, bpf program can already collect stack traces
> through kernel function get_perf_callchain()
> when certain events happens (e.g., cache miss counter or
> cpu clock counter overflows). But such stack traces are
> not enough for jitted programs, e.g., hhvm (jited php).
> To get real stack trace, jit engine internal data structures
> need to be traversed in order to get the real user functions.
> 
> bpf program itself may not be the best place to traverse
> the jit engine as the traversing logic could be complex and
> it is not a stable interface either.
> 
> Instead, hhvm implements a signal handler,
> e.g. for SIGALARM, and a set of program locations which
> it can dump stack traces. When it receives a signal, it will
> dump the stack in next such program location.
> 
> Such a mechanism can be implemented in the following way:
>   . a perf ring buffer is created between bpf program
> and tracing app.
>   . once a particular event happens, bpf program writes
> to the ring buffer and the tracing app gets notified.
>   . the tracing app sends a signal SIGALARM to the hhvm.
> 
> But this method could have large delays and causing profiling
> results skewed.
> 
> This patch implements bpf_send_signal() helper to send
> a signal to hhvm in real time, resulting in intended stack traces.
> 
> Signed-off-by: Yonghong Song 
> ---
>  include/uapi/linux/bpf.h | 17 +-
>  kernel/trace/bpf_trace.c | 67 
>  2 files changed, 83 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 63e0cf66f01a..68d4470523a0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2672,6 +2672,20 @@ union bpf_attr {
>   *   0 on success.
>   *
>   *   **-ENOENT** if the bpf-local-storage cannot be found.
> + *
> + * int bpf_send_signal(u32 sig)
> + *   Description
> + *   Send signal *sig* to the current task.
> + *   Return
> + *   0 on success or successfully queued.
> + *
> + *   **-EBUSY** if work queue under nmi is full.
> + *
> + *   **-EINVAL** if *sig* is invalid.
> + *
> + *   **-EPERM** if no permission to send the *sig*.
> + *
> + *   **-EAGAIN** if bpf program can try again.
>   */
>  #define __BPF_FUNC_MAPPER(FN)\
>   FN(unspec), \
> @@ -2782,7 +2796,8 @@ union bpf_attr {
>   FN(strtol), \
>   FN(strtoul),\
>   FN(sk_storage_get), \
> - FN(sk_storage_delete),
> + FN(sk_storage_delete),  \
> + FN(send_signal),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index f92d6ad5e080..f8cd0db7289f 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -567,6 +567,58 @@ static const struct bpf_func_proto 
> bpf_probe_read_str_proto = {
>   .arg3_type  = ARG_ANYTHING,
>  };
>  
> +struct send_signal_irq_work {
> + struct irq_work irq_work;
> + u32 sig;
> +};
> +
> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
> +
> +static void do_bpf_send_signal(struct irq_work *entry)
> +{
> + struct send_signal_irq_work *work;
> +
> + work = container_of(entry, struct send_signal_irq_work, irq_work);
> + group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
> +}
> +
> +BPF_CALL_1(bpf_send_signal, u32, sig)
> +{
> + struct send_signal_irq_work *work = NULL;
> +
> + /* Similar to bpf_probe_write_user, task needs to be
> +  * in a sound condition and kernel memory access be
> +  * permitted in order to send signal to the current
> +  * task.
> +  */
> + if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
> + return -EPERM;
> + if (unlikely(uaccess_kernel()))
> + return -EPERM;
> + if (unlikely(!nmi_uaccess_okay()))
> + return -EPERM;
> +
> + if (in_nmi()) {

Hm, bit confused, can't this only be done out of process context in
general since only there current points to e.g. hhvm? I'm probably
missing something. Could you elaborate?

> + work = this_cpu_ptr(&send_signal_work);
> + if (work->irq_work.flags & IRQ_WORK_BUSY)
> + return -EBUSY;
> +
> + work->sig = sig;
> + irq_work_queue(&work->irq_work);
> + return 0;
> + }
> +
> + return group_send_sig_info(sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
> +

Nit: extra newline slipped in

> +}
> +
> +static const struct bpf_func_proto bpf_send_signal_proto = {
> + .func   = bpf_send_signal,
> + .gpl_only   = false,
> + .ret_type   = RET_INTEGER,
> +

Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper

2019-05-23 Thread Daniel Borkmann

On 05/23/2019 05:58 PM, Yonghong Song wrote:
> On 5/23/19 8:41 AM, Daniel Borkmann wrote:
>> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>>> This patch tries to solve the following specific use case.
>>>
>>> Currently, bpf program can already collect stack traces
>>> through kernel function get_perf_callchain()
>>> when certain events happens (e.g., cache miss counter or
>>> cpu clock counter overflows). But such stack traces are
>>> not enough for jitted programs, e.g., hhvm (jited php).
>>> To get real stack trace, jit engine internal data structures
>>> need to be traversed in order to get the real user functions.
>>>
>>> bpf program itself may not be the best place to traverse
>>> the jit engine as the traversing logic could be complex and
>>> it is not a stable interface either.
>>>
>>> Instead, hhvm implements a signal handler,
>>> e.g. for SIGALARM, and a set of program locations which
>>> it can dump stack traces. When it receives a signal, it will
>>> dump the stack in next such program location.
>>>
>>> Such a mechanism can be implemented in the following way:
>>>. a perf ring buffer is created between bpf program
>>>  and tracing app.
>>>. once a particular event happens, bpf program writes
>>>  to the ring buffer and the tracing app gets notified.
>>>. the tracing app sends a signal SIGALARM to the hhvm.
>>>
>>> But this method could have large delays and causing profiling
>>> results skewed.
>>>
>>> This patch implements bpf_send_signal() helper to send
>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>
>>> Signed-off-by: Yonghong Song 
>>> ---
>>>   include/uapi/linux/bpf.h | 17 +-
>>>   kernel/trace/bpf_trace.c | 67 
>>>   2 files changed, 83 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 63e0cf66f01a..68d4470523a0 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>*0 on success.
>>>*
>>>***-ENOENT** if the bpf-local-storage cannot be found.
>>> + *
>>> + * int bpf_send_signal(u32 sig)
>>> + * Description
>>> + * Send signal *sig* to the current task.
>>> + * Return
>>> + * 0 on success or successfully queued.
>>> + *
>>> + * **-EBUSY** if work queue under nmi is full.
>>> + *
>>> + * **-EINVAL** if *sig* is invalid.
>>> + *
>>> + * **-EPERM** if no permission to send the *sig*.
>>> + *
>>> + * **-EAGAIN** if bpf program can try again.
>>>*/
>>>   #define __BPF_FUNC_MAPPER(FN) \
>>> FN(unspec), \
>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>> FN(strtol), \
>>> FN(strtoul),\
>>> FN(sk_storage_get), \
>>> -   FN(sk_storage_delete),
>>> +   FN(sk_storage_delete),  \
>>> +   FN(send_signal),
>>>   
>>>   /* integer value in 'imm' field of BPF_CALL instruction selects which 
>>> helper
>>>* function eBPF program intends to call
>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>> index f92d6ad5e080..f8cd0db7289f 100644
>>> --- a/kernel/trace/bpf_trace.c
>>> +++ b/kernel/trace/bpf_trace.c
>>> @@ -567,6 +567,58 @@ static const struct bpf_func_proto 
>>> bpf_probe_read_str_proto = {
>>> .arg3_type  = ARG_ANYTHING,
>>>   };
>>>   
>>> +struct send_signal_irq_work {
>>> +   struct irq_work irq_work;
>>> +   u32 sig;
>>> +};
>>> +
>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>>> +
>>> +static void do_bpf_send_signal(struct irq_work *entry)
>>> +{
>>> +   struct send_signal_irq_work *work;
>>> +
>>> +   work = container_of(entry, struct send_signal_irq_work, irq_work);
>>> +   group_send_sig_info(work->sig, SEND_SIG_PRIV, current, PIDTYPE_TGID);
>>> +}
>>> +
>>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>>> +{
>>> +   struct send_signal_irq_work *work = NULL;
>>> +
>>> +   /* Similar

Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper

2019-05-23 Thread Daniel Borkmann

On 05/23/2019 11:30 PM, Yonghong Song wrote:
> On 5/23/19 2:07 PM, Yonghong Song wrote:
>> On 5/23/19 9:28 AM, Daniel Borkmann wrote:
>>> On 05/23/2019 05:58 PM, Yonghong Song wrote:
>>>> On 5/23/19 8:41 AM, Daniel Borkmann wrote:
>>>>> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>>>>>> This patch tries to solve the following specific use case.
>>>>>>
>>>>>> Currently, bpf program can already collect stack traces
>>>>>> through kernel function get_perf_callchain()
>>>>>> when certain events happens (e.g., cache miss counter or
>>>>>> cpu clock counter overflows). But such stack traces are
>>>>>> not enough for jitted programs, e.g., hhvm (jited php).
>>>>>> To get real stack trace, jit engine internal data structures
>>>>>> need to be traversed in order to get the real user functions.
>>>>>>
>>>>>> bpf program itself may not be the best place to traverse
>>>>>> the jit engine as the traversing logic could be complex and
>>>>>> it is not a stable interface either.
>>>>>>
>>>>>> Instead, hhvm implements a signal handler,
>>>>>> e.g. for SIGALARM, and a set of program locations which
>>>>>> it can dump stack traces. When it receives a signal, it will
>>>>>> dump the stack in next such program location.
>>>>>>
>>>>>> Such a mechanism can be implemented in the following way:
>>>>>>  . a perf ring buffer is created between bpf program
>>>>>>and tracing app.
>>>>>>  . once a particular event happens, bpf program writes
>>>>>>to the ring buffer and the tracing app gets notified.
>>>>>>  . the tracing app sends a signal SIGALARM to the hhvm.
>>>>>>
>>>>>> But this method could have large delays and causing profiling
>>>>>> results skewed.
>>>>>>
>>>>>> This patch implements bpf_send_signal() helper to send
>>>>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>>>>
>>>>>> Signed-off-by: Yonghong Song 
>>>>>> ---
>>>>>> include/uapi/linux/bpf.h | 17 +-
>>>>>> kernel/trace/bpf_trace.c | 67 
>>>>>> 
>>>>>> 2 files changed, 83 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>> index 63e0cf66f01a..68d4470523a0 100644
>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>>>>  *   0 on success.
>>>>>>  *
>>>>>>  *   **-ENOENT** if the bpf-local-storage cannot be found.
>>>>>> + *
>>>>>> + * int bpf_send_signal(u32 sig)
>>>>>> + *  Description
>>>>>> + *  Send signal *sig* to the current task.
>>>>>> + *  Return
>>>>>> + *  0 on success or successfully queued.
>>>>>> + *
>>>>>> + *  **-EBUSY** if work queue under nmi is full.
>>>>>> + *
>>>>>> + *  **-EINVAL** if *sig* is invalid.
>>>>>> + *
>>>>>> + *  **-EPERM** if no permission to send the *sig*.
>>>>>> + *
>>>>>> + *  **-EAGAIN** if bpf program can try again.
>>>>>>  */
>>>>>> #define __BPF_FUNC_MAPPER(FN)\
>>>>>>  FN(unspec), \
>>>>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>>>>>  FN(strtol), \
>>>>>>  FN(strtoul),\
>>>>>>  FN(sk_storage_get), \
>>>>>> -FN(sk_storage_delete),
>>>>>> +FN(sk_storage_delete),  \
>>>>>> +FN(send_signal),
>>>>>> 
>>>>>> /* integer value in 'imm' field of BPF_CALL instruction selects 
>>>>>> which helper
>>>>>>  * function eBPF program intend

Re: [PATCH] bpf: sockmap, fix use after free from sleep in psock backlog workqueue

2019-05-24 Thread Daniel Borkmann

On 05/23/2019 05:48 PM, John Fastabend wrote:
> Backlog work for psock (sk_psock_backlog) might sleep while waiting
> for memory to free up when sending packets. However, while sleeping
> the socket may be closed and removed from the map by the user space
> side.
> 
> This breaks an assumption in sk_stream_wait_memory, which expects the
> wait queue to be still there when it wakes up resulting in a
> use-after-free shown below. To fix his mark sendmsg as MSG_DONTWAIT
> to avoid the sleep altogether. We already set the flag for the
> sendpage case but we missed the case were sendmsg is used.
> Sockmap is currently the only user of skb_send_sock_locked() so only
> the sockmap paths should be impacted.
> 
> ==
> BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
> Write of size 8 at addr 888069a0c4e8 by task kworker/0:2/110
> 
> CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 
> 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 
> 04/01/2014
> Workqueue: events sk_psock_backlog
> Call Trace:
>  print_address_description+0x6e/0x2b0
>  ? remove_wait_queue+0x31/0x70
>  kasan_report+0xfd/0x177
>  ? remove_wait_queue+0x31/0x70
>  ? remove_wait_queue+0x31/0x70
>  remove_wait_queue+0x31/0x70
>  sk_stream_wait_memory+0x4dd/0x5f0
>  ? sk_stream_wait_close+0x1b0/0x1b0
>  ? wait_woken+0xc0/0xc0
>  ? tcp_current_mss+0xc5/0x110
>  tcp_sendmsg_locked+0x634/0x15d0
>  ? tcp_set_state+0x2e0/0x2e0
>  ? __kasan_slab_free+0x1d1/0x230
>  ? kmem_cache_free+0x70/0x140
>  ? sk_psock_backlog+0x40c/0x4b0
>  ? process_one_work+0x40b/0x660
>  ? worker_thread+0x82/0x680
>  ? kthread+0x1b9/0x1e0
>  ? ret_from_fork+0x1f/0x30
>  ? check_preempt_curr+0xaf/0x130
>  ? iov_iter_kvec+0x5f/0x70
>  ? kernel_sendmsg_locked+0xa0/0xe0
>  skb_send_sock_locked+0x273/0x3c0
>  ? skb_splice_bits+0x180/0x180
>  ? start_thread+0xe0/0xe0
>  ? update_min_vruntime.constprop.27+0x88/0xc0
>  sk_psock_backlog+0xb3/0x4b0
>  ? strscpy+0xbf/0x1e0
>  process_one_work+0x40b/0x660
>  worker_thread+0x82/0x680
>  ? process_one_work+0x660/0x660
>  kthread+0x1b9/0x1e0
>  ? __kthread_create_on_node+0x250/0x250
>  ret_from_fork+0x1f/0x30
> 
> Fixes: 20bf50de3028c ("skbuff: Function to send an skbuf on a socket")
> Reported-by: Jakub Sitnicki 
> Tested-by: Jakub Sitnicki 
> Signed-off-by: John Fastabend 
> ---
>  net/core/skbuff.c |1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index e89be62..c3b03c5 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct 
> sk_buff *skb, int offset,
>   kv.iov_base = skb->data + offset;
>   kv.iov_len = slen;
>   memset(&msg, 0, sizeof(msg));
> + msg.flags = MSG_DONTWAIT;
>  
>   ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen);
>   if (ret <= 0)

This doesn't even compile. :( It should have been msg_flags instead ...

Re: [PATCH v2 bpf-next 0/3] bpf: optimize explored_states

2019-05-24 Thread Daniel Borkmann

On 05/22/2019 05:17 AM, Alexei Starovoitov wrote:
> Convert explored_states array into hash table and use simple hash to
> reduce verifier peak memory consumption for programs with bpf2bpf calls.
> More details in patch 3.
> 
> v1->v2: fixed Jakub's small nit in patch 1
> 
> Alexei Starovoitov (3):
>   bpf: cleanup explored_states
>   bpf: split explored_states
>   bpf: convert explored_states to hash table
> 
>  include/linux/bpf_verifier.h |  2 +
>  kernel/bpf/verifier.c| 77 ++--
>  2 files changed, 50 insertions(+), 29 deletions(-)
> 

Applied, thanks!

Re: [RFC net-next 1/1] net: sched: protect against loops in TC filter hooks

2019-05-24 Thread Daniel Borkmann

On 05/24/2019 06:05 PM, John Hurley wrote:
> TC hooks allow the application of filters and actions to packets at both
> ingress and egress of the network stack. It is possible, with poor
> configuration, that this can produce loops whereby an ingress hook calls
> a mirred egress action that has an egress hook that redirects back to
> the first ingress etc. The TC core classifier protects against loops when
> doing reclassifies but, as yet, there is no protection against a packet
> looping between multiple hooks. This can lead to stack overflow panics.
> 
> Add a per cpu counter that tracks recursion of packets through TC hooks.
> The packet will be dropped if a recursive limit is passed and the counter
> reset for the next packet.

NAK. This is quite a hack in the middle of the fast path. Such redirection
usually has a rescheduling point, like in cls_bpf case. If this is not the
case for mirred action as I read your commit message above, then fix mirred
instead if it's such broken.

Thanks,
Daniel

> Signed-off-by: John Hurley 
> Reviewed-by: Simon Horman 
> ---
>  net/core/dev.c | 62 
> +++---
>  1 file changed, 55 insertions(+), 7 deletions(-)
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index b6b8505..a6d9ed7 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -154,6 +154,9 @@
>  /* This should be increased if a protocol with a bigger head is added. */
>  #define GRO_MAX_HEAD (MAX_HEADER + 128)
>  
> +#define SCH_RECURSION_LIMIT  4
> +static DEFINE_PER_CPU(int, sch_recursion_level);
> +
>  static DEFINE_SPINLOCK(ptype_lock);
>  static DEFINE_SPINLOCK(offload_lock);
>  struct list_head ptype_base[PTYPE_HASH_SIZE] __read_mostly;
> @@ -3598,16 +3601,42 @@ int dev_loopback_xmit(struct net *net, struct sock 
> *sk, struct sk_buff *skb)
>  }
>  EXPORT_SYMBOL(dev_loopback_xmit);
>  
> +static inline int sch_check_inc_recur_level(void)
> +{
> + int rec_level = __this_cpu_inc_return(sch_recursion_level);
> +
> + if (rec_level >= SCH_RECURSION_LIMIT) {
> + net_warn_ratelimited("Recursion limit reached on TC datapath, 
> probable configuration error\n");
> + return -ELOOP;
> + }
> +
> + return 0;
> +}
> +
> +static inline void sch_dec_recur_level(void)
> +{
> + __this_cpu_dec(sch_recursion_level);
> +}
> +
>  #ifdef CONFIG_NET_EGRESS
>  static struct sk_buff *
>  sch_handle_egress(struct sk_buff *skb, int *ret, struct net_device *dev)
>  {
>   struct mini_Qdisc *miniq = rcu_dereference_bh(dev->miniq_egress);
>   struct tcf_result cl_res;
> + int err;
>  
>   if (!miniq)
>   return skb;
>  
> + err = sch_check_inc_recur_level();
> + if (err) {
> + sch_dec_recur_level();
> + *ret = NET_XMIT_DROP;
> + consume_skb(skb);
> + return NULL;
> + }
> +
>   /* qdisc_skb_cb(skb)->pkt_len was already set by the caller. */
>   mini_qdisc_bstats_cpu_update(miniq, skb);
>  
> @@ -3620,22 +3649,26 @@ sch_handle_egress(struct sk_buff *skb, int *ret, 
> struct net_device *dev)
>   mini_qdisc_qstats_cpu_drop(miniq);
>   *ret = NET_XMIT_DROP;
>   kfree_skb(skb);
> - return NULL;
> + skb = NULL;
> + break;
>   case TC_ACT_STOLEN:
>   case TC_ACT_QUEUED:
>   case TC_ACT_TRAP:
>   *ret = NET_XMIT_SUCCESS;
>   consume_skb(skb);
> - return NULL;
> + skb = NULL;
> + break;
>   case TC_ACT_REDIRECT:
>   /* No need to push/pop skb's mac_header here on egress! */
>   skb_do_redirect(skb);
>   *ret = NET_XMIT_SUCCESS;
> - return NULL;
> + skb = NULL;
> + break;
>   default:
>   break;
>   }
>  
> + sch_dec_recur_level();
>   return skb;
>  }
>  #endif /* CONFIG_NET_EGRESS */
> @@ -4670,6 +4703,7 @@ sch_handle_ingress(struct sk_buff *skb, struct 
> packet_type **pt_prev, int *ret,
>  #ifdef CONFIG_NET_CLS_ACT
>   struct mini_Qdisc *miniq = rcu_dereference_bh(skb->dev->miniq_ingress);
>   struct tcf_result cl_res;
> + int err;
>  
>   /* If there's at least one ingress present somewhere (so
>* we get here via enabled static key), remaining devices
> @@ -4679,6 +4713,14 @@ sch_handle_ingress(struct sk_buff *skb, struct 
> packet_type **pt_prev, int *ret,
>   if (!miniq)
>   return skb;
>  
> + err = sch_check_inc_recur_level();
> + if (err) {
> + sch_dec_recur_level();
> + *ret = NET_XMIT_DROP;
> + consume_skb(skb);
> + return NULL;
> + }
> +
>   if (*pt_prev) {
>   *ret = deliver_skb(skb, *pt_prev, orig_dev);
>   *pt_prev = NULL;
> @@ -4696,12 +4738,14 @@ sch_handle_ingress(struct sk_buff *skb, struct 
> packet_type **pt_prev, int *ret,
>   case TC_ACT_SHOT:
>

Re: [PATCH v2] bpf: sockmap, fix use after free from sleep in psock backlog workqueue

2019-05-24 Thread Daniel Borkmann

On 05/24/2019 05:01 PM, John Fastabend wrote:
> Backlog work for psock (sk_psock_backlog) might sleep while waiting
> for memory to free up when sending packets. However, while sleeping
> the socket may be closed and removed from the map by the user space
> side.
> 
> This breaks an assumption in sk_stream_wait_memory, which expects the
> wait queue to be still there when it wakes up resulting in a
> use-after-free shown below. To fix his mark sendmsg as MSG_DONTWAIT
> to avoid the sleep altogether. We already set the flag for the
> sendpage case but we missed the case were sendmsg is used.
> Sockmap is currently the only user of skb_send_sock_locked() so only
> the sockmap paths should be impacted.
> 
> ==
> BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
> Write of size 8 at addr 888069a0c4e8 by task kworker/0:2/110
> 
> CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 
> 5.0.0-rc2-00335-g28f9d1a3d4fe-dirty #14
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 
> 04/01/2014
> Workqueue: events sk_psock_backlog
> Call Trace:
>  print_address_description+0x6e/0x2b0
>  ? remove_wait_queue+0x31/0x70
>  kasan_report+0xfd/0x177
>  ? remove_wait_queue+0x31/0x70
>  ? remove_wait_queue+0x31/0x70
>  remove_wait_queue+0x31/0x70
>  sk_stream_wait_memory+0x4dd/0x5f0
>  ? sk_stream_wait_close+0x1b0/0x1b0
>  ? wait_woken+0xc0/0xc0
>  ? tcp_current_mss+0xc5/0x110
>  tcp_sendmsg_locked+0x634/0x15d0
>  ? tcp_set_state+0x2e0/0x2e0
>  ? __kasan_slab_free+0x1d1/0x230
>  ? kmem_cache_free+0x70/0x140
>  ? sk_psock_backlog+0x40c/0x4b0
>  ? process_one_work+0x40b/0x660
>  ? worker_thread+0x82/0x680
>  ? kthread+0x1b9/0x1e0
>  ? ret_from_fork+0x1f/0x30
>  ? check_preempt_curr+0xaf/0x130
>  ? iov_iter_kvec+0x5f/0x70
[...]

Applied, thanks!

Re: [PATCH bpf-next v2 1/3] bpf: implement bpf_send_signal() helper

2019-05-24 Thread Daniel Borkmann

On 05/24/2019 01:54 AM, Yonghong Song wrote:
> 
> 
> On 5/23/19 4:08 PM, Daniel Borkmann wrote:
>> On 05/23/2019 11:30 PM, Yonghong Song wrote:
>>> On 5/23/19 2:07 PM, Yonghong Song wrote:
>>>> On 5/23/19 9:28 AM, Daniel Borkmann wrote:
>>>>> On 05/23/2019 05:58 PM, Yonghong Song wrote:
>>>>>> On 5/23/19 8:41 AM, Daniel Borkmann wrote:
>>>>>>> On 05/22/2019 07:39 AM, Yonghong Song wrote:
>>>>>>>> This patch tries to solve the following specific use case.
>>>>>>>>
>>>>>>>> Currently, bpf program can already collect stack traces
>>>>>>>> through kernel function get_perf_callchain()
>>>>>>>> when certain events happens (e.g., cache miss counter or
>>>>>>>> cpu clock counter overflows). But such stack traces are
>>>>>>>> not enough for jitted programs, e.g., hhvm (jited php).
>>>>>>>> To get real stack trace, jit engine internal data structures
>>>>>>>> need to be traversed in order to get the real user functions.
>>>>>>>>
>>>>>>>> bpf program itself may not be the best place to traverse
>>>>>>>> the jit engine as the traversing logic could be complex and
>>>>>>>> it is not a stable interface either.
>>>>>>>>
>>>>>>>> Instead, hhvm implements a signal handler,
>>>>>>>> e.g. for SIGALARM, and a set of program locations which
>>>>>>>> it can dump stack traces. When it receives a signal, it will
>>>>>>>> dump the stack in next such program location.
>>>>>>>>
>>>>>>>> Such a mechanism can be implemented in the following way:
>>>>>>>>   . a perf ring buffer is created between bpf program
>>>>>>>> and tracing app.
>>>>>>>>   . once a particular event happens, bpf program writes
>>>>>>>> to the ring buffer and the tracing app gets notified.
>>>>>>>>   . the tracing app sends a signal SIGALARM to the hhvm.
>>>>>>>>
>>>>>>>> But this method could have large delays and causing profiling
>>>>>>>> results skewed.
>>>>>>>>
>>>>>>>> This patch implements bpf_send_signal() helper to send
>>>>>>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>>>>>>
>>>>>>>> Signed-off-by: Yonghong Song 
>>>>>>>> ---
>>>>>>>>  include/uapi/linux/bpf.h | 17 +-
>>>>>>>>  kernel/trace/bpf_trace.c | 67 
>>>>>>>> 
>>>>>>>>  2 files changed, 83 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>>>>> index 63e0cf66f01a..68d4470523a0 100644
>>>>>>>> --- a/include/uapi/linux/bpf.h
>>>>>>>> +++ b/include/uapi/linux/bpf.h
>>>>>>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>>>>>>   *0 on success.
>>>>>>>>   *
>>>>>>>>   ***-ENOENT** if the bpf-local-storage cannot be 
>>>>>>>> found.
>>>>>>>> + *
>>>>>>>> + * int bpf_send_signal(u32 sig)
>>>>>>>> + *Description
>>>>>>>> + *Send signal *sig* to the current task.
>>>>>>>> + *Return
>>>>>>>> + *0 on success or successfully queued.
>>>>>>>> + *
>>>>>>>> + ***-EBUSY** if work queue under nmi is full.
>>>>>>>> + *
>>>>>>>> + ***-EINVAL** if *sig* is invalid.
>>>>>>>> + *
>>>>>>>> + ***-EPERM** if no permission to send the *sig*.
>>>>>>>> + *
>>>>>>>> + ***-EAGAIN** if bpf program can try again.
>>>>>>>>   */
>>>>>>>>  #define __BPF_FUNC_MAPPER(FN) \
>>>>>>>>

Re: [PATCH bpf-next v5 0/3] bpf: implement bpf_send_signal() helper

2019-05-24 Thread Daniel Borkmann

On 05/23/2019 11:47 PM, Yonghong Song wrote:
> This patch tries to solve the following specific use case. 
> 
> Currently, bpf program can already collect stack traces
> through kernel function get_perf_callchain()
> when certain events happens (e.g., cache miss counter or
> cpu clock counter overflows). But such stack traces are 
> not enough for jitted programs, e.g., hhvm (jited php).
> To get real stack trace, jit engine internal data structures
> need to be traversed in order to get the real user functions.
> 
> bpf program itself may not be the best place to traverse 
> the jit engine as the traversing logic could be complex and
> it is not a stable interface either.
> 
> Instead, hhvm implements a signal handler,
> e.g. for SIGALARM, and a set of program locations which 
> it can dump stack traces. When it receives a signal, it will
> dump the stack in next such program location.
> 
> This patch implements bpf_send_signal() helper to send
> a signal to hhvm in real time, resulting in intended stack traces. 
> 
> Patch #1 implemented the bpf_send_helper() in the kernel.
> Patch #2 synced uapi header bpf.h to tools directory.
> Patch #3 added a self test which covers tracepoint
> and perf_event bpf programs. 
> 
> Changelogs:
>   v4 => v5:
> . pass the "current" task struct to irq_work as well
>   since the current task struct may change between
>   nmi and subsequent irq_work_interrupt.
>   Discovered by Daniel.
>   v3 => v4:
> . fix one typo and declare "const char *id_path = ..."
>   to avoid directly use the long string in the func body 
>   in Patch #3.
>   v2 => v3:
> . change the standalone test to be part of prog_tests.
>   RFC v1 => v2:
> . previous version allows to send signal to an arbitrary 
>   pid. This version just sends the signal to current
>   task to avoid unstable pid and potential races between
>   sending signals and task state changes for the pid.
> 
> Yonghong Song (3):
>   bpf: implement bpf_send_signal() helper
>   tools/bpf: sync bpf uapi header bpf.h to tools directory
>   tools/bpf: add selftest in test_progs for bpf_send_signal() helper
> 
>  include/uapi/linux/bpf.h  |  17 +-
>  kernel/trace/bpf_trace.c  |  72 +++
>  tools/include/uapi/linux/bpf.h|  17 +-
>  tools/testing/selftests/bpf/bpf_helpers.h |   1 +
>  .../selftests/bpf/prog_tests/send_signal.c| 198 ++
>  .../bpf/progs/test_send_signal_kern.c |  51 +
>  6 files changed, 354 insertions(+), 2 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/send_signal.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_send_signal_kern.c
> 

Applied, thanks. One more remark in patch 1, will reply there.

Re: [PATCH bpf-next v5 1/3] bpf: implement bpf_send_signal() helper

2019-05-24 Thread Daniel Borkmann

On 05/23/2019 11:47 PM, Yonghong Song wrote:
> This patch tries to solve the following specific use case.
> 
> Currently, bpf program can already collect stack traces
> through kernel function get_perf_callchain()
> when certain events happens (e.g., cache miss counter or
> cpu clock counter overflows). But such stack traces are
> not enough for jitted programs, e.g., hhvm (jited php).
> To get real stack trace, jit engine internal data structures
> need to be traversed in order to get the real user functions.
> 
> bpf program itself may not be the best place to traverse
> the jit engine as the traversing logic could be complex and
> it is not a stable interface either.
> 
> Instead, hhvm implements a signal handler,
> e.g. for SIGALARM, and a set of program locations which
> it can dump stack traces. When it receives a signal, it will
> dump the stack in next such program location.
> 
> Such a mechanism can be implemented in the following way:
>   . a perf ring buffer is created between bpf program
> and tracing app.
>   . once a particular event happens, bpf program writes
> to the ring buffer and the tracing app gets notified.
>   . the tracing app sends a signal SIGALARM to the hhvm.
> 
> But this method could have large delays and causing profiling
> results skewed.
> 
> This patch implements bpf_send_signal() helper to send
> a signal to hhvm in real time, resulting in intended stack traces.
> 
> Acked-by: Andrii Nakryiko 
> Signed-off-by: Yonghong Song 
> ---
>  include/uapi/linux/bpf.h | 17 +-
>  kernel/trace/bpf_trace.c | 72 
>  2 files changed, 88 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 63e0cf66f01a..68d4470523a0 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -2672,6 +2672,20 @@ union bpf_attr {
>   *   0 on success.
>   *
>   *   **-ENOENT** if the bpf-local-storage cannot be found.
> + *
> + * int bpf_send_signal(u32 sig)
> + *   Description
> + *   Send signal *sig* to the current task.
> + *   Return
> + *   0 on success or successfully queued.
> + *
> + *   **-EBUSY** if work queue under nmi is full.
> + *
> + *   **-EINVAL** if *sig* is invalid.
> + *
> + *   **-EPERM** if no permission to send the *sig*.
> + *
> + *   **-EAGAIN** if bpf program can try again.
>   */
>  #define __BPF_FUNC_MAPPER(FN)\
>   FN(unspec), \
> @@ -2782,7 +2796,8 @@ union bpf_attr {
>   FN(strtol), \
>   FN(strtoul),\
>   FN(sk_storage_get), \
> - FN(sk_storage_delete),
> + FN(sk_storage_delete),  \
> + FN(send_signal),
>  
>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>   * function eBPF program intends to call
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index f92d6ad5e080..70029eafc71f 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -567,6 +567,63 @@ static const struct bpf_func_proto 
> bpf_probe_read_str_proto = {
>   .arg3_type  = ARG_ANYTHING,
>  };
>  
> +struct send_signal_irq_work {
> + struct irq_work irq_work;
> + struct task_struct *task;
> + u32 sig;
> +};
> +
> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
> +
> +static void do_bpf_send_signal(struct irq_work *entry)
> +{
> + struct send_signal_irq_work *work;
> +
> + work = container_of(entry, struct send_signal_irq_work, irq_work);
> + group_send_sig_info(work->sig, SEND_SIG_PRIV, work->task, PIDTYPE_TGID);
> +}
> +
> +BPF_CALL_1(bpf_send_signal, u32, sig)
> +{
> + struct send_signal_irq_work *work = NULL;
> +
> + /* Similar to bpf_probe_write_user, task needs to be
> +  * in a sound condition and kernel memory access be
> +  * permitted in order to send signal to the current
> +  * task.
> +  */
> + if (unlikely(current->flags & (PF_KTHREAD | PF_EXITING)))
> + return -EPERM;
> + if (unlikely(uaccess_kernel()))
> + return -EPERM;
> + if (unlikely(!nmi_uaccess_okay()))
> + return -EPERM;
> +
> + if (in_nmi()) {
> + work = this_cpu_ptr(&send_signal_work);
> + if (work->irq_work.flags & IRQ_WORK_BUSY)

Given here and in stackmap are the only two users outside of kernel/irq_work.c,
it may probably be good to add a small helper to include/linux/irq_work.h and
use it for both.

Perhaps something like ...

static inline bool irq_work_busy(struct irq_work *work)
{
return READ_ONCE(work->flags) & IRQ_WORK_BUSY;
}

> + return -EBUSY;
> +
> + /* Add the current task, which is the target of sending signal,
> +  * to the irq_work. The current task may change when queued
> +  * irq works get executed.
> +  */
>

Re: [PATCH bpf-next v5 1/3] bpf: implement bpf_send_signal() helper

2019-05-24 Thread Daniel Borkmann

On 05/24/2019 11:39 PM, Daniel Borkmann wrote:
> On 05/23/2019 11:47 PM, Yonghong Song wrote:
>> This patch tries to solve the following specific use case.
>>
>> Currently, bpf program can already collect stack traces
>> through kernel function get_perf_callchain()
>> when certain events happens (e.g., cache miss counter or
>> cpu clock counter overflows). But such stack traces are
>> not enough for jitted programs, e.g., hhvm (jited php).
>> To get real stack trace, jit engine internal data structures
>> need to be traversed in order to get the real user functions.
>>
>> bpf program itself may not be the best place to traverse
>> the jit engine as the traversing logic could be complex and
>> it is not a stable interface either.
>>
>> Instead, hhvm implements a signal handler,
>> e.g. for SIGALARM, and a set of program locations which
>> it can dump stack traces. When it receives a signal, it will
>> dump the stack in next such program location.
>>
>> Such a mechanism can be implemented in the following way:
>>   . a perf ring buffer is created between bpf program
>> and tracing app.
>>   . once a particular event happens, bpf program writes
>> to the ring buffer and the tracing app gets notified.
>>   . the tracing app sends a signal SIGALARM to the hhvm.
>>
>> But this method could have large delays and causing profiling
>> results skewed.
>>
>> This patch implements bpf_send_signal() helper to send
>> a signal to hhvm in real time, resulting in intended stack traces.
>>
>> Acked-by: Andrii Nakryiko 
>> Signed-off-by: Yonghong Song 
>> ---
>>  include/uapi/linux/bpf.h | 17 +-
>>  kernel/trace/bpf_trace.c | 72 
>>  2 files changed, 88 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>> index 63e0cf66f01a..68d4470523a0 100644
>> --- a/include/uapi/linux/bpf.h
>> +++ b/include/uapi/linux/bpf.h
>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>   *  0 on success.
>>   *
>>   *  **-ENOENT** if the bpf-local-storage cannot be found.
>> + *
>> + * int bpf_send_signal(u32 sig)
>> + *  Description
>> + *  Send signal *sig* to the current task.
>> + *  Return
>> + *  0 on success or successfully queued.
>> + *
>> + *  **-EBUSY** if work queue under nmi is full.
>> + *
>> + *  **-EINVAL** if *sig* is invalid.
>> + *
>> + *  **-EPERM** if no permission to send the *sig*.
>> + *
>> + *  **-EAGAIN** if bpf program can try again.
>>   */
>>  #define __BPF_FUNC_MAPPER(FN)   \
>>  FN(unspec), \
>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>  FN(strtol), \
>>  FN(strtoul),\
>>  FN(sk_storage_get), \
>> -FN(sk_storage_delete),
>> +FN(sk_storage_delete),  \
>> +FN(send_signal),
>>  
>>  /* integer value in 'imm' field of BPF_CALL instruction selects which helper
>>   * function eBPF program intends to call
>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>> index f92d6ad5e080..70029eafc71f 100644
>> --- a/kernel/trace/bpf_trace.c
>> +++ b/kernel/trace/bpf_trace.c
>> @@ -567,6 +567,63 @@ static const struct bpf_func_proto 
>> bpf_probe_read_str_proto = {
>>  .arg3_type  = ARG_ANYTHING,
>>  };
>>  
>> +struct send_signal_irq_work {
>> +struct irq_work irq_work;
>> +struct task_struct *task;
>> +u32 sig;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>> +
>> +static void do_bpf_send_signal(struct irq_work *entry)
>> +{
>> +struct send_signal_irq_work *work;
>> +
>> +work = container_of(entry, struct send_signal_irq_work, irq_work);
>> +group_send_sig_info(work->sig, SEND_SIG_PRIV, work->task, PIDTYPE_TGID);
>> +}
>> +
>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>> +{
>> +struct send_signal_irq_work *work = NULL;
>> +

Oh, and one more thing:

if (!valid_signal(sig))
return -EINVAL;

Otherwise when deferring the work, you don't have any such feedback.

>> +/* Similar to bpf_probe_write_user, task needs to be
>> + * in a sound condition and kernel memory access be
>> + * permitted in order to send signal to the current
>> + * task.
>> + */
>> +

Re: [PATCH bpf-next v5 1/3] bpf: implement bpf_send_signal() helper

2019-05-24 Thread Daniel Borkmann

On 05/25/2019 12:20 AM, Yonghong Song wrote:
> On 5/24/19 2:39 PM, Daniel Borkmann wrote:
>> On 05/23/2019 11:47 PM, Yonghong Song wrote:
>>> This patch tries to solve the following specific use case.
>>>
>>> Currently, bpf program can already collect stack traces
>>> through kernel function get_perf_callchain()
>>> when certain events happens (e.g., cache miss counter or
>>> cpu clock counter overflows). But such stack traces are
>>> not enough for jitted programs, e.g., hhvm (jited php).
>>> To get real stack trace, jit engine internal data structures
>>> need to be traversed in order to get the real user functions.
>>>
>>> bpf program itself may not be the best place to traverse
>>> the jit engine as the traversing logic could be complex and
>>> it is not a stable interface either.
>>>
>>> Instead, hhvm implements a signal handler,
>>> e.g. for SIGALARM, and a set of program locations which
>>> it can dump stack traces. When it receives a signal, it will
>>> dump the stack in next such program location.
>>>
>>> Such a mechanism can be implemented in the following way:
>>>. a perf ring buffer is created between bpf program
>>>  and tracing app.
>>>. once a particular event happens, bpf program writes
>>>  to the ring buffer and the tracing app gets notified.
>>>. the tracing app sends a signal SIGALARM to the hhvm.
>>>
>>> But this method could have large delays and causing profiling
>>> results skewed.
>>>
>>> This patch implements bpf_send_signal() helper to send
>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>
>>> Acked-by: Andrii Nakryiko 
>>> Signed-off-by: Yonghong Song 
>>> ---
>>>   include/uapi/linux/bpf.h | 17 +-
>>>   kernel/trace/bpf_trace.c | 72 
>>>   2 files changed, 88 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>> index 63e0cf66f01a..68d4470523a0 100644
>>> --- a/include/uapi/linux/bpf.h
>>> +++ b/include/uapi/linux/bpf.h
>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>*0 on success.
>>>*
>>>***-ENOENT** if the bpf-local-storage cannot be found.
>>> + *
>>> + * int bpf_send_signal(u32 sig)
>>> + * Description
>>> + * Send signal *sig* to the current task.
>>> + * Return
>>> + * 0 on success or successfully queued.
>>> + *
>>> + * **-EBUSY** if work queue under nmi is full.
>>> + *
>>> + * **-EINVAL** if *sig* is invalid.
>>> + *
>>> + * **-EPERM** if no permission to send the *sig*.
>>> + *
>>> + * **-EAGAIN** if bpf program can try again.
>>>*/
>>>   #define __BPF_FUNC_MAPPER(FN) \
>>> FN(unspec), \
>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>> FN(strtol), \
>>> FN(strtoul),\
>>> FN(sk_storage_get), \
>>> -   FN(sk_storage_delete),
>>> +   FN(sk_storage_delete),  \
>>> +   FN(send_signal),
>>>   
>>>   /* integer value in 'imm' field of BPF_CALL instruction selects which 
>>> helper
>>>* function eBPF program intends to call
>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>> index f92d6ad5e080..70029eafc71f 100644
>>> --- a/kernel/trace/bpf_trace.c
>>> +++ b/kernel/trace/bpf_trace.c
>>> @@ -567,6 +567,63 @@ static const struct bpf_func_proto 
>>> bpf_probe_read_str_proto = {
>>> .arg3_type  = ARG_ANYTHING,
>>>   };
>>>   
>>> +struct send_signal_irq_work {
>>> +   struct irq_work irq_work;
>>> +   struct task_struct *task;
>>> +   u32 sig;
>>> +};
>>> +
>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>>> +
>>> +static void do_bpf_send_signal(struct irq_work *entry)
>>> +{
>>> +   struct send_signal_irq_work *work;
>>> +
>>> +   work = container_of(entry, struct send_signal_irq_work, irq_work);
>>> +   group_send_sig_info(work->sig, SEND_SIG_PRIV, work->task, PIDTYPE_TGID);
>>> +}
>>> +
>>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>>> +{
>>> +

Re: [PATCH bpf-next v5 1/3] bpf: implement bpf_send_signal() helper

2019-05-24 Thread Daniel Borkmann

On 05/25/2019 12:23 AM, Yonghong Song wrote:
> On 5/24/19 2:59 PM, Daniel Borkmann wrote:
>> On 05/24/2019 11:39 PM, Daniel Borkmann wrote:
>>> On 05/23/2019 11:47 PM, Yonghong Song wrote:
>>>> This patch tries to solve the following specific use case.
>>>>
>>>> Currently, bpf program can already collect stack traces
>>>> through kernel function get_perf_callchain()
>>>> when certain events happens (e.g., cache miss counter or
>>>> cpu clock counter overflows). But such stack traces are
>>>> not enough for jitted programs, e.g., hhvm (jited php).
>>>> To get real stack trace, jit engine internal data structures
>>>> need to be traversed in order to get the real user functions.
>>>>
>>>> bpf program itself may not be the best place to traverse
>>>> the jit engine as the traversing logic could be complex and
>>>> it is not a stable interface either.
>>>>
>>>> Instead, hhvm implements a signal handler,
>>>> e.g. for SIGALARM, and a set of program locations which
>>>> it can dump stack traces. When it receives a signal, it will
>>>> dump the stack in next such program location.
>>>>
>>>> Such a mechanism can be implemented in the following way:
>>>>. a perf ring buffer is created between bpf program
>>>>  and tracing app.
>>>>. once a particular event happens, bpf program writes
>>>>  to the ring buffer and the tracing app gets notified.
>>>>. the tracing app sends a signal SIGALARM to the hhvm.
>>>>
>>>> But this method could have large delays and causing profiling
>>>> results skewed.
>>>>
>>>> This patch implements bpf_send_signal() helper to send
>>>> a signal to hhvm in real time, resulting in intended stack traces.
>>>>
>>>> Acked-by: Andrii Nakryiko 
>>>> Signed-off-by: Yonghong Song 
>>>> ---
>>>>   include/uapi/linux/bpf.h | 17 +-
>>>>   kernel/trace/bpf_trace.c | 72 
>>>>   2 files changed, 88 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>> index 63e0cf66f01a..68d4470523a0 100644
>>>> --- a/include/uapi/linux/bpf.h
>>>> +++ b/include/uapi/linux/bpf.h
>>>> @@ -2672,6 +2672,20 @@ union bpf_attr {
>>>>*   0 on success.
>>>>*
>>>>*   **-ENOENT** if the bpf-local-storage cannot be found.
>>>> + *
>>>> + * int bpf_send_signal(u32 sig)
>>>> + *Description
>>>> + *Send signal *sig* to the current task.
>>>> + *Return
>>>> + *0 on success or successfully queued.
>>>> + *
>>>> + ***-EBUSY** if work queue under nmi is full.
>>>> + *
>>>> + ***-EINVAL** if *sig* is invalid.
>>>> + *
>>>> + ***-EPERM** if no permission to send the *sig*.
>>>> + *
>>>> + ***-EAGAIN** if bpf program can try again.
>>>>*/
>>>>   #define __BPF_FUNC_MAPPER(FN)\
>>>>FN(unspec), \
>>>> @@ -2782,7 +2796,8 @@ union bpf_attr {
>>>>FN(strtol), \
>>>>FN(strtoul),\
>>>>FN(sk_storage_get), \
>>>> -  FN(sk_storage_delete),
>>>> +  FN(sk_storage_delete),  \
>>>> +  FN(send_signal),
>>>>   
>>>>   /* integer value in 'imm' field of BPF_CALL instruction selects which 
>>>> helper
>>>>* function eBPF program intends to call
>>>> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
>>>> index f92d6ad5e080..70029eafc71f 100644
>>>> --- a/kernel/trace/bpf_trace.c
>>>> +++ b/kernel/trace/bpf_trace.c
>>>> @@ -567,6 +567,63 @@ static const struct bpf_func_proto 
>>>> bpf_probe_read_str_proto = {
>>>>.arg3_type  = ARG_ANYTHING,
>>>>   };
>>>>   
>>>> +struct send_signal_irq_work {
>>>> +  struct irq_work irq_work;
>>>> +  struct task_struct *task;
>>>> +  u32 sig;
>>>> +};
>>>> +
>>>> +static DEFINE_PER_CPU(struct send_signal_irq_work, send_signal_work);
>>>> +
>>>> +static void do_bpf_send_signal(struct irq_work *entry)
>>>> +{
>>>> +  struct send_signal_irq_work *work;
>>>> +
>>>> +  work = container_of(entry, struct send_signal_irq_work, irq_work);
>>>> +  group_send_sig_info(work->sig, SEND_SIG_PRIV, work->task, PIDTYPE_TGID);
>>>> +}
>>>> +
>>>> +BPF_CALL_1(bpf_send_signal, u32, sig)
>>>> +{
>>>> +  struct send_signal_irq_work *work = NULL;
>>>> +
>>
>> Oh, and one more thing:
>>
>>  if (!valid_signal(sig))
>>  return -EINVAL;
>>
>> Otherwise when deferring the work, you don't have any such feedback.
> 
> Good advice! Do you want me send a followup patch or
> resend the whole series?

Lets do follow-up.

Re: [PATCH bpf-next] bpftool: auto-complete BTF IDs for btf dump

2019-05-28 Thread Daniel Borkmann

On 05/26/2019 02:01 AM, Andrii Nakryiko wrote:
> Auto-complete BTF IDs for `btf dump id` sub-command. List of possible BTF
> IDs is scavenged from loaded BPF programs that have associated BTFs, as
> there is currently no API in libbpf to fetch list of all BTFs in the
> system.
> 
> Suggested-by: Quentin Monnet 
> Signed-off-by: Andrii Nakryiko 

Applied, thanks!

(Please add versioning in the subject in future, e.g. [PATCH v2 bpf-next])

Re: [PATCH bpf-next] bpf: check signal validity in nmi for bpf_send_signal() helper

2019-05-28 Thread Daniel Borkmann

On 05/25/2019 08:57 PM, Yonghong Song wrote:
> Commit 8b401f9ed244 ("bpf: implement bpf_send_signal() helper")
> introduced bpf_send_signal() helper. If the context is nmi,
> the sending signal work needs to be deferred to irq_work.
> If the signal is invalid, the error will appear in irq_work
> and it won't be propagated to user.
> 
> This patch did an early check in the helper itself to
> notify user invalid signal, as suggested by Daniel.
> 
> Signed-off-by: Yonghong Song 

Applied, thanks!

Re: [PATCH] [PATCH bpf] style fix in while(!feof()) loop

2019-05-28 Thread Daniel Borkmann

On 05/26/2019 12:32 PM, Chang-Hsien Tsai wrote:
> use fgets() as the while loop condition.
> 
> Signed-off-by: Chang-Hsien Tsai 

Applied, thanks!

Re: [PATCH bpf-next v3 0/3] tools: bpftool: add an option for debug output from libbpf and verifier

2019-05-28 Thread Daniel Borkmann

On 05/24/2019 12:36 PM, Quentin Monnet wrote:
> Hi,
> This series adds an option to bpftool to make it print additional
> information via libbpf and the kernel verifier when attempting to load
> programs.
> 
> A new API function is added to libbpf in order to pass the log_level from
> bpftool with the bpf_object__* part of the API.
> 
> Options for a finer control over the log levels to use for libbpf and the
> verifier could be added in the future, if desired.
> 
> v3:
> - Fix and clarify commit logs.
> 
> v2:
> - Do not add distinct options for libbpf and verifier logs, just keep the
>   one that sets all log levels to their maximum. Rename the option.
> - Do not offer a way to pick desired log levels. The choice is "use the
>   option to print all logs" or "stick with the defaults".
> - Do not export BPF_LOG_* flags to user header.
> - Update all man pages (most bpftool operations use libbpf and may print
>   libbpf logs). Verifier logs are only used when attempting to load
>   programs for now, so bpftool-prog.rst and bpftool.rst remain the only
>   pages updated in that regard.
> 
> Previous discussion available at:
> https://lore.kernel.org/bpf/20190523105426.3938-1-quentin.mon...@netronome.com/
> https://lore.kernel.org/bpf/20190429095227.9745-1-quentin.mon...@netronome.com/
> 
> Quentin Monnet (3):
>   tools: bpftool: add -d option to get debug output from libbpf
>   libbpf: add bpf_object__load_xattr() API function to pass log_level
>   tools: bpftool: make -d option print debug output from verifier
> 
>  .../bpf/bpftool/Documentation/bpftool-btf.rst |  4 +++
>  .../bpftool/Documentation/bpftool-cgroup.rst  |  4 +++
>  .../bpftool/Documentation/bpftool-feature.rst |  4 +++
>  .../bpf/bpftool/Documentation/bpftool-map.rst |  4 +++
>  .../bpf/bpftool/Documentation/bpftool-net.rst |  4 +++
>  .../bpftool/Documentation/bpftool-perf.rst|  4 +++
>  .../bpftool/Documentation/bpftool-prog.rst|  5 
>  tools/bpf/bpftool/Documentation/bpftool.rst   |  4 +++
>  tools/bpf/bpftool/bash-completion/bpftool |  2 +-
>  tools/bpf/bpftool/main.c  | 16 ++-
>  tools/bpf/bpftool/main.h  |  1 +
>  tools/bpf/bpftool/prog.c  | 27 ---
>  tools/lib/bpf/Makefile|  2 +-
>  tools/lib/bpf/libbpf.c| 20 +++---
>  tools/lib/bpf/libbpf.h|  6 +
>  tools/lib/bpf/libbpf.map  |  5 
>  16 files changed, 96 insertions(+), 16 deletions(-)
> 

Applied, thanks!

Re: [PATCH bpf-next] selftests/bpf: fail test_tunnel.sh if subtests fail

2019-05-28 Thread Daniel Borkmann

On 05/25/2019 12:28 AM, Stanislav Fomichev wrote:
> Right now test_tunnel.sh always exits with success even if some
> of the subtests fail. Since the output is very verbose, it's
> hard to spot the issues with subtests. Let's fail the script
> if any subtest fails.
> 
> Signed-off-by: Stanislav Fomichev 

Applied, thanks!

Re: [PATCH v2] samples: bpf: fix style in bpf_load

2019-05-28 Thread Daniel Borkmann

On 05/23/2019 09:24 AM, Daniel T. Lee wrote:
> This commit fixes style problem in samples/bpf/bpf_load.c
> 
> Styles that have been changed are:
>  - Magic string use of 'DEBUGFS'
>  - Useless zero initialization of a global variable
>  - Minor style fix with whitespace
> 
> Signed-off-by: Daniel T. Lee 

Applied, thanks!

Re: [PATCH bpf 0/2] selftests: bpf: more sub-register zero extension unit tests

2019-05-29 Thread Daniel Borkmann

On 05/29/2019 11:57 AM, Jiong Wang wrote:
> JIT back-ends need to guarantee high 32-bit cleared whenever one eBPF insn
> write low 32-bit sub-register only. It is possible that some JIT back-ends
> have failed doing this and are silently generating wrong image.
> 
> This set completes the unit tests, so bug on this could be exposed.
> 
> Jiong Wang (2):
>   selftests: bpf: move sub-register zero extension checks into subreg.c
>   selftests: bpf: complete sub-register zero extension checks
> 
>  tools/testing/selftests/bpf/verifier/basic_instr.c |  39 --
>  tools/testing/selftests/bpf/verifier/subreg.c  | 533 
> +
>  2 files changed, 533 insertions(+), 39 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/subreg.c

Looks good, thanks for following up! Applied, thanks!

Re: [PATCH bpf-next] libbpf: prevent overwriting of log_level in bpf_object__load_progs()

2019-05-29 Thread Daniel Borkmann

On 05/29/2019 11:23 AM, Quentin Monnet wrote:
> There are two functions in libbpf that support passing a log_level
> parameter for the verifier for loading programs:
> bpf_object__load_xattr() and bpf_load_program_xattr(). Both accept an
> attribute object containing the log_level, and apply it to the programs
> to load.
> 
> It turns out that to effectively load the programs, the latter function
> eventually relies on the former. This was not taken into account when
> adding support for log_level in bpf_object__load_xattr(), and the
> log_level passed to bpf_load_program_xattr() later gets overwritten with
> a zero value, thus disabling verifier logs for the program in all cases:
> 
> bpf_load_program_xattr() // prog.log_level = N;

I'm confused with your commit message. How can bpf_load_program_xattr()
make sense here, this is the one doing the bpf syscall. Do you mean to
say bpf_prog_load_xattr()? Because this one sets prog->log_level = 
attr->log_level
and calls bpf_object__load() which in turn does bpf_object__load_xattr()
with an attr that has attr->log_level of 0 such that bpf_object__load_progs()
then overrides it. Unless I'm not missing something, please fix up this
description properly and resubmit.

>  -> bpf_object__load()   // attr.log_level = 0;
>  -> bpf_object__load_xattr() // 
>  -> bpf_object__load_progs() // prog.log_level = attr.log_level;
> 
> Fix this by OR-ing the log_level in bpf_object__load_progs(), instead of
> overwriting it.
> 
> Fixes: 60276f984998 ("libbpf: add bpf_object__load_xattr() API function to 
> pass log_level")
> Reported-by: Alexei Starovoitov 
> Signed-off-by: Quentin Monnet 
> ---
>  tools/lib/bpf/libbpf.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
> index ca4432f5b067..30cb08e2eb75 100644
> --- a/tools/lib/bpf/libbpf.c
> +++ b/tools/lib/bpf/libbpf.c
> @@ -2232,7 +2232,7 @@ bpf_object__load_progs(struct bpf_object *obj, int 
> log_level)
>   for (i = 0; i < obj->nr_programs; i++) {
>   if (bpf_program__is_function_storage(&obj->programs[i], obj))
>   continue;
> - obj->programs[i].log_level = log_level;
> + obj->programs[i].log_level |= log_level;
>   err = bpf_program__load(&obj->programs[i],
>   obj->license,
>   obj->kern_version);
>

Re: [PATCH bpf-next] selftests/bpf: fix compilation error for flow_dissector.c

2019-05-29 Thread Daniel Borkmann

On 05/29/2019 11:48 AM, Alan Maguire wrote:
> When building the tools/testing/selftest/bpf subdirectory,
> (running both a local directory "make" and a
> "make -C tools/testing/selftests/bpf") I keep hitting the
> following compilation error:
> 
> prog_tests/flow_dissector.c: In function ‘create_tap’:
> prog_tests/flow_dissector.c:150:38: error: ‘IFF_NAPI’ undeclared (first
> use in this function)
>.ifr_flags = IFF_TAP | IFF_NO_PI | IFF_NAPI | IFF_NAPI_FRAGS,
>   ^
> prog_tests/flow_dissector.c:150:38: note: each undeclared identifier is
> reported only once for each function it appears in
> prog_tests/flow_dissector.c:150:49: error: ‘IFF_NAPI_FRAGS’ undeclared
> 
> Adding include/uapi/linux/if_tun.h to tools/include/uapi/linux
> resolves the problem and ensures the compilation of the file
> does not depend on having up-to-date kernel headers locally.
> 
> Signed-off-by: Alan Maguire 

Applied, thanks!

Re: [PATCH bpf v2] selftests: bpf: fix compiler warning

2019-05-29 Thread Daniel Borkmann

On 05/28/2019 09:02 PM, Alakesh Haloi wrote:
> Add missing header file following compiler warning
> 
> prog_tests/flow_dissector.c: In function ‘tx_tap’:
> prog_tests/flow_dissector.c:175:9: warning: implicit declaration of function 
> ‘writev’; did you mean ‘write’? [-Wimplicit-function-declaration]
>   return writev(fd, iov, ARRAY_SIZE(iov));
>  ^~
>  write
> 
> Fixes: 0905beec9f52 ("selftests/bpf: run flow dissector tests in skb-less 
> mode")
> Signed-off-by: Alakesh Haloi 

Applied, thanks!

Re: [PATCH bpf-next v4 1/4] bpf: remove __rcu annotations from bpf_prog_array

2019-05-29 Thread Daniel Borkmann

On 05/28/2019 11:14 PM, Stanislav Fomichev wrote:
> Drop __rcu annotations and rcu read sections from bpf_prog_array
> helper functions. They are not needed since all existing callers
> call those helpers from the rcu update side while holding a mutex.
> This guarantees that use-after-free could not happen.
> 
> In the next patches I'll fix the callers with missing
> rcu_dereference_protected to make sparse/lockdep happy, the proper
> way to use these helpers is:
> 
>   struct bpf_prog_array __rcu *progs = ...;
>   struct bpf_prog_array *p;
> 
>   mutex_lock(&mtx);
>   p = rcu_dereference_protected(progs, lockdep_is_held(&mtx));
>   bpf_prog_array_length(p);
>   bpf_prog_array_copy_to_user(p, ...);
>   bpf_prog_array_delete_safe(p, ...);
>   bpf_prog_array_copy_info(p, ...);
>   bpf_prog_array_copy(p, ...);
>   bpf_prog_array_free(p);
>   mutex_unlock(&mtx);
> 
> No functional changes! rcu_dereference_protected with lockdep_is_held
> should catch any cases where we update prog array without a mutex
> (I've looked at existing call sites and I think we hold a mutex
> everywhere).
> 
> Motivation is to fix sparse warnings:
> kernel/bpf/core.c:1803:9: warning: incorrect type in argument 1 (different 
> address spaces)
> kernel/bpf/core.c:1803:9:expected struct callback_head *head
> kernel/bpf/core.c:1803:9:got struct callback_head [noderef]  *
> kernel/bpf/core.c:1877:44: warning: incorrect type in initializer (different 
> address spaces)
> kernel/bpf/core.c:1877:44:expected struct bpf_prog_array_item *item
> kernel/bpf/core.c:1877:44:got struct bpf_prog_array_item [noderef] 
>  *
> kernel/bpf/core.c:1901:26: warning: incorrect type in assignment (different 
> address spaces)
> kernel/bpf/core.c:1901:26:expected struct bpf_prog_array_item *existing
> kernel/bpf/core.c:1901:26:got struct bpf_prog_array_item [noderef] 
>  *
> kernel/bpf/core.c:1935:26: warning: incorrect type in assignment (different 
> address spaces)
> kernel/bpf/core.c:1935:26:expected struct bpf_prog_array_item *[assigned] 
> existing
> kernel/bpf/core.c:1935:26:got struct bpf_prog_array_item [noderef] 
>  *
> 
> v2:
> * remove comment about potential race; that can't happen
>   because all callers are in rcu-update section
> 
> Cc: Roman Gushchin 
> Acked-by: Roman Gushchin 
> Signed-off-by: Stanislav Fomichev 

Series applied, thanks!

Re: [PATCH bpf-next v2] libbpf: prevent overwriting of log_level in bpf_object__load_progs()

2019-05-29 Thread Daniel Borkmann

On 05/29/2019 04:26 PM, Quentin Monnet wrote:
> There are two functions in libbpf that support passing a log_level
> parameter for the verifier for loading programs:
> bpf_object__load_xattr() and bpf_prog_load_xattr(). Both accept an
> attribute object containing the log_level, and apply it to the programs
> to load.
> 
> It turns out that to effectively load the programs, the latter function
> eventually relies on the former. This was not taken into account when
> adding support for log_level in bpf_object__load_xattr(), and the
> log_level passed to bpf_prog_load_xattr() later gets overwritten with a
> zero value, thus disabling verifier logs for the program in all cases:
> 
> bpf_prog_load_xattr() // prog->log_level = attr1->log_level;
> -> bpf_object__load() // attr2->log_level = 0;
>-> bpf_object__load_xattr()// 
>   -> bpf_object__load_progs() // prog->log_level = attr2->log_level;
> 
> Fix this by OR-ing the log_level in bpf_object__load_progs(), instead of
> overwriting it.
> 
> v2: Fix commit log description (confusion on function names in v1).
> 
> Fixes: 60276f984998 ("libbpf: add bpf_object__load_xattr() API function to 
> pass log_level")
> Reported-by: Alexei Starovoitov 
> Signed-off-by: Quentin Monnet 

That's better, applied, thanks!

Re: [PATCH v2 bpf-next 0/9] libbpf random fixes

2019-05-29 Thread Daniel Borkmann

On 05/29/2019 07:36 PM, Andrii Nakryiko wrote:
> This patch set is a collection of unrelated fixes for libbpf.
> 
> Patch #1 fixes detection of corrupted BPF section w/ instructions.
> Patch #2 fixes possible errno clobbering.
> Patch #3 simplifies endianness check and brings it in line with few other
> similar checks in libbpf.
> Patch #4 adds check for failed map name retrieval from ELF symbol name.
> Patch #5 fixes return error code to be negative.
> Patch #6 fixes using valid fd (0) as a marker of missing associated BTF.
> Patch #7 removes redundant logic in two places.
> Patch #8 fixes typos in comments and debug output, and fixes formatting.
> Patch #9 unwraps a bunch of multi-line statements and comments.
> 
> v1->v2:
>   - patch #1 simplifications (Song);
> 
> 
> Andrii Nakryiko (9):
>   libbpf: fix detection of corrupted BPF instructions section
>   libbpf: preserve errno before calling into user callback
>   libbpf: simplify endianness check
>   libbpf: check map name retrieved from ELF
>   libbpf: fix error code returned on corrupted ELF
>   libbpf: use negative fd to specify missing BTF
>   libbpf: simplify two pieces of logic
>   libbpf: typo and formatting fixes
>   libbpf: reduce unnecessary line wrapping
> 
>  tools/lib/bpf/libbpf.c | 148 +
>  1 file changed, 60 insertions(+), 88 deletions(-)
> 

Applied, thanks!

Re: [PATCH bpf-next v2 1/2] net: xdp: refactor XDP_QUERY_PROG{,_HW} to netdev

2019-06-03 Thread Daniel Borkmann

On 06/03/2019 10:39 AM, Björn Töpel wrote:
> On Sat, 1 Jun 2019 at 20:12, Jonathan Lemon  wrote:
>> On 31 May 2019, at 2:42, Björn Töpel wrote:
>>> From: Björn Töpel 
>>>
>>> All XDP capable drivers need to implement the XDP_QUERY_PROG{,_HW}
>>> command of ndo_bpf. The query code is fairly generic. This commit
>>> refactors the query code up from the drivers to the netdev level.
>>>
>>> The struct net_device has gained two new members: xdp_prog_hw and
>>> xdp_flags. The former is the offloaded XDP program, if any, and the
>>> latter tracks the flags that the supplied when attaching the XDP
>>> program. The flags only apply to SKB_MODE or DRV_MODE, not HW_MODE.
>>>
>>> The xdp_prog member, previously only used for SKB_MODE, is shared with
>>> DRV_MODE. This is OK, due to the fact that SKB_MODE and DRV_MODE are
>>> mutually exclusive. To differentiate between the two modes, a new
>>> internal flag is introduced as well.
>>
>> I'm not entirely clear why this new flag is needed - GENERIC seems to
>> be an alias for SKB_MODE, so why just use SKB_MODE directly?
>>
>> If the user does not explicitly specify a type (skb|drv|hw), then the
>> command should choose the correct type and then behave as if this type
>> was specified.
> 
> Yes, this is kind of hairy.
> 
> SKB and DRV are mutually exclusive, but HW is not. IOW, valid options are:
> SKB, DRV, HW, SKB+HW DRV+HW.

Correct, HW is a bit special here in that it helps offloading parts of
the DRV XDP program to NIC, but also do RSS steering in BPF etc, hence
this combo is intentionally allowed (see also git log).

> What complicates things further, is that SKB and DRV can be implicitly
> (auto/no flags) or explicitly enabled (flags).

Mainly out of historic context: originally the fallback to SKB mode was
implicit if the ndo_bpf was missing. But there are use cases where we
want to fail if the driver does not support native XDP to avoid surprises.

> If a user doesn't pass any flags, the "best supported mode" should be
> selected. If this "auto mode" is used, it should be seen as a third
> mode. E.g.
> 
> ip link set dev eth0 xdp on -- OK
> ip link set dev eth0 xdp off -- OK
> 
> ip link set dev eth0 xdp on -- OK # generic auto selected
> ip link set dev eth0 xdpgeneric off -- NOK, bad flags

This would work if the auto selection would have selected XDP generic.

> ip link set dev eth0 xdp on -- OK # drv auto selected
> ip link set dev eth0 xdpdrv off -- NOK, bad flags

This would work if the auto selection chose native XDP previously. Are
you saying it's not the case?

Also, what is the use case in mixing these commands? It should be xdp
on+off, xdpdrv on+off, and so on. Are you saying you would prefer a
xdp{,any} off that uninstalls everything? Isn't this mostly a user space
issue to whatever orchestrates XDP?

> ...and so on. The idea is that a user should use the same set of flags always.
> 
> The internal "GENERIC" flag is only to determine if the xdp_prog
> represents a DRV version or SKB version. Maybe it would be clearer
> just to add an additional xdp_prog_drv to the net_device, instead?
> 
>> The logic in dev_change_xdp_fd() is too complicated.  It disallows
>> setting (drv|skb), but allows (hw|skb), which I'm not sure is
>> intentional.
>>
>> It should be clearer as to which combinations are allowed.
> 
> Fair point. I'll try to clean it up further.
>

Re: [PATCH v4 bpf-next 1/2] bpf: Allow bpf_map_lookup_elem() on an xskmap

2019-06-04 Thread Daniel Borkmann

On 06/03/2019 06:38 PM, Jonathan Lemon wrote:
> Currently, the AF_XDP code uses a separate map in order to
> determine if an xsk is bound to a queue.  Instead of doing this,
> have bpf_map_lookup_elem() return the queue_id, as a way of
> indicating that there is a valid entry at the map index.
> 
> Rearrange some xdp_sock members to eliminate structure holes.
> 
> Signed-off-by: Jonathan Lemon 
> Acked-by: Song Liu 
> Acked-by: Björn Töpel 
> ---
>  include/net/xdp_sock.h|  6 +++---
>  kernel/bpf/verifier.c |  6 +-
>  kernel/bpf/xskmap.c   |  4 +++-
>  .../selftests/bpf/verifier/prevent_map_lookup.c   | 15 ---
>  4 files changed, 11 insertions(+), 20 deletions(-)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index d074b6d60f8a..7d84b1da43d2 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -57,12 +57,12 @@ struct xdp_sock {
>   struct net_device *dev;
>   struct xdp_umem *umem;
>   struct list_head flush_node;
> - u16 queue_id;
> - struct xsk_queue *tx cacheline_aligned_in_smp;
> - struct list_head list;
> + u32 queue_id;
>   bool zc;
>   /* Protects multiple processes in the control path */
>   struct mutex mutex;
> + struct xsk_queue *tx cacheline_aligned_in_smp;
> + struct list_head list;
>   /* Mutual exclusion of NAPI TX thread and sendmsg error paths
>* in the SKB destructor callback.
>*/
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 2778417e6e0c..91c730f85e92 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -2905,10 +2905,14 @@ static int check_map_func_compatibility(struct 
> bpf_verifier_env *env,
>* appear.
>*/
>   case BPF_MAP_TYPE_CPUMAP:
> - case BPF_MAP_TYPE_XSKMAP:
>   if (func_id != BPF_FUNC_redirect_map)
>   goto error;
>   break;
> + case BPF_MAP_TYPE_XSKMAP:
> + if (func_id != BPF_FUNC_redirect_map &&
> + func_id != BPF_FUNC_map_lookup_elem)
> + goto error;
> + break;
>   case BPF_MAP_TYPE_ARRAY_OF_MAPS:
>   case BPF_MAP_TYPE_HASH_OF_MAPS:
>   if (func_id != BPF_FUNC_map_lookup_elem)
> diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
> index 686d244e798d..249b22089014 100644
> --- a/kernel/bpf/xskmap.c
> +++ b/kernel/bpf/xskmap.c
> @@ -154,7 +154,9 @@ void __xsk_map_flush(struct bpf_map *map)
>  
>  static void *xsk_map_lookup_elem(struct bpf_map *map, void *key)
>  {
> - return ERR_PTR(-EOPNOTSUPP);
> + struct xdp_sock *xs = __xsk_map_lookup_elem(map, *(u32 *)key);
> +
> + return xs ? &xs->queue_id : NULL;
>  }

How do you guarantee that BPf programs don't mess around with the map values
e.g. overriding xs->queue_id from the lookup? This should be read-only map
from BPF program side.

>  static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
> diff --git a/tools/testing/selftests/bpf/verifier/prevent_map_lookup.c 
> b/tools/testing/selftests/bpf/verifier/prevent_map_lookup.c
> index bbdba990fefb..da7a4b37cb98 100644
> --- a/tools/testing/selftests/bpf/verifier/prevent_map_lookup.c
> +++ b/tools/testing/selftests/bpf/verifier/prevent_map_lookup.c
> @@ -28,21 +28,6 @@
>   .errstr = "cannot pass map_type 18 into func bpf_map_lookup_elem",
>   .prog_type = BPF_PROG_TYPE_SOCK_OPS,
>  },
> -{
> - "prevent map lookup in xskmap",
> - .insns = {
> - BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> - BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
> - BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8),
> - BPF_LD_MAP_FD(BPF_REG_1, 0),
> - BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
> - BPF_EXIT_INSN(),
> - },
> - .fixup_map_xskmap = { 3 },
> - .result = REJECT,
> - .errstr = "cannot pass map_type 17 into func bpf_map_lookup_elem",
> - .prog_type = BPF_PROG_TYPE_XDP,
> -},
>  {
>   "prevent map lookup in stack trace",
>   .insns = {
>

Re: [PATCH v4 bpf-next 1/2] bpf: Allow bpf_map_lookup_elem() on an xskmap

2019-06-04 Thread Daniel Borkmann

On 06/04/2019 04:54 PM, Daniel Borkmann wrote:
> On 06/03/2019 06:38 PM, Jonathan Lemon wrote:
>> Currently, the AF_XDP code uses a separate map in order to
>> determine if an xsk is bound to a queue.  Instead of doing this,
>> have bpf_map_lookup_elem() return the queue_id, as a way of
>> indicating that there is a valid entry at the map index.
>>
>> Rearrange some xdp_sock members to eliminate structure holes.
>>
>> Signed-off-by: Jonathan Lemon 
>> Acked-by: Song Liu 
>> Acked-by: Björn Töpel 
>> ---
>>  include/net/xdp_sock.h|  6 +++---
>>  kernel/bpf/verifier.c |  6 +-
>>  kernel/bpf/xskmap.c   |  4 +++-
>>  .../selftests/bpf/verifier/prevent_map_lookup.c   | 15 ---
>>  4 files changed, 11 insertions(+), 20 deletions(-)
>>
>> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
>> index d074b6d60f8a..7d84b1da43d2 100644
>> --- a/include/net/xdp_sock.h
>> +++ b/include/net/xdp_sock.h
>> @@ -57,12 +57,12 @@ struct xdp_sock {
>>  struct net_device *dev;
>>  struct xdp_umem *umem;
>>  struct list_head flush_node;
>> -u16 queue_id;
>> -struct xsk_queue *tx cacheline_aligned_in_smp;
>> -struct list_head list;
>> +u32 queue_id;
>>  bool zc;
>>  /* Protects multiple processes in the control path */
>>  struct mutex mutex;
>> +struct xsk_queue *tx cacheline_aligned_in_smp;
>> +struct list_head list;
>>  /* Mutual exclusion of NAPI TX thread and sendmsg error paths
>>   * in the SKB destructor callback.
>>   */
>> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
>> index 2778417e6e0c..91c730f85e92 100644
>> --- a/kernel/bpf/verifier.c
>> +++ b/kernel/bpf/verifier.c
>> @@ -2905,10 +2905,14 @@ static int check_map_func_compatibility(struct 
>> bpf_verifier_env *env,
>>   * appear.
>>   */
>>  case BPF_MAP_TYPE_CPUMAP:
>> -case BPF_MAP_TYPE_XSKMAP:
>>  if (func_id != BPF_FUNC_redirect_map)
>>  goto error;
>>  break;
>> +case BPF_MAP_TYPE_XSKMAP:
>> +if (func_id != BPF_FUNC_redirect_map &&
>> +func_id != BPF_FUNC_map_lookup_elem)
>> +goto error;
>> +break;
>>  case BPF_MAP_TYPE_ARRAY_OF_MAPS:
>>  case BPF_MAP_TYPE_HASH_OF_MAPS:
>>  if (func_id != BPF_FUNC_map_lookup_elem)
>> diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
>> index 686d244e798d..249b22089014 100644
>> --- a/kernel/bpf/xskmap.c
>> +++ b/kernel/bpf/xskmap.c
>> @@ -154,7 +154,9 @@ void __xsk_map_flush(struct bpf_map *map)
>>  
>>  static void *xsk_map_lookup_elem(struct bpf_map *map, void *key)
>>  {
>> -return ERR_PTR(-EOPNOTSUPP);
>> +struct xdp_sock *xs = __xsk_map_lookup_elem(map, *(u32 *)key);
>> +
>> +return xs ? &xs->queue_id : NULL;
>>  }
> 
> How do you guarantee that BPf programs don't mess around with the map values
> e.g. overriding xs->queue_id from the lookup? This should be read-only map
> from BPF program side.

(Or via per-cpu scratch var where you move xs->queue_id into and return from 
here.)

Re: [PATCH bpf] selftests/bpf: move test_lirc_mode2_user to TEST_GEN_PROGS_EXTENDED

2019-06-05 Thread Daniel Borkmann

On 06/04/2019 04:35 AM, Hangbin Liu wrote:
> test_lirc_mode2_user is included in test_lirc_mode2.sh test and should
> not be run directly.
> 
> Fixes: 6bdd533cee9a ("bpf: add selftest for lirc_mode2 type program")
> Signed-off-by: Hangbin Liu 

Applied, thanks!

[PATCH bpf 1/2] bpf: fix unconnected udp hooks

2019-06-05 Thread Daniel Borkmann

assed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann 
Cc: Andrey Ignatov 
Cc: Martynas Pumputis 
---
 include/linux/bpf-cgroup.h |  8 
 include/uapi/linux/bpf.h   |  2 ++
 kernel/bpf/syscall.c   |  8 
 kernel/bpf/verifier.c  | 12 
 net/core/filter.c  |  2 ++
 net/ipv4/udp.c |  4 
 net/ipv6/udp.c |  4 
 tools/bpf/bpftool/cgroup.c |  5 -
 tools/include/uapi/linux/bpf.h |  2 ++
 9 files changed, 42 insertions(+), 5 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index cb3c6b3..a7f7a98 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -238,6 +238,12 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, 
void *key,
 #define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx)
   \
BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_SENDMSG, t_ctx)
 
+#define BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, uaddr)   
\
+   BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP4_RECVMSG, NULL)
+
+#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr)   
\
+   BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL)
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)\
 ({\
int __ret = 0; \
@@ -339,6 +345,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct 
bpf_map *map,
 #define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; 
})
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 63e0cf6..e4114a7 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -192,6 +192,8 @@ enum bpf_attach_type {
BPF_LIRC_MODE2,
BPF_FLOW_DISSECTOR,
BPF_CGROUP_SYSCTL,
+   BPF_CGROUP_UDP4_RECVMSG,
+   BPF_CGROUP_UDP6_RECVMSG,
__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cb5440b..e8ba3a15 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1581,6 +1581,8 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type 
prog_type,
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
return 0;
default:
return -EINVAL;
@@ -1875,6 +1877,8 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
break;
case BPF_CGROUP_SOCK_OPS:
@@ -1960,6 +1964,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
break;
case BPF_CGROUP_SOCK_OPS:
@@ -2011,6 +2017,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
case BPF_CGROUP_SOCK_OPS:
case BPF_CGROUP_DEVICE:
case BPF_CGROUP_SYSCTL:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 95f93544..d2c8a66 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5361,9

[PATCH bpf 2/2] bpf: add further msg_name rewrite tests to test_sock_addr

2019-06-05 Thread Daniel Borkmann

Extend test_sock_addr for recvmsg test cases, bigger parts of the
sendmsg code can be reused for this. Below are the strace view of
the recvmsg rewrites; the sendmsg side does not have a BPF prog
connected to it for the context of this test:

IPv4 test case:

  [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, 
attach_type=0x13 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
  [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
  [pid  4846] bind(5, {sa_family=AF_INET, sin_port=htons(), 
sin_addr=inet_addr("127.0.0.1")}, 128) = 0
  [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 6
  [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET, sin_port=htons(), 
sin_addr=inet_addr("127.0.0.1")}, msg_namelen=128, msg_iov=[{iov_base="a", 
iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
  [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], 
left {tv_sec=1, tv_usec=95})
  [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET, sin_port=htons(4040), 
sin_addr=inet_addr("192.168.1.254")}, msg_namelen=128->16, 
msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, 
msg_flags=0}, 0) = 1
  [pid  4846] close(6)= 0
  [pid  4846] close(5)= 0
  [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x13 /* BPF_??? 
*/}, 112) = 0

IPv6 test case:

  [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, 
attach_type=0x14 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
  [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 5
  [pid  4846] bind(5, {sa_family=AF_INET6, sin6_port=htons(), 
inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), 
sin6_scope_id=0}, 128) = 0
  [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 6
  [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET6, sin6_port=htons(), 
inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), 
sin6_scope_id=0}, msg_namelen=128, msg_iov=[{iov_base="a", iov_len=1}], 
msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
  [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], 
left {tv_sec=1, tv_usec=96})
  [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET6, sin6_port=htons(6060), 
inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), 
sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=128->28, 
msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, 
msg_flags=0}, 0) = 1
  [pid  4846] close(6)= 0
  [pid  4846] close(5)= 0
  [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x14 /* BPF_??? 
*/}, 112) = 0

test_sock_addr run w/o strace view:

  # ./test_sock_addr.sh
  [...]
  Test case: recvmsg4: return code ok .. [PASS]
  Test case: recvmsg4: return code !ok .. [PASS]
  Test case: recvmsg6: return code ok .. [PASS]
  Test case: recvmsg6: return code !ok .. [PASS]
  Test case: recvmsg4: rewrite IP & port (asm) .. [PASS]
  Test case: recvmsg6: rewrite IP & port (asm) .. [PASS]
  [...]

Signed-off-by: Daniel Borkmann 
---
 tools/testing/selftests/bpf/test_sock_addr.c | 213 +--
 1 file changed, 197 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sock_addr.c 
b/tools/testing/selftests/bpf/test_sock_addr.c
index 3f110ea..4ecde23 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -76,6 +76,7 @@ struct sock_addr_test {
enum {
LOAD_REJECT,
ATTACH_REJECT,
+   ATTACH_OKAY,
SYSCALL_EPERM,
SYSCALL_ENOTSUPP,
SUCCESS,
@@ -88,9 +89,13 @@ static int connect4_prog_load(const struct sock_addr_test 
*test);
 static int connect6_prog_load(const struct sock_addr_test *test);
 static int sendmsg_allow_prog_load(const struct sock_addr_test *test);
 static int sendmsg_deny_prog_load(const struct sock_addr_test *test);
+static int recvmsg_allow_prog_load(const struct sock_addr_test *test);
+static int recvmsg_deny_prog_load(const struct sock_addr_test *test);
 static int sendmsg4_rw_asm_prog_load(const struct sock_addr_test *test);
+static int recvmsg4_rw_asm_prog_load(const struct sock_addr_test *test);
 static int sendmsg4_rw_c_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_asm_prog_load(const struct sock_addr_test *test);
+static int recvmsg6_rw_asm_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_c_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_v4mapped_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_wildcard_prog_load(const struct sock_addr_test *test);
@@ -507,6 +512,92 @@ static struct sock_addr_test tests[] = {
SRC6_REWRITE_IP,
SYSCALL_EPERM,

Re: [PATCH bpf 1/2] bpf: fix unconnected udp hooks

2019-06-05 Thread Daniel Borkmann

On 06/06/2019 01:54 AM, Martin Lau wrote:
> On Wed, Jun 05, 2019 at 10:40:49PM +0200, Daniel Borkmann wrote:
>> Intention of cgroup bind/connect/sendmsg BPF hooks is to act transparently
>> to applications as also stated in original motivation in 7828f20e3779 ("Merge
>> branch 'bpf-cgroup-bind-connect'"). When recently integrating the latter
>> two hooks into Cilium to enable host based load-balancing with Kubernetes,
>> I ran into the issue that pods couldn't start up as DNS got broken. 
>> Kubernetes
>> typically sets up DNS as a service and is thus subject to load-balancing.
>>
>> Upon further debugging, it turns out that the cgroupv2 sendmsg BPF hooks API
>> is currently insufficent and thus not usable as-is for standard applications
>> shipped with most distros. To break down the issue we ran into with a simple
>> example:
>>
>>   # cat /etc/resolv.conf
>>   nameserver 147.75.207.207
>>   nameserver 147.75.207.208
>>
>> For the purpose of a simple test, we set up above IPs as service IPs and
>> transparently redirect traffic to a different DNS backend server for that
>> node:
>>
>>   # cilium service list
>>   ID   FrontendBackend
>>   1147.75.207.207:53   1 => 8.8.8.8:53
>>   2147.75.207.208:53   1 => 8.8.8.8:53
>>
>> The attached BPF program is basically selecting one of the backends if the
>> service IP/port matches on the cgroup hook. DNS breaks here, because the
>> hooks are not transparent enough to applications which have built-in msg_name
>> address checks:
>>
>>   # nslookup 1.1.1.1
>>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
>>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
>>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
>>   [...]
>>   ;; connection timed out; no servers could be reached
>>
>>   # dig 1.1.1.1
>>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
>>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.208#53
>>   ;; reply from unexpected source: 8.8.8.8#53, expected 147.75.207.207#53
>>   [...]
>>
>>   ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
>>   ;; global options: +cmd
>>   ;; connection timed out; no servers could be reached
>>
>> For comparison, if none of the service IPs is used, and we tell nslookup
>> to use 8.8.8.8 directly it works just fine, of course:
>>
>>   # nslookup 1.1.1.1 8.8.8.8
>>   1.1.1.1.in-addr.arpa   name = one.one.one.one.
>>
>> In order to fix this and thus act more transparent to the application,
>> this needs reverse translation on recvmsg() side. A minimal fix for this
>> API is to add similar recvmsg() hooks behind the BPF cgroups static key
>> such that the program can track state and replace the current sockaddr_in{,6}
>> with the original service IP. From BPF side, this basically tracks the
>> service tuple plus socket cookie in an LRU map where the reverse NAT can
>> then be retrieved via map value as one example. Side-note: the BPF cgroups
>> static key should be converted to a per-hook static key in future.
>>
>> Same example after this fix:
>>
>>   # cilium service list
>>   ID   FrontendBackend
>>   1147.75.207.207:53   1 => 8.8.8.8:53
>>   2147.75.207.208:53   1 => 8.8.8.8:53
>>
>> Lookups work fine now:
>>
>>   # nslookup 1.1.1.1
>>   1.1.1.1.in-addr.arpaname = one.one.one.one.
>>
>>   Authoritative answers can be found from:
>>
>>   # dig 1.1.1.1
>>
>>   ; <<>> DiG 9.11.3-1ubuntu1.7-Ubuntu <<>> 1.1.1.1
>>   ;; global options: +cmd
>>   ;; Got answer:
>>   ;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 51550
>>   ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
>>
>>   ;; OPT PSEUDOSECTION:
>>   ; EDNS: version: 0, flags:; udp: 512
>>   ;; QUESTION SECTION:
>>   ;1.1.1.1.   IN  A
>>
>>   ;; AUTHORITY SECTION:
>>   .   23426   IN  SOA a.root-servers.net. 
>> nstld.verisign-grs.com. 2019052001 1800 900 604800 86400
>>
>>   ;; Query time: 17 msec
>>   ;; SERVER: 147.75.207.207#53(147.75.207.207)
>>   ;; WHEN: Tue May 21 12:59:38 UTC 2019
>>   ;; MSG SIZE  rcvd: 111
>>
>> And from an actual packet level it shows that we're using the back end
>> server when talking via 147.75.207.20{7

Re: [PATCH bpf-next v2] samples: bpf: print a warning about headers_install

2019-06-05 Thread Daniel Borkmann

On 06/06/2019 01:47 AM, Jakub Kicinski wrote:
> It seems like periodically someone posts patches to "fix"
> header includes.  The issue is that samples expect the
> include path to have the uAPI headers (from usr/) first,
> and then tools/ headers, so that locally installed uAPI
> headers take precedence.  This means that if users didn't
> run headers_install they will see all sort of strange
> compilation errors, e.g.:
> 
>   HOSTCC  samples/bpf/test_lru_dist
>   samples/bpf/test_lru_dist.c:39:8: error: redefinition of ‘struct list_head’
>struct list_head {
>   ^
>In file included from samples/bpf/test_lru_dist.c:9:0:
>../tools/include/linux/types.h:69:8: note: originally defined here
> struct list_head {
>^
> 
> Try to detect this situation, and print a helpful warning.
> 
> v2: just use HOSTCC (Jiong).
> 
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 

Applied, thanks!

Re: [PATCH bpf 1/2] bpf: fix unconnected udp hooks

2019-06-05 Thread Daniel Borkmann

On 06/06/2019 02:13 AM, Alexei Starovoitov wrote:
> On Wed, Jun 5, 2019 at 5:09 PM Daniel Borkmann  wrote:
>>
>>>>  tools/bpf/bpftool/cgroup.c |  5 -
>>>>  tools/include/uapi/linux/bpf.h |  2 ++
>>> Should the bpf.h sync to tools/ be in a separate patch?
>>
>> I was thinking about it, but concluded for such small change, it's not
>> really worth it. If there's a strong opinion, I could do it, but I think
>> that 2-liner sync patch just adds noise.
> 
> it's not about the size. It breaks the sync of libbpf.
> we should really enforce user vs kernel to be separate patches.

Okay, I see. Fair enough, I'll split them in that case.

Thanks,
Daniel

[PATCH bpf v2 2/4] bpf: sync tooling uapi header

2019-06-06 Thread Daniel Borkmann

Sync BPF uapi header in order to pull in BPF_CGROUP_UDP{4,6}_RECVMSG
attach types. This is done and preferred as an extra patch in order
to ease sync of libbpf.

Signed-off-by: Daniel Borkmann 
---
 tools/include/uapi/linux/bpf.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 63e0cf66f01a..e4114a7e4451 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -192,6 +192,8 @@ enum bpf_attach_type {
BPF_LIRC_MODE2,
BPF_FLOW_DISSECTOR,
BPF_CGROUP_SYSCTL,
+   BPF_CGROUP_UDP4_RECVMSG,
+   BPF_CGROUP_UDP6_RECVMSG,
__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.17.1

[PATCH bpf v2 0/4] Fix unconnected bpf cgroup hooks

2019-06-06 Thread Daniel Borkmann

Please refer to the patch 1/4 as the main patch with the details
on the current sendmsg hook API limitations and proposal to fix
it in order to work with basic applications like DNS. Remaining
patches are the usual uapi and tooling updates as well as test
cases. Thanks a lot!

v1 -> v2:
  - Split off uapi header sync and bpftool bits (Martin, Alexei)
  - Added missing bpftool doc and bash completion as well

Daniel Borkmann (4):
  bpf: fix unconnected udp hooks
  bpf: sync tooling uapi header
  bpf, bpftool: enable recvmsg attach types
  bpf: add further msg_name rewrite tests to test_sock_addr

 include/linux/bpf-cgroup.h|   8 +
 include/uapi/linux/bpf.h  |   2 +
 kernel/bpf/syscall.c  |   8 +
 kernel/bpf/verifier.c |  12 +-
 net/core/filter.c |   2 +
 net/ipv4/udp.c|   4 +
 net/ipv6/udp.c|   4 +
 .../bpftool/Documentation/bpftool-cgroup.rst  |   6 +-
 .../bpftool/Documentation/bpftool-prog.rst|   2 +-
 tools/bpf/bpftool/bash-completion/bpftool |   5 +-
 tools/bpf/bpftool/cgroup.c|   5 +-
 tools/bpf/bpftool/prog.c  |   3 +-
 tools/include/uapi/linux/bpf.h|   2 +
 tools/testing/selftests/bpf/test_sock_addr.c  | 213 --
 14 files changed, 250 insertions(+), 26 deletions(-)

-- 
2.17.1

[PATCH bpf v2 4/4] bpf: add further msg_name rewrite tests to test_sock_addr

2019-06-06 Thread Daniel Borkmann

Extend test_sock_addr for recvmsg test cases, bigger parts of the
sendmsg code can be reused for this. Below are the strace view of
the recvmsg rewrites; the sendmsg side does not have a BPF prog
connected to it for the context of this test:

IPv4 test case:

  [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, 
attach_type=0x13 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
  [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
  [pid  4846] bind(5, {sa_family=AF_INET, sin_port=htons(), 
sin_addr=inet_addr("127.0.0.1")}, 128) = 0
  [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 6
  [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET, sin_port=htons(), 
sin_addr=inet_addr("127.0.0.1")}, msg_namelen=128, msg_iov=[{iov_base="a", 
iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
  [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], 
left {tv_sec=1, tv_usec=95})
  [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET, sin_port=htons(4040), 
sin_addr=inet_addr("192.168.1.254")}, msg_namelen=128->16, 
msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, 
msg_flags=0}, 0) = 1
  [pid  4846] close(6)= 0
  [pid  4846] close(5)= 0
  [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x13 /* BPF_??? 
*/}, 112) = 0

IPv6 test case:

  [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, 
attach_type=0x14 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
  [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 5
  [pid  4846] bind(5, {sa_family=AF_INET6, sin6_port=htons(), 
inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), 
sin6_scope_id=0}, 128) = 0
  [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 6
  [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET6, sin6_port=htons(), 
inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), 
sin6_scope_id=0}, msg_namelen=128, msg_iov=[{iov_base="a", iov_len=1}], 
msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
  [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], 
left {tv_sec=1, tv_usec=96})
  [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET6, sin6_port=htons(6060), 
inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), 
sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=128->28, 
msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, 
msg_flags=0}, 0) = 1
  [pid  4846] close(6)= 0
  [pid  4846] close(5)= 0
  [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x14 /* BPF_??? 
*/}, 112) = 0

test_sock_addr run w/o strace view:

  # ./test_sock_addr.sh
  [...]
  Test case: recvmsg4: return code ok .. [PASS]
  Test case: recvmsg4: return code !ok .. [PASS]
  Test case: recvmsg6: return code ok .. [PASS]
  Test case: recvmsg6: return code !ok .. [PASS]
  Test case: recvmsg4: rewrite IP & port (asm) .. [PASS]
  Test case: recvmsg6: rewrite IP & port (asm) .. [PASS]
  [...]

Signed-off-by: Daniel Borkmann 
---
 tools/testing/selftests/bpf/test_sock_addr.c | 213 +--
 1 file changed, 197 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sock_addr.c 
b/tools/testing/selftests/bpf/test_sock_addr.c
index 3f110eaaf29c..4ecde2392327 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -76,6 +76,7 @@ struct sock_addr_test {
enum {
LOAD_REJECT,
ATTACH_REJECT,
+   ATTACH_OKAY,
SYSCALL_EPERM,
SYSCALL_ENOTSUPP,
SUCCESS,
@@ -88,9 +89,13 @@ static int connect4_prog_load(const struct sock_addr_test 
*test);
 static int connect6_prog_load(const struct sock_addr_test *test);
 static int sendmsg_allow_prog_load(const struct sock_addr_test *test);
 static int sendmsg_deny_prog_load(const struct sock_addr_test *test);
+static int recvmsg_allow_prog_load(const struct sock_addr_test *test);
+static int recvmsg_deny_prog_load(const struct sock_addr_test *test);
 static int sendmsg4_rw_asm_prog_load(const struct sock_addr_test *test);
+static int recvmsg4_rw_asm_prog_load(const struct sock_addr_test *test);
 static int sendmsg4_rw_c_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_asm_prog_load(const struct sock_addr_test *test);
+static int recvmsg6_rw_asm_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_c_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_v4mapped_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_wildcard_prog_load(const struct sock_addr_test *test);
@@ -507,6 +512,92 @@ static struct sock_addr_test tests[] = {
SRC6_REW

[PATCH bpf v2 3/4] bpf, bpftool: enable recvmsg attach types

2019-06-06 Thread Daniel Borkmann

Trivial patch to bpftool in order to complete enabling attaching programs
to BPF_CGROUP_UDP{4,6}_RECVMSG.

Signed-off-by: Daniel Borkmann 
---
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst | 6 +-
 tools/bpf/bpftool/Documentation/bpftool-prog.rst   | 2 +-
 tools/bpf/bpftool/bash-completion/bpftool  | 5 +++--
 tools/bpf/bpftool/cgroup.c | 5 -
 tools/bpf/bpftool/prog.c   | 3 ++-
 5 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst 
b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
index ac26876389c2..e744b3e4e56a 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
@@ -29,7 +29,7 @@ CGROUP COMMANDS
 |  *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
 |  *ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | 
**sock_ops** | **device** |
 |  **bind4** | **bind6** | **post_bind4** | **post_bind6** | 
**connect4** | **connect6** |
-|  **sendmsg4** | **sendmsg6** | **sysctl** }
+|  **sendmsg4** | **sendmsg6** | **recvmsg4** | **recvmsg6** | 
**sysctl** }
 |  *ATTACH_FLAGS* := { **multi** | **override** }
 
 DESCRIPTION
@@ -86,6 +86,10 @@ DESCRIPTION
  unconnected udp4 socket (since 4.18);
  **sendmsg6** call to sendto(2), sendmsg(2), sendmmsg(2) for an
  unconnected udp6 socket (since 4.18);
+ **recvmsg4** call to recvfrom(2), recvmsg(2), recvmmsg(2) for
+  an unconnected udp4 socket (since 5.2);
+ **recvmsg6** call to recvfrom(2), recvmsg(2), recvmmsg(2) for
+  an unconnected udp6 socket (since 5.2);
  **sysctl** sysctl access (since 5.2).
 
**bpftool cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG*
diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index e8118544d118..018ecef8dc13 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -40,7 +40,7 @@ PROG COMMANDS
 |  **lwt_seg6local** | **sockops** | **sk_skb** | **sk_msg** | 
**lirc_mode2** |
 |  **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | 
**cgroup/post_bind6** |
 |  **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** 
| **cgroup/sendmsg6** |
-|  **cgroup/sysctl**
+|  **cgroup/recvmsg4** | **cgroup/recvmsg6** | **cgroup/sysctl**
 |  }
 |   *ATTACH_TYPE* := {
 |  **msg_verdict** | **stream_verdict** | **stream_parser** | 
**flow_dissector**
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 50e402a5a9c8..4300adf6e5ab 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -371,6 +371,7 @@ _bpftool()
 lirc_mode2 cgroup/bind4 cgroup/bind6 \
 cgroup/connect4 cgroup/connect6 \
 cgroup/sendmsg4 cgroup/sendmsg6 \
+cgroup/recvmsg4 cgroup/recvmsg6 \
 cgroup/post_bind4 cgroup/post_bind6 \
 cgroup/sysctl" -- \
"$cur" ) )
@@ -666,7 +667,7 @@ _bpftool()
 attach|detach)
 local ATTACH_TYPES='ingress egress sock_create sock_ops \
 device bind4 bind6 post_bind4 post_bind6 connect4 \
-connect6 sendmsg4 sendmsg6 sysctl'
+connect6 sendmsg4 sendmsg6 recvmsg4 recvmsg6 sysctl'
 local ATTACH_FLAGS='multi override'
 local PROG_TYPE='id pinned tag'
 case $prev in
@@ -676,7 +677,7 @@ _bpftool()
 ;;
 
ingress|egress|sock_create|sock_ops|device|bind4|bind6|\
 post_bind4|post_bind6|connect4|connect6|sendmsg4|\
-sendmsg6|sysctl)
+sendmsg6|recvmsg4|recvmsg6|sysctl)
 COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \
 "$cur" ) )
 return 0
diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c
index 7e22f115c8c1..73ec8ea33fb4 100644
--- a/tools/bpf/bpftool/cgroup.c
+++ b/tools/bpf/bpftool/cgroup.c
@@ -25,7 +25,8 @@
"   ATTACH_TYPE := { ingress | egress | sock_create |\n"   \
"sock_ops | device | bind4 | bind6 |\n"\
"post_bind4 | post_bind6 |

[PATCH bpf v2 1/4] bpf: fix unconnected udp hooks

2019-06-06 Thread Daniel Borkmann

as
passed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann 
Cc: Andrey Ignatov 
Cc: Martin KaFai Lau 
Acked-by: Martynas Pumputis 
---
 include/linux/bpf-cgroup.h |  8 
 include/uapi/linux/bpf.h   |  2 ++
 kernel/bpf/syscall.c   |  8 
 kernel/bpf/verifier.c  | 12 
 net/core/filter.c  |  2 ++
 net/ipv4/udp.c |  4 
 net/ipv6/udp.c |  4 
 7 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index cb3c6b3b89c8..a7f7a98ec39d 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -238,6 +238,12 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, 
void *key,
 #define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx)
   \
BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_SENDMSG, t_ctx)
 
+#define BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, uaddr)   
\
+   BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP4_RECVMSG, NULL)
+
+#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr)   
\
+   BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL)
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)\
 ({\
int __ret = 0; \
@@ -339,6 +345,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct 
bpf_map *map,
 #define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; 
})
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 63e0cf66f01a..e4114a7e4451 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -192,6 +192,8 @@ enum bpf_attach_type {
BPF_LIRC_MODE2,
BPF_FLOW_DISSECTOR,
BPF_CGROUP_SYSCTL,
+   BPF_CGROUP_UDP4_RECVMSG,
+   BPF_CGROUP_UDP6_RECVMSG,
__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cb5440b02e82..e8ba3a153691 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1581,6 +1581,8 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type 
prog_type,
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
return 0;
default:
return -EINVAL;
@@ -1875,6 +1877,8 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
break;
case BPF_CGROUP_SOCK_OPS:
@@ -1960,6 +1964,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
break;
case BPF_CGROUP_SOCK_OPS:
@@ -2011,6 +2017,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
case BPF_CGROUP_SOCK_OPS:
case BPF_CGROUP_DEVICE:
case BPF_CGROUP_SYSCTL:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 95f9354495ad..d2c8a6677ac4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5361,9 +5361,12 @@ static int check

Re: [PATCH net-next v2 1/2] bpf_xdp_redirect_map: Add flag to return XDP_PASS on map lookup failure

2019-06-06 Thread Daniel Borkmann

On 06/06/2019 03:24 PM, Toke Høiland-Jørgensen wrote:
> From: Toke Høiland-Jørgensen 
> 
> The bpf_redirect_map() helper used by XDP programs doesn't return any
> indication of whether it can successfully redirect to the map index it was
> given. Instead, BPF programs have to track this themselves, leading to
> programs using duplicate maps to track which entries are populated in the
> devmap.
> 
> This patch adds a flag to the XDP version of the bpf_redirect_map() helper,
> which makes the helper do a lookup in the map when called, and return
> XDP_PASS if there is no value at the provided index.
> 
> With this, a BPF program can check the return code from the helper call and
> react if it is XDP_PASS (by, for instance, substituting a different
> redirect). This works for any type of map used for redirect.
> 
> Signed-off-by: Toke Høiland-Jørgensen 
> ---
>  include/uapi/linux/bpf.h |8 
>  net/core/filter.c|   10 +-
>  2 files changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 7c6aef253173..d57df4f0b837 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -3098,6 +3098,14 @@ enum xdp_action {
>   XDP_REDIRECT,
>  };
>  
> +/* Flags for bpf_xdp_redirect_map helper */
> +
> +/* If set, the help will check if the entry exists in the map and return
> + * XDP_PASS if it doesn't.
> + */
> +#define XDP_REDIRECT_F_PASS_ON_INVALID BIT(0)
> +#define XDP_REDIRECT_ALL_FLAGS XDP_REDIRECT_F_PASS_ON_INVALID
> +
>  /* user accessible metadata for XDP packet hook
>   * new fields must be added to the end of this structure
>   */
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 55bfc941d17a..2e532a9b2605 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -3755,9 +3755,17 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, 
> map, u32, ifindex,
>  {
>   struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>  
> - if (unlikely(flags))
> + if (unlikely(flags & ~XDP_REDIRECT_ALL_FLAGS))
>   return XDP_ABORTED;
>  
> + if (flags & XDP_REDIRECT_F_PASS_ON_INVALID) {
> + void *val;
> +
> + val = __xdp_map_lookup_elem(map, ifindex);
> + if (unlikely(!val))
> + return XDP_PASS;

Generally looks good to me, also the second part with the flag. Given we store 
into
the per-CPU scratch space and function like xdp_do_redirect() pick this up 
again, we
could even propagate val onwards and save a second lookup on the /same/ element 
(which
also avoids a race if the val was dropped from the map in the meantime). Given 
this
should all still be within RCU it should work. Perhaps it even makes sense to 
do the
lookup unconditionally inside bpf_xdp_redirect_map() helper iff we manage to do 
it
only once anyway?

> + }
> +
>   ri->ifindex = ifindex;
>   ri->flags = flags;
>   WRITE_ONCE(ri->map, map);
>

Re: [PATCH net-next v2 1/2] bpf_xdp_redirect_map: Add flag to return XDP_PASS on map lookup failure

2019-06-06 Thread Daniel Borkmann

On 06/06/2019 08:15 PM, Jonathan Lemon wrote:
> On 6 Jun 2019, at 9:15, Toke Høiland-Jørgensen wrote:
>> Alexei Starovoitov  writes:
>>> On Thu, Jun 6, 2019 at 8:51 AM Daniel Borkmann  wrote:
>>>> On 06/06/2019 03:24 PM, Toke Høiland-Jørgensen wrote:
>>>>> From: Toke Høiland-Jørgensen 
>>>>>
>>>>> The bpf_redirect_map() helper used by XDP programs doesn't return any
>>>>> indication of whether it can successfully redirect to the map index it was
>>>>> given. Instead, BPF programs have to track this themselves, leading to
>>>>> programs using duplicate maps to track which entries are populated in the
>>>>> devmap.
>>>>>
>>>>> This patch adds a flag to the XDP version of the bpf_redirect_map() 
>>>>> helper,
>>>>> which makes the helper do a lookup in the map when called, and return
>>>>> XDP_PASS if there is no value at the provided index.
>>>>>
>>>>> With this, a BPF program can check the return code from the helper call 
>>>>> and
>>>>> react if it is XDP_PASS (by, for instance, substituting a different
>>>>> redirect). This works for any type of map used for redirect.
>>>>>
>>>>> Signed-off-by: Toke Høiland-Jørgensen 
>>>>> ---
>>>>>  include/uapi/linux/bpf.h |    8 
>>>>>  net/core/filter.c    |   10 +-
>>>>>  2 files changed, 17 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
>>>>> index 7c6aef253173..d57df4f0b837 100644
>>>>> --- a/include/uapi/linux/bpf.h
>>>>> +++ b/include/uapi/linux/bpf.h
>>>>> @@ -3098,6 +3098,14 @@ enum xdp_action {
>>>>>   XDP_REDIRECT,
>>>>>  };
>>>>>
>>>>> +/* Flags for bpf_xdp_redirect_map helper */
>>>>> +
>>>>> +/* If set, the help will check if the entry exists in the map and return
>>>>> + * XDP_PASS if it doesn't.
>>>>> + */
>>>>> +#define XDP_REDIRECT_F_PASS_ON_INVALID BIT(0)
>>>>> +#define XDP_REDIRECT_ALL_FLAGS XDP_REDIRECT_F_PASS_ON_INVALID
>>>>> +
>>>>>  /* user accessible metadata for XDP packet hook
>>>>>   * new fields must be added to the end of this structure
>>>>>   */
>>>>> diff --git a/net/core/filter.c b/net/core/filter.c
>>>>> index 55bfc941d17a..2e532a9b2605 100644
>>>>> --- a/net/core/filter.c
>>>>> +++ b/net/core/filter.c
>>>>> @@ -3755,9 +3755,17 @@ BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, 
>>>>> map, u32, ifindex,
>>>>>  {
>>>>>   struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
>>>>>
>>>>> - if (unlikely(flags))
>>>>> + if (unlikely(flags & ~XDP_REDIRECT_ALL_FLAGS))
>>>>>   return XDP_ABORTED;
>>>>>
>>>>> + if (flags & XDP_REDIRECT_F_PASS_ON_INVALID) {
>>>>> + void *val;
>>>>> +
>>>>> + val = __xdp_map_lookup_elem(map, ifindex);
>>>>> + if (unlikely(!val))
>>>>> + return XDP_PASS;
>>>>
>>>> Generally looks good to me, also the second part with the flag. Given we 
>>>> store into
>>>> the per-CPU scratch space and function like xdp_do_redirect() pick this up 
>>>> again, we
>>>> could even propagate val onwards and save a second lookup on the /same/ 
>>>> element (which
>>>> also avoids a race if the val was dropped from the map in the meantime). 
>>>> Given this
>>>> should all still be within RCU it should work. Perhaps it even makes sense 
>>>> to do the
>>>> lookup unconditionally inside bpf_xdp_redirect_map() helper iff we manage 
>>>> to do it
>>>> only once anyway?
>>>
>>> +1
>>>
>>> also I don't think we really need a new flag here.
>>> Yes, it could be considered an uapi change, but it
>>> looks more like bugfix in uapi to me.
>>> Since original behavior was so clunky to use.
>>
>> Hmm, the problem with this is that eBPF programs generally do something
>> like:
>>
>> return bpf_redirect_map(map, idx, 0);
>>
>> after having already modified the packet headers. This will get them a
>> return code of XDP_REDIRECT, and the lookup will then subsequently fail,
>> which returns in XDP_ABORTED in the driver, which you can catch with
>> tracing.
>>
>> However, if we just change it to XDP_PASS, the packet will go up the
>> stack, but because it has already been modified the stack will drop it,
>> more or less invisibly.
>>
>> So the question becomes, is that behaviour change really OK?
> 
> Another option would be treating the flags (or the lower bits of flags)
> as the default xdp action taken if the lookup fails.  0 just happens to
> map to XDP_ABORTED, which gives the initial behavior.  Then the new behavior
> would be:
> 
>     return bpf_redirect_map(map, index, XDP_PASS);

Makes sense, that should work, but as default (flags == 0), you'd have
to return XDP_REDIRECT to stay consistent with existing behavior.

Thanks,
Daniel

Re: [RFC PATCH bpf-next 6/8] libbpf: allow specifying map definitions using BTF

2019-06-06 Thread Daniel Borkmann

On 06/04/2019 07:31 PM, Andrii Nakryiko wrote:
> On Tue, Jun 4, 2019 at 6:45 AM Stanislav Fomichev  wrote:
>> On 06/03, Stanislav Fomichev wrote:
 BTF is mandatory for _any_ new feature.
>>> If something is easy to support without asking everyone to upgrade to
>>> a bleeding edge llvm, why not do it?
>>> So much for backwards compatibility and flexibility.
>>>
 It's for introspection and debuggability in the first place.
 Good debugging is not optional.
>>> Once llvm 8+ is everywhere, sure, but we are not there yet (I'm talking
>>> about upstream LTS distros like ubuntu/redhat).
>> But putting this aside, one thing that I didn't see addressed in the
>> cover letter is: what is the main motivation for the series?
>> Is it to support iproute2 map definitions (so cilium can switch to libbpf)?
> 
> In general, the motivation is to arrive at a way to support
> declaratively defining maps in such a way, that:
> - captures type information (for debuggability/introspection) in
> coherent and hard-to-screw-up way;
> - allows to support missing useful features w/ good syntax (e.g.,
> natural map-in-map case vs current completely manual non-declarative
> way for libbpf);
> - ultimately allow iproute2 to use libbpf as unified loader (and thus
> the need to support its existing features, like
> BPF_MAP_TYPE_PROG_ARRAY initialization, pinning, map-in-map);

Thanks for working on this & sorry for jumping in late! Generally, I like
the approach of using BTF to make sense out of the individual members and
to have extensibility, so overall I think it's a step in the right direction.
Going back to the example where others complained that the k/v NULL
initialization feels too much magic from a C pov:

struct {
int type;
int max_entries;
int *key;
struct my_value *value;
} my_map SEC(".maps") = {
.type = BPF_MAP_TYPE_ARRAY,
.max_entries = 16,
};

Given LLVM is in charge of emitting BTF plus given gcc/clang seem /both/
to support *target* specific attributes [0], how about something along these
lines where the type specific info is annotated as a variable BPF target
attribute, like:

struct {
int type;
int max_entries;
} my_map __attribute__((map(int,struct my_value))) = {
.type = BPF_MAP_TYPE_ARRAY,
.max_entries = 16,
};

Of course this would need BPF backend support, but at least that approach
would be more C like. Thus this would define types where we can automatically
derive key/val sizes etc. The SEC() could be dropped as well as map attribute
would imply it for LLVM to do the right thing underneath. The normal/actual 
members
from the struct has a base set of well-known names that are minimally required
but there could be custom stuff as well where libbpf would query some user
defined callback that can handle these. Anyway, main point, what do you think
about the __attribute__ approach instead? I think this feels cleaner to me at
least iff feasible.

Thanks,
Daniel

  [0] https://clang.llvm.org/docs/AttributeReference.html
  https://gcc.gnu.org/onlinedocs/gcc/Variable-Attributes.html

> The only missing feature that can be supported reasonably with
> bpf_map_def is pinning (as it's just another int field), but all the
> other use cases requires awkward approach of matching arbitrary IDs,
> which feels like a bad way forward.
> 
>> If that's the case, maybe explicitly focus on that? Once we have
>> proof-of-concept working for iproute2 mode, we can extend it to everything.

Re: [PATCH bpf v2 0/4] Fix unconnected bpf cgroup hooks

2019-06-06 Thread Daniel Borkmann

On 06/06/2019 10:51 PM, Andrey Ignatov wrote:
> Andrey Ignatov  [Thu, 2019-06-06 13:45 -0700]:
>> Daniel Borkmann  [Thu, 2019-06-06 07:36 -0700]:
>>> Please refer to the patch 1/4 as the main patch with the details
>>> on the current sendmsg hook API limitations and proposal to fix
>>> it in order to work with basic applications like DNS. Remaining
>>> patches are the usual uapi and tooling updates as well as test
>>> cases. Thanks a lot!
>>>
>>> v1 -> v2:
>>>   - Split off uapi header sync and bpftool bits (Martin, Alexei)
>>>   - Added missing bpftool doc and bash completion as well
>>>
>>> Daniel Borkmann (4):
>>>   bpf: fix unconnected udp hooks
>>>   bpf: sync tooling uapi header
>>>   bpf, bpftool: enable recvmsg attach types
>>>   bpf: add further msg_name rewrite tests to test_sock_addr
>>>
>>>  include/linux/bpf-cgroup.h|   8 +
>>>  include/uapi/linux/bpf.h  |   2 +
>>>  kernel/bpf/syscall.c  |   8 +
>>>  kernel/bpf/verifier.c |  12 +-
>>>  net/core/filter.c |   2 +
>>>  net/ipv4/udp.c|   4 +
>>>  net/ipv6/udp.c|   4 +
>>>  .../bpftool/Documentation/bpftool-cgroup.rst  |   6 +-
>>>  .../bpftool/Documentation/bpftool-prog.rst|   2 +-
>>>  tools/bpf/bpftool/bash-completion/bpftool |   5 +-
>>>  tools/bpf/bpftool/cgroup.c|   5 +-
>>>  tools/bpf/bpftool/prog.c  |   3 +-
>>>  tools/include/uapi/linux/bpf.h|   2 +
>>>  tools/testing/selftests/bpf/test_sock_addr.c  | 213 --
>>>  14 files changed, 250 insertions(+), 26 deletions(-)
>>
>> tools/lib/bpf/libbpf.c should also be updated: section_names and
> 
> And tools/testing/selftests/bpf/test_section_names.c as well.
> 
>> bpf_prog_type__needs_kver. Please either follow-up separately or send

Sigh, yes, makes sense, I'll fold these in for a final v3. Thanks for
spotting!

>> v3. Other than this LGMT.
>>
>> Acked-by: Andrey Ignatov 
>>
>> -- 
>> Andrey Ignatov
>

[PATCH bpf v3 0/6] Fix unconnected bpf cgroup hooks

2019-06-06 Thread Daniel Borkmann

Please refer to the patch 1/6 as the main patch with the details
on the current sendmsg hook API limitations and proposal to fix
it in order to work with basic applications like DNS. Remaining
patches are the usual uapi and tooling updates as well as test
cases. Thanks a lot!

v2 -> v3:
  - Add attach types to test_section_names.c and libbpf (Andrey)
  - Added given Acks, rest as-is
v1 -> v2:
  - Split off uapi header sync and bpftool bits (Martin, Alexei)
  - Added missing bpftool doc and bash completion as well

Daniel Borkmann (6):
  bpf: fix unconnected udp hooks
  bpf: sync tooling uapi header
  bpf, libbpf: enable recvmsg attach types
  bpf, bpftool: enable recvmsg attach types
  bpf: more msg_name rewrite tests to test_sock_addr
  bpf: expand section tests for test_section_names

 include/linux/bpf-cgroup.h|   8 +
 include/uapi/linux/bpf.h  |   2 +
 kernel/bpf/syscall.c  |   8 +
 kernel/bpf/verifier.c |  12 +-
 net/core/filter.c |   2 +
 net/ipv4/udp.c|   4 +
 net/ipv6/udp.c|   4 +
 .../bpftool/Documentation/bpftool-cgroup.rst  |   6 +-
 .../bpftool/Documentation/bpftool-prog.rst|   2 +-
 tools/bpf/bpftool/bash-completion/bpftool |   5 +-
 tools/bpf/bpftool/cgroup.c|   5 +-
 tools/bpf/bpftool/prog.c  |   3 +-
 tools/include/uapi/linux/bpf.h|   2 +
 tools/lib/bpf/libbpf.c|   4 +
 .../selftests/bpf/test_section_names.c|  10 +
 tools/testing/selftests/bpf/test_sock_addr.c  | 213 --
 16 files changed, 264 insertions(+), 26 deletions(-)

-- 
2.17.1

[PATCH bpf v3 1/6] bpf: fix unconnected udp hooks

2019-06-06 Thread Daniel Borkmann

assed independent of sk->sk_state being TCP_ESTABLISHED or not. Note
that for TCP case, the msg->msg_name is ignored in the regular recvmsg
path and therefore not relevant.

For the case of ip{,v6}_recv_error() paths, picked up via MSG_ERRQUEUE,
the hook is not called. This is intentional as it aligns with the same
semantics as in case of TCP cgroup BPF hooks right now. This might be
better addressed in future through a different bpf_attach_type such
that this case can be distinguished from the regular recvmsg paths,
for example.

Fixes: 1cedee13d25a ("bpf: Hooks for sys_sendmsg")
Signed-off-by: Daniel Borkmann 
Acked-by: Andrey Ignatov 
Acked-by: Martin KaFai Lau 
Acked-by: Martynas Pumputis 
---
 include/linux/bpf-cgroup.h |  8 
 include/uapi/linux/bpf.h   |  2 ++
 kernel/bpf/syscall.c   |  8 
 kernel/bpf/verifier.c  | 12 
 net/core/filter.c  |  2 ++
 net/ipv4/udp.c |  4 
 net/ipv6/udp.c |  4 
 7 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/include/linux/bpf-cgroup.h b/include/linux/bpf-cgroup.h
index cb3c6b3b89c8..a7f7a98ec39d 100644
--- a/include/linux/bpf-cgroup.h
+++ b/include/linux/bpf-cgroup.h
@@ -238,6 +238,12 @@ int bpf_percpu_cgroup_storage_update(struct bpf_map *map, 
void *key,
 #define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx)
   \
BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_SENDMSG, t_ctx)
 
+#define BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, uaddr)   
\
+   BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP4_RECVMSG, NULL)
+
+#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr)   
\
+   BPF_CGROUP_RUN_SA_PROG_LOCK(sk, uaddr, BPF_CGROUP_UDP6_RECVMSG, NULL)
+
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops)\
 ({\
int __ret = 0; \
@@ -339,6 +345,8 @@ static inline int bpf_percpu_cgroup_storage_update(struct 
bpf_map *map,
 #define BPF_CGROUP_RUN_PROG_INET6_CONNECT_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_UDP4_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_UDP6_SENDMSG_LOCK(sk, uaddr, t_ctx) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_UDP4_RECVMSG_LOCK(sk, uaddr) ({ 0; })
+#define BPF_CGROUP_RUN_PROG_UDP6_RECVMSG_LOCK(sk, uaddr) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SOCK_OPS(sock_ops) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_DEVICE_CGROUP(type,major,minor,access) ({ 0; })
 #define BPF_CGROUP_RUN_PROG_SYSCTL(head,table,write,buf,count,pos,nbuf) ({ 0; 
})
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 63e0cf66f01a..e4114a7e4451 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -192,6 +192,8 @@ enum bpf_attach_type {
BPF_LIRC_MODE2,
BPF_FLOW_DISSECTOR,
BPF_CGROUP_SYSCTL,
+   BPF_CGROUP_UDP4_RECVMSG,
+   BPF_CGROUP_UDP6_RECVMSG,
__MAX_BPF_ATTACH_TYPE
 };
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index cb5440b02e82..e8ba3a153691 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1581,6 +1581,8 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type 
prog_type,
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
return 0;
default:
return -EINVAL;
@@ -1875,6 +1877,8 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
break;
case BPF_CGROUP_SOCK_OPS:
@@ -1960,6 +1964,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
ptype = BPF_PROG_TYPE_CGROUP_SOCK_ADDR;
break;
case BPF_CGROUP_SOCK_OPS:
@@ -2011,6 +2017,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
case BPF_CGROUP_INET6_CONNECT:
case BPF_CGROUP_UDP4_SENDMSG:
case BPF_CGROUP_UDP6_SENDMSG:
+   case BPF_CGROUP_UDP4_RECVMSG:
+   case BPF_CGROUP_UDP6_RECVMSG:
case BPF_CGROUP_SOCK_OPS:
case BPF_CGROUP_DEVICE:
case BPF_CGROUP_SYSCTL:
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 95f9354495ad..d2c8a6677ac4 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -5361,9 +5361,12 @@ static int check

[PATCH bpf v3 6/6] bpf: expand section tests for test_section_names

2019-06-06 Thread Daniel Borkmann

Add cgroup/recvmsg{4,6} to test_section_names as well. Test run output:

  # ./test_section_names
  libbpf: failed to guess program type based on ELF section name 'InvAliD'
  libbpf: supported section(type) names are: [...]
  libbpf: failed to guess attach type based on ELF section name 'InvAliD'
  libbpf: attachable section(type) names are: [...]
  libbpf: failed to guess program type based on ELF section name 'cgroup'
  libbpf: supported section(type) names are: [...]
  libbpf: failed to guess attach type based on ELF section name 'cgroup'
  libbpf: attachable section(type) names are: [...]
  Summary: 38 PASSED, 0 FAILED

Signed-off-by: Daniel Borkmann 
---
 tools/testing/selftests/bpf/test_section_names.c | 10 ++
 1 file changed, 10 insertions(+)

diff --git a/tools/testing/selftests/bpf/test_section_names.c 
b/tools/testing/selftests/bpf/test_section_names.c
index bebd4fbca1f4..dee2f2eceb0f 100644
--- a/tools/testing/selftests/bpf/test_section_names.c
+++ b/tools/testing/selftests/bpf/test_section_names.c
@@ -119,6 +119,16 @@ static struct sec_name_test tests[] = {
{0, BPF_PROG_TYPE_CGROUP_SOCK_ADDR, BPF_CGROUP_UDP6_SENDMSG},
{0, BPF_CGROUP_UDP6_SENDMSG},
},
+   {
+   "cgroup/recvmsg4",
+   {0, BPF_PROG_TYPE_CGROUP_SOCK_ADDR, BPF_CGROUP_UDP4_RECVMSG},
+   {0, BPF_CGROUP_UDP4_RECVMSG},
+   },
+   {
+   "cgroup/recvmsg6",
+   {0, BPF_PROG_TYPE_CGROUP_SOCK_ADDR, BPF_CGROUP_UDP6_RECVMSG},
+   {0, BPF_CGROUP_UDP6_RECVMSG},
+   },
{
"cgroup/sysctl",
{0, BPF_PROG_TYPE_CGROUP_SYSCTL, BPF_CGROUP_SYSCTL},
-- 
2.17.1

[PATCH bpf v3 5/6] bpf: more msg_name rewrite tests to test_sock_addr

2019-06-06 Thread Daniel Borkmann

Extend test_sock_addr for recvmsg test cases, bigger parts of the
sendmsg code can be reused for this. Below are the strace view of
the recvmsg rewrites; the sendmsg side does not have a BPF prog
connected to it for the context of this test:

IPv4 test case:

  [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, 
attach_type=0x13 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
  [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
  [pid  4846] bind(5, {sa_family=AF_INET, sin_port=htons(), 
sin_addr=inet_addr("127.0.0.1")}, 128) = 0
  [pid  4846] socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 6
  [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET, sin_port=htons(), 
sin_addr=inet_addr("127.0.0.1")}, msg_namelen=128, msg_iov=[{iov_base="a", 
iov_len=1}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
  [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], 
left {tv_sec=1, tv_usec=95})
  [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET, sin_port=htons(4040), 
sin_addr=inet_addr("192.168.1.254")}, msg_namelen=128->16, 
msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, 
msg_flags=0}, 0) = 1
  [pid  4846] close(6)= 0
  [pid  4846] close(5)= 0
  [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x13 /* BPF_??? 
*/}, 112) = 0

IPv6 test case:

  [pid  4846] bpf(BPF_PROG_ATTACH, {target_fd=3, attach_bpf_fd=4, 
attach_type=0x14 /* BPF_??? */, attach_flags=BPF_F_ALLOW_OVERRIDE}, 112) = 0
  [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 5
  [pid  4846] bind(5, {sa_family=AF_INET6, sin6_port=htons(), 
inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), 
sin6_scope_id=0}, 128) = 0
  [pid  4846] socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 6
  [pid  4846] sendmsg(6, {msg_name={sa_family=AF_INET6, sin6_port=htons(), 
inet_pton(AF_INET6, "::1", &sin6_addr), sin6_flowinfo=htonl(0), 
sin6_scope_id=0}, msg_namelen=128, msg_iov=[{iov_base="a", iov_len=1}], 
msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 1
  [pid  4846] select(6, [5], NULL, NULL, {tv_sec=2, tv_usec=0}) = 1 (in [5], 
left {tv_sec=1, tv_usec=96})
  [pid  4846] recvmsg(5, {msg_name={sa_family=AF_INET6, sin6_port=htons(6060), 
inet_pton(AF_INET6, "face:b00c:1234:5678::abcd", &sin6_addr), 
sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=128->28, 
msg_iov=[{iov_base="a", iov_len=64}], msg_iovlen=1, msg_controllen=0, 
msg_flags=0}, 0) = 1
  [pid  4846] close(6)= 0
  [pid  4846] close(5)= 0
  [pid  4846] bpf(BPF_PROG_DETACH, {target_fd=3, attach_type=0x14 /* BPF_??? 
*/}, 112) = 0

test_sock_addr run w/o strace view:

  # ./test_sock_addr.sh
  [...]
  Test case: recvmsg4: return code ok .. [PASS]
  Test case: recvmsg4: return code !ok .. [PASS]
  Test case: recvmsg6: return code ok .. [PASS]
  Test case: recvmsg6: return code !ok .. [PASS]
  Test case: recvmsg4: rewrite IP & port (asm) .. [PASS]
  Test case: recvmsg6: rewrite IP & port (asm) .. [PASS]
  [...]

Signed-off-by: Daniel Borkmann 
Acked-by: Andrey Ignatov 
Acked-by: Martin KaFai Lau 
---
 tools/testing/selftests/bpf/test_sock_addr.c | 213 +--
 1 file changed, 197 insertions(+), 16 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_sock_addr.c 
b/tools/testing/selftests/bpf/test_sock_addr.c
index 3f110eaaf29c..4ecde2392327 100644
--- a/tools/testing/selftests/bpf/test_sock_addr.c
+++ b/tools/testing/selftests/bpf/test_sock_addr.c
@@ -76,6 +76,7 @@ struct sock_addr_test {
enum {
LOAD_REJECT,
ATTACH_REJECT,
+   ATTACH_OKAY,
SYSCALL_EPERM,
SYSCALL_ENOTSUPP,
SUCCESS,
@@ -88,9 +89,13 @@ static int connect4_prog_load(const struct sock_addr_test 
*test);
 static int connect6_prog_load(const struct sock_addr_test *test);
 static int sendmsg_allow_prog_load(const struct sock_addr_test *test);
 static int sendmsg_deny_prog_load(const struct sock_addr_test *test);
+static int recvmsg_allow_prog_load(const struct sock_addr_test *test);
+static int recvmsg_deny_prog_load(const struct sock_addr_test *test);
 static int sendmsg4_rw_asm_prog_load(const struct sock_addr_test *test);
+static int recvmsg4_rw_asm_prog_load(const struct sock_addr_test *test);
 static int sendmsg4_rw_c_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_asm_prog_load(const struct sock_addr_test *test);
+static int recvmsg6_rw_asm_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_c_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_v4mapped_prog_load(const struct sock_addr_test *test);
 static int sendmsg6_rw_wildcard_prog_load(const struct sock_addr_test *test);
@@ -507,6 +512,92 @@ static struct sock_addr_test tests[] = {

[PATCH bpf v3 4/6] bpf, bpftool: enable recvmsg attach types

2019-06-06 Thread Daniel Borkmann

Trivial patch to bpftool in order to complete enabling attaching programs
to BPF_CGROUP_UDP{4,6}_RECVMSG.

Signed-off-by: Daniel Borkmann 
Acked-by: Andrey Ignatov 
Acked-by: Martin KaFai Lau 
---
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst | 6 +-
 tools/bpf/bpftool/Documentation/bpftool-prog.rst   | 2 +-
 tools/bpf/bpftool/bash-completion/bpftool  | 5 +++--
 tools/bpf/bpftool/cgroup.c | 5 -
 tools/bpf/bpftool/prog.c   | 3 ++-
 5 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst 
b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
index ac26876389c2..e744b3e4e56a 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-cgroup.rst
@@ -29,7 +29,7 @@ CGROUP COMMANDS
 |  *PROG* := { **id** *PROG_ID* | **pinned** *FILE* | **tag** *PROG_TAG* }
 |  *ATTACH_TYPE* := { **ingress** | **egress** | **sock_create** | 
**sock_ops** | **device** |
 |  **bind4** | **bind6** | **post_bind4** | **post_bind6** | 
**connect4** | **connect6** |
-|  **sendmsg4** | **sendmsg6** | **sysctl** }
+|  **sendmsg4** | **sendmsg6** | **recvmsg4** | **recvmsg6** | 
**sysctl** }
 |  *ATTACH_FLAGS* := { **multi** | **override** }
 
 DESCRIPTION
@@ -86,6 +86,10 @@ DESCRIPTION
  unconnected udp4 socket (since 4.18);
  **sendmsg6** call to sendto(2), sendmsg(2), sendmmsg(2) for an
  unconnected udp6 socket (since 4.18);
+ **recvmsg4** call to recvfrom(2), recvmsg(2), recvmmsg(2) for
+  an unconnected udp4 socket (since 5.2);
+ **recvmsg6** call to recvfrom(2), recvmsg(2), recvmmsg(2) for
+  an unconnected udp6 socket (since 5.2);
  **sysctl** sysctl access (since 5.2).
 
**bpftool cgroup detach** *CGROUP* *ATTACH_TYPE* *PROG*
diff --git a/tools/bpf/bpftool/Documentation/bpftool-prog.rst 
b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
index e8118544d118..018ecef8dc13 100644
--- a/tools/bpf/bpftool/Documentation/bpftool-prog.rst
+++ b/tools/bpf/bpftool/Documentation/bpftool-prog.rst
@@ -40,7 +40,7 @@ PROG COMMANDS
 |  **lwt_seg6local** | **sockops** | **sk_skb** | **sk_msg** | 
**lirc_mode2** |
 |  **cgroup/bind4** | **cgroup/bind6** | **cgroup/post_bind4** | 
**cgroup/post_bind6** |
 |  **cgroup/connect4** | **cgroup/connect6** | **cgroup/sendmsg4** 
| **cgroup/sendmsg6** |
-|  **cgroup/sysctl**
+|  **cgroup/recvmsg4** | **cgroup/recvmsg6** | **cgroup/sysctl**
 |  }
 |   *ATTACH_TYPE* := {
 |  **msg_verdict** | **stream_verdict** | **stream_parser** | 
**flow_dissector**
diff --git a/tools/bpf/bpftool/bash-completion/bpftool 
b/tools/bpf/bpftool/bash-completion/bpftool
index 50e402a5a9c8..4300adf6e5ab 100644
--- a/tools/bpf/bpftool/bash-completion/bpftool
+++ b/tools/bpf/bpftool/bash-completion/bpftool
@@ -371,6 +371,7 @@ _bpftool()
 lirc_mode2 cgroup/bind4 cgroup/bind6 \
 cgroup/connect4 cgroup/connect6 \
 cgroup/sendmsg4 cgroup/sendmsg6 \
+cgroup/recvmsg4 cgroup/recvmsg6 \
 cgroup/post_bind4 cgroup/post_bind6 \
 cgroup/sysctl" -- \
"$cur" ) )
@@ -666,7 +667,7 @@ _bpftool()
 attach|detach)
 local ATTACH_TYPES='ingress egress sock_create sock_ops \
 device bind4 bind6 post_bind4 post_bind6 connect4 \
-connect6 sendmsg4 sendmsg6 sysctl'
+connect6 sendmsg4 sendmsg6 recvmsg4 recvmsg6 sysctl'
 local ATTACH_FLAGS='multi override'
 local PROG_TYPE='id pinned tag'
 case $prev in
@@ -676,7 +677,7 @@ _bpftool()
 ;;
 
ingress|egress|sock_create|sock_ops|device|bind4|bind6|\
 post_bind4|post_bind6|connect4|connect6|sendmsg4|\
-sendmsg6|sysctl)
+sendmsg6|recvmsg4|recvmsg6|sysctl)
 COMPREPLY=( $( compgen -W "$PROG_TYPE" -- \
 "$cur" ) )
 return 0
diff --git a/tools/bpf/bpftool/cgroup.c b/tools/bpf/bpftool/cgroup.c
index 7e22f115c8c1..73ec8ea33fb4 100644
--- a/tools/bpf/bpftool/cgroup.c
+++ b/tools/bpf/bpftool/cgroup.c
@@ -25,7 +25,8 @@
"   ATTACH_TYPE := { ingress | egress | sock_create |\n"   \
"sock_ops | device | bind4 | bind6

[PATCH bpf v3 2/6] bpf: sync tooling uapi header

2019-06-06 Thread Daniel Borkmann

Sync BPF uapi header in order to pull in BPF_CGROUP_UDP{4,6}_RECVMSG
attach types. This is done and preferred as an extra patch in order
to ease sync of libbpf.

Signed-off-by: Daniel Borkmann 
Acked-by: Andrey Ignatov 
Acked-by: Martin KaFai Lau 
---
 tools/include/uapi/linux/bpf.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 63e0cf66f01a..e4114a7e4451 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -192,6 +192,8 @@ enum bpf_attach_type {
BPF_LIRC_MODE2,
BPF_FLOW_DISSECTOR,
BPF_CGROUP_SYSCTL,
+   BPF_CGROUP_UDP4_RECVMSG,
+   BPF_CGROUP_UDP6_RECVMSG,
__MAX_BPF_ATTACH_TYPE
 };
 
-- 
2.17.1

[PATCH bpf v3 3/6] bpf, libbpf: enable recvmsg attach types

2019-06-06 Thread Daniel Borkmann

Another trivial patch to libbpf in order to enable identifying and
attaching programs to BPF_CGROUP_UDP{4,6}_RECVMSG by section name.

Signed-off-by: Daniel Borkmann 
---
 tools/lib/bpf/libbpf.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 5d046cc7b207..151f7ac1882e 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -3210,6 +3210,10 @@ static const struct {
BPF_CGROUP_UDP4_SENDMSG),
BPF_EAPROG_SEC("cgroup/sendmsg6",   BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
BPF_CGROUP_UDP6_SENDMSG),
+   BPF_EAPROG_SEC("cgroup/recvmsg4",   BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+   BPF_CGROUP_UDP4_RECVMSG),
+   BPF_EAPROG_SEC("cgroup/recvmsg6",   BPF_PROG_TYPE_CGROUP_SOCK_ADDR,
+   BPF_CGROUP_UDP6_RECVMSG),
BPF_EAPROG_SEC("cgroup/sysctl", BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_CGROUP_SYSCTL),
 };
-- 
2.17.1

pull-request: bpf 2019-06-07

2019-06-07 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix several bugs in riscv64 JIT code emission which forgot to clear high
   32-bits for alu32 ops, from Björn and Luke with selftests covering all
   relevant BPF alu ops from Björn and Jiong.

2) Two fixes for UDP BPF reuseport that avoid calling the program in case of
   __udp6_lib_err and UDP GRO which broke reuseport_select_sock() assumption
   that skb->data is pointing to transport header, from Martin.

3) Two fixes for BPF sockmap: a use-after-free from sleep in psock's backlog
   workqueue, and a missing restore of sk_write_space when psock gets dropped,
   from Jakub and John.

4) Fix unconnected UDP sendmsg hook API which is insufficient as-is since it
   breaks standard applications like DNS if reverse NAT is not performed upon
   receive, from Daniel.

5) Fix an out-of-bounds read in __bpf_skc_lookup which in case of AF_INET6
   fails to verify that the length of the tuple is long enough, from Lorenz.

6) Fix libbpf's libbpf__probe_raw_btf to return an fd instead of 0/1 (for
   {un,}successful probe) as that is expected to be propagated as an fd to
   load_sk_storage_btf() and thus closing the wrong descriptor otherwise,
   from Michal.

7) Fix bpftool's JSON output for the case when a lookup fails, from Krzesimir.

8) Minor misc fixes in docs, samples and selftests, from various others.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit af8f3fb7fb077c9df9fed97113a031e792163def:

  net: stmmac: dma channel control register need to be init first (2019-05-20 
20:55:39 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 4aeba328019ab9fbaad725b923b4c11c3725b24e:

  Merge branch 'fix-unconnected-udp' (2019-06-06 16:53:13 -0700)


Alakesh Haloi (1):
  selftests: bpf: fix compiler warning in flow_dissector test

Alexei Starovoitov (2):
  Merge branch 'reuseport-fixes'
  Merge branch 'fix-unconnected-udp'

Björn Töpel (2):
  bpf, riscv: clear target register high 32-bits for and/or/xor on ALU32
  selftests: bpf: add zero extend checks for ALU32 and/or/xor

Chang-Hsien Tsai (1):
  samples, bpf: fix to change the buffer size for read()

Daniel Borkmann (7):
  Merge branch 'bpf-subreg-tests'
  bpf: fix unconnected udp hooks
  bpf: sync tooling uapi header
  bpf, libbpf: enable recvmsg attach types
  bpf, bpftool: enable recvmsg attach types
  bpf: more msg_name rewrite tests to test_sock_addr
  bpf: expand section tests for test_section_names

Hangbin Liu (1):
  selftests/bpf: move test_lirc_mode2_user to TEST_GEN_PROGS_EXTENDED

Jakub Sitnicki (1):
  bpf: sockmap, restore sk_write_space when psock gets dropped

Jiong Wang (2):
  selftests: bpf: move sub-register zero extension checks into subreg.c
  selftests: bpf: complete sub-register zero extension checks

John Fastabend (1):
  bpf: sockmap, fix use after free from sleep in psock backlog workqueue

Krzesimir Nowak (1):
  tools: bpftool: Fix JSON output when lookup fails

Lorenz Bauer (1):
  bpf: fix out-of-bounds read in __bpf_skc_lookup

Luke Nelson (1):
  bpf, riscv: clear high 32 bits for ALU32 add/sub/neg/lsh/rsh/arsh

Martin KaFai Lau (3):
  bpf: Check sk_fullsock() before returning from bpf_sk_lookup()
  bpf: udp: ipv6: Avoid running reuseport's bpf_prog from __udp6_lib_err
  bpf: udp: Avoid calling reuseport's bpf_prog from udp_gro

Matteo Croce (1):
  samples, bpf: suppress compiler warning

Michal Rostecki (1):
  libbpf: Return btf_fd for load_sk_storage_btf

Randy Dunlap (1):
  Documentation/networking: fix af_xdp.rst Sphinx warnings

 Documentation/networking/af_xdp.rst|   8 +-
 arch/riscv/net/bpf_jit_comp.c  |  24 +
 include/linux/bpf-cgroup.h |   8 +
 include/linux/skmsg.h  |   2 +
 include/uapi/linux/bpf.h   |   2 +
 kernel/bpf/syscall.c   |   8 +
 kernel/bpf/verifier.c  |  12 +-
 net/core/filter.c  |  26 +-
 net/core/skbuff.c  |   1 +
 net/ipv4/udp.c |  10 +-
 net/ipv6/udp.c |   8 +-
 samples/bpf/bpf_load.c |   2 +-
 samples/bpf/task_fd_query_user.c   |   2 +-
 tools/bpf/bpftool/Documentation/bpftool-cgroup.rst |   6 +-
 tools/bpf/bpftool/Documentation/bpftool-prog.rst   |   2 +-
 tools/bpf/

Re: [PATCH v4 bpf-next 0/2] bpf: Add a new API libbpf_num_possible_cpus()

2019-06-07 Thread Daniel Borkmann

On 06/07/2019 06:37 PM, Hechao Li wrote:
> Getting number of possible CPUs is commonly used for per-CPU BPF maps
> and perf_event_maps. Add a new API libbpf_num_possible_cpus() that
> helps user with per-CPU related operations and remove duplicate
> implementations in bpftool and selftests.
> 
> v4: Fixed error code when reading 0 bytes from possible CPU file
> 
> Hechao Li (2):
>   bpf: add a new API libbpf_num_possible_cpus()
>   bpf: use libbpf_num_possible_cpus in bpftool and selftests
> 
>  tools/bpf/bpftool/common.c | 53 +++-
>  tools/lib/bpf/libbpf.c | 57 ++
>  tools/lib/bpf/libbpf.h | 16 
>  tools/lib/bpf/libbpf.map   |  1 +
>  tools/testing/selftests/bpf/bpf_util.h | 37 +++--
>  5 files changed, 84 insertions(+), 80 deletions(-)

Series applied, thanks!

P.s.: Please retain full history (v1->v2->v3->v4) in cover letter next time as
that is typical convention and helps readers of git log to follow what has been
changed over time.

Re: [PATCH v4 bpf-next 0/2] bpf: Add a new API

2019-06-07 Thread Daniel Borkmann

On 06/08/2019 01:25 AM, Roman Gushchin wrote:
> On 06/07/2019 06:37 PM, Hechao Li wrote:
>> Getting number of possible CPUs is commonly used for per-CPU BPF maps
>> and perf_event_maps. Add a new API libbpf_num_possible_cpus() that
>> helps user with per-CPU related operations and remove duplicate
>> implementations in bpftool and selftests.
>>
>> v4: Fixed error code when reading 0 bytes from possible CPU file
>>
>> Hechao Li (2):
>>   bpf: add a new API libbpf_num_possible_cpus()
>>   bpf: use libbpf_num_possible_cpus in bpftool and selftests
>>
>>  tools/bpf/bpftool/common.c | 53 +++-
>>  tools/lib/bpf/libbpf.c | 57 ++
>>  tools/lib/bpf/libbpf.h | 16 
>>  tools/lib/bpf/libbpf.map   |  1 +
>>  tools/testing/selftests/bpf/bpf_util.h | 37 +++--
>>  5 files changed, 84 insertions(+), 80 deletions(-)
> 
>> Series applied, thanks!
> 
>> P.s.: Please retain full history (v1->v2->v3->v4) in cover letter next time 
>> as
>> that is typical convention and helps readers of git log to follow what has 
>> been
>> changed over time.
> 
> 
> Hello!
> 
> I'm getting the following errors on an attempt to build bpf tests.
> Reverting the last patch fixes it.
> 
[...]
> 
> clang -I. -I./include/uapi -I../../../include/uapi -idirafter 
> /usr/local/include -idirafter 
> /opt/fb/devtoolset/bin/../lib/clang/4.0.0/include -idirafter /usr/include 
> -Wno-compare-distinct-pointer-types \
>-O2 -target bpf -emit-llvm -c progs/sockmap_parse_prog.c -o - |  \
> llc -march=bpf -mcpu=generic  -filetype=obj -o 
> /data/users/guro/linux/tools/testing/selftests/bpf/sockmap_parse_prog.o
> In file included from progs/sockmap_parse_prog.c:3:
> ./bpf_util.h:9:10: fatal error: 'libbpf.h' file not found
> #include 
>  ^~
> 1 error generated.

True, I've therefore tossed the series from bpf-next. Hechao, please fix and 
resubmit.

Re: [PATCH v2 bpf-next] selftests/bpf: fix "alu with different scalars 1" on s390

2019-07-15 Thread Daniel Borkmann

On 7/16/19 12:13 AM, Alexei Starovoitov wrote:
> On Thu, Jul 4, 2019 at 1:53 AM Ilya Leoshkevich  wrote:
>>
>> BPF_LDX_MEM is used to load the least significant byte of the retrieved
>> test_val.index, however, on big-endian machines it ends up retrieving
>> the most significant byte.
>>
>> Use the correct least significant byte offset on big-endian machines.
>>
>> Signed-off-by: Ilya Leoshkevich 
>> ---
>>
>> v1->v2:
>> - use __BYTE_ORDER instead of __BYTE_ORDER__.
>>
>>  tools/testing/selftests/bpf/verifier/value_ptr_arith.c | 4 
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c 
>> b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
>> index c3de1a2c9dc5..e5940c4e8b8f 100644
>> --- a/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
>> +++ b/tools/testing/selftests/bpf/verifier/value_ptr_arith.c
>> @@ -183,7 +183,11 @@
>> BPF_EMIT_CALL(BPF_FUNC_map_lookup_elem),
>> BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1),
>> BPF_EXIT_INSN(),
>> +#if __BYTE_ORDER == __LITTLE_ENDIAN
>> BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, 0),
>> +#else
>> +   BPF_LDX_MEM(BPF_B, BPF_REG_1, BPF_REG_0, sizeof(int) - 1),
>> +#endif
> 
> I think tests should be arch and endian independent where possible.
> In this case test_val.index is 4 byte and 4 byte load should work just as 
> well.

Yes, agree, this should be fixed with BPF_W as load.

Thanks,
Daniel

Re: [PATCH v3 bpf 0/3] fix BTF verification size resolution

2019-07-15 Thread Daniel Borkmann

On 7/12/19 7:25 PM, Andrii Nakryiko wrote:
> BTF size resolution logic isn't always resolving type size correctly, leading
> to erroneous map creation failures due to value size mismatch.
> 
> This patch set:
> 1. fixes the issue (patch #1);
> 2. adds tests for trickier cases (patch #2);
> 3. and converts few test cases utilizing BTF-defined maps, that previously
>couldn't use typedef'ed arrays due to kernel bug (patch #3).
> 
> Andrii Nakryiko (3):
>   bpf: fix BTF verifier size resolution logic
>   selftests/bpf: add trickier size resolution tests
>   selftests/bpf: use typedef'ed arrays as map values
> 
>  kernel/bpf/btf.c  | 19 ++--
>  .../bpf/progs/test_get_stack_rawtp.c  |  3 +-
>  .../bpf/progs/test_stacktrace_build_id.c  |  3 +-
>  .../selftests/bpf/progs/test_stacktrace_map.c |  2 +-
>  tools/testing/selftests/bpf/test_btf.c| 88 +++
>  5 files changed, 104 insertions(+), 11 deletions(-)
> 

Applied, thanks!

Re: [PATCH bpf] selftests/bpf: fix attach_probe on s390

2019-07-15 Thread Daniel Borkmann

On 7/12/19 3:41 PM, Ilya Leoshkevich wrote:
> attach_probe test fails, because it cannot install a kprobe on a
> non-existent sys_nanosleep symbol.
> 
> Use the correct symbol name for the nanosleep syscall on 64-bit s390.
> Don't bother adding one for 31-bit mode, since tests are compiled only
> in 64-bit mode.
> 
> Fixes: 1e8611bbdfc9 ("selftests/bpf: add kprobe/uprobe selftests")
> Signed-off-by: Ilya Leoshkevich 
> Acked-by: Vasily Gorbik 

Applied, thanks!

Re: [PATCH bpf] selftests/bpf: make directory prerequisites order-only

2019-07-15 Thread Daniel Borkmann

On 7/12/19 3:56 PM, Ilya Leoshkevich wrote:
> When directories are used as prerequisites in Makefiles, they can cause
> a lot of unnecessary rebuilds, because a directory is considered changed
> whenever a file in this directory is added, removed or modified.
> 
> If the only thing a target is interested in is the existence of the
> directory it depends on, which is the case for selftests/bpf, this
> directory should be specified as an order-only prerequisite: it would
> still be created in case it does not exist, but it would not trigger a
> rebuild of a target in case it's considered changed.
> 
> Signed-off-by: Ilya Leoshkevich 

Applied, thanks!

Re: [PATCH bpf 0/5] bpf: allow wide (u64) aligned loads for some fields of bpf_sock_addr

2019-07-15 Thread Daniel Borkmann

On 7/15/19 6:39 PM, Stanislav Fomichev wrote:
> When fixing selftests by adding support for wide stores, Yonghong
> reported that he had seen some examples where clang generates
> single u64 loads for two adjacent u32s as well:
> http://lore.kernel.org/netdev/a66c937f-94c0-eaf8-5b37-8587d66c0...@fb.com
> 
> Let's support aligned u64 reads for some bpf_sock_addr fields
> as well.
> 
> (This can probably wait for bpf-next, I'll defer to Younhong and the
> maintainers.)
> 
> Cc: Yonghong Song 
> 
> Stanislav Fomichev (5):
>   bpf: rename bpf_ctx_wide_store_ok to bpf_ctx_wide_access_ok
>   bpf: allow wide aligned loads for bpf_sock_addr user_ip6 and
> msg_src_ip6
>   selftests/bpf: rename verifier/wide_store.c to verifier/wide_access.c
>   selftests/bpf: add selftests for wide loads
>   bpf: sync bpf.h to tools/
> 
>  include/linux/filter.h|  2 +-
>  include/uapi/linux/bpf.h  |  4 +-
>  net/core/filter.c | 24 --
>  tools/include/uapi/linux/bpf.h|  4 +-
>  .../selftests/bpf/verifier/wide_access.c  | 73 +++
>  .../selftests/bpf/verifier/wide_store.c   | 36 -
>  6 files changed, 95 insertions(+), 48 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/wide_access.c
>  delete mode 100644 tools/testing/selftests/bpf/verifier/wide_store.c
> 

Applied, thanks!

Re: [PATCH bpf] samples/bpf: build with -D__TARGET_ARCH_$(SRCARCH)

2019-07-15 Thread Daniel Borkmann

On 7/15/19 11:11 AM, Ilya Leoshkevich wrote:
> While $ARCH can be relatively flexible (see Makefile and
> tools/scripts/Makefile.arch), $SRCARCH always corresponds to a directory
> name under arch/.
> 
> Therefore, build samples with -D__TARGET_ARCH_$(SRCARCH), since that
> matches the expectations of bpf_helpers.h.
> 
> Signed-off-by: Ilya Leoshkevich 
> Acked-by: Vasily Gorbik 

Applied, thanks!

Re: [PATCH bpf] selftests/bpf: put test_stub.o into $(OUTPUT)

2019-07-15 Thread Daniel Borkmann

On 7/12/19 3:59 PM, Ilya Leoshkevich wrote:
> Add a rule to put test_stub.o in $(OUTPUT) and change the references to
> it accordingly. This prevents test_stub.o from being created in the
> source directory.
> 
> Signed-off-by: Ilya Leoshkevich 

Applied, thanks!

Re: [PATCH v2 bpf-next] selftests/bpf: remove logic duplication in test_verifier.c

2019-07-15 Thread Daniel Borkmann

On 7/12/19 7:44 PM, Andrii Nakryiko wrote:
> test_verifier tests can specify single- and multi-runs tests. Internally
> logic of handling them is duplicated. Get rid of it by making single run
> retval/data specification to be a first run spec.
> 
> Cc: Krzesimir Nowak 
> Signed-off-by: Andrii Nakryiko 

Applied, thanks!

Re: [PATCH bpf v4 00/14] sockmap/tls fixes

2019-07-22 Thread Daniel Borkmann

On 7/19/19 7:37 PM, Jakub Kicinski wrote:
> On Fri, 19 Jul 2019 10:29:13 -0700, Jakub Kicinski wrote:
>> John says:
>>
>> Resolve a series of splats discovered by syzbot and an unhash
>> TLS issue noted by Eric Dumazet.
> 
> Sorry for the delay, this code is quite tricky. According to my testing
> TLS SW and HW should now work, I hope I didn't regress things on the
> sockmap side.

Applied, thanks everyone!

Re: [GIT PULL 0/2] libbpf build fixes

2019-07-22 Thread Daniel Borkmann

On 7/19/19 4:34 PM, Arnaldo Carvalho de Melo wrote:
> Hi Daniel,
> 
>   Please consider pulling or applying from the patches, if someone
> has any issues, please holler,
> 
> - Arnaldo
> 
> Arnaldo Carvalho de Melo (2):
>   libbpf: Fix endianness macro usage for some compilers
>   libbpf: Avoid designated initializers for unnamed union members
> 
>  tools/lib/bpf/btf.c|  5 +++--
>  tools/lib/bpf/libbpf.c | 19 ++-
>  2 files changed, 13 insertions(+), 11 deletions(-)
> 

Applied, thanks!

Re: [PATCH bpf] selftests/bpf: fix sendmsg6_prog on s390

2019-07-22 Thread Daniel Borkmann

On 7/19/19 11:06 AM, Ilya Leoshkevich wrote:
> "sendmsg6: rewrite IP & port (C)" fails on s390, because the code in
> sendmsg_v6_prog() assumes that (ctx->user_ip6[0] & 0x) refers to
> leading IPv6 address digits, which is not the case on big-endian
> machines.
> 
> Since checking bitwise operations doesn't seem to be the point of the
> test, replace two short comparisons with a single int comparison.
> 
> Signed-off-by: Ilya Leoshkevich 

Applied, thanks!

bpf-next is OPEN

2019-07-22 Thread Daniel Borkmann

Merge window is closed, therefore new round begins.

Thanks everyone,
Daniel

[PATCH net 1/2] sock: make cookie generation global instead of per netns

2019-08-08 Thread Daniel Borkmann

Generating and retrieving socket cookies are a useful feature that is
exposed to BPF for various program types through bpf_get_socket_cookie()
helper.

The fact that the cookie counter is per netns is quite a limitation
for BPF in practice in particular for programs in host namespace that
use socket cookies as part of a map lookup key since they will be
causing socket cookie collisions e.g. when attached to BPF cgroup hooks
or cls_bpf on tc egress in host namespace handling container traffic
from veths or ipvlan slaves. Change the counter to be global instead.

Socket cookie consumers must assume the value as opqaue in any case.
The cookie does not guarantee an always unique identifier since it
could wrap in fabricated corner cases where two sockets could end up
holding the same cookie, but is good enough to be used as a hint for
many use cases; not every socket must have a cookie generated hence
knowledge of the counter value does not provide much value either way.

Signed-off-by: Daniel Borkmann 
Cc: Eric Dumazet 
Cc: Alexei Starovoitov 
Cc: Willem de Bruijn 
Cc: Martynas Pumputis 
---
 include/net/net_namespace.h | 1 -
 include/uapi/linux/bpf.h| 4 ++--
 net/core/sock_diag.c| 3 ++-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 4a9da951a794..cb668bc2692d 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -61,7 +61,6 @@ struct net {
spinlock_t  rules_mod_lock;
 
u32 hash_mix;
-   atomic64_t  cookie_gen;
 
struct list_headlist;   /* list of network namespaces */
struct list_headexit_list;  /* To linked to call pernet exit
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa1c753dcdbc..a5aa7d3ac6a1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1466,8 +1466,8 @@ union bpf_attr {
  * If no cookie has been set yet, generate a new cookie. Once
  * generated, the socket cookie remains stable for the life of the
  * socket. This helper can be useful for monitoring per socket
- * networking traffic statistics as it provides a unique socket
- * identifier per namespace.
+ * networking traffic statistics as it provides a global socket
+ * identifier that can be assumed unique.
  * Return
  * A 8-byte long non-decreasing number on success, or 0 if the
  * socket field is missing inside *skb*.
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 3312a5849a97..c13ffbd33d8d 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -19,6 +19,7 @@ static const struct sock_diag_handler 
*sock_diag_handlers[AF_MAX];
 static int (*inet_rcv_compat)(struct sk_buff *skb, struct nlmsghdr *nlh);
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
+static atomic64_t cookie_gen;
 
 u64 sock_gen_cookie(struct sock *sk)
 {
@@ -27,7 +28,7 @@ u64 sock_gen_cookie(struct sock *sk)
 
if (res)
return res;
-   res = atomic64_inc_return(&sock_net(sk)->cookie_gen);
+   res = atomic64_inc_return(&cookie_gen);
atomic64_cmpxchg(&sk->sk_cookie, 0, res);
}
 }
-- 
2.17.1

[PATCH net 2/2] bpf: sync bpf.h to tools infrastructure

2019-08-08 Thread Daniel Borkmann

Pull in updates in BPF helper function description.

Signed-off-by: Daniel Borkmann 
---
 tools/include/uapi/linux/bpf.h | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e455018da65..a5aa7d3ac6a1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1466,8 +1466,8 @@ union bpf_attr {
  * If no cookie has been set yet, generate a new cookie. Once
  * generated, the socket cookie remains stable for the life of the
  * socket. This helper can be useful for monitoring per socket
- * networking traffic statistics as it provides a unique socket
- * identifier per namespace.
+ * networking traffic statistics as it provides a global socket
+ * identifier that can be assumed unique.
  * Return
  * A 8-byte long non-decreasing number on success, or 0 if the
  * socket field is missing inside *skb*.
@@ -1571,8 +1571,11 @@ union bpf_attr {
  * but this is only implemented for native XDP (with driver
  * support) as of this writing).
  *
- * All values for *flags* are reserved for future usage, and must
- * be left at zero.
+ * The lower two bits of *flags* are used as the return code if
+ * the map lookup fails. This is so that the return value can be
+ * one of the XDP program return codes up to XDP_TX, as chosen by
+ * the caller. Any higher bits in the *flags* argument must be
+ * unset.
  *
  * When used to redirect packets to net devices, this helper
  * provides a high performance increase over **bpf_redirect**\ ().
-- 
2.17.1

[PATCH net 0/2] Fix collisions in socket cookie generation

2019-08-08 Thread Daniel Borkmann

This change makes the socket cookie generator as a global counter
instead of per netns in order to fix cookie collisions for BPF use
cases we ran into. See main patch #1 for more details.

Given the change is small/trivial and fixes an issue we're seeing
my preference would be net tree (though it cleanly applies to
net-next as well). Went for net tree instead of bpf tree here given
the main change is in net/core/sock_diag.c, but either way would be
fine with me.

Thanks a lot!

Daniel Borkmann (2):
  sock: make cookie generation global instead of per netns
  bpf: sync bpf.h to tools infrastructure

 include/net/net_namespace.h|  1 -
 include/uapi/linux/bpf.h   |  4 ++--
 net/core/sock_diag.c   |  3 ++-
 tools/include/uapi/linux/bpf.h | 11 +++
 4 files changed, 11 insertions(+), 8 deletions(-)

-- 
2.17.1

Re: [PATCH net 1/2] sock: make cookie generation global instead of per netns

2019-08-08 Thread Daniel Borkmann


On 8/8/19 12:45 PM, Eric Dumazet wrote:

On Thu, Aug 8, 2019 at 11:50 AM Daniel Borkmann  wrote:


Socket cookie consumers must assume the value as opqaue in any case.
The cookie does not guarantee an always unique identifier since it
could wrap in fabricated corner cases where two sockets could end up
holding the same cookie,


What do you mean by this ?

Cookie is guaranteed to be unique, it is from a 64bit counter...

There should be no collision.


I meant the [theoretical] corner case where socket_1 has cookie X and
we'd create, trigger sock_gen_cookie() to increment, close socket in a
loop until we wrap and get another cookie X for socket_2; agree it's
impractical and for little gain anyway. So in practice there should be
no collision which is what I tried to say.

Thanks,
Daniel

Re: [PATCH net 1/2] sock: make cookie generation global instead of per netns

2019-08-08 Thread Daniel Borkmann


On 8/8/19 1:37 PM, Eric Dumazet wrote:

On Thu, Aug 8, 2019 at 1:09 PM Daniel Borkmann  wrote:

On 8/8/19 12:45 PM, Eric Dumazet wrote:

On Thu, Aug 8, 2019 at 11:50 AM Daniel Borkmann  wrote:


Socket cookie consumers must assume the value as opqaue in any case.
The cookie does not guarantee an always unique identifier since it
could wrap in fabricated corner cases where two sockets could end up
holding the same cookie,


What do you mean by this ?

Cookie is guaranteed to be unique, it is from a 64bit counter...

There should be no collision.


I meant the [theoretical] corner case where socket_1 has cookie X and
we'd create, trigger sock_gen_cookie() to increment, close socket in a
loop until we wrap and get another cookie X for socket_2; agree it's
impractical and for little gain anyway. So in practice there should be
no collision which is what I tried to say.


If a 64bit counter, updated by one unit at a time could overflow
during the lifetime of a host,
I would agree with you, but this can not happen, even if we succeed to
make 1 billion
locked increments per second (this would still need 584 years)

I would prefer not mentioning something that can not possibly happen
in your changelog ;)


Yep fair enough, makes sense. I'll fix it :)

[PATCH net v2 1/2] sock: make cookie generation global instead of per netns

2019-08-08 Thread Daniel Borkmann

Generating and retrieving socket cookies are a useful feature that is
exposed to BPF for various program types through bpf_get_socket_cookie()
helper.

The fact that the cookie counter is per netns is quite a limitation
for BPF in practice in particular for programs in host namespace that
use socket cookies as part of a map lookup key since they will be
causing socket cookie collisions e.g. when attached to BPF cgroup hooks
or cls_bpf on tc egress in host namespace handling container traffic
from veth or ipvlan devices with peer in different netns. Change the
counter to be global instead.

Socket cookie consumers must assume the value as opqaue in any case.
Not every socket must have a cookie generated and knowledge of the
counter value itself does not provide much value either way hence
conversion to global is fine.

Signed-off-by: Daniel Borkmann 
Cc: Eric Dumazet 
Cc: Alexei Starovoitov 
Cc: Willem de Bruijn 
Cc: Martynas Pumputis 
---
 include/net/net_namespace.h | 1 -
 include/uapi/linux/bpf.h| 4 ++--
 net/core/sock_diag.c| 3 ++-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 4a9da951a794..cb668bc2692d 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -61,7 +61,6 @@ struct net {
spinlock_t  rules_mod_lock;
 
u32 hash_mix;
-   atomic64_t  cookie_gen;
 
struct list_headlist;   /* list of network namespaces */
struct list_headexit_list;  /* To linked to call pernet exit
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index fa1c753dcdbc..a5aa7d3ac6a1 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1466,8 +1466,8 @@ union bpf_attr {
  * If no cookie has been set yet, generate a new cookie. Once
  * generated, the socket cookie remains stable for the life of the
  * socket. This helper can be useful for monitoring per socket
- * networking traffic statistics as it provides a unique socket
- * identifier per namespace.
+ * networking traffic statistics as it provides a global socket
+ * identifier that can be assumed unique.
  * Return
  * A 8-byte long non-decreasing number on success, or 0 if the
  * socket field is missing inside *skb*.
diff --git a/net/core/sock_diag.c b/net/core/sock_diag.c
index 3312a5849a97..c13ffbd33d8d 100644
--- a/net/core/sock_diag.c
+++ b/net/core/sock_diag.c
@@ -19,6 +19,7 @@ static const struct sock_diag_handler 
*sock_diag_handlers[AF_MAX];
 static int (*inet_rcv_compat)(struct sk_buff *skb, struct nlmsghdr *nlh);
 static DEFINE_MUTEX(sock_diag_table_mutex);
 static struct workqueue_struct *broadcast_wq;
+static atomic64_t cookie_gen;
 
 u64 sock_gen_cookie(struct sock *sk)
 {
@@ -27,7 +28,7 @@ u64 sock_gen_cookie(struct sock *sk)
 
if (res)
return res;
-   res = atomic64_inc_return(&sock_net(sk)->cookie_gen);
+   res = atomic64_inc_return(&cookie_gen);
atomic64_cmpxchg(&sk->sk_cookie, 0, res);
}
 }
-- 
2.17.1

[PATCH net v2 0/2] Fix collisions in socket cookie generation

2019-08-08 Thread Daniel Borkmann

This change makes the socket cookie generator as a global counter
instead of per netns in order to fix cookie collisions for BPF use
cases we ran into. See main patch #1 for more details.

Given the change is small/trivial and fixes an issue we're seeing
my preference would be net tree (though it cleanly applies to
net-next as well). Went for net tree instead of bpf tree here given
the main change is in net/core/sock_diag.c, but either way would be
fine with me.

Thanks a lot!

v1 -> v2:
  - Fix up commit description in patch #1, thanks Eric!

Daniel Borkmann (2):
  sock: make cookie generation global instead of per netns
  bpf: sync bpf.h to tools infrastructure

 include/net/net_namespace.h|  1 -
 include/uapi/linux/bpf.h   |  4 ++--
 net/core/sock_diag.c   |  3 ++-
 tools/include/uapi/linux/bpf.h | 11 +++
 4 files changed, 11 insertions(+), 8 deletions(-)

-- 
2.17.1

[PATCH net v2 2/2] bpf: sync bpf.h to tools infrastructure

2019-08-08 Thread Daniel Borkmann

Pull in updates in BPF helper function description.

Signed-off-by: Daniel Borkmann 
---
 tools/include/uapi/linux/bpf.h | 11 +++
 1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4e455018da65..a5aa7d3ac6a1 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1466,8 +1466,8 @@ union bpf_attr {
  * If no cookie has been set yet, generate a new cookie. Once
  * generated, the socket cookie remains stable for the life of the
  * socket. This helper can be useful for monitoring per socket
- * networking traffic statistics as it provides a unique socket
- * identifier per namespace.
+ * networking traffic statistics as it provides a global socket
+ * identifier that can be assumed unique.
  * Return
  * A 8-byte long non-decreasing number on success, or 0 if the
  * socket field is missing inside *skb*.
@@ -1571,8 +1571,11 @@ union bpf_attr {
  * but this is only implemented for native XDP (with driver
  * support) as of this writing).
  *
- * All values for *flags* are reserved for future usage, and must
- * be left at zero.
+ * The lower two bits of *flags* are used as the return code if
+ * the map lookup fails. This is so that the return value can be
+ * one of the XDP program return codes up to XDP_TX, as chosen by
+ * the caller. Any higher bits in the *flags* argument must be
+ * unset.
  *
  * When used to redirect packets to net devices, this helper
  * provides a high performance increase over **bpf_redirect**\ ().
-- 
2.17.1

Re: [PATCH bpf 0/2] tools: bpftool: fix pinning error messages

2019-08-09 Thread Daniel Borkmann


On 8/7/19 2:19 AM, Jakub Kicinski wrote:

Hi!

First make sure we don't use "prog" in error messages because
the pinning operation could be performed on a map. Second add
back missing error message if pin syscall failed.

Jakub Kicinski (2):
   tools: bpftool: fix error message (prog -> object)
   tools: bpftool: add error message on pin failure

  tools/bpf/bpftool/common.c | 8 ++--
  1 file changed, 6 insertions(+), 2 deletions(-)



Applied, thanks!

Re: [bpf-next v3 PATCH 0/3] bpf: improvements to xdp_fwd sample

2019-08-09 Thread Daniel Borkmann


On 8/8/19 6:17 PM, Jesper Dangaard Brouer wrote:

V3: Hopefully fixed all issues point out by Yonghong Song

V2: Addressed issues point out by Yonghong Song
  - Please ACK patch 2/3 again
  - Added ACKs and reviewed-by to other patches

This patchset is focused on improvements for XDP forwarding sample
named xdp_fwd, which leverage the existing FIB routing tables as
described in LPC2018[1] talk by David Ahern.

The primary motivation is to illustrate how Toke's recent work
improves usability of XDP_REDIRECT via lookups in devmap. The other
patches are to help users understand the sample.

I have more improvements to xdp_fwd, but those might requires changes
to libbpf.  Thus, sending these patches first as they are isolated.

[1] http://vger.kernel.org/lpc-networking2018.html#session-1

---

Jesper Dangaard Brouer (3):
   samples/bpf: xdp_fwd rename devmap name to be xdp_tx_ports
   samples/bpf: make xdp_fwd more practically usable via devmap lookup
   samples/bpf: xdp_fwd explain bpf_fib_lookup return codes


  samples/bpf/xdp_fwd_kern.c |   39 ++-
  samples/bpf/xdp_fwd_user.c |   35 +++
  2 files changed, 53 insertions(+), 21 deletions(-)

--



Applied, thanks!

pull-request: bpf 2019-08-11

2019-08-11 Thread Daniel Borkmann

Hi David,

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) x64 JIT code generation fix for backward-jumps to 1st insn, from Alexei.

2) Fix buggy multi-closing of BTF file descriptor in libbpf, from Andrii.

3) Fix libbpf_num_possible_cpus() to make it thread safe, from Takshak.

4) Fix bpftool to dump an error if pinning fails, from Jakub.

Please consider pulling these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git

Thanks a lot!



The following changes since commit 0a062ba725cdad3b167782179ee914a8402a0184:

  Merge tag 'mlx5-fixes-2019-07-25' of 
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux (2019-07-26 14:26:41 
-0700)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf.git 

for you to fetch changes up to 4f7aafd78aeaf18a4f6dea9415df60e745c9dfa7:

  Merge branch 'bpf-bpftool-pinning-error-msg' (2019-08-09 17:38:53 +0200)


Alexei Starovoitov (2):
  bpf: fix x64 JIT code generation for jmp to 1st insn
  selftests/bpf: tests for jmp to 1st insn

Andrii Nakryiko (2):
  libbpf: fix erroneous multi-closing of BTF FD
  libbpf: set BTF FD for prog only when there is supported .BTF.ext data

Daniel Borkmann (1):
  Merge branch 'bpf-bpftool-pinning-error-msg'

Jakub Kicinski (2):
  tools: bpftool: fix error message (prog -> object)
  tools: bpftool: add error message on pin failure

Takshak Chahande (1):
  libbpf : make libbpf_num_possible_cpus function thread safe

 arch/x86/net/bpf_jit_comp.c   |  9 
 tools/bpf/bpftool/common.c|  8 +--
 tools/lib/bpf/libbpf.c| 33 +++
 tools/testing/selftests/bpf/verifier/loops1.c | 28 +++
 4 files changed, 57 insertions(+), 21 deletions(-)

Re: [PATCH v3] tools: bpftool: fix reading from /proc/config.gz

2019-08-12 Thread Daniel Borkmann


On 8/9/19 2:39 AM, Peter Wu wrote:

/proc/config has never existed as far as I can see, but /proc/config.gz
is present on Arch Linux. Add support for decompressing config.gz using
zlib which is a mandatory dependency of libelf. Replace existing stdio
functions with gzFile operations since the latter transparently handles
uncompressed and gzip-compressed files.

Cc: Quentin Monnet 
Signed-off-by: Peter Wu 


Applied, thanks. Please follow-up with a zlib feature test as suggested
by Jakub.

Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()

2019-08-12 Thread Daniel Borkmann


On 8/9/19 6:10 PM, Stanislav Fomichev wrote:

Add new helper bpf_sk_storage_clone which optionally clones sk storage
and call it from sk_clone_lock.

Cc: Martin KaFai Lau 
Cc: Yonghong Song 
Signed-off-by: Stanislav Fomichev 

[...]

+int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
+{
+   struct bpf_sk_storage *new_sk_storage = NULL;
+   struct bpf_sk_storage *sk_storage;
+   struct bpf_sk_storage_elem *selem;
+   int ret;
+
+   RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
+
+   rcu_read_lock();
+   sk_storage = rcu_dereference(sk->sk_bpf_storage);
+
+   if (!sk_storage || hlist_empty(&sk_storage->list))
+   goto out;
+
+   hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
+   struct bpf_sk_storage_elem *copy_selem;
+   struct bpf_sk_storage_map *smap;
+   struct bpf_map *map;
+   int refold;
+
+   smap = rcu_dereference(SDATA(selem)->smap);
+   if (!(smap->map.map_flags & BPF_F_CLONE))
+   continue;
+
+   map = bpf_map_inc_not_zero(&smap->map, false);
+   if (IS_ERR(map))
+   continue;
+
+   copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
+   if (!copy_selem) {
+   ret = -ENOMEM;
+   bpf_map_put(map);
+   goto err;
+   }
+
+   if (new_sk_storage) {
+   selem_link_map(smap, copy_selem);
+   __selem_link_sk(new_sk_storage, copy_selem);
+   } else {
+   ret = sk_storage_alloc(newsk, smap, copy_selem);
+   if (ret) {
+   kfree(copy_selem);
+   atomic_sub(smap->elem_size,
+  &newsk->sk_omem_alloc);
+   bpf_map_put(map);
+   goto err;
+   }
+
+   new_sk_storage = 
rcu_dereference(copy_selem->sk_storage);
+   }
+   bpf_map_put(map);


The map get/put combination /under/ RCU read lock seems a bit odd to me, could
you exactly describe the race that this would be preventing?


+   }
+
+out:
+   rcu_read_unlock();
+   return 0;
+
+err:
+   rcu_read_unlock();
+
+   bpf_sk_storage_free(newsk);
+   return ret;
+}

Thanks,
Daniel

Re: [PATCH bpf-next v4 1/2] xsk: remove AF_XDP socket from map when the socket is released

2019-08-12 Thread Daniel Borkmann


On 8/2/19 10:11 AM, Björn Töpel wrote:

From: Björn Töpel 

When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.

In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.

Suggested-by: Bruce Richardson 
Signed-off-by: Björn Töpel 


[ Sorry for the review delay, was on PTO and catching up with things. ]

Overall looks good to me, I think better than previous versions. One question /
clarification for below:


---
  include/net/xdp_sock.h |  18 +++
  kernel/bpf/xskmap.c| 113 ++---
  net/xdp/xsk.c  |  48 +
  3 files changed, 160 insertions(+), 19 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69796d264f06..066e3ae446a8 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -50,6 +50,16 @@ struct xdp_umem {
struct list_head xsk_list;
  };
  
+/* Nodes are linked in the struct xdp_sock map_list field, and used to

+ * track which maps a certain socket reside in.
+ */
+struct xsk_map;
+struct xsk_map_node {
+   struct list_head node;
+   struct xsk_map *map;
+   struct xdp_sock **map_entry;
+};
+
  struct xdp_sock {
/* struct sock must be the first member of struct xdp_sock */
struct sock sk;
@@ -75,6 +85,9 @@ struct xdp_sock {
/* Protects generic receive. */
spinlock_t rx_lock;
u64 rx_dropped;
+   struct list_head map_list;
+   /* Protects map_list */
+   spinlock_t map_list_lock;
  };
  
  struct xdp_buff;

@@ -96,6 +109,11 @@ struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem 
*umem,
  void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
  struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id);
  
+void xsk_map_try_sock_delete(struct xsk_map *map, struct xdp_sock *xs,

+struct xdp_sock **map_entry);
+int xsk_map_inc(struct xsk_map *map);
+void xsk_map_put(struct xsk_map *map);
+
  static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
  {
return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9bb96ace9fa1..780639309f6b 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -13,8 +13,71 @@ struct xsk_map {
struct bpf_map map;
struct xdp_sock **xsk_map;
struct list_head __percpu *flush_list;
+   spinlock_t lock; /* Synchronize map updates */
  };
  
+int xsk_map_inc(struct xsk_map *map)

+{
+   struct bpf_map *m = &map->map;
+
+   m = bpf_map_inc(m, false);
+   return IS_ERR(m) ? PTR_ERR(m) : 0;
+}
+
+void xsk_map_put(struct xsk_map *map)
+{
+   bpf_map_put(&map->map);
+}
+
+static struct xsk_map_node *xsk_map_node_alloc(struct xsk_map *map,
+  struct xdp_sock **map_entry)
+{
+   struct xsk_map_node *node;
+   int err;
+
+   node = kzalloc(sizeof(*node), GFP_ATOMIC | __GFP_NOWARN);
+   if (!node)
+   return NULL;
+
+   err = xsk_map_inc(map);
+   if (err) {
+   kfree(node);
+   return ERR_PTR(err);
+   }
+
+   node->map = map;
+   node->map_entry = map_entry;
+   return node;
+}
+
+static void xsk_map_node_free(struct xsk_map_node *node)
+{
+   xsk_map_put(node->map);
+   kfree(node);
+}
+
+static void xsk_map_sock_add(struct xdp_sock *xs, struct xsk_map_node *node)
+{
+   spin_lock_bh(&xs->map_list_lock);
+   list_add_tail(&node->node, &xs->map_list);
+   spin_unlock_bh(&xs->map_list_lock);
+}
+
+static void xsk_map_sock_delete(struct xdp_sock *xs,
+   struct xdp_sock **map_entry)
+{
+   struct xsk_map_node *n, *tmp;
+
+   spin_lock_bh(&xs->map_list_lock);
+   list_for_each_entry_safe(n, tmp, &xs->map_list, node) {
+   if (map_entry == n->map_entry) {
+   list_del(&n->node);
+   xsk_map_node_free(n);
+   }
+   }
+   spin_unlock_bh(&xs->map_list_lock);
+}
+
  static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
  {
struct xsk_map *m;
@@ -34,6 +97,7 @@ static struct bpf_map *xsk_map_alloc(union bpf_attr *attr)
return ERR_PTR(-ENOMEM);
  
  	bpf_map_init_from_attr(&m->map, attr);

+   spin_lock_init(&m->lock);
  
  	cost = (u64)m->map.max_entries * sizeof(struct xdp_sock *);

cost += sizeof(struct list_head) * num_possible_cpus();
@@ -71,21 +135,9 @@ static

Re: [PATCH bpf] s390/bpf: fix lcgr instruction encoding

2019-08-12 Thread Daniel Borkmann


On 8/12/19 5:03 PM, Ilya Leoshkevich wrote:

"masking, test in bounds 3" fails on s390, because
BPF_ALU64_IMM(BPF_NEG, BPF_REG_2, 0) ignores the top 32 bits of
BPF_REG_2. The reason is that JIT emits lcgfr instead of lcgr.
The associated comment indicates that the code was intended to emit lcgr
in the first place, it's just that the wrong opcode was used.

Fix by using the correct opcode.

Signed-off-by: Ilya Leoshkevich 


Applied, thanks!

Re: [RESEND][PATCH v3 bpf-next] btf: expose BTF info through sysfs

2019-08-13 Thread Daniel Borkmann


On 8/12/19 8:39 PM, Andrii Nakryiko wrote:

Make .BTF section allocated and expose its contents through sysfs.

/sys/kernel/btf directory is created to contain all the BTFs present
inside kernel. Currently there is only kernel's main BTF, represented as
/sys/kernel/btf/kernel file. Once kernel modules' BTFs are supported,
each module will expose its BTF as /sys/kernel/btf/ file.

Current approach relies on a few pieces coming together:
1. pahole is used to take almost final vmlinux image (modulo .BTF and
kallsyms) and generate .BTF section by converting DWARF info into
BTF. This section is not allocated and not mapped to any segment,
though, so is not yet accessible from inside kernel at runtime.
2. objcopy dumps .BTF contents into binary file and subsequently
convert binary file into linkable object file with automatically
generated symbols _binary__btf_kernel_bin_start and
_binary__btf_kernel_bin_end, pointing to start and end, respectively,
of BTF raw data.
3. final vmlinux image is generated by linking this object file (and
kallsyms, if necessary). sysfs_btf.c then creates
/sys/kernel/btf/kernel file and exposes embedded BTF contents through
it. This allows, e.g., libbpf and bpftool access BTF info at
well-known location, without resorting to searching for vmlinux image
on disk (location of which is not standardized and vmlinux image
might not be even available in some scenarios, e.g., inside qemu
during testing).


Small question: given modules will be covered later, would it not be more
obvious to name it /sys/kernel/btf/vmlinux instead?


Alternative approach using .incbin assembler directive to embed BTF
contents directly was attempted but didn't work, because sysfs_proc.o is
not re-compiled during link-vmlinux.sh stage. This is required, though,
to update embedded BTF data (initially empty data is embedded, then
pahole generates BTF info and we need to regenerate sysfs_btf.o with
updated contents, but it's too late at that point).

If BTF couldn't be generated due to missing or too old pahole,
sysfs_btf.c handles that gracefully by detecting that
_binary__btf_kernel_bin_start (weak symbol) is 0 and not creating
/sys/kernel/btf at all.

v2->v3:
- added Documentation/ABI/testing/sysfs-kernel-btf (Greg K-H);
- created proper kobject (btf_kobj) for btf directory (Greg K-H);
- undo v2 change of reusing vmlinux, as it causes extra kallsyms pass
   due to initially missing  __binary__btf_kernel_bin_{start/end} symbols;

v1->v2:
- allow kallsyms stage to re-use vmlinux generated by gen_btf();

Reviewed-by: Greg Kroah-Hartman 
Signed-off-by: Andrii Nakryiko 


In any case, this is great progress, applied thanks!

Re: [PATCH] tools: bpftool: add feature check for zlib

2019-08-13 Thread Daniel Borkmann


On 8/13/19 2:38 AM, Peter Wu wrote:

bpftool requires libelf, and zlib for decompressing /proc/config.gz.
zlib is a transitive dependency via libelf, and became mandatory since
elfutils 0.165 (Jan 2016). The feature check of libelf is already done
in the elfdep target of tools/lib/bpf/Makefile, pulled in by bpftool via
a dependency on libbpf.a. Add a similar feature check for zlib.

Suggested-by: Jakub Kicinski 
Signed-off-by: Peter Wu 


Applied, thanks!

Re: [bpf-next] selftests/bpf: fix race in flow dissector tests

2019-08-13 Thread Daniel Borkmann


On 8/13/19 1:30 AM, Petar Penkov wrote:

From: Petar Penkov 

Since the "last_dissection" map holds only the flow keys for the most
recent packet, there is a small race in the skb-less flow dissector
tests if a new packet comes between transmitting the test packet, and
reading its keys from the map. If this happens, the test packet keys
will be overwritten and the test will fail.

Changing the "last_dissection" map to a hash map, keyed on the
source/dest port pair resolves this issue. Additionally, let's clear the
last test results from the map between tests to prevent previous test
cases from interfering with the following test cases.

Fixes: 0905beec9f52 ("selftests/bpf: run flow dissector tests in skb-less mode")
Signed-off-by: Petar Penkov 


Applied, thanks!

Re: [PATCH bpf-next v2 2/4] bpf: support cloning sk storage on accept()

2019-08-13 Thread Daniel Borkmann


On 8/12/19 7:52 PM, Stanislav Fomichev wrote:

On 08/12, Daniel Borkmann wrote:

On 8/9/19 6:10 PM, Stanislav Fomichev wrote:

Add new helper bpf_sk_storage_clone which optionally clones sk storage
and call it from sk_clone_lock.

Cc: Martin KaFai Lau 
Cc: Yonghong Song 
Signed-off-by: Stanislav Fomichev 

[...]

+int bpf_sk_storage_clone(const struct sock *sk, struct sock *newsk)
+{
+   struct bpf_sk_storage *new_sk_storage = NULL;
+   struct bpf_sk_storage *sk_storage;
+   struct bpf_sk_storage_elem *selem;
+   int ret;
+
+   RCU_INIT_POINTER(newsk->sk_bpf_storage, NULL);
+
+   rcu_read_lock();
+   sk_storage = rcu_dereference(sk->sk_bpf_storage);
+
+   if (!sk_storage || hlist_empty(&sk_storage->list))
+   goto out;
+
+   hlist_for_each_entry_rcu(selem, &sk_storage->list, snode) {
+   struct bpf_sk_storage_elem *copy_selem;
+   struct bpf_sk_storage_map *smap;
+   struct bpf_map *map;
+   int refold;
+
+   smap = rcu_dereference(SDATA(selem)->smap);
+   if (!(smap->map.map_flags & BPF_F_CLONE))
+   continue;
+
+   map = bpf_map_inc_not_zero(&smap->map, false);
+   if (IS_ERR(map))
+   continue;
+
+   copy_selem = bpf_sk_storage_clone_elem(newsk, smap, selem);
+   if (!copy_selem) {
+   ret = -ENOMEM;
+   bpf_map_put(map);
+   goto err;
+   }
+
+   if (new_sk_storage) {
+   selem_link_map(smap, copy_selem);
+   __selem_link_sk(new_sk_storage, copy_selem);
+   } else {
+   ret = sk_storage_alloc(newsk, smap, copy_selem);
+   if (ret) {
+   kfree(copy_selem);
+   atomic_sub(smap->elem_size,
+  &newsk->sk_omem_alloc);
+   bpf_map_put(map);
+   goto err;
+   }
+
+   new_sk_storage = 
rcu_dereference(copy_selem->sk_storage);
+   }
+   bpf_map_put(map);


The map get/put combination /under/ RCU read lock seems a bit odd to me, could
you exactly describe the race that this would be preventing?

There is a race between sk storage release and sk storage clone.
bpf_sk_storage_map_free uses synchronize_rcu to wait for all existing
users to finish and the new ones are prevented via map's refcnt being
zero; we need to do something like that for the clone.
Martin suggested to use bpf_map_inc_not_zero/bpf_map_put.
If I read everythin correctly, I think without map_inc/map_put we
get the following race:

CPU0   CPU1

bpf_map_put
   bpf_sk_storage_map_free(smap)
 synchronize_rcu

 // no more users via bpf or
 // syscall, but clone
 // can still happen

 for each (bucket)
   selem_unlink
 selem_unlink_map(smap)

 // adding anything at
 // this point to the
 // bucket will leak

rcu_read_lock
tcp_v4_rcv
  tcp_v4_do_rcv
// sk is lockless TCP_LISTEN
tcp_v4_cookie_check
  tcp_v4_syn_recv_sock
bpf_sk_storage_clone
  
rcu_dereference(sk->sk_bpf_storage)
  selem_link_map(smap, copy)
  // adding new element to the
  // map -> leak
rcu_read_unlock

   selem_unlink_sk
sk->sk_bpf_storage = NULL

 synchronize_rcu



Makes sense, thanks for clarifying. Perhaps a small comment on top of
the bpf_map_inc_not_zero() would be great as well, so it's immediately
clear also from this location when reading the code why this is done.

Thanks,
Daniel

Re: [PATCH bpf-next 0/2] libbpf: make use of BTF through sysfs

2019-08-13 Thread Daniel Borkmann


On 8/13/19 8:54 PM, Andrii Nakryiko wrote:

Now that kernel's BTF is exposed through sysfs at well-known location, attempt
to load it first as a target BTF for the purpose of BPF CO-RE relocations.

Patch #1 is a follow-up patch to rename /sys/kernel/btf/kernel into
/sys/kernel/btf/vmlinux.
Patch #2 adds ability to load raw BTF contents from sysfs and expands the list
of locations libbpf attempts to load vmlinux BTF from.

Andrii Nakryiko (2):
   btf: rename /sys/kernel/btf/kernel into /sys/kernel/btf/vmlinux
   libbpf: attempt to load kernel BTF from sysfs first

  Documentation/ABI/testing/sysfs-kernel-btf |  2 +-
  kernel/bpf/sysfs_btf.c | 30 +-
  scripts/link-vmlinux.sh| 18 +++---
  tools/lib/bpf/libbpf.c | 64 +++---
  4 files changed, 82 insertions(+), 32 deletions(-)



LGTM, applied thanks!

Re: [PATCH bpf 1/2] bpf: support PTR_TO_MEM{,_OR_NULL} register spilling

2021-01-12 Thread Daniel Borkmann


On 1/12/21 8:46 PM, Andrii Nakryiko wrote:

On Tue, Jan 12, 2021 at 1:14 AM Gilad Reti  wrote:


Add support for pointer to mem register spilling, to allow the verifier
to track pointer to valid memory addresses. Such pointers are returned
for example by a successful call of the bpf_ringbuf_reserve helper.

This patch was suggested as a solution by Yonghong Song.

The patch was partially contibuted by CyberArk Software, Inc.

Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier
support for it")


Surprised no one mentioned this yet. Fixes tag should always be on a
single unwrapped line, however long it is, please fix.


Especially for first-time contributors there is usually some luxury that we
would have fixed this up on the fly while applying. ;) But given a v2 is going
to be sent anyway it's good to point it out for future reference, agree.

Thanks,
Daniel

Re: [PATCH bpf 2/2] libbpf: allow loading empty BTFs

2021-01-12 Thread Daniel Borkmann


On 1/12/21 7:41 AM, Andrii Nakryiko wrote:

On Mon, Jan 11, 2021 at 5:16 PM Yonghong Song  wrote:

On 1/11/21 12:51 PM, Andrii Nakryiko wrote:

On Mon, Jan 11, 2021 at 10:13 AM Yonghong Song  wrote:

On 1/9/21 11:03 PM, Andrii Nakryiko wrote:

Empty BTFs do come up (e.g., simple kernel modules with no new types and
strings, compared to the vmlinux BTF) and there is nothing technically wrong
with them. So remove unnecessary check preventing loading empty BTFs.

Reported-by: Christopher William Snowhill 
Fixes: ("d8123624506c libbpf: Fix BTF data layout checks and allow empty BTF")


Fixed up Fixes tag ^ while applying. ;-)


Signed-off-by: Andrii Nakryiko 
---
tools/lib/bpf/btf.c | 5 -
1 file changed, 5 deletions(-)

diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 3c3f2bc6c652..9970a288dda5 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -240,11 +240,6 @@ static int btf_parse_hdr(struct btf *btf)
}

meta_left = btf->raw_size - sizeof(*hdr);
- if (!meta_left) {
- pr_debug("BTF has no data\n");
- return -EINVAL;
- }


Previous kernel patch allows empty btf only if that btf is module (not
base/vmlinux) btf. Here it seems we allow any empty non-module btf to be
loaded into the kernel. In such cases, loading may fail? Maybe we should
detect such cases in libbpf and error out instead of going to kernel and
get error back?


I did this consciously. Kernel is more strict, because there is no
reasonable case when vmlinux BTF or BPF program's BTF can be empty (at
least not that now we have FUNCs in BTF). But allowing libbpf to load
empty BTF generically is helpful for bpftool, as one example, for
inspection. If you do `bpftool btf dump` on empty BTF, it will just
print nothing and you'll know that it's a valid (from BTF header
perspective) BTF, just doesn't have any types (besides VOID). If we
don't allow it, then we'll just get an error and then you'll have to
do painful hex dumping and decoding to see what's wrong.


It is totally okay to allow empty btf in libbpf. I just want to check
if this btf is going to be loaded into the kernel, right before it is
loading whether libbpf could check whether it is a non-module empty btf
or not, if it is, do not go to kernel.


Ok, I see what you are proposing. We can do that, but it's definitely
separate from these bug fixes. But, to be honest, I wouldn't bother
because libbpf will return BTF verification log with a very readable
"No data" message in it.


Right, seems okay to me for this particular case given the user will be
able to make some sense of it from the log.

Thanks,
Daniel

Re: [PATCH v3 bpf-next 5/7] bpf: support BPF ksym variables in kernel modules

2021-01-13 Thread Daniel Borkmann


On 1/13/21 12:18 AM, Alexei Starovoitov wrote:

On Tue, Jan 12, 2021 at 8:30 AM Daniel Borkmann  wrote:

On 1/12/21 8:55 AM, Andrii Nakryiko wrote:

Add support for directly accessing kernel module variables from BPF programs
using special ldimm64 instructions. This functionality builds upon vmlinux
ksym support, but extends ldimm64 with src_reg=BPF_PSEUDO_BTF_ID to allow
specifying kernel module BTF's FD in insn[1].imm field.

During BPF program load time, verifier will resolve FD to BTF object and will
take reference on BTF object itself and, for module BTFs, corresponding module
as well, to make sure it won't be unloaded from under running BPF program. The
mechanism used is similar to how bpf_prog keeps track of used bpf_maps.

One interesting change is also in how per-CPU variable is determined. The
logic is to find .data..percpu data section in provided BTF, but both vmlinux
and module each have their own .data..percpu entries in BTF. So for module's
case, the search for DATASEC record needs to look at only module's added BTF
types. This is implemented with custom search function.

Acked-by: Yonghong Song 
Acked-by: Hao Luo 
Signed-off-by: Andrii Nakryiko 

[...]

+
+struct module *btf_try_get_module(const struct btf *btf)
+{
+ struct module *res = NULL;
+#ifdef CONFIG_DEBUG_INFO_BTF_MODULES
+ struct btf_module *btf_mod, *tmp;
+
+ mutex_lock(&btf_module_mutex);
+ list_for_each_entry_safe(btf_mod, tmp, &btf_modules, list) {
+ if (btf_mod->btf != btf)
+ continue;
+
+ if (try_module_get(btf_mod->module))
+ res = btf_mod->module;


One more thought (follow-up would be okay I'd think) ... when a module 
references
a symbol from another module, it similarly needs to bump the refcount of the 
module
that is owning it and thus disallowing to unload for that other module's 
lifetime.
That usage dependency is visible via /proc/modules however, so if unload 
doesn't work
then lsmod allows a way to introspect that to the user. This seems to be 
achieved via
resolve_symbol() where it records its dependency/usage. Would be great if we 
could at
some point also include the BPF prog name into that list so that this is more 
obvious.
Wdyt?


I thought about it as well, but plenty of kernel things just grab the ref of ko
and don't add any way to introspect what piece of kernel is holding ko.
So this case won't be the first.
Also if we add it for bpf progs it could be confusing in lsmod.
Since it currently only shows other ko-s in there.
Long ago I had an awk script to parse that output to rmmod dependent modules
before rmmoding the main one. If somebody doing something like this
bpf prog names in the same place may break things.
So I think there are more cons than pros.


Hm, true that scripting could break in this case if we were to add bpf prog 
names in
there. :/ I don't have a better suggestion atm.. we could potentially add 
something
for the bpf prog info dump via bpftool, but it's a non-obvious location to 
people who
are used to check deps via lsmod. Also true that we bump ref from plenty of 
other
locations where it's not directly shown either apart from just the refcnt (e.g. 
socket
using tcp congctl module etc).

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 4724 matches

Mail list logo