Re: [PATCHv6 bpf-next 0/9] uprobe: uretprobe speed up

2024-05-21 Thread Alexei Starovoitov
On Tue, May 21, 2024 at 1:49 PM Deepak Gupta  wrote:
>
> On Tue, May 21, 2024 at 12:48:16PM +0200, Jiri Olsa wrote:
> >hi,
> >as part of the effort on speeding up the uprobes [0] coming with
> >return uprobe optimization by using syscall instead of the trap
> >on the uretprobe trampoline.
>
> I understand this provides an optimization on x86. I believe primary reason
> is syscall is straight-line microcode and short sequence while trap delivery
> still does all the GDT / IDT and segmentation checks and it makes delivery
> of the trap slow.
>
> So doing syscall improves that. Although it seems x86 is going to get rid of
> that as part of FRED [1, 2]. And linux kernel support for FRED is already 
> upstream [2].
> So I am imagining x86 hardware already exists with FRED support.
>
> On other architectures, I believe trap delivery for breakpoint instruction
> is same as syscall instruction.
>
> Given that x86 trap delivery is pretty much going following the suit here and
> intend to make trap delivery cost similar to syscall delivery.
>
> Sorry for being buzzkill here but ...
> Is it worth introducing this syscall which otherwise has no use on other 
> arches
> and x86 (and x86 kernel) has already taken steps to match trap delivery 
> latency with
> syscall latency would have similar cost?
>
> Did you do any study of this on FRED enabled x86 CPUs?

afaik CPUs with FRED do not exist on the market and it's
not clear when they will be available.
And when they finally will be on the shelves
the overhead of FRED vs int3 would still have to be measured.
int3 with FRED might still be higher than syscall with FRED.

>
> [1] - 
> https://www.intel.com/content/www/us/en/content-details/780121/flexible-return-and-event-delivery-fred-specification.html
> [2] - https://docs.kernel.org/arch/x86/x86_64/fred.html
>
> >
> >The speed up depends on instruction type that uprobe is installed
> >and depends on specific HW type, please check patch 1 for details.
> >



Re: WARNING: kmalloc bug in bpf_uprobe_multi_link_attach

2024-05-15 Thread Alexei Starovoitov
On Tue, May 14, 2024 at 12:33 AM Ubisectech Sirius
 wrote:
>
> Hello.
> We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> Recently, our team has discovered a issue in Linux kernel 6.7.  Attached to 
> the email were a PoC file of the issue.

Jiri,

please take a look.

> Stack dump:
>
> loop3: detected capacity change from 0 to 8
> MTD: Attempt to mount non-MTD device "/dev/loop3"
> [ cut here ]
> WARNING: CPU: 1 PID: 10075 at mm/util.c:632 kvmalloc_node+0x199/0x1b0 
> mm/util.c:632
> Modules linked in:
> CPU: 1 PID: 10075 Comm: syz-executor.3 Not tainted 6.7.0 #2
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 
> 04/01/2014
> RIP: 0010:kvmalloc_node+0x199/0x1b0 mm/util.c:632
> Code: 02 1d 00 eb aa e8 a7 49 c6 ff 41 81 e5 00 20 00 00 31 ff 44 89 ee e8 36 
> 45 c6 ff 45 85 ed 0f 85 1b ff ff ff e8 88 49 c6 ff 90 <0f> 0b 90 e9 dd fe ff 
> ff 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40
> RSP: 0018:c90002007b60 EFLAGS: 00010212
> RAX: 23e4 RBX: 0400 RCX: c90003aaa000
> RDX: 0004 RSI: 81c3acc8 RDI: 0005
> RBP: 0037cec8 R08: 0005 R09: 
> R10:  R11:  R12: 
> R13:  R14:  R15: 88805ff6e1b8
> FS:  7fc62205f640() GS:88807ec0() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 001b2e026000 CR3: 5f338000 CR4: 00750ef0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> PKRU: 5554
> Call Trace:
>  
>  kvmalloc include/linux/slab.h:738 [inline]
>  kvmalloc_array include/linux/slab.h:756 [inline]
>  kvcalloc include/linux/slab.h:761 [inline]
>  bpf_uprobe_multi_link_attach+0x3fe/0xf60 kernel/trace/bpf_trace.c:3239
>  link_create kernel/bpf/syscall.c:5012 [inline]
>  __sys_bpf+0x2e85/0x4e00 kernel/bpf/syscall.c:5453
>  __do_sys_bpf kernel/bpf/syscall.c:5487 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5485 [inline]
>  __x64_sys_bpf+0x78/0xc0 kernel/bpf/syscall.c:5485
>  do_syscall_x64 arch/x86/entry/common.c:52 [inline]
>  do_syscall_64+0x43/0x120 arch/x86/entry/common.c:83
>  entry_SYSCALL_64_after_hwframe+0x6f/0x77
> RIP: 0033:0x7fc62128fd6d
> Code: c3 e8 97 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 
> 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 
> 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:7fc62205f028 EFLAGS: 0246 ORIG_RAX: 0141
> RAX: ffda RBX: 7fc6213cbf80 RCX: 7fc62128fd6d
> RDX: 0040 RSI: 21c0 RDI: 001c
> RBP: 7fc6212f14cd R08:  R09: 
> R10:  R11: 0246 R12: 
> R13: 000b R14: 7fc6213cbf80 R15: 7fc62203f000
>  
>
> Thank you for taking the time to read this email and we look forward to 
> working with you further.
>
>
>



Re: [PATCH v4 02/14] mm: Switch mm->get_unmapped_area() to a flag

2024-03-25 Thread Alexei Starovoitov
On Mon, Mar 25, 2024 at 7:17 PM Rick Edgecombe
 wrote:
>
>
> diff --git a/mm/util.c b/mm/util.c
> index 669397235787..8619d353a1aa 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -469,17 +469,17 @@ void arch_pick_mmap_layout(struct mm_struct *mm, struct 
> rlimit *rlim_stack)
>
> if (mmap_is_legacy(rlim_stack)) {
> mm->mmap_base = TASK_UNMAPPED_BASE + random_factor;
> -   mm->get_unmapped_area = arch_get_unmapped_area;
> +   clear_bit(MMF_TOPDOWN, >flags);
> } else {
> mm->mmap_base = mmap_base(random_factor, rlim_stack);
> -   mm->get_unmapped_area = arch_get_unmapped_area_topdown;
> +   set_bit(MMF_TOPDOWN, >flags);
> }
>  }
>  #elif defined(CONFIG_MMU) && !defined(HAVE_ARCH_PICK_MMAP_LAYOUT)
>  void arch_pick_mmap_layout(struct mm_struct *mm, struct rlimit *rlim_stack)
>  {
> mm->mmap_base = TASK_UNMAPPED_BASE;
> -   mm->get_unmapped_area = arch_get_unmapped_area;
> +   clear_bit(MMF_TOPDOWN, >flags);
>  }
>  #endif

Makes sense to me.
Acked-by: Alexei Starovoitov 
for the idea and for bpf bits.



raw_tp+cookie is buggy. Was: [syzbot] [bpf?] [trace?] KASAN: slab-use-after-free Read in bpf_trace_run1

2024-03-25 Thread Alexei Starovoitov
Hi Andrii,

syzbot found UAF in raw_tp cookie series in bpf-next.
Reverting the whole merge
2e244a72cd48 ("Merge branch 'bpf-raw-tracepoint-support-for-bpf-cookie'")

fixes the issue.

Pls take a look.
See C reproducer below. It splats consistently with CONFIG_KASAN=y

Thanks.

On Sun, Mar 24, 2024 at 4:28 PM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:520fad2e3206 selftests/bpf: scale benchmark counting by us..
> git tree:   bpf-next
> console+strace: https://syzkaller.appspot.com/x/log.txt?x=105af94618
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6fb1be60a193d440
> dashboard link: https://syzkaller.appspot.com/bug?extid=981935d9485a560bfbcb
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=114f17a518
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=162bb7a518
>
> Downloadable assets:
> disk image: 
> https://storage.googleapis.com/syzbot-assets/4eef3506c5ce/disk-520fad2e.raw.xz
> vmlinux: 
> https://storage.googleapis.com/syzbot-assets/24d60ebe76cc/vmlinux-520fad2e.xz
> kernel image: 
> https://storage.googleapis.com/syzbot-assets/8f883e706550/bzImage-520fad2e.xz
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+981935d9485a560bf...@syzkaller.appspotmail.com
>
> ==
> BUG: KASAN: slab-use-after-free in __bpf_trace_run 
> kernel/trace/bpf_trace.c:2376 [inline]
> BUG: KASAN: slab-use-after-free in bpf_trace_run1+0xcb/0x510 
> kernel/trace/bpf_trace.c:2430
> Read of size 8 at addr 8880290d9918 by task migration/0/19
>
> CPU: 0 PID: 19 Comm: migration/0 Not tainted 
> 6.8.0-syzkaller-05233-g520fad2e3206 #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 02/29/2024
> Stopper: 0x0 <- 0x0
> Call Trace:
>  
>  __dump_stack lib/dump_stack.c:88 [inline]
>  dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
>  print_address_description mm/kasan/report.c:377 [inline]
>  print_report+0x169/0x550 mm/kasan/report.c:488
>  kasan_report+0x143/0x180 mm/kasan/report.c:601
>  __bpf_trace_run kernel/trace/bpf_trace.c:2376 [inline]
>  bpf_trace_run1+0xcb/0x510 kernel/trace/bpf_trace.c:2430
>  __traceiter_rcu_utilization+0x74/0xb0 include/trace/events/rcu.h:27
>  trace_rcu_utilization+0x194/0x1c0 include/trace/events/rcu.h:27
>  rcu_note_context_switch+0xc7c/0xff0 kernel/rcu/tree_plugin.h:360
>  __schedule+0x345/0x4a20 kernel/sched/core.c:6635
>  __schedule_loop kernel/sched/core.c:6813 [inline]
>  schedule+0x14b/0x320 kernel/sched/core.c:6828
>  smpboot_thread_fn+0x61e/0xa30 kernel/smpboot.c:160
>  kthread+0x2f0/0x390 kernel/kthread.c:388
>  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
>  
>
> Allocated by task 5075:
>  kasan_save_stack mm/kasan/common.c:47 [inline]
>  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
>  poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
>  __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
>  kasan_kmalloc include/linux/kasan.h:211 [inline]
>  kmalloc_trace+0x1d9/0x360 mm/slub.c:4012
>  kmalloc include/linux/slab.h:590 [inline]
>  kzalloc include/linux/slab.h:711 [inline]
>  bpf_raw_tp_link_attach+0x2a0/0x6e0 kernel/bpf/syscall.c:3816
>  bpf_raw_tracepoint_open+0x1c2/0x240 kernel/bpf/syscall.c:3863
>  __sys_bpf+0x3c0/0x810 kernel/bpf/syscall.c:5673
>  __do_sys_bpf kernel/bpf/syscall.c:5738 [inline]
>  __se_sys_bpf kernel/bpf/syscall.c:5736 [inline]
>  __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736
>  do_syscall_64+0xfb/0x240
>  entry_SYSCALL_64_after_hwframe+0x6d/0x75
>
> Freed by task 5075:
>  kasan_save_stack mm/kasan/common.c:47 [inline]
>  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
>  kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:589
>  poison_slab_object+0xa6/0xe0 mm/kasan/common.c:240
>  __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
>  kasan_slab_free include/linux/kasan.h:184 [inline]
>  slab_free_hook mm/slub.c:2121 [inline]
>  slab_free mm/slub.c:4299 [inline]
>  kfree+0x14a/0x380 mm/slub.c:4409
>  bpf_link_release+0x3b/0x50 kernel/bpf/syscall.c:3071
>  __fput+0x429/0x8a0 fs/file_table.c:423
>  task_work_run+0x24f/0x310 kernel/task_work.c:180
>  exit_task_work include/linux/task_work.h:38 [inline]
>  do_exit+0xa1b/0x27e0 kernel/exit.c:878
>  do_group_exit+0x207/0x2c0 kernel/exit.c:1027
>  __do_sys_exit_group kernel/exit.c:1038 [inline]
>  __se_sys_exit_group kernel/exit.c:1036 [inline]
>  __x64_sys_exit_group+0x3f/0x40 kernel/exit.c:1036
>  do_syscall_64+0xfb/0x240
>  entry_SYSCALL_64_after_hwframe+0x6d/0x75
>
> The buggy address belongs to the object at 8880290d9900
>  which belongs to the cache kmalloc-128 of size 128
> The buggy address is located 24 bytes inside of
>  freed 128-byte region [8880290d9900, 8880290d9980)
>
> The buggy address belongs to 

Re: [PATCH v5 00/12] tracing: fprobe: rethook: Use ftrace_regs instead of pt_regs

2023-09-29 Thread Alexei Starovoitov
On Thu, Sep 28, 2023 at 6:21 PM Masami Hiramatsu  wrote:
>
>
> Thus, what I need is to make fprobe to use function-graph tracer's shadow
> stack and trampoline instead of rethook. This may need to generalize its
> interface so that we can share it between fprobe and function-graph tracer,
> but we don't need to involve rethook and kretprobes anymore.

...

> And need to add patches
>
>  - Introduce a generized function exit hook interface for ftrace.
>  - Replace rethook in fprobe with the function exit hook interface.

you mean that rethook will be removed after that?



Re: [PATCH bpf-next v5 2/6] bpf: Add a ARG_PTR_TO_CONST_STR argument type

2021-04-20 Thread Alexei Starovoitov
On Tue, Apr 20, 2021 at 5:35 AM Florent Revest  wrote:
>
> On Tue, Apr 20, 2021 at 12:54 AM Alexei Starovoitov
>  wrote:
> >
> > On Mon, Apr 19, 2021 at 05:52:39PM +0200, Florent Revest wrote:
> > > This type provides the guarantee that an argument is going to be a const
> > > pointer to somewhere in a read-only map value. It also checks that this
> > > pointer is followed by a zero character before the end of the map value.
> > >
> > > Signed-off-by: Florent Revest 
> > > Acked-by: Andrii Nakryiko 
> > > ---
> > >  include/linux/bpf.h   |  1 +
> > >  kernel/bpf/verifier.c | 41 +
> > >  2 files changed, 42 insertions(+)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 77d1d8c65b81..c160526fc8bf 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -309,6 +309,7 @@ enum bpf_arg_type {
> > >   ARG_PTR_TO_PERCPU_BTF_ID,   /* pointer to in-kernel percpu type 
> > > */
> > >   ARG_PTR_TO_FUNC,/* pointer to a bpf program function */
> > >   ARG_PTR_TO_STACK_OR_NULL,   /* pointer to stack or NULL */
> > > + ARG_PTR_TO_CONST_STR,   /* pointer to a null terminated read-only 
> > > string */
> > >   __BPF_ARG_TYPE_MAX,
> > >  };
> > >
> > > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> > > index 852541a435ef..5f46dd6f3383 100644
> > > --- a/kernel/bpf/verifier.c
> > > +++ b/kernel/bpf/verifier.c
> > > @@ -4787,6 +4787,7 @@ static const struct bpf_reg_types spin_lock_types = 
> > > { .types = { PTR_TO_MAP_VALU
> > >  static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { 
> > > PTR_TO_PERCPU_BTF_ID } };
> > >  static const struct bpf_reg_types func_ptr_types = { .types = { 
> > > PTR_TO_FUNC } };
> > >  static const struct bpf_reg_types stack_ptr_types = { .types = { 
> > > PTR_TO_STACK } };
> > > +static const struct bpf_reg_types const_str_ptr_types = { .types = { 
> > > PTR_TO_MAP_VALUE } };
> > >
> > >  static const struct bpf_reg_types 
> > > *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
> > >   [ARG_PTR_TO_MAP_KEY]= _key_value_types,
> > > @@ -4817,6 +4818,7 @@ static const struct bpf_reg_types 
> > > *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
> > >   [ARG_PTR_TO_PERCPU_BTF_ID]  = _btf_ptr_types,
> > >   [ARG_PTR_TO_FUNC]   = _ptr_types,
> > >   [ARG_PTR_TO_STACK_OR_NULL]  = _ptr_types,
> > > + [ARG_PTR_TO_CONST_STR]  = _str_ptr_types,
> > >  };
> > >
> > >  static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
> > > @@ -5067,6 +5069,45 @@ static int check_func_arg(struct bpf_verifier_env 
> > > *env, u32 arg,
> > >   if (err)
> > >   return err;
> > >   err = check_ptr_alignment(env, reg, 0, size, true);
> > > + } else if (arg_type == ARG_PTR_TO_CONST_STR) {
> > > + struct bpf_map *map = reg->map_ptr;
> > > + int map_off;
> > > + u64 map_addr;
> > > + char *str_ptr;
> > > +
> > > + if (reg->type != PTR_TO_MAP_VALUE || !map ||
> >
> > I think the 'type' check is redundant,
> > since check_reg_type() did it via compatible_reg_types.
> > If so it's probably better to remove it here ?
> >
> > '!map' looks unnecessary. Can it ever happen? If yes, it's a verifier bug.
> > For example in check_mem_access() we just deref reg->map_ptr without 
> > checking
> > which, I think, is correct.
>
> I agree with all of the above. I only thought it's better to be safe
> than sorry but if you'd like I could follow up with a patch that
> removes some checks?
...
> Sure, does not hurt. I can also follow up with a patch unless if you
> prefer doing it yourself.

Please send a follow up patch.
I consider this kind of "safe than sorry" to be defensive programming that
promotes less-thinking-is-fine-because-its-faster-to-code style.
I'm sure you've seen my rants against defensive programming in the past :)


Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode

2021-04-19 Thread Alexei Starovoitov
On Sat, Apr 17, 2021 at 1:16 AM Christophe Leroy
 wrote:
>
>
>
> Le 16/04/2021 à 01:49, Alexei Starovoitov a écrit :
> > On Thu, Apr 15, 2021 at 8:41 AM Quentin Monnet  
> > wrote:
> >>
> >> 2021-04-15 16:37 UTC+0200 ~ Daniel Borkmann 
> >>> On 4/15/21 11:32 AM, Jianlin Lv wrote:
> >>>> For debugging JITs, dumping the JITed image to kernel log is discouraged,
> >>>> "bpftool prog dump jited" is much better way to examine JITed dumps.
> >>>> This patch get rid of the code related to bpf_jit_enable=2 mode and
> >>>> update the proc handler of bpf_jit_enable, also added auxiliary
> >>>> information to explain how to use bpf_jit_disasm tool after this change.
> >>>>
> >>>> Signed-off-by: Jianlin Lv 
> >>
> >> Hello,
> >>
> >> For what it's worth, I have already seen people dump the JIT image in
> >> kernel logs in Qemu VMs running with just a busybox, not for kernel
> >> development, but in a context where buiding/using bpftool was not
> >> possible.
> >
> > If building/using bpftool is not possible then majority of selftests won't
> > be exercised. I don't think such environment is suitable for any kind
> > of bpf development. Much so for JIT debugging.
> > While bpf_jit_enable=2 is nothing but the debugging tool for JIT developers.
> > I'd rather nuke that code instead of carrying it from kernel to kernel.
> >
>
> When I implemented JIT for PPC32, it was extremely helpfull.
>
> As far as I understand, for the time being bpftool is not usable in my 
> environment because it
> doesn't support cross compilation when the target's endianess differs from 
> the building host
> endianess, see discussion at
> https://lore.kernel.org/bpf/21e66a09-514f-f426-b9e2-13baab0b9...@csgroup.eu/
>
> That's right that selftests can't be exercised because they don't build.
>
> The question might be candid as I didn't investigate much about the 
> replacement of "bpf_jit_enable=2
> debugging mode" by bpftool, how do we use bpftool exactly for that ? 
> Especially when using the BPF
> test module ?

the kernel developers can add any amount of printk and dumps to debug
their code,
but such debugging aid should not be part of the production kernel.
That sysctl was two things at once: debugging tool for kernel devs and
introspection for users.
bpftool jit dump solves the 2nd part. It provides JIT introspection to users.
Debugging of the kernel can be done with any amount of auxiliary code
including calling print_hex_dump() during jiting.


Re: [PATCH bpf-next v2] bpf: Fix some invalid links in bpf_devel_QA.rst

2021-04-19 Thread Alexei Starovoitov
On Wed, Apr 14, 2021 at 4:20 AM Tiezhu Yang  wrote:
>
> There exist some errors "404 Not Found" when I click the link
> of "MAINTAINERS" [1], "samples/bpf/" [2] and "selftests" [3]
> in the documentation "HOWTO interact with BPF subsystem" [4].
>
> Use correct link of "MAINTAINERS" and just remove the links of
> "samples/bpf/" and "selftests" because there are no related
> documentations.
...
>  .. Links
>  .. _Documentation/process/: https://www.kernel.org/doc/html/latest/process/
> -.. _MAINTAINERS: ../../MAINTAINERS
>  .. _netdev-FAQ: ../networking/netdev-FAQ.rst
> -.. _samples/bpf/: ../../samples/bpf/
> -.. _selftests: ../../tools/testing/selftests/bpf/

I'm fine with removing maintainers and samples links, but selftests
would be good to keep.
There is no documentation inside selftests, but clicking to go to
source code feels useful.
Instead of removing could you make it clickable?


Re: [PATCH bpf-next v5 2/6] bpf: Add a ARG_PTR_TO_CONST_STR argument type

2021-04-19 Thread Alexei Starovoitov
On Mon, Apr 19, 2021 at 05:52:39PM +0200, Florent Revest wrote:
> This type provides the guarantee that an argument is going to be a const
> pointer to somewhere in a read-only map value. It also checks that this
> pointer is followed by a zero character before the end of the map value.
> 
> Signed-off-by: Florent Revest 
> Acked-by: Andrii Nakryiko 
> ---
>  include/linux/bpf.h   |  1 +
>  kernel/bpf/verifier.c | 41 +
>  2 files changed, 42 insertions(+)
> 
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> index 77d1d8c65b81..c160526fc8bf 100644
> --- a/include/linux/bpf.h
> +++ b/include/linux/bpf.h
> @@ -309,6 +309,7 @@ enum bpf_arg_type {
>   ARG_PTR_TO_PERCPU_BTF_ID,   /* pointer to in-kernel percpu type */
>   ARG_PTR_TO_FUNC,/* pointer to a bpf program function */
>   ARG_PTR_TO_STACK_OR_NULL,   /* pointer to stack or NULL */
> + ARG_PTR_TO_CONST_STR,   /* pointer to a null terminated read-only 
> string */
>   __BPF_ARG_TYPE_MAX,
>  };
>  
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 852541a435ef..5f46dd6f3383 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -4787,6 +4787,7 @@ static const struct bpf_reg_types spin_lock_types = { 
> .types = { PTR_TO_MAP_VALU
>  static const struct bpf_reg_types percpu_btf_ptr_types = { .types = { 
> PTR_TO_PERCPU_BTF_ID } };
>  static const struct bpf_reg_types func_ptr_types = { .types = { PTR_TO_FUNC 
> } };
>  static const struct bpf_reg_types stack_ptr_types = { .types = { 
> PTR_TO_STACK } };
> +static const struct bpf_reg_types const_str_ptr_types = { .types = { 
> PTR_TO_MAP_VALUE } };
>  
>  static const struct bpf_reg_types *compatible_reg_types[__BPF_ARG_TYPE_MAX] 
> = {
>   [ARG_PTR_TO_MAP_KEY]= _key_value_types,
> @@ -4817,6 +4818,7 @@ static const struct bpf_reg_types 
> *compatible_reg_types[__BPF_ARG_TYPE_MAX] = {
>   [ARG_PTR_TO_PERCPU_BTF_ID]  = _btf_ptr_types,
>   [ARG_PTR_TO_FUNC]   = _ptr_types,
>   [ARG_PTR_TO_STACK_OR_NULL]  = _ptr_types,
> + [ARG_PTR_TO_CONST_STR]  = _str_ptr_types,
>  };
>  
>  static int check_reg_type(struct bpf_verifier_env *env, u32 regno,
> @@ -5067,6 +5069,45 @@ static int check_func_arg(struct bpf_verifier_env 
> *env, u32 arg,
>   if (err)
>   return err;
>   err = check_ptr_alignment(env, reg, 0, size, true);
> + } else if (arg_type == ARG_PTR_TO_CONST_STR) {
> + struct bpf_map *map = reg->map_ptr;
> + int map_off;
> + u64 map_addr;
> + char *str_ptr;
> +
> + if (reg->type != PTR_TO_MAP_VALUE || !map ||

I think the 'type' check is redundant,
since check_reg_type() did it via compatible_reg_types.
If so it's probably better to remove it here ?

'!map' looks unnecessary. Can it ever happen? If yes, it's a verifier bug.
For example in check_mem_access() we just deref reg->map_ptr without checking
which, I think, is correct.

> + !bpf_map_is_rdonly(map)) {

This check is needed, of course.

> + verbose(env, "R%d does not point to a readonly map'\n", 
> regno);
> + return -EACCES;
> + }
> +
> + if (!tnum_is_const(reg->var_off)) {
> + verbose(env, "R%d is not a constant address'\n", regno);
> + return -EACCES;
> + }
> +
> + if (!map->ops->map_direct_value_addr) {
> + verbose(env, "no direct value access support for this 
> map type\n");
> + return -EACCES;
> + }
> +
> + err = check_map_access(env, regno, reg->off,
> +map->value_size - reg->off, false);
> + if (err)
> + return err;
> +
> + map_off = reg->off + reg->var_off.value;
> + err = map->ops->map_direct_value_addr(map, _addr, map_off);
> + if (err) {

since the code checks it here the same check in check_bpf_snprintf_call() should
probably do:
 if (err) {
   verbose("verifier bug\n");
   return -EFAULT;
 }

instead of just "return err;"
?

> + verbose(env, "direct value access on string failed\n");

I think the message doesn't tell users much, but they probably should never
see it unless they try to do lookup from readonly array with
more than one element.
So I guess it's fine to keep this one as-is. Just flagging.

Anyway the whole set looks great, so I've applied to bpf-next.
Thanks!


Re: [PATCH] bpf: fix errno code for unsupported batch ops

2021-04-19 Thread Alexei Starovoitov
On Mon, Apr 19, 2021 at 6:52 AM Pedro Tammela  wrote:
>
> Em dom., 18 de abr. de 2021 às 19:56, Alexei Starovoitov
>  escreveu:
> >
> > On Sun, Apr 18, 2021 at 1:03 PM Pedro Tammela  wrote:
> > >
> > > ENOTSUPP is not a valid userland errno[1], which is annoying for
> > > userland applications that implement a fallback to iterative, report
> > > errors via 'strerror()' or both.
> > >
> > > The batched ops return this errno whenever an operation
> > > is not implemented for kernels that implement batched ops.
> > >
> > > In older kernels, pre batched ops, it returns EINVAL as the arguments
> > > are not supported in the syscall.
> > >
> > > [1] 
> > > https://lore.kernel.org/netdev/20200511165319.2251678-1-k...@kernel.org/
> > >
> > > Signed-off-by: Pedro Tammela 
> > > ---
> > >  kernel/bpf/syscall.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > > index fd495190115e..88fe19c0aeb1 100644
> > > --- a/kernel/bpf/syscall.c
> > > +++ b/kernel/bpf/syscall.c
> > > @@ -3961,7 +3961,7 @@ static int bpf_task_fd_query(const union bpf_attr 
> > > *attr,
> > >  #define BPF_DO_BATCH(fn)   \
> > > do {\
> > > if (!fn) {  \
> > > -   err = -ENOTSUPP;\
> > > +   err = -EOPNOTSUPP;  \
> >
> > $ git grep EOPNOTSUPP kernel/bpf/|wc -l
> > 11
> > $ git grep ENOTSUPP kernel/bpf/|wc -l
> > 51
> >
> > For new code EOPNOTSUPP is better, but I don't think changing all 51 case
> > is a good idea. Something might depend on it already.
>
> OK, makes sense.
>
> Perhaps, handle this errno in 'libbpf_strerror()'?

That's a good idea.

> So language
> bindings don't get lost when dealing with this errno.

I'm not sure what you mean by "language bindings".
In general, strerror is not that useful. The kernel aliases
multiple conditions into the same error code. The error string
is too generic in practice to be useful.


Re: [PATCH] bpf: fix errno code for unsupported batch ops

2021-04-18 Thread Alexei Starovoitov
On Sun, Apr 18, 2021 at 1:03 PM Pedro Tammela  wrote:
>
> ENOTSUPP is not a valid userland errno[1], which is annoying for
> userland applications that implement a fallback to iterative, report
> errors via 'strerror()' or both.
>
> The batched ops return this errno whenever an operation
> is not implemented for kernels that implement batched ops.
>
> In older kernels, pre batched ops, it returns EINVAL as the arguments
> are not supported in the syscall.
>
> [1] https://lore.kernel.org/netdev/20200511165319.2251678-1-k...@kernel.org/
>
> Signed-off-by: Pedro Tammela 
> ---
>  kernel/bpf/syscall.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index fd495190115e..88fe19c0aeb1 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -3961,7 +3961,7 @@ static int bpf_task_fd_query(const union bpf_attr *attr,
>  #define BPF_DO_BATCH(fn)   \
> do {\
> if (!fn) {  \
> -   err = -ENOTSUPP;\
> +   err = -EOPNOTSUPP;  \

$ git grep EOPNOTSUPP kernel/bpf/|wc -l
11
$ git grep ENOTSUPP kernel/bpf/|wc -l
51

For new code EOPNOTSUPP is better, but I don't think changing all 51 case
is a good idea. Something might depend on it already.


Re: [PATCH bpf-next 1/2] bpf: Remove bpf_jit_enable=2 debugging mode

2021-04-15 Thread Alexei Starovoitov
On Thu, Apr 15, 2021 at 8:41 AM Quentin Monnet  wrote:
>
> 2021-04-15 16:37 UTC+0200 ~ Daniel Borkmann 
> > On 4/15/21 11:32 AM, Jianlin Lv wrote:
> >> For debugging JITs, dumping the JITed image to kernel log is discouraged,
> >> "bpftool prog dump jited" is much better way to examine JITed dumps.
> >> This patch get rid of the code related to bpf_jit_enable=2 mode and
> >> update the proc handler of bpf_jit_enable, also added auxiliary
> >> information to explain how to use bpf_jit_disasm tool after this change.
> >>
> >> Signed-off-by: Jianlin Lv 
>
> Hello,
>
> For what it's worth, I have already seen people dump the JIT image in
> kernel logs in Qemu VMs running with just a busybox, not for kernel
> development, but in a context where buiding/using bpftool was not
> possible.

If building/using bpftool is not possible then majority of selftests won't
be exercised. I don't think such environment is suitable for any kind
of bpf development. Much so for JIT debugging.
While bpf_jit_enable=2 is nothing but the debugging tool for JIT developers.
I'd rather nuke that code instead of carrying it from kernel to kernel.


Re: [PATCH] selftests/bpf: use !E instead of comparing with NULL

2021-04-13 Thread Alexei Starovoitov
On Tue, Apr 13, 2021 at 9:32 AM  wrote:
>
> > -Original Message-
> > From: Alexei Starovoitov 
> >
> > On Tue, Apr 13, 2021 at 9:19 AM  wrote:
> > >
> > > > -Original Message-
> > > > From: Alexei Starovoitov 
> > > >
> > > > On Tue, Apr 13, 2021 at 9:10 AM  wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -Original Message-
> > > > > > From: Alexei Starovoitov 
> > > > > >
> > > > > > On Tue, Apr 13, 2021 at 2:52 AM Yang Li 
> > > > > >  wrote:
> > > > > > >
> > > > > > > Fix the following coccicheck warnings:
> > > > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:189:7-11: 
> > > > > > > WARNING
> > > > > > > comparing pointer to 0, suggest !E
> > > > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:361:7-11: 
> > > > > > > WARNING
> > > > > > > comparing pointer to 0, suggest !E
> > > > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:386:14-18: 
> > > > > > > WARNING
> > > > > > > comparing pointer to 0, suggest !E
> > > > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:402:14-18: 
> > > > > > > WARNING
> > > > > > > comparing pointer to 0, suggest !E
> > > > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:433:7-11: 
> > > > > > > WARNING
> > > > > > > comparing pointer to 0, suggest !E
> > > > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:534:14-18: 
> > > > > > > WARNING
> > > > > > > comparing pointer to 0, suggest !E
> > > > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:625:7-11: 
> > > > > > > WARNING
> > > > > > > comparing pointer to 0, suggest !E
> > > > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:767:7-11: 
> > > > > > > WARNING
> > > > > > > comparing pointer to 0, suggest !E
> > > > > > >
> > > > > > > Reported-by: Abaci Robot 
> > > > > > > Signed-off-by: Yang Li 
> > > > > > > ---
> > > > > > >  tools/testing/selftests/bpf/progs/profiler.inc.h | 22 
> > > > > > > +++---
> > > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > >
> > > > > > > diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h 
> > > > > > > b/tools/testing/selftests/bpf/progs/profiler.inc.h
> > > > > > > index 4896fdf8..a33066c 100644
> > > > > > > --- a/tools/testing/selftests/bpf/progs/profiler.inc.h
> > > > > > > +++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
> > > > > > > @@ -189,7 +189,7 @@ static INLINE void populate_ancestors(struct 
> > > > > > > task_struct* task,
> > > > > > >  #endif
> > > > > > > for (num_ancestors = 0; num_ancestors < MAX_ANCESTORS; 
> > > > > > > num_ancestors++) {
> > > > > > > parent = BPF_CORE_READ(parent, real_parent);
> > > > > > > -   if (parent == NULL)
> > > > > > > +   if (!parent)
> > > > > >
> > > > > > Sorry, but I'd like the progs to stay as close as possible to the 
> > > > > > way
> > > > > > they were written.
> > > > > Why?
> > > > >
> > > > > > They might not adhere to kernel coding style in some cases.
> > > > > > The code could be grossly inefficient and even buggy.
> > > > > There would have to be a really good reason to accept
> > > > > grossly inefficient and even buggy code into the kernel.
> > > > >
> > > > > Can you please explain what that reason is?
> > > >
> > > > It's not the kernel. It's a test of bpf program.
> > > That doesn't answer the question of why you don't want any changes.
> > >
> > > Why would we not use kernel coding style guidelines and quality 
> > > thresholds for
> > > testing code?  This *is* going into the kernel source tree, where it will 
> > > be
> > > maintained and used by other developers.
> >
> > because the way the C code is written makes llvm generate a particular
> > code pattern that may not be seen otherwise.
> > Like removing 'if' because it's useless to humans, but not to the compiler
> > will change generated code which may or may not trigger the behavior
> > the prog intends to cover.
> > In particular this profiler.inc.h test is compiled three different ways to
> > maximize code generation differences.
> > It may not be checking error paths in some cases which can be considered
> > a bug, but that's the intended behavior of the C code as it was written.
> > So it has nothing to do with "quality of kernel code".
> > and it should not be used by developers. It's neither sample nor example.
>
> Ok - in this case it looks like a program, but it is essentially test data 
> (for testing
> the compiler).  Thanks for the explanation.

yes. That's a good way of saying it.
Of course not all tests are like this.
Majority of bpf progs in selftests/bpf/progs/ are carefully written,
short and designed
as a unit test. While few are "test data" for llvm.


Re: [PATCH] selftests/bpf: use !E instead of comparing with NULL

2021-04-13 Thread Alexei Starovoitov
On Tue, Apr 13, 2021 at 9:19 AM  wrote:
>
> > -Original Message-
> > From: Alexei Starovoitov 
> >
> > On Tue, Apr 13, 2021 at 9:10 AM  wrote:
> > >
> > >
> > >
> > > > -Original Message-
> > > > From: Alexei Starovoitov 
> > > >
> > > > On Tue, Apr 13, 2021 at 2:52 AM Yang Li  
> > > > wrote:
> > > > >
> > > > > Fix the following coccicheck warnings:
> > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:189:7-11: WARNING
> > > > > comparing pointer to 0, suggest !E
> > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:361:7-11: WARNING
> > > > > comparing pointer to 0, suggest !E
> > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:386:14-18: WARNING
> > > > > comparing pointer to 0, suggest !E
> > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:402:14-18: WARNING
> > > > > comparing pointer to 0, suggest !E
> > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:433:7-11: WARNING
> > > > > comparing pointer to 0, suggest !E
> > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:534:14-18: WARNING
> > > > > comparing pointer to 0, suggest !E
> > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:625:7-11: WARNING
> > > > > comparing pointer to 0, suggest !E
> > > > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:767:7-11: WARNING
> > > > > comparing pointer to 0, suggest !E
> > > > >
> > > > > Reported-by: Abaci Robot 
> > > > > Signed-off-by: Yang Li 
> > > > > ---
> > > > >  tools/testing/selftests/bpf/progs/profiler.inc.h | 22 
> > > > > +++---
> > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > >
> > > > > diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h 
> > > > > b/tools/testing/selftests/bpf/progs/profiler.inc.h
> > > > > index 4896fdf8..a33066c 100644
> > > > > --- a/tools/testing/selftests/bpf/progs/profiler.inc.h
> > > > > +++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
> > > > > @@ -189,7 +189,7 @@ static INLINE void populate_ancestors(struct 
> > > > > task_struct* task,
> > > > >  #endif
> > > > > for (num_ancestors = 0; num_ancestors < MAX_ANCESTORS; 
> > > > > num_ancestors++) {
> > > > > parent = BPF_CORE_READ(parent, real_parent);
> > > > > -   if (parent == NULL)
> > > > > +   if (!parent)
> > > >
> > > > Sorry, but I'd like the progs to stay as close as possible to the way
> > > > they were written.
> > > Why?
> > >
> > > > They might not adhere to kernel coding style in some cases.
> > > > The code could be grossly inefficient and even buggy.
> > > There would have to be a really good reason to accept
> > > grossly inefficient and even buggy code into the kernel.
> > >
> > > Can you please explain what that reason is?
> >
> > It's not the kernel. It's a test of bpf program.
> That doesn't answer the question of why you don't want any changes.
>
> Why would we not use kernel coding style guidelines and quality thresholds for
> testing code?  This *is* going into the kernel source tree, where it will be
> maintained and used by other developers.

because the way the C code is written makes llvm generate a particular
code pattern that may not be seen otherwise.
Like removing 'if' because it's useless to humans, but not to the compiler
will change generated code which may or may not trigger the behavior
the prog intends to cover.
In particular this profiler.inc.h test is compiled three different ways to
maximize code generation differences.
It may not be checking error paths in some cases which can be considered
a bug, but that's the intended behavior of the C code as it was written.
So it has nothing to do with "quality of kernel code".
and it should not be used by developers. It's neither sample nor example.


Re: [PATCH] selftests/bpf: use !E instead of comparing with NULL

2021-04-13 Thread Alexei Starovoitov
On Tue, Apr 13, 2021 at 9:10 AM  wrote:
>
>
>
> > -Original Message-
> > From: Alexei Starovoitov 
> >
> > On Tue, Apr 13, 2021 at 2:52 AM Yang Li  wrote:
> > >
> > > Fix the following coccicheck warnings:
> > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:189:7-11: WARNING
> > > comparing pointer to 0, suggest !E
> > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:361:7-11: WARNING
> > > comparing pointer to 0, suggest !E
> > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:386:14-18: WARNING
> > > comparing pointer to 0, suggest !E
> > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:402:14-18: WARNING
> > > comparing pointer to 0, suggest !E
> > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:433:7-11: WARNING
> > > comparing pointer to 0, suggest !E
> > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:534:14-18: WARNING
> > > comparing pointer to 0, suggest !E
> > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:625:7-11: WARNING
> > > comparing pointer to 0, suggest !E
> > > ./tools/testing/selftests/bpf/progs/profiler.inc.h:767:7-11: WARNING
> > > comparing pointer to 0, suggest !E
> > >
> > > Reported-by: Abaci Robot 
> > > Signed-off-by: Yang Li 
> > > ---
> > >  tools/testing/selftests/bpf/progs/profiler.inc.h | 22 
> > > +++---
> > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h 
> > > b/tools/testing/selftests/bpf/progs/profiler.inc.h
> > > index 4896fdf8..a33066c 100644
> > > --- a/tools/testing/selftests/bpf/progs/profiler.inc.h
> > > +++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
> > > @@ -189,7 +189,7 @@ static INLINE void populate_ancestors(struct 
> > > task_struct* task,
> > >  #endif
> > > for (num_ancestors = 0; num_ancestors < MAX_ANCESTORS; 
> > > num_ancestors++) {
> > > parent = BPF_CORE_READ(parent, real_parent);
> > > -   if (parent == NULL)
> > > +   if (!parent)
> >
> > Sorry, but I'd like the progs to stay as close as possible to the way
> > they were written.
> Why?
>
> > They might not adhere to kernel coding style in some cases.
> > The code could be grossly inefficient and even buggy.
> There would have to be a really good reason to accept
> grossly inefficient and even buggy code into the kernel.
>
> Can you please explain what that reason is?

It's not the kernel. It's a test of bpf program.


Re: [PATCH] selftests/bpf: use !E instead of comparing with NULL

2021-04-13 Thread Alexei Starovoitov
On Tue, Apr 13, 2021 at 2:52 AM Yang Li  wrote:
>
> Fix the following coccicheck warnings:
> ./tools/testing/selftests/bpf/progs/profiler.inc.h:189:7-11: WARNING
> comparing pointer to 0, suggest !E
> ./tools/testing/selftests/bpf/progs/profiler.inc.h:361:7-11: WARNING
> comparing pointer to 0, suggest !E
> ./tools/testing/selftests/bpf/progs/profiler.inc.h:386:14-18: WARNING
> comparing pointer to 0, suggest !E
> ./tools/testing/selftests/bpf/progs/profiler.inc.h:402:14-18: WARNING
> comparing pointer to 0, suggest !E
> ./tools/testing/selftests/bpf/progs/profiler.inc.h:433:7-11: WARNING
> comparing pointer to 0, suggest !E
> ./tools/testing/selftests/bpf/progs/profiler.inc.h:534:14-18: WARNING
> comparing pointer to 0, suggest !E
> ./tools/testing/selftests/bpf/progs/profiler.inc.h:625:7-11: WARNING
> comparing pointer to 0, suggest !E
> ./tools/testing/selftests/bpf/progs/profiler.inc.h:767:7-11: WARNING
> comparing pointer to 0, suggest !E
>
> Reported-by: Abaci Robot 
> Signed-off-by: Yang Li 
> ---
>  tools/testing/selftests/bpf/progs/profiler.inc.h | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/progs/profiler.inc.h 
> b/tools/testing/selftests/bpf/progs/profiler.inc.h
> index 4896fdf8..a33066c 100644
> --- a/tools/testing/selftests/bpf/progs/profiler.inc.h
> +++ b/tools/testing/selftests/bpf/progs/profiler.inc.h
> @@ -189,7 +189,7 @@ static INLINE void populate_ancestors(struct task_struct* 
> task,
>  #endif
> for (num_ancestors = 0; num_ancestors < MAX_ANCESTORS; 
> num_ancestors++) {
> parent = BPF_CORE_READ(parent, real_parent);
> -   if (parent == NULL)
> +   if (!parent)

Sorry, but I'd like the progs to stay as close as possible to the way
they were written.
They might not adhere to kernel coding style in some cases.
The code could be grossly inefficient and even buggy.
Please don't run spell checks, coccicheck, checkpatch.pl on them.


Re: BUG: unable to handle kernel paging request in bpf_check

2021-04-12 Thread Alexei Starovoitov
On Mon, Apr 12, 2021 at 12:11 AM Hao Sun  wrote:
>
> Besides, another similar bug occurred while fault injection was enabled.
> 
> BUG: unable to handle kernel paging request in bpf_prog_alloc_no_stats
> 
> RAX: ffda RBX: 0059c080 RCX: 0047338d
> RDX: 0078 RSI: 2300 RDI: 0005
> RBP: 7f7e3c38fc90 R08:  R09: 
> R10:  R11: 0246 R12: 0004
> R13: 7ffed3a1dd6f R14: 7ffed3a1df10 R15: 7f7e3c38fdc0
> BUG: unable to handle page fault for address: 91f2077ed028
> #PF: supervisor write access in kernel mode
> #PF: error_code(0x0002) - not-present page
> PGD 1810067 P4D 1810067 PUD 1915067 PMD 3b907067 PTE 0
> Oops: 0002 [#1] SMP
> CPU: 3 PID: 17344 Comm: executor Not tainted 5.12.0-rc6+ #1
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.13.0-1ubuntu1.1 04/01/2014
> RIP: 0010:bpf_prog_alloc_no_stats+0x251/0x6e0 kernel/bpf/core.c:94

Both crashes don't make much sense.
There are !null checks in both cases.
I suspect it's a kmsan bug.
Most likely kmsan_map_kernel_range_noflush is doing something wrong.
No idea where that function lives. I don't see it in the kernel sources.


Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-03 Thread Alexei Starovoitov
On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote:
> On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote:
> > On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi  
> > wrote:
> > > [...]
> >
> > All of these things are messy because of tc legacy. bpf tried to follow tc 
> > style
> > with cls and act distinction and it didn't quite work. cls with
> > direct-action is the only
> > thing that became mainstream while tc style attach wasn't really addressed.
> > There were several incidents where tc had tens of thousands of progs 
> > attached
> > because of this attach/query/index weirdness described above.
> > I think the only way to address this properly is to introduce bpf_link 
> > style of
> > attaching to tc. Such bpf_link would support ingress/egress only.
> > direction-action will be implied. There won't be any index and query
> > will be obvious.
> 
> Note that we already have bpf_link support working (without support for 
> pinning
> ofcourse) in a limited way. The ifindex, protocol, parent_id, priority, 
> handle,
> chain_index tuple uniquely identifies a filter, so we stash this in the 
> bpf_link
> and are able to operate on the exact filter during release.

Except they're not unique. The library can stash them, but something else
doing detach via iproute2 or their own netlink calls will detach the prog.
This other app can attach to the same spot a different prog and now
bpf_link__destroy will be detaching somebody else prog.

> > So I would like to propose to take this patch set a step further from
> > what Daniel said:
> > int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
> > and make this proposed api to return FD.
> > To detach from tc ingress/egress just close(fd).
> 
> You mean adding an fd-based TC API to the kernel?

yes.

> > The user processes will not conflict with each other and will not accidently
> > detach bpf program that was attached by another user process.
> > Such api will address the existing tc query/attach/detach race race 
> > conditions.
> 
> Hmm, I think we do solve the race condition by returning the id. As long as 
> you
> don't misuse the interface and go around deleting filters arbitrarily (i.e. 
> only
> detach using the id), programs won't step over each other's filters. Storing 
> the
> id from the netlink response received during detach also eliminates any
> ambigiuity from probing through get_info after attach. Same goes for actions,
> and the same applies to the bpf_link returning API (which stashes id/index).

There are plenty of tools and libraries out there that do attach/detach of bpf
to tc. Everyone is not going to convert to this new libbpf api overnight.
So 'miuse of the interface' is not a misuse. It's a reality that is going to 
keep
happening unless the kernel guarantees ownership of the attachment via FD.

> The only advantage of fd would be the possibility of pinning it, and extending
> lifetime of the filter.

Pinning is one of the advantages. The main selling point of FD is ownership
of the attachment.


Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-02 Thread Alexei Starovoitov
On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi  wrote:
>
> This would be fine, because it's not a fast path or anything, but right now we
> return the id using the netlink response, otherwise for query we have to open
> the socket, prepare the msg, send and recv again. So it's a minor 
> optimization.
>
> However, there's one other problem. In an earlier version of this series, I
> didn't keep the id/index out parameters (to act as handle to the newly 
> attached
> filter/action). This lead to problems on query. Suppose a user doesn't 
> properly
> fill the opts during query (e.g. in case of filters). This means the netlink
> dump includes all filters matching filled in attributes. If the prog_id for 
> all
> of these is same (e.g. all have same bpf classifier prog attached to them), it
> becomes impossible to determine which one is the filter user asked for. It is
> not possible to enforce filling in all kinds of attributes since some can be
> left out and assigned by default in the kernel (priority, chain_index etc.). 
> So
> returning the newly created filter's id turned out to be the best option. This
> is also used to stash filter related information in bpf_link to properly 
> release
> it later.
>
> The same problem happens with actions, where we look up using the prog_id, we
> multiple actions with different index can match on same prog_id. It is not
> possible to determine which index corresponds to last loaded action.
>
> So unless there's a better idea on how to deal with this, a query API won't 
> work
> for the case where same bpf prog is attached more than once. Returning the
> id/index during attach seemed better than all other options we considered.

All of these things are messy because of tc legacy. bpf tried to follow tc style
with cls and act distinction and it didn't quite work. cls with
direct-action is the only
thing that became mainstream while tc style attach wasn't really addressed.
There were several incidents where tc had tens of thousands of progs attached
because of this attach/query/index weirdness described above.
I think the only way to address this properly is to introduce bpf_link style of
attaching to tc. Such bpf_link would support ingress/egress only.
direction-action will be implied. There won't be any index and query
will be obvious.
So I would like to propose to take this patch set a step further from
what Daniel said:
int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
and make this proposed api to return FD.
To detach from tc ingress/egress just close(fd).
The user processes will not conflict with each other and will not accidently
detach bpf program that was attached by another user process.
Such api will address the existing tc query/attach/detach race race conditions.
And libbpf side of support for this api will be trivial. Single bpf
link_create command
with ifindex and ingress|egress arguments.
wdyt?


Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-03-30 Thread Alexei Starovoitov
On Tue, Mar 30, 2021 at 2:26 PM Daniel Borkmann  wrote:
>
> On 3/30/21 10:39 PM, Andrii Nakryiko wrote:
> > On Sun, Mar 28, 2021 at 1:11 AM Kumar Kartikeya Dwivedi
> >  wrote:
> >> On Sun, Mar 28, 2021 at 10:12:40AM IST, Andrii Nakryiko wrote:
> >>> Is there some succinct but complete enough documentation/tutorial/etc
> >>> that I can reasonably read to understand kernel APIs provided by TC
> >>> (w.r.t. BPF, of course). I'm trying to wrap my head around this and
> >>> whether API makes sense or not. Please share links, if you have some.
> >>
> >> Hi Andrii,
> >>
> >> Unfortunately for the kernel API part, I couldn't find any when I was 
> >> working
> >> on this. So I had to read the iproute2 tc code (tc_filter.c, f_bpf.c,
> >> m_action.c, m_bpf.c) and the kernel side bits (cls_api.c, cls_bpf.c, 
> >> act_api.c,
> >> act_bpf.c) to grok anything I didn't understand. There's also similar code 
> >> in
> >> libnl (lib/route/{act,cls}.c).
> >>
> >> Other than that, these resources were useful (perhaps you already went 
> >> through
> >> some/all of them):
> >>
> >> https://docs.cilium.io/en/latest/bpf/#tc-traffic-control
> >> https://qmonnet.github.io/whirl-offload/2020/04/11/tc-bpf-direct-action/
> >> tc(8), and tc-bpf(8) man pages
> >>
> >> I hope this is helpful!
> >
> > Thanks! I'll take a look. Sorry, I'm a bit behind with all the stuff,
> > trying to catch up.
> >
> > I was just wondering if it would be more natural instead of having
> > _dev _block variants and having to specify __u32 ifindex, __u32
> > parent_id, __u32 protocol, to have some struct specifying TC
> > "destination"? Maybe not, but I thought I'd bring this up early. So
> > you'd have just bpf_tc_cls_attach(), and you'd so something like
> >
> > bpf_tc_cls_attach(prog_fd, TC_DEV(ifindex, parent_id, protocol))
> >
> > or
> >
> > bpf_tc_cls_attach(prog_fd, TC_BLOCK(block_idx, protocol))
> >
> > ? Or it's taking it too far?
> >
> > But even if not, I think detaching can be unified between _dev and
> > _block, can't it?
>
> Do we even need the _block variant? I would rather prefer to take the chance
> and make it as simple as possible, and only iff really needed extend with
> other APIs, for example:
>
>bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS});
>
> Internally, this will create the sch_clsact qdisc & cls_bpf filter instance
> iff not present yet, and attach to a default prio 1 handle 1, and _always_ in
> direct-action mode. This is /as simple as it gets/ and we don't need to bother
> users with more complex tc/cls_bpf internals unless desired. For example,
> extended APIs could add prio/parent so that multi-prog can be attached to a
> single cls_bpf instance, but even that could be a second step, imho.

+1 to support sched_cls in direct-action mode only.


Re: [PATCH bpf-next 5/5] libbpf: add selftests for TC-BPF API

2021-03-30 Thread Alexei Starovoitov
On Tue, Mar 30, 2021 at 1:28 PM Andrii Nakryiko
 wrote:
> >
> > In the other thread you've proposed to copy paste hash implemenation
> > into pahole. That's not ideal. If we had libbpfutil other projects
> > could have used that without copy-paste.
>
> I know it's not ideal. But I don't think libbpf should be in the
> business of providing generic data structures with stable APIs either.

There is a need for hash in pahole and it's already using libbpf.
Would be good to reuse the code.

> > that's today. Plus mandatory libelf and libz.
> > I would like to have libsysbpf that doesn't depend on libelf/libz
> > for folks that don't need it.
>
> TBH, bpf.c is such a minimal shim on top of bpf() syscall, that
> providing all of its implementation as a single .h wouldn't be too
> horrible. Then whatever applications want those syscall wrappers would
> just include bpf/bpf.h and have no need for the library at all.

1k line bpf.h. hmm. That's not going to be a conventional C header,
but it could work I guess.

> > Also I'd like to see symbolizer to be included in "libbpf package".
> > Currently it's the main component that libbcc offers, but libbpf doesn't.
> > Say we don't split libbpf. Then symbolizer will bring some dwarf library
> > (say libdwarves ~ 1Mbyte) and libiberty ~ 500k (for c++ demangle).
> > Now we're looking at multi megabyte libbpf package.
>
> Right, which is one of the reasons why it probably doesn't belong in
> libbpf at all. Another is that it's not BPF-specific functionality at
> all.

symbolizer, usdt, python and lua bindings is what made libbcc successful.
I think "libbpf package" should include everything that bpf tracing folks
might need.
Getting -l flags correct from a single package isn't a big deal
compared with the need to deal with different packages that
depend on each other.

> I'm against pro-active splitting just in case. I'd rather discuss
> specific problems when we get to them. I think it's premature right
> now to split libbpf.

Fine.
I'm mainly advocating to change the mental model to see
libbpf as a collection of tools and libraries and not just single libbpf.a


Re: [PATCH bpf-next v2] bpf: check flags in 'bpf_ringbuf_discard()' and 'bpf_ringbuf_submit()'

2021-03-30 Thread Alexei Starovoitov
On Tue, Mar 30, 2021 at 3:54 PM Pedro Tammela  wrote:
>
>  BPF_CALL_2(bpf_ringbuf_submit, void *, sample, u64, flags)
>  {
> +   if (unlikely(flags & ~(BPF_RB_NO_WAKEUP | BPF_RB_FORCE_WAKEUP)))
> +   return -EINVAL;
> +
> bpf_ringbuf_commit(sample, flags, false /* discard */);
> +
> return 0;

I think ringbuf design was meant for bpf_ringbuf_submit to never fail.
If we do flag validation it probably should be done at the verifier time.


Re: [PATCH bpf-next 5/5] libbpf: add selftests for TC-BPF API

2021-03-29 Thread Alexei Starovoitov
On Sun, Mar 28, 2021 at 07:38:42PM -0700, Andrii Nakryiko wrote:
> 
> See above. I don't know which hassle is libbpf for users today. You
> were implying code size used for functionality users might not use
> (e.g., linker). Libbpf is a very small library, <300KB. There are
> users building tools for constrained embedded systems that use libbpf.
> There are/were various problems mentioned, but the size of libbpf
> wasn't yet one of them. We should certainly watch the code bloat, but
> we are not yet at the point where library is too big for users to be
> turned off. 

It's true that today sizeof(libbpf + libelf + libz) ~ 500k is not a concern.
I'm worried what it will get to if we don't start splitting things up.

Why split libxdp into its own lib?
If tc attach is going to part of libbpf all things xdp should be
part of libbpf as well.

But af_xdp folks are probably annoyed that they need to add -lelf an -lz
though they're not using them. Just a tech debt that eventually need to be paid.

> > I would take this opportunity to split libbpf into maintainable pieces:
> > - libsysbpf - sys_bpf wrappers (pretty much tools/lib/bpf/bpf.c)
> > - libbpfutil - hash, strset
> 
> strset and hash are internal data structures, I never intended to
> expose them through public APIs. I haven't investigated, but if we
> have a separate shared library (libbpfutil), I imagine we won't be
> able to hide those APIs, right?

In the other thread you've proposed to copy paste hash implemenation
into pahole. That's not ideal. If we had libbpfutil other projects
could have used that without copy-paste.

> But again, let's just reflect for a second that we are talking about
> the library that takes less than 300KB total. 

that's today. Plus mandatory libelf and libz.
I would like to have libsysbpf that doesn't depend on libelf/libz
for folks that don't need it.
Also I'd like to see symbolizer to be included in "libbpf package".
Currently it's the main component that libbcc offers, but libbpf doesn't.
Say we don't split libbpf. Then symbolizer will bring some dwarf library
(say libdwarves ~ 1Mbyte) and libiberty ~ 500k (for c++ demangle).
Now we're looking at multi megabyte libbpf package.
I think the users would benefit from smaller building blocks.
Splitting into 10 mini libs is overkill, of course,
but some split is necessary.
I agree that moving static linking into separate lib won't really
affect .text size. The point is not to reduce text, but to establish
a framework where such things are possible. Then symbolizer and
anything fancier that would depend on other libs can be part
of "libbpf package". I mean single rpm that contains all libbpf libs.
Basic libsysbpf wouldn't need libelf/z.
libbpfsymbolizer would need libdwarf, etc.
Maybe some libbpfnet would depend on libnl or what not.


Re: [PATCH bpf-next 5/5] libbpf: add selftests for TC-BPF API

2021-03-28 Thread Alexei Starovoitov
On Sat, Mar 27, 2021 at 09:32:58PM -0700, Andrii Nakryiko wrote:
> > I think it's better to start with new library for tc/xdp and have
> > libbpf as a dependency on that new lib.
> > For example we can add it as subdir in tools/lib/bpf/.
> >
> > Similarly I think integerating static linking into libbpf was a mistake.
> > It should be a sub library as well.
> >
> > If we end up with core libbpf and ten sublibs for tc, xdp, af_xdp, linking,
> > whatever else the users would appreciate that we don't shove single libbpf
> > to them with a ton of features that they might never use.
> 
> What's the concern exactly? The size of the library? Having 10
> micro-libraries has its own set of downsides, 

specifically?

> I'm not convinced that's
> a better situation for end users. And would certainly cause more
> hassle for libbpf developers and packagers.

For developers and packagers.. yes.
For users.. quite the opposite.
The skel gen and static linking must be split out before the next libbpf 
release.
Not a single application linked with libbpf is going to use those pieces.
bpftool is one and only that needs them. Hence forcing libbpf users
to increase their .text with a dead code is a selfish call of libbpf
developers and packagers. The user's priorities must come first.

> And what did you include in "core libbpf"?

I would take this opportunity to split libbpf into maintainable pieces:
- libsysbpf - sys_bpf wrappers (pretty much tools/lib/bpf/bpf.c)
- libbpfutil - hash, strset
- libbtf - BTF read/write
- libbpfelf - ELF parsing, CORE, ksym, kconfig
- libbpfskel - skeleton gen used by bpftool only
- libbpflink - linker used by bpftool only
- libbpfnet - networking attachment via netlink including TC and XDP
- libbpftrace - perfbuf, ringbuf
- libxdp - Toke's xdp chaining
- libxsk - af_xdp logic

In the future the stack trace symbolization code can come
into libbpftrace or be a part of its own lib.
My upcoming loader program and signed prog generation logic
can be part of libbpfskel.


Re: [PATCH bpf-next 5/5] libbpf: add selftests for TC-BPF API

2021-03-28 Thread Alexei Starovoitov
On Sat, Mar 27, 2021 at 04:17:16PM +0100, Toke Høiland-Jørgensen wrote:
> Alexei Starovoitov  writes:
> 
> > On Thu, Mar 25, 2021 at 05:30:03PM +0530, Kumar Kartikeya Dwivedi wrote:
> >> This adds some basic tests for the low level bpf_tc_* API and its
> >> bpf_program__attach_tc_* wrapper on top.
> >
> > *_block() apis from patch 3 and 4 are not covered by this selftest.
> > Why were they added ? And how were they tested?
> >
> > Pls trim your cc. bpf@vger and netdev@vger would have been enough.
> >
> > My main concern with this set is that it adds netlink apis to libbpf while
> > we already agreed to split xdp manipulation pieces out of libbpf.
> > It would be odd to add tc apis now only to split them later.
> 
> We're not removing the ability to attach an XDP program via netlink from
> libxdp, though. This is the equivalent for TC: the minimum support to
> attach a program, and if you want to do more, you pull in another
> library or roll your own.
> 
> I'm fine with cutting out more stuff and making this even more minimal
> (e.g., remove the block stuff and only support attach/detach on ifaces),
> but we figured we'd err on the side of including too much and getting
> some feedback from others on which bits are the essential ones to keep,
> and which can be dropped.

This is up to you. I'm trying to understand the motivation for *_block() apis.
I'm not taking a stance for/against them.

> > I think it's better to start with new library for tc/xdp and have
> > libbpf as a dependency on that new lib.
> > For example we can add it as subdir in tools/lib/bpf/.
> 
> I agree for the higher-level stuff (though I'm not sure what that would
> be for TC), but right now TC programs are the only ones that cannot be
> attached by libbpf, which is annoying; that's what we're trying to fix.

Sure. I wasn't saying that there is no place for these APIs in libbpf+.
Just that existing libbpf is already became a kitchen sink of features
that users are not going to use like static linking.
tc-api was a straw that broke the camel's back.
I think we must move static linking and skeleton out of libbpf before
the next release.


Re: [PATCH bpf-next] bpf: tcp: Remove comma which is causing build error

2021-03-28 Thread Alexei Starovoitov
On Sun, Mar 28, 2021 at 5:05 AM Atul Gopinathan
 wrote:
>
> Currently, building the bpf-next source with the CONFIG_BPF_SYSCALL
> enabled is causing a compilation error:
>
> "net/ipv4/bpf_tcp_ca.c:209:28: error: expected identifier or '(' before
> ',' token"
>
> Fix this by removing an unnecessary comma.
>
> Reported-by: syzbot+0b74d8ec3bf0cc4e4...@syzkaller.appspotmail.com
> Fixes: e78aea8b2170 ("bpf: tcp: Put some tcp cong functions in allowlist for 
> bpf-tcp-cc")
> Signed-off-by: Atul Gopinathan 

Thanks for the quick fix. Applied.


Re: [PATCH bpf-next] bpf: trace jit code when enable BPF_JIT_ALWAYS_ON

2021-03-27 Thread Alexei Starovoitov
On Sat, Mar 27, 2021 at 1:19 AM Jianlin Lv  wrote:
>
> > On Fri, Mar 26, 2021 at 5:40 AM Jianlin Lv  wrote:
> > >
> > > When CONFIG_BPF_JIT_ALWAYS_ON is enabled, the value of
> > bpf_jit_enable
> > > in /proc/sys is limited to SYSCTL_ONE. This is not convenient for 
> > > debugging.
> > > This patch modifies the value of extra2 (max) to 2 that support
> > > developers to emit traces on kernel log.
> > >
> > > Signed-off-by: Jianlin Lv 
> > > ---
> > >  net/core/sysctl_net_core.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> > > index d84c8a1b280e..aa16883ac445 100644
> > > --- a/net/core/sysctl_net_core.c
> > > +++ b/net/core/sysctl_net_core.c
> > > @@ -386,7 +386,7 @@ static struct ctl_table net_core_table[] = {
> > > .proc_handler   = proc_dointvec_minmax_bpf_enable,
> > >  # ifdef CONFIG_BPF_JIT_ALWAYS_ON
> > > .extra1 = SYSCTL_ONE,
> > > -   .extra2 = SYSCTL_ONE,
> > > +   .extra2 = ,
> >
> > "bpftool prog dump jited" is much better way to examine JITed dumps.
> > I'd rather remove bpf_jit_enable=2 altogether.
>
> In my case, I introduced a bug when I made some adjustments to the arm64
> jit macro A64_MOV(), which caused the SP register to be replaced by the
> XZR register when building prologue, and the wrong value was stored in fp,
> which triggered a crash.
>
> This bug is likely to cause the instruction to access the BPF stack in
> jited prog to trigger a crash.
> I tried to use bpftool to debug, but bpftool crashed when I executed the
> "bpftool prog show" command.
> The syslog shown that bpftool is loading and running some bpf prog.
> because of the bug in the JIT compiler, the bpftool execution failed.

Right 'bpftool prog show' command is loading a bpf iterator prog,
but you didn't need to use it to dump JITed code.
"bpftool prog dump jited name my_prog"
would have dumped it even when JIT is all buggy.

> bpf_jit_disasm saved me, it helped me dump the jited image:
>
> echo 2> /proc/sys/net/core/bpf_jit_enable
> modprobe test_bpf test_name="SPILL_FILL"
> ./bpf_jit_disasm -o
>
> So keeping bpf_jit_enable=2 is still very meaningful for developers who
> try to modify the JIT compiler.

sure and such JIT developers can compile the kernel
without BPF_JIT_ALWAYS_ON just like you did.
They can also insert printk, etc.
bpf_jit_enable=2 was done long ago when there was no other way
to see JITed code. Now we have proper apis.
That =2 mode can and should be removed.

> IMPORTANT NOTICE: The contents of this email and any attachments are 
> confidential and may also be privileged. If you are not the intended 
> recipient, please notify the sender immediately and do not disclose the 
> contents to any other person, use it for any purpose, or store or copy the 
> information in any medium. Thank you.

please fix your email server/client/whatever. No patches will ever be
accepted with
such disclaimer.


Re: [PATCH bpf-next 5/5] libbpf: add selftests for TC-BPF API

2021-03-26 Thread Alexei Starovoitov
On Thu, Mar 25, 2021 at 05:30:03PM +0530, Kumar Kartikeya Dwivedi wrote:
> This adds some basic tests for the low level bpf_tc_* API and its
> bpf_program__attach_tc_* wrapper on top.

*_block() apis from patch 3 and 4 are not covered by this selftest.
Why were they added ? And how were they tested?

Pls trim your cc. bpf@vger and netdev@vger would have been enough.

My main concern with this set is that it adds netlink apis to libbpf while
we already agreed to split xdp manipulation pieces out of libbpf.
It would be odd to add tc apis now only to split them later.
I think it's better to start with new library for tc/xdp and have
libbpf as a dependency on that new lib.
For example we can add it as subdir in tools/lib/bpf/.

Similarly I think integerating static linking into libbpf was a mistake.
It should be a sub library as well.

If we end up with core libbpf and ten sublibs for tc, xdp, af_xdp, linking,
whatever else the users would appreciate that we don't shove single libbpf
to them with a ton of features that they might never use.


Re: [PATCH bpf-next] bpf: trace jit code when enable BPF_JIT_ALWAYS_ON

2021-03-26 Thread Alexei Starovoitov
On Fri, Mar 26, 2021 at 5:40 AM Jianlin Lv  wrote:
>
> When CONFIG_BPF_JIT_ALWAYS_ON is enabled, the value of bpf_jit_enable in
> /proc/sys is limited to SYSCTL_ONE. This is not convenient for debugging.
> This patch modifies the value of extra2 (max) to 2 that support developers
> to emit traces on kernel log.
>
> Signed-off-by: Jianlin Lv 
> ---
>  net/core/sysctl_net_core.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
> index d84c8a1b280e..aa16883ac445 100644
> --- a/net/core/sysctl_net_core.c
> +++ b/net/core/sysctl_net_core.c
> @@ -386,7 +386,7 @@ static struct ctl_table net_core_table[] = {
> .proc_handler   = proc_dointvec_minmax_bpf_enable,
>  # ifdef CONFIG_BPF_JIT_ALWAYS_ON
> .extra1 = SYSCTL_ONE,
> -   .extra2 = SYSCTL_ONE,
> +   .extra2 = ,

"bpftool prog dump jited" is much better way to examine JITed dumps.
I'd rather remove bpf_jit_enable=2 altogether.


Re: [PATCH bpf-next 2/5] bpf: Add a bpf_snprintf helper

2021-03-22 Thread Alexei Starovoitov
On Wed, Mar 10, 2021 at 11:02:08PM +0100, Florent Revest wrote:
>  
> +struct bpf_snprintf_buf {
> + char buf[MAX_SNPRINTF_MEMCPY][MAX_SNPRINTF_STR_LEN];
> +};
> +static DEFINE_PER_CPU(struct bpf_snprintf_buf, bpf_snprintf_buf);
> +static DEFINE_PER_CPU(int, bpf_snprintf_buf_used);
> +
> +BPF_CALL_5(bpf_snprintf, char *, out, u32, out_size, char *, fmt, u64 *, 
> args,
> +u32, args_len)
> +{
> + int err, i, buf_used, copy_size, fmt_cnt = 0, memcpy_cnt = 0;
> + u64 params[MAX_SNPRINTF_VARARGS];
> + struct bpf_snprintf_buf *bufs;
> +
> + buf_used = this_cpu_inc_return(bpf_snprintf_buf_used);
> + if (WARN_ON_ONCE(buf_used > 1)) {

this can trigger only if the helper itself gets preempted and
another bpf prog will run on the same cpu and will call into this helper
again, right?
If so, how about adding preempt_disable here to avoid this case?
It won't prevent the case where kprobe is inside snprintf core,
so the counter is still needed, but it wouldn't trigger by accident.
Also since bufs are not used always, how about grabbing the
buffers only when %p or %s are seen in fmt?
After snprintf() is done it would conditionally do:
if (bufs_were_used) {
   this_cpu_dec(bpf_snprintf_buf_used);
   preempt_enable();
}
This way simple bpf_snprintf won't ever hit EBUSY.

> + err = -EBUSY;
> + goto out;
> + }
> +
> + bufs = this_cpu_ptr(_snprintf_buf);
> +
> + /*
> +  * The verifier has already done most of the heavy-work for us in
> +  * check_bpf_snprintf_call. We know that fmt is well formatted and that
> +  * args_len is valid. The only task left is to convert some of the
> +  * arguments. For the %s and %pi* specifiers, we need to read buffers
> +  * from a kernel address during the helper call.
> +  */
> + for (i = 0; fmt[i] != '\0'; i++) {
> + if (fmt[i] != '%')
> + continue;
> +
> + if (fmt[i + 1] == '%') {
> + i++;
> + continue;
> + }
> +
> + /* fmt[i] != 0 && fmt[last] == 0, so we can access fmt[i + 1] */
> + i++;
> +
> + /* skip optional "[0 +-][num]" width formating field */
> + while (fmt[i] == '0' || fmt[i] == '+'  || fmt[i] == '-' ||
> +fmt[i] == ' ')
> + i++;
> + if (fmt[i] >= '1' && fmt[i] <= '9') {
> + i++;
> + while (fmt[i] >= '0' && fmt[i] <= '9')
> + i++;
> + }
> +
> + if (fmt[i] == 's') {
> + void *unsafe_ptr = (void *)(long)args[fmt_cnt];
> +
> + err = strncpy_from_kernel_nofault(bufs->buf[memcpy_cnt],
> +   unsafe_ptr,
> +   MAX_SNPRINTF_STR_LEN);
> + if (err < 0)
> + bufs->buf[memcpy_cnt][0] = '\0';
> + params[fmt_cnt] = (u64)(long)bufs->buf[memcpy_cnt];

how about:
char buf[512]; instead?
instead of memcpy_cnt++ remember how many bytes of the buf were used and
copy next arg after that.
The scratch space would be used more efficiently.
The helper would potentially return ENOSPC if the first string printed via %s
consumed most of the 512 space and the second string doesn't fit.
But the verifier-time if (memcpy_cnt >= MAX_SNPRINTF_MEMCPY) can be removed.
Ten small %s will work fine.

We can allocate a page per-cpu when this helper is used by prog and free
that page when all progs with bpf_snprintf are unloaded.
But extra complexity is probably not worth it. I would start with 512 per-cpu.
It's going to be enough for most users.

Overall looks great. Cannot wait for v2 :)


Re: linux-next: manual merge of the net-next tree with the net tree

2021-03-19 Thread Alexei Starovoitov
On Fri, Mar 19, 2021 at 8:17 AM Yonghong Song  wrote:
>
>
>
> On 3/19/21 12:21 AM, Daniel Borkmann wrote:
> > On 3/19/21 3:11 AM, Piotr Krysiuk wrote:
> >> Hi Daniel,
> >>
> >> On Fri, Mar 19, 2021 at 12:16 AM Stephen Rothwell 
> >> wrote:
> >>
> >>> diff --cc kernel/bpf/verifier.c
> >>> index 44e4ec1640f1,f9096b049cd6..
> >>> --- a/kernel/bpf/verifier.c
> >>> +++ b/kernel/bpf/verifier.c
> >>> @@@ -5876,10 -6056,22 +6060,23 @@@ static int
> >>> retrieve_ptr_limit(const str
> >>>  if (mask_to_left)
> >>>  *ptr_limit = MAX_BPF_STACK + off;
> >>>  else
> >>>   -  *ptr_limit = -off;
> >>>   -  return 0;
> >>>   +  *ptr_limit = -off - 1;
> >>>   +  return *ptr_limit >= max ? -ERANGE : 0;
> >>> +   case PTR_TO_MAP_KEY:
> >>> +   /* Currently, this code is not exercised as the only use
> >>> +* is bpf_for_each_map_elem() helper which requires
> >>> +* bpf_capble. The code has been tested manually for
> >>> +* future use.
> >>> +*/
> >>> +   if (mask_to_left) {
> >>> +   *ptr_limit = ptr_reg->umax_value + ptr_reg->off;
> >>> +   } else {
> >>> +   off = ptr_reg->smin_value + ptr_reg->off;
> >>> +   *ptr_limit = ptr_reg->map_ptr->key_size - off;
> >>> +   }
> >>> +   return 0;
> >>>
> >>
> >> PTR_TO_MAP_VALUE logic above looks like copy-paste of old
> >> PTR_TO_MAP_VALUE
> >> code from before "bpf: Fix off-by-one for area size in creating mask to
> >> left" and is apparently affected by the same off-by-one, except this time
> >> on "key_size" area and not "value_size".
> >>
> >> This needs to be fixed in the same way as we did with PTR_TO_MAP_VALUE.
> >> What is the best way to proceed?
> >
> > Hm, not sure why PTR_TO_MAP_KEY was added by 69c087ba6225 in the first
> > place, I
> > presume noone expects this to be used from unprivileged as the comment
> > says.
> > Resolution should be to remove the PTR_TO_MAP_KEY case entirely from
> > that switch
> > until we have an actual user.
>
> Alexei suggested so that we don't forget it in the future if
> bpf_capable() requirement is removed.
> https://lore.kernel.org/bpf/c837ae55-2487-2f39-47f6-a18781dc6...@fb.com/
>
> I am okay with either way, fix it or remove it.

I prefer to fix it.


Re: [PATCH bpf-next] bpf: Simplify expression for identify bpf mem type

2021-03-17 Thread Alexei Starovoitov
On Wed, Mar 17, 2021 at 5:52 AM Jianlin Lv  wrote:
> return BPF_CLASS(meta->insn.code);
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 2d3036e292a9..5d77675e7112 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -21,6 +21,7 @@
>  #define BPF_DW 0x18/* double word (64-bit) */
>  #define BPF_ATOMIC 0xc0/* atomic memory ops - op type in immediate */
>  #define BPF_XADD   0xc0/* exclusive add - legacy name */
> +#define BPF_SIZE_MASK  0x18/* mask of size modifier */

Pls don't add kernel internal defines to uapi.


Re: [PATCH v2] bpf: Fix memory leak in copy_process()

2021-03-16 Thread Alexei Starovoitov
On Tue, Mar 16, 2021 at 4:29 AM Zhang, Qiang  wrote:
>
> Hello Alexei Starovoitov Daniel Borkmann
> Please  review this patch.

Please don't top post.


Re: The killing of ideal_nops[]

2021-03-10 Thread Alexei Starovoitov
On Wed, Mar 10, 2021 at 6:29 AM Peter Zijlstra  wrote:
>
> On Wed, Mar 10, 2021 at 09:13:24AM -0500, Steven Rostedt wrote:
> > On Wed, 10 Mar 2021 11:22:48 +0100
> > Peter Zijlstra  wrote:
> >
> > > After this FEATURE_NOPL is unused except for required-features for
> > > x86_64. FEATURE_K8 is only used for PTI and FEATURE_K7 is unused.
> > >
> > > AFAICT this negatively affects lots of 32bit (DONTCARE) and 32bit on
> > > 64bit CPUs (CARELESS) and early AMD (K8) which is from 2003 and almost
> > > 2 decades old by now (SHRUG).
> > >
> > > Everything x86_64 since AMD K10 (2007) was using p6_nops.
> > >
> > > And per FEATURE_NOPL being required for x86_64, all those CPUs can use
> > > p6_nops. So stop caring about NOPs, simplify things and get on with life
> > > :-)
> >
> > Before ripping out all the ideal_nop logic, I wonder if we should just
> > force the nops you want now (that is, don't change the selected
> > ideal_nops, just "pretend" that the CPU wants p6_nops), and see if anyone
> > complains. After a few releases, if there's no complaints, then we can
> > rip out the ideal_nop logic.
>
> Nah, just rip the entire thing out. You should be happy about
> deterministic NOPs :-)

Ack for bpf bits.
I think the cleanup is good from the point of having one way to do things.
Though I won't be surprised if somebody comes along with a patch
to use different nops eventually.
When I first looked at it years ago I was wondering why segment selector
prefix is not used. afaik windows was using it, because having cs: ds: nop
makes intel use only one of the instruction decoders (the big and slow one)
which allegedly saves power, since the pipeline has bubbles.
Things could be completely different now in modern u-arches.


Re: [PATCH] net: add net namespace inode for all net_dev events

2021-03-09 Thread Alexei Starovoitov
On Tue, Mar 9, 2021 at 12:37 PM Steven Rostedt  wrote:
>
> The size of the fields and order changes all the time in various events. I
> recommend doing so *all the time*. If you upgrade a kernel, then all the bpf
> programs you have for that kernel should also be updated. You can't rely on
> fields being the same, size or order. The best you can do is expect the
> field to continue to exist, and that's not even a guarantee.

+1. Tracing bpf progs already do that.
Old style tracing progs do it based on the kernel version.
New style is using CO-RE.


Re: [PATCH] net: add net namespace inode for all net_dev events

2021-03-09 Thread Alexei Starovoitov
On Tue, Mar 9, 2021 at 12:18 PM David Ahern  wrote:
>
> On 3/9/21 1:02 PM, Steven Rostedt wrote:
> > On Tue, 9 Mar 2021 12:53:37 -0700
> > David Ahern  wrote:
> >
> >> Changing the order of the fields will impact any bpf programs expecting
> >> the existing format
> >
> > I thought bpf programs were not API. And why are they not parsing this
> > information? They have these offsets hard coded Why would they do that!
> > The information to extract the data where ever it is has been there from
> > day 1! Way before BPF ever had access to trace events.
>
> BPF programs attached to a tracepoint are passed a context - a structure
> based on the format for the tracepoint. To take an in-tree example, look
> at samples/bpf/offwaketime_kern.c:
>
> ...
>
> /* taken from /sys/kernel/debug/tracing/events/sched/sched_switch/format */
> struct sched_switch_args {
> unsigned long long pad;
> char prev_comm[16];
> int prev_pid;
> int prev_prio;
> long long prev_state;
> char next_comm[16];
> int next_pid;
> int next_prio;
> };
> SEC("tracepoint/sched/sched_switch")
> int oncpu(struct sched_switch_args *ctx)
> {
>
> ...
>
> Production systems do not typically have toolchains installed, so
> dynamic generation of the program based on the 'format' file on the
> running system is not realistic. That means creating the programs on a
> development machine and installing on the production box. Further, there
> is an expectation that a bpf program compiled against version X works on
> version Y.

This is not true.


Re: [PATCH] net: add net namespace inode for all net_dev events

2021-03-09 Thread Alexei Starovoitov
On Tue, Mar 9, 2021 at 12:03 PM Steven Rostedt  wrote:
>
> On Tue, 9 Mar 2021 12:53:37 -0700
> David Ahern  wrote:
>
> > Changing the order of the fields will impact any bpf programs expecting
> > the existing format
>
> I thought bpf programs were not API.

That is correct.


Re: "struct perf_sample_data" alignment

2021-03-05 Thread Alexei Starovoitov
On Fri, Mar 5, 2021 at 7:01 AM Peter Zijlstra  wrote:
>
> On Fri, Mar 05, 2021 at 09:36:30AM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 04, 2021 at 07:45:44PM -0800, Linus Torvalds wrote:
> > > That cacheline_aligned goes back many years, this is not new, it
> > > seems to come from back in 2014: commit 2565711fb7d7 ("perf: Improve
> > > the perf_sample_data struct layout").
> >
> > long time ago...
> >
> > > But it really seems entirely and utterly bogus. That cacheline
> > > alignment makes things *worse*, when the variables are on the local
> > > stack. The local stack is already going to be dirty and in the cache,
> > > and aligning those things isn't going to - and I quote from the code
> > > in that commend in that commit - "minimize the cachelines touched".
> > >
> > > Quite the reverse. It's just going to make the stack frame use *more*
> > > memory, and make any cacheline usage _worse_.
> >
> > IIRC there is more history here, but I can't seem to find references
> > just now.
> >
> > What I remember is that since perf_sample_data is fairly large,
> > unconditionally initializing the whole thing is *slow* (and
> > -fauto-var-init=zero will hurt here).
> >
> > So at some point I removed that full initialization and made sure we
> > only unconditionally touched the first few variables, which gave a
> > measurable speedup.
> >
> > Then things got messy again and the commit 2565711fb7d7 referenced above
> > was cleanup, to get back to that initial state.
> >
> > Now, you're right that __cacheline_aligned on on-stack (and this is
> > indeed mostly on-stack) is fairly tedious (there were a few patches
> > recently to reduce the amount of on-stack instances).
> >
> > I'll put it on the todo list, along with that hotplug stuff (which I
> > tried to fix but ended up with an even bigger mess). I suppose we can
> > try and not have the alignment for the on-stack instances while
> > preserving it for the few off-stack ones.
> >
> > Also; we're running on the NMI stack, and that's not typically hot.
>
> This seems to be it... (completely untested)
>
> ---
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index 3f7f89ea5e51..918a296d2ca2 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -1032,7 +1032,9 @@ struct perf_sample_data {
> u64 cgroup;
> u64 data_page_size;
> u64 code_page_size;
> -} cacheline_aligned;
> +};
> +
> +typedef struct perf_sample_data perf_sample_data_t cacheline_aligned;
>
>  /* default value for data source */
>  #define PERF_MEM_NA (PERF_MEM_S(OP, NA)   |\
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index b0c45d923f0f..f32c623abef6 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -923,7 +923,7 @@ __bpf_perf_event_output(struct pt_regs *regs, struct 
> bpf_map *map,
>   * bpf_perf_event_output
>   */
>  struct bpf_trace_sample_data {
> -   struct perf_sample_data sds[3];
> +   perf_sample_data_t sds[3];

bpf side doesn't care about about cacheline aligned.
No need to add new typedef just for that.


Re: Why do kprobes and uprobes singlestep?

2021-03-03 Thread Alexei Starovoitov
On Tue, Mar 2, 2021 at 5:46 PM Andy Lutomirski  wrote:
>
>
> > On Mar 2, 2021, at 5:22 PM, Alexei Starovoitov 
> >  wrote:
> >
> > On Tue, Mar 2, 2021 at 1:02 PM Andy Lutomirski  wrote:
> >>
> >>
> >>>> On Mar 2, 2021, at 12:24 PM, Alexei Starovoitov 
> >>>>  wrote:
> >>>
> >>> On Tue, Mar 2, 2021 at 10:38 AM Andy Lutomirski  wrote:
> >>>>
> >>>> Is there something like a uprobe test suite?  How maintained /
> >>>> actively used is uprobe?
> >>>
> >>> uprobe+bpf is heavily used in production.
> >>> selftests/bpf has only one test for it though.
> >>>
> >>> Why are you asking?
> >>
> >> Because the integration with the x86 entry code is a mess, and I want to 
> >> know whether to mark it BROKEN or how to make sure the any cleanups 
> >> actually work.
> >
> > Any test case to repro the issue you found?
> > Is it a bug or just messy code?
>
> Just messy code.
>
> > Nowadays a good chunk of popular applications (python, mysql, etc) has
> > USDTs in them.
> > Issues reported with bcc:
> > https://github.com/iovisor/bcc/issues?q=is%3Aissue+USDT
> > Similar thing with bpftrace.
> > Both standard USDT and semaphore based are used in the wild.
> > uprobe for containers has been a long standing feature request.
> > If you can improve uprobe performance that would be awesome.
> > That's another thing that people report often. We optimized it a bit.
> > More can be done.
>
>
> Wait... USDT is much easier to implement well.  Are we talking just USDT or 
> are we talking about general uprobes in which almost any instruction can get 
> probed?  If the only users that care about uprobes are doing USDT, we could 
> vastly simplify the implementation and probably make it faster, too.

USDTs are driving the majority of uprobe usage.
If they can get faster it will increase their adoption even more.
There are certainly cases of normal uprobes.
They are at the start of the function 99% of the time.
Like the following:
"uprobe:/lib64/libc.so:malloc(u64 size):size:size,_ret",
"uprobe:/lib64/libc.so:free(void *ptr)::ptr",
is common despite its overhead.

Here is the most interesting and practical usage of uprobes:
https://github.com/iovisor/bcc/blob/master/tools/sslsniff.py
and the manpage for the tool:
https://github.com/iovisor/bcc/blob/master/tools/sslsniff_example.txt

uprobe in the middle of the function is very rare.
If the kernel starts rejecting uprobes on some weird instructions
I suspect no one will complain.
Especially if such tightening will come with performance boost for
uprobe on a nop and unprobe at the start (which is typically push or
alu on %sp).
That would be a great step forward.


Re: Why do kprobes and uprobes singlestep?

2021-03-03 Thread Alexei Starovoitov
On Tue, Mar 2, 2021 at 1:02 PM Andy Lutomirski  wrote:
>
>
> > On Mar 2, 2021, at 12:24 PM, Alexei Starovoitov 
> >  wrote:
> >
> > On Tue, Mar 2, 2021 at 10:38 AM Andy Lutomirski  wrote:
> >>
> >> Is there something like a uprobe test suite?  How maintained /
> >> actively used is uprobe?
> >
> > uprobe+bpf is heavily used in production.
> > selftests/bpf has only one test for it though.
> >
> > Why are you asking?
>
> Because the integration with the x86 entry code is a mess, and I want to know 
> whether to mark it BROKEN or how to make sure the any cleanups actually work.

Any test case to repro the issue you found?
Is it a bug or just messy code?
Nowadays a good chunk of popular applications (python, mysql, etc) has
USDTs in them.
Issues reported with bcc:
https://github.com/iovisor/bcc/issues?q=is%3Aissue+USDT
Similar thing with bpftrace.
Both standard USDT and semaphore based are used in the wild.
uprobe for containers has been a long standing feature request.
If you can improve uprobe performance that would be awesome.
That's another thing that people report often. We optimized it a bit.
More can be done.


Re: Why do kprobes and uprobes singlestep?

2021-03-02 Thread Alexei Starovoitov
On Tue, Mar 2, 2021 at 10:38 AM Andy Lutomirski  wrote:
>
> Is there something like a uprobe test suite?  How maintained /
> actively used is uprobe?

uprobe+bpf is heavily used in production.
selftests/bpf has only one test for it though.

Why are you asking?


Re: [PATCH v6 bpf-next 0/6] bpf: enable task local storage for tracing programs

2021-02-26 Thread Alexei Starovoitov
On Thu, Feb 25, 2021 at 4:04 PM Martin KaFai Lau  wrote:
>
> On Thu, Feb 25, 2021 at 03:43:13PM -0800, Song Liu wrote:
> > This set enables task local storage for non-BPF_LSM programs.
> >
> > It is common for tracing BPF program to access per-task data. Currently,
> > these data are stored in hash tables with pid as the key. In
> > bcc/libbpftools [1], 9 out of 23 tools use such hash tables. However,
> > hash table is not ideal for many use case. Task local storage provides
> > better usability and performance for BPF programs. Please refer to 6/6 for
> > some performance comparison of task local storage vs. hash table.
> Thanks for the patches.
>
> Acked-by: Martin KaFai Lau 

Applied. Thanks.

9 out of 23 libbpf-tools will significantly reduce the tracing overhead. Hooray!


Re: [PATCH v4 bpf-next 2/6] bpf: prevent deadlock from recursive bpf_task_storage_[get|delete]

2021-02-23 Thread Alexei Starovoitov

On 2/22/21 11:19 PM, Andrii Nakryiko wrote:

+   bpf_task_storage_lock();
sdata = bpf_local_storage_update(
task, (struct bpf_local_storage_map *)map, value, map_flags);

this should probably be container_of() instead of casting

bpf_task_storage.c uses casting in multiple places. How about we fix it in a
separate patch?


Sure, let's fix all uses in a separate patch. My point is that this
makes an assumption (that struct bpf_map map field is always the very
first one) which is not enforced and not documented in struct
bpf_local_storage_map.



I'd rather document it in separate patch.
Just doing container_of here and there will lead to wrong assumption
that it can be in a different place, but it's much more involved.
Consider what happens with all map_alloc callbacks... how that pointer
is hard coded into bpf prog.. then passed back into helpers...
then the logic that can read inner fields of bpf_map from the prog...


Re: [PATCH bpf-next v7 2/5] bpf: Expose bpf_get_socket_cookie to tracing programs

2021-02-11 Thread Alexei Starovoitov
On Wed, Feb 10, 2021 at 3:14 AM Florent Revest  wrote:
>
> +BPF_CALL_1(bpf_get_socket_ptr_cookie, struct sock *, sk)
> +{
> +   return sk ? sock_gen_cookie(sk) : 0;
> +}
> +
> +const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto = {
> +   .func   = bpf_get_socket_ptr_cookie,
> +   .gpl_only   = false,
> +   .ret_type   = RET_INTEGER,
> +   .arg1_type  = ARG_PTR_TO_BTF_ID_SOCK_COMMON,
> +};

As Daniel pointed out there is an sk_destruct issue here, but I don't
think it's fair
to penalize this set and future similar patches. They don't make things worse.
The issue has been there for some time due to sk_storage in tracing and
other helpers. We need to come up with a holistic approach to solve it.
I suspect allow/deny lists will certainly make it better, but won't
really address it,
and will be fragile over long term.
I think tracing would need to be integrated with bpf_lsm and start relying
on security_*_free callbacks to cover this last 1%.
I think that would be a great topic for the next bpf office hours on Feb 25.


Re: [RFC][PATCH] kprobes: Remove kprobe::fault_handler

2021-02-10 Thread Alexei Starovoitov
On Wed, Feb 10, 2021 at 5:57 AM Peter Zijlstra  wrote:
>
>
> Somewhat related.. I had this pending.
>
> ---
> Subject: kprobes: Remove kprobe::fault_handler
> From: Peter Zijlstra 
> Date: Tue Feb 2 10:43:41 CET 2021
>
> The reason for kprobe::fault_handler(), as given by their comment:
>
>  * We come here because instructions in the pre/post
>  * handler caused the page_fault, this could happen
>  * if handler tries to access user space by
>  * copy_from_user(), get_user() etc. Let the
>  * user-specified handler try to fix it first.
>
> If just plain bad. Those other handlers are ran from non-preemptible
> context and had better use _nofault() functions. Also, there is no
> upstream usage of this.

No objections from me.

Since Masami mentioned that systemtap used that you
probably want to give them a courtesy heads-up that it's going away.
Though looking at systemtap source code I couldn't find any
reference to it. So it's likely a nop for them anyway.


Re: [GIT PULL] x86/urgent for v5.11-rc7

2021-02-09 Thread Alexei Starovoitov
On Tue, Feb 9, 2021 at 6:49 AM Steven Rostedt  wrote:
>
> On Tue, 9 Feb 2021 09:32:34 +0100 (CET)
> Miroslav Benes  wrote:
>
> > powerpc has this
> >
> > static inline unsigned long klp_get_ftrace_location(unsigned long faddr)
> > {
> > /*
> >  * Live patch works only with -mprofile-kernel on PPC. In this case,
> >  * the ftrace location is always within the first 16 bytes.
> >  */
> > return ftrace_location_range(faddr, faddr + 16);
> > }
> >
> > > > I suppose the trivial fix is to see if it points to endbr64 and if so,
> > > > increment the addr by the length of that.
> > >
> > > I thought of that too. But one thing that may be possible, is to use
> > > kallsym. I believe you can get the range of a function (start and end of
> > > the function) from kallsyms. Then ask ftrace for the addr in that range
> > > (there should only be one).
> >
> > And we can do this if a hard-coded value live above is not welcome. If I
> > remember correctly, we used to have exactly this in the old versions of
> > kGraft. We walked through all ftrace records, called
> > kallsyms_lookup_size_offset() on every record's ip and if the offset+ip
> > matched faddr (in this case), we returned the ip.
>
> Either way is fine. Question is, should we just wait till CET is
> implemented for the kernel before making any of these changes? Just knowing
> that we have a solution to handle it may be good enough for now.

I think the issue is more fundamental than what appears on the surface.
According to endbr64 documentation it's not just any instruction.
The cpu will wait for it and if it's replaced with int3 or not seen at
the branch target the cpu will throw an exception.
If I understood the doc correctly it means that endbr64 can never be
replaced with a breakpoint. If that's the case text_poke_bp and kprobe
need to do extra safety checks.


Re: [GIT PULL] x86/urgent for v5.11-rc7

2021-02-07 Thread Alexei Starovoitov
On Sun, Feb 7, 2021 at 10:21 AM Dave Hansen  wrote:
>
> On 2/7/21 9:58 AM, Borislav Petkov wrote:
> > On Sun, Feb 07, 2021 at 09:49:18AM -0800, Linus Torvalds wrote:
> >> On Sun, Feb 7, 2021 at 2:40 AM Borislav Petkov  wrote:
> >>> - Disable CET instrumentation in the kernel so that gcc doesn't add
> >>> ENDBR64 to kernel code and thus confuse tracing.
> >> So this is clearly the right thing to do for now, but I wonder if
> >> people have a plan for actually enabling CET and endbr at cpl0 at some
> >> point?
> > It probably is an item on some Intel manager's to-enable list. So far,
> > the CET enablement concentrates only on userspace but dhansen might know
> > more about future plans. CCed.
>
> It's definitely on our radar to look at after CET userspace.

What is the desired timeline to enable CET in the kernel ?
I think for bpf and tracing it will be mostly straightforward to deal
with extra endbr64 insn in front of the fentry nop.
Just trying to figure when this work needs to be done.


Re: [PATCH bpf-next] bpf: clean up for 'const static' in bpf_lsm.c

2021-02-04 Thread Alexei Starovoitov
On Thu, Feb 4, 2021 at 5:40 PM Xu Jia  wrote:
>
> Prefer 'static const' over 'const static' here
>
> Signed-off-by: Xu Jia 
> ---
>  kernel/bpf/bpf_lsm.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/bpf/bpf_lsm.c b/kernel/bpf/bpf_lsm.c
> index 1622a44d1617..75b1c678d558 100644
> --- a/kernel/bpf/bpf_lsm.c
> +++ b/kernel/bpf/bpf_lsm.c
> @@ -67,7 +67,7 @@ BPF_CALL_2(bpf_bprm_opts_set, struct linux_binprm *, bprm, 
> u64, flags)
>
>  BTF_ID_LIST_SINGLE(bpf_bprm_opts_set_btf_ids, struct, linux_binprm)
>
> -const static struct bpf_func_proto bpf_bprm_opts_set_proto = {
> +static const struct bpf_func_proto bpf_bprm_opts_set_proto = {

I totally agree that it's more canonical this way, but I don't think
such git history noise
is worth it.


Re: [PATCH bpf-next v3] bpf: Propagate stack bounds to registers in atomics w/ BPF_FETCH

2021-02-03 Thread Alexei Starovoitov
On Wed, Feb 3, 2021 at 9:07 AM Yonghong Song  wrote:
>
>
>
> On 2/2/21 5:50 AM, Brendan Jackman wrote:
> > When BPF_FETCH is set, atomic instructions load a value from memory
> > into a register. The current verifier code first checks via
> > check_mem_access whether we can access the memory, and then checks
> > via check_reg_arg whether we can write into the register.
> >
> > For loads, check_reg_arg has the side-effect of marking the
> > register's value as unkonwn, and check_mem_access has the side effect
> > of propagating bounds from memory to the register. This currently only
> > takes effect for stack memory.
> >
> > Therefore with the current order, bounds information is thrown away,
> > but by simply reversing the order of check_reg_arg
> > vs. check_mem_access, we can instead propagate bounds smartly.
> >
> > A simple test is added with an infinite loop that can only be proved
> > unreachable if this propagation is present. This is implemented both
> > with C and directly in test_verifier using assembly.
> >
> > Suggested-by: John Fastabend 
> > Signed-off-by: Brendan Jackman 
>
> Ack with a nit below.

Sorry it was already applied yesterday.
patchbot just didn't send auto-reply.


Re: [PATCH bpf-next v2] bpf: Propagate memory bounds to registers in atomics w/ BPF_FETCH

2021-02-01 Thread Alexei Starovoitov
On Mon, Feb 1, 2021 at 7:00 AM Brendan Jackman  wrote:
> +
> +SEC("fentry/bpf_fentry_test1")
> +int BPF_PROG(sub, int x)
> +{
> +   int a = 0;
> +   int b = __sync_fetch_and_add(, 1);

It probably needs ENABLE_ATOMICS_TESTS ?

Otherwise clang without -mcpu=v3 will complain:
"fatal error: error in backend: Invalid usage of the XADD return value"


Re: [PATCH bpf-next v6 2/5] bpf: Expose bpf_get_socket_cookie to tracing programs

2021-02-01 Thread Alexei Starovoitov
On Mon, Feb 1, 2021 at 2:32 PM Daniel Borkmann  wrote:
>
> On 1/30/21 12:45 PM, Florent Revest wrote:
> > On Fri, Jan 29, 2021 at 1:49 PM Daniel Borkmann  
> > wrote:
> >> On 1/29/21 11:57 AM, Daniel Borkmann wrote:
> >>> On 1/27/21 10:01 PM, Andrii Nakryiko wrote:
>  On Tue, Jan 26, 2021 at 10:36 AM Florent Revest  
>  wrote:
> >
> > This needs a new helper that:
> > - can work in a sleepable context (using sock_gen_cookie)
> > - takes a struct sock pointer and checks that it's not NULL
> >
> > Signed-off-by: Florent Revest 
> > Acked-by: KP Singh 
> > ---
> >include/linux/bpf.h|  1 +
> >include/uapi/linux/bpf.h   |  8 
> >kernel/trace/bpf_trace.c   |  2 ++
> >net/core/filter.c  | 12 
> >tools/include/uapi/linux/bpf.h |  8 
> >5 files changed, 31 insertions(+)
> >
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 1aac2af12fed..26219465e1f7 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -1874,6 +1874,7 @@ extern const struct bpf_func_proto 
> > bpf_per_cpu_ptr_proto;
> >extern const struct bpf_func_proto bpf_this_cpu_ptr_proto;
> >extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto;
> >extern const struct bpf_func_proto bpf_sock_from_file_proto;
> > +extern const struct bpf_func_proto bpf_get_socket_ptr_cookie_proto;
> >
> >const struct bpf_func_proto *bpf_tracing_func_proto(
> >   enum bpf_func_id func_id, const struct bpf_prog *prog);
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 0b735c2729b2..5855c398d685 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -1673,6 +1673,14 @@ union bpf_attr {
> > * Return
> > * A 8-byte long unique number.
> > *
> > + * u64 bpf_get_socket_cookie(void *sk)
> 
>  should the type be `struct sock *` then?
> >>>
> >>> Checking libbpf's generated bpf_helper_defs.h it generates:
> >>>
> >>> /*
> >>>* bpf_get_socket_cookie
> >>>*
> >>>*  If the **struct sk_buff** pointed by *skb* has a known socket,
> >>>*  retrieve the cookie (generated by the kernel) of this socket.
> >>>*  If no cookie has been set yet, generate a new cookie. Once
> >>>*  generated, the socket cookie remains stable for the life of the
> >>>*  socket. This helper can be useful for monitoring per socket
> >>>*  networking traffic statistics as it provides a global socket
> >>>*  identifier that can be assumed unique.
> >>>*
> >>>* Returns
> >>>*  A 8-byte long non-decreasing number on success, or 0 if the
> >>>*  socket field is missing inside *skb*.
> >>>*/
> >>> static __u64 (*bpf_get_socket_cookie)(void *ctx) = (void *) 46;
> >>>
> >>> So in terms of helper comment it's picking up the description from the
> >>> `u64 bpf_get_socket_cookie(struct sk_buff *skb)` signature. With that
> >>> in mind it would likely make sense to add the actual `struct sock *` type
> >>> to the comment to make it more clear in here.
> >>
> >> One thought that still came to mind when looking over the series again, do
> >> we need to blacklist certain functions from bpf_get_socket_cookie() under
> >> tracing e.g. when attaching to, say fexit? For example, if sk_prot_free()
> >> would be temporary uninlined/exported for testing and 
> >> bpf_get_socket_cookie()
> >> was invoked from a prog upon fexit where sock was already passed back to
> >> allocator, I presume there's risk of mem corruption, no?
> >
> > Mh, this is interesting. I can try to add a deny list in v7 but I'm
> > not sure whether I'll be able to catch them all. I'm assuming that
> > __sk_destruct, sk_destruct, __sk_free, sk_free would be other
> > problematic functions but potentially there would be more.
>
> I was just looking at bpf_skb_output() from a7658e1a4164 ("bpf: Check types of
> arguments passed into helpers") which afaiu has similar issue, back at the 
> time
> this was introduced there was no fentry/fexit but rather raw tp progs, so you
> could still safely dump skb this way including for kfree_skb() tp. Presumably 
> if
> you bpf_skb_output() at 'fexit/kfree_skb' you might be able to similarly crash

the verifier cannot check absolutely everything.
Whitelisting and blacklisting all combinations is not practical.

> your kernel which I don't think is intentional (also given we go above and 
> beyond
> in other areas to avoid crashing or destabilizing e.g. [0] to mention one). 
> Maybe
> these should really only be for BPF_TRACE_RAW_TP (or BPF_PROG_TYPE_LSM) where 
> it
> can be audited that it's safe to use like a7658e1a4164's original intention 
> ...
> or have some sort of function annotation like __acquires/__releases but for 
> tracing
> e.g. 

Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-30 Thread Alexei Starovoitov
On Sat, Jan 30, 2021 at 11:02:49AM +0900, Masami Hiramatsu wrote:
> On Fri, 29 Jan 2021 18:59:43 +0100
> Peter Zijlstra  wrote:
> 
> > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > Same things apply to bpf side. We can statically prove safety for
> > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > have to use run-time checks for recursion prevention, etc.
> > 
> > I have no idea what you're saying. You can attach to functions that are
> > called with random locks held, you can create kprobes in some very
> > sensitive places.
> > 
> > What can you staticlly prove about that?
> 
> For the bpf and the kprobe tracer, if a probe hits in the NMI context,
> it can call the handler with another handler processing events.
> 
> kprobes is carefully avoiding the deadlock by checking recursion
> with per-cpu variable. But if the handler is shared with the other events
> like tracepoints, it needs to its own recursion cheker too.
> 
> So, Alexei, maybe you need something like this instead of in_nmi() check.
> 
> DEFINE_PER_CPU(bool, under_running_bpf);
> 
> common_handler()
> {
>   if (__this_cpu_read(under_running_bpf))
>   return;
>   __this_cpu_write(under_running_bpf, true);
>   /* execute bpf prog */
>   __this_cpu_write(under_running_bpf, false); 
> }
> 
> Does this work for you?

This exactly check is already in trace_call_bpf.
Right after if (in_nmi()).
See bpf_prog_active. It serves different purpose though.
Simply removing if (in_nmi()) from trace_call_bpf is a bit scary.
I need to analyze all code paths first.


Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Alexei Starovoitov
On Fri, Jan 29, 2021 at 02:01:03PM -0500, Steven Rostedt wrote:
> On Fri, 29 Jan 2021 18:59:43 +0100
> Peter Zijlstra  wrote:
> 
> > On Fri, Jan 29, 2021 at 09:45:48AM -0800, Alexei Starovoitov wrote:
> > > Same things apply to bpf side. We can statically prove safety for
> > > ftrace and kprobe attaching whereas to deal with NMI situation we
> > > have to use run-time checks for recursion prevention, etc.  
> > 
> > I have no idea what you're saying. You can attach to functions that are
> > called with random locks held, you can create kprobes in some very
> > sensitive places.
> > 
> > What can you staticlly prove about that?
> 
> I think the main difference is, if you attach a kprobe or ftrace function,
> you can theoretically analyze the location before you do the attachment.

Excatly.
When we're writing bpf helpers we need to carefully think about reentrance and 
NMI.
If the helper looks like:
int nokprobe notrace bpf_something(...)
{
  // access variables A and B
}

The implementation can rely on the fact that even if the helper is reentrant
the state of A and B will be consistent. Either both got touched or none.
Only NMI condition we have to worry about, because A could be modified 
without touching B.
If we write it as
int nokprobe bpf_something(...) { ... }
that would be another case.
Here we need to consider the case that bpf prog can be attached to it via 
fentry nop.
But no need to worry about instructions split in the middle because of kprobe 
via int3.
Since we have big "if (in_nmi()) goto unsupported;" check in the beginning we
only need to worry about combinations of kprobe at the start of the func,
kprobe anywhere inside the func via int3, and ftrace at the start.
Not having to think of NMI helps a ton.
My earlier this_cpu vs __this_cpu comment is an example of that.
If in_nmi is filtered early it's one implementation. If nmi has to be handled
it's completely different algorithm.
Now you've broke all this logic by making int3 to be marked as 'in_nmi' and
bpf in kprobe in the middle of the func are now broken.
Do people use that? Yeah they do.
We have to fix it.
What were your reasons to make int3 in_nmi?
I've read the commit log, but I don't see in it the actual motivation
for int3 other than "it looks like NMI to me. Let's make it so".
The commit logs talk about cpu exceptions. I agree that #DB and #MC do behave 
like NMI.
But #BP is not really. My understanding it's used by kprobes and text_poke_bp 
only.
If the motivation was to close some issue with text_poke_bp then, sure,
let's make handling of text_poke_bp to be treated as nmi.
But kprobe is not that.
I'm thinking either of the following solutions would be generic:
- introduce another state to preempt flags like "kernel exception"
- remove kprobe's int3 from in_nmi
As bpf specific alternative we can do:
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index 6c0018abe68a..37cc549ad52e 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -96,7 +96,7 @@ unsigned int trace_call_bpf(struct trace_event_call *call, 
void *ctx)
 {
unsigned int ret;

-   if (in_nmi()) /* not supported yet */
+   if (in_nmi() && !kprobe_running()) /* not supported yet */
return 1;

but imo it's less clean than the first two options.


Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-29 Thread Alexei Starovoitov
On Fri, Jan 29, 2021 at 8:24 AM Peter Zijlstra  wrote:
>
> On Fri, Jan 29, 2021 at 10:59:52AM -0500, Steven Rostedt wrote:
> > On Fri, 29 Jan 2021 22:40:11 +0900
> > Masami Hiramatsu  wrote:
> >
> > > > So what, they can all happen with random locks held. Marking them as NMI
> > > > enables a whole bunch of sanity checks that are entirely appropriate.
> > >
> > > How about introducing an idea of Asynchronous NMI (ANMI) and Synchronous
> > > NMI (SNMI)? kprobes and ftrace is synchronously called and can be 
> > > controlled
> > > (we can expect the context) but ANMI may be caused by asynchronous
> > > hardware events on any context.
> > >
> > > If we can distinguish those 2 NMIs on preempt count, bpf people can easily
> > > avoid the inevitable situation.
> >
> > I don't like the name NMI IN SNMI, because they are not NMIs. They are
> > actually more like kernel exceptions. Even page faults in the kernel is
> > similar to a kprobe breakpoint or ftrace. It can happen anywhere, with any
> > lock held. Perhaps we need a kernel exception context? Which by definition
> > is synchronous.

I like 'kernel exception' name. SNMI doesn't sound right. There is nothing
'non maskable' here.

>
> What problem are you trying to solve? AFAICT all these contexts have the
> same restrictions, why try and muck about with different names for the
> same thing?

from core kernel perspective the difference between 'kernel exception'
and true NMI is huge:
this_cpu vs __this_cpu
static checks vs runtime checks

Same things apply to bpf side. We can statically prove safety for
ftrace and kprobe attaching whereas to deal with NMI situation we
have to use run-time checks for recursion prevention, etc.


Re: kprobes broken since 0d00449c7a28 ("x86: Replace ist_enter() with nmi_enter()")

2021-01-28 Thread Alexei Starovoitov
On Thu, Jan 28, 2021 at 07:24:14PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 28, 2021 at 06:45:56PM +0200, Nikolay Borisov wrote:
> > it would be placed on the __fentry__ (and not endbr64) hence it works.
> > So perhaps a workaround outside of bpf could essentially detect this
> > scenario and adjust the probe to be on the __fentry__ and not preceding
> > instruction if it's detected to be endbr64 ?
> 
> Arguably the fentry handler should also set the nmi context, it can,
> after all, interrupt pretty much any other context by construction.

But that doesn't make it a real nmi.
nmi can actually interrupt anything. Whereas kprobe via int3 cannot
do nokprobe and noinstr sections. The exposure is a lot different.
ftrace is even more contained. It's only at the start of the functions.
It's even smaller subset of places than kprobes.
So ftrace < kprobe < nmi.
Grouping them all into nmi doesn't make sense to me.
That bpf breaking change came unnoticed mostly because people use
kprobes in the beginning of the functions only, but there are cases
where kprobes are in the middle too. People just didn't upgrade kernels
fast enough to notice.
imo appropriate solution would be to have some distinction between
ftrace < kprobe_via_int3 < nmi, so that bpf side can react accordingly.
That code in trace_call_bpf:
  if (in_nmi()) /* not supported yet */
was necessary in the past. Now we have all sorts of protections built in.
So it's probably ok to just drop it, but I think adding
called_via_ftrace vs called_via_kprobe_int3 vs called_via_nmi
is more appropriate solution long term.


Re: [PATCH bpf-next v3 0/2] BPF docs fixups

2021-01-21 Thread Alexei Starovoitov
On Thu, Jan 21, 2021 at 10:54 AM Jonathan Corbet  wrote:
>
> On Wed, 20 Jan 2021 13:39:44 +
> Brendan Jackman  wrote:
>
> > Difference from v2->v3 [1]:
> >
> >  * Just fixed a commite message, rebased, and added Lukas' review tag - 
> > thanks
> >Lukas!
> >
> > Difference from v1->v2 [1]:
> >
> >  * Split into 2 patches
> >
> >  * Avoided unnecessary ': ::' in .rst source
> >
> >  * Tweaked wording of the -mcpu=v3 bit a little more
> >
> > [1] Previous versions:
> > v1: 
> > https://lore.kernel.org/bpf/ca+i-1c1lvkjfqlbyk6siiqhxfy0jcr7ubcamj4jced0a9aw...@mail.gmail.com/T/#t
> > v2: 
> > https://lore.kernel.org/bpf/20210118155735.532663-1-jackm...@google.com/T/#t
> >
> > Brendan Jackman (2):
> >   docs: bpf: Fixup atomics markup
> >   docs: bpf: Clarify -mcpu=v3 requirement for atomic ops
> >
> >  Documentation/networking/filter.rst | 20 +++-
> >  1 file changed, 11 insertions(+), 9 deletions(-)
>
> I'm assuming these will go up through the BPF/networking trees; please let
> me know if I should pick them up instead.

I sent an email yesterday indicating that the set was applied to bpf-next.
There is no other tree it can be applied to without conflicts.
Looks like gmail is struggling again.


Re: [PATCH bpf-next v3 2/2] docs: bpf: Clarify -mcpu=v3 requirement for atomic ops

2021-01-20 Thread Alexei Starovoitov
On Wed, Jan 20, 2021 at 11:57 AM Lukas Bulwahn  wrote:
>
> On Wed, Jan 20, 2021 at 2:39 PM Brendan Jackman  wrote:
> >
> > Alexei pointed out [1] that this wording is pretty confusing. Here's
> > an attempt to be more explicit and clear.
> >
> > [1] 
> > https://lore.kernel.org/bpf/CAADnVQJVvwoZsE1K+6qRxzF7+6CvZNzygnoBW9tZNWJELk5c=q...@mail.gmail.com/T/#m07264fc18fdc43af02fc1320968afefcc73d96f4
> >
>
> It is common practice to use "Link: URL" to refer to other mail
> threads; and to use the "permalink":
>
> https://lore.kernel.org/bpf/CAADnVQJVvwoZsE1K+6qRxzF7+6CvZNzygnoBW9tZNWJELk5c=q...@mail.gmail.com/
>
> which is a bit shorter than the link you provided.

I fixed it up while applying.
Thanks everyone.


Re: [PATCH bpf-next] bpf: Propagate memory bounds to registers in atomics w/ BPF_FETCH

2021-01-20 Thread Alexei Starovoitov
On Mon, Jan 18, 2021 at 04:06:13PM +, Brendan Jackman wrote:
> When BPF_FETCH is set, atomic instructions load a value from memory
> into a register. The current verifier code first checks via
> check_mem_access whether we can access the memory, and then checks
> via check_reg_arg whether we can write into the register.
> 
> For loads, check_reg_arg has the side-effect of marking the
> register's value as unkonwn, and check_mem_access has the side effect
> of propagating bounds from memory to the register.
> 
> Therefore with the current order, bounds information is thrown away,
> but by simply reversing the order of check_reg_arg
> vs. check_mem_access, we can instead propagate bounds smartly.

I think such improvement makes sense, but please tweak commit log
to mention that check_mem_access() is doing it only for stack pointers.
Since "propagating from memory" is too generic. Most memory
won't have such propagation.

> A simple test is added with an infinite loop that can only be proved
> unreachable if this propagation is present.
> 
> Note that in the test, the memory value has to be written with two
> instructions:
> 
>   BPF_MOV64_IMM(BPF_REG_0, 0),
>   BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
> 
> instead of one:
> 
>   BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0),
> 
> Because BPF_ST_MEM doesn't seem to set the stack slot type to 0 when
> storing an immediate.

This bit is confusing in the commit log. Pls move it into test itself.

> Signed-off-by: Brendan Jackman 
> ---
>  kernel/bpf/verifier.c | 32 +++
>  .../selftests/bpf/verifier/atomic_bounds.c| 18 +++
>  2 files changed, 36 insertions(+), 14 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/verifier/atomic_bounds.c
> 
> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 0f82d5d46e2c..0512695c70f4 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -3663,9 +3663,26 @@ static int check_atomic(struct bpf_verifier_env *env, 
> int insn_idx, struct bpf_i
>   return -EACCES;
>   }
>  
> + if (insn->imm & BPF_FETCH) {
> + if (insn->imm == BPF_CMPXCHG)
> + load_reg = BPF_REG_0;
> + else
> + load_reg = insn->src_reg;
> +
> + /* check and record load of old value */
> + err = check_reg_arg(env, load_reg, DST_OP);
> + if (err)
> + return err;
> + } else {
> + /* This instruction accesses a memory location but doesn't
> +  * actually load it into a register.
> +  */
> + load_reg = -1;
> + }
> +
>   /* check whether we can read the memory */
>   err = check_mem_access(env, insn_idx, insn->dst_reg, insn->off,
> -BPF_SIZE(insn->code), BPF_READ, -1, true);
> +BPF_SIZE(insn->code), BPF_READ, load_reg, true);
>   if (err)
>   return err;
>  
> @@ -3675,19 +3692,6 @@ static int check_atomic(struct bpf_verifier_env *env, 
> int insn_idx, struct bpf_i
>   if (err)
>   return err;
>  
> - if (!(insn->imm & BPF_FETCH))
> - return 0;
> -
> - if (insn->imm == BPF_CMPXCHG)
> - load_reg = BPF_REG_0;
> - else
> - load_reg = insn->src_reg;
> -
> - /* check and record load of old value */
> - err = check_reg_arg(env, load_reg, DST_OP);
> - if (err)
> - return err;
> -
>   return 0;
>  }
>  
> diff --git a/tools/testing/selftests/bpf/verifier/atomic_bounds.c 
> b/tools/testing/selftests/bpf/verifier/atomic_bounds.c
> new file mode 100644
> index ..45030165ed63
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/verifier/atomic_bounds.c
> @@ -0,0 +1,18 @@
> +{
> + "BPF_ATOMIC bounds propagation, mem->reg",
> + .insns = {
> + /* a = 0; */
> + BPF_MOV64_IMM(BPF_REG_0, 0),
> + BPF_STX_MEM(BPF_DW, BPF_REG_10, BPF_REG_0, -8),
> + /* b = atomic_fetch_add(, 1); */
> + BPF_MOV64_IMM(BPF_REG_1, 1),
> + BPF_ATOMIC_OP(BPF_DW, BPF_ADD | BPF_FETCH, BPF_REG_10, 
> BPF_REG_1, -8),
> + /* Verifier should be able to tell that this infinite loop 
> isn't reachable. */
> + /* if (b) while (true) continue; */
> + BPF_JMP_IMM(BPF_JNE, BPF_REG_1, 0, -1),
> + BPF_EXIT_INSN(),

I think it's a bit unrealistic to use atomic_add to increment induction 
variable,
but I don't mind, since the verifier side is simple.
Could you add a C based test as well?


Re: [PATCH] selftest/bpf: fix typo

2021-01-20 Thread Alexei Starovoitov
On Wed, Jan 20, 2021 at 6:22 AM angkery  wrote:
>
> From: Junlin Yang 
>
> Change 'exeeds' to 'exceeds'.
>
> Signed-off-by: Junlin Yang 

The patch didn't reach patchwork.
Please reduce cc list and resubmit to bpf@vger only.
Also pls mention [PATCH bpf-next] in the subject.

> ---
>  tools/testing/selftests/bpf/prog_tests/btf.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/btf.c 
> b/tools/testing/selftests/bpf/prog_tests/btf.c
> index 8ae97e2..ea008d0 100644
> --- a/tools/testing/selftests/bpf/prog_tests/btf.c
> +++ b/tools/testing/selftests/bpf/prog_tests/btf.c
> @@ -914,7 +914,7 @@ struct btf_raw_test {
> .err_str = "Member exceeds struct_size",
>  },
>
> -/* Test member exeeds the size of struct
> +/* Test member exceeds the size of struct
>   *
>   * struct A {
>   * int m;
> @@ -948,7 +948,7 @@ struct btf_raw_test {
> .err_str = "Member exceeds struct_size",
>  },
>
> -/* Test member exeeds the size of struct
> +/* Test member exceeds the size of struct
>   *
>   * struct A {
>   * int m;
> --
> 1.9.1
>
>


Re: [PATCH] bpf: put file handler if no storage found

2021-01-20 Thread Alexei Starovoitov
On Tue, Jan 19, 2021 at 4:03 AM Pan Bian  wrote:
>
> Put file f if inode_storage_ptr() returns NULL.
>
> Signed-off-by: Pan Bian 
> ---
>  kernel/bpf/bpf_inode_storage.c | 6 +-
>  1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/bpf/bpf_inode_storage.c b/kernel/bpf/bpf_inode_storage.c
> index 6edff97ad594..089d5071d4fc 100644
> --- a/kernel/bpf/bpf_inode_storage.c
> +++ b/kernel/bpf/bpf_inode_storage.c
> @@ -125,8 +125,12 @@ static int bpf_fd_inode_storage_update_elem(struct 
> bpf_map *map, void *key,
>
> fd = *(int *)key;
> f = fget_raw(fd);
> -   if (!f || !inode_storage_ptr(f->f_inode))
> +   if (!f)
> +   return -EBADF;
> +   if (!inode_storage_ptr(f->f_inode)) {
> +   fput(f);
> return -EBADF;
> +   }

Good catch.
Somehow the patch is not in patchwork.
Could you please resubmit with Fixes tag and reduce cc list?
I guess it's hitting some spam filters in vger.


Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type

2021-01-20 Thread Alexei Starovoitov
On Tue, Jan 12, 2021 at 12:55 PM Yuri Benditovich
 wrote:
>
> On Tue, Jan 12, 2021 at 10:40 PM Yuri Benditovich
>  wrote:
> >
> > On Tue, Jan 12, 2021 at 9:42 PM Yuri Benditovich
> >  wrote:
> > >
> > > This program type can set skb hash value. It will be useful
> > > when the tun will support hash reporting feature if virtio-net.
> > >
> > > Signed-off-by: Yuri Benditovich 
> > > ---
> > >  drivers/net/tun.c | 2 ++
> > >  1 file changed, 2 insertions(+)
> > >
> > > diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> > > index 7959b5c2d11f..455f7afc1f36 100644
> > > --- a/drivers/net/tun.c
> > > +++ b/drivers/net/tun.c
> > > @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, 
> > > struct tun_prog __rcu **prog_p,
> > > prog = NULL;
> > > } else {
> > > prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
> > > +   if (IS_ERR(prog))
> > > +       prog = bpf_prog_get_type(fd, 
> > > BPF_PROG_TYPE_SCHED_CLS);
> > > if (IS_ERR(prog))
> > > return PTR_ERR(prog);
> > > }
> >
> > Comment from Alexei Starovoitov:
> > Patches 1 and 2 are missing for me, so I couldn't review properly,
> > but this diff looks odd.
> > It allows sched_cls prog type to attach to tun.
> > That means everything that sched_cls progs can do will be done from tun 
> > hook?
>
> We do not have an intention to modify the packet in this steering eBPF.

The intent is irrelevant. Using SCHED_CLS here will let users modify the packet
and some users will do so. Hence the tun code has to support it.

> There is just one function that unavailable for BPF_PROG_TYPE_SOCKET_FILTER
> that the eBPF needs to make possible to deliver the hash to the guest
> VM - it is 'bpf_set_hash'
>
> Does it mean that we need to define a new eBPF type for socket filter
> operations + set_hash?
>
> Our problem is that the eBPF calculates 32-bit hash, 16-bit queue
> index and 8-bit of hash type.
> But it is able to return only 32-bit integer, so in this set of
> patches the eBPF returns
> queue index and hash type and saves the hash in skb->hash using 
> bpf_set_hash().

bpf prog can only return a 32-bit integer. That's true.
But the prog can use helpers to set any number of bits and variables.
bpf_set_hash_v2() with hash, queue and index arguments could fit this purpose,
but if you allow it for SCHED_CLS type,
tc side of the code should be ready to deal with that too and this extended
helper should be meaningful for both tc and tun.

In general if the purpose of the prog is to compute three values they better be
grouped together. Returned two of them via ORed 32-bit integer and
returning 32-bit via bpf_set_hash is an awkward api.


Re: [PATCH bpf-next v5 4/4] selftests/bpf: Add a selftest for the tracing bpf_get_socket_cookie

2021-01-20 Thread Alexei Starovoitov
On Wed, Jan 20, 2021 at 9:08 AM KP Singh  wrote:
>
> On Tue, Jan 19, 2021 at 5:00 PM Florent Revest  wrote:
> >
> > This builds up on the existing socket cookie test which checks whether
> > the bpf_get_socket_cookie helpers provide the same value in
> > cgroup/connect6 and sockops programs for a socket created by the
> > userspace part of the test.
> >
> > Adding a tracing program to the existing objects requires a different
> > attachment strategy and different headers.
> >
> > Signed-off-by: Florent Revest 
>
> Acked-by: KP Singh 
>
> (one minor note, doesn't really need fixing as a part of this though)
>
> > ---
> >  .../selftests/bpf/prog_tests/socket_cookie.c  | 24 +++
> >  .../selftests/bpf/progs/socket_cookie_prog.c  | 41 ---
> >  2 files changed, 52 insertions(+), 13 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/prog_tests/socket_cookie.c 
> > b/tools/testing/selftests/bpf/prog_tests/socket_cookie.c
> > index 53d0c44e7907..e5c5e2ea1deb 100644
> > --- a/tools/testing/selftests/bpf/prog_tests/socket_cookie.c
> > +++ b/tools/testing/selftests/bpf/prog_tests/socket_cookie.c
> > @@ -15,8 +15,8 @@ struct socket_cookie {
> >
> >  void test_socket_cookie(void)
> >  {
> > +   struct bpf_link *set_link, *update_sockops_link, 
> > *update_tracing_link;
> > socklen_t addr_len = sizeof(struct sockaddr_in6);
> > -   struct bpf_link *set_link, *update_link;
> > int server_fd, client_fd, cgroup_fd;
> > struct socket_cookie_prog *skel;
> > __u32 cookie_expected_value;
> > @@ -39,15 +39,21 @@ void test_socket_cookie(void)
> >   PTR_ERR(set_link)))
> > goto close_cgroup_fd;
> >
> > -   update_link = bpf_program__attach_cgroup(skel->progs.update_cookie,
> > -cgroup_fd);
> > -   if (CHECK(IS_ERR(update_link), "update-link-cg-attach", "err %ld\n",
> > - PTR_ERR(update_link)))
> > +   update_sockops_link = bpf_program__attach_cgroup(
> > +   skel->progs.update_cookie_sockops, cgroup_fd);
> > +   if (CHECK(IS_ERR(update_sockops_link), 
> > "update-sockops-link-cg-attach",
> > + "err %ld\n", PTR_ERR(update_sockops_link)))
> > goto free_set_link;
> >
> > +   update_tracing_link = bpf_program__attach(
> > +   skel->progs.update_cookie_tracing);
> > +   if (CHECK(IS_ERR(update_tracing_link), "update-tracing-link-attach",
> > + "err %ld\n", PTR_ERR(update_tracing_link)))
> > +   goto free_update_sockops_link;
> > +
> > server_fd = start_server(AF_INET6, SOCK_STREAM, "::1", 0, 0);
> > if (CHECK(server_fd < 0, "start_server", "errno %d\n", errno))
> > -   goto free_update_link;
> > +   goto free_update_tracing_link;
> >
> > client_fd = connect_to_fd(server_fd, 0);
> > if (CHECK(client_fd < 0, "connect_to_fd", "errno %d\n", errno))
> > @@ -71,8 +77,10 @@ void test_socket_cookie(void)
> > close(client_fd);
> >  close_server_fd:
> > close(server_fd);
> > -free_update_link:
> > -   bpf_link__destroy(update_link);
> > +free_update_tracing_link:
> > +   bpf_link__destroy(update_tracing_link);
>
> I don't think this need to block submission unless there are other
> issues but the
> bpf_link__destroy can just be called in a single cleanup label because
> it handles null or
> erroneous inputs:
>
> int bpf_link__destroy(struct bpf_link *link)
> {
> int err = 0;
>
> if (IS_ERR_OR_NULL(link))
>  return 0;
> [...]

+1 to KP's point.

Also Florent, how did you test it?
This test fails in CI and in my manual run:
./test_progs -t cook
libbpf: load bpf program failed: Permission denied
libbpf: -- BEGIN DUMP LOG ---
libbpf:
; int update_cookie_sockops(struct bpf_sock_ops *ctx)
0: (bf) r6 = r1
; if (ctx->family != AF_INET6)
1: (61) r1 = *(u32 *)(r6 +20)
; if (ctx->family != AF_INET6)
2: (56) if w1 != 0xa goto pc+21
 R1_w=inv10 R6_w=ctx(id=0,off=0,imm=0) R10=fp0
; if (ctx->op != BPF_SOCK_OPS_TCP_CONNECT_CB)
3: (61) r1 = *(u32 *)(r6 +0)
; if (ctx->op != BPF_SOCK_OPS_TCP_CONNECT_CB)
4: (56) if w1 != 0x3 goto pc+19
 R1_w=inv3 R6_w=ctx(id=0,off=0,imm=0) R10=fp0
; if (!ctx->sk)
5: (79) r1 = *(u64 *)(r6 +184)
; if (!ctx->sk)
6: (15) if r1 == 0x0 goto pc+17
 R1_w=sock(id=0,ref_obj_id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) R10=fp0
; p = bpf_sk_storage_get(_cookies, ctx->sk, 0, 0);
7: (79) r2 = *(u64 *)(r6 +184)
; p = bpf_sk_storage_get(_cookies, ctx->sk, 0, 0);
8: (18) r1 = 0x888106e41400
10: (b7) r3 = 0
11: (b7) r4 = 0
12: (85) call bpf_sk_storage_get#107
R2 type=sock_or_null expected=sock_common, sock, tcp_sock, xdp_sock, ptr_
processed 12 insns (limit 100) max_states_per_insn 0 total_states
0 peak_states 0 mark_read 0

libbpf: -- END LOG --
libbpf: failed to load program 'update_cookie_sockops'
libbpf: failed to load object 'socket_cookie_prog'
libbpf: failed to load 

Re: [PATCH v3 bpf-next 0/2] Allow attaching to bare tracepoints

2021-01-19 Thread Alexei Starovoitov
On Tue, Jan 19, 2021 at 4:22 AM Qais Yousef  wrote:
>
> Changes in v3:
> * Fix not returning error value correctly in
>   trigger_module_test_write() (Yonghong)
> * Add Yonghong acked-by to patch 1.

Applied. Thanks


Re: cBPF socket filters failing - inexplicably?

2021-01-14 Thread Alexei Starovoitov
Adding appropriate mailing list to cc...

My wild guess is that as soon as socket got created:
socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
the packets were already queued to it.
So later setsockopt() is too late to filter.

Eric, thoughts?

On Wed, Jan 6, 2021 at 6:55 AM Tom Cook  wrote:
>
> Another factoid to add to this:  I captured all traffic on an
> interface while the test program was running using
>
> tcpdump -i wlo1 -w capture.pcap
>
> observing that multiple packets got through the filter.  I then built
> the bpf_dbg program from the kernel source tree and ran the same
> filter and capture file through it:
>
> $ tools/bpf_dbg
> > load bpf 1,6 0 0 0
> > load pcap capture.pcap
> > run
> bpf passes:0 fails:269288
>
> So bpf_dbg thinks the filter is correct; it's only when the filter is
> attached to an actual socket that it fails occasionally.
>
> Regards,
> Tom
>
> On Wed, Jan 6, 2021 at 10:07 AM Tom Cook  wrote:
> >
> > Just to note I have also reproduced this on a 5.10.0 kernel.
> >
> > On Tue, Jan 5, 2021 at 1:42 PM Tom Cook  wrote:
> > >
> > > In the course of tracking down a defect in some existing software,
> > > I've found the failure demonstrated by the short program below.
> > > Essentially, a cBPF program that just rejects every frame (ie always
> > > returns zero) and is attached to a socket using setsockopt(SOL_SOCKET,
> > > SO_ATTACH_FILTER, ...) still occasionally lets frames through to
> > > userspace.
> > >
> > > The code is based on the first example in
> > > Documentation/networking/filter.txt, except that I've changed the
> > > content of the filter program and added a timeout on the socket.
> > >
> > > To reproduce the problem:
> > >
> > > # gcc test.c -o test
> > > # sudo ./test
> > > ... and in another console start a large network operation.
> > >
> > > In my case, I copied a ~300MB core file I had lying around to another
> > > host on the LAN.  The test code should print the string "Failed to
> > > read from socket" 100 times.  In practice, it produces about 10%
> > > "Received packet with ethertype..." messages.
> > >
> > > I've observed the same result on Ubuntu amd64 glibc system running a
> > > 5.9.0 kernel and also on Alpine arm64v8 muslc system running a 4.9.1
> > > kernel.  I've written test code in both C and Python.  I'm fairly sure
> > > this is not something I'm doing wrong - but very keen to have things
> > > thrown at me if it is.
> > >
> > > Regards,
> > > Tom Cook
> > >
> > >
> > > #include 
> > > #include 
> > > #include 
> > > #include 
> > > #include 
> > > #include 
> > > #include 
> > > #include 
> > >
> > > struct sock_filter code[] = {
> > > { 0x06,0,0,0x00 }  /* BPF_RET | BPF_K   0   0   0 */
> > > };
> > >
> > > struct sock_fprog bpf = {
> > > .len = 1,
> > > .filter = code,
> > > };
> > >
> > > void test() {
> > > uint8_t buf[2048];
> > >
> > > int sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL));
> > > if (sock < 0) {
> > > printf("Failed to open socket\n");
> > > return;
> > > }
> > > int ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, , 
> > > sizeof(bpf));
> > > if (ret < 0) {
> > > printf("Failed to set socket filter\n");
> > > return;
> > > }
> > > struct timeval tv = {
> > > .tv_sec = 1
> > > };
> > >
> > > ret = setsockopt(sock, SOL_SOCKET, SO_RCVTIMEO, , sizeof(tv));
> > > if (ret < 0) {
> > > printf("Failed to set socket timeout\n");
> > > return;
> > > }
> > >
> > > ssize_t count = recv(sock, buf, 2048, 0);
> > > if (count <= 0) {
> > > printf("Failed to read from socket\n");
> > > return;
> > > }
> > >
> > > close(sock);
> > >
> > > uint16_t *ethertype = (short*)(buf + 12);
> > > uint8_t *proto = (unsigned char *)(buf + 23);
> > > uint16_t *dport = (uint16_t *)(buf + 14 + 20);
> > >
> > > printf("Received packet with ethertype 0x%04hu, protocol 0x%02hhu
> > > and dport 0x%04hu\n", *ethertype, *proto, *dport);
> > > }
> > >
> > > int main() {
> > > for (size_t ii = 0; ii < 100; ++ii) {
> > > test();
> > > }
> > > }


Re: [PATCH bpf-next 2/3] bpf: Add size arg to build_id_parse function

2021-01-14 Thread Alexei Starovoitov
On Thu, Jan 14, 2021 at 3:44 PM Yonghong Song  wrote:
>
>
>
> On 1/14/21 2:02 PM, Jiri Olsa wrote:
> > On Thu, Jan 14, 2021 at 01:05:33PM -0800, Yonghong Song wrote:
> >>
> >>
> >> On 1/14/21 12:01 PM, Jiri Olsa wrote:
> >>> On Thu, Jan 14, 2021 at 10:56:33AM -0800, Yonghong Song wrote:
> 
> 
>  On 1/14/21 5:40 AM, Jiri Olsa wrote:
> > It's possible to have other build id types (other than default SHA1).
> > Currently there's also ld support for MD5 build id.
> 
>  Currently, bpf build_id based stackmap does not returns the size of
>  the build_id. Did you see an issue here? I guess user space can check
>  the length of non-zero bits of the build id to decide what kind of
>  type it is, right?
> >>>
> >>> you can have zero bytes in the build id hash, so you need to get the size
> >>>
> >>> I never saw MD5 being used in practise just SHA1, but we added the
> >>> size to be complete and make sure we'll fit with build id, because
> >>> there's only limited space in mmap2 event
> >>
> >> I am asking to check whether we should extend uapi struct
> >> bpf_stack_build_id to include build_id_size as well. I guess
> >> we can delay this until a real use case.
> >
> > right, we can try make some MD5 build id binaries and check if it
> > explodes with some bcc tools, but I don't expect that.. I'll try
> > to find some time for that
>
> Thanks. We may have issues on bcc side. For build_id collected in
> kernel, bcc always generates a length-20 string. But for user
> binaries, the build_id string length is equal to actual size of
> the build_id. They may not match (MD5 length is 16).
> The fix is probably to append '0's (up to length 20) for user
> binary build_id's.
>
> I guess MD5 is very seldom used. I will wait if you can reproduce
> the issue and then we might fix it.

Indeed.
Jiri, please check whether md5 is really an issue.
Sounds like we have to do something on the kernel side.
Hopefully zero padding will be enough.
I would prefer to avoid extending uapi struct to cover rare case.

I've applied the series, since this issue sounds orthogonal.


Re: [PATCH bpf-next v7 00/11] Atomics for eBPF

2021-01-14 Thread Alexei Starovoitov
On Thu, Jan 14, 2021 at 10:18 AM Brendan Jackman  wrote:
>
> There's still one unresolved review comment from John[3] which I
> will resolve with a followup patch.
>
> Differences from v6->v7 [1]:
>
> * Fixed riscv build error detected by 0-day robot.

Applied.
Thanks a lot.

Please address John's request in a followup and these few issues:

- rst doesn't look correct. Example:
rst2man Documentation/networking/filter.rst >/dev/null
Documentation/networking/filter.rst:1053: (WARNING/2) Inline emphasis
start-string without end-string.

> Except ``BPF_ADD`` _without_ ``BPF_FETCH`` (for legacy reasons), all 4 byte
> atomic operations require alu32 mode. Clang enables this mode by default in
> architecture v3 (``-mcpu=v3``). For older versions it can be enabled with
> ``-Xclang -target-feature -Xclang +alu32``.

It reads confusing to me.
I would rephrase 'clang enables this mode by default' into
'clang can generate new atomic instruction when -mcpu=v3 is enabled'.

'For older versions...'
This part I didn't get. The users need clang 12 that is capable to
emit these insns.
What 'older versions' you're talking about?


Re: [PATCH bpf v2 1/2] bpf: support PTR_TO_MEM{,_OR_NULL} register spilling

2021-01-13 Thread Alexei Starovoitov
On Wed, Jan 13, 2021 at 2:29 PM KP Singh  wrote:
>
> On Wed, Jan 13, 2021 at 6:38 AM Gilad Reti  wrote:
> >
> > Add support for pointer to mem register spilling, to allow the verifier
> > to track pointers to valid memory addresses. Such pointers are returned
> > for example by a successful call of the bpf_ringbuf_reserve helper.
> >
> > The patch was partially contributed by CyberArk Software, Inc.
> >
> > Fixes: 457f44363a88 ("bpf: Implement BPF ring buffer and verifier support 
> > for it")
> > Suggested-by: Yonghong Song 
> > Signed-off-by: Gilad Reti 
>
> Acked-by: KP Singh 

It's a border line feature vs fix.
Since the patch is trivial and it addresses a real problem I've
applied to the bpf tree.
Thanks!


Re: [PATCHv6 0/4] perf: Add mmap2 build id support

2021-01-13 Thread Alexei Starovoitov
On Tue, Jan 12, 2021 at 1:19 AM Jiri Olsa  wrote:
>
> On Mon, Jan 11, 2021 at 06:49:58PM -0800, Alexei Starovoitov wrote:
> > On Mon, Jan 11, 2021 at 10:38:19PM +0100, Jiri Olsa wrote:
> > > hi,
> > > adding the support to have buildid stored in mmap2 event,
> > > so we can bypass the final perf record hunt on build ids.
> > >
> > > This patchset allows perf to record build ID in mmap2 event,
> > > and adds perf tooling to store/download binaries to .debug
> > > cache based on these build IDs.
> > >
> > > Note that the build id retrieval code is stolen from bpf
> > > code, where it's been used (together with file offsets)
> > > to replace IPs in user space stack traces. It's now added
> > > under lib directory.
> > >
> > > v6 changes:
> > >   - last 4 patches rebased Arnaldo's perf/core
> >
> > There were no issues with v5 as far as I can remember.
> > This is just a resubmit to get it landed ?
>
> yes, exactly
>
> > Last time we couldn't quite figure out which tree to go through.
> > I think the recommend path was to go via bpf-next.
> > Is it still the case?
>
> bpf-next would be best for kernel changes,
>   perf: Add build id data in mmap2 event
>   bpf: Add size arg to build_id_parse function
>   bpf: Move stack_map_get_build_id into lib

Then please cc them to bpf@vger and add [PATCH bpf-next]
otherwise it's all very confusing.

> the 'perf buildid-cache' change needs to go through Arnaldo's tree,
> because it depends on changes he already pulled in

Also don't include the 4th patch in the series if it isn't meant for bpf-next.


Re: [PATCH bpf-next v6 00/11] Atomics for eBPF

2021-01-13 Thread Alexei Starovoitov
On Tue, Jan 12, 2021 at 03:42:24PM +, Brendan Jackman wrote:
> Happy new year everyone, and thanks once again for the reviews.
> 
> There's still one unresolved review comment from John[3] but I don't
> think it needs to block the patchset as it stands, it can be a
> separate patch. Hope that's OK.
> 
> Differences from v5->v6 [1]:
> 
> * Carried Björn Töpel's ack for RISC-V code, plus a couple more acks from
>   Yonhgong.
> 
> * Doc fixups.
> 
> * Trivial cleanups.

Please fix riscv build and resubmit.
I think that's the only outstanding issue left.


Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type

2021-01-12 Thread Alexei Starovoitov
On Tue, Jan 12, 2021 at 11:42 AM Yuri Benditovich
 wrote:
>
> This program type can set skb hash value. It will be useful
> when the tun will support hash reporting feature if virtio-net.
>
> Signed-off-by: Yuri Benditovich 
> ---
>  drivers/net/tun.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 7959b5c2d11f..455f7afc1f36 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, struct 
> tun_prog __rcu **prog_p,
> prog = NULL;
> } else {
> prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
> +   if (IS_ERR(prog))
> +   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SCHED_CLS);

You've ignored the feedback and just resend? what for?


Re: [PATCHv6 0/4] perf: Add mmap2 build id support

2021-01-11 Thread Alexei Starovoitov
On Mon, Jan 11, 2021 at 10:38:19PM +0100, Jiri Olsa wrote:
> hi,
> adding the support to have buildid stored in mmap2 event,
> so we can bypass the final perf record hunt on build ids.
> 
> This patchset allows perf to record build ID in mmap2 event,
> and adds perf tooling to store/download binaries to .debug
> cache based on these build IDs.
> 
> Note that the build id retrieval code is stolen from bpf
> code, where it's been used (together with file offsets)
> to replace IPs in user space stack traces. It's now added
> under lib directory.
> 
> v6 changes:
>   - last 4 patches rebased Arnaldo's perf/core

There were no issues with v5 as far as I can remember.
This is just a resubmit to get it landed ?
Last time we couldn't quite figure out which tree to go through.
I think the recommend path was to go via bpf-next.
Is it still the case?


Re: [RFC PATCH bpf-next 0/2] bpf, libbpf: share BTF data show functionality

2021-01-11 Thread Alexei Starovoitov
On Mon, Jan 11, 2021 at 05:32:51PM +, Alan Maguire wrote:
> The BPF Type Format (BTF) can be used in conjunction with the helper
> bpf_snprintf_btf() to display kernel data with type information.
> 
> This series generalizes that support and shares it with libbpf so
> that libbpf can display typed data.  BTF display functionality is
> factored out of kernel/bpf/btf.c into kernel/bpf/btf_show_common.c,
> and that file is duplicated in tools/lib/bpf.  Similarly, common
> definitions and inline functions needed for this support are
> extracted into include/linux/btf_common.h and this header is again
> duplicated in tools/lib/bpf.

I think duplication will inevitable cause code to diverge.
Could you keep a single copy?
Take a look at kernel/bpf/disasm.[ch]
It compiled once for the kernel and once for bpftool.
So should be possible to do something similar for this case as well?


Re: [PATCH] bpf: Add signature checking for BPF programs

2021-01-05 Thread Alexei Starovoitov
On Tue, Jan 5, 2021 at 12:00 AM Xichen Lin  wrote:
>
> From: Xichen Lin 
>
> Check the signature of a BPF program against the same set of keys for
> module signature checking.
>
> Currently the format of a signed BPF program is similar to that of
> a signed kernel module, composed of BPF bytecode, signature,
> module_signature structure and a magic string, in order, aligned to
> struct sock_filter.

Commit log talks about 'what' and gives no insight into 'why' the
patch was sent.
Please take time to clearly explain the motivation for the changes.
Also please see earlier discussions on the subject and Arnaldo's preso
from plumbers.


Re: [RFC PATCH 3/7] tun: allow use of BPF_PROG_TYPE_SCHED_CLS program type

2021-01-05 Thread Alexei Starovoitov
On Tue, Jan 05, 2021 at 02:24:12PM +0200, Yuri Benditovich wrote:
> This program type can set skb hash value. It will be useful
> when the tun will support hash reporting feature if virtio-net.
> 
> Signed-off-by: Yuri Benditovich 
> ---
>  drivers/net/tun.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 7959b5c2d11f..455f7afc1f36 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -2981,6 +2981,8 @@ static int tun_set_ebpf(struct tun_struct *tun, struct 
> tun_prog __rcu **prog_p,
>   prog = NULL;
>   } else {
>   prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SOCKET_FILTER);
> + if (IS_ERR(prog))
> + prog = bpf_prog_get_type(fd, BPF_PROG_TYPE_SCHED_CLS);

Patches 1 and 2 are missing for me, so I couldn't review properly,
but this diff looks odd.
It allows sched_cls prog type to attach to tun.
That means everything that sched_cls progs can do will be done from tun hook?
sched_cls assumes l2 and can modify the packet.
I think crashes are inevitable.


Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2021-01-05 Thread Alexei Starovoitov
On Tue, Jan 5, 2021 at 3:39 AM Qais Yousef  wrote:
>
> On 01/04/21 10:59, Alexei Starovoitov wrote:
> > > > > I did have a patch that allowed that. It might be worth trying to 
> > > > > upstream it.
> > > > > It just required a new macro which could be problematic.
> > > > >
> > > > > https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > > > >
> > > > > With the above I could attach using bpf::RAW_TRACEPOINT mechanism.
> > > > >
> > > >
> > > > Yeah, that could work. I meant there was no way to do it with what was 
> > > > there :)
> > > >
> > > > In our initial attempts at using BPF to get at nr_running (which I was 
> > > > not
> > > > involved in and don't have all the details...) there were issues being 
> > > > able to
> > > > keep up and losing events.  That may have been an implementation issue, 
> > > > but
> > > > using the module and trace-cmd doesn't have that problem. Hopefully you 
> > > > don't
> > > > see that using RAW_TRACEPOINTs.
> > >
> > > So I have a proper patch for that now, that actually turned out to be 
> > > really
> > > tiny once you untangle exactly what is missing.
> > >
> > > Peter, bpf programs aren't considered ABIs AFAIK, do you have concerns 
> > > about
> > > that?
> > >
> > > Thanks
> > >
> > > --
> > > Qais Yousef
> > >
> > > -->8--
> > >
> > > From cf81de8c7db03d62730939aa902579339e2fc859 Mon Sep 17 00:00:00 2001
> > > From: Qais Yousef 
> > > Date: Wed, 30 Dec 2020 22:20:34 +
> > > Subject: [PATCH] trace: bpf: Allow bpf to attach to bare tracepoints
> > >
> > > Some subsystems only have bare tracepoints (a tracepoint with no
> > > associated trace event) to avoid the problem of trace events being an
> > > ABI that can't be changed.
> > >
> > > From bpf presepective, bare tracepoints are what it calls
> > > RAW_TRACEPOINT().
> > >
> > > Since bpf assumed there's 1:1 mapping, it relied on hooking to
> > > DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
> > > bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
> > > no knowledge about their existence.
> > >
> > > By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
> > > DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.
> > >
> > > Enabling that comes with the contract that changes to raw tracepoints
> > > don't constitute a regression if they break existing bpf programs.
> > > We need the ability to continue to morph and modify these raw
> > > tracepoints without worrying about any ABI.
> > >
> > > Signed-off-by: Qais Yousef 
> > > ---
> > >  include/trace/bpf_probe.h | 12 ++--
> > >  1 file changed, 10 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> > > index cd74bffed5c6..a23be89119aa 100644
> > > --- a/include/trace/bpf_probe.h
> > > +++ b/include/trace/bpf_probe.h
> > > @@ -55,8 +55,7 @@
> > >  /* tracepoints with more than 12 arguments will hit build error */
> > >  #define CAST_TO_U64(...) CONCATENATE(__CAST, 
> > > COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
> > >
> > > -#undef DECLARE_EVENT_CLASS
> > > -#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> > > +#define __BPF_DECLARE_TRACE(call, proto, args) \
> > >  static notrace void\
> > >  __bpf_trace_##call(void *__data, proto)  
> > >   \
> > >  {  \
> > > @@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto)
> > >   \
> > > CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, 
> > > CAST_TO_U64(args));  \
> > >  }
> > >
> > > +#undef DECLARE_EVENT_CLASS
> > > +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> > > +   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
> > > +
> > >  /*
> > >   * This part is compiled out, it is only here as a build time check

Re: [RFC PATCH bpf-next] ksnoop: kernel argument/return value tracing/display using BTF

2021-01-04 Thread Alexei Starovoitov
On Mon, Jan 04, 2021 at 03:26:31PM +, Alan Maguire wrote:
> 
> ksnoop can be used to show function signatures; for example:
> 
> $ ksnoop info ip_send_skb
> int  ip_send_skb(struct net  * net, struct sk_buff  * skb);
> 
> Then we can trace the function, for example:
> 
> $ ksnoop trace ip_send_skb

Thanks for sharing. It will be useful tool.

> +
> + data = get_arg(ctx, currtrace->base_arg);
> +
> + dataptr = (void *)data;
> +
> + if (currtrace->offset)
> + dataptr += currtrace->offset;
> +
> + /* look up member value and read into data field, provided
> +  * it <= size of a __u64; when it is, it can be used in
> +  * predicate evaluation.
> +  */
> + if (currtrace->flags & KSNOOP_F_MEMBER) {
> + ret = -EINVAL;
> + data = 0;
> + if (currtrace->size <= sizeof(__u64))
> + ret = bpf_probe_read_kernel(,
> + currtrace->size,
> + dataptr);
> + else
> + bpf_printk("size was %d cant trace",
> +currtrace->size);
> + if (ret) {
> + currdata->err_type_id =
> + currtrace->type_id;
> + currdata->err = ret;
> + continue;
> + }
> + if (currtrace->flags & KSNOOP_F_PTR)
> + dataptr = (void *)data;
> + }
> +
> + /* simple predicate evaluation: if any predicate fails,
> +  * skip all tracing for this function.
> +  */
> + if (currtrace->flags & KSNOOP_F_PREDICATE_MASK) {
> + bool ok = false;
> +
> + if (currtrace->flags & KSNOOP_F_PREDICATE_EQ &&
> + data == currtrace->predicate_value)
> + ok = true;
> +
> + if (currtrace->flags & KSNOOP_F_PREDICATE_NOTEQ &&
> + data != currtrace->predicate_value)
> + ok = true;
> +
> + if (currtrace->flags & KSNOOP_F_PREDICATE_GT &&
> + data > currtrace->predicate_value)
> + ok = true;
> + if (currtrace->flags & KSNOOP_F_PREDICATE_LT &&
> + data < currtrace->predicate_value)
> + ok = true;
> +
> + if (!ok)
> + goto skiptrace;
> + }
> +
> + currdata->raw_value = data;
> +
> + if (currtrace->flags & (KSNOOP_F_PTR | KSNOOP_F_MEMBER))
> + btf_ptr.ptr = dataptr;
> + else
> + btf_ptr.ptr = 
> +
> + btf_ptr.type_id = currtrace->type_id;
> +
> + if (trace->buf_len + MAX_TRACE_DATA >= MAX_TRACE_BUF)
> + break;
> +
> + buf_offset = >buf[trace->buf_len];
> + if (buf_offset > >buf[MAX_TRACE_BUF]) {
> + currdata->err_type_id = currtrace->type_id;
> + currdata->err = -ENOSPC;
> + continue;
> + }
> + currdata->buf_offset = trace->buf_len;
> +
> + ret = bpf_snprintf_btf(buf_offset,
> +MAX_TRACE_DATA,
> +_ptr, sizeof(btf_ptr),
> +BTF_F_PTR_RAW);

The overhead would be much lower if instead of printing in the kernel the
tool's bpf prog would dump the struct data into ring buffer and let the user
space part of the tool do the pretty printing. There would be no need to pass
btf_id from the user space to the kernel either. The user space would need to
gain pretty printing logic, but may be we can share the code somehow between
the kernel and libbpf.

Separately the interpreter in the bpf prog to handle predicates is kinda
anti-bpf :) I think ksnoop can generate bpf code on the fly instead. No need
for llvm. The currtrace->offset/size would be written into the prog placeholder
instructions by ksnoop before loading the prog. With much improved overhead for
filtering.


Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2021-01-04 Thread Alexei Starovoitov
On Mon, Jan 4, 2021 at 10:29 AM Qais Yousef  wrote:
>
> On 09/08/20 09:19, Phil Auld wrote:
> > Hi Quais,
> >
> > On Mon, Sep 07, 2020 at 12:02:24PM +0100 Qais Yousef wrote:
> > > On 09/02/20 09:54, Phil Auld wrote:
> > > > >
> > > > > I think this decoupling is not necessary. The natural place for those
> > > > > scheduler trace_event based on trace_points extension files is
> > > > > kernel/sched/ and here the internal sched.h can just be included.
> > > > >
> > > > > If someone really wants to build this as an out-of-tree module there 
> > > > > is
> > > > > an easy way to make kernel/sched/sched.h visible.
> > > > >
> > > >
> > > > It's not so much that we really _want_ to do this in an external module.
> > > > But we aren't adding more trace events and my (limited) knowledge of
> > > > BPF let me to the conclusion that its raw tracepoint functionality
> > > > requires full events.  I didn't see any other way to do it.
> > >
> > > I did have a patch that allowed that. It might be worth trying to 
> > > upstream it.
> > > It just required a new macro which could be problematic.
> > >
> > > https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> > >
> > > With the above I could attach using bpf::RAW_TRACEPOINT mechanism.
> > >
> >
> > Yeah, that could work. I meant there was no way to do it with what was 
> > there :)
> >
> > In our initial attempts at using BPF to get at nr_running (which I was not
> > involved in and don't have all the details...) there were issues being able 
> > to
> > keep up and losing events.  That may have been an implementation issue, but
> > using the module and trace-cmd doesn't have that problem. Hopefully you 
> > don't
> > see that using RAW_TRACEPOINTs.
>
> So I have a proper patch for that now, that actually turned out to be really
> tiny once you untangle exactly what is missing.
>
> Peter, bpf programs aren't considered ABIs AFAIK, do you have concerns about
> that?
>
> Thanks
>
> --
> Qais Yousef
>
> -->8--
>
> From cf81de8c7db03d62730939aa902579339e2fc859 Mon Sep 17 00:00:00 2001
> From: Qais Yousef 
> Date: Wed, 30 Dec 2020 22:20:34 +
> Subject: [PATCH] trace: bpf: Allow bpf to attach to bare tracepoints
>
> Some subsystems only have bare tracepoints (a tracepoint with no
> associated trace event) to avoid the problem of trace events being an
> ABI that can't be changed.
>
> From bpf presepective, bare tracepoints are what it calls
> RAW_TRACEPOINT().
>
> Since bpf assumed there's 1:1 mapping, it relied on hooking to
> DEFINE_EVENT() macro to create bpf mapping of the tracepoints. Since
> bare tracepoints use DECLARE_TRACE() to create the tracepoint, bpf had
> no knowledge about their existence.
>
> By teaching bpf_probe.h to parse DECLARE_TRACE() in a similar fashion to
> DEFINE_EVENT(), bpf can find and attach to the new raw tracepoints.
>
> Enabling that comes with the contract that changes to raw tracepoints
> don't constitute a regression if they break existing bpf programs.
> We need the ability to continue to morph and modify these raw
> tracepoints without worrying about any ABI.
>
> Signed-off-by: Qais Yousef 
> ---
>  include/trace/bpf_probe.h | 12 ++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/include/trace/bpf_probe.h b/include/trace/bpf_probe.h
> index cd74bffed5c6..a23be89119aa 100644
> --- a/include/trace/bpf_probe.h
> +++ b/include/trace/bpf_probe.h
> @@ -55,8 +55,7 @@
>  /* tracepoints with more than 12 arguments will hit build error */
>  #define CAST_TO_U64(...) CONCATENATE(__CAST, 
> COUNT_ARGS(__VA_ARGS__))(__VA_ARGS__)
>
> -#undef DECLARE_EVENT_CLASS
> -#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> +#define __BPF_DECLARE_TRACE(call, proto, args) \
>  static notrace void\
>  __bpf_trace_##call(void *__data, proto)  
>   \
>  {  \
> @@ -64,6 +63,10 @@ __bpf_trace_##call(void *__data, proto)
>   \
> CONCATENATE(bpf_trace_run, COUNT_ARGS(args))(prog, 
> CAST_TO_U64(args));  \
>  }
>
> +#undef DECLARE_EVENT_CLASS
> +#define DECLARE_EVENT_CLASS(call, proto, args, tstruct, assign, print) \
> +   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))
> +
>  /*
>   * This part is compiled out, it is only here as a build time check
>   * to make sure that if the tracepoint handling changes, the
> @@ -111,6 +114,11 @@ __DEFINE_EVENT(template, call, PARAMS(proto), 
> PARAMS(args), size)
>  #define DEFINE_EVENT_PRINT(template, name, proto, args, print) \
> DEFINE_EVENT(template, name, PARAMS(proto), PARAMS(args))
>
> +#undef DECLARE_TRACE
> +#define DECLARE_TRACE(call, proto, args)   \
> +   __BPF_DECLARE_TRACE(call, PARAMS(proto), PARAMS(args))  \
> +   __DEFINE_EVENT(call, call, 

Re: [PATCH] mm: add prototype for __add_to_page_cache_locked()

2020-12-27 Thread Alexei Starovoitov
On Wed, Dec 23, 2020 at 7:54 AM Christoph Hellwig  wrote:
>
> On Wed, Dec 23, 2020 at 12:11:36PM +, Matthew Wilcox wrote:
> > On Wed, Dec 23, 2020 at 08:31:26AM +, Christoph Hellwig wrote:
> > > Can we please make the eBPF code stop referencing this function instead
> > > of papering over this crap?  It has no business poking into page cache
> > > internals.
> >
> > The reference from the BPF code is simply "you can inject errors here".
> > And I think we want to be able to inject errors to test the error paths,
> > no?
>
> Something that expects a symbol to be global is just pretty broken.
> I think it need to change so that whatever instrumentation is done can
> coexist with a static function.

Pls read commit description that made it global.
It has nothing to do with bpf.


Re: [PATCH bpf-next 1/2] bpf: Add a bpf_kallsyms_lookup helper

2020-12-18 Thread Alexei Starovoitov
On Fri, Dec 18, 2020 at 10:53:57AM -0800, Andrii Nakryiko wrote:
> On Thu, Dec 17, 2020 at 7:20 PM Alexei Starovoitov
>  wrote:
> >
> > On Thu, Dec 17, 2020 at 09:26:09AM -0800, Yonghong Song wrote:
> > >
> > >
> > > On 12/17/20 7:31 AM, Florent Revest wrote:
> > > > On Mon, Dec 14, 2020 at 7:47 AM Yonghong Song  wrote:
> > > > > On 12/11/20 6:40 AM, Florent Revest wrote:
> > > > > > On Wed, Dec 2, 2020 at 10:18 PM Alexei Starovoitov
> > > > > >  wrote:
> > > > > > > I still think that adopting printk/vsnprintf for this instead of
> > > > > > > reinventing the wheel
> > > > > > > is more flexible and easier to maintain long term.
> > > > > > > Almost the same layout can be done with vsnprintf
> > > > > > > with exception of \0 char.
> > > > > > > More meaningful names, etc.
> > > > > > > See Documentation/core-api/printk-formats.rst
> > > > > >
> > > > > > I agree this would be nice. I finally got a bit of time to 
> > > > > > experiment
> > > > > > with this and I noticed a few things:
> > > > > >
> > > > > > First of all, because helpers only have 5 arguments, if we use two 
> > > > > > for
> > > > > > the output buffer and its size and two for the format string and its
> > > > > > size, we are only left with one argument for a modifier. This is 
> > > > > > still
> > > > > > enough for our usecase (where we'd only use "%ps" for example) but 
> > > > > > it
> > > > > > does not strictly-speaking allow for the same layout that Andrii
> > > > > > proposed.
> > > > >
> > > > > See helper bpf_seq_printf. It packs all arguments for format string 
> > > > > and
> > > > > puts them into an array. bpf_seq_printf will unpack them as it parsed
> > > > > through the format string. So it should be doable to have more than
> > > > > "%ps" in format string.
> > > >
> > > > This could be a nice trick, thank you for the suggestion Yonghong :)
> > > >
> > > > My understanding is that this would also require two extra args (one
> > > > for the array of arguments and one for the size of this array) so it
> > > > would still not fit the 5 arguments limit I described in my previous
> > > > email.
> > > > eg: this would not be possible:
> > > > long bpf_snprintf(const char *out, u32 out_size,
> > > >const char *fmt, u32 fmt_size,
> > > >   const void *data, u32 data_len)
> > >
> > > Right. bpf allows only up to 5 parameters.
> > > >
> > > > Would you then suggest that we also put the format string and its
> > > > length in the first and second cells of this array and have something
> > > > along the line of:
> > > > long bpf_snprintf(const char *out, u32 out_size,
> > > >const void *args, u32 args_len) ?
> > > > This seems like a fairly opaque signature to me and harder to verify.
> > >
> > > One way is to define an explicit type for args, something like
> > >struct bpf_fmt_str_data {
> > >   char *fmt;
> > >   u64 fmt_len;
> > >   u64 data[];
> > >};
> >
> > that feels a bit convoluted.
> >
> > The reason I feel unease with the helper as was originally proposed
> > and with Andrii's proposal is all the extra strlen and strcpy that
> > needs to be done. In the helper we have to call kallsyms_lookup()
> > which is ok interface for what it was desinged to do,
> > but it's awkward to use to construct new string ("%s [%s]", sym, modname)
> > or to send two strings into a ring buffer.
> > Andrii's zero separator idea will simplify bpf prog, but user space
> > would need to do strlen anyway if it needs to pretty print.
> > If we take pain on converting addr to sym+modname let's figure out
> > how to make it easy for the bpf prog to do and easy for user space to 
> > consume.
> > That's why I proposed snprintf.
> 
> I have nothing against snprintf support for symbols. But
> bpf_ksym_resolve() solves only a partially overlapping problem, so
> deserves to be added in addition to snprintf support. With snprintf,
> it will be hard to avoid two loo

Re: [PATCH bpf-next 1/2] bpf: Add a bpf_kallsyms_lookup helper

2020-12-17 Thread Alexei Starovoitov
On Thu, Dec 17, 2020 at 09:26:09AM -0800, Yonghong Song wrote:
> 
> 
> On 12/17/20 7:31 AM, Florent Revest wrote:
> > On Mon, Dec 14, 2020 at 7:47 AM Yonghong Song  wrote:
> > > On 12/11/20 6:40 AM, Florent Revest wrote:
> > > > On Wed, Dec 2, 2020 at 10:18 PM Alexei Starovoitov
> > > >  wrote:
> > > > > I still think that adopting printk/vsnprintf for this instead of
> > > > > reinventing the wheel
> > > > > is more flexible and easier to maintain long term.
> > > > > Almost the same layout can be done with vsnprintf
> > > > > with exception of \0 char.
> > > > > More meaningful names, etc.
> > > > > See Documentation/core-api/printk-formats.rst
> > > > 
> > > > I agree this would be nice. I finally got a bit of time to experiment
> > > > with this and I noticed a few things:
> > > > 
> > > > First of all, because helpers only have 5 arguments, if we use two for
> > > > the output buffer and its size and two for the format string and its
> > > > size, we are only left with one argument for a modifier. This is still
> > > > enough for our usecase (where we'd only use "%ps" for example) but it
> > > > does not strictly-speaking allow for the same layout that Andrii
> > > > proposed.
> > > 
> > > See helper bpf_seq_printf. It packs all arguments for format string and
> > > puts them into an array. bpf_seq_printf will unpack them as it parsed
> > > through the format string. So it should be doable to have more than
> > > "%ps" in format string.
> > 
> > This could be a nice trick, thank you for the suggestion Yonghong :)
> > 
> > My understanding is that this would also require two extra args (one
> > for the array of arguments and one for the size of this array) so it
> > would still not fit the 5 arguments limit I described in my previous
> > email.
> > eg: this would not be possible:
> > long bpf_snprintf(const char *out, u32 out_size,
> >const char *fmt, u32 fmt_size,
> >   const void *data, u32 data_len)
> 
> Right. bpf allows only up to 5 parameters.
> > 
> > Would you then suggest that we also put the format string and its
> > length in the first and second cells of this array and have something
> > along the line of:
> > long bpf_snprintf(const char *out, u32 out_size,
> >const void *args, u32 args_len) ?
> > This seems like a fairly opaque signature to me and harder to verify.
> 
> One way is to define an explicit type for args, something like
>struct bpf_fmt_str_data {
>   char *fmt;
>   u64 fmt_len;
>   u64 data[];
>};

that feels a bit convoluted.

The reason I feel unease with the helper as was originally proposed
and with Andrii's proposal is all the extra strlen and strcpy that
needs to be done. In the helper we have to call kallsyms_lookup()
which is ok interface for what it was desinged to do,
but it's awkward to use to construct new string ("%s [%s]", sym, modname)
or to send two strings into a ring buffer.
Andrii's zero separator idea will simplify bpf prog, but user space
would need to do strlen anyway if it needs to pretty print.
If we take pain on converting addr to sym+modname let's figure out
how to make it easy for the bpf prog to do and easy for user space to consume.
That's why I proposed snprintf.

As far as 6 arg issue:
long bpf_snprintf(const char *out, u32 out_size,
  const char *fmt, u32 fmt_size,
  const void *data, u32 data_len);
Yeah. It won't work as-is, but fmt_size is unnecessary nowadays.
The verifier understands read-only data.
Hence the helper can be:
long bpf_snprintf(const char *out, u32 out_size,
  const char *fmt,
  const void *data, u32 data_len);
The 3rd arg cannot be ARG_PTR_TO_MEM.
Instead we can introduce ARG_PTR_TO_CONST_STR in the verifier.
See check_mem_access() where it's doing bpf_map_direct_read().
That 'fmt' string will be accessed through the same bpf_map_direct_read().
The verifier would need to check that it's NUL-terminated valid string.
It should probably do % specifier checks at the same time.
At the end bpf_snprintf() will have 5 args and when wrapped with 
BPF_SNPRINTF() macro it will accept arbitrary number of arguments to print.
It also will be generally useful to do all other kinds of pretty printing.


Re: [RFC PATCH v1 7/7] powerpc/bpf: Implement extended BPF on PPC32

2020-12-16 Thread Alexei Starovoitov
On Wed, Dec 16, 2020 at 10:07:37AM +, Christophe Leroy wrote:
> Implement Extended Berkeley Packet Filter on Powerpc 32
> 
> Test result with test_bpf module:
> 
>   test_bpf: Summary: 378 PASSED, 0 FAILED, [354/366 JIT'ed]

nice!

> Registers mapping:
> 
>   [BPF_REG_0] = r11-r12
>   /* function arguments */
>   [BPF_REG_1] = r3-r4
>   [BPF_REG_2] = r5-r6
>   [BPF_REG_3] = r7-r8
>   [BPF_REG_4] = r9-r10
>   [BPF_REG_5] = r21-r22 (Args 9 and 10 come in via the stack)
>   /* non volatile registers */
>   [BPF_REG_6] = r23-r24
>   [BPF_REG_7] = r25-r26
>   [BPF_REG_8] = r27-r28
>   [BPF_REG_9] = r29-r30
>   /* frame pointer aka BPF_REG_10 */
>   [BPF_REG_FP] = r31
>   /* eBPF jit internal registers */
>   [BPF_REG_AX] = r19-r20
>   [TMP_REG] = r18
> 
> As PPC32 doesn't have a redzone in the stack,
> use r17 as tail call counter.
> 
> r0 is used as temporary register as much as possible. It is referenced
> directly in the code in order to avoid misuse of it, because some
> instructions interpret it as value 0 instead of register r0
> (ex: addi, addis, stw, lwz, ...)
> 
> The following operations are not implemented:
> 
>   case BPF_ALU64 | BPF_DIV | BPF_X: /* dst /= src */
>   case BPF_ALU64 | BPF_MOD | BPF_X: /* dst %= src */
>   case BPF_STX | BPF_XADD | BPF_DW: /* *(u64 *)(dst + off) += src 
> */
> 
> The following operations are only implemented for power of two constants:
> 
>   case BPF_ALU64 | BPF_MOD | BPF_K: /* dst %= imm */
>   case BPF_ALU64 | BPF_DIV | BPF_K: /* dst /= imm */

Those are sensible limitations. MOD and DIV are rare, but XADD is common.
Please consider doing it as a cmpxchg loop in the future.

Also please run test_progs. It will give a lot better coverage than test_bpf.ko


Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked

2020-12-09 Thread Alexei Starovoitov
On Wed, Dec 9, 2020 at 6:31 PM Steven Rostedt  wrote:
>
> On Wed, 9 Dec 2020 17:12:43 -0800
> Alexei Starovoitov  wrote:
>
> > > > > > FWIW, I intend to do some consolidation/renaming in this area.  I
> > > > > > trust that will not be a problem?
> > > > >
> > > > > If it does not break anything, it will be not a problem ;-)
> > > > >
> > > > > It's possible that __add_to_page_cache_locked() can be a global symbol
> > > > > with add_to_page_cache_lru() + add_to_page_cache_locked() being just
> > > > > static/inline wrappers around it.
> > > >
> > > > So what happens to BTF if we change this area entirely?  Your IDs
> > > > sound like some kind of ABI to me, which is extremely scary.
> > >
> > > Is BTF becoming the new tracepoint? That is, random additions of things 
> > > like:
> > >
> > >BTF_ID(func,__add_to_page_cache_locked)
> > >
> > > Like was done in commit 1e6c62a882155 ("bpf: Introduce sleepable BPF
> > > programs") without any notification to the maintainers of the
> > > __add_to_page_cache_locked code, will suddenly become an API?
> >
> > huh? what api/abi you're talking about?
>
> If the function __add_to_page_cache_locked were to be removed due to
> the code being rewritten,  would it break any user space? If not, then
> there's nothing to worry about. ;-)

That function is marked with ALLOW_ERROR_INJECTION.
So any script that exercises it via debugfs (or via bpf) will not work.
That's nothing new. Same "breakage" happens with kprobes, etc.
The function was marked with error_inject for a reason though.
The refactoring or renaming of this code ideally should provide a way to do
similar pattern of injecting errors in this code path.
It could be a completely new function, of course.


Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked

2020-12-09 Thread Alexei Starovoitov
On Wed, Dec 9, 2020 at 2:32 PM Steven Rostedt  wrote:
>
> On Wed, Dec 09, 2020 at 06:05:52PM +, Christoph Hellwig wrote:
> > On Wed, Dec 09, 2020 at 04:51:48PM +0100, Stanislaw Gruszka wrote:
> > > On Wed, Dec 09, 2020 at 03:08:26PM +, Matthew Wilcox wrote:
> > > > On Wed, Dec 09, 2020 at 03:46:28PM +0100, Stanislaw Gruszka wrote:
> > > > > At this point of release cycle we should probably go with revert,
> > > > > but I think the main problem is that BPF and ERROR_INJECTION use
> > > > > function that is not intended to be used externally. For external 
> > > > > users
> > > > > add_to_page_cache_lru() and add_to_page_cache_locked() are exported
> > > > > and I think those should be used (see the patch below).
> > > >
> > > > FWIW, I intend to do some consolidation/renaming in this area.  I
> > > > trust that will not be a problem?
> > >
> > > If it does not break anything, it will be not a problem ;-)
> > >
> > > It's possible that __add_to_page_cache_locked() can be a global symbol
> > > with add_to_page_cache_lru() + add_to_page_cache_locked() being just
> > > static/inline wrappers around it.
> >
> > So what happens to BTF if we change this area entirely?  Your IDs
> > sound like some kind of ABI to me, which is extremely scary.
>
> Is BTF becoming the new tracepoint? That is, random additions of things like:
>
>BTF_ID(func,__add_to_page_cache_locked)
>
> Like was done in commit 1e6c62a882155 ("bpf: Introduce sleepable BPF
> programs") without any notification to the maintainers of the
> __add_to_page_cache_locked code, will suddenly become an API?

huh? what api/abi you're talking about?


Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked

2020-12-07 Thread Alexei Starovoitov
On Mon, Dec 7, 2020 at 2:59 PM Alexei Starovoitov
 wrote:
>
> On Mon, Dec 7, 2020 at 2:53 PM Michal Kubecek  wrote:
> >
> > On Mon, Dec 07, 2020 at 02:44:22PM -0800, Alexei Starovoitov wrote:
> > > On Mon, Dec 7, 2020 at 10:36 AM Justin Forbes  
> > > wrote:
> > > >
> > > > On Mon, Dec 7, 2020 at 2:16 AM Michal Kubecek  wrote:
> > > > >
> > > > > On Thu, Nov 12, 2020 at 08:18:57AM +0800, Alex Shi wrote:
> > > > > >
> > > > > >
> > > > > > 在 2020/11/11 上午3:50, Andrew Morton 写道:
> > > > > > > On Tue, 10 Nov 2020 08:39:24 +0530 Souptick Joarder 
> > > > > > >  wrote:
> > > > > > >
> > > > > > >> On Fri, Nov 6, 2020 at 4:55 PM Alex Shi 
> > > > > > >>  wrote:
> > > > > > >>>
> > > > > > >>> Otherwise it cause gcc warning:
> > > > > > >>>   ^~~
> > > > > > >>> ../mm/filemap.c:830:14: warning: no previous prototype for
> > > > > > >>> ‘__add_to_page_cache_locked’ [-Wmissing-prototypes]
> > > > > > >>>  noinline int __add_to_page_cache_locked(struct page *page,
> > > > > > >>>   ^~
> > > > > > >>
> > > > > > >> Is CONFIG_DEBUG_INFO_BTF enabled in your .config ?
> > > > > > >
> > > > > > > hm, yes.
> > > > > >
> > > > > > When the config enabled, compiling looks good untill pahole tool
> > > > > > used to get BTF info, but I still failed on a right version pahole
> > > > > > > 1.16. Sorry.
> > > > > >
> > > > > > >
> > > > > > >>>
> > > > > > >>> Signed-off-by: Alex Shi 
> > > > > > >>> Cc: Andrew Morton 
> > > > > > >>> Cc: linux...@kvack.org
> > > > > > >>> Cc: linux-kernel@vger.kernel.org
> > > > > > >>> ---
> > > > > > >>>  mm/filemap.c | 2 +-
> > > > > > >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > > >>>
> > > > > > >>> diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > > >>> index d90614f501da..249cf489f5df 100644
> > > > > > >>> --- a/mm/filemap.c
> > > > > > >>> +++ b/mm/filemap.c
> > > > > > >>> @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page 
> > > > > > >>> *old, struct page *new, gfp_t gfp_mask)
> > > > > > >>>  }
> > > > > > >>>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
> > > > > > >>>
> > > > > > >>> -noinline int __add_to_page_cache_locked(struct page *page,
> > > > > > >>> +static noinline int __add_to_page_cache_locked(struct page 
> > > > > > >>> *page,
> > > > > > >>> struct address_space 
> > > > > > >>> *mapping,
> > > > > > >>> pgoff_t offset, gfp_t 
> > > > > > >>> gfp,
> > > > > > >>> void **shadowp)
> > > > > > >
> > > > > > > It's unclear to me whether BTF_ID() requires that the target 
> > > > > > > symbol be
> > > > > > > non-static.  It doesn't actually reference the symbol:
> > > > > > >
> > > > > > > #define BTF_ID(prefix, name) \
> > > > > > > __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__))
> > > > > > >
> > > > > >
> > > > > > The above usage make me thought BTF don't require the symbol, though
> > > > > > the symbol still exist in vmlinux with 'static'.
> > > > > >
> > > > > > So any comments of this, Alexei?
> > >
> > > Sorry. I've completely missed this thread.
> > > Now I have a hard time finding context in archives.
> > > If I understood what's going on the removal of "static" cases issues?
> > > Yes. That's expected.
> > > noinline alone is not enough to work reliably.
> >
> > Not removal, commit 3351b16af494 ("mm/filemap: add static for function
> > __add_to_page_cache_locked") made the function static which breaks the
> > build in btfids phase - but it seems to happen only on some
> > architectures. In our case, ppc64, ppc64le and riscv64 are broken,
> > x86_64, i586 and s390x succeed. (I made a mistake above, aarch64 did not
> > fail - but only because it was not built at all.)
> >
> > The thread starts with
> > http://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex@linux.alibaba.com
>
> Got it. So the above commit is wrong.
> The addition of "static" is incorrect here.
> Regardless of btf_id generation.
> "static noinline" means that the error injection in that spot is unreliable.
> Even when bpf is completely compiled out of the kernel.

I finally realized that the addition of 'static' was pushed into Linus's tree :(
Please revert commit 3351b16af494 ("mm/filemap: add static for
function __add_to_page_cache_locked")


Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked

2020-12-07 Thread Alexei Starovoitov
On Mon, Dec 7, 2020 at 2:53 PM Michal Kubecek  wrote:
>
> On Mon, Dec 07, 2020 at 02:44:22PM -0800, Alexei Starovoitov wrote:
> > On Mon, Dec 7, 2020 at 10:36 AM Justin Forbes  wrote:
> > >
> > > On Mon, Dec 7, 2020 at 2:16 AM Michal Kubecek  wrote:
> > > >
> > > > On Thu, Nov 12, 2020 at 08:18:57AM +0800, Alex Shi wrote:
> > > > >
> > > > >
> > > > > 在 2020/11/11 上午3:50, Andrew Morton 写道:
> > > > > > On Tue, 10 Nov 2020 08:39:24 +0530 Souptick Joarder 
> > > > > >  wrote:
> > > > > >
> > > > > >> On Fri, Nov 6, 2020 at 4:55 PM Alex Shi 
> > > > > >>  wrote:
> > > > > >>>
> > > > > >>> Otherwise it cause gcc warning:
> > > > > >>>   ^~~
> > > > > >>> ../mm/filemap.c:830:14: warning: no previous prototype for
> > > > > >>> ‘__add_to_page_cache_locked’ [-Wmissing-prototypes]
> > > > > >>>  noinline int __add_to_page_cache_locked(struct page *page,
> > > > > >>>   ^~
> > > > > >>
> > > > > >> Is CONFIG_DEBUG_INFO_BTF enabled in your .config ?
> > > > > >
> > > > > > hm, yes.
> > > > >
> > > > > When the config enabled, compiling looks good untill pahole tool
> > > > > used to get BTF info, but I still failed on a right version pahole
> > > > > > 1.16. Sorry.
> > > > >
> > > > > >
> > > > > >>>
> > > > > >>> Signed-off-by: Alex Shi 
> > > > > >>> Cc: Andrew Morton 
> > > > > >>> Cc: linux...@kvack.org
> > > > > >>> Cc: linux-kernel@vger.kernel.org
> > > > > >>> ---
> > > > > >>>  mm/filemap.c | 2 +-
> > > > > >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > > > > >>>
> > > > > >>> diff --git a/mm/filemap.c b/mm/filemap.c
> > > > > >>> index d90614f501da..249cf489f5df 100644
> > > > > >>> --- a/mm/filemap.c
> > > > > >>> +++ b/mm/filemap.c
> > > > > >>> @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page *old, 
> > > > > >>> struct page *new, gfp_t gfp_mask)
> > > > > >>>  }
> > > > > >>>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
> > > > > >>>
> > > > > >>> -noinline int __add_to_page_cache_locked(struct page *page,
> > > > > >>> +static noinline int __add_to_page_cache_locked(struct page *page,
> > > > > >>> struct address_space 
> > > > > >>> *mapping,
> > > > > >>> pgoff_t offset, gfp_t gfp,
> > > > > >>> void **shadowp)
> > > > > >
> > > > > > It's unclear to me whether BTF_ID() requires that the target symbol 
> > > > > > be
> > > > > > non-static.  It doesn't actually reference the symbol:
> > > > > >
> > > > > > #define BTF_ID(prefix, name) \
> > > > > > __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__))
> > > > > >
> > > > >
> > > > > The above usage make me thought BTF don't require the symbol, though
> > > > > the symbol still exist in vmlinux with 'static'.
> > > > >
> > > > > So any comments of this, Alexei?
> >
> > Sorry. I've completely missed this thread.
> > Now I have a hard time finding context in archives.
> > If I understood what's going on the removal of "static" cases issues?
> > Yes. That's expected.
> > noinline alone is not enough to work reliably.
>
> Not removal, commit 3351b16af494 ("mm/filemap: add static for function
> __add_to_page_cache_locked") made the function static which breaks the
> build in btfids phase - but it seems to happen only on some
> architectures. In our case, ppc64, ppc64le and riscv64 are broken,
> x86_64, i586 and s390x succeed. (I made a mistake above, aarch64 did not
> fail - but only because it was not built at all.)
>
> The thread starts with
> http://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex@linux.alibaba.com

Got it. So the above commit is wrong.
The addition of "static" is incorrect here.
Regardless of btf_id generation.
"static noinline" means that the error injection in that spot is unreliable.
Even when bpf is completely compiled out of the kernel.


Re: [PATCH] mm/filemap: add static for function __add_to_page_cache_locked

2020-12-07 Thread Alexei Starovoitov
On Mon, Dec 7, 2020 at 10:36 AM Justin Forbes  wrote:
>
> On Mon, Dec 7, 2020 at 2:16 AM Michal Kubecek  wrote:
> >
> > On Thu, Nov 12, 2020 at 08:18:57AM +0800, Alex Shi wrote:
> > >
> > >
> > > 在 2020/11/11 上午3:50, Andrew Morton 写道:
> > > > On Tue, 10 Nov 2020 08:39:24 +0530 Souptick Joarder 
> > > >  wrote:
> > > >
> > > >> On Fri, Nov 6, 2020 at 4:55 PM Alex Shi  
> > > >> wrote:
> > > >>>
> > > >>> Otherwise it cause gcc warning:
> > > >>>   ^~~
> > > >>> ../mm/filemap.c:830:14: warning: no previous prototype for
> > > >>> ‘__add_to_page_cache_locked’ [-Wmissing-prototypes]
> > > >>>  noinline int __add_to_page_cache_locked(struct page *page,
> > > >>>   ^~
> > > >>
> > > >> Is CONFIG_DEBUG_INFO_BTF enabled in your .config ?
> > > >
> > > > hm, yes.
> > >
> > > When the config enabled, compiling looks good untill pahole tool
> > > used to get BTF info, but I still failed on a right version pahole
> > > > 1.16. Sorry.
> > >
> > > >
> > > >>>
> > > >>> Signed-off-by: Alex Shi 
> > > >>> Cc: Andrew Morton 
> > > >>> Cc: linux...@kvack.org
> > > >>> Cc: linux-kernel@vger.kernel.org
> > > >>> ---
> > > >>>  mm/filemap.c | 2 +-
> > > >>>  1 file changed, 1 insertion(+), 1 deletion(-)
> > > >>>
> > > >>> diff --git a/mm/filemap.c b/mm/filemap.c
> > > >>> index d90614f501da..249cf489f5df 100644
> > > >>> --- a/mm/filemap.c
> > > >>> +++ b/mm/filemap.c
> > > >>> @@ -827,7 +827,7 @@ int replace_page_cache_page(struct page *old, 
> > > >>> struct page *new, gfp_t gfp_mask)
> > > >>>  }
> > > >>>  EXPORT_SYMBOL_GPL(replace_page_cache_page);
> > > >>>
> > > >>> -noinline int __add_to_page_cache_locked(struct page *page,
> > > >>> +static noinline int __add_to_page_cache_locked(struct page *page,
> > > >>> struct address_space *mapping,
> > > >>> pgoff_t offset, gfp_t gfp,
> > > >>> void **shadowp)
> > > >
> > > > It's unclear to me whether BTF_ID() requires that the target symbol be
> > > > non-static.  It doesn't actually reference the symbol:
> > > >
> > > > #define BTF_ID(prefix, name) \
> > > > __BTF_ID(__ID(__BTF_ID__##prefix##__##name##__))
> > > >
> > >
> > > The above usage make me thought BTF don't require the symbol, though
> > > the symbol still exist in vmlinux with 'static'.
> > >
> > > So any comments of this, Alexei?

Sorry. I've completely missed this thread.
Now I have a hard time finding context in archives.
If I understood what's going on the removal of "static" cases issues?
Yes. That's expected.
noinline alone is not enough to work reliably.


Re: [PATCH bpf-next] tools/resolve_btfids: Fix some error messages

2020-12-03 Thread Alexei Starovoitov
On Thu, Dec 03, 2020 at 10:22:34AM +, Brendan Jackman wrote:
> Add missing newlines and fix polarity of strerror argument.
> 
> Signed-off-by: Brendan Jackman 

Applied, Thanks


Re: [PATCH bpf-next v9 00/34] bpf: switch to memcg-based memory accounting

2020-12-02 Thread Alexei Starovoitov
On Tue, Dec 1, 2020 at 1:59 PM Roman Gushchin  wrote:
>
> 5) Cryptic -EPERM is returned on exceeding the limit. Libbpf even had
>a function to "explain" this case for users.
...
> v9:
>   - always charge the saved memory cgroup, by Daniel, Toke and Alexei
>   - added bpf_map_kzalloc()
>   - rebase and minor fixes

This looks great. Applied.
Please follow up with a change to libbpf's pr_perm_msg().
That helpful warning should stay for old kernels, but it would be
misleading for new kernels.
libbpf probably needs a feature check to make this warning conditional.

Thanks!


Re: [PATCH bpf-next 1/2] bpf: Add a bpf_kallsyms_lookup helper

2020-12-02 Thread Alexei Starovoitov
On Wed, Dec 2, 2020 at 12:32 PM Florent Revest  wrote:
>
> On Tue, 2020-12-01 at 16:55 -0800, Andrii Nakryiko wrote:
> > On Fri, Nov 27, 2020 at 8:09 AM Yonghong Song  wrote:
> > >
> > >
> > > On 11/27/20 3:20 AM, KP Singh wrote:
> > > > On Fri, Nov 27, 2020 at 8:35 AM Yonghong Song  wrote:
> > > > >
> > > > > In this case, module name may be truncated and user did not get
> > > > > any indication from return value. In the helper description, it
> > > > > is mentioned that module name currently is most 64 bytes. But
> > > > > from UAPI perspective, it may be still good to return something
> > > > > to let user know the name is truncated.
> > > > >
> > > > > I do not know what is the best way to do this. One suggestion
> > > > > is to break it into two helpers, one for symbol name and
> > > > > another
> > > >
> > > > I think it would be slightly preferable to have one helper
> > > > though. maybe something like bpf_get_symbol_info (better names
> > > > anyone? :)) with flags to get the module name or the symbol name
> > > > depending
> > > > on the flag?
> > >
> > > This works even better. Previously I am thinking if we have two
> > > helpers,
> > > we can add flags for each of them for future extension. But we
> > > can certainly have just one helper with flags to indicate
> > > whether this is for module name or for symbol name or something
> > > else.
> > >
> > > The buffer can be something like
> > > union bpf_ksymbol_info {
> > >char   module_name[];
> > >char   symbol_name[];
> > >...
> > > }
> > > and flags will indicate what information user wants.
> >
> > one more thing that might be useful to resolve to the symbol's "base
> > address". E.g., if we have IP inside the function, this would resolve
> > to the start of the function, sort of "canonical" symbol address.
> > Type of ksym is another "characteristic" which could be returned (as
> > a single char?)
> >
> > I wouldn't define bpf_ksymbol_info, though. Just depending on the
> > flag, specify what kind of memory layou (e.g., for strings -
> > zero-terminated string, for address - 8 byte numbers, etc). That way
> > we can also allow fetching multiple things together, they would just
> > be laid out one after another in memory.
> >
> > E.g.:
> >
> > char buf[256];
> > int err = bpf_ksym_resolve(, BPF_KSYM_NAME | BPF_KSYM_MODNAME |
> > BPF_KSYM_BASE_ADDR, buf, sizeof(buf));
> >
> > if (err == -E2BIG)
> >   /* need bigger buffer, but all the data up to truncation point is
> > filled in */
> > else
> >   /* err has exact number of bytes used, including zero terminator(s)
> > */
> >   /* data is laid out as
> > "cpufreq_gov_powersave_init\0cpufreq_powersave\0\x12\x23\x45\x56\x12\
> > x23\x45\x56"
> > */
>
> Great idea! I like that, thanks for the suggestion :)

I still think that adopting printk/vsnprintf for this instead of
reinventing the wheel
is more flexible and easier to maintain long term.
Almost the same layout can be done with vsnprintf
with exception of \0 char.
More meaningful names, etc.
See Documentation/core-api/printk-formats.rst
If we force fmt to come from readonly map then bpf_trace_printk()-like
run-time check of fmt string can be moved into load time check
and performance won't suffer.


Re: [PATCH v2 bpf-next 01/13] bpf: x86: Factor out emission of ModR/M for *(reg + off)

2020-12-02 Thread Alexei Starovoitov
On Wed, Dec 2, 2020 at 2:52 AM Brendan Jackman  wrote:
>
> Tue, Dec 01, 2020 at 09:50:00PM -0800, Alexei Starovoitov wrote:
> > On Tue, Dec 1, 2020 at 4:14 AM Brendan Jackman  wrote:
> > >
> > > On Sat, Nov 28, 2020 at 05:15:52PM -0800, Alexei Starovoitov wrote:
> > > > On Fri, Nov 27, 2020 at 05:57:26PM +, Brendan Jackman wrote:
> > > > > +/* Emit the ModR/M byte for addressing *(r1 + off) and r2 */
> > > > > +static void emit_modrm_dstoff(u8 **pprog, u32 r1, u32 r2, int off)
> > > >
> > > > same concern as in the another patch. If you could avoid intel's 
> > > > puzzling names
> > > > like above it will make reviewing the patch easier.
> > >
> > > In this case there is actually a call like
> > >
> > >   emit_modrm_dstoff(, src_reg, dst_reg)
> >
> > emit_insn_prefix() ?
>
> Ah sorry, I thought you were talking about the _arg_ names.

I meant both. Arg names and helper name. Sorry for the confusion.


Re: [PATCH v2 bpf-next 10/13] bpf: Add instructions for atomic[64]_[fetch_]sub

2020-12-01 Thread Alexei Starovoitov
On Tue, Dec 1, 2020 at 4:38 AM Brendan Jackman  wrote:
>
> I guess it's also worth remembering other archs might have an atomic
> subtract.

which one?
arm64 LSE implements atomic_fetch_sub as neg+ldadd.
imo x64 and arm64 example outweighs choices by other archs if there are such.
Even without LSE it will be neg+llsc loop.
The reason I proposed bpf xsub insn earlier is that I thought that llvm
won't be able to emit it so easily and JIT/verifier would struggle.


Re: [PATCH v2 bpf-next 02/13] bpf: x86: Factor out emission of REX byte

2020-12-01 Thread Alexei Starovoitov
On Tue, Dec 1, 2020 at 4:12 AM Brendan Jackman  wrote:
>
> On Sat, Nov 28, 2020 at 05:14:05PM -0800, Alexei Starovoitov wrote:
> > On Fri, Nov 27, 2020 at 05:57:27PM +, Brendan Jackman wrote:
> > > The JIT case for encoding atomic ops is about to get more
> > > complicated. In order to make the review & resulting code easier,
> > > let's factor out some shared helpers.
> > >
> > > Signed-off-by: Brendan Jackman 
> > > ---
> > >  arch/x86/net/bpf_jit_comp.c | 39 ++---
> > >  1 file changed, 23 insertions(+), 16 deletions(-)
> > >
> > > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > > index 94b17bd30e00..a839c1a54276 100644
> > > --- a/arch/x86/net/bpf_jit_comp.c
> > > +++ b/arch/x86/net/bpf_jit_comp.c
> > > @@ -702,6 +702,21 @@ static void emit_modrm_dstoff(u8 **pprog, u32 r1, 
> > > u32 r2, int off)
> > > *pprog = prog;
> > >  }
> > >
> > > +/*
> > > + * Emit a REX byte if it will be necessary to address these registers
> >
> > What is "REX byte" ?
> > May be rename it to maybe_emit_mod() ?
>
> Er, this is the REX prefix as described in
> https://wiki.osdev.org/X86-64_Instruction_Encoding#REX_prefix
>
> Would maybe_emit_mod be accurate? In my mind "mod" is a field in the
> ModR/M byte which comes _after_ the opcode. Before developing this
> patchset I knew almost nothing about x86, so maybe I'm missing something
> about the general terminology?

I wrote the JIT without looking into the manual and without studying
the terminology.
Why? Because it was not necessary. I still don't see a reason why
that obscure terminology needs to be brought in into the code.
'mod' to me is a 'modifier'. Nothing to do with intel's modrm thing.


Re: [PATCH v2 bpf-next 01/13] bpf: x86: Factor out emission of ModR/M for *(reg + off)

2020-12-01 Thread Alexei Starovoitov
On Tue, Dec 1, 2020 at 4:14 AM Brendan Jackman  wrote:
>
> On Sat, Nov 28, 2020 at 05:15:52PM -0800, Alexei Starovoitov wrote:
> > On Fri, Nov 27, 2020 at 05:57:26PM +, Brendan Jackman wrote:
> > > +/* Emit the ModR/M byte for addressing *(r1 + off) and r2 */
> > > +static void emit_modrm_dstoff(u8 **pprog, u32 r1, u32 r2, int off)
> >
> > same concern as in the another patch. If you could avoid intel's puzzling 
> > names
> > like above it will make reviewing the patch easier.
>
> In this case there is actually a call like
>
>   emit_modrm_dstoff(, src_reg, dst_reg)

emit_insn_prefix() ?


Re: [PATCH bpf-next 1/2] bpf: Add a bpf_kallsyms_lookup helper

2020-11-30 Thread Alexei Starovoitov
On Mon, Nov 30, 2020 at 05:23:22PM +0100, Florent Revest wrote:
> On Sat, 2020-11-28 at 17:07 -0800, Alexei Starovoitov wrote:
> > On Thu, Nov 26, 2020 at 05:57:47PM +0100, Florent Revest wrote:
> > > This helper exposes the kallsyms_lookup function to eBPF tracing
> > > programs. This can be used to retrieve the name of the symbol at an
> > > address. For example, when hooking into nf_register_net_hook, one
> > > can
> > > audit the name of the registered netfilter hook and potentially
> > > also
> > > the name of the module in which the symbol is located.
> > > 
> > > Signed-off-by: Florent Revest 
> > > ---
> > >  include/uapi/linux/bpf.h   | 16 +
> > >  kernel/trace/bpf_trace.c   | 41
> > > ++
> > >  tools/include/uapi/linux/bpf.h | 16 +
> > >  3 files changed, 73 insertions(+)
> > > 
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index c3458ec1f30a..670998635eac 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -3817,6 +3817,21 @@ union bpf_attr {
> > >   *   The **hash_algo** is returned on success,
> > >   *   **-EOPNOTSUP** if IMA is disabled or **-EINVAL** if
> > >   *   invalid arguments are passed.
> > > + *
> > > + * long bpf_kallsyms_lookup(u64 address, char *symbol, u32
> > > symbol_size, char *module, u32 module_size)
> > > + *   Description
> > > + *   Uses kallsyms to write the name of the symbol at
> > > *address*
> > > + *   into *symbol* of size *symbol_sz*. This is guaranteed
> > > to be
> > > + *   zero terminated.
> > > + *   If the symbol is in a module, up to *module_size* bytes
> > > of
> > > + *   the module name is written in *module*. This is also
> > > + *   guaranteed to be zero-terminated. Note: a module name
> > > + *   is always shorter than 64 bytes.
> > > + *   Return
> > > + *   On success, the strictly positive length of the full
> > > symbol
> > > + *   name, If this is greater than *symbol_size*, the
> > > written
> > > + *   symbol is truncated.
> > > + *   On error, a negative value.
> > 
> > Looks like debug-only helper.
> > I cannot think of a way to use in production code.
> > What program suppose to do with that string?
> > Do string compare? BPF side doesn't have a good way to do string
> > manipulations.
> > If you really need to print a symbolic name for a given address
> > I'd rather extend bpf_trace_printk() to support %pS
> 
> We actually use this helper for auditing, not debugging.
> We don't want to parse /proc/kallsyms from userspace because we have no
> guarantee that the module will still be loaded by the time the event
> reaches userspace (this is also faster in kernelspace).

so what are you going to do with that string?
print it? send to user space via ring buffer?
Where are you getting that $pc ?


Re: [PATCH v2 bpf-next 00/13] Atomics for eBPF

2020-11-28 Thread Alexei Starovoitov
On Fri, Nov 27, 2020 at 09:53:05PM -0800, Yonghong Song wrote:
> 
> 
> On 11/27/20 9:57 AM, Brendan Jackman wrote:
> > Status of the patches
> > =
> > 
> > Thanks for the reviews! Differences from v1->v2 [1]:
> > 
> > * Fixed mistakes in the netronome driver
> > 
> > * Addd sub, add, or, xor operations
> > 
> > * The above led to some refactors to keep things readable. (Maybe I
> >should have just waited until I'd implemented these before starting
> >the review...)
> > 
> > * Replaced BPF_[CMP]SET | BPF_FETCH with just BPF_[CMP]XCHG, which
> >include the BPF_FETCH flag
> > 
> > * Added a bit of documentation. Suggestions welcome for more places
> >to dump this info...
> > 
> > The prog_test that's added depends on Clang/LLVM features added by
> > Yonghong in https://reviews.llvm.org/D72184
> > 
> > This only includes a JIT implementation for x86_64 - I don't plan to
> > implement JIT support myself for other architectures.
> > 
> > Operations
> > ==
> > 
> > This patchset adds atomic operations to the eBPF instruction set. The
> > use-case that motivated this work was a trivial and efficient way to
> > generate globally-unique cookies in BPF progs, but I think it's
> > obvious that these features are pretty widely applicable.  The
> > instructions that are added here can be summarised with this list of
> > kernel operations:
> > 
> > * atomic[64]_[fetch_]add
> > * atomic[64]_[fetch_]sub
> > * atomic[64]_[fetch_]and
> > * atomic[64]_[fetch_]or
> 
> * atomic[64]_[fetch_]xor
> 
> > * atomic[64]_xchg
> > * atomic[64]_cmpxchg
> 
> Thanks. Overall looks good to me but I did not check carefully
> on jit part as I am not an expert in x64 assembly...
> 
> This patch also introduced atomic[64]_{sub,and,or,xor}, similar to
> xadd. I am not sure whether it is necessary. For one thing,
> users can just use atomic[64]_fetch_{sub,and,or,xor} to ignore
> return value and they will achieve the same result, right?
> From llvm side, there is no ready-to-use gcc builtin matching
> atomic[64]_{sub,and,or,xor} which does not have return values.
> If we go this route, we will need to invent additional bpf
> specific builtins.

I think bpf specific builtins are overkill.
As you said the users can use atomic_fetch_xor() and ignore
return value. I think llvm backend should be smart enough to use
BPF_ATOMIC | BPF_XOR insn without BPF_FETCH bit in such case.
But if it's too cumbersome to do at the moment we skip this
optimization for now.


Re: [PATCH v2 bpf-next 11/13] bpf: Add bitwise atomic instructions

2020-11-28 Thread Alexei Starovoitov
On Fri, Nov 27, 2020 at 09:39:10PM -0800, Yonghong Song wrote:
> 
> 
> On 11/27/20 9:57 AM, Brendan Jackman wrote:
> > This adds instructions for
> > 
> > atomic[64]_[fetch_]and
> > atomic[64]_[fetch_]or
> > atomic[64]_[fetch_]xor
> > 
> > All these operations are isomorphic enough to implement with the same
> > verifier, interpreter, and x86 JIT code, hence being a single commit.
> > 
> > The main interesting thing here is that x86 doesn't directly support
> > the fetch_ version these operations, so we need to generate a CMPXCHG
> > loop in the JIT. This requires the use of two temporary registers,
> > IIUC it's safe to use BPF_REG_AX and x86's AUX_REG for this purpose.
> 
> similar to previous xsub (atomic[64]_sub), should we implement
> xand, xor, xxor in llvm?

yes. please. Unlike atomic_fetch_sub that can be handled by llvm.
atomic_fetch_or/xor/and has to be seen as separate instructions
because JITs will translate them as loop.


Re: [PATCH v2 bpf-next 10/13] bpf: Add instructions for atomic[64]_[fetch_]sub

2020-11-28 Thread Alexei Starovoitov
On Fri, Nov 27, 2020 at 09:35:07PM -0800, Yonghong Song wrote:
> 
> 
> On 11/27/20 9:57 AM, Brendan Jackman wrote:
> > Including only interpreter and x86 JIT support.
> > 
> > x86 doesn't provide an atomic exchange-and-subtract instruction that
> > could be used for BPF_SUB | BPF_FETCH, however we can just emit a NEG
> > followed by an XADD to get the same effect.
> > 
> > Signed-off-by: Brendan Jackman 
> > ---
> >   arch/x86/net/bpf_jit_comp.c  | 16 ++--
> >   include/linux/filter.h   | 20 
> >   kernel/bpf/core.c|  1 +
> >   kernel/bpf/disasm.c  | 16 
> >   kernel/bpf/verifier.c|  2 ++
> >   tools/include/linux/filter.h | 20 
> >   6 files changed, 69 insertions(+), 6 deletions(-)
> > 
> > diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> > index 7431b2937157..a8a9fab13fcf 100644
> > --- a/arch/x86/net/bpf_jit_comp.c
> > +++ b/arch/x86/net/bpf_jit_comp.c
> > @@ -823,6 +823,7 @@ static int emit_atomic(u8 **pprog, u8 atomic_op,
> > /* emit opcode */
> > switch (atomic_op) {
> > +   case BPF_SUB:
> > case BPF_ADD:
> > /* lock *(u32/u64*)(dst_reg + off) = src_reg */
> > EMIT1(simple_alu_opcodes[atomic_op]);
> > @@ -1306,8 +1307,19 @@ st:  if (is_imm8(insn->off))
> > case BPF_STX | BPF_ATOMIC | BPF_W:
> > case BPF_STX | BPF_ATOMIC | BPF_DW:
> > -   err = emit_atomic(, insn->imm, dst_reg, src_reg,
> > - insn->off, BPF_SIZE(insn->code));
> > +   if (insn->imm == (BPF_SUB | BPF_FETCH)) {
> > +   /*
> > +* x86 doesn't have an XSUB insn, so we negate
> > +* and XADD instead.
> > +*/
> > +   emit_neg(, src_reg, BPF_SIZE(insn->code) 
> > == BPF_DW);
> > +   err = emit_atomic(, BPF_ADD | BPF_FETCH,
> > + dst_reg, src_reg, insn->off,
> > + BPF_SIZE(insn->code));
> > +   } else {
> > +   err = emit_atomic(, insn->imm, dst_reg, 
> > src_reg,
> > + insn->off, 
> > BPF_SIZE(insn->code));
> > +   }
> > if (err)
> > return err;
> > break;
> > diff --git a/include/linux/filter.h b/include/linux/filter.h
> > index 6186280715ed..a20a3a536bf5 100644
> > --- a/include/linux/filter.h
> > +++ b/include/linux/filter.h
> > @@ -280,6 +280,26 @@ static inline bool insn_is_zext(const struct bpf_insn 
> > *insn)
> > .off   = OFF,   \
> > .imm   = BPF_ADD | BPF_FETCH })
> > +/* Atomic memory sub, *(uint *)(dst_reg + off16) -= src_reg */
> > +
> > +#define BPF_ATOMIC_SUB(SIZE, DST, SRC, OFF)\
> > +   ((struct bpf_insn) {\
> > +   .code  = BPF_STX | BPF_SIZE(SIZE) | BPF_ATOMIC, \
> > +   .dst_reg = DST, \
> > +   .src_reg = SRC, \
> > +   .off   = OFF,   \
> > +   .imm   = BPF_SUB })
> 
> Currently, llvm does not support XSUB, should we support it in llvm?
> At source code, as implemented in JIT, user can just do a negate
> followed by xadd.

I forgot we have BPF_NEG insn :)
Indeed it's probably easier to handle atomic_fetch_sub() builtin
completely on llvm side. It can generate bpf_neg followed by atomic_fetch_add.
No need to burden verifier, interpreter and JITs with it.


Re: [PATCH v2 bpf-next 08/13] bpf: Add instructions for atomic_[cmp]xchg

2020-11-28 Thread Alexei Starovoitov
On Fri, Nov 27, 2020 at 05:57:33PM +, Brendan Jackman wrote:
>  
>  /* atomic op type fields (stored in immediate) */
> -#define BPF_FETCH0x01/* fetch previous value into src reg */
> +#define BPF_XCHG (0xe0 | BPF_FETCH)  /* atomic exchange */
> +#define BPF_CMPXCHG  (0xf0 | BPF_FETCH)  /* atomic compare-and-write */
> +#define BPF_FETCH0x01/* fetch previous value into src reg or r0*/

I think such comment is more confusing than helpful.
I'd just say that the fetch bit is not valid on its own.
It's used to build other instructions like cmpxchg and atomic_fetch_add.

> + } else if (BPF_MODE(insn->code) == BPF_ATOMIC &&
> +insn->imm == (BPF_CMPXCHG)) {

redundant ().

> + verbose(cbs->private_data, "(%02x) r0 = 
> atomic%s_cmpxchg(*(%s *)(r%d %+d), r0, r%d)\n",
> + insn->code,
> + BPF_SIZE(insn->code) == BPF_DW ? "64" : "",
> + bpf_ldst_string[BPF_SIZE(insn->code) >> 3],
> + insn->dst_reg, insn->off,
> + insn->src_reg);
> + } else if (BPF_MODE(insn->code) == BPF_ATOMIC &&
> +insn->imm == (BPF_XCHG)) {

redundant ().


Re: [PATCH v2 bpf-next 01/13] bpf: x86: Factor out emission of ModR/M for *(reg + off)

2020-11-28 Thread Alexei Starovoitov
On Fri, Nov 27, 2020 at 05:57:26PM +, Brendan Jackman wrote:
> +/* Emit the ModR/M byte for addressing *(r1 + off) and r2 */
> +static void emit_modrm_dstoff(u8 **pprog, u32 r1, u32 r2, int off)

same concern as in the another patch. If you could avoid intel's puzzling names
like above it will make reviewing the patch easier.


Re: [PATCH v2 bpf-next 02/13] bpf: x86: Factor out emission of REX byte

2020-11-28 Thread Alexei Starovoitov
On Fri, Nov 27, 2020 at 05:57:27PM +, Brendan Jackman wrote:
> The JIT case for encoding atomic ops is about to get more
> complicated. In order to make the review & resulting code easier,
> let's factor out some shared helpers.
> 
> Signed-off-by: Brendan Jackman 
> ---
>  arch/x86/net/bpf_jit_comp.c | 39 ++---
>  1 file changed, 23 insertions(+), 16 deletions(-)
> 
> diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
> index 94b17bd30e00..a839c1a54276 100644
> --- a/arch/x86/net/bpf_jit_comp.c
> +++ b/arch/x86/net/bpf_jit_comp.c
> @@ -702,6 +702,21 @@ static void emit_modrm_dstoff(u8 **pprog, u32 r1, u32 
> r2, int off)
>   *pprog = prog;
>  }
>  
> +/*
> + * Emit a REX byte if it will be necessary to address these registers

What is "REX byte" ?
May be rename it to maybe_emit_mod() ?

> + */
> +static void maybe_emit_rex(u8 **pprog, u32 reg_rm, u32 reg_reg, bool wide)

could you please keep original names as dst_reg/src_reg instead of 
reg_rm/reg_reg ?
reg_reg reads really odd and reg_rm is equally puzzling unless the reader 
studied
intel's manual. I didn't. All these new abbreviations are challenging for me.
> +{
> + u8 *prog = *pprog;
> + int cnt = 0;
> +
> + if (wide)

what is 'wide' ? Why not to call it 'bool is_alu64' ?

> + EMIT1(add_2mod(0x48, reg_rm, reg_reg));
> + else if (is_ereg(reg_rm) || is_ereg(reg_reg))
> + EMIT1(add_2mod(0x40, reg_rm, reg_reg));
> + *pprog = prog;
> +}


  1   2   3   4   5   6   7   8   9   10   >