Re: [PATCHv4 bpf-next 0/7] uprobe: uretprobe speed up

2024-05-02 Thread Andrii Nakryiko
On Thu, May 2, 2024 at 5:23 AM Jiri Olsa  wrote:
>
> hi,
> as part of the effort on speeding up the uprobes [0] coming with
> return uprobe optimization by using syscall instead of the trap
> on the uretprobe trampoline.
>
> The speed up depends on instruction type that uprobe is installed
> and depends on specific HW type, please check patch 1 for details.
>
> Patches 1-6 are based on bpf-next/master, but path 1 and 2 are
> apply-able on linux-trace.git tree probes/for-next branch.
> Patch 7 is based on man-pages master.
>
> v4 changes:
>   - added acks [Oleg,Andrii,Masami]
>   - reworded the man page and adding more info to NOTE section [Masami]
>   - rewrote bpf tests not to use trace_pipe [Andrii]
>   - cc-ed linux-man list
>
> Also available at:
>   https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git
>   uretprobe_syscall
>

It looks great to me, thanks! Unfortunately BPF CI build is broken,
probably due to some of the Makefile additions, please investigate and
fix (or we'll need to fix something on BPF CI side), but it looks like
you'll need another revision, unfortunately.

pw-bot: cr

  [0] 
https://github.com/kernel-patches/bpf/actions/runs/8923849088/job/24509002194



But while we are at it.

Masami, Oleg,

What should be the logistics of landing this? Can/should we route this
through the bpf-next tree, given there are lots of BPF-based
selftests? Or you want to take this through
linux-trace/probes/for-next? In the latter case, it's probably better
to apply only the first two patches to probes/for-next and the rest
should still go through the bpf-next tree (otherwise we are running
into conflicts in BPF selftests). Previously we were handling such
cross-tree dependencies by creating a named branch or tag, and merging
it into bpf-next (so that all SHAs are preserved). It's a bunch of
extra work for everyone involved, so the simplest way would be to just
land through bpf-next, of course. But let me know your preferences.

Thanks!

> thanks,
> jirka
>
>
> Notes to check list items in Documentation/process/adding-syscalls.rst:
>
> - System Call Alternatives
>   New syscall seems like the best way in here, becase we need

typo (thanks, Gmail): because

>   just to quickly enter kernel with no extra arguments processing,
>   which we'd need to do if we decided to use another syscall.
>
> - Designing the API: Planning for Extension
>   The uretprobe syscall is very specific and most likely won't be
>   extended in the future.
>
>   At the moment it does not take any arguments and even if it does
>   in future, it's allowed to be called only from trampoline prepared
>   by kernel, so there'll be no broken user.
>
> - Designing the API: Other Considerations
>   N/A because uretprobe syscall does not return reference to kernel
>   object.
>
> - Proposing the API
>   Wiring up of the uretprobe system call si in separate change,

typo: is

>   selftests and man page changes are part of the patchset.
>
> - Generic System Call Implementation
>   There's no CONFIG option for the new functionality because it
>   keeps the same behaviour from the user POV.
>
> - x86 System Call Implementation
>   It's 64-bit syscall only.
>
> - Compatibility System Calls (Generic)
>   N/A uretprobe syscall has no arguments and is not supported
>   for compat processes.
>
> - Compatibility System Calls (x86)
>   N/A uretprobe syscall is not supported for compat processes.
>
> - System Calls Returning Elsewhere
>   N/A.
>
> - Other Details
>   N/A.
>
> - Testing
>   Adding new bpf selftests and ran ltp on top of this change.
>
> - Man Page
>   Attached.
>
> - Do not call System Calls in the Kernel
>   N/A.
>
>
> [0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/
> ---
> Jiri Olsa (6):
>   uprobe: Wire up uretprobe system call
>   uprobe: Add uretprobe syscall to speed up return probe
>   selftests/bpf: Add uretprobe syscall test for regs integrity
>   selftests/bpf: Add uretprobe syscall test for regs changes
>   selftests/bpf: Add uretprobe syscall call from user space test
>   selftests/bpf: Add uretprobe compat test
>
>  arch/x86/entry/syscalls/syscall_64.tbl  |   1 +
>  arch/x86/kernel/uprobes.c   | 115 
> 
>  include/linux/syscalls.h|   2 +
>  include/linux/uprobes.h |   3 +
>  include/uapi/asm-generic/unistd.h   |   5 +-
>  kernel/events/uprobes.c |  24 --
>  kernel/sys_ni.c |   2 +
>  tools/include/linux/compiler.h  |   4 +
>  tools/testing/selftests/bpf/.gitignore  |   1 +
>  tools/testing/selftests/bpf/Makefile|   7 +-
>  tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c   | 123 
> -
>  

Re: [PATCHv4 bpf-next 6/7] selftests/bpf: Add uretprobe compat test

2024-05-02 Thread Andrii Nakryiko
On Thu, May 2, 2024 at 5:24 AM Jiri Olsa  wrote:
>
> Adding test that adds return uprobe inside 32-bit task
> and verify the return uprobe and attached bpf programs
> get properly executed.
>
> Reviewed-by: Masami Hiramatsu (Google) 
> Signed-off-by: Jiri Olsa 
> ---
>  tools/testing/selftests/bpf/.gitignore|  1 +
>  tools/testing/selftests/bpf/Makefile  |  7 ++-
>  .../selftests/bpf/prog_tests/uprobe_syscall.c | 60 +++
>  3 files changed, 67 insertions(+), 1 deletion(-)
>
> diff --git a/tools/testing/selftests/bpf/.gitignore 
> b/tools/testing/selftests/bpf/.gitignore
> index f1aebabfb017..69d71223c0dd 100644
> --- a/tools/testing/selftests/bpf/.gitignore
> +++ b/tools/testing/selftests/bpf/.gitignore
> @@ -45,6 +45,7 @@ test_cpp
>  /veristat
>  /sign-file
>  /uprobe_multi
> +/uprobe_compat
>  *.ko
>  *.tmp
>  xskxceiver
> diff --git a/tools/testing/selftests/bpf/Makefile 
> b/tools/testing/selftests/bpf/Makefile
> index 82247aeef857..a94352162290 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -133,7 +133,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr 
> test_skb_cgroup_id_user \
> xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
> xdp_features bpf_test_no_cfi.ko
>
> -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
> +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi 
> uprobe_compat
>
>  ifneq ($(V),1)
>  submake_extras := feature_display=0
> @@ -631,6 +631,7 @@ TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read 
> $(OUTPUT)/bpf_testmod.ko  \
>$(OUTPUT)/xdp_synproxy   \
>$(OUTPUT)/sign-file  \
>$(OUTPUT)/uprobe_multi   \
> +  $(OUTPUT)/uprobe_compat  \
>ima_setup.sh \
>verify_sig_setup.sh  \
>$(wildcard progs/btf_dump_test_case_*.c) \
> @@ -752,6 +753,10 @@ $(OUTPUT)/uprobe_multi: uprobe_multi.c
> $(call msg,BINARY,,$@)
> $(Q)$(CC) $(CFLAGS) -O0 $(LDFLAGS) $^ $(LDLIBS) -o $@
>
> +$(OUTPUT)/uprobe_compat:
> +   $(call msg,BINARY,,$@)
> +   $(Q)echo "int main() { return 0; }" | $(CC) $(CFLAGS) -xc -m32 -O0 - 
> -o $@
> +
>  EXTRA_CLEAN := $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)  \
> prog_tests/tests.h map_tests/tests.h verifier/tests.h   \
> feature bpftool \
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
> b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index c6fdb8c59ea3..bfea9a0368a4 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -5,6 +5,7 @@
>  #ifdef __x86_64__
>
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -297,6 +298,58 @@ static void test_uretprobe_syscall_call(void)
> close(go[1]);
> close(go[0]);
>  }
> +
> +static void test_uretprobe_compat(void)
> +{
> +   LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
> +   .retprobe = true,
> +   );
> +   struct uprobe_syscall_executed *skel;
> +   int err, go[2], pid, c, status;
> +
> +   if (pipe(go))
> +   return;

ASSERT_OK() missing, like in the previous patch

Thanks for switching to pipe() + global variable instead of using trace_pipe.

Acked-by: Andrii Nakryiko 

> +
> +   skel = uprobe_syscall_executed__open_and_load();
> +   if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load"))
> +   goto cleanup;
> +

[...]



Re: [PATCHv4 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test

2024-05-02 Thread Andrii Nakryiko
On Thu, May 2, 2024 at 5:24 AM Jiri Olsa  wrote:
>
> Adding test to verify that when called from outside of the
> trampoline provided by kernel, the uretprobe syscall will cause
> calling process to receive SIGILL signal and the attached bpf
> program is not executed.
>
> Reviewed-by: Masami Hiramatsu (Google) 
> Signed-off-by: Jiri Olsa 
> ---
>  .../selftests/bpf/prog_tests/uprobe_syscall.c | 95 +++
>  .../bpf/progs/uprobe_syscall_executed.c   | 17 
>  2 files changed, 112 insertions(+)
>  create mode 100644 
> tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
> b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index 1a50cd35205d..c6fdb8c59ea3 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -7,7 +7,10 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  #include "uprobe_syscall.skel.h"
> +#include "uprobe_syscall_executed.skel.h"
>
>  __naked unsigned long uretprobe_regs_trigger(void)
>  {
> @@ -209,6 +212,91 @@ static void test_uretprobe_regs_change(void)
> }
>  }
>
> +#ifndef __NR_uretprobe
> +#define __NR_uretprobe 462
> +#endif
> +
> +__naked unsigned long uretprobe_syscall_call_1(void)
> +{
> +   /*
> +* Pretend we are uretprobe trampoline to trigger the return
> +* probe invocation in order to verify we get SIGILL.
> +*/
> +   asm volatile (
> +   "pushq %rax\n"
> +   "pushq %rcx\n"
> +   "pushq %r11\n"
> +   "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> +   "syscall\n"
> +   "popq %r11\n"
> +   "popq %rcx\n"
> +   "retq\n"
> +   );
> +}
> +
> +__naked unsigned long uretprobe_syscall_call(void)
> +{
> +   asm volatile (
> +   "call uretprobe_syscall_call_1\n"
> +   "retq\n"
> +   );
> +}
> +
> +static void test_uretprobe_syscall_call(void)
> +{
> +   LIBBPF_OPTS(bpf_uprobe_multi_opts, opts,
> +   .retprobe = true,
> +   );
> +   struct uprobe_syscall_executed *skel;
> +   int pid, status, err, go[2], c;
> +
> +   if (pipe(go))
> +   return;

very unlikely to fail, but still, ASSERT_OK() would be in order here

But regardless:

Acked-by: Andrii Nakryiko 

[...]



Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-30 Thread Andrii Nakryiko
On Tue, Apr 30, 2024 at 6:32 AM Masami Hiramatsu  wrote:
>
> On Mon, 29 Apr 2024 13:25:04 -0700
> Andrii Nakryiko  wrote:
>
> > On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu  
> > wrote:
> > >
> > > Hi Andrii,
> > >
> > > On Thu, 25 Apr 2024 13:31:53 -0700
> > > Andrii Nakryiko  wrote:
> > >
> > > > Hey Masami,
> > > >
> > > > I can't really review most of that code as I'm completely unfamiliar
> > > > with all those inner workings of fprobe/ftrace/function_graph. I left
> > > > a few comments where there were somewhat more obvious BPF-related
> > > > pieces.
> > > >
> > > > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > > > and then with your series applied on top. Just to see if there are any
> > > > regressions. I think it will be a useful data point for you.
> > >
> > > Thanks for testing!
> > >
> > > >
> > > > You should be already familiar with the bench tool we have in BPF
> > > > selftests (I used it on some other patches for your tree).
> > >
> > > What patches we need?
> > >
> >
> > You mean for this `bench` tool? They are part of BPF selftests (under
> > tools/testing/selftests/bpf), you can build them by running:
> >
> > $ make RELEASE=1 -j$(nproc) bench
> >
> > After that you'll get a self-container `bench` binary, which has all
> > the self-contained benchmarks.
> >
> > You might also find a small script (benchs/run_bench_trigger.sh inside
> > BPF selftests directory) helpful, it collects final summary of the
> > benchmark run and optionally accepts a specific set of benchmarks. So
> > you can use it like this:
> >
> > $ benchs/run_bench_trigger.sh kprobe kprobe-multi
> > kprobe :   18.731 ± 0.639M/s
> > kprobe-multi   :   23.938 ± 0.612M/s
> >
> > By default it will run a wider set of benchmarks (no uprobes, but a
> > bunch of extra fentry/fexit tests and stuff like this).
>
> origin:
> # benchs/run_bench_trigger.sh
> kretprobe :1.329 ± 0.007M/s
> kretprobe-multi:1.341 ± 0.004M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.288 ± 0.014M/s
> kretprobe-multi:1.365 ± 0.002M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.329 ± 0.002M/s
> kretprobe-multi:1.331 ± 0.011M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.311 ± 0.003M/s
> kretprobe-multi:1.318 ± 0.002M/s s
>
> patched:
>
> # benchs/run_bench_trigger.sh
> kretprobe :1.274 ± 0.003M/s
> kretprobe-multi:1.397 ± 0.002M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.307 ± 0.002M/s
> kretprobe-multi:1.406 ± 0.004M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.279 ± 0.004M/s
> kretprobe-multi:1.330 ± 0.014M/s
> # benchs/run_bench_trigger.sh
> kretprobe :1.256 ± 0.010M/s
> kretprobe-multi:1.412 ± 0.003M/s
>
> Hmm, in my case, it seems smaller differences (~3%?).
> I attached perf report results for those, but I don't see large difference.

I ran my benchmarks on bare metal machine (and quite powerful at that,
you can see my numbers are almost 10x of yours), with mitigations
disabled, no retpolines, etc. If you have any of those mitigations it
might result in smaller differences, probably. If you are running
inside QEMU/VM, the results might differ significantly as well.

>
> > > >
> > > > BASELINE
> > > > 
> > > > kprobe :   24.634 ± 0.205M/s
> > > > kprobe-multi   :   28.898 ± 0.531M/s
> > > > kretprobe  :   10.478 ± 0.015M/s
> > > > kretprobe-multi:   11.012 ± 0.063M/s
> > > >
> > > > THIS PATCH SET ON TOP
> > > > =
> > > > kprobe :   25.144 ± 0.027M/s (+2%)
> > > > kprobe-multi   :   28.909 ± 0.074M/s
> > > > kretprobe  :9.482 ± 0.008M/s (-9.5%)
> > > > kretprobe-multi:   13.688 ± 0.027M/s (+24%)
> > >
> > > This looks good. Kretprobe should also use kretprobe-multi (fprobe)
> > > eventually because it should be a single callback version of
> > > kretprobe-multi.
>
> I ran another benchmark (prctl loop, attached), the origin kernel result is 
> here;
>
> # sh ./benchmark.sh
> count = 1000, took 6.748133 sec
>
> And the patched kernel result;
>
> # sh ./benchmark.sh
> count = 1000, took 6.644095 sec
>
> I confirmed that the parf result has no big difference.
>
> Thank you,
>
>
> > >
> > > >
> >

Re: [PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code

2024-04-29 Thread Andrii Nakryiko
On Wed, Apr 24, 2024 at 5:02 PM Andrii Nakryiko  wrote:
>
> At the lowest level, rethook-based kretprobes on x86-64 architecture go
> through arch_rethoook_trampoline() function, manually written in
> assembly, which calls into a simple arch_rethook_trampoline_callback()
> function, written in C, and only doing a few straightforward field
> assignments, before calling further into rethook_trampoline_handler(),
> which handles kretprobe callbacks generically.
>
> Looking at simplicity of arch_rethook_trampoline_callback(), it seems
> not really worthwhile to spend an extra function call just to do 4 or
> 5 assignments. As such, this patch proposes to "inline"
> arch_rethook_trampoline_callback() into arch_rethook_trampoline() by
> manually implementing it in an assembly code.
>
> This has two motivations. First, we do get a bit of runtime speed up by
> avoiding function calls. Using BPF selftests's bench tool, we see
> 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe
> triggering code path:
>
> BEFORE (latest probes/for-next)
> ===
> kretprobe  :   10.455 ± 0.024M/s
> kretprobe-multi:   11.150 ± 0.012M/s
>
> AFTER (probes/for-next + this patch)
> 
> kretprobe  :   10.540 ± 0.009M/s (+0.8%)
> kretprobe-multi:   11.219 ± 0.042M/s (+0.6%)
>
> Second, and no less importantly for some specialized use cases, this
> avoids unnecessarily "polluting" LBR records with an extra function call
> (recorded as a jump by CPU). This is the case for the retsnoop ([0])
> tool, which relies havily on capturing LBR records to provide users with
> lots of insight into kernel internals.
>
> This RFC patch is only inlining this function for x86-64, but it's
> possible to do that for 32-bit x86 arch as well and then remove
> arch_rethook_trampoline_callback() implementation altogether. Please let
> me know if this change is acceptable and whether I should complete it
> with 32-bit "inlining" as well. Thanks!
>
>   [0] 
> https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr
>
> Signed-off-by: Andrii Nakryiko 
> ---
>  arch/x86/kernel/asm-offsets_64.c |  4 
>  arch/x86/kernel/rethook.c| 37 +++-
>  2 files changed, 36 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kernel/asm-offsets_64.c 
> b/arch/x86/kernel/asm-offsets_64.c
> index bb65371ea9df..5c444abc540c 100644
> --- a/arch/x86/kernel/asm-offsets_64.c
> +++ b/arch/x86/kernel/asm-offsets_64.c
> @@ -42,6 +42,10 @@ int main(void)
> ENTRY(r14);
> ENTRY(r15);
> ENTRY(flags);
> +   ENTRY(ip);
> +   ENTRY(cs);
> +   ENTRY(ss);
> +   ENTRY(orig_ax);
> BLANK();
>  #undef ENTRY
>
> diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c
> index 8a1c0111ae79..3e1c01beebd1 100644
> --- a/arch/x86/kernel/rethook.c
> +++ b/arch/x86/kernel/rethook.c
> @@ -6,6 +6,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  #include "kprobes/common.h"
>
> @@ -34,10 +35,36 @@ asm(
> "   pushq %rsp\n"
> "   pushfq\n"
> SAVE_REGS_STRING
> -   "   movq %rsp, %rdi\n"
> -   "   call arch_rethook_trampoline_callback\n"
> +   "   movq %rsp, %rdi\n" /* $rdi points to regs */
> +   /* fixup registers */
> +   /* regs->cs = __KERNEL_CS; */
> +   "   movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) 
> "(%rdi)\n"
> +   /* regs->ip = (unsigned long)_rethook_trampoline; */
> +   "   movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) 
> "(%rdi)\n"
> +   /* regs->orig_ax = ~0UL; */
> +   "   movq $0x, " __stringify(pt_regs_orig_ax) 
> "(%rdi)\n"
> +   /* regs->sp += 2*sizeof(long); */
> +   "   addq $16, " __stringify(pt_regs_sp) "(%rdi)\n"
> +   /* 2nd arg is frame_pointer = (long *)(regs + 1); */
> +   "   lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n"

BTW, all this __stringify() ugliness can be avoided if we move this
assembly into its own .S file, like lots of other assembly functions
in arch/x86/kernel subdir. That has another benefit of generating
better line information in DWARF for those assembly instructions. It's
lots more work, so before I do this, I'd like to get confirmation that
this change is acceptable in principle.

> +   /*
> +* The return address at 'frame_pointer' is recovered by the
> +* a

Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-29 Thread Andrii Nakryiko
On Sun, Apr 28, 2024 at 4:25 PM Steven Rostedt  wrote:
>
> On Thu, 25 Apr 2024 13:31:53 -0700
> Andrii Nakryiko  wrote:
>
> I'm just coming back from Japan (work and then a vacation), and
> catching up on my email during the 6 hour layover in Detroit.
>
> > Hey Masami,
> >
> > I can't really review most of that code as I'm completely unfamiliar
> > with all those inner workings of fprobe/ftrace/function_graph. I left
> > a few comments where there were somewhat more obvious BPF-related
> > pieces.
> >
> > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > and then with your series applied on top. Just to see if there are any
> > regressions. I think it will be a useful data point for you.
> >
> > You should be already familiar with the bench tool we have in BPF
> > selftests (I used it on some other patches for your tree).
>
> I should get familiar with your tools too.
>

It's a nifty and self-contained tool to do some micro-benchmarking, I
replied to Masami with a few details on how to build and use it.

> >
> > BASELINE
> > 
> > kprobe :   24.634 ± 0.205M/s
> > kprobe-multi   :   28.898 ± 0.531M/s
> > kretprobe  :   10.478 ± 0.015M/s
> > kretprobe-multi:   11.012 ± 0.063M/s
> >
> > THIS PATCH SET ON TOP
> > =
> > kprobe :   25.144 ± 0.027M/s (+2%)
> > kprobe-multi   :   28.909 ± 0.074M/s
> > kretprobe  :9.482 ± 0.008M/s (-9.5%)
> > kretprobe-multi:   13.688 ± 0.027M/s (+24%)
> >
> > These numbers are pretty stable and look to be more or less representative.
>
> Thanks for running this.
>
> >
> > As you can see, kprobes got a bit faster, kprobe-multi seems to be
> > about the same, though.
> >
> > Then (I suppose they are "legacy") kretprobes got quite noticeably
> > slower, almost by 10%. Not sure why, but looks real after re-running
> > benchmarks a bunch of times and getting stable results.
> >
> > On the other hand, multi-kretprobes got significantly faster (+24%!).
> > Again, I don't know if it is expected or not, but it's a nice
> > improvement.
> >
> > If you have any idea why kretprobes would get so much slower, it would
> > be nice to look into that and see if you can mitigate the regression
> > somehow. Thanks!
>
> My guess is that this patch set helps generic use cases for tracing the
> return of functions, but will likely add more overhead for single use
> cases. That is, kretprobe is made to be specific for a single function,
> but kretprobe-multi is more generic. Hence the generic version will
> improve at the sacrifice of the specific function. I did expect as much.
>
> That said, I think there's probably a lot of low hanging fruit that can
> be done to this series to help improve the kretprobe performance. I'm
> not sure we can get back to the baseline, but I'm hoping we can at
> least make it much better than that 10% slowdown.

That would certainly be appreciated, thanks!

But I'm also considering trying to switch to multi-kprobe/kretprobe
automatically on libbpf side, whenever possible, so that users can get
the best performance. There might still be situations where this can't
be done, so singular kprobe/kretprobe can't be completely deprecated,
but multi variants seems to be universally faster, so I'm going to
make them a default (I need to handle some backwards compat aspect,
but that's libbpf-specific stuff you shouldn't be concerned with).

>
> I'll be reviewing this patch set this week as I recover from jetlag.
>
> -- Steve



Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-29 Thread Andrii Nakryiko
On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu  wrote:
>
> Hi Andrii,
>
> On Thu, 25 Apr 2024 13:31:53 -0700
> Andrii Nakryiko  wrote:
>
> > Hey Masami,
> >
> > I can't really review most of that code as I'm completely unfamiliar
> > with all those inner workings of fprobe/ftrace/function_graph. I left
> > a few comments where there were somewhat more obvious BPF-related
> > pieces.
> >
> > But I also did run our BPF benchmarks on probes/for-next as a baseline
> > and then with your series applied on top. Just to see if there are any
> > regressions. I think it will be a useful data point for you.
>
> Thanks for testing!
>
> >
> > You should be already familiar with the bench tool we have in BPF
> > selftests (I used it on some other patches for your tree).
>
> What patches we need?
>

You mean for this `bench` tool? They are part of BPF selftests (under
tools/testing/selftests/bpf), you can build them by running:

$ make RELEASE=1 -j$(nproc) bench

After that you'll get a self-container `bench` binary, which has all
the self-contained benchmarks.

You might also find a small script (benchs/run_bench_trigger.sh inside
BPF selftests directory) helpful, it collects final summary of the
benchmark run and optionally accepts a specific set of benchmarks. So
you can use it like this:

$ benchs/run_bench_trigger.sh kprobe kprobe-multi
kprobe :   18.731 ± 0.639M/s
kprobe-multi   :   23.938 ± 0.612M/s

By default it will run a wider set of benchmarks (no uprobes, but a
bunch of extra fentry/fexit tests and stuff like this).

> >
> > BASELINE
> > 
> > kprobe :   24.634 ± 0.205M/s
> > kprobe-multi   :   28.898 ± 0.531M/s
> > kretprobe  :   10.478 ± 0.015M/s
> > kretprobe-multi:   11.012 ± 0.063M/s
> >
> > THIS PATCH SET ON TOP
> > =
> > kprobe :   25.144 ± 0.027M/s (+2%)
> > kprobe-multi   :   28.909 ± 0.074M/s
> > kretprobe  :9.482 ± 0.008M/s (-9.5%)
> > kretprobe-multi:   13.688 ± 0.027M/s (+24%)
>
> This looks good. Kretprobe should also use kretprobe-multi (fprobe)
> eventually because it should be a single callback version of
> kretprobe-multi.
>
> >
> > These numbers are pretty stable and look to be more or less representative.
> >
> > As you can see, kprobes got a bit faster, kprobe-multi seems to be
> > about the same, though.
> >
> > Then (I suppose they are "legacy") kretprobes got quite noticeably
> > slower, almost by 10%. Not sure why, but looks real after re-running
> > benchmarks a bunch of times and getting stable results.
>
> Hmm, kretprobe on x86 should use ftrace + rethook even with my series.
> So nothing should be changed. Maybe cache access pattern has been
> changed?
> I'll check it with tracefs (to remove the effect from bpf related changes)
>
> >
> > On the other hand, multi-kretprobes got significantly faster (+24%!).
> > Again, I don't know if it is expected or not, but it's a nice
> > improvement.
>
> Thanks!
>
> >
> > If you have any idea why kretprobes would get so much slower, it would
> > be nice to look into that and see if you can mitigate the regression
> > somehow. Thanks!
>
> OK, let me check it.
>
> Thank you!
>
> >
> >
> > >  51 files changed, 2325 insertions(+), 882 deletions(-)
> > >  create mode 100644 
> > > tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc
> > >
> > > --
> > > Masami Hiramatsu (Google) 
> > >
>
>
> --
> Masami Hiramatsu (Google) 



Re: [PATCHv3 bpf-next 6/7] selftests/bpf: Add uretprobe compat test

2024-04-29 Thread Andrii Nakryiko
On Mon, Apr 29, 2024 at 12:39 AM Jiri Olsa  wrote:
>
> On Fri, Apr 26, 2024 at 11:06:53AM -0700, Andrii Nakryiko wrote:
> > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa  wrote:
> > >
> > > Adding test that adds return uprobe inside 32 bit task
> > > and verify the return uprobe and attached bpf programs
> > > get properly executed.
> > >
> > > Signed-off-by: Jiri Olsa 
> > > ---
> > >  tools/testing/selftests/bpf/.gitignore|  1 +
> > >  tools/testing/selftests/bpf/Makefile  |  6 ++-
> > >  .../selftests/bpf/prog_tests/uprobe_syscall.c | 40 +++
> > >  .../bpf/progs/uprobe_syscall_compat.c | 13 ++
> > >  4 files changed, 59 insertions(+), 1 deletion(-)
> > >  create mode 100644 
> > > tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c
> > >
> > > diff --git a/tools/testing/selftests/bpf/.gitignore 
> > > b/tools/testing/selftests/bpf/.gitignore
> > > index f1aebabfb017..69d71223c0dd 100644
> > > --- a/tools/testing/selftests/bpf/.gitignore
> > > +++ b/tools/testing/selftests/bpf/.gitignore
> > > @@ -45,6 +45,7 @@ test_cpp
> > >  /veristat
> > >  /sign-file
> > >  /uprobe_multi
> > > +/uprobe_compat
> > >  *.ko
> > >  *.tmp
> > >  xskxceiver
> > > diff --git a/tools/testing/selftests/bpf/Makefile 
> > > b/tools/testing/selftests/bpf/Makefile
> > > index edc73f8f5aef..d170b63eca62 100644
> > > --- a/tools/testing/selftests/bpf/Makefile
> > > +++ b/tools/testing/selftests/bpf/Makefile
> > > @@ -134,7 +134,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr 
> > > test_skb_cgroup_id_user \
> > > xskxceiver xdp_redirect_multi xdp_synproxy veristat 
> > > xdp_hw_metadata \
> > > xdp_features bpf_test_no_cfi.ko
> > >
> > > -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
> > > +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi 
> > > uprobe_compat
> >
> > you need to add uprobe_compat to TRUNNER_EXTRA_FILES as well, no?
>
> ah right
>
> > > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
> > > b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > > index 9233210a4c33..3770254d893b 100644
> > > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> > > @@ -11,6 +11,7 @@
> > >  #include 
> > >  #include "uprobe_syscall.skel.h"
> > >  #include "uprobe_syscall_call.skel.h"
> > > +#include "uprobe_syscall_compat.skel.h"
> > >
> > >  __naked unsigned long uretprobe_regs_trigger(void)
> > >  {
> > > @@ -291,6 +292,35 @@ static void test_uretprobe_syscall_call(void)
> > >  "read_trace_pipe_iter");
> > > ASSERT_EQ(found, 0, "found");
> > >  }
> > > +
> > > +static void trace_pipe_compat_cb(const char *str, void *data)
> > > +{
> > > +   if (strstr(str, "uretprobe compat") != NULL)
> > > +   (*(int *)data)++;
> > > +}
> > > +
> > > +static void test_uretprobe_compat(void)
> > > +{
> > > +   struct uprobe_syscall_compat *skel = NULL;
> > > +   int err, found = 0;
> > > +
> > > +   skel = uprobe_syscall_compat__open_and_load();
> > > +   if (!ASSERT_OK_PTR(skel, "uprobe_syscall_compat__open_and_load"))
> > > +   goto cleanup;
> > > +
> > > +   err = uprobe_syscall_compat__attach(skel);
> > > +   if (!ASSERT_OK(err, "uprobe_syscall_compat__attach"))
> > > +   goto cleanup;
> > > +
> > > +   system("./uprobe_compat");
> > > +
> > > +   ASSERT_OK(read_trace_pipe_iter(trace_pipe_compat_cb, , 
> > > 1000),
> > > +"read_trace_pipe_iter");
> >
> > why so complicated? can't you just set global variable that it was called
>
> hm, we execute separate uprobe_compat (32bit) process that triggers the bpf
> program, so we can't use global variable.. using the trace_pipe was the only
> thing that was easy to do

you need child process to trigger uprobe, but you could have installed
BPF program from parent process (you'd need to make child wait for
parent to be ready, with normal pipe() like we do in other place

Re: [PATCHv3 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test

2024-04-29 Thread Andrii Nakryiko
On Mon, Apr 29, 2024 at 12:33 AM Jiri Olsa  wrote:
>
> On Fri, Apr 26, 2024 at 11:03:29AM -0700, Andrii Nakryiko wrote:
> > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa  wrote:
> > >
> > > Adding test to verify that when called from outside of the
> > > trampoline provided by kernel, the uretprobe syscall will cause
> > > calling process to receive SIGILL signal and the attached bpf
> > > program is no executed.
> > >
> > > Signed-off-by: Jiri Olsa 
> > > ---
> > >  .../selftests/bpf/prog_tests/uprobe_syscall.c | 92 +++
> > >  .../selftests/bpf/progs/uprobe_syscall_call.c | 15 +++
> > >  2 files changed, 107 insertions(+)
> > >  create mode 100644 
> > > tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> > >
> >
> > See nits below, but overall LGTM
> >
> > Acked-by: Andrii Nakryiko 
> >
> > [...]
> >
> > > @@ -219,6 +301,11 @@ static void test_uretprobe_regs_change(void)
> > >  {
> > > test__skip();
> > >  }
> > > +
> > > +static void test_uretprobe_syscall_call(void)
> > > +{
> > > +   test__skip();
> > > +}
> > >  #endif
> > >
> > >  void test_uprobe_syscall(void)
> > > @@ -228,3 +315,8 @@ void test_uprobe_syscall(void)
> > > if (test__start_subtest("uretprobe_regs_change"))
> > > test_uretprobe_regs_change();
> > >  }
> > > +
> > > +void serial_test_uprobe_syscall_call(void)
> >
> > does it need to be serial? non-serial are still run sequentially
> > within a process (there is no multi-threading), it's more about some
> > global effects on system.
>
> plz see below
>
> >
> > > +{
> > > +   test_uretprobe_syscall_call();
> > > +}
> > > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c 
> > > b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> > > new file mode 100644
> > > index ..5ea03bb47198
> > > --- /dev/null
> > > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> > > @@ -0,0 +1,15 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include "vmlinux.h"
> > > +#include 
> > > +#include 
> > > +
> > > +struct pt_regs regs;
> > > +
> > > +char _license[] SEC("license") = "GPL";
> > > +
> > > +SEC("uretprobe//proc/self/exe:uretprobe_syscall_call")
> > > +int uretprobe(struct pt_regs *regs)
> > > +{
> > > +   bpf_printk("uretprobe called");
> >
> > debugging leftover? we probably don't want to pollute trace_pipe from test
>
> the reason for this is to make sure the bpf program was not executed,
>
> the test makes sure the child gets killed with SIGILL and also that
> the bpf program was not executed by checking the trace_pipe and
> making sure nothing was received
>
> the trace_pipe reading is also why it's serial

you could have attached BPF program from parent process and use a
global variable (and thus eliminate all the trace_pipe system-wide
dependency), but ok, it's fine by me the way this is done

>
> jirka
>
> >
> > > +   return 0;
> > > +}
> > > --
> > > 2.44.0
> > >



Re: [PATCHv3 bpf-next 6/7] selftests/bpf: Add uretprobe compat test

2024-04-26 Thread Andrii Nakryiko
On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa  wrote:
>
> Adding test that adds return uprobe inside 32 bit task
> and verify the return uprobe and attached bpf programs
> get properly executed.
>
> Signed-off-by: Jiri Olsa 
> ---
>  tools/testing/selftests/bpf/.gitignore|  1 +
>  tools/testing/selftests/bpf/Makefile  |  6 ++-
>  .../selftests/bpf/prog_tests/uprobe_syscall.c | 40 +++
>  .../bpf/progs/uprobe_syscall_compat.c | 13 ++
>  4 files changed, 59 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c
>
> diff --git a/tools/testing/selftests/bpf/.gitignore 
> b/tools/testing/selftests/bpf/.gitignore
> index f1aebabfb017..69d71223c0dd 100644
> --- a/tools/testing/selftests/bpf/.gitignore
> +++ b/tools/testing/selftests/bpf/.gitignore
> @@ -45,6 +45,7 @@ test_cpp
>  /veristat
>  /sign-file
>  /uprobe_multi
> +/uprobe_compat
>  *.ko
>  *.tmp
>  xskxceiver
> diff --git a/tools/testing/selftests/bpf/Makefile 
> b/tools/testing/selftests/bpf/Makefile
> index edc73f8f5aef..d170b63eca62 100644
> --- a/tools/testing/selftests/bpf/Makefile
> +++ b/tools/testing/selftests/bpf/Makefile
> @@ -134,7 +134,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr 
> test_skb_cgroup_id_user \
> xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \
> xdp_features bpf_test_no_cfi.ko
>
> -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi
> +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi 
> uprobe_compat

you need to add uprobe_compat to TRUNNER_EXTRA_FILES as well, no?

>
>  # Emit succinct information message describing current building step
>  # $1 - generic step name (e.g., CC, LINK, etc);
> @@ -761,6 +761,10 @@ $(OUTPUT)/uprobe_multi: uprobe_multi.c
> $(call msg,BINARY,,$@)
> $(Q)$(CC) $(CFLAGS) -O0 $(LDFLAGS) $^ $(LDLIBS) -o $@
>
> +$(OUTPUT)/uprobe_compat:
> +   $(call msg,BINARY,,$@)
> +   $(Q)echo "int main() { return 0; }" | $(CC) $(CFLAGS) -xc -m32 -O0 - 
> -o $@
> +
>  EXTRA_CLEAN := $(SCRATCH_DIR) $(HOST_SCRATCH_DIR)  \
> prog_tests/tests.h map_tests/tests.h verifier/tests.h   \
> feature bpftool \
> diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c 
> b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> index 9233210a4c33..3770254d893b 100644
> --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c
> @@ -11,6 +11,7 @@
>  #include 
>  #include "uprobe_syscall.skel.h"
>  #include "uprobe_syscall_call.skel.h"
> +#include "uprobe_syscall_compat.skel.h"
>
>  __naked unsigned long uretprobe_regs_trigger(void)
>  {
> @@ -291,6 +292,35 @@ static void test_uretprobe_syscall_call(void)
>  "read_trace_pipe_iter");
> ASSERT_EQ(found, 0, "found");
>  }
> +
> +static void trace_pipe_compat_cb(const char *str, void *data)
> +{
> +   if (strstr(str, "uretprobe compat") != NULL)
> +   (*(int *)data)++;
> +}
> +
> +static void test_uretprobe_compat(void)
> +{
> +   struct uprobe_syscall_compat *skel = NULL;
> +   int err, found = 0;
> +
> +   skel = uprobe_syscall_compat__open_and_load();
> +   if (!ASSERT_OK_PTR(skel, "uprobe_syscall_compat__open_and_load"))
> +   goto cleanup;
> +
> +   err = uprobe_syscall_compat__attach(skel);
> +   if (!ASSERT_OK(err, "uprobe_syscall_compat__attach"))
> +   goto cleanup;
> +
> +   system("./uprobe_compat");
> +
> +   ASSERT_OK(read_trace_pipe_iter(trace_pipe_compat_cb, , 1000),
> +"read_trace_pipe_iter");

why so complicated? can't you just set global variable that it was called

> +   ASSERT_EQ(found, 1, "found");
> +
> +cleanup:
> +   uprobe_syscall_compat__destroy(skel);
> +}
>  #else
>  static void test_uretprobe_regs_equal(void)
>  {
> @@ -306,6 +336,11 @@ static void test_uretprobe_syscall_call(void)
>  {
> test__skip();
>  }
> +
> +static void test_uretprobe_compat(void)
> +{
> +   test__skip();
> +}
>  #endif
>
>  void test_uprobe_syscall(void)
> @@ -320,3 +355,8 @@ void serial_test_uprobe_syscall_call(void)
>  {
> test_uretprobe_syscall_call();
>  }
> +
> +void serial_test_uprobe_syscall_compat(void)

and then no need for serial_test?

> +{
> +   test_uretprobe_compat();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c 
> b/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c
> new file mode 100644
> index ..f8adde7f08e2
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c
> @@ -0,0 +1,13 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include 
> +#include 
> +#include 
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("uretprobe.multi/./uprobe_compat:main")
> +int 

Re: [PATCHv3 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test

2024-04-26 Thread Andrii Nakryiko
On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa  wrote:
>
> Adding test to verify that when called from outside of the
> trampoline provided by kernel, the uretprobe syscall will cause
> calling process to receive SIGILL signal and the attached bpf
> program is no executed.
>
> Signed-off-by: Jiri Olsa 
> ---
>  .../selftests/bpf/prog_tests/uprobe_syscall.c | 92 +++
>  .../selftests/bpf/progs/uprobe_syscall_call.c | 15 +++
>  2 files changed, 107 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
>

See nits below, but overall LGTM

Acked-by: Andrii Nakryiko 

[...]

> @@ -219,6 +301,11 @@ static void test_uretprobe_regs_change(void)
>  {
> test__skip();
>  }
> +
> +static void test_uretprobe_syscall_call(void)
> +{
> +   test__skip();
> +}
>  #endif
>
>  void test_uprobe_syscall(void)
> @@ -228,3 +315,8 @@ void test_uprobe_syscall(void)
> if (test__start_subtest("uretprobe_regs_change"))
> test_uretprobe_regs_change();
>  }
> +
> +void serial_test_uprobe_syscall_call(void)

does it need to be serial? non-serial are still run sequentially
within a process (there is no multi-threading), it's more about some
global effects on system.

> +{
> +   test_uretprobe_syscall_call();
> +}
> diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c 
> b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> new file mode 100644
> index ..5ea03bb47198
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c
> @@ -0,0 +1,15 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include "vmlinux.h"
> +#include 
> +#include 
> +
> +struct pt_regs regs;
> +
> +char _license[] SEC("license") = "GPL";
> +
> +SEC("uretprobe//proc/self/exe:uretprobe_syscall_call")
> +int uretprobe(struct pt_regs *regs)
> +{
> +   bpf_printk("uretprobe called");

debugging leftover? we probably don't want to pollute trace_pipe from test

> +   return 0;
> +}
> --
> 2.44.0
>



Re: [PATCHv3 bpf-next 2/7] uprobe: Add uretprobe syscall to speed up return probe

2024-04-26 Thread Andrii Nakryiko
On Sun, Apr 21, 2024 at 12:42 PM Jiri Olsa  wrote:
>
> Adding uretprobe syscall instead of trap to speed up return probe.
>
> At the moment the uretprobe setup/path is:
>
>   - install entry uprobe
>
>   - when the uprobe is hit, it overwrites probed function's return address
> on stack with address of the trampoline that contains breakpoint
> instruction
>
>   - the breakpoint trap code handles the uretprobe consumers execution and
> jumps back to original return address
>
> This patch replaces the above trampoline's breakpoint instruction with new
> ureprobe syscall call. This syscall does exactly the same job as the trap
> with some more extra work:
>
>   - syscall trampoline must save original value for rax/r11/rcx registers
> on stack - rax is set to syscall number and r11/rcx are changed and
> used by syscall instruction
>
>   - the syscall code reads the original values of those registers and
> restore those values in task's pt_regs area
>
>   - only caller from trampoline exposed in '[uprobes]' is allowed,
> the process will receive SIGILL signal otherwise
>
> Even with some extra work, using the uretprobes syscall shows speed
> improvement (compared to using standard breakpoint):
>
>   On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz)
>
>   current:
> uretprobe-nop  :1.498 ± 0.000M/s
> uretprobe-push :1.448 ± 0.001M/s
> uretprobe-ret  :0.816 ± 0.001M/s
>
>   with the fix:
> uretprobe-nop  :1.969 ± 0.002M/s  < 31% speed up
> uretprobe-push :1.910 ± 0.000M/s  < 31% speed up
> uretprobe-ret  :0.934 ± 0.000M/s  < 14% speed up
>
>   On Amd (AMD Ryzen 7 5700U)
>
>   current:
> uretprobe-nop  :0.778 ± 0.001M/s
> uretprobe-push :0.744 ± 0.001M/s
> uretprobe-ret  :0.540 ± 0.001M/s
>
>   with the fix:
> uretprobe-nop  :0.860 ± 0.001M/s  < 10% speed up
> uretprobe-push :0.818 ± 0.001M/s  < 10% speed up
> uretprobe-ret  :0.578 ± 0.000M/s  <  7% speed up
>
> The performance test spawns a thread that runs loop which triggers
> uprobe with attached bpf program that increments the counter that
> gets printed in results above.
>
> The uprobe (and uretprobe) kind is determined by which instruction
> is being patched with breakpoint instruction. That's also important
> for uretprobes, because uprobe is installed for each uretprobe.
>
> The performance test is part of bpf selftests:
>   tools/testing/selftests/bpf/run_bench_uprobes.sh
>
> Note at the moment uretprobe syscall is supported only for native
> 64-bit process, compat process still uses standard breakpoint.
>
> Suggested-by: Andrii Nakryiko 
> Signed-off-by: Oleg Nesterov 
> Signed-off-by: Jiri Olsa 
> ---
>  arch/x86/kernel/uprobes.c | 115 ++
>  include/linux/uprobes.h   |   3 +
>  kernel/events/uprobes.c   |  24 +---
>  3 files changed, 135 insertions(+), 7 deletions(-)
>

LGTM as far as I can follow the code

Acked-by: Andrii Nakryiko 

[...]



Re: [PATCHv3 bpf-next 1/7] uprobe: Wire up uretprobe system call

2024-04-26 Thread Andrii Nakryiko
On Sun, Apr 21, 2024 at 12:42 PM Jiri Olsa  wrote:
>
> Wiring up uretprobe system call, which comes in following changes.
> We need to do the wiring before, because the uretprobe implementation
> needs the syscall number.
>
> Note at the moment uretprobe syscall is supported only for native
> 64-bit process.
>
> Signed-off-by: Jiri Olsa 
> ---
>  arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>  include/linux/syscalls.h   | 2 ++
>  include/uapi/asm-generic/unistd.h  | 5 -
>  kernel/sys_ni.c| 2 ++
>  4 files changed, 9 insertions(+), 1 deletion(-)
>

LGTM

Acked-by: Andrii Nakryiko 

> diff --git a/arch/x86/entry/syscalls/syscall_64.tbl 
> b/arch/x86/entry/syscalls/syscall_64.tbl
> index 7e8d46f4147f..af0a33ab06ee 100644
> --- a/arch/x86/entry/syscalls/syscall_64.tbl
> +++ b/arch/x86/entry/syscalls/syscall_64.tbl
> @@ -383,6 +383,7 @@
>  459common  lsm_get_self_attr   sys_lsm_get_self_attr
>  460common  lsm_set_self_attr   sys_lsm_set_self_attr
>  461common  lsm_list_modulessys_lsm_list_modules
> +46264  uretprobe   sys_uretprobe
>
>  #
>  # Due to a historical design error, certain syscalls are numbered differently
> diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
> index e619ac10cd23..5318e0e76799 100644
> --- a/include/linux/syscalls.h
> +++ b/include/linux/syscalls.h
> @@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, u32 *size, 
> u32 flags);
>  /* x86 */
>  asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on);
>
> +asmlinkage long sys_uretprobe(void);
> +
>  /* pciconfig: alpha, arm, arm64, ia64, sparc */
>  asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn,
> unsigned long off, unsigned long len,
> diff --git a/include/uapi/asm-generic/unistd.h 
> b/include/uapi/asm-generic/unistd.h
> index 75f00965ab15..8a747cd1d735 100644
> --- a/include/uapi/asm-generic/unistd.h
> +++ b/include/uapi/asm-generic/unistd.h
> @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr)
>  #define __NR_lsm_list_modules 461
>  __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules)
>
> +#define __NR_uretprobe 462
> +__SYSCALL(__NR_uretprobe, sys_uretprobe)
> +
>  #undef __NR_syscalls
> -#define __NR_syscalls 462
> +#define __NR_syscalls 463
>
>  /*
>   * 32 bit systems traditionally used different
> diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
> index faad00cce269..be6195e0d078 100644
> --- a/kernel/sys_ni.c
> +++ b/kernel/sys_ni.c
> @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16);
>
>  /* restartable sequence */
>  COND_SYSCALL(rseq);
> +
> +COND_SYSCALL(uretprobe);
> --
> 2.44.0
>



Re: [PATCH 0/2] Objpool performance improvements

2024-04-26 Thread Andrii Nakryiko
On Fri, Apr 26, 2024 at 7:25 AM Masami Hiramatsu  wrote:
>
> Hi Andrii,
>
> On Wed, 24 Apr 2024 14:52:12 -0700
> Andrii Nakryiko  wrote:
>
> > Improve objpool (used heavily in kretprobe hot path) performance with two
> > improvements:
> >   - inlining performance critical objpool_push()/objpool_pop() operations;
> >   - avoiding re-calculating relatively expensive nr_possible_cpus().
>
> Thanks for optimizing objpool. Both looks good to me.

Great, thanks for applying.

>
> BTW, I don't intend to stop this short-term optimization attempts,
> but I would like to ask you check the new fgraph based fprobe
> (kretprobe-multi)[1] instead of objpool/rethook.

You can see that I did :) There is tons of code and I'm not familiar
with internals of function_graph infra, but you can see I did run
benchmarks, so I'm paying attention.

But as for the objpool itself, I think it's a performant and useful
internal building block, and we might use it outside of rethook as
well, so I think making it as fast as possible is good regardless.

>
> [1] 
> https://lore.kernel.org/all/171318533841.254850.15841395205784342850.stgit@devnote2/
>
> I'm considering to obsolete the kretprobe (and rethook) with fprobe
> and eventually remove it. Those have similar feature and we should
> choose safer one.
>

Yep, I had a few more semi-ready patches, but I'll hold off for now
given this move to function graph, plus some of the changes that Jiri
is making in multi-kprobe code. I'll wait for things to settle down a
bit before looking again.

But just to give you some context, I'm an author of retsnoop tool, and
one of the killer features of it is LBR capture in kretprobes, which
is a tremendous help in investigating kernel failures, especially in
unfamiliar code (LBR allows to "look back" and figure out "how did we
get to this condition" after the fact). And so it's important to
minimize the amount of wasted LBR records between some kernel function
returns error (and thus is "an interesting event" and BPF program that
captures LBR is triggered). Big part of that is ftrace/fprobe/rethook
infra, so I was looking into making that part as "minimal" as
possible, in the sense of eliminating as many function calls and jump
as possible. This has an added benefit of making this hot path faster,
but my main motivation is LBR.

Anyways, just a bit of context for some of the other patches (like
inlining arch_rethook_trampoline_callback).

> Thank you,
>
> >
> > These opportunities were found when benchmarking and profiling kprobes and
> > kretprobes with BPF-based benchmarks. See individual patches for details and
> > results.
> >
> > Andrii Nakryiko (2):
> >   objpool: enable inlining objpool_push() and objpool_pop() operations
> >   objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
> >
> >  include/linux/objpool.h | 105 +++--
> >  lib/objpool.c   | 112 +++-
> >  2 files changed, 107 insertions(+), 110 deletions(-)
> >
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) 



Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph

2024-04-25 Thread Andrii Nakryiko
On Mon, Apr 15, 2024 at 5:49 AM Masami Hiramatsu (Google)
 wrote:
>
> Hi,
>
> Here is the 9th version of the series to re-implement the fprobe on
> function-graph tracer. The previous version is;
>
> https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/
>
> This version is ported on the latest kernel (v6.9-rc3 + probes/for-next)
> and fixed some bugs + performance optimization patch[36/36].
>  - [12/36] Fix to clear fgraph_array entry in registration failure, also
>return -ENOSPC when fgraph_array is full.
>  - [28/36] Add new store_fprobe_entry_data() for fprobe.
>  - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation.
>  - [36/36] Add new flag to skip timestamp recording.
>
> Overview
> 
> This series does major 2 changes, enable multiple function-graphs on
> the ftrace (e.g. allow function-graph on sub instances) and rewrite the
> fprobe on this function-graph.
>
> The former changes had been sent from Steven Rostedt 4 years ago (*),
> which allows users to set different setting function-graph tracer (and
> other tracers based on function-graph) in each trace-instances at the
> same time.
>
> (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/
>
> The purpose of latter change are;
>
>  1) Remove dependency of the rethook from fprobe so that we can reduce
>the return hook code and shadow stack.
>
>  2) Make 'ftrace_regs' the common trace interface for the function
>boundary.
>
> 1) Currently we have 2(or 3) different function return hook codes,
>  the function-graph tracer and rethook (and legacy kretprobe).
>  But since this  is redundant and needs double maintenance cost,
>  I would like to unify those. From the user's viewpoint, function-
>  graph tracer is very useful to grasp the execution path. For this
>  purpose, it is hard to use the rethook in the function-graph
>  tracer, but the opposite is possible. (Strictly speaking, kretprobe
>  can not use it because it requires 'pt_regs' for historical reasons.)
>
> 2) Now the fprobe provides the 'pt_regs' for its handler, but that is
>  wrong for the function entry and exit. Moreover, depending on the
>  architecture, there is no way to accurately reproduce 'pt_regs'
>  outside of interrupt or exception handlers. This means fprobe should
>  not use 'pt_regs' because it does not use such exceptions.
>  (Conversely, kprobe should use 'pt_regs' because it is an abstract
>   interface of the software breakpoint exception.)
>
> This series changes fprobe to use function-graph tracer for tracing
> function entry and exit, instead of mixture of ftrace and rethook.
> Unlike the rethook which is a per-task list of system-wide allocated
> nodes, the function graph's ret_stack is a per-task shadow stack.
> Thus it does not need to set 'nr_maxactive' (which is the number of
> pre-allocated nodes).
> Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'.
> Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as
> their register interface, this changes it to convert 'ftrace_regs' to
> 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs',
> so users must access only registers for function parameters or
> return value.
>
> Design
> --
> Instead of using ftrace's function entry hook directly, the new fprobe
> is built on top of the function-graph's entry and return callbacks
> with 'ftrace_regs'.
>
> Since the fprobe requires access to 'ftrace_regs', the architecture
> must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and
> CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph
> entry callback with 'ftrace_regs', and also
> CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to
> return_to_handler.
>
> All fprobes share a single function-graph ops (means shares a common
> ftrace filter) similar to the kprobe-on-ftrace. This needs another
> layer to find corresponding fprobe in the common function-graph
> callbacks, but has much better scalability, since the number of
> registered function-graph ops is limited.
>
> In the entry callback, the fprobe runs its entry_handler and saves the
> address of 'fprobe' on the function-graph's shadow stack as data. The
> return callback decodes the data to get the 'fprobe' address, and runs
> the exit_handler.
>
> The fprobe introduces two hash-tables, one is for entry callback which
> searches fprobes related to the given function address passed by entry
> callback. The other is for a return callback which checks if the given
> 'fprobe' data structure pointer is still valid. Note that it is
> possible to unregister fprobe before the return callback runs. Thus
> the address validation must be done before using it in the return
> callback.
>
> This series can be applied against the probes/for-next branch, which
> is based on v6.9-rc3.
>
> This series can also be found below branch.
>
> 

Re: [PATCH v9 36/36] fgraph: Skip recording calltime/rettime if it is not nneeded

2024-04-25 Thread Andrii Nakryiko
On Mon, Apr 15, 2024 at 6:25 AM Masami Hiramatsu (Google)
 wrote:
>
> From: Masami Hiramatsu (Google) 
>
> Skip recording calltime and rettime if the fgraph_ops does not need it.
> This is a kind of performance optimization for fprobe. Since the fprobe
> user does not use these entries, recording timestamp in fgraph is just
> a overhead (e.g. eBPF, ftrace). So introduce the skip_timestamp flag,
> and all fgraph_ops sets this flag, skip recording calltime and rettime.
>
> Suggested-by: Jiri Olsa 
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  Changes in v9:
>   - Newly added.
> ---
>  include/linux/ftrace.h |2 ++
>  kernel/trace/fgraph.c  |   46 +++---
>  kernel/trace/fprobe.c  |1 +
>  3 files changed, 42 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h
> index d845a80a3d56..06fc7cbef897 100644
> --- a/include/linux/ftrace.h
> +++ b/include/linux/ftrace.h
> @@ -1156,6 +1156,8 @@ struct fgraph_ops {
> struct ftrace_ops   ops; /* for the hash lists */
> void*private;
> int idx;
> +   /* If skip_timestamp is true, this does not record timestamps. */
> +   boolskip_timestamp;
>  };
>
>  void *fgraph_reserve_data(int idx, int size_bytes);
> diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
> index 7556fbbae323..a5722537bb79 100644
> --- a/kernel/trace/fgraph.c
> +++ b/kernel/trace/fgraph.c
> @@ -131,6 +131,7 @@ DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph);
>  int ftrace_graph_active;
>
>  static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE];
> +static bool fgraph_skip_timestamp;
>
>  /* LRU index table for fgraph_array */
>  static int fgraph_lru_table[FGRAPH_ARRAY_SIZE];
> @@ -475,7 +476,7 @@ void ftrace_graph_stop(void)
>  static int
>  ftrace_push_return_trace(unsigned long ret, unsigned long func,
>  unsigned long frame_pointer, unsigned long *retp,
> -int fgraph_idx)
> +int fgraph_idx, bool skip_ts)
>  {
> struct ftrace_ret_stack *ret_stack;
> unsigned long long calltime;
> @@ -498,8 +499,12 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> long func,
> ret_stack = get_ret_stack(current, current->curr_ret_stack, );
> if (ret_stack && ret_stack->func == func &&
> get_fgraph_type(current, index + FGRAPH_RET_INDEX) == 
> FGRAPH_TYPE_BITMAP &&
> -   !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, 
> fgraph_idx))
> +   !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, 
> fgraph_idx)) {
> +   /* If previous one skips calltime, update it. */
> +   if (!skip_ts && !ret_stack->calltime)
> +   ret_stack->calltime = trace_clock_local();
> return index + FGRAPH_RET_INDEX;
> +   }
>
> val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_RET_INDEX;
>
> @@ -517,7 +522,10 @@ ftrace_push_return_trace(unsigned long ret, unsigned 
> long func,
> return -EBUSY;
> }
>
> -   calltime = trace_clock_local();
> +   if (skip_ts)

would it be ok to add likely() here to keep the least-overhead code path linear?

> +   calltime = 0LL;
> +   else
> +   calltime = trace_clock_local();
>
> index = READ_ONCE(current->curr_ret_stack);
> ret_stack = RET_STACK(current, index);
> @@ -601,7 +609,8 @@ int function_graph_enter_regs(unsigned long ret, unsigned 
> long func,
> trace.func = func;
> trace.depth = ++current->curr_ret_depth;
>
> -   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0);
> +   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0,
> +fgraph_skip_timestamp);
> if (index < 0)
> goto out;
>
> @@ -654,7 +663,8 @@ int function_graph_enter_ops(unsigned long ret, unsigned 
> long func,
> return -ENODEV;
>
> /* Use start for the distance to ret_stack (skipping over reserve) */
> -   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 
> gops->idx);
> +   index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 
> gops->idx,
> +gops->skip_timestamp);
> if (index < 0)
> return index;
> type = get_fgraph_type(current, index);
> @@ -732,6 +742,7 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, 
> unsigned long *ret,
> *ret = ret_stack->ret;
> trace->func = ret_stack->func;
> trace->calltime = ret_stack->calltime;
> +   trace->rettime = 0;
> trace->overrun = atomic_read(>trace_overrun);
> trace->depth = current->curr_ret_depth;
> /*
> @@ -792,7 +803,6 @@ __ftrace_return_to_handler(struct 

Re: [PATCH v9 29/36] bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled

2024-04-25 Thread Andrii Nakryiko
On Mon, Apr 15, 2024 at 6:22 AM Masami Hiramatsu (Google)
 wrote:
>
> From: Masami Hiramatsu (Google) 
>
> Enable kprobe_multi feature if CONFIG_FPROBE is enabled. The pt_regs is
> converted from ftrace_regs by ftrace_partial_regs(), thus some registers
> may always returns 0. But it should be enough for function entry (access
> arguments) and exit (access return value).
>
> Signed-off-by: Masami Hiramatsu (Google) 
> Acked-by: Florent Revest 
> ---
>  Changes from previous series: NOTHING, Update against the new series.
> ---
>  kernel/trace/bpf_trace.c |   22 +-
>  1 file changed, 9 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> index e51a6ef87167..57b1174030c9 100644
> --- a/kernel/trace/bpf_trace.c
> +++ b/kernel/trace/bpf_trace.c
> @@ -2577,7 +2577,7 @@ static int __init bpf_event_init(void)
>  fs_initcall(bpf_event_init);
>  #endif /* CONFIG_MODULES */
>
> -#if defined(CONFIG_FPROBE) && defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS)
> +#ifdef CONFIG_FPROBE
>  struct bpf_kprobe_multi_link {
> struct bpf_link link;
> struct fprobe fp;
> @@ -2600,6 +2600,8 @@ struct user_syms {
> char *buf;
>  };
>
> +static DEFINE_PER_CPU(struct pt_regs, bpf_kprobe_multi_pt_regs);

this is a waste if CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST=y, right?
Can we guard it?


> +
>  static int copy_user_syms(struct user_syms *us, unsigned long __user *usyms, 
> u32 cnt)
>  {
> unsigned long __user usymbol;
> @@ -2792,13 +2794,14 @@ static u64 bpf_kprobe_multi_entry_ip(struct 
> bpf_run_ctx *ctx)
>
>  static int
>  kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link,
> -  unsigned long entry_ip, struct pt_regs *regs)
> +  unsigned long entry_ip, struct ftrace_regs *fregs)
>  {
> struct bpf_kprobe_multi_run_ctx run_ctx = {
> .link = link,
> .entry_ip = entry_ip,
> };
> struct bpf_run_ctx *old_run_ctx;
> +   struct pt_regs *regs;
> int err;
>
> if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) {
> @@ -2809,6 +2812,7 @@ kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link 
> *link,
>
> migrate_disable();
> rcu_read_lock();
> +   regs = ftrace_partial_regs(fregs, 
> this_cpu_ptr(_kprobe_multi_pt_regs));

and then pass NULL if defined(CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST)?


> old_run_ctx = bpf_set_run_ctx(_ctx.run_ctx);
> err = bpf_prog_run(link->link.prog, regs);
> bpf_reset_run_ctx(old_run_ctx);
> @@ -2826,13 +2830,9 @@ kprobe_multi_link_handler(struct fprobe *fp, unsigned 
> long fentry_ip,
>   void *data)
>  {
> struct bpf_kprobe_multi_link *link;
> -   struct pt_regs *regs = ftrace_get_regs(fregs);
> -
> -   if (!regs)
> -   return 0;
>
> link = container_of(fp, struct bpf_kprobe_multi_link, fp);
> -   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
> +   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs);
> return 0;
>  }
>
> @@ -2842,13 +2842,9 @@ kprobe_multi_link_exit_handler(struct fprobe *fp, 
> unsigned long fentry_ip,
>void *data)
>  {
> struct bpf_kprobe_multi_link *link;
> -   struct pt_regs *regs = ftrace_get_regs(fregs);
> -
> -   if (!regs)
> -   return;
>
> link = container_of(fp, struct bpf_kprobe_multi_link, fp);
> -   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs);
> +   kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs);
>  }
>
>  static int symbols_cmp_r(const void *a, const void *b, const void *priv)
> @@ -3107,7 +3103,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr 
> *attr, struct bpf_prog *pr
> kvfree(cookies);
> return err;
>  }
> -#else /* !CONFIG_FPROBE || !CONFIG_DYNAMIC_FTRACE_WITH_REGS */
> +#else /* !CONFIG_FPROBE */
>  int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog 
> *prog)
>  {
> return -EOPNOTSUPP;
>
>



[PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code

2024-04-24 Thread Andrii Nakryiko
At the lowest level, rethook-based kretprobes on x86-64 architecture go
through arch_rethoook_trampoline() function, manually written in
assembly, which calls into a simple arch_rethook_trampoline_callback()
function, written in C, and only doing a few straightforward field
assignments, before calling further into rethook_trampoline_handler(),
which handles kretprobe callbacks generically.

Looking at simplicity of arch_rethook_trampoline_callback(), it seems
not really worthwhile to spend an extra function call just to do 4 or
5 assignments. As such, this patch proposes to "inline"
arch_rethook_trampoline_callback() into arch_rethook_trampoline() by
manually implementing it in an assembly code.

This has two motivations. First, we do get a bit of runtime speed up by
avoiding function calls. Using BPF selftests's bench tool, we see
0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe
triggering code path:

BEFORE (latest probes/for-next)
===
kretprobe  :   10.455 ± 0.024M/s
kretprobe-multi:   11.150 ± 0.012M/s

AFTER (probes/for-next + this patch)

kretprobe  :   10.540 ± 0.009M/s (+0.8%)
kretprobe-multi:   11.219 ± 0.042M/s (+0.6%)

Second, and no less importantly for some specialized use cases, this
avoids unnecessarily "polluting" LBR records with an extra function call
(recorded as a jump by CPU). This is the case for the retsnoop ([0])
tool, which relies havily on capturing LBR records to provide users with
lots of insight into kernel internals.

This RFC patch is only inlining this function for x86-64, but it's
possible to do that for 32-bit x86 arch as well and then remove
arch_rethook_trampoline_callback() implementation altogether. Please let
me know if this change is acceptable and whether I should complete it
with 32-bit "inlining" as well. Thanks!

  [0] 
https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr

Signed-off-by: Andrii Nakryiko 
---
 arch/x86/kernel/asm-offsets_64.c |  4 
 arch/x86/kernel/rethook.c| 37 +++-
 2 files changed, 36 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index bb65371ea9df..5c444abc540c 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -42,6 +42,10 @@ int main(void)
ENTRY(r14);
ENTRY(r15);
ENTRY(flags);
+   ENTRY(ip);
+   ENTRY(cs);
+   ENTRY(ss);
+   ENTRY(orig_ax);
BLANK();
 #undef ENTRY
 
diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c
index 8a1c0111ae79..3e1c01beebd1 100644
--- a/arch/x86/kernel/rethook.c
+++ b/arch/x86/kernel/rethook.c
@@ -6,6 +6,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "kprobes/common.h"
 
@@ -34,10 +35,36 @@ asm(
"   pushq %rsp\n"
"   pushfq\n"
SAVE_REGS_STRING
-   "   movq %rsp, %rdi\n"
-   "   call arch_rethook_trampoline_callback\n"
+   "   movq %rsp, %rdi\n" /* $rdi points to regs */
+   /* fixup registers */
+   /* regs->cs = __KERNEL_CS; */
+   "   movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) 
"(%rdi)\n"
+   /* regs->ip = (unsigned long)_rethook_trampoline; */
+   "   movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) 
"(%rdi)\n"
+   /* regs->orig_ax = ~0UL; */
+   "   movq $0x, " __stringify(pt_regs_orig_ax) 
"(%rdi)\n"
+   /* regs->sp += 2*sizeof(long); */
+   "   addq $16, " __stringify(pt_regs_sp) "(%rdi)\n"
+   /* 2nd arg is frame_pointer = (long *)(regs + 1); */
+   "   lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n"
+   /*
+* The return address at 'frame_pointer' is recovered by the
+* arch_rethook_fixup_return() which called from this
+* rethook_trampoline_handler().
+*/
+   "   call rethook_trampoline_handler\n"
+   /*
+* Copy FLAGS to 'pt_regs::ss' so we can do RET right after POPF.
+*
+* We don't save/restore %rax below, because we ignore
+* rethook_trampoline_handler result.
+*
+* *(unsigned long *)>ss = regs->flags;
+*/
+   "   mov " __stringify(pt_regs_flags) "(%rsp), %rax\n"
+   "   mov %rax, " __stringify(pt_regs_ss) "(%rsp)\n"
RESTORE_REGS_STRING
-   /* In the callback function, 'regs->flags' is copied to 'regs->ss'. */
+   /* We just copied 'regs->flags' into 'regs->ss'. */
"   addq $16, %rsp\n"
"   popfq\n"
 #else
@@ -61,6 +88,7 @@ asm(
 );
 NOKPROBE_SYMBOL(arch_retho

[PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids

2024-04-24 Thread Andrii Nakryiko
Profiling shows that calling nr_possible_cpus() in objpool_pop() takes
a noticeable amount of CPU (when profiled on 80-core machine), as we
need to recalculate number of set bits in a CPU bit mask. This number
can't change, so there is no point in paying the price for recalculating
it. As such, cache this value in struct objpool_head and use it in
objpool_pop().

On the other hand, cached pool->nr_cpus isn't necessary, as it's not
used in hot path and is also a pretty trivial value to retrieve. So drop
pool->nr_cpus in favor of using nr_cpu_ids everywhere. This way the size
of struct objpool_head remains the same, which is a nice bonus.

Same BPF selftests benchmarks were used to evaluate the effect. Using
changes in previous patch (inlining of objpool_pop/objpool_push) as
baseline, here are the differences:

BASELINE

kretprobe  :9.937 ± 0.174M/s
kretprobe-multi:   10.440 ± 0.108M/s

AFTER
=
kretprobe  :   10.106 ± 0.120M/s (+1.7%)
kretprobe-multi:   10.515 ± 0.180M/s (+0.7%)

Cc: Matt (Qiang) Wu 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/objpool.h |  6 +++---
 lib/objpool.c   | 12 ++--
 2 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
index d8b1f7b91128..cb1758eaa2d3 100644
--- a/include/linux/objpool.h
+++ b/include/linux/objpool.h
@@ -73,7 +73,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, 
void *context);
  * struct objpool_head - object pooling metadata
  * @obj_size:   object size, aligned to sizeof(void *)
  * @nr_objs:total objs (to be pre-allocated with objpool)
- * @nr_cpus:local copy of nr_cpu_ids
+ * @nr_possible_cpus: cached value of num_possible_cpus()
  * @capacity:   max objs can be managed by one objpool_slot
  * @gfp:gfp flags for kmalloc & vmalloc
  * @ref:refcount of objpool
@@ -85,7 +85,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, 
void *context);
 struct objpool_head {
int obj_size;
int nr_objs;
-   int nr_cpus;
+   int nr_possible_cpus;
int capacity;
gfp_t   gfp;
refcount_t  ref;
@@ -176,7 +176,7 @@ static inline void *objpool_pop(struct objpool_head *pool)
raw_local_irq_save(flags);
 
cpu = raw_smp_processor_id();
-   for (i = 0; i < num_possible_cpus(); i++) {
+   for (i = 0; i < pool->nr_possible_cpus; i++) {
obj = __objpool_try_get_slot(pool, cpu);
if (obj)
break;
diff --git a/lib/objpool.c b/lib/objpool.c
index f696308fc026..234f9d0bd081 100644
--- a/lib/objpool.c
+++ b/lib/objpool.c
@@ -50,7 +50,7 @@ objpool_init_percpu_slots(struct objpool_head *pool, int 
nr_objs,
 {
int i, cpu_count = 0;
 
-   for (i = 0; i < pool->nr_cpus; i++) {
+   for (i = 0; i < nr_cpu_ids; i++) {
 
struct objpool_slot *slot;
int nodes, size, rc;
@@ -60,8 +60,8 @@ objpool_init_percpu_slots(struct objpool_head *pool, int 
nr_objs,
continue;
 
/* compute how many objects to be allocated with this slot */
-   nodes = nr_objs / num_possible_cpus();
-   if (cpu_count < (nr_objs % num_possible_cpus()))
+   nodes = nr_objs / pool->nr_possible_cpus;
+   if (cpu_count < (nr_objs % pool->nr_possible_cpus))
nodes++;
cpu_count++;
 
@@ -103,7 +103,7 @@ static void objpool_fini_percpu_slots(struct objpool_head 
*pool)
if (!pool->cpu_slots)
return;
 
-   for (i = 0; i < pool->nr_cpus; i++)
+   for (i = 0; i < nr_cpu_ids; i++)
kvfree(pool->cpu_slots[i]);
kfree(pool->cpu_slots);
 }
@@ -130,13 +130,13 @@ int objpool_init(struct objpool_head *pool, int nr_objs, 
int object_size,
 
/* initialize objpool pool */
memset(pool, 0, sizeof(struct objpool_head));
-   pool->nr_cpus = nr_cpu_ids;
+   pool->nr_possible_cpus = num_possible_cpus();
pool->obj_size = object_size;
pool->capacity = capacity;
pool->gfp = gfp & ~__GFP_ZERO;
pool->context = context;
pool->release = release;
-   slot_size = pool->nr_cpus * sizeof(struct objpool_slot);
+   slot_size = nr_cpu_ids * sizeof(struct objpool_slot);
pool->cpu_slots = kzalloc(slot_size, pool->gfp);
if (!pool->cpu_slots)
return -ENOMEM;
-- 
2.43.0




[PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations

2024-04-24 Thread Andrii Nakryiko
objpool_push() and objpool_pop() are very performance-critical functions
and can be called very frequently in kretprobe triggering path.

As such, it makes sense to allow compiler to inline them completely to
eliminate function calls overhead. Luckily, their logic is quite well
isolated and doesn't have any sprawling dependencies.

This patch moves both objpool_push() and objpool_pop() into
include/linux/objpool.h and marks them as static inline functions,
enabling inlining. To avoid anyone using internal helpers
(objpool_try_get_slot, objpool_try_add_slot), rename them to use leading
underscores.

We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe
and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi
programs in a tight loop to evaluate the effect. BPF own overhead in
this case is minimal and it mostly stresses the rest of in-kernel
kretprobe infrastructure overhead. Results are in millions of calls per
second. This is not super scientific, but shows the trend nevertheless.

BEFORE
==
kretprobe  :9.794 ± 0.086M/s
kretprobe-multi:   10.219 ± 0.032M/s

AFTER
=
kretprobe  :9.937 ± 0.174M/s (+1.5%)
kretprobe-multi:   10.440 ± 0.108M/s (+2.2%)

Cc: Matt (Qiang) Wu 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/objpool.h | 101 +++-
 lib/objpool.c   | 100 ---
 2 files changed, 99 insertions(+), 102 deletions(-)

diff --git a/include/linux/objpool.h b/include/linux/objpool.h
index 15aff4a17f0c..d8b1f7b91128 100644
--- a/include/linux/objpool.h
+++ b/include/linux/objpool.h
@@ -5,6 +5,10 @@
 
 #include 
 #include 
+#include 
+#include 
+#include 
+#include 
 
 /*
  * objpool: ring-array based lockless MPMC queue
@@ -118,13 +122,94 @@ int objpool_init(struct objpool_head *pool, int nr_objs, 
int object_size,
 gfp_t gfp, void *context, objpool_init_obj_cb objinit,
 objpool_fini_cb release);
 
+/* try to retrieve object from slot */
+static inline void *__objpool_try_get_slot(struct objpool_head *pool, int cpu)
+{
+   struct objpool_slot *slot = pool->cpu_slots[cpu];
+   /* load head snapshot, other cpus may change it */
+   uint32_t head = smp_load_acquire(>head);
+
+   while (head != READ_ONCE(slot->last)) {
+   void *obj;
+
+   /*
+* data visibility of 'last' and 'head' could be out of
+* order since memory updating of 'last' and 'head' are
+* performed in push() and pop() independently
+*
+* before any retrieving attempts, pop() must guarantee
+* 'last' is behind 'head', that is to say, there must
+* be available objects in slot, which could be ensured
+* by condition 'last != head && last - head <= nr_objs'
+* that is equivalent to 'last - head - 1 < nr_objs' as
+* 'last' and 'head' are both unsigned int32
+*/
+   if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) {
+   head = READ_ONCE(slot->head);
+   continue;
+   }
+
+   /* obj must be retrieved before moving forward head */
+   obj = READ_ONCE(slot->entries[head & slot->mask]);
+
+   /* move head forward to mark it's consumption */
+   if (try_cmpxchg_release(>head, , head + 1))
+   return obj;
+   }
+
+   return NULL;
+}
+
 /**
  * objpool_pop() - allocate an object from objpool
  * @pool: object pool
  *
  * return value: object ptr or NULL if failed
  */
-void *objpool_pop(struct objpool_head *pool);
+static inline void *objpool_pop(struct objpool_head *pool)
+{
+   void *obj = NULL;
+   unsigned long flags;
+   int i, cpu;
+
+   /* disable local irq to avoid preemption & interruption */
+   raw_local_irq_save(flags);
+
+   cpu = raw_smp_processor_id();
+   for (i = 0; i < num_possible_cpus(); i++) {
+   obj = __objpool_try_get_slot(pool, cpu);
+   if (obj)
+   break;
+   cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1);
+   }
+   raw_local_irq_restore(flags);
+
+   return obj;
+}
+
+/* adding object to slot, abort if the slot was already full */
+static inline int
+__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu)
+{
+   struct objpool_slot *slot = pool->cpu_slots[cpu];
+   uint32_t head, tail;
+
+   /* loading tail and head as a local snapshot, tail first */
+   tail = READ_ONCE(slot->tail);
+
+   do {
+   head = READ_ONCE(slot->head);
+   /* fault caught: something must be wrong */
+   WARN_ON_ONCE(tail - head > pool->nr_objs);
+   } while (!try_cmpxchg_acqui

[PATCH 0/2] Objpool performance improvements

2024-04-24 Thread Andrii Nakryiko
Improve objpool (used heavily in kretprobe hot path) performance with two
improvements:
  - inlining performance critical objpool_push()/objpool_pop() operations;
  - avoiding re-calculating relatively expensive nr_possible_cpus().

These opportunities were found when benchmarking and profiling kprobes and
kretprobes with BPF-based benchmarks. See individual patches for details and
results.

Andrii Nakryiko (2):
  objpool: enable inlining objpool_push() and objpool_pop() operations
  objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids

 include/linux/objpool.h | 105 +++--
 lib/objpool.c   | 112 +++-
 2 files changed, 107 insertions(+), 110 deletions(-)

-- 
2.43.0




Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-19 Thread Andrii Nakryiko
On Thu, Apr 18, 2024 at 6:00 PM Masami Hiramatsu  wrote:
>
> On Thu, 18 Apr 2024 12:09:09 -0700
> Andrii Nakryiko  wrote:
>
> > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> > that RCU is watching when trying to setup rethooko on a function entry.
> >
> > One notable exception when we force rcu_is_watching() check is
> > CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
> > old-style int3-based workflow instead of relying on ftrace, making RCU
> > watching check important to validate.
> >
> > This further (in addition to improvements in the previous patch)
> > improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> > by 2.3%, according to BPF benchmarks ([0]).
> >
> >   [0] 
> > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> >
> > Signed-off-by: Andrii Nakryiko 
>
>
> Thanks for update! This looks good to me.

Thanks, Masami! Will you take it through your tree, or you'd like to
route it through bpf-next?

>
> Acked-by: Masami Hiramatsu (Google) 
>
> Thanks,
>
> > ---
> >  kernel/trace/rethook.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> > index fa03094e9e69..a974605ad7a5 100644
> > --- a/kernel/trace/rethook.c
> > +++ b/kernel/trace/rethook.c
> > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >   if (unlikely(!handler))
> >   return NULL;
> >
> > +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
> > defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
> >   /*
> >* This expects the caller will set up a rethook on a function entry.
> >* When the function returns, the rethook will eventually be reclaimed
> > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >*/
> >   if (unlikely(!rcu_is_watching()))
> >   return NULL;
> > +#endif
> >
> >   return (struct rethook_node *)objpool_pop(>pool);
> >  }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) 



[PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Andrii Nakryiko
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
that RCU is watching when trying to setup rethooko on a function entry.

One notable exception when we force rcu_is_watching() check is
CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use
old-style int3-based workflow instead of relying on ftrace, making RCU
watching check important to validate.

This further (in addition to improvements in the previous patch)
improves BPF multi-kretprobe (which rely on rethook) runtime throughput
by 2.3%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/rethook.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..a974605ad7a5 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!handler))
return NULL;
 
+#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || 
defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)
/*
 * This expects the caller will set up a rethook on a function entry.
 * When the function returns, the rethook will eventually be reclaimed
@@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
 */
if (unlikely(!rcu_is_watching()))
return NULL;
+#endif
 
return (struct rethook_node *)objpool_pop(>pool);
 }
-- 
2.43.0




[PATCH v4 1/2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-18 Thread Andrii Nakryiko
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level validation of ftrace subsystem invariants. For most users it
should probably be kept disabled to eliminate unnecessary runtime
overhead.

This improves BPF multi-kretprobe (relying on ftrace and rethook
infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..7aebd1b8f93e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace execution"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort of
+ protection against recursion. This option is only to verify that
+ ftrace (and other users of ftrace_test_recursion_trylock()) are not
+ called outside of RCU, as if they are, it can cause a race. But it
+ also has a noticeable overhead when enabled.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0




Re: [PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-18 Thread Andrii Nakryiko
On Tue, Apr 9, 2024 at 3:48 PM Masami Hiramatsu  wrote:
>
> On Wed,  3 Apr 2024 15:03:28 -0700
> Andrii Nakryiko  wrote:
>
> > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
> > that RCU is watching when trying to setup rethooko on a function entry.
> >
> > This further (in addition to improvements in the previous patch)
> > improves BPF multi-kretprobe (which rely on rethook) runtime throughput
> > by 2.3%, according to BPF benchmarks ([0]).
> >
> >   [0] 
> > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/
> >
>
> Hi Andrii,
>
> Can you make this part depends on !KPROBE_EVENTS_ON_NOTRACE (with this
> option, kretprobes can be used without ftrace, but with original int3) ?

Sorry for the late response, I was out on vacation. Makes sense about
KPROBE_EVENTS_ON_NOTRACE, I went with this condition:

#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) ||
defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE)

Will send an updated revision shortly.

> This option should be set N on production system because of safety,
> just for testing raw kretprobes.
>
> Thank you,
>
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  kernel/trace/rethook.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> > index fa03094e9e69..15b8aa4048d9 100644
> > --- a/kernel/trace/rethook.c
> > +++ b/kernel/trace/rethook.c
> > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >   if (unlikely(!handler))
> >   return NULL;
> >
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >   /*
> >* This expects the caller will set up a rethook on a function entry.
> >* When the function returns, the rethook will eventually be reclaimed
> > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
> >*/
> >   if (unlikely(!rcu_is_watching()))
> >   return NULL;
> > +#endif
> >
> >   return (struct rethook_node *)objpool_pop(>pool);
> >  }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) 



Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-18 Thread Andrii Nakryiko
On Mon, Apr 15, 2024 at 1:25 AM Jiri Olsa  wrote:
>
> On Tue, Apr 02, 2024 at 11:33:00AM +0200, Jiri Olsa wrote:
>
> SNIP
>
> >  #include 
> >  #include 
> > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe 
> > *auprobe, struct insn *insn, bool
> >  }
> >
> >  #ifdef CONFIG_X86_64
> > +
> > +asm (
> > + ".pushsection .rodata\n"
> > + ".global uretprobe_syscall_entry\n"
> > + "uretprobe_syscall_entry:\n"
> > + "pushq %rax\n"
> > + "pushq %rcx\n"
> > + "pushq %r11\n"
> > + "movq $" __stringify(__NR_uretprobe) ", %rax\n"
> > + "syscall\n"
> > + "popq %r11\n"
> > + "popq %rcx\n"
> > +
> > + /* The uretprobe syscall replaces stored %rax value with final
> > +  * return address, so we don't restore %rax in here and just
> > +  * call ret.
> > +  */
> > + "retq\n"
> > + ".global uretprobe_syscall_end\n"
> > + "uretprobe_syscall_end:\n"
> > + ".popsection\n"
> > +);
> > +
> > +extern u8 uretprobe_syscall_entry[];
> > +extern u8 uretprobe_syscall_end[];
> > +
> > +void *arch_uprobe_trampoline(unsigned long *psize)
> > +{
> > + *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> > + return uretprobe_syscall_entry;
>
> fyi I realized this screws 32-bit programs, we either need to add
> compat trampoline, or keep the standard breakpoint for them:
>
> +   struct pt_regs *regs = task_pt_regs(current);
> +   static uprobe_opcode_t insn = UPROBE_SWBP_INSN;
> +
> +   if (user_64bit_mode(regs)) {
> +   *psize = uretprobe_syscall_end - uretprobe_syscall_entry;
> +   return uretprobe_syscall_entry;
> +   }
> +
> +   *psize = UPROBE_SWBP_INSN_SIZE;
> +   return 
>
>
> not sure it's worth the effort to add the trampoline, I'll check
>

32-bit arch isn't a high-performance target anyways, so I'd probably
not bother and prioritize simplicity and long term maintenance.

>
> jirka



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-06 Thread Andrii Nakryiko
On Fri, Apr 5, 2024 at 8:41 PM Masami Hiramatsu  wrote:
>
> On Tue, 2 Apr 2024 22:21:00 -0700
> Andrii Nakryiko  wrote:
>
> > On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko
> >  wrote:
> > >
> > > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt  wrote:
> > > >
> > > > On Wed, 3 Apr 2024 09:40:48 +0900
> > > > Masami Hiramatsu (Google)  wrote:
> > > >
> > > > > OK, for me, this last sentence is preferred for the help message. 
> > > > > That explains
> > > > > what this is for.
> > > > >
> > > > > All callbacks that attach to the function tracing have some 
> > > > > sort
> > > > > of protection against recursion. This option is only to 
> > > > > verify that
> > > > >    ftrace (and other users of ftrace_test_recursion_trylock()) 
> > > > >are not
> > > > > called outside of RCU, as if they are, it can cause a race.
> > > > > But it also has a noticeable overhead when enabled.
> > >
> > > Sounds good to me, I can add this to the description of the Kconfig 
> > > option.
> > >
> > > > >
> > > > > BTW, how much overhead does this introduce? and the race case a 
> > > > > kernel crash?
> > >
> > > I just checked our fleet-wide production data for the last 24 hours.
> > > Within the kprobe/kretprobe code path (ftrace_trampoline and
> > > everything called from it), rcu_is_watching (both calls, see below)
> > > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> > > prefer to be able to avoid that in production use cases.
> > >
> >
> > I just ran synthetic microbenchmark testing multi-kretprobe
> > throughput. We get (in millions of BPF kretprobe-multi program
> > invocations per second):
> >   - 5.568M/s as baseline;
> >   - 5.679M/s with changes in this patch (+2% throughput improvement);
> >   - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
> > (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).
> >
> > It's definitely noticeable.
>
> Thanks for checking the overhead! Hmm, it is considerable.
>
> > > > > or just messed up the ftrace buffer?
> > > >
> > > > There's a hypothetical race where it can cause a use after free.
>
> Hmm, so it might not lead a kernel crash but is better to enable with
> other debugging options.
>
> > > >
> > > > That is, after you shutdown ftrace, you need to call 
> > > > synchronize_rcu_tasks(),
> > > > which requires RCU to be watching. There's a theoretical case where that
> > > > task calls the trampoline and misses the synchronization. Note, these
> > > > locations are with preemption disabled, as rcu is always watching when
> > > > preemption is enabled. Thus it would be extremely fast where as the
> > > > synchronize_rcu_tasks() is rather slow.
> > > >
> > > > We also have synchronize_rcu_tasks_rude() which would actually keep the
> > > > trace from happening, as it would schedule on each CPU forcing all CPUs 
> > > > to
> > > > have RCU watching.
> > > >
> > > > I have never heard of this race being hit. I guess it could happen on a 
> > > > VM
> > > > where a vCPU gets preempted at the right moment for a long time and the
> > > > other CPUs synchronize.
> > > >
> > > > But like lockdep, where deadlocks can crash the kernel, we don't enable 
> > > > it
> > > > for production.
> > > >
> > > > The overhead is another function call within the function tracer. I had
> > > > numbers before, but I guess I could run tests again and get new numbers.
> > > >
> > >
> > > I just noticed another rcu_is_watching() call, in rethook_try_get(),
> > > which seems to be a similar and complementary validation check to the
> > > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> > > in this patch. It feels like both of them should be controlled by the
> > > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > > guard around rcu_is_watching() check in rethook_try_get() as well?
>
> Hmmm, no, I think it should not change the rethook side because rethook
> can be used with kprobes without ftrace. If we can detect it is used from

It's a good thing that I split that into a separate patch, then.
Hopefully the first patch looks good and you can apply it as is.

> the ftrace, we can skip it. (From this reason, I would like to remove
> return probe from kprobes...)

I'm on PTO for the next two weeks and I can take a look at more
properly guarding rcu_is_watching() in rethook_try_get() when I'm
back. Thanks.

>
> Thank you,
>
> > >
> > >
> > > > Thanks,
> > > >
> > > > -- Steve
>
>
> --
> Masami Hiramatsu (Google) 



Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Andrii Nakryiko
On Wed, Apr 3, 2024 at 5:58 PM Masami Hiramatsu  wrote:
>
> On Wed, 3 Apr 2024 09:58:12 -0700
> Andrii Nakryiko  wrote:
>
> > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu  wrote:
> > >
> > > On Wed, 3 Apr 2024 11:47:41 +0200
> > > Jiri Olsa  wrote:
> > >
> > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > > > Hi Jiri,
> > > > >
> > > > > On Tue,  2 Apr 2024 11:33:00 +0200
> > > > > Jiri Olsa  wrote:
> > > > >
> > > > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > > > >
> > > > > This is interesting approach. But I doubt we need to add additional
> > > > > syscall just for this purpose. Can't we use another syscall or ioctl?
> > > >
> > > > so the plan is to optimize entry uprobe in a similar way and given
> > > > the syscall is not a scarce resource I wanted to add another syscall
> > > > for that one as well
> > > >
> > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > > > possible to do that, the trampoline will just have to save one or
> > > > more additional registers, but adding new syscall seems cleaner to me
> > >
> > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.
> >
> > I think both ptrace and prctl are for completely different use cases
> > and it would be an abuse of existing API to reuse them for uretprobe
> > tracing. Also, keep in mind, that any extra argument that has to be
> > passed into this syscall means that we need to complicate and slow
> > generated assembly code that is injected into user process (to
> > save/restore registers) and also kernel-side (again, to deal with all
> > the extra registers that would be stored/restored on stack).
> >
> > Given syscalls are not some kind of scarce resources, what's the
> > downside to have a dedicated and simple syscall?
>
> Syscalls are explicitly exposed to user space, thus, even if it is used
> ONLY for a very specific situation, it is an official kernel interface,
> and need to care about the compatibility. (If it causes SIGILL unless
> a specific use case, I don't know there is a "compatibility".)

Check rt_sigreturn syscall (manpage at [0], for example).

   sigreturn() exists only to allow the implementation of signal
   handlers.  It should never be called directly.  (Indeed, a simple
   sigreturn() wrapper in the GNU C library simply returns -1, with
   errno set to ENOSYS.)  Details of the arguments (if any) passed
   to sigreturn() vary depending on the architecture.  (On some
   architectures, such as x86-64, sigreturn() takes no arguments,
   since all of the information that it requires is available in the
   stack frame that was previously created by the kernel on the
   user-space stack.)

This is a very similar use case. Also, check its source code in
arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process
on any sign of something not being right. It's exactly the same with
sys_uretprobe.

  [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html

> And the number of syscalls are limited resource.

We have almost 500 of them, it didn't seems like adding 1-2 for good
reasons would be a problem. Can you please point to where the limits
on syscalls as a resource are described? I'm curious to learn.

>
> I'm actually not sure how much we need to care of it, but adding a new
> syscall is worth to be discussed carefully because all of them are
> user-space compatibility.

Absolutely, it's a good discussion to have.

>
> > > > > Also, we should run syzkaller on this syscall. And if uretprobe is
> > > >
> > > > right, I'll check on syzkaller
> > > >
> > > > > set in the user function, what happen if the user function directly
> > > > > calls this syscall? (maybe it consumes shadow stack?)
> > > >
> > > > the process should receive SIGILL if there's no pending uretprobe for
> > > > the current task, or it will trigger uretprobe if there's one pending
> > >
> > > No, that is too aggressive and not safe. Since the syscall is exposed to
> > > user program, it should return appropriate error code instead of SIGILL.
> > >
> >
> > This is the way it is today with uretprobes even through interrupt.
>
> I doubt that the interrupt (exception) and syscall should be handled
> differently. Especially, this exception is injected by uprobes but
> syscall will be caused by itself. But syscall

[PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()

2024-04-03 Thread Andrii Nakryiko
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating
that RCU is watching when trying to setup rethooko on a function entry.

This further (in addition to improvements in the previous patch)
improves BPF multi-kretprobe (which rely on rethook) runtime throughput
by 2.3%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/rethook.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
index fa03094e9e69..15b8aa4048d9 100644
--- a/kernel/trace/rethook.c
+++ b/kernel/trace/rethook.c
@@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
if (unlikely(!handler))
return NULL;
 
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
/*
 * This expects the caller will set up a rethook on a function entry.
 * When the function returns, the rethook will eventually be reclaimed
@@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh)
 */
if (unlikely(!rcu_is_watching()))
return NULL;
+#endif
 
return (struct rethook_node *)objpool_pop(>pool);
 }
-- 
2.43.0




[PATCH v3 1/2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-03 Thread Andrii Nakryiko
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level validation of ftrace subsystem invariants. For most users it
should probably be kept disabled to eliminate unnecessary runtime
overhead.

This improves BPF multi-kretprobe (relying on ftrace and rethook
infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]).

  [0] 
https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..7aebd1b8f93e 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace execution"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort of
+ protection against recursion. This option is only to verify that
+ ftrace (and other users of ftrace_test_recursion_trylock()) are not
+ called outside of RCU, as if they are, it can cause a race. But it
+ also has a noticeable overhead when enabled.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0




Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-04-03 Thread Andrii Nakryiko
On Wed, Apr 3, 2024 at 4:05 AM Jonthan Haslam  wrote:
>
> > > > > Given the discussion around per-cpu rw semaphore and need for
> > > > > (internal) batched attachment API for uprobes, do you think you can
> > > > > apply this patch as is for now? We can then gain initial improvements
> > > > > in scalability that are also easy to backport, and Jonathan will work
> > > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > > suggested by Ingo.
> > > >
> > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > > I would like to wait for the next version.
> > >
> > > My initial tests show a nice improvement on the over RW spinlocks but
> > > significant regression in acquiring a write lock. I've got a few days
> > > vacation over Easter but I'll aim to get some more formalised results out
> > > to the thread toward the end of next week.
> >
> > As far as the write lock is only on the cold path, I think you can choose
> > per-cpu RW semaphore. Since it does not do busy wait, the total system
> > performance impact will be small.
> > I look forward to your formalized results :)
>
> Sorry for the delay in getting back to you on this Masami.
>
> I have used one of the bpf selftest benchmarks to provide some form of
> comparison of the 3 different approaches (spinlock, RW spinlock and
> per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop'
> benchmark which just executes a single uprobe with a minimal bpf program
> attached. The tests were done on a 32 core qemu/kvm instance.
>

Thanks a lot for running benchmarks and providing results!

> Things to note about the results:
>
> - The results are slightly variable so don't get too caught up on
>   individual thread count - it's the trend that is important.
> - In terms of throughput with this specific benchmark a *very* macro view
>   is that the RW spinlock provides 40-60% more throughput than the
>   spinlock.  The per-CPU RW semaphore provides in the order of 50-100%
>   more throughput then the spinlock.
> - This doesn't fully reflect the large reduction in latency that we have
>   seen in application based measurements. However, it does demonstrate
>   that even the trivial change of going to a RW spinlock provides
>   significant benefits.

This is probably because trig-uprobe-nop creates a single uprobe that
is triggered on many CPUs. While in production we have also *many*
uprobes running on many CPUs. In this benchmark, besides contention on
uprobes_treelock, we are also hammering on other per-uprobe locks
(register_rwsem, also if you don't have [0] patch locally, there will
be another filter lock taken each time, filter->rwlock). There is also
atomic refcounting going on, which when you have the same uprobe
across all CPUs at the same time will cause a bunch of cache line
bouncing.

So yes, it's understandable that in practice in production you see an
even larger effect of optimizing uprobe_treelock than in this
micro-benchmark.

  [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf

>
> I haven't included the measurements on per-CPU RW semaphore write
> performance as they are completely in line with those that Paul McKenney
> posted on his journal [0]. On a 32 core system I see semaphore writes to
> take in the order of 25-28 millisecs - the cost of the synchronize_rcu().
>
> Each block of results below show 1 line per execution of the benchmark (the
> "Summary" line) and each line is a run with one more thread added - a
> thread is a "producer". The lines are edited to remove extraneous output
> that adds no value here.
>
> The tests were executed with this driver script:
>
> for num_threads in {1..20}
> do
> sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary

just want to mention -a (affinity) option that you can pass a bench
tool, it will pin each thread on its own CPU. It generally makes tests
more uniform, eliminating CPU migrations variability.

> done
>
>
> spinlock
>
> Summary: hits1.453 ± 0.005M/s (  1.453M/prod)
> Summary: hits2.087 ± 0.005M/s (  1.043M/prod)
> Summary: hits2.701 ± 0.012M/s (  0.900M/prod)

I also wanted to point out that the first measurement (1.453M/s in
this row) is total throughput across all threads, while value in
parenthesis (0.900M/prod) is averaged throughput per each thread. So
this M/prod value is the most interesting in this benchmark where we
assess the effect of reducing contention.

> Summary: hits1.917 ± 0.011M/s (  0.479M/prod)
> Summary: hits2.105 ± 0.003M/s (  0.421M/prod)
> Summary: hits1.615 ± 0.006M/s (  0.269M/prod)

[...]



Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe

2024-04-03 Thread Andrii Nakryiko
On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu  wrote:
>
> On Wed, 3 Apr 2024 11:47:41 +0200
> Jiri Olsa  wrote:
>
> > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote:
> > > Hi Jiri,
> > >
> > > On Tue,  2 Apr 2024 11:33:00 +0200
> > > Jiri Olsa  wrote:
> > >
> > > > Adding uretprobe syscall instead of trap to speed up return probe.
> > >
> > > This is interesting approach. But I doubt we need to add additional
> > > syscall just for this purpose. Can't we use another syscall or ioctl?
> >
> > so the plan is to optimize entry uprobe in a similar way and given
> > the syscall is not a scarce resource I wanted to add another syscall
> > for that one as well
> >
> > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's
> > possible to do that, the trampoline will just have to save one or
> > more additional registers, but adding new syscall seems cleaner to me
>
> Hmm, I think a similar syscall is ptrace? prctl may also be a candidate.

I think both ptrace and prctl are for completely different use cases
and it would be an abuse of existing API to reuse them for uretprobe
tracing. Also, keep in mind, that any extra argument that has to be
passed into this syscall means that we need to complicate and slow
generated assembly code that is injected into user process (to
save/restore registers) and also kernel-side (again, to deal with all
the extra registers that would be stored/restored on stack).

Given syscalls are not some kind of scarce resources, what's the
downside to have a dedicated and simple syscall?

>
> >
> > >
> > > Also, we should run syzkaller on this syscall. And if uretprobe is
> >
> > right, I'll check on syzkaller
> >
> > > set in the user function, what happen if the user function directly
> > > calls this syscall? (maybe it consumes shadow stack?)
> >
> > the process should receive SIGILL if there's no pending uretprobe for
> > the current task, or it will trigger uretprobe if there's one pending
>
> No, that is too aggressive and not safe. Since the syscall is exposed to
> user program, it should return appropriate error code instead of SIGILL.
>

This is the way it is today with uretprobes even through interrupt.
E.g., it could happen that user process is using fibers and is
replacing stack pointer without kernel realizing this, which will
trigger some defensive checks in uretprobe handling code and kernel
will send SIGILL because it can't support such cases. This is
happening today already, and it works fine in practice (except for
applications that manually change stack pointer, too bad, you can't
trace them with uretprobes, unfortunately).

So I think it's absolutely adequate to have this behavior if the user
process is *intentionally* abusing this API.

> >
> > but we could limit the syscall to be executed just from the trampoline,
> > that should prevent all the user space use cases, I'll do that in next
> > version and add more tests for that
>
> Why not limit? :) The uprobe_handle_trampoline() expects it is called
> only from the trampoline, so it is natural to check the caller address.
> (and uprobe should know where is the trampoline)
>
> Since the syscall is always exposed to the user program, it should
> - Do nothing and return an error unless it is properly called.
> - check the prerequisites for operation strictly.
> I concern that new system calls introduce vulnerabilities.
>

As Oleg and Jiri mentioned, this syscall can't harm kernel or other
processes, only the process that is abusing the API. So any extra
checks that would slow down this approach is an unnecessary overhead
and complication that will never be useful in practice.

Also note that sys_uretprobe is a kind of internal and unstable API
and it is explicitly called out that its contract can change at any
time and user space shouldn't rely on it. It's purely for the kernel's
own usage.

So let's please keep it fast and simple.


> Thank you,
>
>
> >
> > thanks,
> > jirka
> >
> >
> > >

[...]



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-02 Thread Andrii Nakryiko
On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko
 wrote:
>
> On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt  wrote:
> >
> > On Wed, 3 Apr 2024 09:40:48 +0900
> > Masami Hiramatsu (Google)  wrote:
> >
> > > OK, for me, this last sentence is preferred for the help message. That 
> > > explains
> > > what this is for.
> > >
> > > All callbacks that attach to the function tracing have some sort
> > > of protection against recursion. This option is only to verify 
> > > that
> > >    ftrace (and other users of ftrace_test_recursion_trylock()) are not
> > > called outside of RCU, as if they are, it can cause a race.
> > > But it also has a noticeable overhead when enabled.
>
> Sounds good to me, I can add this to the description of the Kconfig option.
>
> > >
> > > BTW, how much overhead does this introduce? and the race case a kernel 
> > > crash?
>
> I just checked our fleet-wide production data for the last 24 hours.
> Within the kprobe/kretprobe code path (ftrace_trampoline and
> everything called from it), rcu_is_watching (both calls, see below)
> cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
> prefer to be able to avoid that in production use cases.
>

I just ran synthetic microbenchmark testing multi-kretprobe
throughput. We get (in millions of BPF kretprobe-multi program
invocations per second):
  - 5.568M/s as baseline;
  - 5.679M/s with changes in this patch (+2% throughput improvement);
  - 5.808M/s with disabling rcu_is_watching in rethook_try_get()
(+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline).

It's definitely noticeable.

> > > or just messed up the ftrace buffer?
> >
> > There's a hypothetical race where it can cause a use after free.
> >
> > That is, after you shutdown ftrace, you need to call 
> > synchronize_rcu_tasks(),
> > which requires RCU to be watching. There's a theoretical case where that
> > task calls the trampoline and misses the synchronization. Note, these
> > locations are with preemption disabled, as rcu is always watching when
> > preemption is enabled. Thus it would be extremely fast where as the
> > synchronize_rcu_tasks() is rather slow.
> >
> > We also have synchronize_rcu_tasks_rude() which would actually keep the
> > trace from happening, as it would schedule on each CPU forcing all CPUs to
> > have RCU watching.
> >
> > I have never heard of this race being hit. I guess it could happen on a VM
> > where a vCPU gets preempted at the right moment for a long time and the
> > other CPUs synchronize.
> >
> > But like lockdep, where deadlocks can crash the kernel, we don't enable it
> > for production.
> >
> > The overhead is another function call within the function tracer. I had
> > numbers before, but I guess I could run tests again and get new numbers.
> >
>
> I just noticed another rcu_is_watching() call, in rethook_try_get(),
> which seems to be a similar and complementary validation check to the
> one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
> in this patch. It feels like both of them should be controlled by the
> same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> guard around rcu_is_watching() check in rethook_try_get() as well?
>
>
> > Thanks,
> >
> > -- Steve



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-02 Thread Andrii Nakryiko
On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt  wrote:
>
> On Wed, 3 Apr 2024 09:40:48 +0900
> Masami Hiramatsu (Google)  wrote:
>
> > OK, for me, this last sentence is preferred for the help message. That 
> > explains
> > what this is for.
> >
> > All callbacks that attach to the function tracing have some sort
> > of protection against recursion. This option is only to verify that
> >    ftrace (and other users of ftrace_test_recursion_trylock()) are not
> > called outside of RCU, as if they are, it can cause a race.
> > But it also has a noticeable overhead when enabled.

Sounds good to me, I can add this to the description of the Kconfig option.

> >
> > BTW, how much overhead does this introduce? and the race case a kernel 
> > crash?

I just checked our fleet-wide production data for the last 24 hours.
Within the kprobe/kretprobe code path (ftrace_trampoline and
everything called from it), rcu_is_watching (both calls, see below)
cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd
prefer to be able to avoid that in production use cases.

> > or just messed up the ftrace buffer?
>
> There's a hypothetical race where it can cause a use after free.
>
> That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(),
> which requires RCU to be watching. There's a theoretical case where that
> task calls the trampoline and misses the synchronization. Note, these
> locations are with preemption disabled, as rcu is always watching when
> preemption is enabled. Thus it would be extremely fast where as the
> synchronize_rcu_tasks() is rather slow.
>
> We also have synchronize_rcu_tasks_rude() which would actually keep the
> trace from happening, as it would schedule on each CPU forcing all CPUs to
> have RCU watching.
>
> I have never heard of this race being hit. I guess it could happen on a VM
> where a vCPU gets preempted at the right moment for a long time and the
> other CPUs synchronize.
>
> But like lockdep, where deadlocks can crash the kernel, we don't enable it
> for production.
>
> The overhead is another function call within the function tracer. I had
> numbers before, but I guess I could run tests again and get new numbers.
>

I just noticed another rcu_is_watching() call, in rethook_try_get(),
which seems to be a similar and complementary validation check to the
one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option
in this patch. It feels like both of them should be controlled by the
same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
guard around rcu_is_watching() check in rethook_try_get() as well?


> Thanks,
>
> -- Steve



Re: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.

2024-04-02 Thread Andrii Nakryiko
On Mon, Apr 1, 2024 at 12:16 PM Kui-Feng Lee  wrote:
>
> rethook_find_ret_addr() prints a warning message and returns 0 when the
> target task is running and not the "current" task to prevent returning an
> incorrect return address. However, this check is incomplete as the target
> task can still transition to the running state when finding the return
> address, although it is safe with RCU.
>
> The issue we encounter is that the kernel frequently prints warning
> messages when BPF profiling programs call to bpf_get_task_stack() on
> running tasks.
>
> The callers should be aware and willing to take the risk of receiving an
> incorrect return address from a task that is currently running other than
> the "current" one. A warning is not needed here as the callers are intent
> on it.
>
> Signed-off-by: Kui-Feng Lee 
> ---
>  kernel/trace/rethook.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c
> index fa03094e9e69..4297a132a7ae 100644
> --- a/kernel/trace/rethook.c
> +++ b/kernel/trace/rethook.c
> @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct 
> *tsk, unsigned long frame
> if (WARN_ON_ONCE(!cur))
> return 0;
>
> -   if (WARN_ON_ONCE(tsk != current && task_is_running(tsk)))
> +   if (tsk != current && task_is_running(tsk))
> return 0;
>

This should probably go through Masami's tree, but the change makes
sense to me, given this is an expected condition.

Acked-by: Andrii Nakryiko 

> do {
> --
> 2.34.1
>
>



Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Andrii Nakryiko
On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu  wrote:
>
> On Mon, 1 Apr 2024 12:09:18 -0400
> Steven Rostedt  wrote:
>
> > On Mon, 1 Apr 2024 20:25:52 +0900
> > Masami Hiramatsu (Google)  wrote:
> >
> > > > Masami,
> > > >
> > > > Are you OK with just keeping it set to N.
> > >
> > > OK, if it is only for the debugging, I'm OK to set N this.
> > >
> > > >
> > > > We could have other options like PROVE_LOCKING enable it.
> > >
> > > Agreed (but it should say this is a debug option)
> >
> > It does say "Validate" which to me is a debug option. What would you
> > suggest?
>
> I think the help message should have "This is for debugging ftrace."
>

Sent v2 with adjusted wording, thanks!

> Thank you,
>
> >
> > -- Steve
>
>
> --
> Masami Hiramatsu (Google) 



[PATCH v2] ftrace: make extra rcu_is_watching() validation check optional

2024-04-01 Thread Andrii Nakryiko
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true and is mostly useful for
low-level debugging of ftrace subsystem. For most users it should
probably be kept disabled to eliminate unnecessary runtime overhead.

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 14 ++
 2 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..fcf45d5c60cb 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,20 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace recursion check"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort
+ of protection against recursion. This option performs additional
+ checks to make sure RCU is on when ftrace callbacks recurse.
+
+ This is a feature useful for debugging ftrace. This will add more
+ overhead to all ftrace-based invocations.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0




Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-29 Thread Andrii Nakryiko
On Fri, Mar 29, 2024 at 5:36 PM Masami Hiramatsu  wrote:
>
> On Fri, 29 Mar 2024 10:33:57 -0700
> Andrii Nakryiko  wrote:
>
> > On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko
> >  wrote:
> > >
> > > On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu  
> > > wrote:
> > > >
> > > > On Wed, 27 Mar 2024 17:06:01 +
> > > > Jonthan Haslam  wrote:
> > > >
> > > > > > > Masami,
> > > > > > >
> > > > > > > Given the discussion around per-cpu rw semaphore and need for
> > > > > > > (internal) batched attachment API for uprobes, do you think you 
> > > > > > > can
> > > > > > > apply this patch as is for now? We can then gain initial 
> > > > > > > improvements
> > > > > > > in scalability that are also easy to backport, and Jonathan will 
> > > > > > > work
> > > > > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > > > > suggested by Ingo.
> > > > > >
> > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > > > > I would like to wait for the next version.
> > > > >
> > > > > My initial tests show a nice improvement on the over RW spinlocks but
> > > > > significant regression in acquiring a write lock. I've got a few days
> > > > > vacation over Easter but I'll aim to get some more formalised results 
> > > > > out
> > > > > to the thread toward the end of next week.
> > > >
> > > > As far as the write lock is only on the cold path, I think you can 
> > > > choose
> > > > per-cpu RW semaphore. Since it does not do busy wait, the total system
> > > > performance impact will be small.
> > >
> > > No, Masami, unfortunately it's not as simple. In BPF we have BPF
> > > multi-uprobe, which can be used to attach to thousands of user
> > > functions. It currently creates one uprobe at a time, as we don't
> > > really have a batched API. If each such uprobe registration will now
> > > take a (relatively) long time, when multiplied by number of attach-to
> > > user functions, it will be a horrible regression in terms of
> > > attachment/detachment performance.
>
> Ah, got it. So attachment/detachment performance should be counted.
>
> > >
> > > So when we switch to per-CPU rw semaphore, we'll need to provide an
> > > internal batch uprobe attach/detach API to make sure that attaching to
> > > multiple uprobes is still fast.
>
> Yeah, we need such interface like register_uprobes(...).
>
> > >
> > > Which is why I was asking to land this patch as is, as it relieves the
> > > scalability pains in production and is easy to backport to old
> > > kernels. And then we can work on batched APIs and switch to per-CPU rw
> > > semaphore.
>
> OK, then I'll push this to for-next at this moment.

Great, thanks a lot!

> Please share if you have a good idea for the batch interface which can be
> backported. I guess it should involve updating userspace changes too.
>

Yep, we'll investigate a best way to provide batch interface for
uprobes and will send patches.

> Thank you!
>
> > >
> > > So I hope you can reconsider and accept improvements in this patch,
> > > while Jonathan will keep working on even better final solution.
> > > Thanks!
> > >
> > > > I look forward to your formalized results :)
> > > >
> >
> > BTW, as part of BPF selftests, we have a multi-attach test for uprobes
> > and USDTs, reporting attach/detach timings:
> > $ sudo ./test_progs -v -t uprobe_multi_test/bench
> > bpf_testmod.ko is already unloaded.
> > Loading bpf_testmod.ko...
> > Successfully loaded bpf_testmod.ko.
> > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
> > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
> > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
> > test_bench_attach_uprobe: attached in   0.120s
> > test_bench_attach_uprobe: detached in   0.092s
> > #400/5   uprobe_multi_test/bench_uprobe:OK
> > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
> > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
> > test_bench_attach_usdt:PASS:usdt_count 0 nsec
> > test_bench_attach_usdt: attached in   0.124s
> > test_bench_attach_usdt: detached in   0.064s
&

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-29 Thread Andrii Nakryiko
On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko
 wrote:
>
> On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu  wrote:
> >
> > On Wed, 27 Mar 2024 17:06:01 +
> > Jonthan Haslam  wrote:
> >
> > > > > Masami,
> > > > >
> > > > > Given the discussion around per-cpu rw semaphore and need for
> > > > > (internal) batched attachment API for uprobes, do you think you can
> > > > > apply this patch as is for now? We can then gain initial improvements
> > > > > in scalability that are also easy to backport, and Jonathan will work
> > > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > > suggested by Ingo.
> > > >
> > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > > I would like to wait for the next version.
> > >
> > > My initial tests show a nice improvement on the over RW spinlocks but
> > > significant regression in acquiring a write lock. I've got a few days
> > > vacation over Easter but I'll aim to get some more formalised results out
> > > to the thread toward the end of next week.
> >
> > As far as the write lock is only on the cold path, I think you can choose
> > per-cpu RW semaphore. Since it does not do busy wait, the total system
> > performance impact will be small.
>
> No, Masami, unfortunately it's not as simple. In BPF we have BPF
> multi-uprobe, which can be used to attach to thousands of user
> functions. It currently creates one uprobe at a time, as we don't
> really have a batched API. If each such uprobe registration will now
> take a (relatively) long time, when multiplied by number of attach-to
> user functions, it will be a horrible regression in terms of
> attachment/detachment performance.
>
> So when we switch to per-CPU rw semaphore, we'll need to provide an
> internal batch uprobe attach/detach API to make sure that attaching to
> multiple uprobes is still fast.
>
> Which is why I was asking to land this patch as is, as it relieves the
> scalability pains in production and is easy to backport to old
> kernels. And then we can work on batched APIs and switch to per-CPU rw
> semaphore.
>
> So I hope you can reconsider and accept improvements in this patch,
> while Jonathan will keep working on even better final solution.
> Thanks!
>
> > I look forward to your formalized results :)
> >

BTW, as part of BPF selftests, we have a multi-attach test for uprobes
and USDTs, reporting attach/detach timings:
$ sudo ./test_progs -v -t uprobe_multi_test/bench
bpf_testmod.ko is already unloaded.
Loading bpf_testmod.ko...
Successfully loaded bpf_testmod.ko.
test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec
test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec
test_bench_attach_uprobe:PASS:uprobes_count 0 nsec
test_bench_attach_uprobe: attached in   0.120s
test_bench_attach_uprobe: detached in   0.092s
#400/5   uprobe_multi_test/bench_uprobe:OK
test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec
test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec
test_bench_attach_usdt:PASS:usdt_count 0 nsec
test_bench_attach_usdt: attached in   0.124s
test_bench_attach_usdt: detached in   0.064s
#400/6   uprobe_multi_test/bench_usdt:OK
#400 uprobe_multi_test:OK
Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED
Successfully unloaded bpf_testmod.ko.

So it should be easy for Jonathan to validate his changes with this.

> > Thank you,
> >
> > >
> > > Jon.
> > >
> > > >
> > > > Thank you,
> > > >
> > > > >
> > > > > >
> > > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > > will depend on how much lock contention happens.
> > > > > >
> > > > > > Thank you,
> > > > > >
> > > > > > >
> > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > > >
> > > > > > > Signed-off-by: Jonathan Haslam 
> > > > > > > ---
> > > > > > >  kernel/events/uprobes.c | 22 +++---
> > > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > > >
> > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > > --- a/kernel/events/uprobes.c
> > > > > > > +++ b/kernel/events/uprobes.c
> > > > > > > @@ -39,7 +39,7

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-29 Thread Andrii Nakryiko
On Tue, Mar 26, 2024 at 11:58 AM Steven Rostedt  wrote:
>
> On Tue, 26 Mar 2024 09:16:33 -0700
> Andrii Nakryiko  wrote:
>
> > > It's no different than lockdep. Test boxes should have it enabled, but
> > > there's no reason to have this enabled in a production system.
> > >
> >
> > I tend to agree with Steven here (which is why I sent this patch as it
> > is), but I'm happy to do it as an opt-out, if Masami insists. Please
> > do let me know if I need to send v2 or this one is actually the one
> > we'll end up using. Thanks!
>
> Masami,
>
> Are you OK with just keeping it set to N.
>
> We could have other options like PROVE_LOCKING enable it.
>

So what's the conclusion, Masami? Should I send another version where
this config is opt-out, or are you ok with keeping it as opt-in as
proposed in this revision?

> -- Steve



Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-27 Thread Andrii Nakryiko
On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu  wrote:
>
> On Wed, 27 Mar 2024 17:06:01 +
> Jonthan Haslam  wrote:
>
> > > > Masami,
> > > >
> > > > Given the discussion around per-cpu rw semaphore and need for
> > > > (internal) batched attachment API for uprobes, do you think you can
> > > > apply this patch as is for now? We can then gain initial improvements
> > > > in scalability that are also easy to backport, and Jonathan will work
> > > > on a more complete solution based on per-cpu RW semaphore, as
> > > > suggested by Ingo.
> > >
> > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe.
> > > I would like to wait for the next version.
> >
> > My initial tests show a nice improvement on the over RW spinlocks but
> > significant regression in acquiring a write lock. I've got a few days
> > vacation over Easter but I'll aim to get some more formalised results out
> > to the thread toward the end of next week.
>
> As far as the write lock is only on the cold path, I think you can choose
> per-cpu RW semaphore. Since it does not do busy wait, the total system
> performance impact will be small.

No, Masami, unfortunately it's not as simple. In BPF we have BPF
multi-uprobe, which can be used to attach to thousands of user
functions. It currently creates one uprobe at a time, as we don't
really have a batched API. If each such uprobe registration will now
take a (relatively) long time, when multiplied by number of attach-to
user functions, it will be a horrible regression in terms of
attachment/detachment performance.

So when we switch to per-CPU rw semaphore, we'll need to provide an
internal batch uprobe attach/detach API to make sure that attaching to
multiple uprobes is still fast.

Which is why I was asking to land this patch as is, as it relieves the
scalability pains in production and is easy to backport to old
kernels. And then we can work on batched APIs and switch to per-CPU rw
semaphore.

So I hope you can reconsider and accept improvements in this patch,
while Jonathan will keep working on even better final solution.
Thanks!

> I look forward to your formalized results :)
>
> Thank you,
>
> >
> > Jon.
> >
> > >
> > > Thank you,
> > >
> > > >
> > > > >
> > > > > BTW, how did you measure the overhead? I think spinlock overhead
> > > > > will depend on how much lock contention happens.
> > > > >
> > > > > Thank you,
> > > > >
> > > > > >
> > > > > > [0] https://docs.kernel.org/locking/spinlocks.html
> > > > > >
> > > > > > Signed-off-by: Jonathan Haslam 
> > > > > > ---
> > > > > >  kernel/events/uprobes.c | 22 +++---
> > > > > >  1 file changed, 11 insertions(+), 11 deletions(-)
> > > > > >
> > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > > > > > index 929e98c62965..42bf9b6e8bc0 100644
> > > > > > --- a/kernel/events/uprobes.c
> > > > > > +++ b/kernel/events/uprobes.c
> > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> > > > > >   */
> > > > > >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> > > > > >
> > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree 
> > > > > > access */
> > > > > > +static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree 
> > > > > > access */
> > > > > >
> > > > > >  #define UPROBES_HASH_SZ  13
> > > > > >  /* serialize uprobe->pending_list */
> > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode 
> > > > > > *inode, loff_t offset)
> > > > > >  {
> > > > > >   struct uprobe *uprobe;
> > > > > >
> > > > > > - spin_lock(_treelock);
> > > > > > + read_lock(_treelock);
> > > > > >   uprobe = __find_uprobe(inode, offset);
> > > > > > - spin_unlock(_treelock);
> > > > > > + read_unlock(_treelock);
> > > > > >
> > > > > >   return uprobe;
> > > > > >  }
> > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct 
> > > > > > uprobe *uprobe)
> > > > > >  {
> > > > > >   struct uprobe *u;
> > > > > >
> > > > > > - spin_lock(_treelock);
> > > > > > + write_lock(_treelock);
> > > > > >   u = __insert_uprobe(uprobe);
> > > > > > - spin_unlock(_treelock);
> > > > > > + write_unlock(_treelock);
> > > > > >
> > > > > >   return u;
> > > > > >  }
> > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> > > > > >   if (WARN_ON(!uprobe_is_active(uprobe)))
> > > > > >   return;
> > > > > >
> > > > > > - spin_lock(_treelock);
> > > > > > + write_lock(_treelock);
> > > > > >   rb_erase(>rb_node, _tree);
> > > > > > - spin_unlock(_treelock);
> > > > > > + write_unlock(_treelock);
> > > > > >   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> > > > > >   put_uprobe(uprobe);
> > > > > >  }
> > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode 
> > > > > > *inode,
> > > > > >   min = vaddr_to_offset(vma, start);
> > > > > >   max = min + (end - start) - 1;
> > > 

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-26 Thread Andrii Nakryiko
On Mon, Mar 25, 2024 at 3:11 PM Steven Rostedt  wrote:
>
> On Mon, 25 Mar 2024 11:38:48 +0900
> Masami Hiramatsu (Google)  wrote:
>
> > On Fri, 22 Mar 2024 09:03:23 -0700
> > Andrii Nakryiko  wrote:
> >
> > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > > control whether ftrace low-level code performs additional
> > > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > > violations.
> > >
> > > This check is expected to never be true in practice and would be best
> > > controlled with extra config to let users decide if they are willing to
> > > pay the price.
> >
> > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > is OK, but tht should be set to Y by default. If you have already verified
> > that your system never make it true and you want to optimize your ftrace
> > path, you can manually set it to N at your own risk.
> >
>
> Really, it's for debugging. I would argue that it should *not* be default y.
> Peter added this to find all the locations that could be called where RCU
> is not watching. But the issue I have is that this is that it *does cause
> overhead* with function tracing.
>
> I believe we found pretty much all locations that were an issue, and we
> should now just make it an option for developers.
>
> It's no different than lockdep. Test boxes should have it enabled, but
> there's no reason to have this enabled in a production system.
>

I tend to agree with Steven here (which is why I sent this patch as it
is), but I'm happy to do it as an opt-out, if Masami insists. Please
do let me know if I need to send v2 or this one is actually the one
we'll end up using. Thanks!

> -- Steve
>
>
> > >
> > > Cc: Steven Rostedt 
> > > Cc: Masami Hiramatsu 
> > > Cc: Paul E. McKenney 
> > > Signed-off-by: Andrii Nakryiko 
> > > ---
> > >  include/linux/trace_recursion.h |  2 +-
> > >  kernel/trace/Kconfig| 13 +
> > >  2 files changed, 14 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/include/linux/trace_recursion.h 
> > > b/include/linux/trace_recursion.h
> > > index d48cd92d2364..24ea8ac049b4 100644
> > > --- a/include/linux/trace_recursion.h
> > > +++ b/include/linux/trace_recursion.h
> > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> > > unsigned long parent_ip);
> > >  # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
> > >  #endif
> > >
> > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> > >  # define trace_warn_on_no_rcu(ip)  \
> > > ({  \
> > > bool __ret = !rcu_is_watching();\
> >
> > BTW, maybe we can add "unlikely" in the next "if" line?
> >
> > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > > index 61c541c36596..19bce4e217d6 100644
> > > --- a/kernel/trace/Kconfig
> > > +++ b/kernel/trace/Kconfig
> > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> > >   This file can be reset, but the limit can not change in
> > >   size at runtime.
> > >
> > > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > > +   bool "Validate RCU is on during ftrace recursion check"
> > > +   depends on FUNCTION_TRACER
> > > +   depends on ARCH_WANTS_NO_INSTR
> >
> >   default y
> >
> > > +   help
> > > + All callbacks that attach to the function tracing have some sort
> > > + of protection against recursion. This option performs additional
> > > + checks to make sure RCU is on when ftrace callbacks recurse.
> > > +
> > > + This will add more overhead to all ftrace-based invocations.
> >
> >   ... invocations, but keep it safe.
> >
> > > +
> > > + If unsure, say N
> >
> >   If unsure, say Y
> >
> > Thank you,
> >
> > > +
> > >  config RING_BUFFER_RECORD_RECURSION
> > > bool "Record functions that recurse in the ring buffer"
> > > depends on FTRACE_RECORD_RECURSION
> > > --
> > > 2.43.0
> > >
> >
> >
>



Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-26 Thread Andrii Nakryiko
On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu  wrote:
>
> On Thu, 21 Mar 2024 07:57:35 -0700
> Jonathan Haslam  wrote:
>
> > Active uprobes are stored in an RB tree and accesses to this tree are
> > dominated by read operations. Currently these accesses are serialized by
> > a spinlock but this leads to enormous contention when large numbers of
> > threads are executing active probes.
> >
> > This patch converts the spinlock used to serialize access to the
> > uprobes_tree RB tree into a reader-writer spinlock. This lock type
> > aligns naturally with the overwhelmingly read-only nature of the tree
> > usage here. Although the addition of reader-writer spinlocks are
> > discouraged [0], this fix is proposed as an interim solution while an
> > RCU based approach is implemented (that work is in a nascent form). This
> > fix also has the benefit of being trivial, self contained and therefore
> > simple to backport.
> >
> > This change has been tested against production workloads that exhibit
> > significant contention on the spinlock and an almost order of magnitude
> > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
>
> Looks good to me.
>
> Acked-by: Masami Hiramatsu (Google) 

Masami,

Given the discussion around per-cpu rw semaphore and need for
(internal) batched attachment API for uprobes, do you think you can
apply this patch as is for now? We can then gain initial improvements
in scalability that are also easy to backport, and Jonathan will work
on a more complete solution based on per-cpu RW semaphore, as
suggested by Ingo.

>
> BTW, how did you measure the overhead? I think spinlock overhead
> will depend on how much lock contention happens.
>
> Thank you,
>
> >
> > [0] https://docs.kernel.org/locking/spinlocks.html
> >
> > Signed-off-by: Jonathan Haslam 
> > ---
> >  kernel/events/uprobes.c | 22 +++---
> >  1 file changed, 11 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> > index 929e98c62965..42bf9b6e8bc0 100644
> > --- a/kernel/events/uprobes.c
> > +++ b/kernel/events/uprobes.c
> > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
> >   */
> >  #define no_uprobe_events()   RB_EMPTY_ROOT(_tree)
> >
> > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree access */
> > +static DEFINE_RWLOCK(uprobes_treelock);  /* serialize rbtree access */
> >
> >  #define UPROBES_HASH_SZ  13
> >  /* serialize uprobe->pending_list */
> > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
> > loff_t offset)
> >  {
> >   struct uprobe *uprobe;
> >
> > - spin_lock(_treelock);
> > + read_lock(_treelock);
> >   uprobe = __find_uprobe(inode, offset);
> > - spin_unlock(_treelock);
> > + read_unlock(_treelock);
> >
> >   return uprobe;
> >  }
> > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe 
> > *uprobe)
> >  {
> >   struct uprobe *u;
> >
> > - spin_lock(_treelock);
> > + write_lock(_treelock);
> >   u = __insert_uprobe(uprobe);
> > - spin_unlock(_treelock);
> > + write_unlock(_treelock);
> >
> >   return u;
> >  }
> > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> >   if (WARN_ON(!uprobe_is_active(uprobe)))
> >   return;
> >
> > - spin_lock(_treelock);
> > + write_lock(_treelock);
> >   rb_erase(>rb_node, _tree);
> > - spin_unlock(_treelock);
> > + write_unlock(_treelock);
> >   RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> >   put_uprobe(uprobe);
> >  }
> > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> >   min = vaddr_to_offset(vma, start);
> >   max = min + (end - start) - 1;
> >
> > - spin_lock(_treelock);
> > + read_lock(_treelock);
> >   n = find_node_in_range(inode, min, max);
> >   if (n) {
> >   for (t = n; t; t = rb_prev(t)) {
> > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> >   get_uprobe(u);
> >   }
> >   }
> > - spin_unlock(_treelock);
> > + read_unlock(_treelock);
> >  }
> >
> >  /* @vma contains reference counter, not the probed instruction. */
> > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned 
> > long start, unsigned long e
> >   min = vaddr_to_offset(vma, start);
> >   max = min + (end - start) - 1;
> >
> > - spin_lock(_treelock);
> > + read_lock(_treelock);
> >   n = find_node_in_range(inode, min, max);
> > - spin_unlock(_treelock);
> > + read_unlock(_treelock);
> >
> >   return !!n;
> >  }
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) 



Re: raw_tp+cookie is buggy. Was: [syzbot] [bpf?] [trace?] KASAN: slab-use-after-free Read in bpf_trace_run1

2024-03-25 Thread Andrii Nakryiko
On Mon, Mar 25, 2024 at 10:27 AM Andrii Nakryiko
 wrote:
>
> On Sun, Mar 24, 2024 at 5:07 PM Alexei Starovoitov
>  wrote:
> >
> > Hi Andrii,
> >
> > syzbot found UAF in raw_tp cookie series in bpf-next.
> > Reverting the whole merge
> > 2e244a72cd48 ("Merge branch 'bpf-raw-tracepoint-support-for-bpf-cookie'")
> >
> > fixes the issue.
> >
> > Pls take a look.
> > See C reproducer below. It splats consistently with CONFIG_KASAN=y
> >
> > Thanks.
>
> Will do, traveling today, so will be offline for a bit, but will check
> first thing afterwards.
>

Ok, so I don't think it's bpf_raw_tp_link specific, it should affect a
bunch of other links (unless I missed something). Basically, when last
link refcnt drops, we detach, do bpf_prog_put() and then proceed to
kfree link itself synchronously. But that link can still be referred
from running BPF program (I think multi-kprobe/multi-uprobe use it for
cookies, raw_tp with my changes started using link at runtime, there
are probably more types), and so if we free this memory synchronously,
we can have UAF.

We should do what we do for bpf_maps and delay freeing, the only
question is how tunable that freeing can be? Always do call_rcu()?
Always call_rcu_tasks_trace() (relevant for sleepable multi-uprobes)?
Should we allow synchronous free if link is not directly accessible
from program during its run?

Anyway, I sent a fix as an RFC so we can discuss.

> >
> > On Sun, Mar 24, 2024 at 4:28 PM syzbot
> >  wrote:
> > >
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:520fad2e3206 selftests/bpf: scale benchmark counting by 
> > > us..
> > > git tree:   bpf-next
> > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=105af94618
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=6fb1be60a193d440
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=981935d9485a560bfbcb
> > > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > > Debian) 2.40
> > > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=114f17a518
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=162bb7a518
> > >
> > > Downloadable assets:
> > > disk image: 
> > > https://storage.googleapis.com/syzbot-assets/4eef3506c5ce/disk-520fad2e.raw.xz
> > > vmlinux: 
> > > https://storage.googleapis.com/syzbot-assets/24d60ebe76cc/vmlinux-520fad2e.xz
> > > kernel image: 
> > > https://storage.googleapis.com/syzbot-assets/8f883e706550/bzImage-520fad2e.xz
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+981935d9485a560bf...@syzkaller.appspotmail.com
> > >
> > > ==
> > > BUG: KASAN: slab-use-after-free in __bpf_trace_run 
> > > kernel/trace/bpf_trace.c:2376 [inline]
> > > BUG: KASAN: slab-use-after-free in bpf_trace_run1+0xcb/0x510 
> > > kernel/trace/bpf_trace.c:2430
> > > Read of size 8 at addr 8880290d9918 by task migration/0/19
> > >
> > > CPU: 0 PID: 19 Comm: migration/0 Not tainted 
> > > 6.8.0-syzkaller-05233-g520fad2e3206 #0
> > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > > Google 02/29/2024
> > > Stopper: 0x0 <- 0x0
> > > Call Trace:
> > >  
> > >  __dump_stack lib/dump_stack.c:88 [inline]
> > >  dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
> > >  print_address_description mm/kasan/report.c:377 [inline]
> > >  print_report+0x169/0x550 mm/kasan/report.c:488
> > >  kasan_report+0x143/0x180 mm/kasan/report.c:601
> > >  __bpf_trace_run kernel/trace/bpf_trace.c:2376 [inline]
> > >  bpf_trace_run1+0xcb/0x510 kernel/trace/bpf_trace.c:2430
> > >  __traceiter_rcu_utilization+0x74/0xb0 include/trace/events/rcu.h:27
> > >  trace_rcu_utilization+0x194/0x1c0 include/trace/events/rcu.h:27
> > >  rcu_note_context_switch+0xc7c/0xff0 kernel/rcu/tree_plugin.h:360
> > >  __schedule+0x345/0x4a20 kernel/sched/core.c:6635
> > >  __schedule_loop kernel/sched/core.c:6813 [inline]
> > >  schedule+0x14b/0x320 kernel/sched/core.c:6828
> > >  smpboot_thread_fn+0x61e/0xa30 kernel/smpboot.c:160
> > >  kthread+0x2f0/0x390 kernel/kthread.c:388
> > >  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
> > >  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
> > &g

Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-25 Thread Andrii Nakryiko
On Mon, Mar 25, 2024 at 12:12 PM Jonthan Haslam
 wrote:
>
> Hi Ingo,
>
> > > This change has been tested against production workloads that exhibit
> > > significant contention on the spinlock and an almost order of magnitude
> > > reduction for mean uprobe execution time is observed (28 -> 3.5 
> > > microsecs).
> >
> > Have you considered/measured per-CPU RW semaphores?
>
> No I hadn't but thanks hugely for suggesting it! In initial measurements
> it seems to be between 20-100% faster than the RW spinlocks! Apologies for
> all the exclamation marks but I'm very excited. I'll do some more testing
> tomorrow but so far it's looking very good.
>

Documentation ([0]) says that locking for writing calls
synchronize_rcu(), is that right? If that's true, attaching multiple
uprobes (including just attaching a single BPF multi-uprobe) will take
a really long time. We need to confirm we are not significantly
regressing this. And if we do, we need to take measures in the BPF
multi-uprobe attachment code path to make sure that a single
multi-uprobe attachment is still fast.

If my worries above turn out to be true, it still feels like a first
good step should be landing this patch as is (and get it backported to
older kernels), and then have percpu rw-semaphore as a final (and a
bit more invasive) solution (it's RCU-based, so feels like a good
primitive to settle on), making sure to not regress multi-uprobes
(we'll probably will need some batched API for multiple uprobes).

Thoughts?

  [0] https://docs.kernel.org/locking/percpu-rw-semaphore.html

> Thanks again for the input.
>
> Jon.



Re: raw_tp+cookie is buggy. Was: [syzbot] [bpf?] [trace?] KASAN: slab-use-after-free Read in bpf_trace_run1

2024-03-25 Thread Andrii Nakryiko
On Sun, Mar 24, 2024 at 5:07 PM Alexei Starovoitov
 wrote:
>
> Hi Andrii,
>
> syzbot found UAF in raw_tp cookie series in bpf-next.
> Reverting the whole merge
> 2e244a72cd48 ("Merge branch 'bpf-raw-tracepoint-support-for-bpf-cookie'")
>
> fixes the issue.
>
> Pls take a look.
> See C reproducer below. It splats consistently with CONFIG_KASAN=y
>
> Thanks.

Will do, traveling today, so will be offline for a bit, but will check
first thing afterwards.

>
> On Sun, Mar 24, 2024 at 4:28 PM syzbot
>  wrote:
> >
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:520fad2e3206 selftests/bpf: scale benchmark counting by us..
> > git tree:   bpf-next
> > console+strace: https://syzkaller.appspot.com/x/log.txt?x=105af94618
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=6fb1be60a193d440
> > dashboard link: https://syzkaller.appspot.com/bug?extid=981935d9485a560bfbcb
> > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > Debian) 2.40
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=114f17a518
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=162bb7a518
> >
> > Downloadable assets:
> > disk image: 
> > https://storage.googleapis.com/syzbot-assets/4eef3506c5ce/disk-520fad2e.raw.xz
> > vmlinux: 
> > https://storage.googleapis.com/syzbot-assets/24d60ebe76cc/vmlinux-520fad2e.xz
> > kernel image: 
> > https://storage.googleapis.com/syzbot-assets/8f883e706550/bzImage-520fad2e.xz
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+981935d9485a560bf...@syzkaller.appspotmail.com
> >
> > ==
> > BUG: KASAN: slab-use-after-free in __bpf_trace_run 
> > kernel/trace/bpf_trace.c:2376 [inline]
> > BUG: KASAN: slab-use-after-free in bpf_trace_run1+0xcb/0x510 
> > kernel/trace/bpf_trace.c:2430
> > Read of size 8 at addr 8880290d9918 by task migration/0/19
> >
> > CPU: 0 PID: 19 Comm: migration/0 Not tainted 
> > 6.8.0-syzkaller-05233-g520fad2e3206 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 02/29/2024
> > Stopper: 0x0 <- 0x0
> > Call Trace:
> >  
> >  __dump_stack lib/dump_stack.c:88 [inline]
> >  dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106
> >  print_address_description mm/kasan/report.c:377 [inline]
> >  print_report+0x169/0x550 mm/kasan/report.c:488
> >  kasan_report+0x143/0x180 mm/kasan/report.c:601
> >  __bpf_trace_run kernel/trace/bpf_trace.c:2376 [inline]
> >  bpf_trace_run1+0xcb/0x510 kernel/trace/bpf_trace.c:2430
> >  __traceiter_rcu_utilization+0x74/0xb0 include/trace/events/rcu.h:27
> >  trace_rcu_utilization+0x194/0x1c0 include/trace/events/rcu.h:27
> >  rcu_note_context_switch+0xc7c/0xff0 kernel/rcu/tree_plugin.h:360
> >  __schedule+0x345/0x4a20 kernel/sched/core.c:6635
> >  __schedule_loop kernel/sched/core.c:6813 [inline]
> >  schedule+0x14b/0x320 kernel/sched/core.c:6828
> >  smpboot_thread_fn+0x61e/0xa30 kernel/smpboot.c:160
> >  kthread+0x2f0/0x390 kernel/kthread.c:388
> >  ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
> >  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243
> >  
> >
> > Allocated by task 5075:
> >  kasan_save_stack mm/kasan/common.c:47 [inline]
> >  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
> >  poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
> >  __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
> >  kasan_kmalloc include/linux/kasan.h:211 [inline]
> >  kmalloc_trace+0x1d9/0x360 mm/slub.c:4012
> >  kmalloc include/linux/slab.h:590 [inline]
> >  kzalloc include/linux/slab.h:711 [inline]
> >  bpf_raw_tp_link_attach+0x2a0/0x6e0 kernel/bpf/syscall.c:3816
> >  bpf_raw_tracepoint_open+0x1c2/0x240 kernel/bpf/syscall.c:3863
> >  __sys_bpf+0x3c0/0x810 kernel/bpf/syscall.c:5673
> >  __do_sys_bpf kernel/bpf/syscall.c:5738 [inline]
> >  __se_sys_bpf kernel/bpf/syscall.c:5736 [inline]
> >  __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736
> >  do_syscall_64+0xfb/0x240
> >  entry_SYSCALL_64_after_hwframe+0x6d/0x75
> >
> > Freed by task 5075:
> >  kasan_save_stack mm/kasan/common.c:47 [inline]
> >  kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
> >  kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:589
> >  poison_slab_object+0xa6/0xe0 mm/kasan/common.c:240
> >  __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
> >  kasan_slab_free include/linux/kasan.h:184 [inline]
> >  slab_free_hook mm/slub.c:2121 [inline]
> >  slab_free mm/slub.c:4299 [inline]
> >  kfree+0x14a/0x380 mm/slub.c:4409
> >  bpf_link_release+0x3b/0x50 kernel/bpf/syscall.c:3071
> >  __fput+0x429/0x8a0 fs/file_table.c:423
> >  task_work_run+0x24f/0x310 kernel/task_work.c:180
> >  exit_task_work include/linux/task_work.h:38 [inline]
> >  do_exit+0xa1b/0x27e0 kernel/exit.c:878
> >  do_group_exit+0x207/0x2c0 kernel/exit.c:1027
> >  __do_sys_exit_group kernel/exit.c:1038 [inline]
> >  __se_sys_exit_group kernel/exit.c:1036 [inline]
> >  

Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-25 Thread Andrii Nakryiko
On Sun, Mar 24, 2024 at 7:38 PM Masami Hiramatsu  wrote:
>
> On Fri, 22 Mar 2024 09:03:23 -0700
> Andrii Nakryiko  wrote:
>
> > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
> > control whether ftrace low-level code performs additional
> > rcu_is_watching()-based validation logic in an attempt to catch noinstr
> > violations.
> >
> > This check is expected to never be true in practice and would be best
> > controlled with extra config to let users decide if they are willing to
> > pay the price.
>
> Hmm, for me, it sounds like "WARN_ON(something) never be true in practice
> so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> is OK, but tht should be set to Y by default. If you have already verified
> that your system never make it true and you want to optimize your ftrace
> path, you can manually set it to N at your own risk.

Yeah, I don't think we ever see this warning across our machines. And
sure, I can default it to Y, no problem.

>
> >
> > Cc: Steven Rostedt 
> > Cc: Masami Hiramatsu 
> > Cc: Paul E. McKenney 
> > Signed-off-by: Andrii Nakryiko 
> > ---
> >  include/linux/trace_recursion.h |  2 +-
> >  kernel/trace/Kconfig| 13 +
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/trace_recursion.h 
> > b/include/linux/trace_recursion.h
> > index d48cd92d2364..24ea8ac049b4 100644
> > --- a/include/linux/trace_recursion.h
> > +++ b/include/linux/trace_recursion.h
> > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
> > unsigned long parent_ip);
> >  # define do_ftrace_record_recursion(ip, pip) do { } while (0)
> >  #endif
> >
> > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR
> > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
> >  # define trace_warn_on_no_rcu(ip)\
> >   ({  \
> >   bool __ret = !rcu_is_watching();\
>
> BTW, maybe we can add "unlikely" in the next "if" line?

sure, can add that as well

>
> > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
> > index 61c541c36596..19bce4e217d6 100644
> > --- a/kernel/trace/Kconfig
> > +++ b/kernel/trace/Kconfig
> > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
> > This file can be reset, but the limit can not change in
> > size at runtime.
> >
> > +config FTRACE_VALIDATE_RCU_IS_WATCHING
> > + bool "Validate RCU is on during ftrace recursion check"
> > + depends on FUNCTION_TRACER
> > + depends on ARCH_WANTS_NO_INSTR
>
> default y
>

ok

> > + help
> > +   All callbacks that attach to the function tracing have some sort
> > +   of protection against recursion. This option performs additional
> > +   checks to make sure RCU is on when ftrace callbacks recurse.
> > +
> > +   This will add more overhead to all ftrace-based invocations.
>
> ... invocations, but keep it safe.
>
> > +
> > +   If unsure, say N
>
> If unsure, say Y
>

yep, will do, thanks!

> Thank you,
>
> > +
> >  config RING_BUFFER_RECORD_RECURSION
> >   bool "Record functions that recurse in the ring buffer"
> >   depends on FTRACE_RECORD_RECURSION
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) 



[PATCH] ftrace: make extra rcu_is_watching() validation check optional

2024-03-22 Thread Andrii Nakryiko
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to
control whether ftrace low-level code performs additional
rcu_is_watching()-based validation logic in an attempt to catch noinstr
violations.

This check is expected to never be true in practice and would be best
controlled with extra config to let users decide if they are willing to
pay the price.

Cc: Steven Rostedt 
Cc: Masami Hiramatsu 
Cc: Paul E. McKenney 
Signed-off-by: Andrii Nakryiko 
---
 include/linux/trace_recursion.h |  2 +-
 kernel/trace/Kconfig| 13 +
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h
index d48cd92d2364..24ea8ac049b4 100644
--- a/include/linux/trace_recursion.h
+++ b/include/linux/trace_recursion.h
@@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, 
unsigned long parent_ip);
 # define do_ftrace_record_recursion(ip, pip)   do { } while (0)
 #endif
 
-#ifdef CONFIG_ARCH_WANTS_NO_INSTR
+#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING
 # define trace_warn_on_no_rcu(ip)  \
({  \
bool __ret = !rcu_is_watching();\
diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig
index 61c541c36596..19bce4e217d6 100644
--- a/kernel/trace/Kconfig
+++ b/kernel/trace/Kconfig
@@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE
  This file can be reset, but the limit can not change in
  size at runtime.
 
+config FTRACE_VALIDATE_RCU_IS_WATCHING
+   bool "Validate RCU is on during ftrace recursion check"
+   depends on FUNCTION_TRACER
+   depends on ARCH_WANTS_NO_INSTR
+   help
+ All callbacks that attach to the function tracing have some sort
+ of protection against recursion. This option performs additional
+ checks to make sure RCU is on when ftrace callbacks recurse.
+
+ This will add more overhead to all ftrace-based invocations.
+
+ If unsure, say N
+
 config RING_BUFFER_RECORD_RECURSION
bool "Record functions that recurse in the ring buffer"
depends on FTRACE_RECORD_RECURSION
-- 
2.43.0




Re: [PATCH] uprobes: reduce contention on uprobes_tree access

2024-03-21 Thread Andrii Nakryiko
On Thu, Mar 21, 2024 at 7:57 AM Jonathan Haslam
 wrote:
>
> Active uprobes are stored in an RB tree and accesses to this tree are
> dominated by read operations. Currently these accesses are serialized by
> a spinlock but this leads to enormous contention when large numbers of
> threads are executing active probes.
>
> This patch converts the spinlock used to serialize access to the
> uprobes_tree RB tree into a reader-writer spinlock. This lock type
> aligns naturally with the overwhelmingly read-only nature of the tree
> usage here. Although the addition of reader-writer spinlocks are
> discouraged [0], this fix is proposed as an interim solution while an
> RCU based approach is implemented (that work is in a nascent form). This
> fix also has the benefit of being trivial, self contained and therefore
> simple to backport.

Yep, makes sense, I think we'll want to backport this ASAP to some of
the old kernels we have. Thanks!

Acked-by: Andrii Nakryiko 

>
> This change has been tested against production workloads that exhibit
> significant contention on the spinlock and an almost order of magnitude
> reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs).
>
> [0] https://docs.kernel.org/locking/spinlocks.html
>
> Signed-off-by: Jonathan Haslam 
> ---
>  kernel/events/uprobes.c | 22 +++---
>  1 file changed, 11 insertions(+), 11 deletions(-)
>
> diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
> index 929e98c62965..42bf9b6e8bc0 100644
> --- a/kernel/events/uprobes.c
> +++ b/kernel/events/uprobes.c
> @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT;
>   */
>  #define no_uprobe_events() RB_EMPTY_ROOT(_tree)
>
> -static DEFINE_SPINLOCK(uprobes_treelock);  /* serialize rbtree access */
> +static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */
>
>  #define UPROBES_HASH_SZ13
>  /* serialize uprobe->pending_list */
> @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, 
> loff_t offset)
>  {
> struct uprobe *uprobe;
>
> -   spin_lock(_treelock);
> +   read_lock(_treelock);
> uprobe = __find_uprobe(inode, offset);
> -   spin_unlock(_treelock);
> +   read_unlock(_treelock);
>
> return uprobe;
>  }
> @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe)
>  {
> struct uprobe *u;
>
> -   spin_lock(_treelock);
> +   write_lock(_treelock);
> u = __insert_uprobe(uprobe);
> -   spin_unlock(_treelock);
> +   write_unlock(_treelock);
>
> return u;
>  }
> @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe)
> if (WARN_ON(!uprobe_is_active(uprobe)))
> return;
>
> -   spin_lock(_treelock);
> +   write_lock(_treelock);
> rb_erase(>rb_node, _tree);
> -   spin_unlock(_treelock);
> +   write_unlock(_treelock);
> RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */
> put_uprobe(uprobe);
>  }
> @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode,
> min = vaddr_to_offset(vma, start);
> max = min + (end - start) - 1;
>
> -   spin_lock(_treelock);
> +   read_lock(_treelock);
> n = find_node_in_range(inode, min, max);
> if (n) {
> for (t = n; t; t = rb_prev(t)) {
> @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode,
> get_uprobe(u);
> }
> }
> -   spin_unlock(_treelock);
> +   read_unlock(_treelock);
>  }
>
>  /* @vma contains reference counter, not the probed instruction. */
> @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned 
> long start, unsigned long e
> min = vaddr_to_offset(vma, start);
> max = min + (end - start) - 1;
>
> -   spin_lock(_treelock);
> +   read_lock(_treelock);
> n = find_node_in_range(inode, min, max);
> -   spin_unlock(_treelock);
> +   read_unlock(_treelock);
>
> return !!n;
>  }
> --
> 2.43.0
>



Re: [PATCH v2 0/3] uprobes: two common case speed ups

2024-03-19 Thread Andrii Nakryiko
On Mon, Mar 18, 2024 at 9:21 PM Masami Hiramatsu  wrote:
>
> Hi,
>
> On Mon, 18 Mar 2024 11:17:25 -0700
> Andrii Nakryiko  wrote:
>
> > This patch set implements two speed ups for uprobe/uretprobe runtime 
> > execution
> > path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
> > system-wide (non-PID-specific) uprobes (patch #3). Please see individual
> > patches for details.
>
> This series looks good to me. Let me pick it on probes/for-next.

Great, at least I guessed the Git repo right, if not the branch.
Thanks for pulling it in! I assume some other uprobe-related follow up
patches should be based on probes/for-next as well, right?

>
> Thanks!
>
> >
> > v1->v2:
> >   - rebased onto trace/core branch of tracing tree, hopefully I guessed 
> > right;
> >   - simplified user_cpu_buffer usage further (Oleg Nesterov);
> >   - simplified patch #3, just moved speculative check outside of lock 
> > (Oleg);
> >   - added Reviewed-by from Jiri Olsa.
> >
> > Andrii Nakryiko (3):
> >   uprobes: encapsulate preparation of uprobe args buffer
> >   uprobes: prepare uprobe args buffer lazily
> >   uprobes: add speculative lockless system-wide uprobe filter check
> >
> >  kernel/trace/trace_uprobe.c | 103 +---
> >  1 file changed, 59 insertions(+), 44 deletions(-)
> >
> > --
> > 2.43.0
> >
>
>
> --
> Masami Hiramatsu (Google) 



[PATCH v2 3/3] uprobes: add speculative lockless system-wide uprobe filter check

2024-03-18 Thread Andrii Nakryiko
It's very common with BPF-based uprobe/uretprobe use cases to have
a system-wide (not PID specific) probes used. In this case uprobe's
trace_uprobe_filter->nr_systemwide counter is bumped at registration
time, and actual filtering is short circuited at the time when
uprobe/uretprobe is triggered.

This is a great optimization, and the only issue with it is that to even
get to checking this counter uprobe subsystem is taking
read-side trace_uprobe_filter->rwlock. This is actually noticeable in
profiles and is just another point of contention when uprobe is
triggered on multiple CPUs simultaneously.

This patch moves this nr_systemwide check outside of filter list's
rwlock scope, as rwlock is meant to protect list modification, while
nr_systemwide-based check is speculative and racy already, despite the
lock (as discussed in [0]). trace_uprobe_filter_remove() and
trace_uprobe_filter_add() already check for filter->nr_systewide
explicitly outside of __uprobe_perf_filter, so no modifications are
required there.

Confirming with BPF selftests's based benchmarks.

BEFORE (based on changes in previous patch)
===
uprobe-nop :2.732 ± 0.022M/s
uprobe-push:2.621 ± 0.016M/s
uprobe-ret :1.105 ± 0.007M/s
uretprobe-nop  :1.396 ± 0.007M/s
uretprobe-push :1.347 ± 0.008M/s
uretprobe-ret  :0.800 ± 0.006M/s

AFTER
=
uprobe-nop :2.878 ± 0.017M/s (+5.5%, total +8.3%)
uprobe-push:2.753 ± 0.013M/s (+5.3%, total +10.2%)
uprobe-ret :1.142 ± 0.010M/s (+3.8%, total +3.8%)
uretprobe-nop  :1.444 ± 0.008M/s (+3.5%, total +6.5%)
uretprobe-push :1.410 ± 0.010M/s (+4.8%, total +7.1%)
uretprobe-ret  :0.816 ± 0.002M/s (+2.0%, total +3.9%)

In the above, first percentage value is based on top of previous patch
(lazy uprobe buffer optimization), while the "total" percentage is
based on kernel without any of the changes in this patch set.

As can be seen, we get about 4% - 10% speed up, in total, with both lazy
uprobe buffer and speculative filter check optimizations.

  [0] https://lore.kernel.org/bpf/20240313131926.ga19...@redhat.com/

Reviewed-by: Jiri Olsa 
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index b5da95240a31..ac05885a6ce6 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1226,9 +1226,6 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, 
struct mm_struct *mm)
 {
struct perf_event *event;
 
-   if (filter->nr_systemwide)
-   return true;
-
list_for_each_entry(event, >perf_events, hw.tp_list) {
if (event->hw.target->mm == mm)
return true;
@@ -1353,6 +1350,13 @@ static bool uprobe_perf_filter(struct uprobe_consumer 
*uc,
tu = container_of(uc, struct trace_uprobe, consumer);
filter = tu->tp.event->filter;
 
+   /*
+* speculative short-circuiting check to avoid unnecessarily taking
+* filter->rwlock below, if the uprobe has system-wide consumer
+*/
+   if (READ_ONCE(filter->nr_systemwide))
+   return true;
+
read_lock(>rwlock);
ret = __uprobe_perf_filter(filter, mm);
read_unlock(>rwlock);
-- 
2.43.0




[PATCH v2 2/3] uprobes: prepare uprobe args buffer lazily

2024-03-18 Thread Andrii Nakryiko
uprobe_cpu_buffer and corresponding logic to store uprobe args into it
are used for uprobes/uretprobes that are created through tracefs or
perf events.

BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't
need uprobe_cpu_buffer and associated data. For BPF-only use cases this
buffer handling and preparation is a pure overhead. At the same time,
BPF-only uprobe/uretprobe usage is very common in practice. Also, for
a lot of cases applications are very senstivie to performance overheads,
as they might be tracing a very high frequency functions like
malloc()/free(), so every bit of performance improvement matters.

All that is to say that this uprobe_cpu_buffer preparation is an
unnecessary overhead that each BPF user of uprobes/uretprobe has to pay.
This patch is changing this by making uprobe_cpu_buffer preparation
optional. It will happen only if either tracefs-based or perf event-based
uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For
BPF-only use cases this step will be skipped.

We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0])
to estimate the improvements. We have 3 uprobe and 3 uretprobe
scenarios, which vary an instruction that is replaced by uprobe: nop
(fastest uprobe case), `push rbp` (typical case), and non-simulated
`ret` instruction (slowest case). Benchmark thread is constantly calling
user space function in a tight loop. User space function has attached
BPF uprobe or uretprobe program doing nothing but atomic counter
increments to count number of triggering calls. Benchmark emits
throughput in millions of executions per second.

BEFORE these changes

uprobe-nop :2.657 ± 0.024M/s
uprobe-push:2.499 ± 0.018M/s
uprobe-ret :1.100 ± 0.006M/s
uretprobe-nop  :1.356 ± 0.004M/s
uretprobe-push :1.317 ± 0.019M/s
uretprobe-ret  :0.785 ± 0.007M/s

AFTER these changes
===
uprobe-nop :2.732 ± 0.022M/s (+2.8%)
uprobe-push:2.621 ± 0.016M/s (+4.9%)
uprobe-ret :1.105 ± 0.007M/s (+0.5%)
uretprobe-nop  :1.396 ± 0.007M/s (+2.9%)
uretprobe-push :1.347 ± 0.008M/s (+2.3%)
uretprobe-ret  :0.800 ± 0.006M/s (+1.9)

So the improvements on this particular machine seems to be between 2% and 5%.

  [0] 
https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c

Reviewed-by: Jiri Olsa 
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 49 +
 1 file changed, 28 insertions(+), 21 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index 9bffaab448a6..b5da95240a31 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -941,15 +941,21 @@ static struct uprobe_cpu_buffer *uprobe_buffer_get(void)
 
 static void uprobe_buffer_put(struct uprobe_cpu_buffer *ucb)
 {
+   if (!ucb)
+   return;
mutex_unlock(>mutex);
 }
 
 static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
-  struct pt_regs *regs)
+  struct pt_regs *regs,
+  struct uprobe_cpu_buffer 
**ucbp)
 {
struct uprobe_cpu_buffer *ucb;
int dsize, esize;
 
+   if (*ucbp)
+   return *ucbp;
+
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
dsize = __get_data_size(>tp, regs);
 
@@ -958,22 +964,25 @@ static struct uprobe_cpu_buffer 
*prepare_uprobe_buffer(struct trace_uprobe *tu,
 
store_trace_args(ucb->buf, >tp, regs, esize, dsize);
 
+   *ucbp = ucb;
return ucb;
 }
 
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer *ucb,
+   struct uprobe_cpu_buffer **ucbp,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
struct trace_event_buffer fbuffer;
+   struct uprobe_cpu_buffer *ucb;
void *data;
int size, esize;
struct trace_event_call *call = trace_probe_event_call(>tp);
 
WARN_ON(call != trace_file->event_call);
 
+   ucb = prepare_uprobe_buffer(tu, regs, ucbp);
if (WARN_ON_ONCE(ucb->dsize > PAGE_SIZE))
return;
 
@@ -1002,7 +1011,7 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
 
 /* uprobe handler */
 static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb)
+struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
 
@@ -1011,7 +1020,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,

[PATCH v2 1/3] uprobes: encapsulate preparation of uprobe args buffer

2024-03-18 Thread Andrii Nakryiko
Move the logic of fetching temporary per-CPU uprobe buffer and storing
uprobes args into it to a new helper function. Store data size as part
of this buffer, simplifying interfaces a bit, as now we only pass single
uprobe_cpu_buffer reference around, instead of pointer + dsize.

This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher,
and now will be centralized. All this is also in preparation to make
this uprobe_cpu_buffer handling logic optional in the next patch.

Reviewed-by: Jiri Olsa 
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 78 +++--
 1 file changed, 41 insertions(+), 37 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index a84b85d8aac1..9bffaab448a6 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -854,6 +854,7 @@ static const struct file_operations uprobe_profile_ops = {
 struct uprobe_cpu_buffer {
struct mutex mutex;
void *buf;
+   int dsize;
 };
 static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer;
 static int uprobe_buffer_refcnt;
@@ -943,9 +944,26 @@ static void uprobe_buffer_put(struct uprobe_cpu_buffer 
*ucb)
mutex_unlock(>mutex);
 }
 
+static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
+  struct pt_regs *regs)
+{
+   struct uprobe_cpu_buffer *ucb;
+   int dsize, esize;
+
+   esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
+   dsize = __get_data_size(>tp, regs);
+
+   ucb = uprobe_buffer_get();
+   ucb->dsize = tu->tp.size + dsize;
+
+   store_trace_args(ucb->buf, >tp, regs, esize, dsize);
+
+   return ucb;
+}
+
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer *ucb, int dsize,
+   struct uprobe_cpu_buffer *ucb,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
@@ -956,14 +974,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
 
WARN_ON(call != trace_file->event_call);
 
-   if (WARN_ON_ONCE(tu->tp.size + dsize > PAGE_SIZE))
+   if (WARN_ON_ONCE(ucb->dsize > PAGE_SIZE))
return;
 
if (trace_trigger_soft_disabled(trace_file))
return;
 
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
-   size = esize + tu->tp.size + dsize;
+   size = esize + ucb->dsize;
entry = trace_event_buffer_reserve(, trace_file, size);
if (!entry)
return;
@@ -977,14 +995,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
data = DATAOF_TRACE_ENTRY(entry, false);
}
 
-   memcpy(data, ucb->buf, tu->tp.size + dsize);
+   memcpy(data, ucb->buf, ucb->dsize);
 
trace_event_buffer_commit();
 }
 
 /* uprobe handler */
 static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb, int dsize)
+struct uprobe_cpu_buffer *ucb)
 {
struct event_file_link *link;
 
@@ -993,7 +1011,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, 0, regs, ucb, dsize, link->file);
+   __uprobe_trace_func(tu, 0, regs, ucb, link->file);
rcu_read_unlock();
 
return 0;
@@ -1001,13 +1019,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
 static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
 struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb, int dsize)
+struct uprobe_cpu_buffer *ucb)
 {
struct event_file_link *link;
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, func, regs, ucb, dsize, link->file);
+   __uprobe_trace_func(tu, func, regs, ucb, link->file);
rcu_read_unlock();
 }
 
@@ -1335,7 +1353,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc,
 
 static void __uprobe_perf_func(struct trace_uprobe *tu,
   unsigned long func, struct pt_regs *regs,
-  struct uprobe_cpu_buffer *ucb, int dsize)
+  struct uprobe_cpu_buffer *ucb)
 {
struct trace_event_call *call = trace_probe_event_call(>tp);
struct uprobe_trace_entry_head *entry;
@@ -1356,7 +1374,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
 
-

[PATCH v2 0/3] uprobes: two common case speed ups

2024-03-18 Thread Andrii Nakryiko
This patch set implements two speed ups for uprobe/uretprobe runtime execution
path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
system-wide (non-PID-specific) uprobes (patch #3). Please see individual
patches for details.

v1->v2:
  - rebased onto trace/core branch of tracing tree, hopefully I guessed right;
  - simplified user_cpu_buffer usage further (Oleg Nesterov);
  - simplified patch #3, just moved speculative check outside of lock (Oleg);
  - added Reviewed-by from Jiri Olsa.

Andrii Nakryiko (3):
  uprobes: encapsulate preparation of uprobe args buffer
  uprobes: prepare uprobe args buffer lazily
  uprobes: add speculative lockless system-wide uprobe filter check

 kernel/trace/trace_uprobe.c | 103 +---
 1 file changed, 59 insertions(+), 44 deletions(-)

-- 
2.43.0




Re: [PATCH bpf-next 0/3] uprobes: two common case speed ups

2024-03-13 Thread Andrii Nakryiko
On Wed, Mar 13, 2024 at 2:41 AM Jiri Olsa  wrote:
>
> On Tue, Mar 12, 2024 at 02:02:30PM -0700, Andrii Nakryiko wrote:
> > This patch set implements two speed ups for uprobe/uretprobe runtime 
> > execution
> > path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
> > system-wide (non-PID-specific) uprobes (patch #3). Please see individual
> > patches for details.
> >
> > Given I haven't worked with uprobe code before, I'm unfamiliar with
> > conventions in this subsystem, including which kernel tree patches should be
> > sent to. For now I based all the changes on top of bpf-next/master, which is
> > where I tested and benchmarked everything anyways. Please advise what should
> > I use as a base for subsequent revision. Thanks.

Steven, Masami,

Is this the kind of patches that should go through your tree(s)? Or
you'd be fine with this going through bpf-next? I'd appreciate the
link to the specific GIT repo I should use as a base in the former
case, thank you!

> >
> > Andrii Nakryiko (3):
> >   uprobes: encapsulate preparation of uprobe args buffer
> >   uprobes: prepare uprobe args buffer lazily
> >   uprobes: add speculative lockless system-wide uprobe filter check
>
> nice cleanup and speed up, lgtm
>
> Reviewed-by: Jiri Olsa 
>
> jirka
>
> >
> >  kernel/trace/trace_uprobe.c | 103 ++--
> >  1 file changed, 63 insertions(+), 40 deletions(-)
> >
> > --
> > 2.43.0
> >
> >



Re: [PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check

2024-03-13 Thread Andrii Nakryiko
On Wed, Mar 13, 2024 at 6:20 AM Oleg Nesterov  wrote:
>
> I forgot everything about this code, plus it has changed a lot since
> I looked at it many years ago, but ...
>
> I think this change is fine but the changelog looks a bit confusing
> (overcomplicated) to me.

It's a new piece of code and logic, so I tried to do my due diligence
and argue why I think it's fine. I'll drop the overcomplicated
explanation, as I agree with you that it's inherently racy even
without my changes (and use-after-free safety is provided with
uprobe->register_rwsem independent from all this).

>
> On 03/12, Andrii Nakryiko wrote:
> >
> > This patch adds a speculative check before grabbing that rwlock. If
> > nr_systemwide is non-zero, lock is skipped and event is passed through.
> > From examining existing logic it looks correct and safe to do. If
> > nr_systemwide is being modified under rwlock in parallel, we have to
> > consider basically just one important race condition: the case when
> > nr_systemwide is dropped from one to zero (from
> > trace_uprobe_filter_remove()) under filter->rwlock, but
> > uprobe_perf_filter() raced and saw it as >0.
>
> Unless I am totally confused, there is nothing new. Even without
> this change trace_uprobe_filter_remove() can clear nr_systemwide
> right after uprobe_perf_filter() drops filter->rwlock.
>
> And of course, trace_uprobe_filter_add() can change nr_systemwide
> from 0 to 1. In this case uprobe_perf_func() can "wrongly" return
> UPROBE_HANDLER_REMOVE but we can't avoid this and afaics this is
> fine even if handler_chain() does unapply_uprobe(), uprobe_perf_open()
> will do uprobe_apply() after that, we can rely on ->register_rwsem.
>

yep, agreed

> > In case we speculatively read nr_systemwide as zero, while it was
> > incremented in parallel, we'll proceed to grabbing filter->rwlock and
> > re-doing the check, this time in lock-protected and non-racy way.
>
> See above...
>
>
> So I think uprobe_perf_filter() needs filter->rwlock only to iterate
> the list, it can check nr_systemwide lockless and this means that you
> can also remove the same check in __uprobe_perf_filter(), other callers
> trace_uprobe_filter_add/remove check it themselves.
>

makes sense, will do

>
> > --- a/kernel/trace/trace_uprobe.c
> > +++ b/kernel/trace/trace_uprobe.c
> > @@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct 
> > uprobe_consumer *uc,
> >   tu = container_of(uc, struct trace_uprobe, consumer);
> >   filter = tu->tp.event->filter;
> >
> > + /* speculative check */
> > + if (READ_ONCE(filter->nr_systemwide))
> > + return true;
> > +
> >   read_lock(>rwlock);
> >   ret = __uprobe_perf_filter(filter, mm);
> >   read_unlock(>rwlock);
>
> ACK,
>
> but see above. I think the changelog should be simplified and the
> filter->nr_systemwide check in __uprobe_perf_filter() should be
> removed. But I won't insist and perhaps I missed something...
>

I think you are right, I'll move the check

> Oleg.
>



Re: [PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily

2024-03-13 Thread Andrii Nakryiko
On Wed, Mar 13, 2024 at 8:48 AM Oleg Nesterov  wrote:
>
> Again, looks good to me, but I have a minor nit. Feel free to ignore.
>
> On 03/12, Andrii Nakryiko wrote:
> >
> >  static void __uprobe_trace_func(struct trace_uprobe *tu,
> >   unsigned long func, struct pt_regs *regs,
> > - struct uprobe_cpu_buffer *ucb,
> > + struct uprobe_cpu_buffer **ucbp,
> >   struct trace_event_file *trace_file)
> >  {
> >   struct uprobe_trace_entry_head *entry;
> >   struct trace_event_buffer fbuffer;
> > + struct uprobe_cpu_buffer *ucb;
> >   void *data;
> >   int size, esize;
> >   struct trace_event_call *call = trace_probe_event_call(>tp);
> >
> > + ucb = *ucbp;
> > + if (!ucb) {
> > + ucb = prepare_uprobe_buffer(tu, regs);
> > + *ucbp = ucb;
> > + }
>
> perhaps it would be more clean to pass ucbp to prepare_uprobe_buffer()
> and change it to do
>
> if (*ucbp)
> return *ucbp;
>
> at the start. Then __uprobe_trace_func() and __uprobe_perf_func() can
> simply do
>
> ucb = prepare_uprobe_buffer(tu, regs, ucbp);

ok, will do

>
> > - uprobe_buffer_put(ucb);
> > + if (ucb)
> > + uprobe_buffer_put(ucb);
>
> Similarly, I think the "ucb != NULL" check should be shifted into
> uprobe_buffer_put().

sure, will hide it inside uprobe_buffer_put()

>
> Oleg.
>



Re: [PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer

2024-03-13 Thread Andrii Nakryiko
On Wed, Mar 13, 2024 at 8:16 AM Oleg Nesterov  wrote:
>
> LGTM, one nit below.
>
> On 03/12, Andrii Nakryiko wrote:
> >
> > +static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe 
> > *tu,
> > +struct pt_regs *regs)
> > +{
> > + struct uprobe_cpu_buffer *ucb;
> > + int dsize, esize;
> > +
> > + esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
> > + dsize = __get_data_size(>tp, regs);
> > +
> > + ucb = uprobe_buffer_get();
> > + ucb->dsize = dsize;
> > +
> > + store_trace_args(ucb->buf, >tp, regs, esize, dsize);
> > +
> > + return ucb;
> > +}
>
> OK, but note that every user of ->dsize adds tp.size. So I think you can
> simplify this code a bit more if you change prepare_uprobe_buffer() to do
>
> ucb->dsize = tu->tp.size + dsize;
>
> and update the users.
>

makes sense, done

> Oleg.
>



[PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check

2024-03-12 Thread Andrii Nakryiko
It's very common with BPF-based uprobe/uretprobe use cases to have
a system-wide (not PID specific) probes used. In this case uprobe's
trace_uprobe_filter->nr_systemwide counter is bumped at registration
time, and actual filtering is short circuited at the time when
uprobe/uretprobe is triggered.

This is a great optimization, and the only issue with it is that to even
get to checking this counter uprobe subsystem is taking
read-side trace_uprobe_filter->rwlock. This is actually noticeable in
profiles and is just another point of contention when uprobe is
triggered on multiple CPUs simultaneously.

This patch adds a speculative check before grabbing that rwlock. If
nr_systemwide is non-zero, lock is skipped and event is passed through.
>From examining existing logic it looks correct and safe to do. If
nr_systemwide is being modified under rwlock in parallel, we have to
consider basically just one important race condition: the case when
nr_systemwide is dropped from one to zero (from
trace_uprobe_filter_remove()) under filter->rwlock, but
uprobe_perf_filter() raced and saw it as >0.

In this case, we'll proceed with uprobe/uretprobe execution, while
uprobe_perf_close() and uprobe_apply() will be blocked on trying to grab
uprobe->register_rwsem as a writer. It will be blocked because
uprobe_dispatcher() (and, similarly, uretprobe_dispatcher()) runs with
uprobe->register_rwsem taken as a reader. So there is no real race
besides uprobe/uretprobe might execute one last time before it's
removed, which is fine because from user space perspective
uprobe/uretprobe hasn't been yet deactivated.

In case we speculatively read nr_systemwide as zero, while it was
incremented in parallel, we'll proceed to grabbing filter->rwlock and
re-doing the check, this time in lock-protected and non-racy way.

As such, it looks safe to do a quick short circuiting check and save
some performance in a very common system-wide case, not sacrificing hot
path performance due to much rarer possibility of registration or
unregistration of uprobes.

Again, confirming with BPF selftests's based benchmarks.

BEFORE (based on changes in previous patch)
===
uprobe-nop :2.732 ± 0.022M/s
uprobe-push:2.621 ± 0.016M/s
uprobe-ret :1.105 ± 0.007M/s
uretprobe-nop  :1.396 ± 0.007M/s
uretprobe-push :1.347 ± 0.008M/s
uretprobe-ret  :0.800 ± 0.006M/s

AFTER
=
uprobe-nop :2.878 ± 0.017M/s (+5.5%, total +8.3%)
uprobe-push:2.753 ± 0.013M/s (+5.3%, total +10.2%)
uprobe-ret :1.142 ± 0.010M/s (+3.8%, total +3.8%)
uretprobe-nop  :1.444 ± 0.008M/s (+3.5%, total +6.5%)
uretprobe-push :1.410 ± 0.010M/s (+4.8%, total +7.1%)
uretprobe-ret  :0.816 ± 0.002M/s (+2.0%, total +3.9%)

In the above, first percentage value is based on top of previous patch
(lazy uprobe buffer optimization), while the "total" percentage is
based on kernel without any of the changes in this patch set.

As can be seen, we get about 4% - 10% speed up, in total, with both lazy
uprobe buffer and speculative filter check optimizations.

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index f2875349d124..be28e6d0578e 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct uprobe_consumer 
*uc,
tu = container_of(uc, struct trace_uprobe, consumer);
filter = tu->tp.event->filter;
 
+   /* speculative check */
+   if (READ_ONCE(filter->nr_systemwide))
+   return true;
+
read_lock(>rwlock);
ret = __uprobe_perf_filter(filter, mm);
read_unlock(>rwlock);
-- 
2.43.0




[PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily

2024-03-12 Thread Andrii Nakryiko
uprobe_cpu_buffer and corresponding logic to store uprobe args into it
are used for uprobes/uretprobes that are created through tracefs or
perf events.

BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't
need uprobe_cpu_buffer and associated data. For BPF-only use cases this
buffer handling and preparation is a pure overhead. At the same time,
BPF-only uprobe/uretprobe usage is very common in practice. Also, for
a lot of cases applications are very senstivie to performance overheads,
as they might be tracing a very high frequency functions like
malloc()/free(), so every bit of performance improvement matters.

All that is to say that this uprobe_cpu_buffer preparation is an
unnecessary overhead that each BPF user of uprobes/uretprobe has to pay.
This patch is changing this by making uprobe_cpu_buffer preparation
optional. It will happen only if either tracefs-based or perf event-based
uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For
BPF-only use cases this step will be skipped.

We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0])
to estimate the improvements. We have 3 uprobe and 3 uretprobe
scenarios, which vary an instruction that is replaced by uprobe: nop
(fastest uprobe case), `push rbp` (typical case), and non-simulated
`ret` instruction (slowest case). Benchmark thread is constantly calling
user space function in a tight loop. User space function has attached
BPF uprobe or uretprobe program doing nothing but atomic counter
increments to count number of triggering calls. Benchmark emits
throughput in millions of executions per second.

BEFORE these changes

uprobe-nop :2.657 ± 0.024M/s
uprobe-push:2.499 ± 0.018M/s
uprobe-ret :1.100 ± 0.006M/s
uretprobe-nop  :1.356 ± 0.004M/s
uretprobe-push :1.317 ± 0.019M/s
uretprobe-ret  :0.785 ± 0.007M/s

AFTER these changes
===
uprobe-nop :2.732 ± 0.022M/s (+2.8%)
uprobe-push:2.621 ± 0.016M/s (+4.9%)
uprobe-ret :1.105 ± 0.007M/s (+0.5%)
uretprobe-nop  :1.396 ± 0.007M/s (+2.9%)
uretprobe-push :1.347 ± 0.008M/s (+2.3%)
uretprobe-ret  :0.800 ± 0.006M/s (+1.9)

So the improvements on this particular machine seems to be between 2% and 5%.

  [0] 
https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 56 ++---
 1 file changed, 34 insertions(+), 22 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index a0f60bb10158..f2875349d124 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -963,15 +963,22 @@ static struct uprobe_cpu_buffer 
*prepare_uprobe_buffer(struct trace_uprobe *tu,
 
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer *ucb,
+   struct uprobe_cpu_buffer **ucbp,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
struct trace_event_buffer fbuffer;
+   struct uprobe_cpu_buffer *ucb;
void *data;
int size, esize;
struct trace_event_call *call = trace_probe_event_call(>tp);
 
+   ucb = *ucbp;
+   if (!ucb) {
+   ucb = prepare_uprobe_buffer(tu, regs);
+   *ucbp = ucb;
+   }
+
WARN_ON(call != trace_file->event_call);
 
if (WARN_ON_ONCE(tu->tp.size + ucb->dsize > PAGE_SIZE))
@@ -1002,7 +1009,7 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
 
 /* uprobe handler */
 static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb)
+struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
 
@@ -1011,7 +1018,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, 0, regs, ucb, link->file);
+   __uprobe_trace_func(tu, 0, regs, ucbp, link->file);
rcu_read_unlock();
 
return 0;
@@ -1019,13 +1026,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
 static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
 struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb)
+struct uprobe_cpu_buffer **ucbp)
 {
struct event_file_link *link;
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, func, regs, ucb, link->file);
+   __uprobe_trace_func(tu

[PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer

2024-03-12 Thread Andrii Nakryiko
Move the logic of fetching temporary per-CPU uprobe buffer and storing
uprobes args into it to a new helper function. Store data size as part
of this buffer, simplifying interfaces a bit, as now we only pass single
uprobe_cpu_buffer reference around, instead of pointer + dsize.

This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher,
and now will be centralized. All this is also in preparation to make
this uprobe_cpu_buffer handling logic optional in the next patch.

Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_uprobe.c | 75 -
 1 file changed, 41 insertions(+), 34 deletions(-)

diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c
index a84b85d8aac1..a0f60bb10158 100644
--- a/kernel/trace/trace_uprobe.c
+++ b/kernel/trace/trace_uprobe.c
@@ -854,6 +854,7 @@ static const struct file_operations uprobe_profile_ops = {
 struct uprobe_cpu_buffer {
struct mutex mutex;
void *buf;
+   int dsize;
 };
 static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer;
 static int uprobe_buffer_refcnt;
@@ -943,9 +944,26 @@ static void uprobe_buffer_put(struct uprobe_cpu_buffer 
*ucb)
mutex_unlock(>mutex);
 }
 
+static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu,
+  struct pt_regs *regs)
+{
+   struct uprobe_cpu_buffer *ucb;
+   int dsize, esize;
+
+   esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
+   dsize = __get_data_size(>tp, regs);
+
+   ucb = uprobe_buffer_get();
+   ucb->dsize = dsize;
+
+   store_trace_args(ucb->buf, >tp, regs, esize, dsize);
+
+   return ucb;
+}
+
 static void __uprobe_trace_func(struct trace_uprobe *tu,
unsigned long func, struct pt_regs *regs,
-   struct uprobe_cpu_buffer *ucb, int dsize,
+   struct uprobe_cpu_buffer *ucb,
struct trace_event_file *trace_file)
 {
struct uprobe_trace_entry_head *entry;
@@ -956,14 +974,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
 
WARN_ON(call != trace_file->event_call);
 
-   if (WARN_ON_ONCE(tu->tp.size + dsize > PAGE_SIZE))
+   if (WARN_ON_ONCE(tu->tp.size + ucb->dsize > PAGE_SIZE))
return;
 
if (trace_trigger_soft_disabled(trace_file))
return;
 
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
-   size = esize + tu->tp.size + dsize;
+   size = esize + tu->tp.size + ucb->dsize;
entry = trace_event_buffer_reserve(, trace_file, size);
if (!entry)
return;
@@ -977,14 +995,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu,
data = DATAOF_TRACE_ENTRY(entry, false);
}
 
-   memcpy(data, ucb->buf, tu->tp.size + dsize);
+   memcpy(data, ucb->buf, tu->tp.size + ucb->dsize);
 
trace_event_buffer_commit();
 }
 
 /* uprobe handler */
 static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb, int dsize)
+struct uprobe_cpu_buffer *ucb)
 {
struct event_file_link *link;
 
@@ -993,7 +1011,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, 0, regs, ucb, dsize, link->file);
+   __uprobe_trace_func(tu, 0, regs, ucb, link->file);
rcu_read_unlock();
 
return 0;
@@ -1001,13 +1019,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, 
struct pt_regs *regs,
 
 static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func,
 struct pt_regs *regs,
-struct uprobe_cpu_buffer *ucb, int dsize)
+struct uprobe_cpu_buffer *ucb)
 {
struct event_file_link *link;
 
rcu_read_lock();
trace_probe_for_each_link_rcu(link, >tp)
-   __uprobe_trace_func(tu, func, regs, ucb, dsize, link->file);
+   __uprobe_trace_func(tu, func, regs, ucb, link->file);
rcu_read_unlock();
 }
 
@@ -1335,7 +1353,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc,
 
 static void __uprobe_perf_func(struct trace_uprobe *tu,
   unsigned long func, struct pt_regs *regs,
-  struct uprobe_cpu_buffer *ucb, int dsize)
+  struct uprobe_cpu_buffer *ucb)
 {
struct trace_event_call *call = trace_probe_event_call(>tp);
struct uprobe_trace_entry_head *entry;
@@ -1356,7 +1374,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu,
 
esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
 

[PATCH bpf-next 0/3] uprobes: two common case speed ups

2024-03-12 Thread Andrii Nakryiko
This patch set implements two speed ups for uprobe/uretprobe runtime execution
path for some common scenarios: BPF-only uprobes (patches #1 and #2) and
system-wide (non-PID-specific) uprobes (patch #3). Please see individual
patches for details.

Given I haven't worked with uprobe code before, I'm unfamiliar with
conventions in this subsystem, including which kernel tree patches should be
sent to. For now I based all the changes on top of bpf-next/master, which is
where I tested and benchmarked everything anyways. Please advise what should
I use as a base for subsequent revision. Thanks.

Andrii Nakryiko (3):
  uprobes: encapsulate preparation of uprobe args buffer
  uprobes: prepare uprobe args buffer lazily
  uprobes: add speculative lockless system-wide uprobe filter check

 kernel/trace/trace_uprobe.c | 103 ++--
 1 file changed, 63 insertions(+), 40 deletions(-)

-- 
2.43.0




Re: [PATCH for-next] tracing/kprobes: Add symbol counting check when module loads

2023-10-31 Thread Andrii Nakryiko
On Sat, Oct 28, 2023 at 8:10 PM Masami Hiramatsu (Google)
 wrote:
>
> From: Masami Hiramatsu (Google) 
>
> Check the number of probe target symbols in the target module when
> the module is loaded. If the probe is not on the unique name symbols
> in the module, it will be rejected at that point.
>
> Note that the symbol which has a unique name in the target module,
> it will be accepted even if there are same-name symbols in the
> kernel or other modules,
>
> Signed-off-by: Masami Hiramatsu (Google) 
> ---
>  kernel/trace/trace_kprobe.c |  112 
> ++-
>  1 file changed, 68 insertions(+), 44 deletions(-)
>

LGTM.

Acked-by: Andrii Nakryiko 


> diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
> index e834f149695b..90cf2219adb4 100644
> --- a/kernel/trace/trace_kprobe.c
> +++ b/kernel/trace/trace_kprobe.c
> @@ -670,6 +670,21 @@ static int register_trace_kprobe(struct trace_kprobe *tk)
> return ret;
>  }
>
> +static int validate_module_probe_symbol(const char *modname, const char 
> *symbol);
> +
> +static int register_module_trace_kprobe(struct module *mod, struct 
> trace_kprobe *tk)
> +{
> +   const char *p;
> +   int ret = 0;
> +
> +   p = strchr(trace_kprobe_symbol(tk), ':');
> +   if (p)
> +   ret = validate_module_probe_symbol(module_name(mod), p++);
> +   if (!ret)
> +   ret = register_trace_kprobe(tk);
> +   return ret;
> +}
> +
>  /* Module notifier call back, checking event on the module */
>  static int trace_kprobe_module_callback(struct notifier_block *nb,
>unsigned long val, void *data)
> @@ -688,7 +703,7 @@ static int trace_kprobe_module_callback(struct 
> notifier_block *nb,
> if (trace_kprobe_within_module(tk, mod)) {
> /* Don't need to check busy - this should have gone. 
> */
> __unregister_trace_kprobe(tk);
> -   ret = __register_trace_kprobe(tk);
> +   ret = register_module_trace_kprobe(mod, tk);
> if (ret)
> pr_warn("Failed to re-register probe %s on 
> %s: %d\n",
> trace_probe_name(>tp),
> @@ -729,17 +744,55 @@ static int count_mod_symbols(void *data, const char 
> *name, unsigned long unused)
> return 0;
>  }
>
> -static unsigned int number_of_same_symbols(char *func_name)
> +static unsigned int number_of_same_symbols(const char *mod, const char 
> *func_name)
>  {
> struct sym_count_ctx ctx = { .count = 0, .name = func_name };
>
> -   kallsyms_on_each_match_symbol(count_symbols, func_name, );
> +   if (!mod)
> +   kallsyms_on_each_match_symbol(count_symbols, func_name, 
> );
>
> -   module_kallsyms_on_each_symbol(NULL, count_mod_symbols, );
> +   module_kallsyms_on_each_symbol(mod, count_mod_symbols, );
>
> return ctx.count;
>  }
>
> +static int validate_module_probe_symbol(const char *modname, const char 
> *symbol)
> +{
> +   unsigned int count = number_of_same_symbols(modname, symbol);
> +
> +   if (count > 1) {
> +   /*
> +* Users should use ADDR to remove the ambiguity of
> +* using KSYM only.
> +*/
> +   return -EADDRNOTAVAIL;
> +   } else if (count == 0) {
> +   /*
> +* We can return ENOENT earlier than when register the
> +* kprobe.
> +*/
> +   return -ENOENT;
> +   }
> +   return 0;
> +}
> +
> +static int validate_probe_symbol(char *symbol)
> +{
> +   char *mod = NULL, *p;
> +   int ret;
> +
> +   p = strchr(symbol, ':');
> +   if (p) {
> +   mod = symbol;
> +   symbol = p + 1;
> +   *p = '\0';
> +   }
> +   ret = validate_module_probe_symbol(mod, symbol);
> +   if (p)
> +   *p = ':';
> +   return ret;
> +}
> +
>  static int __trace_kprobe_create(int argc, const char *argv[])
>  {
> /*
> @@ -859,6 +912,14 @@ static int __trace_kprobe_create(int argc, const char 
> *argv[])
> trace_probe_log_err(0, BAD_PROBE_ADDR);
> goto parse_error;
> }
> +   ret = validate_probe_symbol(symbol);
> +   if (ret) {
> +   if (ret == -EADDRNOTAVAIL)
> +   trace_probe_log_err(0, NON_UNIQ_SYMBOL);
> +  

[PATCH] tracing/kprobes: Fix symbol counting logic by looking at modules as well

2023-10-27 Thread Andrii Nakryiko
Recent changes to count number of matching symbols when creating
a kprobe event failed to take into account kernel modules. As such, it
breaks kprobes on kernel module symbols, by assuming there is no match.

Fix this my calling module_kallsyms_on_each_symbol() in addition to
kallsyms_on_each_match_symbol() to perform a proper counting.

Cc: Francis Laniel 
Cc: sta...@vger.kernel.org
Cc: Masami Hiramatsu 
Cc: Steven Rostedt 
Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches 
several symbols")
Signed-off-by: Andrii Nakryiko 
---
 kernel/trace/trace_kprobe.c | 24 
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index effcaede4759..1efb27f35963 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -714,14 +714,30 @@ static int count_symbols(void *data, unsigned long unused)
return 0;
 }
 
+struct sym_count_ctx {
+   unsigned int count;
+   const char *name;
+};
+
+static int count_mod_symbols(void *data, const char *name, unsigned long 
unused)
+{
+   struct sym_count_ctx *ctx = data;
+
+   if (strcmp(name, ctx->name) == 0)
+   ctx->count++;
+
+   return 0;
+}
+
 static unsigned int number_of_same_symbols(char *func_name)
 {
-   unsigned int count;
+   struct sym_count_ctx ctx = { .count = 0, .name = func_name };
+
+   kallsyms_on_each_match_symbol(count_symbols, func_name, );
 
-   count = 0;
-   kallsyms_on_each_match_symbol(count_symbols, func_name, );
+   module_kallsyms_on_each_symbol(NULL, count_mod_symbols, );
 
-   return count;
+   return ctx.count;
 }
 
 static int __trace_kprobe_create(int argc, const char *argv[])
-- 
2.34.1




Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t

2023-10-13 Thread Andrii Nakryiko
On Fri, Oct 13, 2023 at 7:00 AM Steven Rostedt  wrote:
>
> On Fri, 13 Oct 2023 08:01:34 +0200
> Artem Savkov  wrote:
>
> > > But looking at [0] and briefly reading some of the discussions you,
> > > Steven, had. I'm just wondering if it would be best to avoid
> > > increasing struct trace_entry altogether? It seems like preempt_count
> > > is actually a 4-bit field in trace context, so it doesn't seem like we
> > > really need to allocate an entire byte for both preempt_count and
> > > preempt_lazy_count. Why can't we just combine them and not waste 8
> > > extra bytes for each trace event in a ring buffer?
> > >
> > >   [0] 
> > > https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71
> >
> > I agree that avoiding increase in struct trace_entry size would be very
> > desirable, but I have no knowledge whether rt developers had reasons to
> > do it like this.
> >
> > Nevertheless I think the issue with verifier running against a wrong
> > struct still needs to be addressed.
>
> Correct. My Ack is based on the current way things are done upstream.
> It was just that linux-rt showed the issue, where the code was not as
> robust as it should have been. To me this was a correctness issue, not
> an issue that had to do with how things are done in linux-rt.

I think we should at least add some BUILD_BUG_ON() that validates
offsets in syscall_tp_t matches the ones in syscall_trace_enter and
syscall_trace_exit, to fail more loudly if there is any mismatch in
the future. WDYT?

>
> As for the changes in linux-rt, they are not upstream yet. I'll have my
> comments on that code when that happens.

Ah, ok, cool. I'd appreciate you cc'ing b...@vger.kernel.org in that
discussion, thank you!

>
> -- Steve



Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t

2023-10-12 Thread Andrii Nakryiko
On Thu, Oct 12, 2023 at 6:43 AM Steven Rostedt  wrote:
>
> On Thu, 12 Oct 2023 13:45:50 +0200
> Artem Savkov  wrote:
>
> > linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support
> > for lazy preemption")) that adds an extra member to struct trace_entry.
> > This causes the offset of args field in struct trace_event_raw_sys_enter
> > be different from the one in struct syscall_trace_enter:
> >
> > struct trace_event_raw_sys_enter {
> > struct trace_entry ent;  /* 012 */
> >
> > /* XXX last struct has 3 bytes of padding */
> > /* XXX 4 bytes hole, try to pack */
> >
> > long int   id;   /*16 8 */
> > long unsigned int  args[6];  /*2448 */
> > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> > char   __data[]; /*72 0 */
> >
> > /* size: 72, cachelines: 2, members: 4 */
> > /* sum members: 68, holes: 1, sum holes: 4 */
> > /* paddings: 1, sum paddings: 3 */
> > /* last cacheline: 8 bytes */
> > };
> >
> > struct syscall_trace_enter {
> > struct trace_entry ent;  /* 012 */
> >
> > /* XXX last struct has 3 bytes of padding */
> >
> > intnr;   /*12 4 */
> > long unsigned int  args[];   /*16 0 */
> >
> > /* size: 16, cachelines: 1, members: 3 */
> > /* paddings: 1, sum paddings: 3 */
> > /* last cacheline: 16 bytes */
> > };
> >
> > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> > test_profiler testcase because max_ctx_offset is calculated based on the
> > former struct, while off on the latter:
> >
> >   10488 if (is_tracepoint || is_syscall_tp) {
> >   10489 int off = trace_event_get_offsets(event->tp_event);
> >   10490
> >   10491 if (prog->aux->max_ctx_offset > off)
> >   10492 return -EACCES;
> >   10493 }
> >
> > What bpf program is actually getting is a pointer to struct
> > syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes
> > the problem by aligning struct syscall_tp_t with with struct
> > syscall_trace_(enter|exit) and changing the tests to use these structs
> > to dereference context.
> >
> > Signed-off-by: Artem Savkov 
>

I think these changes make sense regardless, can you please resend the
patch without RFC tag so that our CI can run tests for it?

> Thanks for doing a proper fix.
>
> Acked-by: Steven Rostedt (Google) 

But looking at [0] and briefly reading some of the discussions you,
Steven, had. I'm just wondering if it would be best to avoid
increasing struct trace_entry altogether? It seems like preempt_count
is actually a 4-bit field in trace context, so it doesn't seem like we
really need to allocate an entire byte for both preempt_count and
preempt_lazy_count. Why can't we just combine them and not waste 8
extra bytes for each trace event in a ring buffer?

  [0] 
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71

>
> -- Steve



Re: [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*

2023-10-03 Thread Andrii Nakryiko
On Mon, Oct 2, 2023 at 6:53 AM Artem Savkov  wrote:
>
> linux-rt-devel tree contains a patch that adds an extra member to struct

can you please point to the patch itself that makes that change?

> trace_entry. This causes the offset of args field in struct
> trace_event_raw_sys_enter be different from the one in struct
> syscall_trace_enter:
>
> struct trace_event_raw_sys_enter {
> struct trace_entry ent;  /* 012 */
>
> /* XXX last struct has 3 bytes of padding */
> /* XXX 4 bytes hole, try to pack */
>
> long int   id;   /*16 8 */
> long unsigned int  args[6];  /*2448 */
> /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */
> char   __data[]; /*72 0 */
>
> /* size: 72, cachelines: 2, members: 4 */
> /* sum members: 68, holes: 1, sum holes: 4 */
> /* paddings: 1, sum paddings: 3 */
> /* last cacheline: 8 bytes */
> };
>
> struct syscall_trace_enter {
> struct trace_entry ent;  /* 012 */
>
> /* XXX last struct has 3 bytes of padding */
>
> intnr;   /*12 4 */
> long unsigned int  args[];   /*16 0 */
>
> /* size: 16, cachelines: 1, members: 3 */
> /* paddings: 1, sum paddings: 3 */
> /* last cacheline: 16 bytes */
> };
>
> This, in turn, causes perf_event_set_bpf_prog() fail while running bpf
> test_profiler testcase because max_ctx_offset is calculated based on the
> former struct, while off on the latter:
>
>   10488 if (is_tracepoint || is_syscall_tp) {
>   10489 int off = trace_event_get_offsets(event->tp_event);
>   10490
>   10491 if (prog->aux->max_ctx_offset > off)
>   10492 return -EACCES;
>   10493 }
>
> This patch changes the type of nr member in syscall_trace_* structs to
> be long so that "args" offset is equal to that in struct
> trace_event_raw_sys_enter.
>
> Signed-off-by: Artem Savkov 
> ---
>  kernel/trace/trace.h  | 4 ++--
>  kernel/trace/trace_syscalls.c | 7 ---
>  2 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h
> index 77debe53f07cf..cd1d24df85364 100644
> --- a/kernel/trace/trace.h
> +++ b/kernel/trace/trace.h
> @@ -135,13 +135,13 @@ enum trace_type {
>   */
>  struct syscall_trace_enter {
> struct trace_entry  ent;
> -   int nr;
> +   longnr;
> unsigned long   args[];
>  };
>
>  struct syscall_trace_exit {
> struct trace_entry  ent;
> -   int nr;
> +   longnr;
> longret;
>  };
>
> diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c
> index de753403cdafb..c26939119f2e4 100644
> --- a/kernel/trace/trace_syscalls.c
> +++ b/kernel/trace/trace_syscalls.c
> @@ -101,7 +101,7 @@ find_syscall_meta(unsigned long syscall)
> return NULL;
>  }
>
> -static struct syscall_metadata *syscall_nr_to_meta(int nr)
> +static struct syscall_metadata *syscall_nr_to_meta(long nr)
>  {
> if (IS_ENABLED(CONFIG_HAVE_SPARSE_SYSCALL_NR))
> return xa_load(_metadata_sparse, (unsigned long)nr);
> @@ -132,7 +132,8 @@ print_syscall_enter(struct trace_iterator *iter, int 
> flags,
> struct trace_entry *ent = iter->ent;
> struct syscall_trace_enter *trace;
> struct syscall_metadata *entry;
> -   int i, syscall;
> +   int i;
> +   long syscall;
>
> trace = (typeof(trace))ent;
> syscall = trace->nr;
> @@ -177,7 +178,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags,
> struct trace_seq *s = >seq;
> struct trace_entry *ent = iter->ent;
> struct syscall_trace_exit *trace;
> -   int syscall;
> +   long syscall;
> struct syscall_metadata *entry;
>
> trace = (typeof(trace))ent;
> --
> 2.41.0
>
>



Re: [PATCH v3 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.

2021-04-20 Thread Andrii Nakryiko
On Tue, Apr 20, 2021 at 8:45 AM Kuniyuki Iwashima  wrote:
>
> This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE and
> removes 'static' from settimeo() in network_helpers.c.
>
> Signed-off-by: Kuniyuki Iwashima 
> ---

Almost everything in prog_tests/migrate_reuseport.c should be static,
functions and variables. Except the test_migrate_reuseport, of course.

But thank you for using ASSERT_xxx()! :)

>  tools/testing/selftests/bpf/network_helpers.c |   2 +-
>  tools/testing/selftests/bpf/network_helpers.h |   1 +
>  .../bpf/prog_tests/migrate_reuseport.c| 483 ++
>  .../bpf/progs/test_migrate_reuseport.c|  51 ++
>  4 files changed, 536 insertions(+), 1 deletion(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c
>

[...]


Re: [PATCH bpf-next v2 4/4] libbpf: add selftests for TC-BPF API

2021-04-19 Thread Andrii Nakryiko
On Mon, Apr 19, 2021 at 5:18 AM Kumar Kartikeya Dwivedi
 wrote:
>
> This adds some basic tests for the low level bpf_tc_cls_* API.
>
> Reviewed-by: Toke Høiland-Jørgensen 
> Signed-off-by: Kumar Kartikeya Dwivedi 
> ---
>  .../selftests/bpf/prog_tests/test_tc_bpf.c| 112 ++
>  .../selftests/bpf/progs/test_tc_bpf_kern.c|  12 ++
>  2 files changed, 124 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/test_tc_bpf.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_tc_bpf_kern.c
>
> diff --git a/tools/testing/selftests/bpf/prog_tests/test_tc_bpf.c 
> b/tools/testing/selftests/bpf/prog_tests/test_tc_bpf.c
> new file mode 100644
> index ..945f3a1a72f8
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/prog_tests/test_tc_bpf.c
> @@ -0,0 +1,112 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define LO_IFINDEX 1
> +
> +static int test_tc_cls_internal(int fd, __u32 parent_id)
> +{
> +   DECLARE_LIBBPF_OPTS(bpf_tc_cls_opts, opts, .handle = 1, .priority = 
> 10,
> +   .class_id = TC_H_MAKE(1UL << 16, 1),
> +   .chain_index = 5);
> +   struct bpf_tc_cls_attach_id id = {};
> +   struct bpf_tc_cls_info info = {};
> +   int ret;
> +
> +   ret = bpf_tc_cls_attach(fd, LO_IFINDEX, parent_id, , );
> +   if (CHECK_FAIL(ret < 0))
> +   return ret;
> +
> +   ret = bpf_tc_cls_get_info(fd, LO_IFINDEX, parent_id, NULL, );
> +   if (CHECK_FAIL(ret < 0))
> +   goto end;
> +
> +   ret = -1;
> +
> +   if (CHECK_FAIL(info.id.handle != id.handle) ||
> +   CHECK_FAIL(info.id.chain_index != id.chain_index) ||
> +   CHECK_FAIL(info.id.priority != id.priority) ||
> +   CHECK_FAIL(info.id.handle != 1) ||
> +   CHECK_FAIL(info.id.priority != 10) ||
> +   CHECK_FAIL(info.class_id != TC_H_MAKE(1UL << 16, 1)) ||
> +   CHECK_FAIL(info.id.chain_index != 5))
> +   goto end;
> +
> +   ret = bpf_tc_cls_replace(fd, LO_IFINDEX, parent_id, , );
> +   if (CHECK_FAIL(ret < 0))
> +   return ret;
> +
> +   if (CHECK_FAIL(info.id.handle != 1) ||
> +   CHECK_FAIL(info.id.priority != 10) ||
> +   CHECK_FAIL(info.class_id != TC_H_MAKE(1UL << 16, 1)))
> +   goto end;
> +
> +   /* Demonstrate changing attributes */
> +   opts.class_id = TC_H_MAKE(1UL << 16, 2);
> +
> +   ret = bpf_tc_cls_change(fd, LO_IFINDEX, parent_id, , );
> +   if (CHECK_FAIL(ret < 0))
> +   goto end;
> +
> +   ret = bpf_tc_cls_get_info(fd, LO_IFINDEX, parent_id, NULL, );
> +   if (CHECK_FAIL(ret < 0))
> +   goto end;
> +
> +   if (CHECK_FAIL(info.class_id != TC_H_MAKE(1UL << 16, 2)))
> +   goto end;
> +   if (CHECK_FAIL((info.bpf_flags & TCA_BPF_FLAG_ACT_DIRECT) != 1))
> +   goto end;
> +
> +end:
> +   ret = bpf_tc_cls_detach(LO_IFINDEX, parent_id, );
> +   CHECK_FAIL(ret < 0);
> +   return ret;
> +}
> +
> +void test_test_tc_bpf(void)
> +{
> +   const char *file = "./test_tc_bpf_kern.o";
> +   struct bpf_program *clsp;
> +   struct bpf_object *obj;
> +   int cls_fd, ret;
> +
> +   obj = bpf_object__open(file);
> +   if (CHECK_FAIL(IS_ERR_OR_NULL(obj)))
> +   return;
> +
> +   clsp = bpf_object__find_program_by_title(obj, "classifier");
> +   if (CHECK_FAIL(IS_ERR_OR_NULL(clsp)))
> +   goto end;
> +
> +   ret = bpf_object__load(obj);
> +   if (CHECK_FAIL(ret < 0))
> +   goto end;
> +
> +   cls_fd = bpf_program__fd(clsp);
> +
> +   system("tc qdisc del dev lo clsact");
> +
> +   ret = test_tc_cls_internal(cls_fd, BPF_TC_CLSACT_INGRESS);
> +   if (CHECK_FAIL(ret < 0))
> +   goto end;
> +
> +   if (CHECK_FAIL(system("tc qdisc del dev lo clsact")))
> +   goto end;
> +
> +   ret = test_tc_cls_internal(cls_fd, BPF_TC_CLSACT_EGRESS);
> +   if (CHECK_FAIL(ret < 0))
> +   goto end;
> +
> +   CHECK_FAIL(system("tc qdisc del dev lo clsact"));

please don't use CHECK_FAIL. And prefer ASSERT_xxx over CHECK().

> +
> +end:
> +   bpf_object__close(obj);
> +}
> diff --git a/tools/testing/selftests/bpf/progs/test_tc_bpf_kern.c 
> b/tools/testing/selftests/bpf/progs/test_tc_bpf_kern.c
> new file mode 100644
> index ..3dd40e21af8e
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_tc_bpf_kern.c
> @@ -0,0 +1,12 @@
> +// SPDX-License-Identifier: GPL-2.0
> +
> +#include 
> +#include 
> +
> +// Dummy prog to test TC-BPF API

no C++-style comments, please (except for SPDX header, of course)
> +
> +SEC("classifier")
> +int cls(struct __sk_buff *skb)
> +{
> +   return 0;
> +}
> --
> 2.30.2
>


Re: [PATCH bpf-next v5 0/6] Add a snprintf eBPF helper

2021-04-19 Thread Andrii Nakryiko
On Mon, Apr 19, 2021 at 8:52 AM Florent Revest  wrote:
>
> We have a usecase where we want to audit symbol names (if available) in
> callback registration hooks. (ex: fentry/nf_register_net_hook)
>
> A few months back, I proposed a bpf_kallsyms_lookup series but it was
> decided in the reviews that a more generic helper, bpf_snprintf, would
> be more useful.
>
> This series implements the helper according to the feedback received in
> https://lore.kernel.org/bpf/20201126165748.1748417-1-rev...@google.com/T/#u
>
> - A new arg type guarantees the NULL-termination of string arguments and
>   lets us pass format strings in only one arg
> - A new helper is implemented using that guarantee. Because the format
>   string is known at verification time, the format string validation is
>   done by the verifier
> - To implement a series of tests for bpf_snprintf, the logic for
>   marshalling variadic args in a fixed-size array is reworked as per:
> https://lore.kernel.org/bpf/20210310015455.1095207-1-rev...@chromium.org/T/#u
>
> ---
> Changes in v5:
> - Fixed the bpf_printf_buf_used counter logic in try_get_fmt_tmp_buf
> - Added a couple of extra incorrect specifiers tests
> - Call test_snprintf_single__destroy unconditionally
> - Fixed a C++-style comment
>
> ---
> Changes in v4:
> - Moved bpf_snprintf, bpf_printf_prepare and bpf_printf_cleanup to
>   kernel/bpf/helpers.c so that they get built without CONFIG_BPF_EVENTS
> - Added negative test cases (various invalid format strings)
> - Renamed put_fmt_tmp_buf() as bpf_printf_cleanup()
> - Fixed a mistake that caused temporary buffers to be unconditionally
>   freed in bpf_printf_prepare
> - Fixed a mistake that caused missing 0 character to be ignored
> - Fixed a warning about integer to pointer conversion
> - Misc cleanups
>
> ---
> Changes in v3:
> - Simplified temporary buffer acquisition with try_get_fmt_tmp_buf()
> - Made zero-termination check more consistent
> - Allowed NULL output_buffer
> - Simplified the BPF_CAST_FMT_ARG macro
> - Three new test cases: number padding, simple string with no arg and
>   string length extraction only with a NULL output buffer
> - Clarified helper's description for edge cases (eg: str_size == 0)
> - Lots of cosmetic changes
>
> ---
> Changes in v2:
> - Extracted the format validation/argument sanitization in a generic way
>   for all printf-like helpers.
> - bpf_snprintf's str_size can now be 0
> - bpf_snprintf is now exposed to all BPF program types
> - We now preempt_disable when using a per-cpu temporary buffer
> - Addressed a few cosmetic changes
>
> Florent Revest (6):
>   bpf: Factorize bpf_trace_printk and bpf_seq_printf
>   bpf: Add a ARG_PTR_TO_CONST_STR argument type
>   bpf: Add a bpf_snprintf helper
>   libbpf: Initialize the bpf_seq_printf parameters array field by field
>   libbpf: Introduce a BPF_SNPRINTF helper macro
>   selftests/bpf: Add a series of tests for bpf_snprintf
>
>  include/linux/bpf.h   |  22 ++
>  include/uapi/linux/bpf.h  |  28 ++
>  kernel/bpf/helpers.c  | 306 ++
>  kernel/bpf/verifier.c |  82 
>  kernel/trace/bpf_trace.c  | 373 ++
>  tools/include/uapi/linux/bpf.h|  28 ++
>  tools/lib/bpf/bpf_tracing.h   |  58 ++-
>  .../selftests/bpf/prog_tests/snprintf.c   | 125 ++
>  .../selftests/bpf/progs/test_snprintf.c   |  73 
>  .../bpf/progs/test_snprintf_single.c  |  20 +
>  10 files changed, 770 insertions(+), 345 deletions(-)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf_single.c
>
> --
> 2.31.1.368.gbe11c130af-goog
>

Looks great, thank you!

For the series:

Acked-by: Andrii Nakryiko 


Re: [PATCH bpf-next v4 6/6] selftests/bpf: Add a series of tests for bpf_snprintf

2021-04-15 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 11:54 AM Florent Revest  wrote:
>
> The "positive" part tests all format specifiers when things go well.
>
> The "negative" part makes sure that incorrect format strings fail at
> load time.
>
> Signed-off-by: Florent Revest 
> ---
>  .../selftests/bpf/prog_tests/snprintf.c   | 124 ++
>  .../selftests/bpf/progs/test_snprintf.c   |  73 +++
>  .../bpf/progs/test_snprintf_single.c  |  20 +++
>  3 files changed, 217 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf_single.c
>

[...]

> +/* Loads an eBPF object calling bpf_snprintf with up to 10 characters of fmt 
> */
> +static int load_single_snprintf(char *fmt)
> +{
> +   struct test_snprintf_single *skel;
> +   int ret;
> +
> +   skel = test_snprintf_single__open();
> +   if (!skel)
> +   return -EINVAL;
> +
> +   memcpy(skel->rodata->fmt, fmt, min(strlen(fmt) + 1, 10));
> +
> +   ret = test_snprintf_single__load(skel);
> +   if (!ret)
> +   test_snprintf_single__destroy(skel);

destroy unconditionally?

> +
> +   return ret;
> +}
> +
> +void test_snprintf_negative(void)
> +{
> +   ASSERT_OK(load_single_snprintf("valid %d"), "valid usage");
> +
> +   ASSERT_ERR(load_single_snprintf("0123456789"), "no terminating zero");
> +   ASSERT_ERR(load_single_snprintf("%d %d"), "too many specifiers");
> +   ASSERT_ERR(load_single_snprintf("%pi5"), "invalid specifier 1");
> +   ASSERT_ERR(load_single_snprintf("%a"), "invalid specifier 2");
> +   ASSERT_ERR(load_single_snprintf("%"), "invalid specifier 3");
> +   ASSERT_ERR(load_single_snprintf("\x80"), "non ascii character");
> +   ASSERT_ERR(load_single_snprintf("\x1"), "non printable character");

Some more cases that came up in my mind:

1. %123987129387192387 -- long and unterminated specified
2. similarly %--- or something like that

Do you think they are worth checking?

> +}
> +
> +void test_snprintf(void)
> +{
> +   if (test__start_subtest("snprintf_positive"))
> +   test_snprintf_positive();
> +   if (test__start_subtest("snprintf_negative"))
> +   test_snprintf_negative();
> +}

[...]

> +char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/progs/test_snprintf_single.c 
> b/tools/testing/selftests/bpf/progs/test_snprintf_single.c
> new file mode 100644
> index ..15ccc5c43803
> --- /dev/null
> +++ b/tools/testing/selftests/bpf/progs/test_snprintf_single.c
> @@ -0,0 +1,20 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* Copyright (c) 2021 Google LLC. */
> +
> +#include 
> +#include 
> +
> +// The format string is filled from the userspace side such that loading 
> fails

C++ style format

> +static const char fmt[10];
> +
> +SEC("raw_tp/sys_enter")
> +int handler(const void *ctx)
> +{
> +   unsigned long long arg = 42;
> +
> +   bpf_snprintf(NULL, 0, fmt, , sizeof(arg));
> +
> +   return 0;
> +}
> +
> +char _license[] SEC("license") = "GPL";
> --
> 2.31.1.295.g9ea45b61b8-goog
>


Re: [PATCH bpf-next v4 3/6] bpf: Add a bpf_snprintf helper

2021-04-15 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 11:54 AM Florent Revest  wrote:
>
> The implementation takes inspiration from the existing bpf_trace_printk
> helper but there are a few differences:
>
> To allow for a large number of format-specifiers, parameters are
> provided in an array, like in bpf_seq_printf.
>
> Because the output string takes two arguments and the array of
> parameters also takes two arguments, the format string needs to fit in
> one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to
> a zero-terminated read-only map so we don't need a format string length
> arg.
>
> Because the format-string is known at verification time, we also do
> a first pass of format string validation in the verifier logic. This
> makes debugging easier.
>
> Signed-off-by: Florent Revest 
> ---

LGTM.
Acked-by: Andrii Nakryiko 

>  include/linux/bpf.h|  1 +
>  include/uapi/linux/bpf.h   | 28 +++
>  kernel/bpf/helpers.c   | 50 ++
>  kernel/bpf/verifier.c  | 41 
>  kernel/trace/bpf_trace.c   |  2 ++
>  tools/include/uapi/linux/bpf.h | 28 +++
>  6 files changed, 150 insertions(+)
>

[...]


Re: [PATCH bpf-next v4 1/6] bpf: Factorize bpf_trace_printk and bpf_seq_printf

2021-04-15 Thread Andrii Nakryiko
On Thu, Apr 15, 2021 at 2:33 AM Florent Revest  wrote:
>
> On Thu, Apr 15, 2021 at 2:38 AM Andrii Nakryiko
>  wrote:
> > On Wed, Apr 14, 2021 at 11:54 AM Florent Revest  wrote:
> > > +static int try_get_fmt_tmp_buf(char **tmp_buf)
> > > +{
> > > +   struct bpf_printf_buf *bufs;
> > > +   int used;
> > > +
> > > +   if (*tmp_buf)
> > > +   return 0;
> > > +
> > > +   preempt_disable();
> > > +   used = this_cpu_inc_return(bpf_printf_buf_used);
> > > +   if (WARN_ON_ONCE(used > 1)) {
> > > +   this_cpu_dec(bpf_printf_buf_used);
> >
> > this makes me uncomfortable. If used > 1, you won't preempt_enable()
> > here, but you'll decrease count. Then later bpf_printf_cleanup() will
> > be called (inside bpf_printf_prepare()) and will further decrease
> > count (which it didn't increase, so it's a mess now).
>
> Awkward, yes. :( This code is untested because it only covers a niche
> preempt_rt usecase that is hard to reproduce but I should have thought
> harder about these corner cases.
>
> > > +   i += 2;
> > > +   if (!final_args)
> > > +   goto fmt_next;
> > > +
> > > +   if (try_get_fmt_tmp_buf(_buf)) {
> > > +   err = -EBUSY;
> > > +   goto out;
> >
> > this probably should bypass doing bpf_printf_cleanup() and
> > try_get_fmt_tmp_buf() should enable preemption internally on error.
>
> Yes. I'll fix this and spend some more brain cycles thinking about
> what I'm doing. ;)
>
> > > -static __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...)
> > > +BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1,
> > > +  u64, arg2, u64, arg3)
> > >  {
> > > +   u64 args[MAX_TRACE_PRINTK_VARARGS] = { arg1, arg2, arg3 };
> > > +   enum bpf_printf_mod_type mod[MAX_TRACE_PRINTK_VARARGS];
> > > static char buf[BPF_TRACE_PRINTK_SIZE];
> > > unsigned long flags;
> > > -   va_list ap;
> > > int ret;
> > >
> > > -   raw_spin_lock_irqsave(_printk_lock, flags);
> > > -   va_start(ap, fmt);
> > > -   ret = vsnprintf(buf, sizeof(buf), fmt, ap);
> > > -   va_end(ap);
> > > -   /* vsnprintf() will not append null for zero-length strings */
> > > +   ret = bpf_printf_prepare(fmt, fmt_size, args, args, mod,
> > > +MAX_TRACE_PRINTK_VARARGS);
> > > +   if (ret < 0)
> > > +   return ret;
> > > +
> > > +   ret = snprintf(buf, sizeof(buf), fmt, BPF_CAST_FMT_ARG(0, args, 
> > > mod),
> > > +   BPF_CAST_FMT_ARG(1, args, mod), BPF_CAST_FMT_ARG(2, args, 
> > > mod));
> > > +   /* snprintf() will not append null for zero-length strings */
> > > if (ret == 0)
> > > buf[0] = '\0';
> > > +
> > > +   raw_spin_lock_irqsave(_printk_lock, flags);
> > > trace_bpf_trace_printk(buf);
> > > raw_spin_unlock_irqrestore(_printk_lock, flags);
> > >
> > > -   return ret;
> >
> > see here, no + 1 :(
>
> I wonder if it's a bug or a feature though. The helper documentation
> says the helper returns "the number of bytes written to the buffer". I
> am not familiar with the internals of trace_printk but if the
> terminating \0 is not outputted in the trace_printk buffer, then it
> kind of makes sense.
>
> Also, if anyone uses this return value, I can imagine that the usecase
> would be if (ret == 0) assume_nothing_was_written(). And if we
> suddenly output 1 here, we might break something.
>
> Because the helper is quite old, maybe we should improve the helper
> documentation instead? Your call :)

Yeah, let's make helper's doc a bit more precise, otherwise let's not
touch it. I doubt many users ever check return result of
bpf_trace_printk() at all, tbh.


Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-15 Thread Andrii Nakryiko
On Thu, Apr 15, 2021 at 3:10 PM Daniel Borkmann  wrote:
>
> On 4/15/21 1:58 AM, Andrii Nakryiko wrote:
> > On Wed, Apr 14, 2021 at 4:32 PM Daniel Borkmann  
> > wrote:
> >> On 4/15/21 1:19 AM, Andrii Nakryiko wrote:
> >>> On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen  
> >>> wrote:
> >>>> Andrii Nakryiko  writes:
> >>>>> On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen 
> >>>>>  wrote:
> >>>>>> Andrii Nakryiko  writes:
> >>>>>>> On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen 
> >>>>>>>  wrote:
> >>>>>>>> Andrii Nakryiko  writes:
> >>>>>>>>> On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov
> >>>>>>>>>  wrote:
> >>>>>>>>>> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi 
> >>>>>>>>>> wrote:
> >>>>>>>>>>> On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote:
> >>>>>>>>>>>> On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi 
> >>>>>>>>>>>>  wrote:
> >>>>>>>>>>>>> [...]
> >>>>>>>>>>>>
> >>>>>>>>>>>> All of these things are messy because of tc legacy. bpf tried to 
> >>>>>>>>>>>> follow tc style
> >>>>>>>>>>>> with cls and act distinction and it didn't quite work. cls with
> >>>>>>>>>>>> direct-action is the only
> >>>>>>>>>>>> thing that became mainstream while tc style attach wasn't really 
> >>>>>>>>>>>> addressed.
> >>>>>>>>>>>> There were several incidents where tc had tens of thousands of 
> >>>>>>>>>>>> progs attached
> >>>>>>>>>>>> because of this attach/query/index weirdness described above.
> >>>>>>>>>>>> I think the only way to address this properly is to introduce 
> >>>>>>>>>>>> bpf_link style of
> >>>>>>>>>>>> attaching to tc. Such bpf_link would support ingress/egress only.
> >>>>>>>>>>>> direction-action will be implied. There won't be any index and 
> >>>>>>>>>>>> query
> >>>>>>>>>>>> will be obvious.
> >>>>>>>>>>>
> >>>>>>>>>>> Note that we already have bpf_link support working (without 
> >>>>>>>>>>> support for pinning
> >>>>>>>>>>> ofcourse) in a limited way. The ifindex, protocol, parent_id, 
> >>>>>>>>>>> priority, handle,
> >>>>>>>>>>> chain_index tuple uniquely identifies a filter, so we stash this 
> >>>>>>>>>>> in the bpf_link
> >>>>>>>>>>> and are able to operate on the exact filter during release.
> >>>>>>>>>>
> >>>>>>>>>> Except they're not unique. The library can stash them, but 
> >>>>>>>>>> something else
> >>>>>>>>>> doing detach via iproute2 or their own netlink calls will detach 
> >>>>>>>>>> the prog.
> >>>>>>>>>> This other app can attach to the same spot a different prog and now
> >>>>>>>>>> bpf_link__destroy will be detaching somebody else prog.
> >>>>>>>>>>
> >>>>>>>>>>>> So I would like to propose to take this patch set a step further 
> >>>>>>>>>>>> from
> >>>>>>>>>>>> what Daniel said:
> >>>>>>>>>>>> int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
> >>>>>>>>>>>> and make this proposed api to return FD.
> >>>>>>>>>>>> To detach from tc ingress/egress just close(fd).
> >>>>>>>>>>>
> >>>>>>>>>>> You mean adding an fd-based TC API to the kernel?
> >>>>>>

Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-15 Thread Andrii Nakryiko
On Thu, Apr 15, 2021 at 8:57 AM Toke Høiland-Jørgensen  wrote:
>
> Andrii Nakryiko  writes:
>
> > On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen  
> > wrote:
> >>
> >> Andrii Nakryiko  writes:
> >>
> >> > On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen  
> >> > wrote:
> >> >>
> >> >> Andrii Nakryiko  writes:
> >> >>
> >> >> > On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen 
> >> >> >  wrote:
> >> >> >>
> >> >> >> Andrii Nakryiko  writes:
> >> >> >>
> >> >> >> > On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov
> >> >> >> >  wrote:
> >> >> >> >>
> >> >> >> >> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi 
> >> >> >> >> wrote:
> >> >> >> >> > On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov 
> >> >> >> >> > wrote:
> >> >> >> >> > > On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi 
> >> >> >> >> > >  wrote:
> >> >> >> >> > > > [...]
> >> >> >> >> > >
> >> >> >> >> > > All of these things are messy because of tc legacy. bpf tried 
> >> >> >> >> > > to follow tc style
> >> >> >> >> > > with cls and act distinction and it didn't quite work. cls 
> >> >> >> >> > > with
> >> >> >> >> > > direct-action is the only
> >> >> >> >> > > thing that became mainstream while tc style attach wasn't 
> >> >> >> >> > > really addressed.
> >> >> >> >> > > There were several incidents where tc had tens of thousands 
> >> >> >> >> > > of progs attached
> >> >> >> >> > > because of this attach/query/index weirdness described above.
> >> >> >> >> > > I think the only way to address this properly is to introduce 
> >> >> >> >> > > bpf_link style of
> >> >> >> >> > > attaching to tc. Such bpf_link would support ingress/egress 
> >> >> >> >> > > only.
> >> >> >> >> > > direction-action will be implied. There won't be any index 
> >> >> >> >> > > and query
> >> >> >> >> > > will be obvious.
> >> >> >> >> >
> >> >> >> >> > Note that we already have bpf_link support working (without 
> >> >> >> >> > support for pinning
> >> >> >> >> > ofcourse) in a limited way. The ifindex, protocol, parent_id, 
> >> >> >> >> > priority, handle,
> >> >> >> >> > chain_index tuple uniquely identifies a filter, so we stash 
> >> >> >> >> > this in the bpf_link
> >> >> >> >> > and are able to operate on the exact filter during release.
> >> >> >> >>
> >> >> >> >> Except they're not unique. The library can stash them, but 
> >> >> >> >> something else
> >> >> >> >> doing detach via iproute2 or their own netlink calls will detach 
> >> >> >> >> the prog.
> >> >> >> >> This other app can attach to the same spot a different prog and 
> >> >> >> >> now
> >> >> >> >> bpf_link__destroy will be detaching somebody else prog.
> >> >> >> >>
> >> >> >> >> > > So I would like to propose to take this patch set a step 
> >> >> >> >> > > further from
> >> >> >> >> > > what Daniel said:
> >> >> >> >> > > int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
> >> >> >> >> > > and make this proposed api to return FD.
> >> >> >> >> > > To detach from tc ingress/egress just close(fd).
> >> >> >> >> >
> >> >> >> >> > You mean adding an fd-based TC API to the kernel?
> >> >> >> >>
> &

Re: [PATCH bpf-next v4 1/6] bpf: Factorize bpf_trace_printk and bpf_seq_printf

2021-04-14 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 11:54 AM Florent Revest  wrote:
>
> Two helpers (trace_printk and seq_printf) have very similar
> implementations of format string parsing and a third one is coming
> (snprintf). To avoid code duplication and make the code easier to
> maintain, this moves the operations associated with format string
> parsing (validation and argument sanitization) into one generic
> function.
>
> The implementation of the two existing helpers already drifted quite a
> bit so unifying them entailed a lot of changes:
>
> - bpf_trace_printk always expected fmt[fmt_size] to be the terminating
>   NULL character, this is no longer true, the first 0 is terminating.
> - bpf_trace_printk now supports %% (which produces the percentage char).
> - bpf_trace_printk now skips width formating fields.
> - bpf_trace_printk now supports the X modifier (capital hexadecimal).
> - bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6
> - argument casting on 32 bit has been simplified into one macro and
>   using an enum instead of obscure int increments.
>
> - bpf_seq_printf now uses bpf_trace_copy_string instead of
>   strncpy_from_kernel_nofault and handles the %pks %pus specifiers.
> - bpf_seq_printf now prints longs correctly on 32 bit architectures.
>
> - both were changed to use a global per-cpu tmp buffer instead of one
>   stack buffer for trace_printk and 6 small buffers for seq_printf.
> - to avoid per-cpu buffer usage conflict, these helpers disable
>   preemption while the per-cpu buffer is in use.
> - both helpers now support the %ps and %pS specifiers to print symbols.
>
> The implementation is also moved from bpf_trace.c to helpers.c because
> the upcoming bpf_snprintf helper will be made available to all BPF
> programs and will need it.
>
> Signed-off-by: Florent Revest 
> ---
>  include/linux/bpf.h  |  20 +++
>  kernel/bpf/helpers.c | 254 +++
>  kernel/trace/bpf_trace.c | 371 ---
>  3 files changed, 311 insertions(+), 334 deletions(-)
>

[...]

> +static int try_get_fmt_tmp_buf(char **tmp_buf)
> +{
> +   struct bpf_printf_buf *bufs;
> +   int used;
> +
> +   if (*tmp_buf)
> +   return 0;
> +
> +   preempt_disable();
> +   used = this_cpu_inc_return(bpf_printf_buf_used);
> +   if (WARN_ON_ONCE(used > 1)) {
> +   this_cpu_dec(bpf_printf_buf_used);

this makes me uncomfortable. If used > 1, you won't preempt_enable()
here, but you'll decrease count. Then later bpf_printf_cleanup() will
be called (inside bpf_printf_prepare()) and will further decrease
count (which it didn't increase, so it's a mess now).

> +   return -EBUSY;
> +   }
> +   bufs = this_cpu_ptr(_printf_buf);
> +   *tmp_buf = bufs->tmp_buf;
> +
> +   return 0;
> +}
> +

[...]

> +   i += 2;
> +   if (!final_args)
> +   goto fmt_next;
> +
> +   if (try_get_fmt_tmp_buf(_buf)) {
> +   err = -EBUSY;
> +   goto out;

this probably should bypass doing bpf_printf_cleanup() and
try_get_fmt_tmp_buf() should enable preemption internally on error.

> +   }
> +
> +   copy_size = (fmt[i + 2] == '4') ? 4 : 16;
> +   if (tmp_buf_len < copy_size) {
> +   err = -ENOSPC;
> +   goto out;
> +   }
> +

[...]

> -static __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...)
> +BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1,
> +  u64, arg2, u64, arg3)
>  {
> +   u64 args[MAX_TRACE_PRINTK_VARARGS] = { arg1, arg2, arg3 };
> +   enum bpf_printf_mod_type mod[MAX_TRACE_PRINTK_VARARGS];
> static char buf[BPF_TRACE_PRINTK_SIZE];
> unsigned long flags;
> -   va_list ap;
> int ret;
>
> -   raw_spin_lock_irqsave(_printk_lock, flags);
> -   va_start(ap, fmt);
> -   ret = vsnprintf(buf, sizeof(buf), fmt, ap);
> -   va_end(ap);
> -   /* vsnprintf() will not append null for zero-length strings */
> +   ret = bpf_printf_prepare(fmt, fmt_size, args, args, mod,
> +MAX_TRACE_PRINTK_VARARGS);
> +   if (ret < 0)
> +   return ret;
> +
> +   ret = snprintf(buf, sizeof(buf), fmt, BPF_CAST_FMT_ARG(0, args, mod),
> +   BPF_CAST_FMT_ARG(1, args, mod), BPF_CAST_FMT_ARG(2, args, 
> mod));
> +   /* snprintf() will not append null for zero-length strings */
> if (ret == 0)
> buf[0] = '\0';
> +
> +   raw_spin_lock_irqsave(_printk_lock, flags);
> trace_bpf_trace_printk(buf);
> raw_spin_unlock_irqrestore(_printk_lock, flags);
>
> -   return ret;

see here, no + 1 :(

> -}
> -
> -/*
> - * Only limited trace_printk() conversion specifiers allowed:
> - * %d %i 

Re: [PATCH] selftests/bpf: Fix the ASSERT_ERR_PTR macro

2021-04-14 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 11:58 AM Martin KaFai Lau  wrote:
>
> On Wed, Apr 14, 2021 at 05:56:32PM +0200, Florent Revest wrote:
> > It is just missing a ';'. This macro is not used by any test yet.
> >
> > Signed-off-by: Florent Revest 
> Fixes: 22ba36351631 ("selftests/bpf: Move and extend ASSERT_xxx() testing 
> macros")
>

Thanks, Martin. Added Fixes tag and applied to bpf-next.

> Since it has not been used, it could be bpf-next.  Please also tag
> it in the future.
>
> Acked-by: Martin KaFai Lau 


Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-14 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 4:32 PM Daniel Borkmann  wrote:
>
> On 4/15/21 1:19 AM, Andrii Nakryiko wrote:
> > On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen  
> > wrote:
> >> Andrii Nakryiko  writes:
> >>> On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen  
> >>> wrote:
> >>>> Andrii Nakryiko  writes:
> >>>>> On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen  
> >>>>> wrote:
> >>>>>> Andrii Nakryiko  writes:
> >>>>>>> On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov
> >>>>>>>  wrote:
> >>>>>>>> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi 
> >>>>>>>> wrote:
> >>>>>>>>> On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote:
> >>>>>>>>>> On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi 
> >>>>>>>>>>  wrote:
> >>>>>>>>>>> [...]
> >>>>>>>>>>
> >>>>>>>>>> All of these things are messy because of tc legacy. bpf tried to 
> >>>>>>>>>> follow tc style
> >>>>>>>>>> with cls and act distinction and it didn't quite work. cls with
> >>>>>>>>>> direct-action is the only
> >>>>>>>>>> thing that became mainstream while tc style attach wasn't really 
> >>>>>>>>>> addressed.
> >>>>>>>>>> There were several incidents where tc had tens of thousands of 
> >>>>>>>>>> progs attached
> >>>>>>>>>> because of this attach/query/index weirdness described above.
> >>>>>>>>>> I think the only way to address this properly is to introduce 
> >>>>>>>>>> bpf_link style of
> >>>>>>>>>> attaching to tc. Such bpf_link would support ingress/egress only.
> >>>>>>>>>> direction-action will be implied. There won't be any index and 
> >>>>>>>>>> query
> >>>>>>>>>> will be obvious.
> >>>>>>>>>
> >>>>>>>>> Note that we already have bpf_link support working (without support 
> >>>>>>>>> for pinning
> >>>>>>>>> ofcourse) in a limited way. The ifindex, protocol, parent_id, 
> >>>>>>>>> priority, handle,
> >>>>>>>>> chain_index tuple uniquely identifies a filter, so we stash this in 
> >>>>>>>>> the bpf_link
> >>>>>>>>> and are able to operate on the exact filter during release.
> >>>>>>>>
> >>>>>>>> Except they're not unique. The library can stash them, but something 
> >>>>>>>> else
> >>>>>>>> doing detach via iproute2 or their own netlink calls will detach the 
> >>>>>>>> prog.
> >>>>>>>> This other app can attach to the same spot a different prog and now
> >>>>>>>> bpf_link__destroy will be detaching somebody else prog.
> >>>>>>>>
> >>>>>>>>>> So I would like to propose to take this patch set a step further 
> >>>>>>>>>> from
> >>>>>>>>>> what Daniel said:
> >>>>>>>>>> int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
> >>>>>>>>>> and make this proposed api to return FD.
> >>>>>>>>>> To detach from tc ingress/egress just close(fd).
> >>>>>>>>>
> >>>>>>>>> You mean adding an fd-based TC API to the kernel?
> >>>>>>>>
> >>>>>>>> yes.
> >>>>>>>
> >>>>>>> I'm totally for bpf_link-based TC attachment.
> >>>>>>>
> >>>>>>> But I think *also* having "legacy" netlink-based APIs will allow
> >>>>>>> applications to handle older kernels in a much nicer way without extra
> >>>>>>> dependency on iproute2. We have a similar situation with kprobe, where
> >>>>>>> currently libbpf only supports "modern&qu

Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-14 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen  wrote:
>
> Andrii Nakryiko  writes:
>
> > On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen  
> > wrote:
> >>
> >> Andrii Nakryiko  writes:
> >>
> >> > On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen  
> >> > wrote:
> >> >>
> >> >> Andrii Nakryiko  writes:
> >> >>
> >> >> > On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov
> >> >> >  wrote:
> >> >> >>
> >> >> >> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi 
> >> >> >> wrote:
> >> >> >> > On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote:
> >> >> >> > > On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi 
> >> >> >> > >  wrote:
> >> >> >> > > > [...]
> >> >> >> > >
> >> >> >> > > All of these things are messy because of tc legacy. bpf tried to 
> >> >> >> > > follow tc style
> >> >> >> > > with cls and act distinction and it didn't quite work. cls with
> >> >> >> > > direct-action is the only
> >> >> >> > > thing that became mainstream while tc style attach wasn't really 
> >> >> >> > > addressed.
> >> >> >> > > There were several incidents where tc had tens of thousands of 
> >> >> >> > > progs attached
> >> >> >> > > because of this attach/query/index weirdness described above.
> >> >> >> > > I think the only way to address this properly is to introduce 
> >> >> >> > > bpf_link style of
> >> >> >> > > attaching to tc. Such bpf_link would support ingress/egress only.
> >> >> >> > > direction-action will be implied. There won't be any index and 
> >> >> >> > > query
> >> >> >> > > will be obvious.
> >> >> >> >
> >> >> >> > Note that we already have bpf_link support working (without 
> >> >> >> > support for pinning
> >> >> >> > ofcourse) in a limited way. The ifindex, protocol, parent_id, 
> >> >> >> > priority, handle,
> >> >> >> > chain_index tuple uniquely identifies a filter, so we stash this 
> >> >> >> > in the bpf_link
> >> >> >> > and are able to operate on the exact filter during release.
> >> >> >>
> >> >> >> Except they're not unique. The library can stash them, but something 
> >> >> >> else
> >> >> >> doing detach via iproute2 or their own netlink calls will detach the 
> >> >> >> prog.
> >> >> >> This other app can attach to the same spot a different prog and now
> >> >> >> bpf_link__destroy will be detaching somebody else prog.
> >> >> >>
> >> >> >> > > So I would like to propose to take this patch set a step further 
> >> >> >> > > from
> >> >> >> > > what Daniel said:
> >> >> >> > > int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
> >> >> >> > > and make this proposed api to return FD.
> >> >> >> > > To detach from tc ingress/egress just close(fd).
> >> >> >> >
> >> >> >> > You mean adding an fd-based TC API to the kernel?
> >> >> >>
> >> >> >> yes.
> >> >> >
> >> >> > I'm totally for bpf_link-based TC attachment.
> >> >> >
> >> >> > But I think *also* having "legacy" netlink-based APIs will allow
> >> >> > applications to handle older kernels in a much nicer way without extra
> >> >> > dependency on iproute2. We have a similar situation with kprobe, where
> >> >> > currently libbpf only supports "modern" fd-based attachment, but users
> >> >> > periodically ask questions and struggle to figure out issues on older
> >> >> > kernels that don't support new APIs.
> >> >>
> >> >> +1; I am OK with adding a new bpf_link-based way to attach TC programs,
> >> >> b

Re: [PATCH bpf-next v3 3/6] bpf: Add a bpf_snprintf helper

2021-04-14 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 11:30 AM Florent Revest  wrote:
>
> Hey Geert! :)
>
> On Wed, Apr 14, 2021 at 8:02 PM Geert Uytterhoeven  
> wrote:
> > On Wed, Apr 14, 2021 at 9:41 AM Andrii Nakryiko
> >  wrote:
> > > On Mon, Apr 12, 2021 at 8:38 AM Florent Revest  
> > > wrote:
> > > > +   fmt = (char *)fmt_addr + fmt_map_off;
> > > > +
> > >
> > > bot complained about lack of (long) cast before fmt_addr, please address
> >
> > (uintptr_t), I assume?
>
> (uintptr_t) seems more correct to me as well. However, I just had a
> look at the rest of verifier.c and (long) casts are already used
> pretty much everywhere whereas uintptr_t isn't used yet.
> I'll send a v4 with a long cast for the sake of consistency with the
> rest of the verifier.

right, I don't care about long or uintptr_t, both are guaranteed to
work, I just remember seeing a lot of code with (long) cast. I have no
preference.


Re: [PATCH bpf-next v3 3/6] bpf: Add a bpf_snprintf helper

2021-04-14 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 2:46 AM Florent Revest  wrote:
>
> On Wed, Apr 14, 2021 at 1:16 AM Andrii Nakryiko
>  wrote:
> > On Mon, Apr 12, 2021 at 8:38 AM Florent Revest  wrote:
> > > +static int check_bpf_snprintf_call(struct bpf_verifier_env *env,
> > > +  struct bpf_reg_state *regs)
> > > +{
> > > +   struct bpf_reg_state *fmt_reg = [BPF_REG_3];
> > > +   struct bpf_reg_state *data_len_reg = [BPF_REG_5];
> > > +   struct bpf_map *fmt_map = fmt_reg->map_ptr;
> > > +   int err, fmt_map_off, num_args;
> > > +   u64 fmt_addr;
> > > +   char *fmt;
> > > +
> > > +   /* data must be an array of u64 */
> > > +   if (data_len_reg->var_off.value % 8)
> > > +   return -EINVAL;
> > > +   num_args = data_len_reg->var_off.value / 8;
> > > +
> > > +   /* fmt being ARG_PTR_TO_CONST_STR guarantees that var_off is const
> > > +* and map_direct_value_addr is set.
> > > +*/
> > > +   fmt_map_off = fmt_reg->off + fmt_reg->var_off.value;
> > > +   err = fmt_map->ops->map_direct_value_addr(fmt_map, _addr,
> > > + fmt_map_off);
> > > +   if (err)
> > > +   return err;
> > > +   fmt = (char *)fmt_addr + fmt_map_off;
> > > +
> >
> > bot complained about lack of (long) cast before fmt_addr, please address
>
> Will do.
>
> > > +   /* Maximumly we can have MAX_SNPRINTF_VARARGS parameters, just 
> > > give
> > > +* all of them to snprintf().
> > > +*/
> > > +   err = snprintf(str, str_size, fmt, BPF_CAST_FMT_ARG(0, args, mod),
> > > +   BPF_CAST_FMT_ARG(1, args, mod), BPF_CAST_FMT_ARG(2, args, 
> > > mod),
> > > +   BPF_CAST_FMT_ARG(3, args, mod), BPF_CAST_FMT_ARG(4, args, 
> > > mod),
> > > +   BPF_CAST_FMT_ARG(5, args, mod), BPF_CAST_FMT_ARG(6, args, 
> > > mod),
> > > +   BPF_CAST_FMT_ARG(7, args, mod), BPF_CAST_FMT_ARG(8, args, 
> > > mod),
> > > +   BPF_CAST_FMT_ARG(9, args, mod), BPF_CAST_FMT_ARG(10, 
> > > args, mod),
> > > +   BPF_CAST_FMT_ARG(11, args, mod));
> > > +
> > > +   put_fmt_tmp_buf();
> >
> > reading this for at least 3rd time, this put_fmt_tmp_buf() looks a bit
> > out of place and kind of random. I think bpf_printf_cleanup() name
> > pairs with bpf_printf_prepare() better.
>
> Yes, I thought it would be clever to name that function
> put_fmt_tmp_buf() as a clear parallel to try_get_fmt_tmp_buf() but
> because it only puts the buffer if it is used and because they get
> called in two different contexts, it's after all maybe not such a
> clever name... I'll revert to bpf_printf_cleanup(). Thank you for your
> patience with my naming adventures! :)
>
> > > +
> > > +   return err + 1;
> >
> > snprintf() already returns string length *including* terminating zero,
> > so this is wrong
>
> lib/vsprintf.c says:
>  * The return value is the number of characters which would be
>  * generated for the given input, excluding the trailing null,
>  * as per ISO C99.
>
> Also if I look at the "no arg" test case in the selftest patch.
> "simple case" is asserted to return 12 which seems correct to me
> (includes the terminating zero only once). Am I missing something ?
>

no, you are right, but that means that bpf_trace_printk is broken, it
doesn't do + 1 (which threw me off here), shall we fix that?

> However that makes me wonder whether it would be more appropriate to
> return the value excluding the trailing null. On one hand it makes
> sense to be coherent with other BPF helpers that include the trailing
> zero (as discussed in patch v1), on the other hand the helper is
> clearly named after the standard "snprintf" function and it's likely
> that users will assume it works the same as the std snprintf.


Having zero included simplifies BPF code tremendously for cases like
bpf_probe_read_str(). So no, let's stick with including zero
terminator in return size.


Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-14 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen  wrote:
>
> Andrii Nakryiko  writes:
>
> > On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen  
> > wrote:
> >>
> >> Andrii Nakryiko  writes:
> >>
> >> > On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov
> >> >  wrote:
> >> >>
> >> >> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote:
> >> >> > On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote:
> >> >> > > On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi 
> >> >> > >  wrote:
> >> >> > > > [...]
> >> >> > >
> >> >> > > All of these things are messy because of tc legacy. bpf tried to 
> >> >> > > follow tc style
> >> >> > > with cls and act distinction and it didn't quite work. cls with
> >> >> > > direct-action is the only
> >> >> > > thing that became mainstream while tc style attach wasn't really 
> >> >> > > addressed.
> >> >> > > There were several incidents where tc had tens of thousands of 
> >> >> > > progs attached
> >> >> > > because of this attach/query/index weirdness described above.
> >> >> > > I think the only way to address this properly is to introduce 
> >> >> > > bpf_link style of
> >> >> > > attaching to tc. Such bpf_link would support ingress/egress only.
> >> >> > > direction-action will be implied. There won't be any index and query
> >> >> > > will be obvious.
> >> >> >
> >> >> > Note that we already have bpf_link support working (without support 
> >> >> > for pinning
> >> >> > ofcourse) in a limited way. The ifindex, protocol, parent_id, 
> >> >> > priority, handle,
> >> >> > chain_index tuple uniquely identifies a filter, so we stash this in 
> >> >> > the bpf_link
> >> >> > and are able to operate on the exact filter during release.
> >> >>
> >> >> Except they're not unique. The library can stash them, but something 
> >> >> else
> >> >> doing detach via iproute2 or their own netlink calls will detach the 
> >> >> prog.
> >> >> This other app can attach to the same spot a different prog and now
> >> >> bpf_link__destroy will be detaching somebody else prog.
> >> >>
> >> >> > > So I would like to propose to take this patch set a step further 
> >> >> > > from
> >> >> > > what Daniel said:
> >> >> > > int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
> >> >> > > and make this proposed api to return FD.
> >> >> > > To detach from tc ingress/egress just close(fd).
> >> >> >
> >> >> > You mean adding an fd-based TC API to the kernel?
> >> >>
> >> >> yes.
> >> >
> >> > I'm totally for bpf_link-based TC attachment.
> >> >
> >> > But I think *also* having "legacy" netlink-based APIs will allow
> >> > applications to handle older kernels in a much nicer way without extra
> >> > dependency on iproute2. We have a similar situation with kprobe, where
> >> > currently libbpf only supports "modern" fd-based attachment, but users
> >> > periodically ask questions and struggle to figure out issues on older
> >> > kernels that don't support new APIs.
> >>
> >> +1; I am OK with adding a new bpf_link-based way to attach TC programs,
> >> but we still need to support the netlink API in libbpf.
> >>
> >> > So I think we'd have to support legacy TC APIs, but I agree with
> >> > Alexei and Daniel that we should keep it to the simplest and most
> >> > straightforward API of supporting direction-action attachments and
> >> > setting up qdisc transparently (if I'm getting all the terminology
> >> > right, after reading Quentin's blog post). That coincidentally should
> >> > probably match how bpf_link-based TC API will look like, so all that
> >> > can be abstracted behind a single bpf_link__attach_tc() API as well,
> >> > right? That's the plan for dealing with kprobe right now, btw. Libbpf
> >> > will detect the best available API and trans

Re: [PATCH bpf-next v3 6/6] selftests/bpf: Add a series of tests for bpf_snprintf

2021-04-14 Thread Andrii Nakryiko
On Wed, Apr 14, 2021 at 2:21 AM Florent Revest  wrote:
>
> On Wed, Apr 14, 2021 at 1:21 AM Andrii Nakryiko
>  wrote:
> >
> > On Mon, Apr 12, 2021 at 8:38 AM Florent Revest  wrote:
> > >
> > > This exercises most of the format specifiers.
> > >
> > > Signed-off-by: Florent Revest 
> > > Acked-by: Andrii Nakryiko 
> > > ---
> >
> > As I mentioned on another patch, we probably need negative tests even
> > more than positive ones.
>
> Agreed.
>
> > I think an easy and nice way to do this is to have a separate BPF
> > skeleton where fmt string and arguments are provided through read-only
> > global variables, so that user-space can re-use the same BPF skeleton
> > to simulate multiple cases. BPF program itself would just call
> > bpf_snprintf() and store the returned result.
>
> Ah, great idea! I was thinking of having one skeleton for each but it
> would be a bit much indeed.
>
> Because the format string needs to be in a read only map though, I
> hope it can be modified from userspace before loading. I'll try it out
> and see :) if it doesn't work I'll just use more skeletons

You need read-only variables (const volatile my_type). Their contents
are statically verified by BPF verifier, yet user-space can pre-setup
it at runtime.

>
> > Whether we need to validate the verifier log is up to debate (though
> > it's not that hard to do by overriding libbpf_print_fn() callback),
> > I'd be ok at least knowing that some bad format strings are rejected
> > and don't crash the kernel.
>
> Alright :)
>
> >
> > >  .../selftests/bpf/prog_tests/snprintf.c   | 81 +++
> > >  .../selftests/bpf/progs/test_snprintf.c   | 74 +
> > >  2 files changed, 155 insertions(+)
> > >  create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf.c
> > >  create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf.c
> > >
> >
> > [...]


Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API

2021-04-13 Thread Andrii Nakryiko
On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen  wrote:
>
> Andrii Nakryiko  writes:
>
> > On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov
> >  wrote:
> >>
> >> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote:
> >> > On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote:
> >> > > On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi 
> >> > >  wrote:
> >> > > > [...]
> >> > >
> >> > > All of these things are messy because of tc legacy. bpf tried to 
> >> > > follow tc style
> >> > > with cls and act distinction and it didn't quite work. cls with
> >> > > direct-action is the only
> >> > > thing that became mainstream while tc style attach wasn't really 
> >> > > addressed.
> >> > > There were several incidents where tc had tens of thousands of progs 
> >> > > attached
> >> > > because of this attach/query/index weirdness described above.
> >> > > I think the only way to address this properly is to introduce bpf_link 
> >> > > style of
> >> > > attaching to tc. Such bpf_link would support ingress/egress only.
> >> > > direction-action will be implied. There won't be any index and query
> >> > > will be obvious.
> >> >
> >> > Note that we already have bpf_link support working (without support for 
> >> > pinning
> >> > ofcourse) in a limited way. The ifindex, protocol, parent_id, priority, 
> >> > handle,
> >> > chain_index tuple uniquely identifies a filter, so we stash this in the 
> >> > bpf_link
> >> > and are able to operate on the exact filter during release.
> >>
> >> Except they're not unique. The library can stash them, but something else
> >> doing detach via iproute2 or their own netlink calls will detach the prog.
> >> This other app can attach to the same spot a different prog and now
> >> bpf_link__destroy will be detaching somebody else prog.
> >>
> >> > > So I would like to propose to take this patch set a step further from
> >> > > what Daniel said:
> >> > > int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}):
> >> > > and make this proposed api to return FD.
> >> > > To detach from tc ingress/egress just close(fd).
> >> >
> >> > You mean adding an fd-based TC API to the kernel?
> >>
> >> yes.
> >
> > I'm totally for bpf_link-based TC attachment.
> >
> > But I think *also* having "legacy" netlink-based APIs will allow
> > applications to handle older kernels in a much nicer way without extra
> > dependency on iproute2. We have a similar situation with kprobe, where
> > currently libbpf only supports "modern" fd-based attachment, but users
> > periodically ask questions and struggle to figure out issues on older
> > kernels that don't support new APIs.
>
> +1; I am OK with adding a new bpf_link-based way to attach TC programs,
> but we still need to support the netlink API in libbpf.
>
> > So I think we'd have to support legacy TC APIs, but I agree with
> > Alexei and Daniel that we should keep it to the simplest and most
> > straightforward API of supporting direction-action attachments and
> > setting up qdisc transparently (if I'm getting all the terminology
> > right, after reading Quentin's blog post). That coincidentally should
> > probably match how bpf_link-based TC API will look like, so all that
> > can be abstracted behind a single bpf_link__attach_tc() API as well,
> > right? That's the plan for dealing with kprobe right now, btw. Libbpf
> > will detect the best available API and transparently fall back (maybe
> > with some warning for awareness, due to inherent downsides of legacy
> > APIs: no auto-cleanup being the most prominent one).
>
> Yup, SGTM: Expose both in the low-level API (in bpf.c), and make the
> high-level API auto-detect. That way users can also still use the
> netlink attach function if they don't want the fd-based auto-close
> behaviour of bpf_link.

So I thought a bit more about this, and it feels like the right move
would be to expose only higher-level TC BPF API behind bpf_link. It
will keep the API complexity and amount of APIs that libbpf will have
to support to the minimum, and will keep the API itself simple:
direct-attach with the minimum amount of input arguments. By not
exposing low-level APIs we also table the whole bpf_tc_cls_attach_id
design discussion, as we now can keep as much info as needed inside
bpf_link_tc (which will embed bpf_link internally as well) to support
detachment and possibly some additional querying, if needed.

I think that's the best and least controversial step forward for
getting this API into libbpf.

>
> -Toke
>


Re: [PATCH bpf-next v3 6/6] selftests/bpf: Add a series of tests for bpf_snprintf

2021-04-13 Thread Andrii Nakryiko
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest  wrote:
>
> This exercises most of the format specifiers.
>
> Signed-off-by: Florent Revest 
> Acked-by: Andrii Nakryiko 
> ---

As I mentioned on another patch, we probably need negative tests even
more than positive ones.

I think an easy and nice way to do this is to have a separate BPF
skeleton where fmt string and arguments are provided through read-only
global variables, so that user-space can re-use the same BPF skeleton
to simulate multiple cases. BPF program itself would just call
bpf_snprintf() and store the returned result.

Whether we need to validate the verifier log is up to debate (though
it's not that hard to do by overriding libbpf_print_fn() callback),
I'd be ok at least knowing that some bad format strings are rejected
and don't crash the kernel.


>  .../selftests/bpf/prog_tests/snprintf.c   | 81 +++
>  .../selftests/bpf/progs/test_snprintf.c   | 74 +
>  2 files changed, 155 insertions(+)
>  create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf.c
>  create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf.c
>

[...]


Re: [PATCH bpf-next v3 5/6] libbpf: Introduce a BPF_SNPRINTF helper macro

2021-04-13 Thread Andrii Nakryiko
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest  wrote:
>
> Similarly to BPF_SEQ_PRINTF, this macro turns variadic arguments into an
> array of u64, making it more natural to call the bpf_snprintf helper.
>
> Signed-off-by: Florent Revest 
> ---

Nice!

Acked-by: Andrii Nakryiko 

>  tools/lib/bpf/bpf_tracing.h | 18 ++
>  1 file changed, 18 insertions(+)
>
> diff --git a/tools/lib/bpf/bpf_tracing.h b/tools/lib/bpf/bpf_tracing.h
> index 1c2e91ee041d..8c954ebc0c7c 100644
> --- a/tools/lib/bpf/bpf_tracing.h
> +++ b/tools/lib/bpf/bpf_tracing.h
> @@ -447,4 +447,22 @@ static __always_inline typeof(name(0)) ##name(struct 
> pt_regs *ctx, ##args)
>___param, sizeof(___param)); \
>  })
>
> +/*
> + * BPF_SNPRINTF wraps the bpf_snprintf helper with variadic arguments 
> instead of
> + * an array of u64.
> + */
> +#define BPF_SNPRINTF(out, out_size, fmt, args...)  \
> +({ \
> +   static const char ___fmt[] = fmt;   \
> +   unsigned long long ___param[___bpf_narg(args)]; \
> +   \
> +   _Pragma("GCC diagnostic push")  \
> +   _Pragma("GCC diagnostic ignored \"-Wint-conversion\"")  \
> +   ___bpf_fill(___param, args);\
> +   _Pragma("GCC diagnostic pop")   \
> +   \
> +   bpf_snprintf(out, out_size, ___fmt, \
> +___param, sizeof(___param));   \
> +})
> +
>  #endif
> --
> 2.31.1.295.g9ea45b61b8-goog
>


Re: [PATCH bpf-next v3 4/6] libbpf: Initialize the bpf_seq_printf parameters array field by field

2021-04-13 Thread Andrii Nakryiko
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest  wrote:
>
> When initializing the __param array with a one liner, if all args are
> const, the initial array value will be placed in the rodata section but
> because libbpf does not support relocation in the rodata section, any
> pointer in this array will stay NULL.
>
> Fixes: c09add2fbc5a ("tools/libbpf: Add bpf_iter support")
> Signed-off-by: Florent Revest 
> ---

Looks good!

Acked-by: Andrii Nakryiko 

>  tools/lib/bpf/bpf_tracing.h | 40 +++--
>  1 file changed, 29 insertions(+), 11 deletions(-)
>

[...]


Re: [PATCH bpf-next v3 3/6] bpf: Add a bpf_snprintf helper

2021-04-13 Thread Andrii Nakryiko
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest  wrote:
>
> The implementation takes inspiration from the existing bpf_trace_printk
> helper but there are a few differences:
>
> To allow for a large number of format-specifiers, parameters are
> provided in an array, like in bpf_seq_printf.
>
> Because the output string takes two arguments and the array of
> parameters also takes two arguments, the format string needs to fit in
> one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to
> a zero-terminated read-only map so we don't need a format string length
> arg.
>
> Because the format-string is known at verification time, we also do
> a first pass of format string validation in the verifier logic. This
> makes debugging easier.
>
> Signed-off-by: Florent Revest 
> ---
>  include/linux/bpf.h|  6 
>  include/uapi/linux/bpf.h   | 28 +++
>  kernel/bpf/helpers.c   |  2 ++
>  kernel/bpf/verifier.c  | 41 
>  kernel/trace/bpf_trace.c   | 50 ++
>  tools/include/uapi/linux/bpf.h | 28 +++
>  6 files changed, 155 insertions(+)
>

[...]

> diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
> index 5f46dd6f3383..d4020e5f91ee 100644
> --- a/kernel/bpf/verifier.c
> +++ b/kernel/bpf/verifier.c
> @@ -5918,6 +5918,41 @@ static int check_reference_leak(struct 
> bpf_verifier_env *env)
> return state->acquired_refs ? -EINVAL : 0;
>  }
>
> +static int check_bpf_snprintf_call(struct bpf_verifier_env *env,
> +  struct bpf_reg_state *regs)
> +{
> +   struct bpf_reg_state *fmt_reg = [BPF_REG_3];
> +   struct bpf_reg_state *data_len_reg = [BPF_REG_5];
> +   struct bpf_map *fmt_map = fmt_reg->map_ptr;
> +   int err, fmt_map_off, num_args;
> +   u64 fmt_addr;
> +   char *fmt;
> +
> +   /* data must be an array of u64 */
> +   if (data_len_reg->var_off.value % 8)
> +   return -EINVAL;
> +   num_args = data_len_reg->var_off.value / 8;
> +
> +   /* fmt being ARG_PTR_TO_CONST_STR guarantees that var_off is const
> +* and map_direct_value_addr is set.
> +*/
> +   fmt_map_off = fmt_reg->off + fmt_reg->var_off.value;
> +   err = fmt_map->ops->map_direct_value_addr(fmt_map, _addr,
> + fmt_map_off);
> +   if (err)
> +   return err;
> +   fmt = (char *)fmt_addr + fmt_map_off;
> +

bot complained about lack of (long) cast before fmt_addr, please address


[...]

> +   /* Maximumly we can have MAX_SNPRINTF_VARARGS parameters, just give
> +* all of them to snprintf().
> +*/
> +   err = snprintf(str, str_size, fmt, BPF_CAST_FMT_ARG(0, args, mod),
> +   BPF_CAST_FMT_ARG(1, args, mod), BPF_CAST_FMT_ARG(2, args, 
> mod),
> +   BPF_CAST_FMT_ARG(3, args, mod), BPF_CAST_FMT_ARG(4, args, 
> mod),
> +   BPF_CAST_FMT_ARG(5, args, mod), BPF_CAST_FMT_ARG(6, args, 
> mod),
> +   BPF_CAST_FMT_ARG(7, args, mod), BPF_CAST_FMT_ARG(8, args, 
> mod),
> +   BPF_CAST_FMT_ARG(9, args, mod), BPF_CAST_FMT_ARG(10, args, 
> mod),
> +   BPF_CAST_FMT_ARG(11, args, mod));
> +
> +   put_fmt_tmp_buf();

reading this for at least 3rd time, this put_fmt_tmp_buf() looks a bit
out of place and kind of random. I think bpf_printf_cleanup() name
pairs with bpf_printf_prepare() better.

> +
> +   return err + 1;

snprintf() already returns string length *including* terminating zero,
so this is wrong


> +}
> +

[...]


Re: [PATCH bpf-next v3 2/6] bpf: Add a ARG_PTR_TO_CONST_STR argument type

2021-04-13 Thread Andrii Nakryiko
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest  wrote:
>
> This type provides the guarantee that an argument is going to be a const
> pointer to somewhere in a read-only map value. It also checks that this
> pointer is followed by a zero character before the end of the map value.
>
> Signed-off-by: Florent Revest 
> ---

LGTM.

Acked-by: Andrii Nakryiko 

>  include/linux/bpf.h   |  1 +
>  kernel/bpf/verifier.c | 41 +
>  2 files changed, 42 insertions(+)
>

[...]


Re: [PATCH bpf-next v3 1/6] bpf: Factorize bpf_trace_printk and bpf_seq_printf

2021-04-13 Thread Andrii Nakryiko
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest  wrote:
>
> Two helpers (trace_printk and seq_printf) have very similar
> implementations of format string parsing and a third one is coming
> (snprintf). To avoid code duplication and make the code easier to
> maintain, this moves the operations associated with format string
> parsing (validation and argument sanitization) into one generic
> function.
>
> The implementation of the two existing helpers already drifted quite a
> bit so unifying them entailed a lot of changes:
>
> - bpf_trace_printk always expected fmt[fmt_size] to be the terminating
>   NULL character, this is no longer true, the first 0 is terminating.
> - bpf_trace_printk now supports %% (which produces the percentage char).
> - bpf_trace_printk now skips width formating fields.
> - bpf_trace_printk now supports the X modifier (capital hexadecimal).
> - bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6
> - argument casting on 32 bit has been simplified into one macro and
>   using an enum instead of obscure int increments.
>
> - bpf_seq_printf now uses bpf_trace_copy_string instead of
>   strncpy_from_kernel_nofault and handles the %pks %pus specifiers.
> - bpf_seq_printf now prints longs correctly on 32 bit architectures.
>
> - both were changed to use a global per-cpu tmp buffer instead of one
>   stack buffer for trace_printk and 6 small buffers for seq_printf.
> - to avoid per-cpu buffer usage conflict, these helpers disable
>   preemption while the per-cpu buffer is in use.
> - both helpers now support the %ps and %pS specifiers to print symbols.
>
> Signed-off-by: Florent Revest 
> ---
>  kernel/trace/bpf_trace.c | 529 ++-
>  1 file changed, 248 insertions(+), 281 deletions(-)
>

[...]

> +/* Per-cpu temp buffers which can be used by printf-like helpers for %s or %p
> + */
> +#define MAX_PRINTF_BUF_LEN 512
> +
> +struct bpf_printf_buf {
> +   char tmp_buf[MAX_PRINTF_BUF_LEN];
> +};
> +static DEFINE_PER_CPU(struct bpf_printf_buf, bpf_printf_buf);
> +static DEFINE_PER_CPU(int, bpf_printf_buf_used);
> +
> +static int try_get_fmt_tmp_buf(char **tmp_buf)
>  {
> -   static char buf[BPF_TRACE_PRINTK_SIZE];
> -   unsigned long flags;
> -   va_list ap;
> -   int ret;
> +   struct bpf_printf_buf *bufs = this_cpu_ptr(_printf_buf);

why doing this_cpu_ptr() if below (if *tmp_buf case), you will not use
it. just a waste of CPU, no?

> +   int used;
>
> -   raw_spin_lock_irqsave(_printk_lock, flags);
> -   va_start(ap, fmt);
> -   ret = vsnprintf(buf, sizeof(buf), fmt, ap);
> -   va_end(ap);
> -   /* vsnprintf() will not append null for zero-length strings */
> -   if (ret == 0)
> -   buf[0] = '\0';
> -   trace_bpf_trace_printk(buf);
> -   raw_spin_unlock_irqrestore(_printk_lock, flags);
> +   if (*tmp_buf)
> +   return 0;
>
> -   return ret;
> +   preempt_disable();
> +   used = this_cpu_inc_return(bpf_printf_buf_used);
> +   if (WARN_ON_ONCE(used > 1)) {
> +   this_cpu_dec(bpf_printf_buf_used);
> +   return -EBUSY;
> +   }

get bufs pointer here instead?

> +   *tmp_buf = bufs->tmp_buf;
> +
> +   return 0;
> +}
> +
> +static void put_fmt_tmp_buf(void)
> +{
> +   if (this_cpu_read(bpf_printf_buf_used)) {
> +   this_cpu_dec(bpf_printf_buf_used);
> +   preempt_enable();
> +   }
>  }
>
>  /*
> - * Only limited trace_printk() conversion specifiers allowed:
> - * %d %i %u %x %ld %li %lu %lx %lld %lli %llu %llx %p %pB %pks %pus %s
> + * bpf_parse_fmt_str - Generic pass on format strings for printf-like helpers
> + *
> + * Returns a negative value if fmt is an invalid format string or 0 
> otherwise.
> + *
> + * This can be used in two ways:
> + * - Format string verification only: when final_args and mod are NULL
> + * - Arguments preparation: in addition to the above verification, it writes 
> in
> + *   final_args a copy of raw_args where pointers from BPF have been 
> sanitized
> + *   into pointers safe to use by snprintf. This also writes in the mod array
> + *   the size requirement of each argument, usable by BPF_CAST_FMT_ARG for 
> ex.
> + *
> + * In argument preparation mode, if 0 is returned, safe temporary buffers are
> + * allocated and put_fmt_tmp_buf should be called to free them after use.
>   */
> -BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1,
> -  u64, arg2, u64, arg3)
> -{
> -   int i, mod[3] = {}, fmt_cnt = 0;
> -   char buf[64], fmt_ptype;
> -   void *unsafe_ptr = NULL;
> -   bool str_seen = false;
> +int bpf_printf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args,
> +   u64 *final_args, enum bpf_printf_mod_type *mod,
> +   u32 num_args)
> +{
> +   int err, i, curr_specifier = 0, copy_size;
> +   char *unsafe_ptr = NULL, *tmp_buf = NULL;
> +   size_t 

Re: mmotm 2021-04-11-20-47 uploaded (bpf: xsk.c)

2021-04-13 Thread Andrii Nakryiko
On Mon, Apr 12, 2021 at 9:38 AM Randy Dunlap  wrote:
>
> On 4/11/21 8:48 PM, a...@linux-foundation.org wrote:
> > The mm-of-the-moment snapshot 2021-04-11-20-47 has been uploaded to
> >
> >https://www.ozlabs.org/~akpm/mmotm/
> >
> > mmotm-readme.txt says
> >
> > README for mm-of-the-moment:
> >
> > https://www.ozlabs.org/~akpm/mmotm/
> >
> > This is a snapshot of my -mm patch queue.  Uploaded at random hopefully
> > more than once a week.
> >
> > You will need quilt to apply these patches to the latest Linus release (5.x
> > or 5.x-rcY).  The series file is in broken-out.tar.gz and is duplicated in
> > https://ozlabs.org/~akpm/mmotm/series
> >
> > The file broken-out.tar.gz contains two datestamp files: .DATE and
> > .DATE--mm-dd-hh-mm-ss.  Both contain the string -mm-dd-hh-mm-ss,
> > followed by the base kernel version against which this patch series is to
> > be applied.
> >
> > This tree is partially included in linux-next.  To see which patches are
> > included in linux-next, consult the `series' file.  Only the patches
> > within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in
> > linux-next.
> >
> >
> > A full copy of the full kernel tree with the linux-next and mmotm patches
> > already applied is available through git within an hour of the mmotm
> > release.  Individual mmotm releases are tagged.  The master branch always
> > points to the latest release, so it's constantly rebasing.
> >
> >   https://github.com/hnaz/linux-mm
> >
> > The directory https://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second)
> > contains daily snapshots of the -mm tree.  It is updated more frequently
> > than mmotm, and is untested.
> >
> > A git copy of this tree is also available at
> >
> >   https://github.com/hnaz/linux-mm
>
> on x86_64:
>
> xsk.c: In function ‘xsk_socket__create_shared’:
> xsk.c:1027:7: error: redeclaration of ‘unmap’ with no linkage
>   bool unmap = umem->fill_save != fill;
>^
> xsk.c:1020:7: note: previous declaration of ‘unmap’ was here
>   bool unmap, rx_setup_done = false, tx_setup_done = false;
>^
> xsk.c:1028:7: error: redefinition of ‘rx_setup_done’
>   bool rx_setup_done = false, tx_setup_done = false;
>^
> xsk.c:1020:14: note: previous definition of ‘rx_setup_done’ was here
>   bool unmap, rx_setup_done = false, tx_setup_done = false;
>   ^
> xsk.c:1028:30: error: redefinition of ‘tx_setup_done’
>   bool rx_setup_done = false, tx_setup_done = false;
>   ^
> xsk.c:1020:37: note: previous definition of ‘tx_setup_done’ was here
>   bool unmap, rx_setup_done = false, tx_setup_done = false;
>  ^
>
>
> Full randconfig file is attached.

What SHA are you on? I checked that github tree, the source code there
doesn't correspond to the errors here (i.e., there is no unmap
redefinition on lines 1020 and 1027). Could it be some local merge
conflict?

>
> --
> ~Randy
> Reported-by: Randy Dunlap 


Re: [PATCH bpf-next v2] libbpf: clarify flags in ringbuf helpers

2021-04-12 Thread Andrii Nakryiko
On Mon, Apr 12, 2021 at 12:25 PM Pedro Tammela  wrote:
>
> In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment.
>
> For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a
> notification to the process if needed.
>
> Signed-off-by: Pedro Tammela 
> ---

Great, thanks! Applied to bpf-next.

>  include/uapi/linux/bpf.h   | 16 
>  tools/include/uapi/linux/bpf.h | 16 
>  2 files changed, 32 insertions(+)
>

[...]


Re: memory leak in bpf

2021-04-07 Thread Andrii Nakryiko
On Wed, Apr 7, 2021 at 4:24 PM Rustam Kovhaev  wrote:
>
> On Mon, Mar 01, 2021 at 09:43:00PM +0100, Dmitry Vyukov wrote:
> > On Mon, Mar 1, 2021 at 9:39 PM Rustam Kovhaev  wrote:
> > >
> > > On Mon, Mar 01, 2021 at 08:05:42PM +0100, Dmitry Vyukov wrote:
> > > > On Mon, Mar 1, 2021 at 5:21 PM Rustam Kovhaev  
> > > > wrote:
> > > > >
> > > > > On Wed, Dec 09, 2020 at 10:58:10PM -0800, syzbot wrote:
> > > > > > syzbot has found a reproducer for the following issue on:
> > > > > >
> > > > > > HEAD commit:a68a0262 mm/madvise: remove racy mm ownership check
> > > > > > git tree:   upstream
> > > > > > console output: 
> > > > > > https://syzkaller.appspot.com/x/log.txt?x=11facf1750
> > > > > > kernel config:  
> > > > > > https://syzkaller.appspot.com/x/.config?x=4305fa9ea70c7a9f
> > > > > > dashboard link: 
> > > > > > https://syzkaller.appspot.com/bug?extid=f3694595248708227d35
> > > > > > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > > > > > syz repro:  
> > > > > > https://syzkaller.appspot.com/x/repro.syz?x=159a961350
> > > > > > C reproducer:   
> > > > > > https://syzkaller.appspot.com/x/repro.c?x=11bf712350
> > > > > >
> > > > > > IMPORTANT: if you fix the issue, please add the following tag to 
> > > > > > the commit:
> > > > > > Reported-by: syzbot+f3694595248708227...@syzkaller.appspotmail.com
> > > > > >
> > > > > > Debian GNU/Linux 9 syzkaller ttyS0
> > > > > > Warning: Permanently added '10.128.0.9' (ECDSA) to the list of 
> > > > > > known hosts.
> > > > > > executing program
> > > > > > executing program
> > > > > > executing program
> > > > > > BUG: memory leak
> > > > > > unreferenced object 0x88810efccc80 (size 64):
> > > > > >   comm "syz-executor334", pid 8460, jiffies 4294945724 (age 13.850s)
> > > > > >   hex dump (first 32 bytes):
> > > > > > c0 cb 14 04 00 ea ff ff c0 c2 11 04 00 ea ff ff  
> > > > > > 
> > > > > > c0 56 3f 04 00 ea ff ff 40 18 38 04 00 ea ff ff  
> > > > > > .V?.@.8.
> > > > > >   backtrace:
> > > > > > [<36ae98a7>] kmalloc_node include/linux/slab.h:575 
> > > > > > [inline]
> > > > > > [<36ae98a7>] bpf_ringbuf_area_alloc 
> > > > > > kernel/bpf/ringbuf.c:94 [inline]
> > > > > > [<36ae98a7>] bpf_ringbuf_alloc kernel/bpf/ringbuf.c:135 
> > > > > > [inline]
> > > > > > [<36ae98a7>] ringbuf_map_alloc kernel/bpf/ringbuf.c:183 
> > > > > > [inline]
> > > > > > [<36ae98a7>] ringbuf_map_alloc+0x1be/0x410 
> > > > > > kernel/bpf/ringbuf.c:150
> > > > > > [] find_and_alloc_map 
> > > > > > kernel/bpf/syscall.c:122 [inline]
> > > > > > [] map_create kernel/bpf/syscall.c:825 
> > > > > > [inline]
> > > > > > [] __do_sys_bpf+0x7d0/0x30a0 
> > > > > > kernel/bpf/syscall.c:4381
> > > > > > [<8feaf393>] do_syscall_64+0x2d/0x70 
> > > > > > arch/x86/entry/common.c:46
> > > > > > [] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > > > > >
> > > > > >
> > > > >
> > > > > i am pretty sure that this one is a false positive
> > > > > the problem with reproducer is that it does not terminate all of the
> > > > > child processes that it spawns
> > > > >
> > > > > i confirmed that it is a false positive by tracing __fput() and
> > > > > bpf_map_release(), i ran reproducer, got kmemleak report, then i
> > > > > manually killed those running leftover processes from reproducer and
> > > > > then both functions were executed and memory was freed
> > > > >
> > > > > i am marking this one as:
> > > > > #syz invalid
> > > >
> > > > Hi Rustam,
> > > >
> > > > Thanks for looking into this.
> > > >
> > > > I wonder how/where are these objects referenced? If they are not
> > > > leaked and referenced somewhere, KMEMLEAK should not report them as
> > > > leaks.
> > > > So even if this is a false positive for BPF, this is a true positive
> > > > bug and something to fix for KMEMLEAK ;)
> > > > And syzbot will probably re-create this bug report soon as this still
> > > > happens and is not a one-off thing.
> > >
> > > hi Dmitry, i haven't thought of it this way, but i guess you are right,
> > > it is a kmemleak bug, ideally kmemleak should be aware that there are
> > > still running processes holding references to bpf fd/anonymous inodes
> > > which in their turn hold references to allocated bpf maps
> >
> > KMEMLEAK scans whole memory, so if there are pointers to the object
> > anywhere in memory, KMEMLEAK should not report them as leaked. Running
> > processes have no direct effect on KMEMLEAK logic.
> > So the question is: where are these pointers to these objects? If we
> > answer this, we can check how/why KMEMLEAK misses them. Are they
> > mangled in some way?
> thank you for your comments, they make sense, and indeed, the pointer
> gets vmaped.
> i should have looked into this sooner, becaused syzbot did trigger the
> issue again, and Andrii had to look into the same bug, sorry 

Re: [PATCH bpf-next v2 3/6] bpf: Add a bpf_snprintf helper

2021-04-07 Thread Andrii Nakryiko
On Tue, Apr 6, 2021 at 9:06 AM Florent Revest  wrote:
>
> On Fri, Mar 26, 2021 at 11:55 PM Andrii Nakryiko
>  wrote:
> > On Tue, Mar 23, 2021 at 7:23 PM Florent Revest  wrote:
> > > The implementation takes inspiration from the existing bpf_trace_printk
> > > helper but there are a few differences:
> > >
> > > To allow for a large number of format-specifiers, parameters are
> > > provided in an array, like in bpf_seq_printf.
> > >
> > > Because the output string takes two arguments and the array of
> > > parameters also takes two arguments, the format string needs to fit in
> > > one argument. But because ARG_PTR_TO_CONST_STR guarantees to point to a
> > > NULL-terminated read-only map, we don't need a format string length arg.
> > >
> > > Because the format-string is known at verification time, we also move
> > > most of the format string validation, currently done in formatting
> > > helper calls, into the verifier logic. This makes debugging easier and
> > > also slightly improves the runtime performance.
> > >
> > > Signed-off-by: Florent Revest 
> > > ---
> > >  include/linux/bpf.h|  6 
> > >  include/uapi/linux/bpf.h   | 28 ++
> > >  kernel/bpf/helpers.c   |  2 ++
> > >  kernel/bpf/verifier.c  | 41 +++
> > >  kernel/trace/bpf_trace.c   | 52 ++
> > >  tools/include/uapi/linux/bpf.h | 28 ++
> > >  6 files changed, 157 insertions(+)
> > >
> > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > > index 7b5319d75b3e..f3d9c8fa60b3 100644
> > > --- a/include/linux/bpf.h
> > > +++ b/include/linux/bpf.h
> > > @@ -1893,6 +1893,7 @@ extern const struct bpf_func_proto 
> > > bpf_skc_to_tcp_request_sock_proto;
> > >  extern const struct bpf_func_proto bpf_skc_to_udp6_sock_proto;
> > >  extern const struct bpf_func_proto bpf_copy_from_user_proto;
> > >  extern const struct bpf_func_proto bpf_snprintf_btf_proto;
> > > +extern const struct bpf_func_proto bpf_snprintf_proto;
> > >  extern const struct bpf_func_proto bpf_per_cpu_ptr_proto;
> > >  extern const struct bpf_func_proto bpf_this_cpu_ptr_proto;
> > >  extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto;
> > > @@ -2018,4 +2019,9 @@ int bpf_arch_text_poke(void *ip, enum 
> > > bpf_text_poke_type t,
> > >  struct btf_id_set;
> > >  bool btf_id_set_contains(const struct btf_id_set *set, u32 id);
> > >
> > > +enum bpf_printf_mod_type;
> > > +int bpf_printf_preamble(char *fmt, u32 fmt_size, const u64 *raw_args,
> > > +   u64 *final_args, enum bpf_printf_mod_type *mod,
> > > +   u32 num_args);
> > > +
> > >  #endif /* _LINUX_BPF_H */
> > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > index 2d3036e292a9..86af61e912c6 100644
> > > --- a/include/uapi/linux/bpf.h
> > > +++ b/include/uapi/linux/bpf.h
> > > @@ -4660,6 +4660,33 @@ union bpf_attr {
> > >   * Return
> > >   * The number of traversed map elements for success, 
> > > **-EINVAL** for
> > >   * invalid **flags**.
> > > + *
> > > + * long bpf_snprintf(char *str, u32 str_size, const char *fmt, u64 
> > > *data, u32 data_len)
> > > + * Description
> > > + * Outputs a string into the **str** buffer of size 
> > > **str_size**
> > > + * based on a format string stored in a read-only map 
> > > pointed by
> > > + * **fmt**.
> > > + *
> > > + * Each format specifier in **fmt** corresponds to one u64 
> > > element
> > > + * in the **data** array. For strings and pointers where 
> > > pointees
> > > + * are accessed, only the pointer values are stored in the 
> > > *data*
> > > + * array. The *data_len* is the size of *data* in bytes.
> > > + *
> > > + * Formats **%s** and **%p{i,I}{4,6}** require to read kernel
> > > + * memory. Reading kernel memory may fail due to either 
> > > invalid
> > > + * address or valid address but requiring a major memory 
> > > fault. If
> > > + * reading kernel memory fails, the string for **%s** will 
> > > be an
> > > + * empty strin

Re: [PATCH bpf-next v2 1/6] bpf: Factorize bpf_trace_printk and bpf_seq_printf

2021-04-07 Thread Andrii Nakryiko
On Tue, Apr 6, 2021 at 8:35 AM Florent Revest  wrote:
>
> [Sorry for the late replies, I'm just back from a long easter break :)]
>
> On Fri, Mar 26, 2021 at 11:51 PM Andrii Nakryiko
>  wrote:
> > On Fri, Mar 26, 2021 at 2:53 PM Andrii Nakryiko
> >  wrote:
> > > On Tue, Mar 23, 2021 at 7:23 PM Florent Revest  
> > > wrote:
> > > > Unfortunately, the implementation of the two existing helpers already
> > > > drifted quite a bit and unifying them entailed a lot of changes:
> > >
> > > "Unfortunately" as in a lot of extra work for you? I think overall
> > > though it was very fortunate that you ended up doing it, all
> > > implementations are more feature-complete and saner now, no? Thanks a
> > > lot for your hard work!
>
> Ahah, "unfortunately" a bit of extra work for me, indeed. But I find
> this kind of refactoring patches even harder to review than to write
> so thank you too!
>
> > > > - bpf_trace_printk always expected fmt[fmt_size] to be the terminating
> > > >   NULL character, this is no longer true, the first 0 is terminating.
> > >
> > > You mean if you had bpf_trace_printk("bla bla\0some more bla\0", 24)
> > > it would emit that zero character? If yes, I don't think it was a sane
> > > behavior anyways.
>
> The call to snprintf in bpf_do_trace_printk would eventually ignore
> "some more bla" but the parsing done in bpf_trace_printk would indeed
> read the whole string.
>
> > > This is great, you already saved some lines of code! I suspect I'll
> > > have some complaints about mods (it feels like this preample should
> > > provide extra information about which arguments have to be read from
> > > kernel/user memory, but I'll see next patches first.
> >
> > Disregard the last part (at least for now). I had a mental model that
> > it should be possible to parse a format string once and then remember
> > "instructions" (i.e., arg1 is long, arg2 is string, and so on). But
> > that's too complicated, so I think re-parsing the format string is
> > much simpler.
>
> I also wanted to do that originally but realized it would keep a lot
> of the complexity in the helpers themselves and not really move the
> needle.
>
> > > > +/* Horrid workaround for getting va_list handling working with 
> > > > different
> > > > + * argument type combinations generically for 32 and 64 bit archs.
> > > > + */
> > > > +#define BPF_CAST_FMT_ARG(arg_nb, args, mod)
> > > > \
> > > > +   ((mod[arg_nb] == BPF_PRINTF_LONG_LONG ||
> > > > \
> > > > +(mod[arg_nb] == BPF_PRINTF_LONG && __BITS_PER_LONG == 64)) 
> > > > \
> > > > + ? args[arg_nb]
> > > > \
> > > > + : ((mod[arg_nb] == BPF_PRINTF_LONG || 
> > > > \
> > > > +(mod[arg_nb] == BPF_PRINTF_INT && __BITS_PER_LONG == 32))  
> > > > \
> > >
> > > is this right? INT is always 32-bit, it's only LONG that differs.
> > > Shouldn't the rule be
> > >
> > > (LONG_LONG || LONG && __BITS_PER_LONG) -> (__u64)args[args_nb]
> > > (INT || LONG && __BITS_PER_LONG == 32) -> (__u32)args[args_nb]
> > >
> > > Does (long) cast do anything fancy when casting from u64? Sorry, maybe
> > > I'm confused.
>
> To be honest, I am also confused by that logic... :p My patch tries to
> conserve exactly the same logic as "88a5c690b6 bpf: fix
> bpf_trace_printk on 32 bit archs" because I was also afraid of missing
> something and could not test it on 32 bit arches. From that commit
> description, it is unclear to me what "u32 and long are passed
> differently to u64, since the result of C conditional operators
> follows the "usual arithmetic conversions" rules" means. Maybe Daniel
> can comment on this ?

Yeah, no idea. Seems like the code above should work fine for 32 and
64 bitness and both little- and big-endianness.

>
> > > > +int bpf_printf_preamble(char *fmt, u32 fmt_size, const u64 *raw_args,
> > > > +   u64 *final_args, enum bpf_printf_mod_type *mod,
> > > > +   u32 num_args)
> > > > +{
> > > > +   struct bpf_printf_buf *bufs = this_cpu_ptr(_printf_buf);
> > > > +   int err, i, fmt_cnt = 0, copy_size, used;
> &g

Re: [PATCH bpf-next] libbpf: clarify flags in ringbuf helpers

2021-04-07 Thread Andrii Nakryiko
On Wed, Apr 7, 2021 at 1:10 PM Pedro Tammela  wrote:
>
> Em qua., 7 de abr. de 2021 às 16:58, Andrii Nakryiko
>  escreveu:
> >
> > On Wed, Apr 7, 2021 at 11:43 AM Joe Stringer  wrote:
> > >
> > > Hi Pedro,
> > >
> > > On Tue, Apr 6, 2021 at 11:58 AM Pedro Tammela  wrote:
> > > >
> > > > In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment.
> > > >
> > > > For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a
> > > > notification to the process if needed.
> > > >
> > > > Signed-off-by: Pedro Tammela 
> > > > ---
> > > >  include/uapi/linux/bpf.h   | 7 +++
> > > >  tools/include/uapi/linux/bpf.h | 7 +++
> > > >  2 files changed, 14 insertions(+)
> > > >
> > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > > > index 49371eba98ba..8c5c7a893b87 100644
> > > > --- a/include/uapi/linux/bpf.h
> > > > +++ b/include/uapi/linux/bpf.h
> > > > @@ -4061,12 +4061,15 @@ union bpf_attr {
> > > >   * of new data availability is sent.
> > > >   * If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, 
> > > > notification
> > > >   * of new data availability is sent unconditionally.
> > > > + * If **0** is specified in *flags*, notification
> > > > + * of new data availability is sent if needed.
> > >
> > > Maybe a trivial question, but what does "if needed" mean? Does that
> > > mean "when the buffer is full"?
> >
> > I used to call it ns "adaptive notification", so maybe let's use that
> > term instead of "if needed"? It means that in kernel BPF ringbuf code
> > will check if the user-space consumer has caught up and consumed all
> > the available data. In that case user-space might be waiting
> > (sleeping) in epoll_wait() already and not processing samples
> > actively. That means that we have to send notification, otherwise
> > user-space might never wake up. But if the kernel sees that user-space
> > is still processing previous record (consumer position < producer
> > position), then we can bypass sending another notification, because
> > user-space consumer protocol dictates that it needs to consume all the
> > record until consumer position == producer position. So no
> > notification is necessary for the newly submitted sample, as
> > user-space will eventually see it without notification.
> >
> > Of course there is careful writes and memory ordering involved to make
> > sure that we never miss notification.
> >
> > Does someone want to try to condense it into a succinct description? ;)
>
> OK.
>
> I can try to condense this and perhaps add it as code in the comment?

Sure, though there is already a brief comment to that effect. But
having high-level explanation in uapi/linux/bpf.h would be great for
users, though.


Re: [PATCH bpf-next] libbpf: clarify flags in ringbuf helpers

2021-04-07 Thread Andrii Nakryiko
On Wed, Apr 7, 2021 at 11:43 AM Joe Stringer  wrote:
>
> Hi Pedro,
>
> On Tue, Apr 6, 2021 at 11:58 AM Pedro Tammela  wrote:
> >
> > In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment.
> >
> > For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a
> > notification to the process if needed.
> >
> > Signed-off-by: Pedro Tammela 
> > ---
> >  include/uapi/linux/bpf.h   | 7 +++
> >  tools/include/uapi/linux/bpf.h | 7 +++
> >  2 files changed, 14 insertions(+)
> >
> > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> > index 49371eba98ba..8c5c7a893b87 100644
> > --- a/include/uapi/linux/bpf.h
> > +++ b/include/uapi/linux/bpf.h
> > @@ -4061,12 +4061,15 @@ union bpf_attr {
> >   * of new data availability is sent.
> >   * If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, 
> > notification
> >   * of new data availability is sent unconditionally.
> > + * If **0** is specified in *flags*, notification
> > + * of new data availability is sent if needed.
>
> Maybe a trivial question, but what does "if needed" mean? Does that
> mean "when the buffer is full"?

I used to call it ns "adaptive notification", so maybe let's use that
term instead of "if needed"? It means that in kernel BPF ringbuf code
will check if the user-space consumer has caught up and consumed all
the available data. In that case user-space might be waiting
(sleeping) in epoll_wait() already and not processing samples
actively. That means that we have to send notification, otherwise
user-space might never wake up. But if the kernel sees that user-space
is still processing previous record (consumer position < producer
position), then we can bypass sending another notification, because
user-space consumer protocol dictates that it needs to consume all the
record until consumer position == producer position. So no
notification is necessary for the newly submitted sample, as
user-space will eventually see it without notification.

Of course there is careful writes and memory ordering involved to make
sure that we never miss notification.

Does someone want to try to condense it into a succinct description? ;)


Re: [PATCH bpf-next v2 2/3] libbpf: selftests: refactor 'BPF_PERCPU_TYPE()' and 'bpf_percpu()' macros

2021-04-07 Thread Andrii Nakryiko
On Wed, Apr 7, 2021 at 12:30 PM Pedro Tammela  wrote:
>
> Em qua., 7 de abr. de 2021 às 15:31, Andrii Nakryiko
>  escreveu:
> >
> > On Tue, Apr 6, 2021 at 11:55 AM Pedro Tammela  wrote:
> > >
> > > This macro was refactored out of the bpf selftests.
> > >
> > > Since percpu values are rounded up to '8' in the kernel, a careless
> > > user in userspace might encounter unexpected values when parsing the
> > > output of the batched operations.
> >
> > I wonder if a user has to be more careful, though? This
> > BPF_PERCPU_TYPE, __bpf_percpu_align and bpf_percpu macros seem to
> > create just another opaque layer. It actually seems detrimental to me.
> >
> > I'd rather emphasize in the documentation (e.g., in
> > bpf_map_lookup_elem) that all per-cpu maps are aligning values at 8
> > bytes, so user has to make sure that array of values provided to
> > bpf_map_lookup_elem() has each element size rounded up to 8.
>
> From my own experience, the documentation has been a very unreliable
> source, to the point that I usually jump to the code first rather than
> to the documentation nowadays[1].

I totally agree, which is why I think improving docs is necessary.
Unfortunately docs are usually lagging behind, because generally
people hate writing documentation and it's just a fact of life.

> Tests, samples and projects have always been my source of truth and we
> are already lacking a bit on those as well. For instance, the samples
> directory contains programs that are very outdated (I didn't check if
> they are still functional).

Yeah, samples/bpf is bitrotting. selftests/bpf, though, are maintained
and run regularly and vigorously, so making sure they set a good and
realistic example is a good.


> I think macros like these will be present in most of the project
> dealing with batched operations and as a daily user of libbpf I don't
> see how this could not be offered by libbpf as a standardized way to
> declare percpu types.

If I were using per-CPU maps a lot, I'd make sure I use u64 and
aligned(8) types and bypass all the macro ugliness, because there is
no need in it and it just hurts readability. So I don't want libbpf to
incentivize bad choices here by providing seemingly convenient macros.
Users have to be aware that values are 8-byte aligned/extended. That's
not a big secret and not a very obscure thing to learn anyways.

>
> [1] So batched operations were introduced a little bit over a 1 year
> ago and yet the only reference I had for it was the selftests. The
> documentation is on my TODO list, but that's just because I have to
> deal with it daily.
>

Yeah, please do contribute them!

> >
> > In practice, I'd recommend users to always use __u64/__s64 when having
> > primitive integers in a map (they are not saving anything by using
> > int, it just creates an illusion of savings). Well, maybe on 32-bit
> > arches they would save a bit of CPU, but not on typical 64-bit
> > architectures. As for using structs as values, always mark them as
> > __attribute__((aligned(8))).
> >
> > Basically, instead of obscuring the real use some more, let's clarify
> > and maybe even provide some examples in documentation?
>
> Why not do both?
>
> Provide a standardized way to declare a percpu value with examples and
> a good documentation with examples.
> Let the user decide what is best for his use case.

What is a standardized way? A custom macro with struct { T v; }
inside? That's just one way of doing this, and it requires another
macro to just access the value (because no one wants to write
my_values[cpu].v, right?). I'd say the standardized way of reading
values should look like `my_values[cpu]`, that's it. For that you use
64-bit integers or 8-byte aligned structs. And don't mess with macros
for that at all.

So if a user insists on using int/short/char as value, they can do
their own struct { char v} __aligned(8) trick. But I'd advise such
users to reconsider and use u64. If they are using structs for values,
always mark __aligned(8) and forget about this in the rest of your
code.

As for allocating memory for array of per-cpu values, there is also no
single standardized way we can come up with. It could be malloc() on
the heap. Or alloca() on the stack. Or it could be pre-allocated one
for up to maximum supported CPUs. Or... whatever makes sense.

So I think the best way to handle all that is to clearly explain how
reading per-CPU values from per-CPU maps works in BPF and what are the
memory layout expectations.

>
> >
> > >
> > > Now that both array and hash maps have support for batched ops in the
> > > percpu variant, let's provide a convenient macro to declare percpu map
> > > value types.

Re: [syzbot] memory leak in bpf (2)

2021-04-07 Thread Andrii Nakryiko
On Wed, Mar 31, 2021 at 6:08 PM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:0f4498ce Merge tag 'for-5.12/dm-fixes-2' of git://git.kern..
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1250e126d0
> kernel config:  https://syzkaller.appspot.com/x/.config?x=49f2683f4e7a4347
> dashboard link: https://syzkaller.appspot.com/bug?extid=5d895828587f49e7fe9b
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=10a17016d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10a32016d0
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+5d895828587f49e7f...@syzkaller.appspotmail.com
>
> Warning: Permanently added '10.128.0.74' (ECDSA) to the list of known hosts.
> executing program
> executing program
> BUG: memory leak
> unreferenced object 0x8881133295c0 (size 64):
>   comm "syz-executor529", pid 8395, jiffies 4294943939 (age 8.130s)
>   hex dump (first 32 bytes):
> 40 48 3c 04 00 ea ff ff 00 48 3c 04 00 ea ff ff  @H<..H<.
> c0 e7 3c 04 00 ea ff ff 80 e7 3c 04 00 ea ff ff  ..<...<.
>   backtrace:
> [] kmalloc_node include/linux/slab.h:577 [inline]
> [] __bpf_map_area_alloc+0xfc/0x120 
> kernel/bpf/syscall.c:300
> [] bpf_ringbuf_area_alloc kernel/bpf/ringbuf.c:90 
> [inline]
> [] bpf_ringbuf_alloc kernel/bpf/ringbuf.c:131 [inline]
> [] ringbuf_map_alloc kernel/bpf/ringbuf.c:170 [inline]
> [] ringbuf_map_alloc+0x134/0x350 
> kernel/bpf/ringbuf.c:146
> [] find_and_alloc_map kernel/bpf/syscall.c:122 [inline]
> [] map_create kernel/bpf/syscall.c:828 [inline]
> [] __do_sys_bpf+0x7c3/0x2fe0 kernel/bpf/syscall.c:4375
> [] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> [] entry_SYSCALL_64_after_hwframe+0x44/0xae
>
>

I think either kmemleak or syzbot are mis-reporting this. I've added a
bunch of printks around all allocations performed by BPF ringbuf. When
I run repro, I see this:

[   26.013500] ALLOC rb_map 888118d7d000
[   26.013946] ALLOC KMALLOC AREA 88810d538c00
[   26.014439] ALLOC PAGES 88810d538c00
[   26.014826] ALLOC PAGE[0] ea000419af00
[   26.015272] ALLOC PAGE[1] ea000419aec0
[   26.015686] ALLOC PAGE[2] ea000419ae80
[   26.016090] ALLOC PAGE[3] ea00042e29c0
[   26.016513] ALLOC PAGE[4] ea00042a1000
[   26.016928] VMAP rb c9539000
[   26.017291] ALLOC rb_map->rb c9539000
[   26.017712] FINISHED ALLOC BPF_MAP 888118d7d000
[   32.105069] ALLOC rb_map 888118d7d200
[   32.105568] ALLOC KMALLOC AREA 88810d538c80
[   32.106005] ALLOC PAGES 88810d538c80
[   32.106407] ALLOC PAGE[0] ea000419aa80
[   32.106805] ALLOC PAGE[1] ea000419ab00
[   32.107206] ALLOC PAGE[2] ea000419abc0
[   32.107607] ALLOC PAGE[3] ea0004284480
[   32.108003] ALLOC PAGE[4] ea0004284440
[   32.108419] VMAP rb c95ad000
[   32.108765] ALLOC rb_map->rb c95ad000
[   32.109186] FINISHED ALLOC BPF_MAP 888118d7d200
[   33.592874] kmemleak: 1 new suspected memory leaks (see
/sys/kernel/debug/kmemleak)
[   40.526922] kmemleak: 1 new suspected memory leaks (see
/sys/kernel/debug/kmemleak)

On repro side I get these two warnings:

[vmuser@archvm bpf]$ sudo ./repro
BUG: memory leak
unreferenced object 0x88810d538c00 (size 64):
  comm "repro", pid 2140, jiffies 4294692933 (age 14.540s)
  hex dump (first 32 bytes):
00 af 19 04 00 ea ff ff c0 ae 19 04 00 ea ff ff  
80 ae 19 04 00 ea ff ff c0 29 2e 04 00 ea ff ff  .)..
  backtrace:
[<77bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
[<587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
[<44d49e96>] __do_sys_bpf+0x359/0x1d90
[] do_syscall_64+0x2d/0x40
[<43d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae

BUG: memory leak
unreferenced object 0x88810d538c80 (size 64):
  comm "repro", pid 2143, jiffies 4294699025 (age 8.448s)
  hex dump (first 32 bytes):
80 aa 19 04 00 ea ff ff 00 ab 19 04 00 ea ff ff  
c0 ab 19 04 00 ea ff ff 80 44 28 04 00 ea ff ff  .D(.
  backtrace:
[<77bfbfbd>] __bpf_map_area_alloc+0x31/0xc0
[<587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218
[<44d49e96>] __do_sys_bpf+0x359/0x1d90
[] do_syscall_64+0x2d/0x40
[<43d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae

Note that both reported leaks (88810d538c80 and 88810d538c00)
correspond to pages array bpf_ringbuf is allocating and tracking
properly internally.

Note also that syzbot repro doesn't close FD of created BPF ringbufs,
and even when ./repro itself exits with error, there are still two
forked processes hanging around in my system. So clearly ringbuf maps
are alive at that point. So reporting any memory leak looks weird at
that point, because that memory is being used by active referenced BPF

Re: [PATCH bpf-next v2 2/3] libbpf: selftests: refactor 'BPF_PERCPU_TYPE()' and 'bpf_percpu()' macros

2021-04-07 Thread Andrii Nakryiko
On Tue, Apr 6, 2021 at 11:55 AM Pedro Tammela  wrote:
>
> This macro was refactored out of the bpf selftests.
>
> Since percpu values are rounded up to '8' in the kernel, a careless
> user in userspace might encounter unexpected values when parsing the
> output of the batched operations.

I wonder if a user has to be more careful, though? This
BPF_PERCPU_TYPE, __bpf_percpu_align and bpf_percpu macros seem to
create just another opaque layer. It actually seems detrimental to me.

I'd rather emphasize in the documentation (e.g., in
bpf_map_lookup_elem) that all per-cpu maps are aligning values at 8
bytes, so user has to make sure that array of values provided to
bpf_map_lookup_elem() has each element size rounded up to 8.

In practice, I'd recommend users to always use __u64/__s64 when having
primitive integers in a map (they are not saving anything by using
int, it just creates an illusion of savings). Well, maybe on 32-bit
arches they would save a bit of CPU, but not on typical 64-bit
architectures. As for using structs as values, always mark them as
__attribute__((aligned(8))).

Basically, instead of obscuring the real use some more, let's clarify
and maybe even provide some examples in documentation?

>
> Now that both array and hash maps have support for batched ops in the
> percpu variant, let's provide a convenient macro to declare percpu map
> value types.
>
> Updates the tests to a "reference" usage of the new macro.
>
> Signed-off-by: Pedro Tammela 
> ---
>  tools/lib/bpf/bpf.h   | 10 
>  tools/testing/selftests/bpf/bpf_util.h|  7 ---
>  .../bpf/map_tests/htab_map_batch_ops.c| 48 ++-
>  .../selftests/bpf/prog_tests/map_init.c   |  5 +-
>  tools/testing/selftests/bpf/test_maps.c   | 16 ---
>  5 files changed, 46 insertions(+), 40 deletions(-)
>

[...]

> @@ -400,11 +402,11 @@ static void test_arraymap(unsigned int task, void *data)
>  static void test_arraymap_percpu(unsigned int task, void *data)
>  {
> unsigned int nr_cpus = bpf_num_possible_cpus();
> -   BPF_DECLARE_PERCPU(long, values);
> +   pcpu_map_value_t values[nr_cpus];
> int key, next_key, fd, i;
>
> fd = bpf_create_map(BPF_MAP_TYPE_PERCPU_ARRAY, sizeof(key),
> -   sizeof(bpf_percpu(values, 0)), 2, 0);
> +   sizeof(long), 2, 0);
> if (fd < 0) {
> printf("Failed to create arraymap '%s'!\n", strerror(errno));
> exit(1);
> @@ -459,7 +461,7 @@ static void test_arraymap_percpu(unsigned int task, void 
> *data)
>  static void test_arraymap_percpu_many_keys(void)
>  {
> unsigned int nr_cpus = bpf_num_possible_cpus();

This just sets a bad example for anyone using selftests as an
aspiration for their own code. bpf_num_possible_cpus() does exit(1)
internally if libbpf_num_possible_cpus() returns error. No one should
write real production code like that. So maybe let's provide a better
example instead with error handling and malloc (or perhaps alloca)?

> -   BPF_DECLARE_PERCPU(long, values);
> +   pcpu_map_value_t values[nr_cpus];
> /* nr_keys is not too large otherwise the test stresses percpu
>  * allocator more than anything else
>  */
> @@ -467,7 +469,7 @@ static void test_arraymap_percpu_many_keys(void)
> int key, fd, i;
>
> fd = bpf_create_map(BPF_MAP_TYPE_PERCPU_ARRAY, sizeof(key),
> -   sizeof(bpf_percpu(values, 0)), nr_keys, 0);
> +   sizeof(long), nr_keys, 0);
> if (fd < 0) {
> printf("Failed to create per-cpu arraymap '%s'!\n",
>strerror(errno));
> --
> 2.25.1
>


  1   2   3   4   5   6   >