Re: [PATCHv4 bpf-next 0/7] uprobe: uretprobe speed up
On Thu, May 2, 2024 at 5:23 AM Jiri Olsa wrote: > > hi, > as part of the effort on speeding up the uprobes [0] coming with > return uprobe optimization by using syscall instead of the trap > on the uretprobe trampoline. > > The speed up depends on instruction type that uprobe is installed > and depends on specific HW type, please check patch 1 for details. > > Patches 1-6 are based on bpf-next/master, but path 1 and 2 are > apply-able on linux-trace.git tree probes/for-next branch. > Patch 7 is based on man-pages master. > > v4 changes: > - added acks [Oleg,Andrii,Masami] > - reworded the man page and adding more info to NOTE section [Masami] > - rewrote bpf tests not to use trace_pipe [Andrii] > - cc-ed linux-man list > > Also available at: > https://git.kernel.org/pub/scm/linux/kernel/git/jolsa/perf.git > uretprobe_syscall > It looks great to me, thanks! Unfortunately BPF CI build is broken, probably due to some of the Makefile additions, please investigate and fix (or we'll need to fix something on BPF CI side), but it looks like you'll need another revision, unfortunately. pw-bot: cr [0] https://github.com/kernel-patches/bpf/actions/runs/8923849088/job/24509002194 But while we are at it. Masami, Oleg, What should be the logistics of landing this? Can/should we route this through the bpf-next tree, given there are lots of BPF-based selftests? Or you want to take this through linux-trace/probes/for-next? In the latter case, it's probably better to apply only the first two patches to probes/for-next and the rest should still go through the bpf-next tree (otherwise we are running into conflicts in BPF selftests). Previously we were handling such cross-tree dependencies by creating a named branch or tag, and merging it into bpf-next (so that all SHAs are preserved). It's a bunch of extra work for everyone involved, so the simplest way would be to just land through bpf-next, of course. But let me know your preferences. Thanks! > thanks, > jirka > > > Notes to check list items in Documentation/process/adding-syscalls.rst: > > - System Call Alternatives > New syscall seems like the best way in here, becase we need typo (thanks, Gmail): because > just to quickly enter kernel with no extra arguments processing, > which we'd need to do if we decided to use another syscall. > > - Designing the API: Planning for Extension > The uretprobe syscall is very specific and most likely won't be > extended in the future. > > At the moment it does not take any arguments and even if it does > in future, it's allowed to be called only from trampoline prepared > by kernel, so there'll be no broken user. > > - Designing the API: Other Considerations > N/A because uretprobe syscall does not return reference to kernel > object. > > - Proposing the API > Wiring up of the uretprobe system call si in separate change, typo: is > selftests and man page changes are part of the patchset. > > - Generic System Call Implementation > There's no CONFIG option for the new functionality because it > keeps the same behaviour from the user POV. > > - x86 System Call Implementation > It's 64-bit syscall only. > > - Compatibility System Calls (Generic) > N/A uretprobe syscall has no arguments and is not supported > for compat processes. > > - Compatibility System Calls (x86) > N/A uretprobe syscall is not supported for compat processes. > > - System Calls Returning Elsewhere > N/A. > > - Other Details > N/A. > > - Testing > Adding new bpf selftests and ran ltp on top of this change. > > - Man Page > Attached. > > - Do not call System Calls in the Kernel > N/A. > > > [0] https://lore.kernel.org/bpf/ZeCXHKJ--iYYbmLj@krava/ > --- > Jiri Olsa (6): > uprobe: Wire up uretprobe system call > uprobe: Add uretprobe syscall to speed up return probe > selftests/bpf: Add uretprobe syscall test for regs integrity > selftests/bpf: Add uretprobe syscall test for regs changes > selftests/bpf: Add uretprobe syscall call from user space test > selftests/bpf: Add uretprobe compat test > > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > arch/x86/kernel/uprobes.c | 115 > > include/linux/syscalls.h| 2 + > include/linux/uprobes.h | 3 + > include/uapi/asm-generic/unistd.h | 5 +- > kernel/events/uprobes.c | 24 -- > kernel/sys_ni.c | 2 + > tools/include/linux/compiler.h | 4 + > tools/testing/selftests/bpf/.gitignore | 1 + > tools/testing/selftests/bpf/Makefile| 7 +- > tools/testing/selftests/bpf/bpf_testmod/bpf_testmod.c | 123 > - >
Re: [PATCHv4 bpf-next 6/7] selftests/bpf: Add uretprobe compat test
On Thu, May 2, 2024 at 5:24 AM Jiri Olsa wrote: > > Adding test that adds return uprobe inside 32-bit task > and verify the return uprobe and attached bpf programs > get properly executed. > > Reviewed-by: Masami Hiramatsu (Google) > Signed-off-by: Jiri Olsa > --- > tools/testing/selftests/bpf/.gitignore| 1 + > tools/testing/selftests/bpf/Makefile | 7 ++- > .../selftests/bpf/prog_tests/uprobe_syscall.c | 60 +++ > 3 files changed, 67 insertions(+), 1 deletion(-) > > diff --git a/tools/testing/selftests/bpf/.gitignore > b/tools/testing/selftests/bpf/.gitignore > index f1aebabfb017..69d71223c0dd 100644 > --- a/tools/testing/selftests/bpf/.gitignore > +++ b/tools/testing/selftests/bpf/.gitignore > @@ -45,6 +45,7 @@ test_cpp > /veristat > /sign-file > /uprobe_multi > +/uprobe_compat > *.ko > *.tmp > xskxceiver > diff --git a/tools/testing/selftests/bpf/Makefile > b/tools/testing/selftests/bpf/Makefile > index 82247aeef857..a94352162290 100644 > --- a/tools/testing/selftests/bpf/Makefile > +++ b/tools/testing/selftests/bpf/Makefile > @@ -133,7 +133,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr > test_skb_cgroup_id_user \ > xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \ > xdp_features bpf_test_no_cfi.ko > > -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > uprobe_compat > > ifneq ($(V),1) > submake_extras := feature_display=0 > @@ -631,6 +631,7 @@ TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read > $(OUTPUT)/bpf_testmod.ko \ >$(OUTPUT)/xdp_synproxy \ >$(OUTPUT)/sign-file \ >$(OUTPUT)/uprobe_multi \ > + $(OUTPUT)/uprobe_compat \ >ima_setup.sh \ >verify_sig_setup.sh \ >$(wildcard progs/btf_dump_test_case_*.c) \ > @@ -752,6 +753,10 @@ $(OUTPUT)/uprobe_multi: uprobe_multi.c > $(call msg,BINARY,,$@) > $(Q)$(CC) $(CFLAGS) -O0 $(LDFLAGS) $^ $(LDLIBS) -o $@ > > +$(OUTPUT)/uprobe_compat: > + $(call msg,BINARY,,$@) > + $(Q)echo "int main() { return 0; }" | $(CC) $(CFLAGS) -xc -m32 -O0 - > -o $@ > + > EXTRA_CLEAN := $(SCRATCH_DIR) $(HOST_SCRATCH_DIR) \ > prog_tests/tests.h map_tests/tests.h verifier/tests.h \ > feature bpftool \ > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > index c6fdb8c59ea3..bfea9a0368a4 100644 > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > @@ -5,6 +5,7 @@ > #ifdef __x86_64__ > > #include > +#include > #include > #include > #include > @@ -297,6 +298,58 @@ static void test_uretprobe_syscall_call(void) > close(go[1]); > close(go[0]); > } > + > +static void test_uretprobe_compat(void) > +{ > + LIBBPF_OPTS(bpf_uprobe_multi_opts, opts, > + .retprobe = true, > + ); > + struct uprobe_syscall_executed *skel; > + int err, go[2], pid, c, status; > + > + if (pipe(go)) > + return; ASSERT_OK() missing, like in the previous patch Thanks for switching to pipe() + global variable instead of using trace_pipe. Acked-by: Andrii Nakryiko > + > + skel = uprobe_syscall_executed__open_and_load(); > + if (!ASSERT_OK_PTR(skel, "uprobe_syscall_executed__open_and_load")) > + goto cleanup; > + [...]
Re: [PATCHv4 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test
On Thu, May 2, 2024 at 5:24 AM Jiri Olsa wrote: > > Adding test to verify that when called from outside of the > trampoline provided by kernel, the uretprobe syscall will cause > calling process to receive SIGILL signal and the attached bpf > program is not executed. > > Reviewed-by: Masami Hiramatsu (Google) > Signed-off-by: Jiri Olsa > --- > .../selftests/bpf/prog_tests/uprobe_syscall.c | 95 +++ > .../bpf/progs/uprobe_syscall_executed.c | 17 > 2 files changed, 112 insertions(+) > create mode 100644 > tools/testing/selftests/bpf/progs/uprobe_syscall_executed.c > > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > index 1a50cd35205d..c6fdb8c59ea3 100644 > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > @@ -7,7 +7,10 @@ > #include > #include > #include > +#include > +#include > #include "uprobe_syscall.skel.h" > +#include "uprobe_syscall_executed.skel.h" > > __naked unsigned long uretprobe_regs_trigger(void) > { > @@ -209,6 +212,91 @@ static void test_uretprobe_regs_change(void) > } > } > > +#ifndef __NR_uretprobe > +#define __NR_uretprobe 462 > +#endif > + > +__naked unsigned long uretprobe_syscall_call_1(void) > +{ > + /* > +* Pretend we are uretprobe trampoline to trigger the return > +* probe invocation in order to verify we get SIGILL. > +*/ > + asm volatile ( > + "pushq %rax\n" > + "pushq %rcx\n" > + "pushq %r11\n" > + "movq $" __stringify(__NR_uretprobe) ", %rax\n" > + "syscall\n" > + "popq %r11\n" > + "popq %rcx\n" > + "retq\n" > + ); > +} > + > +__naked unsigned long uretprobe_syscall_call(void) > +{ > + asm volatile ( > + "call uretprobe_syscall_call_1\n" > + "retq\n" > + ); > +} > + > +static void test_uretprobe_syscall_call(void) > +{ > + LIBBPF_OPTS(bpf_uprobe_multi_opts, opts, > + .retprobe = true, > + ); > + struct uprobe_syscall_executed *skel; > + int pid, status, err, go[2], c; > + > + if (pipe(go)) > + return; very unlikely to fail, but still, ASSERT_OK() would be in order here But regardless: Acked-by: Andrii Nakryiko [...]
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Tue, Apr 30, 2024 at 6:32 AM Masami Hiramatsu wrote: > > On Mon, 29 Apr 2024 13:25:04 -0700 > Andrii Nakryiko wrote: > > > On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu > > wrote: > > > > > > Hi Andrii, > > > > > > On Thu, 25 Apr 2024 13:31:53 -0700 > > > Andrii Nakryiko wrote: > > > > > > > Hey Masami, > > > > > > > > I can't really review most of that code as I'm completely unfamiliar > > > > with all those inner workings of fprobe/ftrace/function_graph. I left > > > > a few comments where there were somewhat more obvious BPF-related > > > > pieces. > > > > > > > > But I also did run our BPF benchmarks on probes/for-next as a baseline > > > > and then with your series applied on top. Just to see if there are any > > > > regressions. I think it will be a useful data point for you. > > > > > > Thanks for testing! > > > > > > > > > > > You should be already familiar with the bench tool we have in BPF > > > > selftests (I used it on some other patches for your tree). > > > > > > What patches we need? > > > > > > > You mean for this `bench` tool? They are part of BPF selftests (under > > tools/testing/selftests/bpf), you can build them by running: > > > > $ make RELEASE=1 -j$(nproc) bench > > > > After that you'll get a self-container `bench` binary, which has all > > the self-contained benchmarks. > > > > You might also find a small script (benchs/run_bench_trigger.sh inside > > BPF selftests directory) helpful, it collects final summary of the > > benchmark run and optionally accepts a specific set of benchmarks. So > > you can use it like this: > > > > $ benchs/run_bench_trigger.sh kprobe kprobe-multi > > kprobe : 18.731 ± 0.639M/s > > kprobe-multi : 23.938 ± 0.612M/s > > > > By default it will run a wider set of benchmarks (no uprobes, but a > > bunch of extra fentry/fexit tests and stuff like this). > > origin: > # benchs/run_bench_trigger.sh > kretprobe :1.329 ± 0.007M/s > kretprobe-multi:1.341 ± 0.004M/s > # benchs/run_bench_trigger.sh > kretprobe :1.288 ± 0.014M/s > kretprobe-multi:1.365 ± 0.002M/s > # benchs/run_bench_trigger.sh > kretprobe :1.329 ± 0.002M/s > kretprobe-multi:1.331 ± 0.011M/s > # benchs/run_bench_trigger.sh > kretprobe :1.311 ± 0.003M/s > kretprobe-multi:1.318 ± 0.002M/s s > > patched: > > # benchs/run_bench_trigger.sh > kretprobe :1.274 ± 0.003M/s > kretprobe-multi:1.397 ± 0.002M/s > # benchs/run_bench_trigger.sh > kretprobe :1.307 ± 0.002M/s > kretprobe-multi:1.406 ± 0.004M/s > # benchs/run_bench_trigger.sh > kretprobe :1.279 ± 0.004M/s > kretprobe-multi:1.330 ± 0.014M/s > # benchs/run_bench_trigger.sh > kretprobe :1.256 ± 0.010M/s > kretprobe-multi:1.412 ± 0.003M/s > > Hmm, in my case, it seems smaller differences (~3%?). > I attached perf report results for those, but I don't see large difference. I ran my benchmarks on bare metal machine (and quite powerful at that, you can see my numbers are almost 10x of yours), with mitigations disabled, no retpolines, etc. If you have any of those mitigations it might result in smaller differences, probably. If you are running inside QEMU/VM, the results might differ significantly as well. > > > > > > > > > BASELINE > > > > > > > > kprobe : 24.634 ± 0.205M/s > > > > kprobe-multi : 28.898 ± 0.531M/s > > > > kretprobe : 10.478 ± 0.015M/s > > > > kretprobe-multi: 11.012 ± 0.063M/s > > > > > > > > THIS PATCH SET ON TOP > > > > = > > > > kprobe : 25.144 ± 0.027M/s (+2%) > > > > kprobe-multi : 28.909 ± 0.074M/s > > > > kretprobe :9.482 ± 0.008M/s (-9.5%) > > > > kretprobe-multi: 13.688 ± 0.027M/s (+24%) > > > > > > This looks good. Kretprobe should also use kretprobe-multi (fprobe) > > > eventually because it should be a single callback version of > > > kretprobe-multi. > > I ran another benchmark (prctl loop, attached), the origin kernel result is > here; > > # sh ./benchmark.sh > count = 1000, took 6.748133 sec > > And the patched kernel result; > > # sh ./benchmark.sh > count = 1000, took 6.644095 sec > > I confirmed that the parf result has no big difference. > > Thank you, > > > > > > > > > > >
Re: [PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code
On Wed, Apr 24, 2024 at 5:02 PM Andrii Nakryiko wrote: > > At the lowest level, rethook-based kretprobes on x86-64 architecture go > through arch_rethoook_trampoline() function, manually written in > assembly, which calls into a simple arch_rethook_trampoline_callback() > function, written in C, and only doing a few straightforward field > assignments, before calling further into rethook_trampoline_handler(), > which handles kretprobe callbacks generically. > > Looking at simplicity of arch_rethook_trampoline_callback(), it seems > not really worthwhile to spend an extra function call just to do 4 or > 5 assignments. As such, this patch proposes to "inline" > arch_rethook_trampoline_callback() into arch_rethook_trampoline() by > manually implementing it in an assembly code. > > This has two motivations. First, we do get a bit of runtime speed up by > avoiding function calls. Using BPF selftests's bench tool, we see > 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe > triggering code path: > > BEFORE (latest probes/for-next) > === > kretprobe : 10.455 ± 0.024M/s > kretprobe-multi: 11.150 ± 0.012M/s > > AFTER (probes/for-next + this patch) > > kretprobe : 10.540 ± 0.009M/s (+0.8%) > kretprobe-multi: 11.219 ± 0.042M/s (+0.6%) > > Second, and no less importantly for some specialized use cases, this > avoids unnecessarily "polluting" LBR records with an extra function call > (recorded as a jump by CPU). This is the case for the retsnoop ([0]) > tool, which relies havily on capturing LBR records to provide users with > lots of insight into kernel internals. > > This RFC patch is only inlining this function for x86-64, but it's > possible to do that for 32-bit x86 arch as well and then remove > arch_rethook_trampoline_callback() implementation altogether. Please let > me know if this change is acceptable and whether I should complete it > with 32-bit "inlining" as well. Thanks! > > [0] > https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr > > Signed-off-by: Andrii Nakryiko > --- > arch/x86/kernel/asm-offsets_64.c | 4 > arch/x86/kernel/rethook.c| 37 +++- > 2 files changed, 36 insertions(+), 5 deletions(-) > > diff --git a/arch/x86/kernel/asm-offsets_64.c > b/arch/x86/kernel/asm-offsets_64.c > index bb65371ea9df..5c444abc540c 100644 > --- a/arch/x86/kernel/asm-offsets_64.c > +++ b/arch/x86/kernel/asm-offsets_64.c > @@ -42,6 +42,10 @@ int main(void) > ENTRY(r14); > ENTRY(r15); > ENTRY(flags); > + ENTRY(ip); > + ENTRY(cs); > + ENTRY(ss); > + ENTRY(orig_ax); > BLANK(); > #undef ENTRY > > diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c > index 8a1c0111ae79..3e1c01beebd1 100644 > --- a/arch/x86/kernel/rethook.c > +++ b/arch/x86/kernel/rethook.c > @@ -6,6 +6,7 @@ > #include > #include > #include > +#include > > #include "kprobes/common.h" > > @@ -34,10 +35,36 @@ asm( > " pushq %rsp\n" > " pushfq\n" > SAVE_REGS_STRING > - " movq %rsp, %rdi\n" > - " call arch_rethook_trampoline_callback\n" > + " movq %rsp, %rdi\n" /* $rdi points to regs */ > + /* fixup registers */ > + /* regs->cs = __KERNEL_CS; */ > + " movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) > "(%rdi)\n" > + /* regs->ip = (unsigned long)_rethook_trampoline; */ > + " movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) > "(%rdi)\n" > + /* regs->orig_ax = ~0UL; */ > + " movq $0x, " __stringify(pt_regs_orig_ax) > "(%rdi)\n" > + /* regs->sp += 2*sizeof(long); */ > + " addq $16, " __stringify(pt_regs_sp) "(%rdi)\n" > + /* 2nd arg is frame_pointer = (long *)(regs + 1); */ > + " lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n" BTW, all this __stringify() ugliness can be avoided if we move this assembly into its own .S file, like lots of other assembly functions in arch/x86/kernel subdir. That has another benefit of generating better line information in DWARF for those assembly instructions. It's lots more work, so before I do this, I'd like to get confirmation that this change is acceptable in principle. > + /* > +* The return address at 'frame_pointer' is recovered by the > +* a
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Sun, Apr 28, 2024 at 4:25 PM Steven Rostedt wrote: > > On Thu, 25 Apr 2024 13:31:53 -0700 > Andrii Nakryiko wrote: > > I'm just coming back from Japan (work and then a vacation), and > catching up on my email during the 6 hour layover in Detroit. > > > Hey Masami, > > > > I can't really review most of that code as I'm completely unfamiliar > > with all those inner workings of fprobe/ftrace/function_graph. I left > > a few comments where there were somewhat more obvious BPF-related > > pieces. > > > > But I also did run our BPF benchmarks on probes/for-next as a baseline > > and then with your series applied on top. Just to see if there are any > > regressions. I think it will be a useful data point for you. > > > > You should be already familiar with the bench tool we have in BPF > > selftests (I used it on some other patches for your tree). > > I should get familiar with your tools too. > It's a nifty and self-contained tool to do some micro-benchmarking, I replied to Masami with a few details on how to build and use it. > > > > BASELINE > > > > kprobe : 24.634 ± 0.205M/s > > kprobe-multi : 28.898 ± 0.531M/s > > kretprobe : 10.478 ± 0.015M/s > > kretprobe-multi: 11.012 ± 0.063M/s > > > > THIS PATCH SET ON TOP > > = > > kprobe : 25.144 ± 0.027M/s (+2%) > > kprobe-multi : 28.909 ± 0.074M/s > > kretprobe :9.482 ± 0.008M/s (-9.5%) > > kretprobe-multi: 13.688 ± 0.027M/s (+24%) > > > > These numbers are pretty stable and look to be more or less representative. > > Thanks for running this. > > > > > As you can see, kprobes got a bit faster, kprobe-multi seems to be > > about the same, though. > > > > Then (I suppose they are "legacy") kretprobes got quite noticeably > > slower, almost by 10%. Not sure why, but looks real after re-running > > benchmarks a bunch of times and getting stable results. > > > > On the other hand, multi-kretprobes got significantly faster (+24%!). > > Again, I don't know if it is expected or not, but it's a nice > > improvement. > > > > If you have any idea why kretprobes would get so much slower, it would > > be nice to look into that and see if you can mitigate the regression > > somehow. Thanks! > > My guess is that this patch set helps generic use cases for tracing the > return of functions, but will likely add more overhead for single use > cases. That is, kretprobe is made to be specific for a single function, > but kretprobe-multi is more generic. Hence the generic version will > improve at the sacrifice of the specific function. I did expect as much. > > That said, I think there's probably a lot of low hanging fruit that can > be done to this series to help improve the kretprobe performance. I'm > not sure we can get back to the baseline, but I'm hoping we can at > least make it much better than that 10% slowdown. That would certainly be appreciated, thanks! But I'm also considering trying to switch to multi-kprobe/kretprobe automatically on libbpf side, whenever possible, so that users can get the best performance. There might still be situations where this can't be done, so singular kprobe/kretprobe can't be completely deprecated, but multi variants seems to be universally faster, so I'm going to make them a default (I need to handle some backwards compat aspect, but that's libbpf-specific stuff you shouldn't be concerned with). > > I'll be reviewing this patch set this week as I recover from jetlag. > > -- Steve
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Mon, Apr 29, 2024 at 6:51 AM Masami Hiramatsu wrote: > > Hi Andrii, > > On Thu, 25 Apr 2024 13:31:53 -0700 > Andrii Nakryiko wrote: > > > Hey Masami, > > > > I can't really review most of that code as I'm completely unfamiliar > > with all those inner workings of fprobe/ftrace/function_graph. I left > > a few comments where there were somewhat more obvious BPF-related > > pieces. > > > > But I also did run our BPF benchmarks on probes/for-next as a baseline > > and then with your series applied on top. Just to see if there are any > > regressions. I think it will be a useful data point for you. > > Thanks for testing! > > > > > You should be already familiar with the bench tool we have in BPF > > selftests (I used it on some other patches for your tree). > > What patches we need? > You mean for this `bench` tool? They are part of BPF selftests (under tools/testing/selftests/bpf), you can build them by running: $ make RELEASE=1 -j$(nproc) bench After that you'll get a self-container `bench` binary, which has all the self-contained benchmarks. You might also find a small script (benchs/run_bench_trigger.sh inside BPF selftests directory) helpful, it collects final summary of the benchmark run and optionally accepts a specific set of benchmarks. So you can use it like this: $ benchs/run_bench_trigger.sh kprobe kprobe-multi kprobe : 18.731 ± 0.639M/s kprobe-multi : 23.938 ± 0.612M/s By default it will run a wider set of benchmarks (no uprobes, but a bunch of extra fentry/fexit tests and stuff like this). > > > > BASELINE > > > > kprobe : 24.634 ± 0.205M/s > > kprobe-multi : 28.898 ± 0.531M/s > > kretprobe : 10.478 ± 0.015M/s > > kretprobe-multi: 11.012 ± 0.063M/s > > > > THIS PATCH SET ON TOP > > = > > kprobe : 25.144 ± 0.027M/s (+2%) > > kprobe-multi : 28.909 ± 0.074M/s > > kretprobe :9.482 ± 0.008M/s (-9.5%) > > kretprobe-multi: 13.688 ± 0.027M/s (+24%) > > This looks good. Kretprobe should also use kretprobe-multi (fprobe) > eventually because it should be a single callback version of > kretprobe-multi. > > > > > These numbers are pretty stable and look to be more or less representative. > > > > As you can see, kprobes got a bit faster, kprobe-multi seems to be > > about the same, though. > > > > Then (I suppose they are "legacy") kretprobes got quite noticeably > > slower, almost by 10%. Not sure why, but looks real after re-running > > benchmarks a bunch of times and getting stable results. > > Hmm, kretprobe on x86 should use ftrace + rethook even with my series. > So nothing should be changed. Maybe cache access pattern has been > changed? > I'll check it with tracefs (to remove the effect from bpf related changes) > > > > > On the other hand, multi-kretprobes got significantly faster (+24%!). > > Again, I don't know if it is expected or not, but it's a nice > > improvement. > > Thanks! > > > > > If you have any idea why kretprobes would get so much slower, it would > > be nice to look into that and see if you can mitigate the regression > > somehow. Thanks! > > OK, let me check it. > > Thank you! > > > > > > > > 51 files changed, 2325 insertions(+), 882 deletions(-) > > > create mode 100644 > > > tools/testing/selftests/ftrace/test.d/dynevent/add_remove_fprobe_repeat.tc > > > > > > -- > > > Masami Hiramatsu (Google) > > > > > > -- > Masami Hiramatsu (Google)
Re: [PATCHv3 bpf-next 6/7] selftests/bpf: Add uretprobe compat test
On Mon, Apr 29, 2024 at 12:39 AM Jiri Olsa wrote: > > On Fri, Apr 26, 2024 at 11:06:53AM -0700, Andrii Nakryiko wrote: > > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa wrote: > > > > > > Adding test that adds return uprobe inside 32 bit task > > > and verify the return uprobe and attached bpf programs > > > get properly executed. > > > > > > Signed-off-by: Jiri Olsa > > > --- > > > tools/testing/selftests/bpf/.gitignore| 1 + > > > tools/testing/selftests/bpf/Makefile | 6 ++- > > > .../selftests/bpf/prog_tests/uprobe_syscall.c | 40 +++ > > > .../bpf/progs/uprobe_syscall_compat.c | 13 ++ > > > 4 files changed, 59 insertions(+), 1 deletion(-) > > > create mode 100644 > > > tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > > > > > > diff --git a/tools/testing/selftests/bpf/.gitignore > > > b/tools/testing/selftests/bpf/.gitignore > > > index f1aebabfb017..69d71223c0dd 100644 > > > --- a/tools/testing/selftests/bpf/.gitignore > > > +++ b/tools/testing/selftests/bpf/.gitignore > > > @@ -45,6 +45,7 @@ test_cpp > > > /veristat > > > /sign-file > > > /uprobe_multi > > > +/uprobe_compat > > > *.ko > > > *.tmp > > > xskxceiver > > > diff --git a/tools/testing/selftests/bpf/Makefile > > > b/tools/testing/selftests/bpf/Makefile > > > index edc73f8f5aef..d170b63eca62 100644 > > > --- a/tools/testing/selftests/bpf/Makefile > > > +++ b/tools/testing/selftests/bpf/Makefile > > > @@ -134,7 +134,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr > > > test_skb_cgroup_id_user \ > > > xskxceiver xdp_redirect_multi xdp_synproxy veristat > > > xdp_hw_metadata \ > > > xdp_features bpf_test_no_cfi.ko > > > > > > -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > > > +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > > > uprobe_compat > > > > you need to add uprobe_compat to TRUNNER_EXTRA_FILES as well, no? > > ah right > > > > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > > b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > > index 9233210a4c33..3770254d893b 100644 > > > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > > > @@ -11,6 +11,7 @@ > > > #include > > > #include "uprobe_syscall.skel.h" > > > #include "uprobe_syscall_call.skel.h" > > > +#include "uprobe_syscall_compat.skel.h" > > > > > > __naked unsigned long uretprobe_regs_trigger(void) > > > { > > > @@ -291,6 +292,35 @@ static void test_uretprobe_syscall_call(void) > > > "read_trace_pipe_iter"); > > > ASSERT_EQ(found, 0, "found"); > > > } > > > + > > > +static void trace_pipe_compat_cb(const char *str, void *data) > > > +{ > > > + if (strstr(str, "uretprobe compat") != NULL) > > > + (*(int *)data)++; > > > +} > > > + > > > +static void test_uretprobe_compat(void) > > > +{ > > > + struct uprobe_syscall_compat *skel = NULL; > > > + int err, found = 0; > > > + > > > + skel = uprobe_syscall_compat__open_and_load(); > > > + if (!ASSERT_OK_PTR(skel, "uprobe_syscall_compat__open_and_load")) > > > + goto cleanup; > > > + > > > + err = uprobe_syscall_compat__attach(skel); > > > + if (!ASSERT_OK(err, "uprobe_syscall_compat__attach")) > > > + goto cleanup; > > > + > > > + system("./uprobe_compat"); > > > + > > > + ASSERT_OK(read_trace_pipe_iter(trace_pipe_compat_cb, , > > > 1000), > > > +"read_trace_pipe_iter"); > > > > why so complicated? can't you just set global variable that it was called > > hm, we execute separate uprobe_compat (32bit) process that triggers the bpf > program, so we can't use global variable.. using the trace_pipe was the only > thing that was easy to do you need child process to trigger uprobe, but you could have installed BPF program from parent process (you'd need to make child wait for parent to be ready, with normal pipe() like we do in other place
Re: [PATCHv3 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test
On Mon, Apr 29, 2024 at 12:33 AM Jiri Olsa wrote: > > On Fri, Apr 26, 2024 at 11:03:29AM -0700, Andrii Nakryiko wrote: > > On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa wrote: > > > > > > Adding test to verify that when called from outside of the > > > trampoline provided by kernel, the uretprobe syscall will cause > > > calling process to receive SIGILL signal and the attached bpf > > > program is no executed. > > > > > > Signed-off-by: Jiri Olsa > > > --- > > > .../selftests/bpf/prog_tests/uprobe_syscall.c | 92 +++ > > > .../selftests/bpf/progs/uprobe_syscall_call.c | 15 +++ > > > 2 files changed, 107 insertions(+) > > > create mode 100644 > > > tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > > > > > > See nits below, but overall LGTM > > > > Acked-by: Andrii Nakryiko > > > > [...] > > > > > @@ -219,6 +301,11 @@ static void test_uretprobe_regs_change(void) > > > { > > > test__skip(); > > > } > > > + > > > +static void test_uretprobe_syscall_call(void) > > > +{ > > > + test__skip(); > > > +} > > > #endif > > > > > > void test_uprobe_syscall(void) > > > @@ -228,3 +315,8 @@ void test_uprobe_syscall(void) > > > if (test__start_subtest("uretprobe_regs_change")) > > > test_uretprobe_regs_change(); > > > } > > > + > > > +void serial_test_uprobe_syscall_call(void) > > > > does it need to be serial? non-serial are still run sequentially > > within a process (there is no multi-threading), it's more about some > > global effects on system. > > plz see below > > > > > > +{ > > > + test_uretprobe_syscall_call(); > > > +} > > > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > > b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > > new file mode 100644 > > > index ..5ea03bb47198 > > > --- /dev/null > > > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > > > @@ -0,0 +1,15 @@ > > > +// SPDX-License-Identifier: GPL-2.0 > > > +#include "vmlinux.h" > > > +#include > > > +#include > > > + > > > +struct pt_regs regs; > > > + > > > +char _license[] SEC("license") = "GPL"; > > > + > > > +SEC("uretprobe//proc/self/exe:uretprobe_syscall_call") > > > +int uretprobe(struct pt_regs *regs) > > > +{ > > > + bpf_printk("uretprobe called"); > > > > debugging leftover? we probably don't want to pollute trace_pipe from test > > the reason for this is to make sure the bpf program was not executed, > > the test makes sure the child gets killed with SIGILL and also that > the bpf program was not executed by checking the trace_pipe and > making sure nothing was received > > the trace_pipe reading is also why it's serial you could have attached BPF program from parent process and use a global variable (and thus eliminate all the trace_pipe system-wide dependency), but ok, it's fine by me the way this is done > > jirka > > > > > > + return 0; > > > +} > > > -- > > > 2.44.0 > > >
Re: [PATCHv3 bpf-next 6/7] selftests/bpf: Add uretprobe compat test
On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa wrote: > > Adding test that adds return uprobe inside 32 bit task > and verify the return uprobe and attached bpf programs > get properly executed. > > Signed-off-by: Jiri Olsa > --- > tools/testing/selftests/bpf/.gitignore| 1 + > tools/testing/selftests/bpf/Makefile | 6 ++- > .../selftests/bpf/prog_tests/uprobe_syscall.c | 40 +++ > .../bpf/progs/uprobe_syscall_compat.c | 13 ++ > 4 files changed, 59 insertions(+), 1 deletion(-) > create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > > diff --git a/tools/testing/selftests/bpf/.gitignore > b/tools/testing/selftests/bpf/.gitignore > index f1aebabfb017..69d71223c0dd 100644 > --- a/tools/testing/selftests/bpf/.gitignore > +++ b/tools/testing/selftests/bpf/.gitignore > @@ -45,6 +45,7 @@ test_cpp > /veristat > /sign-file > /uprobe_multi > +/uprobe_compat > *.ko > *.tmp > xskxceiver > diff --git a/tools/testing/selftests/bpf/Makefile > b/tools/testing/selftests/bpf/Makefile > index edc73f8f5aef..d170b63eca62 100644 > --- a/tools/testing/selftests/bpf/Makefile > +++ b/tools/testing/selftests/bpf/Makefile > @@ -134,7 +134,7 @@ TEST_GEN_PROGS_EXTENDED = test_sock_addr > test_skb_cgroup_id_user \ > xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata \ > xdp_features bpf_test_no_cfi.ko > > -TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > +TEST_GEN_FILES += liburandom_read.so urandom_read sign-file uprobe_multi > uprobe_compat you need to add uprobe_compat to TRUNNER_EXTRA_FILES as well, no? > > # Emit succinct information message describing current building step > # $1 - generic step name (e.g., CC, LINK, etc); > @@ -761,6 +761,10 @@ $(OUTPUT)/uprobe_multi: uprobe_multi.c > $(call msg,BINARY,,$@) > $(Q)$(CC) $(CFLAGS) -O0 $(LDFLAGS) $^ $(LDLIBS) -o $@ > > +$(OUTPUT)/uprobe_compat: > + $(call msg,BINARY,,$@) > + $(Q)echo "int main() { return 0; }" | $(CC) $(CFLAGS) -xc -m32 -O0 - > -o $@ > + > EXTRA_CLEAN := $(SCRATCH_DIR) $(HOST_SCRATCH_DIR) \ > prog_tests/tests.h map_tests/tests.h verifier/tests.h \ > feature bpftool \ > diff --git a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > index 9233210a4c33..3770254d893b 100644 > --- a/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > +++ b/tools/testing/selftests/bpf/prog_tests/uprobe_syscall.c > @@ -11,6 +11,7 @@ > #include > #include "uprobe_syscall.skel.h" > #include "uprobe_syscall_call.skel.h" > +#include "uprobe_syscall_compat.skel.h" > > __naked unsigned long uretprobe_regs_trigger(void) > { > @@ -291,6 +292,35 @@ static void test_uretprobe_syscall_call(void) > "read_trace_pipe_iter"); > ASSERT_EQ(found, 0, "found"); > } > + > +static void trace_pipe_compat_cb(const char *str, void *data) > +{ > + if (strstr(str, "uretprobe compat") != NULL) > + (*(int *)data)++; > +} > + > +static void test_uretprobe_compat(void) > +{ > + struct uprobe_syscall_compat *skel = NULL; > + int err, found = 0; > + > + skel = uprobe_syscall_compat__open_and_load(); > + if (!ASSERT_OK_PTR(skel, "uprobe_syscall_compat__open_and_load")) > + goto cleanup; > + > + err = uprobe_syscall_compat__attach(skel); > + if (!ASSERT_OK(err, "uprobe_syscall_compat__attach")) > + goto cleanup; > + > + system("./uprobe_compat"); > + > + ASSERT_OK(read_trace_pipe_iter(trace_pipe_compat_cb, , 1000), > +"read_trace_pipe_iter"); why so complicated? can't you just set global variable that it was called > + ASSERT_EQ(found, 1, "found"); > + > +cleanup: > + uprobe_syscall_compat__destroy(skel); > +} > #else > static void test_uretprobe_regs_equal(void) > { > @@ -306,6 +336,11 @@ static void test_uretprobe_syscall_call(void) > { > test__skip(); > } > + > +static void test_uretprobe_compat(void) > +{ > + test__skip(); > +} > #endif > > void test_uprobe_syscall(void) > @@ -320,3 +355,8 @@ void serial_test_uprobe_syscall_call(void) > { > test_uretprobe_syscall_call(); > } > + > +void serial_test_uprobe_syscall_compat(void) and then no need for serial_test? > +{ > + test_uretprobe_compat(); > +} > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > b/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > new file mode 100644 > index ..f8adde7f08e2 > --- /dev/null > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_compat.c > @@ -0,0 +1,13 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include > +#include > +#include > + > +char _license[] SEC("license") = "GPL"; > + > +SEC("uretprobe.multi/./uprobe_compat:main") > +int
Re: [PATCHv3 bpf-next 5/7] selftests/bpf: Add uretprobe syscall call from user space test
On Sun, Apr 21, 2024 at 12:43 PM Jiri Olsa wrote: > > Adding test to verify that when called from outside of the > trampoline provided by kernel, the uretprobe syscall will cause > calling process to receive SIGILL signal and the attached bpf > program is no executed. > > Signed-off-by: Jiri Olsa > --- > .../selftests/bpf/prog_tests/uprobe_syscall.c | 92 +++ > .../selftests/bpf/progs/uprobe_syscall_call.c | 15 +++ > 2 files changed, 107 insertions(+) > create mode 100644 tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > See nits below, but overall LGTM Acked-by: Andrii Nakryiko [...] > @@ -219,6 +301,11 @@ static void test_uretprobe_regs_change(void) > { > test__skip(); > } > + > +static void test_uretprobe_syscall_call(void) > +{ > + test__skip(); > +} > #endif > > void test_uprobe_syscall(void) > @@ -228,3 +315,8 @@ void test_uprobe_syscall(void) > if (test__start_subtest("uretprobe_regs_change")) > test_uretprobe_regs_change(); > } > + > +void serial_test_uprobe_syscall_call(void) does it need to be serial? non-serial are still run sequentially within a process (there is no multi-threading), it's more about some global effects on system. > +{ > + test_uretprobe_syscall_call(); > +} > diff --git a/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > new file mode 100644 > index ..5ea03bb47198 > --- /dev/null > +++ b/tools/testing/selftests/bpf/progs/uprobe_syscall_call.c > @@ -0,0 +1,15 @@ > +// SPDX-License-Identifier: GPL-2.0 > +#include "vmlinux.h" > +#include > +#include > + > +struct pt_regs regs; > + > +char _license[] SEC("license") = "GPL"; > + > +SEC("uretprobe//proc/self/exe:uretprobe_syscall_call") > +int uretprobe(struct pt_regs *regs) > +{ > + bpf_printk("uretprobe called"); debugging leftover? we probably don't want to pollute trace_pipe from test > + return 0; > +} > -- > 2.44.0 >
Re: [PATCHv3 bpf-next 2/7] uprobe: Add uretprobe syscall to speed up return probe
On Sun, Apr 21, 2024 at 12:42 PM Jiri Olsa wrote: > > Adding uretprobe syscall instead of trap to speed up return probe. > > At the moment the uretprobe setup/path is: > > - install entry uprobe > > - when the uprobe is hit, it overwrites probed function's return address > on stack with address of the trampoline that contains breakpoint > instruction > > - the breakpoint trap code handles the uretprobe consumers execution and > jumps back to original return address > > This patch replaces the above trampoline's breakpoint instruction with new > ureprobe syscall call. This syscall does exactly the same job as the trap > with some more extra work: > > - syscall trampoline must save original value for rax/r11/rcx registers > on stack - rax is set to syscall number and r11/rcx are changed and > used by syscall instruction > > - the syscall code reads the original values of those registers and > restore those values in task's pt_regs area > > - only caller from trampoline exposed in '[uprobes]' is allowed, > the process will receive SIGILL signal otherwise > > Even with some extra work, using the uretprobes syscall shows speed > improvement (compared to using standard breakpoint): > > On Intel (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz) > > current: > uretprobe-nop :1.498 ± 0.000M/s > uretprobe-push :1.448 ± 0.001M/s > uretprobe-ret :0.816 ± 0.001M/s > > with the fix: > uretprobe-nop :1.969 ± 0.002M/s < 31% speed up > uretprobe-push :1.910 ± 0.000M/s < 31% speed up > uretprobe-ret :0.934 ± 0.000M/s < 14% speed up > > On Amd (AMD Ryzen 7 5700U) > > current: > uretprobe-nop :0.778 ± 0.001M/s > uretprobe-push :0.744 ± 0.001M/s > uretprobe-ret :0.540 ± 0.001M/s > > with the fix: > uretprobe-nop :0.860 ± 0.001M/s < 10% speed up > uretprobe-push :0.818 ± 0.001M/s < 10% speed up > uretprobe-ret :0.578 ± 0.000M/s < 7% speed up > > The performance test spawns a thread that runs loop which triggers > uprobe with attached bpf program that increments the counter that > gets printed in results above. > > The uprobe (and uretprobe) kind is determined by which instruction > is being patched with breakpoint instruction. That's also important > for uretprobes, because uprobe is installed for each uretprobe. > > The performance test is part of bpf selftests: > tools/testing/selftests/bpf/run_bench_uprobes.sh > > Note at the moment uretprobe syscall is supported only for native > 64-bit process, compat process still uses standard breakpoint. > > Suggested-by: Andrii Nakryiko > Signed-off-by: Oleg Nesterov > Signed-off-by: Jiri Olsa > --- > arch/x86/kernel/uprobes.c | 115 ++ > include/linux/uprobes.h | 3 + > kernel/events/uprobes.c | 24 +--- > 3 files changed, 135 insertions(+), 7 deletions(-) > LGTM as far as I can follow the code Acked-by: Andrii Nakryiko [...]
Re: [PATCHv3 bpf-next 1/7] uprobe: Wire up uretprobe system call
On Sun, Apr 21, 2024 at 12:42 PM Jiri Olsa wrote: > > Wiring up uretprobe system call, which comes in following changes. > We need to do the wiring before, because the uretprobe implementation > needs the syscall number. > > Note at the moment uretprobe syscall is supported only for native > 64-bit process. > > Signed-off-by: Jiri Olsa > --- > arch/x86/entry/syscalls/syscall_64.tbl | 1 + > include/linux/syscalls.h | 2 ++ > include/uapi/asm-generic/unistd.h | 5 - > kernel/sys_ni.c| 2 ++ > 4 files changed, 9 insertions(+), 1 deletion(-) > LGTM Acked-by: Andrii Nakryiko > diff --git a/arch/x86/entry/syscalls/syscall_64.tbl > b/arch/x86/entry/syscalls/syscall_64.tbl > index 7e8d46f4147f..af0a33ab06ee 100644 > --- a/arch/x86/entry/syscalls/syscall_64.tbl > +++ b/arch/x86/entry/syscalls/syscall_64.tbl > @@ -383,6 +383,7 @@ > 459common lsm_get_self_attr sys_lsm_get_self_attr > 460common lsm_set_self_attr sys_lsm_set_self_attr > 461common lsm_list_modulessys_lsm_list_modules > +46264 uretprobe sys_uretprobe > > # > # Due to a historical design error, certain syscalls are numbered differently > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index e619ac10cd23..5318e0e76799 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -972,6 +972,8 @@ asmlinkage long sys_lsm_list_modules(u64 *ids, u32 *size, > u32 flags); > /* x86 */ > asmlinkage long sys_ioperm(unsigned long from, unsigned long num, int on); > > +asmlinkage long sys_uretprobe(void); > + > /* pciconfig: alpha, arm, arm64, ia64, sparc */ > asmlinkage long sys_pciconfig_read(unsigned long bus, unsigned long dfn, > unsigned long off, unsigned long len, > diff --git a/include/uapi/asm-generic/unistd.h > b/include/uapi/asm-generic/unistd.h > index 75f00965ab15..8a747cd1d735 100644 > --- a/include/uapi/asm-generic/unistd.h > +++ b/include/uapi/asm-generic/unistd.h > @@ -842,8 +842,11 @@ __SYSCALL(__NR_lsm_set_self_attr, sys_lsm_set_self_attr) > #define __NR_lsm_list_modules 461 > __SYSCALL(__NR_lsm_list_modules, sys_lsm_list_modules) > > +#define __NR_uretprobe 462 > +__SYSCALL(__NR_uretprobe, sys_uretprobe) > + > #undef __NR_syscalls > -#define __NR_syscalls 462 > +#define __NR_syscalls 463 > > /* > * 32 bit systems traditionally used different > diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c > index faad00cce269..be6195e0d078 100644 > --- a/kernel/sys_ni.c > +++ b/kernel/sys_ni.c > @@ -391,3 +391,5 @@ COND_SYSCALL(setuid16); > > /* restartable sequence */ > COND_SYSCALL(rseq); > + > +COND_SYSCALL(uretprobe); > -- > 2.44.0 >
Re: [PATCH 0/2] Objpool performance improvements
On Fri, Apr 26, 2024 at 7:25 AM Masami Hiramatsu wrote: > > Hi Andrii, > > On Wed, 24 Apr 2024 14:52:12 -0700 > Andrii Nakryiko wrote: > > > Improve objpool (used heavily in kretprobe hot path) performance with two > > improvements: > > - inlining performance critical objpool_push()/objpool_pop() operations; > > - avoiding re-calculating relatively expensive nr_possible_cpus(). > > Thanks for optimizing objpool. Both looks good to me. Great, thanks for applying. > > BTW, I don't intend to stop this short-term optimization attempts, > but I would like to ask you check the new fgraph based fprobe > (kretprobe-multi)[1] instead of objpool/rethook. You can see that I did :) There is tons of code and I'm not familiar with internals of function_graph infra, but you can see I did run benchmarks, so I'm paying attention. But as for the objpool itself, I think it's a performant and useful internal building block, and we might use it outside of rethook as well, so I think making it as fast as possible is good regardless. > > [1] > https://lore.kernel.org/all/171318533841.254850.15841395205784342850.stgit@devnote2/ > > I'm considering to obsolete the kretprobe (and rethook) with fprobe > and eventually remove it. Those have similar feature and we should > choose safer one. > Yep, I had a few more semi-ready patches, but I'll hold off for now given this move to function graph, plus some of the changes that Jiri is making in multi-kprobe code. I'll wait for things to settle down a bit before looking again. But just to give you some context, I'm an author of retsnoop tool, and one of the killer features of it is LBR capture in kretprobes, which is a tremendous help in investigating kernel failures, especially in unfamiliar code (LBR allows to "look back" and figure out "how did we get to this condition" after the fact). And so it's important to minimize the amount of wasted LBR records between some kernel function returns error (and thus is "an interesting event" and BPF program that captures LBR is triggered). Big part of that is ftrace/fprobe/rethook infra, so I was looking into making that part as "minimal" as possible, in the sense of eliminating as many function calls and jump as possible. This has an added benefit of making this hot path faster, but my main motivation is LBR. Anyways, just a bit of context for some of the other patches (like inlining arch_rethook_trampoline_callback). > Thank you, > > > > > These opportunities were found when benchmarking and profiling kprobes and > > kretprobes with BPF-based benchmarks. See individual patches for details and > > results. > > > > Andrii Nakryiko (2): > > objpool: enable inlining objpool_push() and objpool_pop() operations > > objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids > > > > include/linux/objpool.h | 105 +++-- > > lib/objpool.c | 112 +++- > > 2 files changed, 107 insertions(+), 110 deletions(-) > > > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
Re: [PATCH v9 00/36] tracing: fprobe: function_graph: Multi-function graph and fprobe on fgraph
On Mon, Apr 15, 2024 at 5:49 AM Masami Hiramatsu (Google) wrote: > > Hi, > > Here is the 9th version of the series to re-implement the fprobe on > function-graph tracer. The previous version is; > > https://lore.kernel.org/all/170887410337.564249.6360118840946697039.stgit@devnote2/ > > This version is ported on the latest kernel (v6.9-rc3 + probes/for-next) > and fixed some bugs + performance optimization patch[36/36]. > - [12/36] Fix to clear fgraph_array entry in registration failure, also >return -ENOSPC when fgraph_array is full. > - [28/36] Add new store_fprobe_entry_data() for fprobe. > - [31/36] Remove DIV_ROUND_UP() and fix entry data address calculation. > - [36/36] Add new flag to skip timestamp recording. > > Overview > > This series does major 2 changes, enable multiple function-graphs on > the ftrace (e.g. allow function-graph on sub instances) and rewrite the > fprobe on this function-graph. > > The former changes had been sent from Steven Rostedt 4 years ago (*), > which allows users to set different setting function-graph tracer (and > other tracers based on function-graph) in each trace-instances at the > same time. > > (*) https://lore.kernel.org/all/20190525031633.811342...@goodmis.org/ > > The purpose of latter change are; > > 1) Remove dependency of the rethook from fprobe so that we can reduce >the return hook code and shadow stack. > > 2) Make 'ftrace_regs' the common trace interface for the function >boundary. > > 1) Currently we have 2(or 3) different function return hook codes, > the function-graph tracer and rethook (and legacy kretprobe). > But since this is redundant and needs double maintenance cost, > I would like to unify those. From the user's viewpoint, function- > graph tracer is very useful to grasp the execution path. For this > purpose, it is hard to use the rethook in the function-graph > tracer, but the opposite is possible. (Strictly speaking, kretprobe > can not use it because it requires 'pt_regs' for historical reasons.) > > 2) Now the fprobe provides the 'pt_regs' for its handler, but that is > wrong for the function entry and exit. Moreover, depending on the > architecture, there is no way to accurately reproduce 'pt_regs' > outside of interrupt or exception handlers. This means fprobe should > not use 'pt_regs' because it does not use such exceptions. > (Conversely, kprobe should use 'pt_regs' because it is an abstract > interface of the software breakpoint exception.) > > This series changes fprobe to use function-graph tracer for tracing > function entry and exit, instead of mixture of ftrace and rethook. > Unlike the rethook which is a per-task list of system-wide allocated > nodes, the function graph's ret_stack is a per-task shadow stack. > Thus it does not need to set 'nr_maxactive' (which is the number of > pre-allocated nodes). > Also the handlers will get the 'ftrace_regs' instead of 'pt_regs'. > Since eBPF mulit_kprobe/multi_kretprobe events still use 'pt_regs' as > their register interface, this changes it to convert 'ftrace_regs' to > 'pt_regs'. Of course this conversion makes an incomplete 'pt_regs', > so users must access only registers for function parameters or > return value. > > Design > -- > Instead of using ftrace's function entry hook directly, the new fprobe > is built on top of the function-graph's entry and return callbacks > with 'ftrace_regs'. > > Since the fprobe requires access to 'ftrace_regs', the architecture > must support CONFIG_HAVE_DYNAMIC_FTRACE_WITH_ARGS and > CONFIG_HAVE_FTRACE_GRAPH_FUNC, which enables to call function-graph > entry callback with 'ftrace_regs', and also > CONFIG_HAVE_FUNCTION_GRAPH_FREGS, which passes the ftrace_regs to > return_to_handler. > > All fprobes share a single function-graph ops (means shares a common > ftrace filter) similar to the kprobe-on-ftrace. This needs another > layer to find corresponding fprobe in the common function-graph > callbacks, but has much better scalability, since the number of > registered function-graph ops is limited. > > In the entry callback, the fprobe runs its entry_handler and saves the > address of 'fprobe' on the function-graph's shadow stack as data. The > return callback decodes the data to get the 'fprobe' address, and runs > the exit_handler. > > The fprobe introduces two hash-tables, one is for entry callback which > searches fprobes related to the given function address passed by entry > callback. The other is for a return callback which checks if the given > 'fprobe' data structure pointer is still valid. Note that it is > possible to unregister fprobe before the return callback runs. Thus > the address validation must be done before using it in the return > callback. > > This series can be applied against the probes/for-next branch, which > is based on v6.9-rc3. > > This series can also be found below branch. > >
Re: [PATCH v9 36/36] fgraph: Skip recording calltime/rettime if it is not nneeded
On Mon, Apr 15, 2024 at 6:25 AM Masami Hiramatsu (Google) wrote: > > From: Masami Hiramatsu (Google) > > Skip recording calltime and rettime if the fgraph_ops does not need it. > This is a kind of performance optimization for fprobe. Since the fprobe > user does not use these entries, recording timestamp in fgraph is just > a overhead (e.g. eBPF, ftrace). So introduce the skip_timestamp flag, > and all fgraph_ops sets this flag, skip recording calltime and rettime. > > Suggested-by: Jiri Olsa > Signed-off-by: Masami Hiramatsu (Google) > --- > Changes in v9: > - Newly added. > --- > include/linux/ftrace.h |2 ++ > kernel/trace/fgraph.c | 46 +++--- > kernel/trace/fprobe.c |1 + > 3 files changed, 42 insertions(+), 7 deletions(-) > > diff --git a/include/linux/ftrace.h b/include/linux/ftrace.h > index d845a80a3d56..06fc7cbef897 100644 > --- a/include/linux/ftrace.h > +++ b/include/linux/ftrace.h > @@ -1156,6 +1156,8 @@ struct fgraph_ops { > struct ftrace_ops ops; /* for the hash lists */ > void*private; > int idx; > + /* If skip_timestamp is true, this does not record timestamps. */ > + boolskip_timestamp; > }; > > void *fgraph_reserve_data(int idx, int size_bytes); > diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c > index 7556fbbae323..a5722537bb79 100644 > --- a/kernel/trace/fgraph.c > +++ b/kernel/trace/fgraph.c > @@ -131,6 +131,7 @@ DEFINE_STATIC_KEY_FALSE(kill_ftrace_graph); > int ftrace_graph_active; > > static struct fgraph_ops *fgraph_array[FGRAPH_ARRAY_SIZE]; > +static bool fgraph_skip_timestamp; > > /* LRU index table for fgraph_array */ > static int fgraph_lru_table[FGRAPH_ARRAY_SIZE]; > @@ -475,7 +476,7 @@ void ftrace_graph_stop(void) > static int > ftrace_push_return_trace(unsigned long ret, unsigned long func, > unsigned long frame_pointer, unsigned long *retp, > -int fgraph_idx) > +int fgraph_idx, bool skip_ts) > { > struct ftrace_ret_stack *ret_stack; > unsigned long long calltime; > @@ -498,8 +499,12 @@ ftrace_push_return_trace(unsigned long ret, unsigned > long func, > ret_stack = get_ret_stack(current, current->curr_ret_stack, ); > if (ret_stack && ret_stack->func == func && > get_fgraph_type(current, index + FGRAPH_RET_INDEX) == > FGRAPH_TYPE_BITMAP && > - !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, > fgraph_idx)) > + !is_fgraph_index_set(current, index + FGRAPH_RET_INDEX, > fgraph_idx)) { > + /* If previous one skips calltime, update it. */ > + if (!skip_ts && !ret_stack->calltime) > + ret_stack->calltime = trace_clock_local(); > return index + FGRAPH_RET_INDEX; > + } > > val = (FGRAPH_TYPE_RESERVED << FGRAPH_TYPE_SHIFT) | FGRAPH_RET_INDEX; > > @@ -517,7 +522,10 @@ ftrace_push_return_trace(unsigned long ret, unsigned > long func, > return -EBUSY; > } > > - calltime = trace_clock_local(); > + if (skip_ts) would it be ok to add likely() here to keep the least-overhead code path linear? > + calltime = 0LL; > + else > + calltime = trace_clock_local(); > > index = READ_ONCE(current->curr_ret_stack); > ret_stack = RET_STACK(current, index); > @@ -601,7 +609,8 @@ int function_graph_enter_regs(unsigned long ret, unsigned > long func, > trace.func = func; > trace.depth = ++current->curr_ret_depth; > > - index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0); > + index = ftrace_push_return_trace(ret, func, frame_pointer, retp, 0, > +fgraph_skip_timestamp); > if (index < 0) > goto out; > > @@ -654,7 +663,8 @@ int function_graph_enter_ops(unsigned long ret, unsigned > long func, > return -ENODEV; > > /* Use start for the distance to ret_stack (skipping over reserve) */ > - index = ftrace_push_return_trace(ret, func, frame_pointer, retp, > gops->idx); > + index = ftrace_push_return_trace(ret, func, frame_pointer, retp, > gops->idx, > +gops->skip_timestamp); > if (index < 0) > return index; > type = get_fgraph_type(current, index); > @@ -732,6 +742,7 @@ ftrace_pop_return_trace(struct ftrace_graph_ret *trace, > unsigned long *ret, > *ret = ret_stack->ret; > trace->func = ret_stack->func; > trace->calltime = ret_stack->calltime; > + trace->rettime = 0; > trace->overrun = atomic_read(>trace_overrun); > trace->depth = current->curr_ret_depth; > /* > @@ -792,7 +803,6 @@ __ftrace_return_to_handler(struct
Re: [PATCH v9 29/36] bpf: Enable kprobe_multi feature if CONFIG_FPROBE is enabled
On Mon, Apr 15, 2024 at 6:22 AM Masami Hiramatsu (Google) wrote: > > From: Masami Hiramatsu (Google) > > Enable kprobe_multi feature if CONFIG_FPROBE is enabled. The pt_regs is > converted from ftrace_regs by ftrace_partial_regs(), thus some registers > may always returns 0. But it should be enough for function entry (access > arguments) and exit (access return value). > > Signed-off-by: Masami Hiramatsu (Google) > Acked-by: Florent Revest > --- > Changes from previous series: NOTHING, Update against the new series. > --- > kernel/trace/bpf_trace.c | 22 +- > 1 file changed, 9 insertions(+), 13 deletions(-) > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > index e51a6ef87167..57b1174030c9 100644 > --- a/kernel/trace/bpf_trace.c > +++ b/kernel/trace/bpf_trace.c > @@ -2577,7 +2577,7 @@ static int __init bpf_event_init(void) > fs_initcall(bpf_event_init); > #endif /* CONFIG_MODULES */ > > -#if defined(CONFIG_FPROBE) && defined(CONFIG_DYNAMIC_FTRACE_WITH_REGS) > +#ifdef CONFIG_FPROBE > struct bpf_kprobe_multi_link { > struct bpf_link link; > struct fprobe fp; > @@ -2600,6 +2600,8 @@ struct user_syms { > char *buf; > }; > > +static DEFINE_PER_CPU(struct pt_regs, bpf_kprobe_multi_pt_regs); this is a waste if CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST=y, right? Can we guard it? > + > static int copy_user_syms(struct user_syms *us, unsigned long __user *usyms, > u32 cnt) > { > unsigned long __user usymbol; > @@ -2792,13 +2794,14 @@ static u64 bpf_kprobe_multi_entry_ip(struct > bpf_run_ctx *ctx) > > static int > kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link *link, > - unsigned long entry_ip, struct pt_regs *regs) > + unsigned long entry_ip, struct ftrace_regs *fregs) > { > struct bpf_kprobe_multi_run_ctx run_ctx = { > .link = link, > .entry_ip = entry_ip, > }; > struct bpf_run_ctx *old_run_ctx; > + struct pt_regs *regs; > int err; > > if (unlikely(__this_cpu_inc_return(bpf_prog_active) != 1)) { > @@ -2809,6 +2812,7 @@ kprobe_multi_link_prog_run(struct bpf_kprobe_multi_link > *link, > > migrate_disable(); > rcu_read_lock(); > + regs = ftrace_partial_regs(fregs, > this_cpu_ptr(_kprobe_multi_pt_regs)); and then pass NULL if defined(CONFIG_HAVE_PT_REGS_TO_FTRACE_REGS_CAST)? > old_run_ctx = bpf_set_run_ctx(_ctx.run_ctx); > err = bpf_prog_run(link->link.prog, regs); > bpf_reset_run_ctx(old_run_ctx); > @@ -2826,13 +2830,9 @@ kprobe_multi_link_handler(struct fprobe *fp, unsigned > long fentry_ip, > void *data) > { > struct bpf_kprobe_multi_link *link; > - struct pt_regs *regs = ftrace_get_regs(fregs); > - > - if (!regs) > - return 0; > > link = container_of(fp, struct bpf_kprobe_multi_link, fp); > - kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs); > + kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs); > return 0; > } > > @@ -2842,13 +2842,9 @@ kprobe_multi_link_exit_handler(struct fprobe *fp, > unsigned long fentry_ip, >void *data) > { > struct bpf_kprobe_multi_link *link; > - struct pt_regs *regs = ftrace_get_regs(fregs); > - > - if (!regs) > - return; > > link = container_of(fp, struct bpf_kprobe_multi_link, fp); > - kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), regs); > + kprobe_multi_link_prog_run(link, get_entry_ip(fentry_ip), fregs); > } > > static int symbols_cmp_r(const void *a, const void *b, const void *priv) > @@ -3107,7 +3103,7 @@ int bpf_kprobe_multi_link_attach(const union bpf_attr > *attr, struct bpf_prog *pr > kvfree(cookies); > return err; > } > -#else /* !CONFIG_FPROBE || !CONFIG_DYNAMIC_FTRACE_WITH_REGS */ > +#else /* !CONFIG_FPROBE */ > int bpf_kprobe_multi_link_attach(const union bpf_attr *attr, struct bpf_prog > *prog) > { > return -EOPNOTSUPP; > >
[PATCH RFC] rethook: inline arch_rethook_trampoline_callback() in assembly code
At the lowest level, rethook-based kretprobes on x86-64 architecture go through arch_rethoook_trampoline() function, manually written in assembly, which calls into a simple arch_rethook_trampoline_callback() function, written in C, and only doing a few straightforward field assignments, before calling further into rethook_trampoline_handler(), which handles kretprobe callbacks generically. Looking at simplicity of arch_rethook_trampoline_callback(), it seems not really worthwhile to spend an extra function call just to do 4 or 5 assignments. As such, this patch proposes to "inline" arch_rethook_trampoline_callback() into arch_rethook_trampoline() by manually implementing it in an assembly code. This has two motivations. First, we do get a bit of runtime speed up by avoiding function calls. Using BPF selftests's bench tool, we see 0.6%-0.8% throughput improvement for kretprobe/multi-kretprobe triggering code path: BEFORE (latest probes/for-next) === kretprobe : 10.455 ± 0.024M/s kretprobe-multi: 11.150 ± 0.012M/s AFTER (probes/for-next + this patch) kretprobe : 10.540 ± 0.009M/s (+0.8%) kretprobe-multi: 11.219 ± 0.042M/s (+0.6%) Second, and no less importantly for some specialized use cases, this avoids unnecessarily "polluting" LBR records with an extra function call (recorded as a jump by CPU). This is the case for the retsnoop ([0]) tool, which relies havily on capturing LBR records to provide users with lots of insight into kernel internals. This RFC patch is only inlining this function for x86-64, but it's possible to do that for 32-bit x86 arch as well and then remove arch_rethook_trampoline_callback() implementation altogether. Please let me know if this change is acceptable and whether I should complete it with 32-bit "inlining" as well. Thanks! [0] https://nakryiko.com/posts/retsnoop-intro/#peering-deep-into-functions-with-lbr Signed-off-by: Andrii Nakryiko --- arch/x86/kernel/asm-offsets_64.c | 4 arch/x86/kernel/rethook.c| 37 +++- 2 files changed, 36 insertions(+), 5 deletions(-) diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c index bb65371ea9df..5c444abc540c 100644 --- a/arch/x86/kernel/asm-offsets_64.c +++ b/arch/x86/kernel/asm-offsets_64.c @@ -42,6 +42,10 @@ int main(void) ENTRY(r14); ENTRY(r15); ENTRY(flags); + ENTRY(ip); + ENTRY(cs); + ENTRY(ss); + ENTRY(orig_ax); BLANK(); #undef ENTRY diff --git a/arch/x86/kernel/rethook.c b/arch/x86/kernel/rethook.c index 8a1c0111ae79..3e1c01beebd1 100644 --- a/arch/x86/kernel/rethook.c +++ b/arch/x86/kernel/rethook.c @@ -6,6 +6,7 @@ #include #include #include +#include #include "kprobes/common.h" @@ -34,10 +35,36 @@ asm( " pushq %rsp\n" " pushfq\n" SAVE_REGS_STRING - " movq %rsp, %rdi\n" - " call arch_rethook_trampoline_callback\n" + " movq %rsp, %rdi\n" /* $rdi points to regs */ + /* fixup registers */ + /* regs->cs = __KERNEL_CS; */ + " movq $" __stringify(__KERNEL_CS) ", " __stringify(pt_regs_cs) "(%rdi)\n" + /* regs->ip = (unsigned long)_rethook_trampoline; */ + " movq $arch_rethook_trampoline, " __stringify(pt_regs_ip) "(%rdi)\n" + /* regs->orig_ax = ~0UL; */ + " movq $0x, " __stringify(pt_regs_orig_ax) "(%rdi)\n" + /* regs->sp += 2*sizeof(long); */ + " addq $16, " __stringify(pt_regs_sp) "(%rdi)\n" + /* 2nd arg is frame_pointer = (long *)(regs + 1); */ + " lea " __stringify(PTREGS_SIZE) "(%rdi), %rsi\n" + /* +* The return address at 'frame_pointer' is recovered by the +* arch_rethook_fixup_return() which called from this +* rethook_trampoline_handler(). +*/ + " call rethook_trampoline_handler\n" + /* +* Copy FLAGS to 'pt_regs::ss' so we can do RET right after POPF. +* +* We don't save/restore %rax below, because we ignore +* rethook_trampoline_handler result. +* +* *(unsigned long *)>ss = regs->flags; +*/ + " mov " __stringify(pt_regs_flags) "(%rsp), %rax\n" + " mov %rax, " __stringify(pt_regs_ss) "(%rsp)\n" RESTORE_REGS_STRING - /* In the callback function, 'regs->flags' is copied to 'regs->ss'. */ + /* We just copied 'regs->flags' into 'regs->ss'. */ " addq $16, %rsp\n" " popfq\n" #else @@ -61,6 +88,7 @@ asm( ); NOKPROBE_SYMBOL(arch_retho
[PATCH 2/2] objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids
Profiling shows that calling nr_possible_cpus() in objpool_pop() takes a noticeable amount of CPU (when profiled on 80-core machine), as we need to recalculate number of set bits in a CPU bit mask. This number can't change, so there is no point in paying the price for recalculating it. As such, cache this value in struct objpool_head and use it in objpool_pop(). On the other hand, cached pool->nr_cpus isn't necessary, as it's not used in hot path and is also a pretty trivial value to retrieve. So drop pool->nr_cpus in favor of using nr_cpu_ids everywhere. This way the size of struct objpool_head remains the same, which is a nice bonus. Same BPF selftests benchmarks were used to evaluate the effect. Using changes in previous patch (inlining of objpool_pop/objpool_push) as baseline, here are the differences: BASELINE kretprobe :9.937 ± 0.174M/s kretprobe-multi: 10.440 ± 0.108M/s AFTER = kretprobe : 10.106 ± 0.120M/s (+1.7%) kretprobe-multi: 10.515 ± 0.180M/s (+0.7%) Cc: Matt (Qiang) Wu Signed-off-by: Andrii Nakryiko --- include/linux/objpool.h | 6 +++--- lib/objpool.c | 12 ++-- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/include/linux/objpool.h b/include/linux/objpool.h index d8b1f7b91128..cb1758eaa2d3 100644 --- a/include/linux/objpool.h +++ b/include/linux/objpool.h @@ -73,7 +73,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); * struct objpool_head - object pooling metadata * @obj_size: object size, aligned to sizeof(void *) * @nr_objs:total objs (to be pre-allocated with objpool) - * @nr_cpus:local copy of nr_cpu_ids + * @nr_possible_cpus: cached value of num_possible_cpus() * @capacity: max objs can be managed by one objpool_slot * @gfp:gfp flags for kmalloc & vmalloc * @ref:refcount of objpool @@ -85,7 +85,7 @@ typedef int (*objpool_fini_cb)(struct objpool_head *head, void *context); struct objpool_head { int obj_size; int nr_objs; - int nr_cpus; + int nr_possible_cpus; int capacity; gfp_t gfp; refcount_t ref; @@ -176,7 +176,7 @@ static inline void *objpool_pop(struct objpool_head *pool) raw_local_irq_save(flags); cpu = raw_smp_processor_id(); - for (i = 0; i < num_possible_cpus(); i++) { + for (i = 0; i < pool->nr_possible_cpus; i++) { obj = __objpool_try_get_slot(pool, cpu); if (obj) break; diff --git a/lib/objpool.c b/lib/objpool.c index f696308fc026..234f9d0bd081 100644 --- a/lib/objpool.c +++ b/lib/objpool.c @@ -50,7 +50,7 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs, { int i, cpu_count = 0; - for (i = 0; i < pool->nr_cpus; i++) { + for (i = 0; i < nr_cpu_ids; i++) { struct objpool_slot *slot; int nodes, size, rc; @@ -60,8 +60,8 @@ objpool_init_percpu_slots(struct objpool_head *pool, int nr_objs, continue; /* compute how many objects to be allocated with this slot */ - nodes = nr_objs / num_possible_cpus(); - if (cpu_count < (nr_objs % num_possible_cpus())) + nodes = nr_objs / pool->nr_possible_cpus; + if (cpu_count < (nr_objs % pool->nr_possible_cpus)) nodes++; cpu_count++; @@ -103,7 +103,7 @@ static void objpool_fini_percpu_slots(struct objpool_head *pool) if (!pool->cpu_slots) return; - for (i = 0; i < pool->nr_cpus; i++) + for (i = 0; i < nr_cpu_ids; i++) kvfree(pool->cpu_slots[i]); kfree(pool->cpu_slots); } @@ -130,13 +130,13 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, /* initialize objpool pool */ memset(pool, 0, sizeof(struct objpool_head)); - pool->nr_cpus = nr_cpu_ids; + pool->nr_possible_cpus = num_possible_cpus(); pool->obj_size = object_size; pool->capacity = capacity; pool->gfp = gfp & ~__GFP_ZERO; pool->context = context; pool->release = release; - slot_size = pool->nr_cpus * sizeof(struct objpool_slot); + slot_size = nr_cpu_ids * sizeof(struct objpool_slot); pool->cpu_slots = kzalloc(slot_size, pool->gfp); if (!pool->cpu_slots) return -ENOMEM; -- 2.43.0
[PATCH 1/2] objpool: enable inlining objpool_push() and objpool_pop() operations
objpool_push() and objpool_pop() are very performance-critical functions and can be called very frequently in kretprobe triggering path. As such, it makes sense to allow compiler to inline them completely to eliminate function calls overhead. Luckily, their logic is quite well isolated and doesn't have any sprawling dependencies. This patch moves both objpool_push() and objpool_pop() into include/linux/objpool.h and marks them as static inline functions, enabling inlining. To avoid anyone using internal helpers (objpool_try_get_slot, objpool_try_add_slot), rename them to use leading underscores. We used kretprobe microbenchmark from BPF selftests (bench trig-kprobe and trig-kprobe-multi benchmarks) running no-op BPF kretprobe/kretprobe.multi programs in a tight loop to evaluate the effect. BPF own overhead in this case is minimal and it mostly stresses the rest of in-kernel kretprobe infrastructure overhead. Results are in millions of calls per second. This is not super scientific, but shows the trend nevertheless. BEFORE == kretprobe :9.794 ± 0.086M/s kretprobe-multi: 10.219 ± 0.032M/s AFTER = kretprobe :9.937 ± 0.174M/s (+1.5%) kretprobe-multi: 10.440 ± 0.108M/s (+2.2%) Cc: Matt (Qiang) Wu Signed-off-by: Andrii Nakryiko --- include/linux/objpool.h | 101 +++- lib/objpool.c | 100 --- 2 files changed, 99 insertions(+), 102 deletions(-) diff --git a/include/linux/objpool.h b/include/linux/objpool.h index 15aff4a17f0c..d8b1f7b91128 100644 --- a/include/linux/objpool.h +++ b/include/linux/objpool.h @@ -5,6 +5,10 @@ #include #include +#include +#include +#include +#include /* * objpool: ring-array based lockless MPMC queue @@ -118,13 +122,94 @@ int objpool_init(struct objpool_head *pool, int nr_objs, int object_size, gfp_t gfp, void *context, objpool_init_obj_cb objinit, objpool_fini_cb release); +/* try to retrieve object from slot */ +static inline void *__objpool_try_get_slot(struct objpool_head *pool, int cpu) +{ + struct objpool_slot *slot = pool->cpu_slots[cpu]; + /* load head snapshot, other cpus may change it */ + uint32_t head = smp_load_acquire(>head); + + while (head != READ_ONCE(slot->last)) { + void *obj; + + /* +* data visibility of 'last' and 'head' could be out of +* order since memory updating of 'last' and 'head' are +* performed in push() and pop() independently +* +* before any retrieving attempts, pop() must guarantee +* 'last' is behind 'head', that is to say, there must +* be available objects in slot, which could be ensured +* by condition 'last != head && last - head <= nr_objs' +* that is equivalent to 'last - head - 1 < nr_objs' as +* 'last' and 'head' are both unsigned int32 +*/ + if (READ_ONCE(slot->last) - head - 1 >= pool->nr_objs) { + head = READ_ONCE(slot->head); + continue; + } + + /* obj must be retrieved before moving forward head */ + obj = READ_ONCE(slot->entries[head & slot->mask]); + + /* move head forward to mark it's consumption */ + if (try_cmpxchg_release(>head, , head + 1)) + return obj; + } + + return NULL; +} + /** * objpool_pop() - allocate an object from objpool * @pool: object pool * * return value: object ptr or NULL if failed */ -void *objpool_pop(struct objpool_head *pool); +static inline void *objpool_pop(struct objpool_head *pool) +{ + void *obj = NULL; + unsigned long flags; + int i, cpu; + + /* disable local irq to avoid preemption & interruption */ + raw_local_irq_save(flags); + + cpu = raw_smp_processor_id(); + for (i = 0; i < num_possible_cpus(); i++) { + obj = __objpool_try_get_slot(pool, cpu); + if (obj) + break; + cpu = cpumask_next_wrap(cpu, cpu_possible_mask, -1, 1); + } + raw_local_irq_restore(flags); + + return obj; +} + +/* adding object to slot, abort if the slot was already full */ +static inline int +__objpool_try_add_slot(void *obj, struct objpool_head *pool, int cpu) +{ + struct objpool_slot *slot = pool->cpu_slots[cpu]; + uint32_t head, tail; + + /* loading tail and head as a local snapshot, tail first */ + tail = READ_ONCE(slot->tail); + + do { + head = READ_ONCE(slot->head); + /* fault caught: something must be wrong */ + WARN_ON_ONCE(tail - head > pool->nr_objs); + } while (!try_cmpxchg_acqui
[PATCH 0/2] Objpool performance improvements
Improve objpool (used heavily in kretprobe hot path) performance with two improvements: - inlining performance critical objpool_push()/objpool_pop() operations; - avoiding re-calculating relatively expensive nr_possible_cpus(). These opportunities were found when benchmarking and profiling kprobes and kretprobes with BPF-based benchmarks. See individual patches for details and results. Andrii Nakryiko (2): objpool: enable inlining objpool_push() and objpool_pop() operations objpool: cache nr_possible_cpus() and avoid caching nr_cpu_ids include/linux/objpool.h | 105 +++-- lib/objpool.c | 112 +++- 2 files changed, 107 insertions(+), 110 deletions(-) -- 2.43.0
Re: [PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
On Thu, Apr 18, 2024 at 6:00 PM Masami Hiramatsu wrote: > > On Thu, 18 Apr 2024 12:09:09 -0700 > Andrii Nakryiko wrote: > > > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating > > that RCU is watching when trying to setup rethooko on a function entry. > > > > One notable exception when we force rcu_is_watching() check is > > CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use > > old-style int3-based workflow instead of relying on ftrace, making RCU > > watching check important to validate. > > > > This further (in addition to improvements in the previous patch) > > improves BPF multi-kretprobe (which rely on rethook) runtime throughput > > by 2.3%, according to BPF benchmarks ([0]). > > > > [0] > > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ > > > > Signed-off-by: Andrii Nakryiko > > > Thanks for update! This looks good to me. Thanks, Masami! Will you take it through your tree, or you'd like to route it through bpf-next? > > Acked-by: Masami Hiramatsu (Google) > > Thanks, > > > --- > > kernel/trace/rethook.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c > > index fa03094e9e69..a974605ad7a5 100644 > > --- a/kernel/trace/rethook.c > > +++ b/kernel/trace/rethook.c > > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) > > if (unlikely(!handler)) > > return NULL; > > > > +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || > > defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE) > > /* > >* This expects the caller will set up a rethook on a function entry. > >* When the function returns, the rethook will eventually be reclaimed > > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) > >*/ > > if (unlikely(!rcu_is_watching())) > > return NULL; > > +#endif > > > > return (struct rethook_node *)objpool_pop(>pool); > > } > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
[PATCH v4 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating that RCU is watching when trying to setup rethooko on a function entry. One notable exception when we force rcu_is_watching() check is CONFIG_KPROBE_EVENTS_ON_NOTRACE=y case, in which case kretprobes will use old-style int3-based workflow instead of relying on ftrace, making RCU watching check important to validate. This further (in addition to improvements in the previous patch) improves BPF multi-kretprobe (which rely on rethook) runtime throughput by 2.3%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Signed-off-by: Andrii Nakryiko --- kernel/trace/rethook.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index fa03094e9e69..a974605ad7a5 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) if (unlikely(!handler)) return NULL; +#if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE) /* * This expects the caller will set up a rethook on a function entry. * When the function returns, the rethook will eventually be reclaimed @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) */ if (unlikely(!rcu_is_watching())) return NULL; +#endif return (struct rethook_node *)objpool_pop(>pool); } -- 2.43.0
[PATCH v4 1/2] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level validation of ftrace subsystem invariants. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. This improves BPF multi-kretprobe (relying on ftrace and rethook infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Acked-by: Masami Hiramatsu (Google) Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 13 + 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..7aebd1b8f93e 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace execution" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort of + protection against recursion. This option is only to verify that + ftrace (and other users of ftrace_test_recursion_trylock()) are not + called outside of RCU, as if they are, it can cause a race. But it + also has a noticeable overhead when enabled. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
On Tue, Apr 9, 2024 at 3:48 PM Masami Hiramatsu wrote: > > On Wed, 3 Apr 2024 15:03:28 -0700 > Andrii Nakryiko wrote: > > > Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating > > that RCU is watching when trying to setup rethooko on a function entry. > > > > This further (in addition to improvements in the previous patch) > > improves BPF multi-kretprobe (which rely on rethook) runtime throughput > > by 2.3%, according to BPF benchmarks ([0]). > > > > [0] > > https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ > > > > Hi Andrii, > > Can you make this part depends on !KPROBE_EVENTS_ON_NOTRACE (with this > option, kretprobes can be used without ftrace, but with original int3) ? Sorry for the late response, I was out on vacation. Makes sense about KPROBE_EVENTS_ON_NOTRACE, I went with this condition: #if defined(CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING) || defined(CONFIG_KPROBE_EVENTS_ON_NOTRACE) Will send an updated revision shortly. > This option should be set N on production system because of safety, > just for testing raw kretprobes. > > Thank you, > > > Signed-off-by: Andrii Nakryiko > > --- > > kernel/trace/rethook.c | 2 ++ > > 1 file changed, 2 insertions(+) > > > > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c > > index fa03094e9e69..15b8aa4048d9 100644 > > --- a/kernel/trace/rethook.c > > +++ b/kernel/trace/rethook.c > > @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) > > if (unlikely(!handler)) > > return NULL; > > > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > /* > >* This expects the caller will set up a rethook on a function entry. > >* When the function returns, the rethook will eventually be reclaimed > > @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) > >*/ > > if (unlikely(!rcu_is_watching())) > > return NULL; > > +#endif > > > > return (struct rethook_node *)objpool_pop(>pool); > > } > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Mon, Apr 15, 2024 at 1:25 AM Jiri Olsa wrote: > > On Tue, Apr 02, 2024 at 11:33:00AM +0200, Jiri Olsa wrote: > > SNIP > > > #include > > #include > > @@ -308,6 +309,88 @@ static int uprobe_init_insn(struct arch_uprobe > > *auprobe, struct insn *insn, bool > > } > > > > #ifdef CONFIG_X86_64 > > + > > +asm ( > > + ".pushsection .rodata\n" > > + ".global uretprobe_syscall_entry\n" > > + "uretprobe_syscall_entry:\n" > > + "pushq %rax\n" > > + "pushq %rcx\n" > > + "pushq %r11\n" > > + "movq $" __stringify(__NR_uretprobe) ", %rax\n" > > + "syscall\n" > > + "popq %r11\n" > > + "popq %rcx\n" > > + > > + /* The uretprobe syscall replaces stored %rax value with final > > + * return address, so we don't restore %rax in here and just > > + * call ret. > > + */ > > + "retq\n" > > + ".global uretprobe_syscall_end\n" > > + "uretprobe_syscall_end:\n" > > + ".popsection\n" > > +); > > + > > +extern u8 uretprobe_syscall_entry[]; > > +extern u8 uretprobe_syscall_end[]; > > + > > +void *arch_uprobe_trampoline(unsigned long *psize) > > +{ > > + *psize = uretprobe_syscall_end - uretprobe_syscall_entry; > > + return uretprobe_syscall_entry; > > fyi I realized this screws 32-bit programs, we either need to add > compat trampoline, or keep the standard breakpoint for them: > > + struct pt_regs *regs = task_pt_regs(current); > + static uprobe_opcode_t insn = UPROBE_SWBP_INSN; > + > + if (user_64bit_mode(regs)) { > + *psize = uretprobe_syscall_end - uretprobe_syscall_entry; > + return uretprobe_syscall_entry; > + } > + > + *psize = UPROBE_SWBP_INSN_SIZE; > + return > > > not sure it's worth the effort to add the trampoline, I'll check > 32-bit arch isn't a high-performance target anyways, so I'd probably not bother and prioritize simplicity and long term maintenance. > > jirka
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Fri, Apr 5, 2024 at 8:41 PM Masami Hiramatsu wrote: > > On Tue, 2 Apr 2024 22:21:00 -0700 > Andrii Nakryiko wrote: > > > On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko > > wrote: > > > > > > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt wrote: > > > > > > > > On Wed, 3 Apr 2024 09:40:48 +0900 > > > > Masami Hiramatsu (Google) wrote: > > > > > > > > > OK, for me, this last sentence is preferred for the help message. > > > > > That explains > > > > > what this is for. > > > > > > > > > > All callbacks that attach to the function tracing have some > > > > > sort > > > > > of protection against recursion. This option is only to > > > > > verify that > > > > > ftrace (and other users of ftrace_test_recursion_trylock()) > > > > >are not > > > > > called outside of RCU, as if they are, it can cause a race. > > > > > But it also has a noticeable overhead when enabled. > > > > > > Sounds good to me, I can add this to the description of the Kconfig > > > option. > > > > > > > > > > > > > BTW, how much overhead does this introduce? and the race case a > > > > > kernel crash? > > > > > > I just checked our fleet-wide production data for the last 24 hours. > > > Within the kprobe/kretprobe code path (ftrace_trampoline and > > > everything called from it), rcu_is_watching (both calls, see below) > > > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd > > > prefer to be able to avoid that in production use cases. > > > > > > > I just ran synthetic microbenchmark testing multi-kretprobe > > throughput. We get (in millions of BPF kretprobe-multi program > > invocations per second): > > - 5.568M/s as baseline; > > - 5.679M/s with changes in this patch (+2% throughput improvement); > > - 5.808M/s with disabling rcu_is_watching in rethook_try_get() > > (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline). > > > > It's definitely noticeable. > > Thanks for checking the overhead! Hmm, it is considerable. > > > > > > or just messed up the ftrace buffer? > > > > > > > > There's a hypothetical race where it can cause a use after free. > > Hmm, so it might not lead a kernel crash but is better to enable with > other debugging options. > > > > > > > > > That is, after you shutdown ftrace, you need to call > > > > synchronize_rcu_tasks(), > > > > which requires RCU to be watching. There's a theoretical case where that > > > > task calls the trampoline and misses the synchronization. Note, these > > > > locations are with preemption disabled, as rcu is always watching when > > > > preemption is enabled. Thus it would be extremely fast where as the > > > > synchronize_rcu_tasks() is rather slow. > > > > > > > > We also have synchronize_rcu_tasks_rude() which would actually keep the > > > > trace from happening, as it would schedule on each CPU forcing all CPUs > > > > to > > > > have RCU watching. > > > > > > > > I have never heard of this race being hit. I guess it could happen on a > > > > VM > > > > where a vCPU gets preempted at the right moment for a long time and the > > > > other CPUs synchronize. > > > > > > > > But like lockdep, where deadlocks can crash the kernel, we don't enable > > > > it > > > > for production. > > > > > > > > The overhead is another function call within the function tracer. I had > > > > numbers before, but I guess I could run tests again and get new numbers. > > > > > > > > > > I just noticed another rcu_is_watching() call, in rethook_try_get(), > > > which seems to be a similar and complementary validation check to the > > > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option > > > in this patch. It feels like both of them should be controlled by the > > > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > > guard around rcu_is_watching() check in rethook_try_get() as well? > > Hmmm, no, I think it should not change the rethook side because rethook > can be used with kprobes without ftrace. If we can detect it is used from It's a good thing that I split that into a separate patch, then. Hopefully the first patch looks good and you can apply it as is. > the ftrace, we can skip it. (From this reason, I would like to remove > return probe from kprobes...) I'm on PTO for the next two weeks and I can take a look at more properly guarding rcu_is_watching() in rethook_try_get() when I'm back. Thanks. > > Thank you, > > > > > > > > > > > Thanks, > > > > > > > > -- Steve > > > -- > Masami Hiramatsu (Google)
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Wed, Apr 3, 2024 at 5:58 PM Masami Hiramatsu wrote: > > On Wed, 3 Apr 2024 09:58:12 -0700 > Andrii Nakryiko wrote: > > > On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu wrote: > > > > > > On Wed, 3 Apr 2024 11:47:41 +0200 > > > Jiri Olsa wrote: > > > > > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote: > > > > > Hi Jiri, > > > > > > > > > > On Tue, 2 Apr 2024 11:33:00 +0200 > > > > > Jiri Olsa wrote: > > > > > > > > > > > Adding uretprobe syscall instead of trap to speed up return probe. > > > > > > > > > > This is interesting approach. But I doubt we need to add additional > > > > > syscall just for this purpose. Can't we use another syscall or ioctl? > > > > > > > > so the plan is to optimize entry uprobe in a similar way and given > > > > the syscall is not a scarce resource I wanted to add another syscall > > > > for that one as well > > > > > > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's > > > > possible to do that, the trampoline will just have to save one or > > > > more additional registers, but adding new syscall seems cleaner to me > > > > > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate. > > > > I think both ptrace and prctl are for completely different use cases > > and it would be an abuse of existing API to reuse them for uretprobe > > tracing. Also, keep in mind, that any extra argument that has to be > > passed into this syscall means that we need to complicate and slow > > generated assembly code that is injected into user process (to > > save/restore registers) and also kernel-side (again, to deal with all > > the extra registers that would be stored/restored on stack). > > > > Given syscalls are not some kind of scarce resources, what's the > > downside to have a dedicated and simple syscall? > > Syscalls are explicitly exposed to user space, thus, even if it is used > ONLY for a very specific situation, it is an official kernel interface, > and need to care about the compatibility. (If it causes SIGILL unless > a specific use case, I don't know there is a "compatibility".) Check rt_sigreturn syscall (manpage at [0], for example). sigreturn() exists only to allow the implementation of signal handlers. It should never be called directly. (Indeed, a simple sigreturn() wrapper in the GNU C library simply returns -1, with errno set to ENOSYS.) Details of the arguments (if any) passed to sigreturn() vary depending on the architecture. (On some architectures, such as x86-64, sigreturn() takes no arguments, since all of the information that it requires is available in the stack frame that was previously created by the kernel on the user-space stack.) This is a very similar use case. Also, check its source code in arch/x86/kernel/signal_64.c. It sends SIGSEGV to the calling process on any sign of something not being right. It's exactly the same with sys_uretprobe. [0] https://man7.org/linux/man-pages/man2/sigreturn.2.html > And the number of syscalls are limited resource. We have almost 500 of them, it didn't seems like adding 1-2 for good reasons would be a problem. Can you please point to where the limits on syscalls as a resource are described? I'm curious to learn. > > I'm actually not sure how much we need to care of it, but adding a new > syscall is worth to be discussed carefully because all of them are > user-space compatibility. Absolutely, it's a good discussion to have. > > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is > > > > > > > > right, I'll check on syzkaller > > > > > > > > > set in the user function, what happen if the user function directly > > > > > calls this syscall? (maybe it consumes shadow stack?) > > > > > > > > the process should receive SIGILL if there's no pending uretprobe for > > > > the current task, or it will trigger uretprobe if there's one pending > > > > > > No, that is too aggressive and not safe. Since the syscall is exposed to > > > user program, it should return appropriate error code instead of SIGILL. > > > > > > > This is the way it is today with uretprobes even through interrupt. > > I doubt that the interrupt (exception) and syscall should be handled > differently. Especially, this exception is injected by uprobes but > syscall will be caused by itself. But syscall
[PATCH v3 2/2] rethook: honor CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING in rethook_try_get()
Take into account CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING when validating that RCU is watching when trying to setup rethooko on a function entry. This further (in addition to improvements in the previous patch) improves BPF multi-kretprobe (which rely on rethook) runtime throughput by 2.3%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Signed-off-by: Andrii Nakryiko --- kernel/trace/rethook.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c index fa03094e9e69..15b8aa4048d9 100644 --- a/kernel/trace/rethook.c +++ b/kernel/trace/rethook.c @@ -166,6 +166,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) if (unlikely(!handler)) return NULL; +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING /* * This expects the caller will set up a rethook on a function entry. * When the function returns, the rethook will eventually be reclaimed @@ -174,6 +175,7 @@ struct rethook_node *rethook_try_get(struct rethook *rh) */ if (unlikely(!rcu_is_watching())) return NULL; +#endif return (struct rethook_node *)objpool_pop(>pool); } -- 2.43.0
[PATCH v3 1/2] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level validation of ftrace subsystem invariants. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. This improves BPF multi-kretprobe (relying on ftrace and rethook infrastructure) runtime throughput by 2%, according to BPF benchmarks ([0]). [0] https://lore.kernel.org/bpf/caef4bzauq2wkmjzdc9s0rbwa01bybgwhn6andxqshyia47p...@mail.gmail.com/ Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 13 + 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..7aebd1b8f93e 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace execution" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort of + protection against recursion. This option is only to verify that + ftrace (and other users of ftrace_test_recursion_trylock()) are not + called outside of RCU, as if they are, it can cause a race. But it + also has a noticeable overhead when enabled. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Wed, Apr 3, 2024 at 4:05 AM Jonthan Haslam wrote: > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > > (internal) batched attachment API for uprobes, do you think you can > > > > > apply this patch as is for now? We can then gain initial improvements > > > > > in scalability that are also easy to backport, and Jonathan will work > > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > > suggested by Ingo. > > > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > > I would like to wait for the next version. > > > > > > My initial tests show a nice improvement on the over RW spinlocks but > > > significant regression in acquiring a write lock. I've got a few days > > > vacation over Easter but I'll aim to get some more formalised results out > > > to the thread toward the end of next week. > > > > As far as the write lock is only on the cold path, I think you can choose > > per-cpu RW semaphore. Since it does not do busy wait, the total system > > performance impact will be small. > > I look forward to your formalized results :) > > Sorry for the delay in getting back to you on this Masami. > > I have used one of the bpf selftest benchmarks to provide some form of > comparison of the 3 different approaches (spinlock, RW spinlock and > per-cpu RW semaphore). The benchmark used here is the 'trig-uprobe-nop' > benchmark which just executes a single uprobe with a minimal bpf program > attached. The tests were done on a 32 core qemu/kvm instance. > Thanks a lot for running benchmarks and providing results! > Things to note about the results: > > - The results are slightly variable so don't get too caught up on > individual thread count - it's the trend that is important. > - In terms of throughput with this specific benchmark a *very* macro view > is that the RW spinlock provides 40-60% more throughput than the > spinlock. The per-CPU RW semaphore provides in the order of 50-100% > more throughput then the spinlock. > - This doesn't fully reflect the large reduction in latency that we have > seen in application based measurements. However, it does demonstrate > that even the trivial change of going to a RW spinlock provides > significant benefits. This is probably because trig-uprobe-nop creates a single uprobe that is triggered on many CPUs. While in production we have also *many* uprobes running on many CPUs. In this benchmark, besides contention on uprobes_treelock, we are also hammering on other per-uprobe locks (register_rwsem, also if you don't have [0] patch locally, there will be another filter lock taken each time, filter->rwlock). There is also atomic refcounting going on, which when you have the same uprobe across all CPUs at the same time will cause a bunch of cache line bouncing. So yes, it's understandable that in practice in production you see an even larger effect of optimizing uprobe_treelock than in this micro-benchmark. [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf > > I haven't included the measurements on per-CPU RW semaphore write > performance as they are completely in line with those that Paul McKenney > posted on his journal [0]. On a 32 core system I see semaphore writes to > take in the order of 25-28 millisecs - the cost of the synchronize_rcu(). > > Each block of results below show 1 line per execution of the benchmark (the > "Summary" line) and each line is a run with one more thread added - a > thread is a "producer". The lines are edited to remove extraneous output > that adds no value here. > > The tests were executed with this driver script: > > for num_threads in {1..20} > do > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary just want to mention -a (affinity) option that you can pass a bench tool, it will pin each thread on its own CPU. It generally makes tests more uniform, eliminating CPU migrations variability. > done > > > spinlock > > Summary: hits1.453 ± 0.005M/s ( 1.453M/prod) > Summary: hits2.087 ± 0.005M/s ( 1.043M/prod) > Summary: hits2.701 ± 0.012M/s ( 0.900M/prod) I also wanted to point out that the first measurement (1.453M/s in this row) is total throughput across all threads, while value in parenthesis (0.900M/prod) is averaged throughput per each thread. So this M/prod value is the most interesting in this benchmark where we assess the effect of reducing contention. > Summary: hits1.917 ± 0.011M/s ( 0.479M/prod) > Summary: hits2.105 ± 0.003M/s ( 0.421M/prod) > Summary: hits1.615 ± 0.006M/s ( 0.269M/prod) [...]
Re: [PATCHv2 1/3] uprobe: Add uretprobe syscall to speed up return probe
On Wed, Apr 3, 2024 at 7:09 AM Masami Hiramatsu wrote: > > On Wed, 3 Apr 2024 11:47:41 +0200 > Jiri Olsa wrote: > > > On Wed, Apr 03, 2024 at 10:07:08AM +0900, Masami Hiramatsu wrote: > > > Hi Jiri, > > > > > > On Tue, 2 Apr 2024 11:33:00 +0200 > > > Jiri Olsa wrote: > > > > > > > Adding uretprobe syscall instead of trap to speed up return probe. > > > > > > This is interesting approach. But I doubt we need to add additional > > > syscall just for this purpose. Can't we use another syscall or ioctl? > > > > so the plan is to optimize entry uprobe in a similar way and given > > the syscall is not a scarce resource I wanted to add another syscall > > for that one as well > > > > tbh I'm not sure sure which syscall or ioctl to reuse for this, it's > > possible to do that, the trampoline will just have to save one or > > more additional registers, but adding new syscall seems cleaner to me > > Hmm, I think a similar syscall is ptrace? prctl may also be a candidate. I think both ptrace and prctl are for completely different use cases and it would be an abuse of existing API to reuse them for uretprobe tracing. Also, keep in mind, that any extra argument that has to be passed into this syscall means that we need to complicate and slow generated assembly code that is injected into user process (to save/restore registers) and also kernel-side (again, to deal with all the extra registers that would be stored/restored on stack). Given syscalls are not some kind of scarce resources, what's the downside to have a dedicated and simple syscall? > > > > > > > > > Also, we should run syzkaller on this syscall. And if uretprobe is > > > > right, I'll check on syzkaller > > > > > set in the user function, what happen if the user function directly > > > calls this syscall? (maybe it consumes shadow stack?) > > > > the process should receive SIGILL if there's no pending uretprobe for > > the current task, or it will trigger uretprobe if there's one pending > > No, that is too aggressive and not safe. Since the syscall is exposed to > user program, it should return appropriate error code instead of SIGILL. > This is the way it is today with uretprobes even through interrupt. E.g., it could happen that user process is using fibers and is replacing stack pointer without kernel realizing this, which will trigger some defensive checks in uretprobe handling code and kernel will send SIGILL because it can't support such cases. This is happening today already, and it works fine in practice (except for applications that manually change stack pointer, too bad, you can't trace them with uretprobes, unfortunately). So I think it's absolutely adequate to have this behavior if the user process is *intentionally* abusing this API. > > > > but we could limit the syscall to be executed just from the trampoline, > > that should prevent all the user space use cases, I'll do that in next > > version and add more tests for that > > Why not limit? :) The uprobe_handle_trampoline() expects it is called > only from the trampoline, so it is natural to check the caller address. > (and uprobe should know where is the trampoline) > > Since the syscall is always exposed to the user program, it should > - Do nothing and return an error unless it is properly called. > - check the prerequisites for operation strictly. > I concern that new system calls introduce vulnerabilities. > As Oleg and Jiri mentioned, this syscall can't harm kernel or other processes, only the process that is abusing the API. So any extra checks that would slow down this approach is an unnecessary overhead and complication that will never be useful in practice. Also note that sys_uretprobe is a kind of internal and unstable API and it is explicitly called out that its contract can change at any time and user space shouldn't rely on it. It's purely for the kernel's own usage. So let's please keep it fast and simple. > Thank you, > > > > > > thanks, > > jirka > > > > > > > [...]
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, Apr 2, 2024 at 9:00 PM Andrii Nakryiko wrote: > > On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt wrote: > > > > On Wed, 3 Apr 2024 09:40:48 +0900 > > Masami Hiramatsu (Google) wrote: > > > > > OK, for me, this last sentence is preferred for the help message. That > > > explains > > > what this is for. > > > > > > All callbacks that attach to the function tracing have some sort > > > of protection against recursion. This option is only to verify > > > that > > > ftrace (and other users of ftrace_test_recursion_trylock()) are not > > > called outside of RCU, as if they are, it can cause a race. > > > But it also has a noticeable overhead when enabled. > > Sounds good to me, I can add this to the description of the Kconfig option. > > > > > > > BTW, how much overhead does this introduce? and the race case a kernel > > > crash? > > I just checked our fleet-wide production data for the last 24 hours. > Within the kprobe/kretprobe code path (ftrace_trampoline and > everything called from it), rcu_is_watching (both calls, see below) > cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd > prefer to be able to avoid that in production use cases. > I just ran synthetic microbenchmark testing multi-kretprobe throughput. We get (in millions of BPF kretprobe-multi program invocations per second): - 5.568M/s as baseline; - 5.679M/s with changes in this patch (+2% throughput improvement); - 5.808M/s with disabling rcu_is_watching in rethook_try_get() (+2.3% more vs just one of rcu_is_watching, and +4.3% vs baseline). It's definitely noticeable. > > > or just messed up the ftrace buffer? > > > > There's a hypothetical race where it can cause a use after free. > > > > That is, after you shutdown ftrace, you need to call > > synchronize_rcu_tasks(), > > which requires RCU to be watching. There's a theoretical case where that > > task calls the trampoline and misses the synchronization. Note, these > > locations are with preemption disabled, as rcu is always watching when > > preemption is enabled. Thus it would be extremely fast where as the > > synchronize_rcu_tasks() is rather slow. > > > > We also have synchronize_rcu_tasks_rude() which would actually keep the > > trace from happening, as it would schedule on each CPU forcing all CPUs to > > have RCU watching. > > > > I have never heard of this race being hit. I guess it could happen on a VM > > where a vCPU gets preempted at the right moment for a long time and the > > other CPUs synchronize. > > > > But like lockdep, where deadlocks can crash the kernel, we don't enable it > > for production. > > > > The overhead is another function call within the function tracer. I had > > numbers before, but I guess I could run tests again and get new numbers. > > > > I just noticed another rcu_is_watching() call, in rethook_try_get(), > which seems to be a similar and complementary validation check to the > one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option > in this patch. It feels like both of them should be controlled by the > same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > guard around rcu_is_watching() check in rethook_try_get() as well? > > > > Thanks, > > > > -- Steve
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, Apr 2, 2024 at 5:52 PM Steven Rostedt wrote: > > On Wed, 3 Apr 2024 09:40:48 +0900 > Masami Hiramatsu (Google) wrote: > > > OK, for me, this last sentence is preferred for the help message. That > > explains > > what this is for. > > > > All callbacks that attach to the function tracing have some sort > > of protection against recursion. This option is only to verify that > > ftrace (and other users of ftrace_test_recursion_trylock()) are not > > called outside of RCU, as if they are, it can cause a race. > > But it also has a noticeable overhead when enabled. Sounds good to me, I can add this to the description of the Kconfig option. > > > > BTW, how much overhead does this introduce? and the race case a kernel > > crash? I just checked our fleet-wide production data for the last 24 hours. Within the kprobe/kretprobe code path (ftrace_trampoline and everything called from it), rcu_is_watching (both calls, see below) cause 0.484% CPU cycles usage, which isn't nothing. So definitely we'd prefer to be able to avoid that in production use cases. > > or just messed up the ftrace buffer? > > There's a hypothetical race where it can cause a use after free. > > That is, after you shutdown ftrace, you need to call synchronize_rcu_tasks(), > which requires RCU to be watching. There's a theoretical case where that > task calls the trampoline and misses the synchronization. Note, these > locations are with preemption disabled, as rcu is always watching when > preemption is enabled. Thus it would be extremely fast where as the > synchronize_rcu_tasks() is rather slow. > > We also have synchronize_rcu_tasks_rude() which would actually keep the > trace from happening, as it would schedule on each CPU forcing all CPUs to > have RCU watching. > > I have never heard of this race being hit. I guess it could happen on a VM > where a vCPU gets preempted at the right moment for a long time and the > other CPUs synchronize. > > But like lockdep, where deadlocks can crash the kernel, we don't enable it > for production. > > The overhead is another function call within the function tracer. I had > numbers before, but I guess I could run tests again and get new numbers. > I just noticed another rcu_is_watching() call, in rethook_try_get(), which seems to be a similar and complementary validation check to the one we are putting under CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING option in this patch. It feels like both of them should be controlled by the same settings. WDYT? Can I add CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING guard around rcu_is_watching() check in rethook_try_get() as well? > Thanks, > > -- Steve
Re: [PATCH bpf-next] rethook: Remove warning messages printed for finding return address of a frame.
On Mon, Apr 1, 2024 at 12:16 PM Kui-Feng Lee wrote: > > rethook_find_ret_addr() prints a warning message and returns 0 when the > target task is running and not the "current" task to prevent returning an > incorrect return address. However, this check is incomplete as the target > task can still transition to the running state when finding the return > address, although it is safe with RCU. > > The issue we encounter is that the kernel frequently prints warning > messages when BPF profiling programs call to bpf_get_task_stack() on > running tasks. > > The callers should be aware and willing to take the risk of receiving an > incorrect return address from a task that is currently running other than > the "current" one. A warning is not needed here as the callers are intent > on it. > > Signed-off-by: Kui-Feng Lee > --- > kernel/trace/rethook.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/kernel/trace/rethook.c b/kernel/trace/rethook.c > index fa03094e9e69..4297a132a7ae 100644 > --- a/kernel/trace/rethook.c > +++ b/kernel/trace/rethook.c > @@ -248,7 +248,7 @@ unsigned long rethook_find_ret_addr(struct task_struct > *tsk, unsigned long frame > if (WARN_ON_ONCE(!cur)) > return 0; > > - if (WARN_ON_ONCE(tsk != current && task_is_running(tsk))) > + if (tsk != current && task_is_running(tsk)) > return 0; > This should probably go through Masami's tree, but the change makes sense to me, given this is an expected condition. Acked-by: Andrii Nakryiko > do { > -- > 2.34.1 > >
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, Apr 1, 2024 at 5:38 PM Masami Hiramatsu wrote: > > On Mon, 1 Apr 2024 12:09:18 -0400 > Steven Rostedt wrote: > > > On Mon, 1 Apr 2024 20:25:52 +0900 > > Masami Hiramatsu (Google) wrote: > > > > > > Masami, > > > > > > > > Are you OK with just keeping it set to N. > > > > > > OK, if it is only for the debugging, I'm OK to set N this. > > > > > > > > > > > We could have other options like PROVE_LOCKING enable it. > > > > > > Agreed (but it should say this is a debug option) > > > > It does say "Validate" which to me is a debug option. What would you > > suggest? > > I think the help message should have "This is for debugging ftrace." > Sent v2 with adjusted wording, thanks! > Thank you, > > > > > -- Steve > > > -- > Masami Hiramatsu (Google)
[PATCH v2] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true and is mostly useful for low-level debugging of ftrace subsystem. For most users it should probably be kept disabled to eliminate unnecessary runtime overhead. Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 14 ++ 2 files changed, 15 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..fcf45d5c60cb 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,20 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace recursion check" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort + of protection against recursion. This option performs additional + checks to make sure RCU is on when ftrace callbacks recurse. + + This is a feature useful for debugging ftrace. This will add more + overhead to all ftrace-based invocations. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Fri, Mar 29, 2024 at 5:36 PM Masami Hiramatsu wrote: > > On Fri, 29 Mar 2024 10:33:57 -0700 > Andrii Nakryiko wrote: > > > On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko > > wrote: > > > > > > On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu > > > wrote: > > > > > > > > On Wed, 27 Mar 2024 17:06:01 + > > > > Jonthan Haslam wrote: > > > > > > > > > > > Masami, > > > > > > > > > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > > > > (internal) batched attachment API for uprobes, do you think you > > > > > > > can > > > > > > > apply this patch as is for now? We can then gain initial > > > > > > > improvements > > > > > > > in scalability that are also easy to backport, and Jonathan will > > > > > > > work > > > > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > > > > suggested by Ingo. > > > > > > > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > > > > I would like to wait for the next version. > > > > > > > > > > My initial tests show a nice improvement on the over RW spinlocks but > > > > > significant regression in acquiring a write lock. I've got a few days > > > > > vacation over Easter but I'll aim to get some more formalised results > > > > > out > > > > > to the thread toward the end of next week. > > > > > > > > As far as the write lock is only on the cold path, I think you can > > > > choose > > > > per-cpu RW semaphore. Since it does not do busy wait, the total system > > > > performance impact will be small. > > > > > > No, Masami, unfortunately it's not as simple. In BPF we have BPF > > > multi-uprobe, which can be used to attach to thousands of user > > > functions. It currently creates one uprobe at a time, as we don't > > > really have a batched API. If each such uprobe registration will now > > > take a (relatively) long time, when multiplied by number of attach-to > > > user functions, it will be a horrible regression in terms of > > > attachment/detachment performance. > > Ah, got it. So attachment/detachment performance should be counted. > > > > > > > So when we switch to per-CPU rw semaphore, we'll need to provide an > > > internal batch uprobe attach/detach API to make sure that attaching to > > > multiple uprobes is still fast. > > Yeah, we need such interface like register_uprobes(...). > > > > > > > Which is why I was asking to land this patch as is, as it relieves the > > > scalability pains in production and is easy to backport to old > > > kernels. And then we can work on batched APIs and switch to per-CPU rw > > > semaphore. > > OK, then I'll push this to for-next at this moment. Great, thanks a lot! > Please share if you have a good idea for the batch interface which can be > backported. I guess it should involve updating userspace changes too. > Yep, we'll investigate a best way to provide batch interface for uprobes and will send patches. > Thank you! > > > > > > > So I hope you can reconsider and accept improvements in this patch, > > > while Jonathan will keep working on even better final solution. > > > Thanks! > > > > > > > I look forward to your formalized results :) > > > > > > > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes > > and USDTs, reporting attach/detach timings: > > $ sudo ./test_progs -v -t uprobe_multi_test/bench > > bpf_testmod.ko is already unloaded. > > Loading bpf_testmod.ko... > > Successfully loaded bpf_testmod.ko. > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec > > test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec > > test_bench_attach_uprobe:PASS:uprobes_count 0 nsec > > test_bench_attach_uprobe: attached in 0.120s > > test_bench_attach_uprobe: detached in 0.092s > > #400/5 uprobe_multi_test/bench_uprobe:OK > > test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec > > test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec > > test_bench_attach_usdt:PASS:usdt_count 0 nsec > > test_bench_attach_usdt: attached in 0.124s > > test_bench_attach_usdt: detached in 0.064s &
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Wed, Mar 27, 2024 at 5:45 PM Andrii Nakryiko wrote: > > On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu wrote: > > > > On Wed, 27 Mar 2024 17:06:01 + > > Jonthan Haslam wrote: > > > > > > > Masami, > > > > > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > > (internal) batched attachment API for uprobes, do you think you can > > > > > apply this patch as is for now? We can then gain initial improvements > > > > > in scalability that are also easy to backport, and Jonathan will work > > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > > suggested by Ingo. > > > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > > I would like to wait for the next version. > > > > > > My initial tests show a nice improvement on the over RW spinlocks but > > > significant regression in acquiring a write lock. I've got a few days > > > vacation over Easter but I'll aim to get some more formalised results out > > > to the thread toward the end of next week. > > > > As far as the write lock is only on the cold path, I think you can choose > > per-cpu RW semaphore. Since it does not do busy wait, the total system > > performance impact will be small. > > No, Masami, unfortunately it's not as simple. In BPF we have BPF > multi-uprobe, which can be used to attach to thousands of user > functions. It currently creates one uprobe at a time, as we don't > really have a batched API. If each such uprobe registration will now > take a (relatively) long time, when multiplied by number of attach-to > user functions, it will be a horrible regression in terms of > attachment/detachment performance. > > So when we switch to per-CPU rw semaphore, we'll need to provide an > internal batch uprobe attach/detach API to make sure that attaching to > multiple uprobes is still fast. > > Which is why I was asking to land this patch as is, as it relieves the > scalability pains in production and is easy to backport to old > kernels. And then we can work on batched APIs and switch to per-CPU rw > semaphore. > > So I hope you can reconsider and accept improvements in this patch, > while Jonathan will keep working on even better final solution. > Thanks! > > > I look forward to your formalized results :) > > BTW, as part of BPF selftests, we have a multi-attach test for uprobes and USDTs, reporting attach/detach timings: $ sudo ./test_progs -v -t uprobe_multi_test/bench bpf_testmod.ko is already unloaded. Loading bpf_testmod.ko... Successfully loaded bpf_testmod.ko. test_bench_attach_uprobe:PASS:uprobe_multi_bench__open_and_load 0 nsec test_bench_attach_uprobe:PASS:uprobe_multi_bench__attach 0 nsec test_bench_attach_uprobe:PASS:uprobes_count 0 nsec test_bench_attach_uprobe: attached in 0.120s test_bench_attach_uprobe: detached in 0.092s #400/5 uprobe_multi_test/bench_uprobe:OK test_bench_attach_usdt:PASS:uprobe_multi__open 0 nsec test_bench_attach_usdt:PASS:bpf_program__attach_usdt 0 nsec test_bench_attach_usdt:PASS:usdt_count 0 nsec test_bench_attach_usdt: attached in 0.124s test_bench_attach_usdt: detached in 0.064s #400/6 uprobe_multi_test/bench_usdt:OK #400 uprobe_multi_test:OK Summary: 1/2 PASSED, 0 SKIPPED, 0 FAILED Successfully unloaded bpf_testmod.ko. So it should be easy for Jonathan to validate his changes with this. > > Thank you, > > > > > > > > Jon. > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > > will depend on how much lock contention happens. > > > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > > > Signed-off-by: Jonathan Haslam > > > > > > > --- > > > > > > > kernel/events/uprobes.c | 22 +++--- > > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > > --- a/kernel/events/uprobes.c > > > > > > > +++ b/kernel/events/uprobes.c > > > > > > > @@ -39,7 +39,7
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Tue, Mar 26, 2024 at 11:58 AM Steven Rostedt wrote: > > On Tue, 26 Mar 2024 09:16:33 -0700 > Andrii Nakryiko wrote: > > > > It's no different than lockdep. Test boxes should have it enabled, but > > > there's no reason to have this enabled in a production system. > > > > > > > I tend to agree with Steven here (which is why I sent this patch as it > > is), but I'm happy to do it as an opt-out, if Masami insists. Please > > do let me know if I need to send v2 or this one is actually the one > > we'll end up using. Thanks! > > Masami, > > Are you OK with just keeping it set to N. > > We could have other options like PROVE_LOCKING enable it. > So what's the conclusion, Masami? Should I send another version where this config is opt-out, or are you ok with keeping it as opt-in as proposed in this revision? > -- Steve
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Wed, Mar 27, 2024 at 5:18 PM Masami Hiramatsu wrote: > > On Wed, 27 Mar 2024 17:06:01 + > Jonthan Haslam wrote: > > > > > Masami, > > > > > > > > Given the discussion around per-cpu rw semaphore and need for > > > > (internal) batched attachment API for uprobes, do you think you can > > > > apply this patch as is for now? We can then gain initial improvements > > > > in scalability that are also easy to backport, and Jonathan will work > > > > on a more complete solution based on per-cpu RW semaphore, as > > > > suggested by Ingo. > > > > > > Yeah, it is interesting to use per-cpu rw semaphore on uprobe. > > > I would like to wait for the next version. > > > > My initial tests show a nice improvement on the over RW spinlocks but > > significant regression in acquiring a write lock. I've got a few days > > vacation over Easter but I'll aim to get some more formalised results out > > to the thread toward the end of next week. > > As far as the write lock is only on the cold path, I think you can choose > per-cpu RW semaphore. Since it does not do busy wait, the total system > performance impact will be small. No, Masami, unfortunately it's not as simple. In BPF we have BPF multi-uprobe, which can be used to attach to thousands of user functions. It currently creates one uprobe at a time, as we don't really have a batched API. If each such uprobe registration will now take a (relatively) long time, when multiplied by number of attach-to user functions, it will be a horrible regression in terms of attachment/detachment performance. So when we switch to per-CPU rw semaphore, we'll need to provide an internal batch uprobe attach/detach API to make sure that attaching to multiple uprobes is still fast. Which is why I was asking to land this patch as is, as it relieves the scalability pains in production and is easy to backport to old kernels. And then we can work on batched APIs and switch to per-CPU rw semaphore. So I hope you can reconsider and accept improvements in this patch, while Jonathan will keep working on even better final solution. Thanks! > I look forward to your formalized results :) > > Thank you, > > > > > Jon. > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > BTW, how did you measure the overhead? I think spinlock overhead > > > > > will depend on how much lock contention happens. > > > > > > > > > > Thank you, > > > > > > > > > > > > > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > > > > > > > > > Signed-off-by: Jonathan Haslam > > > > > > --- > > > > > > kernel/events/uprobes.c | 22 +++--- > > > > > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > > > > > > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > > > > > index 929e98c62965..42bf9b6e8bc0 100644 > > > > > > --- a/kernel/events/uprobes.c > > > > > > +++ b/kernel/events/uprobes.c > > > > > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > > > > > */ > > > > > > #define no_uprobe_events() RB_EMPTY_ROOT(_tree) > > > > > > > > > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree > > > > > > access */ > > > > > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree > > > > > > access */ > > > > > > > > > > > > #define UPROBES_HASH_SZ 13 > > > > > > /* serialize uprobe->pending_list */ > > > > > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode > > > > > > *inode, loff_t offset) > > > > > > { > > > > > > struct uprobe *uprobe; > > > > > > > > > > > > - spin_lock(_treelock); > > > > > > + read_lock(_treelock); > > > > > > uprobe = __find_uprobe(inode, offset); > > > > > > - spin_unlock(_treelock); > > > > > > + read_unlock(_treelock); > > > > > > > > > > > > return uprobe; > > > > > > } > > > > > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct > > > > > > uprobe *uprobe) > > > > > > { > > > > > > struct uprobe *u; > > > > > > > > > > > > - spin_lock(_treelock); > > > > > > + write_lock(_treelock); > > > > > > u = __insert_uprobe(uprobe); > > > > > > - spin_unlock(_treelock); > > > > > > + write_unlock(_treelock); > > > > > > > > > > > > return u; > > > > > > } > > > > > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > > > > > if (WARN_ON(!uprobe_is_active(uprobe))) > > > > > > return; > > > > > > > > > > > > - spin_lock(_treelock); > > > > > > + write_lock(_treelock); > > > > > > rb_erase(>rb_node, _tree); > > > > > > - spin_unlock(_treelock); > > > > > > + write_unlock(_treelock); > > > > > > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */ > > > > > > put_uprobe(uprobe); > > > > > > } > > > > > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode > > > > > > *inode, > > > > > > min = vaddr_to_offset(vma, start); > > > > > > max = min + (end - start) - 1; > > >
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Mon, Mar 25, 2024 at 3:11 PM Steven Rostedt wrote: > > On Mon, 25 Mar 2024 11:38:48 +0900 > Masami Hiramatsu (Google) wrote: > > > On Fri, 22 Mar 2024 09:03:23 -0700 > > Andrii Nakryiko wrote: > > > > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to > > > control whether ftrace low-level code performs additional > > > rcu_is_watching()-based validation logic in an attempt to catch noinstr > > > violations. > > > > > > This check is expected to never be true in practice and would be best > > > controlled with extra config to let users decide if they are willing to > > > pay the price. > > > > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice > > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > is OK, but tht should be set to Y by default. If you have already verified > > that your system never make it true and you want to optimize your ftrace > > path, you can manually set it to N at your own risk. > > > > Really, it's for debugging. I would argue that it should *not* be default y. > Peter added this to find all the locations that could be called where RCU > is not watching. But the issue I have is that this is that it *does cause > overhead* with function tracing. > > I believe we found pretty much all locations that were an issue, and we > should now just make it an option for developers. > > It's no different than lockdep. Test boxes should have it enabled, but > there's no reason to have this enabled in a production system. > I tend to agree with Steven here (which is why I sent this patch as it is), but I'm happy to do it as an opt-out, if Masami insists. Please do let me know if I need to send v2 or this one is actually the one we'll end up using. Thanks! > -- Steve > > > > > > > > Cc: Steven Rostedt > > > Cc: Masami Hiramatsu > > > Cc: Paul E. McKenney > > > Signed-off-by: Andrii Nakryiko > > > --- > > > include/linux/trace_recursion.h | 2 +- > > > kernel/trace/Kconfig| 13 + > > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > > > diff --git a/include/linux/trace_recursion.h > > > b/include/linux/trace_recursion.h > > > index d48cd92d2364..24ea8ac049b4 100644 > > > --- a/include/linux/trace_recursion.h > > > +++ b/include/linux/trace_recursion.h > > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, > > > unsigned long parent_ip); > > > # define do_ftrace_record_recursion(ip, pip) do { } while (0) > > > #endif > > > > > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR > > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > > # define trace_warn_on_no_rcu(ip) \ > > > ({ \ > > > bool __ret = !rcu_is_watching();\ > > > > BTW, maybe we can add "unlikely" in the next "if" line? > > > > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > > > index 61c541c36596..19bce4e217d6 100644 > > > --- a/kernel/trace/Kconfig > > > +++ b/kernel/trace/Kconfig > > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE > > > This file can be reset, but the limit can not change in > > > size at runtime. > > > > > > +config FTRACE_VALIDATE_RCU_IS_WATCHING > > > + bool "Validate RCU is on during ftrace recursion check" > > > + depends on FUNCTION_TRACER > > > + depends on ARCH_WANTS_NO_INSTR > > > > default y > > > > > + help > > > + All callbacks that attach to the function tracing have some sort > > > + of protection against recursion. This option performs additional > > > + checks to make sure RCU is on when ftrace callbacks recurse. > > > + > > > + This will add more overhead to all ftrace-based invocations. > > > > ... invocations, but keep it safe. > > > > > + > > > + If unsure, say N > > > > If unsure, say Y > > > > Thank you, > > > > > + > > > config RING_BUFFER_RECORD_RECURSION > > > bool "Record functions that recurse in the ring buffer" > > > depends on FTRACE_RECORD_RECURSION > > > -- > > > 2.43.0 > > > > > > > >
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Sun, Mar 24, 2024 at 8:03 PM Masami Hiramatsu wrote: > > On Thu, 21 Mar 2024 07:57:35 -0700 > Jonathan Haslam wrote: > > > Active uprobes are stored in an RB tree and accesses to this tree are > > dominated by read operations. Currently these accesses are serialized by > > a spinlock but this leads to enormous contention when large numbers of > > threads are executing active probes. > > > > This patch converts the spinlock used to serialize access to the > > uprobes_tree RB tree into a reader-writer spinlock. This lock type > > aligns naturally with the overwhelmingly read-only nature of the tree > > usage here. Although the addition of reader-writer spinlocks are > > discouraged [0], this fix is proposed as an interim solution while an > > RCU based approach is implemented (that work is in a nascent form). This > > fix also has the benefit of being trivial, self contained and therefore > > simple to backport. > > > > This change has been tested against production workloads that exhibit > > significant contention on the spinlock and an almost order of magnitude > > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > Looks good to me. > > Acked-by: Masami Hiramatsu (Google) Masami, Given the discussion around per-cpu rw semaphore and need for (internal) batched attachment API for uprobes, do you think you can apply this patch as is for now? We can then gain initial improvements in scalability that are also easy to backport, and Jonathan will work on a more complete solution based on per-cpu RW semaphore, as suggested by Ingo. > > BTW, how did you measure the overhead? I think spinlock overhead > will depend on how much lock contention happens. > > Thank you, > > > > > [0] https://docs.kernel.org/locking/spinlocks.html > > > > Signed-off-by: Jonathan Haslam > > --- > > kernel/events/uprobes.c | 22 +++--- > > 1 file changed, 11 insertions(+), 11 deletions(-) > > > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > > index 929e98c62965..42bf9b6e8bc0 100644 > > --- a/kernel/events/uprobes.c > > +++ b/kernel/events/uprobes.c > > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > > */ > > #define no_uprobe_events() RB_EMPTY_ROOT(_tree) > > > > -static DEFINE_SPINLOCK(uprobes_treelock);/* serialize rbtree access */ > > +static DEFINE_RWLOCK(uprobes_treelock); /* serialize rbtree access */ > > > > #define UPROBES_HASH_SZ 13 > > /* serialize uprobe->pending_list */ > > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, > > loff_t offset) > > { > > struct uprobe *uprobe; > > > > - spin_lock(_treelock); > > + read_lock(_treelock); > > uprobe = __find_uprobe(inode, offset); > > - spin_unlock(_treelock); > > + read_unlock(_treelock); > > > > return uprobe; > > } > > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe > > *uprobe) > > { > > struct uprobe *u; > > > > - spin_lock(_treelock); > > + write_lock(_treelock); > > u = __insert_uprobe(uprobe); > > - spin_unlock(_treelock); > > + write_unlock(_treelock); > > > > return u; > > } > > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > > if (WARN_ON(!uprobe_is_active(uprobe))) > > return; > > > > - spin_lock(_treelock); > > + write_lock(_treelock); > > rb_erase(>rb_node, _tree); > > - spin_unlock(_treelock); > > + write_unlock(_treelock); > > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */ > > put_uprobe(uprobe); > > } > > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(_treelock); > > + read_lock(_treelock); > > n = find_node_in_range(inode, min, max); > > if (n) { > > for (t = n; t; t = rb_prev(t)) { > > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > > get_uprobe(u); > > } > > } > > - spin_unlock(_treelock); > > + read_unlock(_treelock); > > } > > > > /* @vma contains reference counter, not the probed instruction. */ > > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned > > long start, unsigned long e > > min = vaddr_to_offset(vma, start); > > max = min + (end - start) - 1; > > > > - spin_lock(_treelock); > > + read_lock(_treelock); > > n = find_node_in_range(inode, min, max); > > - spin_unlock(_treelock); > > + read_unlock(_treelock); > > > > return !!n; > > } > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
Re: raw_tp+cookie is buggy. Was: [syzbot] [bpf?] [trace?] KASAN: slab-use-after-free Read in bpf_trace_run1
On Mon, Mar 25, 2024 at 10:27 AM Andrii Nakryiko wrote: > > On Sun, Mar 24, 2024 at 5:07 PM Alexei Starovoitov > wrote: > > > > Hi Andrii, > > > > syzbot found UAF in raw_tp cookie series in bpf-next. > > Reverting the whole merge > > 2e244a72cd48 ("Merge branch 'bpf-raw-tracepoint-support-for-bpf-cookie'") > > > > fixes the issue. > > > > Pls take a look. > > See C reproducer below. It splats consistently with CONFIG_KASAN=y > > > > Thanks. > > Will do, traveling today, so will be offline for a bit, but will check > first thing afterwards. > Ok, so I don't think it's bpf_raw_tp_link specific, it should affect a bunch of other links (unless I missed something). Basically, when last link refcnt drops, we detach, do bpf_prog_put() and then proceed to kfree link itself synchronously. But that link can still be referred from running BPF program (I think multi-kprobe/multi-uprobe use it for cookies, raw_tp with my changes started using link at runtime, there are probably more types), and so if we free this memory synchronously, we can have UAF. We should do what we do for bpf_maps and delay freeing, the only question is how tunable that freeing can be? Always do call_rcu()? Always call_rcu_tasks_trace() (relevant for sleepable multi-uprobes)? Should we allow synchronous free if link is not directly accessible from program during its run? Anyway, I sent a fix as an RFC so we can discuss. > > > > On Sun, Mar 24, 2024 at 4:28 PM syzbot > > wrote: > > > > > > Hello, > > > > > > syzbot found the following issue on: > > > > > > HEAD commit:520fad2e3206 selftests/bpf: scale benchmark counting by > > > us.. > > > git tree: bpf-next > > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=105af94618 > > > kernel config: https://syzkaller.appspot.com/x/.config?x=6fb1be60a193d440 > > > dashboard link: > > > https://syzkaller.appspot.com/bug?extid=981935d9485a560bfbcb > > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for > > > Debian) 2.40 > > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=114f17a518 > > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=162bb7a518 > > > > > > Downloadable assets: > > > disk image: > > > https://storage.googleapis.com/syzbot-assets/4eef3506c5ce/disk-520fad2e.raw.xz > > > vmlinux: > > > https://storage.googleapis.com/syzbot-assets/24d60ebe76cc/vmlinux-520fad2e.xz > > > kernel image: > > > https://storage.googleapis.com/syzbot-assets/8f883e706550/bzImage-520fad2e.xz > > > > > > IMPORTANT: if you fix the issue, please add the following tag to the > > > commit: > > > Reported-by: syzbot+981935d9485a560bf...@syzkaller.appspotmail.com > > > > > > == > > > BUG: KASAN: slab-use-after-free in __bpf_trace_run > > > kernel/trace/bpf_trace.c:2376 [inline] > > > BUG: KASAN: slab-use-after-free in bpf_trace_run1+0xcb/0x510 > > > kernel/trace/bpf_trace.c:2430 > > > Read of size 8 at addr 8880290d9918 by task migration/0/19 > > > > > > CPU: 0 PID: 19 Comm: migration/0 Not tainted > > > 6.8.0-syzkaller-05233-g520fad2e3206 #0 > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > > Google 02/29/2024 > > > Stopper: 0x0 <- 0x0 > > > Call Trace: > > > > > > __dump_stack lib/dump_stack.c:88 [inline] > > > dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106 > > > print_address_description mm/kasan/report.c:377 [inline] > > > print_report+0x169/0x550 mm/kasan/report.c:488 > > > kasan_report+0x143/0x180 mm/kasan/report.c:601 > > > __bpf_trace_run kernel/trace/bpf_trace.c:2376 [inline] > > > bpf_trace_run1+0xcb/0x510 kernel/trace/bpf_trace.c:2430 > > > __traceiter_rcu_utilization+0x74/0xb0 include/trace/events/rcu.h:27 > > > trace_rcu_utilization+0x194/0x1c0 include/trace/events/rcu.h:27 > > > rcu_note_context_switch+0xc7c/0xff0 kernel/rcu/tree_plugin.h:360 > > > __schedule+0x345/0x4a20 kernel/sched/core.c:6635 > > > __schedule_loop kernel/sched/core.c:6813 [inline] > > > schedule+0x14b/0x320 kernel/sched/core.c:6828 > > > smpboot_thread_fn+0x61e/0xa30 kernel/smpboot.c:160 > > > kthread+0x2f0/0x390 kernel/kthread.c:388 > > > ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 > > > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243 > > &g
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Mon, Mar 25, 2024 at 12:12 PM Jonthan Haslam wrote: > > Hi Ingo, > > > > This change has been tested against production workloads that exhibit > > > significant contention on the spinlock and an almost order of magnitude > > > reduction for mean uprobe execution time is observed (28 -> 3.5 > > > microsecs). > > > > Have you considered/measured per-CPU RW semaphores? > > No I hadn't but thanks hugely for suggesting it! In initial measurements > it seems to be between 20-100% faster than the RW spinlocks! Apologies for > all the exclamation marks but I'm very excited. I'll do some more testing > tomorrow but so far it's looking very good. > Documentation ([0]) says that locking for writing calls synchronize_rcu(), is that right? If that's true, attaching multiple uprobes (including just attaching a single BPF multi-uprobe) will take a really long time. We need to confirm we are not significantly regressing this. And if we do, we need to take measures in the BPF multi-uprobe attachment code path to make sure that a single multi-uprobe attachment is still fast. If my worries above turn out to be true, it still feels like a first good step should be landing this patch as is (and get it backported to older kernels), and then have percpu rw-semaphore as a final (and a bit more invasive) solution (it's RCU-based, so feels like a good primitive to settle on), making sure to not regress multi-uprobes (we'll probably will need some batched API for multiple uprobes). Thoughts? [0] https://docs.kernel.org/locking/percpu-rw-semaphore.html > Thanks again for the input. > > Jon.
Re: raw_tp+cookie is buggy. Was: [syzbot] [bpf?] [trace?] KASAN: slab-use-after-free Read in bpf_trace_run1
On Sun, Mar 24, 2024 at 5:07 PM Alexei Starovoitov wrote: > > Hi Andrii, > > syzbot found UAF in raw_tp cookie series in bpf-next. > Reverting the whole merge > 2e244a72cd48 ("Merge branch 'bpf-raw-tracepoint-support-for-bpf-cookie'") > > fixes the issue. > > Pls take a look. > See C reproducer below. It splats consistently with CONFIG_KASAN=y > > Thanks. Will do, traveling today, so will be offline for a bit, but will check first thing afterwards. > > On Sun, Mar 24, 2024 at 4:28 PM syzbot > wrote: > > > > Hello, > > > > syzbot found the following issue on: > > > > HEAD commit:520fad2e3206 selftests/bpf: scale benchmark counting by us.. > > git tree: bpf-next > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=105af94618 > > kernel config: https://syzkaller.appspot.com/x/.config?x=6fb1be60a193d440 > > dashboard link: https://syzkaller.appspot.com/bug?extid=981935d9485a560bfbcb > > compiler: Debian clang version 15.0.6, GNU ld (GNU Binutils for > > Debian) 2.40 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=114f17a518 > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=162bb7a518 > > > > Downloadable assets: > > disk image: > > https://storage.googleapis.com/syzbot-assets/4eef3506c5ce/disk-520fad2e.raw.xz > > vmlinux: > > https://storage.googleapis.com/syzbot-assets/24d60ebe76cc/vmlinux-520fad2e.xz > > kernel image: > > https://storage.googleapis.com/syzbot-assets/8f883e706550/bzImage-520fad2e.xz > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+981935d9485a560bf...@syzkaller.appspotmail.com > > > > == > > BUG: KASAN: slab-use-after-free in __bpf_trace_run > > kernel/trace/bpf_trace.c:2376 [inline] > > BUG: KASAN: slab-use-after-free in bpf_trace_run1+0xcb/0x510 > > kernel/trace/bpf_trace.c:2430 > > Read of size 8 at addr 8880290d9918 by task migration/0/19 > > > > CPU: 0 PID: 19 Comm: migration/0 Not tainted > > 6.8.0-syzkaller-05233-g520fad2e3206 #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 02/29/2024 > > Stopper: 0x0 <- 0x0 > > Call Trace: > > > > __dump_stack lib/dump_stack.c:88 [inline] > > dump_stack_lvl+0x1e7/0x2e0 lib/dump_stack.c:106 > > print_address_description mm/kasan/report.c:377 [inline] > > print_report+0x169/0x550 mm/kasan/report.c:488 > > kasan_report+0x143/0x180 mm/kasan/report.c:601 > > __bpf_trace_run kernel/trace/bpf_trace.c:2376 [inline] > > bpf_trace_run1+0xcb/0x510 kernel/trace/bpf_trace.c:2430 > > __traceiter_rcu_utilization+0x74/0xb0 include/trace/events/rcu.h:27 > > trace_rcu_utilization+0x194/0x1c0 include/trace/events/rcu.h:27 > > rcu_note_context_switch+0xc7c/0xff0 kernel/rcu/tree_plugin.h:360 > > __schedule+0x345/0x4a20 kernel/sched/core.c:6635 > > __schedule_loop kernel/sched/core.c:6813 [inline] > > schedule+0x14b/0x320 kernel/sched/core.c:6828 > > smpboot_thread_fn+0x61e/0xa30 kernel/smpboot.c:160 > > kthread+0x2f0/0x390 kernel/kthread.c:388 > > ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147 > > ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:243 > > > > > > Allocated by task 5075: > > kasan_save_stack mm/kasan/common.c:47 [inline] > > kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 > > poison_kmalloc_redzone mm/kasan/common.c:370 [inline] > > __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387 > > kasan_kmalloc include/linux/kasan.h:211 [inline] > > kmalloc_trace+0x1d9/0x360 mm/slub.c:4012 > > kmalloc include/linux/slab.h:590 [inline] > > kzalloc include/linux/slab.h:711 [inline] > > bpf_raw_tp_link_attach+0x2a0/0x6e0 kernel/bpf/syscall.c:3816 > > bpf_raw_tracepoint_open+0x1c2/0x240 kernel/bpf/syscall.c:3863 > > __sys_bpf+0x3c0/0x810 kernel/bpf/syscall.c:5673 > > __do_sys_bpf kernel/bpf/syscall.c:5738 [inline] > > __se_sys_bpf kernel/bpf/syscall.c:5736 [inline] > > __x64_sys_bpf+0x7c/0x90 kernel/bpf/syscall.c:5736 > > do_syscall_64+0xfb/0x240 > > entry_SYSCALL_64_after_hwframe+0x6d/0x75 > > > > Freed by task 5075: > > kasan_save_stack mm/kasan/common.c:47 [inline] > > kasan_save_track+0x3f/0x80 mm/kasan/common.c:68 > > kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:589 > > poison_slab_object+0xa6/0xe0 mm/kasan/common.c:240 > > __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256 > > kasan_slab_free include/linux/kasan.h:184 [inline] > > slab_free_hook mm/slub.c:2121 [inline] > > slab_free mm/slub.c:4299 [inline] > > kfree+0x14a/0x380 mm/slub.c:4409 > > bpf_link_release+0x3b/0x50 kernel/bpf/syscall.c:3071 > > __fput+0x429/0x8a0 fs/file_table.c:423 > > task_work_run+0x24f/0x310 kernel/task_work.c:180 > > exit_task_work include/linux/task_work.h:38 [inline] > > do_exit+0xa1b/0x27e0 kernel/exit.c:878 > > do_group_exit+0x207/0x2c0 kernel/exit.c:1027 > > __do_sys_exit_group kernel/exit.c:1038 [inline] > > __se_sys_exit_group kernel/exit.c:1036 [inline] > >
Re: [PATCH] ftrace: make extra rcu_is_watching() validation check optional
On Sun, Mar 24, 2024 at 7:38 PM Masami Hiramatsu wrote: > > On Fri, 22 Mar 2024 09:03:23 -0700 > Andrii Nakryiko wrote: > > > Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to > > control whether ftrace low-level code performs additional > > rcu_is_watching()-based validation logic in an attempt to catch noinstr > > violations. > > > > This check is expected to never be true in practice and would be best > > controlled with extra config to let users decide if they are willing to > > pay the price. > > Hmm, for me, it sounds like "WARN_ON(something) never be true in practice > so disable it by default". I think CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > is OK, but tht should be set to Y by default. If you have already verified > that your system never make it true and you want to optimize your ftrace > path, you can manually set it to N at your own risk. Yeah, I don't think we ever see this warning across our machines. And sure, I can default it to Y, no problem. > > > > > Cc: Steven Rostedt > > Cc: Masami Hiramatsu > > Cc: Paul E. McKenney > > Signed-off-by: Andrii Nakryiko > > --- > > include/linux/trace_recursion.h | 2 +- > > kernel/trace/Kconfig| 13 + > > 2 files changed, 14 insertions(+), 1 deletion(-) > > > > diff --git a/include/linux/trace_recursion.h > > b/include/linux/trace_recursion.h > > index d48cd92d2364..24ea8ac049b4 100644 > > --- a/include/linux/trace_recursion.h > > +++ b/include/linux/trace_recursion.h > > @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, > > unsigned long parent_ip); > > # define do_ftrace_record_recursion(ip, pip) do { } while (0) > > #endif > > > > -#ifdef CONFIG_ARCH_WANTS_NO_INSTR > > +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING > > # define trace_warn_on_no_rcu(ip)\ > > ({ \ > > bool __ret = !rcu_is_watching();\ > > BTW, maybe we can add "unlikely" in the next "if" line? sure, can add that as well > > > diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig > > index 61c541c36596..19bce4e217d6 100644 > > --- a/kernel/trace/Kconfig > > +++ b/kernel/trace/Kconfig > > @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE > > This file can be reset, but the limit can not change in > > size at runtime. > > > > +config FTRACE_VALIDATE_RCU_IS_WATCHING > > + bool "Validate RCU is on during ftrace recursion check" > > + depends on FUNCTION_TRACER > > + depends on ARCH_WANTS_NO_INSTR > > default y > ok > > + help > > + All callbacks that attach to the function tracing have some sort > > + of protection against recursion. This option performs additional > > + checks to make sure RCU is on when ftrace callbacks recurse. > > + > > + This will add more overhead to all ftrace-based invocations. > > ... invocations, but keep it safe. > > > + > > + If unsure, say N > > If unsure, say Y > yep, will do, thanks! > Thank you, > > > + > > config RING_BUFFER_RECORD_RECURSION > > bool "Record functions that recurse in the ring buffer" > > depends on FTRACE_RECORD_RECURSION > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
[PATCH] ftrace: make extra rcu_is_watching() validation check optional
Introduce CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING config option to control whether ftrace low-level code performs additional rcu_is_watching()-based validation logic in an attempt to catch noinstr violations. This check is expected to never be true in practice and would be best controlled with extra config to let users decide if they are willing to pay the price. Cc: Steven Rostedt Cc: Masami Hiramatsu Cc: Paul E. McKenney Signed-off-by: Andrii Nakryiko --- include/linux/trace_recursion.h | 2 +- kernel/trace/Kconfig| 13 + 2 files changed, 14 insertions(+), 1 deletion(-) diff --git a/include/linux/trace_recursion.h b/include/linux/trace_recursion.h index d48cd92d2364..24ea8ac049b4 100644 --- a/include/linux/trace_recursion.h +++ b/include/linux/trace_recursion.h @@ -135,7 +135,7 @@ extern void ftrace_record_recursion(unsigned long ip, unsigned long parent_ip); # define do_ftrace_record_recursion(ip, pip) do { } while (0) #endif -#ifdef CONFIG_ARCH_WANTS_NO_INSTR +#ifdef CONFIG_FTRACE_VALIDATE_RCU_IS_WATCHING # define trace_warn_on_no_rcu(ip) \ ({ \ bool __ret = !rcu_is_watching();\ diff --git a/kernel/trace/Kconfig b/kernel/trace/Kconfig index 61c541c36596..19bce4e217d6 100644 --- a/kernel/trace/Kconfig +++ b/kernel/trace/Kconfig @@ -974,6 +974,19 @@ config FTRACE_RECORD_RECURSION_SIZE This file can be reset, but the limit can not change in size at runtime. +config FTRACE_VALIDATE_RCU_IS_WATCHING + bool "Validate RCU is on during ftrace recursion check" + depends on FUNCTION_TRACER + depends on ARCH_WANTS_NO_INSTR + help + All callbacks that attach to the function tracing have some sort + of protection against recursion. This option performs additional + checks to make sure RCU is on when ftrace callbacks recurse. + + This will add more overhead to all ftrace-based invocations. + + If unsure, say N + config RING_BUFFER_RECORD_RECURSION bool "Record functions that recurse in the ring buffer" depends on FTRACE_RECORD_RECURSION -- 2.43.0
Re: [PATCH] uprobes: reduce contention on uprobes_tree access
On Thu, Mar 21, 2024 at 7:57 AM Jonathan Haslam wrote: > > Active uprobes are stored in an RB tree and accesses to this tree are > dominated by read operations. Currently these accesses are serialized by > a spinlock but this leads to enormous contention when large numbers of > threads are executing active probes. > > This patch converts the spinlock used to serialize access to the > uprobes_tree RB tree into a reader-writer spinlock. This lock type > aligns naturally with the overwhelmingly read-only nature of the tree > usage here. Although the addition of reader-writer spinlocks are > discouraged [0], this fix is proposed as an interim solution while an > RCU based approach is implemented (that work is in a nascent form). This > fix also has the benefit of being trivial, self contained and therefore > simple to backport. Yep, makes sense, I think we'll want to backport this ASAP to some of the old kernels we have. Thanks! Acked-by: Andrii Nakryiko > > This change has been tested against production workloads that exhibit > significant contention on the spinlock and an almost order of magnitude > reduction for mean uprobe execution time is observed (28 -> 3.5 microsecs). > > [0] https://docs.kernel.org/locking/spinlocks.html > > Signed-off-by: Jonathan Haslam > --- > kernel/events/uprobes.c | 22 +++--- > 1 file changed, 11 insertions(+), 11 deletions(-) > > diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c > index 929e98c62965..42bf9b6e8bc0 100644 > --- a/kernel/events/uprobes.c > +++ b/kernel/events/uprobes.c > @@ -39,7 +39,7 @@ static struct rb_root uprobes_tree = RB_ROOT; > */ > #define no_uprobe_events() RB_EMPTY_ROOT(_tree) > > -static DEFINE_SPINLOCK(uprobes_treelock); /* serialize rbtree access */ > +static DEFINE_RWLOCK(uprobes_treelock);/* serialize rbtree access */ > > #define UPROBES_HASH_SZ13 > /* serialize uprobe->pending_list */ > @@ -669,9 +669,9 @@ static struct uprobe *find_uprobe(struct inode *inode, > loff_t offset) > { > struct uprobe *uprobe; > > - spin_lock(_treelock); > + read_lock(_treelock); > uprobe = __find_uprobe(inode, offset); > - spin_unlock(_treelock); > + read_unlock(_treelock); > > return uprobe; > } > @@ -701,9 +701,9 @@ static struct uprobe *insert_uprobe(struct uprobe *uprobe) > { > struct uprobe *u; > > - spin_lock(_treelock); > + write_lock(_treelock); > u = __insert_uprobe(uprobe); > - spin_unlock(_treelock); > + write_unlock(_treelock); > > return u; > } > @@ -935,9 +935,9 @@ static void delete_uprobe(struct uprobe *uprobe) > if (WARN_ON(!uprobe_is_active(uprobe))) > return; > > - spin_lock(_treelock); > + write_lock(_treelock); > rb_erase(>rb_node, _tree); > - spin_unlock(_treelock); > + write_unlock(_treelock); > RB_CLEAR_NODE(>rb_node); /* for uprobe_is_active() */ > put_uprobe(uprobe); > } > @@ -1298,7 +1298,7 @@ static void build_probe_list(struct inode *inode, > min = vaddr_to_offset(vma, start); > max = min + (end - start) - 1; > > - spin_lock(_treelock); > + read_lock(_treelock); > n = find_node_in_range(inode, min, max); > if (n) { > for (t = n; t; t = rb_prev(t)) { > @@ -1316,7 +1316,7 @@ static void build_probe_list(struct inode *inode, > get_uprobe(u); > } > } > - spin_unlock(_treelock); > + read_unlock(_treelock); > } > > /* @vma contains reference counter, not the probed instruction. */ > @@ -1407,9 +1407,9 @@ vma_has_uprobes(struct vm_area_struct *vma, unsigned > long start, unsigned long e > min = vaddr_to_offset(vma, start); > max = min + (end - start) - 1; > > - spin_lock(_treelock); > + read_lock(_treelock); > n = find_node_in_range(inode, min, max); > - spin_unlock(_treelock); > + read_unlock(_treelock); > > return !!n; > } > -- > 2.43.0 >
Re: [PATCH v2 0/3] uprobes: two common case speed ups
On Mon, Mar 18, 2024 at 9:21 PM Masami Hiramatsu wrote: > > Hi, > > On Mon, 18 Mar 2024 11:17:25 -0700 > Andrii Nakryiko wrote: > > > This patch set implements two speed ups for uprobe/uretprobe runtime > > execution > > path for some common scenarios: BPF-only uprobes (patches #1 and #2) and > > system-wide (non-PID-specific) uprobes (patch #3). Please see individual > > patches for details. > > This series looks good to me. Let me pick it on probes/for-next. Great, at least I guessed the Git repo right, if not the branch. Thanks for pulling it in! I assume some other uprobe-related follow up patches should be based on probes/for-next as well, right? > > Thanks! > > > > > v1->v2: > > - rebased onto trace/core branch of tracing tree, hopefully I guessed > > right; > > - simplified user_cpu_buffer usage further (Oleg Nesterov); > > - simplified patch #3, just moved speculative check outside of lock > > (Oleg); > > - added Reviewed-by from Jiri Olsa. > > > > Andrii Nakryiko (3): > > uprobes: encapsulate preparation of uprobe args buffer > > uprobes: prepare uprobe args buffer lazily > > uprobes: add speculative lockless system-wide uprobe filter check > > > > kernel/trace/trace_uprobe.c | 103 +--- > > 1 file changed, 59 insertions(+), 44 deletions(-) > > > > -- > > 2.43.0 > > > > > -- > Masami Hiramatsu (Google)
[PATCH v2 3/3] uprobes: add speculative lockless system-wide uprobe filter check
It's very common with BPF-based uprobe/uretprobe use cases to have a system-wide (not PID specific) probes used. In this case uprobe's trace_uprobe_filter->nr_systemwide counter is bumped at registration time, and actual filtering is short circuited at the time when uprobe/uretprobe is triggered. This is a great optimization, and the only issue with it is that to even get to checking this counter uprobe subsystem is taking read-side trace_uprobe_filter->rwlock. This is actually noticeable in profiles and is just another point of contention when uprobe is triggered on multiple CPUs simultaneously. This patch moves this nr_systemwide check outside of filter list's rwlock scope, as rwlock is meant to protect list modification, while nr_systemwide-based check is speculative and racy already, despite the lock (as discussed in [0]). trace_uprobe_filter_remove() and trace_uprobe_filter_add() already check for filter->nr_systewide explicitly outside of __uprobe_perf_filter, so no modifications are required there. Confirming with BPF selftests's based benchmarks. BEFORE (based on changes in previous patch) === uprobe-nop :2.732 ± 0.022M/s uprobe-push:2.621 ± 0.016M/s uprobe-ret :1.105 ± 0.007M/s uretprobe-nop :1.396 ± 0.007M/s uretprobe-push :1.347 ± 0.008M/s uretprobe-ret :0.800 ± 0.006M/s AFTER = uprobe-nop :2.878 ± 0.017M/s (+5.5%, total +8.3%) uprobe-push:2.753 ± 0.013M/s (+5.3%, total +10.2%) uprobe-ret :1.142 ± 0.010M/s (+3.8%, total +3.8%) uretprobe-nop :1.444 ± 0.008M/s (+3.5%, total +6.5%) uretprobe-push :1.410 ± 0.010M/s (+4.8%, total +7.1%) uretprobe-ret :0.816 ± 0.002M/s (+2.0%, total +3.9%) In the above, first percentage value is based on top of previous patch (lazy uprobe buffer optimization), while the "total" percentage is based on kernel without any of the changes in this patch set. As can be seen, we get about 4% - 10% speed up, in total, with both lazy uprobe buffer and speculative filter check optimizations. [0] https://lore.kernel.org/bpf/20240313131926.ga19...@redhat.com/ Reviewed-by: Jiri Olsa Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 10 +++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index b5da95240a31..ac05885a6ce6 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1226,9 +1226,6 @@ __uprobe_perf_filter(struct trace_uprobe_filter *filter, struct mm_struct *mm) { struct perf_event *event; - if (filter->nr_systemwide) - return true; - list_for_each_entry(event, >perf_events, hw.tp_list) { if (event->hw.target->mm == mm) return true; @@ -1353,6 +1350,13 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, tu = container_of(uc, struct trace_uprobe, consumer); filter = tu->tp.event->filter; + /* +* speculative short-circuiting check to avoid unnecessarily taking +* filter->rwlock below, if the uprobe has system-wide consumer +*/ + if (READ_ONCE(filter->nr_systemwide)) + return true; + read_lock(>rwlock); ret = __uprobe_perf_filter(filter, mm); read_unlock(>rwlock); -- 2.43.0
[PATCH v2 2/3] uprobes: prepare uprobe args buffer lazily
uprobe_cpu_buffer and corresponding logic to store uprobe args into it are used for uprobes/uretprobes that are created through tracefs or perf events. BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't need uprobe_cpu_buffer and associated data. For BPF-only use cases this buffer handling and preparation is a pure overhead. At the same time, BPF-only uprobe/uretprobe usage is very common in practice. Also, for a lot of cases applications are very senstivie to performance overheads, as they might be tracing a very high frequency functions like malloc()/free(), so every bit of performance improvement matters. All that is to say that this uprobe_cpu_buffer preparation is an unnecessary overhead that each BPF user of uprobes/uretprobe has to pay. This patch is changing this by making uprobe_cpu_buffer preparation optional. It will happen only if either tracefs-based or perf event-based uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For BPF-only use cases this step will be skipped. We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0]) to estimate the improvements. We have 3 uprobe and 3 uretprobe scenarios, which vary an instruction that is replaced by uprobe: nop (fastest uprobe case), `push rbp` (typical case), and non-simulated `ret` instruction (slowest case). Benchmark thread is constantly calling user space function in a tight loop. User space function has attached BPF uprobe or uretprobe program doing nothing but atomic counter increments to count number of triggering calls. Benchmark emits throughput in millions of executions per second. BEFORE these changes uprobe-nop :2.657 ± 0.024M/s uprobe-push:2.499 ± 0.018M/s uprobe-ret :1.100 ± 0.006M/s uretprobe-nop :1.356 ± 0.004M/s uretprobe-push :1.317 ± 0.019M/s uretprobe-ret :0.785 ± 0.007M/s AFTER these changes === uprobe-nop :2.732 ± 0.022M/s (+2.8%) uprobe-push:2.621 ± 0.016M/s (+4.9%) uprobe-ret :1.105 ± 0.007M/s (+0.5%) uretprobe-nop :1.396 ± 0.007M/s (+2.9%) uretprobe-push :1.347 ± 0.008M/s (+2.3%) uretprobe-ret :0.800 ± 0.006M/s (+1.9) So the improvements on this particular machine seems to be between 2% and 5%. [0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c Reviewed-by: Jiri Olsa Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 49 + 1 file changed, 28 insertions(+), 21 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index 9bffaab448a6..b5da95240a31 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -941,15 +941,21 @@ static struct uprobe_cpu_buffer *uprobe_buffer_get(void) static void uprobe_buffer_put(struct uprobe_cpu_buffer *ucb) { + if (!ucb) + return; mutex_unlock(>mutex); } static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu, - struct pt_regs *regs) + struct pt_regs *regs, + struct uprobe_cpu_buffer **ucbp) { struct uprobe_cpu_buffer *ucb; int dsize, esize; + if (*ucbp) + return *ucbp; + esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); dsize = __get_data_size(>tp, regs); @@ -958,22 +964,25 @@ static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu, store_trace_args(ucb->buf, >tp, regs, esize, dsize); + *ucbp = ucb; return ucb; } static void __uprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer *ucb, + struct uprobe_cpu_buffer **ucbp, struct trace_event_file *trace_file) { struct uprobe_trace_entry_head *entry; struct trace_event_buffer fbuffer; + struct uprobe_cpu_buffer *ucb; void *data; int size, esize; struct trace_event_call *call = trace_probe_event_call(>tp); WARN_ON(call != trace_file->event_call); + ucb = prepare_uprobe_buffer(tu, regs, ucbp); if (WARN_ON_ONCE(ucb->dsize > PAGE_SIZE)) return; @@ -1002,7 +1011,7 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, /* uprobe handler */ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb) +struct uprobe_cpu_buffer **ucbp) { struct event_file_link *link; @@ -1011,7 +1020,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs,
[PATCH v2 1/3] uprobes: encapsulate preparation of uprobe args buffer
Move the logic of fetching temporary per-CPU uprobe buffer and storing uprobes args into it to a new helper function. Store data size as part of this buffer, simplifying interfaces a bit, as now we only pass single uprobe_cpu_buffer reference around, instead of pointer + dsize. This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher, and now will be centralized. All this is also in preparation to make this uprobe_cpu_buffer handling logic optional in the next patch. Reviewed-by: Jiri Olsa Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 78 +++-- 1 file changed, 41 insertions(+), 37 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index a84b85d8aac1..9bffaab448a6 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -854,6 +854,7 @@ static const struct file_operations uprobe_profile_ops = { struct uprobe_cpu_buffer { struct mutex mutex; void *buf; + int dsize; }; static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer; static int uprobe_buffer_refcnt; @@ -943,9 +944,26 @@ static void uprobe_buffer_put(struct uprobe_cpu_buffer *ucb) mutex_unlock(>mutex); } +static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu, + struct pt_regs *regs) +{ + struct uprobe_cpu_buffer *ucb; + int dsize, esize; + + esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); + dsize = __get_data_size(>tp, regs); + + ucb = uprobe_buffer_get(); + ucb->dsize = tu->tp.size + dsize; + + store_trace_args(ucb->buf, >tp, regs, esize, dsize); + + return ucb; +} + static void __uprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer *ucb, int dsize, + struct uprobe_cpu_buffer *ucb, struct trace_event_file *trace_file) { struct uprobe_trace_entry_head *entry; @@ -956,14 +974,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, WARN_ON(call != trace_file->event_call); - if (WARN_ON_ONCE(tu->tp.size + dsize > PAGE_SIZE)) + if (WARN_ON_ONCE(ucb->dsize > PAGE_SIZE)) return; if (trace_trigger_soft_disabled(trace_file)) return; esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); - size = esize + tu->tp.size + dsize; + size = esize + ucb->dsize; entry = trace_event_buffer_reserve(, trace_file, size); if (!entry) return; @@ -977,14 +995,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, data = DATAOF_TRACE_ENTRY(entry, false); } - memcpy(data, ucb->buf, tu->tp.size + dsize); + memcpy(data, ucb->buf, ucb->dsize); trace_event_buffer_commit(); } /* uprobe handler */ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb, int dsize) +struct uprobe_cpu_buffer *ucb) { struct event_file_link *link; @@ -993,7 +1011,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, rcu_read_lock(); trace_probe_for_each_link_rcu(link, >tp) - __uprobe_trace_func(tu, 0, regs, ucb, dsize, link->file); + __uprobe_trace_func(tu, 0, regs, ucb, link->file); rcu_read_unlock(); return 0; @@ -1001,13 +1019,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb, int dsize) +struct uprobe_cpu_buffer *ucb) { struct event_file_link *link; rcu_read_lock(); trace_probe_for_each_link_rcu(link, >tp) - __uprobe_trace_func(tu, func, regs, ucb, dsize, link->file); + __uprobe_trace_func(tu, func, regs, ucb, link->file); rcu_read_unlock(); } @@ -1335,7 +1353,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, static void __uprobe_perf_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer *ucb, int dsize) + struct uprobe_cpu_buffer *ucb) { struct trace_event_call *call = trace_probe_event_call(>tp); struct uprobe_trace_entry_head *entry; @@ -1356,7 +1374,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); -
[PATCH v2 0/3] uprobes: two common case speed ups
This patch set implements two speed ups for uprobe/uretprobe runtime execution path for some common scenarios: BPF-only uprobes (patches #1 and #2) and system-wide (non-PID-specific) uprobes (patch #3). Please see individual patches for details. v1->v2: - rebased onto trace/core branch of tracing tree, hopefully I guessed right; - simplified user_cpu_buffer usage further (Oleg Nesterov); - simplified patch #3, just moved speculative check outside of lock (Oleg); - added Reviewed-by from Jiri Olsa. Andrii Nakryiko (3): uprobes: encapsulate preparation of uprobe args buffer uprobes: prepare uprobe args buffer lazily uprobes: add speculative lockless system-wide uprobe filter check kernel/trace/trace_uprobe.c | 103 +--- 1 file changed, 59 insertions(+), 44 deletions(-) -- 2.43.0
Re: [PATCH bpf-next 0/3] uprobes: two common case speed ups
On Wed, Mar 13, 2024 at 2:41 AM Jiri Olsa wrote: > > On Tue, Mar 12, 2024 at 02:02:30PM -0700, Andrii Nakryiko wrote: > > This patch set implements two speed ups for uprobe/uretprobe runtime > > execution > > path for some common scenarios: BPF-only uprobes (patches #1 and #2) and > > system-wide (non-PID-specific) uprobes (patch #3). Please see individual > > patches for details. > > > > Given I haven't worked with uprobe code before, I'm unfamiliar with > > conventions in this subsystem, including which kernel tree patches should be > > sent to. For now I based all the changes on top of bpf-next/master, which is > > where I tested and benchmarked everything anyways. Please advise what should > > I use as a base for subsequent revision. Thanks. Steven, Masami, Is this the kind of patches that should go through your tree(s)? Or you'd be fine with this going through bpf-next? I'd appreciate the link to the specific GIT repo I should use as a base in the former case, thank you! > > > > Andrii Nakryiko (3): > > uprobes: encapsulate preparation of uprobe args buffer > > uprobes: prepare uprobe args buffer lazily > > uprobes: add speculative lockless system-wide uprobe filter check > > nice cleanup and speed up, lgtm > > Reviewed-by: Jiri Olsa > > jirka > > > > > kernel/trace/trace_uprobe.c | 103 ++-- > > 1 file changed, 63 insertions(+), 40 deletions(-) > > > > -- > > 2.43.0 > > > >
Re: [PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check
On Wed, Mar 13, 2024 at 6:20 AM Oleg Nesterov wrote: > > I forgot everything about this code, plus it has changed a lot since > I looked at it many years ago, but ... > > I think this change is fine but the changelog looks a bit confusing > (overcomplicated) to me. It's a new piece of code and logic, so I tried to do my due diligence and argue why I think it's fine. I'll drop the overcomplicated explanation, as I agree with you that it's inherently racy even without my changes (and use-after-free safety is provided with uprobe->register_rwsem independent from all this). > > On 03/12, Andrii Nakryiko wrote: > > > > This patch adds a speculative check before grabbing that rwlock. If > > nr_systemwide is non-zero, lock is skipped and event is passed through. > > From examining existing logic it looks correct and safe to do. If > > nr_systemwide is being modified under rwlock in parallel, we have to > > consider basically just one important race condition: the case when > > nr_systemwide is dropped from one to zero (from > > trace_uprobe_filter_remove()) under filter->rwlock, but > > uprobe_perf_filter() raced and saw it as >0. > > Unless I am totally confused, there is nothing new. Even without > this change trace_uprobe_filter_remove() can clear nr_systemwide > right after uprobe_perf_filter() drops filter->rwlock. > > And of course, trace_uprobe_filter_add() can change nr_systemwide > from 0 to 1. In this case uprobe_perf_func() can "wrongly" return > UPROBE_HANDLER_REMOVE but we can't avoid this and afaics this is > fine even if handler_chain() does unapply_uprobe(), uprobe_perf_open() > will do uprobe_apply() after that, we can rely on ->register_rwsem. > yep, agreed > > In case we speculatively read nr_systemwide as zero, while it was > > incremented in parallel, we'll proceed to grabbing filter->rwlock and > > re-doing the check, this time in lock-protected and non-racy way. > > See above... > > > So I think uprobe_perf_filter() needs filter->rwlock only to iterate > the list, it can check nr_systemwide lockless and this means that you > can also remove the same check in __uprobe_perf_filter(), other callers > trace_uprobe_filter_add/remove check it themselves. > makes sense, will do > > > --- a/kernel/trace/trace_uprobe.c > > +++ b/kernel/trace/trace_uprobe.c > > @@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct > > uprobe_consumer *uc, > > tu = container_of(uc, struct trace_uprobe, consumer); > > filter = tu->tp.event->filter; > > > > + /* speculative check */ > > + if (READ_ONCE(filter->nr_systemwide)) > > + return true; > > + > > read_lock(>rwlock); > > ret = __uprobe_perf_filter(filter, mm); > > read_unlock(>rwlock); > > ACK, > > but see above. I think the changelog should be simplified and the > filter->nr_systemwide check in __uprobe_perf_filter() should be > removed. But I won't insist and perhaps I missed something... > I think you are right, I'll move the check > Oleg. >
Re: [PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily
On Wed, Mar 13, 2024 at 8:48 AM Oleg Nesterov wrote: > > Again, looks good to me, but I have a minor nit. Feel free to ignore. > > On 03/12, Andrii Nakryiko wrote: > > > > static void __uprobe_trace_func(struct trace_uprobe *tu, > > unsigned long func, struct pt_regs *regs, > > - struct uprobe_cpu_buffer *ucb, > > + struct uprobe_cpu_buffer **ucbp, > > struct trace_event_file *trace_file) > > { > > struct uprobe_trace_entry_head *entry; > > struct trace_event_buffer fbuffer; > > + struct uprobe_cpu_buffer *ucb; > > void *data; > > int size, esize; > > struct trace_event_call *call = trace_probe_event_call(>tp); > > > > + ucb = *ucbp; > > + if (!ucb) { > > + ucb = prepare_uprobe_buffer(tu, regs); > > + *ucbp = ucb; > > + } > > perhaps it would be more clean to pass ucbp to prepare_uprobe_buffer() > and change it to do > > if (*ucbp) > return *ucbp; > > at the start. Then __uprobe_trace_func() and __uprobe_perf_func() can > simply do > > ucb = prepare_uprobe_buffer(tu, regs, ucbp); ok, will do > > > - uprobe_buffer_put(ucb); > > + if (ucb) > > + uprobe_buffer_put(ucb); > > Similarly, I think the "ucb != NULL" check should be shifted into > uprobe_buffer_put(). sure, will hide it inside uprobe_buffer_put() > > Oleg. >
Re: [PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer
On Wed, Mar 13, 2024 at 8:16 AM Oleg Nesterov wrote: > > LGTM, one nit below. > > On 03/12, Andrii Nakryiko wrote: > > > > +static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe > > *tu, > > +struct pt_regs *regs) > > +{ > > + struct uprobe_cpu_buffer *ucb; > > + int dsize, esize; > > + > > + esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); > > + dsize = __get_data_size(>tp, regs); > > + > > + ucb = uprobe_buffer_get(); > > + ucb->dsize = dsize; > > + > > + store_trace_args(ucb->buf, >tp, regs, esize, dsize); > > + > > + return ucb; > > +} > > OK, but note that every user of ->dsize adds tp.size. So I think you can > simplify this code a bit more if you change prepare_uprobe_buffer() to do > > ucb->dsize = tu->tp.size + dsize; > > and update the users. > makes sense, done > Oleg. >
[PATCH bpf-next 3/3] uprobes: add speculative lockless system-wide uprobe filter check
It's very common with BPF-based uprobe/uretprobe use cases to have a system-wide (not PID specific) probes used. In this case uprobe's trace_uprobe_filter->nr_systemwide counter is bumped at registration time, and actual filtering is short circuited at the time when uprobe/uretprobe is triggered. This is a great optimization, and the only issue with it is that to even get to checking this counter uprobe subsystem is taking read-side trace_uprobe_filter->rwlock. This is actually noticeable in profiles and is just another point of contention when uprobe is triggered on multiple CPUs simultaneously. This patch adds a speculative check before grabbing that rwlock. If nr_systemwide is non-zero, lock is skipped and event is passed through. >From examining existing logic it looks correct and safe to do. If nr_systemwide is being modified under rwlock in parallel, we have to consider basically just one important race condition: the case when nr_systemwide is dropped from one to zero (from trace_uprobe_filter_remove()) under filter->rwlock, but uprobe_perf_filter() raced and saw it as >0. In this case, we'll proceed with uprobe/uretprobe execution, while uprobe_perf_close() and uprobe_apply() will be blocked on trying to grab uprobe->register_rwsem as a writer. It will be blocked because uprobe_dispatcher() (and, similarly, uretprobe_dispatcher()) runs with uprobe->register_rwsem taken as a reader. So there is no real race besides uprobe/uretprobe might execute one last time before it's removed, which is fine because from user space perspective uprobe/uretprobe hasn't been yet deactivated. In case we speculatively read nr_systemwide as zero, while it was incremented in parallel, we'll proceed to grabbing filter->rwlock and re-doing the check, this time in lock-protected and non-racy way. As such, it looks safe to do a quick short circuiting check and save some performance in a very common system-wide case, not sacrificing hot path performance due to much rarer possibility of registration or unregistration of uprobes. Again, confirming with BPF selftests's based benchmarks. BEFORE (based on changes in previous patch) === uprobe-nop :2.732 ± 0.022M/s uprobe-push:2.621 ± 0.016M/s uprobe-ret :1.105 ± 0.007M/s uretprobe-nop :1.396 ± 0.007M/s uretprobe-push :1.347 ± 0.008M/s uretprobe-ret :0.800 ± 0.006M/s AFTER = uprobe-nop :2.878 ± 0.017M/s (+5.5%, total +8.3%) uprobe-push:2.753 ± 0.013M/s (+5.3%, total +10.2%) uprobe-ret :1.142 ± 0.010M/s (+3.8%, total +3.8%) uretprobe-nop :1.444 ± 0.008M/s (+3.5%, total +6.5%) uretprobe-push :1.410 ± 0.010M/s (+4.8%, total +7.1%) uretprobe-ret :0.816 ± 0.002M/s (+2.0%, total +3.9%) In the above, first percentage value is based on top of previous patch (lazy uprobe buffer optimization), while the "total" percentage is based on kernel without any of the changes in this patch set. As can be seen, we get about 4% - 10% speed up, in total, with both lazy uprobe buffer and speculative filter check optimizations. Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index f2875349d124..be28e6d0578e 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -1351,6 +1351,10 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, tu = container_of(uc, struct trace_uprobe, consumer); filter = tu->tp.event->filter; + /* speculative check */ + if (READ_ONCE(filter->nr_systemwide)) + return true; + read_lock(>rwlock); ret = __uprobe_perf_filter(filter, mm); read_unlock(>rwlock); -- 2.43.0
[PATCH bpf-next 2/3] uprobes: prepare uprobe args buffer lazily
uprobe_cpu_buffer and corresponding logic to store uprobe args into it are used for uprobes/uretprobes that are created through tracefs or perf events. BPF is yet another user of uprobe/uretprobe infrastructure, but doesn't need uprobe_cpu_buffer and associated data. For BPF-only use cases this buffer handling and preparation is a pure overhead. At the same time, BPF-only uprobe/uretprobe usage is very common in practice. Also, for a lot of cases applications are very senstivie to performance overheads, as they might be tracing a very high frequency functions like malloc()/free(), so every bit of performance improvement matters. All that is to say that this uprobe_cpu_buffer preparation is an unnecessary overhead that each BPF user of uprobes/uretprobe has to pay. This patch is changing this by making uprobe_cpu_buffer preparation optional. It will happen only if either tracefs-based or perf event-based uprobe/uretprobe consumer is registered for given uprobe/uretprobe. For BPF-only use cases this step will be skipped. We used uprobe/uretprobe benchmark which is part of BPF selftests (see [0]) to estimate the improvements. We have 3 uprobe and 3 uretprobe scenarios, which vary an instruction that is replaced by uprobe: nop (fastest uprobe case), `push rbp` (typical case), and non-simulated `ret` instruction (slowest case). Benchmark thread is constantly calling user space function in a tight loop. User space function has attached BPF uprobe or uretprobe program doing nothing but atomic counter increments to count number of triggering calls. Benchmark emits throughput in millions of executions per second. BEFORE these changes uprobe-nop :2.657 ± 0.024M/s uprobe-push:2.499 ± 0.018M/s uprobe-ret :1.100 ± 0.006M/s uretprobe-nop :1.356 ± 0.004M/s uretprobe-push :1.317 ± 0.019M/s uretprobe-ret :0.785 ± 0.007M/s AFTER these changes === uprobe-nop :2.732 ± 0.022M/s (+2.8%) uprobe-push:2.621 ± 0.016M/s (+4.9%) uprobe-ret :1.105 ± 0.007M/s (+0.5%) uretprobe-nop :1.396 ± 0.007M/s (+2.9%) uretprobe-push :1.347 ± 0.008M/s (+2.3%) uretprobe-ret :0.800 ± 0.006M/s (+1.9) So the improvements on this particular machine seems to be between 2% and 5%. [0] https://github.com/torvalds/linux/blob/master/tools/testing/selftests/bpf/benchs/bench_trigger.c Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 56 ++--- 1 file changed, 34 insertions(+), 22 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index a0f60bb10158..f2875349d124 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -963,15 +963,22 @@ static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu, static void __uprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer *ucb, + struct uprobe_cpu_buffer **ucbp, struct trace_event_file *trace_file) { struct uprobe_trace_entry_head *entry; struct trace_event_buffer fbuffer; + struct uprobe_cpu_buffer *ucb; void *data; int size, esize; struct trace_event_call *call = trace_probe_event_call(>tp); + ucb = *ucbp; + if (!ucb) { + ucb = prepare_uprobe_buffer(tu, regs); + *ucbp = ucb; + } + WARN_ON(call != trace_file->event_call); if (WARN_ON_ONCE(tu->tp.size + ucb->dsize > PAGE_SIZE)) @@ -1002,7 +1009,7 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, /* uprobe handler */ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb) +struct uprobe_cpu_buffer **ucbp) { struct event_file_link *link; @@ -1011,7 +1018,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, rcu_read_lock(); trace_probe_for_each_link_rcu(link, >tp) - __uprobe_trace_func(tu, 0, regs, ucb, link->file); + __uprobe_trace_func(tu, 0, regs, ucbp, link->file); rcu_read_unlock(); return 0; @@ -1019,13 +1026,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb) +struct uprobe_cpu_buffer **ucbp) { struct event_file_link *link; rcu_read_lock(); trace_probe_for_each_link_rcu(link, >tp) - __uprobe_trace_func(tu, func, regs, ucb, link->file); + __uprobe_trace_func(tu
[PATCH bpf-next 1/3] uprobes: encapsulate preparation of uprobe args buffer
Move the logic of fetching temporary per-CPU uprobe buffer and storing uprobes args into it to a new helper function. Store data size as part of this buffer, simplifying interfaces a bit, as now we only pass single uprobe_cpu_buffer reference around, instead of pointer + dsize. This logic was duplicated across uprobe_dispatcher and uretprobe_dispatcher, and now will be centralized. All this is also in preparation to make this uprobe_cpu_buffer handling logic optional in the next patch. Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_uprobe.c | 75 - 1 file changed, 41 insertions(+), 34 deletions(-) diff --git a/kernel/trace/trace_uprobe.c b/kernel/trace/trace_uprobe.c index a84b85d8aac1..a0f60bb10158 100644 --- a/kernel/trace/trace_uprobe.c +++ b/kernel/trace/trace_uprobe.c @@ -854,6 +854,7 @@ static const struct file_operations uprobe_profile_ops = { struct uprobe_cpu_buffer { struct mutex mutex; void *buf; + int dsize; }; static struct uprobe_cpu_buffer __percpu *uprobe_cpu_buffer; static int uprobe_buffer_refcnt; @@ -943,9 +944,26 @@ static void uprobe_buffer_put(struct uprobe_cpu_buffer *ucb) mutex_unlock(>mutex); } +static struct uprobe_cpu_buffer *prepare_uprobe_buffer(struct trace_uprobe *tu, + struct pt_regs *regs) +{ + struct uprobe_cpu_buffer *ucb; + int dsize, esize; + + esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); + dsize = __get_data_size(>tp, regs); + + ucb = uprobe_buffer_get(); + ucb->dsize = dsize; + + store_trace_args(ucb->buf, >tp, regs, esize, dsize); + + return ucb; +} + static void __uprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer *ucb, int dsize, + struct uprobe_cpu_buffer *ucb, struct trace_event_file *trace_file) { struct uprobe_trace_entry_head *entry; @@ -956,14 +974,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, WARN_ON(call != trace_file->event_call); - if (WARN_ON_ONCE(tu->tp.size + dsize > PAGE_SIZE)) + if (WARN_ON_ONCE(tu->tp.size + ucb->dsize > PAGE_SIZE)) return; if (trace_trigger_soft_disabled(trace_file)) return; esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu)); - size = esize + tu->tp.size + dsize; + size = esize + tu->tp.size + ucb->dsize; entry = trace_event_buffer_reserve(, trace_file, size); if (!entry) return; @@ -977,14 +995,14 @@ static void __uprobe_trace_func(struct trace_uprobe *tu, data = DATAOF_TRACE_ENTRY(entry, false); } - memcpy(data, ucb->buf, tu->tp.size + dsize); + memcpy(data, ucb->buf, tu->tp.size + ucb->dsize); trace_event_buffer_commit(); } /* uprobe handler */ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb, int dsize) +struct uprobe_cpu_buffer *ucb) { struct event_file_link *link; @@ -993,7 +1011,7 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, rcu_read_lock(); trace_probe_for_each_link_rcu(link, >tp) - __uprobe_trace_func(tu, 0, regs, ucb, dsize, link->file); + __uprobe_trace_func(tu, 0, regs, ucb, link->file); rcu_read_unlock(); return 0; @@ -1001,13 +1019,13 @@ static int uprobe_trace_func(struct trace_uprobe *tu, struct pt_regs *regs, static void uretprobe_trace_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, -struct uprobe_cpu_buffer *ucb, int dsize) +struct uprobe_cpu_buffer *ucb) { struct event_file_link *link; rcu_read_lock(); trace_probe_for_each_link_rcu(link, >tp) - __uprobe_trace_func(tu, func, regs, ucb, dsize, link->file); + __uprobe_trace_func(tu, func, regs, ucb, link->file); rcu_read_unlock(); } @@ -1335,7 +1353,7 @@ static bool uprobe_perf_filter(struct uprobe_consumer *uc, static void __uprobe_perf_func(struct trace_uprobe *tu, unsigned long func, struct pt_regs *regs, - struct uprobe_cpu_buffer *ucb, int dsize) + struct uprobe_cpu_buffer *ucb) { struct trace_event_call *call = trace_probe_event_call(>tp); struct uprobe_trace_entry_head *entry; @@ -1356,7 +1374,7 @@ static void __uprobe_perf_func(struct trace_uprobe *tu, esize = SIZEOF_TRACE_ENTRY(is_ret_probe(tu));
[PATCH bpf-next 0/3] uprobes: two common case speed ups
This patch set implements two speed ups for uprobe/uretprobe runtime execution path for some common scenarios: BPF-only uprobes (patches #1 and #2) and system-wide (non-PID-specific) uprobes (patch #3). Please see individual patches for details. Given I haven't worked with uprobe code before, I'm unfamiliar with conventions in this subsystem, including which kernel tree patches should be sent to. For now I based all the changes on top of bpf-next/master, which is where I tested and benchmarked everything anyways. Please advise what should I use as a base for subsequent revision. Thanks. Andrii Nakryiko (3): uprobes: encapsulate preparation of uprobe args buffer uprobes: prepare uprobe args buffer lazily uprobes: add speculative lockless system-wide uprobe filter check kernel/trace/trace_uprobe.c | 103 ++-- 1 file changed, 63 insertions(+), 40 deletions(-) -- 2.43.0
Re: [PATCH for-next] tracing/kprobes: Add symbol counting check when module loads
On Sat, Oct 28, 2023 at 8:10 PM Masami Hiramatsu (Google) wrote: > > From: Masami Hiramatsu (Google) > > Check the number of probe target symbols in the target module when > the module is loaded. If the probe is not on the unique name symbols > in the module, it will be rejected at that point. > > Note that the symbol which has a unique name in the target module, > it will be accepted even if there are same-name symbols in the > kernel or other modules, > > Signed-off-by: Masami Hiramatsu (Google) > --- > kernel/trace/trace_kprobe.c | 112 > ++- > 1 file changed, 68 insertions(+), 44 deletions(-) > LGTM. Acked-by: Andrii Nakryiko > diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c > index e834f149695b..90cf2219adb4 100644 > --- a/kernel/trace/trace_kprobe.c > +++ b/kernel/trace/trace_kprobe.c > @@ -670,6 +670,21 @@ static int register_trace_kprobe(struct trace_kprobe *tk) > return ret; > } > > +static int validate_module_probe_symbol(const char *modname, const char > *symbol); > + > +static int register_module_trace_kprobe(struct module *mod, struct > trace_kprobe *tk) > +{ > + const char *p; > + int ret = 0; > + > + p = strchr(trace_kprobe_symbol(tk), ':'); > + if (p) > + ret = validate_module_probe_symbol(module_name(mod), p++); > + if (!ret) > + ret = register_trace_kprobe(tk); > + return ret; > +} > + > /* Module notifier call back, checking event on the module */ > static int trace_kprobe_module_callback(struct notifier_block *nb, >unsigned long val, void *data) > @@ -688,7 +703,7 @@ static int trace_kprobe_module_callback(struct > notifier_block *nb, > if (trace_kprobe_within_module(tk, mod)) { > /* Don't need to check busy - this should have gone. > */ > __unregister_trace_kprobe(tk); > - ret = __register_trace_kprobe(tk); > + ret = register_module_trace_kprobe(mod, tk); > if (ret) > pr_warn("Failed to re-register probe %s on > %s: %d\n", > trace_probe_name(>tp), > @@ -729,17 +744,55 @@ static int count_mod_symbols(void *data, const char > *name, unsigned long unused) > return 0; > } > > -static unsigned int number_of_same_symbols(char *func_name) > +static unsigned int number_of_same_symbols(const char *mod, const char > *func_name) > { > struct sym_count_ctx ctx = { .count = 0, .name = func_name }; > > - kallsyms_on_each_match_symbol(count_symbols, func_name, ); > + if (!mod) > + kallsyms_on_each_match_symbol(count_symbols, func_name, > ); > > - module_kallsyms_on_each_symbol(NULL, count_mod_symbols, ); > + module_kallsyms_on_each_symbol(mod, count_mod_symbols, ); > > return ctx.count; > } > > +static int validate_module_probe_symbol(const char *modname, const char > *symbol) > +{ > + unsigned int count = number_of_same_symbols(modname, symbol); > + > + if (count > 1) { > + /* > +* Users should use ADDR to remove the ambiguity of > +* using KSYM only. > +*/ > + return -EADDRNOTAVAIL; > + } else if (count == 0) { > + /* > +* We can return ENOENT earlier than when register the > +* kprobe. > +*/ > + return -ENOENT; > + } > + return 0; > +} > + > +static int validate_probe_symbol(char *symbol) > +{ > + char *mod = NULL, *p; > + int ret; > + > + p = strchr(symbol, ':'); > + if (p) { > + mod = symbol; > + symbol = p + 1; > + *p = '\0'; > + } > + ret = validate_module_probe_symbol(mod, symbol); > + if (p) > + *p = ':'; > + return ret; > +} > + > static int __trace_kprobe_create(int argc, const char *argv[]) > { > /* > @@ -859,6 +912,14 @@ static int __trace_kprobe_create(int argc, const char > *argv[]) > trace_probe_log_err(0, BAD_PROBE_ADDR); > goto parse_error; > } > + ret = validate_probe_symbol(symbol); > + if (ret) { > + if (ret == -EADDRNOTAVAIL) > + trace_probe_log_err(0, NON_UNIQ_SYMBOL); > +
[PATCH] tracing/kprobes: Fix symbol counting logic by looking at modules as well
Recent changes to count number of matching symbols when creating a kprobe event failed to take into account kernel modules. As such, it breaks kprobes on kernel module symbols, by assuming there is no match. Fix this my calling module_kallsyms_on_each_symbol() in addition to kallsyms_on_each_match_symbol() to perform a proper counting. Cc: Francis Laniel Cc: sta...@vger.kernel.org Cc: Masami Hiramatsu Cc: Steven Rostedt Fixes: b022f0c7e404 ("tracing/kprobes: Return EADDRNOTAVAIL when func matches several symbols") Signed-off-by: Andrii Nakryiko --- kernel/trace/trace_kprobe.c | 24 1 file changed, 20 insertions(+), 4 deletions(-) diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c index effcaede4759..1efb27f35963 100644 --- a/kernel/trace/trace_kprobe.c +++ b/kernel/trace/trace_kprobe.c @@ -714,14 +714,30 @@ static int count_symbols(void *data, unsigned long unused) return 0; } +struct sym_count_ctx { + unsigned int count; + const char *name; +}; + +static int count_mod_symbols(void *data, const char *name, unsigned long unused) +{ + struct sym_count_ctx *ctx = data; + + if (strcmp(name, ctx->name) == 0) + ctx->count++; + + return 0; +} + static unsigned int number_of_same_symbols(char *func_name) { - unsigned int count; + struct sym_count_ctx ctx = { .count = 0, .name = func_name }; + + kallsyms_on_each_match_symbol(count_symbols, func_name, ); - count = 0; - kallsyms_on_each_match_symbol(count_symbols, func_name, ); + module_kallsyms_on_each_symbol(NULL, count_mod_symbols, ); - return count; + return ctx.count; } static int __trace_kprobe_create(int argc, const char *argv[]) -- 2.34.1
Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
On Fri, Oct 13, 2023 at 7:00 AM Steven Rostedt wrote: > > On Fri, 13 Oct 2023 08:01:34 +0200 > Artem Savkov wrote: > > > > But looking at [0] and briefly reading some of the discussions you, > > > Steven, had. I'm just wondering if it would be best to avoid > > > increasing struct trace_entry altogether? It seems like preempt_count > > > is actually a 4-bit field in trace context, so it doesn't seem like we > > > really need to allocate an entire byte for both preempt_count and > > > preempt_lazy_count. Why can't we just combine them and not waste 8 > > > extra bytes for each trace event in a ring buffer? > > > > > > [0] > > > https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71 > > > > I agree that avoiding increase in struct trace_entry size would be very > > desirable, but I have no knowledge whether rt developers had reasons to > > do it like this. > > > > Nevertheless I think the issue with verifier running against a wrong > > struct still needs to be addressed. > > Correct. My Ack is based on the current way things are done upstream. > It was just that linux-rt showed the issue, where the code was not as > robust as it should have been. To me this was a correctness issue, not > an issue that had to do with how things are done in linux-rt. I think we should at least add some BUILD_BUG_ON() that validates offsets in syscall_tp_t matches the ones in syscall_trace_enter and syscall_trace_exit, to fail more loudly if there is any mismatch in the future. WDYT? > > As for the changes in linux-rt, they are not upstream yet. I'll have my > comments on that code when that happens. Ah, ok, cool. I'd appreciate you cc'ing b...@vger.kernel.org in that discussion, thank you! > > -- Steve
Re: [RFC PATCH bpf-next] bpf: change syscall_nr type to int in struct syscall_tp_t
On Thu, Oct 12, 2023 at 6:43 AM Steven Rostedt wrote: > > On Thu, 12 Oct 2023 13:45:50 +0200 > Artem Savkov wrote: > > > linux-rt-devel tree contains a patch (b1773eac3f29c ("sched: Add support > > for lazy preemption")) that adds an extra member to struct trace_entry. > > This causes the offset of args field in struct trace_event_raw_sys_enter > > be different from the one in struct syscall_trace_enter: > > > > struct trace_event_raw_sys_enter { > > struct trace_entry ent; /* 012 */ > > > > /* XXX last struct has 3 bytes of padding */ > > /* XXX 4 bytes hole, try to pack */ > > > > long int id; /*16 8 */ > > long unsigned int args[6]; /*2448 */ > > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ > > char __data[]; /*72 0 */ > > > > /* size: 72, cachelines: 2, members: 4 */ > > /* sum members: 68, holes: 1, sum holes: 4 */ > > /* paddings: 1, sum paddings: 3 */ > > /* last cacheline: 8 bytes */ > > }; > > > > struct syscall_trace_enter { > > struct trace_entry ent; /* 012 */ > > > > /* XXX last struct has 3 bytes of padding */ > > > > intnr; /*12 4 */ > > long unsigned int args[]; /*16 0 */ > > > > /* size: 16, cachelines: 1, members: 3 */ > > /* paddings: 1, sum paddings: 3 */ > > /* last cacheline: 16 bytes */ > > }; > > > > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf > > test_profiler testcase because max_ctx_offset is calculated based on the > > former struct, while off on the latter: > > > > 10488 if (is_tracepoint || is_syscall_tp) { > > 10489 int off = trace_event_get_offsets(event->tp_event); > > 10490 > > 10491 if (prog->aux->max_ctx_offset > off) > > 10492 return -EACCES; > > 10493 } > > > > What bpf program is actually getting is a pointer to struct > > syscall_tp_t, defined in kernel/trace/trace_syscalls.c. This patch fixes > > the problem by aligning struct syscall_tp_t with with struct > > syscall_trace_(enter|exit) and changing the tests to use these structs > > to dereference context. > > > > Signed-off-by: Artem Savkov > I think these changes make sense regardless, can you please resend the patch without RFC tag so that our CI can run tests for it? > Thanks for doing a proper fix. > > Acked-by: Steven Rostedt (Google) But looking at [0] and briefly reading some of the discussions you, Steven, had. I'm just wondering if it would be best to avoid increasing struct trace_entry altogether? It seems like preempt_count is actually a 4-bit field in trace context, so it doesn't seem like we really need to allocate an entire byte for both preempt_count and preempt_lazy_count. Why can't we just combine them and not waste 8 extra bytes for each trace event in a ring buffer? [0] https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/commit/?id=b1773eac3f29cbdcdfd16e0339f1a164066e9f71 > > -- Steve
Re: [RFC PATCH] tracing: change syscall number type in struct syscall_trace_*
On Mon, Oct 2, 2023 at 6:53 AM Artem Savkov wrote: > > linux-rt-devel tree contains a patch that adds an extra member to struct can you please point to the patch itself that makes that change? > trace_entry. This causes the offset of args field in struct > trace_event_raw_sys_enter be different from the one in struct > syscall_trace_enter: > > struct trace_event_raw_sys_enter { > struct trace_entry ent; /* 012 */ > > /* XXX last struct has 3 bytes of padding */ > /* XXX 4 bytes hole, try to pack */ > > long int id; /*16 8 */ > long unsigned int args[6]; /*2448 */ > /* --- cacheline 1 boundary (64 bytes) was 8 bytes ago --- */ > char __data[]; /*72 0 */ > > /* size: 72, cachelines: 2, members: 4 */ > /* sum members: 68, holes: 1, sum holes: 4 */ > /* paddings: 1, sum paddings: 3 */ > /* last cacheline: 8 bytes */ > }; > > struct syscall_trace_enter { > struct trace_entry ent; /* 012 */ > > /* XXX last struct has 3 bytes of padding */ > > intnr; /*12 4 */ > long unsigned int args[]; /*16 0 */ > > /* size: 16, cachelines: 1, members: 3 */ > /* paddings: 1, sum paddings: 3 */ > /* last cacheline: 16 bytes */ > }; > > This, in turn, causes perf_event_set_bpf_prog() fail while running bpf > test_profiler testcase because max_ctx_offset is calculated based on the > former struct, while off on the latter: > > 10488 if (is_tracepoint || is_syscall_tp) { > 10489 int off = trace_event_get_offsets(event->tp_event); > 10490 > 10491 if (prog->aux->max_ctx_offset > off) > 10492 return -EACCES; > 10493 } > > This patch changes the type of nr member in syscall_trace_* structs to > be long so that "args" offset is equal to that in struct > trace_event_raw_sys_enter. > > Signed-off-by: Artem Savkov > --- > kernel/trace/trace.h | 4 ++-- > kernel/trace/trace_syscalls.c | 7 --- > 2 files changed, 6 insertions(+), 5 deletions(-) > > diff --git a/kernel/trace/trace.h b/kernel/trace/trace.h > index 77debe53f07cf..cd1d24df85364 100644 > --- a/kernel/trace/trace.h > +++ b/kernel/trace/trace.h > @@ -135,13 +135,13 @@ enum trace_type { > */ > struct syscall_trace_enter { > struct trace_entry ent; > - int nr; > + longnr; > unsigned long args[]; > }; > > struct syscall_trace_exit { > struct trace_entry ent; > - int nr; > + longnr; > longret; > }; > > diff --git a/kernel/trace/trace_syscalls.c b/kernel/trace/trace_syscalls.c > index de753403cdafb..c26939119f2e4 100644 > --- a/kernel/trace/trace_syscalls.c > +++ b/kernel/trace/trace_syscalls.c > @@ -101,7 +101,7 @@ find_syscall_meta(unsigned long syscall) > return NULL; > } > > -static struct syscall_metadata *syscall_nr_to_meta(int nr) > +static struct syscall_metadata *syscall_nr_to_meta(long nr) > { > if (IS_ENABLED(CONFIG_HAVE_SPARSE_SYSCALL_NR)) > return xa_load(_metadata_sparse, (unsigned long)nr); > @@ -132,7 +132,8 @@ print_syscall_enter(struct trace_iterator *iter, int > flags, > struct trace_entry *ent = iter->ent; > struct syscall_trace_enter *trace; > struct syscall_metadata *entry; > - int i, syscall; > + int i; > + long syscall; > > trace = (typeof(trace))ent; > syscall = trace->nr; > @@ -177,7 +178,7 @@ print_syscall_exit(struct trace_iterator *iter, int flags, > struct trace_seq *s = >seq; > struct trace_entry *ent = iter->ent; > struct syscall_trace_exit *trace; > - int syscall; > + long syscall; > struct syscall_metadata *entry; > > trace = (typeof(trace))ent; > -- > 2.41.0 > >
Re: [PATCH v3 bpf-next 11/11] bpf: Test BPF_SK_REUSEPORT_SELECT_OR_MIGRATE.
On Tue, Apr 20, 2021 at 8:45 AM Kuniyuki Iwashima wrote: > > This patch adds a test for BPF_SK_REUSEPORT_SELECT_OR_MIGRATE and > removes 'static' from settimeo() in network_helpers.c. > > Signed-off-by: Kuniyuki Iwashima > --- Almost everything in prog_tests/migrate_reuseport.c should be static, functions and variables. Except the test_migrate_reuseport, of course. But thank you for using ASSERT_xxx()! :) > tools/testing/selftests/bpf/network_helpers.c | 2 +- > tools/testing/selftests/bpf/network_helpers.h | 1 + > .../bpf/prog_tests/migrate_reuseport.c| 483 ++ > .../bpf/progs/test_migrate_reuseport.c| 51 ++ > 4 files changed, 536 insertions(+), 1 deletion(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/migrate_reuseport.c > create mode 100644 tools/testing/selftests/bpf/progs/test_migrate_reuseport.c > [...]
Re: [PATCH bpf-next v2 4/4] libbpf: add selftests for TC-BPF API
On Mon, Apr 19, 2021 at 5:18 AM Kumar Kartikeya Dwivedi wrote: > > This adds some basic tests for the low level bpf_tc_cls_* API. > > Reviewed-by: Toke Høiland-Jørgensen > Signed-off-by: Kumar Kartikeya Dwivedi > --- > .../selftests/bpf/prog_tests/test_tc_bpf.c| 112 ++ > .../selftests/bpf/progs/test_tc_bpf_kern.c| 12 ++ > 2 files changed, 124 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/test_tc_bpf.c > create mode 100644 tools/testing/selftests/bpf/progs/test_tc_bpf_kern.c > > diff --git a/tools/testing/selftests/bpf/prog_tests/test_tc_bpf.c > b/tools/testing/selftests/bpf/prog_tests/test_tc_bpf.c > new file mode 100644 > index ..945f3a1a72f8 > --- /dev/null > +++ b/tools/testing/selftests/bpf/prog_tests/test_tc_bpf.c > @@ -0,0 +1,112 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#define LO_IFINDEX 1 > + > +static int test_tc_cls_internal(int fd, __u32 parent_id) > +{ > + DECLARE_LIBBPF_OPTS(bpf_tc_cls_opts, opts, .handle = 1, .priority = > 10, > + .class_id = TC_H_MAKE(1UL << 16, 1), > + .chain_index = 5); > + struct bpf_tc_cls_attach_id id = {}; > + struct bpf_tc_cls_info info = {}; > + int ret; > + > + ret = bpf_tc_cls_attach(fd, LO_IFINDEX, parent_id, , ); > + if (CHECK_FAIL(ret < 0)) > + return ret; > + > + ret = bpf_tc_cls_get_info(fd, LO_IFINDEX, parent_id, NULL, ); > + if (CHECK_FAIL(ret < 0)) > + goto end; > + > + ret = -1; > + > + if (CHECK_FAIL(info.id.handle != id.handle) || > + CHECK_FAIL(info.id.chain_index != id.chain_index) || > + CHECK_FAIL(info.id.priority != id.priority) || > + CHECK_FAIL(info.id.handle != 1) || > + CHECK_FAIL(info.id.priority != 10) || > + CHECK_FAIL(info.class_id != TC_H_MAKE(1UL << 16, 1)) || > + CHECK_FAIL(info.id.chain_index != 5)) > + goto end; > + > + ret = bpf_tc_cls_replace(fd, LO_IFINDEX, parent_id, , ); > + if (CHECK_FAIL(ret < 0)) > + return ret; > + > + if (CHECK_FAIL(info.id.handle != 1) || > + CHECK_FAIL(info.id.priority != 10) || > + CHECK_FAIL(info.class_id != TC_H_MAKE(1UL << 16, 1))) > + goto end; > + > + /* Demonstrate changing attributes */ > + opts.class_id = TC_H_MAKE(1UL << 16, 2); > + > + ret = bpf_tc_cls_change(fd, LO_IFINDEX, parent_id, , ); > + if (CHECK_FAIL(ret < 0)) > + goto end; > + > + ret = bpf_tc_cls_get_info(fd, LO_IFINDEX, parent_id, NULL, ); > + if (CHECK_FAIL(ret < 0)) > + goto end; > + > + if (CHECK_FAIL(info.class_id != TC_H_MAKE(1UL << 16, 2))) > + goto end; > + if (CHECK_FAIL((info.bpf_flags & TCA_BPF_FLAG_ACT_DIRECT) != 1)) > + goto end; > + > +end: > + ret = bpf_tc_cls_detach(LO_IFINDEX, parent_id, ); > + CHECK_FAIL(ret < 0); > + return ret; > +} > + > +void test_test_tc_bpf(void) > +{ > + const char *file = "./test_tc_bpf_kern.o"; > + struct bpf_program *clsp; > + struct bpf_object *obj; > + int cls_fd, ret; > + > + obj = bpf_object__open(file); > + if (CHECK_FAIL(IS_ERR_OR_NULL(obj))) > + return; > + > + clsp = bpf_object__find_program_by_title(obj, "classifier"); > + if (CHECK_FAIL(IS_ERR_OR_NULL(clsp))) > + goto end; > + > + ret = bpf_object__load(obj); > + if (CHECK_FAIL(ret < 0)) > + goto end; > + > + cls_fd = bpf_program__fd(clsp); > + > + system("tc qdisc del dev lo clsact"); > + > + ret = test_tc_cls_internal(cls_fd, BPF_TC_CLSACT_INGRESS); > + if (CHECK_FAIL(ret < 0)) > + goto end; > + > + if (CHECK_FAIL(system("tc qdisc del dev lo clsact"))) > + goto end; > + > + ret = test_tc_cls_internal(cls_fd, BPF_TC_CLSACT_EGRESS); > + if (CHECK_FAIL(ret < 0)) > + goto end; > + > + CHECK_FAIL(system("tc qdisc del dev lo clsact")); please don't use CHECK_FAIL. And prefer ASSERT_xxx over CHECK(). > + > +end: > + bpf_object__close(obj); > +} > diff --git a/tools/testing/selftests/bpf/progs/test_tc_bpf_kern.c > b/tools/testing/selftests/bpf/progs/test_tc_bpf_kern.c > new file mode 100644 > index ..3dd40e21af8e > --- /dev/null > +++ b/tools/testing/selftests/bpf/progs/test_tc_bpf_kern.c > @@ -0,0 +1,12 @@ > +// SPDX-License-Identifier: GPL-2.0 > + > +#include > +#include > + > +// Dummy prog to test TC-BPF API no C++-style comments, please (except for SPDX header, of course) > + > +SEC("classifier") > +int cls(struct __sk_buff *skb) > +{ > + return 0; > +} > -- > 2.30.2 >
Re: [PATCH bpf-next v5 0/6] Add a snprintf eBPF helper
On Mon, Apr 19, 2021 at 8:52 AM Florent Revest wrote: > > We have a usecase where we want to audit symbol names (if available) in > callback registration hooks. (ex: fentry/nf_register_net_hook) > > A few months back, I proposed a bpf_kallsyms_lookup series but it was > decided in the reviews that a more generic helper, bpf_snprintf, would > be more useful. > > This series implements the helper according to the feedback received in > https://lore.kernel.org/bpf/20201126165748.1748417-1-rev...@google.com/T/#u > > - A new arg type guarantees the NULL-termination of string arguments and > lets us pass format strings in only one arg > - A new helper is implemented using that guarantee. Because the format > string is known at verification time, the format string validation is > done by the verifier > - To implement a series of tests for bpf_snprintf, the logic for > marshalling variadic args in a fixed-size array is reworked as per: > https://lore.kernel.org/bpf/20210310015455.1095207-1-rev...@chromium.org/T/#u > > --- > Changes in v5: > - Fixed the bpf_printf_buf_used counter logic in try_get_fmt_tmp_buf > - Added a couple of extra incorrect specifiers tests > - Call test_snprintf_single__destroy unconditionally > - Fixed a C++-style comment > > --- > Changes in v4: > - Moved bpf_snprintf, bpf_printf_prepare and bpf_printf_cleanup to > kernel/bpf/helpers.c so that they get built without CONFIG_BPF_EVENTS > - Added negative test cases (various invalid format strings) > - Renamed put_fmt_tmp_buf() as bpf_printf_cleanup() > - Fixed a mistake that caused temporary buffers to be unconditionally > freed in bpf_printf_prepare > - Fixed a mistake that caused missing 0 character to be ignored > - Fixed a warning about integer to pointer conversion > - Misc cleanups > > --- > Changes in v3: > - Simplified temporary buffer acquisition with try_get_fmt_tmp_buf() > - Made zero-termination check more consistent > - Allowed NULL output_buffer > - Simplified the BPF_CAST_FMT_ARG macro > - Three new test cases: number padding, simple string with no arg and > string length extraction only with a NULL output buffer > - Clarified helper's description for edge cases (eg: str_size == 0) > - Lots of cosmetic changes > > --- > Changes in v2: > - Extracted the format validation/argument sanitization in a generic way > for all printf-like helpers. > - bpf_snprintf's str_size can now be 0 > - bpf_snprintf is now exposed to all BPF program types > - We now preempt_disable when using a per-cpu temporary buffer > - Addressed a few cosmetic changes > > Florent Revest (6): > bpf: Factorize bpf_trace_printk and bpf_seq_printf > bpf: Add a ARG_PTR_TO_CONST_STR argument type > bpf: Add a bpf_snprintf helper > libbpf: Initialize the bpf_seq_printf parameters array field by field > libbpf: Introduce a BPF_SNPRINTF helper macro > selftests/bpf: Add a series of tests for bpf_snprintf > > include/linux/bpf.h | 22 ++ > include/uapi/linux/bpf.h | 28 ++ > kernel/bpf/helpers.c | 306 ++ > kernel/bpf/verifier.c | 82 > kernel/trace/bpf_trace.c | 373 ++ > tools/include/uapi/linux/bpf.h| 28 ++ > tools/lib/bpf/bpf_tracing.h | 58 ++- > .../selftests/bpf/prog_tests/snprintf.c | 125 ++ > .../selftests/bpf/progs/test_snprintf.c | 73 > .../bpf/progs/test_snprintf_single.c | 20 + > 10 files changed, 770 insertions(+), 345 deletions(-) > create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf.c > create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf.c > create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf_single.c > > -- > 2.31.1.368.gbe11c130af-goog > Looks great, thank you! For the series: Acked-by: Andrii Nakryiko
Re: [PATCH bpf-next v4 6/6] selftests/bpf: Add a series of tests for bpf_snprintf
On Wed, Apr 14, 2021 at 11:54 AM Florent Revest wrote: > > The "positive" part tests all format specifiers when things go well. > > The "negative" part makes sure that incorrect format strings fail at > load time. > > Signed-off-by: Florent Revest > --- > .../selftests/bpf/prog_tests/snprintf.c | 124 ++ > .../selftests/bpf/progs/test_snprintf.c | 73 +++ > .../bpf/progs/test_snprintf_single.c | 20 +++ > 3 files changed, 217 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf.c > create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf.c > create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf_single.c > [...] > +/* Loads an eBPF object calling bpf_snprintf with up to 10 characters of fmt > */ > +static int load_single_snprintf(char *fmt) > +{ > + struct test_snprintf_single *skel; > + int ret; > + > + skel = test_snprintf_single__open(); > + if (!skel) > + return -EINVAL; > + > + memcpy(skel->rodata->fmt, fmt, min(strlen(fmt) + 1, 10)); > + > + ret = test_snprintf_single__load(skel); > + if (!ret) > + test_snprintf_single__destroy(skel); destroy unconditionally? > + > + return ret; > +} > + > +void test_snprintf_negative(void) > +{ > + ASSERT_OK(load_single_snprintf("valid %d"), "valid usage"); > + > + ASSERT_ERR(load_single_snprintf("0123456789"), "no terminating zero"); > + ASSERT_ERR(load_single_snprintf("%d %d"), "too many specifiers"); > + ASSERT_ERR(load_single_snprintf("%pi5"), "invalid specifier 1"); > + ASSERT_ERR(load_single_snprintf("%a"), "invalid specifier 2"); > + ASSERT_ERR(load_single_snprintf("%"), "invalid specifier 3"); > + ASSERT_ERR(load_single_snprintf("\x80"), "non ascii character"); > + ASSERT_ERR(load_single_snprintf("\x1"), "non printable character"); Some more cases that came up in my mind: 1. %123987129387192387 -- long and unterminated specified 2. similarly %--- or something like that Do you think they are worth checking? > +} > + > +void test_snprintf(void) > +{ > + if (test__start_subtest("snprintf_positive")) > + test_snprintf_positive(); > + if (test__start_subtest("snprintf_negative")) > + test_snprintf_negative(); > +} [...] > +char _license[] SEC("license") = "GPL"; > diff --git a/tools/testing/selftests/bpf/progs/test_snprintf_single.c > b/tools/testing/selftests/bpf/progs/test_snprintf_single.c > new file mode 100644 > index ..15ccc5c43803 > --- /dev/null > +++ b/tools/testing/selftests/bpf/progs/test_snprintf_single.c > @@ -0,0 +1,20 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* Copyright (c) 2021 Google LLC. */ > + > +#include > +#include > + > +// The format string is filled from the userspace side such that loading > fails C++ style format > +static const char fmt[10]; > + > +SEC("raw_tp/sys_enter") > +int handler(const void *ctx) > +{ > + unsigned long long arg = 42; > + > + bpf_snprintf(NULL, 0, fmt, , sizeof(arg)); > + > + return 0; > +} > + > +char _license[] SEC("license") = "GPL"; > -- > 2.31.1.295.g9ea45b61b8-goog >
Re: [PATCH bpf-next v4 3/6] bpf: Add a bpf_snprintf helper
On Wed, Apr 14, 2021 at 11:54 AM Florent Revest wrote: > > The implementation takes inspiration from the existing bpf_trace_printk > helper but there are a few differences: > > To allow for a large number of format-specifiers, parameters are > provided in an array, like in bpf_seq_printf. > > Because the output string takes two arguments and the array of > parameters also takes two arguments, the format string needs to fit in > one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to > a zero-terminated read-only map so we don't need a format string length > arg. > > Because the format-string is known at verification time, we also do > a first pass of format string validation in the verifier logic. This > makes debugging easier. > > Signed-off-by: Florent Revest > --- LGTM. Acked-by: Andrii Nakryiko > include/linux/bpf.h| 1 + > include/uapi/linux/bpf.h | 28 +++ > kernel/bpf/helpers.c | 50 ++ > kernel/bpf/verifier.c | 41 > kernel/trace/bpf_trace.c | 2 ++ > tools/include/uapi/linux/bpf.h | 28 +++ > 6 files changed, 150 insertions(+) > [...]
Re: [PATCH bpf-next v4 1/6] bpf: Factorize bpf_trace_printk and bpf_seq_printf
On Thu, Apr 15, 2021 at 2:33 AM Florent Revest wrote: > > On Thu, Apr 15, 2021 at 2:38 AM Andrii Nakryiko > wrote: > > On Wed, Apr 14, 2021 at 11:54 AM Florent Revest wrote: > > > +static int try_get_fmt_tmp_buf(char **tmp_buf) > > > +{ > > > + struct bpf_printf_buf *bufs; > > > + int used; > > > + > > > + if (*tmp_buf) > > > + return 0; > > > + > > > + preempt_disable(); > > > + used = this_cpu_inc_return(bpf_printf_buf_used); > > > + if (WARN_ON_ONCE(used > 1)) { > > > + this_cpu_dec(bpf_printf_buf_used); > > > > this makes me uncomfortable. If used > 1, you won't preempt_enable() > > here, but you'll decrease count. Then later bpf_printf_cleanup() will > > be called (inside bpf_printf_prepare()) and will further decrease > > count (which it didn't increase, so it's a mess now). > > Awkward, yes. :( This code is untested because it only covers a niche > preempt_rt usecase that is hard to reproduce but I should have thought > harder about these corner cases. > > > > + i += 2; > > > + if (!final_args) > > > + goto fmt_next; > > > + > > > + if (try_get_fmt_tmp_buf(_buf)) { > > > + err = -EBUSY; > > > + goto out; > > > > this probably should bypass doing bpf_printf_cleanup() and > > try_get_fmt_tmp_buf() should enable preemption internally on error. > > Yes. I'll fix this and spend some more brain cycles thinking about > what I'm doing. ;) > > > > -static __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...) > > > +BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, > > > + u64, arg2, u64, arg3) > > > { > > > + u64 args[MAX_TRACE_PRINTK_VARARGS] = { arg1, arg2, arg3 }; > > > + enum bpf_printf_mod_type mod[MAX_TRACE_PRINTK_VARARGS]; > > > static char buf[BPF_TRACE_PRINTK_SIZE]; > > > unsigned long flags; > > > - va_list ap; > > > int ret; > > > > > > - raw_spin_lock_irqsave(_printk_lock, flags); > > > - va_start(ap, fmt); > > > - ret = vsnprintf(buf, sizeof(buf), fmt, ap); > > > - va_end(ap); > > > - /* vsnprintf() will not append null for zero-length strings */ > > > + ret = bpf_printf_prepare(fmt, fmt_size, args, args, mod, > > > +MAX_TRACE_PRINTK_VARARGS); > > > + if (ret < 0) > > > + return ret; > > > + > > > + ret = snprintf(buf, sizeof(buf), fmt, BPF_CAST_FMT_ARG(0, args, > > > mod), > > > + BPF_CAST_FMT_ARG(1, args, mod), BPF_CAST_FMT_ARG(2, args, > > > mod)); > > > + /* snprintf() will not append null for zero-length strings */ > > > if (ret == 0) > > > buf[0] = '\0'; > > > + > > > + raw_spin_lock_irqsave(_printk_lock, flags); > > > trace_bpf_trace_printk(buf); > > > raw_spin_unlock_irqrestore(_printk_lock, flags); > > > > > > - return ret; > > > > see here, no + 1 :( > > I wonder if it's a bug or a feature though. The helper documentation > says the helper returns "the number of bytes written to the buffer". I > am not familiar with the internals of trace_printk but if the > terminating \0 is not outputted in the trace_printk buffer, then it > kind of makes sense. > > Also, if anyone uses this return value, I can imagine that the usecase > would be if (ret == 0) assume_nothing_was_written(). And if we > suddenly output 1 here, we might break something. > > Because the helper is quite old, maybe we should improve the helper > documentation instead? Your call :) Yeah, let's make helper's doc a bit more precise, otherwise let's not touch it. I doubt many users ever check return result of bpf_trace_printk() at all, tbh.
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On Thu, Apr 15, 2021 at 3:10 PM Daniel Borkmann wrote: > > On 4/15/21 1:58 AM, Andrii Nakryiko wrote: > > On Wed, Apr 14, 2021 at 4:32 PM Daniel Borkmann > > wrote: > >> On 4/15/21 1:19 AM, Andrii Nakryiko wrote: > >>> On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen > >>> wrote: > >>>> Andrii Nakryiko writes: > >>>>> On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen > >>>>> wrote: > >>>>>> Andrii Nakryiko writes: > >>>>>>> On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen > >>>>>>> wrote: > >>>>>>>> Andrii Nakryiko writes: > >>>>>>>>> On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov > >>>>>>>>> wrote: > >>>>>>>>>> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi > >>>>>>>>>> wrote: > >>>>>>>>>>> On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote: > >>>>>>>>>>>> On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>> [...] > >>>>>>>>>>>> > >>>>>>>>>>>> All of these things are messy because of tc legacy. bpf tried to > >>>>>>>>>>>> follow tc style > >>>>>>>>>>>> with cls and act distinction and it didn't quite work. cls with > >>>>>>>>>>>> direct-action is the only > >>>>>>>>>>>> thing that became mainstream while tc style attach wasn't really > >>>>>>>>>>>> addressed. > >>>>>>>>>>>> There were several incidents where tc had tens of thousands of > >>>>>>>>>>>> progs attached > >>>>>>>>>>>> because of this attach/query/index weirdness described above. > >>>>>>>>>>>> I think the only way to address this properly is to introduce > >>>>>>>>>>>> bpf_link style of > >>>>>>>>>>>> attaching to tc. Such bpf_link would support ingress/egress only. > >>>>>>>>>>>> direction-action will be implied. There won't be any index and > >>>>>>>>>>>> query > >>>>>>>>>>>> will be obvious. > >>>>>>>>>>> > >>>>>>>>>>> Note that we already have bpf_link support working (without > >>>>>>>>>>> support for pinning > >>>>>>>>>>> ofcourse) in a limited way. The ifindex, protocol, parent_id, > >>>>>>>>>>> priority, handle, > >>>>>>>>>>> chain_index tuple uniquely identifies a filter, so we stash this > >>>>>>>>>>> in the bpf_link > >>>>>>>>>>> and are able to operate on the exact filter during release. > >>>>>>>>>> > >>>>>>>>>> Except they're not unique. The library can stash them, but > >>>>>>>>>> something else > >>>>>>>>>> doing detach via iproute2 or their own netlink calls will detach > >>>>>>>>>> the prog. > >>>>>>>>>> This other app can attach to the same spot a different prog and now > >>>>>>>>>> bpf_link__destroy will be detaching somebody else prog. > >>>>>>>>>> > >>>>>>>>>>>> So I would like to propose to take this patch set a step further > >>>>>>>>>>>> from > >>>>>>>>>>>> what Daniel said: > >>>>>>>>>>>> int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): > >>>>>>>>>>>> and make this proposed api to return FD. > >>>>>>>>>>>> To detach from tc ingress/egress just close(fd). > >>>>>>>>>>> > >>>>>>>>>>> You mean adding an fd-based TC API to the kernel? > >>>>>>
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On Thu, Apr 15, 2021 at 8:57 AM Toke Høiland-Jørgensen wrote: > > Andrii Nakryiko writes: > > > On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen > > wrote: > >> > >> Andrii Nakryiko writes: > >> > >> > On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen > >> > wrote: > >> >> > >> >> Andrii Nakryiko writes: > >> >> > >> >> > On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen > >> >> > wrote: > >> >> >> > >> >> >> Andrii Nakryiko writes: > >> >> >> > >> >> >> > On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov > >> >> >> > wrote: > >> >> >> >> > >> >> >> >> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi > >> >> >> >> wrote: > >> >> >> >> > On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov > >> >> >> >> > wrote: > >> >> >> >> > > On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi > >> >> >> >> > > wrote: > >> >> >> >> > > > [...] > >> >> >> >> > > > >> >> >> >> > > All of these things are messy because of tc legacy. bpf tried > >> >> >> >> > > to follow tc style > >> >> >> >> > > with cls and act distinction and it didn't quite work. cls > >> >> >> >> > > with > >> >> >> >> > > direct-action is the only > >> >> >> >> > > thing that became mainstream while tc style attach wasn't > >> >> >> >> > > really addressed. > >> >> >> >> > > There were several incidents where tc had tens of thousands > >> >> >> >> > > of progs attached > >> >> >> >> > > because of this attach/query/index weirdness described above. > >> >> >> >> > > I think the only way to address this properly is to introduce > >> >> >> >> > > bpf_link style of > >> >> >> >> > > attaching to tc. Such bpf_link would support ingress/egress > >> >> >> >> > > only. > >> >> >> >> > > direction-action will be implied. There won't be any index > >> >> >> >> > > and query > >> >> >> >> > > will be obvious. > >> >> >> >> > > >> >> >> >> > Note that we already have bpf_link support working (without > >> >> >> >> > support for pinning > >> >> >> >> > ofcourse) in a limited way. The ifindex, protocol, parent_id, > >> >> >> >> > priority, handle, > >> >> >> >> > chain_index tuple uniquely identifies a filter, so we stash > >> >> >> >> > this in the bpf_link > >> >> >> >> > and are able to operate on the exact filter during release. > >> >> >> >> > >> >> >> >> Except they're not unique. The library can stash them, but > >> >> >> >> something else > >> >> >> >> doing detach via iproute2 or their own netlink calls will detach > >> >> >> >> the prog. > >> >> >> >> This other app can attach to the same spot a different prog and > >> >> >> >> now > >> >> >> >> bpf_link__destroy will be detaching somebody else prog. > >> >> >> >> > >> >> >> >> > > So I would like to propose to take this patch set a step > >> >> >> >> > > further from > >> >> >> >> > > what Daniel said: > >> >> >> >> > > int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): > >> >> >> >> > > and make this proposed api to return FD. > >> >> >> >> > > To detach from tc ingress/egress just close(fd). > >> >> >> >> > > >> >> >> >> > You mean adding an fd-based TC API to the kernel? > >> >> >> >> > &
Re: [PATCH bpf-next v4 1/6] bpf: Factorize bpf_trace_printk and bpf_seq_printf
On Wed, Apr 14, 2021 at 11:54 AM Florent Revest wrote: > > Two helpers (trace_printk and seq_printf) have very similar > implementations of format string parsing and a third one is coming > (snprintf). To avoid code duplication and make the code easier to > maintain, this moves the operations associated with format string > parsing (validation and argument sanitization) into one generic > function. > > The implementation of the two existing helpers already drifted quite a > bit so unifying them entailed a lot of changes: > > - bpf_trace_printk always expected fmt[fmt_size] to be the terminating > NULL character, this is no longer true, the first 0 is terminating. > - bpf_trace_printk now supports %% (which produces the percentage char). > - bpf_trace_printk now skips width formating fields. > - bpf_trace_printk now supports the X modifier (capital hexadecimal). > - bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6 > - argument casting on 32 bit has been simplified into one macro and > using an enum instead of obscure int increments. > > - bpf_seq_printf now uses bpf_trace_copy_string instead of > strncpy_from_kernel_nofault and handles the %pks %pus specifiers. > - bpf_seq_printf now prints longs correctly on 32 bit architectures. > > - both were changed to use a global per-cpu tmp buffer instead of one > stack buffer for trace_printk and 6 small buffers for seq_printf. > - to avoid per-cpu buffer usage conflict, these helpers disable > preemption while the per-cpu buffer is in use. > - both helpers now support the %ps and %pS specifiers to print symbols. > > The implementation is also moved from bpf_trace.c to helpers.c because > the upcoming bpf_snprintf helper will be made available to all BPF > programs and will need it. > > Signed-off-by: Florent Revest > --- > include/linux/bpf.h | 20 +++ > kernel/bpf/helpers.c | 254 +++ > kernel/trace/bpf_trace.c | 371 --- > 3 files changed, 311 insertions(+), 334 deletions(-) > [...] > +static int try_get_fmt_tmp_buf(char **tmp_buf) > +{ > + struct bpf_printf_buf *bufs; > + int used; > + > + if (*tmp_buf) > + return 0; > + > + preempt_disable(); > + used = this_cpu_inc_return(bpf_printf_buf_used); > + if (WARN_ON_ONCE(used > 1)) { > + this_cpu_dec(bpf_printf_buf_used); this makes me uncomfortable. If used > 1, you won't preempt_enable() here, but you'll decrease count. Then later bpf_printf_cleanup() will be called (inside bpf_printf_prepare()) and will further decrease count (which it didn't increase, so it's a mess now). > + return -EBUSY; > + } > + bufs = this_cpu_ptr(_printf_buf); > + *tmp_buf = bufs->tmp_buf; > + > + return 0; > +} > + [...] > + i += 2; > + if (!final_args) > + goto fmt_next; > + > + if (try_get_fmt_tmp_buf(_buf)) { > + err = -EBUSY; > + goto out; this probably should bypass doing bpf_printf_cleanup() and try_get_fmt_tmp_buf() should enable preemption internally on error. > + } > + > + copy_size = (fmt[i + 2] == '4') ? 4 : 16; > + if (tmp_buf_len < copy_size) { > + err = -ENOSPC; > + goto out; > + } > + [...] > -static __printf(1, 0) int bpf_do_trace_printk(const char *fmt, ...) > +BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, > + u64, arg2, u64, arg3) > { > + u64 args[MAX_TRACE_PRINTK_VARARGS] = { arg1, arg2, arg3 }; > + enum bpf_printf_mod_type mod[MAX_TRACE_PRINTK_VARARGS]; > static char buf[BPF_TRACE_PRINTK_SIZE]; > unsigned long flags; > - va_list ap; > int ret; > > - raw_spin_lock_irqsave(_printk_lock, flags); > - va_start(ap, fmt); > - ret = vsnprintf(buf, sizeof(buf), fmt, ap); > - va_end(ap); > - /* vsnprintf() will not append null for zero-length strings */ > + ret = bpf_printf_prepare(fmt, fmt_size, args, args, mod, > +MAX_TRACE_PRINTK_VARARGS); > + if (ret < 0) > + return ret; > + > + ret = snprintf(buf, sizeof(buf), fmt, BPF_CAST_FMT_ARG(0, args, mod), > + BPF_CAST_FMT_ARG(1, args, mod), BPF_CAST_FMT_ARG(2, args, > mod)); > + /* snprintf() will not append null for zero-length strings */ > if (ret == 0) > buf[0] = '\0'; > + > + raw_spin_lock_irqsave(_printk_lock, flags); > trace_bpf_trace_printk(buf); > raw_spin_unlock_irqrestore(_printk_lock, flags); > > - return ret; see here, no + 1 :( > -} > - > -/* > - * Only limited trace_printk() conversion specifiers allowed: > - * %d %i
Re: [PATCH] selftests/bpf: Fix the ASSERT_ERR_PTR macro
On Wed, Apr 14, 2021 at 11:58 AM Martin KaFai Lau wrote: > > On Wed, Apr 14, 2021 at 05:56:32PM +0200, Florent Revest wrote: > > It is just missing a ';'. This macro is not used by any test yet. > > > > Signed-off-by: Florent Revest > Fixes: 22ba36351631 ("selftests/bpf: Move and extend ASSERT_xxx() testing > macros") > Thanks, Martin. Added Fixes tag and applied to bpf-next. > Since it has not been used, it could be bpf-next. Please also tag > it in the future. > > Acked-by: Martin KaFai Lau
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On Wed, Apr 14, 2021 at 4:32 PM Daniel Borkmann wrote: > > On 4/15/21 1:19 AM, Andrii Nakryiko wrote: > > On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen > > wrote: > >> Andrii Nakryiko writes: > >>> On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen > >>> wrote: > >>>> Andrii Nakryiko writes: > >>>>> On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen > >>>>> wrote: > >>>>>> Andrii Nakryiko writes: > >>>>>>> On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov > >>>>>>> wrote: > >>>>>>>> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi > >>>>>>>> wrote: > >>>>>>>>> On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote: > >>>>>>>>>> On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi > >>>>>>>>>> wrote: > >>>>>>>>>>> [...] > >>>>>>>>>> > >>>>>>>>>> All of these things are messy because of tc legacy. bpf tried to > >>>>>>>>>> follow tc style > >>>>>>>>>> with cls and act distinction and it didn't quite work. cls with > >>>>>>>>>> direct-action is the only > >>>>>>>>>> thing that became mainstream while tc style attach wasn't really > >>>>>>>>>> addressed. > >>>>>>>>>> There were several incidents where tc had tens of thousands of > >>>>>>>>>> progs attached > >>>>>>>>>> because of this attach/query/index weirdness described above. > >>>>>>>>>> I think the only way to address this properly is to introduce > >>>>>>>>>> bpf_link style of > >>>>>>>>>> attaching to tc. Such bpf_link would support ingress/egress only. > >>>>>>>>>> direction-action will be implied. There won't be any index and > >>>>>>>>>> query > >>>>>>>>>> will be obvious. > >>>>>>>>> > >>>>>>>>> Note that we already have bpf_link support working (without support > >>>>>>>>> for pinning > >>>>>>>>> ofcourse) in a limited way. The ifindex, protocol, parent_id, > >>>>>>>>> priority, handle, > >>>>>>>>> chain_index tuple uniquely identifies a filter, so we stash this in > >>>>>>>>> the bpf_link > >>>>>>>>> and are able to operate on the exact filter during release. > >>>>>>>> > >>>>>>>> Except they're not unique. The library can stash them, but something > >>>>>>>> else > >>>>>>>> doing detach via iproute2 or their own netlink calls will detach the > >>>>>>>> prog. > >>>>>>>> This other app can attach to the same spot a different prog and now > >>>>>>>> bpf_link__destroy will be detaching somebody else prog. > >>>>>>>> > >>>>>>>>>> So I would like to propose to take this patch set a step further > >>>>>>>>>> from > >>>>>>>>>> what Daniel said: > >>>>>>>>>> int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): > >>>>>>>>>> and make this proposed api to return FD. > >>>>>>>>>> To detach from tc ingress/egress just close(fd). > >>>>>>>>> > >>>>>>>>> You mean adding an fd-based TC API to the kernel? > >>>>>>>> > >>>>>>>> yes. > >>>>>>> > >>>>>>> I'm totally for bpf_link-based TC attachment. > >>>>>>> > >>>>>>> But I think *also* having "legacy" netlink-based APIs will allow > >>>>>>> applications to handle older kernels in a much nicer way without extra > >>>>>>> dependency on iproute2. We have a similar situation with kprobe, where > >>>>>>> currently libbpf only supports "modern&qu
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On Wed, Apr 14, 2021 at 3:51 PM Toke Høiland-Jørgensen wrote: > > Andrii Nakryiko writes: > > > On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen > > wrote: > >> > >> Andrii Nakryiko writes: > >> > >> > On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen > >> > wrote: > >> >> > >> >> Andrii Nakryiko writes: > >> >> > >> >> > On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov > >> >> > wrote: > >> >> >> > >> >> >> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi > >> >> >> wrote: > >> >> >> > On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote: > >> >> >> > > On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi > >> >> >> > > wrote: > >> >> >> > > > [...] > >> >> >> > > > >> >> >> > > All of these things are messy because of tc legacy. bpf tried to > >> >> >> > > follow tc style > >> >> >> > > with cls and act distinction and it didn't quite work. cls with > >> >> >> > > direct-action is the only > >> >> >> > > thing that became mainstream while tc style attach wasn't really > >> >> >> > > addressed. > >> >> >> > > There were several incidents where tc had tens of thousands of > >> >> >> > > progs attached > >> >> >> > > because of this attach/query/index weirdness described above. > >> >> >> > > I think the only way to address this properly is to introduce > >> >> >> > > bpf_link style of > >> >> >> > > attaching to tc. Such bpf_link would support ingress/egress only. > >> >> >> > > direction-action will be implied. There won't be any index and > >> >> >> > > query > >> >> >> > > will be obvious. > >> >> >> > > >> >> >> > Note that we already have bpf_link support working (without > >> >> >> > support for pinning > >> >> >> > ofcourse) in a limited way. The ifindex, protocol, parent_id, > >> >> >> > priority, handle, > >> >> >> > chain_index tuple uniquely identifies a filter, so we stash this > >> >> >> > in the bpf_link > >> >> >> > and are able to operate on the exact filter during release. > >> >> >> > >> >> >> Except they're not unique. The library can stash them, but something > >> >> >> else > >> >> >> doing detach via iproute2 or their own netlink calls will detach the > >> >> >> prog. > >> >> >> This other app can attach to the same spot a different prog and now > >> >> >> bpf_link__destroy will be detaching somebody else prog. > >> >> >> > >> >> >> > > So I would like to propose to take this patch set a step further > >> >> >> > > from > >> >> >> > > what Daniel said: > >> >> >> > > int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): > >> >> >> > > and make this proposed api to return FD. > >> >> >> > > To detach from tc ingress/egress just close(fd). > >> >> >> > > >> >> >> > You mean adding an fd-based TC API to the kernel? > >> >> >> > >> >> >> yes. > >> >> > > >> >> > I'm totally for bpf_link-based TC attachment. > >> >> > > >> >> > But I think *also* having "legacy" netlink-based APIs will allow > >> >> > applications to handle older kernels in a much nicer way without extra > >> >> > dependency on iproute2. We have a similar situation with kprobe, where > >> >> > currently libbpf only supports "modern" fd-based attachment, but users > >> >> > periodically ask questions and struggle to figure out issues on older > >> >> > kernels that don't support new APIs. > >> >> > >> >> +1; I am OK with adding a new bpf_link-based way to attach TC programs, > >> >> b
Re: [PATCH bpf-next v3 3/6] bpf: Add a bpf_snprintf helper
On Wed, Apr 14, 2021 at 11:30 AM Florent Revest wrote: > > Hey Geert! :) > > On Wed, Apr 14, 2021 at 8:02 PM Geert Uytterhoeven > wrote: > > On Wed, Apr 14, 2021 at 9:41 AM Andrii Nakryiko > > wrote: > > > On Mon, Apr 12, 2021 at 8:38 AM Florent Revest > > > wrote: > > > > + fmt = (char *)fmt_addr + fmt_map_off; > > > > + > > > > > > bot complained about lack of (long) cast before fmt_addr, please address > > > > (uintptr_t), I assume? > > (uintptr_t) seems more correct to me as well. However, I just had a > look at the rest of verifier.c and (long) casts are already used > pretty much everywhere whereas uintptr_t isn't used yet. > I'll send a v4 with a long cast for the sake of consistency with the > rest of the verifier. right, I don't care about long or uintptr_t, both are guaranteed to work, I just remember seeing a lot of code with (long) cast. I have no preference.
Re: [PATCH bpf-next v3 3/6] bpf: Add a bpf_snprintf helper
On Wed, Apr 14, 2021 at 2:46 AM Florent Revest wrote: > > On Wed, Apr 14, 2021 at 1:16 AM Andrii Nakryiko > wrote: > > On Mon, Apr 12, 2021 at 8:38 AM Florent Revest wrote: > > > +static int check_bpf_snprintf_call(struct bpf_verifier_env *env, > > > + struct bpf_reg_state *regs) > > > +{ > > > + struct bpf_reg_state *fmt_reg = [BPF_REG_3]; > > > + struct bpf_reg_state *data_len_reg = [BPF_REG_5]; > > > + struct bpf_map *fmt_map = fmt_reg->map_ptr; > > > + int err, fmt_map_off, num_args; > > > + u64 fmt_addr; > > > + char *fmt; > > > + > > > + /* data must be an array of u64 */ > > > + if (data_len_reg->var_off.value % 8) > > > + return -EINVAL; > > > + num_args = data_len_reg->var_off.value / 8; > > > + > > > + /* fmt being ARG_PTR_TO_CONST_STR guarantees that var_off is const > > > +* and map_direct_value_addr is set. > > > +*/ > > > + fmt_map_off = fmt_reg->off + fmt_reg->var_off.value; > > > + err = fmt_map->ops->map_direct_value_addr(fmt_map, _addr, > > > + fmt_map_off); > > > + if (err) > > > + return err; > > > + fmt = (char *)fmt_addr + fmt_map_off; > > > + > > > > bot complained about lack of (long) cast before fmt_addr, please address > > Will do. > > > > + /* Maximumly we can have MAX_SNPRINTF_VARARGS parameters, just > > > give > > > +* all of them to snprintf(). > > > +*/ > > > + err = snprintf(str, str_size, fmt, BPF_CAST_FMT_ARG(0, args, mod), > > > + BPF_CAST_FMT_ARG(1, args, mod), BPF_CAST_FMT_ARG(2, args, > > > mod), > > > + BPF_CAST_FMT_ARG(3, args, mod), BPF_CAST_FMT_ARG(4, args, > > > mod), > > > + BPF_CAST_FMT_ARG(5, args, mod), BPF_CAST_FMT_ARG(6, args, > > > mod), > > > + BPF_CAST_FMT_ARG(7, args, mod), BPF_CAST_FMT_ARG(8, args, > > > mod), > > > + BPF_CAST_FMT_ARG(9, args, mod), BPF_CAST_FMT_ARG(10, > > > args, mod), > > > + BPF_CAST_FMT_ARG(11, args, mod)); > > > + > > > + put_fmt_tmp_buf(); > > > > reading this for at least 3rd time, this put_fmt_tmp_buf() looks a bit > > out of place and kind of random. I think bpf_printf_cleanup() name > > pairs with bpf_printf_prepare() better. > > Yes, I thought it would be clever to name that function > put_fmt_tmp_buf() as a clear parallel to try_get_fmt_tmp_buf() but > because it only puts the buffer if it is used and because they get > called in two different contexts, it's after all maybe not such a > clever name... I'll revert to bpf_printf_cleanup(). Thank you for your > patience with my naming adventures! :) > > > > + > > > + return err + 1; > > > > snprintf() already returns string length *including* terminating zero, > > so this is wrong > > lib/vsprintf.c says: > * The return value is the number of characters which would be > * generated for the given input, excluding the trailing null, > * as per ISO C99. > > Also if I look at the "no arg" test case in the selftest patch. > "simple case" is asserted to return 12 which seems correct to me > (includes the terminating zero only once). Am I missing something ? > no, you are right, but that means that bpf_trace_printk is broken, it doesn't do + 1 (which threw me off here), shall we fix that? > However that makes me wonder whether it would be more appropriate to > return the value excluding the trailing null. On one hand it makes > sense to be coherent with other BPF helpers that include the trailing > zero (as discussed in patch v1), on the other hand the helper is > clearly named after the standard "snprintf" function and it's likely > that users will assume it works the same as the std snprintf. Having zero included simplifies BPF code tremendously for cases like bpf_probe_read_str(). So no, let's stick with including zero terminator in return size.
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On Wed, Apr 14, 2021 at 3:58 AM Toke Høiland-Jørgensen wrote: > > Andrii Nakryiko writes: > > > On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen > > wrote: > >> > >> Andrii Nakryiko writes: > >> > >> > On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov > >> > wrote: > >> >> > >> >> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote: > >> >> > On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote: > >> >> > > On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi > >> >> > > wrote: > >> >> > > > [...] > >> >> > > > >> >> > > All of these things are messy because of tc legacy. bpf tried to > >> >> > > follow tc style > >> >> > > with cls and act distinction and it didn't quite work. cls with > >> >> > > direct-action is the only > >> >> > > thing that became mainstream while tc style attach wasn't really > >> >> > > addressed. > >> >> > > There were several incidents where tc had tens of thousands of > >> >> > > progs attached > >> >> > > because of this attach/query/index weirdness described above. > >> >> > > I think the only way to address this properly is to introduce > >> >> > > bpf_link style of > >> >> > > attaching to tc. Such bpf_link would support ingress/egress only. > >> >> > > direction-action will be implied. There won't be any index and query > >> >> > > will be obvious. > >> >> > > >> >> > Note that we already have bpf_link support working (without support > >> >> > for pinning > >> >> > ofcourse) in a limited way. The ifindex, protocol, parent_id, > >> >> > priority, handle, > >> >> > chain_index tuple uniquely identifies a filter, so we stash this in > >> >> > the bpf_link > >> >> > and are able to operate on the exact filter during release. > >> >> > >> >> Except they're not unique. The library can stash them, but something > >> >> else > >> >> doing detach via iproute2 or their own netlink calls will detach the > >> >> prog. > >> >> This other app can attach to the same spot a different prog and now > >> >> bpf_link__destroy will be detaching somebody else prog. > >> >> > >> >> > > So I would like to propose to take this patch set a step further > >> >> > > from > >> >> > > what Daniel said: > >> >> > > int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): > >> >> > > and make this proposed api to return FD. > >> >> > > To detach from tc ingress/egress just close(fd). > >> >> > > >> >> > You mean adding an fd-based TC API to the kernel? > >> >> > >> >> yes. > >> > > >> > I'm totally for bpf_link-based TC attachment. > >> > > >> > But I think *also* having "legacy" netlink-based APIs will allow > >> > applications to handle older kernels in a much nicer way without extra > >> > dependency on iproute2. We have a similar situation with kprobe, where > >> > currently libbpf only supports "modern" fd-based attachment, but users > >> > periodically ask questions and struggle to figure out issues on older > >> > kernels that don't support new APIs. > >> > >> +1; I am OK with adding a new bpf_link-based way to attach TC programs, > >> but we still need to support the netlink API in libbpf. > >> > >> > So I think we'd have to support legacy TC APIs, but I agree with > >> > Alexei and Daniel that we should keep it to the simplest and most > >> > straightforward API of supporting direction-action attachments and > >> > setting up qdisc transparently (if I'm getting all the terminology > >> > right, after reading Quentin's blog post). That coincidentally should > >> > probably match how bpf_link-based TC API will look like, so all that > >> > can be abstracted behind a single bpf_link__attach_tc() API as well, > >> > right? That's the plan for dealing with kprobe right now, btw. Libbpf > >> > will detect the best available API and trans
Re: [PATCH bpf-next v3 6/6] selftests/bpf: Add a series of tests for bpf_snprintf
On Wed, Apr 14, 2021 at 2:21 AM Florent Revest wrote: > > On Wed, Apr 14, 2021 at 1:21 AM Andrii Nakryiko > wrote: > > > > On Mon, Apr 12, 2021 at 8:38 AM Florent Revest wrote: > > > > > > This exercises most of the format specifiers. > > > > > > Signed-off-by: Florent Revest > > > Acked-by: Andrii Nakryiko > > > --- > > > > As I mentioned on another patch, we probably need negative tests even > > more than positive ones. > > Agreed. > > > I think an easy and nice way to do this is to have a separate BPF > > skeleton where fmt string and arguments are provided through read-only > > global variables, so that user-space can re-use the same BPF skeleton > > to simulate multiple cases. BPF program itself would just call > > bpf_snprintf() and store the returned result. > > Ah, great idea! I was thinking of having one skeleton for each but it > would be a bit much indeed. > > Because the format string needs to be in a read only map though, I > hope it can be modified from userspace before loading. I'll try it out > and see :) if it doesn't work I'll just use more skeletons You need read-only variables (const volatile my_type). Their contents are statically verified by BPF verifier, yet user-space can pre-setup it at runtime. > > > Whether we need to validate the verifier log is up to debate (though > > it's not that hard to do by overriding libbpf_print_fn() callback), > > I'd be ok at least knowing that some bad format strings are rejected > > and don't crash the kernel. > > Alright :) > > > > > > .../selftests/bpf/prog_tests/snprintf.c | 81 +++ > > > .../selftests/bpf/progs/test_snprintf.c | 74 + > > > 2 files changed, 155 insertions(+) > > > create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf.c > > > create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf.c > > > > > > > [...]
Re: [PATCH bpf-next 3/5] libbpf: add low level TC-BPF API
On Tue, Apr 6, 2021 at 3:06 AM Toke Høiland-Jørgensen wrote: > > Andrii Nakryiko writes: > > > On Sat, Apr 3, 2021 at 10:47 AM Alexei Starovoitov > > wrote: > >> > >> On Sat, Apr 03, 2021 at 12:38:06AM +0530, Kumar Kartikeya Dwivedi wrote: > >> > On Sat, Apr 03, 2021 at 12:02:14AM IST, Alexei Starovoitov wrote: > >> > > On Fri, Apr 2, 2021 at 8:27 AM Kumar Kartikeya Dwivedi > >> > > wrote: > >> > > > [...] > >> > > > >> > > All of these things are messy because of tc legacy. bpf tried to > >> > > follow tc style > >> > > with cls and act distinction and it didn't quite work. cls with > >> > > direct-action is the only > >> > > thing that became mainstream while tc style attach wasn't really > >> > > addressed. > >> > > There were several incidents where tc had tens of thousands of progs > >> > > attached > >> > > because of this attach/query/index weirdness described above. > >> > > I think the only way to address this properly is to introduce bpf_link > >> > > style of > >> > > attaching to tc. Such bpf_link would support ingress/egress only. > >> > > direction-action will be implied. There won't be any index and query > >> > > will be obvious. > >> > > >> > Note that we already have bpf_link support working (without support for > >> > pinning > >> > ofcourse) in a limited way. The ifindex, protocol, parent_id, priority, > >> > handle, > >> > chain_index tuple uniquely identifies a filter, so we stash this in the > >> > bpf_link > >> > and are able to operate on the exact filter during release. > >> > >> Except they're not unique. The library can stash them, but something else > >> doing detach via iproute2 or their own netlink calls will detach the prog. > >> This other app can attach to the same spot a different prog and now > >> bpf_link__destroy will be detaching somebody else prog. > >> > >> > > So I would like to propose to take this patch set a step further from > >> > > what Daniel said: > >> > > int bpf_tc_attach(prog_fd, ifindex, {INGRESS,EGRESS}): > >> > > and make this proposed api to return FD. > >> > > To detach from tc ingress/egress just close(fd). > >> > > >> > You mean adding an fd-based TC API to the kernel? > >> > >> yes. > > > > I'm totally for bpf_link-based TC attachment. > > > > But I think *also* having "legacy" netlink-based APIs will allow > > applications to handle older kernels in a much nicer way without extra > > dependency on iproute2. We have a similar situation with kprobe, where > > currently libbpf only supports "modern" fd-based attachment, but users > > periodically ask questions and struggle to figure out issues on older > > kernels that don't support new APIs. > > +1; I am OK with adding a new bpf_link-based way to attach TC programs, > but we still need to support the netlink API in libbpf. > > > So I think we'd have to support legacy TC APIs, but I agree with > > Alexei and Daniel that we should keep it to the simplest and most > > straightforward API of supporting direction-action attachments and > > setting up qdisc transparently (if I'm getting all the terminology > > right, after reading Quentin's blog post). That coincidentally should > > probably match how bpf_link-based TC API will look like, so all that > > can be abstracted behind a single bpf_link__attach_tc() API as well, > > right? That's the plan for dealing with kprobe right now, btw. Libbpf > > will detect the best available API and transparently fall back (maybe > > with some warning for awareness, due to inherent downsides of legacy > > APIs: no auto-cleanup being the most prominent one). > > Yup, SGTM: Expose both in the low-level API (in bpf.c), and make the > high-level API auto-detect. That way users can also still use the > netlink attach function if they don't want the fd-based auto-close > behaviour of bpf_link. So I thought a bit more about this, and it feels like the right move would be to expose only higher-level TC BPF API behind bpf_link. It will keep the API complexity and amount of APIs that libbpf will have to support to the minimum, and will keep the API itself simple: direct-attach with the minimum amount of input arguments. By not exposing low-level APIs we also table the whole bpf_tc_cls_attach_id design discussion, as we now can keep as much info as needed inside bpf_link_tc (which will embed bpf_link internally as well) to support detachment and possibly some additional querying, if needed. I think that's the best and least controversial step forward for getting this API into libbpf. > > -Toke >
Re: [PATCH bpf-next v3 6/6] selftests/bpf: Add a series of tests for bpf_snprintf
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest wrote: > > This exercises most of the format specifiers. > > Signed-off-by: Florent Revest > Acked-by: Andrii Nakryiko > --- As I mentioned on another patch, we probably need negative tests even more than positive ones. I think an easy and nice way to do this is to have a separate BPF skeleton where fmt string and arguments are provided through read-only global variables, so that user-space can re-use the same BPF skeleton to simulate multiple cases. BPF program itself would just call bpf_snprintf() and store the returned result. Whether we need to validate the verifier log is up to debate (though it's not that hard to do by overriding libbpf_print_fn() callback), I'd be ok at least knowing that some bad format strings are rejected and don't crash the kernel. > .../selftests/bpf/prog_tests/snprintf.c | 81 +++ > .../selftests/bpf/progs/test_snprintf.c | 74 + > 2 files changed, 155 insertions(+) > create mode 100644 tools/testing/selftests/bpf/prog_tests/snprintf.c > create mode 100644 tools/testing/selftests/bpf/progs/test_snprintf.c > [...]
Re: [PATCH bpf-next v3 5/6] libbpf: Introduce a BPF_SNPRINTF helper macro
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest wrote: > > Similarly to BPF_SEQ_PRINTF, this macro turns variadic arguments into an > array of u64, making it more natural to call the bpf_snprintf helper. > > Signed-off-by: Florent Revest > --- Nice! Acked-by: Andrii Nakryiko > tools/lib/bpf/bpf_tracing.h | 18 ++ > 1 file changed, 18 insertions(+) > > diff --git a/tools/lib/bpf/bpf_tracing.h b/tools/lib/bpf/bpf_tracing.h > index 1c2e91ee041d..8c954ebc0c7c 100644 > --- a/tools/lib/bpf/bpf_tracing.h > +++ b/tools/lib/bpf/bpf_tracing.h > @@ -447,4 +447,22 @@ static __always_inline typeof(name(0)) ##name(struct > pt_regs *ctx, ##args) >___param, sizeof(___param)); \ > }) > > +/* > + * BPF_SNPRINTF wraps the bpf_snprintf helper with variadic arguments > instead of > + * an array of u64. > + */ > +#define BPF_SNPRINTF(out, out_size, fmt, args...) \ > +({ \ > + static const char ___fmt[] = fmt; \ > + unsigned long long ___param[___bpf_narg(args)]; \ > + \ > + _Pragma("GCC diagnostic push") \ > + _Pragma("GCC diagnostic ignored \"-Wint-conversion\"") \ > + ___bpf_fill(___param, args);\ > + _Pragma("GCC diagnostic pop") \ > + \ > + bpf_snprintf(out, out_size, ___fmt, \ > +___param, sizeof(___param)); \ > +}) > + > #endif > -- > 2.31.1.295.g9ea45b61b8-goog >
Re: [PATCH bpf-next v3 4/6] libbpf: Initialize the bpf_seq_printf parameters array field by field
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest wrote: > > When initializing the __param array with a one liner, if all args are > const, the initial array value will be placed in the rodata section but > because libbpf does not support relocation in the rodata section, any > pointer in this array will stay NULL. > > Fixes: c09add2fbc5a ("tools/libbpf: Add bpf_iter support") > Signed-off-by: Florent Revest > --- Looks good! Acked-by: Andrii Nakryiko > tools/lib/bpf/bpf_tracing.h | 40 +++-- > 1 file changed, 29 insertions(+), 11 deletions(-) > [...]
Re: [PATCH bpf-next v3 3/6] bpf: Add a bpf_snprintf helper
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest wrote: > > The implementation takes inspiration from the existing bpf_trace_printk > helper but there are a few differences: > > To allow for a large number of format-specifiers, parameters are > provided in an array, like in bpf_seq_printf. > > Because the output string takes two arguments and the array of > parameters also takes two arguments, the format string needs to fit in > one argument. Thankfully, ARG_PTR_TO_CONST_STR is guaranteed to point to > a zero-terminated read-only map so we don't need a format string length > arg. > > Because the format-string is known at verification time, we also do > a first pass of format string validation in the verifier logic. This > makes debugging easier. > > Signed-off-by: Florent Revest > --- > include/linux/bpf.h| 6 > include/uapi/linux/bpf.h | 28 +++ > kernel/bpf/helpers.c | 2 ++ > kernel/bpf/verifier.c | 41 > kernel/trace/bpf_trace.c | 50 ++ > tools/include/uapi/linux/bpf.h | 28 +++ > 6 files changed, 155 insertions(+) > [...] > diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c > index 5f46dd6f3383..d4020e5f91ee 100644 > --- a/kernel/bpf/verifier.c > +++ b/kernel/bpf/verifier.c > @@ -5918,6 +5918,41 @@ static int check_reference_leak(struct > bpf_verifier_env *env) > return state->acquired_refs ? -EINVAL : 0; > } > > +static int check_bpf_snprintf_call(struct bpf_verifier_env *env, > + struct bpf_reg_state *regs) > +{ > + struct bpf_reg_state *fmt_reg = [BPF_REG_3]; > + struct bpf_reg_state *data_len_reg = [BPF_REG_5]; > + struct bpf_map *fmt_map = fmt_reg->map_ptr; > + int err, fmt_map_off, num_args; > + u64 fmt_addr; > + char *fmt; > + > + /* data must be an array of u64 */ > + if (data_len_reg->var_off.value % 8) > + return -EINVAL; > + num_args = data_len_reg->var_off.value / 8; > + > + /* fmt being ARG_PTR_TO_CONST_STR guarantees that var_off is const > +* and map_direct_value_addr is set. > +*/ > + fmt_map_off = fmt_reg->off + fmt_reg->var_off.value; > + err = fmt_map->ops->map_direct_value_addr(fmt_map, _addr, > + fmt_map_off); > + if (err) > + return err; > + fmt = (char *)fmt_addr + fmt_map_off; > + bot complained about lack of (long) cast before fmt_addr, please address [...] > + /* Maximumly we can have MAX_SNPRINTF_VARARGS parameters, just give > +* all of them to snprintf(). > +*/ > + err = snprintf(str, str_size, fmt, BPF_CAST_FMT_ARG(0, args, mod), > + BPF_CAST_FMT_ARG(1, args, mod), BPF_CAST_FMT_ARG(2, args, > mod), > + BPF_CAST_FMT_ARG(3, args, mod), BPF_CAST_FMT_ARG(4, args, > mod), > + BPF_CAST_FMT_ARG(5, args, mod), BPF_CAST_FMT_ARG(6, args, > mod), > + BPF_CAST_FMT_ARG(7, args, mod), BPF_CAST_FMT_ARG(8, args, > mod), > + BPF_CAST_FMT_ARG(9, args, mod), BPF_CAST_FMT_ARG(10, args, > mod), > + BPF_CAST_FMT_ARG(11, args, mod)); > + > + put_fmt_tmp_buf(); reading this for at least 3rd time, this put_fmt_tmp_buf() looks a bit out of place and kind of random. I think bpf_printf_cleanup() name pairs with bpf_printf_prepare() better. > + > + return err + 1; snprintf() already returns string length *including* terminating zero, so this is wrong > +} > + [...]
Re: [PATCH bpf-next v3 2/6] bpf: Add a ARG_PTR_TO_CONST_STR argument type
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest wrote: > > This type provides the guarantee that an argument is going to be a const > pointer to somewhere in a read-only map value. It also checks that this > pointer is followed by a zero character before the end of the map value. > > Signed-off-by: Florent Revest > --- LGTM. Acked-by: Andrii Nakryiko > include/linux/bpf.h | 1 + > kernel/bpf/verifier.c | 41 + > 2 files changed, 42 insertions(+) > [...]
Re: [PATCH bpf-next v3 1/6] bpf: Factorize bpf_trace_printk and bpf_seq_printf
On Mon, Apr 12, 2021 at 8:38 AM Florent Revest wrote: > > Two helpers (trace_printk and seq_printf) have very similar > implementations of format string parsing and a third one is coming > (snprintf). To avoid code duplication and make the code easier to > maintain, this moves the operations associated with format string > parsing (validation and argument sanitization) into one generic > function. > > The implementation of the two existing helpers already drifted quite a > bit so unifying them entailed a lot of changes: > > - bpf_trace_printk always expected fmt[fmt_size] to be the terminating > NULL character, this is no longer true, the first 0 is terminating. > - bpf_trace_printk now supports %% (which produces the percentage char). > - bpf_trace_printk now skips width formating fields. > - bpf_trace_printk now supports the X modifier (capital hexadecimal). > - bpf_trace_printk now supports %pK, %px, %pB, %pi4, %pI4, %pi6 and %pI6 > - argument casting on 32 bit has been simplified into one macro and > using an enum instead of obscure int increments. > > - bpf_seq_printf now uses bpf_trace_copy_string instead of > strncpy_from_kernel_nofault and handles the %pks %pus specifiers. > - bpf_seq_printf now prints longs correctly on 32 bit architectures. > > - both were changed to use a global per-cpu tmp buffer instead of one > stack buffer for trace_printk and 6 small buffers for seq_printf. > - to avoid per-cpu buffer usage conflict, these helpers disable > preemption while the per-cpu buffer is in use. > - both helpers now support the %ps and %pS specifiers to print symbols. > > Signed-off-by: Florent Revest > --- > kernel/trace/bpf_trace.c | 529 ++- > 1 file changed, 248 insertions(+), 281 deletions(-) > [...] > +/* Per-cpu temp buffers which can be used by printf-like helpers for %s or %p > + */ > +#define MAX_PRINTF_BUF_LEN 512 > + > +struct bpf_printf_buf { > + char tmp_buf[MAX_PRINTF_BUF_LEN]; > +}; > +static DEFINE_PER_CPU(struct bpf_printf_buf, bpf_printf_buf); > +static DEFINE_PER_CPU(int, bpf_printf_buf_used); > + > +static int try_get_fmt_tmp_buf(char **tmp_buf) > { > - static char buf[BPF_TRACE_PRINTK_SIZE]; > - unsigned long flags; > - va_list ap; > - int ret; > + struct bpf_printf_buf *bufs = this_cpu_ptr(_printf_buf); why doing this_cpu_ptr() if below (if *tmp_buf case), you will not use it. just a waste of CPU, no? > + int used; > > - raw_spin_lock_irqsave(_printk_lock, flags); > - va_start(ap, fmt); > - ret = vsnprintf(buf, sizeof(buf), fmt, ap); > - va_end(ap); > - /* vsnprintf() will not append null for zero-length strings */ > - if (ret == 0) > - buf[0] = '\0'; > - trace_bpf_trace_printk(buf); > - raw_spin_unlock_irqrestore(_printk_lock, flags); > + if (*tmp_buf) > + return 0; > > - return ret; > + preempt_disable(); > + used = this_cpu_inc_return(bpf_printf_buf_used); > + if (WARN_ON_ONCE(used > 1)) { > + this_cpu_dec(bpf_printf_buf_used); > + return -EBUSY; > + } get bufs pointer here instead? > + *tmp_buf = bufs->tmp_buf; > + > + return 0; > +} > + > +static void put_fmt_tmp_buf(void) > +{ > + if (this_cpu_read(bpf_printf_buf_used)) { > + this_cpu_dec(bpf_printf_buf_used); > + preempt_enable(); > + } > } > > /* > - * Only limited trace_printk() conversion specifiers allowed: > - * %d %i %u %x %ld %li %lu %lx %lld %lli %llu %llx %p %pB %pks %pus %s > + * bpf_parse_fmt_str - Generic pass on format strings for printf-like helpers > + * > + * Returns a negative value if fmt is an invalid format string or 0 > otherwise. > + * > + * This can be used in two ways: > + * - Format string verification only: when final_args and mod are NULL > + * - Arguments preparation: in addition to the above verification, it writes > in > + * final_args a copy of raw_args where pointers from BPF have been > sanitized > + * into pointers safe to use by snprintf. This also writes in the mod array > + * the size requirement of each argument, usable by BPF_CAST_FMT_ARG for > ex. > + * > + * In argument preparation mode, if 0 is returned, safe temporary buffers are > + * allocated and put_fmt_tmp_buf should be called to free them after use. > */ > -BPF_CALL_5(bpf_trace_printk, char *, fmt, u32, fmt_size, u64, arg1, > - u64, arg2, u64, arg3) > -{ > - int i, mod[3] = {}, fmt_cnt = 0; > - char buf[64], fmt_ptype; > - void *unsafe_ptr = NULL; > - bool str_seen = false; > +int bpf_printf_prepare(char *fmt, u32 fmt_size, const u64 *raw_args, > + u64 *final_args, enum bpf_printf_mod_type *mod, > + u32 num_args) > +{ > + int err, i, curr_specifier = 0, copy_size; > + char *unsafe_ptr = NULL, *tmp_buf = NULL; > + size_t
Re: mmotm 2021-04-11-20-47 uploaded (bpf: xsk.c)
On Mon, Apr 12, 2021 at 9:38 AM Randy Dunlap wrote: > > On 4/11/21 8:48 PM, a...@linux-foundation.org wrote: > > The mm-of-the-moment snapshot 2021-04-11-20-47 has been uploaded to > > > >https://www.ozlabs.org/~akpm/mmotm/ > > > > mmotm-readme.txt says > > > > README for mm-of-the-moment: > > > > https://www.ozlabs.org/~akpm/mmotm/ > > > > This is a snapshot of my -mm patch queue. Uploaded at random hopefully > > more than once a week. > > > > You will need quilt to apply these patches to the latest Linus release (5.x > > or 5.x-rcY). The series file is in broken-out.tar.gz and is duplicated in > > https://ozlabs.org/~akpm/mmotm/series > > > > The file broken-out.tar.gz contains two datestamp files: .DATE and > > .DATE--mm-dd-hh-mm-ss. Both contain the string -mm-dd-hh-mm-ss, > > followed by the base kernel version against which this patch series is to > > be applied. > > > > This tree is partially included in linux-next. To see which patches are > > included in linux-next, consult the `series' file. Only the patches > > within the #NEXT_PATCHES_START/#NEXT_PATCHES_END markers are included in > > linux-next. > > > > > > A full copy of the full kernel tree with the linux-next and mmotm patches > > already applied is available through git within an hour of the mmotm > > release. Individual mmotm releases are tagged. The master branch always > > points to the latest release, so it's constantly rebasing. > > > > https://github.com/hnaz/linux-mm > > > > The directory https://www.ozlabs.org/~akpm/mmots/ (mm-of-the-second) > > contains daily snapshots of the -mm tree. It is updated more frequently > > than mmotm, and is untested. > > > > A git copy of this tree is also available at > > > > https://github.com/hnaz/linux-mm > > on x86_64: > > xsk.c: In function ‘xsk_socket__create_shared’: > xsk.c:1027:7: error: redeclaration of ‘unmap’ with no linkage > bool unmap = umem->fill_save != fill; >^ > xsk.c:1020:7: note: previous declaration of ‘unmap’ was here > bool unmap, rx_setup_done = false, tx_setup_done = false; >^ > xsk.c:1028:7: error: redefinition of ‘rx_setup_done’ > bool rx_setup_done = false, tx_setup_done = false; >^ > xsk.c:1020:14: note: previous definition of ‘rx_setup_done’ was here > bool unmap, rx_setup_done = false, tx_setup_done = false; > ^ > xsk.c:1028:30: error: redefinition of ‘tx_setup_done’ > bool rx_setup_done = false, tx_setup_done = false; > ^ > xsk.c:1020:37: note: previous definition of ‘tx_setup_done’ was here > bool unmap, rx_setup_done = false, tx_setup_done = false; > ^ > > > Full randconfig file is attached. What SHA are you on? I checked that github tree, the source code there doesn't correspond to the errors here (i.e., there is no unmap redefinition on lines 1020 and 1027). Could it be some local merge conflict? > > -- > ~Randy > Reported-by: Randy Dunlap
Re: [PATCH bpf-next v2] libbpf: clarify flags in ringbuf helpers
On Mon, Apr 12, 2021 at 12:25 PM Pedro Tammela wrote: > > In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment. > > For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a > notification to the process if needed. > > Signed-off-by: Pedro Tammela > --- Great, thanks! Applied to bpf-next. > include/uapi/linux/bpf.h | 16 > tools/include/uapi/linux/bpf.h | 16 > 2 files changed, 32 insertions(+) > [...]
Re: memory leak in bpf
On Wed, Apr 7, 2021 at 4:24 PM Rustam Kovhaev wrote: > > On Mon, Mar 01, 2021 at 09:43:00PM +0100, Dmitry Vyukov wrote: > > On Mon, Mar 1, 2021 at 9:39 PM Rustam Kovhaev wrote: > > > > > > On Mon, Mar 01, 2021 at 08:05:42PM +0100, Dmitry Vyukov wrote: > > > > On Mon, Mar 1, 2021 at 5:21 PM Rustam Kovhaev > > > > wrote: > > > > > > > > > > On Wed, Dec 09, 2020 at 10:58:10PM -0800, syzbot wrote: > > > > > > syzbot has found a reproducer for the following issue on: > > > > > > > > > > > > HEAD commit:a68a0262 mm/madvise: remove racy mm ownership check > > > > > > git tree: upstream > > > > > > console output: > > > > > > https://syzkaller.appspot.com/x/log.txt?x=11facf1750 > > > > > > kernel config: > > > > > > https://syzkaller.appspot.com/x/.config?x=4305fa9ea70c7a9f > > > > > > dashboard link: > > > > > > https://syzkaller.appspot.com/bug?extid=f3694595248708227d35 > > > > > > compiler: gcc (GCC) 10.1.0-syz 20200507 > > > > > > syz repro: > > > > > > https://syzkaller.appspot.com/x/repro.syz?x=159a961350 > > > > > > C reproducer: > > > > > > https://syzkaller.appspot.com/x/repro.c?x=11bf712350 > > > > > > > > > > > > IMPORTANT: if you fix the issue, please add the following tag to > > > > > > the commit: > > > > > > Reported-by: syzbot+f3694595248708227...@syzkaller.appspotmail.com > > > > > > > > > > > > Debian GNU/Linux 9 syzkaller ttyS0 > > > > > > Warning: Permanently added '10.128.0.9' (ECDSA) to the list of > > > > > > known hosts. > > > > > > executing program > > > > > > executing program > > > > > > executing program > > > > > > BUG: memory leak > > > > > > unreferenced object 0x88810efccc80 (size 64): > > > > > > comm "syz-executor334", pid 8460, jiffies 4294945724 (age 13.850s) > > > > > > hex dump (first 32 bytes): > > > > > > c0 cb 14 04 00 ea ff ff c0 c2 11 04 00 ea ff ff > > > > > > > > > > > > c0 56 3f 04 00 ea ff ff 40 18 38 04 00 ea ff ff > > > > > > .V?.@.8. > > > > > > backtrace: > > > > > > [<36ae98a7>] kmalloc_node include/linux/slab.h:575 > > > > > > [inline] > > > > > > [<36ae98a7>] bpf_ringbuf_area_alloc > > > > > > kernel/bpf/ringbuf.c:94 [inline] > > > > > > [<36ae98a7>] bpf_ringbuf_alloc kernel/bpf/ringbuf.c:135 > > > > > > [inline] > > > > > > [<36ae98a7>] ringbuf_map_alloc kernel/bpf/ringbuf.c:183 > > > > > > [inline] > > > > > > [<36ae98a7>] ringbuf_map_alloc+0x1be/0x410 > > > > > > kernel/bpf/ringbuf.c:150 > > > > > > [] find_and_alloc_map > > > > > > kernel/bpf/syscall.c:122 [inline] > > > > > > [ ] map_create kernel/bpf/syscall.c:825 > > > > > > [inline] > > > > > > [ ] __do_sys_bpf+0x7d0/0x30a0 > > > > > > kernel/bpf/syscall.c:4381 > > > > > > [<8feaf393>] do_syscall_64+0x2d/0x70 > > > > > > arch/x86/entry/common.c:46 > > > > > > [ ] entry_SYSCALL_64_after_hwframe+0x44/0xa9 > > > > > > > > > > > > > > > > > > > > > > i am pretty sure that this one is a false positive > > > > > the problem with reproducer is that it does not terminate all of the > > > > > child processes that it spawns > > > > > > > > > > i confirmed that it is a false positive by tracing __fput() and > > > > > bpf_map_release(), i ran reproducer, got kmemleak report, then i > > > > > manually killed those running leftover processes from reproducer and > > > > > then both functions were executed and memory was freed > > > > > > > > > > i am marking this one as: > > > > > #syz invalid > > > > > > > > Hi Rustam, > > > > > > > > Thanks for looking into this. > > > > > > > > I wonder how/where are these objects referenced? If they are not > > > > leaked and referenced somewhere, KMEMLEAK should not report them as > > > > leaks. > > > > So even if this is a false positive for BPF, this is a true positive > > > > bug and something to fix for KMEMLEAK ;) > > > > And syzbot will probably re-create this bug report soon as this still > > > > happens and is not a one-off thing. > > > > > > hi Dmitry, i haven't thought of it this way, but i guess you are right, > > > it is a kmemleak bug, ideally kmemleak should be aware that there are > > > still running processes holding references to bpf fd/anonymous inodes > > > which in their turn hold references to allocated bpf maps > > > > KMEMLEAK scans whole memory, so if there are pointers to the object > > anywhere in memory, KMEMLEAK should not report them as leaked. Running > > processes have no direct effect on KMEMLEAK logic. > > So the question is: where are these pointers to these objects? If we > > answer this, we can check how/why KMEMLEAK misses them. Are they > > mangled in some way? > thank you for your comments, they make sense, and indeed, the pointer > gets vmaped. > i should have looked into this sooner, becaused syzbot did trigger the > issue again, and Andrii had to look into the same bug, sorry
Re: [PATCH bpf-next v2 3/6] bpf: Add a bpf_snprintf helper
On Tue, Apr 6, 2021 at 9:06 AM Florent Revest wrote: > > On Fri, Mar 26, 2021 at 11:55 PM Andrii Nakryiko > wrote: > > On Tue, Mar 23, 2021 at 7:23 PM Florent Revest wrote: > > > The implementation takes inspiration from the existing bpf_trace_printk > > > helper but there are a few differences: > > > > > > To allow for a large number of format-specifiers, parameters are > > > provided in an array, like in bpf_seq_printf. > > > > > > Because the output string takes two arguments and the array of > > > parameters also takes two arguments, the format string needs to fit in > > > one argument. But because ARG_PTR_TO_CONST_STR guarantees to point to a > > > NULL-terminated read-only map, we don't need a format string length arg. > > > > > > Because the format-string is known at verification time, we also move > > > most of the format string validation, currently done in formatting > > > helper calls, into the verifier logic. This makes debugging easier and > > > also slightly improves the runtime performance. > > > > > > Signed-off-by: Florent Revest > > > --- > > > include/linux/bpf.h| 6 > > > include/uapi/linux/bpf.h | 28 ++ > > > kernel/bpf/helpers.c | 2 ++ > > > kernel/bpf/verifier.c | 41 +++ > > > kernel/trace/bpf_trace.c | 52 ++ > > > tools/include/uapi/linux/bpf.h | 28 ++ > > > 6 files changed, 157 insertions(+) > > > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > > index 7b5319d75b3e..f3d9c8fa60b3 100644 > > > --- a/include/linux/bpf.h > > > +++ b/include/linux/bpf.h > > > @@ -1893,6 +1893,7 @@ extern const struct bpf_func_proto > > > bpf_skc_to_tcp_request_sock_proto; > > > extern const struct bpf_func_proto bpf_skc_to_udp6_sock_proto; > > > extern const struct bpf_func_proto bpf_copy_from_user_proto; > > > extern const struct bpf_func_proto bpf_snprintf_btf_proto; > > > +extern const struct bpf_func_proto bpf_snprintf_proto; > > > extern const struct bpf_func_proto bpf_per_cpu_ptr_proto; > > > extern const struct bpf_func_proto bpf_this_cpu_ptr_proto; > > > extern const struct bpf_func_proto bpf_ktime_get_coarse_ns_proto; > > > @@ -2018,4 +2019,9 @@ int bpf_arch_text_poke(void *ip, enum > > > bpf_text_poke_type t, > > > struct btf_id_set; > > > bool btf_id_set_contains(const struct btf_id_set *set, u32 id); > > > > > > +enum bpf_printf_mod_type; > > > +int bpf_printf_preamble(char *fmt, u32 fmt_size, const u64 *raw_args, > > > + u64 *final_args, enum bpf_printf_mod_type *mod, > > > + u32 num_args); > > > + > > > #endif /* _LINUX_BPF_H */ > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > index 2d3036e292a9..86af61e912c6 100644 > > > --- a/include/uapi/linux/bpf.h > > > +++ b/include/uapi/linux/bpf.h > > > @@ -4660,6 +4660,33 @@ union bpf_attr { > > > * Return > > > * The number of traversed map elements for success, > > > **-EINVAL** for > > > * invalid **flags**. > > > + * > > > + * long bpf_snprintf(char *str, u32 str_size, const char *fmt, u64 > > > *data, u32 data_len) > > > + * Description > > > + * Outputs a string into the **str** buffer of size > > > **str_size** > > > + * based on a format string stored in a read-only map > > > pointed by > > > + * **fmt**. > > > + * > > > + * Each format specifier in **fmt** corresponds to one u64 > > > element > > > + * in the **data** array. For strings and pointers where > > > pointees > > > + * are accessed, only the pointer values are stored in the > > > *data* > > > + * array. The *data_len* is the size of *data* in bytes. > > > + * > > > + * Formats **%s** and **%p{i,I}{4,6}** require to read kernel > > > + * memory. Reading kernel memory may fail due to either > > > invalid > > > + * address or valid address but requiring a major memory > > > fault. If > > > + * reading kernel memory fails, the string for **%s** will > > > be an > > > + * empty strin
Re: [PATCH bpf-next v2 1/6] bpf: Factorize bpf_trace_printk and bpf_seq_printf
On Tue, Apr 6, 2021 at 8:35 AM Florent Revest wrote: > > [Sorry for the late replies, I'm just back from a long easter break :)] > > On Fri, Mar 26, 2021 at 11:51 PM Andrii Nakryiko > wrote: > > On Fri, Mar 26, 2021 at 2:53 PM Andrii Nakryiko > > wrote: > > > On Tue, Mar 23, 2021 at 7:23 PM Florent Revest > > > wrote: > > > > Unfortunately, the implementation of the two existing helpers already > > > > drifted quite a bit and unifying them entailed a lot of changes: > > > > > > "Unfortunately" as in a lot of extra work for you? I think overall > > > though it was very fortunate that you ended up doing it, all > > > implementations are more feature-complete and saner now, no? Thanks a > > > lot for your hard work! > > Ahah, "unfortunately" a bit of extra work for me, indeed. But I find > this kind of refactoring patches even harder to review than to write > so thank you too! > > > > > - bpf_trace_printk always expected fmt[fmt_size] to be the terminating > > > > NULL character, this is no longer true, the first 0 is terminating. > > > > > > You mean if you had bpf_trace_printk("bla bla\0some more bla\0", 24) > > > it would emit that zero character? If yes, I don't think it was a sane > > > behavior anyways. > > The call to snprintf in bpf_do_trace_printk would eventually ignore > "some more bla" but the parsing done in bpf_trace_printk would indeed > read the whole string. > > > > This is great, you already saved some lines of code! I suspect I'll > > > have some complaints about mods (it feels like this preample should > > > provide extra information about which arguments have to be read from > > > kernel/user memory, but I'll see next patches first. > > > > Disregard the last part (at least for now). I had a mental model that > > it should be possible to parse a format string once and then remember > > "instructions" (i.e., arg1 is long, arg2 is string, and so on). But > > that's too complicated, so I think re-parsing the format string is > > much simpler. > > I also wanted to do that originally but realized it would keep a lot > of the complexity in the helpers themselves and not really move the > needle. > > > > > +/* Horrid workaround for getting va_list handling working with > > > > different > > > > + * argument type combinations generically for 32 and 64 bit archs. > > > > + */ > > > > +#define BPF_CAST_FMT_ARG(arg_nb, args, mod) > > > > \ > > > > + ((mod[arg_nb] == BPF_PRINTF_LONG_LONG || > > > > \ > > > > +(mod[arg_nb] == BPF_PRINTF_LONG && __BITS_PER_LONG == 64)) > > > > \ > > > > + ? args[arg_nb] > > > > \ > > > > + : ((mod[arg_nb] == BPF_PRINTF_LONG || > > > > \ > > > > +(mod[arg_nb] == BPF_PRINTF_INT && __BITS_PER_LONG == 32)) > > > > \ > > > > > > is this right? INT is always 32-bit, it's only LONG that differs. > > > Shouldn't the rule be > > > > > > (LONG_LONG || LONG && __BITS_PER_LONG) -> (__u64)args[args_nb] > > > (INT || LONG && __BITS_PER_LONG == 32) -> (__u32)args[args_nb] > > > > > > Does (long) cast do anything fancy when casting from u64? Sorry, maybe > > > I'm confused. > > To be honest, I am also confused by that logic... :p My patch tries to > conserve exactly the same logic as "88a5c690b6 bpf: fix > bpf_trace_printk on 32 bit archs" because I was also afraid of missing > something and could not test it on 32 bit arches. From that commit > description, it is unclear to me what "u32 and long are passed > differently to u64, since the result of C conditional operators > follows the "usual arithmetic conversions" rules" means. Maybe Daniel > can comment on this ? Yeah, no idea. Seems like the code above should work fine for 32 and 64 bitness and both little- and big-endianness. > > > > > +int bpf_printf_preamble(char *fmt, u32 fmt_size, const u64 *raw_args, > > > > + u64 *final_args, enum bpf_printf_mod_type *mod, > > > > + u32 num_args) > > > > +{ > > > > + struct bpf_printf_buf *bufs = this_cpu_ptr(_printf_buf); > > > > + int err, i, fmt_cnt = 0, copy_size, used; > &g
Re: [PATCH bpf-next] libbpf: clarify flags in ringbuf helpers
On Wed, Apr 7, 2021 at 1:10 PM Pedro Tammela wrote: > > Em qua., 7 de abr. de 2021 às 16:58, Andrii Nakryiko > escreveu: > > > > On Wed, Apr 7, 2021 at 11:43 AM Joe Stringer wrote: > > > > > > Hi Pedro, > > > > > > On Tue, Apr 6, 2021 at 11:58 AM Pedro Tammela wrote: > > > > > > > > In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment. > > > > > > > > For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a > > > > notification to the process if needed. > > > > > > > > Signed-off-by: Pedro Tammela > > > > --- > > > > include/uapi/linux/bpf.h | 7 +++ > > > > tools/include/uapi/linux/bpf.h | 7 +++ > > > > 2 files changed, 14 insertions(+) > > > > > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > > > index 49371eba98ba..8c5c7a893b87 100644 > > > > --- a/include/uapi/linux/bpf.h > > > > +++ b/include/uapi/linux/bpf.h > > > > @@ -4061,12 +4061,15 @@ union bpf_attr { > > > > * of new data availability is sent. > > > > * If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, > > > > notification > > > > * of new data availability is sent unconditionally. > > > > + * If **0** is specified in *flags*, notification > > > > + * of new data availability is sent if needed. > > > > > > Maybe a trivial question, but what does "if needed" mean? Does that > > > mean "when the buffer is full"? > > > > I used to call it ns "adaptive notification", so maybe let's use that > > term instead of "if needed"? It means that in kernel BPF ringbuf code > > will check if the user-space consumer has caught up and consumed all > > the available data. In that case user-space might be waiting > > (sleeping) in epoll_wait() already and not processing samples > > actively. That means that we have to send notification, otherwise > > user-space might never wake up. But if the kernel sees that user-space > > is still processing previous record (consumer position < producer > > position), then we can bypass sending another notification, because > > user-space consumer protocol dictates that it needs to consume all the > > record until consumer position == producer position. So no > > notification is necessary for the newly submitted sample, as > > user-space will eventually see it without notification. > > > > Of course there is careful writes and memory ordering involved to make > > sure that we never miss notification. > > > > Does someone want to try to condense it into a succinct description? ;) > > OK. > > I can try to condense this and perhaps add it as code in the comment? Sure, though there is already a brief comment to that effect. But having high-level explanation in uapi/linux/bpf.h would be great for users, though.
Re: [PATCH bpf-next] libbpf: clarify flags in ringbuf helpers
On Wed, Apr 7, 2021 at 11:43 AM Joe Stringer wrote: > > Hi Pedro, > > On Tue, Apr 6, 2021 at 11:58 AM Pedro Tammela wrote: > > > > In 'bpf_ringbuf_reserve()' we require the flag to '0' at the moment. > > > > For 'bpf_ringbuf_{discard,submit,output}' a flag of '0' might send a > > notification to the process if needed. > > > > Signed-off-by: Pedro Tammela > > --- > > include/uapi/linux/bpf.h | 7 +++ > > tools/include/uapi/linux/bpf.h | 7 +++ > > 2 files changed, 14 insertions(+) > > > > diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h > > index 49371eba98ba..8c5c7a893b87 100644 > > --- a/include/uapi/linux/bpf.h > > +++ b/include/uapi/linux/bpf.h > > @@ -4061,12 +4061,15 @@ union bpf_attr { > > * of new data availability is sent. > > * If **BPF_RB_FORCE_WAKEUP** is specified in *flags*, > > notification > > * of new data availability is sent unconditionally. > > + * If **0** is specified in *flags*, notification > > + * of new data availability is sent if needed. > > Maybe a trivial question, but what does "if needed" mean? Does that > mean "when the buffer is full"? I used to call it ns "adaptive notification", so maybe let's use that term instead of "if needed"? It means that in kernel BPF ringbuf code will check if the user-space consumer has caught up and consumed all the available data. In that case user-space might be waiting (sleeping) in epoll_wait() already and not processing samples actively. That means that we have to send notification, otherwise user-space might never wake up. But if the kernel sees that user-space is still processing previous record (consumer position < producer position), then we can bypass sending another notification, because user-space consumer protocol dictates that it needs to consume all the record until consumer position == producer position. So no notification is necessary for the newly submitted sample, as user-space will eventually see it without notification. Of course there is careful writes and memory ordering involved to make sure that we never miss notification. Does someone want to try to condense it into a succinct description? ;)
Re: [PATCH bpf-next v2 2/3] libbpf: selftests: refactor 'BPF_PERCPU_TYPE()' and 'bpf_percpu()' macros
On Wed, Apr 7, 2021 at 12:30 PM Pedro Tammela wrote: > > Em qua., 7 de abr. de 2021 às 15:31, Andrii Nakryiko > escreveu: > > > > On Tue, Apr 6, 2021 at 11:55 AM Pedro Tammela wrote: > > > > > > This macro was refactored out of the bpf selftests. > > > > > > Since percpu values are rounded up to '8' in the kernel, a careless > > > user in userspace might encounter unexpected values when parsing the > > > output of the batched operations. > > > > I wonder if a user has to be more careful, though? This > > BPF_PERCPU_TYPE, __bpf_percpu_align and bpf_percpu macros seem to > > create just another opaque layer. It actually seems detrimental to me. > > > > I'd rather emphasize in the documentation (e.g., in > > bpf_map_lookup_elem) that all per-cpu maps are aligning values at 8 > > bytes, so user has to make sure that array of values provided to > > bpf_map_lookup_elem() has each element size rounded up to 8. > > From my own experience, the documentation has been a very unreliable > source, to the point that I usually jump to the code first rather than > to the documentation nowadays[1]. I totally agree, which is why I think improving docs is necessary. Unfortunately docs are usually lagging behind, because generally people hate writing documentation and it's just a fact of life. > Tests, samples and projects have always been my source of truth and we > are already lacking a bit on those as well. For instance, the samples > directory contains programs that are very outdated (I didn't check if > they are still functional). Yeah, samples/bpf is bitrotting. selftests/bpf, though, are maintained and run regularly and vigorously, so making sure they set a good and realistic example is a good. > I think macros like these will be present in most of the project > dealing with batched operations and as a daily user of libbpf I don't > see how this could not be offered by libbpf as a standardized way to > declare percpu types. If I were using per-CPU maps a lot, I'd make sure I use u64 and aligned(8) types and bypass all the macro ugliness, because there is no need in it and it just hurts readability. So I don't want libbpf to incentivize bad choices here by providing seemingly convenient macros. Users have to be aware that values are 8-byte aligned/extended. That's not a big secret and not a very obscure thing to learn anyways. > > [1] So batched operations were introduced a little bit over a 1 year > ago and yet the only reference I had for it was the selftests. The > documentation is on my TODO list, but that's just because I have to > deal with it daily. > Yeah, please do contribute them! > > > > In practice, I'd recommend users to always use __u64/__s64 when having > > primitive integers in a map (they are not saving anything by using > > int, it just creates an illusion of savings). Well, maybe on 32-bit > > arches they would save a bit of CPU, but not on typical 64-bit > > architectures. As for using structs as values, always mark them as > > __attribute__((aligned(8))). > > > > Basically, instead of obscuring the real use some more, let's clarify > > and maybe even provide some examples in documentation? > > Why not do both? > > Provide a standardized way to declare a percpu value with examples and > a good documentation with examples. > Let the user decide what is best for his use case. What is a standardized way? A custom macro with struct { T v; } inside? That's just one way of doing this, and it requires another macro to just access the value (because no one wants to write my_values[cpu].v, right?). I'd say the standardized way of reading values should look like `my_values[cpu]`, that's it. For that you use 64-bit integers or 8-byte aligned structs. And don't mess with macros for that at all. So if a user insists on using int/short/char as value, they can do their own struct { char v} __aligned(8) trick. But I'd advise such users to reconsider and use u64. If they are using structs for values, always mark __aligned(8) and forget about this in the rest of your code. As for allocating memory for array of per-cpu values, there is also no single standardized way we can come up with. It could be malloc() on the heap. Or alloca() on the stack. Or it could be pre-allocated one for up to maximum supported CPUs. Or... whatever makes sense. So I think the best way to handle all that is to clearly explain how reading per-CPU values from per-CPU maps works in BPF and what are the memory layout expectations. > > > > > > > > > Now that both array and hash maps have support for batched ops in the > > > percpu variant, let's provide a convenient macro to declare percpu map > > > value types.
Re: [syzbot] memory leak in bpf (2)
On Wed, Mar 31, 2021 at 6:08 PM syzbot wrote: > > Hello, > > syzbot found the following issue on: > > HEAD commit:0f4498ce Merge tag 'for-5.12/dm-fixes-2' of git://git.kern.. > git tree: upstream > console output: https://syzkaller.appspot.com/x/log.txt?x=1250e126d0 > kernel config: https://syzkaller.appspot.com/x/.config?x=49f2683f4e7a4347 > dashboard link: https://syzkaller.appspot.com/bug?extid=5d895828587f49e7fe9b > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10a17016d0 > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=10a32016d0 > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > Reported-by: syzbot+5d895828587f49e7f...@syzkaller.appspotmail.com > > Warning: Permanently added '10.128.0.74' (ECDSA) to the list of known hosts. > executing program > executing program > BUG: memory leak > unreferenced object 0x8881133295c0 (size 64): > comm "syz-executor529", pid 8395, jiffies 4294943939 (age 8.130s) > hex dump (first 32 bytes): > 40 48 3c 04 00 ea ff ff 00 48 3c 04 00 ea ff ff @H<..H<. > c0 e7 3c 04 00 ea ff ff 80 e7 3c 04 00 ea ff ff ..<...<. > backtrace: > [] kmalloc_node include/linux/slab.h:577 [inline] > [] __bpf_map_area_alloc+0xfc/0x120 > kernel/bpf/syscall.c:300 > [] bpf_ringbuf_area_alloc kernel/bpf/ringbuf.c:90 > [inline] > [] bpf_ringbuf_alloc kernel/bpf/ringbuf.c:131 [inline] > [] ringbuf_map_alloc kernel/bpf/ringbuf.c:170 [inline] > [] ringbuf_map_alloc+0x134/0x350 > kernel/bpf/ringbuf.c:146 > [] find_and_alloc_map kernel/bpf/syscall.c:122 [inline] > [] map_create kernel/bpf/syscall.c:828 [inline] > [] __do_sys_bpf+0x7c3/0x2fe0 kernel/bpf/syscall.c:4375 > [] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 > [] entry_SYSCALL_64_after_hwframe+0x44/0xae > > I think either kmemleak or syzbot are mis-reporting this. I've added a bunch of printks around all allocations performed by BPF ringbuf. When I run repro, I see this: [ 26.013500] ALLOC rb_map 888118d7d000 [ 26.013946] ALLOC KMALLOC AREA 88810d538c00 [ 26.014439] ALLOC PAGES 88810d538c00 [ 26.014826] ALLOC PAGE[0] ea000419af00 [ 26.015272] ALLOC PAGE[1] ea000419aec0 [ 26.015686] ALLOC PAGE[2] ea000419ae80 [ 26.016090] ALLOC PAGE[3] ea00042e29c0 [ 26.016513] ALLOC PAGE[4] ea00042a1000 [ 26.016928] VMAP rb c9539000 [ 26.017291] ALLOC rb_map->rb c9539000 [ 26.017712] FINISHED ALLOC BPF_MAP 888118d7d000 [ 32.105069] ALLOC rb_map 888118d7d200 [ 32.105568] ALLOC KMALLOC AREA 88810d538c80 [ 32.106005] ALLOC PAGES 88810d538c80 [ 32.106407] ALLOC PAGE[0] ea000419aa80 [ 32.106805] ALLOC PAGE[1] ea000419ab00 [ 32.107206] ALLOC PAGE[2] ea000419abc0 [ 32.107607] ALLOC PAGE[3] ea0004284480 [ 32.108003] ALLOC PAGE[4] ea0004284440 [ 32.108419] VMAP rb c95ad000 [ 32.108765] ALLOC rb_map->rb c95ad000 [ 32.109186] FINISHED ALLOC BPF_MAP 888118d7d200 [ 33.592874] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) [ 40.526922] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak) On repro side I get these two warnings: [vmuser@archvm bpf]$ sudo ./repro BUG: memory leak unreferenced object 0x88810d538c00 (size 64): comm "repro", pid 2140, jiffies 4294692933 (age 14.540s) hex dump (first 32 bytes): 00 af 19 04 00 ea ff ff c0 ae 19 04 00 ea ff ff 80 ae 19 04 00 ea ff ff c0 29 2e 04 00 ea ff ff .).. backtrace: [<77bfbfbd>] __bpf_map_area_alloc+0x31/0xc0 [<587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218 [<44d49e96>] __do_sys_bpf+0x359/0x1d90 [] do_syscall_64+0x2d/0x40 [<43d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae BUG: memory leak unreferenced object 0x88810d538c80 (size 64): comm "repro", pid 2143, jiffies 4294699025 (age 8.448s) hex dump (first 32 bytes): 80 aa 19 04 00 ea ff ff 00 ab 19 04 00 ea ff ff c0 ab 19 04 00 ea ff ff 80 44 28 04 00 ea ff ff .D(. backtrace: [<77bfbfbd>] __bpf_map_area_alloc+0x31/0xc0 [<587fa522>] ringbuf_map_alloc.cold.4+0x48/0x218 [<44d49e96>] __do_sys_bpf+0x359/0x1d90 [ ] do_syscall_64+0x2d/0x40 [<43d3112a>] entry_SYSCALL_64_after_hwframe+0x44/0xae Note that both reported leaks (88810d538c80 and 88810d538c00) correspond to pages array bpf_ringbuf is allocating and tracking properly internally. Note also that syzbot repro doesn't close FD of created BPF ringbufs, and even when ./repro itself exits with error, there are still two forked processes hanging around in my system. So clearly ringbuf maps are alive at that point. So reporting any memory leak looks weird at that point, because that memory is being used by active referenced BPF
Re: [PATCH bpf-next v2 2/3] libbpf: selftests: refactor 'BPF_PERCPU_TYPE()' and 'bpf_percpu()' macros
On Tue, Apr 6, 2021 at 11:55 AM Pedro Tammela wrote: > > This macro was refactored out of the bpf selftests. > > Since percpu values are rounded up to '8' in the kernel, a careless > user in userspace might encounter unexpected values when parsing the > output of the batched operations. I wonder if a user has to be more careful, though? This BPF_PERCPU_TYPE, __bpf_percpu_align and bpf_percpu macros seem to create just another opaque layer. It actually seems detrimental to me. I'd rather emphasize in the documentation (e.g., in bpf_map_lookup_elem) that all per-cpu maps are aligning values at 8 bytes, so user has to make sure that array of values provided to bpf_map_lookup_elem() has each element size rounded up to 8. In practice, I'd recommend users to always use __u64/__s64 when having primitive integers in a map (they are not saving anything by using int, it just creates an illusion of savings). Well, maybe on 32-bit arches they would save a bit of CPU, but not on typical 64-bit architectures. As for using structs as values, always mark them as __attribute__((aligned(8))). Basically, instead of obscuring the real use some more, let's clarify and maybe even provide some examples in documentation? > > Now that both array and hash maps have support for batched ops in the > percpu variant, let's provide a convenient macro to declare percpu map > value types. > > Updates the tests to a "reference" usage of the new macro. > > Signed-off-by: Pedro Tammela > --- > tools/lib/bpf/bpf.h | 10 > tools/testing/selftests/bpf/bpf_util.h| 7 --- > .../bpf/map_tests/htab_map_batch_ops.c| 48 ++- > .../selftests/bpf/prog_tests/map_init.c | 5 +- > tools/testing/selftests/bpf/test_maps.c | 16 --- > 5 files changed, 46 insertions(+), 40 deletions(-) > [...] > @@ -400,11 +402,11 @@ static void test_arraymap(unsigned int task, void *data) > static void test_arraymap_percpu(unsigned int task, void *data) > { > unsigned int nr_cpus = bpf_num_possible_cpus(); > - BPF_DECLARE_PERCPU(long, values); > + pcpu_map_value_t values[nr_cpus]; > int key, next_key, fd, i; > > fd = bpf_create_map(BPF_MAP_TYPE_PERCPU_ARRAY, sizeof(key), > - sizeof(bpf_percpu(values, 0)), 2, 0); > + sizeof(long), 2, 0); > if (fd < 0) { > printf("Failed to create arraymap '%s'!\n", strerror(errno)); > exit(1); > @@ -459,7 +461,7 @@ static void test_arraymap_percpu(unsigned int task, void > *data) > static void test_arraymap_percpu_many_keys(void) > { > unsigned int nr_cpus = bpf_num_possible_cpus(); This just sets a bad example for anyone using selftests as an aspiration for their own code. bpf_num_possible_cpus() does exit(1) internally if libbpf_num_possible_cpus() returns error. No one should write real production code like that. So maybe let's provide a better example instead with error handling and malloc (or perhaps alloca)? > - BPF_DECLARE_PERCPU(long, values); > + pcpu_map_value_t values[nr_cpus]; > /* nr_keys is not too large otherwise the test stresses percpu > * allocator more than anything else > */ > @@ -467,7 +469,7 @@ static void test_arraymap_percpu_many_keys(void) > int key, fd, i; > > fd = bpf_create_map(BPF_MAP_TYPE_PERCPU_ARRAY, sizeof(key), > - sizeof(bpf_percpu(values, 0)), nr_keys, 0); > + sizeof(long), nr_keys, 0); > if (fd < 0) { > printf("Failed to create per-cpu arraymap '%s'!\n", >strerror(errno)); > -- > 2.25.1 >