Re: [PATCH v3] tracepoint: Do not fail unregistering a probe due to memory allocation
On Wed, Nov 18, 2020 at 09:34:05AM -0500, Steven Rostedt wrote: > From: "Steven Rostedt (VMware)" > > The list of tracepoint callbacks is managed by an array that is protected > by RCU. To update this array, a new array is allocated, the updates are > copied over to the new array, and then the list of functions for the > tracepoint is switched over to the new array. After a completion of an RCU > grace period, the old array is freed. > > This process happens for both adding a callback as well as removing one. > But on removing a callback, if the new array fails to be allocated, the > callback is not removed, and may be used after it is freed by the clients > of the tracepoint. > > There's really no reason to fail if the allocation for a new array fails > when removing a function. Instead, the function can simply be replaced by a > stub function that could be cleaned up on the next modification of the > array. That is, instead of calling the function registered to the > tracepoint, it would call a stub function in its place. > > Link: https://lore.kernel.org/r/20201115055256.65625-1-mmull...@mmlx.us > Link: https://lore.kernel.org/r/20201116175107.02db3...@gandalf.local.home > Link: https://lore.kernel.org/r/20201117211836.54aca...@oasis.local.home > > Cc: Mathieu Desnoyers > Cc: Ingo Molnar > Cc: Alexei Starovoitov > Cc: Daniel Borkmann > Cc: Dmitry Vyukov > Cc: Martin KaFai Lau > Cc: Song Liu > Cc: Yonghong Song > Cc: Andrii Nakryiko > Cc: John Fastabend > Cc: KP Singh > Cc: netdev > Cc: bpf > Cc: Kees Cook > Cc: Florian Weimer > Fixes: 97e1c18e8d17b ("tracing: Kernel Tracepoints") > Reported-by: syzbot+83aa762ef23b6f0d1...@syzkaller.appspotmail.com > Reported-by: syzbot+d29e58bb557324e55...@syzkaller.appspotmail.com > Reported-by: Matt Mullins > Signed-off-by: Steven Rostedt (VMware) I'm a bit late answering your initial query, but yes indeed this fixes the bug I was hunting. I just watched it live through the reproducer for about a half-hour, while unpatched I get an instant "BUG: unable to handle page fault". Tested-by: Matt Mullins > --- > Changes since v2: >- Went back to using a stub function and not touching > the fast path. >- Removed adding __GFP_NOFAIL from the allocation of the removal. > > kernel/tracepoint.c | 80 - > 1 file changed, 64 insertions(+), 16 deletions(-) > > diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c > index 3f659f855074..3e261482296c 100644 > --- a/kernel/tracepoint.c > +++ b/kernel/tracepoint.c > @@ -53,6 +53,12 @@ struct tp_probes { > struct tracepoint_func probes[]; > }; > > +/* Called in removal of a func but failed to allocate a new tp_funcs */ > +static void tp_stub_func(void) > +{ > + return; > +} > + > static inline void *allocate_probes(int count) > { > struct tp_probes *p = kmalloc(struct_size(p, probes, count), > @@ -131,6 +137,7 @@ func_add(struct tracepoint_func **funcs, struct > tracepoint_func *tp_func, > { > struct tracepoint_func *old, *new; > int nr_probes = 0; > + int stub_funcs = 0; > int pos = -1; > > if (WARN_ON(!tp_func->func)) > @@ -147,14 +154,34 @@ func_add(struct tracepoint_func **funcs, struct > tracepoint_func *tp_func, > if (old[nr_probes].func == tp_func->func && > old[nr_probes].data == tp_func->data) > return ERR_PTR(-EEXIST); > + if (old[nr_probes].func == tp_stub_func) > + stub_funcs++; > } > } > - /* + 2 : one for new probe, one for NULL func */ > - new = allocate_probes(nr_probes + 2); > + /* + 2 : one for new probe, one for NULL func - stub functions */ > + new = allocate_probes(nr_probes + 2 - stub_funcs); > if (new == NULL) > return ERR_PTR(-ENOMEM); > if (old) { > - if (pos < 0) { > + if (stub_funcs) { > + /* Need to copy one at a time to remove stubs */ > + int probes = 0; > + > + pos = -1; > + for (nr_probes = 0; old[nr_probes].func; nr_probes++) { > + if (old[nr_probes].func == tp_stub_func) > + continue; > + if (pos < 0 && old[nr_probes].prio < prio) > + pos = probes++; > + new[probes++] = old[nr_probes]; > + } &g
Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp
On Tue, Nov 17, 2020 at 06:05:51PM -0500, Mathieu Desnoyers wrote: > - On Nov 16, 2020, at 5:10 PM, rostedt rost...@goodmis.org wrote: > > > On Mon, 16 Nov 2020 16:34:41 -0500 (EST) > > Mathieu Desnoyers wrote: > > [...] > > >> I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched > >> with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the > >> func pointer can be updated and loaded concurrently. > > > > I thought about this a little, and then only thing we really should worry > > about is synchronizing with those that unregister. Because when we make > > this update, there are now two states. the __DO_TRACE either reads the > > original func or the stub. And either should be OK to call. > > > > Only the func gets updated and not the data. So what exactly are we worried > > about here? > > Indeed with a stub function, I don't see any need for READ_ONCE/WRITE_ONCE. I'm not sure if this is a practical issue, but without WRITE_ONCE, can't the write be torn? A racing __traceiter_ could potentially see a half-modified function pointer, which wouldn't work out too well. This was actually my gut instinct before I wrote the __GFP_NOFAIL instead -- currently that whole array's memory ordering is provided by RCU and I didn't dive deep enough to evaluate getting too clever with atomic modifications to it. > > However, if we want to compare the function pointer to some other value and > conditionally do (or skip) the call, I think you'll need the > READ_ONCE/WRITE_ONCE > to make sure the pointer is not re-fetched between comparison and call. > > Thanks, > > Mathieu > > -- > Mathieu Desnoyers > EfficiOS Inc. > http://www.efficios.com
[PATCH] bpf: don't fail kmalloc while releasing raw_tp
bpf_link_free is always called in process context, including from a workqueue and from __fput. Neither of these have the ability to propagate an -ENOMEM to the caller. Reported-by: syzbot+83aa762ef23b6f0d1...@syzkaller.appspotmail.com Reported-by: syzbot+d29e58bb557324e55...@syzkaller.appspotmail.com Signed-off-by: Matt Mullins --- I previously referenced a "pretty ugly" patch. This is not that one, because I don't think there's any way I can make the caller of ->release() actually handle errors like ENOMEM. It also looks like most of the other ways tracepoint_probe_unregister is called also don't check the error code (e.g. just a quick grep found blk_unregister_tracepoints). Should this just be upgraded to GFP_NOFAIL across the board instead of passing around a gfp_t? include/linux/trace_events.h | 6 -- include/linux/tracepoint.h | 7 +-- kernel/bpf/syscall.c | 2 +- kernel/trace/bpf_trace.c | 6 -- kernel/trace/trace_events.c | 6 -- kernel/tracepoint.c | 20 ++-- 6 files changed, 28 insertions(+), 19 deletions(-) diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 5c6943354049..166ad7646a98 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -625,7 +625,8 @@ int perf_event_attach_bpf_prog(struct perf_event *event, struct bpf_prog *prog); void perf_event_detach_bpf_prog(struct perf_event *event); int perf_event_query_prog_array(struct perf_event *event, void __user *info); int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog); -int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog); +int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog, +gfp_t flags); struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name); void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp); int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, @@ -654,7 +655,8 @@ static inline int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_p { return -EOPNOTSUPP; } -static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *p) +static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, + struct bpf_prog *p, gfp_t flags) { return -EOPNOTSUPP; } diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h index 598fec9f9dbf..7b02f92f3b8f 100644 --- a/include/linux/tracepoint.h +++ b/include/linux/tracepoint.h @@ -12,6 +12,7 @@ * Heavily inspired from the Linux Kernel Markers. */ +#include #include #include #include @@ -40,7 +41,8 @@ extern int tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data, int prio); extern int -tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data); +tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data, + gfp_t flags); extern void for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv), void *priv); @@ -260,7 +262,8 @@ static inline struct tracepoint *tracepoint_ptr_deref(tracepoint_ptr_t *p) unregister_trace_##name(void (*probe)(data_proto), void *data) \ { \ return tracepoint_probe_unregister(&__tracepoint_##name,\ - (void *)probe, data); \ + (void *)probe, data,\ + GFP_KERNEL);\ } \ static inline void \ check_trace_callback_type_##name(void (*cb)(data_proto))\ diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index b999e7ff2583..f6876681c4ab 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -2601,7 +2601,7 @@ static void bpf_raw_tp_link_release(struct bpf_link *link) struct bpf_raw_tp_link *raw_tp = container_of(link, struct bpf_raw_tp_link, link); - bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog); + bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog, GFP_KERNEL | __GFP_NOFAIL); bpf_put_raw_tracepoint(raw_tp->btp); } diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index a8d4f253ed77..a4ea58c7506d 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -1955,9 +1955,11 @@ int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog) return __bpf_probe_register(btp, prog); } -int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog) +int bpf_probe_unregister(struc
Re: KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3
On Wed, Nov 11, 2020 at 03:57:50PM +0100, Dmitry Vyukov wrote: > On Mon, Nov 2, 2020 at 12:54 PM syzbot > wrote: > > > > Hello, > > > > syzbot found the following issue on: > > > > HEAD commit:080b6f40 bpf: Don't rely on GCC __attribute__((optimize)) .. > > git tree: bpf > > console output: https://syzkaller.appspot.com/x/log.txt?x=1089d37c50 > > kernel config: https://syzkaller.appspot.com/x/.config?x=58a4ca757d776bfe > > dashboard link: https://syzkaller.appspot.com/bug?extid=d29e58bb557324e55e5e > > compiler: gcc (GCC) 10.1.0-syz 20200507 > > syz repro: https://syzkaller.appspot.com/x/repro.syz?x=10f4b03250 > > C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1371a47c50 > > > > The issue was bisected to: > > > > commit 9df1c28bb75217b244257152ab7d788bb2a386d0 > > Author: Matt Mullins > > Date: Fri Apr 26 18:49:47 2019 + > > > > bpf: add writable context for raw tracepoints > > > We have a number of kernel memory corruptions related to bpf_trace_run now: > https://groups.google.com/g/syzkaller-bugs/search?q=kernel%2Ftrace%2Fbpf_trace.c > > Can raw tracepoints "legally" corrupt kernel memory (a-la /dev/kmem)? > Or they shouldn't? > > Looking at the description of Matt's commit, it seems that corruptions > should not be possible (bounded buffer, checked size, etc). Then it > means it's a real kernel bug? This bug doesn't seem to be related to the writability of the tracepoint; it bisected to that commit simply because it used BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE for the reproducer and it EINVAL's before that program type was introduced. The BPF program it loads is pretty much a no-op. The problem here is a kmalloc failure injection into tracepoint_probe_unregister, but the error is ignored -- so the bpf program is freed even though the tracepoint is never unregistered. I have a first pass at a patch to pipe through the error code, but it's pretty ugly. It's also called from the file_operations ->release(), for which errors are solidly ignored in __fput(), so I'm not sure what the best way to handle ENOMEM is... > > > > > bisection log: https://syzkaller.appspot.com/x/bisect.txt?x=12b6c4da50 > > final oops: https://syzkaller.appspot.com/x/report.txt?x=11b6c4da50 > > console output: https://syzkaller.appspot.com/x/log.txt?x=16b6c4da50 > > > > IMPORTANT: if you fix the issue, please add the following tag to the commit: > > Reported-by: syzbot+d29e58bb557324e55...@syzkaller.appspotmail.com > > Fixes: 9df1c28bb752 ("bpf: add writable context for raw tracepoints") > > > > == > > BUG: KASAN: vmalloc-out-of-bounds in __bpf_trace_run > > kernel/trace/bpf_trace.c:2045 [inline] > > BUG: KASAN: vmalloc-out-of-bounds in bpf_trace_run3+0x3e0/0x3f0 > > kernel/trace/bpf_trace.c:2083 > > Read of size 8 at addr c9e6c030 by task kworker/0:3/3754 > > > > CPU: 0 PID: 3754 Comm: kworker/0:3 Not tainted 5.9.0-syzkaller #0 > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS > > Google 01/01/2011 > > Workqueue: 0x0 (events) > > Call Trace: > > __dump_stack lib/dump_stack.c:77 [inline] > > dump_stack+0x107/0x163 lib/dump_stack.c:118 > > print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385 > > __kasan_report mm/kasan/report.c:545 [inline] > > kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562 > > __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline] > > bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083 > > __bpf_trace_sched_switch+0xdc/0x120 include/trace/events/sched.h:138 > > __traceiter_sched_switch+0x64/0xb0 include/trace/events/sched.h:138 > > trace_sched_switch include/trace/events/sched.h:138 [inline] > > __schedule+0xeb8/0x2130 kernel/sched/core.c:4520 > > schedule+0xcf/0x270 kernel/sched/core.c:4601 > > worker_thread+0x14c/0x1120 kernel/workqueue.c:2439 > > kthread+0x3af/0x4a0 kernel/kthread.c:292 > > ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296 > > > > > > Memory state around the buggy address: > > c9e6bf00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 > > c9e6bf80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 > > >c9e6c000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 > > ^ > > c9e6c080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 > > c9e6c100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
[tip:x86/urgent] x86/entry/32: Pass cr2 to do_async_page_fault()
Commit-ID: b8f70953c1251d8b16276995816a95639f598e70 Gitweb: https://git.kernel.org/tip/b8f70953c1251d8b16276995816a95639f598e70 Author: Matt Mullins AuthorDate: Tue, 23 Jul 2019 21:20:58 -0700 Committer: Thomas Gleixner CommitDate: Wed, 24 Jul 2019 12:17:39 +0200 x86/entry/32: Pass cr2 to do_async_page_fault() Commit a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption") added the address parameter to do_async_page_fault(), but does not pass it from the 32-bit entry point. To plumb it through, factor-out common_exception_read_cr2 in the same fashion as common_exception, and uses it from both page_fault and async_page_fault. For a 32-bit KVM guest, this fixes: Run /sbin/init as init process Starting init: /sbin/init exists but couldn't execute it (error -14) Fixes: a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption") Signed-off-by: Matt Mullins Signed-off-by: Thomas Gleixner Acked-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20190724042058.24506-1-mmull...@fb.com --- arch/x86/entry/entry_32.S | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 2bb986f305ac..4f86928246e7 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1443,8 +1443,12 @@ BUILD_INTERRUPT3(hv_stimer0_callback_vector, HYPERV_STIMER0_VECTOR, ENTRY(page_fault) ASM_CLAC - pushl $0; /* %gs's slot on the stack */ + pushl $do_page_fault + jmp common_exception_read_cr2 +END(page_fault) +common_exception_read_cr2: + /* the function address is in %gs's slot on the stack */ SAVE_ALL switch_stacks=1 skip_gs=1 ENCODE_FRAME_POINTER @@ -1452,6 +1456,7 @@ ENTRY(page_fault) /* fixup %gs */ GS_TO_REG %ecx + movlPT_GS(%esp), %edi REG_TO_PTGS %ecx SET_KERNEL_GS %ecx @@ -1463,9 +1468,9 @@ ENTRY(page_fault) TRACE_IRQS_OFF movl%esp, %eax # pt_regs pointer - calldo_page_fault + CALL_NOSPEC %edi jmp ret_from_exception -END(page_fault) +END(common_exception_read_cr2) common_exception: /* the function address is in %gs's slot on the stack */ @@ -1595,7 +1600,7 @@ END(general_protection) ENTRY(async_page_fault) ASM_CLAC pushl $do_async_page_fault - jmp common_exception + jmp common_exception_read_cr2 END(async_page_fault) #endif
[PATCH] x86/entry/32: pass cr2 to do_async_page_fault
Commit a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption") added the address parameter to do_async_page_fault, but does not pass it from the 32-bit entry point. To plumb it through, factor-out common_exception_read_cr2 in the same fashion as common_exception, and uses it from both page_fault and async_page_fault. For a 32-bit KVM guest, this fixes: [1.148669][T1] Run /sbin/init as init process [1.153328][T1] Starting init: /sbin/init exists but couldn't execute it (error -14) Fixes: a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption") Signed-off-by: Matt Mullins --- arch/x86/entry/entry_32.S | 13 + 1 file changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S index 2bb986f305ac..4f86928246e7 100644 --- a/arch/x86/entry/entry_32.S +++ b/arch/x86/entry/entry_32.S @@ -1443,8 +1443,12 @@ BUILD_INTERRUPT3(hv_stimer0_callback_vector, HYPERV_STIMER0_VECTOR, ENTRY(page_fault) ASM_CLAC - pushl $0; /* %gs's slot on the stack */ + pushl $do_page_fault + jmp common_exception_read_cr2 +END(page_fault) +common_exception_read_cr2: + /* the function address is in %gs's slot on the stack */ SAVE_ALL switch_stacks=1 skip_gs=1 ENCODE_FRAME_POINTER @@ -1452,6 +1456,7 @@ ENTRY(page_fault) /* fixup %gs */ GS_TO_REG %ecx + movlPT_GS(%esp), %edi REG_TO_PTGS %ecx SET_KERNEL_GS %ecx @@ -1463,9 +1468,9 @@ ENTRY(page_fault) TRACE_IRQS_OFF movl%esp, %eax # pt_regs pointer - calldo_page_fault + CALL_NOSPEC %edi jmp ret_from_exception -END(page_fault) +END(common_exception_read_cr2) common_exception: /* the function address is in %gs's slot on the stack */ @@ -1595,7 +1600,7 @@ END(general_protection) ENTRY(async_page_fault) ASM_CLAC pushl $do_async_page_fault - jmp common_exception + jmp common_exception_read_cr2 END(async_page_fault) #endif -- 2.17.1
Re: [PATCH] bpf: hide do_bpf_send_signal when unused
On Mon, 2019-06-17 at 19:09 -0400, Steven Rostedt wrote: > On Mon, 17 Jun 2019 08:26:29 -0700 > Alexei Starovoitov wrote: > > > On Mon, Jun 17, 2019 at 5:59 AM Arnd Bergmann wrote: > > > > > > When CONFIG_MODULES is disabled, this function is never called: > > > > > > kernel/trace/bpf_trace.c:581:13: error: 'do_bpf_send_signal' defined but > > > not used [-Werror=unused-function] > > > > hmm. it should work just fine without modules. > > the bug is somewhere else. > > From what I see, the only use of do_bpf_send_signal is within a > #ifdef CONFIG_MODULES, which means that you will get a warning about a > static unused when CONFIG_MODULES is not defined. > > In kernel/trace/bpf_trace.c we have: > > static void do_bpf_send_signal(struct irq_work *entry) > > [..] > > #ifdef CONFIG_MODULES > > [..] > > for_each_possible_cpu(cpu) { > work = per_cpu_ptr(&send_signal_work, cpu); > init_irq_work(&work->irq_work, do_bpf_send_signal); <-- on > use of do_bpf_send_signal > } > [..] > #endif /* CONFIG_MODULES */ > > The bug (really just a warning) reported is exactly here. I don't think bpf_send_signal is tied to modules at all; send_signal_irq_work_init and the corresponding initcall should be moved outside that #ifdef. > > -- Steve
Re: [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data
On Fri, 2019-06-14 at 16:50 +0200, Daniel Borkmann wrote: > On 06/14/2019 02:51 AM, Matt Mullins wrote: > > On Fri, 2019-06-14 at 00:47 +0200, Daniel Borkmann wrote: > > > On 06/12/2019 07:00 AM, Andrii Nakryiko wrote: > > > > On Tue, Jun 11, 2019 at 8:48 PM Matt Mullins wrote: > > > > > > > > > > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, > > > > > as > > > > > they do not increment bpf_prog_active while executing. > > > > > > > > > > This enables three levels of nesting, to support > > > > > - a kprobe or raw tp or perf event, > > > > > - another one of the above that irq context happens to call, and > > > > > - another one in nmi context > > > > > (at most one of which may be a kprobe or perf event). > > > > > > > > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for > > > > > perf_sample_data") > > > > > > Generally, looks good to me. Two things below: > > > > > > Nit, for stable, shouldn't fixes tag be c4f6699dfcb8 ("bpf: introduce > > > BPF_RAW_TRACEPOINT") > > > instead of the one you currently have? > > > > Ah, yeah, that's probably more reasonable; I haven't managed to come up > > with a scenario where one could hit this without raw tracepoints. I'll > > fix up the nits that've accumulated since v2. > > > > > One more question / clarification: we have __bpf_trace_run() vs > > > trace_call_bpf(). > > > > > > Only raw tracepoints can be nested since the rest has the bpf_prog_active > > > per-CPU > > > counter via trace_call_bpf() and would bail out otherwise, iiuc. And raw > > > ones use > > > the __bpf_trace_run() added in c4f6699dfcb8 ("bpf: introduce > > > BPF_RAW_TRACEPOINT"). > > > > > > 1) I tried to recall and find a rationale for mentioned trace_call_bpf() > > > split in > > > the c4f6699dfcb8 log, but couldn't find any. Is the raison d'être purely > > > because of > > > performance overhead (and desire to not miss events as a result of > > > nesting)? (This > > > also means we're not protected by bpf_prog_active in all the map ops, of > > > course.) > > > 2) Wouldn't this also mean that we only need to fix the raw tp programs > > > via > > > get_bpf_raw_tp_regs() / put_bpf_raw_tp_regs() and won't need this > > > duplication for > > > the rest which relies upon trace_call_bpf()? I'm probably missing > > > something, but > > > given they have separate pt_regs there, how could they be affected then? > > > > For the pt_regs, you're correct: I only used get/put_raw_tp_regs for > > the _raw_tp variants. However, consider the following nesting: > > > > trace_nest_level raw_tp_nest_level > > (kprobe) bpf_perf_event_output1 0 > > (raw_tp) bpf_perf_event_output_raw_tp 2 1 > > (raw_tp) bpf_get_stackid_raw_tp 2 2 > > > > I need to increment a nest level (and ideally increment it only once) > > between the kprobe and the first raw_tp, because they would otherwise > > share the struct perf_sample_data. But I also need to increment a nest > > I'm not sure I follow on this one: the former would still keep using the > bpf_trace_sd as-is today since only ever /one/ can be active on a given CPU > as we otherwise bail out in trace_call_bpf() due to bpf_prog_active counter. > Given these two are /not/ shared, you only need the code you have below for > nesting to deal with the raw_tps via get_bpf_raw_tp_regs() / > put_bpf_raw_tp_regs() > which should also simplify the code quite a bit. bpf_perf_event_output_raw_tp calls bpf_perf_event_output, so it currently shares bpf_trace_sd with kprobes -- it _can_ be nested. > > > level between the two raw_tps, since they share the pt_regs -- I can't > > use trace_nest_level for everything because it's not used by > > get_stackid, and I can't use raw_tp_nest_level for everything because > > it's not incremented by kprobes. > > (See above wrt kprobes.) > > > If raw tracepoints were to bump bpf_prog_active, then I could get away > > with just using that count in these callsites -- I'm reluctant to do > > that, though, since it would prevent kprobes from ever running inside a > > raw_tp. I'd like to retain the ability to (e.g.) > > trace.py -K htab_map_update_elem > > and get some stack traces from at least within raw tracepoints. > > > > That said, as I wrote up this example, bpf_trace_nest_level seems to be > > wildly misnamed; I should name those after the structure they're > > protecting...
Re: [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data
On Fri, 2019-06-14 at 00:47 +0200, Daniel Borkmann wrote: > On 06/12/2019 07:00 AM, Andrii Nakryiko wrote: > > On Tue, Jun 11, 2019 at 8:48 PM Matt Mullins wrote: > > > > > > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as > > > they do not increment bpf_prog_active while executing. > > > > > > This enables three levels of nesting, to support > > > - a kprobe or raw tp or perf event, > > > - another one of the above that irq context happens to call, and > > > - another one in nmi context > > > (at most one of which may be a kprobe or perf event). > > > > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for > > > perf_sample_data") > > Generally, looks good to me. Two things below: > > Nit, for stable, shouldn't fixes tag be c4f6699dfcb8 ("bpf: introduce > BPF_RAW_TRACEPOINT") > instead of the one you currently have? Ah, yeah, that's probably more reasonable; I haven't managed to come up with a scenario where one could hit this without raw tracepoints. I'll fix up the nits that've accumulated since v2. > One more question / clarification: we have __bpf_trace_run() vs > trace_call_bpf(). > > Only raw tracepoints can be nested since the rest has the bpf_prog_active > per-CPU > counter via trace_call_bpf() and would bail out otherwise, iiuc. And raw ones > use > the __bpf_trace_run() added in c4f6699dfcb8 ("bpf: introduce > BPF_RAW_TRACEPOINT"). > > 1) I tried to recall and find a rationale for mentioned trace_call_bpf() > split in > the c4f6699dfcb8 log, but couldn't find any. Is the raison d'être purely > because of > performance overhead (and desire to not miss events as a result of nesting)? > (This > also means we're not protected by bpf_prog_active in all the map ops, of > course.) > 2) Wouldn't this also mean that we only need to fix the raw tp programs via > get_bpf_raw_tp_regs() / put_bpf_raw_tp_regs() and won't need this duplication > for > the rest which relies upon trace_call_bpf()? I'm probably missing something, > but > given they have separate pt_regs there, how could they be affected then? For the pt_regs, you're correct: I only used get/put_raw_tp_regs for the _raw_tp variants. However, consider the following nesting: trace_nest_level raw_tp_nest_level (kprobe) bpf_perf_event_output1 0 (raw_tp) bpf_perf_event_output_raw_tp 2 1 (raw_tp) bpf_get_stackid_raw_tp 2 2 I need to increment a nest level (and ideally increment it only once) between the kprobe and the first raw_tp, because they would otherwise share the struct perf_sample_data. But I also need to increment a nest level between the two raw_tps, since they share the pt_regs -- I can't use trace_nest_level for everything because it's not used by get_stackid, and I can't use raw_tp_nest_level for everything because it's not incremented by kprobes. If raw tracepoints were to bump bpf_prog_active, then I could get away with just using that count in these callsites -- I'm reluctant to do that, though, since it would prevent kprobes from ever running inside a raw_tp. I'd like to retain the ability to (e.g.) trace.py -K htab_map_update_elem and get some stack traces from at least within raw tracepoints. That said, as I wrote up this example, bpf_trace_nest_level seems to be wildly misnamed; I should name those after the structure they're protecting... > Thanks, > Daniel > > > > Signed-off-by: Matt Mullins > > > --- > > > > LGTM, minor nit below. > > > > Acked-by: Andrii Nakryiko > > > > > v1->v2: > > > * reverse-Christmas-tree-ize the declarations in bpf_perf_event_output > > > * instantiate err more readably > > > > > > I've done additional testing with the original workload that hit the > > > irq+raw-tp reentrancy problem, and as far as I can tell, it's still > > > solved with this solution (as opposed to my earlier per-map-element > > > version). > > > > > > kernel/trace/bpf_trace.c | 100 --- > > > 1 file changed, 84 insertions(+), 16 deletions(-) > > > > > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c > > > index f92d6ad5e080..1c9a4745e596 100644 > > > --- a/kernel/trace/bpf_trace.c > > > +++ b/kernel/trace/bpf_trace.c > > > @@ -410,8 +410,6 @@ static const struct bpf_func_proto > > > bpf_perf_event_read_value_proto = { > > &g
[tip:x86/urgent] x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()
Commit-ID: 71ab8323cc357c68985a2d6fc6cfc22b1dbbc1c3 Gitweb: https://git.kernel.org/tip/71ab8323cc357c68985a2d6fc6cfc22b1dbbc1c3 Author: Matt Mullins AuthorDate: Fri, 31 May 2019 12:47:54 -0700 Committer: Borislav Petkov CommitDate: Wed, 12 Jun 2019 18:52:44 +0200 x86/kgdb: Return 0 from kgdb_arch_set_breakpoint() err must be nonzero in order to reach text_poke(), which caused kgdb to fail to set breakpoints: (gdb) break __x64_sys_sync Breakpoint 1 at 0x81288910: file ../fs/sync.c, line 124. (gdb) c Continuing. Warning: Cannot insert breakpoint 1. Cannot access memory at address 0x81288910 Command aborted. Fixes: 86a22057127d ("x86/kgdb: Avoid redundant comparison of patched code") Signed-off-by: Matt Mullins Signed-off-by: Borislav Petkov Reviewed-by: Nadav Amit Cc: Andy Lutomirski Cc: Christophe Leroy Cc: Daniel Thompson Cc: Douglas Anderson Cc: "Gustavo A. R. Silva" Cc: "H. Peter Anvin" Cc: Ingo Molnar Cc: "Peter Zijlstra (Intel)" Cc: Rick Edgecombe Cc: Thomas Gleixner Cc: x86-ml Link: https://lkml.kernel.org/r/20190531194755.6320-1-mmull...@fb.com --- arch/x86/kernel/kgdb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c index 9a8c1648fc9a..6690c5652aeb 100644 --- a/arch/x86/kernel/kgdb.c +++ b/arch/x86/kernel/kgdb.c @@ -758,7 +758,7 @@ int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt) BREAK_INSTR_SIZE); bpt->type = BP_POKE_BREAKPOINT; - return err; + return 0; } int kgdb_arch_remove_breakpoint(struct kgdb_bkpt *bpt)
[PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data
BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as they do not increment bpf_prog_active while executing. This enables three levels of nesting, to support - a kprobe or raw tp or perf event, - another one of the above that irq context happens to call, and - another one in nmi context (at most one of which may be a kprobe or perf event). Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") Signed-off-by: Matt Mullins --- v1->v2: * reverse-Christmas-tree-ize the declarations in bpf_perf_event_output * instantiate err more readably I've done additional testing with the original workload that hit the irq+raw-tp reentrancy problem, and as far as I can tell, it's still solved with this solution (as opposed to my earlier per-map-element version). kernel/trace/bpf_trace.c | 100 --- 1 file changed, 84 insertions(+), 16 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f92d6ad5e080..1c9a4745e596 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -410,8 +410,6 @@ static const struct bpf_func_proto bpf_perf_event_read_value_proto = { .arg4_type = ARG_CONST_SIZE, }; -static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd); - static __always_inline u64 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, u64 flags, struct perf_sample_data *sd) @@ -442,24 +440,50 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, return perf_event_output(event, sd, regs); } +/* + * Support executing tracepoints in normal, irq, and nmi context that each call + * bpf_perf_event_output + */ +struct bpf_trace_sample_data { + struct perf_sample_data sds[3]; +}; + +static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds); +static DEFINE_PER_CPU(int, bpf_trace_nest_level); BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, u64, flags, void *, data, u64, size) { - struct perf_sample_data *sd = this_cpu_ptr(&bpf_trace_sd); + struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds); + int nest_level = this_cpu_inc_return(bpf_trace_nest_level); struct perf_raw_record raw = { .frag = { .size = size, .data = data, }, }; + struct perf_sample_data *sd; + int err; - if (unlikely(flags & ~(BPF_F_INDEX_MASK))) - return -EINVAL; + if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) { + err = -EBUSY; + goto out; + } + + sd = &sds->sds[nest_level - 1]; + + if (unlikely(flags & ~(BPF_F_INDEX_MASK))) { + err = -EINVAL; + goto out; + } perf_sample_data_init(sd, 0, 0); sd->raw = &raw; - return __bpf_perf_event_output(regs, map, flags, sd); + err = __bpf_perf_event_output(regs, map, flags, sd); + +out: + this_cpu_dec(bpf_trace_nest_level); + return err; } static const struct bpf_func_proto bpf_perf_event_output_proto = { @@ -822,16 +846,48 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) /* * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp * to avoid potential recursive reuse issue when/if tracepoints are added - * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack + * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack. + * + * Since raw tracepoints run despite bpf_prog_active, support concurrent usage + * in normal, irq, and nmi context. */ -static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs); +struct bpf_raw_tp_regs { + struct pt_regs regs[3]; +}; +static DEFINE_PER_CPU(struct bpf_raw_tp_regs, bpf_raw_tp_regs); +static DEFINE_PER_CPU(int, bpf_raw_tp_nest_level); +static struct pt_regs *get_bpf_raw_tp_regs(void) +{ + struct bpf_raw_tp_regs *tp_regs = this_cpu_ptr(&bpf_raw_tp_regs); + int nest_level = this_cpu_inc_return(bpf_raw_tp_nest_level); + + if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(tp_regs->regs))) { + this_cpu_dec(bpf_raw_tp_nest_level); + return ERR_PTR(-EBUSY); + } + + return &tp_regs->regs[nest_level - 1]; +} + +static void put_bpf_raw_tp_regs(void) +{ + this_cpu_dec(bpf_raw_tp_nest_level); +} + BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args, struct bpf_map *, map, u64, flags, void *, data, u64, size) { - struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); + struct pt_regs *regs = get_bpf_raw_tp_regs(); + int ret; + + if (IS_ERR(regs)) + return PTR_ERR(regs); perf_fetch_caller_regs(regs); - return bpf_perf_event_output(regs, m
Re: [PATCH bpf] bpf: fix nested bpf tracepoints with per-cpu data
On Thu, 2019-06-06 at 15:13 -0700, Jakub Kicinski wrote: > On Thu, 6 Jun 2019 11:54:27 -0700, Matt Mullins wrote: > > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as > > they do not increment bpf_prog_active while executing. > > > > This enables three levels of nesting, to support > > - a kprobe or raw tp or perf event, > > - another one of the above that irq context happens to call, and > > - another one in nmi context > > (at most one of which may be a kprobe or perf event). > > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for > > perf_sample_data") > > No comment on the code, but you're definitely missing a sign-off. Oops, I totally am. I'll give it some more time for opinions to roll in, and I'll fix that before I resubmit :) > > > --- > > This is more lines of code, but possibly less intrusive than the > > per-array-element approach. > > > > I don't necessarily like that I duplicated the nest_level logic in two > > places, but I don't see a way to unify them: > > - kprobes' bpf_perf_event_output doesn't use bpf_raw_tp_regs, and does > > use the perf_sample_data, > > - raw tracepoints' bpf_get_stackid uses bpf_raw_tp_regs, but not > > the perf_sample_data, and > > - raw tracepoints' bpf_perf_event_output uses both... > >
[PATCH bpf] bpf: fix nested bpf tracepoints with per-cpu data
BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as they do not increment bpf_prog_active while executing. This enables three levels of nesting, to support - a kprobe or raw tp or perf event, - another one of the above that irq context happens to call, and - another one in nmi context (at most one of which may be a kprobe or perf event). Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") --- This is more lines of code, but possibly less intrusive than the per-array-element approach. I don't necessarily like that I duplicated the nest_level logic in two places, but I don't see a way to unify them: - kprobes' bpf_perf_event_output doesn't use bpf_raw_tp_regs, and does use the perf_sample_data, - raw tracepoints' bpf_get_stackid uses bpf_raw_tp_regs, but not the perf_sample_data, and - raw tracepoints' bpf_perf_event_output uses both... kernel/trace/bpf_trace.c | 95 +--- 1 file changed, 80 insertions(+), 15 deletions(-) diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f92d6ad5e080..4f5419837ddd 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -410,8 +410,6 @@ static const struct bpf_func_proto bpf_perf_event_read_value_proto = { .arg4_type = ARG_CONST_SIZE, }; -static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd); - static __always_inline u64 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, u64 flags, struct perf_sample_data *sd) @@ -442,24 +440,47 @@ __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, return perf_event_output(event, sd, regs); } +/* + * Support executing tracepoints in normal, irq, and nmi context that each call + * bpf_perf_event_output + */ +struct bpf_trace_sample_data { + struct perf_sample_data sds[3]; +}; + +static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds); +static DEFINE_PER_CPU(int, bpf_trace_nest_level); BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, map, u64, flags, void *, data, u64, size) { - struct perf_sample_data *sd = this_cpu_ptr(&bpf_trace_sd); + struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds); + struct perf_sample_data *sd; + int nest_level = this_cpu_inc_return(bpf_trace_nest_level); struct perf_raw_record raw = { .frag = { .size = size, .data = data, }, }; + int err = -EBUSY; + if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) + goto out; + + sd = &sds->sds[nest_level - 1]; + + err = -EINVAL; if (unlikely(flags & ~(BPF_F_INDEX_MASK))) - return -EINVAL; + goto out; perf_sample_data_init(sd, 0, 0); sd->raw = &raw; - return __bpf_perf_event_output(regs, map, flags, sd); + err = __bpf_perf_event_output(regs, map, flags, sd); + +out: + this_cpu_dec(bpf_trace_nest_level); + return err; } static const struct bpf_func_proto bpf_perf_event_output_proto = { @@ -822,16 +843,48 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog) /* * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp * to avoid potential recursive reuse issue when/if tracepoints are added - * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack + * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack. + * + * Since raw tracepoints run despite bpf_prog_active, support concurrent usage + * in normal, irq, and nmi context. */ -static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs); +struct bpf_raw_tp_regs { + struct pt_regs regs[3]; +}; +static DEFINE_PER_CPU(struct bpf_raw_tp_regs, bpf_raw_tp_regs); +static DEFINE_PER_CPU(int, bpf_raw_tp_nest_level); +static struct pt_regs *get_bpf_raw_tp_regs(void) +{ + struct bpf_raw_tp_regs *tp_regs = this_cpu_ptr(&bpf_raw_tp_regs); + int nest_level = this_cpu_inc_return(bpf_raw_tp_nest_level); + + if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(tp_regs->regs))) { + this_cpu_dec(bpf_raw_tp_nest_level); + return ERR_PTR(-EBUSY); + } + + return &tp_regs->regs[nest_level - 1]; +} + +static void put_bpf_raw_tp_regs(void) +{ + this_cpu_dec(bpf_raw_tp_nest_level); +} + BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, args, struct bpf_map *, map, u64, flags, void *, data, u64, size) { - struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs); + struct pt_regs *regs = get_bpf_raw_tp_regs(); + int ret; + + if (IS_ERR(regs)) + return PTR_ERR(regs); perf_fetch_caller_regs(regs); - return bpf_perf_event_output(regs, map, flags, data, size); + ret = bpf_perf_event_output(regs, map, flags, d
Re: [PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd
On Tue, 2019-06-04 at 02:43 +0200, Daniel Borkmann wrote: > On 06/04/2019 01:54 AM, Alexei Starovoitov wrote: > > On Mon, Jun 3, 2019 at 4:48 PM Daniel Borkmann wrote: > > > On 06/04/2019 01:27 AM, Alexei Starovoitov wrote: > > > > On Mon, Jun 3, 2019 at 3:59 PM Matt Mullins wrote: > > > > > > > > > > If these are invariably non-nested, I can easily keep bpf_misc_sd when > > > > > I resubmit. There was no technical reason other than keeping the two > > > > > codepaths as similar as possible. > > > > > > > > > > What resource gives you worry about doing this for the networking > > > > > codepath? > > > > > > > > my preference would be to keep tracing and networking the same. > > > > there is already minimal nesting in networking and probably we see > > > > more when reuseport progs will start running from xdp and clsbpf > > > > > > > > > > Aside from that it's also really bad to miss events like this as > > > > > > exporting > > > > > > through rb is critical. Why can't you have a per-CPU counter that > > > > > > selects a > > > > > > sample data context based on nesting level in tracing? (I don't see > > > > > > a discussion > > > > > > of this in your commit message.) > > > > > > > > > > This change would only drop messages if the same perf_event is > > > > > attempted to be used recursively (i.e. the same CPU on the same > > > > > PERF_EVENT_ARRAY map, as I haven't observed anything use index != > > > > > BPF_F_CURRENT_CPU in testing). > > > > > > > > > > I'll try to accomplish the same with a percpu nesting level and > > > > > allocating 2 or 3 perf_sample_data per cpu. I think that'll solve the > > > > > same problem -- a local patch keeping track of the nesting level is > > > > > how > > > > > I got the above stack trace, too. > > > > > > > > I don't think counter approach works. The amount of nesting is unknown. > > > > imo the approach taken in this patch is good. > > > > I don't see any issue when event_outputs will be dropped for valid > > > > progs. > > > > Only when user called the helper incorrectly without BPF_F_CURRENT_CPU. > > > > But that's an error anyway. > > > > > > My main worry with this xchg() trick is that we'll miss to export crucial > > > data with the EBUSY bailing out especially given nesting could increase in > > > future as you state, so users might have a hard time debugging this kind > > > of > > > issue if they share the same perf event map among these programs, and no > > > option to get to this data otherwise. Supporting nesting up to a certain > > > level would still be better than a lost event which is also not reported > > > through the usual way aka perf rb. Tracing can already be lossy: trace_call_bpf() silently simply doesn't call the prog and instead returns zero if bpf_prog_active != 1. > > > > I simply don't see this 'miss to export data' in all but contrived > > conditions. > > Say two progs share the same perf event array. > > One prog calls event_output and while rb logic is working > > another prog needs to start executing and use the same event array > > Correct. > > > slot. Today it's only possible for tracing prog combined with networking, > > but having two progs use the same event output array is pretty much > > a user bug. Just like not passing BPF_F_CURRENT_CPU. > > I don't see the user bug part, why should that be a user bug? It's the same > as if we would say that sharing a BPF hash map between networking programs > attached to different hooks or networking and tracing would be a user bug > which it is not. One concrete example would be cilium monitor where we > currently expose skb trace and drop events a well as debug data through > the same rb. This should be usable from any type that has perf_event_output > helper enabled (e.g. XDP and tc/BPF) w/o requiring to walk yet another per > cpu mmap rb from user space. Neither of these solutions would affect the behavior of sharing the perf array between networking programs -- since they're never called in a nested fashion, then you'll never hit the "xchg() returned NULL" at all. That said, I think I can logically limit nesting in tracing to 3 lev
Re: [PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd
On Mon, 2019-06-03 at 15:22 +0200, Daniel Borkmann wrote: > On 06/03/2019 03:08 PM, Daniel Borkmann wrote: > > On 06/01/2019 12:37 AM, Matt Mullins wrote: > > > It is possible that a BPF program can be called while another BPF > > > program is executing bpf_perf_event_output. This has been observed with > > > I/O completion occurring as a result of an interrupt: > > > > > > bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000 > > > ? trace_call_bpf+0x82/0x100 > > > ? sch_direct_xmit+0xe2/0x230 > > > ? blk_mq_end_request+0x1/0x100 > > > ? blk_mq_end_request+0x5/0x100 > > > ? kprobe_perf_func+0x19b/0x240 > > > ? __qdisc_run+0x86/0x520 > > > ? blk_mq_end_request+0x1/0x100 > > > ? blk_mq_end_request+0x5/0x100 > > > ? kprobe_ftrace_handler+0x90/0xf0 > > > ? ftrace_ops_assist_func+0x6e/0xe0 > > > ? ip6_input_finish+0xbf/0x460 > > > ? 0xa01e80bf > > > ? nbd_dbg_flags_show+0xc0/0xc0 [nbd] > > > ? blkdev_issue_zeroout+0x200/0x200 > > > ? blk_mq_end_request+0x1/0x100 > > > ? blk_mq_end_request+0x5/0x100 > > > ? flush_smp_call_function_queue+0x6c/0xe0 > > > ? smp_call_function_single_interrupt+0x32/0xc0 > > > ? call_function_single_interrupt+0xf/0x20 > > > ? call_function_single_interrupt+0xa/0x20 > > > ? swiotlb_map_page+0x140/0x140 > > > ? refcount_sub_and_test+0x1a/0x50 > > > ? tcp_wfree+0x20/0xf0 > > > ? skb_release_head_state+0x62/0xc0 > > > ? skb_release_all+0xe/0x30 > > > ? napi_consume_skb+0xb5/0x100 > > > ? mlx5e_poll_tx_cq+0x1df/0x4e0 > > > ? mlx5e_poll_tx_cq+0x38c/0x4e0 > > > ? mlx5e_napi_poll+0x58/0xc30 > > > ? mlx5e_napi_poll+0x232/0xc30 > > > ? net_rx_action+0x128/0x340 > > > ? __do_softirq+0xd4/0x2ad > > > ? irq_exit+0xa5/0xb0 > > > ? do_IRQ+0x7d/0xc0 > > > ? common_interrupt+0xf/0xf > > > > > > ? __rb_free_aux+0xf0/0xf0 > > > ? perf_output_sample+0x28/0x7b0 > > > ? perf_prepare_sample+0x54/0x4a0 > > > ? perf_event_output+0x43/0x60 > > > ? bpf_perf_event_output_raw_tp+0x15f/0x180 > > > ? blk_mq_start_request+0x1/0x120 > > > ? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000 > > > ? bpf_trace_run3+0x2c/0x80 > > > ? nbd_send_cmd+0x4c2/0x690 [nbd] > > > > > > This also cannot be alleviated by further splitting the per-cpu > > > perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix > > > corruption on concurrent perf_event_output calls")), as a raw_tp could > > > be attached to the block:block_rq_complete tracepoint and execute during > > > another raw_tp. Instead, keep a pre-allocated perf_sample_data > > > structure per perf_event_array element and fail a bpf_perf_event_output > > > if that element is concurrently being used. > > > > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for > > > perf_sample_data") > > > Signed-off-by: Matt Mullins > > > > You do not elaborate why is this needed for all the networking programs that > > use this functionality. The bpf_misc_sd should therefore be kept as-is. > > There > > cannot be nested occurrences there (xdp, tc ingress/egress). Please explain > > why > > non-tracing should be affected here... If these are invariably non-nested, I can easily keep bpf_misc_sd when I resubmit. There was no technical reason other than keeping the two codepaths as similar as possible. What resource gives you worry about doing this for the networking codepath? > Aside from that it's also really bad to miss events like this as exporting > through rb is critical. Why can't you have a per-CPU counter that selects a > sample data context based on nesting level in tracing? (I don't see a > discussion > of this in your commit message.) This change would only drop messages if the same perf_event is attempted to be used recursively (i.e. the same CPU on the same PERF_EVENT_ARRAY map, as I haven't observed anything use index != BPF_F_CURRENT_CPU in testing). I'll try to accomplish the same with a percpu nesting level and allocating 2 or 3 perf_sample_data per cpu. I think that'll solve the same problem -- a local patch keeping track of the nesting level is how I got the above stack trace, too.
[PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd
It is possible that a BPF program can be called while another BPF program is executing bpf_perf_event_output. This has been observed with I/O completion occurring as a result of an interrupt: bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000 ? trace_call_bpf+0x82/0x100 ? sch_direct_xmit+0xe2/0x230 ? blk_mq_end_request+0x1/0x100 ? blk_mq_end_request+0x5/0x100 ? kprobe_perf_func+0x19b/0x240 ? __qdisc_run+0x86/0x520 ? blk_mq_end_request+0x1/0x100 ? blk_mq_end_request+0x5/0x100 ? kprobe_ftrace_handler+0x90/0xf0 ? ftrace_ops_assist_func+0x6e/0xe0 ? ip6_input_finish+0xbf/0x460 ? 0xa01e80bf ? nbd_dbg_flags_show+0xc0/0xc0 [nbd] ? blkdev_issue_zeroout+0x200/0x200 ? blk_mq_end_request+0x1/0x100 ? blk_mq_end_request+0x5/0x100 ? flush_smp_call_function_queue+0x6c/0xe0 ? smp_call_function_single_interrupt+0x32/0xc0 ? call_function_single_interrupt+0xf/0x20 ? call_function_single_interrupt+0xa/0x20 ? swiotlb_map_page+0x140/0x140 ? refcount_sub_and_test+0x1a/0x50 ? tcp_wfree+0x20/0xf0 ? skb_release_head_state+0x62/0xc0 ? skb_release_all+0xe/0x30 ? napi_consume_skb+0xb5/0x100 ? mlx5e_poll_tx_cq+0x1df/0x4e0 ? mlx5e_poll_tx_cq+0x38c/0x4e0 ? mlx5e_napi_poll+0x58/0xc30 ? mlx5e_napi_poll+0x232/0xc30 ? net_rx_action+0x128/0x340 ? __do_softirq+0xd4/0x2ad ? irq_exit+0xa5/0xb0 ? do_IRQ+0x7d/0xc0 ? common_interrupt+0xf/0xf ? __rb_free_aux+0xf0/0xf0 ? perf_output_sample+0x28/0x7b0 ? perf_prepare_sample+0x54/0x4a0 ? perf_event_output+0x43/0x60 ? bpf_perf_event_output_raw_tp+0x15f/0x180 ? blk_mq_start_request+0x1/0x120 ? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000 ? bpf_trace_run3+0x2c/0x80 ? nbd_send_cmd+0x4c2/0x690 [nbd] This also cannot be alleviated by further splitting the per-cpu perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix corruption on concurrent perf_event_output calls")), as a raw_tp could be attached to the block:block_rq_complete tracepoint and execute during another raw_tp. Instead, keep a pre-allocated perf_sample_data structure per perf_event_array element and fail a bpf_perf_event_output if that element is concurrently being used. Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") Signed-off-by: Matt Mullins --- v1->v2: keep a pointer to the struct perf_sample_data rather than directly embedding it in the structure, avoiding the circular include and removing the need for in_use. Suggested by Song. include/linux/bpf.h | 1 + kernel/bpf/arraymap.c| 3 ++- kernel/trace/bpf_trace.c | 29 - 3 files changed, 19 insertions(+), 14 deletions(-) diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 4fb3aa2dc975..47fd85cfbbaf 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -472,6 +472,7 @@ struct bpf_event_entry { struct file *perf_file; struct file *map_file; struct rcu_head rcu; + struct perf_sample_data *sd; }; bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp); diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 584636c9e2eb..c7f5d593e04f 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -654,11 +654,12 @@ static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file, { struct bpf_event_entry *ee; - ee = kzalloc(sizeof(*ee), GFP_ATOMIC); + ee = kzalloc(sizeof(*ee) + sizeof(struct perf_sample_data), GFP_ATOMIC); if (ee) { ee->event = perf_file->private_data; ee->perf_file = perf_file; ee->map_file = map_file; + ee->sd = (void *)ee + sizeof(*ee); } return ee; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f92d6ad5e080..076f8e987355 100644 --- a/kernel/trace/bpf_trace.c +++ b/kernel/trace/bpf_trace.c @@ -410,17 +410,17 @@ static const struct bpf_func_proto bpf_perf_event_read_value_proto = { .arg4_type = ARG_CONST_SIZE, }; -static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd); - static __always_inline u64 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map, - u64 flags, struct perf_sample_data *sd) + u64 flags, struct perf_raw_record *raw) { struct bpf_array *array = container_of(map, struct bpf_array, map); unsigned int cpu = smp_processor_id(); u64 index = flags & BPF_F_INDEX_MASK; struct bpf_event_entry *ee; struct perf_event *event; + struct perf_sample_data *sd; + u64 ret
[PATCH] x86/kgdb: return 0 from kgdb_arch_set_breakpoint
err must be nonzero in order to reach text_poke(), which caused kgdb to fail to set breakpoints: (gdb) break __x64_sys_sync Breakpoint 1 at 0x81288910: file ../fs/sync.c, line 124. (gdb) c Continuing. Warning: Cannot insert breakpoint 1. Cannot access memory at address 0x81288910 Command aborted. Fixes: 86a22057127d ("x86/kgdb: Avoid redundant comparison of patched code") Signed-off-by: Matt Mullins --- arch/x86/kernel/kgdb.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c index 9a8c1648fc9a..6690c5652aeb 100644 --- a/arch/x86/kernel/kgdb.c +++ b/arch/x86/kernel/kgdb.c @@ -758,7 +758,7 @@ int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt) BREAK_INSTR_SIZE); bpt->type = BP_POKE_BREAKPOINT; - return err; + return 0; } int kgdb_arch_remove_breakpoint(struct kgdb_bkpt *bpt) -- 2.17.1
Re: [PATCH bpf] bpf: preallocate a perf_sample_data per event fd
On Thu, 2019-05-30 at 23:28 +, Song Liu wrote: > > On May 30, 2019, at 3:55 PM, Matt Mullins wrote: > > > > It is possible that a BPF program can be called while another BPF > > program is executing bpf_perf_event_output. This has been observed with > > I/O completion occurring as a result of an interrupt: > > > > bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000 > > ? trace_call_bpf+0x82/0x100 > > ? sch_direct_xmit+0xe2/0x230 > > ? blk_mq_end_request+0x1/0x100 > > ? blk_mq_end_request+0x5/0x100 > > ? kprobe_perf_func+0x19b/0x240 > > ? __qdisc_run+0x86/0x520 > > ? blk_mq_end_request+0x1/0x100 > > ? blk_mq_end_request+0x5/0x100 > > ? kprobe_ftrace_handler+0x90/0xf0 > > ? ftrace_ops_assist_func+0x6e/0xe0 > > ? ip6_input_finish+0xbf/0x460 > > ? 0xa01e80bf > > ? nbd_dbg_flags_show+0xc0/0xc0 [nbd] > > ? blkdev_issue_zeroout+0x200/0x200 > > ? blk_mq_end_request+0x1/0x100 > > ? blk_mq_end_request+0x5/0x100 > > ? flush_smp_call_function_queue+0x6c/0xe0 > > ? smp_call_function_single_interrupt+0x32/0xc0 > > ? call_function_single_interrupt+0xf/0x20 > > ? call_function_single_interrupt+0xa/0x20 > > ? swiotlb_map_page+0x140/0x140 > > ? refcount_sub_and_test+0x1a/0x50 > > ? tcp_wfree+0x20/0xf0 > > ? skb_release_head_state+0x62/0xc0 > > ? skb_release_all+0xe/0x30 > > ? napi_consume_skb+0xb5/0x100 > > ? mlx5e_poll_tx_cq+0x1df/0x4e0 > > ? mlx5e_poll_tx_cq+0x38c/0x4e0 > > ? mlx5e_napi_poll+0x58/0xc30 > > ? mlx5e_napi_poll+0x232/0xc30 > > ? net_rx_action+0x128/0x340 > > ? __do_softirq+0xd4/0x2ad > > ? irq_exit+0xa5/0xb0 > > ? do_IRQ+0x7d/0xc0 > > ? common_interrupt+0xf/0xf > > > > ? __rb_free_aux+0xf0/0xf0 > > ? perf_output_sample+0x28/0x7b0 > > ? perf_prepare_sample+0x54/0x4a0 > > ? perf_event_output+0x43/0x60 > > ? bpf_perf_event_output_raw_tp+0x15f/0x180 > > ? blk_mq_start_request+0x1/0x120 > > ? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000 > > ? bpf_trace_run3+0x2c/0x80 > > ? nbd_send_cmd+0x4c2/0x690 [nbd] > > > > This also cannot be alleviated by further splitting the per-cpu > > perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix > > corruption on concurrent perf_event_output calls")), as a raw_tp could > > be attached to the block:block_rq_complete tracepoint and execute during > > another raw_tp. Instead, keep a pre-allocated perf_sample_data > > structure per perf_event_array element and fail a bpf_perf_event_output > > if that element is concurrently being used. > > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for > > perf_sample_data") > > Signed-off-by: Matt Mullins > > --- > > It felt a bit overkill, but I had to split bpf_event_entry into its own > > header file to break an include cycle from perf_event.h -> cgroup.h -> > > cgroup-defs.h -> bpf-cgroup.h -> bpf.h -> (potentially) perf_event.h. > > > > include/linux/bpf.h | 7 --- > > include/linux/bpf_event.h | 20 > > kernel/bpf/arraymap.c | 2 ++ > > kernel/trace/bpf_trace.c | 30 +- > > 4 files changed, 39 insertions(+), 20 deletions(-) > > create mode 100644 include/linux/bpf_event.h > > > > diff --git a/include/linux/bpf.h b/include/linux/bpf.h > > index 4fb3aa2dc975..13b253a36402 100644 > > --- a/include/linux/bpf.h > > +++ b/include/linux/bpf.h > > @@ -467,13 +467,6 @@ static inline bool bpf_map_flags_access_ok(u32 > > access_flags) > >(BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG); > > } > > > > I think we can avoid the include cycle as: > > +struct perf_sample_data *sd; > struct bpf_event_entry { > struct perf_event *event; > struct file *perf_file; > struct file *map_file; > struct rcu_head rcu; > + struct perf_sample_data *sd; > }; Yeah, that totally works. I was mostly doing this so we had only one kmalloc allocation, but I'm not too worried about having an extra object in kmalloc-64 if it simplifies the code a lot. > > > -struct bpf_event_entry { > > - struct perf_event *event; > > - struct file *perf_file; > > - struct file *map_file; > > - struct rcu_head rcu; > > -}; > > - > > bool bpf_prog_array_compatible(struct bpf_array *array, const struct > > bpf_pr
[PATCH bpf] bpf: preallocate a perf_sample_data per event fd
It is possible that a BPF program can be called while another BPF program is executing bpf_perf_event_output. This has been observed with I/O completion occurring as a result of an interrupt: bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000 ? trace_call_bpf+0x82/0x100 ? sch_direct_xmit+0xe2/0x230 ? blk_mq_end_request+0x1/0x100 ? blk_mq_end_request+0x5/0x100 ? kprobe_perf_func+0x19b/0x240 ? __qdisc_run+0x86/0x520 ? blk_mq_end_request+0x1/0x100 ? blk_mq_end_request+0x5/0x100 ? kprobe_ftrace_handler+0x90/0xf0 ? ftrace_ops_assist_func+0x6e/0xe0 ? ip6_input_finish+0xbf/0x460 ? 0xa01e80bf ? nbd_dbg_flags_show+0xc0/0xc0 [nbd] ? blkdev_issue_zeroout+0x200/0x200 ? blk_mq_end_request+0x1/0x100 ? blk_mq_end_request+0x5/0x100 ? flush_smp_call_function_queue+0x6c/0xe0 ? smp_call_function_single_interrupt+0x32/0xc0 ? call_function_single_interrupt+0xf/0x20 ? call_function_single_interrupt+0xa/0x20 ? swiotlb_map_page+0x140/0x140 ? refcount_sub_and_test+0x1a/0x50 ? tcp_wfree+0x20/0xf0 ? skb_release_head_state+0x62/0xc0 ? skb_release_all+0xe/0x30 ? napi_consume_skb+0xb5/0x100 ? mlx5e_poll_tx_cq+0x1df/0x4e0 ? mlx5e_poll_tx_cq+0x38c/0x4e0 ? mlx5e_napi_poll+0x58/0xc30 ? mlx5e_napi_poll+0x232/0xc30 ? net_rx_action+0x128/0x340 ? __do_softirq+0xd4/0x2ad ? irq_exit+0xa5/0xb0 ? do_IRQ+0x7d/0xc0 ? common_interrupt+0xf/0xf ? __rb_free_aux+0xf0/0xf0 ? perf_output_sample+0x28/0x7b0 ? perf_prepare_sample+0x54/0x4a0 ? perf_event_output+0x43/0x60 ? bpf_perf_event_output_raw_tp+0x15f/0x180 ? blk_mq_start_request+0x1/0x120 ? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000 ? bpf_trace_run3+0x2c/0x80 ? nbd_send_cmd+0x4c2/0x690 [nbd] This also cannot be alleviated by further splitting the per-cpu perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix corruption on concurrent perf_event_output calls")), as a raw_tp could be attached to the block:block_rq_complete tracepoint and execute during another raw_tp. Instead, keep a pre-allocated perf_sample_data structure per perf_event_array element and fail a bpf_perf_event_output if that element is concurrently being used. Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data") Signed-off-by: Matt Mullins --- It felt a bit overkill, but I had to split bpf_event_entry into its own header file to break an include cycle from perf_event.h -> cgroup.h -> cgroup-defs.h -> bpf-cgroup.h -> bpf.h -> (potentially) perf_event.h. include/linux/bpf.h | 7 --- include/linux/bpf_event.h | 20 kernel/bpf/arraymap.c | 2 ++ kernel/trace/bpf_trace.c | 30 +- 4 files changed, 39 insertions(+), 20 deletions(-) create mode 100644 include/linux/bpf_event.h diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 4fb3aa2dc975..13b253a36402 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -467,13 +467,6 @@ static inline bool bpf_map_flags_access_ok(u32 access_flags) (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG); } -struct bpf_event_entry { - struct perf_event *event; - struct file *perf_file; - struct file *map_file; - struct rcu_head rcu; -}; - bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog *fp); int bpf_prog_calc_tag(struct bpf_prog *fp); diff --git a/include/linux/bpf_event.h b/include/linux/bpf_event.h new file mode 100644 index ..9f415990f921 --- /dev/null +++ b/include/linux/bpf_event.h @@ -0,0 +1,20 @@ +/* SPDX-License-Identifier: GPL-2.0 */ + +#ifndef _LINUX_BPF_EVENT_H +#define _LINUX_BPF_EVENT_H + +#include +#include + +struct file; + +struct bpf_event_entry { + struct perf_event *event; + struct file *perf_file; + struct file *map_file; + struct rcu_head rcu; + struct perf_sample_data sd; + atomic_t in_use; +}; + +#endif /* _LINUX_BPF_EVENT_H */ diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c index 584636c9e2eb..08e5e486d563 100644 --- a/kernel/bpf/arraymap.c +++ b/kernel/bpf/arraymap.c @@ -11,6 +11,7 @@ * General Public License for more details. */ #include +#include #include #include #include @@ -659,6 +660,7 @@ static struct bpf_event_entry *bpf_event_entry_gen(struct file *perf_file, ee->event = perf_file->private_data; ee->perf_file = perf_file; ee->map_file = map_file; + atomic_set(&ee->in_use, 0); } return ee; diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c index f92d6ad5e080..a03e29957698 100644 --- a/kernel/tr
[PATCH bpf-next v3 5/5] selftests: bpf: test writable buffers in raw tps
This tests that: * a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it uses either: * a variable offset to the tracepoint buffer, or * an offset beyond the size of the tracepoint buffer * a tracer can modify the buffer provided when attached to a writable tracepoint in bpf_prog_test_run Signed-off-by: Matt Mullins --- include/trace/events/bpf_test_run.h | 50 net/bpf/test_run.c| 4 + .../raw_tp_writable_reject_nbd_invalid.c | 40 ++ .../bpf/prog_tests/raw_tp_writable_test_run.c | 80 +++ .../selftests/bpf/verifier/raw_tp_writable.c | 34 5 files changed, 208 insertions(+) create mode 100644 include/trace/events/bpf_test_run.h create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c create mode 100644 tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c create mode 100644 tools/testing/selftests/bpf/verifier/raw_tp_writable.c diff --git a/include/trace/events/bpf_test_run.h b/include/trace/events/bpf_test_run.h new file mode 100644 index ..abf466839ea4 --- /dev/null +++ b/include/trace/events/bpf_test_run.h @@ -0,0 +1,50 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +#undef TRACE_SYSTEM +#define TRACE_SYSTEM bpf_test_run + +#if !defined(_TRACE_NBD_H) || defined(TRACE_HEADER_MULTI_READ) +#define _TRACE_BPF_TEST_RUN_H + +#include + +DECLARE_EVENT_CLASS(bpf_test_finish, + + TP_PROTO(int *err), + + TP_ARGS(err), + + TP_STRUCT__entry( + __field(int, err) + ), + + TP_fast_assign( + __entry->err = *err; + ), + + TP_printk("bpf_test_finish with err=%d", __entry->err) +); + +#ifdef DEFINE_EVENT_WRITABLE +#undef BPF_TEST_RUN_DEFINE_EVENT +#define BPF_TEST_RUN_DEFINE_EVENT(template, call, proto, args, size) \ + DEFINE_EVENT_WRITABLE(template, call, PARAMS(proto),\ + PARAMS(args), size) +#else +#undef BPF_TEST_RUN_DEFINE_EVENT +#define BPF_TEST_RUN_DEFINE_EVENT(template, call, proto, args, size) \ + DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args)) +#endif + +BPF_TEST_RUN_DEFINE_EVENT(bpf_test_finish, bpf_test_finish, + + TP_PROTO(int *err), + + TP_ARGS(err), + + sizeof(int) +); + +#endif + +/* This part must be outside protection */ +#include diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c index fab142b796ef..25e757102595 100644 --- a/net/bpf/test_run.c +++ b/net/bpf/test_run.c @@ -13,6 +13,9 @@ #include #include +#define CREATE_TRACE_POINTS +#include + static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat, u32 *retval, u32 *time) { @@ -100,6 +103,7 @@ static int bpf_test_finish(const union bpf_attr *kattr, if (err != -ENOSPC) err = 0; out: + trace_bpf_test_finish(&err); return err; } diff --git a/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c new file mode 100644 index ..328d5c4b084b --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c @@ -0,0 +1,40 @@ +// SPDX-License-Identifier: GPL-2.0 + +#include +#include + +void test_raw_tp_writable_reject_nbd_invalid(void) +{ + __u32 duration = 0; + char error[4096]; + int bpf_fd = -1, tp_fd = -1; + + const struct bpf_insn program[] = { + /* r6 is our tp buffer */ + BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0), + BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 128), + BPF_EXIT_INSN(), + }; + + struct bpf_load_program_attr load_attr = { + .prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE, + .license = "GPL v2", + .insns = program, + .insns_cnt = sizeof(program) / sizeof(struct bpf_insn), + .log_level = 2, + }; + + bpf_fd = bpf_load_program_xattr(&load_attr, error, sizeof(error)); + if (CHECK(bpf_fd < 0, "bpf_raw_tracepoint_writable loaded", + "failed: %d errno %d\n", bpf_fd, errno)) + return; + + tp_fd = bpf_raw_tracepoint_open("nbd_send_request", bpf_fd); + if (CHECK(tp_fd >= 0, "bpf_raw_tracepoint_writable opened", + "erroneously succeeded\n")) + goto out_bpffd; + + close(tp_fd); +out_bpffd: + close(bpf_fd); +} diff --git a/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c new file mode 100644 index ..4145925f9cab --- /dev/null +++ b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c @@ -0,0 +1,80 @@ +// SPDX-Lice
Re: [PATCH bpf-next v2] bpf: support raw tracepoints in modules
On Thu, 2018-12-13 at 19:22 +, Martin Lau wrote: > On Wed, Dec 12, 2018 at 04:42:37PM -0800, Matt Mullins wrote: > > Distributions build drivers as modules, including network and filesystem > > drivers which export numerous tracepoints. This enables > > bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints. > > > > Signed-off-by: Matt Mullins > > --- > > v1->v2: > > * avoid taking the mutex in bpf_event_notify when op is neither COMING nor > > GOING. > > * check that kzalloc actually succeeded > > > > I didn't try to check list_empty before taking the mutex since I want to > > avoid > > races between bpf_event_notify and bpf_get_raw_tracepoint. Additionally, > > list_for_each_entry_safe is not strictly necessary upon MODULE_STATE_GOING, > > but > > Alexei suggested I use it to protect against fragility if the subsequent > > break; > > eventually disappears. > > > > include/linux/module.h | 4 ++ > > include/linux/trace_events.h | 8 ++- > > kernel/bpf/syscall.c | 11 ++-- > > kernel/module.c | 5 ++ > > kernel/trace/bpf_trace.c | 99 +++- > > 5 files changed, 120 insertions(+), 7 deletions(-) > > > > diff --git a/include/linux/module.h b/include/linux/module.h > > index fce6b4335e36..5f147dd5e709 100644 > > --- a/include/linux/module.h > > +++ b/include/linux/module.h > > @@ -432,6 +432,10 @@ struct module { > > unsigned int num_tracepoints; > > tracepoint_ptr_t *tracepoints_ptrs; > > #endif > > +#ifdef CONFIG_BPF_EVENTS > > + unsigned int num_bpf_raw_events; > > + struct bpf_raw_event_map *bpf_raw_events; > > +#endif > > #ifdef HAVE_JUMP_LABEL > > struct jump_entry *jump_entries; > > unsigned int num_jump_entries; > > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h > > index 4130a5497d40..8a62731673f7 100644 > > --- a/include/linux/trace_events.h > > +++ b/include/linux/trace_events.h > > @@ -471,7 +471,8 @@ void perf_event_detach_bpf_prog(struct perf_event > > *event); > > int perf_event_query_prog_array(struct perf_event *event, void __user > > *info); > > int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog > > *prog); > > int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog > > *prog); > > -struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name); > > +struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name); > > +void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp); > > int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, > > u32 *fd_type, const char **buf, > > u64 *probe_offset, u64 *probe_addr); > > @@ -502,10 +503,13 @@ static inline int bpf_probe_unregister(struct > > bpf_raw_event_map *btp, struct bpf > > { > > return -EOPNOTSUPP; > > } > > -static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char > > *name) > > +static inline struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char > > *name) > > { > > return NULL; > > } > > +static inline void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) > > +{ > > +} > > static inline int bpf_get_perf_event_info(const struct perf_event *event, > > u32 *prog_id, u32 *fd_type, > > const char **buf, u64 *probe_offset, > > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c > > index 70fb11106fc2..754370e3155e 100644 > > --- a/kernel/bpf/syscall.c > > +++ b/kernel/bpf/syscall.c > > @@ -1609,6 +1609,7 @@ static int bpf_raw_tracepoint_release(struct inode > > *inode, struct file *filp) > > bpf_probe_unregister(raw_tp->btp, raw_tp->prog); > > bpf_prog_put(raw_tp->prog); > > } > > + bpf_put_raw_tracepoint(raw_tp->btp); > > kfree(raw_tp); > > return 0; > > } > > @@ -1634,13 +1635,15 @@ static int bpf_raw_tracepoint_open(const union > > bpf_attr *attr) > > return -EFAULT; > > tp_name[sizeof(tp_name) - 1] = 0; > > > > - btp = bpf_find_raw_tracepoint(tp_name); > > + btp = bpf_get_raw_tracepoint(tp_name); > > if (!btp) > > return -ENOENT; > > > > raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER); > > - if (!raw_tp) > > - return -ENOMEM; >
[PATCH bpf-next v2] bpf: support raw tracepoints in modules
Distributions build drivers as modules, including network and filesystem drivers which export numerous tracepoints. This enables bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints. Signed-off-by: Matt Mullins --- v1->v2: * avoid taking the mutex in bpf_event_notify when op is neither COMING nor GOING. * check that kzalloc actually succeeded I didn't try to check list_empty before taking the mutex since I want to avoid races between bpf_event_notify and bpf_get_raw_tracepoint. Additionally, list_for_each_entry_safe is not strictly necessary upon MODULE_STATE_GOING, but Alexei suggested I use it to protect against fragility if the subsequent break; eventually disappears. include/linux/module.h | 4 ++ include/linux/trace_events.h | 8 ++- kernel/bpf/syscall.c | 11 ++-- kernel/module.c | 5 ++ kernel/trace/bpf_trace.c | 99 +++- 5 files changed, 120 insertions(+), 7 deletions(-) diff --git a/include/linux/module.h b/include/linux/module.h index fce6b4335e36..5f147dd5e709 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -432,6 +432,10 @@ struct module { unsigned int num_tracepoints; tracepoint_ptr_t *tracepoints_ptrs; #endif +#ifdef CONFIG_BPF_EVENTS + unsigned int num_bpf_raw_events; + struct bpf_raw_event_map *bpf_raw_events; +#endif #ifdef HAVE_JUMP_LABEL struct jump_entry *jump_entries; unsigned int num_jump_entries; diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index 4130a5497d40..8a62731673f7 100644 --- a/include/linux/trace_events.h +++ b/include/linux/trace_events.h @@ -471,7 +471,8 @@ void perf_event_detach_bpf_prog(struct perf_event *event); int perf_event_query_prog_array(struct perf_event *event, void __user *info); int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog); int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog); -struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name); +struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name); +void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp); int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, u32 *fd_type, const char **buf, u64 *probe_offset, u64 *probe_addr); @@ -502,10 +503,13 @@ static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf { return -EOPNOTSUPP; } -static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name) +static inline struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name) { return NULL; } +static inline void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp) +{ +} static inline int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id, u32 *fd_type, const char **buf, u64 *probe_offset, diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c index 70fb11106fc2..754370e3155e 100644 --- a/kernel/bpf/syscall.c +++ b/kernel/bpf/syscall.c @@ -1609,6 +1609,7 @@ static int bpf_raw_tracepoint_release(struct inode *inode, struct file *filp) bpf_probe_unregister(raw_tp->btp, raw_tp->prog); bpf_prog_put(raw_tp->prog); } + bpf_put_raw_tracepoint(raw_tp->btp); kfree(raw_tp); return 0; } @@ -1634,13 +1635,15 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) return -EFAULT; tp_name[sizeof(tp_name) - 1] = 0; - btp = bpf_find_raw_tracepoint(tp_name); + btp = bpf_get_raw_tracepoint(tp_name); if (!btp) return -ENOENT; raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER); - if (!raw_tp) - return -ENOMEM; + if (!raw_tp) { + err = -ENOMEM; + goto out_put_btp; + } raw_tp->btp = btp; prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd, @@ -1668,6 +1671,8 @@ static int bpf_raw_tracepoint_open(const union bpf_attr *attr) bpf_prog_put(prog); out_free_tp: kfree(raw_tp); +out_put_btp: + bpf_put_raw_tracepoint(btp); return err; } diff --git a/kernel/module.c b/kernel/module.c index 49a405891587..06ec68f08387 100644 --- a/kernel/module.c +++ b/kernel/module.c @@ -3093,6 +3093,11 @@ static int find_module_sections(struct module *mod, struct load_info *info) sizeof(*mod->tracepoints_ptrs), &mod->num_tracepoints); #endif +#ifdef CONFIG_BPF_EVENTS + mod->bpf_raw_events = section_objs(info, "__bpf_raw_tp_map", + sizeof(*mod->bpf_raw_events), + &mod->num_b
Re: [PATCH net] tls: don't use stack memory in a scatterlist
On Thu, 2018-05-17 at 14:50 -0400, David Miller wrote: > I'm surprised this problem wasn't discovered sooner. How exactly did you > discover it? Did you actually see it trigger or is this purely from code > inspection? Honestly, I'm not sure how it got uncovered, but it was observed at runtime. Doron Roberts-Kedes hit a null pointer dereference so we turned on CONFIG_DEBUG_SG -- then it became a proper BUG_ON(!virt_addr_valid(buf)); in sg_set_buf.
[PATCH net] tls: don't use stack memory in a scatterlist
scatterlist code expects virt_to_page() to work, which fails with CONFIG_VMAP_STACK=y. Fixes: c46234ebb4d1e ("tls: RX path for ktls") Signed-off-by: Matt Mullins --- include/net/tls.h | 3 +++ net/tls/tls_sw.c | 9 - 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/include/net/tls.h b/include/net/tls.h index b400d0bb7448..f5fb16da3860 100644 --- a/include/net/tls.h +++ b/include/net/tls.h @@ -97,6 +97,9 @@ struct tls_sw_context { u8 control; bool decrypted; + char rx_aad_ciphertext[TLS_AAD_SPACE_SIZE]; + char rx_aad_plaintext[TLS_AAD_SPACE_SIZE]; + /* Sending context */ char aad_space[TLS_AAD_SPACE_SIZE]; diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c index 71e79597f940..e1c93ce74e0f 100644 --- a/net/tls/tls_sw.c +++ b/net/tls/tls_sw.c @@ -680,7 +680,6 @@ static int decrypt_skb(struct sock *sk, struct sk_buff *skb, struct scatterlist *sgin = &sgin_arr[0]; struct strp_msg *rxm = strp_msg(skb); int ret, nsg = ARRAY_SIZE(sgin_arr); - char aad_recv[TLS_AAD_SPACE_SIZE]; struct sk_buff *unused; ret = skb_copy_bits(skb, rxm->offset + TLS_HEADER_SIZE, @@ -698,13 +697,13 @@ static int decrypt_skb(struct sock *sk, struct sk_buff *skb, } sg_init_table(sgin, nsg); - sg_set_buf(&sgin[0], aad_recv, sizeof(aad_recv)); + sg_set_buf(&sgin[0], ctx->rx_aad_ciphertext, TLS_AAD_SPACE_SIZE); nsg = skb_to_sgvec(skb, &sgin[1], rxm->offset + tls_ctx->rx.prepend_size, rxm->full_len - tls_ctx->rx.prepend_size); - tls_make_aad(aad_recv, + tls_make_aad(ctx->rx_aad_ciphertext, rxm->full_len - tls_ctx->rx.overhead_size, tls_ctx->rx.rec_seq, tls_ctx->rx.rec_seq_size, @@ -803,12 +802,12 @@ int tls_sw_recvmsg(struct sock *sk, if (to_copy <= len && page_count < MAX_SKB_FRAGS && likely(!(flags & MSG_PEEK))) { struct scatterlist sgin[MAX_SKB_FRAGS + 1]; - char unused[21]; int pages = 0; zc = true; sg_init_table(sgin, MAX_SKB_FRAGS + 1); - sg_set_buf(&sgin[0], unused, 13); + sg_set_buf(&sgin[0], ctx->rx_aad_plaintext, + TLS_AAD_SPACE_SIZE); err = zerocopy_from_iter(sk, &msg->msg_iter, to_copy, &pages, -- 2.14.1
Re: [RFC] KVM: SVM: ignore type when setting segment registers
On Tue, May 30, 2017 at 02:54:21PM +0200, Radim Krčmář wrote: > 2017-05-29 15:24+0200, Gioh Kim: > > If so, why type is checked when setting segment registers? > > No idea. 19bca6ab75d8 ("KVM: SVM: Fix cross vendor migration issue with > unusable bit") also moved the assigment up to initialize it before use > and I think that is enough. Was this perhaps intended to instead check for a zero selector, which is also an unusable segment?
Re: [PATCH resend 4.9] hw_random: Don't use a stack buffer in add_early_randomness()
On Sat, Feb 04, 2017 at 11:47:38AM +0800, Yisheng Xie wrote: > On 2016/10/18 1:06, Andy Lutomirski wrote: > > hw_random carefully avoids using a stack buffer except in > > add_early_randomness(). This causes a crash in virtio_rng if > > CONFIG_VMAP_STACK=y. > I try to understand this patch, but I do not know why it will cause > a crash in virtio_rng with CONFIG_VMAP_STACK=y? > Could you please give me more info. about it. My original report was https://lkml.kernel.org/r/20161016002151.ga18...@hydra.tuxags.com. The virtio ring APIs use scatterlists to keep track of the buffers, and scatterlist requires addresses to be in the kernel direct-mapped address range. This is not the case for vmalloc()ed addresses, such as the original on-stack "bytes" array when VMAP_STACK=y.
virtio-rng scatterlist null pointer dereference with VMAP_STACK=y
With VMAP_STACK=y and HW_RANDOM_VIRTIO=y, I get the following crash: [1.470437] BUG: unable to handle kernel NULL pointer dereference at (null) [1.473350] IP: [] sg_init_one+0x65/0x90 [1.474658] PGD 0 [1.475169] Oops: [#1] SMP Entering kdb (current=0x880069b0c980, pid 1) on processor 1 Oops: (null) due to oops @ 0x812f7c35 CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.8.0-08682-g41844e3 #246 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: 880069b0c980 task.stack: c9c54000 RIP: 0010:[] [] sg_init_one+0x65/0x90 RSP: :c9c57d00 EFLAGS: 00010246 RAX: RBX: c9c57d20 RCX: 00082000 RDX: 0d68 RSI: 00041c57 RDI: c9c57d40 RBP: 0820 R08: R09: c9c57d20 R10: 00019228 R11: 88017fff8500 R12: 0010 R13: 0010 R14: R15: FS: () GS:88017fc0() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 01806000 CR4: 001406a0 Stack: 8800694bf500 0001 c9c57d68 8138286f 0002 8800694bf500 c9c57d68 8800694bf500 Call Trace: [] ? virtio_read+0x9f/0xe0 [] ? add_early_randomness+0x3e/0xa0 [] ? set_current_rng+0x3f/0x160 [] ? hwrng_register+0x186/0x1d0 [] ? virtrng_scan+0x10/0x20 [] ? virtio_dev_probe+0x21e/0x2b0 [] ? really_probe+0x14e/0x250 [] ? __driver_attach+0x82/0xa0 [] ? really_probe+0x250/0x250 [] ? bus_for_each_dev+0x52/0x80 [] ? bus_add_driver+0x1a3/0x220 [] ? hwrng_modinit+0xc/0xc [] ? driver_register+0x56/0xd0 [] ? hwrng_modinit+0xc/0xc [] ? do_one_initcall+0x32/0x150 [] ? parse_args+0x170/0x390 [] ? kernel_init_freeable+0x11e/0x1a3 [] ? set_debug_rodata+0xc/0xc [] ? rest_init+0x70/0x70 [] ? kernel_init+0x5/0x100 [] ? ret_from_fork+0x25/0x30 Code: 01 c5 48 89 ee 48 89 e9 48 c1 ed 23 48 8b 04 ed c0 49 9c 81 48 c1 ee 0c 48 c1 e9 1b 48 85 c0 74 0a 0f b6 c9 48 c1 e1 04 48 01 c8 <48> 8b 08 48 89 f0 89 53 08 48 c1 e0 06 44 89 63 0c 48 83 e1 fc It looks like add_early_randomness() uses a buffer on the stack, which (with VMAP_STACK) fails to be added to a scatterlist due to use of virt_to_page in sg_set_buf(). None of this looks obviously wrong to me, but in combination it is not functional. Hence, I don't have a particular fix in mind, and I've CC'ed folks from both the hw_random and scatterlist history.
Re: [PATCH 2/6] x86/boot: Move compressed kernel to end of decompression buffer
On Tue, Aug 16, 2016 at 12:19:42PM -0700, Yinghai Lu wrote: > On Mon, Aug 15, 2016 at 9:01 PM, Matt Mullins wrote: > > > > This appears to have a negative effect on booting the Intel Edison > > platform, as > > it uses u-boot as its bootloader. u-boot does not copy the init_size > > parameter > > when booting a bzImage: it copies a fixed-size setup_header [1], and its > > definition of setup_header doesn't include the parameters beyond setup_data > > [2]. > > > > With a zero value for init_size, this calculates a %rsp value of > > 0x101ff9600. > > This causes the boot process to hard-stop at the immediately-following > > pushq, as > > this platform has no usable physical addresses above 4G. > > > > What are the options for getting this type of platform to function again? > > For > > now, kexec from a working Linux system does seem to be a work-around, but > > there > > appears to be other x86 hardware using u-boot: the chromium.org folks seem > > to be > > maintaining the u-boot x86 tree. > > > > [1] > > http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/lib/zimage.c;h=1b33c771391f49ffe82864ff1582bdfd07e5e97d;hb=HEAD#l156 > > [2] > > http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/include/asm/bootparam.h;h=140095117e5a2daef0a097c55f0ed10e08acc781;hb=HEAD#l24 > > Then should fix the u-boot about header_size assumption. I was hoping to avoid that, since the Edison's u-boot is 10,000-line patch atop the upstream -- I don't trust myself to build and flash one quite yet. If this turned out to affect Chromebooks, I'd spend more effort pushing for a kernel fix, but it seems that ChromeOS has a different kernel load procedure and doesn't use "zboot". For now, I'll probably just keep a local patch that hard-codes a value large enough to decompress and launch the kernel. I may turn that local patch into something gated by a Kconfig eventually, in hopes that users of the other x86 u-boot platforms will see it in a "make oldconfig" run.
Re: [PATCH 2/6] x86/boot: Move compressed kernel to end of decompression buffer
[added Simon Glass to CC in case there's some input from u-boot] On Thu, Apr 28, 2016 at 05:09:04PM -0700, Kees Cook wrote: > From: Yinghai Lu > > This patch adds BP_init_size (which is the INIT_SIZE as passed in from > the boot_params) into asm-offsets.c to make it visible to the assembly > code. Then when moving the ZO, it calculates the starting position of > the copied ZO (via BP_init_size and the ZO run size) so that the VO__end > will be at the end of the decompression buffer. To make the position > calculation safe, the end of ZO is page aligned (and a comment is added > to the existing VO alignment for good measure). > diff --git a/arch/x86/boot/compressed/head_64.S > b/arch/x86/boot/compressed/head_64.S > index d43c30ed89ed..09cdc0c3ee7e 100644 > --- a/arch/x86/boot/compressed/head_64.S > +++ b/arch/x86/boot/compressed/head_64.S > @@ -338,7 +340,9 @@ preferred_addr: > 1: > > /* Target address to relocate to for decompression */ > - leaqz_extract_offset(%rbp), %rbx > + movlBP_init_size(%rsi), %ebx > + subl$_end, %ebx > + addq%rbp, %rbx > > /* Set up the stack */ > leaqboot_stack_end(%rbx), %rsp This appears to have a negative effect on booting the Intel Edison platform, as it uses u-boot as its bootloader. u-boot does not copy the init_size parameter when booting a bzImage: it copies a fixed-size setup_header [1], and its definition of setup_header doesn't include the parameters beyond setup_data [2]. With a zero value for init_size, this calculates a %rsp value of 0x101ff9600. This causes the boot process to hard-stop at the immediately-following pushq, as this platform has no usable physical addresses above 4G. What are the options for getting this type of platform to function again? For now, kexec from a working Linux system does seem to be a work-around, but there appears to be other x86 hardware using u-boot: the chromium.org folks seem to be maintaining the u-boot x86 tree. [1] http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/lib/zimage.c;h=1b33c771391f49ffe82864ff1582bdfd07e5e97d;hb=HEAD#l156 [2] http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/include/asm/bootparam.h;h=140095117e5a2daef0a097c55f0ed10e08acc781;hb=HEAD#l24
Re: [PATCH] perf tools: get version from uname(2), not /proc
On Wed, Oct 07, 2015 at 11:19:17AM +0300, Adrian Hunter wrote: > On 07/10/15 11:18, Jiri Olsa wrote: > > On Tue, Oct 06, 2015 at 03:53:14PM -0700, Matt Mullins wrote: > >> Tools in kmod (e.g. modprobe) compose the module path from the release > >> from uname(2). Because we use the UNAME26 personality, we need perf to > >> find modules located at the same path as the system tools. > > > > I guess it's easy to google this up, but could you > > please state in the changelog what's the difference > > between the current version and the UNAME26 one? > > > > Also state (and check) this change wouldn't affect other > > parts of the code that use this version (if there's any). This is the only caller of get_kernel_version. > Isn't the machine root dir for guests? uname() won't work for them. Aye, this would break --guestmount. Would it make sense to use uname() iff root_dir is the empty string, or is that too much special-casing for good taste? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[PATCH] perf tools: get version from uname(2), not /proc
Tools in kmod (e.g. modprobe) compose the module path from the release from uname(2). Because we use the UNAME26 personality, we need perf to find modules located at the same path as the system tools. Signed-off-by: Matt Mullins Cc: Vinson Lee --- tools/perf/util/machine.c | 28 ++-- 1 file changed, 6 insertions(+), 22 deletions(-) diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c index 5ef90be2a249..51199bc271e9 100644 --- a/tools/perf/util/machine.c +++ b/tools/perf/util/machine.c @@ -11,6 +11,7 @@ #include "vdso.h" #include #include +#include #include "unwind.h" #include "linux/hash.h" @@ -903,31 +904,14 @@ static void map_groups__fixup_end(struct map_groups *mg) __map_groups__fixup_end(mg, i); } -static char *get_kernel_version(const char *root_dir) +static char *get_kernel_version(void) { - char version[PATH_MAX]; - FILE *file; - char *name, *tmp; - const char *prefix = "Linux version "; + struct utsname utsname; - sprintf(version, "%s/proc/version", root_dir); - file = fopen(version, "r"); - if (!file) + if (uname(&utsname)) return NULL; - version[0] = '\0'; - tmp = fgets(version, sizeof(version), file); - fclose(file); - - name = strstr(version, prefix); - if (!name) - return NULL; - name += strlen(prefix); - tmp = strchr(name, ' '); - if (tmp) - *tmp = '\0'; - - return strdup(name); + return strdup(utsname.release); } static bool is_kmod_dso(struct dso *dso) @@ -1027,7 +1011,7 @@ static int machine__set_modules_path(struct machine *machine) char *version; char modules_path[PATH_MAX]; - version = get_kernel_version(machine->root_dir); + version = get_kernel_version(); if (!version) return -1; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v11 21/21] brd: Rename XIP to DAX
On Thu, Sep 25, 2014 at 04:33:38PM -0400, Matthew Wilcox wrote: > --- a/drivers/block/brd.c > +++ b/drivers/block/brd.c > @@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device > *brd, sector_t sector) >* Must use NOIO because we don't want to recurse back into the >* block or filesystem layers from page reclaim. >* > - * Cannot support XIP and highmem, because our ->direct_access > - * routine for XIP must return memory that is always addressable. > - * If XIP was reworked to use pfns and kmap throughout, this > + * Cannot support DAX and highmem, because our ->direct_access > + * routine for DAX must return memory that is always addressable. > + * If DAX was reworked to use pfns and kmap throughout, this >* restriction might be able to be lifted. >*/ > gfp_flags = GFP_NOIO | __GFP_ZERO; > -#ifndef CONFIG_BLK_DEV_XIP > +#ifndef CONFIG_BLK_DEV_RAM_DAX > gfp_flags |= __GFP_HIGHMEM; > #endif > page = alloc_page(gfp_flags); We're also developing a user of direct_access, and we ended up with some questions about the sleeping guarantees of the direct_access API. Since brd is currently the only (x86) implementation of DAX in Linus's tree, I've been testing against that. We noticed that the brd implementation of DAX can call into alloc_page() with __GFP_WAIT if we call direct_access() on a page that has not yet been allocated. This is compounded by the fact that brd does not support size > PAGE_SIZE (and thus I call bdev_direct_access() on each use), though the limitation makes sense -- I shouldn't expect the brd driver to be able to allocate a gigabyte of contiguous memory. The potential sleeping behavior was somewhat surprising to me, as I would expect the NV-DIMM device implementation to simply offset the pfn at which the device is located rather than perform a memory allocation. What are the guaranteed and/or expected contexts from which direct_access() can be safely called? While I can easily punt this usage to a work queue (that's what we already do for devices where we need to submit a bio), part of our desire to use direct_access is to avoid additional latency. If it would make more sense for us to test against (for example) the pmem or an mtd-block driver instead, as you've discussed with Mathieu Desnoyers, then I'd be happy to work with those in our environment as well. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] kernel/sys.c: Fix UNAME26 for 4.0
I'll raise my hand in agreement with Andi -- this is functionality that we do use. Tested-by: Matt Mullins -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] perf hist browser: change print format from lu to llu to fit u64 type
On Tue, Nov 18, 2014 at 05:00:27PM -0600, Tom Huynh wrote: > nr_events in tools/perf/ui/browsers/hists.c is of type u64, thus it should > be printed as %llu instead of %lu. The print format %lu causes perf report > to show 0 event count when running with 32-bit userspace without > redirection. This patch fixes that problem. > > Signed-off-by: Tom Huynh > --- > tools/perf/ui/browsers/hists.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c > index cfb976b..49b2471 100644 > --- a/tools/perf/ui/browsers/hists.c > +++ b/tools/perf/ui/browsers/hists.c > @@ -1254,7 +1254,7 @@ static int hists__browser_title(struct hists *hists, > > nr_samples = convert_unit(nr_samples, &unit); > printed = scnprintf(bf, size, > -"Samples: %lu%c of event '%s', Event count > (approx.): %lu", > +"Samples: %lu%c of event '%s', Event count > (approx.): %llu", > nr_samples, unit, ev_name, nr_events); If I enable __attribute__((format(printf, ...))) on scnprintf, on a 64-bit build this gives me: ui/browsers/hists.c: In function ‘hists__browser_title’: ui/browsers/hists.c:1258:7: error: format ‘%llu’ expects argument of type ‘long long unsigned int’, but argument 7 has type ‘u64’ [-Werror=format=] nr_samples, unit, ev_name, nr_events); ^ I would usually suggest using the PRIu64 macro which should(TM) evaluate to the right string constant for the platform, but that appears to be a C99 thing. I'm not sure if that's kosher in tools/perf, but it does seem to build for me (on Fedora 20). -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH] mm/balloon_compaction: fix deflation when compaction is disabled
On Wed, Oct 29, 2014 at 02:51:07PM +0400, Konstantin Khlebnikov wrote: > Fix for commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2 > ("mm/balloon_compaction: redesign ballooned pages management"). > > If CONFIG_BALLOON_COMPACTION=n balloon_page_insert() does not link > pages with balloon and doesn't set PagePrivate flag, as a result > balloon_page_dequeue cannot get any pages because it thinks that > all of them are isolated. Without balloon compaction nobody can > isolate ballooned pages, it's safe to remove this check. > > Signed-off-by: Konstantin Khlebnikov > Reported-by: Matt Mullins > Cc: Stable (v3.17) That seems to do it, thanks! Tested-by: Matt Mullins -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v3 1/4] mm/balloon_compaction: redesign ballooned pages management
On Tue, Oct 28, 2014 at 11:56:15PM +0400, Konstantin Khlebnikov wrote: > On Tue, Oct 28, 2014 at 9:47 PM, Matt Mullins wrote: > > On Sat, Sep 27, 2014 at 11:15:16PM +0400, Konstantin Khlebnikov wrote: > >> This patch fixes all of them. > > > > It seems to have rendered virtio_balloon completely ineffective without > > CONFIG_COMPACTION and CONFIG_BALLOON_COMPACTION, even for the case that I'm > > expanding the memory available to my VM. > > What do you mean by ineffective? Ballooning does not change the amount of memory available at all. Bisecting the change got me back to this patch. I'm currently using 3.18-rc2 and running in qemu-kvm-1.6.2-9.fc20.x86_64 via libvirt-daemon-1.1.3.6-1.fc20.x86_64. I'll attach my .config; enabling COMPACTION and BALLOON_COMPACTION makes virtio-ballooning work again. > That it cannot handle fragmentation? I saw that without compaction > ballooning works even better: > it allocates pages without GFP_MOVABLE and buddy allocator returns > much less scattered pages. > > > > > Was this intended? Should Kconfig be updated so that VIRTIO_BALLOON > > depends on > > BALLOON_COMPACTION now? > > Nope, this is independent feature. Good to know, thanks. # # Automatically generated file; DO NOT EDIT. # Linux/x86 3.18.0-rc2 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_MMU=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_RWSEM_XCHGADD_ALGORITHM=y CONFIG_GENERIC_CALIBRATE_DELAY=y CONFIG_ARCH_HAS_CPU_RELAX=y CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y CONFIG_ARCH_HIBERNATION_POSSIBLE=y CONFIG_ARCH_SUSPEND_POSSIBLE=y CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y CONFIG_ARCH_WANT_GENERAL_HUGETLB=y CONFIG_ZONE_DMA32=y CONFIG_AUDIT_ARCH=y CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y CONFIG_X86_64_SMP=y CONFIG_X86_HT=y CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx -fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 -fcall-saved-r11" CONFIG_ARCH_SUPPORTS_UPROBES=y CONFIG_FIX_EARLYCON_MEM=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" CONFIG_IRQ_WORK=y CONFIG_BUILDTIME_EXTABLE_SORT=y # # General setup # CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" # CONFIG_COMPILE_TEST is not set CONFIG_LOCALVERSION="" CONFIG_LOCALVERSION_AUTO=y CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y CONFIG_HAVE_KERNEL_LZMA=y CONFIG_HAVE_KERNEL_XZ=y CONFIG_HAVE_KERNEL_LZO=y CONFIG_HAVE_KERNEL_LZ4=y CONFIG_KERNEL_GZIP=y # CONFIG_KERNEL_BZIP2 is not set # CONFIG_KERNEL_LZMA is not set # CONFIG_KERNEL_XZ is not set # CONFIG_KERNEL_LZO is not set # CONFIG_KERNEL_LZ4 is not set CONFIG_DEFAULT_HOSTNAME="(none)" # CONFIG_SWAP is not set CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y CONFIG_POSIX_MQUEUE_SYSCTL=y # CONFIG_CROSS_MEMORY_ATTACH is not set # CONFIG_FHANDLE is not set # CONFIG_USELIB is not set CONFIG_AUDIT=y CONFIG_HAVE_ARCH_AUDITSYSCALL=y # CONFIG_AUDITSYSCALL is not set # # IRQ subsystem # CONFIG_GENERIC_IRQ_PROBE=y CONFIG_GENERIC_IRQ_SHOW=y CONFIG_GENERIC_IRQ_LEGACY_ALLOC_HWIRQ=y CONFIG_GENERIC_PENDING_IRQ=y CONFIG_IRQ_DOMAIN=y CONFIG_IRQ_FORCED_THREADING=y CONFIG_SPARSE_IRQ=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_ARCH_CLOCKSOURCE_DATA=y CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y CONFIG_GENERIC_TIME_VSYSCALL=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BUILD=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y CONFIG_GENERIC_CMOS_UPDATE=y # # Timers subsystem # CONFIG_HZ_PERIODIC=y # CONFIG_NO_HZ_IDLE is not set # CONFIG_NO_HZ_FULL is not set # CONFIG_NO_HZ is not set # CONFIG_HIGH_RES_TIMERS is not set # # CPU/Task time and stats accounting # CONFIG_VIRT_CPU_ACCOUNTING=y # CONFIG_TICK_CPU_ACCOUNTING is not set CONFIG_VIRT_CPU_ACCOUNTING_GEN=y # CONFIG_IRQ_TIME_ACCOUNTING is not set # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # # RCU Subsystem # CONFIG_TREE_RCU=y # CONFIG_PREEMPT_RCU is not set # CONFIG_TASKS_RCU is not set CONFIG_RCU_STALL_COMMON=y CONFIG_CONTEXT_TRACKING=y CONFIG_RCU_USER_QS=y CONFIG_CONTEXT_TRACKING_FORCE=y CONFIG_RCU_FANOUT=64 CONFIG_RCU_FANOUT_LEAF=16 # CONFIG_RCU_FANOUT_EXACT is not set # CONFIG_TREE_RCU_TRACE is not set CONFIG_RCU_NOCB_CPU=y # CONFIG_RCU_NOCB_CPU_NONE is not set # CONFIG_RCU_NOCB_CPU_ZERO is not set CONFIG_RCU_NOCB_CPU_ALL=y CONFIG_BUILD_BIN2C=y CONFIG_IKCONFIG=y CON
Re: [PATCH v3 1/4] mm/balloon_compaction: redesign ballooned pages management
On Sat, Sep 27, 2014 at 11:15:16PM +0400, Konstantin Khlebnikov wrote: > This patch fixes all of them. It seems to have rendered virtio_balloon completely ineffective without CONFIG_COMPACTION and CONFIG_BALLOON_COMPACTION, even for the case that I'm expanding the memory available to my VM. Was this intended? Should Kconfig be updated so that VIRTIO_BALLOON depends on BALLOON_COMPACTION now? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [PATCH v2] nfs: fix kernel warning when removing proc entry
On Tue, Aug 19, 2014 at 09:20:38PM -0700, Eric W. Biederman wrote: > I have applied this patch and will push it to Linus after it has a > little bit to sit in linux-next. I also fairly reliably hit this warning with trinity, and with Cong's patch, it seems to be gone. If it helps get it pushed toward Linus: Tested-by: Matt Mullins -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: Problems with hda_intel, Santa Rosa, and suspend
I've been reading up on the Intel HD Audio specifications, and I found that the problem is most likely that the driver has to resort to PI/O (single_cmd) mode. I really haven't had enough time or experience to figure out what is actually causing this. This is the first time I've done any kernel development; thankfully, Intel has good documentation. I'm going to keep hacking to see what information I can figure out. I just tried 2.6.22-rc5, and no behavior has changed. Matt Mullins On 6/17/07, Rafael J. Wysocki <[EMAIL PROTECTED]> wrote: On Sunday, 17 June 2007 03:36, Matt Mullins wrote: > I just received a Dell Latitude D630, with the new Intel Santa Rosa > platform. Currently, the only major driver issue I have is sound. It > worked fine in Ubuntu Feisty's 2.6.20 kernel, but now I am using Gutsy > so I can have graphics drivers. Gutsy's 2.6.22-rc3-based kernel no > longer recognized my soundcard, a Sigmatel STAC9205, which uses the > hda_intel driver. I have since fixed that problem by compiling the > hda_intel driver into the kernel (I used 2.6.22-rc4 vanilla sources), > instead of a module, but it is still a bug that it does not work as a > module. > > All the following debug information was obtained from kernel 2.6.22-rc4. > > However, the bug that is currently affecting me is that sound no > longer works after using ACPI suspend-to-RAM or swsusp. I compiled > ALSA debugging in, and as soon as I resume, I get tons of messages > like: > [8.474830] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0970500 > [8.474954] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0af0009 > [8.475078] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0a70500 > [8.475207] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0bf0009 > (etc) > > There may be more before that, but that is more than the kernel > message buffer can hold, so I can't ever see it. I get similar > timeout messages each time a PCM gets set up: > [8.972859] hda_codec_setup_stream: NID=0x10, stream=0x5, channel=0, format=0 > x11 > [8.972987] hda-intel: send_cmd timeout: IRS=0x1, val=0x1070650 > [9.006071] hda-intel: send_cmd timeout: IRS=0x1, val=0x1020011 > [9.006081] hda_codec_setup_stream: NID=0x11, stream=0x5, channel=0, format=0 > x11 > [9.006207] hda-intel: send_cmd timeout: IRS=0x1, val=0x1170650 > [9.016117] hda-intel: send_cmd timeout: IRS=0x1, val=0x1120011 > > I read some of sound/pci/hda/hda_intel.c, specifically the code that > output those messages. After octupling my kernel log buffer (to 1MB), > I noticed that for the first set of timeouts (immediately upon > resume), all the "codec" nibbles are odd, the last word of the val > ("verb" and "para") are all either 0x0500 or 0x0009. For the second > set (when something tries to use the card) that last word is 0x0650 or > 0x0011. > > I have yet to go back and re-test compiling snd-hda-intel as a module. Well, the ALSA people should be notified of that in the first place (CCs added). Greetings, Rafael -- "Premature optimization is the root of all evil." - Donald Knuth - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Problems with hda_intel, Santa Rosa, and suspend
I just received a Dell Latitude D630, with the new Intel Santa Rosa platform. Currently, the only major driver issue I have is sound. It worked fine in Ubuntu Feisty's 2.6.20 kernel, but now I am using Gutsy so I can have graphics drivers. Gutsy's 2.6.22-rc3-based kernel no longer recognized my soundcard, a Sigmatel STAC9205, which uses the hda_intel driver. I have since fixed that problem by compiling the hda_intel driver into the kernel (I used 2.6.22-rc4 vanilla sources), instead of a module, but it is still a bug that it does not work as a module. All the following debug information was obtained from kernel 2.6.22-rc4. However, the bug that is currently affecting me is that sound no longer works after using ACPI suspend-to-RAM or swsusp. I compiled ALSA debugging in, and as soon as I resume, I get tons of messages like: [8.474830] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0970500 [8.474954] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0af0009 [8.475078] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0a70500 [8.475207] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0bf0009 (etc) There may be more before that, but that is more than the kernel message buffer can hold, so I can't ever see it. I get similar timeout messages each time a PCM gets set up: [8.972859] hda_codec_setup_stream: NID=0x10, stream=0x5, channel=0, format=0 x11 [8.972987] hda-intel: send_cmd timeout: IRS=0x1, val=0x1070650 [9.006071] hda-intel: send_cmd timeout: IRS=0x1, val=0x1020011 [9.006081] hda_codec_setup_stream: NID=0x11, stream=0x5, channel=0, format=0 x11 [9.006207] hda-intel: send_cmd timeout: IRS=0x1, val=0x1170650 [9.016117] hda-intel: send_cmd timeout: IRS=0x1, val=0x1120011 I read some of sound/pci/hda/hda_intel.c, specifically the code that output those messages. After octupling my kernel log buffer (to 1MB), I noticed that for the first set of timeouts (immediately upon resume), all the "codec" nibbles are odd, the last word of the val ("verb" and "para") are all either 0x0500 or 0x0009. For the second set (when something tries to use the card) that last word is 0x0650 or 0x0011. I have yet to go back and re-test compiling snd-hda-intel as a module. Thank you in advance, Matt Mullins - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/