Re: [PATCH v3] tracepoint: Do not fail unregistering a probe due to memory allocation

2020-11-23 Thread Matt Mullins
On Wed, Nov 18, 2020 at 09:34:05AM -0500, Steven Rostedt wrote:
> From: "Steven Rostedt (VMware)" 
> 
> The list of tracepoint callbacks is managed by an array that is protected
> by RCU. To update this array, a new array is allocated, the updates are
> copied over to the new array, and then the list of functions for the
> tracepoint is switched over to the new array. After a completion of an RCU
> grace period, the old array is freed.
> 
> This process happens for both adding a callback as well as removing one.
> But on removing a callback, if the new array fails to be allocated, the
> callback is not removed, and may be used after it is freed by the clients
> of the tracepoint.
> 
> There's really no reason to fail if the allocation for a new array fails
> when removing a function. Instead, the function can simply be replaced by a
> stub function that could be cleaned up on the next modification of the
> array. That is, instead of calling the function registered to the
> tracepoint, it would call a stub function in its place.
> 
> Link: https://lore.kernel.org/r/20201115055256.65625-1-mmull...@mmlx.us
> Link: https://lore.kernel.org/r/20201116175107.02db3...@gandalf.local.home
> Link: https://lore.kernel.org/r/20201117211836.54aca...@oasis.local.home
> 
> Cc: Mathieu Desnoyers 
> Cc: Ingo Molnar 
> Cc: Alexei Starovoitov 
> Cc: Daniel Borkmann 
> Cc: Dmitry Vyukov 
> Cc: Martin KaFai Lau 
> Cc: Song Liu 
> Cc: Yonghong Song 
> Cc: Andrii Nakryiko 
> Cc: John Fastabend 
> Cc: KP Singh 
> Cc: netdev 
> Cc: bpf 
> Cc: Kees Cook 
> Cc: Florian Weimer 
> Fixes: 97e1c18e8d17b ("tracing: Kernel Tracepoints")
> Reported-by: syzbot+83aa762ef23b6f0d1...@syzkaller.appspotmail.com
> Reported-by: syzbot+d29e58bb557324e55...@syzkaller.appspotmail.com
> Reported-by: Matt Mullins 
> Signed-off-by: Steven Rostedt (VMware) 

I'm a bit late answering your initial query, but yes indeed this fixes
the bug I was hunting.  I just watched it live through the reproducer
for about a half-hour, while unpatched I get an instant "BUG: unable to
handle page fault".

Tested-by: Matt Mullins 

> ---
> Changes since v2:
>- Went back to using a stub function and not touching
>   the fast path.
>- Removed adding __GFP_NOFAIL from the allocation of the removal.
> 
>  kernel/tracepoint.c | 80 -
>  1 file changed, 64 insertions(+), 16 deletions(-)
> 
> diff --git a/kernel/tracepoint.c b/kernel/tracepoint.c
> index 3f659f855074..3e261482296c 100644
> --- a/kernel/tracepoint.c
> +++ b/kernel/tracepoint.c
> @@ -53,6 +53,12 @@ struct tp_probes {
>   struct tracepoint_func probes[];
>  };
>  
> +/* Called in removal of a func but failed to allocate a new tp_funcs */
> +static void tp_stub_func(void)
> +{
> + return;
> +}
> +
>  static inline void *allocate_probes(int count)
>  {
>   struct tp_probes *p  = kmalloc(struct_size(p, probes, count),
> @@ -131,6 +137,7 @@ func_add(struct tracepoint_func **funcs, struct 
> tracepoint_func *tp_func,
>  {
>   struct tracepoint_func *old, *new;
>   int nr_probes = 0;
> + int stub_funcs = 0;
>   int pos = -1;
>  
>   if (WARN_ON(!tp_func->func))
> @@ -147,14 +154,34 @@ func_add(struct tracepoint_func **funcs, struct 
> tracepoint_func *tp_func,
>   if (old[nr_probes].func == tp_func->func &&
>   old[nr_probes].data == tp_func->data)
>   return ERR_PTR(-EEXIST);
> + if (old[nr_probes].func == tp_stub_func)
> + stub_funcs++;
>   }
>   }
> - /* + 2 : one for new probe, one for NULL func */
> - new = allocate_probes(nr_probes + 2);
> + /* + 2 : one for new probe, one for NULL func - stub functions */
> + new = allocate_probes(nr_probes + 2 - stub_funcs);
>   if (new == NULL)
>   return ERR_PTR(-ENOMEM);
>   if (old) {
> - if (pos < 0) {
> + if (stub_funcs) {
> + /* Need to copy one at a time to remove stubs */
> + int probes = 0;
> +
> + pos = -1;
> + for (nr_probes = 0; old[nr_probes].func; nr_probes++) {
> + if (old[nr_probes].func == tp_stub_func)
> + continue;
> + if (pos < 0 && old[nr_probes].prio < prio)
> + pos = probes++;
> + new[probes++] = old[nr_probes];
> + }
&g

Re: [PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-17 Thread Matt Mullins
On Tue, Nov 17, 2020 at 06:05:51PM -0500, Mathieu Desnoyers wrote:
> - On Nov 16, 2020, at 5:10 PM, rostedt rost...@goodmis.org wrote:
> 
> > On Mon, 16 Nov 2020 16:34:41 -0500 (EST)
> > Mathieu Desnoyers  wrote:
> 
> [...]
> 
> >> I think you'll want a WRITE_ONCE(old[i].func, tp_stub_func) here, matched
> >> with a READ_ONCE() in __DO_TRACE. This introduces a new situation where the
> >> func pointer can be updated and loaded concurrently.
> > 
> > I thought about this a little, and then only thing we really should worry
> > about is synchronizing with those that unregister. Because when we make
> > this update, there are now two states. the __DO_TRACE either reads the
> > original func or the stub. And either should be OK to call.
> > 
> > Only the func gets updated and not the data. So what exactly are we worried
> > about here?
> 
> Indeed with a stub function, I don't see any need for READ_ONCE/WRITE_ONCE.

I'm not sure if this is a practical issue, but without WRITE_ONCE, can't
the write be torn?  A racing __traceiter_ could potentially see a
half-modified function pointer, which wouldn't work out too well.

This was actually my gut instinct before I wrote the __GFP_NOFAIL
instead -- currently that whole array's memory ordering is provided by
RCU and I didn't dive deep enough to evaluate getting too clever with
atomic modifications to it.

> 
> However, if we want to compare the function pointer to some other value and
> conditionally do (or skip) the call, I think you'll need the 
> READ_ONCE/WRITE_ONCE
> to make sure the pointer is not re-fetched between comparison and call.
> 
> Thanks,
> 
> Mathieu
> 
> -- 
> Mathieu Desnoyers
> EfficiOS Inc.
> http://www.efficios.com


[PATCH] bpf: don't fail kmalloc while releasing raw_tp

2020-11-14 Thread Matt Mullins
bpf_link_free is always called in process context, including from a
workqueue and from __fput.  Neither of these have the ability to
propagate an -ENOMEM to the caller.

Reported-by: syzbot+83aa762ef23b6f0d1...@syzkaller.appspotmail.com
Reported-by: syzbot+d29e58bb557324e55...@syzkaller.appspotmail.com
Signed-off-by: Matt Mullins 
---
I previously referenced a "pretty ugly" patch.  This is not that one,
because I don't think there's any way I can make the caller of
->release() actually handle errors like ENOMEM.

It also looks like most of the other ways tracepoint_probe_unregister is
called also don't check the error code (e.g. just a quick grep found
blk_unregister_tracepoints).  Should this just be upgraded to GFP_NOFAIL
across the board instead of passing around a gfp_t?

 include/linux/trace_events.h |  6 --
 include/linux/tracepoint.h   |  7 +--
 kernel/bpf/syscall.c |  2 +-
 kernel/trace/bpf_trace.c |  6 --
 kernel/trace/trace_events.c  |  6 --
 kernel/tracepoint.c  | 20 ++--
 6 files changed, 28 insertions(+), 19 deletions(-)

diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 5c6943354049..166ad7646a98 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -625,7 +625,8 @@ int perf_event_attach_bpf_prog(struct perf_event *event, 
struct bpf_prog *prog);
 void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
 int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
-int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
+int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog,
+gfp_t flags);
 struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name);
 void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp);
 int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
@@ -654,7 +655,8 @@ static inline int bpf_probe_register(struct 
bpf_raw_event_map *btp, struct bpf_p
 {
return -EOPNOTSUPP;
 }
-static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct 
bpf_prog *p)
+static inline int bpf_probe_unregister(struct bpf_raw_event_map *btp,
+  struct bpf_prog *p, gfp_t flags)
 {
return -EOPNOTSUPP;
 }
diff --git a/include/linux/tracepoint.h b/include/linux/tracepoint.h
index 598fec9f9dbf..7b02f92f3b8f 100644
--- a/include/linux/tracepoint.h
+++ b/include/linux/tracepoint.h
@@ -12,6 +12,7 @@
  * Heavily inspired from the Linux Kernel Markers.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -40,7 +41,8 @@ extern int
 tracepoint_probe_register_prio(struct tracepoint *tp, void *probe, void *data,
   int prio);
 extern int
-tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data);
+tracepoint_probe_unregister(struct tracepoint *tp, void *probe, void *data,
+   gfp_t flags);
 extern void
 for_each_kernel_tracepoint(void (*fct)(struct tracepoint *tp, void *priv),
void *priv);
@@ -260,7 +262,8 @@ static inline struct tracepoint 
*tracepoint_ptr_deref(tracepoint_ptr_t *p)
unregister_trace_##name(void (*probe)(data_proto), void *data)  \
{   \
return tracepoint_probe_unregister(&__tracepoint_##name,\
-   (void *)probe, data);   \
+   (void *)probe, data,\
+   GFP_KERNEL);\
}   \
static inline void  \
check_trace_callback_type_##name(void (*cb)(data_proto))\
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index b999e7ff2583..f6876681c4ab 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2601,7 +2601,7 @@ static void bpf_raw_tp_link_release(struct bpf_link *link)
struct bpf_raw_tp_link *raw_tp =
container_of(link, struct bpf_raw_tp_link, link);
 
-   bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog);
+   bpf_probe_unregister(raw_tp->btp, raw_tp->link.prog, GFP_KERNEL | 
__GFP_NOFAIL);
bpf_put_raw_tracepoint(raw_tp->btp);
 }
 
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index a8d4f253ed77..a4ea58c7506d 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -1955,9 +1955,11 @@ int bpf_probe_register(struct bpf_raw_event_map *btp, 
struct bpf_prog *prog)
return __bpf_probe_register(btp, prog);
 }
 
-int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog)
+int bpf_probe_unregister(struc

Re: KASAN: vmalloc-out-of-bounds Read in bpf_trace_run3

2020-11-12 Thread Matt Mullins
On Wed, Nov 11, 2020 at 03:57:50PM +0100, Dmitry Vyukov wrote:
> On Mon, Nov 2, 2020 at 12:54 PM syzbot
>  wrote:
> >
> > Hello,
> >
> > syzbot found the following issue on:
> >
> > HEAD commit:080b6f40 bpf: Don't rely on GCC __attribute__((optimize)) ..
> > git tree:   bpf
> > console output: https://syzkaller.appspot.com/x/log.txt?x=1089d37c50
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=58a4ca757d776bfe
> > dashboard link: https://syzkaller.appspot.com/bug?extid=d29e58bb557324e55e5e
> > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=10f4b03250
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1371a47c50
> >
> > The issue was bisected to:
> >
> > commit 9df1c28bb75217b244257152ab7d788bb2a386d0
> > Author: Matt Mullins 
> > Date:   Fri Apr 26 18:49:47 2019 +
> >
> > bpf: add writable context for raw tracepoints
> 
> 
> We have a number of kernel memory corruptions related to bpf_trace_run now:
> https://groups.google.com/g/syzkaller-bugs/search?q=kernel%2Ftrace%2Fbpf_trace.c
> 
> Can raw tracepoints "legally" corrupt kernel memory (a-la /dev/kmem)?
> Or they shouldn't?
> 
> Looking at the description of Matt's commit, it seems that corruptions
> should not be possible (bounded buffer, checked size, etc). Then it
> means it's a real kernel bug?

This bug doesn't seem to be related to the writability of the
tracepoint; it bisected to that commit simply because it used
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE for the reproducer and it EINVAL's
before that program type was introduced.  The BPF program it loads is
pretty much a no-op.

The problem here is a kmalloc failure injection into
tracepoint_probe_unregister, but the error is ignored -- so the bpf
program is freed even though the tracepoint is never unregistered.

I have a first pass at a patch to pipe through the error code, but it's
pretty ugly.  It's also called from the file_operations ->release(), for
which errors are solidly ignored in __fput(), so I'm not sure what the
best way to handle ENOMEM is...

> 
> 
> 
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=12b6c4da50
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=11b6c4da50
> > console output: https://syzkaller.appspot.com/x/log.txt?x=16b6c4da50
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+d29e58bb557324e55...@syzkaller.appspotmail.com
> > Fixes: 9df1c28bb752 ("bpf: add writable context for raw tracepoints")
> >
> > ==
> > BUG: KASAN: vmalloc-out-of-bounds in __bpf_trace_run 
> > kernel/trace/bpf_trace.c:2045 [inline]
> > BUG: KASAN: vmalloc-out-of-bounds in bpf_trace_run3+0x3e0/0x3f0 
> > kernel/trace/bpf_trace.c:2083
> > Read of size 8 at addr c9e6c030 by task kworker/0:3/3754
> >
> > CPU: 0 PID: 3754 Comm: kworker/0:3 Not tainted 5.9.0-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > Workqueue:  0x0 (events)
> > Call Trace:
> >  __dump_stack lib/dump_stack.c:77 [inline]
> >  dump_stack+0x107/0x163 lib/dump_stack.c:118
> >  print_address_description.constprop.0.cold+0x5/0x4c8 mm/kasan/report.c:385
> >  __kasan_report mm/kasan/report.c:545 [inline]
> >  kasan_report.cold+0x1f/0x37 mm/kasan/report.c:562
> >  __bpf_trace_run kernel/trace/bpf_trace.c:2045 [inline]
> >  bpf_trace_run3+0x3e0/0x3f0 kernel/trace/bpf_trace.c:2083
> >  __bpf_trace_sched_switch+0xdc/0x120 include/trace/events/sched.h:138
> >  __traceiter_sched_switch+0x64/0xb0 include/trace/events/sched.h:138
> >  trace_sched_switch include/trace/events/sched.h:138 [inline]
> >  __schedule+0xeb8/0x2130 kernel/sched/core.c:4520
> >  schedule+0xcf/0x270 kernel/sched/core.c:4601
> >  worker_thread+0x14c/0x1120 kernel/workqueue.c:2439
> >  kthread+0x3af/0x4a0 kernel/kthread.c:292
> >  ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:296
> >
> >
> > Memory state around the buggy address:
> >  c9e6bf00: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> >  c9e6bf80: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> > >c9e6c000: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> >  ^
> >  c9e6c080: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8
> >  c9e6c100: f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8 f8

[tip:x86/urgent] x86/entry/32: Pass cr2 to do_async_page_fault()

2019-07-24 Thread tip-bot for Matt Mullins
Commit-ID:  b8f70953c1251d8b16276995816a95639f598e70
Gitweb: https://git.kernel.org/tip/b8f70953c1251d8b16276995816a95639f598e70
Author: Matt Mullins 
AuthorDate: Tue, 23 Jul 2019 21:20:58 -0700
Committer:  Thomas Gleixner 
CommitDate: Wed, 24 Jul 2019 12:17:39 +0200

x86/entry/32: Pass cr2 to do_async_page_fault()

Commit a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption") added the
address parameter to do_async_page_fault(), but does not pass it from the
32-bit entry point.  To plumb it through, factor-out
common_exception_read_cr2 in the same fashion as common_exception, and uses
it from both page_fault and async_page_fault.

For a 32-bit KVM guest, this fixes:

  Run /sbin/init as init process
  Starting init: /sbin/init exists but couldn't execute it (error -14)

Fixes: a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption")
Signed-off-by: Matt Mullins 
Signed-off-by: Thomas Gleixner 
Acked-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20190724042058.24506-1-mmull...@fb.com

---
 arch/x86/entry/entry_32.S | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 2bb986f305ac..4f86928246e7 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1443,8 +1443,12 @@ BUILD_INTERRUPT3(hv_stimer0_callback_vector, 
HYPERV_STIMER0_VECTOR,
 
 ENTRY(page_fault)
ASM_CLAC
-   pushl   $0; /* %gs's slot on the stack */
+   pushl   $do_page_fault
+   jmp common_exception_read_cr2
+END(page_fault)
 
+common_exception_read_cr2:
+   /* the function address is in %gs's slot on the stack */
SAVE_ALL switch_stacks=1 skip_gs=1
 
ENCODE_FRAME_POINTER
@@ -1452,6 +1456,7 @@ ENTRY(page_fault)
 
/* fixup %gs */
GS_TO_REG %ecx
+   movlPT_GS(%esp), %edi
REG_TO_PTGS %ecx
SET_KERNEL_GS %ecx
 
@@ -1463,9 +1468,9 @@ ENTRY(page_fault)
 
TRACE_IRQS_OFF
movl%esp, %eax  # pt_regs pointer
-   calldo_page_fault
+   CALL_NOSPEC %edi
jmp ret_from_exception
-END(page_fault)
+END(common_exception_read_cr2)
 
 common_exception:
/* the function address is in %gs's slot on the stack */
@@ -1595,7 +1600,7 @@ END(general_protection)
 ENTRY(async_page_fault)
ASM_CLAC
pushl   $do_async_page_fault
-   jmp common_exception
+   jmp common_exception_read_cr2
 END(async_page_fault)
 #endif
 


[PATCH] x86/entry/32: pass cr2 to do_async_page_fault

2019-07-23 Thread Matt Mullins
Commit a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption") added the
address parameter to do_async_page_fault, but does not pass it from the
32-bit entry point.  To plumb it through, factor-out
common_exception_read_cr2 in the same fashion as common_exception, and
uses it from both page_fault and async_page_fault.

For a 32-bit KVM guest, this fixes:

[1.148669][T1] Run /sbin/init as init process
[1.153328][T1] Starting init: /sbin/init exists but couldn't execute it 
(error -14)

Fixes: a0d14b8909de ("x86/mm, tracing: Fix CR2 corruption")
Signed-off-by: Matt Mullins 
---
 arch/x86/entry/entry_32.S | 13 +
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 2bb986f305ac..4f86928246e7 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1443,8 +1443,12 @@ BUILD_INTERRUPT3(hv_stimer0_callback_vector, 
HYPERV_STIMER0_VECTOR,
 
 ENTRY(page_fault)
ASM_CLAC
-   pushl   $0; /* %gs's slot on the stack */
+   pushl   $do_page_fault
+   jmp common_exception_read_cr2
+END(page_fault)
 
+common_exception_read_cr2:
+   /* the function address is in %gs's slot on the stack */
SAVE_ALL switch_stacks=1 skip_gs=1
 
ENCODE_FRAME_POINTER
@@ -1452,6 +1456,7 @@ ENTRY(page_fault)
 
/* fixup %gs */
GS_TO_REG %ecx
+   movlPT_GS(%esp), %edi
REG_TO_PTGS %ecx
SET_KERNEL_GS %ecx
 
@@ -1463,9 +1468,9 @@ ENTRY(page_fault)
 
TRACE_IRQS_OFF
movl%esp, %eax  # pt_regs pointer
-   calldo_page_fault
+   CALL_NOSPEC %edi
jmp ret_from_exception
-END(page_fault)
+END(common_exception_read_cr2)
 
 common_exception:
/* the function address is in %gs's slot on the stack */
@@ -1595,7 +1600,7 @@ END(general_protection)
 ENTRY(async_page_fault)
ASM_CLAC
pushl   $do_async_page_fault
-   jmp common_exception
+   jmp common_exception_read_cr2
 END(async_page_fault)
 #endif
 
-- 
2.17.1



Re: [PATCH] bpf: hide do_bpf_send_signal when unused

2019-06-17 Thread Matt Mullins
On Mon, 2019-06-17 at 19:09 -0400, Steven Rostedt wrote:
> On Mon, 17 Jun 2019 08:26:29 -0700
> Alexei Starovoitov  wrote:
> 
> > On Mon, Jun 17, 2019 at 5:59 AM Arnd Bergmann  wrote:
> > > 
> > > When CONFIG_MODULES is disabled, this function is never called:
> > > 
> > > kernel/trace/bpf_trace.c:581:13: error: 'do_bpf_send_signal' defined but 
> > > not used [-Werror=unused-function]  
> > 
> > hmm. it should work just fine without modules.
> > the bug is somewhere else.
> 
> From what I see, the only use of do_bpf_send_signal is within a
> #ifdef CONFIG_MODULES, which means that you will get a warning about a
> static unused when CONFIG_MODULES is not defined.
> 
> In kernel/trace/bpf_trace.c we have:
> 
> static void do_bpf_send_signal(struct irq_work *entry)
> 
> [..]
> 
> #ifdef CONFIG_MODULES
> 
> [..]
> 
> for_each_possible_cpu(cpu) {
> work = per_cpu_ptr(&send_signal_work, cpu);
> init_irq_work(&work->irq_work, do_bpf_send_signal);  <-- on 
> use of do_bpf_send_signal
> }
> [..]
> #endif /* CONFIG_MODULES */
> 
> The bug (really just a warning) reported is exactly here.

I don't think bpf_send_signal is tied to modules at all;
send_signal_irq_work_init and the corresponding initcall should be
moved outside that #ifdef.

> 
> -- Steve


Re: [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data

2019-06-14 Thread Matt Mullins
On Fri, 2019-06-14 at 16:50 +0200, Daniel Borkmann wrote:
> On 06/14/2019 02:51 AM, Matt Mullins wrote:
> > On Fri, 2019-06-14 at 00:47 +0200, Daniel Borkmann wrote:
> > > On 06/12/2019 07:00 AM, Andrii Nakryiko wrote:
> > > > On Tue, Jun 11, 2019 at 8:48 PM Matt Mullins  wrote:
> > > > > 
> > > > > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, 
> > > > > as
> > > > > they do not increment bpf_prog_active while executing.
> > > > > 
> > > > > This enables three levels of nesting, to support
> > > > >   - a kprobe or raw tp or perf event,
> > > > >   - another one of the above that irq context happens to call, and
> > > > >   - another one in nmi context
> > > > > (at most one of which may be a kprobe or perf event).
> > > > > 
> > > > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for 
> > > > > perf_sample_data")
> > > 
> > > Generally, looks good to me. Two things below:
> > > 
> > > Nit, for stable, shouldn't fixes tag be c4f6699dfcb8 ("bpf: introduce 
> > > BPF_RAW_TRACEPOINT")
> > > instead of the one you currently have?
> > 
> > Ah, yeah, that's probably more reasonable; I haven't managed to come up
> > with a scenario where one could hit this without raw tracepoints.  I'll
> > fix up the nits that've accumulated since v2.
> > 
> > > One more question / clarification: we have __bpf_trace_run() vs 
> > > trace_call_bpf().
> > > 
> > > Only raw tracepoints can be nested since the rest has the bpf_prog_active 
> > > per-CPU
> > > counter via trace_call_bpf() and would bail out otherwise, iiuc. And raw 
> > > ones use
> > > the __bpf_trace_run() added in c4f6699dfcb8 ("bpf: introduce 
> > > BPF_RAW_TRACEPOINT").
> > > 
> > > 1) I tried to recall and find a rationale for mentioned trace_call_bpf() 
> > > split in
> > > the c4f6699dfcb8 log, but couldn't find any. Is the raison d'être purely 
> > > because of
> > > performance overhead (and desire to not miss events as a result of 
> > > nesting)? (This
> > > also means we're not protected by bpf_prog_active in all the map ops, of 
> > > course.)
> > > 2) Wouldn't this also mean that we only need to fix the raw tp programs 
> > > via
> > > get_bpf_raw_tp_regs() / put_bpf_raw_tp_regs() and won't need this 
> > > duplication for
> > > the rest which relies upon trace_call_bpf()? I'm probably missing 
> > > something, but
> > > given they have separate pt_regs there, how could they be affected then?
> > 
> > For the pt_regs, you're correct: I only used get/put_raw_tp_regs for
> > the _raw_tp variants.  However, consider the following nesting:
> > 
> > trace_nest_level raw_tp_nest_level
> >   (kprobe) bpf_perf_event_output1   0
> >   (raw_tp) bpf_perf_event_output_raw_tp 2   1
> >   (raw_tp) bpf_get_stackid_raw_tp   2   2
> > 
> > I need to increment a nest level (and ideally increment it only once)
> > between the kprobe and the first raw_tp, because they would otherwise
> > share the struct perf_sample_data.  But I also need to increment a nest
> 
> I'm not sure I follow on this one: the former would still keep using the
> bpf_trace_sd as-is today since only ever /one/ can be active on a given CPU
> as we otherwise bail out in trace_call_bpf() due to bpf_prog_active counter.
> Given these two are /not/ shared, you only need the code you have below for
> nesting to deal with the raw_tps via get_bpf_raw_tp_regs() / 
> put_bpf_raw_tp_regs()
> which should also simplify the code quite a bit.

bpf_perf_event_output_raw_tp calls bpf_perf_event_output, so it
currently shares bpf_trace_sd with kprobes -- it _can_ be nested.

> 
> > level between the two raw_tps, since they share the pt_regs -- I can't
> > use trace_nest_level for everything because it's not used by
> > get_stackid, and I can't use raw_tp_nest_level for everything because
> > it's not incremented by kprobes.
> 
> (See above wrt kprobes.)
> 
> > If raw tracepoints were to bump bpf_prog_active, then I could get away
> > with just using that count in these callsites -- I'm reluctant to do
> > that, though, since it would prevent kprobes from ever running inside a
> > raw_tp.  I'd like to retain the ability to (e.g.)
> >   trace.py -K htab_map_update_elem
> > and get some stack traces from at least within raw tracepoints.
> > 
> > That said, as I wrote up this example, bpf_trace_nest_level seems to be
> > wildly misnamed; I should name those after the structure they're
> > protecting...


Re: [PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data

2019-06-13 Thread Matt Mullins
On Fri, 2019-06-14 at 00:47 +0200, Daniel Borkmann wrote:
> On 06/12/2019 07:00 AM, Andrii Nakryiko wrote:
> > On Tue, Jun 11, 2019 at 8:48 PM Matt Mullins  wrote:
> > > 
> > > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
> > > they do not increment bpf_prog_active while executing.
> > > 
> > > This enables three levels of nesting, to support
> > >   - a kprobe or raw tp or perf event,
> > >   - another one of the above that irq context happens to call, and
> > >   - another one in nmi context
> > > (at most one of which may be a kprobe or perf event).
> > > 
> > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for 
> > > perf_sample_data")
> 
> Generally, looks good to me. Two things below:
> 
> Nit, for stable, shouldn't fixes tag be c4f6699dfcb8 ("bpf: introduce 
> BPF_RAW_TRACEPOINT")
> instead of the one you currently have?

Ah, yeah, that's probably more reasonable; I haven't managed to come up
with a scenario where one could hit this without raw tracepoints.  I'll
fix up the nits that've accumulated since v2.

> One more question / clarification: we have __bpf_trace_run() vs 
> trace_call_bpf().
> 
> Only raw tracepoints can be nested since the rest has the bpf_prog_active 
> per-CPU
> counter via trace_call_bpf() and would bail out otherwise, iiuc. And raw ones 
> use
> the __bpf_trace_run() added in c4f6699dfcb8 ("bpf: introduce 
> BPF_RAW_TRACEPOINT").
> 
> 1) I tried to recall and find a rationale for mentioned trace_call_bpf() 
> split in
> the c4f6699dfcb8 log, but couldn't find any. Is the raison d'être purely 
> because of
> performance overhead (and desire to not miss events as a result of nesting)? 
> (This
> also means we're not protected by bpf_prog_active in all the map ops, of 
> course.)
> 2) Wouldn't this also mean that we only need to fix the raw tp programs via
> get_bpf_raw_tp_regs() / put_bpf_raw_tp_regs() and won't need this duplication 
> for
> the rest which relies upon trace_call_bpf()? I'm probably missing something, 
> but
> given they have separate pt_regs there, how could they be affected then?

For the pt_regs, you're correct: I only used get/put_raw_tp_regs for
the _raw_tp variants.  However, consider the following nesting:

trace_nest_level raw_tp_nest_level
  (kprobe) bpf_perf_event_output1   0
  (raw_tp) bpf_perf_event_output_raw_tp 2   1
  (raw_tp) bpf_get_stackid_raw_tp   2   2

I need to increment a nest level (and ideally increment it only once)
between the kprobe and the first raw_tp, because they would otherwise
share the struct perf_sample_data.  But I also need to increment a nest
level between the two raw_tps, since they share the pt_regs -- I can't
use trace_nest_level for everything because it's not used by
get_stackid, and I can't use raw_tp_nest_level for everything because
it's not incremented by kprobes.

If raw tracepoints were to bump bpf_prog_active, then I could get away
with just using that count in these callsites -- I'm reluctant to do
that, though, since it would prevent kprobes from ever running inside a
raw_tp.  I'd like to retain the ability to (e.g.)
  trace.py -K htab_map_update_elem
and get some stack traces from at least within raw tracepoints.

That said, as I wrote up this example, bpf_trace_nest_level seems to be
wildly misnamed; I should name those after the structure they're
protecting...

> Thanks,
> Daniel
> 
> > > Signed-off-by: Matt Mullins 
> > > ---
> > 
> > LGTM, minor nit below.
> > 
> > Acked-by: Andrii Nakryiko 
> > 
> > > v1->v2:
> > >   * reverse-Christmas-tree-ize the declarations in bpf_perf_event_output
> > >   * instantiate err more readably
> > > 
> > > I've done additional testing with the original workload that hit the
> > > irq+raw-tp reentrancy problem, and as far as I can tell, it's still
> > > solved with this solution (as opposed to my earlier per-map-element
> > > version).
> > > 
> > >  kernel/trace/bpf_trace.c | 100 ---
> > >  1 file changed, 84 insertions(+), 16 deletions(-)
> > > 
> > > diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
> > > index f92d6ad5e080..1c9a4745e596 100644
> > > --- a/kernel/trace/bpf_trace.c
> > > +++ b/kernel/trace/bpf_trace.c
> > > @@ -410,8 +410,6 @@ static const struct bpf_func_proto 
> > > bpf_perf_event_read_value_proto = {
> > &g

[tip:x86/urgent] x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()

2019-06-12 Thread tip-bot for Matt Mullins
Commit-ID:  71ab8323cc357c68985a2d6fc6cfc22b1dbbc1c3
Gitweb: https://git.kernel.org/tip/71ab8323cc357c68985a2d6fc6cfc22b1dbbc1c3
Author: Matt Mullins 
AuthorDate: Fri, 31 May 2019 12:47:54 -0700
Committer:  Borislav Petkov 
CommitDate: Wed, 12 Jun 2019 18:52:44 +0200

x86/kgdb: Return 0 from kgdb_arch_set_breakpoint()

err must be nonzero in order to reach text_poke(), which caused kgdb to
fail to set breakpoints:

  (gdb) break __x64_sys_sync
  Breakpoint 1 at 0x81288910: file ../fs/sync.c, line 124.
  (gdb) c
  Continuing.
  Warning:
  Cannot insert breakpoint 1.
  Cannot access memory at address 0x81288910

  Command aborted.

Fixes: 86a22057127d ("x86/kgdb: Avoid redundant comparison of patched code")
Signed-off-by: Matt Mullins 
Signed-off-by: Borislav Petkov 
Reviewed-by: Nadav Amit 
Cc: Andy Lutomirski 
Cc: Christophe Leroy 
Cc: Daniel Thompson 
Cc: Douglas Anderson 
Cc: "Gustavo A. R. Silva" 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: "Peter Zijlstra (Intel)" 
Cc: Rick Edgecombe 
Cc: Thomas Gleixner 
Cc: x86-ml 
Link: https://lkml.kernel.org/r/20190531194755.6320-1-mmull...@fb.com
---
 arch/x86/kernel/kgdb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 9a8c1648fc9a..6690c5652aeb 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -758,7 +758,7 @@ int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt)
   BREAK_INSTR_SIZE);
bpt->type = BP_POKE_BREAKPOINT;
 
-   return err;
+   return 0;
 }
 
 int kgdb_arch_remove_breakpoint(struct kgdb_bkpt *bpt)


[PATCH bpf v2] bpf: fix nested bpf tracepoints with per-cpu data

2019-06-11 Thread Matt Mullins
BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
they do not increment bpf_prog_active while executing.

This enables three levels of nesting, to support
  - a kprobe or raw tp or perf event,
  - another one of the above that irq context happens to call, and
  - another one in nmi context
(at most one of which may be a kprobe or perf event).

Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
Signed-off-by: Matt Mullins 
---
v1->v2:
  * reverse-Christmas-tree-ize the declarations in bpf_perf_event_output
  * instantiate err more readably

I've done additional testing with the original workload that hit the
irq+raw-tp reentrancy problem, and as far as I can tell, it's still
solved with this solution (as opposed to my earlier per-map-element
version).

 kernel/trace/bpf_trace.c | 100 ---
 1 file changed, 84 insertions(+), 16 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f92d6ad5e080..1c9a4745e596 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -410,8 +410,6 @@ static const struct bpf_func_proto 
bpf_perf_event_read_value_proto = {
.arg4_type  = ARG_CONST_SIZE,
 };
 
-static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd);
-
 static __always_inline u64
 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
u64 flags, struct perf_sample_data *sd)
@@ -442,24 +440,50 @@ __bpf_perf_event_output(struct pt_regs *regs, struct 
bpf_map *map,
return perf_event_output(event, sd, regs);
 }
 
+/*
+ * Support executing tracepoints in normal, irq, and nmi context that each call
+ * bpf_perf_event_output
+ */
+struct bpf_trace_sample_data {
+   struct perf_sample_data sds[3];
+};
+
+static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds);
+static DEFINE_PER_CPU(int, bpf_trace_nest_level);
 BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, 
map,
   u64, flags, void *, data, u64, size)
 {
-   struct perf_sample_data *sd = this_cpu_ptr(&bpf_trace_sd);
+   struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds);
+   int nest_level = this_cpu_inc_return(bpf_trace_nest_level);
struct perf_raw_record raw = {
.frag = {
.size = size,
.data = data,
},
};
+   struct perf_sample_data *sd;
+   int err;
 
-   if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
-   return -EINVAL;
+   if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds))) {
+   err = -EBUSY;
+   goto out;
+   }
+
+   sd = &sds->sds[nest_level - 1];
+
+   if (unlikely(flags & ~(BPF_F_INDEX_MASK))) {
+   err = -EINVAL;
+   goto out;
+   }
 
perf_sample_data_init(sd, 0, 0);
sd->raw = &raw;
 
-   return __bpf_perf_event_output(regs, map, flags, sd);
+   err = __bpf_perf_event_output(regs, map, flags, sd);
+
+out:
+   this_cpu_dec(bpf_trace_nest_level);
+   return err;
 }
 
 static const struct bpf_func_proto bpf_perf_event_output_proto = {
@@ -822,16 +846,48 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
 /*
  * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
  * to avoid potential recursive reuse issue when/if tracepoints are added
- * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack
+ * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack.
+ *
+ * Since raw tracepoints run despite bpf_prog_active, support concurrent usage
+ * in normal, irq, and nmi context.
  */
-static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs);
+struct bpf_raw_tp_regs {
+   struct pt_regs regs[3];
+};
+static DEFINE_PER_CPU(struct bpf_raw_tp_regs, bpf_raw_tp_regs);
+static DEFINE_PER_CPU(int, bpf_raw_tp_nest_level);
+static struct pt_regs *get_bpf_raw_tp_regs(void)
+{
+   struct bpf_raw_tp_regs *tp_regs = this_cpu_ptr(&bpf_raw_tp_regs);
+   int nest_level = this_cpu_inc_return(bpf_raw_tp_nest_level);
+
+   if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(tp_regs->regs))) {
+   this_cpu_dec(bpf_raw_tp_nest_level);
+   return ERR_PTR(-EBUSY);
+   }
+
+   return &tp_regs->regs[nest_level - 1];
+}
+
+static void put_bpf_raw_tp_regs(void)
+{
+   this_cpu_dec(bpf_raw_tp_nest_level);
+}
+
 BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, 
args,
   struct bpf_map *, map, u64, flags, void *, data, u64, size)
 {
-   struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+   struct pt_regs *regs = get_bpf_raw_tp_regs();
+   int ret;
+
+   if (IS_ERR(regs))
+   return PTR_ERR(regs);
 
perf_fetch_caller_regs(regs);
-   return bpf_perf_event_output(regs, m

Re: [PATCH bpf] bpf: fix nested bpf tracepoints with per-cpu data

2019-06-06 Thread Matt Mullins
On Thu, 2019-06-06 at 15:13 -0700, Jakub Kicinski wrote:
> On Thu, 6 Jun 2019 11:54:27 -0700, Matt Mullins wrote:
> > BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
> > they do not increment bpf_prog_active while executing.
> > 
> > This enables three levels of nesting, to support
> >   - a kprobe or raw tp or perf event,
> >   - another one of the above that irq context happens to call, and
> >   - another one in nmi context
> > (at most one of which may be a kprobe or perf event).
> > 
> > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for 
> > perf_sample_data")
> 
> No comment on the code, but you're definitely missing a sign-off.

Oops, I totally am.  I'll give it some more time for opinions to roll
in, and I'll fix that before I resubmit :)

> 
> > ---
> > This is more lines of code, but possibly less intrusive than the
> > per-array-element approach.
> > 
> > I don't necessarily like that I duplicated the nest_level logic in two
> > places, but I don't see a way to unify them:
> >   - kprobes' bpf_perf_event_output doesn't use bpf_raw_tp_regs, and does
> > use the perf_sample_data,
> >   - raw tracepoints' bpf_get_stackid uses bpf_raw_tp_regs, but not
> > the perf_sample_data, and
> >   - raw tracepoints' bpf_perf_event_output uses both...
> 
> 


[PATCH bpf] bpf: fix nested bpf tracepoints with per-cpu data

2019-06-06 Thread Matt Mullins
BPF_PROG_TYPE_RAW_TRACEPOINTs can be executed nested on the same CPU, as
they do not increment bpf_prog_active while executing.

This enables three levels of nesting, to support
  - a kprobe or raw tp or perf event,
  - another one of the above that irq context happens to call, and
  - another one in nmi context
(at most one of which may be a kprobe or perf event).

Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
---
This is more lines of code, but possibly less intrusive than the
per-array-element approach.

I don't necessarily like that I duplicated the nest_level logic in two
places, but I don't see a way to unify them:
  - kprobes' bpf_perf_event_output doesn't use bpf_raw_tp_regs, and does
use the perf_sample_data,
  - raw tracepoints' bpf_get_stackid uses bpf_raw_tp_regs, but not
the perf_sample_data, and
  - raw tracepoints' bpf_perf_event_output uses both...

 kernel/trace/bpf_trace.c | 95 +---
 1 file changed, 80 insertions(+), 15 deletions(-)

diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f92d6ad5e080..4f5419837ddd 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -410,8 +410,6 @@ static const struct bpf_func_proto 
bpf_perf_event_read_value_proto = {
.arg4_type  = ARG_CONST_SIZE,
 };
 
-static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd);
-
 static __always_inline u64
 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
u64 flags, struct perf_sample_data *sd)
@@ -442,24 +440,47 @@ __bpf_perf_event_output(struct pt_regs *regs, struct 
bpf_map *map,
return perf_event_output(event, sd, regs);
 }
 
+/*
+ * Support executing tracepoints in normal, irq, and nmi context that each call
+ * bpf_perf_event_output
+ */
+struct bpf_trace_sample_data {
+   struct perf_sample_data sds[3];
+};
+
+static DEFINE_PER_CPU(struct bpf_trace_sample_data, bpf_trace_sds);
+static DEFINE_PER_CPU(int, bpf_trace_nest_level);
 BPF_CALL_5(bpf_perf_event_output, struct pt_regs *, regs, struct bpf_map *, 
map,
   u64, flags, void *, data, u64, size)
 {
-   struct perf_sample_data *sd = this_cpu_ptr(&bpf_trace_sd);
+   struct bpf_trace_sample_data *sds = this_cpu_ptr(&bpf_trace_sds);
+   struct perf_sample_data *sd;
+   int nest_level = this_cpu_inc_return(bpf_trace_nest_level);
struct perf_raw_record raw = {
.frag = {
.size = size,
.data = data,
},
};
+   int err = -EBUSY;
 
+   if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(sds->sds)))
+   goto out;
+
+   sd = &sds->sds[nest_level - 1];
+
+   err = -EINVAL;
if (unlikely(flags & ~(BPF_F_INDEX_MASK)))
-   return -EINVAL;
+   goto out;
 
perf_sample_data_init(sd, 0, 0);
sd->raw = &raw;
 
-   return __bpf_perf_event_output(regs, map, flags, sd);
+   err = __bpf_perf_event_output(regs, map, flags, sd);
+
+out:
+   this_cpu_dec(bpf_trace_nest_level);
+   return err;
 }
 
 static const struct bpf_func_proto bpf_perf_event_output_proto = {
@@ -822,16 +843,48 @@ pe_prog_func_proto(enum bpf_func_id func_id, const struct 
bpf_prog *prog)
 /*
  * bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
  * to avoid potential recursive reuse issue when/if tracepoints are added
- * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack
+ * inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack.
+ *
+ * Since raw tracepoints run despite bpf_prog_active, support concurrent usage
+ * in normal, irq, and nmi context.
  */
-static DEFINE_PER_CPU(struct pt_regs, bpf_raw_tp_regs);
+struct bpf_raw_tp_regs {
+   struct pt_regs regs[3];
+};
+static DEFINE_PER_CPU(struct bpf_raw_tp_regs, bpf_raw_tp_regs);
+static DEFINE_PER_CPU(int, bpf_raw_tp_nest_level);
+static struct pt_regs *get_bpf_raw_tp_regs(void)
+{
+   struct bpf_raw_tp_regs *tp_regs = this_cpu_ptr(&bpf_raw_tp_regs);
+   int nest_level = this_cpu_inc_return(bpf_raw_tp_nest_level);
+
+   if (WARN_ON_ONCE(nest_level > ARRAY_SIZE(tp_regs->regs))) {
+   this_cpu_dec(bpf_raw_tp_nest_level);
+   return ERR_PTR(-EBUSY);
+   }
+
+   return &tp_regs->regs[nest_level - 1];
+}
+
+static void put_bpf_raw_tp_regs(void)
+{
+   this_cpu_dec(bpf_raw_tp_nest_level);
+}
+
 BPF_CALL_5(bpf_perf_event_output_raw_tp, struct bpf_raw_tracepoint_args *, 
args,
   struct bpf_map *, map, u64, flags, void *, data, u64, size)
 {
-   struct pt_regs *regs = this_cpu_ptr(&bpf_raw_tp_regs);
+   struct pt_regs *regs = get_bpf_raw_tp_regs();
+   int ret;
+
+   if (IS_ERR(regs))
+   return PTR_ERR(regs);
 
perf_fetch_caller_regs(regs);
-   return bpf_perf_event_output(regs, map, flags, data, size);
+   ret = bpf_perf_event_output(regs, map, flags, d

Re: [PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd

2019-06-03 Thread Matt Mullins
On Tue, 2019-06-04 at 02:43 +0200, Daniel Borkmann wrote:
> On 06/04/2019 01:54 AM, Alexei Starovoitov wrote:
> > On Mon, Jun 3, 2019 at 4:48 PM Daniel Borkmann  wrote:
> > > On 06/04/2019 01:27 AM, Alexei Starovoitov wrote:
> > > > On Mon, Jun 3, 2019 at 3:59 PM Matt Mullins  wrote:
> > > > > 
> > > > > If these are invariably non-nested, I can easily keep bpf_misc_sd when
> > > > > I resubmit.  There was no technical reason other than keeping the two
> > > > > codepaths as similar as possible.
> > > > > 
> > > > > What resource gives you worry about doing this for the networking
> > > > > codepath?
> > > > 
> > > > my preference would be to keep tracing and networking the same.
> > > > there is already minimal nesting in networking and probably we see
> > > > more when reuseport progs will start running from xdp and clsbpf
> > > > 
> > > > > > Aside from that it's also really bad to miss events like this as 
> > > > > > exporting
> > > > > > through rb is critical. Why can't you have a per-CPU counter that 
> > > > > > selects a
> > > > > > sample data context based on nesting level in tracing? (I don't see 
> > > > > > a discussion
> > > > > > of this in your commit message.)
> > > > > 
> > > > > This change would only drop messages if the same perf_event is
> > > > > attempted to be used recursively (i.e. the same CPU on the same
> > > > > PERF_EVENT_ARRAY map, as I haven't observed anything use index !=
> > > > > BPF_F_CURRENT_CPU in testing).
> > > > > 
> > > > > I'll try to accomplish the same with a percpu nesting level and
> > > > > allocating 2 or 3 perf_sample_data per cpu.  I think that'll solve the
> > > > > same problem -- a local patch keeping track of the nesting level is 
> > > > > how
> > > > > I got the above stack trace, too.
> > > > 
> > > > I don't think counter approach works. The amount of nesting is unknown.
> > > > imo the approach taken in this patch is good.
> > > > I don't see any issue when event_outputs will be dropped for valid 
> > > > progs.
> > > > Only when user called the helper incorrectly without BPF_F_CURRENT_CPU.
> > > > But that's an error anyway.
> > > 
> > > My main worry with this xchg() trick is that we'll miss to export crucial
> > > data with the EBUSY bailing out especially given nesting could increase in
> > > future as you state, so users might have a hard time debugging this kind 
> > > of
> > > issue if they share the same perf event map among these programs, and no
> > > option to get to this data otherwise. Supporting nesting up to a certain
> > > level would still be better than a lost event which is also not reported
> > > through the usual way aka perf rb.

Tracing can already be lossy: trace_call_bpf() silently simply doesn't
call the prog and instead returns zero if bpf_prog_active != 1.

> > 
> > I simply don't see this 'miss to export data' in all but contrived 
> > conditions.
> > Say two progs share the same perf event array.
> > One prog calls event_output and while rb logic is working
> > another prog needs to start executing and use the same event array
> 
> Correct.
> 
> > slot. Today it's only possible for tracing prog combined with networking,
> > but having two progs use the same event output array is pretty much
> > a user bug. Just like not passing BPF_F_CURRENT_CPU.
> 
> I don't see the user bug part, why should that be a user bug? It's the same
> as if we would say that sharing a BPF hash map between networking programs
> attached to different hooks or networking and tracing would be a user bug
> which it is not. One concrete example would be cilium monitor where we
> currently expose skb trace and drop events a well as debug data through
> the same rb. This should be usable from any type that has perf_event_output
> helper enabled (e.g. XDP and tc/BPF) w/o requiring to walk yet another per
> cpu mmap rb from user space.

Neither of these solutions would affect the behavior of sharing the
perf array between networking programs -- since they're never called in
a nested fashion, then you'll never hit the "xchg() returned NULL" at
all.

That said, I think I can logically limit nesting in tracing to 3
lev

Re: [PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd

2019-06-03 Thread Matt Mullins
On Mon, 2019-06-03 at 15:22 +0200, Daniel Borkmann wrote:
> On 06/03/2019 03:08 PM, Daniel Borkmann wrote:
> > On 06/01/2019 12:37 AM, Matt Mullins wrote:
> > > It is possible that a BPF program can be called while another BPF
> > > program is executing bpf_perf_event_output.  This has been observed with
> > > I/O completion occurring as a result of an interrupt:
> > > 
> > >   bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000
> > >   ? trace_call_bpf+0x82/0x100
> > >   ? sch_direct_xmit+0xe2/0x230
> > >   ? blk_mq_end_request+0x1/0x100
> > >   ? blk_mq_end_request+0x5/0x100
> > >   ? kprobe_perf_func+0x19b/0x240
> > >   ? __qdisc_run+0x86/0x520
> > >   ? blk_mq_end_request+0x1/0x100
> > >   ? blk_mq_end_request+0x5/0x100
> > >   ? kprobe_ftrace_handler+0x90/0xf0
> > >   ? ftrace_ops_assist_func+0x6e/0xe0
> > >   ? ip6_input_finish+0xbf/0x460
> > >   ? 0xa01e80bf
> > >   ? nbd_dbg_flags_show+0xc0/0xc0 [nbd]
> > >   ? blkdev_issue_zeroout+0x200/0x200
> > >   ? blk_mq_end_request+0x1/0x100
> > >   ? blk_mq_end_request+0x5/0x100
> > >   ? flush_smp_call_function_queue+0x6c/0xe0
> > >   ? smp_call_function_single_interrupt+0x32/0xc0
> > >   ? call_function_single_interrupt+0xf/0x20
> > >   ? call_function_single_interrupt+0xa/0x20
> > >   ? swiotlb_map_page+0x140/0x140
> > >   ? refcount_sub_and_test+0x1a/0x50
> > >   ? tcp_wfree+0x20/0xf0
> > >   ? skb_release_head_state+0x62/0xc0
> > >   ? skb_release_all+0xe/0x30
> > >   ? napi_consume_skb+0xb5/0x100
> > >   ? mlx5e_poll_tx_cq+0x1df/0x4e0
> > >   ? mlx5e_poll_tx_cq+0x38c/0x4e0
> > >   ? mlx5e_napi_poll+0x58/0xc30
> > >   ? mlx5e_napi_poll+0x232/0xc30
> > >   ? net_rx_action+0x128/0x340
> > >   ? __do_softirq+0xd4/0x2ad
> > >   ? irq_exit+0xa5/0xb0
> > >   ? do_IRQ+0x7d/0xc0
> > >   ? common_interrupt+0xf/0xf
> > >   
> > >   ? __rb_free_aux+0xf0/0xf0
> > >   ? perf_output_sample+0x28/0x7b0
> > >   ? perf_prepare_sample+0x54/0x4a0
> > >   ? perf_event_output+0x43/0x60
> > >   ? bpf_perf_event_output_raw_tp+0x15f/0x180
> > >   ? blk_mq_start_request+0x1/0x120
> > >   ? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000
> > >   ? bpf_trace_run3+0x2c/0x80
> > >   ? nbd_send_cmd+0x4c2/0x690 [nbd]
> > > 
> > > This also cannot be alleviated by further splitting the per-cpu
> > > perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix
> > > corruption on concurrent perf_event_output calls")), as a raw_tp could
> > > be attached to the block:block_rq_complete tracepoint and execute during
> > > another raw_tp.  Instead, keep a pre-allocated perf_sample_data
> > > structure per perf_event_array element and fail a bpf_perf_event_output
> > > if that element is concurrently being used.
> > > 
> > > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for 
> > > perf_sample_data")
> > > Signed-off-by: Matt Mullins 
> > 
> > You do not elaborate why is this needed for all the networking programs that
> > use this functionality. The bpf_misc_sd should therefore be kept as-is. 
> > There
> > cannot be nested occurrences there (xdp, tc ingress/egress). Please explain 
> > why
> > non-tracing should be affected here...

If these are invariably non-nested, I can easily keep bpf_misc_sd when
I resubmit.  There was no technical reason other than keeping the two
codepaths as similar as possible.

What resource gives you worry about doing this for the networking
codepath?

> Aside from that it's also really bad to miss events like this as exporting
> through rb is critical. Why can't you have a per-CPU counter that selects a
> sample data context based on nesting level in tracing? (I don't see a 
> discussion
> of this in your commit message.)

This change would only drop messages if the same perf_event is
attempted to be used recursively (i.e. the same CPU on the same
PERF_EVENT_ARRAY map, as I haven't observed anything use index !=
BPF_F_CURRENT_CPU in testing).

I'll try to accomplish the same with a percpu nesting level and
allocating 2 or 3 perf_sample_data per cpu.  I think that'll solve the
same problem -- a local patch keeping track of the nesting level is how
I got the above stack trace, too.


[PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd

2019-05-31 Thread Matt Mullins
It is possible that a BPF program can be called while another BPF
program is executing bpf_perf_event_output.  This has been observed with
I/O completion occurring as a result of an interrupt:

bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000
? trace_call_bpf+0x82/0x100
? sch_direct_xmit+0xe2/0x230
? blk_mq_end_request+0x1/0x100
? blk_mq_end_request+0x5/0x100
? kprobe_perf_func+0x19b/0x240
? __qdisc_run+0x86/0x520
? blk_mq_end_request+0x1/0x100
? blk_mq_end_request+0x5/0x100
? kprobe_ftrace_handler+0x90/0xf0
? ftrace_ops_assist_func+0x6e/0xe0
? ip6_input_finish+0xbf/0x460
? 0xa01e80bf
? nbd_dbg_flags_show+0xc0/0xc0 [nbd]
? blkdev_issue_zeroout+0x200/0x200
? blk_mq_end_request+0x1/0x100
? blk_mq_end_request+0x5/0x100
? flush_smp_call_function_queue+0x6c/0xe0
? smp_call_function_single_interrupt+0x32/0xc0
? call_function_single_interrupt+0xf/0x20
? call_function_single_interrupt+0xa/0x20
? swiotlb_map_page+0x140/0x140
? refcount_sub_and_test+0x1a/0x50
? tcp_wfree+0x20/0xf0
? skb_release_head_state+0x62/0xc0
? skb_release_all+0xe/0x30
? napi_consume_skb+0xb5/0x100
? mlx5e_poll_tx_cq+0x1df/0x4e0
? mlx5e_poll_tx_cq+0x38c/0x4e0
? mlx5e_napi_poll+0x58/0xc30
? mlx5e_napi_poll+0x232/0xc30
? net_rx_action+0x128/0x340
? __do_softirq+0xd4/0x2ad
? irq_exit+0xa5/0xb0
? do_IRQ+0x7d/0xc0
? common_interrupt+0xf/0xf

? __rb_free_aux+0xf0/0xf0
? perf_output_sample+0x28/0x7b0
? perf_prepare_sample+0x54/0x4a0
? perf_event_output+0x43/0x60
? bpf_perf_event_output_raw_tp+0x15f/0x180
? blk_mq_start_request+0x1/0x120
? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000
? bpf_trace_run3+0x2c/0x80
? nbd_send_cmd+0x4c2/0x690 [nbd]

This also cannot be alleviated by further splitting the per-cpu
perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix
corruption on concurrent perf_event_output calls")), as a raw_tp could
be attached to the block:block_rq_complete tracepoint and execute during
another raw_tp.  Instead, keep a pre-allocated perf_sample_data
structure per perf_event_array element and fail a bpf_perf_event_output
if that element is concurrently being used.

Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
Signed-off-by: Matt Mullins 
---
v1->v2:
keep a pointer to the struct perf_sample_data rather than directly
embedding it in the structure, avoiding the circular include and
removing the need for in_use.  Suggested by Song.

 include/linux/bpf.h  |  1 +
 kernel/bpf/arraymap.c|  3 ++-
 kernel/trace/bpf_trace.c | 29 -
 3 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4fb3aa2dc975..47fd85cfbbaf 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -472,6 +472,7 @@ struct bpf_event_entry {
struct file *perf_file;
struct file *map_file;
struct rcu_head rcu;
+   struct perf_sample_data *sd;
 };
 
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog 
*fp);
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 584636c9e2eb..c7f5d593e04f 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -654,11 +654,12 @@ static struct bpf_event_entry *bpf_event_entry_gen(struct 
file *perf_file,
 {
struct bpf_event_entry *ee;
 
-   ee = kzalloc(sizeof(*ee), GFP_ATOMIC);
+   ee = kzalloc(sizeof(*ee) + sizeof(struct perf_sample_data), GFP_ATOMIC);
if (ee) {
ee->event = perf_file->private_data;
ee->perf_file = perf_file;
ee->map_file = map_file;
+   ee->sd = (void *)ee + sizeof(*ee);
}
 
return ee;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f92d6ad5e080..076f8e987355 100644
--- a/kernel/trace/bpf_trace.c
+++ b/kernel/trace/bpf_trace.c
@@ -410,17 +410,17 @@ static const struct bpf_func_proto 
bpf_perf_event_read_value_proto = {
.arg4_type  = ARG_CONST_SIZE,
 };
 
-static DEFINE_PER_CPU(struct perf_sample_data, bpf_trace_sd);
-
 static __always_inline u64
 __bpf_perf_event_output(struct pt_regs *regs, struct bpf_map *map,
-   u64 flags, struct perf_sample_data *sd)
+   u64 flags, struct perf_raw_record *raw)
 {
struct bpf_array *array = container_of(map, struct bpf_array, map);
unsigned int cpu = smp_processor_id();
u64 index = flags & BPF_F_INDEX_MASK;
struct bpf_event_entry *ee;
struct perf_event *event;
+   struct perf_sample_data *sd;
+   u64 ret

[PATCH] x86/kgdb: return 0 from kgdb_arch_set_breakpoint

2019-05-31 Thread Matt Mullins
err must be nonzero in order to reach text_poke(), which caused kgdb to
fail to set breakpoints:

(gdb) break __x64_sys_sync
Breakpoint 1 at 0x81288910: file ../fs/sync.c, line 124.
(gdb) c
Continuing.
Warning:
Cannot insert breakpoint 1.
Cannot access memory at address 0x81288910

Command aborted.

Fixes: 86a22057127d ("x86/kgdb: Avoid redundant comparison of patched code")
Signed-off-by: Matt Mullins 
---
 arch/x86/kernel/kgdb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/kgdb.c b/arch/x86/kernel/kgdb.c
index 9a8c1648fc9a..6690c5652aeb 100644
--- a/arch/x86/kernel/kgdb.c
+++ b/arch/x86/kernel/kgdb.c
@@ -758,7 +758,7 @@ int kgdb_arch_set_breakpoint(struct kgdb_bkpt *bpt)
   BREAK_INSTR_SIZE);
bpt->type = BP_POKE_BREAKPOINT;
 
-   return err;
+   return 0;
 }
 
 int kgdb_arch_remove_breakpoint(struct kgdb_bkpt *bpt)
-- 
2.17.1



Re: [PATCH bpf] bpf: preallocate a perf_sample_data per event fd

2019-05-30 Thread Matt Mullins
On Thu, 2019-05-30 at 23:28 +, Song Liu wrote:
> > On May 30, 2019, at 3:55 PM, Matt Mullins  wrote:
> > 
> > It is possible that a BPF program can be called while another BPF
> > program is executing bpf_perf_event_output.  This has been observed with
> > I/O completion occurring as a result of an interrupt:
> > 
> > bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000
> > ? trace_call_bpf+0x82/0x100
> > ? sch_direct_xmit+0xe2/0x230
> > ? blk_mq_end_request+0x1/0x100
> > ? blk_mq_end_request+0x5/0x100
> > ? kprobe_perf_func+0x19b/0x240
> > ? __qdisc_run+0x86/0x520
> > ? blk_mq_end_request+0x1/0x100
> > ? blk_mq_end_request+0x5/0x100
> > ? kprobe_ftrace_handler+0x90/0xf0
> > ? ftrace_ops_assist_func+0x6e/0xe0
> > ? ip6_input_finish+0xbf/0x460
> > ? 0xa01e80bf
> > ? nbd_dbg_flags_show+0xc0/0xc0 [nbd]
> > ? blkdev_issue_zeroout+0x200/0x200
> > ? blk_mq_end_request+0x1/0x100
> > ? blk_mq_end_request+0x5/0x100
> > ? flush_smp_call_function_queue+0x6c/0xe0
> > ? smp_call_function_single_interrupt+0x32/0xc0
> > ? call_function_single_interrupt+0xf/0x20
> > ? call_function_single_interrupt+0xa/0x20
> > ? swiotlb_map_page+0x140/0x140
> > ? refcount_sub_and_test+0x1a/0x50
> > ? tcp_wfree+0x20/0xf0
> > ? skb_release_head_state+0x62/0xc0
> > ? skb_release_all+0xe/0x30
> > ? napi_consume_skb+0xb5/0x100
> > ? mlx5e_poll_tx_cq+0x1df/0x4e0
> > ? mlx5e_poll_tx_cq+0x38c/0x4e0
> > ? mlx5e_napi_poll+0x58/0xc30
> > ? mlx5e_napi_poll+0x232/0xc30
> > ? net_rx_action+0x128/0x340
> > ? __do_softirq+0xd4/0x2ad
> > ? irq_exit+0xa5/0xb0
> > ? do_IRQ+0x7d/0xc0
> > ? common_interrupt+0xf/0xf
> > 
> > ? __rb_free_aux+0xf0/0xf0
> > ? perf_output_sample+0x28/0x7b0
> > ? perf_prepare_sample+0x54/0x4a0
> > ? perf_event_output+0x43/0x60
> > ? bpf_perf_event_output_raw_tp+0x15f/0x180
> > ? blk_mq_start_request+0x1/0x120
> > ? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000
> > ? bpf_trace_run3+0x2c/0x80
> > ? nbd_send_cmd+0x4c2/0x690 [nbd]
> > 
> > This also cannot be alleviated by further splitting the per-cpu
> > perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix
> > corruption on concurrent perf_event_output calls")), as a raw_tp could
> > be attached to the block:block_rq_complete tracepoint and execute during
> > another raw_tp.  Instead, keep a pre-allocated perf_sample_data
> > structure per perf_event_array element and fail a bpf_perf_event_output
> > if that element is concurrently being used.
> > 
> > Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for 
> > perf_sample_data")
> > Signed-off-by: Matt Mullins 
> > ---
> > It felt a bit overkill, but I had to split bpf_event_entry into its own
> > header file to break an include cycle from perf_event.h -> cgroup.h ->
> > cgroup-defs.h -> bpf-cgroup.h -> bpf.h -> (potentially) perf_event.h.
> > 
> > include/linux/bpf.h   |  7 ---
> > include/linux/bpf_event.h | 20 
> > kernel/bpf/arraymap.c |  2 ++
> > kernel/trace/bpf_trace.c  | 30 +-
> > 4 files changed, 39 insertions(+), 20 deletions(-)
> > create mode 100644 include/linux/bpf_event.h
> > 
> > diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> > index 4fb3aa2dc975..13b253a36402 100644
> > --- a/include/linux/bpf.h
> > +++ b/include/linux/bpf.h
> > @@ -467,13 +467,6 @@ static inline bool bpf_map_flags_access_ok(u32 
> > access_flags)
> >(BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG);
> > }
> > 
> 
> I think we can avoid the include cycle as:
> 
> +struct perf_sample_data *sd;
> struct bpf_event_entry {
>   struct perf_event *event;
>   struct file *perf_file;
>   struct file *map_file;
>   struct rcu_head rcu;
> + struct perf_sample_data *sd;
> };

Yeah, that totally works.  I was mostly doing this so we had only one
kmalloc allocation, but I'm not too worried about having an extra
object in kmalloc-64 if it simplifies the code a lot.

> 
> > -struct bpf_event_entry {
> > -   struct perf_event *event;
> > -   struct file *perf_file;
> > -   struct file *map_file;
> > -   struct rcu_head rcu;
> > -};
> > -
> > bool bpf_prog_array_compatible(struct bpf_array *array, const struct 
> > bpf_pr

[PATCH bpf] bpf: preallocate a perf_sample_data per event fd

2019-05-30 Thread Matt Mullins
It is possible that a BPF program can be called while another BPF
program is executing bpf_perf_event_output.  This has been observed with
I/O completion occurring as a result of an interrupt:

bpf_prog_247fd1341cddaea4_trace_req_end+0x8d7/0x1000
? trace_call_bpf+0x82/0x100
? sch_direct_xmit+0xe2/0x230
? blk_mq_end_request+0x1/0x100
? blk_mq_end_request+0x5/0x100
? kprobe_perf_func+0x19b/0x240
? __qdisc_run+0x86/0x520
? blk_mq_end_request+0x1/0x100
? blk_mq_end_request+0x5/0x100
? kprobe_ftrace_handler+0x90/0xf0
? ftrace_ops_assist_func+0x6e/0xe0
? ip6_input_finish+0xbf/0x460
? 0xa01e80bf
? nbd_dbg_flags_show+0xc0/0xc0 [nbd]
? blkdev_issue_zeroout+0x200/0x200
? blk_mq_end_request+0x1/0x100
? blk_mq_end_request+0x5/0x100
? flush_smp_call_function_queue+0x6c/0xe0
? smp_call_function_single_interrupt+0x32/0xc0
? call_function_single_interrupt+0xf/0x20
? call_function_single_interrupt+0xa/0x20
? swiotlb_map_page+0x140/0x140
? refcount_sub_and_test+0x1a/0x50
? tcp_wfree+0x20/0xf0
? skb_release_head_state+0x62/0xc0
? skb_release_all+0xe/0x30
? napi_consume_skb+0xb5/0x100
? mlx5e_poll_tx_cq+0x1df/0x4e0
? mlx5e_poll_tx_cq+0x38c/0x4e0
? mlx5e_napi_poll+0x58/0xc30
? mlx5e_napi_poll+0x232/0xc30
? net_rx_action+0x128/0x340
? __do_softirq+0xd4/0x2ad
? irq_exit+0xa5/0xb0
? do_IRQ+0x7d/0xc0
? common_interrupt+0xf/0xf

? __rb_free_aux+0xf0/0xf0
? perf_output_sample+0x28/0x7b0
? perf_prepare_sample+0x54/0x4a0
? perf_event_output+0x43/0x60
? bpf_perf_event_output_raw_tp+0x15f/0x180
? blk_mq_start_request+0x1/0x120
? bpf_prog_411a64a706fc6044_should_trace+0xad4/0x1000
? bpf_trace_run3+0x2c/0x80
? nbd_send_cmd+0x4c2/0x690 [nbd]

This also cannot be alleviated by further splitting the per-cpu
perf_sample_data structs (as in commit 283ca526a9bd ("bpf: fix
corruption on concurrent perf_event_output calls")), as a raw_tp could
be attached to the block:block_rq_complete tracepoint and execute during
another raw_tp.  Instead, keep a pre-allocated perf_sample_data
structure per perf_event_array element and fail a bpf_perf_event_output
if that element is concurrently being used.

Fixes: 20b9d7ac4852 ("bpf: avoid excessive stack usage for perf_sample_data")
Signed-off-by: Matt Mullins 
---
It felt a bit overkill, but I had to split bpf_event_entry into its own
header file to break an include cycle from perf_event.h -> cgroup.h ->
cgroup-defs.h -> bpf-cgroup.h -> bpf.h -> (potentially) perf_event.h.

 include/linux/bpf.h   |  7 ---
 include/linux/bpf_event.h | 20 
 kernel/bpf/arraymap.c |  2 ++
 kernel/trace/bpf_trace.c  | 30 +-
 4 files changed, 39 insertions(+), 20 deletions(-)
 create mode 100644 include/linux/bpf_event.h

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 4fb3aa2dc975..13b253a36402 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -467,13 +467,6 @@ static inline bool bpf_map_flags_access_ok(u32 
access_flags)
   (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG);
 }
 
-struct bpf_event_entry {
-   struct perf_event *event;
-   struct file *perf_file;
-   struct file *map_file;
-   struct rcu_head rcu;
-};
-
 bool bpf_prog_array_compatible(struct bpf_array *array, const struct bpf_prog 
*fp);
 int bpf_prog_calc_tag(struct bpf_prog *fp);
 
diff --git a/include/linux/bpf_event.h b/include/linux/bpf_event.h
new file mode 100644
index ..9f415990f921
--- /dev/null
+++ b/include/linux/bpf_event.h
@@ -0,0 +1,20 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_BPF_EVENT_H
+#define _LINUX_BPF_EVENT_H
+
+#include 
+#include 
+
+struct file;
+
+struct bpf_event_entry {
+   struct perf_event *event;
+   struct file *perf_file;
+   struct file *map_file;
+   struct rcu_head rcu;
+   struct perf_sample_data sd;
+   atomic_t in_use;
+};
+
+#endif /* _LINUX_BPF_EVENT_H */
diff --git a/kernel/bpf/arraymap.c b/kernel/bpf/arraymap.c
index 584636c9e2eb..08e5e486d563 100644
--- a/kernel/bpf/arraymap.c
+++ b/kernel/bpf/arraymap.c
@@ -11,6 +11,7 @@
  * General Public License for more details.
  */
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -659,6 +660,7 @@ static struct bpf_event_entry *bpf_event_entry_gen(struct 
file *perf_file,
ee->event = perf_file->private_data;
ee->perf_file = perf_file;
ee->map_file = map_file;
+   atomic_set(&ee->in_use, 0);
}
 
return ee;
diff --git a/kernel/trace/bpf_trace.c b/kernel/trace/bpf_trace.c
index f92d6ad5e080..a03e29957698 100644
--- a/kernel/tr

[PATCH bpf-next v3 5/5] selftests: bpf: test writable buffers in raw tps

2019-04-19 Thread Matt Mullins
This tests that:
  * a BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE cannot be attached if it
uses either:
* a variable offset to the tracepoint buffer, or
* an offset beyond the size of the tracepoint buffer
  * a tracer can modify the buffer provided when attached to a writable
tracepoint in bpf_prog_test_run

Signed-off-by: Matt Mullins 
---
 include/trace/events/bpf_test_run.h   | 50 
 net/bpf/test_run.c|  4 +
 .../raw_tp_writable_reject_nbd_invalid.c  | 40 ++
 .../bpf/prog_tests/raw_tp_writable_test_run.c | 80 +++
 .../selftests/bpf/verifier/raw_tp_writable.c  | 34 
 5 files changed, 208 insertions(+)
 create mode 100644 include/trace/events/bpf_test_run.h
 create mode 100644 
tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c
 create mode 100644 
tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c
 create mode 100644 tools/testing/selftests/bpf/verifier/raw_tp_writable.c

diff --git a/include/trace/events/bpf_test_run.h 
b/include/trace/events/bpf_test_run.h
new file mode 100644
index ..abf466839ea4
--- /dev/null
+++ b/include/trace/events/bpf_test_run.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM bpf_test_run
+
+#if !defined(_TRACE_NBD_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_BPF_TEST_RUN_H
+
+#include 
+
+DECLARE_EVENT_CLASS(bpf_test_finish,
+
+   TP_PROTO(int *err),
+
+   TP_ARGS(err),
+
+   TP_STRUCT__entry(
+   __field(int, err)
+   ),
+
+   TP_fast_assign(
+   __entry->err = *err;
+   ),
+
+   TP_printk("bpf_test_finish with err=%d", __entry->err)
+);
+
+#ifdef DEFINE_EVENT_WRITABLE
+#undef BPF_TEST_RUN_DEFINE_EVENT
+#define BPF_TEST_RUN_DEFINE_EVENT(template, call, proto, args, size)   \
+   DEFINE_EVENT_WRITABLE(template, call, PARAMS(proto),\
+ PARAMS(args), size)
+#else
+#undef BPF_TEST_RUN_DEFINE_EVENT
+#define BPF_TEST_RUN_DEFINE_EVENT(template, call, proto, args, size)   \
+   DEFINE_EVENT(template, call, PARAMS(proto), PARAMS(args))
+#endif
+
+BPF_TEST_RUN_DEFINE_EVENT(bpf_test_finish, bpf_test_finish,
+
+   TP_PROTO(int *err),
+
+   TP_ARGS(err),
+
+   sizeof(int)
+);
+
+#endif
+
+/* This part must be outside protection */
+#include 
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index fab142b796ef..25e757102595 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -13,6 +13,9 @@
 #include 
 #include 
 
+#define CREATE_TRACE_POINTS
+#include 
+
 static int bpf_test_run(struct bpf_prog *prog, void *ctx, u32 repeat,
u32 *retval, u32 *time)
 {
@@ -100,6 +103,7 @@ static int bpf_test_finish(const union bpf_attr *kattr,
if (err != -ENOSPC)
err = 0;
 out:
+   trace_bpf_test_finish(&err);
return err;
 }
 
diff --git 
a/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c 
b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c
new file mode 100644
index ..328d5c4b084b
--- /dev/null
+++ 
b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_reject_nbd_invalid.c
@@ -0,0 +1,40 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include 
+#include 
+
+void test_raw_tp_writable_reject_nbd_invalid(void)
+{
+   __u32 duration = 0;
+   char error[4096];
+   int bpf_fd = -1, tp_fd = -1;
+
+   const struct bpf_insn program[] = {
+   /* r6 is our tp buffer */
+   BPF_LDX_MEM(BPF_DW, BPF_REG_6, BPF_REG_1, 0),
+   BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_6, 128),
+   BPF_EXIT_INSN(),
+   };
+
+   struct bpf_load_program_attr load_attr = {
+   .prog_type = BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
+   .license = "GPL v2",
+   .insns = program,
+   .insns_cnt = sizeof(program) / sizeof(struct bpf_insn),
+   .log_level = 2,
+   };
+
+   bpf_fd = bpf_load_program_xattr(&load_attr, error, sizeof(error));
+   if (CHECK(bpf_fd < 0, "bpf_raw_tracepoint_writable loaded",
+ "failed: %d errno %d\n", bpf_fd, errno))
+   return;
+
+   tp_fd = bpf_raw_tracepoint_open("nbd_send_request", bpf_fd);
+   if (CHECK(tp_fd >= 0, "bpf_raw_tracepoint_writable opened",
+ "erroneously succeeded\n"))
+   goto out_bpffd;
+
+   close(tp_fd);
+out_bpffd:
+   close(bpf_fd);
+}
diff --git a/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c 
b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c
new file mode 100644
index ..4145925f9cab
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/raw_tp_writable_test_run.c
@@ -0,0 +1,80 @@
+// SPDX-Lice

Re: [PATCH bpf-next v2] bpf: support raw tracepoints in modules

2018-12-13 Thread Matt Mullins
On Thu, 2018-12-13 at 19:22 +, Martin Lau wrote:
> On Wed, Dec 12, 2018 at 04:42:37PM -0800, Matt Mullins wrote:
> > Distributions build drivers as modules, including network and filesystem
> > drivers which export numerous tracepoints.  This enables
> > bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.
> > 
> > Signed-off-by: Matt Mullins 
> > ---
> > v1->v2:
> >   * avoid taking the mutex in bpf_event_notify when op is neither COMING nor
> > GOING.
> >   * check that kzalloc actually succeeded
> > 
> > I didn't try to check list_empty before taking the mutex since I want to 
> > avoid
> > races between bpf_event_notify and bpf_get_raw_tracepoint.  Additionally,
> > list_for_each_entry_safe is not strictly necessary upon MODULE_STATE_GOING, 
> > but
> > Alexei suggested I use it to protect against fragility if the subsequent 
> > break;
> > eventually disappears.
> > 
> >  include/linux/module.h   |  4 ++
> >  include/linux/trace_events.h |  8 ++-
> >  kernel/bpf/syscall.c | 11 ++--
> >  kernel/module.c  |  5 ++
> >  kernel/trace/bpf_trace.c | 99 +++-
> >  5 files changed, 120 insertions(+), 7 deletions(-)
> > 
> > diff --git a/include/linux/module.h b/include/linux/module.h
> > index fce6b4335e36..5f147dd5e709 100644
> > --- a/include/linux/module.h
> > +++ b/include/linux/module.h
> > @@ -432,6 +432,10 @@ struct module {
> > unsigned int num_tracepoints;
> > tracepoint_ptr_t *tracepoints_ptrs;
> >  #endif
> > +#ifdef CONFIG_BPF_EVENTS
> > +   unsigned int num_bpf_raw_events;
> > +   struct bpf_raw_event_map *bpf_raw_events;
> > +#endif
> >  #ifdef HAVE_JUMP_LABEL
> > struct jump_entry *jump_entries;
> > unsigned int num_jump_entries;
> > diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
> > index 4130a5497d40..8a62731673f7 100644
> > --- a/include/linux/trace_events.h
> > +++ b/include/linux/trace_events.h
> > @@ -471,7 +471,8 @@ void perf_event_detach_bpf_prog(struct perf_event 
> > *event);
> >  int perf_event_query_prog_array(struct perf_event *event, void __user 
> > *info);
> >  int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog 
> > *prog);
> >  int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog 
> > *prog);
> > -struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
> > +struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name);
> > +void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp);
> >  int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
> > u32 *fd_type, const char **buf,
> > u64 *probe_offset, u64 *probe_addr);
> > @@ -502,10 +503,13 @@ static inline int bpf_probe_unregister(struct 
> > bpf_raw_event_map *btp, struct bpf
> >  {
> > return -EOPNOTSUPP;
> >  }
> > -static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char 
> > *name)
> > +static inline struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char 
> > *name)
> >  {
> > return NULL;
> >  }
> > +static inline void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
> > +{
> > +}
> >  static inline int bpf_get_perf_event_info(const struct perf_event *event,
> >   u32 *prog_id, u32 *fd_type,
> >   const char **buf, u64 *probe_offset,
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index 70fb11106fc2..754370e3155e 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -1609,6 +1609,7 @@ static int bpf_raw_tracepoint_release(struct inode 
> > *inode, struct file *filp)
> > bpf_probe_unregister(raw_tp->btp, raw_tp->prog);
> > bpf_prog_put(raw_tp->prog);
> > }
> > +   bpf_put_raw_tracepoint(raw_tp->btp);
> > kfree(raw_tp);
> > return 0;
> >  }
> > @@ -1634,13 +1635,15 @@ static int bpf_raw_tracepoint_open(const union 
> > bpf_attr *attr)
> > return -EFAULT;
> > tp_name[sizeof(tp_name) - 1] = 0;
> >  
> > -   btp = bpf_find_raw_tracepoint(tp_name);
> > +   btp = bpf_get_raw_tracepoint(tp_name);
> > if (!btp)
> > return -ENOENT;
> >  
> > raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER);
> > -   if (!raw_tp)
> > -   return -ENOMEM;
>

[PATCH bpf-next v2] bpf: support raw tracepoints in modules

2018-12-12 Thread Matt Mullins
Distributions build drivers as modules, including network and filesystem
drivers which export numerous tracepoints.  This enables
bpf(BPF_RAW_TRACEPOINT_OPEN) to attach to those tracepoints.

Signed-off-by: Matt Mullins 
---
v1->v2:
  * avoid taking the mutex in bpf_event_notify when op is neither COMING nor
GOING.
  * check that kzalloc actually succeeded

I didn't try to check list_empty before taking the mutex since I want to avoid
races between bpf_event_notify and bpf_get_raw_tracepoint.  Additionally,
list_for_each_entry_safe is not strictly necessary upon MODULE_STATE_GOING, but
Alexei suggested I use it to protect against fragility if the subsequent break;
eventually disappears.

 include/linux/module.h   |  4 ++
 include/linux/trace_events.h |  8 ++-
 kernel/bpf/syscall.c | 11 ++--
 kernel/module.c  |  5 ++
 kernel/trace/bpf_trace.c | 99 +++-
 5 files changed, 120 insertions(+), 7 deletions(-)

diff --git a/include/linux/module.h b/include/linux/module.h
index fce6b4335e36..5f147dd5e709 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -432,6 +432,10 @@ struct module {
unsigned int num_tracepoints;
tracepoint_ptr_t *tracepoints_ptrs;
 #endif
+#ifdef CONFIG_BPF_EVENTS
+   unsigned int num_bpf_raw_events;
+   struct bpf_raw_event_map *bpf_raw_events;
+#endif
 #ifdef HAVE_JUMP_LABEL
struct jump_entry *jump_entries;
unsigned int num_jump_entries;
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index 4130a5497d40..8a62731673f7 100644
--- a/include/linux/trace_events.h
+++ b/include/linux/trace_events.h
@@ -471,7 +471,8 @@ void perf_event_detach_bpf_prog(struct perf_event *event);
 int perf_event_query_prog_array(struct perf_event *event, void __user *info);
 int bpf_probe_register(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
 int bpf_probe_unregister(struct bpf_raw_event_map *btp, struct bpf_prog *prog);
-struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char *name);
+struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char *name);
+void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp);
 int bpf_get_perf_event_info(const struct perf_event *event, u32 *prog_id,
u32 *fd_type, const char **buf,
u64 *probe_offset, u64 *probe_addr);
@@ -502,10 +503,13 @@ static inline int bpf_probe_unregister(struct 
bpf_raw_event_map *btp, struct bpf
 {
return -EOPNOTSUPP;
 }
-static inline struct bpf_raw_event_map *bpf_find_raw_tracepoint(const char 
*name)
+static inline struct bpf_raw_event_map *bpf_get_raw_tracepoint(const char 
*name)
 {
return NULL;
 }
+static inline void bpf_put_raw_tracepoint(struct bpf_raw_event_map *btp)
+{
+}
 static inline int bpf_get_perf_event_info(const struct perf_event *event,
  u32 *prog_id, u32 *fd_type,
  const char **buf, u64 *probe_offset,
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 70fb11106fc2..754370e3155e 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1609,6 +1609,7 @@ static int bpf_raw_tracepoint_release(struct inode 
*inode, struct file *filp)
bpf_probe_unregister(raw_tp->btp, raw_tp->prog);
bpf_prog_put(raw_tp->prog);
}
+   bpf_put_raw_tracepoint(raw_tp->btp);
kfree(raw_tp);
return 0;
 }
@@ -1634,13 +1635,15 @@ static int bpf_raw_tracepoint_open(const union bpf_attr 
*attr)
return -EFAULT;
tp_name[sizeof(tp_name) - 1] = 0;
 
-   btp = bpf_find_raw_tracepoint(tp_name);
+   btp = bpf_get_raw_tracepoint(tp_name);
if (!btp)
return -ENOENT;
 
raw_tp = kzalloc(sizeof(*raw_tp), GFP_USER);
-   if (!raw_tp)
-   return -ENOMEM;
+   if (!raw_tp) {
+   err = -ENOMEM;
+   goto out_put_btp;
+   }
raw_tp->btp = btp;
 
prog = bpf_prog_get_type(attr->raw_tracepoint.prog_fd,
@@ -1668,6 +1671,8 @@ static int bpf_raw_tracepoint_open(const union bpf_attr 
*attr)
bpf_prog_put(prog);
 out_free_tp:
kfree(raw_tp);
+out_put_btp:
+   bpf_put_raw_tracepoint(btp);
return err;
 }
 
diff --git a/kernel/module.c b/kernel/module.c
index 49a405891587..06ec68f08387 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -3093,6 +3093,11 @@ static int find_module_sections(struct module *mod, 
struct load_info *info)
 sizeof(*mod->tracepoints_ptrs),
 &mod->num_tracepoints);
 #endif
+#ifdef CONFIG_BPF_EVENTS
+   mod->bpf_raw_events = section_objs(info, "__bpf_raw_tp_map",
+  sizeof(*mod->bpf_raw_events),
+  &mod->num_b

Re: [PATCH net] tls: don't use stack memory in a scatterlist

2018-05-17 Thread Matt Mullins
On Thu, 2018-05-17 at 14:50 -0400, David Miller wrote:
> I'm surprised this problem wasn't discovered sooner.  How exactly did you
> discover it?  Did you actually see it trigger or is this purely from code
> inspection?

Honestly, I'm not sure how it got uncovered, but it was observed at
runtime.  Doron Roberts-Kedes hit a null pointer dereference so we
turned on CONFIG_DEBUG_SG -- then it became a proper
BUG_ON(!virt_addr_valid(buf)); in sg_set_buf.

[PATCH net] tls: don't use stack memory in a scatterlist

2018-05-16 Thread Matt Mullins
scatterlist code expects virt_to_page() to work, which fails with
CONFIG_VMAP_STACK=y.

Fixes: c46234ebb4d1e ("tls: RX path for ktls")
Signed-off-by: Matt Mullins 
---
 include/net/tls.h | 3 +++
 net/tls/tls_sw.c  | 9 -
 2 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index b400d0bb7448..f5fb16da3860 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -97,6 +97,9 @@ struct tls_sw_context {
u8 control;
bool decrypted;
 
+   char rx_aad_ciphertext[TLS_AAD_SPACE_SIZE];
+   char rx_aad_plaintext[TLS_AAD_SPACE_SIZE];
+
/* Sending context */
char aad_space[TLS_AAD_SPACE_SIZE];
 
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 71e79597f940..e1c93ce74e0f 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -680,7 +680,6 @@ static int decrypt_skb(struct sock *sk, struct sk_buff *skb,
struct scatterlist *sgin = &sgin_arr[0];
struct strp_msg *rxm = strp_msg(skb);
int ret, nsg = ARRAY_SIZE(sgin_arr);
-   char aad_recv[TLS_AAD_SPACE_SIZE];
struct sk_buff *unused;
 
ret = skb_copy_bits(skb, rxm->offset + TLS_HEADER_SIZE,
@@ -698,13 +697,13 @@ static int decrypt_skb(struct sock *sk, struct sk_buff 
*skb,
}
 
sg_init_table(sgin, nsg);
-   sg_set_buf(&sgin[0], aad_recv, sizeof(aad_recv));
+   sg_set_buf(&sgin[0], ctx->rx_aad_ciphertext, TLS_AAD_SPACE_SIZE);
 
nsg = skb_to_sgvec(skb, &sgin[1],
   rxm->offset + tls_ctx->rx.prepend_size,
   rxm->full_len - tls_ctx->rx.prepend_size);
 
-   tls_make_aad(aad_recv,
+   tls_make_aad(ctx->rx_aad_ciphertext,
 rxm->full_len - tls_ctx->rx.overhead_size,
 tls_ctx->rx.rec_seq,
 tls_ctx->rx.rec_seq_size,
@@ -803,12 +802,12 @@ int tls_sw_recvmsg(struct sock *sk,
if (to_copy <= len && page_count < MAX_SKB_FRAGS &&
likely(!(flags & MSG_PEEK)))  {
struct scatterlist sgin[MAX_SKB_FRAGS + 1];
-   char unused[21];
int pages = 0;
 
zc = true;
sg_init_table(sgin, MAX_SKB_FRAGS + 1);
-   sg_set_buf(&sgin[0], unused, 13);
+   sg_set_buf(&sgin[0], ctx->rx_aad_plaintext,
+  TLS_AAD_SPACE_SIZE);
 
err = zerocopy_from_iter(sk, &msg->msg_iter,
 to_copy, &pages,
-- 
2.14.1



Re: [RFC] KVM: SVM: ignore type when setting segment registers

2017-05-30 Thread Matt Mullins
On Tue, May 30, 2017 at 02:54:21PM +0200, Radim Krčmář wrote:
> 2017-05-29 15:24+0200, Gioh Kim:
> > If so, why type is checked when setting segment registers?
> 
> No idea.  19bca6ab75d8 ("KVM: SVM: Fix cross vendor migration issue with
> unusable bit") also moved the assigment up to initialize it before use
> and I think that is enough.

Was this perhaps intended to instead check for a zero selector, which is also
an unusable segment?


Re: [PATCH resend 4.9] hw_random: Don't use a stack buffer in add_early_randomness()

2017-02-03 Thread Matt Mullins
On Sat, Feb 04, 2017 at 11:47:38AM +0800, Yisheng Xie wrote:
> On 2016/10/18 1:06, Andy Lutomirski wrote:
> > hw_random carefully avoids using a stack buffer except in
> > add_early_randomness().  This causes a crash in virtio_rng if
> > CONFIG_VMAP_STACK=y.
> I try to understand this patch, but I do not know why it will cause
> a crash in virtio_rng with CONFIG_VMAP_STACK=y?
> Could you please give me more info. about it.

My original report was
https://lkml.kernel.org/r/20161016002151.ga18...@hydra.tuxags.com.

The virtio ring APIs use scatterlists to keep track of the buffers, and
scatterlist requires addresses to be in the kernel direct-mapped address range.
This is not the case for vmalloc()ed addresses, such as the original on-stack
"bytes" array when VMAP_STACK=y.


virtio-rng scatterlist null pointer dereference with VMAP_STACK=y

2016-10-15 Thread Matt Mullins
With VMAP_STACK=y and HW_RANDOM_VIRTIO=y, I get the following crash:

[1.470437] BUG: unable to handle kernel NULL pointer dereference at  (null)
[1.473350] IP: [] sg_init_one+0x65/0x90
[1.474658] PGD 0 
[1.475169] Oops:  [#1] SMP

Entering kdb (current=0x880069b0c980, pid 1) on processor 1 Oops: (null)
due to oops @ 0x812f7c35
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.8.0-08682-g41844e3 #246
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: 880069b0c980 task.stack: c9c54000
RIP: 0010:[]  [] sg_init_one+0x65/0x90
RSP: :c9c57d00  EFLAGS: 00010246
RAX:  RBX: c9c57d20 RCX: 00082000
RDX: 0d68 RSI: 00041c57 RDI: c9c57d40
RBP: 0820 R08:  R09: c9c57d20
R10: 00019228 R11: 88017fff8500 R12: 0010
R13: 0010 R14:  R15: 
FS:  () GS:88017fc0() knlGS:
CS:  0010 DS:  ES:  CR0: 80050033
CR2:  CR3: 01806000 CR4: 001406a0
Stack:
 8800694bf500 0001 c9c57d68 8138286f
 0002   
 8800694bf500 c9c57d68 8800694bf500 
Call Trace:
 [] ? virtio_read+0x9f/0xe0
 [] ? add_early_randomness+0x3e/0xa0
 [] ? set_current_rng+0x3f/0x160
 [] ? hwrng_register+0x186/0x1d0
 [] ? virtrng_scan+0x10/0x20
 [] ? virtio_dev_probe+0x21e/0x2b0
 [] ? really_probe+0x14e/0x250
 [] ? __driver_attach+0x82/0xa0
 [] ? really_probe+0x250/0x250
 [] ? bus_for_each_dev+0x52/0x80
 [] ? bus_add_driver+0x1a3/0x220
 [] ? hwrng_modinit+0xc/0xc
 [] ? driver_register+0x56/0xd0
 [] ? hwrng_modinit+0xc/0xc
 [] ? do_one_initcall+0x32/0x150
 [] ? parse_args+0x170/0x390
 [] ? kernel_init_freeable+0x11e/0x1a3
 [] ? set_debug_rodata+0xc/0xc
 [] ? rest_init+0x70/0x70
 [] ? kernel_init+0x5/0x100
 [] ? ret_from_fork+0x25/0x30
Code: 01 c5 48 89 ee 48 89 e9 48 c1 ed 23 48 8b 04 ed c0 49 9c 81 48 c1 ee 0c
48 c1 e9 1b 48 85 c0 74 0a 0f b6 c9 48 c1 e1 04 48 01 c8 <48> 8b 08 48 89 f0 89
53 08 48 c1 e0 06 44 89 63 0c 48 83 e1 fc 


It looks like add_early_randomness() uses a buffer on the stack, which (with
VMAP_STACK) fails to be added to a scatterlist due to use of virt_to_page in
sg_set_buf().

None of this looks obviously wrong to me, but in combination it is not
functional.  Hence, I don't have a particular fix in mind, and I've CC'ed folks
from both the hw_random and scatterlist history.


Re: [PATCH 2/6] x86/boot: Move compressed kernel to end of decompression buffer

2016-08-16 Thread Matt Mullins
On Tue, Aug 16, 2016 at 12:19:42PM -0700, Yinghai Lu wrote:
> On Mon, Aug 15, 2016 at 9:01 PM, Matt Mullins  wrote:
> >
> > This appears to have a negative effect on booting the Intel Edison 
> > platform, as
> > it uses u-boot as its bootloader.  u-boot does not copy the init_size 
> > parameter
> > when booting a bzImage: it copies a fixed-size setup_header [1], and its
> > definition of setup_header doesn't include the parameters beyond setup_data 
> > [2].
> >
> > With a zero value for init_size, this calculates a %rsp value of 
> > 0x101ff9600.
> > This causes the boot process to hard-stop at the immediately-following 
> > pushq, as
> > this platform has no usable physical addresses above 4G.
> >
> > What are the options for getting this type of platform to function again?  
> > For
> > now, kexec from a working Linux system does seem to be a work-around, but 
> > there
> > appears to be other x86 hardware using u-boot: the chromium.org folks seem 
> > to be
> > maintaining the u-boot x86 tree.
> >
> > [1] 
> > http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/lib/zimage.c;h=1b33c771391f49ffe82864ff1582bdfd07e5e97d;hb=HEAD#l156
> > [2] 
> > http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/include/asm/bootparam.h;h=140095117e5a2daef0a097c55f0ed10e08acc781;hb=HEAD#l24
> 
> Then should fix the u-boot about header_size assumption.

I was hoping to avoid that, since the Edison's u-boot is 10,000-line patch atop
the upstream -- I don't trust myself to build and flash one quite yet.

If this turned out to affect Chromebooks, I'd spend more effort pushing for
a kernel fix, but it seems that ChromeOS has a different kernel load procedure
and doesn't use "zboot".  For now, I'll probably just keep a local patch that
hard-codes a value large enough to decompress and launch the kernel.

I may turn that local patch into something gated by a Kconfig eventually, in
hopes that users of the other x86 u-boot platforms will see it in a "make
oldconfig" run.


Re: [PATCH 2/6] x86/boot: Move compressed kernel to end of decompression buffer

2016-08-15 Thread Matt Mullins
[added Simon Glass to CC in case there's some input from u-boot]

On Thu, Apr 28, 2016 at 05:09:04PM -0700, Kees Cook wrote:
> From: Yinghai Lu 
> 
> This patch adds BP_init_size (which is the INIT_SIZE as passed in from
> the boot_params) into asm-offsets.c to make it visible to the assembly
> code. Then when moving the ZO, it calculates the starting position of
> the copied ZO (via BP_init_size and the ZO run size) so that the VO__end
> will be at the end of the decompression buffer. To make the position
> calculation safe, the end of ZO is page aligned (and a comment is added
> to the existing VO alignment for good measure).

> diff --git a/arch/x86/boot/compressed/head_64.S 
> b/arch/x86/boot/compressed/head_64.S
> index d43c30ed89ed..09cdc0c3ee7e 100644
> --- a/arch/x86/boot/compressed/head_64.S
> +++ b/arch/x86/boot/compressed/head_64.S
> @@ -338,7 +340,9 @@ preferred_addr:
>  1:
>  
>   /* Target address to relocate to for decompression */
> - leaqz_extract_offset(%rbp), %rbx
> + movlBP_init_size(%rsi), %ebx
> + subl$_end, %ebx
> + addq%rbp, %rbx
>  
>   /* Set up the stack */
>   leaqboot_stack_end(%rbx), %rsp

This appears to have a negative effect on booting the Intel Edison platform, as
it uses u-boot as its bootloader.  u-boot does not copy the init_size parameter
when booting a bzImage: it copies a fixed-size setup_header [1], and its
definition of setup_header doesn't include the parameters beyond setup_data [2].

With a zero value for init_size, this calculates a %rsp value of 0x101ff9600.
This causes the boot process to hard-stop at the immediately-following pushq, as
this platform has no usable physical addresses above 4G.

What are the options for getting this type of platform to function again?  For
now, kexec from a working Linux system does seem to be a work-around, but there
appears to be other x86 hardware using u-boot: the chromium.org folks seem to be
maintaining the u-boot x86 tree.

[1] 
http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/lib/zimage.c;h=1b33c771391f49ffe82864ff1582bdfd07e5e97d;hb=HEAD#l156
[2] 
http://git.denx.de/?p=u-boot.git;a=blob;f=arch/x86/include/asm/bootparam.h;h=140095117e5a2daef0a097c55f0ed10e08acc781;hb=HEAD#l24


Re: [PATCH] perf tools: get version from uname(2), not /proc

2015-10-07 Thread Matt Mullins
On Wed, Oct 07, 2015 at 11:19:17AM +0300, Adrian Hunter wrote:
> On 07/10/15 11:18, Jiri Olsa wrote:
> > On Tue, Oct 06, 2015 at 03:53:14PM -0700, Matt Mullins wrote:
> >> Tools in kmod (e.g. modprobe) compose the module path from the release
> >> from uname(2).  Because we use the UNAME26 personality, we need perf to
> >> find modules located at the same path as the system tools.
> > 
> > I guess it's easy to google this up, but could you
> > please state in the changelog what's the difference
> > between the current version and the UNAME26 one?
> > 
> > Also state (and check) this change wouldn't affect other
> > parts of the code that use this version (if there's any).

This is the only caller of get_kernel_version.

> Isn't the machine root dir for guests? uname() won't work for them.

Aye, this would break --guestmount.  Would it make sense to use uname()
iff root_dir is the empty string, or is that too much special-casing for
good taste?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] perf tools: get version from uname(2), not /proc

2015-10-06 Thread Matt Mullins
Tools in kmod (e.g. modprobe) compose the module path from the release
from uname(2).  Because we use the UNAME26 personality, we need perf to
find modules located at the same path as the system tools.

Signed-off-by: Matt Mullins 
Cc: Vinson Lee 
---
 tools/perf/util/machine.c | 28 ++--
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/tools/perf/util/machine.c b/tools/perf/util/machine.c
index 5ef90be2a249..51199bc271e9 100644
--- a/tools/perf/util/machine.c
+++ b/tools/perf/util/machine.c
@@ -11,6 +11,7 @@
 #include "vdso.h"
 #include 
 #include 
+#include 
 #include "unwind.h"
 #include "linux/hash.h"
 
@@ -903,31 +904,14 @@ static void map_groups__fixup_end(struct map_groups *mg)
__map_groups__fixup_end(mg, i);
 }
 
-static char *get_kernel_version(const char *root_dir)
+static char *get_kernel_version(void)
 {
-   char version[PATH_MAX];
-   FILE *file;
-   char *name, *tmp;
-   const char *prefix = "Linux version ";
+   struct utsname utsname;
 
-   sprintf(version, "%s/proc/version", root_dir);
-   file = fopen(version, "r");
-   if (!file)
+   if (uname(&utsname))
return NULL;
 
-   version[0] = '\0';
-   tmp = fgets(version, sizeof(version), file);
-   fclose(file);
-
-   name = strstr(version, prefix);
-   if (!name)
-   return NULL;
-   name += strlen(prefix);
-   tmp = strchr(name, ' ');
-   if (tmp)
-   *tmp = '\0';
-
-   return strdup(name);
+   return strdup(utsname.release);
 }
 
 static bool is_kmod_dso(struct dso *dso)
@@ -1027,7 +1011,7 @@ static int machine__set_modules_path(struct machine 
*machine)
char *version;
char modules_path[PATH_MAX];
 
-   version = get_kernel_version(machine->root_dir);
+   version = get_kernel_version();
if (!version)
return -1;
 
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v11 21/21] brd: Rename XIP to DAX

2015-03-24 Thread Matt Mullins
On Thu, Sep 25, 2014 at 04:33:38PM -0400, Matthew Wilcox wrote:
> --- a/drivers/block/brd.c
> +++ b/drivers/block/brd.c
> @@ -97,13 +97,13 @@ static struct page *brd_insert_page(struct brd_device 
> *brd, sector_t sector)
>* Must use NOIO because we don't want to recurse back into the
>* block or filesystem layers from page reclaim.
>*
> -  * Cannot support XIP and highmem, because our ->direct_access
> -  * routine for XIP must return memory that is always addressable.
> -  * If XIP was reworked to use pfns and kmap throughout, this
> +  * Cannot support DAX and highmem, because our ->direct_access
> +  * routine for DAX must return memory that is always addressable.
> +  * If DAX was reworked to use pfns and kmap throughout, this
>* restriction might be able to be lifted.
>*/
>   gfp_flags = GFP_NOIO | __GFP_ZERO;
> -#ifndef CONFIG_BLK_DEV_XIP
> +#ifndef CONFIG_BLK_DEV_RAM_DAX
>   gfp_flags |= __GFP_HIGHMEM;
>  #endif
>   page = alloc_page(gfp_flags);

We're also developing a user of direct_access, and we ended up with some
questions about the sleeping guarantees of the direct_access API.

Since brd is currently the only (x86) implementation of DAX in Linus's tree,
I've been testing against that.  We noticed that the brd implementation of DAX
can call into alloc_page() with __GFP_WAIT if we call direct_access() on a page
that has not yet been allocated.  This is compounded by the fact that brd does
not support size > PAGE_SIZE (and thus I call bdev_direct_access() on each use),
though the limitation makes sense -- I shouldn't expect the brd driver to be
able to allocate a gigabyte of contiguous memory.

The potential sleeping behavior was somewhat surprising to me, as I would expect
the NV-DIMM device implementation to simply offset the pfn at which the device
is located rather than perform a memory allocation.  What are the guaranteed
and/or expected contexts from which direct_access() can be safely called?

While I can easily punt this usage to a work queue (that's what we already do
for devices where we need to submit a bio), part of our desire to use
direct_access is to avoid additional latency.

If it would make more sense for us to test against (for example) the pmem or an
mtd-block driver instead, as you've discussed with Mathieu Desnoyers, then I'd
be happy to work with those in our environment as well.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] kernel/sys.c: Fix UNAME26 for 4.0

2015-02-27 Thread Matt Mullins
I'll raise my hand in agreement with Andi -- this is functionality that
we do use.

Tested-by: Matt Mullins 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] perf hist browser: change print format from lu to llu to fit u64 type

2014-11-25 Thread Matt Mullins
On Tue, Nov 18, 2014 at 05:00:27PM -0600, Tom Huynh wrote:
> nr_events in tools/perf/ui/browsers/hists.c is of type u64, thus it should
> be printed as %llu instead of %lu. The print format %lu causes perf report
> to show 0 event count when running with 32-bit userspace without 
> redirection. This patch fixes that problem.
> 
> Signed-off-by: Tom Huynh 
> ---
>  tools/perf/ui/browsers/hists.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/tools/perf/ui/browsers/hists.c b/tools/perf/ui/browsers/hists.c
> index cfb976b..49b2471 100644
> --- a/tools/perf/ui/browsers/hists.c
> +++ b/tools/perf/ui/browsers/hists.c
> @@ -1254,7 +1254,7 @@ static int hists__browser_title(struct hists *hists,
>  
>   nr_samples = convert_unit(nr_samples, &unit);
>   printed = scnprintf(bf, size,
> -"Samples: %lu%c of event '%s', Event count 
> (approx.): %lu",
> +"Samples: %lu%c of event '%s', Event count 
> (approx.): %llu",
>  nr_samples, unit, ev_name, nr_events);

If I enable __attribute__((format(printf, ...))) on scnprintf, on a 64-bit
build this gives me:

ui/browsers/hists.c: In function ‘hists__browser_title’:
ui/browsers/hists.c:1258:7: error: format ‘%llu’ expects argument of type ‘long 
long unsigned int’, but argument 7 has type ‘u64’ [-Werror=format=]
   nr_samples, unit, ev_name, nr_events);
   ^

I would usually suggest using the PRIu64 macro which should(TM) evaluate to the
right string constant for the platform, but that appears to be a C99 thing.
I'm not sure if that's kosher in tools/perf, but it does seem to build for me
(on Fedora 20).
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] mm/balloon_compaction: fix deflation when compaction is disabled

2014-10-29 Thread Matt Mullins
On Wed, Oct 29, 2014 at 02:51:07PM +0400, Konstantin Khlebnikov wrote:
> Fix for commit d6d86c0a7f8ddc5b38cf089222cb1d9540762dc2
> ("mm/balloon_compaction: redesign ballooned pages management").
> 
> If CONFIG_BALLOON_COMPACTION=n balloon_page_insert() does not link
> pages with balloon and doesn't set PagePrivate flag, as a result
> balloon_page_dequeue cannot get any pages because it thinks that
> all of them are isolated. Without balloon compaction nobody can
> isolate ballooned pages, it's safe to remove this check.
> 
> Signed-off-by: Konstantin Khlebnikov 
> Reported-by: Matt Mullins 
> Cc: Stable    (v3.17)

That seems to do it, thanks!

Tested-by: Matt Mullins 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v3 1/4] mm/balloon_compaction: redesign ballooned pages management

2014-10-28 Thread Matt Mullins
On Tue, Oct 28, 2014 at 11:56:15PM +0400, Konstantin Khlebnikov wrote:
> On Tue, Oct 28, 2014 at 9:47 PM, Matt Mullins  wrote:
> > On Sat, Sep 27, 2014 at 11:15:16PM +0400, Konstantin Khlebnikov wrote:
> >> This patch fixes all of them.
> >
> > It seems to have rendered virtio_balloon completely ineffective without
> > CONFIG_COMPACTION and CONFIG_BALLOON_COMPACTION, even for the case that I'm
> > expanding the memory available to my VM.
> 
> What do you mean by ineffective?

Ballooning does not change the amount of memory available at all.  Bisecting
the change got me back to this patch.

I'm currently using 3.18-rc2 and running in qemu-kvm-1.6.2-9.fc20.x86_64 via
libvirt-daemon-1.1.3.6-1.fc20.x86_64.

I'll attach my .config; enabling COMPACTION and BALLOON_COMPACTION makes
virtio-ballooning work again.

> That it cannot handle fragmentation? I saw that without compaction
> ballooning works even better:
> it allocates pages without GFP_MOVABLE and buddy allocator returns
> much less scattered pages.
> 
> >
> > Was this intended?  Should Kconfig be updated so that VIRTIO_BALLOON 
> > depends on
> > BALLOON_COMPACTION now?
> 
> Nope, this is independent feature.

Good to know, thanks.
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 3.18.0-rc2 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_X86_HT=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
# CONFIG_SWAP is not set
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
# CONFIG_CROSS_MEMORY_ATTACH is not set
# CONFIG_FHANDLE is not set
# CONFIG_USELIB is not set
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
# CONFIG_AUDITSYSCALL is not set

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_LEGACY_ALLOC_HWIRQ=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_HZ_PERIODIC=y
# CONFIG_NO_HZ_IDLE is not set
# CONFIG_NO_HZ_FULL is not set
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_RCU_USER_QS=y
CONFIG_CONTEXT_TRACKING_FORCE=y
CONFIG_RCU_FANOUT=64
CONFIG_RCU_FANOUT_LEAF=16
# CONFIG_RCU_FANOUT_EXACT is not set
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_RCU_NOCB_CPU=y
# CONFIG_RCU_NOCB_CPU_NONE is not set
# CONFIG_RCU_NOCB_CPU_ZERO is not set
CONFIG_RCU_NOCB_CPU_ALL=y
CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=y
CON

Re: [PATCH v3 1/4] mm/balloon_compaction: redesign ballooned pages management

2014-10-28 Thread Matt Mullins
On Sat, Sep 27, 2014 at 11:15:16PM +0400, Konstantin Khlebnikov wrote:
> This patch fixes all of them.

It seems to have rendered virtio_balloon completely ineffective without
CONFIG_COMPACTION and CONFIG_BALLOON_COMPACTION, even for the case that I'm
expanding the memory available to my VM.

Was this intended?  Should Kconfig be updated so that VIRTIO_BALLOON depends on
BALLOON_COMPACTION now?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH v2] nfs: fix kernel warning when removing proc entry

2014-09-08 Thread Matt Mullins
On Tue, Aug 19, 2014 at 09:20:38PM -0700, Eric W. Biederman wrote:
> I have applied this patch and will push it to Linus after it has a
> little bit to sit in linux-next.

I also fairly reliably hit this warning with trinity, and with Cong's
patch, it seems to be gone.

If it helps get it pushed toward Linus:
Tested-by: Matt Mullins 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: Problems with hda_intel, Santa Rosa, and suspend

2007-06-17 Thread Matt Mullins

I've been reading up on the Intel HD Audio specifications, and I found
that the problem is most likely that the driver has to resort to PI/O
(single_cmd) mode.  I really haven't had enough time or experience to
figure out what is actually causing this.  This is the first time I've
done any kernel development; thankfully, Intel has good documentation.
I'm going to keep hacking to see what information I can figure out.

I just tried 2.6.22-rc5, and no behavior has changed.

Matt Mullins

On 6/17/07, Rafael J. Wysocki <[EMAIL PROTECTED]> wrote:

On Sunday, 17 June 2007 03:36, Matt Mullins wrote:
> I just received a Dell Latitude D630, with the new Intel Santa Rosa
> platform.  Currently, the only major driver issue I have is sound.  It
> worked fine in Ubuntu Feisty's 2.6.20 kernel, but now I am using Gutsy
> so I can have graphics drivers.  Gutsy's 2.6.22-rc3-based kernel no
> longer recognized my soundcard, a Sigmatel STAC9205, which uses the
> hda_intel driver.  I have since fixed that problem by compiling the
> hda_intel driver into the kernel (I used 2.6.22-rc4 vanilla sources),
> instead of a module, but it is still a bug that it does not work as a
> module.
>
> All the following debug information was obtained from kernel 2.6.22-rc4.
>
> However, the bug that is currently affecting me is that sound no
> longer works after using ACPI suspend-to-RAM or swsusp.  I compiled
> ALSA debugging in, and as soon as I resume, I get tons of messages
> like:
> [8.474830] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0970500
> [8.474954] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0af0009
> [8.475078] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0a70500
> [8.475207] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0bf0009
> (etc)
>
> There may be more before that, but that is more than the kernel
> message buffer can hold, so I can't ever see it.  I get similar
> timeout messages each time a PCM gets set up:
> [8.972859] hda_codec_setup_stream: NID=0x10, stream=0x5, channel=0, 
format=0
> x11
> [8.972987] hda-intel: send_cmd timeout: IRS=0x1, val=0x1070650
> [9.006071] hda-intel: send_cmd timeout: IRS=0x1, val=0x1020011
> [9.006081] hda_codec_setup_stream: NID=0x11, stream=0x5, channel=0, 
format=0
> x11
> [9.006207] hda-intel: send_cmd timeout: IRS=0x1, val=0x1170650
> [9.016117] hda-intel: send_cmd timeout: IRS=0x1, val=0x1120011
>
> I read some of sound/pci/hda/hda_intel.c, specifically the code that
> output those messages.  After octupling my kernel log buffer (to 1MB),
> I noticed that for the first set of timeouts (immediately upon
> resume), all the "codec" nibbles are odd, the last word of the val
> ("verb" and "para") are all either 0x0500 or 0x0009.  For the second
> set (when something tries to use the card) that last word is 0x0650 or
> 0x0011.
>
> I have yet to go back and re-test compiling snd-hda-intel as a module.

Well, the ALSA people should be notified of that in the first place (CCs
added).

Greetings,
Rafael


--
"Premature optimization is the root of all evil." - Donald Knuth
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Problems with hda_intel, Santa Rosa, and suspend

2007-06-16 Thread Matt Mullins

I just received a Dell Latitude D630, with the new Intel Santa Rosa
platform.  Currently, the only major driver issue I have is sound.  It
worked fine in Ubuntu Feisty's 2.6.20 kernel, but now I am using Gutsy
so I can have graphics drivers.  Gutsy's 2.6.22-rc3-based kernel no
longer recognized my soundcard, a Sigmatel STAC9205, which uses the
hda_intel driver.  I have since fixed that problem by compiling the
hda_intel driver into the kernel (I used 2.6.22-rc4 vanilla sources),
instead of a module, but it is still a bug that it does not work as a
module.

All the following debug information was obtained from kernel 2.6.22-rc4.

However, the bug that is currently affecting me is that sound no
longer works after using ACPI suspend-to-RAM or swsusp.  I compiled
ALSA debugging in, and as soon as I resume, I get tons of messages
like:
[8.474830] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0970500
[8.474954] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0af0009
[8.475078] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0a70500
[8.475207] hda-intel: send_cmd timeout: IRS=0x1, val=0xd0bf0009
(etc)

There may be more before that, but that is more than the kernel
message buffer can hold, so I can't ever see it.  I get similar
timeout messages each time a PCM gets set up:
[8.972859] hda_codec_setup_stream: NID=0x10, stream=0x5, channel=0, format=0
x11
[8.972987] hda-intel: send_cmd timeout: IRS=0x1, val=0x1070650
[9.006071] hda-intel: send_cmd timeout: IRS=0x1, val=0x1020011
[9.006081] hda_codec_setup_stream: NID=0x11, stream=0x5, channel=0, format=0
x11
[9.006207] hda-intel: send_cmd timeout: IRS=0x1, val=0x1170650
[9.016117] hda-intel: send_cmd timeout: IRS=0x1, val=0x1120011

I read some of sound/pci/hda/hda_intel.c, specifically the code that
output those messages.  After octupling my kernel log buffer (to 1MB),
I noticed that for the first set of timeouts (immediately upon
resume), all the "codec" nibbles are odd, the last word of the val
("verb" and "para") are all either 0x0500 or 0x0009.  For the second
set (when something tries to use the card) that last word is 0x0650 or
0x0011.

I have yet to go back and re-test compiling snd-hda-intel as a module.

Thank you in advance,
Matt Mullins
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/