Re: cgroup trace events acquire sleeping locks

2018-07-10 Thread Steven Rostedt
On Tue, 10 Jul 2018 11:26:35 +0200
Peter Zijlstra  wrote:

> > Hmm, good question. I could possibly make all the tracepoint code into
> > its own section. And then look to see if any spin locks exist in them.
> > That wouldn't be too trivial to implement though.  
> 
> pick a bit from the preempt_count (say right above NMI_MASK) and set it
> inside a trace-event and add in_trace().
> 
> Then make lockdep explode when in_trace().
> 
> Or something along those lines.

Sure. We have current->trace for special tracing flags. That could
easily be used.

-- Steve


Re: cgroup trace events acquire sleeping locks

2018-07-10 Thread Peter Zijlstra
On Mon, Jul 09, 2018 at 04:30:10PM -0400, Steven Rostedt wrote:
> On Mon, 9 Jul 2018 22:22:15 +0200
> Sebastian Andrzej Siewior  wrote:
> 
> > On 2018-07-09 15:01:54 [-0400], Steven Rostedt wrote:
> > > > which is the trace_cgroup_rmdir() trace event in cgroup_rmdir(). The
> > > > trace event invokes cgroup_path() which acquires a spin_lock_t and this
> > > > is invoked within a preempt_disable()ed section.   
> > > 
> > > Correct. And I wish no trace event took spin locks.  
> > 
> > is there an easy way to detect this? I mean instead hitting the trace
> > event with debug enabled and doing a review of each of them.
> 
> Hmm, good question. I could possibly make all the tracepoint code into
> its own section. And then look to see if any spin locks exist in them.
> That wouldn't be too trivial to implement though.

pick a bit from the preempt_count (say right above NMI_MASK) and set it
inside a trace-event and add in_trace().

Then make lockdep explode when in_trace().

Or something along those lines.


Re: cgroup trace events acquire sleeping locks

2018-07-09 Thread Steven Rostedt
On Mon, 9 Jul 2018 22:22:15 +0200
Sebastian Andrzej Siewior  wrote:

> On 2018-07-09 15:01:54 [-0400], Steven Rostedt wrote:
> > > which is the trace_cgroup_rmdir() trace event in cgroup_rmdir(). The
> > > trace event invokes cgroup_path() which acquires a spin_lock_t and this
> > > is invoked within a preempt_disable()ed section.   
> > 
> > Correct. And I wish no trace event took spin locks.  
> 
> is there an easy way to detect this? I mean instead hitting the trace
> event with debug enabled and doing a review of each of them.

Hmm, good question. I could possibly make all the tracepoint code into
its own section. And then look to see if any spin locks exist in them.
That wouldn't be too trivial to implement though.

> 
> > > It says "Preemption disabled at" migrate_enable() but this is not true.
> > > A printk() just before the lock reports preempt_count() of two and
> > > sometimes one. I *think*
> > > - one is from rcu_read_lock_sched_notrace() in __DO_TRACE()
> > > - the second is from preempt_disable_notrace() in 
> > > ring_buffer_lock_reserve()
> > > 
> > > I would prefer not to turn kernfs_rename_lock into raw_spin_lock_t. We
> > > had a similar problem with a i915 trace event which eventually vanished
> > > (and before I just disabled it).
> > > 
> > > So how likely are chances that we can use rcu_read_lock() in 
> > > __DO_TRACE()?  
> > 
> > Not very.  
> 
> Is there a reason for this? I don't think this is documented. I changed
> it to the "normal" RCU read section and it appeared to work :)
> 

Well, there's trace points in RCU code. Not sure how they will react.

> > > And how likely are chances that the preempt_disable() in
> > > ring_buffer_lock_reserve() could be avoided while the trace event is
> > > invoked?  
> > 
> > Even less likely. The design of the ring buffer is based on not being
> > able to be preempted.  
> 
> I was expecting this.
> 
> > > I guess nothing of this is easy peasy. Any suggestions?
> > >   
> > 
> > One solution, albeit not so pretty, is to move the grabbing of the
> > path, outside the trace event. But this should work.  
> 
> okay, wasn't aware of the trace_cgroup_##type##_enabled() magic. Yes,
> this should work. Do you mind posting this upstream?

Sure.

-- Steve


Re: cgroup trace events acquire sleeping locks

2018-07-09 Thread Sebastian Andrzej Siewior
On 2018-07-09 15:01:54 [-0400], Steven Rostedt wrote:
> > which is the trace_cgroup_rmdir() trace event in cgroup_rmdir(). The
> > trace event invokes cgroup_path() which acquires a spin_lock_t and this
> > is invoked within a preempt_disable()ed section. 
> 
> Correct. And I wish no trace event took spin locks.

is there an easy way to detect this? I mean instead hitting the trace
event with debug enabled and doing a review of each of them.

> > It says "Preemption disabled at" migrate_enable() but this is not true.
> > A printk() just before the lock reports preempt_count() of two and
> > sometimes one. I *think*
> > - one is from rcu_read_lock_sched_notrace() in __DO_TRACE()
> > - the second is from preempt_disable_notrace() in ring_buffer_lock_reserve()
> > 
> > I would prefer not to turn kernfs_rename_lock into raw_spin_lock_t. We
> > had a similar problem with a i915 trace event which eventually vanished
> > (and before I just disabled it).
> > 
> > So how likely are chances that we can use rcu_read_lock() in __DO_TRACE()?
> 
> Not very.

Is there a reason for this? I don't think this is documented. I changed
it to the "normal" RCU read section and it appeared to work :)

> > And how likely are chances that the preempt_disable() in
> > ring_buffer_lock_reserve() could be avoided while the trace event is
> > invoked?
> 
> Even less likely. The design of the ring buffer is based on not being
> able to be preempted.

I was expecting this.

> > I guess nothing of this is easy peasy. Any suggestions?
> > 
> 
> One solution, albeit not so pretty, is to move the grabbing of the
> path, outside the trace event. But this should work.

okay, wasn't aware of the trace_cgroup_##type##_enabled() magic. Yes,
this should work. Do you mind posting this upstream?

> -- Steve

Sebastian


Re: cgroup trace events acquire sleeping locks

2018-07-09 Thread Steven Rostedt
On Mon, 9 Jul 2018 18:38:05 +0200
Sebastian Andrzej Siewior  wrote:

> Clark showed me this:
> 
> | BUG: sleeping function called from invalid context at 
> kernel/locking/rtmutex.c:974
> | in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: systemd
> | 5 locks held by systemd/1:
> |  #0:  (sb_writers#7){.+.+}, at: [<(ptrval)>] 
> mnt_want_write+0x1f/0x50
> |  #1:  (&type->i_mutex_dir_key#3/1){+.+.}, at: [<(ptrval)>] 
> do_rmdir+0x14d/0x1f0
> |  #2:  (&type->i_mutex_dir_key#5){}, at: [<(ptrval)>] 
> vfs_rmdir+0x50/0x150
> |  #3:  (cgroup_mutex){+.+.}, at: [<(ptrval)>] 
> cgroup_kn_lock_live+0xfb/0x200
> |  #4:  (kernfs_rename_lock){+.+.}, at: [<(ptrval)>] 
> kernfs_path_from_node+0x23/0x50
> | Preemption disabled at:
> | [] migrate_enable+0x95/0x340
> | CPU: 1 PID: 1 Comm: systemd Not tainted 4.16.18-rt9+ #173
> | Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.1-1 04/01/2014
> | Call Trace:
> |  dump_stack+0x70/0xa7
> |  ___might_sleep+0x159/0x240
> |  rt_spin_lock+0x4e/0x60
> |  ? kernfs_path_from_node+0x23/0x50
> |  kernfs_path_from_node+0x23/0x50
> |  trace_event_raw_event_cgroup+0x54/0x110
> |  cgroup_rmdir+0xdb/0x1a0
> |  kernfs_iop_rmdir+0x53/0x80
> |  vfs_rmdir+0x74/0x150
> |  do_rmdir+0x191/0x1f0
> |  SyS_rmdir+0x11/0x20
> |  do_syscall_64+0x73/0x220
> |  entry_SYSCALL_64_after_hwframe+0x42/0xb7
> 
> which is the trace_cgroup_rmdir() trace event in cgroup_rmdir(). The
> trace event invokes cgroup_path() which acquires a spin_lock_t and this
> is invoked within a preempt_disable()ed section. 

Correct. And I wish no trace event took spin locks.

> 
> It says "Preemption disabled at" migrate_enable() but this is not true.
> A printk() just before the lock reports preempt_count() of two and
> sometimes one. I *think*
> - one is from rcu_read_lock_sched_notrace() in __DO_TRACE()
> - the second is from preempt_disable_notrace() in ring_buffer_lock_reserve()
> 
> I would prefer not to turn kernfs_rename_lock into raw_spin_lock_t. We
> had a similar problem with a i915 trace event which eventually vanished
> (and before I just disabled it).
> 
> So how likely are chances that we can use rcu_read_lock() in __DO_TRACE()?

Not very.

> And how likely are chances that the preempt_disable() in
> ring_buffer_lock_reserve() could be avoided while the trace event is
> invoked?

Even less likely. The design of the ring buffer is based on not being
able to be preempted.

> 
> I guess nothing of this is easy peasy. Any suggestions?
> 

One solution, albeit not so pretty, is to move the grabbing of the
path, outside the trace event. But this should work.

-- Steve


diff --git a/include/trace/events/cgroup.h b/include/trace/events/cgroup.h
index d74722c2ac8b..d149f3a42122 100644
--- a/include/trace/events/cgroup.h
+++ b/include/trace/events/cgroup.h
@@ -53,24 +53,22 @@ DEFINE_EVENT(cgroup_root, cgroup_remount,
 
 DECLARE_EVENT_CLASS(cgroup,
 
-   TP_PROTO(struct cgroup *cgrp),
+   TP_PROTO(struct cgroup *cgrp, const char *path),
 
-   TP_ARGS(cgrp),
+   TP_ARGS(cgrp, path),
 
TP_STRUCT__entry(
__field(int,root)
__field(int,id  )
__field(int,level   )
-   __dynamic_array(char,   path,
-   cgroup_path(cgrp, NULL, 0) + 1)
+   __string(   path,   path)
),
 
TP_fast_assign(
__entry->root = cgrp->root->hierarchy_id;
__entry->id = cgrp->id;
__entry->level = cgrp->level;
-   cgroup_path(cgrp, __get_dynamic_array(path),
- __get_dynamic_array_len(path));
+   __assign_str(path, path);
),
 
TP_printk("root=%d id=%d level=%d path=%s",
@@ -79,30 +77,30 @@ DECLARE_EVENT_CLASS(cgroup,
 
 DEFINE_EVENT(cgroup, cgroup_mkdir,
 
-   TP_PROTO(struct cgroup *cgroup),
+   TP_PROTO(struct cgroup *cgrp, const char *path),
 
-   TP_ARGS(cgroup)
+   TP_ARGS(cgrp, path)
 );
 
 DEFINE_EVENT(cgroup, cgroup_rmdir,
 
-   TP_PROTO(struct cgroup *cgroup),
+   TP_PROTO(struct cgroup *cgrp, const char *path),
 
-   TP_ARGS(cgroup)
+   TP_ARGS(cgrp, path)
 );
 
 DEFINE_EVENT(cgroup, cgroup_release,
 
-   TP_PROTO(struct cgroup *cgroup),
+   TP_PROTO(struct cgroup *cgrp, const char *path),
 
-   TP_ARGS(cgroup)
+   TP_ARGS(cgrp, path)
 );
 
 DEFINE_EVENT(cgroup, cgroup_rename,
 
-   TP_PROTO(struct cgroup *cgroup),
+   TP_PROTO(struct cgroup *cgrp, const char *path),
 
-   TP_ARGS(cgroup)
+   TP_ARGS(cgrp, path)
 );
 
 DECLARE_EVENT_CLASS(cgroup_migrate,
diff --git a/kernel/cgroup/cgroup-internal.h b/kernel/cgroup/cgroup-internal.h
index 77ff1cd6a252..d0ad35796ea1 100644
--- a/kernel/cgroup/cgroup-internal.h
+++ b/kernel/cgroup/

cgroup trace events acquire sleeping locks

2018-07-09 Thread Sebastian Andrzej Siewior
Clark showed me this:

| BUG: sleeping function called from invalid context at 
kernel/locking/rtmutex.c:974
| in_atomic(): 1, irqs_disabled(): 0, pid: 1, name: systemd
| 5 locks held by systemd/1:
|  #0:  (sb_writers#7){.+.+}, at: [<(ptrval)>] mnt_want_write+0x1f/0x50
|  #1:  (&type->i_mutex_dir_key#3/1){+.+.}, at: [<(ptrval)>] 
do_rmdir+0x14d/0x1f0
|  #2:  (&type->i_mutex_dir_key#5){}, at: [<(ptrval)>] 
vfs_rmdir+0x50/0x150
|  #3:  (cgroup_mutex){+.+.}, at: [<(ptrval)>] 
cgroup_kn_lock_live+0xfb/0x200
|  #4:  (kernfs_rename_lock){+.+.}, at: [<(ptrval)>] 
kernfs_path_from_node+0x23/0x50
| Preemption disabled at:
| [] migrate_enable+0x95/0x340
| CPU: 1 PID: 1 Comm: systemd Not tainted 4.16.18-rt9+ #173
| Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.11.1-1 04/01/2014
| Call Trace:
|  dump_stack+0x70/0xa7
|  ___might_sleep+0x159/0x240
|  rt_spin_lock+0x4e/0x60
|  ? kernfs_path_from_node+0x23/0x50
|  kernfs_path_from_node+0x23/0x50
|  trace_event_raw_event_cgroup+0x54/0x110
|  cgroup_rmdir+0xdb/0x1a0
|  kernfs_iop_rmdir+0x53/0x80
|  vfs_rmdir+0x74/0x150
|  do_rmdir+0x191/0x1f0
|  SyS_rmdir+0x11/0x20
|  do_syscall_64+0x73/0x220
|  entry_SYSCALL_64_after_hwframe+0x42/0xb7

which is the trace_cgroup_rmdir() trace event in cgroup_rmdir(). The
trace event invokes cgroup_path() which acquires a spin_lock_t and this
is invoked within a preempt_disable()ed section. 

It says "Preemption disabled at" migrate_enable() but this is not true.
A printk() just before the lock reports preempt_count() of two and
sometimes one. I *think*
- one is from rcu_read_lock_sched_notrace() in __DO_TRACE()
- the second is from preempt_disable_notrace() in ring_buffer_lock_reserve()

I would prefer not to turn kernfs_rename_lock into raw_spin_lock_t. We
had a similar problem with a i915 trace event which eventually vanished
(and before I just disabled it).

So how likely are chances that we can use rcu_read_lock() in __DO_TRACE()?
And how likely are chances that the preempt_disable() in
ring_buffer_lock_reserve() could be avoided while the trace event is
invoked?

I guess nothing of this is easy peasy. Any suggestions?

Sebastian