Re: RCU stall when using function_graph

2017-08-30 Thread Paul E. McKenney
On Wed, Aug 16, 2017 at 10:58:05AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 16, 2017 at 12:41:40PM -0400, Steven Rostedt wrote:
> > On Wed, 16 Aug 2017 09:32:28 -0700
> > "Paul E. McKenney"  wrote:
> > 
> > > Let me see if I understand you...  About halfway to the stall limit,
> > > RCU triggers an irq_work (on each CPU that has not yet passed through
> > > a quiescent state, IPIing them in turn?), and if the irq_work has
> > > not completed by the end of the stall limit, RCU adds that to its
> > > stall-warning message.
> > 
> > Doesn't even have to be half way through. It could be done at the
> > limit, and then wait a little more.
> 
> Agreed.  However, I take other evasive action halfway through, so there
> is a nice place to put the code.
> 
> > > Or am I missing something here?
> > 
> > No, I think that's what I was suggesting.
> 
> I don't see a way of canceling an irq_work.
> 
> Ah, but there is an irq_work_queue_on() that fails if still pending.
> I should be able to use that instead of cancel.  Plus, -I- don't have
> to send the IPIs, I can just let irq_work_queue_on() do my dirty work!
> 
> This should provide sufficient entertainment for a bit!  ;-)

Finally got to this -- how does the patch below look to you?

It adds a digit to the stall warning saying for how many full grace
periods interrupts have been disabled, "." if interrupts were enabled
recently, and "!" if RCU was in too much trouble to tell one way or
the other.  (This last can happen if RCU's grace-period kthread has been
prevented from executing for most of the grace period.)

Thanx, Paul



commit 7b32575c82ed5c891b93a4a5961e608a97cdb192
Author: Paul E. McKenney 
Date:   Thu Aug 17 17:05:59 2017 -0700

rcu: Make RCU CPU stall warnings check for irq-disabled CPUs

One common question upon seeing an RCU CPU stall warning is "did
the stalled CPUs have interrupts disabled?"  However, the current
stall warnings are silent on this point.  This commit therefore
uses irq_work to check whether stalled CPUs still respond to IPIs,
and flags this state in the RCU CPU stall warning console messages.

Reported-by: Steven Rostedt 
Signed-off-by: Paul E. McKenney 

 tree.c|  103 +++---
 tree.h|5 ++
 tree_plugin.h |5 ++
 3 files changed, 100 insertions(+), 13 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5152120f11d3..307c641149d4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1197,6 +1197,22 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 }
 
 /*
+ * We are reporting a quiescent state on behalf of some other CPU, so
+ * it is our responsibility to check for and handle potential overflow
+ * of the rcu_node ->gpnum counter with respect to the rcu_data counters.
+ * After all, the CPU might be in deep idle state, and thus executing no
+ * code whatsoever.
+ */
+static void rcu_gpnum_ovf(struct rcu_node *rnp, struct rcu_data *rdp)
+{
+   lockdep_assert_held(>lock);
+   if (ULONG_CMP_LT(READ_ONCE(rdp->gpnum) + ULONG_MAX / 4, rnp->gpnum))
+   WRITE_ONCE(rdp->gpwrap, true);
+   if (ULONG_CMP_LT(rdp->rcu_iw_gpnum + ULONG_MAX / 4, rnp->gpnum))
+   rdp->rcu_iw_gpnum = rnp->gpnum + ULONG_MAX / 4;
+}
+
+/*
  * Snapshot the specified CPU's dynticks counter so that we can later
  * credit them with an implicit quiescent state.  Return 1 if this CPU
  * is in dynticks idle mode, which is an extended quiescent state.
@@ -1206,15 +1222,34 @@ static int dyntick_save_progress_counter(struct 
rcu_data *rdp)
rdp->dynticks_snap = rcu_dynticks_snap(rdp->dynticks);
if (rcu_dynticks_in_eqs(rdp->dynticks_snap)) {
trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("dti"));
-   if (ULONG_CMP_LT(READ_ONCE(rdp->gpnum) + ULONG_MAX / 4,
-rdp->mynode->gpnum))
-   WRITE_ONCE(rdp->gpwrap, true);
+   rcu_gpnum_ovf(rdp->mynode, rdp);
return 1;
}
return 0;
 }
 
 /*
+ * Handler for the irq_work request posted when a grace period has
+ * gone on for too long, but not yet long enough for an RCU CPU
+ * stall warning.  Set state appropriately, but just complain if
+ * there is unexpected state on entry.
+ */
+static void rcu_iw_handler(struct irq_work *iwp)
+{
+   struct rcu_data *rdp;
+   struct rcu_node *rnp;
+
+   rdp = container_of(iwp, struct rcu_data, rcu_iw);
+   rnp = rdp->mynode;
+   raw_spin_lock_rcu_node(rnp);
+   if (!WARN_ON_ONCE(!rdp->rcu_iw_pending)) {
+   rdp->rcu_iw_gpnum = rnp->gpnum;
+   rdp->rcu_iw_pending = false;
+   }
+   

Re: RCU stall when using function_graph

2017-08-30 Thread Paul E. McKenney
On Wed, Aug 16, 2017 at 10:58:05AM -0700, Paul E. McKenney wrote:
> On Wed, Aug 16, 2017 at 12:41:40PM -0400, Steven Rostedt wrote:
> > On Wed, 16 Aug 2017 09:32:28 -0700
> > "Paul E. McKenney"  wrote:
> > 
> > > Let me see if I understand you...  About halfway to the stall limit,
> > > RCU triggers an irq_work (on each CPU that has not yet passed through
> > > a quiescent state, IPIing them in turn?), and if the irq_work has
> > > not completed by the end of the stall limit, RCU adds that to its
> > > stall-warning message.
> > 
> > Doesn't even have to be half way through. It could be done at the
> > limit, and then wait a little more.
> 
> Agreed.  However, I take other evasive action halfway through, so there
> is a nice place to put the code.
> 
> > > Or am I missing something here?
> > 
> > No, I think that's what I was suggesting.
> 
> I don't see a way of canceling an irq_work.
> 
> Ah, but there is an irq_work_queue_on() that fails if still pending.
> I should be able to use that instead of cancel.  Plus, -I- don't have
> to send the IPIs, I can just let irq_work_queue_on() do my dirty work!
> 
> This should provide sufficient entertainment for a bit!  ;-)

Finally got to this -- how does the patch below look to you?

It adds a digit to the stall warning saying for how many full grace
periods interrupts have been disabled, "." if interrupts were enabled
recently, and "!" if RCU was in too much trouble to tell one way or
the other.  (This last can happen if RCU's grace-period kthread has been
prevented from executing for most of the grace period.)

Thanx, Paul



commit 7b32575c82ed5c891b93a4a5961e608a97cdb192
Author: Paul E. McKenney 
Date:   Thu Aug 17 17:05:59 2017 -0700

rcu: Make RCU CPU stall warnings check for irq-disabled CPUs

One common question upon seeing an RCU CPU stall warning is "did
the stalled CPUs have interrupts disabled?"  However, the current
stall warnings are silent on this point.  This commit therefore
uses irq_work to check whether stalled CPUs still respond to IPIs,
and flags this state in the RCU CPU stall warning console messages.

Reported-by: Steven Rostedt 
Signed-off-by: Paul E. McKenney 

 tree.c|  103 +++---
 tree.h|5 ++
 tree_plugin.h |5 ++
 3 files changed, 100 insertions(+), 13 deletions(-)

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 5152120f11d3..307c641149d4 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -1197,6 +1197,22 @@ static int rcu_is_cpu_rrupt_from_idle(void)
 }
 
 /*
+ * We are reporting a quiescent state on behalf of some other CPU, so
+ * it is our responsibility to check for and handle potential overflow
+ * of the rcu_node ->gpnum counter with respect to the rcu_data counters.
+ * After all, the CPU might be in deep idle state, and thus executing no
+ * code whatsoever.
+ */
+static void rcu_gpnum_ovf(struct rcu_node *rnp, struct rcu_data *rdp)
+{
+   lockdep_assert_held(>lock);
+   if (ULONG_CMP_LT(READ_ONCE(rdp->gpnum) + ULONG_MAX / 4, rnp->gpnum))
+   WRITE_ONCE(rdp->gpwrap, true);
+   if (ULONG_CMP_LT(rdp->rcu_iw_gpnum + ULONG_MAX / 4, rnp->gpnum))
+   rdp->rcu_iw_gpnum = rnp->gpnum + ULONG_MAX / 4;
+}
+
+/*
  * Snapshot the specified CPU's dynticks counter so that we can later
  * credit them with an implicit quiescent state.  Return 1 if this CPU
  * is in dynticks idle mode, which is an extended quiescent state.
@@ -1206,15 +1222,34 @@ static int dyntick_save_progress_counter(struct 
rcu_data *rdp)
rdp->dynticks_snap = rcu_dynticks_snap(rdp->dynticks);
if (rcu_dynticks_in_eqs(rdp->dynticks_snap)) {
trace_rcu_fqs(rdp->rsp->name, rdp->gpnum, rdp->cpu, TPS("dti"));
-   if (ULONG_CMP_LT(READ_ONCE(rdp->gpnum) + ULONG_MAX / 4,
-rdp->mynode->gpnum))
-   WRITE_ONCE(rdp->gpwrap, true);
+   rcu_gpnum_ovf(rdp->mynode, rdp);
return 1;
}
return 0;
 }
 
 /*
+ * Handler for the irq_work request posted when a grace period has
+ * gone on for too long, but not yet long enough for an RCU CPU
+ * stall warning.  Set state appropriately, but just complain if
+ * there is unexpected state on entry.
+ */
+static void rcu_iw_handler(struct irq_work *iwp)
+{
+   struct rcu_data *rdp;
+   struct rcu_node *rnp;
+
+   rdp = container_of(iwp, struct rcu_data, rcu_iw);
+   rnp = rdp->mynode;
+   raw_spin_lock_rcu_node(rnp);
+   if (!WARN_ON_ONCE(!rdp->rcu_iw_pending)) {
+   rdp->rcu_iw_gpnum = rnp->gpnum;
+   rdp->rcu_iw_pending = false;
+   }
+   raw_spin_unlock_rcu_node(rnp);
+}
+
+/*
  * Return true if the specified CPU has passed through a quiescent
  * state by virtue of 

Re: RCU stall when using function_graph

2017-08-16 Thread Paul E. McKenney
On Wed, Aug 16, 2017 at 12:41:40PM -0400, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 09:32:28 -0700
> "Paul E. McKenney"  wrote:
> 
> > Let me see if I understand you...  About halfway to the stall limit,
> > RCU triggers an irq_work (on each CPU that has not yet passed through
> > a quiescent state, IPIing them in turn?), and if the irq_work has
> > not completed by the end of the stall limit, RCU adds that to its
> > stall-warning message.
> 
> Doesn't even have to be half way through. It could be done at the
> limit, and then wait a little more.

Agreed.  However, I take other evasive action halfway through, so there
is a nice place to put the code.

> > Or am I missing something here?
> 
> No, I think that's what I was suggesting.

I don't see a way of canceling an irq_work.

Ah, but there is an irq_work_queue_on() that fails if still pending.
I should be able to use that instead of cancel.  Plus, -I- don't have
to send the IPIs, I can just let irq_work_queue_on() do my dirty work!

This should provide sufficient entertainment for a bit!  ;-)

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-16 Thread Paul E. McKenney
On Wed, Aug 16, 2017 at 12:41:40PM -0400, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 09:32:28 -0700
> "Paul E. McKenney"  wrote:
> 
> > Let me see if I understand you...  About halfway to the stall limit,
> > RCU triggers an irq_work (on each CPU that has not yet passed through
> > a quiescent state, IPIing them in turn?), and if the irq_work has
> > not completed by the end of the stall limit, RCU adds that to its
> > stall-warning message.
> 
> Doesn't even have to be half way through. It could be done at the
> limit, and then wait a little more.

Agreed.  However, I take other evasive action halfway through, so there
is a nice place to put the code.

> > Or am I missing something here?
> 
> No, I think that's what I was suggesting.

I don't see a way of canceling an irq_work.

Ah, but there is an irq_work_queue_on() that fails if still pending.
I should be able to use that instead of cancel.  Plus, -I- don't have
to send the IPIs, I can just let irq_work_queue_on() do my dirty work!

This should provide sufficient entertainment for a bit!  ;-)

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-16 Thread Steven Rostedt
On Wed, 16 Aug 2017 09:32:28 -0700
"Paul E. McKenney"  wrote:

> Let me see if I understand you...  About halfway to the stall limit,
> RCU triggers an irq_work (on each CPU that has not yet passed through
> a quiescent state, IPIing them in turn?), and if the irq_work has
> not completed by the end of the stall limit, RCU adds that to its
> stall-warning message.

Doesn't even have to be half way through. It could be done at the
limit, and then wait a little more.

> 
> Or am I missing something here?

No, I think that's what I was suggesting.

-- Steve


Re: RCU stall when using function_graph

2017-08-16 Thread Steven Rostedt
On Wed, 16 Aug 2017 09:32:28 -0700
"Paul E. McKenney"  wrote:

> Let me see if I understand you...  About halfway to the stall limit,
> RCU triggers an irq_work (on each CPU that has not yet passed through
> a quiescent state, IPIing them in turn?), and if the irq_work has
> not completed by the end of the stall limit, RCU adds that to its
> stall-warning message.

Doesn't even have to be half way through. It could be done at the
limit, and then wait a little more.

> 
> Or am I missing something here?

No, I think that's what I was suggesting.

-- Steve


Re: RCU stall when using function_graph

2017-08-16 Thread Paul E. McKenney
On Wed, Aug 16, 2017 at 10:04:21AM -0400, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 10:42:15 +0200
> Daniel Lezcano  wrote:
> 
> > Hi Steven,
> > 
> > 
> > On 15/08/2017 15:29, Steven Rostedt wrote:
> > > 
> > > [ I'm back from vacation! ]  
> > 
> > Did you get the tapes? :)
> 
> Yes, but nothing in them would cause the reputation of the POTUS to
> become any worse than it already is.
> 
> > 
> > > On Wed, 9 Aug 2017 17:51:33 +0200
> > > Daniel Lezcano  wrote:
> > >   
> > >> Well, may be the instruction pointer thing is not a good idea.
> > >>
> > >> I learnt from this experience, an overloaded kernel with a lot of
> > >> interrupts can hang the console and issue RCU stall.
> > >>
> > >> However, someone else can face the same situation. Even if he reads the
> > >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> > >>
> > >> A message telling the grace period can't be reached because we are too
> > >> busy processing interrupts would have helped but I understand it is not
> > >> easy to implement.  
> > > 
> > > What if the stall code triggered an irqwork first? The irqwork would
> > > trigger as soon as interrupts were enabled again (or at the next tick,
> > > depending on the arch), and then it would know that RCU stalled due to
> > > an irq storm if the irqwork is being hit.  
> > 
> > Is that condition enough to tell the CPU is over utilized by the
> > interrupts handling?
> > 
> > And I'm wondering if it wouldn't make sense to have this detection in
> > the irq code. With or without the RCU stall warning kernel option set,
> > the irq framework will be warning about this situation. If the RCU stall
> > option is set, that will issue a second message. It will be easy to do
> > the connection between the first message and the second one, no ?
> 
> The thing is, the RCU code keeps track of the state of progress, I
> don't believe the interrupt code does. It just worries about handling
> interrupts. I'm not excited about adding infrastructure to the
> interrupt code to do accounting of IRQ storms.
> 
> On the other hand, the RCU code already does this. If it notices a
> stall, it can trigger a irq_work and wait a little more. If the
> irq_work doesn't fire, then it can do the normal RCU stall message. But
> if the irq_work does fire, and the RCU progress still hasn't moved
> forward, then it would be able to say this is due to an IRQ storm and
> produce a better error message.

Let me see if I understand you...  About halfway to the stall limit,
RCU triggers an irq_work (on each CPU that has not yet passed through
a quiescent state, IPIing them in turn?), and if the irq_work has
not completed by the end of the stall limit, RCU adds that to its
stall-warning message.

Or am I missing something here?

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-16 Thread Paul E. McKenney
On Wed, Aug 16, 2017 at 10:04:21AM -0400, Steven Rostedt wrote:
> On Wed, 16 Aug 2017 10:42:15 +0200
> Daniel Lezcano  wrote:
> 
> > Hi Steven,
> > 
> > 
> > On 15/08/2017 15:29, Steven Rostedt wrote:
> > > 
> > > [ I'm back from vacation! ]  
> > 
> > Did you get the tapes? :)
> 
> Yes, but nothing in them would cause the reputation of the POTUS to
> become any worse than it already is.
> 
> > 
> > > On Wed, 9 Aug 2017 17:51:33 +0200
> > > Daniel Lezcano  wrote:
> > >   
> > >> Well, may be the instruction pointer thing is not a good idea.
> > >>
> > >> I learnt from this experience, an overloaded kernel with a lot of
> > >> interrupts can hang the console and issue RCU stall.
> > >>
> > >> However, someone else can face the same situation. Even if he reads the
> > >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> > >>
> > >> A message telling the grace period can't be reached because we are too
> > >> busy processing interrupts would have helped but I understand it is not
> > >> easy to implement.  
> > > 
> > > What if the stall code triggered an irqwork first? The irqwork would
> > > trigger as soon as interrupts were enabled again (or at the next tick,
> > > depending on the arch), and then it would know that RCU stalled due to
> > > an irq storm if the irqwork is being hit.  
> > 
> > Is that condition enough to tell the CPU is over utilized by the
> > interrupts handling?
> > 
> > And I'm wondering if it wouldn't make sense to have this detection in
> > the irq code. With or without the RCU stall warning kernel option set,
> > the irq framework will be warning about this situation. If the RCU stall
> > option is set, that will issue a second message. It will be easy to do
> > the connection between the first message and the second one, no ?
> 
> The thing is, the RCU code keeps track of the state of progress, I
> don't believe the interrupt code does. It just worries about handling
> interrupts. I'm not excited about adding infrastructure to the
> interrupt code to do accounting of IRQ storms.
> 
> On the other hand, the RCU code already does this. If it notices a
> stall, it can trigger a irq_work and wait a little more. If the
> irq_work doesn't fire, then it can do the normal RCU stall message. But
> if the irq_work does fire, and the RCU progress still hasn't moved
> forward, then it would be able to say this is due to an IRQ storm and
> produce a better error message.

Let me see if I understand you...  About halfway to the stall limit,
RCU triggers an irq_work (on each CPU that has not yet passed through
a quiescent state, IPIing them in turn?), and if the irq_work has
not completed by the end of the stall limit, RCU adds that to its
stall-warning message.

Or am I missing something here?

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-16 Thread Steven Rostedt
On Wed, 16 Aug 2017 10:42:15 +0200
Daniel Lezcano  wrote:

> Hi Steven,
> 
> 
> On 15/08/2017 15:29, Steven Rostedt wrote:
> > 
> > [ I'm back from vacation! ]  
> 
> Did you get the tapes? :)

Yes, but nothing in them would cause the reputation of the POTUS to
become any worse than it already is.

> 
> > On Wed, 9 Aug 2017 17:51:33 +0200
> > Daniel Lezcano  wrote:
> >   
> >> Well, may be the instruction pointer thing is not a good idea.
> >>
> >> I learnt from this experience, an overloaded kernel with a lot of
> >> interrupts can hang the console and issue RCU stall.
> >>
> >> However, someone else can face the same situation. Even if he reads the
> >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> >>
> >> A message telling the grace period can't be reached because we are too
> >> busy processing interrupts would have helped but I understand it is not
> >> easy to implement.  
> > 
> > What if the stall code triggered an irqwork first? The irqwork would
> > trigger as soon as interrupts were enabled again (or at the next tick,
> > depending on the arch), and then it would know that RCU stalled due to
> > an irq storm if the irqwork is being hit.  
> 
> Is that condition enough to tell the CPU is over utilized by the
> interrupts handling?
> 
> And I'm wondering if it wouldn't make sense to have this detection in
> the irq code. With or without the RCU stall warning kernel option set,
> the irq framework will be warning about this situation. If the RCU stall
> option is set, that will issue a second message. It will be easy to do
> the connection between the first message and the second one, no ?

The thing is, the RCU code keeps track of the state of progress, I
don't believe the interrupt code does. It just worries about handling
interrupts. I'm not excited about adding infrastructure to the
interrupt code to do accounting of IRQ storms.

On the other hand, the RCU code already does this. If it notices a
stall, it can trigger a irq_work and wait a little more. If the
irq_work doesn't fire, then it can do the normal RCU stall message. But
if the irq_work does fire, and the RCU progress still hasn't moved
forward, then it would be able to say this is due to an IRQ storm and
produce a better error message.

-- Steve



Re: RCU stall when using function_graph

2017-08-16 Thread Steven Rostedt
On Wed, 16 Aug 2017 10:42:15 +0200
Daniel Lezcano  wrote:

> Hi Steven,
> 
> 
> On 15/08/2017 15:29, Steven Rostedt wrote:
> > 
> > [ I'm back from vacation! ]  
> 
> Did you get the tapes? :)

Yes, but nothing in them would cause the reputation of the POTUS to
become any worse than it already is.

> 
> > On Wed, 9 Aug 2017 17:51:33 +0200
> > Daniel Lezcano  wrote:
> >   
> >> Well, may be the instruction pointer thing is not a good idea.
> >>
> >> I learnt from this experience, an overloaded kernel with a lot of
> >> interrupts can hang the console and issue RCU stall.
> >>
> >> However, someone else can face the same situation. Even if he reads the
> >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> >>
> >> A message telling the grace period can't be reached because we are too
> >> busy processing interrupts would have helped but I understand it is not
> >> easy to implement.  
> > 
> > What if the stall code triggered an irqwork first? The irqwork would
> > trigger as soon as interrupts were enabled again (or at the next tick,
> > depending on the arch), and then it would know that RCU stalled due to
> > an irq storm if the irqwork is being hit.  
> 
> Is that condition enough to tell the CPU is over utilized by the
> interrupts handling?
> 
> And I'm wondering if it wouldn't make sense to have this detection in
> the irq code. With or without the RCU stall warning kernel option set,
> the irq framework will be warning about this situation. If the RCU stall
> option is set, that will issue a second message. It will be easy to do
> the connection between the first message and the second one, no ?

The thing is, the RCU code keeps track of the state of progress, I
don't believe the interrupt code does. It just worries about handling
interrupts. I'm not excited about adding infrastructure to the
interrupt code to do accounting of IRQ storms.

On the other hand, the RCU code already does this. If it notices a
stall, it can trigger a irq_work and wait a little more. If the
irq_work doesn't fire, then it can do the normal RCU stall message. But
if the irq_work does fire, and the RCU progress still hasn't moved
forward, then it would be able to say this is due to an IRQ storm and
produce a better error message.

-- Steve



Re: RCU stall when using function_graph

2017-08-16 Thread Daniel Lezcano

Hi Steven,


On 15/08/2017 15:29, Steven Rostedt wrote:
> 
> [ I'm back from vacation! ]

Did you get the tapes? :)

> On Wed, 9 Aug 2017 17:51:33 +0200
> Daniel Lezcano  wrote:
> 
>> Well, may be the instruction pointer thing is not a good idea.
>>
>> I learnt from this experience, an overloaded kernel with a lot of
>> interrupts can hang the console and issue RCU stall.
>>
>> However, someone else can face the same situation. Even if he reads the
>> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
>>
>> A message telling the grace period can't be reached because we are too
>> busy processing interrupts would have helped but I understand it is not
>> easy to implement.
> 
> What if the stall code triggered an irqwork first? The irqwork would
> trigger as soon as interrupts were enabled again (or at the next tick,
> depending on the arch), and then it would know that RCU stalled due to
> an irq storm if the irqwork is being hit.

Is that condition enough to tell the CPU is over utilized by the
interrupts handling?

And I'm wondering if it wouldn't make sense to have this detection in
the irq code. With or without the RCU stall warning kernel option set,
the irq framework will be warning about this situation. If the RCU stall
option is set, that will issue a second message. It will be easy to do
the connection between the first message and the second one, no ?




-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-16 Thread Daniel Lezcano

Hi Steven,


On 15/08/2017 15:29, Steven Rostedt wrote:
> 
> [ I'm back from vacation! ]

Did you get the tapes? :)

> On Wed, 9 Aug 2017 17:51:33 +0200
> Daniel Lezcano  wrote:
> 
>> Well, may be the instruction pointer thing is not a good idea.
>>
>> I learnt from this experience, an overloaded kernel with a lot of
>> interrupts can hang the console and issue RCU stall.
>>
>> However, someone else can face the same situation. Even if he reads the
>> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
>>
>> A message telling the grace period can't be reached because we are too
>> busy processing interrupts would have helped but I understand it is not
>> easy to implement.
> 
> What if the stall code triggered an irqwork first? The irqwork would
> trigger as soon as interrupts were enabled again (or at the next tick,
> depending on the arch), and then it would know that RCU stalled due to
> an irq storm if the irqwork is being hit.

Is that condition enough to tell the CPU is over utilized by the
interrupts handling?

And I'm wondering if it wouldn't make sense to have this detection in
the irq code. With or without the RCU stall warning kernel option set,
the irq framework will be warning about this situation. If the RCU stall
option is set, that will issue a second message. It will be easy to do
the connection between the first message and the second one, no ?




-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-15 Thread Steven Rostedt

[ I'm back from vacation! ]

On Wed, 9 Aug 2017 17:51:33 +0200
Daniel Lezcano  wrote:

> Well, may be the instruction pointer thing is not a good idea.
> 
> I learnt from this experience, an overloaded kernel with a lot of
> interrupts can hang the console and issue RCU stall.
> 
> However, someone else can face the same situation. Even if he reads the
> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> 
> A message telling the grace period can't be reached because we are too
> busy processing interrupts would have helped but I understand it is not
> easy to implement.

What if the stall code triggered an irqwork first? The irqwork would
trigger as soon as interrupts were enabled again (or at the next tick,
depending on the arch), and then it would know that RCU stalled due to
an irq storm if the irqwork is being hit.

-- Steve

> 
> Perhaps, adding a new bullet in the documentation can help:
> 
> "If the interrupt processing time is longer than the interval between
> each interrupt, the CPU will keep processing the interrupts without
> allowing the RCU's grace period kthread. This situation can happen if
> there is a highly rated number of interrupts and the function_graph
> tracer is enabled".
> 
> 
> 
> 



Re: RCU stall when using function_graph

2017-08-15 Thread Steven Rostedt

[ I'm back from vacation! ]

On Wed, 9 Aug 2017 17:51:33 +0200
Daniel Lezcano  wrote:

> Well, may be the instruction pointer thing is not a good idea.
> 
> I learnt from this experience, an overloaded kernel with a lot of
> interrupts can hang the console and issue RCU stall.
> 
> However, someone else can face the same situation. Even if he reads the
> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> 
> A message telling the grace period can't be reached because we are too
> busy processing interrupts would have helped but I understand it is not
> easy to implement.

What if the stall code triggered an irqwork first? The irqwork would
trigger as soon as interrupts were enabled again (or at the next tick,
depending on the arch), and then it would know that RCU stalled due to
an irq storm if the irqwork is being hit.

-- Steve

> 
> Perhaps, adding a new bullet in the documentation can help:
> 
> "If the interrupt processing time is longer than the interval between
> each interrupt, the CPU will keep processing the interrupts without
> allowing the RCU's grace period kthread. This situation can happen if
> there is a highly rated number of interrupts and the function_graph
> tracer is enabled".
> 
> 
> 
> 



Re: RCU stall when using function_graph

2017-08-11 Thread Daniel Lezcano
On 10/08/2017 23:39, Paul E. McKenney wrote:
> On Thu, Aug 10, 2017 at 11:45:09AM +0200, Daniel Lezcano wrote:

[ ... ]

>> Nothing coming in mind but may be worth to mention the slowness of the
>> CPU is the aggravating factor. In particular I was able to reproduce the
>> issue by setting to the min CPU frequency. With the ondemand governor,
>> we can have the frequency high (hence enough CPU power) at the moment we
>> set the function_graph because another CPU is loaded (and both CPUs are
>> sharing the same clock line). The system became stuck at the moment the
>> other CPU went idle with the lowest frequency. That introduced
>> randomness in the issue and made hard to figure out why the RCU stall
>> was happening.
> 
> Adding this, then?

Yes, sure.

Thanks Paul.

  -- Daniel

> 
> 
> commit f7d9ce95064f76be583c775fac32076fa59f1617
> Author: Paul E. McKenney 
> Date:   Thu Aug 10 14:33:17 2017 -0700
> 
> documentation: Slow systems can stall RCU grace periods
> 
> If a fast system has a worst-case grace-period duration of (say) ten
> seconds, then running the same workload on a system ten times as slow
> will get you an RCU CPU stall warning given default stall-warning
> timeout settings.  This commit therefore adds this possibility to
> stallwarn.txt.
> 
> Reported-by: Daniel Lezcano 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
> index 21b8913acbdf..238acbd94917 100644
> --- a/Documentation/RCU/stallwarn.txt
> +++ b/Documentation/RCU/stallwarn.txt
> @@ -70,6 +70,12 @@ o  A periodic interrupt whose handler takes longer than 
> the time
>   considerably longer than normal, which can in turn result in
>   RCU CPU stall warnings.
>  
> +oTesting a workload on a fast system, tuning the stall-warning
> + timeout down to just barely avoid RCU CPU stall warnings, and then
> + running the same workload with the same stall-warning timeout on a
> + slow system.  Note that thermal throttling and on-demand governors
> + can cause a single system to be sometimes fast and sometimes slow!
> +
>  oA hardware or software issue shuts off the scheduler-clock
>   interrupt on a CPU that is not in dyntick-idle mode.  This
>   problem really has happened, and seems to be most likely to
> 


-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-11 Thread Daniel Lezcano
On 10/08/2017 23:39, Paul E. McKenney wrote:
> On Thu, Aug 10, 2017 at 11:45:09AM +0200, Daniel Lezcano wrote:

[ ... ]

>> Nothing coming in mind but may be worth to mention the slowness of the
>> CPU is the aggravating factor. In particular I was able to reproduce the
>> issue by setting to the min CPU frequency. With the ondemand governor,
>> we can have the frequency high (hence enough CPU power) at the moment we
>> set the function_graph because another CPU is loaded (and both CPUs are
>> sharing the same clock line). The system became stuck at the moment the
>> other CPU went idle with the lowest frequency. That introduced
>> randomness in the issue and made hard to figure out why the RCU stall
>> was happening.
> 
> Adding this, then?

Yes, sure.

Thanks Paul.

  -- Daniel

> 
> 
> commit f7d9ce95064f76be583c775fac32076fa59f1617
> Author: Paul E. McKenney 
> Date:   Thu Aug 10 14:33:17 2017 -0700
> 
> documentation: Slow systems can stall RCU grace periods
> 
> If a fast system has a worst-case grace-period duration of (say) ten
> seconds, then running the same workload on a system ten times as slow
> will get you an RCU CPU stall warning given default stall-warning
> timeout settings.  This commit therefore adds this possibility to
> stallwarn.txt.
> 
> Reported-by: Daniel Lezcano 
> Signed-off-by: Paul E. McKenney 
> 
> diff --git a/Documentation/RCU/stallwarn.txt b/Documentation/RCU/stallwarn.txt
> index 21b8913acbdf..238acbd94917 100644
> --- a/Documentation/RCU/stallwarn.txt
> +++ b/Documentation/RCU/stallwarn.txt
> @@ -70,6 +70,12 @@ o  A periodic interrupt whose handler takes longer than 
> the time
>   considerably longer than normal, which can in turn result in
>   RCU CPU stall warnings.
>  
> +oTesting a workload on a fast system, tuning the stall-warning
> + timeout down to just barely avoid RCU CPU stall warnings, and then
> + running the same workload with the same stall-warning timeout on a
> + slow system.  Note that thermal throttling and on-demand governors
> + can cause a single system to be sometimes fast and sometimes slow!
> +
>  oA hardware or software issue shuts off the scheduler-clock
>   interrupt on a CPU that is not in dyntick-idle mode.  This
>   problem really has happened, and seems to be most likely to
> 


-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-10 Thread Paul E. McKenney
On Thu, Aug 10, 2017 at 11:45:09AM +0200, Daniel Lezcano wrote:
> On 09/08/2017 19:22, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2017 at 05:51:33PM +0200, Daniel Lezcano wrote:
> >> On 09/08/2017 16:40, Paul E. McKenney wrote:
> >>> On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
>  On 09/08/2017 14:58, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
> >>
> >>
> >> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> >>> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
>  Dear All
> 
>  As for me, after configuring function_graph as below, crash 
>  disappears.
>  "echo 0 > d/tracing/tracing_on"
>  "sleep 1"
> 
>  "echo function_graph > d/tracing/current_tracer"
>  "sleep 1"
> 
>  "echo smp_call_function_single > d/tracing/set_ftrace_filter"
> >>
> >> It will limit trace output to only for the filtered function
> >> (smp_call_function_single).
> >>
>  adb shell "sleep 1"
> 
>  "echo 1 > d/tracing/tracing_on"
>  adb shell "sleep 1"
> 
>  Right after function_graph is enabled, too many logs are traced upon 
>  IRQ
>  transaction which many times eventually causes stall.
> >>>
> >>> That would do it!
> >>>
> >>> Hmmm...
> >>>
> >>> Steven, would it be helpful if RCU were to inform tracing (say) 
> >>> halfway
> >>> through the RCU CPU stall interval, allowing the tracer to do 
> >>> something
> >>> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why 
> >>> this
> >>> wouldn't work, for example, if all the tracing was with irqs disabled
> >>> or some such, but figured I should ask.
> >>>
> >>> Does Guillermo's approach work for others?
> >>
> >> Limited output with a couple of filtered function will definitely
> >> not cause RCU schedule stall. But the question is whether we should
> >> expect a full function graph trace working on every platform or not
> >> (specially the one which generates high interrupts)?
> >
> > It might well be that the user must disable RCU CPU stall warnings via
> > the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> > rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> > increases overhead.  Like enabling large quantities of tracing.  ;-)
> >
> > It -might- be possible to do this automatically, but reliable
> > automation would require that tracing understand how often each
> > function was called, which sounds to me to be a bit of a stretch.
> >
> > Thoughts?
> 
>  A random thought:
> 
>  Is it possible to have a mid-timeout happening and store some
>  information like the instruction pointer, so when the timeout happen we
>  can compare if there was some progress, if yes, very likely, system
>  performance collapsed and we are not fast enough.
> >>>
> >>> RCU already does take various actions for an impending stall, so something
> >>> could be done.  But in most slowdowns, the instruction pointer will be
> >>> changing rapidly, just not as rapidly as it would normally.  So exactly
> >>> how would the forward-progress comparison be carried out?
> >>>
> >>> It would be easy to set up a notifier, so that if any notifier in the
> >>> chain returned an error, stall warnings would be suppressed.  It would
> >>> be harder to figure out when to re-enable them, though I suppose that
> >>> they could be suppressed only for the duration of the current grace
> >>> period or some such.
> >>>
> >>> But what exactly would you use such a notifier for?
> >>>
> >>> Or am I misunderstanding your suggestion?
> >>
> >> Well, may be the instruction pointer thing is not a good idea.
> >>
> >> I learnt from this experience, an overloaded kernel with a lot of
> >> interrupts can hang the console and issue RCU stall.
> >>
> >> However, someone else can face the same situation. Even if he reads the
> >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> >>
> >> A message telling the grace period can't be reached because we are too
> >> busy processing interrupts would have helped but I understand it is not
> >> easy to implement.
> >>
> >> Perhaps, adding a new bullet in the documentation can help:
> >>
> >> "If the interrupt processing time is longer than the interval between
> >> each interrupt, the CPU will keep processing the interrupts without
> >> allowing the RCU's grace period kthread. This situation can happen if
> >> there is a highly rated number of interrupts and the function_graph
> >> tracer is enabled".
> > 
> > How about this?
> 
> Yes, it is clear. Thanks for reformulating it.
> 
> > Any other debugging options that should be called out?  I bet that
> > the function_graph tracer isn't the only way 

Re: RCU stall when using function_graph

2017-08-10 Thread Paul E. McKenney
On Thu, Aug 10, 2017 at 11:45:09AM +0200, Daniel Lezcano wrote:
> On 09/08/2017 19:22, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2017 at 05:51:33PM +0200, Daniel Lezcano wrote:
> >> On 09/08/2017 16:40, Paul E. McKenney wrote:
> >>> On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
>  On 09/08/2017 14:58, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
> >>
> >>
> >> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> >>> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
>  Dear All
> 
>  As for me, after configuring function_graph as below, crash 
>  disappears.
>  "echo 0 > d/tracing/tracing_on"
>  "sleep 1"
> 
>  "echo function_graph > d/tracing/current_tracer"
>  "sleep 1"
> 
>  "echo smp_call_function_single > d/tracing/set_ftrace_filter"
> >>
> >> It will limit trace output to only for the filtered function
> >> (smp_call_function_single).
> >>
>  adb shell "sleep 1"
> 
>  "echo 1 > d/tracing/tracing_on"
>  adb shell "sleep 1"
> 
>  Right after function_graph is enabled, too many logs are traced upon 
>  IRQ
>  transaction which many times eventually causes stall.
> >>>
> >>> That would do it!
> >>>
> >>> Hmmm...
> >>>
> >>> Steven, would it be helpful if RCU were to inform tracing (say) 
> >>> halfway
> >>> through the RCU CPU stall interval, allowing the tracer to do 
> >>> something
> >>> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why 
> >>> this
> >>> wouldn't work, for example, if all the tracing was with irqs disabled
> >>> or some such, but figured I should ask.
> >>>
> >>> Does Guillermo's approach work for others?
> >>
> >> Limited output with a couple of filtered function will definitely
> >> not cause RCU schedule stall. But the question is whether we should
> >> expect a full function graph trace working on every platform or not
> >> (specially the one which generates high interrupts)?
> >
> > It might well be that the user must disable RCU CPU stall warnings via
> > the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> > rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> > increases overhead.  Like enabling large quantities of tracing.  ;-)
> >
> > It -might- be possible to do this automatically, but reliable
> > automation would require that tracing understand how often each
> > function was called, which sounds to me to be a bit of a stretch.
> >
> > Thoughts?
> 
>  A random thought:
> 
>  Is it possible to have a mid-timeout happening and store some
>  information like the instruction pointer, so when the timeout happen we
>  can compare if there was some progress, if yes, very likely, system
>  performance collapsed and we are not fast enough.
> >>>
> >>> RCU already does take various actions for an impending stall, so something
> >>> could be done.  But in most slowdowns, the instruction pointer will be
> >>> changing rapidly, just not as rapidly as it would normally.  So exactly
> >>> how would the forward-progress comparison be carried out?
> >>>
> >>> It would be easy to set up a notifier, so that if any notifier in the
> >>> chain returned an error, stall warnings would be suppressed.  It would
> >>> be harder to figure out when to re-enable them, though I suppose that
> >>> they could be suppressed only for the duration of the current grace
> >>> period or some such.
> >>>
> >>> But what exactly would you use such a notifier for?
> >>>
> >>> Or am I misunderstanding your suggestion?
> >>
> >> Well, may be the instruction pointer thing is not a good idea.
> >>
> >> I learnt from this experience, an overloaded kernel with a lot of
> >> interrupts can hang the console and issue RCU stall.
> >>
> >> However, someone else can face the same situation. Even if he reads the
> >> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> >>
> >> A message telling the grace period can't be reached because we are too
> >> busy processing interrupts would have helped but I understand it is not
> >> easy to implement.
> >>
> >> Perhaps, adding a new bullet in the documentation can help:
> >>
> >> "If the interrupt processing time is longer than the interval between
> >> each interrupt, the CPU will keep processing the interrupts without
> >> allowing the RCU's grace period kthread. This situation can happen if
> >> there is a highly rated number of interrupts and the function_graph
> >> tracer is enabled".
> > 
> > How about this?
> 
> Yes, it is clear. Thanks for reformulating it.
> 
> > Any other debugging options that should be called out?  I bet that
> > the function_graph tracer isn't the only way 

Re: RCU stall when using function_graph

2017-08-10 Thread Daniel Lezcano
On 09/08/2017 19:22, Paul E. McKenney wrote:
> On Wed, Aug 09, 2017 at 05:51:33PM +0200, Daniel Lezcano wrote:
>> On 09/08/2017 16:40, Paul E. McKenney wrote:
>>> On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
 On 09/08/2017 14:58, Paul E. McKenney wrote:
> On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
>>
>>
>> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
>>> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
 Dear All

 As for me, after configuring function_graph as below, crash disappears.
 "echo 0 > d/tracing/tracing_on"
 "sleep 1"

 "echo function_graph > d/tracing/current_tracer"
 "sleep 1"

 "echo smp_call_function_single > d/tracing/set_ftrace_filter"
>>
>> It will limit trace output to only for the filtered function
>> (smp_call_function_single).
>>
 adb shell "sleep 1"

 "echo 1 > d/tracing/tracing_on"
 adb shell "sleep 1"

 Right after function_graph is enabled, too many logs are traced upon 
 IRQ
 transaction which many times eventually causes stall.
>>>
>>> That would do it!
>>>
>>> Hmmm...
>>>
>>> Steven, would it be helpful if RCU were to inform tracing (say) halfway
>>> through the RCU CPU stall interval, allowing the tracer to do something
>>> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
>>> wouldn't work, for example, if all the tracing was with irqs disabled
>>> or some such, but figured I should ask.
>>>
>>> Does Guillermo's approach work for others?
>>
>> Limited output with a couple of filtered function will definitely
>> not cause RCU schedule stall. But the question is whether we should
>> expect a full function graph trace working on every platform or not
>> (specially the one which generates high interrupts)?
>
> It might well be that the user must disable RCU CPU stall warnings via
> the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> increases overhead.  Like enabling large quantities of tracing.  ;-)
>
> It -might- be possible to do this automatically, but reliable
> automation would require that tracing understand how often each
> function was called, which sounds to me to be a bit of a stretch.
>
> Thoughts?

 A random thought:

 Is it possible to have a mid-timeout happening and store some
 information like the instruction pointer, so when the timeout happen we
 can compare if there was some progress, if yes, very likely, system
 performance collapsed and we are not fast enough.
>>>
>>> RCU already does take various actions for an impending stall, so something
>>> could be done.  But in most slowdowns, the instruction pointer will be
>>> changing rapidly, just not as rapidly as it would normally.  So exactly
>>> how would the forward-progress comparison be carried out?
>>>
>>> It would be easy to set up a notifier, so that if any notifier in the
>>> chain returned an error, stall warnings would be suppressed.  It would
>>> be harder to figure out when to re-enable them, though I suppose that
>>> they could be suppressed only for the duration of the current grace
>>> period or some such.
>>>
>>> But what exactly would you use such a notifier for?
>>>
>>> Or am I misunderstanding your suggestion?
>>
>> Well, may be the instruction pointer thing is not a good idea.
>>
>> I learnt from this experience, an overloaded kernel with a lot of
>> interrupts can hang the console and issue RCU stall.
>>
>> However, someone else can face the same situation. Even if he reads the
>> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
>>
>> A message telling the grace period can't be reached because we are too
>> busy processing interrupts would have helped but I understand it is not
>> easy to implement.
>>
>> Perhaps, adding a new bullet in the documentation can help:
>>
>> "If the interrupt processing time is longer than the interval between
>> each interrupt, the CPU will keep processing the interrupts without
>> allowing the RCU's grace period kthread. This situation can happen if
>> there is a highly rated number of interrupts and the function_graph
>> tracer is enabled".
> 
> How about this?

Yes, it is clear. Thanks for reformulating it.

> Any other debugging options that should be called out?  I bet that
> the function_graph tracer isn't the only way to make this happen.

Nothing coming in mind but may be worth to mention the slowness of the
CPU is the aggravating factor. In particular I was able to reproduce the
issue by setting to the min CPU frequency. With the ondemand governor,
we can have the frequency high (hence enough CPU power) at the moment we
set the 

Re: RCU stall when using function_graph

2017-08-10 Thread Daniel Lezcano
On 09/08/2017 19:22, Paul E. McKenney wrote:
> On Wed, Aug 09, 2017 at 05:51:33PM +0200, Daniel Lezcano wrote:
>> On 09/08/2017 16:40, Paul E. McKenney wrote:
>>> On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
 On 09/08/2017 14:58, Paul E. McKenney wrote:
> On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
>>
>>
>> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
>>> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
 Dear All

 As for me, after configuring function_graph as below, crash disappears.
 "echo 0 > d/tracing/tracing_on"
 "sleep 1"

 "echo function_graph > d/tracing/current_tracer"
 "sleep 1"

 "echo smp_call_function_single > d/tracing/set_ftrace_filter"
>>
>> It will limit trace output to only for the filtered function
>> (smp_call_function_single).
>>
 adb shell "sleep 1"

 "echo 1 > d/tracing/tracing_on"
 adb shell "sleep 1"

 Right after function_graph is enabled, too many logs are traced upon 
 IRQ
 transaction which many times eventually causes stall.
>>>
>>> That would do it!
>>>
>>> Hmmm...
>>>
>>> Steven, would it be helpful if RCU were to inform tracing (say) halfway
>>> through the RCU CPU stall interval, allowing the tracer to do something
>>> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
>>> wouldn't work, for example, if all the tracing was with irqs disabled
>>> or some such, but figured I should ask.
>>>
>>> Does Guillermo's approach work for others?
>>
>> Limited output with a couple of filtered function will definitely
>> not cause RCU schedule stall. But the question is whether we should
>> expect a full function graph trace working on every platform or not
>> (specially the one which generates high interrupts)?
>
> It might well be that the user must disable RCU CPU stall warnings via
> the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> increases overhead.  Like enabling large quantities of tracing.  ;-)
>
> It -might- be possible to do this automatically, but reliable
> automation would require that tracing understand how often each
> function was called, which sounds to me to be a bit of a stretch.
>
> Thoughts?

 A random thought:

 Is it possible to have a mid-timeout happening and store some
 information like the instruction pointer, so when the timeout happen we
 can compare if there was some progress, if yes, very likely, system
 performance collapsed and we are not fast enough.
>>>
>>> RCU already does take various actions for an impending stall, so something
>>> could be done.  But in most slowdowns, the instruction pointer will be
>>> changing rapidly, just not as rapidly as it would normally.  So exactly
>>> how would the forward-progress comparison be carried out?
>>>
>>> It would be easy to set up a notifier, so that if any notifier in the
>>> chain returned an error, stall warnings would be suppressed.  It would
>>> be harder to figure out when to re-enable them, though I suppose that
>>> they could be suppressed only for the duration of the current grace
>>> period or some such.
>>>
>>> But what exactly would you use such a notifier for?
>>>
>>> Or am I misunderstanding your suggestion?
>>
>> Well, may be the instruction pointer thing is not a good idea.
>>
>> I learnt from this experience, an overloaded kernel with a lot of
>> interrupts can hang the console and issue RCU stall.
>>
>> However, someone else can face the same situation. Even if he reads the
>> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
>>
>> A message telling the grace period can't be reached because we are too
>> busy processing interrupts would have helped but I understand it is not
>> easy to implement.
>>
>> Perhaps, adding a new bullet in the documentation can help:
>>
>> "If the interrupt processing time is longer than the interval between
>> each interrupt, the CPU will keep processing the interrupts without
>> allowing the RCU's grace period kthread. This situation can happen if
>> there is a highly rated number of interrupts and the function_graph
>> tracer is enabled".
> 
> How about this?

Yes, it is clear. Thanks for reformulating it.

> Any other debugging options that should be called out?  I bet that
> the function_graph tracer isn't the only way to make this happen.

Nothing coming in mind but may be worth to mention the slowness of the
CPU is the aggravating factor. In particular I was able to reproduce the
issue by setting to the min CPU frequency. With the ondemand governor,
we can have the frequency high (hence enough CPU power) at the moment we
set the 

Re: RCU stall when using function_graph

2017-08-09 Thread Paul E. McKenney
On Wed, Aug 09, 2017 at 05:51:33PM +0200, Daniel Lezcano wrote:
> On 09/08/2017 16:40, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
> >> On 09/08/2017 14:58, Paul E. McKenney wrote:
> >>> On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
> 
> 
>  On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> > On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
> >> Dear All
> >>
> >> As for me, after configuring function_graph as below, crash disappears.
> >> "echo 0 > d/tracing/tracing_on"
> >> "sleep 1"
> >>
> >> "echo function_graph > d/tracing/current_tracer"
> >> "sleep 1"
> >>
> >> "echo smp_call_function_single > d/tracing/set_ftrace_filter"
> 
>  It will limit trace output to only for the filtered function
>  (smp_call_function_single).
> 
> >> adb shell "sleep 1"
> >>
> >> "echo 1 > d/tracing/tracing_on"
> >> adb shell "sleep 1"
> >>
> >> Right after function_graph is enabled, too many logs are traced upon 
> >> IRQ
> >> transaction which many times eventually causes stall.
> >
> > That would do it!
> >
> > Hmmm...
> >
> > Steven, would it be helpful if RCU were to inform tracing (say) halfway
> > through the RCU CPU stall interval, allowing the tracer to do something
> > like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
> > wouldn't work, for example, if all the tracing was with irqs disabled
> > or some such, but figured I should ask.
> >
> > Does Guillermo's approach work for others?
> 
>  Limited output with a couple of filtered function will definitely
>  not cause RCU schedule stall. But the question is whether we should
>  expect a full function graph trace working on every platform or not
>  (specially the one which generates high interrupts)?
> >>>
> >>> It might well be that the user must disable RCU CPU stall warnings via
> >>> the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> >>> rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> >>> increases overhead.  Like enabling large quantities of tracing.  ;-)
> >>>
> >>> It -might- be possible to do this automatically, but reliable
> >>> automation would require that tracing understand how often each
> >>> function was called, which sounds to me to be a bit of a stretch.
> >>>
> >>> Thoughts?
> >>
> >> A random thought:
> >>
> >> Is it possible to have a mid-timeout happening and store some
> >> information like the instruction pointer, so when the timeout happen we
> >> can compare if there was some progress, if yes, very likely, system
> >> performance collapsed and we are not fast enough.
> > 
> > RCU already does take various actions for an impending stall, so something
> > could be done.  But in most slowdowns, the instruction pointer will be
> > changing rapidly, just not as rapidly as it would normally.  So exactly
> > how would the forward-progress comparison be carried out?
> > 
> > It would be easy to set up a notifier, so that if any notifier in the
> > chain returned an error, stall warnings would be suppressed.  It would
> > be harder to figure out when to re-enable them, though I suppose that
> > they could be suppressed only for the duration of the current grace
> > period or some such.
> > 
> > But what exactly would you use such a notifier for?
> > 
> > Or am I misunderstanding your suggestion?
> 
> Well, may be the instruction pointer thing is not a good idea.
> 
> I learnt from this experience, an overloaded kernel with a lot of
> interrupts can hang the console and issue RCU stall.
> 
> However, someone else can face the same situation. Even if he reads the
> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> 
> A message telling the grace period can't be reached because we are too
> busy processing interrupts would have helped but I understand it is not
> easy to implement.
> 
> Perhaps, adding a new bullet in the documentation can help:
> 
> "If the interrupt processing time is longer than the interval between
> each interrupt, the CPU will keep processing the interrupts without
> allowing the RCU's grace period kthread. This situation can happen if
> there is a highly rated number of interrupts and the function_graph
> tracer is enabled".

How about this?

Any other debugging options that should be called out?  I bet that
the function_graph tracer isn't the only way to make this happen.

Thanx, Paul



commit 8b12d9919f59fe6855429e2eacc1c2e03ecdfe96
Author: Paul E. McKenney 
Date:   Wed Aug 9 10:16:29 2017 -0700

documentation: Long-running irq handlers can stall RCU grace periods

If a periodic interrupt's handler takes longer to 

Re: RCU stall when using function_graph

2017-08-09 Thread Paul E. McKenney
On Wed, Aug 09, 2017 at 05:51:33PM +0200, Daniel Lezcano wrote:
> On 09/08/2017 16:40, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
> >> On 09/08/2017 14:58, Paul E. McKenney wrote:
> >>> On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
> 
> 
>  On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> > On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
> >> Dear All
> >>
> >> As for me, after configuring function_graph as below, crash disappears.
> >> "echo 0 > d/tracing/tracing_on"
> >> "sleep 1"
> >>
> >> "echo function_graph > d/tracing/current_tracer"
> >> "sleep 1"
> >>
> >> "echo smp_call_function_single > d/tracing/set_ftrace_filter"
> 
>  It will limit trace output to only for the filtered function
>  (smp_call_function_single).
> 
> >> adb shell "sleep 1"
> >>
> >> "echo 1 > d/tracing/tracing_on"
> >> adb shell "sleep 1"
> >>
> >> Right after function_graph is enabled, too many logs are traced upon 
> >> IRQ
> >> transaction which many times eventually causes stall.
> >
> > That would do it!
> >
> > Hmmm...
> >
> > Steven, would it be helpful if RCU were to inform tracing (say) halfway
> > through the RCU CPU stall interval, allowing the tracer to do something
> > like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
> > wouldn't work, for example, if all the tracing was with irqs disabled
> > or some such, but figured I should ask.
> >
> > Does Guillermo's approach work for others?
> 
>  Limited output with a couple of filtered function will definitely
>  not cause RCU schedule stall. But the question is whether we should
>  expect a full function graph trace working on every platform or not
>  (specially the one which generates high interrupts)?
> >>>
> >>> It might well be that the user must disable RCU CPU stall warnings via
> >>> the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> >>> rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> >>> increases overhead.  Like enabling large quantities of tracing.  ;-)
> >>>
> >>> It -might- be possible to do this automatically, but reliable
> >>> automation would require that tracing understand how often each
> >>> function was called, which sounds to me to be a bit of a stretch.
> >>>
> >>> Thoughts?
> >>
> >> A random thought:
> >>
> >> Is it possible to have a mid-timeout happening and store some
> >> information like the instruction pointer, so when the timeout happen we
> >> can compare if there was some progress, if yes, very likely, system
> >> performance collapsed and we are not fast enough.
> > 
> > RCU already does take various actions for an impending stall, so something
> > could be done.  But in most slowdowns, the instruction pointer will be
> > changing rapidly, just not as rapidly as it would normally.  So exactly
> > how would the forward-progress comparison be carried out?
> > 
> > It would be easy to set up a notifier, so that if any notifier in the
> > chain returned an error, stall warnings would be suppressed.  It would
> > be harder to figure out when to re-enable them, though I suppose that
> > they could be suppressed only for the duration of the current grace
> > period or some such.
> > 
> > But what exactly would you use such a notifier for?
> > 
> > Or am I misunderstanding your suggestion?
> 
> Well, may be the instruction pointer thing is not a good idea.
> 
> I learnt from this experience, an overloaded kernel with a lot of
> interrupts can hang the console and issue RCU stall.
> 
> However, someone else can face the same situation. Even if he reads the
> RCU/stallwarn.txt documentation, it will be hard to figure out the issue.
> 
> A message telling the grace period can't be reached because we are too
> busy processing interrupts would have helped but I understand it is not
> easy to implement.
> 
> Perhaps, adding a new bullet in the documentation can help:
> 
> "If the interrupt processing time is longer than the interval between
> each interrupt, the CPU will keep processing the interrupts without
> allowing the RCU's grace period kthread. This situation can happen if
> there is a highly rated number of interrupts and the function_graph
> tracer is enabled".

How about this?

Any other debugging options that should be called out?  I bet that
the function_graph tracer isn't the only way to make this happen.

Thanx, Paul



commit 8b12d9919f59fe6855429e2eacc1c2e03ecdfe96
Author: Paul E. McKenney 
Date:   Wed Aug 9 10:16:29 2017 -0700

documentation: Long-running irq handlers can stall RCU grace periods

If a periodic interrupt's handler takes longer to execute than the period

Re: RCU stall when using function_graph

2017-08-09 Thread Daniel Lezcano
On 09/08/2017 16:40, Paul E. McKenney wrote:
> On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
>> On 09/08/2017 14:58, Paul E. McKenney wrote:
>>> On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:


 On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
>> Dear All
>>
>> As for me, after configuring function_graph as below, crash disappears.
>> "echo 0 > d/tracing/tracing_on"
>> "sleep 1"
>>
>> "echo function_graph > d/tracing/current_tracer"
>> "sleep 1"
>>
>> "echo smp_call_function_single > d/tracing/set_ftrace_filter"

 It will limit trace output to only for the filtered function
 (smp_call_function_single).

>> adb shell "sleep 1"
>>
>> "echo 1 > d/tracing/tracing_on"
>> adb shell "sleep 1"
>>
>> Right after function_graph is enabled, too many logs are traced upon IRQ
>> transaction which many times eventually causes stall.
>
> That would do it!
>
> Hmmm...
>
> Steven, would it be helpful if RCU were to inform tracing (say) halfway
> through the RCU CPU stall interval, allowing the tracer to do something
> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
> wouldn't work, for example, if all the tracing was with irqs disabled
> or some such, but figured I should ask.
>
> Does Guillermo's approach work for others?

 Limited output with a couple of filtered function will definitely
 not cause RCU schedule stall. But the question is whether we should
 expect a full function graph trace working on every platform or not
 (specially the one which generates high interrupts)?
>>>
>>> It might well be that the user must disable RCU CPU stall warnings via
>>> the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
>>> rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
>>> increases overhead.  Like enabling large quantities of tracing.  ;-)
>>>
>>> It -might- be possible to do this automatically, but reliable
>>> automation would require that tracing understand how often each
>>> function was called, which sounds to me to be a bit of a stretch.
>>>
>>> Thoughts?
>>
>> A random thought:
>>
>> Is it possible to have a mid-timeout happening and store some
>> information like the instruction pointer, so when the timeout happen we
>> can compare if there was some progress, if yes, very likely, system
>> performance collapsed and we are not fast enough.
> 
> RCU already does take various actions for an impending stall, so something
> could be done.  But in most slowdowns, the instruction pointer will be
> changing rapidly, just not as rapidly as it would normally.  So exactly
> how would the forward-progress comparison be carried out?
> 
> It would be easy to set up a notifier, so that if any notifier in the
> chain returned an error, stall warnings would be suppressed.  It would
> be harder to figure out when to re-enable them, though I suppose that
> they could be suppressed only for the duration of the current grace
> period or some such.
> 
> But what exactly would you use such a notifier for?
> 
> Or am I misunderstanding your suggestion?

Well, may be the instruction pointer thing is not a good idea.

I learnt from this experience, an overloaded kernel with a lot of
interrupts can hang the console and issue RCU stall.

However, someone else can face the same situation. Even if he reads the
RCU/stallwarn.txt documentation, it will be hard to figure out the issue.

A message telling the grace period can't be reached because we are too
busy processing interrupts would have helped but I understand it is not
easy to implement.

Perhaps, adding a new bullet in the documentation can help:

"If the interrupt processing time is longer than the interval between
each interrupt, the CPU will keep processing the interrupts without
allowing the RCU's grace period kthread. This situation can happen if
there is a highly rated number of interrupts and the function_graph
tracer is enabled".




-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-09 Thread Daniel Lezcano
On 09/08/2017 16:40, Paul E. McKenney wrote:
> On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
>> On 09/08/2017 14:58, Paul E. McKenney wrote:
>>> On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:


 On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
>> Dear All
>>
>> As for me, after configuring function_graph as below, crash disappears.
>> "echo 0 > d/tracing/tracing_on"
>> "sleep 1"
>>
>> "echo function_graph > d/tracing/current_tracer"
>> "sleep 1"
>>
>> "echo smp_call_function_single > d/tracing/set_ftrace_filter"

 It will limit trace output to only for the filtered function
 (smp_call_function_single).

>> adb shell "sleep 1"
>>
>> "echo 1 > d/tracing/tracing_on"
>> adb shell "sleep 1"
>>
>> Right after function_graph is enabled, too many logs are traced upon IRQ
>> transaction which many times eventually causes stall.
>
> That would do it!
>
> Hmmm...
>
> Steven, would it be helpful if RCU were to inform tracing (say) halfway
> through the RCU CPU stall interval, allowing the tracer to do something
> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
> wouldn't work, for example, if all the tracing was with irqs disabled
> or some such, but figured I should ask.
>
> Does Guillermo's approach work for others?

 Limited output with a couple of filtered function will definitely
 not cause RCU schedule stall. But the question is whether we should
 expect a full function graph trace working on every platform or not
 (specially the one which generates high interrupts)?
>>>
>>> It might well be that the user must disable RCU CPU stall warnings via
>>> the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
>>> rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
>>> increases overhead.  Like enabling large quantities of tracing.  ;-)
>>>
>>> It -might- be possible to do this automatically, but reliable
>>> automation would require that tracing understand how often each
>>> function was called, which sounds to me to be a bit of a stretch.
>>>
>>> Thoughts?
>>
>> A random thought:
>>
>> Is it possible to have a mid-timeout happening and store some
>> information like the instruction pointer, so when the timeout happen we
>> can compare if there was some progress, if yes, very likely, system
>> performance collapsed and we are not fast enough.
> 
> RCU already does take various actions for an impending stall, so something
> could be done.  But in most slowdowns, the instruction pointer will be
> changing rapidly, just not as rapidly as it would normally.  So exactly
> how would the forward-progress comparison be carried out?
> 
> It would be easy to set up a notifier, so that if any notifier in the
> chain returned an error, stall warnings would be suppressed.  It would
> be harder to figure out when to re-enable them, though I suppose that
> they could be suppressed only for the duration of the current grace
> period or some such.
> 
> But what exactly would you use such a notifier for?
> 
> Or am I misunderstanding your suggestion?

Well, may be the instruction pointer thing is not a good idea.

I learnt from this experience, an overloaded kernel with a lot of
interrupts can hang the console and issue RCU stall.

However, someone else can face the same situation. Even if he reads the
RCU/stallwarn.txt documentation, it will be hard to figure out the issue.

A message telling the grace period can't be reached because we are too
busy processing interrupts would have helped but I understand it is not
easy to implement.

Perhaps, adding a new bullet in the documentation can help:

"If the interrupt processing time is longer than the interval between
each interrupt, the CPU will keep processing the interrupts without
allowing the RCU's grace period kthread. This situation can happen if
there is a highly rated number of interrupts and the function_graph
tracer is enabled".




-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-09 Thread Paul E. McKenney
On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
> On 09/08/2017 14:58, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
> >>
> >>
> >> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> >>> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
>  Dear All
> 
>  As for me, after configuring function_graph as below, crash disappears.
>  "echo 0 > d/tracing/tracing_on"
>  "sleep 1"
> 
>  "echo function_graph > d/tracing/current_tracer"
>  "sleep 1"
> 
>  "echo smp_call_function_single > d/tracing/set_ftrace_filter"
> >>
> >> It will limit trace output to only for the filtered function
> >> (smp_call_function_single).
> >>
>  adb shell "sleep 1"
> 
>  "echo 1 > d/tracing/tracing_on"
>  adb shell "sleep 1"
> 
>  Right after function_graph is enabled, too many logs are traced upon IRQ
>  transaction which many times eventually causes stall.
> >>>
> >>> That would do it!
> >>>
> >>> Hmmm...
> >>>
> >>> Steven, would it be helpful if RCU were to inform tracing (say) halfway
> >>> through the RCU CPU stall interval, allowing the tracer to do something
> >>> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
> >>> wouldn't work, for example, if all the tracing was with irqs disabled
> >>> or some such, but figured I should ask.
> >>>
> >>> Does Guillermo's approach work for others?
> >>
> >> Limited output with a couple of filtered function will definitely
> >> not cause RCU schedule stall. But the question is whether we should
> >> expect a full function graph trace working on every platform or not
> >> (specially the one which generates high interrupts)?
> > 
> > It might well be that the user must disable RCU CPU stall warnings via
> > the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> > rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> > increases overhead.  Like enabling large quantities of tracing.  ;-)
> > 
> > It -might- be possible to do this automatically, but reliable
> > automation would require that tracing understand how often each
> > function was called, which sounds to me to be a bit of a stretch.
> > 
> > Thoughts?
> 
> A random thought:
> 
> Is it possible to have a mid-timeout happening and store some
> information like the instruction pointer, so when the timeout happen we
> can compare if there was some progress, if yes, very likely, system
> performance collapsed and we are not fast enough.

RCU already does take various actions for an impending stall, so something
could be done.  But in most slowdowns, the instruction pointer will be
changing rapidly, just not as rapidly as it would normally.  So exactly
how would the forward-progress comparison be carried out?

It would be easy to set up a notifier, so that if any notifier in the
chain returned an error, stall warnings would be suppressed.  It would
be harder to figure out when to re-enable them, though I suppose that
they could be suppressed only for the duration of the current grace
period or some such.

But what exactly would you use such a notifier for?

Or am I misunderstanding your suggestion?

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-09 Thread Paul E. McKenney
On Wed, Aug 09, 2017 at 03:28:05PM +0200, Daniel Lezcano wrote:
> On 09/08/2017 14:58, Paul E. McKenney wrote:
> > On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
> >>
> >>
> >> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> >>> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
>  Dear All
> 
>  As for me, after configuring function_graph as below, crash disappears.
>  "echo 0 > d/tracing/tracing_on"
>  "sleep 1"
> 
>  "echo function_graph > d/tracing/current_tracer"
>  "sleep 1"
> 
>  "echo smp_call_function_single > d/tracing/set_ftrace_filter"
> >>
> >> It will limit trace output to only for the filtered function
> >> (smp_call_function_single).
> >>
>  adb shell "sleep 1"
> 
>  "echo 1 > d/tracing/tracing_on"
>  adb shell "sleep 1"
> 
>  Right after function_graph is enabled, too many logs are traced upon IRQ
>  transaction which many times eventually causes stall.
> >>>
> >>> That would do it!
> >>>
> >>> Hmmm...
> >>>
> >>> Steven, would it be helpful if RCU were to inform tracing (say) halfway
> >>> through the RCU CPU stall interval, allowing the tracer to do something
> >>> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
> >>> wouldn't work, for example, if all the tracing was with irqs disabled
> >>> or some such, but figured I should ask.
> >>>
> >>> Does Guillermo's approach work for others?
> >>
> >> Limited output with a couple of filtered function will definitely
> >> not cause RCU schedule stall. But the question is whether we should
> >> expect a full function graph trace working on every platform or not
> >> (specially the one which generates high interrupts)?
> > 
> > It might well be that the user must disable RCU CPU stall warnings via
> > the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> > rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> > increases overhead.  Like enabling large quantities of tracing.  ;-)
> > 
> > It -might- be possible to do this automatically, but reliable
> > automation would require that tracing understand how often each
> > function was called, which sounds to me to be a bit of a stretch.
> > 
> > Thoughts?
> 
> A random thought:
> 
> Is it possible to have a mid-timeout happening and store some
> information like the instruction pointer, so when the timeout happen we
> can compare if there was some progress, if yes, very likely, system
> performance collapsed and we are not fast enough.

RCU already does take various actions for an impending stall, so something
could be done.  But in most slowdowns, the instruction pointer will be
changing rapidly, just not as rapidly as it would normally.  So exactly
how would the forward-progress comparison be carried out?

It would be easy to set up a notifier, so that if any notifier in the
chain returned an error, stall warnings would be suppressed.  It would
be harder to figure out when to re-enable them, though I suppose that
they could be suppressed only for the duration of the current grace
period or some such.

But what exactly would you use such a notifier for?

Or am I misunderstanding your suggestion?

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-09 Thread Daniel Lezcano
On 09/08/2017 14:58, Paul E. McKenney wrote:
> On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
>>
>>
>> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
>>> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
 Dear All

 As for me, after configuring function_graph as below, crash disappears.
 "echo 0 > d/tracing/tracing_on"
 "sleep 1"

 "echo function_graph > d/tracing/current_tracer"
 "sleep 1"

 "echo smp_call_function_single > d/tracing/set_ftrace_filter"
>>
>> It will limit trace output to only for the filtered function
>> (smp_call_function_single).
>>
 adb shell "sleep 1"

 "echo 1 > d/tracing/tracing_on"
 adb shell "sleep 1"

 Right after function_graph is enabled, too many logs are traced upon IRQ
 transaction which many times eventually causes stall.
>>>
>>> That would do it!
>>>
>>> Hmmm...
>>>
>>> Steven, would it be helpful if RCU were to inform tracing (say) halfway
>>> through the RCU CPU stall interval, allowing the tracer to do something
>>> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
>>> wouldn't work, for example, if all the tracing was with irqs disabled
>>> or some such, but figured I should ask.
>>>
>>> Does Guillermo's approach work for others?
>>
>> Limited output with a couple of filtered function will definitely
>> not cause RCU schedule stall. But the question is whether we should
>> expect a full function graph trace working on every platform or not
>> (specially the one which generates high interrupts)?
> 
> It might well be that the user must disable RCU CPU stall warnings via
> the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> increases overhead.  Like enabling large quantities of tracing.  ;-)
> 
> It -might- be possible to do this automatically, but reliable
> automation would require that tracing understand how often each
> function was called, which sounds to me to be a bit of a stretch.
> 
> Thoughts?

A random thought:

Is it possible to have a mid-timeout happening and store some
information like the instruction pointer, so when the timeout happen we
can compare if there was some progress, if yes, very likely, system
performance collapsed and we are not fast enough.

-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-09 Thread Daniel Lezcano
On 09/08/2017 14:58, Paul E. McKenney wrote:
> On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
>>
>>
>> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
>>> On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
 Dear All

 As for me, after configuring function_graph as below, crash disappears.
 "echo 0 > d/tracing/tracing_on"
 "sleep 1"

 "echo function_graph > d/tracing/current_tracer"
 "sleep 1"

 "echo smp_call_function_single > d/tracing/set_ftrace_filter"
>>
>> It will limit trace output to only for the filtered function
>> (smp_call_function_single).
>>
 adb shell "sleep 1"

 "echo 1 > d/tracing/tracing_on"
 adb shell "sleep 1"

 Right after function_graph is enabled, too many logs are traced upon IRQ
 transaction which many times eventually causes stall.
>>>
>>> That would do it!
>>>
>>> Hmmm...
>>>
>>> Steven, would it be helpful if RCU were to inform tracing (say) halfway
>>> through the RCU CPU stall interval, allowing the tracer to do something
>>> like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
>>> wouldn't work, for example, if all the tracing was with irqs disabled
>>> or some such, but figured I should ask.
>>>
>>> Does Guillermo's approach work for others?
>>
>> Limited output with a couple of filtered function will definitely
>> not cause RCU schedule stall. But the question is whether we should
>> expect a full function graph trace working on every platform or not
>> (specially the one which generates high interrupts)?
> 
> It might well be that the user must disable RCU CPU stall warnings via
> the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via th
> rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
> increases overhead.  Like enabling large quantities of tracing.  ;-)
> 
> It -might- be possible to do this automatically, but reliable
> automation would require that tracing understand how often each
> function was called, which sounds to me to be a bit of a stretch.
> 
> Thoughts?

A random thought:

Is it possible to have a mid-timeout happening and store some
information like the instruction pointer, so when the timeout happen we
can compare if there was some progress, if yes, very likely, system
performance collapsed and we are not fast enough.

-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-09 Thread Paul E. McKenney
On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
> 
> 
> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> >On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
> >>Dear All
> >>
> >>As for me, after configuring function_graph as below, crash disappears.
> >>"echo 0 > d/tracing/tracing_on"
> >>"sleep 1"
> >>
> >>"echo function_graph > d/tracing/current_tracer"
> >>"sleep 1"
> >>
> >>"echo smp_call_function_single > d/tracing/set_ftrace_filter"
> 
> It will limit trace output to only for the filtered function
> (smp_call_function_single).
> 
> >>adb shell "sleep 1"
> >>
> >>"echo 1 > d/tracing/tracing_on"
> >>adb shell "sleep 1"
> >>
> >>Right after function_graph is enabled, too many logs are traced upon IRQ
> >>transaction which many times eventually causes stall.
> >
> >That would do it!
> >
> >Hmmm...
> >
> >Steven, would it be helpful if RCU were to inform tracing (say) halfway
> >through the RCU CPU stall interval, allowing the tracer to do something
> >like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
> >wouldn't work, for example, if all the tracing was with irqs disabled
> >or some such, but figured I should ask.
> >
> >Does Guillermo's approach work for others?
> 
> Limited output with a couple of filtered function will definitely
> not cause RCU schedule stall. But the question is whether we should
> expect a full function graph trace working on every platform or not
> (specially the one which generates high interrupts)?

It might well be that the user must disable RCU CPU stall warnings via
the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via the
rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
increases overhead.  Like enabling large quantities of tracing.  ;-)

It -might- be possible to do this automatically, but reliable
automation would require that tracing understand how often each
function was called, which sounds to me to be a bit of a stretch.

Thoughts?

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-09 Thread Paul E. McKenney
On Wed, Aug 09, 2017 at 02:43:49PM +0530, Pratyush Anand wrote:
> 
> 
> On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:
> >On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
> >>Dear All
> >>
> >>As for me, after configuring function_graph as below, crash disappears.
> >>"echo 0 > d/tracing/tracing_on"
> >>"sleep 1"
> >>
> >>"echo function_graph > d/tracing/current_tracer"
> >>"sleep 1"
> >>
> >>"echo smp_call_function_single > d/tracing/set_ftrace_filter"
> 
> It will limit trace output to only for the filtered function
> (smp_call_function_single).
> 
> >>adb shell "sleep 1"
> >>
> >>"echo 1 > d/tracing/tracing_on"
> >>adb shell "sleep 1"
> >>
> >>Right after function_graph is enabled, too many logs are traced upon IRQ
> >>transaction which many times eventually causes stall.
> >
> >That would do it!
> >
> >Hmmm...
> >
> >Steven, would it be helpful if RCU were to inform tracing (say) halfway
> >through the RCU CPU stall interval, allowing the tracer to do something
> >like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
> >wouldn't work, for example, if all the tracing was with irqs disabled
> >or some such, but figured I should ask.
> >
> >Does Guillermo's approach work for others?
> 
> Limited output with a couple of filtered function will definitely
> not cause RCU schedule stall. But the question is whether we should
> expect a full function graph trace working on every platform or not
> (specially the one which generates high interrupts)?

It might well be that the user must disable RCU CPU stall warnings via
the rcu_cpu_stall_suppress sysfs entry (or increase their timeout via the
rcu_cpu_stall_timeout sysfs entry) before doing something that greatly
increases overhead.  Like enabling large quantities of tracing.  ;-)

It -might- be possible to do this automatically, but reliable
automation would require that tracing understand how often each
function was called, which sounds to me to be a bit of a stretch.

Thoughts?

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-09 Thread Pratyush Anand



On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:

On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:

Dear All

As for me, after configuring function_graph as below, crash disappears.
"echo 0 > d/tracing/tracing_on"
"sleep 1"

"echo function_graph > d/tracing/current_tracer"
"sleep 1"

"echo smp_call_function_single > d/tracing/set_ftrace_filter"


It will limit trace output to only for the filtered function 
(smp_call_function_single).



adb shell "sleep 1"

"echo 1 > d/tracing/tracing_on"
adb shell "sleep 1"

Right after function_graph is enabled, too many logs are traced upon IRQ
transaction which many times eventually causes stall.


That would do it!

Hmmm...

Steven, would it be helpful if RCU were to inform tracing (say) halfway
through the RCU CPU stall interval, allowing the tracer to do something
like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
wouldn't work, for example, if all the tracing was with irqs disabled
or some such, but figured I should ask.

Does Guillermo's approach work for others?


Limited output with a couple of filtered function will definitely not cause 
RCU schedule stall. But the question is whether we should expect a full 
function graph trace working on every platform or not (specially the one which 
generates high interrupts)?


--
Regards
Pratyush


Re: RCU stall when using function_graph

2017-08-09 Thread Pratyush Anand



On Sunday 06 August 2017 10:32 PM, Paul E. McKenney wrote:

On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:

Dear All

As for me, after configuring function_graph as below, crash disappears.
"echo 0 > d/tracing/tracing_on"
"sleep 1"

"echo function_graph > d/tracing/current_tracer"
"sleep 1"

"echo smp_call_function_single > d/tracing/set_ftrace_filter"


It will limit trace output to only for the filtered function 
(smp_call_function_single).



adb shell "sleep 1"

"echo 1 > d/tracing/tracing_on"
adb shell "sleep 1"

Right after function_graph is enabled, too many logs are traced upon IRQ
transaction which many times eventually causes stall.


That would do it!

Hmmm...

Steven, would it be helpful if RCU were to inform tracing (say) halfway
through the RCU CPU stall interval, allowing the tracer to do something
like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
wouldn't work, for example, if all the tracing was with irqs disabled
or some such, but figured I should ask.

Does Guillermo's approach work for others?


Limited output with a couple of filtered function will definitely not cause 
RCU schedule stall. But the question is whether we should expect a full 
function graph trace working on every platform or not (specially the one which 
generates high interrupts)?


--
Regards
Pratyush


Re: RCU stall when using function_graph

2017-08-06 Thread Paul E. McKenney
On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
> Dear All
> 
> As for me, after configuring function_graph as below, crash disappears.
> "echo 0 > d/tracing/tracing_on"
> "sleep 1"
> 
> "echo function_graph > d/tracing/current_tracer"
> "sleep 1"
> 
> "echo smp_call_function_single > d/tracing/set_ftrace_filter"
> adb shell "sleep 1"
> 
> "echo 1 > d/tracing/tracing_on"
> adb shell "sleep 1"
> 
> Right after function_graph is enabled, too many logs are traced upon IRQ
> transaction which many times eventually causes stall.

That would do it!

Hmmm...

Steven, would it be helpful if RCU were to inform tracing (say) halfway
through the RCU CPU stall interval, allowing the tracer to do something
like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
wouldn't work, for example, if all the tracing was with irqs disabled
or some such, but figured I should ask.

Does Guillermo's approach work for others?

Thanx, Paul

> BR,
> Guillermo Austin Kim
> 
> 2017. 8. 3. 오후 11:38에 "Daniel Lezcano" 님이 작성:
> 
> On Thu, Aug 03, 2017 at 05:44:21AM -0700, Paul E. McKenney wrote:
> 
> [ ... ]
> 
> > > > BTW, function_graph tracer is the most invasive of the tracers. It's
> 4x
> > > > slower than function tracer. I'm wondering if the tracer isn't the
> > > > cause, but just slows things down enough to cause a some other race
> > > > condition that triggers the bug.
> > >
> > > Yes, that could be true.
> > >
> > > I tried the following scenario:
> > >
> > >  - cpufreq governor => userspace + max_freq (1.2GHz)
> > >- function_graph set ==> OK
> > >
> > >  - cpufreq governor => userspace + min_freq (200MHz)
> > >- function_graph set ==> RCU stall
> > >
> > > Beside that, I realize the board is constantly processing SOF interrupts
> > > every 124us, so that adds more overhead.
> > >
> > > Removing the USB support, thus the associated processing for the SOF
> > > interrupts, I don't see anymore the RCU stall.
> >
> > Looks like Steve called this one!  ;-)
> 
> Yep :)
> 
> > > Is it the expected behavior to have the system hang after a RCU stall
> > > raises ?
> >
> > No, but if NMI stack traces are enabled and there are any NMI problems,
> > bad things can happen.  In addition, the bulk of output can cause problems
> > if you have a slow console connection.
> 
> Ok, thanks.
> 
>   -- Daniel
> 
> --
> 
>   Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:   Facebook |
>  Twitter |
>  Blog



Re: RCU stall when using function_graph

2017-08-06 Thread Paul E. McKenney
On Sat, Aug 05, 2017 at 02:24:21PM +0900, 김동현 wrote:
> Dear All
> 
> As for me, after configuring function_graph as below, crash disappears.
> "echo 0 > d/tracing/tracing_on"
> "sleep 1"
> 
> "echo function_graph > d/tracing/current_tracer"
> "sleep 1"
> 
> "echo smp_call_function_single > d/tracing/set_ftrace_filter"
> adb shell "sleep 1"
> 
> "echo 1 > d/tracing/tracing_on"
> adb shell "sleep 1"
> 
> Right after function_graph is enabled, too many logs are traced upon IRQ
> transaction which many times eventually causes stall.

That would do it!

Hmmm...

Steven, would it be helpful if RCU were to inform tracing (say) halfway
through the RCU CPU stall interval, allowing the tracer to do something
like cond_resched_rcu_qs()?  I can imagine all sorts of reasons why this
wouldn't work, for example, if all the tracing was with irqs disabled
or some such, but figured I should ask.

Does Guillermo's approach work for others?

Thanx, Paul

> BR,
> Guillermo Austin Kim
> 
> 2017. 8. 3. 오후 11:38에 "Daniel Lezcano" 님이 작성:
> 
> On Thu, Aug 03, 2017 at 05:44:21AM -0700, Paul E. McKenney wrote:
> 
> [ ... ]
> 
> > > > BTW, function_graph tracer is the most invasive of the tracers. It's
> 4x
> > > > slower than function tracer. I'm wondering if the tracer isn't the
> > > > cause, but just slows things down enough to cause a some other race
> > > > condition that triggers the bug.
> > >
> > > Yes, that could be true.
> > >
> > > I tried the following scenario:
> > >
> > >  - cpufreq governor => userspace + max_freq (1.2GHz)
> > >- function_graph set ==> OK
> > >
> > >  - cpufreq governor => userspace + min_freq (200MHz)
> > >- function_graph set ==> RCU stall
> > >
> > > Beside that, I realize the board is constantly processing SOF interrupts
> > > every 124us, so that adds more overhead.
> > >
> > > Removing the USB support, thus the associated processing for the SOF
> > > interrupts, I don't see anymore the RCU stall.
> >
> > Looks like Steve called this one!  ;-)
> 
> Yep :)
> 
> > > Is it the expected behavior to have the system hang after a RCU stall
> > > raises ?
> >
> > No, but if NMI stack traces are enabled and there are any NMI problems,
> > bad things can happen.  In addition, the bulk of output can cause problems
> > if you have a slow console connection.
> 
> Ok, thanks.
> 
>   -- Daniel
> 
> --
> 
>   Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:   Facebook |
>  Twitter |
>  Blog



Re: RCU stall when using function_graph

2017-08-03 Thread Daniel Lezcano
On Thu, Aug 03, 2017 at 05:44:21AM -0700, Paul E. McKenney wrote:

[ ... ]

> > > BTW, function_graph tracer is the most invasive of the tracers. It's 4x
> > > slower than function tracer. I'm wondering if the tracer isn't the
> > > cause, but just slows things down enough to cause a some other race
> > > condition that triggers the bug.
> > 
> > Yes, that could be true.
> > 
> > I tried the following scenario:
> > 
> >  - cpufreq governor => userspace + max_freq (1.2GHz)
> >- function_graph set ==> OK
> > 
> >  - cpufreq governor => userspace + min_freq (200MHz)
> >- function_graph set ==> RCU stall
> > 
> > Beside that, I realize the board is constantly processing SOF interrupts
> > every 124us, so that adds more overhead.
> > 
> > Removing the USB support, thus the associated processing for the SOF
> > interrupts, I don't see anymore the RCU stall.
> 
> Looks like Steve called this one!  ;-)

Yep :)

> > Is it the expected behavior to have the system hang after a RCU stall
> > raises ?
> 
> No, but if NMI stack traces are enabled and there are any NMI problems,
> bad things can happen.  In addition, the bulk of output can cause problems
> if you have a slow console connection.

Ok, thanks.

  -- Daniel

-- 

  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog


Re: RCU stall when using function_graph

2017-08-03 Thread Daniel Lezcano
On Thu, Aug 03, 2017 at 05:44:21AM -0700, Paul E. McKenney wrote:

[ ... ]

> > > BTW, function_graph tracer is the most invasive of the tracers. It's 4x
> > > slower than function tracer. I'm wondering if the tracer isn't the
> > > cause, but just slows things down enough to cause a some other race
> > > condition that triggers the bug.
> > 
> > Yes, that could be true.
> > 
> > I tried the following scenario:
> > 
> >  - cpufreq governor => userspace + max_freq (1.2GHz)
> >- function_graph set ==> OK
> > 
> >  - cpufreq governor => userspace + min_freq (200MHz)
> >- function_graph set ==> RCU stall
> > 
> > Beside that, I realize the board is constantly processing SOF interrupts
> > every 124us, so that adds more overhead.
> > 
> > Removing the USB support, thus the associated processing for the SOF
> > interrupts, I don't see anymore the RCU stall.
> 
> Looks like Steve called this one!  ;-)

Yep :)

> > Is it the expected behavior to have the system hang after a RCU stall
> > raises ?
> 
> No, but if NMI stack traces are enabled and there are any NMI problems,
> bad things can happen.  In addition, the bulk of output can cause problems
> if you have a slow console connection.

Ok, thanks.

  -- Daniel

-- 

  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog


Re: RCU stall when using function_graph

2017-08-03 Thread Paul E. McKenney
On Thu, Aug 03, 2017 at 01:41:11PM +0200, Daniel Lezcano wrote:
> On 02/08/2017 15:07, Steven Rostedt wrote:
> > On Wed, 2 Aug 2017 14:42:39 +0200
> > Daniel Lezcano  wrote:
> > 
> >> On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> >>> On Wed, 2 Aug 2017 00:15:44 +0200
> >>> Daniel Lezcano  wrote:
> >>>   
>  On 02/08/2017 00:04, Paul E. McKenney wrote:  
> >> Hi Paul,
> >>
> >> I have been trying to set the function_graph tracer for ftrace and 
> >> each time I
> >> get a CPU stall.
> >>
> >> How to reproduce:
> >> -
> >>
> >> echo function_graph > 
> >> /sys/kernel/debug/tracing/current_tracer
> >>
> >> This error appears with v4.13-rc3 and v4.12-rc6.  
> >>>
> >>> Can you bisect this? It may be due to this commit:
> >>>
> >>> 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> >>> trampolines")  
> >>
> >> Hi Steve,
> >>
> >> I git bisected but each time the issue occured. I went through the 
> >> different
> >> version down to v4.4 where the board was not fully supported and it ended 
> >> up to
> >> have the same issue.
> >>
> >> Finally, I had the intuition it could be related to the wall time (there 
> >> is no
> >> RTC clock with battery on the board and the wall time is Jan 1st, 1970).
> >>
> >> Setting up the with ntpdate solved the problem.
> 
> Actually, it did not solve the problem. The function_graph trace is set,
> I can use the system but after awhile (no tracing enabled at anytime),
> the stall appears.
> 
> >> Even if it is rarely the case to have the time not set, is it normal to 
> >> have a
> >> RCU cpu stall ?
> >>
> >>
> > 
> > BTW, function_graph tracer is the most invasive of the tracers. It's 4x
> > slower than function tracer. I'm wondering if the tracer isn't the
> > cause, but just slows things down enough to cause a some other race
> > condition that triggers the bug.
> 
> Yes, that could be true.
> 
> I tried the following scenario:
> 
>  - cpufreq governor => userspace + max_freq (1.2GHz)
>- function_graph set ==> OK
> 
>  - cpufreq governor => userspace + min_freq (200MHz)
>- function_graph set ==> RCU stall
> 
> Beside that, I realize the board is constantly processing SOF interrupts
> every 124us, so that adds more overhead.
> 
> Removing the USB support, thus the associated processing for the SOF
> interrupts, I don't see anymore the RCU stall.

Looks like Steve called this one!  ;-)

> Is it the expected behavior to have the system hang after a RCU stall
> raises ?

No, but if NMI stack traces are enabled and there are any NMI problems,
bad things can happen.  In addition, the bulk of output can cause problems
if you have a slow console connection.

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-03 Thread Paul E. McKenney
On Thu, Aug 03, 2017 at 01:41:11PM +0200, Daniel Lezcano wrote:
> On 02/08/2017 15:07, Steven Rostedt wrote:
> > On Wed, 2 Aug 2017 14:42:39 +0200
> > Daniel Lezcano  wrote:
> > 
> >> On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> >>> On Wed, 2 Aug 2017 00:15:44 +0200
> >>> Daniel Lezcano  wrote:
> >>>   
>  On 02/08/2017 00:04, Paul E. McKenney wrote:  
> >> Hi Paul,
> >>
> >> I have been trying to set the function_graph tracer for ftrace and 
> >> each time I
> >> get a CPU stall.
> >>
> >> How to reproduce:
> >> -
> >>
> >> echo function_graph > 
> >> /sys/kernel/debug/tracing/current_tracer
> >>
> >> This error appears with v4.13-rc3 and v4.12-rc6.  
> >>>
> >>> Can you bisect this? It may be due to this commit:
> >>>
> >>> 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> >>> trampolines")  
> >>
> >> Hi Steve,
> >>
> >> I git bisected but each time the issue occured. I went through the 
> >> different
> >> version down to v4.4 where the board was not fully supported and it ended 
> >> up to
> >> have the same issue.
> >>
> >> Finally, I had the intuition it could be related to the wall time (there 
> >> is no
> >> RTC clock with battery on the board and the wall time is Jan 1st, 1970).
> >>
> >> Setting up the with ntpdate solved the problem.
> 
> Actually, it did not solve the problem. The function_graph trace is set,
> I can use the system but after awhile (no tracing enabled at anytime),
> the stall appears.
> 
> >> Even if it is rarely the case to have the time not set, is it normal to 
> >> have a
> >> RCU cpu stall ?
> >>
> >>
> > 
> > BTW, function_graph tracer is the most invasive of the tracers. It's 4x
> > slower than function tracer. I'm wondering if the tracer isn't the
> > cause, but just slows things down enough to cause a some other race
> > condition that triggers the bug.
> 
> Yes, that could be true.
> 
> I tried the following scenario:
> 
>  - cpufreq governor => userspace + max_freq (1.2GHz)
>- function_graph set ==> OK
> 
>  - cpufreq governor => userspace + min_freq (200MHz)
>- function_graph set ==> RCU stall
> 
> Beside that, I realize the board is constantly processing SOF interrupts
> every 124us, so that adds more overhead.
> 
> Removing the USB support, thus the associated processing for the SOF
> interrupts, I don't see anymore the RCU stall.

Looks like Steve called this one!  ;-)

> Is it the expected behavior to have the system hang after a RCU stall
> raises ?

No, but if NMI stack traces are enabled and there are any NMI problems,
bad things can happen.  In addition, the bulk of output can cause problems
if you have a slow console connection.

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-03 Thread Daniel Lezcano
On 02/08/2017 15:07, Steven Rostedt wrote:
> On Wed, 2 Aug 2017 14:42:39 +0200
> Daniel Lezcano  wrote:
> 
>> On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
>>> On Wed, 2 Aug 2017 00:15:44 +0200
>>> Daniel Lezcano  wrote:
>>>   
 On 02/08/2017 00:04, Paul E. McKenney wrote:  
>> Hi Paul,
>>
>> I have been trying to set the function_graph tracer for ftrace and each 
>> time I
>> get a CPU stall.
>>
>> How to reproduce:
>> -
>>
>>   echo function_graph > /sys/kernel/debug/tracing/current_tracer
>>
>> This error appears with v4.13-rc3 and v4.12-rc6.  
>>>
>>> Can you bisect this? It may be due to this commit:
>>>
>>> 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
>>> trampolines")  
>>
>> Hi Steve,
>>
>> I git bisected but each time the issue occured. I went through the different
>> version down to v4.4 where the board was not fully supported and it ended up 
>> to
>> have the same issue.
>>
>> Finally, I had the intuition it could be related to the wall time (there is 
>> no
>> RTC clock with battery on the board and the wall time is Jan 1st, 1970).
>>
>> Setting up the with ntpdate solved the problem.

Actually, it did not solve the problem. The function_graph trace is set,
I can use the system but after awhile (no tracing enabled at anytime),
the stall appears.

>> Even if it is rarely the case to have the time not set, is it normal to have 
>> a
>> RCU cpu stall ?
>>
>>
> 
> BTW, function_graph tracer is the most invasive of the tracers. It's 4x
> slower than function tracer. I'm wondering if the tracer isn't the
> cause, but just slows things down enough to cause a some other race
> condition that triggers the bug.

Yes, that could be true.

I tried the following scenario:

 - cpufreq governor => userspace + max_freq (1.2GHz)
   - function_graph set ==> OK

 - cpufreq governor => userspace + min_freq (200MHz)
   - function_graph set ==> RCU stall

Beside that, I realize the board is constantly processing SOF interrupts
every 124us, so that adds more overhead.

Removing the USB support, thus the associated processing for the SOF
interrupts, I don't see anymore the RCU stall.

Is it the expected behavior to have the system hang after a RCU stall
raises ?

-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-03 Thread Daniel Lezcano
On 02/08/2017 15:07, Steven Rostedt wrote:
> On Wed, 2 Aug 2017 14:42:39 +0200
> Daniel Lezcano  wrote:
> 
>> On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
>>> On Wed, 2 Aug 2017 00:15:44 +0200
>>> Daniel Lezcano  wrote:
>>>   
 On 02/08/2017 00:04, Paul E. McKenney wrote:  
>> Hi Paul,
>>
>> I have been trying to set the function_graph tracer for ftrace and each 
>> time I
>> get a CPU stall.
>>
>> How to reproduce:
>> -
>>
>>   echo function_graph > /sys/kernel/debug/tracing/current_tracer
>>
>> This error appears with v4.13-rc3 and v4.12-rc6.  
>>>
>>> Can you bisect this? It may be due to this commit:
>>>
>>> 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
>>> trampolines")  
>>
>> Hi Steve,
>>
>> I git bisected but each time the issue occured. I went through the different
>> version down to v4.4 where the board was not fully supported and it ended up 
>> to
>> have the same issue.
>>
>> Finally, I had the intuition it could be related to the wall time (there is 
>> no
>> RTC clock with battery on the board and the wall time is Jan 1st, 1970).
>>
>> Setting up the with ntpdate solved the problem.

Actually, it did not solve the problem. The function_graph trace is set,
I can use the system but after awhile (no tracing enabled at anytime),
the stall appears.

>> Even if it is rarely the case to have the time not set, is it normal to have 
>> a
>> RCU cpu stall ?
>>
>>
> 
> BTW, function_graph tracer is the most invasive of the tracers. It's 4x
> slower than function tracer. I'm wondering if the tracer isn't the
> cause, but just slows things down enough to cause a some other race
> condition that triggers the bug.

Yes, that could be true.

I tried the following scenario:

 - cpufreq governor => userspace + max_freq (1.2GHz)
   - function_graph set ==> OK

 - cpufreq governor => userspace + min_freq (200MHz)
   - function_graph set ==> RCU stall

Beside that, I realize the board is constantly processing SOF interrupts
every 124us, so that adds more overhead.

Removing the USB support, thus the associated processing for the SOF
interrupts, I don't see anymore the RCU stall.

Is it the expected behavior to have the system hang after a RCU stall
raises ?

-- 
  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog



Re: RCU stall when using function_graph

2017-08-02 Thread Paul E. McKenney
On Wed, Aug 02, 2017 at 09:07:44AM -0400, Steven Rostedt wrote:
> On Wed, 2 Aug 2017 14:42:39 +0200
> Daniel Lezcano  wrote:
> 
> > On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> > > On Wed, 2 Aug 2017 00:15:44 +0200
> > > Daniel Lezcano  wrote:
> > >   
> > > > On 02/08/2017 00:04, Paul E. McKenney wrote:  
> > > > >> Hi Paul,
> > > > >>
> > > > >> I have been trying to set the function_graph tracer for ftrace and 
> > > > >> each time I
> > > > >> get a CPU stall.
> > > > >>
> > > > >> How to reproduce:
> > > > >> -
> > > > >>
> > > > >>   echo function_graph > 
> > > > >> /sys/kernel/debug/tracing/current_tracer
> > > > >>
> > > > >> This error appears with v4.13-rc3 and v4.12-rc6.  
> > > 
> > > Can you bisect this? It may be due to this commit:
> > > 
> > > 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> > > trampolines")  
> > 
> > Hi Steve,
> > 
> > I git bisected but each time the issue occured. I went through the different
> > version down to v4.4 where the board was not fully supported and it ended 
> > up to
> > have the same issue.
> > 
> > Finally, I had the intuition it could be related to the wall time (there is 
> > no
> > RTC clock with battery on the board and the wall time is Jan 1st, 1970).
> > 
> > Setting up the with ntpdate solved the problem.
> > 
> > Even if it is rarely the case to have the time not set, is it normal to 
> > have a
> > RCU cpu stall ?
> > 
> >
> 
> BTW, function_graph tracer is the most invasive of the tracers. It's 4x
> slower than function tracer. I'm wondering if the tracer isn't the
> cause, but just slows things down enough to cause a some other race
> condition that triggers the bug.

Easy to check!  Use the rcupdate.rcu_cpu_stall_timeout kernel boot
parameter to increase this timeout by a factor of four.  Mainline
default is 21 seconds, but many distros set it to 60 seconds.
You can always check sysfs to find the value for your system, or
CONFIG_RCU_CPU_STALL_TIMEOUT in your .config file.

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-02 Thread Paul E. McKenney
On Wed, Aug 02, 2017 at 09:07:44AM -0400, Steven Rostedt wrote:
> On Wed, 2 Aug 2017 14:42:39 +0200
> Daniel Lezcano  wrote:
> 
> > On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> > > On Wed, 2 Aug 2017 00:15:44 +0200
> > > Daniel Lezcano  wrote:
> > >   
> > > > On 02/08/2017 00:04, Paul E. McKenney wrote:  
> > > > >> Hi Paul,
> > > > >>
> > > > >> I have been trying to set the function_graph tracer for ftrace and 
> > > > >> each time I
> > > > >> get a CPU stall.
> > > > >>
> > > > >> How to reproduce:
> > > > >> -
> > > > >>
> > > > >>   echo function_graph > 
> > > > >> /sys/kernel/debug/tracing/current_tracer
> > > > >>
> > > > >> This error appears with v4.13-rc3 and v4.12-rc6.  
> > > 
> > > Can you bisect this? It may be due to this commit:
> > > 
> > > 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> > > trampolines")  
> > 
> > Hi Steve,
> > 
> > I git bisected but each time the issue occured. I went through the different
> > version down to v4.4 where the board was not fully supported and it ended 
> > up to
> > have the same issue.
> > 
> > Finally, I had the intuition it could be related to the wall time (there is 
> > no
> > RTC clock with battery on the board and the wall time is Jan 1st, 1970).
> > 
> > Setting up the with ntpdate solved the problem.
> > 
> > Even if it is rarely the case to have the time not set, is it normal to 
> > have a
> > RCU cpu stall ?
> > 
> >
> 
> BTW, function_graph tracer is the most invasive of the tracers. It's 4x
> slower than function tracer. I'm wondering if the tracer isn't the
> cause, but just slows things down enough to cause a some other race
> condition that triggers the bug.

Easy to check!  Use the rcupdate.rcu_cpu_stall_timeout kernel boot
parameter to increase this timeout by a factor of four.  Mainline
default is 21 seconds, but many distros set it to 60 seconds.
You can always check sysfs to find the value for your system, or
CONFIG_RCU_CPU_STALL_TIMEOUT in your .config file.

Thanx, Paul



Re: RCU stall when using function_graph

2017-08-02 Thread Paul E. McKenney
On Wed, Aug 02, 2017 at 02:42:39PM +0200, Daniel Lezcano wrote:
> On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> > On Wed, 2 Aug 2017 00:15:44 +0200
> > Daniel Lezcano  wrote:
> > 
> > > On 02/08/2017 00:04, Paul E. McKenney wrote:
> > > >> Hi Paul,
> > > >>
> > > >> I have been trying to set the function_graph tracer for ftrace and 
> > > >> each time I
> > > >> get a CPU stall.
> > > >>
> > > >> How to reproduce:
> > > >> -
> > > >>
> > > >> echo function_graph > 
> > > >> /sys/kernel/debug/tracing/current_tracer
> > > >>
> > > >> This error appears with v4.13-rc3 and v4.12-rc6.
> > 
> > Can you bisect this? It may be due to this commit:
> > 
> > 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> > trampolines")
> 
> Hi Steve,
> 
> I git bisected but each time the issue occured. I went through the different
> version down to v4.4 where the board was not fully supported and it ended up 
> to
> have the same issue.
> 
> Finally, I had the intuition it could be related to the wall time (there is no
> RTC clock with battery on the board and the wall time is Jan 1st, 1970).
> 
> Setting up the with ntpdate solved the problem.
> 
> Even if it is rarely the case to have the time not set, is it normal to have a
> RCU cpu stall ?

If the system is sufficiently confused about the time, you can indeed
get RCU CPU stall warnings.

In one memorable case, a pair of CPUs had a multi-minute disagreement
as to the current time, which meant that when one of the started an RCU
grace period, the other would immediately issue an RCU CPU stall warning.

Thanx, Paul

> > > >>
> > > >> Is it something already reported ?  
> > > > 
> > > > I have seen this sort of thing, but only when actually dumping the trace
> > > > out, and I though those got fixed.  You are seeing this just 
> > > > accumulating
> > > > the trace?  
> > > 
> > > No, just by changing the tracer. It is the first operation I do after
> > > rebooting and it is reproducible each time. That happens on an ARM64
> > > platform.
> > > 
> > > > These RCU CPU stall warnings usually occur when something grabs hold of
> > > > a CPU for too long, as in 21 seconds or so.  One way that they can 
> > > > happen
> > > > is excessive lock contention, another is having the kernel run through
> > > > too much data at one shot.
> > > > 
> > > > Adding Steven Rostedt on CC for his thoughts.
> > > > 
> > >
> 
> -- 
> 
>   Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:   Facebook |
>  Twitter |
>  Blog
> 



Re: RCU stall when using function_graph

2017-08-02 Thread Paul E. McKenney
On Wed, Aug 02, 2017 at 02:42:39PM +0200, Daniel Lezcano wrote:
> On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> > On Wed, 2 Aug 2017 00:15:44 +0200
> > Daniel Lezcano  wrote:
> > 
> > > On 02/08/2017 00:04, Paul E. McKenney wrote:
> > > >> Hi Paul,
> > > >>
> > > >> I have been trying to set the function_graph tracer for ftrace and 
> > > >> each time I
> > > >> get a CPU stall.
> > > >>
> > > >> How to reproduce:
> > > >> -
> > > >>
> > > >> echo function_graph > 
> > > >> /sys/kernel/debug/tracing/current_tracer
> > > >>
> > > >> This error appears with v4.13-rc3 and v4.12-rc6.
> > 
> > Can you bisect this? It may be due to this commit:
> > 
> > 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> > trampolines")
> 
> Hi Steve,
> 
> I git bisected but each time the issue occured. I went through the different
> version down to v4.4 where the board was not fully supported and it ended up 
> to
> have the same issue.
> 
> Finally, I had the intuition it could be related to the wall time (there is no
> RTC clock with battery on the board and the wall time is Jan 1st, 1970).
> 
> Setting up the with ntpdate solved the problem.
> 
> Even if it is rarely the case to have the time not set, is it normal to have a
> RCU cpu stall ?

If the system is sufficiently confused about the time, you can indeed
get RCU CPU stall warnings.

In one memorable case, a pair of CPUs had a multi-minute disagreement
as to the current time, which meant that when one of the started an RCU
grace period, the other would immediately issue an RCU CPU stall warning.

Thanx, Paul

> > > >>
> > > >> Is it something already reported ?  
> > > > 
> > > > I have seen this sort of thing, but only when actually dumping the trace
> > > > out, and I though those got fixed.  You are seeing this just 
> > > > accumulating
> > > > the trace?  
> > > 
> > > No, just by changing the tracer. It is the first operation I do after
> > > rebooting and it is reproducible each time. That happens on an ARM64
> > > platform.
> > > 
> > > > These RCU CPU stall warnings usually occur when something grabs hold of
> > > > a CPU for too long, as in 21 seconds or so.  One way that they can 
> > > > happen
> > > > is excessive lock contention, another is having the kernel run through
> > > > too much data at one shot.
> > > > 
> > > > Adding Steven Rostedt on CC for his thoughts.
> > > > 
> > >
> 
> -- 
> 
>   Linaro.org │ Open source software for ARM SoCs
> 
> Follow Linaro:   Facebook |
>  Twitter |
>  Blog
> 



Re: RCU stall when using function_graph

2017-08-02 Thread Steven Rostedt
On Wed, 2 Aug 2017 14:42:39 +0200
Daniel Lezcano  wrote:

> On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> > On Wed, 2 Aug 2017 00:15:44 +0200
> > Daniel Lezcano  wrote:
> >   
> > > On 02/08/2017 00:04, Paul E. McKenney wrote:  
> > > >> Hi Paul,
> > > >>
> > > >> I have been trying to set the function_graph tracer for ftrace and 
> > > >> each time I
> > > >> get a CPU stall.
> > > >>
> > > >> How to reproduce:
> > > >> -
> > > >>
> > > >> echo function_graph > 
> > > >> /sys/kernel/debug/tracing/current_tracer
> > > >>
> > > >> This error appears with v4.13-rc3 and v4.12-rc6.  
> > 
> > Can you bisect this? It may be due to this commit:
> > 
> > 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> > trampolines")  
> 
> Hi Steve,
> 
> I git bisected but each time the issue occured. I went through the different
> version down to v4.4 where the board was not fully supported and it ended up 
> to
> have the same issue.
> 
> Finally, I had the intuition it could be related to the wall time (there is no
> RTC clock with battery on the board and the wall time is Jan 1st, 1970).
> 
> Setting up the with ntpdate solved the problem.
> 
> Even if it is rarely the case to have the time not set, is it normal to have a
> RCU cpu stall ?
> 
>

BTW, function_graph tracer is the most invasive of the tracers. It's 4x
slower than function tracer. I'm wondering if the tracer isn't the
cause, but just slows things down enough to cause a some other race
condition that triggers the bug.

-- Steve


Re: RCU stall when using function_graph

2017-08-02 Thread Steven Rostedt
On Wed, 2 Aug 2017 14:42:39 +0200
Daniel Lezcano  wrote:

> On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> > On Wed, 2 Aug 2017 00:15:44 +0200
> > Daniel Lezcano  wrote:
> >   
> > > On 02/08/2017 00:04, Paul E. McKenney wrote:  
> > > >> Hi Paul,
> > > >>
> > > >> I have been trying to set the function_graph tracer for ftrace and 
> > > >> each time I
> > > >> get a CPU stall.
> > > >>
> > > >> How to reproduce:
> > > >> -
> > > >>
> > > >> echo function_graph > 
> > > >> /sys/kernel/debug/tracing/current_tracer
> > > >>
> > > >> This error appears with v4.13-rc3 and v4.12-rc6.  
> > 
> > Can you bisect this? It may be due to this commit:
> > 
> > 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> > trampolines")  
> 
> Hi Steve,
> 
> I git bisected but each time the issue occured. I went through the different
> version down to v4.4 where the board was not fully supported and it ended up 
> to
> have the same issue.
> 
> Finally, I had the intuition it could be related to the wall time (there is no
> RTC clock with battery on the board and the wall time is Jan 1st, 1970).
> 
> Setting up the with ntpdate solved the problem.
> 
> Even if it is rarely the case to have the time not set, is it normal to have a
> RCU cpu stall ?
> 
>

BTW, function_graph tracer is the most invasive of the tracers. It's 4x
slower than function tracer. I'm wondering if the tracer isn't the
cause, but just slows things down enough to cause a some other race
condition that triggers the bug.

-- Steve


Re: RCU stall when using function_graph

2017-08-02 Thread Paul E. McKenney
On Tue, Aug 01, 2017 at 03:04:05PM -0700, Paul E. McKenney wrote:
> > Hi Paul,
> > 
> > I have been trying to set the function_graph tracer for ftrace and each 
> > time I
> > get a CPU stall.
> > 
> > How to reproduce:
> > -
> > 
> >  echo function_graph > /sys/kernel/debug/tracing/current_tracer
> > 
> > This error appears with v4.13-rc3 and v4.12-rc6.
> > 
> > Is it something already reported ?
> 
> I have seen this sort of thing, but only when actually dumping the trace
> out, and I though those got fixed.  You are seeing this just accumulating
> the trace?
> 
> These RCU CPU stall warnings usually occur when something grabs hold of
> a CPU for too long, as in 21 seconds or so.  One way that they can happen
> is excessive lock contention, another is having the kernel run through
> too much data at one shot.
> 
> Adding Steven Rostedt on CC for his thoughts.

And Peter Zijlstra suggested this in response to a similar splat:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=locking/core=15377ef4fe0c86eb7fa1099575b2e86357d99e42

Thanx, Paul

> > [ ... ]
> > 
> > [  472.934986] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  472.941302]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=3195
> > [  472.948982]  (detected by 3, t=6502 jiffies, g=549, c=548, q=10)
> > [  472.955180] Task dump for CPU 0:
> > [  472.958513] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  550.954786] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  550.960657]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=12926
> > [  550.967897]  (detected by 3, t=26007 jiffies, g=549, c=548, q=27)
> > [  550.974181] Task dump for CPU 0:
> > [  550.977513] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  628.974788] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  628.980659]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=22666
> > [  628.987900]  (detected by 3, t=45512 jiffies, g=549, c=548, q=43)
> > [  628.994184] Task dump for CPU 0:
> > [  628.997518] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  706.994787] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  707.000658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=31785
> > [  707.007899]  (detected by 3, t=65017 jiffies, g=549, c=548, q=74)
> > [  707.014183] Task dump for CPU 0:
> > [  707.017516] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  707.024902] rcu_preempt kthread starved for 1261 jiffies! g549 c548 f0x0 
> > RCU_GP_WAIT_FQS(3) ->state=0x1
> > [  707.034548] rcu_preempt S0 8  2 0x
> > [  726.049583] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
> > seconds.
> > [  726.061772]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> > [  726.074294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [  726.089027] jbd2/mmcblk1p2- D0  1120  2 0x
> > [  726.126737] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
> > seconds.
> > [  726.140484]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> > [  726.154515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [  726.170140] rs:main Q:Reg   D0  2272  1 0x
> > [  726.205626]
> > [  726.205626] Showing all locks held in the system:
> > [  726.272952]
> > [  726.278194] =
> > [  726.278194]
> > [  785.014779] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  785.020658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=41387
> > [  785.027905]  (detected by 3, t=84524 jiffies, g=549, c=548, q=94)
> > [  785.034194] Task dump for CPU 0:
> > [  785.037532] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  846.879753] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
> > seconds.
> > [  846.892966]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> > [  846.909597] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [  846.926726] jbd2/mmcblk1p2- D0  1120  2 0x
> > [  846.981356] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
> > seconds.
> > [  847.001593]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> > [  847.017642] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [  847.034135] rs:main Q:Reg   D0  2272  1 0x
> > [  847.077209]
> > [  847.077209] Showing all locks held in the system:
> > [  847.180736]
> > [  847.186753] =
> > [  847.186753]
> > [  863.034778] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  863.040660]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=50881
> > [  863.047907]  

Re: RCU stall when using function_graph

2017-08-02 Thread Paul E. McKenney
On Tue, Aug 01, 2017 at 03:04:05PM -0700, Paul E. McKenney wrote:
> > Hi Paul,
> > 
> > I have been trying to set the function_graph tracer for ftrace and each 
> > time I
> > get a CPU stall.
> > 
> > How to reproduce:
> > -
> > 
> >  echo function_graph > /sys/kernel/debug/tracing/current_tracer
> > 
> > This error appears with v4.13-rc3 and v4.12-rc6.
> > 
> > Is it something already reported ?
> 
> I have seen this sort of thing, but only when actually dumping the trace
> out, and I though those got fixed.  You are seeing this just accumulating
> the trace?
> 
> These RCU CPU stall warnings usually occur when something grabs hold of
> a CPU for too long, as in 21 seconds or so.  One way that they can happen
> is excessive lock contention, another is having the kernel run through
> too much data at one shot.
> 
> Adding Steven Rostedt on CC for his thoughts.

And Peter Zijlstra suggested this in response to a similar splat:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/commit/?h=locking/core=15377ef4fe0c86eb7fa1099575b2e86357d99e42

Thanx, Paul

> > [ ... ]
> > 
> > [  472.934986] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  472.941302]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=3195
> > [  472.948982]  (detected by 3, t=6502 jiffies, g=549, c=548, q=10)
> > [  472.955180] Task dump for CPU 0:
> > [  472.958513] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  550.954786] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  550.960657]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=12926
> > [  550.967897]  (detected by 3, t=26007 jiffies, g=549, c=548, q=27)
> > [  550.974181] Task dump for CPU 0:
> > [  550.977513] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  628.974788] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  628.980659]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=22666
> > [  628.987900]  (detected by 3, t=45512 jiffies, g=549, c=548, q=43)
> > [  628.994184] Task dump for CPU 0:
> > [  628.997518] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  706.994787] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  707.000658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=31785
> > [  707.007899]  (detected by 3, t=65017 jiffies, g=549, c=548, q=74)
> > [  707.014183] Task dump for CPU 0:
> > [  707.017516] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  707.024902] rcu_preempt kthread starved for 1261 jiffies! g549 c548 f0x0 
> > RCU_GP_WAIT_FQS(3) ->state=0x1
> > [  707.034548] rcu_preempt S0 8  2 0x
> > [  726.049583] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
> > seconds.
> > [  726.061772]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> > [  726.074294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [  726.089027] jbd2/mmcblk1p2- D0  1120  2 0x
> > [  726.126737] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
> > seconds.
> > [  726.140484]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> > [  726.154515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [  726.170140] rs:main Q:Reg   D0  2272  1 0x
> > [  726.205626]
> > [  726.205626] Showing all locks held in the system:
> > [  726.272952]
> > [  726.278194] =
> > [  726.278194]
> > [  785.014779] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  785.020658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=41387
> > [  785.027905]  (detected by 3, t=84524 jiffies, g=549, c=548, q=94)
> > [  785.034194] Task dump for CPU 0:
> > [  785.037532] swapper/0   R  running task0 0  0 
> > 0x0002
> > [  846.879753] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
> > seconds.
> > [  846.892966]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> > [  846.909597] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [  846.926726] jbd2/mmcblk1p2- D0  1120  2 0x
> > [  846.981356] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
> > seconds.
> > [  847.001593]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> > [  847.017642] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> > this message.
> > [  847.034135] rs:main Q:Reg   D0  2272  1 0x
> > [  847.077209]
> > [  847.077209] Showing all locks held in the system:
> > [  847.180736]
> > [  847.186753] =
> > [  847.186753]
> > [  863.034778] INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [  863.040660]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
> > fqs=50881
> > [  863.047907]  

Re: RCU stall when using function_graph

2017-08-02 Thread Daniel Lezcano
On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> On Wed, 2 Aug 2017 00:15:44 +0200
> Daniel Lezcano  wrote:
> 
> > On 02/08/2017 00:04, Paul E. McKenney wrote:
> > >> Hi Paul,
> > >>
> > >> I have been trying to set the function_graph tracer for ftrace and each 
> > >> time I
> > >> get a CPU stall.
> > >>
> > >> How to reproduce:
> > >> -
> > >>
> > >>   echo function_graph > /sys/kernel/debug/tracing/current_tracer
> > >>
> > >> This error appears with v4.13-rc3 and v4.12-rc6.
> 
> Can you bisect this? It may be due to this commit:
> 
> 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> trampolines")

Hi Steve,

I git bisected but each time the issue occured. I went through the different
version down to v4.4 where the board was not fully supported and it ended up to
have the same issue.

Finally, I had the intuition it could be related to the wall time (there is no
RTC clock with battery on the board and the wall time is Jan 1st, 1970).

Setting up the with ntpdate solved the problem.

Even if it is rarely the case to have the time not set, is it normal to have a
RCU cpu stall ?

> > >>
> > >> Is it something already reported ?  
> > > 
> > > I have seen this sort of thing, but only when actually dumping the trace
> > > out, and I though those got fixed.  You are seeing this just accumulating
> > > the trace?  
> > 
> > No, just by changing the tracer. It is the first operation I do after
> > rebooting and it is reproducible each time. That happens on an ARM64
> > platform.
> > 
> > > These RCU CPU stall warnings usually occur when something grabs hold of
> > > a CPU for too long, as in 21 seconds or so.  One way that they can happen
> > > is excessive lock contention, another is having the kernel run through
> > > too much data at one shot.
> > > 
> > > Adding Steven Rostedt on CC for his thoughts.
> > > 
> >

-- 

  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog


Re: RCU stall when using function_graph

2017-08-02 Thread Daniel Lezcano
On Tue, Aug 01, 2017 at 08:12:14PM -0400, Steven Rostedt wrote:
> On Wed, 2 Aug 2017 00:15:44 +0200
> Daniel Lezcano  wrote:
> 
> > On 02/08/2017 00:04, Paul E. McKenney wrote:
> > >> Hi Paul,
> > >>
> > >> I have been trying to set the function_graph tracer for ftrace and each 
> > >> time I
> > >> get a CPU stall.
> > >>
> > >> How to reproduce:
> > >> -
> > >>
> > >>   echo function_graph > /sys/kernel/debug/tracing/current_tracer
> > >>
> > >> This error appears with v4.13-rc3 and v4.12-rc6.
> 
> Can you bisect this? It may be due to this commit:
> 
> 0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
> trampolines")

Hi Steve,

I git bisected but each time the issue occured. I went through the different
version down to v4.4 where the board was not fully supported and it ended up to
have the same issue.

Finally, I had the intuition it could be related to the wall time (there is no
RTC clock with battery on the board and the wall time is Jan 1st, 1970).

Setting up the with ntpdate solved the problem.

Even if it is rarely the case to have the time not set, is it normal to have a
RCU cpu stall ?

> > >>
> > >> Is it something already reported ?  
> > > 
> > > I have seen this sort of thing, but only when actually dumping the trace
> > > out, and I though those got fixed.  You are seeing this just accumulating
> > > the trace?  
> > 
> > No, just by changing the tracer. It is the first operation I do after
> > rebooting and it is reproducible each time. That happens on an ARM64
> > platform.
> > 
> > > These RCU CPU stall warnings usually occur when something grabs hold of
> > > a CPU for too long, as in 21 seconds or so.  One way that they can happen
> > > is excessive lock contention, another is having the kernel run through
> > > too much data at one shot.
> > > 
> > > Adding Steven Rostedt on CC for his thoughts.
> > > 
> >

-- 

  Linaro.org │ Open source software for ARM SoCs

Follow Linaro:   Facebook |
 Twitter |
 Blog


Re: RCU stall when using function_graph

2017-08-01 Thread Steven Rostedt
On Wed, 2 Aug 2017 00:15:44 +0200
Daniel Lezcano  wrote:

> On 02/08/2017 00:04, Paul E. McKenney wrote:
> >> Hi Paul,
> >>
> >> I have been trying to set the function_graph tracer for ftrace and each 
> >> time I
> >> get a CPU stall.
> >>
> >> How to reproduce:
> >> -
> >>
> >> echo function_graph > /sys/kernel/debug/tracing/current_tracer
> >>
> >> This error appears with v4.13-rc3 and v4.12-rc6.

Can you bisect this? It may be due to this commit:

0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
trampolines")

-- Steve

> >>
> >> Is it something already reported ?  
> > 
> > I have seen this sort of thing, but only when actually dumping the trace
> > out, and I though those got fixed.  You are seeing this just accumulating
> > the trace?  
> 
> No, just by changing the tracer. It is the first operation I do after
> rebooting and it is reproducible each time. That happens on an ARM64
> platform.
> 
> > These RCU CPU stall warnings usually occur when something grabs hold of
> > a CPU for too long, as in 21 seconds or so.  One way that they can happen
> > is excessive lock contention, another is having the kernel run through
> > too much data at one shot.
> > 
> > Adding Steven Rostedt on CC for his thoughts.
> > 
>


Re: RCU stall when using function_graph

2017-08-01 Thread Steven Rostedt
On Wed, 2 Aug 2017 00:15:44 +0200
Daniel Lezcano  wrote:

> On 02/08/2017 00:04, Paul E. McKenney wrote:
> >> Hi Paul,
> >>
> >> I have been trying to set the function_graph tracer for ftrace and each 
> >> time I
> >> get a CPU stall.
> >>
> >> How to reproduce:
> >> -
> >>
> >> echo function_graph > /sys/kernel/debug/tracing/current_tracer
> >>
> >> This error appears with v4.13-rc3 and v4.12-rc6.

Can you bisect this? It may be due to this commit:

0598e4f08 ("ftrace: Add use of synchronize_rcu_tasks() with dynamic 
trampolines")

-- Steve

> >>
> >> Is it something already reported ?  
> > 
> > I have seen this sort of thing, but only when actually dumping the trace
> > out, and I though those got fixed.  You are seeing this just accumulating
> > the trace?  
> 
> No, just by changing the tracer. It is the first operation I do after
> rebooting and it is reproducible each time. That happens on an ARM64
> platform.
> 
> > These RCU CPU stall warnings usually occur when something grabs hold of
> > a CPU for too long, as in 21 seconds or so.  One way that they can happen
> > is excessive lock contention, another is having the kernel run through
> > too much data at one shot.
> > 
> > Adding Steven Rostedt on CC for his thoughts.
> > 
>


Re: RCU stall when using function_graph

2017-08-01 Thread Daniel Lezcano
On 02/08/2017 00:04, Paul E. McKenney wrote:
>> Hi Paul,
>>
>> I have been trying to set the function_graph tracer for ftrace and each time 
>> I
>> get a CPU stall.
>>
>> How to reproduce:
>> -
>>
>>   echo function_graph > /sys/kernel/debug/tracing/current_tracer
>>
>> This error appears with v4.13-rc3 and v4.12-rc6.
>>
>> Is it something already reported ?
> 
> I have seen this sort of thing, but only when actually dumping the trace
> out, and I though those got fixed.  You are seeing this just accumulating
> the trace?

No, just by changing the tracer. It is the first operation I do after
rebooting and it is reproducible each time. That happens on an ARM64
platform.

> These RCU CPU stall warnings usually occur when something grabs hold of
> a CPU for too long, as in 21 seconds or so.  One way that they can happen
> is excessive lock contention, another is having the kernel run through
> too much data at one shot.
> 
> Adding Steven Rostedt on CC for his thoughts.
> 
>   Thanx, Paul
> 
>> [ ... ]
>>
>> [  472.934986] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  472.941302]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=3195
>> [  472.948982]  (detected by 3, t=6502 jiffies, g=549, c=548, q=10)
>> [  472.955180] Task dump for CPU 0:
>> [  472.958513] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  550.954786] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  550.960657]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=12926
>> [  550.967897]  (detected by 3, t=26007 jiffies, g=549, c=548, q=27)
>> [  550.974181] Task dump for CPU 0:
>> [  550.977513] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  628.974788] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  628.980659]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=22666
>> [  628.987900]  (detected by 3, t=45512 jiffies, g=549, c=548, q=43)
>> [  628.994184] Task dump for CPU 0:
>> [  628.997518] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  706.994787] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  707.000658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=31785
>> [  707.007899]  (detected by 3, t=65017 jiffies, g=549, c=548, q=74)
>> [  707.014183] Task dump for CPU 0:
>> [  707.017516] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  707.024902] rcu_preempt kthread starved for 1261 jiffies! g549 c548 f0x0 
>> RCU_GP_WAIT_FQS(3) ->state=0x1
>> [  707.034548] rcu_preempt S0 8  2 0x
>> [  726.049583] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
>> seconds.
>> [  726.061772]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
>> [  726.074294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  726.089027] jbd2/mmcblk1p2- D0  1120  2 0x
>> [  726.126737] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
>> seconds.
>> [  726.140484]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
>> [  726.154515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  726.170140] rs:main Q:Reg   D0  2272  1 0x
>> [  726.205626]
>> [  726.205626] Showing all locks held in the system:
>> [  726.272952]
>> [  726.278194] =
>> [  726.278194]
>> [  785.014779] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  785.020658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=41387
>> [  785.027905]  (detected by 3, t=84524 jiffies, g=549, c=548, q=94)
>> [  785.034194] Task dump for CPU 0:
>> [  785.037532] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  846.879753] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
>> seconds.
>> [  846.892966]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
>> [  846.909597] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  846.926726] jbd2/mmcblk1p2- D0  1120  2 0x
>> [  846.981356] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
>> seconds.
>> [  847.001593]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
>> [  847.017642] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  847.034135] rs:main Q:Reg   D0  2272  1 0x
>> [  847.077209]
>> [  847.077209] Showing all locks held in the system:
>> [  847.180736]
>> [  847.186753] =
>> [  847.186753]
>> [  863.034778] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  863.040660]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=50881
>> [  863.047907]  (detected by 4, t=104030 jiffies, g=549, c=548, q=114)
>> [  863.054373] Task dump for CPU 0:
>> [  863.057711] swapper/0   R  running task0 0  0 
>> 

Re: RCU stall when using function_graph

2017-08-01 Thread Daniel Lezcano
On 02/08/2017 00:04, Paul E. McKenney wrote:
>> Hi Paul,
>>
>> I have been trying to set the function_graph tracer for ftrace and each time 
>> I
>> get a CPU stall.
>>
>> How to reproduce:
>> -
>>
>>   echo function_graph > /sys/kernel/debug/tracing/current_tracer
>>
>> This error appears with v4.13-rc3 and v4.12-rc6.
>>
>> Is it something already reported ?
> 
> I have seen this sort of thing, but only when actually dumping the trace
> out, and I though those got fixed.  You are seeing this just accumulating
> the trace?

No, just by changing the tracer. It is the first operation I do after
rebooting and it is reproducible each time. That happens on an ARM64
platform.

> These RCU CPU stall warnings usually occur when something grabs hold of
> a CPU for too long, as in 21 seconds or so.  One way that they can happen
> is excessive lock contention, another is having the kernel run through
> too much data at one shot.
> 
> Adding Steven Rostedt on CC for his thoughts.
> 
>   Thanx, Paul
> 
>> [ ... ]
>>
>> [  472.934986] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  472.941302]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=3195
>> [  472.948982]  (detected by 3, t=6502 jiffies, g=549, c=548, q=10)
>> [  472.955180] Task dump for CPU 0:
>> [  472.958513] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  550.954786] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  550.960657]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=12926
>> [  550.967897]  (detected by 3, t=26007 jiffies, g=549, c=548, q=27)
>> [  550.974181] Task dump for CPU 0:
>> [  550.977513] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  628.974788] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  628.980659]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=22666
>> [  628.987900]  (detected by 3, t=45512 jiffies, g=549, c=548, q=43)
>> [  628.994184] Task dump for CPU 0:
>> [  628.997518] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  706.994787] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  707.000658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=31785
>> [  707.007899]  (detected by 3, t=65017 jiffies, g=549, c=548, q=74)
>> [  707.014183] Task dump for CPU 0:
>> [  707.017516] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  707.024902] rcu_preempt kthread starved for 1261 jiffies! g549 c548 f0x0 
>> RCU_GP_WAIT_FQS(3) ->state=0x1
>> [  707.034548] rcu_preempt S0 8  2 0x
>> [  726.049583] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
>> seconds.
>> [  726.061772]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
>> [  726.074294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  726.089027] jbd2/mmcblk1p2- D0  1120  2 0x
>> [  726.126737] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
>> seconds.
>> [  726.140484]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
>> [  726.154515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  726.170140] rs:main Q:Reg   D0  2272  1 0x
>> [  726.205626]
>> [  726.205626] Showing all locks held in the system:
>> [  726.272952]
>> [  726.278194] =
>> [  726.278194]
>> [  785.014779] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  785.020658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=41387
>> [  785.027905]  (detected by 3, t=84524 jiffies, g=549, c=548, q=94)
>> [  785.034194] Task dump for CPU 0:
>> [  785.037532] swapper/0   R  running task0 0  0 
>> 0x0002
>> [  846.879753] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
>> seconds.
>> [  846.892966]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
>> [  846.909597] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  846.926726] jbd2/mmcblk1p2- D0  1120  2 0x
>> [  846.981356] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
>> seconds.
>> [  847.001593]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
>> [  847.017642] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
>> this message.
>> [  847.034135] rs:main Q:Reg   D0  2272  1 0x
>> [  847.077209]
>> [  847.077209] Showing all locks held in the system:
>> [  847.180736]
>> [  847.186753] =
>> [  847.186753]
>> [  863.034778] INFO: rcu_preempt detected stalls on CPUs/tasks:
>> [  863.040660]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 
>> fqs=50881
>> [  863.047907]  (detected by 4, t=104030 jiffies, g=549, c=548, q=114)
>> [  863.054373] Task dump for CPU 0:
>> [  863.057711] swapper/0   R  running task0 0  0 
>> 

Re: RCU stall when using function_graph

2017-08-01 Thread Paul E. McKenney
> Hi Paul,
> 
> I have been trying to set the function_graph tracer for ftrace and each time I
> get a CPU stall.
> 
> How to reproduce:
> -
> 
>echo function_graph > /sys/kernel/debug/tracing/current_tracer
> 
> This error appears with v4.13-rc3 and v4.12-rc6.
> 
> Is it something already reported ?

I have seen this sort of thing, but only when actually dumping the trace
out, and I though those got fixed.  You are seeing this just accumulating
the trace?

These RCU CPU stall warnings usually occur when something grabs hold of
a CPU for too long, as in 21 seconds or so.  One way that they can happen
is excessive lock contention, another is having the kernel run through
too much data at one shot.

Adding Steven Rostedt on CC for his thoughts.

Thanx, Paul

> [ ... ]
> 
> [  472.934986] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  472.941302]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=3195
> [  472.948982]  (detected by 3, t=6502 jiffies, g=549, c=548, q=10)
> [  472.955180] Task dump for CPU 0:
> [  472.958513] swapper/0   R  running task0 0  0 
> 0x0002
> [  550.954786] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  550.960657]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=12926
> [  550.967897]  (detected by 3, t=26007 jiffies, g=549, c=548, q=27)
> [  550.974181] Task dump for CPU 0:
> [  550.977513] swapper/0   R  running task0 0  0 
> 0x0002
> [  628.974788] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  628.980659]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=22666
> [  628.987900]  (detected by 3, t=45512 jiffies, g=549, c=548, q=43)
> [  628.994184] Task dump for CPU 0:
> [  628.997518] swapper/0   R  running task0 0  0 
> 0x0002
> [  706.994787] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  707.000658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=31785
> [  707.007899]  (detected by 3, t=65017 jiffies, g=549, c=548, q=74)
> [  707.014183] Task dump for CPU 0:
> [  707.017516] swapper/0   R  running task0 0  0 
> 0x0002
> [  707.024902] rcu_preempt kthread starved for 1261 jiffies! g549 c548 f0x0 
> RCU_GP_WAIT_FQS(3) ->state=0x1
> [  707.034548] rcu_preempt S0 8  2 0x
> [  726.049583] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
> seconds.
> [  726.061772]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> [  726.074294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  726.089027] jbd2/mmcblk1p2- D0  1120  2 0x
> [  726.126737] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
> seconds.
> [  726.140484]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> [  726.154515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  726.170140] rs:main Q:Reg   D0  2272  1 0x
> [  726.205626]
> [  726.205626] Showing all locks held in the system:
> [  726.272952]
> [  726.278194] =
> [  726.278194]
> [  785.014779] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  785.020658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=41387
> [  785.027905]  (detected by 3, t=84524 jiffies, g=549, c=548, q=94)
> [  785.034194] Task dump for CPU 0:
> [  785.037532] swapper/0   R  running task0 0  0 
> 0x0002
> [  846.879753] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
> seconds.
> [  846.892966]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> [  846.909597] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  846.926726] jbd2/mmcblk1p2- D0  1120  2 0x
> [  846.981356] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
> seconds.
> [  847.001593]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> [  847.017642] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  847.034135] rs:main Q:Reg   D0  2272  1 0x
> [  847.077209]
> [  847.077209] Showing all locks held in the system:
> [  847.180736]
> [  847.186753] =
> [  847.186753]
> [  863.034778] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  863.040660]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=50881
> [  863.047907]  (detected by 4, t=104030 jiffies, g=549, c=548, q=114)
> [  863.054373] Task dump for CPU 0:
> [  863.057711] swapper/0   R  running task0 0  0 
> 0x0002
> [  941.054780] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  941.060659]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=59726
> [  941.067903]  (detected by 7, t=123535 jiffies, g=549, c=548, q=303)
> [  941.074370] Task dump for CPU 0:
> [  941.077708] swapper/0   R  running task

Re: RCU stall when using function_graph

2017-08-01 Thread Paul E. McKenney
> Hi Paul,
> 
> I have been trying to set the function_graph tracer for ftrace and each time I
> get a CPU stall.
> 
> How to reproduce:
> -
> 
>echo function_graph > /sys/kernel/debug/tracing/current_tracer
> 
> This error appears with v4.13-rc3 and v4.12-rc6.
> 
> Is it something already reported ?

I have seen this sort of thing, but only when actually dumping the trace
out, and I though those got fixed.  You are seeing this just accumulating
the trace?

These RCU CPU stall warnings usually occur when something grabs hold of
a CPU for too long, as in 21 seconds or so.  One way that they can happen
is excessive lock contention, another is having the kernel run through
too much data at one shot.

Adding Steven Rostedt on CC for his thoughts.

Thanx, Paul

> [ ... ]
> 
> [  472.934986] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  472.941302]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=3195
> [  472.948982]  (detected by 3, t=6502 jiffies, g=549, c=548, q=10)
> [  472.955180] Task dump for CPU 0:
> [  472.958513] swapper/0   R  running task0 0  0 
> 0x0002
> [  550.954786] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  550.960657]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=12926
> [  550.967897]  (detected by 3, t=26007 jiffies, g=549, c=548, q=27)
> [  550.974181] Task dump for CPU 0:
> [  550.977513] swapper/0   R  running task0 0  0 
> 0x0002
> [  628.974788] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  628.980659]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=22666
> [  628.987900]  (detected by 3, t=45512 jiffies, g=549, c=548, q=43)
> [  628.994184] Task dump for CPU 0:
> [  628.997518] swapper/0   R  running task0 0  0 
> 0x0002
> [  706.994787] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  707.000658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=31785
> [  707.007899]  (detected by 3, t=65017 jiffies, g=549, c=548, q=74)
> [  707.014183] Task dump for CPU 0:
> [  707.017516] swapper/0   R  running task0 0  0 
> 0x0002
> [  707.024902] rcu_preempt kthread starved for 1261 jiffies! g549 c548 f0x0 
> RCU_GP_WAIT_FQS(3) ->state=0x1
> [  707.034548] rcu_preempt S0 8  2 0x
> [  726.049583] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
> seconds.
> [  726.061772]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> [  726.074294] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  726.089027] jbd2/mmcblk1p2- D0  1120  2 0x
> [  726.126737] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
> seconds.
> [  726.140484]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> [  726.154515] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  726.170140] rs:main Q:Reg   D0  2272  1 0x
> [  726.205626]
> [  726.205626] Showing all locks held in the system:
> [  726.272952]
> [  726.278194] =
> [  726.278194]
> [  785.014779] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  785.020658]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=41387
> [  785.027905]  (detected by 3, t=84524 jiffies, g=549, c=548, q=94)
> [  785.034194] Task dump for CPU 0:
> [  785.037532] swapper/0   R  running task0 0  0 
> 0x0002
> [  846.879753] INFO: task jbd2/mmcblk1p2-:1120 blocked for more than 120 
> seconds.
> [  846.892966]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> [  846.909597] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  846.926726] jbd2/mmcblk1p2- D0  1120  2 0x
> [  846.981356] INFO: task rs:main Q:Reg:2272 blocked for more than 120 
> seconds.
> [  847.001593]   Not tainted 4.13.0-rc3-ktest-linaro-hikey #202
> [  847.017642] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables 
> this message.
> [  847.034135] rs:main Q:Reg   D0  2272  1 0x
> [  847.077209]
> [  847.077209] Showing all locks held in the system:
> [  847.180736]
> [  847.186753] =
> [  847.186753]
> [  863.034778] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  863.040660]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=50881
> [  863.047907]  (detected by 4, t=104030 jiffies, g=549, c=548, q=114)
> [  863.054373] Task dump for CPU 0:
> [  863.057711] swapper/0   R  running task0 0  0 
> 0x0002
> [  941.054780] INFO: rcu_preempt detected stalls on CPUs/tasks:
> [  941.060659]  0-...: (3 GPs behind) idle=d1a/2/0 softirq=2556/2557 fqs=59726
> [  941.067903]  (detected by 7, t=123535 jiffies, g=549, c=548, q=303)
> [  941.074370] Task dump for CPU 0:
> [  941.077708] swapper/0   R  running task