subject:"nohz fail \(was\: perf related boot hang.\)"

Re: nohz fail (was: perf related boot hang.)

2014-09-04 Thread Frederic Weisbecker

On Thu, Sep 04, 2014 at 11:05:02PM +0200, Catalin Iacob wrote:
> On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker  
> wrote:
> > Yeah, that's expected. You need to apply the nine patches on top of -rc1:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> > nohz/fixes
> >
> > "nohz: Restore NMI safe local irq work for local nohz kick" only fixes
> > part of the issue.
> 
> Ok, but if the whole series is needed, isn't it better if it all goes
> into 3.17? Otherwise 3.17 is a clear regression for some users; it's
> definitely for me since before 3.17-rc1 I never saw this bug and now I
> see it every time I do something CPU intensive. Maybe the regression
> is acceptable because the it's confined to some CONFIG_NO_HZ_*
> combination (I think) which is still rather experimental, that's your
> call to make, but it's still a regression.

Yeah the bug is there for a while but likely something got merged in the
last -rc1 that made the bug more likely to happen.

This is probably due to the fact that we converted remote nohz kick to use
irq work instead of the scheduler IPI. So it fires more likely and if we
are unlucky enough, some tick sees the irq work before the irq work IPI
can fire.

Or some code enqueues that irq work from the tick itself.

Awyway you're right that it belongs to the category of regressions. 
Unfortunately
the fix is invasive.

Also I don't know much users of nohz full so probably this won't
have much impact. Or this could be a good way to know who uses this feature 
after all :o)

I'm not sure what I should do. Lets see how the final fix will look like, Peter
is proposing some simplifications. Then we'll know better.

BTW, do you run some specific workloads to trigger this?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-04 Thread Catalin Iacob

On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker  wrote:
> Yeah, that's expected. You need to apply the nine patches on top of -rc1:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
> nohz/fixes
>
> "nohz: Restore NMI safe local irq work for local nohz kick" only fixes
> part of the issue.

Ok, but if the whole series is needed, isn't it better if it all goes
into 3.17? Otherwise 3.17 is a clear regression for some users; it's
definitely for me since before 3.17-rc1 I never saw this bug and now I
see it every time I do something CPU intensive. Maybe the regression
is acceptable because the it's confined to some CONFIG_NO_HZ_*
combination (I think) which is still rather experimental, that's your
call to make, but it's still a regression.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-04 Thread Frederic Weisbecker

On Thu, Sep 04, 2014 at 10:07:37PM +0200, Catalin Iacob wrote:
> On Tue, Sep 2, 2014 at 8:23 PM, Catalin Iacob  wrote:
> > On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker  
> > wrote:
> >> I'll send "nohz: Restore NMI safe local irq work for local nohz kick"
> >> as a fix for 3.17 and the rest will have to wait for 3.18 as it's a 
> >> complicated
> >> fix for a long standing bug.
> >
> > I've been running with the full series since you sent it and haven't
> > experienced the bug since. I'll try to test with just the 3.17 patch
> > to also check that it's enough on its own.
> 
> I tested with just "nohz: Restore NMI safe local irq work for local
> nohz kick" on top of 7505ceaf8635 (which is 3.17-rc3 plus the merge of
> 1 commit) and unfortunately it doesn't fix the issue. I got the same
> panic after some minutes of building Firefox.

Yeah, that's expected. You need to apply the nine patches on top of -rc1:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/fixes

"nohz: Restore NMI safe local irq work for local nohz kick" only fixes
part of the issue.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-04 Thread Catalin Iacob

On Tue, Sep 2, 2014 at 8:23 PM, Catalin Iacob  wrote:
> On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker  
> wrote:
>> I'll send "nohz: Restore NMI safe local irq work for local nohz kick"
>> as a fix for 3.17 and the rest will have to wait for 3.18 as it's a 
>> complicated
>> fix for a long standing bug.
>
> I've been running with the full series since you sent it and haven't
> experienced the bug since. I'll try to test with just the 3.17 patch
> to also check that it's enough on its own.

I tested with just "nohz: Restore NMI safe local irq work for local
nohz kick" on top of 7505ceaf8635 (which is 3.17-rc3 plus the merge of
1 commit) and unfortunately it doesn't fix the issue. I got the same
panic after some minutes of building Firefox.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-04 Thread Catalin Iacob

On Tue, Sep 2, 2014 at 8:23 PM, Catalin Iacob iacobcata...@gmail.com wrote:
 On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker fweis...@gmail.com 
 wrote:
 I'll send nohz: Restore NMI safe local irq work for local nohz kick
 as a fix for 3.17 and the rest will have to wait for 3.18 as it's a 
 complicated
 fix for a long standing bug.

 I've been running with the full series since you sent it and haven't
 experienced the bug since. I'll try to test with just the 3.17 patch
 to also check that it's enough on its own.

I tested with just nohz: Restore NMI safe local irq work for local
nohz kick on top of 7505ceaf8635 (which is 3.17-rc3 plus the merge of
1 commit) and unfortunately it doesn't fix the issue. I got the same
panic after some minutes of building Firefox.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-04 Thread Frederic Weisbecker

On Thu, Sep 04, 2014 at 10:07:37PM +0200, Catalin Iacob wrote:
 On Tue, Sep 2, 2014 at 8:23 PM, Catalin Iacob iacobcata...@gmail.com wrote:
  On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker fweis...@gmail.com 
  wrote:
  I'll send nohz: Restore NMI safe local irq work for local nohz kick
  as a fix for 3.17 and the rest will have to wait for 3.18 as it's a 
  complicated
  fix for a long standing bug.
 
  I've been running with the full series since you sent it and haven't
  experienced the bug since. I'll try to test with just the 3.17 patch
  to also check that it's enough on its own.
 
 I tested with just nohz: Restore NMI safe local irq work for local
 nohz kick on top of 7505ceaf8635 (which is 3.17-rc3 plus the merge of
 1 commit) and unfortunately it doesn't fix the issue. I got the same
 panic after some minutes of building Firefox.

Yeah, that's expected. You need to apply the nine patches on top of -rc1:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
nohz/fixes

nohz: Restore NMI safe local irq work for local nohz kick only fixes
part of the issue.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-04 Thread Catalin Iacob

On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker fweis...@gmail.com wrote:
 Yeah, that's expected. You need to apply the nine patches on top of -rc1:

 git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
 nohz/fixes

 nohz: Restore NMI safe local irq work for local nohz kick only fixes
 part of the issue.

Ok, but if the whole series is needed, isn't it better if it all goes
into 3.17? Otherwise 3.17 is a clear regression for some users; it's
definitely for me since before 3.17-rc1 I never saw this bug and now I
see it every time I do something CPU intensive. Maybe the regression
is acceptable because the it's confined to some CONFIG_NO_HZ_*
combination (I think) which is still rather experimental, that's your
call to make, but it's still a regression.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-04 Thread Frederic Weisbecker

On Thu, Sep 04, 2014 at 11:05:02PM +0200, Catalin Iacob wrote:
 On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker fweis...@gmail.com 
 wrote:
  Yeah, that's expected. You need to apply the nine patches on top of -rc1:
 
  git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git
  nohz/fixes
 
  nohz: Restore NMI safe local irq work for local nohz kick only fixes
  part of the issue.
 
 Ok, but if the whole series is needed, isn't it better if it all goes
 into 3.17? Otherwise 3.17 is a clear regression for some users; it's
 definitely for me since before 3.17-rc1 I never saw this bug and now I
 see it every time I do something CPU intensive. Maybe the regression
 is acceptable because the it's confined to some CONFIG_NO_HZ_*
 combination (I think) which is still rather experimental, that's your
 call to make, but it's still a regression.

Yeah the bug is there for a while but likely something got merged in the
last -rc1 that made the bug more likely to happen.

This is probably due to the fact that we converted remote nohz kick to use
irq work instead of the scheduler IPI. So it fires more likely and if we
are unlucky enough, some tick sees the irq work before the irq work IPI
can fire.

Or some code enqueues that irq work from the tick itself.

Awyway you're right that it belongs to the category of regressions. 
Unfortunately
the fix is invasive.

Also I don't know much users of nohz full so probably this won't
have much impact. Or this could be a good way to know who uses this feature 
after all :o)

I'm not sure what I should do. Lets see how the final fix will look like, Peter
is proposing some simplifications. Then we'll know better.

BTW, do you run some specific workloads to trigger this?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-02 Thread Catalin Iacob

On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker  wrote:
> I'll send "nohz: Restore NMI safe local irq work for local nohz kick"
> as a fix for 3.17 and the rest will have to wait for 3.18 as it's a 
> complicated
> fix for a long standing bug.

I've been running with the full series since you sent it and haven't
experienced the bug since. I'll try to test with just the 3.17 patch
to also check that it's enough on its own.

> Can I apply your Tested-by from you both?

Sure.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-02 Thread Dave Jones

On Mon, Sep 01, 2014 at 10:14:31PM +0200, Frederic Weisbecker wrote:
 > On Fri, Aug 22, 2014 at 10:00:09AM -0400, Dave Jones wrote:
 > > On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote:
 > >  > On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker 
 > >  wrote:
 > >  > > Can you please test the series I just posted: "[RFC PATCH 0/9] nohz:
 > >  > > Nohz full kick fixes"?
 > >  > 
 > >  > Before applying the series I tried one more Firefox build which
 > >  > triggered the panic after less than 5 minutes.
 > >  > 
 > >  > After applying the series I did 2 full builds (around 60 minutes in
 > >  > total) with no problem.
 > > 
 > > Seems to be working fine here too after fuzzing for 17 hours so far.
 > 
 > Thanks a lot for testing this guys!
 > 
 > I'll send "nohz: Restore NMI safe local irq work for local nohz kick"
 > as a fix for 3.17 and the rest will have to wait for 3.18 as it's a 
 > complicated
 > fix for a long standing bug.
 > 
 > Can I apply your Tested-by from you both?

Sure, I haven't seen any recurrance of this since applying your patches.

Dave


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-02 Thread Dave Jones

On Mon, Sep 01, 2014 at 10:14:31PM +0200, Frederic Weisbecker wrote:
  On Fri, Aug 22, 2014 at 10:00:09AM -0400, Dave Jones wrote:
   On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote:
 On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker 
   fweis...@gmail.com wrote:
  Can you please test the series I just posted: [RFC PATCH 0/9] nohz:
  Nohz full kick fixes?
 
 Before applying the series I tried one more Firefox build which
 triggered the panic after less than 5 minutes.
 
 After applying the series I did 2 full builds (around 60 minutes in
 total) with no problem.
   
   Seems to be working fine here too after fuzzing for 17 hours so far.
  
  Thanks a lot for testing this guys!
  
  I'll send nohz: Restore NMI safe local irq work for local nohz kick
  as a fix for 3.17 and the rest will have to wait for 3.18 as it's a 
  complicated
  fix for a long standing bug.
  
  Can I apply your Tested-by from you both?

Sure, I haven't seen any recurrance of this since applying your patches.

Dave


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-02 Thread Catalin Iacob

On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker fweis...@gmail.com wrote:
 I'll send nohz: Restore NMI safe local irq work for local nohz kick
 as a fix for 3.17 and the rest will have to wait for 3.18 as it's a 
 complicated
 fix for a long standing bug.

I've been running with the full series since you sent it and haven't
experienced the bug since. I'll try to test with just the 3.17 patch
to also check that it's enough on its own.

 Can I apply your Tested-by from you both?

Sure.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-01 Thread Frederic Weisbecker

On Fri, Aug 22, 2014 at 10:00:09AM -0400, Dave Jones wrote:
> On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote:
>  > On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker  
> wrote:
>  > > Can you please test the series I just posted: "[RFC PATCH 0/9] nohz:
>  > > Nohz full kick fixes"?
>  > 
>  > Before applying the series I tried one more Firefox build which
>  > triggered the panic after less than 5 minutes.
>  > 
>  > After applying the series I did 2 full builds (around 60 minutes in
>  > total) with no problem.
> 
> Seems to be working fine here too after fuzzing for 17 hours so far.

Thanks a lot for testing this guys!

I'll send "nohz: Restore NMI safe local irq work for local nohz kick"
as a fix for 3.17 and the rest will have to wait for 3.18 as it's a complicated
fix for a long standing bug.

Can I apply your Tested-by from you both?

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-09-01 Thread Frederic Weisbecker

On Fri, Aug 22, 2014 at 10:00:09AM -0400, Dave Jones wrote:
 On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote:
   On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker fweis...@gmail.com 
 wrote:
Can you please test the series I just posted: [RFC PATCH 0/9] nohz:
Nohz full kick fixes?
   
   Before applying the series I tried one more Firefox build which
   triggered the panic after less than 5 minutes.
   
   After applying the series I did 2 full builds (around 60 minutes in
   total) with no problem.
 
 Seems to be working fine here too after fuzzing for 17 hours so far.

Thanks a lot for testing this guys!

I'll send nohz: Restore NMI safe local irq work for local nohz kick
as a fix for 3.17 and the rest will have to wait for 3.18 as it's a complicated
fix for a long standing bug.

Can I apply your Tested-by from you both?

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-22 Thread Dave Jones

On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote:
 > On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker  
 > wrote:
 > > Can you please test the series I just posted: "[RFC PATCH 0/9] nohz:
 > > Nohz full kick fixes"?
 > 
 > Before applying the series I tried one more Firefox build which
 > triggered the panic after less than 5 minutes.
 > 
 > After applying the series I did 2 full builds (around 60 minutes in
 > total) with no problem.

Seems to be working fine here too after fuzzing for 17 hours so far.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-22 Thread Catalin Iacob

On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker  wrote:
> Can you please test the series I just posted: "[RFC PATCH 0/9] nohz:
> Nohz full kick fixes"?

Before applying the series I tried one more Firefox build which
triggered the panic after less than 5 minutes.

After applying the series I did 2 full builds (around 60 minutes in
total) with no problem.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-22 Thread Catalin Iacob

On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker fweis...@gmail.com wrote:
 Can you please test the series I just posted: [RFC PATCH 0/9] nohz:
 Nohz full kick fixes?

Before applying the series I tried one more Firefox build which
triggered the panic after less than 5 minutes.

After applying the series I did 2 full builds (around 60 minutes in
total) with no problem.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-22 Thread Dave Jones

On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote:
  On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker fweis...@gmail.com 
  wrote:
   Can you please test the series I just posted: [RFC PATCH 0/9] nohz:
   Nohz full kick fixes?
  
  Before applying the series I tried one more Firefox build which
  triggered the panic after less than 5 minutes.
  
  After applying the series I did 2 full builds (around 60 minutes in
  total) with no problem.

Seems to be working fine here too after fuzzing for 17 hours so far.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-21 Thread Frederic Weisbecker

2014-08-20 22:31 GMT+02:00 Catalin Iacob :
> I've also just hit what seems to be the same panic in 3.17-rc1 (ignore the 1
> local patch, it's an unrelated change in a comment) twice in less than 1
> hour. Hitting this twice in a short amount of time seems to be proof that the
> 3.17 merge window made it trigger more often. Both times I was running a grep
> over a Firefox build tree which was taking a long time.
>
> The stacktraces are slightly different but both have the "cancel timer from a
> timer, followed by nmi" pattern. Pictures of the 2 stacktraces:
> https://drive.google.com/file/d/0B_fRjDygGZSNY0RIc2dyYTExTjg/edit?usp=sharing
> https://drive.google.com/file/d/0B_fRjDygGZSNS1pSWFkteURrOTQ/edit?usp=sharing

Hi Catalin, Dave,

Can you please test the series I just posted: "[RFC PATCH 0/9] nohz:
Nohz full kick fixes"?
It should fix the issues.

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-21 Thread Frederic Weisbecker

2014-08-20 22:31 GMT+02:00 Catalin Iacob iacobcata...@gmail.com:
 I've also just hit what seems to be the same panic in 3.17-rc1 (ignore the 1
 local patch, it's an unrelated change in a comment) twice in less than 1
 hour. Hitting this twice in a short amount of time seems to be proof that the
 3.17 merge window made it trigger more often. Both times I was running a grep
 over a Firefox build tree which was taking a long time.

 The stacktraces are slightly different but both have the cancel timer from a
 timer, followed by nmi pattern. Pictures of the 2 stacktraces:
 https://drive.google.com/file/d/0B_fRjDygGZSNY0RIc2dyYTExTjg/edit?usp=sharing
 https://drive.google.com/file/d/0B_fRjDygGZSNS1pSWFkteURrOTQ/edit?usp=sharing

Hi Catalin, Dave,

Can you please test the series I just posted: [RFC PATCH 0/9] nohz:
Nohz full kick fixes?
It should fix the issues.

Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-20 Thread Catalin Iacob

I've also just hit what seems to be the same panic in 3.17-rc1 (ignore the 1
local patch, it's an unrelated change in a comment) twice in less than 1
hour. Hitting this twice in a short amount of time seems to be proof that the
3.17 merge window made it trigger more often. Both times I was running a grep
over a Firefox build tree which was taking a long time.

The stacktraces are slightly different but both have the "cancel timer from a
timer, followed by nmi" pattern. Pictures of the 2 stacktraces:
https://drive.google.com/file/d/0B_fRjDygGZSNY0RIc2dyYTExTjg/edit?usp=sharing
https://drive.google.com/file/d/0B_fRjDygGZSNS1pSWFkteURrOTQ/edit?usp=sharing
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-20 Thread Catalin Iacob

I've also just hit what seems to be the same panic in 3.17-rc1 (ignore the 1
local patch, it's an unrelated change in a comment) twice in less than 1
hour. Hitting this twice in a short amount of time seems to be proof that the
3.17 merge window made it trigger more often. Both times I was running a grep
over a Firefox build tree which was taking a long time.

The stacktraces are slightly different but both have the cancel timer from a
timer, followed by nmi pattern. Pictures of the 2 stacktraces:
https://drive.google.com/file/d/0B_fRjDygGZSNY0RIc2dyYTExTjg/edit?usp=sharing
https://drive.google.com/file/d/0B_fRjDygGZSNS1pSWFkteURrOTQ/edit?usp=sharing
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-11 Thread Dave Jones

On Thu, Aug 07, 2014 at 03:16:49PM +0200, Frederic Weisbecker wrote:
 
 > > >   <>[] lock_acquired+0xaf/0x450
 > > >   [] ? lock_hrtimer_base.isra.20+0x25/0x50
 > > >   [] _raw_spin_lock_irqsave+0x78/0x90
 > > >   [] ? lock_hrtimer_base.isra.20+0x25/0x50
 > > >   [] lock_hrtimer_base.isra.20+0x25/0x50
 > > >   [] hrtimer_try_to_cancel+0x33/0x1e0
 > > >   [] hrtimer_cancel+0x1a/0x30
 > > >   [] tick_nohz_restart+0x17/0x90
 > > >   [] __tick_nohz_full_check+0xc3/0x100
 > > >   [] nohz_full_kick_work_func+0xe/0x10
 > > >   [] irq_work_run_list+0x44/0x70
 > > >   [] irq_work_run+0x2a/0x50
 > > >   [] update_process_times+0x5b/0x70
 > > >   [] tick_sched_handle.isra.21+0x25/0x60
 > > >   [] tick_sched_timer+0x41/0x60
 > > >   [] __run_hrtimer+0x72/0x470
 > > >   [] ? tick_sched_do_timer+0xb0/0xb0
 > > >   [] hrtimer_interrupt+0x117/0x270
 > > >   [] local_apic_timer_interrupt+0x37/0x60
 > > >   [] smp_apic_timer_interrupt+0x3f/0x50
 > > >   [] apic_timer_interrupt+0x6f/0x80
 > > 
 > > And that looks like someone trying to cancel a timer from a timer, I
 > > guess that won't work, seeing how cancel will wait for the timer handler
 > > completion etc.
 > > 
 > > This is because of the fallback irq_work_run() in the tick
 > > (update_process_times).
 > > 
 > 
 > Indeed, I saw that too but very rarely.

FWIW, I'm now seeing this quite often (several times a day) when I run
trinity on current git master.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-11 Thread Dave Jones

On Thu, Aug 07, 2014 at 03:16:49PM +0200, Frederic Weisbecker wrote:
 
  EOE  IRQ  [8a0ccd2f] lock_acquired+0xaf/0x450
  [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50
  [8a7fc678] _raw_spin_lock_irqsave+0x78/0x90
  [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50
  [8a0f74c5] lock_hrtimer_base.isra.20+0x25/0x50
  [8a0f7723] hrtimer_try_to_cancel+0x33/0x1e0
  [8a0f78ea] hrtimer_cancel+0x1a/0x30
  [8a109237] tick_nohz_restart+0x17/0x90
  [8a10a213] __tick_nohz_full_check+0xc3/0x100
  [8a10a25e] nohz_full_kick_work_func+0xe/0x10
  [8a17c884] irq_work_run_list+0x44/0x70
  [8a17c8da] irq_work_run+0x2a/0x50
  [8a0f700b] update_process_times+0x5b/0x70
  [8a109005] tick_sched_handle.isra.21+0x25/0x60
  [8a109b81] tick_sched_timer+0x41/0x60
  [8a0f7aa2] __run_hrtimer+0x72/0x470
  [8a109b40] ? tick_sched_do_timer+0xb0/0xb0
  [8a0f8707] hrtimer_interrupt+0x117/0x270
  [8a034357] local_apic_timer_interrupt+0x37/0x60
  [8a80010f] smp_apic_timer_interrupt+0x3f/0x50
  [8a7fe52f] apic_timer_interrupt+0x6f/0x80
   
   And that looks like someone trying to cancel a timer from a timer, I
   guess that won't work, seeing how cancel will wait for the timer handler
   completion etc.
   
   This is because of the fallback irq_work_run() in the tick
   (update_process_times).
   
  
  Indeed, I saw that too but very rarely.

FWIW, I'm now seeing this quite often (several times a day) when I run
trinity on current git master.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

2014-08-07 Thread Frederic Weisbecker

On Thu, Aug 07, 2014 at 11:03:33AM +0200, Peter Zijlstra wrote:
> On Wed, Aug 06, 2014 at 03:46:56PM -0400, Dave Jones wrote:
> > This one happened during runtime, but I got a whole stack..
> > 
> > 
> >  Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
> >  CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
> >  Workqueue: btrfs-endio-write normal_work_helper [btrfs]
> >   880244c06c88 1b486fe1 880244c06bf0 8a7f1e37
> >   8ac52a18 880244c06c78 8a7ef928 0010
> >   880244c06c88 880244c06c20 1b486fe1 
> >  Call Trace:
> > [] dump_stack+0x4e/0x7a
> >   [] panic+0xd4/0x207
> >   [] watchdog_overflow_callback+0x118/0x120
> >   [] __perf_event_overflow+0xae/0x350
> >   [] ? perf_event_task_disable+0xa0/0xa0
> >   [] ? x86_perf_event_set_period+0xbf/0x150
> >   [] perf_event_overflow+0x14/0x20
> >   [] intel_pmu_handle_irq+0x206/0x410
> >   [] perf_event_nmi_handler+0x2b/0x50
> >   [] nmi_handle+0xd2/0x390
> >   [] ? nmi_handle+0x5/0x390
> >   [] ? match_held_lock+0x8/0x1b0
> >   [] default_do_nmi+0x72/0x1c0
> >   [] do_nmi+0xb8/0x100
> >   [] end_repeat_nmi+0x1e/0x2e
> >   [] ? match_held_lock+0x8/0x1b0
> >   [] ? match_held_lock+0x8/0x1b0
> >   [] ? match_held_lock+0x8/0x1b0
> 
> Ok so that part is just the watchdog triggering, so the below part is
> the screwy bit:
> 
> >   <>[] lock_acquired+0xaf/0x450
> >   [] ? lock_hrtimer_base.isra.20+0x25/0x50
> >   [] _raw_spin_lock_irqsave+0x78/0x90
> >   [] ? lock_hrtimer_base.isra.20+0x25/0x50
> >   [] lock_hrtimer_base.isra.20+0x25/0x50
> >   [] hrtimer_try_to_cancel+0x33/0x1e0
> >   [] hrtimer_cancel+0x1a/0x30
> >   [] tick_nohz_restart+0x17/0x90
> >   [] __tick_nohz_full_check+0xc3/0x100
> >   [] nohz_full_kick_work_func+0xe/0x10
> >   [] irq_work_run_list+0x44/0x70
> >   [] irq_work_run+0x2a/0x50
> >   [] update_process_times+0x5b/0x70
> >   [] tick_sched_handle.isra.21+0x25/0x60
> >   [] tick_sched_timer+0x41/0x60
> >   [] __run_hrtimer+0x72/0x470
> >   [] ? tick_sched_do_timer+0xb0/0xb0
> >   [] hrtimer_interrupt+0x117/0x270
> >   [] local_apic_timer_interrupt+0x37/0x60
> >   [] smp_apic_timer_interrupt+0x3f/0x50
> >   [] apic_timer_interrupt+0x6f/0x80
> 
> And that looks like someone trying to cancel a timer from a timer, I
> guess that won't work, seeing how cancel will wait for the timer handler
> completion etc.
> 
> This is because of the fallback irq_work_run() in the tick
> (update_process_times).
> 

Indeed, I saw that too but very rarely. The nohz kick needs to restart
the tick asynchronously (so we use irq work) and all is fine as long as
irq work actually runs through the irq work IRQ. But when it triggers
through the tick, that's when we fail like above. This should be a rare
scenario for archs that support raising irq_work, but it can happen.

I've been considering to check and restart the tick from irq exit. But
that's going to add extra checks on all IRQs.

Ideally we should be able to force IRQ work on irq work interrupt when
we know that the arch supports it (except for lazy ones). In fact that
alone is a requisite for nohz full itself. If the arch can't raise irq_work
IRQs, nohz full can't work anyway. So soon or later I knew that one day I'd
need to add a check for that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

nohz fail (was: perf related boot hang.)

2014-08-07 Thread Peter Zijlstra

On Wed, Aug 06, 2014 at 03:46:56PM -0400, Dave Jones wrote:
> This one happened during runtime, but I got a whole stack..
> 
> 
>  Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
>  CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
>  Workqueue: btrfs-endio-write normal_work_helper [btrfs]
>   880244c06c88 1b486fe1 880244c06bf0 8a7f1e37
>   8ac52a18 880244c06c78 8a7ef928 0010
>   880244c06c88 880244c06c20 1b486fe1 
>  Call Trace:
> [] dump_stack+0x4e/0x7a
>   [] panic+0xd4/0x207
>   [] watchdog_overflow_callback+0x118/0x120
>   [] __perf_event_overflow+0xae/0x350
>   [] ? perf_event_task_disable+0xa0/0xa0
>   [] ? x86_perf_event_set_period+0xbf/0x150
>   [] perf_event_overflow+0x14/0x20
>   [] intel_pmu_handle_irq+0x206/0x410
>   [] perf_event_nmi_handler+0x2b/0x50
>   [] nmi_handle+0xd2/0x390
>   [] ? nmi_handle+0x5/0x390
>   [] ? match_held_lock+0x8/0x1b0
>   [] default_do_nmi+0x72/0x1c0
>   [] do_nmi+0xb8/0x100
>   [] end_repeat_nmi+0x1e/0x2e
>   [] ? match_held_lock+0x8/0x1b0
>   [] ? match_held_lock+0x8/0x1b0
>   [] ? match_held_lock+0x8/0x1b0

Ok so that part is just the watchdog triggering, so the below part is
the screwy bit:

>   <>[] lock_acquired+0xaf/0x450
>   [] ? lock_hrtimer_base.isra.20+0x25/0x50
>   [] _raw_spin_lock_irqsave+0x78/0x90
>   [] ? lock_hrtimer_base.isra.20+0x25/0x50
>   [] lock_hrtimer_base.isra.20+0x25/0x50
>   [] hrtimer_try_to_cancel+0x33/0x1e0
>   [] hrtimer_cancel+0x1a/0x30
>   [] tick_nohz_restart+0x17/0x90
>   [] __tick_nohz_full_check+0xc3/0x100
>   [] nohz_full_kick_work_func+0xe/0x10
>   [] irq_work_run_list+0x44/0x70
>   [] irq_work_run+0x2a/0x50
>   [] update_process_times+0x5b/0x70
>   [] tick_sched_handle.isra.21+0x25/0x60
>   [] tick_sched_timer+0x41/0x60
>   [] __run_hrtimer+0x72/0x470
>   [] ? tick_sched_do_timer+0xb0/0xb0
>   [] hrtimer_interrupt+0x117/0x270
>   [] local_apic_timer_interrupt+0x37/0x60
>   [] smp_apic_timer_interrupt+0x3f/0x50
>   [] apic_timer_interrupt+0x6f/0x80

And that looks like someone trying to cancel a timer from a timer, I
guess that won't work, seeing how cancel will wait for the timer handler
completion etc.

This is because of the fallback irq_work_run() in the tick
(update_process_times).



pgppYMVGjEAUt.pgp
Description: PGP signature

nohz fail (was: perf related boot hang.)

2014-08-07 Thread Peter Zijlstra

On Wed, Aug 06, 2014 at 03:46:56PM -0400, Dave Jones wrote:
 This one happened during runtime, but I got a whole stack..
 
 
  Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
  CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
  Workqueue: btrfs-endio-write normal_work_helper [btrfs]
   880244c06c88 1b486fe1 880244c06bf0 8a7f1e37
   8ac52a18 880244c06c78 8a7ef928 0010
   880244c06c88 880244c06c20 1b486fe1 
  Call Trace:
   NMI  [8a7f1e37] dump_stack+0x4e/0x7a
   [8a7ef928] panic+0xd4/0x207
   [8a1450e8] watchdog_overflow_callback+0x118/0x120
   [8a186b0e] __perf_event_overflow+0xae/0x350
   [8a184f80] ? perf_event_task_disable+0xa0/0xa0
   [8a01a4cf] ? x86_perf_event_set_period+0xbf/0x150
   [8a187934] perf_event_overflow+0x14/0x20
   [8a020386] intel_pmu_handle_irq+0x206/0x410
   [8a01937b] perf_event_nmi_handler+0x2b/0x50
   [8a007b72] nmi_handle+0xd2/0x390
   [8a007aa5] ? nmi_handle+0x5/0x390
   [8a0cb7f8] ? match_held_lock+0x8/0x1b0
   [8a008062] default_do_nmi+0x72/0x1c0
   [8a008268] do_nmi+0xb8/0x100
   [8a7ff66a] end_repeat_nmi+0x1e/0x2e
   [8a0cb7f8] ? match_held_lock+0x8/0x1b0
   [8a0cb7f8] ? match_held_lock+0x8/0x1b0
   [8a0cb7f8] ? match_held_lock+0x8/0x1b0

Ok so that part is just the watchdog triggering, so the below part is
the screwy bit:

   EOE  IRQ  [8a0ccd2f] lock_acquired+0xaf/0x450
   [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50
   [8a7fc678] _raw_spin_lock_irqsave+0x78/0x90
   [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50
   [8a0f74c5] lock_hrtimer_base.isra.20+0x25/0x50
   [8a0f7723] hrtimer_try_to_cancel+0x33/0x1e0
   [8a0f78ea] hrtimer_cancel+0x1a/0x30
   [8a109237] tick_nohz_restart+0x17/0x90
   [8a10a213] __tick_nohz_full_check+0xc3/0x100
   [8a10a25e] nohz_full_kick_work_func+0xe/0x10
   [8a17c884] irq_work_run_list+0x44/0x70
   [8a17c8da] irq_work_run+0x2a/0x50
   [8a0f700b] update_process_times+0x5b/0x70
   [8a109005] tick_sched_handle.isra.21+0x25/0x60
   [8a109b81] tick_sched_timer+0x41/0x60
   [8a0f7aa2] __run_hrtimer+0x72/0x470
   [8a109b40] ? tick_sched_do_timer+0xb0/0xb0
   [8a0f8707] hrtimer_interrupt+0x117/0x270
   [8a034357] local_apic_timer_interrupt+0x37/0x60
   [8a80010f] smp_apic_timer_interrupt+0x3f/0x50
   [8a7fe52f] apic_timer_interrupt+0x6f/0x80

And that looks like someone trying to cancel a timer from a timer, I
guess that won't work, seeing how cancel will wait for the timer handler
completion etc.

This is because of the fallback irq_work_run() in the tick
(update_process_times).



pgppYMVGjEAUt.pgp
Description: PGP signature

Re: nohz fail (was: perf related boot hang.)

2014-08-07 Thread Frederic Weisbecker

On Thu, Aug 07, 2014 at 11:03:33AM +0200, Peter Zijlstra wrote:
 On Wed, Aug 06, 2014 at 03:46:56PM -0400, Dave Jones wrote:
  This one happened during runtime, but I got a whole stack..
  
  
   Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2
   CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34
   Workqueue: btrfs-endio-write normal_work_helper [btrfs]
880244c06c88 1b486fe1 880244c06bf0 8a7f1e37
8ac52a18 880244c06c78 8a7ef928 0010
880244c06c88 880244c06c20 1b486fe1 
   Call Trace:
NMI  [8a7f1e37] dump_stack+0x4e/0x7a
[8a7ef928] panic+0xd4/0x207
[8a1450e8] watchdog_overflow_callback+0x118/0x120
[8a186b0e] __perf_event_overflow+0xae/0x350
[8a184f80] ? perf_event_task_disable+0xa0/0xa0
[8a01a4cf] ? x86_perf_event_set_period+0xbf/0x150
[8a187934] perf_event_overflow+0x14/0x20
[8a020386] intel_pmu_handle_irq+0x206/0x410
[8a01937b] perf_event_nmi_handler+0x2b/0x50
[8a007b72] nmi_handle+0xd2/0x390
[8a007aa5] ? nmi_handle+0x5/0x390
[8a0cb7f8] ? match_held_lock+0x8/0x1b0
[8a008062] default_do_nmi+0x72/0x1c0
[8a008268] do_nmi+0xb8/0x100
[8a7ff66a] end_repeat_nmi+0x1e/0x2e
[8a0cb7f8] ? match_held_lock+0x8/0x1b0
[8a0cb7f8] ? match_held_lock+0x8/0x1b0
[8a0cb7f8] ? match_held_lock+0x8/0x1b0
 
 Ok so that part is just the watchdog triggering, so the below part is
 the screwy bit:
 
EOE  IRQ  [8a0ccd2f] lock_acquired+0xaf/0x450
[8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50
[8a7fc678] _raw_spin_lock_irqsave+0x78/0x90
[8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50
[8a0f74c5] lock_hrtimer_base.isra.20+0x25/0x50
[8a0f7723] hrtimer_try_to_cancel+0x33/0x1e0
[8a0f78ea] hrtimer_cancel+0x1a/0x30
[8a109237] tick_nohz_restart+0x17/0x90
[8a10a213] __tick_nohz_full_check+0xc3/0x100
[8a10a25e] nohz_full_kick_work_func+0xe/0x10
[8a17c884] irq_work_run_list+0x44/0x70
[8a17c8da] irq_work_run+0x2a/0x50
[8a0f700b] update_process_times+0x5b/0x70
[8a109005] tick_sched_handle.isra.21+0x25/0x60
[8a109b81] tick_sched_timer+0x41/0x60
[8a0f7aa2] __run_hrtimer+0x72/0x470
[8a109b40] ? tick_sched_do_timer+0xb0/0xb0
[8a0f8707] hrtimer_interrupt+0x117/0x270
[8a034357] local_apic_timer_interrupt+0x37/0x60
[8a80010f] smp_apic_timer_interrupt+0x3f/0x50
[8a7fe52f] apic_timer_interrupt+0x6f/0x80
 
 And that looks like someone trying to cancel a timer from a timer, I
 guess that won't work, seeing how cancel will wait for the timer handler
 completion etc.
 
 This is because of the fallback irq_work_run() in the tick
 (update_process_times).
 

Indeed, I saw that too but very rarely. The nohz kick needs to restart
the tick asynchronously (so we use irq work) and all is fine as long as
irq work actually runs through the irq work IRQ. But when it triggers
through the tick, that's when we fail like above. This should be a rare
scenario for archs that support raising irq_work, but it can happen.

I've been considering to check and restart the tick from irq exit. But
that's going to add extra checks on all IRQs.

Ideally we should be able to force IRQ work on irq work interrupt when
we know that the arch supports it (except for lazy ones). In fact that
alone is a requisite for nohz full itself. If the arch can't raise irq_work
IRQs, nohz full can't work anyway. So soon or later I knew that one day I'd
need to add a check for that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

nohz fail (was: perf related boot hang.)

nohz fail (was: perf related boot hang.)

Re: nohz fail (was: perf related boot hang.)

28 matches

Site Navigation

Mail list logo

Footer information