Re: nohz fail (was: perf related boot hang.)
On Thu, Sep 04, 2014 at 11:05:02PM +0200, Catalin Iacob wrote: > On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker > wrote: > > Yeah, that's expected. You need to apply the nine patches on top of -rc1: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > > nohz/fixes > > > > "nohz: Restore NMI safe local irq work for local nohz kick" only fixes > > part of the issue. > > Ok, but if the whole series is needed, isn't it better if it all goes > into 3.17? Otherwise 3.17 is a clear regression for some users; it's > definitely for me since before 3.17-rc1 I never saw this bug and now I > see it every time I do something CPU intensive. Maybe the regression > is acceptable because the it's confined to some CONFIG_NO_HZ_* > combination (I think) which is still rather experimental, that's your > call to make, but it's still a regression. Yeah the bug is there for a while but likely something got merged in the last -rc1 that made the bug more likely to happen. This is probably due to the fact that we converted remote nohz kick to use irq work instead of the scheduler IPI. So it fires more likely and if we are unlucky enough, some tick sees the irq work before the irq work IPI can fire. Or some code enqueues that irq work from the tick itself. Awyway you're right that it belongs to the category of regressions. Unfortunately the fix is invasive. Also I don't know much users of nohz full so probably this won't have much impact. Or this could be a good way to know who uses this feature after all :o) I'm not sure what I should do. Lets see how the final fix will look like, Peter is proposing some simplifications. Then we'll know better. BTW, do you run some specific workloads to trigger this? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker wrote: > Yeah, that's expected. You need to apply the nine patches on top of -rc1: > > git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git > nohz/fixes > > "nohz: Restore NMI safe local irq work for local nohz kick" only fixes > part of the issue. Ok, but if the whole series is needed, isn't it better if it all goes into 3.17? Otherwise 3.17 is a clear regression for some users; it's definitely for me since before 3.17-rc1 I never saw this bug and now I see it every time I do something CPU intensive. Maybe the regression is acceptable because the it's confined to some CONFIG_NO_HZ_* combination (I think) which is still rather experimental, that's your call to make, but it's still a regression. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Sep 04, 2014 at 10:07:37PM +0200, Catalin Iacob wrote: > On Tue, Sep 2, 2014 at 8:23 PM, Catalin Iacob wrote: > > On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker > > wrote: > >> I'll send "nohz: Restore NMI safe local irq work for local nohz kick" > >> as a fix for 3.17 and the rest will have to wait for 3.18 as it's a > >> complicated > >> fix for a long standing bug. > > > > I've been running with the full series since you sent it and haven't > > experienced the bug since. I'll try to test with just the 3.17 patch > > to also check that it's enough on its own. > > I tested with just "nohz: Restore NMI safe local irq work for local > nohz kick" on top of 7505ceaf8635 (which is 3.17-rc3 plus the merge of > 1 commit) and unfortunately it doesn't fix the issue. I got the same > panic after some minutes of building Firefox. Yeah, that's expected. You need to apply the nine patches on top of -rc1: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git nohz/fixes "nohz: Restore NMI safe local irq work for local nohz kick" only fixes part of the issue. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Tue, Sep 2, 2014 at 8:23 PM, Catalin Iacob wrote: > On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker > wrote: >> I'll send "nohz: Restore NMI safe local irq work for local nohz kick" >> as a fix for 3.17 and the rest will have to wait for 3.18 as it's a >> complicated >> fix for a long standing bug. > > I've been running with the full series since you sent it and haven't > experienced the bug since. I'll try to test with just the 3.17 patch > to also check that it's enough on its own. I tested with just "nohz: Restore NMI safe local irq work for local nohz kick" on top of 7505ceaf8635 (which is 3.17-rc3 plus the merge of 1 commit) and unfortunately it doesn't fix the issue. I got the same panic after some minutes of building Firefox. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Tue, Sep 2, 2014 at 8:23 PM, Catalin Iacob iacobcata...@gmail.com wrote: On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker fweis...@gmail.com wrote: I'll send nohz: Restore NMI safe local irq work for local nohz kick as a fix for 3.17 and the rest will have to wait for 3.18 as it's a complicated fix for a long standing bug. I've been running with the full series since you sent it and haven't experienced the bug since. I'll try to test with just the 3.17 patch to also check that it's enough on its own. I tested with just nohz: Restore NMI safe local irq work for local nohz kick on top of 7505ceaf8635 (which is 3.17-rc3 plus the merge of 1 commit) and unfortunately it doesn't fix the issue. I got the same panic after some minutes of building Firefox. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Sep 04, 2014 at 10:07:37PM +0200, Catalin Iacob wrote: On Tue, Sep 2, 2014 at 8:23 PM, Catalin Iacob iacobcata...@gmail.com wrote: On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker fweis...@gmail.com wrote: I'll send nohz: Restore NMI safe local irq work for local nohz kick as a fix for 3.17 and the rest will have to wait for 3.18 as it's a complicated fix for a long standing bug. I've been running with the full series since you sent it and haven't experienced the bug since. I'll try to test with just the 3.17 patch to also check that it's enough on its own. I tested with just nohz: Restore NMI safe local irq work for local nohz kick on top of 7505ceaf8635 (which is 3.17-rc3 plus the merge of 1 commit) and unfortunately it doesn't fix the issue. I got the same panic after some minutes of building Firefox. Yeah, that's expected. You need to apply the nine patches on top of -rc1: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git nohz/fixes nohz: Restore NMI safe local irq work for local nohz kick only fixes part of the issue. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker fweis...@gmail.com wrote: Yeah, that's expected. You need to apply the nine patches on top of -rc1: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git nohz/fixes nohz: Restore NMI safe local irq work for local nohz kick only fixes part of the issue. Ok, but if the whole series is needed, isn't it better if it all goes into 3.17? Otherwise 3.17 is a clear regression for some users; it's definitely for me since before 3.17-rc1 I never saw this bug and now I see it every time I do something CPU intensive. Maybe the regression is acceptable because the it's confined to some CONFIG_NO_HZ_* combination (I think) which is still rather experimental, that's your call to make, but it's still a regression. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Sep 04, 2014 at 11:05:02PM +0200, Catalin Iacob wrote: On Thu, Sep 4, 2014 at 10:17 PM, Frederic Weisbecker fweis...@gmail.com wrote: Yeah, that's expected. You need to apply the nine patches on top of -rc1: git://git.kernel.org/pub/scm/linux/kernel/git/frederic/linux-dynticks.git nohz/fixes nohz: Restore NMI safe local irq work for local nohz kick only fixes part of the issue. Ok, but if the whole series is needed, isn't it better if it all goes into 3.17? Otherwise 3.17 is a clear regression for some users; it's definitely for me since before 3.17-rc1 I never saw this bug and now I see it every time I do something CPU intensive. Maybe the regression is acceptable because the it's confined to some CONFIG_NO_HZ_* combination (I think) which is still rather experimental, that's your call to make, but it's still a regression. Yeah the bug is there for a while but likely something got merged in the last -rc1 that made the bug more likely to happen. This is probably due to the fact that we converted remote nohz kick to use irq work instead of the scheduler IPI. So it fires more likely and if we are unlucky enough, some tick sees the irq work before the irq work IPI can fire. Or some code enqueues that irq work from the tick itself. Awyway you're right that it belongs to the category of regressions. Unfortunately the fix is invasive. Also I don't know much users of nohz full so probably this won't have much impact. Or this could be a good way to know who uses this feature after all :o) I'm not sure what I should do. Lets see how the final fix will look like, Peter is proposing some simplifications. Then we'll know better. BTW, do you run some specific workloads to trigger this? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker wrote: > I'll send "nohz: Restore NMI safe local irq work for local nohz kick" > as a fix for 3.17 and the rest will have to wait for 3.18 as it's a > complicated > fix for a long standing bug. I've been running with the full series since you sent it and haven't experienced the bug since. I'll try to test with just the 3.17 patch to also check that it's enough on its own. > Can I apply your Tested-by from you both? Sure. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Mon, Sep 01, 2014 at 10:14:31PM +0200, Frederic Weisbecker wrote: > On Fri, Aug 22, 2014 at 10:00:09AM -0400, Dave Jones wrote: > > On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote: > > > On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker > > wrote: > > > > Can you please test the series I just posted: "[RFC PATCH 0/9] nohz: > > > > Nohz full kick fixes"? > > > > > > Before applying the series I tried one more Firefox build which > > > triggered the panic after less than 5 minutes. > > > > > > After applying the series I did 2 full builds (around 60 minutes in > > > total) with no problem. > > > > Seems to be working fine here too after fuzzing for 17 hours so far. > > Thanks a lot for testing this guys! > > I'll send "nohz: Restore NMI safe local irq work for local nohz kick" > as a fix for 3.17 and the rest will have to wait for 3.18 as it's a > complicated > fix for a long standing bug. > > Can I apply your Tested-by from you both? Sure, I haven't seen any recurrance of this since applying your patches. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Mon, Sep 01, 2014 at 10:14:31PM +0200, Frederic Weisbecker wrote: On Fri, Aug 22, 2014 at 10:00:09AM -0400, Dave Jones wrote: On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote: On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker fweis...@gmail.com wrote: Can you please test the series I just posted: [RFC PATCH 0/9] nohz: Nohz full kick fixes? Before applying the series I tried one more Firefox build which triggered the panic after less than 5 minutes. After applying the series I did 2 full builds (around 60 minutes in total) with no problem. Seems to be working fine here too after fuzzing for 17 hours so far. Thanks a lot for testing this guys! I'll send nohz: Restore NMI safe local irq work for local nohz kick as a fix for 3.17 and the rest will have to wait for 3.18 as it's a complicated fix for a long standing bug. Can I apply your Tested-by from you both? Sure, I haven't seen any recurrance of this since applying your patches. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Mon, Sep 1, 2014 at 10:14 PM, Frederic Weisbecker fweis...@gmail.com wrote: I'll send nohz: Restore NMI safe local irq work for local nohz kick as a fix for 3.17 and the rest will have to wait for 3.18 as it's a complicated fix for a long standing bug. I've been running with the full series since you sent it and haven't experienced the bug since. I'll try to test with just the 3.17 patch to also check that it's enough on its own. Can I apply your Tested-by from you both? Sure. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Fri, Aug 22, 2014 at 10:00:09AM -0400, Dave Jones wrote: > On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote: > > On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker > wrote: > > > Can you please test the series I just posted: "[RFC PATCH 0/9] nohz: > > > Nohz full kick fixes"? > > > > Before applying the series I tried one more Firefox build which > > triggered the panic after less than 5 minutes. > > > > After applying the series I did 2 full builds (around 60 minutes in > > total) with no problem. > > Seems to be working fine here too after fuzzing for 17 hours so far. Thanks a lot for testing this guys! I'll send "nohz: Restore NMI safe local irq work for local nohz kick" as a fix for 3.17 and the rest will have to wait for 3.18 as it's a complicated fix for a long standing bug. Can I apply your Tested-by from you both? Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Fri, Aug 22, 2014 at 10:00:09AM -0400, Dave Jones wrote: On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote: On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker fweis...@gmail.com wrote: Can you please test the series I just posted: [RFC PATCH 0/9] nohz: Nohz full kick fixes? Before applying the series I tried one more Firefox build which triggered the panic after less than 5 minutes. After applying the series I did 2 full builds (around 60 minutes in total) with no problem. Seems to be working fine here too after fuzzing for 17 hours so far. Thanks a lot for testing this guys! I'll send nohz: Restore NMI safe local irq work for local nohz kick as a fix for 3.17 and the rest will have to wait for 3.18 as it's a complicated fix for a long standing bug. Can I apply your Tested-by from you both? Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote: > On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker > wrote: > > Can you please test the series I just posted: "[RFC PATCH 0/9] nohz: > > Nohz full kick fixes"? > > Before applying the series I tried one more Firefox build which > triggered the panic after less than 5 minutes. > > After applying the series I did 2 full builds (around 60 minutes in > total) with no problem. Seems to be working fine here too after fuzzing for 17 hours so far. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker wrote: > Can you please test the series I just posted: "[RFC PATCH 0/9] nohz: > Nohz full kick fixes"? Before applying the series I tried one more Firefox build which triggered the panic after less than 5 minutes. After applying the series I did 2 full builds (around 60 minutes in total) with no problem. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker fweis...@gmail.com wrote: Can you please test the series I just posted: [RFC PATCH 0/9] nohz: Nohz full kick fixes? Before applying the series I tried one more Firefox build which triggered the panic after less than 5 minutes. After applying the series I did 2 full builds (around 60 minutes in total) with no problem. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Fri, Aug 22, 2014 at 08:01:48AM +0200, Catalin Iacob wrote: On Thu, Aug 21, 2014 at 4:56 PM, Frederic Weisbecker fweis...@gmail.com wrote: Can you please test the series I just posted: [RFC PATCH 0/9] nohz: Nohz full kick fixes? Before applying the series I tried one more Firefox build which triggered the panic after less than 5 minutes. After applying the series I did 2 full builds (around 60 minutes in total) with no problem. Seems to be working fine here too after fuzzing for 17 hours so far. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
2014-08-20 22:31 GMT+02:00 Catalin Iacob : > I've also just hit what seems to be the same panic in 3.17-rc1 (ignore the 1 > local patch, it's an unrelated change in a comment) twice in less than 1 > hour. Hitting this twice in a short amount of time seems to be proof that the > 3.17 merge window made it trigger more often. Both times I was running a grep > over a Firefox build tree which was taking a long time. > > The stacktraces are slightly different but both have the "cancel timer from a > timer, followed by nmi" pattern. Pictures of the 2 stacktraces: > https://drive.google.com/file/d/0B_fRjDygGZSNY0RIc2dyYTExTjg/edit?usp=sharing > https://drive.google.com/file/d/0B_fRjDygGZSNS1pSWFkteURrOTQ/edit?usp=sharing Hi Catalin, Dave, Can you please test the series I just posted: "[RFC PATCH 0/9] nohz: Nohz full kick fixes"? It should fix the issues. Thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
2014-08-20 22:31 GMT+02:00 Catalin Iacob iacobcata...@gmail.com: I've also just hit what seems to be the same panic in 3.17-rc1 (ignore the 1 local patch, it's an unrelated change in a comment) twice in less than 1 hour. Hitting this twice in a short amount of time seems to be proof that the 3.17 merge window made it trigger more often. Both times I was running a grep over a Firefox build tree which was taking a long time. The stacktraces are slightly different but both have the cancel timer from a timer, followed by nmi pattern. Pictures of the 2 stacktraces: https://drive.google.com/file/d/0B_fRjDygGZSNY0RIc2dyYTExTjg/edit?usp=sharing https://drive.google.com/file/d/0B_fRjDygGZSNS1pSWFkteURrOTQ/edit?usp=sharing Hi Catalin, Dave, Can you please test the series I just posted: [RFC PATCH 0/9] nohz: Nohz full kick fixes? It should fix the issues. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
I've also just hit what seems to be the same panic in 3.17-rc1 (ignore the 1 local patch, it's an unrelated change in a comment) twice in less than 1 hour. Hitting this twice in a short amount of time seems to be proof that the 3.17 merge window made it trigger more often. Both times I was running a grep over a Firefox build tree which was taking a long time. The stacktraces are slightly different but both have the "cancel timer from a timer, followed by nmi" pattern. Pictures of the 2 stacktraces: https://drive.google.com/file/d/0B_fRjDygGZSNY0RIc2dyYTExTjg/edit?usp=sharing https://drive.google.com/file/d/0B_fRjDygGZSNS1pSWFkteURrOTQ/edit?usp=sharing -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
I've also just hit what seems to be the same panic in 3.17-rc1 (ignore the 1 local patch, it's an unrelated change in a comment) twice in less than 1 hour. Hitting this twice in a short amount of time seems to be proof that the 3.17 merge window made it trigger more often. Both times I was running a grep over a Firefox build tree which was taking a long time. The stacktraces are slightly different but both have the cancel timer from a timer, followed by nmi pattern. Pictures of the 2 stacktraces: https://drive.google.com/file/d/0B_fRjDygGZSNY0RIc2dyYTExTjg/edit?usp=sharing https://drive.google.com/file/d/0B_fRjDygGZSNS1pSWFkteURrOTQ/edit?usp=sharing -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Aug 07, 2014 at 03:16:49PM +0200, Frederic Weisbecker wrote: > > > <>[] lock_acquired+0xaf/0x450 > > > [] ? lock_hrtimer_base.isra.20+0x25/0x50 > > > [] _raw_spin_lock_irqsave+0x78/0x90 > > > [] ? lock_hrtimer_base.isra.20+0x25/0x50 > > > [] lock_hrtimer_base.isra.20+0x25/0x50 > > > [] hrtimer_try_to_cancel+0x33/0x1e0 > > > [] hrtimer_cancel+0x1a/0x30 > > > [] tick_nohz_restart+0x17/0x90 > > > [] __tick_nohz_full_check+0xc3/0x100 > > > [] nohz_full_kick_work_func+0xe/0x10 > > > [] irq_work_run_list+0x44/0x70 > > > [] irq_work_run+0x2a/0x50 > > > [] update_process_times+0x5b/0x70 > > > [] tick_sched_handle.isra.21+0x25/0x60 > > > [] tick_sched_timer+0x41/0x60 > > > [] __run_hrtimer+0x72/0x470 > > > [] ? tick_sched_do_timer+0xb0/0xb0 > > > [] hrtimer_interrupt+0x117/0x270 > > > [] local_apic_timer_interrupt+0x37/0x60 > > > [] smp_apic_timer_interrupt+0x3f/0x50 > > > [] apic_timer_interrupt+0x6f/0x80 > > > > And that looks like someone trying to cancel a timer from a timer, I > > guess that won't work, seeing how cancel will wait for the timer handler > > completion etc. > > > > This is because of the fallback irq_work_run() in the tick > > (update_process_times). > > > > Indeed, I saw that too but very rarely. FWIW, I'm now seeing this quite often (several times a day) when I run trinity on current git master. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Aug 07, 2014 at 03:16:49PM +0200, Frederic Weisbecker wrote: EOE IRQ [8a0ccd2f] lock_acquired+0xaf/0x450 [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50 [8a7fc678] _raw_spin_lock_irqsave+0x78/0x90 [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50 [8a0f74c5] lock_hrtimer_base.isra.20+0x25/0x50 [8a0f7723] hrtimer_try_to_cancel+0x33/0x1e0 [8a0f78ea] hrtimer_cancel+0x1a/0x30 [8a109237] tick_nohz_restart+0x17/0x90 [8a10a213] __tick_nohz_full_check+0xc3/0x100 [8a10a25e] nohz_full_kick_work_func+0xe/0x10 [8a17c884] irq_work_run_list+0x44/0x70 [8a17c8da] irq_work_run+0x2a/0x50 [8a0f700b] update_process_times+0x5b/0x70 [8a109005] tick_sched_handle.isra.21+0x25/0x60 [8a109b81] tick_sched_timer+0x41/0x60 [8a0f7aa2] __run_hrtimer+0x72/0x470 [8a109b40] ? tick_sched_do_timer+0xb0/0xb0 [8a0f8707] hrtimer_interrupt+0x117/0x270 [8a034357] local_apic_timer_interrupt+0x37/0x60 [8a80010f] smp_apic_timer_interrupt+0x3f/0x50 [8a7fe52f] apic_timer_interrupt+0x6f/0x80 And that looks like someone trying to cancel a timer from a timer, I guess that won't work, seeing how cancel will wait for the timer handler completion etc. This is because of the fallback irq_work_run() in the tick (update_process_times). Indeed, I saw that too but very rarely. FWIW, I'm now seeing this quite often (several times a day) when I run trinity on current git master. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: nohz fail (was: perf related boot hang.)
On Thu, Aug 07, 2014 at 11:03:33AM +0200, Peter Zijlstra wrote: > On Wed, Aug 06, 2014 at 03:46:56PM -0400, Dave Jones wrote: > > This one happened during runtime, but I got a whole stack.. > > > > > > Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2 > > CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34 > > Workqueue: btrfs-endio-write normal_work_helper [btrfs] > > 880244c06c88 1b486fe1 880244c06bf0 8a7f1e37 > > 8ac52a18 880244c06c78 8a7ef928 0010 > > 880244c06c88 880244c06c20 1b486fe1 > > Call Trace: > > [] dump_stack+0x4e/0x7a > > [] panic+0xd4/0x207 > > [] watchdog_overflow_callback+0x118/0x120 > > [] __perf_event_overflow+0xae/0x350 > > [] ? perf_event_task_disable+0xa0/0xa0 > > [] ? x86_perf_event_set_period+0xbf/0x150 > > [] perf_event_overflow+0x14/0x20 > > [] intel_pmu_handle_irq+0x206/0x410 > > [] perf_event_nmi_handler+0x2b/0x50 > > [] nmi_handle+0xd2/0x390 > > [] ? nmi_handle+0x5/0x390 > > [] ? match_held_lock+0x8/0x1b0 > > [] default_do_nmi+0x72/0x1c0 > > [] do_nmi+0xb8/0x100 > > [] end_repeat_nmi+0x1e/0x2e > > [] ? match_held_lock+0x8/0x1b0 > > [] ? match_held_lock+0x8/0x1b0 > > [] ? match_held_lock+0x8/0x1b0 > > Ok so that part is just the watchdog triggering, so the below part is > the screwy bit: > > > <>[] lock_acquired+0xaf/0x450 > > [] ? lock_hrtimer_base.isra.20+0x25/0x50 > > [] _raw_spin_lock_irqsave+0x78/0x90 > > [] ? lock_hrtimer_base.isra.20+0x25/0x50 > > [] lock_hrtimer_base.isra.20+0x25/0x50 > > [] hrtimer_try_to_cancel+0x33/0x1e0 > > [] hrtimer_cancel+0x1a/0x30 > > [] tick_nohz_restart+0x17/0x90 > > [] __tick_nohz_full_check+0xc3/0x100 > > [] nohz_full_kick_work_func+0xe/0x10 > > [] irq_work_run_list+0x44/0x70 > > [] irq_work_run+0x2a/0x50 > > [] update_process_times+0x5b/0x70 > > [] tick_sched_handle.isra.21+0x25/0x60 > > [] tick_sched_timer+0x41/0x60 > > [] __run_hrtimer+0x72/0x470 > > [] ? tick_sched_do_timer+0xb0/0xb0 > > [] hrtimer_interrupt+0x117/0x270 > > [] local_apic_timer_interrupt+0x37/0x60 > > [] smp_apic_timer_interrupt+0x3f/0x50 > > [] apic_timer_interrupt+0x6f/0x80 > > And that looks like someone trying to cancel a timer from a timer, I > guess that won't work, seeing how cancel will wait for the timer handler > completion etc. > > This is because of the fallback irq_work_run() in the tick > (update_process_times). > Indeed, I saw that too but very rarely. The nohz kick needs to restart the tick asynchronously (so we use irq work) and all is fine as long as irq work actually runs through the irq work IRQ. But when it triggers through the tick, that's when we fail like above. This should be a rare scenario for archs that support raising irq_work, but it can happen. I've been considering to check and restart the tick from irq exit. But that's going to add extra checks on all IRQs. Ideally we should be able to force IRQ work on irq work interrupt when we know that the arch supports it (except for lazy ones). In fact that alone is a requisite for nohz full itself. If the arch can't raise irq_work IRQs, nohz full can't work anyway. So soon or later I knew that one day I'd need to add a check for that. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
nohz fail (was: perf related boot hang.)
On Wed, Aug 06, 2014 at 03:46:56PM -0400, Dave Jones wrote: > This one happened during runtime, but I got a whole stack.. > > > Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2 > CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34 > Workqueue: btrfs-endio-write normal_work_helper [btrfs] > 880244c06c88 1b486fe1 880244c06bf0 8a7f1e37 > 8ac52a18 880244c06c78 8a7ef928 0010 > 880244c06c88 880244c06c20 1b486fe1 > Call Trace: > [] dump_stack+0x4e/0x7a > [] panic+0xd4/0x207 > [] watchdog_overflow_callback+0x118/0x120 > [] __perf_event_overflow+0xae/0x350 > [] ? perf_event_task_disable+0xa0/0xa0 > [] ? x86_perf_event_set_period+0xbf/0x150 > [] perf_event_overflow+0x14/0x20 > [] intel_pmu_handle_irq+0x206/0x410 > [] perf_event_nmi_handler+0x2b/0x50 > [] nmi_handle+0xd2/0x390 > [] ? nmi_handle+0x5/0x390 > [] ? match_held_lock+0x8/0x1b0 > [] default_do_nmi+0x72/0x1c0 > [] do_nmi+0xb8/0x100 > [] end_repeat_nmi+0x1e/0x2e > [] ? match_held_lock+0x8/0x1b0 > [] ? match_held_lock+0x8/0x1b0 > [] ? match_held_lock+0x8/0x1b0 Ok so that part is just the watchdog triggering, so the below part is the screwy bit: > <>[] lock_acquired+0xaf/0x450 > [] ? lock_hrtimer_base.isra.20+0x25/0x50 > [] _raw_spin_lock_irqsave+0x78/0x90 > [] ? lock_hrtimer_base.isra.20+0x25/0x50 > [] lock_hrtimer_base.isra.20+0x25/0x50 > [] hrtimer_try_to_cancel+0x33/0x1e0 > [] hrtimer_cancel+0x1a/0x30 > [] tick_nohz_restart+0x17/0x90 > [] __tick_nohz_full_check+0xc3/0x100 > [] nohz_full_kick_work_func+0xe/0x10 > [] irq_work_run_list+0x44/0x70 > [] irq_work_run+0x2a/0x50 > [] update_process_times+0x5b/0x70 > [] tick_sched_handle.isra.21+0x25/0x60 > [] tick_sched_timer+0x41/0x60 > [] __run_hrtimer+0x72/0x470 > [] ? tick_sched_do_timer+0xb0/0xb0 > [] hrtimer_interrupt+0x117/0x270 > [] local_apic_timer_interrupt+0x37/0x60 > [] smp_apic_timer_interrupt+0x3f/0x50 > [] apic_timer_interrupt+0x6f/0x80 And that looks like someone trying to cancel a timer from a timer, I guess that won't work, seeing how cancel will wait for the timer handler completion etc. This is because of the fallback irq_work_run() in the tick (update_process_times). pgppYMVGjEAUt.pgp Description: PGP signature
nohz fail (was: perf related boot hang.)
On Wed, Aug 06, 2014 at 03:46:56PM -0400, Dave Jones wrote: This one happened during runtime, but I got a whole stack.. Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2 CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34 Workqueue: btrfs-endio-write normal_work_helper [btrfs] 880244c06c88 1b486fe1 880244c06bf0 8a7f1e37 8ac52a18 880244c06c78 8a7ef928 0010 880244c06c88 880244c06c20 1b486fe1 Call Trace: NMI [8a7f1e37] dump_stack+0x4e/0x7a [8a7ef928] panic+0xd4/0x207 [8a1450e8] watchdog_overflow_callback+0x118/0x120 [8a186b0e] __perf_event_overflow+0xae/0x350 [8a184f80] ? perf_event_task_disable+0xa0/0xa0 [8a01a4cf] ? x86_perf_event_set_period+0xbf/0x150 [8a187934] perf_event_overflow+0x14/0x20 [8a020386] intel_pmu_handle_irq+0x206/0x410 [8a01937b] perf_event_nmi_handler+0x2b/0x50 [8a007b72] nmi_handle+0xd2/0x390 [8a007aa5] ? nmi_handle+0x5/0x390 [8a0cb7f8] ? match_held_lock+0x8/0x1b0 [8a008062] default_do_nmi+0x72/0x1c0 [8a008268] do_nmi+0xb8/0x100 [8a7ff66a] end_repeat_nmi+0x1e/0x2e [8a0cb7f8] ? match_held_lock+0x8/0x1b0 [8a0cb7f8] ? match_held_lock+0x8/0x1b0 [8a0cb7f8] ? match_held_lock+0x8/0x1b0 Ok so that part is just the watchdog triggering, so the below part is the screwy bit: EOE IRQ [8a0ccd2f] lock_acquired+0xaf/0x450 [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50 [8a7fc678] _raw_spin_lock_irqsave+0x78/0x90 [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50 [8a0f74c5] lock_hrtimer_base.isra.20+0x25/0x50 [8a0f7723] hrtimer_try_to_cancel+0x33/0x1e0 [8a0f78ea] hrtimer_cancel+0x1a/0x30 [8a109237] tick_nohz_restart+0x17/0x90 [8a10a213] __tick_nohz_full_check+0xc3/0x100 [8a10a25e] nohz_full_kick_work_func+0xe/0x10 [8a17c884] irq_work_run_list+0x44/0x70 [8a17c8da] irq_work_run+0x2a/0x50 [8a0f700b] update_process_times+0x5b/0x70 [8a109005] tick_sched_handle.isra.21+0x25/0x60 [8a109b81] tick_sched_timer+0x41/0x60 [8a0f7aa2] __run_hrtimer+0x72/0x470 [8a109b40] ? tick_sched_do_timer+0xb0/0xb0 [8a0f8707] hrtimer_interrupt+0x117/0x270 [8a034357] local_apic_timer_interrupt+0x37/0x60 [8a80010f] smp_apic_timer_interrupt+0x3f/0x50 [8a7fe52f] apic_timer_interrupt+0x6f/0x80 And that looks like someone trying to cancel a timer from a timer, I guess that won't work, seeing how cancel will wait for the timer handler completion etc. This is because of the fallback irq_work_run() in the tick (update_process_times). pgppYMVGjEAUt.pgp Description: PGP signature
Re: nohz fail (was: perf related boot hang.)
On Thu, Aug 07, 2014 at 11:03:33AM +0200, Peter Zijlstra wrote: On Wed, Aug 06, 2014 at 03:46:56PM -0400, Dave Jones wrote: This one happened during runtime, but I got a whole stack.. Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 2 CPU: 2 PID: 7538 Comm: kworker/u8:8 Not tainted 3.16.0+ #34 Workqueue: btrfs-endio-write normal_work_helper [btrfs] 880244c06c88 1b486fe1 880244c06bf0 8a7f1e37 8ac52a18 880244c06c78 8a7ef928 0010 880244c06c88 880244c06c20 1b486fe1 Call Trace: NMI [8a7f1e37] dump_stack+0x4e/0x7a [8a7ef928] panic+0xd4/0x207 [8a1450e8] watchdog_overflow_callback+0x118/0x120 [8a186b0e] __perf_event_overflow+0xae/0x350 [8a184f80] ? perf_event_task_disable+0xa0/0xa0 [8a01a4cf] ? x86_perf_event_set_period+0xbf/0x150 [8a187934] perf_event_overflow+0x14/0x20 [8a020386] intel_pmu_handle_irq+0x206/0x410 [8a01937b] perf_event_nmi_handler+0x2b/0x50 [8a007b72] nmi_handle+0xd2/0x390 [8a007aa5] ? nmi_handle+0x5/0x390 [8a0cb7f8] ? match_held_lock+0x8/0x1b0 [8a008062] default_do_nmi+0x72/0x1c0 [8a008268] do_nmi+0xb8/0x100 [8a7ff66a] end_repeat_nmi+0x1e/0x2e [8a0cb7f8] ? match_held_lock+0x8/0x1b0 [8a0cb7f8] ? match_held_lock+0x8/0x1b0 [8a0cb7f8] ? match_held_lock+0x8/0x1b0 Ok so that part is just the watchdog triggering, so the below part is the screwy bit: EOE IRQ [8a0ccd2f] lock_acquired+0xaf/0x450 [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50 [8a7fc678] _raw_spin_lock_irqsave+0x78/0x90 [8a0f74c5] ? lock_hrtimer_base.isra.20+0x25/0x50 [8a0f74c5] lock_hrtimer_base.isra.20+0x25/0x50 [8a0f7723] hrtimer_try_to_cancel+0x33/0x1e0 [8a0f78ea] hrtimer_cancel+0x1a/0x30 [8a109237] tick_nohz_restart+0x17/0x90 [8a10a213] __tick_nohz_full_check+0xc3/0x100 [8a10a25e] nohz_full_kick_work_func+0xe/0x10 [8a17c884] irq_work_run_list+0x44/0x70 [8a17c8da] irq_work_run+0x2a/0x50 [8a0f700b] update_process_times+0x5b/0x70 [8a109005] tick_sched_handle.isra.21+0x25/0x60 [8a109b81] tick_sched_timer+0x41/0x60 [8a0f7aa2] __run_hrtimer+0x72/0x470 [8a109b40] ? tick_sched_do_timer+0xb0/0xb0 [8a0f8707] hrtimer_interrupt+0x117/0x270 [8a034357] local_apic_timer_interrupt+0x37/0x60 [8a80010f] smp_apic_timer_interrupt+0x3f/0x50 [8a7fe52f] apic_timer_interrupt+0x6f/0x80 And that looks like someone trying to cancel a timer from a timer, I guess that won't work, seeing how cancel will wait for the timer handler completion etc. This is because of the fallback irq_work_run() in the tick (update_process_times). Indeed, I saw that too but very rarely. The nohz kick needs to restart the tick asynchronously (so we use irq work) and all is fine as long as irq work actually runs through the irq work IRQ. But when it triggers through the tick, that's when we fail like above. This should be a rare scenario for archs that support raising irq_work, but it can happen. I've been considering to check and restart the tick from irq exit. But that's going to add extra checks on all IRQs. Ideally we should be able to force IRQ work on irq work interrupt when we know that the arch supports it (except for lazy ones). In fact that alone is a requisite for nohz full itself. If the arch can't raise irq_work IRQs, nohz full can't work anyway. So soon or later I knew that one day I'd need to add a check for that. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/