Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 2017-06-30 at 17:28 +1000, Nicholas Piggin wrote: > On Fri, 30 Jun 2017 10:52:18 +0530 > Abdul Haleemwrote: > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > Nicholas Piggin wrote: > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > involved, but > > > > > it's likely to be one of the idle patches. > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > simulator, sorry for the false alarm. > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > lock in rcu_process_callbacks? > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > enabling lockdep and try to reproduce it? > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > I do not see any difference in trace messages with and without LOCKDEP > > enabled. > > > > Please find the attached log file. > > Can you get an rcu_invoke_callback event trace that Paul suggested? Yes, I have collected the perf report. > > Does this bug show up with just the powerpc next branch? Now started seeing the call trace on mainline too (4.13.0-rc2) > > Thanks, > Nick > -- Regard's Abdul Haleem IBM Linux Technology Centre
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 2017-06-30 at 17:28 +1000, Nicholas Piggin wrote: > On Fri, 30 Jun 2017 10:52:18 +0530 > Abdul Haleem wrote: > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > Nicholas Piggin wrote: > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > involved, but > > > > > it's likely to be one of the idle patches. > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > simulator, sorry for the false alarm. > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > lock in rcu_process_callbacks? > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > enabling lockdep and try to reproduce it? > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > I do not see any difference in trace messages with and without LOCKDEP > > enabled. > > > > Please find the attached log file. > > Can you get an rcu_invoke_callback event trace that Paul suggested? Yes, I have collected the perf report. > > Does this bug show up with just the powerpc next branch? Now started seeing the call trace on mainline too (4.13.0-rc2) > > Thanks, > Nick > -- Regard's Abdul Haleem IBM Linux Technology Centre
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 2017-06-30 at 17:28 +1000, Nicholas Piggin wrote: > On Fri, 30 Jun 2017 10:52:18 +0530 > Abdul Haleemwrote: > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > Nicholas Piggin wrote: > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > involved, but > > > > > it's likely to be one of the idle patches. > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > simulator, sorry for the false alarm. > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > lock in rcu_process_callbacks? > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > enabling lockdep and try to reproduce it? > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > I do not see any difference in trace messages with and without LOCKDEP > > enabled. > > > > Please find the attached log file. > > Can you get an rcu_invoke_callback event trace that Paul suggested? Yes, I was able to collect the perf data for rcu_invoke_callback event on recent next kernel (4.12.0-next-20170705). the issue is rare to hit. After booting the next kernel, I started this command 'perf record -e rcu:rcu_invoke_callback -a -g -- cat' and waited for 30 minutes. five minutes after seeing the stalls messages, I did CTRL-C to end the perf command. @Nicholas : the perf.data report is too huge to attach here, shall I ping you the internal location of file on slack/mail ? Also the machine is in the same state if you want to use it ? > > Does this bug show up with just the powerpc next branch? > > Thanks, > Nick > -- Regard's Abdul Haleem IBM Linux Technology Centre
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 2017-06-30 at 17:28 +1000, Nicholas Piggin wrote: > On Fri, 30 Jun 2017 10:52:18 +0530 > Abdul Haleem wrote: > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > Nicholas Piggin wrote: > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > involved, but > > > > > it's likely to be one of the idle patches. > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > simulator, sorry for the false alarm. > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > lock in rcu_process_callbacks? > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > enabling lockdep and try to reproduce it? > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > I do not see any difference in trace messages with and without LOCKDEP > > enabled. > > > > Please find the attached log file. > > Can you get an rcu_invoke_callback event trace that Paul suggested? Yes, I was able to collect the perf data for rcu_invoke_callback event on recent next kernel (4.12.0-next-20170705). the issue is rare to hit. After booting the next kernel, I started this command 'perf record -e rcu:rcu_invoke_callback -a -g -- cat' and waited for 30 minutes. five minutes after seeing the stalls messages, I did CTRL-C to end the perf command. @Nicholas : the perf.data report is too huge to attach here, shall I ping you the internal location of file on slack/mail ? Also the machine is in the same state if you want to use it ? > > Does this bug show up with just the powerpc next branch? > > Thanks, > Nick > -- Regard's Abdul Haleem IBM Linux Technology Centre
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Tue, Jul 04, 2017 at 07:44:58PM +0530, Abdul Haleem wrote: > On Fri, 2017-06-30 at 17:28 +1000, Nicholas Piggin wrote: > > On Fri, 30 Jun 2017 10:52:18 +0530 > > Abdul Haleemwrote: > > > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > > Nicholas Piggin wrote: > > > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > > involved, but > > > > > > it's likely to be one of the idle patches. > > > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > > simulator, sorry for the false alarm. > > > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > > lock in rcu_process_callbacks? > > > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > > enabling lockdep and try to reproduce it? > > > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > > I do not see any difference in trace messages with and without LOCKDEP > > > enabled. > > > > > > Please find the attached log file. > > > > Can you get an rcu_invoke_callback event trace that Paul suggested? > > I could not reproduce the issue with the latest next kernel > (4.12.0-rc7-next-20170703). > > > Does this bug show up with just the powerpc next branch? > > Perhaps, stress-ng test on today's mainline kernel (4.12.0) triggered a > different trace message and this not reproducible all the time. > > > stress-ng --io 100 --vm 10 --vm-bytes 100g --timeout 1h --oomable > > INFO: rcu_sched self-detected stall on CPU > 8-...: (2099 ticks this GP) idle=37e/141/0 > softirq=1404131/1404131 fqs=932 > (t=2100 jiffies g=394894 c=394893 q=21997) > Task dump for CPU 8: > kworker/u162:2 R running task11168 28468 2 0x0884 > Workqueue: writeback wb_workfn (flush-253:1) > Call Trace: > [c0036e3eb340] [c0114480] sched_show_task+0xf0/0x160 > (unreliable) > [c0036e3eb3b0] [c09adc2c] rcu_dump_cpu_stacks+0xd0/0x134 > [c0036e3eb400] [c015e4d0] rcu_check_callbacks+0x8f0/0xaf0 > [c0036e3eb530] [c0165e2c] update_process_times+0x3c/0x90 > [c0036e3eb560] [c017b73c] tick_sched_handle.isra.13 > +0x2c/0xc0 > [c0036e3eb590] [c017b828] tick_sched_timer+0x58/0xb0 > [c0036e3eb5d0] [c01669e8] __hrtimer_run_queues+0xf8/0x330 > [c0036e3eb650] [c0167744] hrtimer_interrupt+0xe4/0x280 > [c0036e3eb710] [c0022620] __timer_interrupt+0x90/0x270 > [c0036e3eb760] [c0022cf0] timer_interrupt+0xa0/0xe0 > [c0036e3eb790] [c00091e8] decrementer_common+0x158/0x160 > --- interrupt: 901 at move_expired_inodes+0x30/0x200 > LR = queue_io+0x8c/0x190 > [c0036e3eba80] [c0036e3ebac0] 0xc0036e3ebac0 (unreliable) > [c0036e3ebac0] [c0319b24] wb_writeback+0x2b4/0x420 The usual assumption would be that wb_writeback() is looping or... > [c0036e3ebb90] [c031a980] wb_workfn+0xf0/0x4b0 > [c0036e3ebca0] [c00fa160] process_one_work+0x180/0x470 > [c0036e3ebd30] [c00fa6d4] worker_thread+0x284/0x500 ...there is so much work being scheduled that worker_thread() cannot keep up. If tracing locates such a loop, the usual trick is to place a cond_resched_rcu_qs() somewhere in it (but of course not where interrupts or preemption are disabled). Tracing can help detect this. Or printk()s, for that matter. ;-) Thanx, Paul > [c0036e3ebdc0] [c0101fc0] kthread+0x160/0x1a0 > [c0036e3ebe30] [c000bb60] ret_from_kernel_thread+0x5c/0x7c > INFO: rcu_sched self-detected stall on CPU > 51-...: (2099 ticks this GP) idle=7c2/141/0 > softirq=1272749/1272749 fqs=995 > (t=2100 jiffies g=394900 c=394899 q=32186) > ~ > > stress-ng : http://kernel.ubuntu.com/git/cking/stress-ng.git > > > > > Thanks, > > Nick > > > > > -- > Regard's > > Abdul Haleem > IBM Linux Technology Centre > > >
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Tue, Jul 04, 2017 at 07:44:58PM +0530, Abdul Haleem wrote: > On Fri, 2017-06-30 at 17:28 +1000, Nicholas Piggin wrote: > > On Fri, 30 Jun 2017 10:52:18 +0530 > > Abdul Haleem wrote: > > > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > > Nicholas Piggin wrote: > > > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > > involved, but > > > > > > it's likely to be one of the idle patches. > > > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > > simulator, sorry for the false alarm. > > > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > > lock in rcu_process_callbacks? > > > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > > enabling lockdep and try to reproduce it? > > > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > > I do not see any difference in trace messages with and without LOCKDEP > > > enabled. > > > > > > Please find the attached log file. > > > > Can you get an rcu_invoke_callback event trace that Paul suggested? > > I could not reproduce the issue with the latest next kernel > (4.12.0-rc7-next-20170703). > > > Does this bug show up with just the powerpc next branch? > > Perhaps, stress-ng test on today's mainline kernel (4.12.0) triggered a > different trace message and this not reproducible all the time. > > > stress-ng --io 100 --vm 10 --vm-bytes 100g --timeout 1h --oomable > > INFO: rcu_sched self-detected stall on CPU > 8-...: (2099 ticks this GP) idle=37e/141/0 > softirq=1404131/1404131 fqs=932 > (t=2100 jiffies g=394894 c=394893 q=21997) > Task dump for CPU 8: > kworker/u162:2 R running task11168 28468 2 0x0884 > Workqueue: writeback wb_workfn (flush-253:1) > Call Trace: > [c0036e3eb340] [c0114480] sched_show_task+0xf0/0x160 > (unreliable) > [c0036e3eb3b0] [c09adc2c] rcu_dump_cpu_stacks+0xd0/0x134 > [c0036e3eb400] [c015e4d0] rcu_check_callbacks+0x8f0/0xaf0 > [c0036e3eb530] [c0165e2c] update_process_times+0x3c/0x90 > [c0036e3eb560] [c017b73c] tick_sched_handle.isra.13 > +0x2c/0xc0 > [c0036e3eb590] [c017b828] tick_sched_timer+0x58/0xb0 > [c0036e3eb5d0] [c01669e8] __hrtimer_run_queues+0xf8/0x330 > [c0036e3eb650] [c0167744] hrtimer_interrupt+0xe4/0x280 > [c0036e3eb710] [c0022620] __timer_interrupt+0x90/0x270 > [c0036e3eb760] [c0022cf0] timer_interrupt+0xa0/0xe0 > [c0036e3eb790] [c00091e8] decrementer_common+0x158/0x160 > --- interrupt: 901 at move_expired_inodes+0x30/0x200 > LR = queue_io+0x8c/0x190 > [c0036e3eba80] [c0036e3ebac0] 0xc0036e3ebac0 (unreliable) > [c0036e3ebac0] [c0319b24] wb_writeback+0x2b4/0x420 The usual assumption would be that wb_writeback() is looping or... > [c0036e3ebb90] [c031a980] wb_workfn+0xf0/0x4b0 > [c0036e3ebca0] [c00fa160] process_one_work+0x180/0x470 > [c0036e3ebd30] [c00fa6d4] worker_thread+0x284/0x500 ...there is so much work being scheduled that worker_thread() cannot keep up. If tracing locates such a loop, the usual trick is to place a cond_resched_rcu_qs() somewhere in it (but of course not where interrupts or preemption are disabled). Tracing can help detect this. Or printk()s, for that matter. ;-) Thanx, Paul > [c0036e3ebdc0] [c0101fc0] kthread+0x160/0x1a0 > [c0036e3ebe30] [c000bb60] ret_from_kernel_thread+0x5c/0x7c > INFO: rcu_sched self-detected stall on CPU > 51-...: (2099 ticks this GP) idle=7c2/141/0 > softirq=1272749/1272749 fqs=995 > (t=2100 jiffies g=394900 c=394899 q=32186) > ~ > > stress-ng : http://kernel.ubuntu.com/git/cking/stress-ng.git > > > > > Thanks, > > Nick > > > > > -- > Regard's > > Abdul Haleem > IBM Linux Technology Centre > > >
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 2017-06-30 at 17:28 +1000, Nicholas Piggin wrote: > On Fri, 30 Jun 2017 10:52:18 +0530 > Abdul Haleemwrote: > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > Nicholas Piggin wrote: > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > involved, but > > > > > it's likely to be one of the idle patches. > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > simulator, sorry for the false alarm. > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > lock in rcu_process_callbacks? > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > enabling lockdep and try to reproduce it? > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > I do not see any difference in trace messages with and without LOCKDEP > > enabled. > > > > Please find the attached log file. > > Can you get an rcu_invoke_callback event trace that Paul suggested? I could not reproduce the issue with the latest next kernel (4.12.0-rc7-next-20170703). > Does this bug show up with just the powerpc next branch? Perhaps, stress-ng test on today's mainline kernel (4.12.0) triggered a different trace message and this not reproducible all the time. stress-ng --io 100 --vm 10 --vm-bytes 100g --timeout 1h --oomable INFO: rcu_sched self-detected stall on CPU 8-...: (2099 ticks this GP) idle=37e/141/0 softirq=1404131/1404131 fqs=932 (t=2100 jiffies g=394894 c=394893 q=21997) Task dump for CPU 8: kworker/u162:2 R running task11168 28468 2 0x0884 Workqueue: writeback wb_workfn (flush-253:1) Call Trace: [c0036e3eb340] [c0114480] sched_show_task+0xf0/0x160 (unreliable) [c0036e3eb3b0] [c09adc2c] rcu_dump_cpu_stacks+0xd0/0x134 [c0036e3eb400] [c015e4d0] rcu_check_callbacks+0x8f0/0xaf0 [c0036e3eb530] [c0165e2c] update_process_times+0x3c/0x90 [c0036e3eb560] [c017b73c] tick_sched_handle.isra.13 +0x2c/0xc0 [c0036e3eb590] [c017b828] tick_sched_timer+0x58/0xb0 [c0036e3eb5d0] [c01669e8] __hrtimer_run_queues+0xf8/0x330 [c0036e3eb650] [c0167744] hrtimer_interrupt+0xe4/0x280 [c0036e3eb710] [c0022620] __timer_interrupt+0x90/0x270 [c0036e3eb760] [c0022cf0] timer_interrupt+0xa0/0xe0 [c0036e3eb790] [c00091e8] decrementer_common+0x158/0x160 --- interrupt: 901 at move_expired_inodes+0x30/0x200 LR = queue_io+0x8c/0x190 [c0036e3eba80] [c0036e3ebac0] 0xc0036e3ebac0 (unreliable) [c0036e3ebac0] [c0319b24] wb_writeback+0x2b4/0x420 [c0036e3ebb90] [c031a980] wb_workfn+0xf0/0x4b0 [c0036e3ebca0] [c00fa160] process_one_work+0x180/0x470 [c0036e3ebd30] [c00fa6d4] worker_thread+0x284/0x500 [c0036e3ebdc0] [c0101fc0] kthread+0x160/0x1a0 [c0036e3ebe30] [c000bb60] ret_from_kernel_thread+0x5c/0x7c INFO: rcu_sched self-detected stall on CPU 51-...: (2099 ticks this GP) idle=7c2/141/0 softirq=1272749/1272749 fqs=995 (t=2100 jiffies g=394900 c=394899 q=32186) ~ stress-ng : http://kernel.ubuntu.com/git/cking/stress-ng.git > > Thanks, > Nick > -- Regard's Abdul Haleem IBM Linux Technology Centre
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 2017-06-30 at 17:28 +1000, Nicholas Piggin wrote: > On Fri, 30 Jun 2017 10:52:18 +0530 > Abdul Haleem wrote: > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > Nicholas Piggin wrote: > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > involved, but > > > > > it's likely to be one of the idle patches. > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > simulator, sorry for the false alarm. > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > lock in rcu_process_callbacks? > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > enabling lockdep and try to reproduce it? > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > I do not see any difference in trace messages with and without LOCKDEP > > enabled. > > > > Please find the attached log file. > > Can you get an rcu_invoke_callback event trace that Paul suggested? I could not reproduce the issue with the latest next kernel (4.12.0-rc7-next-20170703). > Does this bug show up with just the powerpc next branch? Perhaps, stress-ng test on today's mainline kernel (4.12.0) triggered a different trace message and this not reproducible all the time. stress-ng --io 100 --vm 10 --vm-bytes 100g --timeout 1h --oomable INFO: rcu_sched self-detected stall on CPU 8-...: (2099 ticks this GP) idle=37e/141/0 softirq=1404131/1404131 fqs=932 (t=2100 jiffies g=394894 c=394893 q=21997) Task dump for CPU 8: kworker/u162:2 R running task11168 28468 2 0x0884 Workqueue: writeback wb_workfn (flush-253:1) Call Trace: [c0036e3eb340] [c0114480] sched_show_task+0xf0/0x160 (unreliable) [c0036e3eb3b0] [c09adc2c] rcu_dump_cpu_stacks+0xd0/0x134 [c0036e3eb400] [c015e4d0] rcu_check_callbacks+0x8f0/0xaf0 [c0036e3eb530] [c0165e2c] update_process_times+0x3c/0x90 [c0036e3eb560] [c017b73c] tick_sched_handle.isra.13 +0x2c/0xc0 [c0036e3eb590] [c017b828] tick_sched_timer+0x58/0xb0 [c0036e3eb5d0] [c01669e8] __hrtimer_run_queues+0xf8/0x330 [c0036e3eb650] [c0167744] hrtimer_interrupt+0xe4/0x280 [c0036e3eb710] [c0022620] __timer_interrupt+0x90/0x270 [c0036e3eb760] [c0022cf0] timer_interrupt+0xa0/0xe0 [c0036e3eb790] [c00091e8] decrementer_common+0x158/0x160 --- interrupt: 901 at move_expired_inodes+0x30/0x200 LR = queue_io+0x8c/0x190 [c0036e3eba80] [c0036e3ebac0] 0xc0036e3ebac0 (unreliable) [c0036e3ebac0] [c0319b24] wb_writeback+0x2b4/0x420 [c0036e3ebb90] [c031a980] wb_workfn+0xf0/0x4b0 [c0036e3ebca0] [c00fa160] process_one_work+0x180/0x470 [c0036e3ebd30] [c00fa6d4] worker_thread+0x284/0x500 [c0036e3ebdc0] [c0101fc0] kthread+0x160/0x1a0 [c0036e3ebe30] [c000bb60] ret_from_kernel_thread+0x5c/0x7c INFO: rcu_sched self-detected stall on CPU 51-...: (2099 ticks this GP) idle=7c2/141/0 softirq=1272749/1272749 fqs=995 (t=2100 jiffies g=394900 c=394899 q=32186) ~ stress-ng : http://kernel.ubuntu.com/git/cking/stress-ng.git > > Thanks, > Nick > -- Regard's Abdul Haleem IBM Linux Technology Centre
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, Jun 30, 2017 at 05:28:02PM +1000, Nicholas Piggin wrote: > On Fri, 30 Jun 2017 10:52:18 +0530 > Abdul Haleemwrote: > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > Nicholas Piggin wrote: > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > involved, but > > > > > it's likely to be one of the idle patches. > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > simulator, sorry for the false alarm. > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > lock in rcu_process_callbacks? > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > enabling lockdep and try to reproduce it? > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > I do not see any difference in trace messages with and without LOCKDEP > > enabled. > > > > Please find the attached log file. > > Can you get an rcu_invoke_callback event trace that Paul suggested? > > Does this bug show up with just the powerpc next branch? And I must say that those RCU CPU stall warnings are spectacular! Did you by chance boot with a small subset of the CPUs online, and bring the rest online later on? Thanx, Paul
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, Jun 30, 2017 at 05:28:02PM +1000, Nicholas Piggin wrote: > On Fri, 30 Jun 2017 10:52:18 +0530 > Abdul Haleem wrote: > > > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > > On Thu, 29 Jun 2017 20:23:05 +1000 > > > Nicholas Piggin wrote: > > > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > > Nicholas Piggin wrote: > > > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly > > > > > involved, but > > > > > it's likely to be one of the idle patches. > > > > > > Okay this turned out to be misconfigured sleep states I added for the > > > simulator, sorry for the false alarm. > > > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > > lock in rcu_process_callbacks? > > > > > > So this spinlock becomes top of the list of suspects. Can you try > > > enabling lockdep and try to reproduce it? > > > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > > I do not see any difference in trace messages with and without LOCKDEP > > enabled. > > > > Please find the attached log file. > > Can you get an rcu_invoke_callback event trace that Paul suggested? > > Does this bug show up with just the powerpc next branch? And I must say that those RCU CPU stall warnings are spectacular! Did you by chance boot with a small subset of the CPUs online, and bring the rest online later on? Thanx, Paul
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 30 Jun 2017 10:52:18 +0530 Abdul Haleemwrote: > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > On Thu, 29 Jun 2017 20:23:05 +1000 > > Nicholas Piggin wrote: > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > Nicholas Piggin wrote: > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly involved, > > > > but > > > > it's likely to be one of the idle patches. > > > > Okay this turned out to be misconfigured sleep states I added for the > > simulator, sorry for the false alarm. > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > lock in rcu_process_callbacks? > > > > So this spinlock becomes top of the list of suspects. Can you try > > enabling lockdep and try to reproduce it? > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > I do not see any difference in trace messages with and without LOCKDEP > enabled. > > Please find the attached log file. Can you get an rcu_invoke_callback event trace that Paul suggested? Does this bug show up with just the powerpc next branch? Thanks, Nick
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 30 Jun 2017 10:52:18 +0530 Abdul Haleem wrote: > On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > > On Thu, 29 Jun 2017 20:23:05 +1000 > > Nicholas Piggin wrote: > > > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > > Nicholas Piggin wrote: > > > > > > I don't *think* the replay-wakeup-interrupt patch is directly involved, > > > > but > > > > it's likely to be one of the idle patches. > > > > Okay this turned out to be misconfigured sleep states I added for the > > simulator, sorry for the false alarm. > > > > > Although you have this in the backtrace. I wonder if that's a stuck > > > lock in rcu_process_callbacks? > > > > So this spinlock becomes top of the list of suspects. Can you try > > enabling lockdep and try to reproduce it? > > Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. > I do not see any difference in trace messages with and without LOCKDEP > enabled. > > Please find the attached log file. Can you get an rcu_invoke_callback event trace that Paul suggested? Does this bug show up with just the powerpc next branch? Thanks, Nick
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > On Thu, 29 Jun 2017 20:23:05 +1000 > Nicholas Pigginwrote: > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > Nicholas Piggin wrote: > > > > I don't *think* the replay-wakeup-interrupt patch is directly involved, > > > but > > > it's likely to be one of the idle patches. > > Okay this turned out to be misconfigured sleep states I added for the > simulator, sorry for the false alarm. > > > Although you have this in the backtrace. I wonder if that's a stuck > > lock in rcu_process_callbacks? > > So this spinlock becomes top of the list of suspects. Can you try > enabling lockdep and try to reproduce it? Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. I do not see any difference in trace messages with and without LOCKDEP enabled. Please find the attached log file. > > > [ 5948.345539] CPU: 63 PID: 7360 Comm: rs:main Q:Reg Not tainted \ > > 4.12.0-rc7-next-20170628 #2 [ 5948.345612] task: c00f1c14e600 > > task.stack: \ > > c00f1c1e8000 [ 5948.345672] NIP: c09c7a10 LR: c09c7a08 > > CTR: \ > > c015eda0 [ 5948.358553] REGS: c00f1c1eb150 TRAP: 0501 Not > > tainted \ > > (4.12.0-rc7-next-20170628) [ 5948.358623] MSR: 90009033 \ > > [ 5948.358626] CR: 28082884 XER: 2000 > > [ 5948.358706] CFAR: c09c7a28 SOFTE: 1 > > GPR00: c015f14c c00f1c1eb3d0 c1062b00 0001 > > GPR04: c00fff6e6180 0001 00cc > > GPR08: 0001 804a > > GPR12: c015eda0 cfd55a80 > > [ 5948.358986] NIP [c09c7a10] _raw_spin_lock_irqsave+0x90/0x100 > > [ 5948.359043] LR [c09c7a08] _raw_spin_lock_irqsave+0x88/0x100 > > [ 5948.359099] Call Trace: > > [ 5948.359123] [c00f1c1eb3d0] [0001] 0x1 (unreliable) > > [ 5948.359182] [c00f1c1eb410] [c015f14c] \ > d> rcu_process_callbacks+0x3ac/0x620 [ 5948.359252] [c00f1c1eb4c0] \ > > [c00e1e0c] __do_softirq+0x14c/0x3a0 [ 5948.365958] > > [c00f1c1eb5b0] \ > > [c00e2448] irq_exit+0x108/0x120 [ 5948.366016] [c00f1c1eb5d0] \ > > [c00232b4] timer_interrupt+0xa4/0xe0 [ 5948.366075] > > [c00f1c1eb600] \ > > [c0009208] decrementer_common+0x158/0x160 [ 5948.366149] --- > > interrupt: 901 \ > > at start_this_handle+0xd0/0x4b0 LR = jbd2__journal_start+0x17c/0x2b0 > > [ 5948.366242] [c00f1c1eb8f0] [c00f20a5cb00] 0xc00f20a5cb00 > > (unreliable) > > [ 5948.366314] [c00f1c1eba00] [c040717c] > > jbd2__journal_start+0x17c/0x2b0 > > [ 5948.366388] [c00f1c1eba70] [c038edf4] \ > > __ext4_journal_start_sb+0x84/0x180 [ 5948.366459] [c00f1c1ebad0] \ > > [c03b47dc] ext4_da_write_begin+0x17c/0x520 [ 5948.366532] > > [c00f1c1ebb90] \ > > [c021f9c8] generic_perform_write+0xe8/0x250 [ 5948.366604] > > [c00f1c1ebc20] \ > > [c0220d20] __generic_file_write_iter+0x200/0x240 [ 5948.366677] \ > > [c00f1c1ebc80] [c039d614] ext4_file_write_iter+0x2e4/0x4d0 [ > > 5948.373255] \ > > [c00f1c1ebd00] [c02e13c0] __vfs_write+0x120/0x200 [ > > 5948.373313] \ > > [c00f1c1ebd90] [c02e2c48] vfs_write+0xc8/0x240 [ 5948.373371] \ > > [c00f1c1ebde0] [c02e4940] SyS_write+0x60/0x110 [ 5948.373430] \ > > [c00f1c1ebe30] [c000b8e0] system_call+0x38/0xdc [ 5948.373486] \ > > Instruction dump: [ 5948.373521] 7fe3fb78 e8010010 eba1ffe8 ebc1fff0 > > ebe1fff8 \ > > 7c0803a6 4e800020 8bad028a [ 5948.373592] 7fe3fb78 4b64db15 6000 > > 7c210b78 \ > > 89290009 792affe3 40820048 [ 5948.374515] Sending NMI from CPU > > 74 to CPUs \ > > > -- Regard's Abdul Haleem IBM Linux Technology Centre [ 9296.490900] INFO: rcu_sched detected stalls on CPUs/tasks: [ 9296.491039] 8-...: (18788 GPs behind) idle=0cc/0/0 softirq=653/653 fqs=0 [ 9296.491081] 9-...: (18788 GPs behind) idle=860/0/0 softirq=702/702 fqs=0 [ 9296.491114] 10-...: (47558 GPs behind) idle=f90/0/0 softirq=493/493 fqs=0 [ 9296.491147] 14-...: (18791 GPs behind) idle=840/0/0 softirq=816/816 fqs=0 [ 9296.491180] 17-...: (1976 GPs behind) idle=0ec/0/0 softirq=1147/1147 fqs=0 [ 9296.491221] 19-...: (60 GPs behind) idle=6a8/0/0 softirq=1551/1551 fqs=0 [ 9296.491256] 23-...: (965 GPs behind) idle=f2c/0/0 softirq=1130/1131 fqs=0 [ 9296.491289] 25-...: (37469 GPs behind) idle=170/0/0 softirq=599/599 fqs=0 [ 9296.491444] 27-...: (47543 GPs behind) idle=3b0/0/0 softirq=606/606 fqs=0 [ 9296.491479] 28-...: (47541 GPs behind) idle=5d4/0/0 softirq=503/504 fqs=0 [ 9296.491514] 29-...: (47541 GPs behind) idle=8b4/0/0 softirq=470/470 fqs=0 [ 9296.491547] 32-...: (2324 GPs behind) idle=7c4/0/0 softirq=946/946 fqs=0 [ 9296.491580] 33-...: (37473 GPs behind) idle=5a0/0/0 softirq=618/618 fqs=0 [
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Fri, 2017-06-30 at 00:45 +1000, Nicholas Piggin wrote: > On Thu, 29 Jun 2017 20:23:05 +1000 > Nicholas Piggin wrote: > > > On Thu, 29 Jun 2017 19:36:14 +1000 > > Nicholas Piggin wrote: > > > > I don't *think* the replay-wakeup-interrupt patch is directly involved, > > > but > > > it's likely to be one of the idle patches. > > Okay this turned out to be misconfigured sleep states I added for the > simulator, sorry for the false alarm. > > > Although you have this in the backtrace. I wonder if that's a stuck > > lock in rcu_process_callbacks? > > So this spinlock becomes top of the list of suspects. Can you try > enabling lockdep and try to reproduce it? Yes, recreated again with CONFIG_LOCKDEP=y & CONFIG_DEBUG_LOCKDEP=y set. I do not see any difference in trace messages with and without LOCKDEP enabled. Please find the attached log file. > > > [ 5948.345539] CPU: 63 PID: 7360 Comm: rs:main Q:Reg Not tainted \ > > 4.12.0-rc7-next-20170628 #2 [ 5948.345612] task: c00f1c14e600 > > task.stack: \ > > c00f1c1e8000 [ 5948.345672] NIP: c09c7a10 LR: c09c7a08 > > CTR: \ > > c015eda0 [ 5948.358553] REGS: c00f1c1eb150 TRAP: 0501 Not > > tainted \ > > (4.12.0-rc7-next-20170628) [ 5948.358623] MSR: 90009033 \ > > [ 5948.358626] CR: 28082884 XER: 2000 > > [ 5948.358706] CFAR: c09c7a28 SOFTE: 1 > > GPR00: c015f14c c00f1c1eb3d0 c1062b00 0001 > > GPR04: c00fff6e6180 0001 00cc > > GPR08: 0001 804a > > GPR12: c015eda0 cfd55a80 > > [ 5948.358986] NIP [c09c7a10] _raw_spin_lock_irqsave+0x90/0x100 > > [ 5948.359043] LR [c09c7a08] _raw_spin_lock_irqsave+0x88/0x100 > > [ 5948.359099] Call Trace: > > [ 5948.359123] [c00f1c1eb3d0] [0001] 0x1 (unreliable) > > [ 5948.359182] [c00f1c1eb410] [c015f14c] \ > d> rcu_process_callbacks+0x3ac/0x620 [ 5948.359252] [c00f1c1eb4c0] \ > > [c00e1e0c] __do_softirq+0x14c/0x3a0 [ 5948.365958] > > [c00f1c1eb5b0] \ > > [c00e2448] irq_exit+0x108/0x120 [ 5948.366016] [c00f1c1eb5d0] \ > > [c00232b4] timer_interrupt+0xa4/0xe0 [ 5948.366075] > > [c00f1c1eb600] \ > > [c0009208] decrementer_common+0x158/0x160 [ 5948.366149] --- > > interrupt: 901 \ > > at start_this_handle+0xd0/0x4b0 LR = jbd2__journal_start+0x17c/0x2b0 > > [ 5948.366242] [c00f1c1eb8f0] [c00f20a5cb00] 0xc00f20a5cb00 > > (unreliable) > > [ 5948.366314] [c00f1c1eba00] [c040717c] > > jbd2__journal_start+0x17c/0x2b0 > > [ 5948.366388] [c00f1c1eba70] [c038edf4] \ > > __ext4_journal_start_sb+0x84/0x180 [ 5948.366459] [c00f1c1ebad0] \ > > [c03b47dc] ext4_da_write_begin+0x17c/0x520 [ 5948.366532] > > [c00f1c1ebb90] \ > > [c021f9c8] generic_perform_write+0xe8/0x250 [ 5948.366604] > > [c00f1c1ebc20] \ > > [c0220d20] __generic_file_write_iter+0x200/0x240 [ 5948.366677] \ > > [c00f1c1ebc80] [c039d614] ext4_file_write_iter+0x2e4/0x4d0 [ > > 5948.373255] \ > > [c00f1c1ebd00] [c02e13c0] __vfs_write+0x120/0x200 [ > > 5948.373313] \ > > [c00f1c1ebd90] [c02e2c48] vfs_write+0xc8/0x240 [ 5948.373371] \ > > [c00f1c1ebde0] [c02e4940] SyS_write+0x60/0x110 [ 5948.373430] \ > > [c00f1c1ebe30] [c000b8e0] system_call+0x38/0xdc [ 5948.373486] \ > > Instruction dump: [ 5948.373521] 7fe3fb78 e8010010 eba1ffe8 ebc1fff0 > > ebe1fff8 \ > > 7c0803a6 4e800020 8bad028a [ 5948.373592] 7fe3fb78 4b64db15 6000 > > 7c210b78 \ > > 89290009 792affe3 40820048 [ 5948.374515] Sending NMI from CPU > > 74 to CPUs \ > > > -- Regard's Abdul Haleem IBM Linux Technology Centre [ 9296.490900] INFO: rcu_sched detected stalls on CPUs/tasks: [ 9296.491039] 8-...: (18788 GPs behind) idle=0cc/0/0 softirq=653/653 fqs=0 [ 9296.491081] 9-...: (18788 GPs behind) idle=860/0/0 softirq=702/702 fqs=0 [ 9296.491114] 10-...: (47558 GPs behind) idle=f90/0/0 softirq=493/493 fqs=0 [ 9296.491147] 14-...: (18791 GPs behind) idle=840/0/0 softirq=816/816 fqs=0 [ 9296.491180] 17-...: (1976 GPs behind) idle=0ec/0/0 softirq=1147/1147 fqs=0 [ 9296.491221] 19-...: (60 GPs behind) idle=6a8/0/0 softirq=1551/1551 fqs=0 [ 9296.491256] 23-...: (965 GPs behind) idle=f2c/0/0 softirq=1130/1131 fqs=0 [ 9296.491289] 25-...: (37469 GPs behind) idle=170/0/0 softirq=599/599 fqs=0 [ 9296.491444] 27-...: (47543 GPs behind) idle=3b0/0/0 softirq=606/606 fqs=0 [ 9296.491479] 28-...: (47541 GPs behind) idle=5d4/0/0 softirq=503/504 fqs=0 [ 9296.491514] 29-...: (47541 GPs behind) idle=8b4/0/0 softirq=470/470 fqs=0 [ 9296.491547] 32-...: (2324 GPs behind) idle=7c4/0/0 softirq=946/946 fqs=0 [ 9296.491580] 33-...: (37473 GPs behind) idle=5a0/0/0 softirq=618/618 fqs=0 [ 9296.491616] 37-...: (9317 GPs behind) idle=c98/0/0 softirq=679/679
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Thu, 29 Jun 2017 20:23:05 +1000 Nicholas Pigginwrote: > On Thu, 29 Jun 2017 19:36:14 +1000 > Nicholas Piggin wrote: > > I don't *think* the replay-wakeup-interrupt patch is directly involved, but > > it's likely to be one of the idle patches. Okay this turned out to be misconfigured sleep states I added for the simulator, sorry for the false alarm. > Although you have this in the backtrace. I wonder if that's a stuck > lock in rcu_process_callbacks? So this spinlock becomes top of the list of suspects. Can you try enabling lockdep and try to reproduce it? > [ 5948.345539] CPU: 63 PID: 7360 Comm: rs:main Q:Reg Not tainted \ > 4.12.0-rc7-next-20170628 #2 [ 5948.345612] task: c00f1c14e600 task.stack: > \ > c00f1c1e8000 [ 5948.345672] NIP: c09c7a10 LR: c09c7a08 > CTR: \ > c015eda0 [ 5948.358553] REGS: c00f1c1eb150 TRAP: 0501 Not > tainted \ > (4.12.0-rc7-next-20170628) [ 5948.358623] MSR: 90009033 \ > [ 5948.358626] CR: 28082884 XER: 2000 > [ 5948.358706] CFAR: c09c7a28 SOFTE: 1 > GPR00: c015f14c c00f1c1eb3d0 c1062b00 0001 > GPR04: c00fff6e6180 0001 00cc > GPR08: 0001 804a > GPR12: c015eda0 cfd55a80 > [ 5948.358986] NIP [c09c7a10] _raw_spin_lock_irqsave+0x90/0x100 > [ 5948.359043] LR [c09c7a08] _raw_spin_lock_irqsave+0x88/0x100 > [ 5948.359099] Call Trace: > [ 5948.359123] [c00f1c1eb3d0] [0001] 0x1 (unreliable) > [ 5948.359182] [c00f1c1eb410] [c015f14c] \ d> rcu_process_callbacks+0x3ac/0x620 [ 5948.359252] [c00f1c1eb4c0] \ > [c00e1e0c] __do_softirq+0x14c/0x3a0 [ 5948.365958] [c00f1c1eb5b0] > \ > [c00e2448] irq_exit+0x108/0x120 [ 5948.366016] [c00f1c1eb5d0] \ > [c00232b4] timer_interrupt+0xa4/0xe0 [ 5948.366075] > [c00f1c1eb600] \ > [c0009208] decrementer_common+0x158/0x160 [ 5948.366149] --- > interrupt: 901 \ > at start_this_handle+0xd0/0x4b0 LR = jbd2__journal_start+0x17c/0x2b0 > [ 5948.366242] [c00f1c1eb8f0] [c00f20a5cb00] 0xc00f20a5cb00 > (unreliable) > [ 5948.366314] [c00f1c1eba00] [c040717c] > jbd2__journal_start+0x17c/0x2b0 > [ 5948.366388] [c00f1c1eba70] [c038edf4] \ > __ext4_journal_start_sb+0x84/0x180 [ 5948.366459] [c00f1c1ebad0] \ > [c03b47dc] ext4_da_write_begin+0x17c/0x520 [ 5948.366532] > [c00f1c1ebb90] \ > [c021f9c8] generic_perform_write+0xe8/0x250 [ 5948.366604] > [c00f1c1ebc20] \ > [c0220d20] __generic_file_write_iter+0x200/0x240 [ 5948.366677] \ > [c00f1c1ebc80] [c039d614] ext4_file_write_iter+0x2e4/0x4d0 [ > 5948.373255] \ > [c00f1c1ebd00] [c02e13c0] __vfs_write+0x120/0x200 [ 5948.373313] \ > [c00f1c1ebd90] [c02e2c48] vfs_write+0xc8/0x240 [ 5948.373371] \ > [c00f1c1ebde0] [c02e4940] SyS_write+0x60/0x110 [ 5948.373430] \ > [c00f1c1ebe30] [c000b8e0] system_call+0x38/0xdc [ 5948.373486] \ > Instruction dump: [ 5948.373521] 7fe3fb78 e8010010 eba1ffe8 ebc1fff0 ebe1fff8 > \ > 7c0803a6 4e800020 8bad028a [ 5948.373592] 7fe3fb78 4b64db15 6000 > 7c210b78 \ > 89290009 792affe3 40820048 [ 5948.374515] Sending NMI from CPU 74 > to CPUs \ >
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Thu, 29 Jun 2017 20:23:05 +1000 Nicholas Piggin wrote: > On Thu, 29 Jun 2017 19:36:14 +1000 > Nicholas Piggin wrote: > > I don't *think* the replay-wakeup-interrupt patch is directly involved, but > > it's likely to be one of the idle patches. Okay this turned out to be misconfigured sleep states I added for the simulator, sorry for the false alarm. > Although you have this in the backtrace. I wonder if that's a stuck > lock in rcu_process_callbacks? So this spinlock becomes top of the list of suspects. Can you try enabling lockdep and try to reproduce it? > [ 5948.345539] CPU: 63 PID: 7360 Comm: rs:main Q:Reg Not tainted \ > 4.12.0-rc7-next-20170628 #2 [ 5948.345612] task: c00f1c14e600 task.stack: > \ > c00f1c1e8000 [ 5948.345672] NIP: c09c7a10 LR: c09c7a08 > CTR: \ > c015eda0 [ 5948.358553] REGS: c00f1c1eb150 TRAP: 0501 Not > tainted \ > (4.12.0-rc7-next-20170628) [ 5948.358623] MSR: 90009033 \ > [ 5948.358626] CR: 28082884 XER: 2000 > [ 5948.358706] CFAR: c09c7a28 SOFTE: 1 > GPR00: c015f14c c00f1c1eb3d0 c1062b00 0001 > GPR04: c00fff6e6180 0001 00cc > GPR08: 0001 804a > GPR12: c015eda0 cfd55a80 > [ 5948.358986] NIP [c09c7a10] _raw_spin_lock_irqsave+0x90/0x100 > [ 5948.359043] LR [c09c7a08] _raw_spin_lock_irqsave+0x88/0x100 > [ 5948.359099] Call Trace: > [ 5948.359123] [c00f1c1eb3d0] [0001] 0x1 (unreliable) > [ 5948.359182] [c00f1c1eb410] [c015f14c] \ d> rcu_process_callbacks+0x3ac/0x620 [ 5948.359252] [c00f1c1eb4c0] \ > [c00e1e0c] __do_softirq+0x14c/0x3a0 [ 5948.365958] [c00f1c1eb5b0] > \ > [c00e2448] irq_exit+0x108/0x120 [ 5948.366016] [c00f1c1eb5d0] \ > [c00232b4] timer_interrupt+0xa4/0xe0 [ 5948.366075] > [c00f1c1eb600] \ > [c0009208] decrementer_common+0x158/0x160 [ 5948.366149] --- > interrupt: 901 \ > at start_this_handle+0xd0/0x4b0 LR = jbd2__journal_start+0x17c/0x2b0 > [ 5948.366242] [c00f1c1eb8f0] [c00f20a5cb00] 0xc00f20a5cb00 > (unreliable) > [ 5948.366314] [c00f1c1eba00] [c040717c] > jbd2__journal_start+0x17c/0x2b0 > [ 5948.366388] [c00f1c1eba70] [c038edf4] \ > __ext4_journal_start_sb+0x84/0x180 [ 5948.366459] [c00f1c1ebad0] \ > [c03b47dc] ext4_da_write_begin+0x17c/0x520 [ 5948.366532] > [c00f1c1ebb90] \ > [c021f9c8] generic_perform_write+0xe8/0x250 [ 5948.366604] > [c00f1c1ebc20] \ > [c0220d20] __generic_file_write_iter+0x200/0x240 [ 5948.366677] \ > [c00f1c1ebc80] [c039d614] ext4_file_write_iter+0x2e4/0x4d0 [ > 5948.373255] \ > [c00f1c1ebd00] [c02e13c0] __vfs_write+0x120/0x200 [ 5948.373313] \ > [c00f1c1ebd90] [c02e2c48] vfs_write+0xc8/0x240 [ 5948.373371] \ > [c00f1c1ebde0] [c02e4940] SyS_write+0x60/0x110 [ 5948.373430] \ > [c00f1c1ebe30] [c000b8e0] system_call+0x38/0xdc [ 5948.373486] \ > Instruction dump: [ 5948.373521] 7fe3fb78 e8010010 eba1ffe8 ebc1fff0 ebe1fff8 > \ > 7c0803a6 4e800020 8bad028a [ 5948.373592] 7fe3fb78 4b64db15 6000 > 7c210b78 \ > 89290009 792affe3 40820048 [ 5948.374515] Sending NMI from CPU 74 > to CPUs \ >
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Thu, 29 Jun 2017 19:36:14 +1000 Nicholas Pigginwrote: > I've seen this as well (or something like it) in mambo at boot, but > it's pretty rare to hit. I'm trying to debug it. > > I'm guessing possibly an idle vs interrupt race. > > > [ 4255.151192] Sending NMI from CPU 5 to CPUs 0: > > [ 4255.151246] NMI backtrace for cpu 0 > > [ 4255.151287] CPU: 0 PID: 0 Comm: swapper/0 Not tainted > > 4.12.0-rc7-next-20170628 #2 > > [ 4255.151363] task: c007f8495600 task.stack: c007f842c000 > > [ 4255.151428] NIP: c000adb4 LR: c0015584 CTR: > > c082f4b0 > > [ 4255.151504] REGS: c007f842fb60 TRAP: 0e81 Not tainted > > (4.12.0-rc7-next-20170628) > > [ 4255.151578] MSR: 90009033 > > [ 4255.151586] CR: 22004884 XER: > > [ 4255.151675] CFAR: c062c108 SOFTE: 1 > > [ 4255.151675] GPR00: c082d6c8 c007f842fde0 c1062b00 > > 2800 > > [ 4255.151675] GPR04: 0003 c0089830 3aa8056bc35f > > 0001 > > [ 4255.151675] GPR08: 0002 c0d52d80 0007fe7d > > 90001003 > > [ 4255.151675] GPR12: c082a0c0 cfd4 > > [ 4255.152217] NIP [c000adb4] .L__replay_interrupt_return+0x0/0x4 > > [ 4255.152334] LR [c0015584] arch_local_irq_restore+0x74/0x90 > > [ 4255.152447] Call Trace: > > [ 4255.152499] [c007f842fde0] [c017cec0] > > tick_broadcast_oneshot_control+0x40/0x60 (unreliable) > > [ 4255.152662] [c007f842fe00] [c082d6c8] > > cpuidle_enter_state+0x108/0x3d0 > > [ 4255.152803] [c007f842fe60] [c0133e94] call_cpuidle+0x44/0x80 > > [ 4255.152921] [c007f842fe80] [c0134240] do_idle+0x290/0x2f0 > > [ 4255.153037] [c007f842fef0] [c0134474] > > cpu_startup_entry+0x34/0x40 > > [ 4255.153176] [c007f842ff20] [c0041944] > > start_secondary+0x304/0x360 > > [ 4255.153316] [c007f842ff90] [c000b16c] > > start_secondary_prolog+0x10/0x14 > > [ 4255.153455] Instruction dump: > > [ 4255.153527] 7d200026 618c8000 2c030900 4182e320 2c030500 4182dd68 > > 2c030e80 4182ffa4 > > [ 4255.153668] 2c030ea0 4182f078 2c030e60 4182edb0 <4e800020> 7c781b78 > > 480003c9 480003e1 > > What I think you're seeing here is just the lockup IPI causes the CPU to > wake from idle, and so the backtrace gets recorded from where the interrupt > gets replayed when the cpuidle code returns from the low level idle call > then re-enables interrupts. > > I don't *think* the replay-wakeup-interrupt patch is directly involved, but > it's likely to be one of the idle patches. Although you have this in the backtrace. I wonder if that's a stuck lock in rcu_process_callbacks? [ 5948.345539] CPU: 63 PID: 7360 Comm: rs:main Q:Reg Not tainted \ 4.12.0-rc7-next-20170628 #2 [ 5948.345612] task: c00f1c14e600 task.stack: \ c00f1c1e8000 [ 5948.345672] NIP: c09c7a10 LR: c09c7a08 CTR: \ c015eda0 [ 5948.358553] REGS: c00f1c1eb150 TRAP: 0501 Not tainted \ (4.12.0-rc7-next-20170628) [ 5948.358623] MSR: 90009033 \ [ 5948.358626] CR: 28082884 XER: 2000 [ 5948.358706] CFAR: c09c7a28 SOFTE: 1 GPR00: c015f14c c00f1c1eb3d0 c1062b00 0001 GPR04: c00fff6e6180 0001 00cc GPR08: 0001 804a GPR12: c015eda0 cfd55a80 [ 5948.358986] NIP [c09c7a10] _raw_spin_lock_irqsave+0x90/0x100 [ 5948.359043] LR [c09c7a08] _raw_spin_lock_irqsave+0x88/0x100 [ 5948.359099] Call Trace: [ 5948.359123] [c00f1c1eb3d0] [0001] 0x1 (unreliable) [ 5948.359182] [c00f1c1eb410] [c015f14c] \ rcu_process_callbacks+0x3ac/0x620 [ 5948.359252] [c00f1c1eb4c0] \ [c00e1e0c] __do_softirq+0x14c/0x3a0 [ 5948.365958] [c00f1c1eb5b0] \ [c00e2448] irq_exit+0x108/0x120 [ 5948.366016] [c00f1c1eb5d0] \ [c00232b4] timer_interrupt+0xa4/0xe0 [ 5948.366075] [c00f1c1eb600] \ [c0009208] decrementer_common+0x158/0x160 [ 5948.366149] --- interrupt: 901 \ at start_this_handle+0xd0/0x4b0 LR = jbd2__journal_start+0x17c/0x2b0 [ 5948.366242] [c00f1c1eb8f0] [c00f20a5cb00] 0xc00f20a5cb00 (unreliable) [ 5948.366314] [c00f1c1eba00] [c040717c] jbd2__journal_start+0x17c/0x2b0 [ 5948.366388] [c00f1c1eba70] [c038edf4] \ __ext4_journal_start_sb+0x84/0x180 [ 5948.366459] [c00f1c1ebad0] \ [c03b47dc] ext4_da_write_begin+0x17c/0x520 [ 5948.366532] [c00f1c1ebb90] \ [c021f9c8] generic_perform_write+0xe8/0x250 [ 5948.366604] [c00f1c1ebc20] \ [c0220d20] __generic_file_write_iter+0x200/0x240 [ 5948.366677] \ [c00f1c1ebc80] [c039d614] ext4_file_write_iter+0x2e4/0x4d0 [ 5948.373255] \ [c00f1c1ebd00] [c02e13c0]
Re: [linux-next] cpus stalls detected few hours after booting next kernel
On Thu, 29 Jun 2017 19:36:14 +1000 Nicholas Piggin wrote: > I've seen this as well (or something like it) in mambo at boot, but > it's pretty rare to hit. I'm trying to debug it. > > I'm guessing possibly an idle vs interrupt race. > > > [ 4255.151192] Sending NMI from CPU 5 to CPUs 0: > > [ 4255.151246] NMI backtrace for cpu 0 > > [ 4255.151287] CPU: 0 PID: 0 Comm: swapper/0 Not tainted > > 4.12.0-rc7-next-20170628 #2 > > [ 4255.151363] task: c007f8495600 task.stack: c007f842c000 > > [ 4255.151428] NIP: c000adb4 LR: c0015584 CTR: > > c082f4b0 > > [ 4255.151504] REGS: c007f842fb60 TRAP: 0e81 Not tainted > > (4.12.0-rc7-next-20170628) > > [ 4255.151578] MSR: 90009033 > > [ 4255.151586] CR: 22004884 XER: > > [ 4255.151675] CFAR: c062c108 SOFTE: 1 > > [ 4255.151675] GPR00: c082d6c8 c007f842fde0 c1062b00 > > 2800 > > [ 4255.151675] GPR04: 0003 c0089830 3aa8056bc35f > > 0001 > > [ 4255.151675] GPR08: 0002 c0d52d80 0007fe7d > > 90001003 > > [ 4255.151675] GPR12: c082a0c0 cfd4 > > [ 4255.152217] NIP [c000adb4] .L__replay_interrupt_return+0x0/0x4 > > [ 4255.152334] LR [c0015584] arch_local_irq_restore+0x74/0x90 > > [ 4255.152447] Call Trace: > > [ 4255.152499] [c007f842fde0] [c017cec0] > > tick_broadcast_oneshot_control+0x40/0x60 (unreliable) > > [ 4255.152662] [c007f842fe00] [c082d6c8] > > cpuidle_enter_state+0x108/0x3d0 > > [ 4255.152803] [c007f842fe60] [c0133e94] call_cpuidle+0x44/0x80 > > [ 4255.152921] [c007f842fe80] [c0134240] do_idle+0x290/0x2f0 > > [ 4255.153037] [c007f842fef0] [c0134474] > > cpu_startup_entry+0x34/0x40 > > [ 4255.153176] [c007f842ff20] [c0041944] > > start_secondary+0x304/0x360 > > [ 4255.153316] [c007f842ff90] [c000b16c] > > start_secondary_prolog+0x10/0x14 > > [ 4255.153455] Instruction dump: > > [ 4255.153527] 7d200026 618c8000 2c030900 4182e320 2c030500 4182dd68 > > 2c030e80 4182ffa4 > > [ 4255.153668] 2c030ea0 4182f078 2c030e60 4182edb0 <4e800020> 7c781b78 > > 480003c9 480003e1 > > What I think you're seeing here is just the lockup IPI causes the CPU to > wake from idle, and so the backtrace gets recorded from where the interrupt > gets replayed when the cpuidle code returns from the low level idle call > then re-enables interrupts. > > I don't *think* the replay-wakeup-interrupt patch is directly involved, but > it's likely to be one of the idle patches. Although you have this in the backtrace. I wonder if that's a stuck lock in rcu_process_callbacks? [ 5948.345539] CPU: 63 PID: 7360 Comm: rs:main Q:Reg Not tainted \ 4.12.0-rc7-next-20170628 #2 [ 5948.345612] task: c00f1c14e600 task.stack: \ c00f1c1e8000 [ 5948.345672] NIP: c09c7a10 LR: c09c7a08 CTR: \ c015eda0 [ 5948.358553] REGS: c00f1c1eb150 TRAP: 0501 Not tainted \ (4.12.0-rc7-next-20170628) [ 5948.358623] MSR: 90009033 \ [ 5948.358626] CR: 28082884 XER: 2000 [ 5948.358706] CFAR: c09c7a28 SOFTE: 1 GPR00: c015f14c c00f1c1eb3d0 c1062b00 0001 GPR04: c00fff6e6180 0001 00cc GPR08: 0001 804a GPR12: c015eda0 cfd55a80 [ 5948.358986] NIP [c09c7a10] _raw_spin_lock_irqsave+0x90/0x100 [ 5948.359043] LR [c09c7a08] _raw_spin_lock_irqsave+0x88/0x100 [ 5948.359099] Call Trace: [ 5948.359123] [c00f1c1eb3d0] [0001] 0x1 (unreliable) [ 5948.359182] [c00f1c1eb410] [c015f14c] \ rcu_process_callbacks+0x3ac/0x620 [ 5948.359252] [c00f1c1eb4c0] \ [c00e1e0c] __do_softirq+0x14c/0x3a0 [ 5948.365958] [c00f1c1eb5b0] \ [c00e2448] irq_exit+0x108/0x120 [ 5948.366016] [c00f1c1eb5d0] \ [c00232b4] timer_interrupt+0xa4/0xe0 [ 5948.366075] [c00f1c1eb600] \ [c0009208] decrementer_common+0x158/0x160 [ 5948.366149] --- interrupt: 901 \ at start_this_handle+0xd0/0x4b0 LR = jbd2__journal_start+0x17c/0x2b0 [ 5948.366242] [c00f1c1eb8f0] [c00f20a5cb00] 0xc00f20a5cb00 (unreliable) [ 5948.366314] [c00f1c1eba00] [c040717c] jbd2__journal_start+0x17c/0x2b0 [ 5948.366388] [c00f1c1eba70] [c038edf4] \ __ext4_journal_start_sb+0x84/0x180 [ 5948.366459] [c00f1c1ebad0] \ [c03b47dc] ext4_da_write_begin+0x17c/0x520 [ 5948.366532] [c00f1c1ebb90] \ [c021f9c8] generic_perform_write+0xe8/0x250 [ 5948.366604] [c00f1c1ebc20] \ [c0220d20] __generic_file_write_iter+0x200/0x240 [ 5948.366677] \ [c00f1c1ebc80] [c039d614] ext4_file_write_iter+0x2e4/0x4d0 [ 5948.373255] \ [c00f1c1ebd00] [c02e13c0] __vfs_write+0x120/0x200 [ 5948.373313] \ [c00f1c1ebd90] [c02e2c48]
Re: [linux-next] cpus stalls detected few hours after booting next kernel
I've seen this as well (or something like it) in mambo at boot, but it's pretty rare to hit. I'm trying to debug it. I'm guessing possibly an idle vs interrupt race. > [ 4255.151192] Sending NMI from CPU 5 to CPUs 0: > [ 4255.151246] NMI backtrace for cpu 0 > [ 4255.151287] CPU: 0 PID: 0 Comm: swapper/0 Not tainted > 4.12.0-rc7-next-20170628 #2 > [ 4255.151363] task: c007f8495600 task.stack: c007f842c000 > [ 4255.151428] NIP: c000adb4 LR: c0015584 CTR: > c082f4b0 > [ 4255.151504] REGS: c007f842fb60 TRAP: 0e81 Not tainted > (4.12.0-rc7-next-20170628) > [ 4255.151578] MSR: 90009033> [ 4255.151586] CR: 22004884 XER: > [ 4255.151675] CFAR: c062c108 SOFTE: 1 > [ 4255.151675] GPR00: c082d6c8 c007f842fde0 c1062b00 > 2800 > [ 4255.151675] GPR04: 0003 c0089830 3aa8056bc35f > 0001 > [ 4255.151675] GPR08: 0002 c0d52d80 0007fe7d > 90001003 > [ 4255.151675] GPR12: c082a0c0 cfd4 > [ 4255.152217] NIP [c000adb4] .L__replay_interrupt_return+0x0/0x4 > [ 4255.152334] LR [c0015584] arch_local_irq_restore+0x74/0x90 > [ 4255.152447] Call Trace: > [ 4255.152499] [c007f842fde0] [c017cec0] > tick_broadcast_oneshot_control+0x40/0x60 (unreliable) > [ 4255.152662] [c007f842fe00] [c082d6c8] > cpuidle_enter_state+0x108/0x3d0 > [ 4255.152803] [c007f842fe60] [c0133e94] call_cpuidle+0x44/0x80 > [ 4255.152921] [c007f842fe80] [c0134240] do_idle+0x290/0x2f0 > [ 4255.153037] [c007f842fef0] [c0134474] > cpu_startup_entry+0x34/0x40 > [ 4255.153176] [c007f842ff20] [c0041944] > start_secondary+0x304/0x360 > [ 4255.153316] [c007f842ff90] [c000b16c] > start_secondary_prolog+0x10/0x14 > [ 4255.153455] Instruction dump: > [ 4255.153527] 7d200026 618c8000 2c030900 4182e320 2c030500 4182dd68 2c030e80 > 4182ffa4 > [ 4255.153668] 2c030ea0 4182f078 2c030e60 4182edb0 <4e800020> 7c781b78 > 480003c9 480003e1 What I think you're seeing here is just the lockup IPI causes the CPU to wake from idle, and so the backtrace gets recorded from where the interrupt gets replayed when the cpuidle code returns from the low level idle call then re-enables interrupts. I don't *think* the replay-wakeup-interrupt patch is directly involved, but it's likely to be one of the idle patches. Thanks, Nick
Re: [linux-next] cpus stalls detected few hours after booting next kernel
I've seen this as well (or something like it) in mambo at boot, but it's pretty rare to hit. I'm trying to debug it. I'm guessing possibly an idle vs interrupt race. > [ 4255.151192] Sending NMI from CPU 5 to CPUs 0: > [ 4255.151246] NMI backtrace for cpu 0 > [ 4255.151287] CPU: 0 PID: 0 Comm: swapper/0 Not tainted > 4.12.0-rc7-next-20170628 #2 > [ 4255.151363] task: c007f8495600 task.stack: c007f842c000 > [ 4255.151428] NIP: c000adb4 LR: c0015584 CTR: > c082f4b0 > [ 4255.151504] REGS: c007f842fb60 TRAP: 0e81 Not tainted > (4.12.0-rc7-next-20170628) > [ 4255.151578] MSR: 90009033 > [ 4255.151586] CR: 22004884 XER: > [ 4255.151675] CFAR: c062c108 SOFTE: 1 > [ 4255.151675] GPR00: c082d6c8 c007f842fde0 c1062b00 > 2800 > [ 4255.151675] GPR04: 0003 c0089830 3aa8056bc35f > 0001 > [ 4255.151675] GPR08: 0002 c0d52d80 0007fe7d > 90001003 > [ 4255.151675] GPR12: c082a0c0 cfd4 > [ 4255.152217] NIP [c000adb4] .L__replay_interrupt_return+0x0/0x4 > [ 4255.152334] LR [c0015584] arch_local_irq_restore+0x74/0x90 > [ 4255.152447] Call Trace: > [ 4255.152499] [c007f842fde0] [c017cec0] > tick_broadcast_oneshot_control+0x40/0x60 (unreliable) > [ 4255.152662] [c007f842fe00] [c082d6c8] > cpuidle_enter_state+0x108/0x3d0 > [ 4255.152803] [c007f842fe60] [c0133e94] call_cpuidle+0x44/0x80 > [ 4255.152921] [c007f842fe80] [c0134240] do_idle+0x290/0x2f0 > [ 4255.153037] [c007f842fef0] [c0134474] > cpu_startup_entry+0x34/0x40 > [ 4255.153176] [c007f842ff20] [c0041944] > start_secondary+0x304/0x360 > [ 4255.153316] [c007f842ff90] [c000b16c] > start_secondary_prolog+0x10/0x14 > [ 4255.153455] Instruction dump: > [ 4255.153527] 7d200026 618c8000 2c030900 4182e320 2c030500 4182dd68 2c030e80 > 4182ffa4 > [ 4255.153668] 2c030ea0 4182f078 2c030e60 4182edb0 <4e800020> 7c781b78 > 480003c9 480003e1 What I think you're seeing here is just the lockup IPI causes the CPU to wake from idle, and so the backtrace gets recorded from where the interrupt gets replayed when the cpuidle code returns from the low level idle call then re-enables interrupts. I don't *think* the replay-wakeup-interrupt patch is directly involved, but it's likely to be one of the idle patches. Thanks, Nick