Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Thu, Jan 22, 2015 at 11:08:11AM +0800, Zhang Zhen wrote: > On 2015/1/22 4:06, Paul E. McKenney wrote: > > On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote: > >> On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote: > This may not cause other problems but what happens if you comment out the > 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like > below (based on latest upstream cb59670870)? > > The idea is that console printing for that cpu won't reset the softlockup > detector. Again other bad things might happen and this patch may not be > a > good final solution, but it can help give me a clue about what is going > on. > >>> > >>> I commented out the 'touch_softlockup_watchdog' from the > >>> touch_nmi_watchdog function > >>> (based on latest upstream ec6f34e5b552). > >>> This triggered RCU stall and softlockup, but softlockup just printk only > >>> once. > >> > >> Yes, as expected. Instead of flooding the console with the same message, > >> report only once until the high priority task is re-kicked and can re-arm > >> the softlockup timer again. So this makes sense. Based on your double > >> spin lock test, I don't ever expect this to get re-armed, so you should > >> not expect to see another message. > >> > >> > >>> As you mentioned "other bad things" lead to softlockup just printk only > >>> once. > >>> What's the bad thing ? > >> > >> Well, one of the assumptions here is that if you are going to have > >> interrupts disabled for a long time and purposely kick the hardlockup > >> detector (with touch_nmi_watchdog), then you can reasonably assume that > >> softlockups will go off too if you don't also kick the softlockup detector > >> (even though there is a window where this isn't true: 2*thresh - thresh). > >> > >> > >> So the patch I gave you, can lead to false softlockup warnings. > >> > >> > >> On the flip side, printing to the console also blocks hardlockups from > >> showing up. But I believe lng ago, Paul adjusted the rcu stalls to be > >> longer than a hardlockup/softlockup timeout to prevent such scenarios. > >> > >> > >> > >> I am not sure what to do here. Printing to the console has traditionally > >> been slow (especially if doing so with interrupts disabled), so a > >> touch_nmi_watchdog has been necessary. But a side effect of that, is it > >> prevents the lockup detectors from going off if printing repeatedly. > >> > >> > >> Now we can hack up rcu stall to only print once until it is re-armed. > >> This would show rcu stall printing first, followed two minutes later by > >> a softlockup, which sorta achieves what you want. > >> > >> > >> > >> However, at the end of the day, an rcu stall, softlockup or hardlockup is > >> a bad thing, does it matter if you get one or all of them? One should be > >> enough to start an investigation, no? > > > > Hear, hear!!! > > > Ok, thanks for Don's and your patience! > We enriched softlockup printk information, so we want to get softlockup > warnning. Booting with "rcupdate.rcu_cpu_stall_suppress=1" will shut up the RCU CPU stall warnings. Thanx, Paul > >> I don't know. I am leaning towards 'working as expected' and nothing to > >> really fix here. Does anyone have any suggestions on what assumptions we > >> could change to handle this better? > >> > >> > >> Cheers, > >> Don > >> > >>> > >>> / # > >>> / # echo 60 > /proc/sys/kernel/watchdog_thresh > >>> / # busybox insmod softlockup_test.ko & > >>> / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >>> (detected by 0, t=21002 jiffies, g=78, c=77, q=4) > >>> [ 39.044058] INFO: Stall ended before state dump start > >>> [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >>> (detected by 0, t=84007 jiffies, g=78, c=77, q=4) > >>> [ 102.049045] INFO: Stall ended before state dump start > >>> [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] > >>> [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >>> (detected by 0, t=147012 jiffies, g=78, c=77, q=4) > >>> [ 165.054075] INFO: Stall ended before state dump start > >>> [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >>> (detected by 0, t=210017 jiffies, g=78, c=77, q=4) > >>> [ 228.059045] INFO: Stall ended before state dump start > >>> [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >>> (detected by 0, t=273022 jiffies, g=78, c=77, q=4) > >>> [ 291.064099] INFO: Stall ended before state dump start > >>> [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >>> (detected by 0, t=336027 jiffies, g=78, c=77, q=4) > >>> [ 354.069099] INFO: Stall ended before state dump start > >>> [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >>> (detected by 0, t=399032 jiffies, g=78, c=77, q=4) > >>> [ 417.074045] INFO: Stall ended before state dump
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/22 4:06, Paul E. McKenney wrote: > On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote: >> On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote: This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. >>> >>> I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog >>> function >>> (based on latest upstream ec6f34e5b552). >>> This triggered RCU stall and softlockup, but softlockup just printk only >>> once. >> >> Yes, as expected. Instead of flooding the console with the same message, >> report only once until the high priority task is re-kicked and can re-arm >> the softlockup timer again. So this makes sense. Based on your double >> spin lock test, I don't ever expect this to get re-armed, so you should >> not expect to see another message. >> >> >>> As you mentioned "other bad things" lead to softlockup just printk only >>> once. >>> What's the bad thing ? >> >> Well, one of the assumptions here is that if you are going to have >> interrupts disabled for a long time and purposely kick the hardlockup >> detector (with touch_nmi_watchdog), then you can reasonably assume that >> softlockups will go off too if you don't also kick the softlockup detector >> (even though there is a window where this isn't true: 2*thresh - thresh). >> >> >> So the patch I gave you, can lead to false softlockup warnings. >> >> >> On the flip side, printing to the console also blocks hardlockups from >> showing up. But I believe lng ago, Paul adjusted the rcu stalls to be >> longer than a hardlockup/softlockup timeout to prevent such scenarios. >> >> >> >> I am not sure what to do here. Printing to the console has traditionally >> been slow (especially if doing so with interrupts disabled), so a >> touch_nmi_watchdog has been necessary. But a side effect of that, is it >> prevents the lockup detectors from going off if printing repeatedly. >> >> >> Now we can hack up rcu stall to only print once until it is re-armed. >> This would show rcu stall printing first, followed two minutes later by >> a softlockup, which sorta achieves what you want. >> >> >> >> However, at the end of the day, an rcu stall, softlockup or hardlockup is >> a bad thing, does it matter if you get one or all of them? One should be >> enough to start an investigation, no? > > Hear, hear!!! > Ok, thanks for Don's and your patience! We enriched softlockup printk information, so we want to get softlockup warnning. > Thanx, Paul > >> I don't know. I am leaning towards 'working as expected' and nothing to >> really fix here. Does anyone have any suggestions on what assumptions we >> could change to handle this better? >> >> >> Cheers, >> Don >> >>> >>> / # >>> / # echo 60 > /proc/sys/kernel/watchdog_thresh >>> / # busybox insmod softlockup_test.ko & >>> / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >>> (detected by 0, t=21002 jiffies, g=78, c=77, q=4) >>> [ 39.044058] INFO: Stall ended before state dump start >>> [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >>> (detected by 0, t=84007 jiffies, g=78, c=77, q=4) >>> [ 102.049045] INFO: Stall ended before state dump start >>> [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] >>> [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >>> (detected by 0, t=147012 jiffies, g=78, c=77, q=4) >>> [ 165.054075] INFO: Stall ended before state dump start >>> [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >>> (detected by 0, t=210017 jiffies, g=78, c=77, q=4) >>> [ 228.059045] INFO: Stall ended before state dump start >>> [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >>> (detected by 0, t=273022 jiffies, g=78, c=77, q=4) >>> [ 291.064099] INFO: Stall ended before state dump start >>> [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >>> (detected by 0, t=336027 jiffies, g=78, c=77, q=4) >>> [ 354.069099] INFO: Stall ended before state dump start >>> [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >>> (detected by 0, t=399032 jiffies, g=78, c=77, q=4) >>> [ 417.074045] INFO: Stall ended before state dump start >>> [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >>> (detected by 0, t=462037 jiffies, g=78, c=77, q=4) >>> [ 480.079099] INFO: Stall ended before state dump start >>> [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >>> (detected by 0, t=525042 jiffies, g=78, c=77, q=4) >>> [ 543.084099] INFO: Stall ended
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote: > On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote: > > > This may not cause other problems but what happens if you comment out the > > > 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like > > > below (based on latest upstream cb59670870)? > > > > > > The idea is that console printing for that cpu won't reset the softlockup > > > detector. Again other bad things might happen and this patch may not be a > > > good final solution, but it can help give me a clue about what is going > > > on. > > > > I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog > > function > > (based on latest upstream ec6f34e5b552). > > This triggered RCU stall and softlockup, but softlockup just printk only > > once. > > Yes, as expected. Instead of flooding the console with the same message, > report only once until the high priority task is re-kicked and can re-arm > the softlockup timer again. So this makes sense. Based on your double > spin lock test, I don't ever expect this to get re-armed, so you should > not expect to see another message. > > > > As you mentioned "other bad things" lead to softlockup just printk only > > once. > > What's the bad thing ? > > Well, one of the assumptions here is that if you are going to have > interrupts disabled for a long time and purposely kick the hardlockup > detector (with touch_nmi_watchdog), then you can reasonably assume that > softlockups will go off too if you don't also kick the softlockup detector > (even though there is a window where this isn't true: 2*thresh - thresh). > > > So the patch I gave you, can lead to false softlockup warnings. > > > On the flip side, printing to the console also blocks hardlockups from > showing up. But I believe lng ago, Paul adjusted the rcu stalls to be > longer than a hardlockup/softlockup timeout to prevent such scenarios. > > > > I am not sure what to do here. Printing to the console has traditionally > been slow (especially if doing so with interrupts disabled), so a > touch_nmi_watchdog has been necessary. But a side effect of that, is it > prevents the lockup detectors from going off if printing repeatedly. > > > Now we can hack up rcu stall to only print once until it is re-armed. > This would show rcu stall printing first, followed two minutes later by > a softlockup, which sorta achieves what you want. > > > > However, at the end of the day, an rcu stall, softlockup or hardlockup is > a bad thing, does it matter if you get one or all of them? One should be > enough to start an investigation, no? Hear, hear!!! Thanx, Paul > I don't know. I am leaning towards 'working as expected' and nothing to > really fix here. Does anyone have any suggestions on what assumptions we > could change to handle this better? > > > Cheers, > Don > > > > > / # > > / # echo 60 > /proc/sys/kernel/watchdog_thresh > > / # busybox insmod softlockup_test.ko & > > / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=21002 jiffies, g=78, c=77, q=4) > > [ 39.044058] INFO: Stall ended before state dump start > > [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=84007 jiffies, g=78, c=77, q=4) > > [ 102.049045] INFO: Stall ended before state dump start > > [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] > > [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=147012 jiffies, g=78, c=77, q=4) > > [ 165.054075] INFO: Stall ended before state dump start > > [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=210017 jiffies, g=78, c=77, q=4) > > [ 228.059045] INFO: Stall ended before state dump start > > [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=273022 jiffies, g=78, c=77, q=4) > > [ 291.064099] INFO: Stall ended before state dump start > > [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=336027 jiffies, g=78, c=77, q=4) > > [ 354.069099] INFO: Stall ended before state dump start > > [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=399032 jiffies, g=78, c=77, q=4) > > [ 417.074045] INFO: Stall ended before state dump start > > [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=462037 jiffies, g=78, c=77, q=4) > > [ 480.079099] INFO: Stall ended before state dump start > > [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=525042 jiffies, g=78, c=77, q=4) > > [ 543.084099] INFO: Stall ended before state dump start > > [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > > (detected by 0, t=588047 jiffies, g=78, c=77, q=4) > > [ 606.089101] INFO: Stall ended before
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 07:11:51PM +0800, Zhang Zhen wrote: > On 2015/1/21 18:16, Paul E. McKenney wrote: > > On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote: > >> On 2015/1/21 15:02, Paul E. McKenney wrote: > >>> On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: > On 2015/1/21 11:13, Zhang Zhen wrote: > > On 2015/1/21 10:26, Zhang Zhen wrote: > >> On 2015/1/20 23:25, Don Zickus wrote: > >>> > >>> [ . . . ] > >>> > > Sorry, i made a mistake, the log above is based on v3.10.63. > > I have tested the latest upstream kernel (based on ec6f34e5b552), > > but my test case can't triggered RCU stall warning. > > > > I used git bisect to find the bad commit, but found a irrelevant > commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). > Before this commit, my test case can easily trigger RCU stall warning, > but after this commit, my test case can't trigger RCU stall warning. > >>> > >>> I have lost track of exactly what kernel you are using. If you are > >>> using a recent kernel (v3.18 or thereabouts), one thing to try is to > >>> apply my stack of RCU CPU stall-warning changes that are on their way > >>> in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: > >>> > >>> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git > >>> > >>> These handle the problems that Dave Jones and a few others located this > >>> past December. Could you please give them a spin? > >>> > >> > >> Yeah, my test case can easily trigger RCU stall-warning based on commit > >> 630181c4a915 > >> in your linux-rcu git tree. > >> > >> In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21. > >> Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog: > >> > >> / # > >> / # echo 60 > /proc/sys/kernel/watchdog_thresh > >> [ 21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled > >> / # busybox insmod softlockup_test.ko > >> [ 47.900117] INFO: rcu_sched detected stalls on CPUs/tasks: > >> [ 47.900117] (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7) > >> [ 47.900117] All QSes seen, last rcu_sched kthread activity 5252 > >> (4294904271-4294899019), jiffies_till_next_fqs=1 > > > > Looks like you are starving RCU's grace-period kthreads. What happens if > > you boot with rcutree.kthread_prio=1 in order to run them at real-time > > priority? > > > >> [ 47.900117] busybox R running task042 41 > >> 0x0008 > >> [ 47.900117] 81afaf40 88007fa03d98 810781c0 > >> 81a5e5f8 > >> [ 47.900117] 81a45d80 88007fa03dd8 810af723 > >> 0083 > >> [ 47.900117] 88007fa13580 81a45d80 81a45d80 > >> 810c43e0 > >> [ 47.900117] Call Trace: > >> [ 47.900117][] sched_show_task+0xb0/0x110 > >> [ 47.900117] [] print_other_cpu_stall+0x2d3/0x2f0 > >> [ 47.900117] [] ? tick_nohz_handler+0xc0/0xc0 > >> [ 47.900117] [] __rcu_pending+0x178/0x220 > >> [ 47.900117] [] rcu_check_callbacks+0x105/0x190 > >> [ 47.900117] [] ? tick_nohz_handler+0xc0/0xc0 > >> [ 47.900117] [] update_process_times+0x32/0x60 > >> [ 47.900117] [] tick_sched_handle+0x32/0x80 > >> [ 47.900117] [] tick_sched_timer+0x4d/0x90 > >> [ 47.900117] [] __run_hrtimer+0xc7/0x1c0 > >> [ 47.900117] [] hrtimer_interrupt+0xe7/0x220 > >> [ 47.900117] [] ? 0xa0005000 > >> [ 47.900117] [] local_apic_timer_interrupt+0x34/0x60 > >> [ 47.900117] [] smp_apic_timer_interrupt+0x3c/0x60 > >> [ 47.900117] [] apic_timer_interrupt+0x6a/0x70 > >> [ 47.900117][] ? _raw_spin_lock+0x20/0x30 > >> [ 47.900117] [] test_init+0x5d/0x1000 > >> [softlockup_test] > >> [ 47.900117] [] ? 0xa0005000 > >> [ 47.900117] [] do_one_initcall+0xb8/0x1d0 > >> [ 47.900117] [] ? __vunmap+0x94/0xf0 > >> [ 47.900117] [] do_init_module+0x2b/0x1b0 > >> [ 47.900117] [] load_module+0x585/0x5f0 > >> [ 47.900117] [] ? mod_sysfs_teardown+0x150/0x150 > >> [ 47.900117] [] SyS_init_module+0x9b/0xc0 > >> [ 47.900117] [] system_call_fastpath+0x12/0x17 > >> [ 47.900117] rcu_sched kthread starved for 5252 jiffies! > >> > >> After commented out touch_softlockup_watchdog() in touch_nmi_watchdog: > >> > >> / # > >> / # echo 60 > /proc/sys/kernel/watchdog_thresh > >> [ 123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled > >> / # busybox insmod softlockup_test.ko > >> [ 150.912055] INFO: rcu_sched detected stalls on CPUs/tasks: > >> [ 150.912055] (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6) > >> [ 150.912055] All QSes seen, last rcu_sched kthread activity 5252 > >> (4294930024-4294924772), jiffies_till_next_fqs=1 > > > > Same here, looks like RCU's grace-period kthreads haven't gotten a chance > > to run for more than 5,000 jiffies. > > > >> [ 150.912055] busybox R running task042 41 > >> 0x0008 > >> [ 150.912055]
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote: > > This may not cause other problems but what happens if you comment out the > > 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like > > below (based on latest upstream cb59670870)? > > > > The idea is that console printing for that cpu won't reset the softlockup > > detector. Again other bad things might happen and this patch may not be a > > good final solution, but it can help give me a clue about what is going > > on. > > I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog > function > (based on latest upstream ec6f34e5b552). > This triggered RCU stall and softlockup, but softlockup just printk only once. Yes, as expected. Instead of flooding the console with the same message, report only once until the high priority task is re-kicked and can re-arm the softlockup timer again. So this makes sense. Based on your double spin lock test, I don't ever expect this to get re-armed, so you should not expect to see another message. > As you mentioned "other bad things" lead to softlockup just printk only once. > What's the bad thing ? Well, one of the assumptions here is that if you are going to have interrupts disabled for a long time and purposely kick the hardlockup detector (with touch_nmi_watchdog), then you can reasonably assume that softlockups will go off too if you don't also kick the softlockup detector (even though there is a window where this isn't true: 2*thresh - thresh). So the patch I gave you, can lead to false softlockup warnings. On the flip side, printing to the console also blocks hardlockups from showing up. But I believe lng ago, Paul adjusted the rcu stalls to be longer than a hardlockup/softlockup timeout to prevent such scenarios. I am not sure what to do here. Printing to the console has traditionally been slow (especially if doing so with interrupts disabled), so a touch_nmi_watchdog has been necessary. But a side effect of that, is it prevents the lockup detectors from going off if printing repeatedly. Now we can hack up rcu stall to only print once until it is re-armed. This would show rcu stall printing first, followed two minutes later by a softlockup, which sorta achieves what you want. However, at the end of the day, an rcu stall, softlockup or hardlockup is a bad thing, does it matter if you get one or all of them? One should be enough to start an investigation, no? I don't know. I am leaning towards 'working as expected' and nothing to really fix here. Does anyone have any suggestions on what assumptions we could change to handle this better? Cheers, Don > > / # > / # echo 60 > /proc/sys/kernel/watchdog_thresh > / # busybox insmod softlockup_test.ko & > / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=21002 jiffies, g=78, c=77, q=4) > [ 39.044058] INFO: Stall ended before state dump start > [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=84007 jiffies, g=78, c=77, q=4) > [ 102.049045] INFO: Stall ended before state dump start > [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] > [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=147012 jiffies, g=78, c=77, q=4) > [ 165.054075] INFO: Stall ended before state dump start > [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=210017 jiffies, g=78, c=77, q=4) > [ 228.059045] INFO: Stall ended before state dump start > [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=273022 jiffies, g=78, c=77, q=4) > [ 291.064099] INFO: Stall ended before state dump start > [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=336027 jiffies, g=78, c=77, q=4) > [ 354.069099] INFO: Stall ended before state dump start > [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=399032 jiffies, g=78, c=77, q=4) > [ 417.074045] INFO: Stall ended before state dump start > [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=462037 jiffies, g=78, c=77, q=4) > [ 480.079099] INFO: Stall ended before state dump start > [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=525042 jiffies, g=78, c=77, q=4) > [ 543.084099] INFO: Stall ended before state dump start > [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=588047 jiffies, g=78, c=77, q=4) > [ 606.089101] INFO: Stall ended before state dump start > [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=651052 jiffies, g=78, c=77, q=4) > [ 669.094099] INFO: Stall ended before state dump start > [ 732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=714057 jiffies, g=78, c=77, q=4) > [ 732.099045] INFO: Stall
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 18:16, Paul E. McKenney wrote: > On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote: >> On 2015/1/21 15:02, Paul E. McKenney wrote: >>> On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: On 2015/1/21 11:13, Zhang Zhen wrote: > On 2015/1/21 10:26, Zhang Zhen wrote: >> On 2015/1/20 23:25, Don Zickus wrote: >>> >>> [ . . . ] >>> > Sorry, i made a mistake, the log above is based on v3.10.63. > I have tested the latest upstream kernel (based on ec6f34e5b552), > but my test case can't triggered RCU stall warning. > I used git bisect to find the bad commit, but found a irrelevant commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). Before this commit, my test case can easily trigger RCU stall warning, but after this commit, my test case can't trigger RCU stall warning. >>> >>> I have lost track of exactly what kernel you are using. If you are >>> using a recent kernel (v3.18 or thereabouts), one thing to try is to >>> apply my stack of RCU CPU stall-warning changes that are on their way >>> in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: >>> >>> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git >>> >>> These handle the problems that Dave Jones and a few others located this >>> past December. Could you please give them a spin? >>> >> >> Yeah, my test case can easily trigger RCU stall-warning based on commit >> 630181c4a915 >> in your linux-rcu git tree. >> >> In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21. >> Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog: >> >> / # >> / # echo 60 > /proc/sys/kernel/watchdog_thresh >> [ 21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled >> / # busybox insmod softlockup_test.ko >> [ 47.900117] INFO: rcu_sched detected stalls on CPUs/tasks: >> [ 47.900117] (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7) >> [ 47.900117] All QSes seen, last rcu_sched kthread activity 5252 >> (4294904271-4294899019), jiffies_till_next_fqs=1 > > Looks like you are starving RCU's grace-period kthreads. What happens if > you boot with rcutree.kthread_prio=1 in order to run them at real-time > priority? > >> [ 47.900117] busybox R running task042 41 >> 0x0008 >> [ 47.900117] 81afaf40 88007fa03d98 810781c0 >> 81a5e5f8 >> [ 47.900117] 81a45d80 88007fa03dd8 810af723 >> 0083 >> [ 47.900117] 88007fa13580 81a45d80 81a45d80 >> 810c43e0 >> [ 47.900117] Call Trace: >> [ 47.900117][] sched_show_task+0xb0/0x110 >> [ 47.900117] [] print_other_cpu_stall+0x2d3/0x2f0 >> [ 47.900117] [] ? tick_nohz_handler+0xc0/0xc0 >> [ 47.900117] [] __rcu_pending+0x178/0x220 >> [ 47.900117] [] rcu_check_callbacks+0x105/0x190 >> [ 47.900117] [] ? tick_nohz_handler+0xc0/0xc0 >> [ 47.900117] [] update_process_times+0x32/0x60 >> [ 47.900117] [] tick_sched_handle+0x32/0x80 >> [ 47.900117] [] tick_sched_timer+0x4d/0x90 >> [ 47.900117] [] __run_hrtimer+0xc7/0x1c0 >> [ 47.900117] [] hrtimer_interrupt+0xe7/0x220 >> [ 47.900117] [] ? 0xa0005000 >> [ 47.900117] [] local_apic_timer_interrupt+0x34/0x60 >> [ 47.900117] [] smp_apic_timer_interrupt+0x3c/0x60 >> [ 47.900117] [] apic_timer_interrupt+0x6a/0x70 >> [ 47.900117][] ? _raw_spin_lock+0x20/0x30 >> [ 47.900117] [] test_init+0x5d/0x1000 [softlockup_test] >> [ 47.900117] [] ? 0xa0005000 >> [ 47.900117] [] do_one_initcall+0xb8/0x1d0 >> [ 47.900117] [] ? __vunmap+0x94/0xf0 >> [ 47.900117] [] do_init_module+0x2b/0x1b0 >> [ 47.900117] [] load_module+0x585/0x5f0 >> [ 47.900117] [] ? mod_sysfs_teardown+0x150/0x150 >> [ 47.900117] [] SyS_init_module+0x9b/0xc0 >> [ 47.900117] [] system_call_fastpath+0x12/0x17 >> [ 47.900117] rcu_sched kthread starved for 5252 jiffies! >> >> After commented out touch_softlockup_watchdog() in touch_nmi_watchdog: >> >> / # >> / # echo 60 > /proc/sys/kernel/watchdog_thresh >> [ 123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled >> / # busybox insmod softlockup_test.ko >> [ 150.912055] INFO: rcu_sched detected stalls on CPUs/tasks: >> [ 150.912055] (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6) >> [ 150.912055] All QSes seen, last rcu_sched kthread activity 5252 >> (4294930024-4294924772), jiffies_till_next_fqs=1 > > Same here, looks like RCU's grace-period kthreads haven't gotten a chance > to run for more than 5,000 jiffies. > >> [ 150.912055] busybox R running task042 41 >> 0x0008 >> [ 150.912055] 81afaf40 88007fa03d98 810781c0 >> 81a5e5f8 >> [ 150.912055] 81a45d80 88007fa03dd8 810af723 >> 0083 >> [ 150.912055] 88007fa13580 81a45d80 81a45d80 >> 810c43e0 >> [ 150.912055] Call
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote: > On 2015/1/21 15:02, Paul E. McKenney wrote: > > On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: > >> On 2015/1/21 11:13, Zhang Zhen wrote: > >>> On 2015/1/21 10:26, Zhang Zhen wrote: > On 2015/1/20 23:25, Don Zickus wrote: > > > > [ . . . ] > > > >>> Sorry, i made a mistake, the log above is based on v3.10.63. > >>> I have tested the latest upstream kernel (based on ec6f34e5b552), > >>> but my test case can't triggered RCU stall warning. > >>> > >> > >> I used git bisect to find the bad commit, but found a irrelevant > >> commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). > >> Before this commit, my test case can easily trigger RCU stall warning, > >> but after this commit, my test case can't trigger RCU stall warning. > > > > I have lost track of exactly what kernel you are using. If you are > > using a recent kernel (v3.18 or thereabouts), one thing to try is to > > apply my stack of RCU CPU stall-warning changes that are on their way > > in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: > > > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git > > > > These handle the problems that Dave Jones and a few others located this > > past December. Could you please give them a spin? > > > > Yeah, my test case can easily trigger RCU stall-warning based on commit > 630181c4a915 > in your linux-rcu git tree. > > In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21. > Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog: > > / # > / # echo 60 > /proc/sys/kernel/watchdog_thresh > [ 21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled > / # busybox insmod softlockup_test.ko > [ 47.900117] INFO: rcu_sched detected stalls on CPUs/tasks: > [ 47.900117](detected by 0, t=5252 jiffies, g=-202, c=-203, q=7) > [ 47.900117] All QSes seen, last rcu_sched kthread activity 5252 > (4294904271-4294899019), jiffies_till_next_fqs=1 Looks like you are starving RCU's grace-period kthreads. What happens if you boot with rcutree.kthread_prio=1 in order to run them at real-time priority? > [ 47.900117] busybox R running task042 41 > 0x0008 > [ 47.900117] 81afaf40 88007fa03d98 810781c0 > 81a5e5f8 > [ 47.900117] 81a45d80 88007fa03dd8 810af723 > 0083 > [ 47.900117] 88007fa13580 81a45d80 81a45d80 > 810c43e0 > [ 47.900117] Call Trace: > [ 47.900117][] sched_show_task+0xb0/0x110 > [ 47.900117] [] print_other_cpu_stall+0x2d3/0x2f0 > [ 47.900117] [] ? tick_nohz_handler+0xc0/0xc0 > [ 47.900117] [] __rcu_pending+0x178/0x220 > [ 47.900117] [] rcu_check_callbacks+0x105/0x190 > [ 47.900117] [] ? tick_nohz_handler+0xc0/0xc0 > [ 47.900117] [] update_process_times+0x32/0x60 > [ 47.900117] [] tick_sched_handle+0x32/0x80 > [ 47.900117] [] tick_sched_timer+0x4d/0x90 > [ 47.900117] [] __run_hrtimer+0xc7/0x1c0 > [ 47.900117] [] hrtimer_interrupt+0xe7/0x220 > [ 47.900117] [] ? 0xa0005000 > [ 47.900117] [] local_apic_timer_interrupt+0x34/0x60 > [ 47.900117] [] smp_apic_timer_interrupt+0x3c/0x60 > [ 47.900117] [] apic_timer_interrupt+0x6a/0x70 > [ 47.900117][] ? _raw_spin_lock+0x20/0x30 > [ 47.900117] [] test_init+0x5d/0x1000 [softlockup_test] > [ 47.900117] [] ? 0xa0005000 > [ 47.900117] [] do_one_initcall+0xb8/0x1d0 > [ 47.900117] [] ? __vunmap+0x94/0xf0 > [ 47.900117] [] do_init_module+0x2b/0x1b0 > [ 47.900117] [] load_module+0x585/0x5f0 > [ 47.900117] [] ? mod_sysfs_teardown+0x150/0x150 > [ 47.900117] [] SyS_init_module+0x9b/0xc0 > [ 47.900117] [] system_call_fastpath+0x12/0x17 > [ 47.900117] rcu_sched kthread starved for 5252 jiffies! > > After commented out touch_softlockup_watchdog() in touch_nmi_watchdog: > > / # > / # echo 60 > /proc/sys/kernel/watchdog_thresh > [ 123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled > / # busybox insmod softlockup_test.ko > [ 150.912055] INFO: rcu_sched detected stalls on CPUs/tasks: > [ 150.912055](detected by 0, t=5252 jiffies, g=-198, c=-199, q=6) > [ 150.912055] All QSes seen, last rcu_sched kthread activity 5252 > (4294930024-4294924772), jiffies_till_next_fqs=1 Same here, looks like RCU's grace-period kthreads haven't gotten a chance to run for more than 5,000 jiffies. > [ 150.912055] busybox R running task042 41 > 0x0008 > [ 150.912055] 81afaf40 88007fa03d98 810781c0 > 81a5e5f8 > [ 150.912055] 81a45d80 88007fa03dd8 810af723 > 0083 > [ 150.912055] 88007fa13580 81a45d80 81a45d80 > 810c43e0 > [ 150.912055] Call Trace: > [ 150.912055][] sched_show_task+0xb0/0x110 > [ 150.912055] [] print_other_cpu_stall+0x2d3/0x2f0 > [ 150.912055]
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 15:02, Paul E. McKenney wrote: > On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: >> On 2015/1/21 11:13, Zhang Zhen wrote: >>> On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: > > [ . . . ] > >>> Sorry, i made a mistake, the log above is based on v3.10.63. >>> I have tested the latest upstream kernel (based on ec6f34e5b552), >>> but my test case can't triggered RCU stall warning. >>> >> >> I used git bisect to find the bad commit, but found a irrelevant >> commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). >> Before this commit, my test case can easily trigger RCU stall warning, >> but after this commit, my test case can't trigger RCU stall warning. > > I have lost track of exactly what kernel you are using. If you are > using a recent kernel (v3.18 or thereabouts), one thing to try is to > apply my stack of RCU CPU stall-warning changes that are on their way > in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git > > These handle the problems that Dave Jones and a few others located this > past December. Could you please give them a spin? > Yeah, my test case can easily trigger RCU stall-warning based on commit 630181c4a915 in your linux-rcu git tree. In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21. Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 > /proc/sys/kernel/watchdog_thresh [ 21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 47.900117] INFO: rcu_sched detected stalls on CPUs/tasks: [ 47.900117] (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7) [ 47.900117] All QSes seen, last rcu_sched kthread activity 5252 (4294904271-4294899019), jiffies_till_next_fqs=1 [ 47.900117] busybox R running task042 41 0x0008 [ 47.900117] 81afaf40 88007fa03d98 810781c0 81a5e5f8 [ 47.900117] 81a45d80 88007fa03dd8 810af723 0083 [ 47.900117] 88007fa13580 81a45d80 81a45d80 810c43e0 [ 47.900117] Call Trace: [ 47.900117][] sched_show_task+0xb0/0x110 [ 47.900117] [] print_other_cpu_stall+0x2d3/0x2f0 [ 47.900117] [] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [] __rcu_pending+0x178/0x220 [ 47.900117] [] rcu_check_callbacks+0x105/0x190 [ 47.900117] [] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [] update_process_times+0x32/0x60 [ 47.900117] [] tick_sched_handle+0x32/0x80 [ 47.900117] [] tick_sched_timer+0x4d/0x90 [ 47.900117] [] __run_hrtimer+0xc7/0x1c0 [ 47.900117] [] hrtimer_interrupt+0xe7/0x220 [ 47.900117] [] ? 0xa0005000 [ 47.900117] [] local_apic_timer_interrupt+0x34/0x60 [ 47.900117] [] smp_apic_timer_interrupt+0x3c/0x60 [ 47.900117] [] apic_timer_interrupt+0x6a/0x70 [ 47.900117][] ? _raw_spin_lock+0x20/0x30 [ 47.900117] [] test_init+0x5d/0x1000 [softlockup_test] [ 47.900117] [] ? 0xa0005000 [ 47.900117] [] do_one_initcall+0xb8/0x1d0 [ 47.900117] [] ? __vunmap+0x94/0xf0 [ 47.900117] [] do_init_module+0x2b/0x1b0 [ 47.900117] [] load_module+0x585/0x5f0 [ 47.900117] [] ? mod_sysfs_teardown+0x150/0x150 [ 47.900117] [] SyS_init_module+0x9b/0xc0 [ 47.900117] [] system_call_fastpath+0x12/0x17 [ 47.900117] rcu_sched kthread starved for 5252 jiffies! After commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 > /proc/sys/kernel/watchdog_thresh [ 123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 150.912055] INFO: rcu_sched detected stalls on CPUs/tasks: [ 150.912055] (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6) [ 150.912055] All QSes seen, last rcu_sched kthread activity 5252 (4294930024-4294924772), jiffies_till_next_fqs=1 [ 150.912055] busybox R running task042 41 0x0008 [ 150.912055] 81afaf40 88007fa03d98 810781c0 81a5e5f8 [ 150.912055] 81a45d80 88007fa03dd8 810af723 0083 [ 150.912055] 88007fa13580 81a45d80 81a45d80 810c43e0 [ 150.912055] Call Trace: [ 150.912055][] sched_show_task+0xb0/0x110 [ 150.912055] [] print_other_cpu_stall+0x2d3/0x2f0 [ 150.912055] [] ? tick_nohz_handler+0xc0/0xc0 [ 150.912055] [] __rcu_pending+0x178/0x220 [ 150.912055] [] rcu_check_callbacks+0x105/0x190 [ 150.912055] [] ? tick_nohz_handler+0xc0/0xc0 [ 150.912055] [] update_process_times+0x32/0x60 [ 150.912055] [] tick_sched_handle+0x32/0x80 [ 150.912055] [] tick_sched_timer+0x4d/0x90 [ 150.912055] [] __run_hrtimer+0xc7/0x1c0 [ 150.912055] [] hrtimer_interrupt+0xe7/0x220 [ 150.912055] [] ? 0xa0005000 [ 150.912055] [] local_apic_timer_interrupt+0x34/0x60 [ 150.912055] []
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 07:11:51PM +0800, Zhang Zhen wrote: On 2015/1/21 18:16, Paul E. McKenney wrote: On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote: On 2015/1/21 15:02, Paul E. McKenney wrote: On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: On 2015/1/21 11:13, Zhang Zhen wrote: On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: [ . . . ] Sorry, i made a mistake, the log above is based on v3.10.63. I have tested the latest upstream kernel (based on ec6f34e5b552), but my test case can't triggered RCU stall warning. I used git bisect to find the bad commit, but found a irrelevant commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). Before this commit, my test case can easily trigger RCU stall warning, but after this commit, my test case can't trigger RCU stall warning. I have lost track of exactly what kernel you are using. If you are using a recent kernel (v3.18 or thereabouts), one thing to try is to apply my stack of RCU CPU stall-warning changes that are on their way in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git These handle the problems that Dave Jones and a few others located this past December. Could you please give them a spin? Yeah, my test case can easily trigger RCU stall-warning based on commit 630181c4a915 in your linux-rcu git tree. In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21. Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 /proc/sys/kernel/watchdog_thresh [ 21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 47.900117] INFO: rcu_sched detected stalls on CPUs/tasks: [ 47.900117] (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7) [ 47.900117] All QSes seen, last rcu_sched kthread activity 5252 (4294904271-4294899019), jiffies_till_next_fqs=1 Looks like you are starving RCU's grace-period kthreads. What happens if you boot with rcutree.kthread_prio=1 in order to run them at real-time priority? [ 47.900117] busybox R running task042 41 0x0008 [ 47.900117] 81afaf40 88007fa03d98 810781c0 81a5e5f8 [ 47.900117] 81a45d80 88007fa03dd8 810af723 0083 [ 47.900117] 88007fa13580 81a45d80 81a45d80 810c43e0 [ 47.900117] Call Trace: [ 47.900117] IRQ [810781c0] sched_show_task+0xb0/0x110 [ 47.900117] [810af723] print_other_cpu_stall+0x2d3/0x2f0 [ 47.900117] [810c43e0] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [810af8b8] __rcu_pending+0x178/0x220 [ 47.900117] [810b02a5] rcu_check_callbacks+0x105/0x190 [ 47.900117] [810c43e0] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [810b3582] update_process_times+0x32/0x60 [ 47.900117] [810c41e2] tick_sched_handle+0x32/0x80 [ 47.900117] [810c442d] tick_sched_timer+0x4d/0x90 [ 47.900117] [810b5b57] __run_hrtimer+0xc7/0x1c0 [ 47.900117] [810b5e37] hrtimer_interrupt+0xe7/0x220 [ 47.900117] [a0005000] ? 0xa0005000 [ 47.900117] [81035654] local_apic_timer_interrupt+0x34/0x60 [ 47.900117] [81035d8c] smp_apic_timer_interrupt+0x3c/0x60 [ 47.900117] [814ff02a] apic_timer_interrupt+0x6a/0x70 [ 47.900117] EOI [814fdaa0] ? _raw_spin_lock+0x20/0x30 [ 47.900117] [a000505d] test_init+0x5d/0x1000 [softlockup_test] [ 47.900117] [a0005000] ? 0xa0005000 [ 47.900117] [81000288] do_one_initcall+0xb8/0x1d0 [ 47.900117] [81168794] ? __vunmap+0x94/0xf0 [ 47.900117] [810ce36b] do_init_module+0x2b/0x1b0 [ 47.900117] [810d09b5] load_module+0x585/0x5f0 [ 47.900117] [810cdd40] ? mod_sysfs_teardown+0x150/0x150 [ 47.900117] [810d0bbb] SyS_init_module+0x9b/0xc0 [ 47.900117] [814fe252] system_call_fastpath+0x12/0x17 [ 47.900117] rcu_sched kthread starved for 5252 jiffies! After commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 /proc/sys/kernel/watchdog_thresh [ 123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 150.912055] INFO: rcu_sched detected stalls on CPUs/tasks: [ 150.912055] (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6) [ 150.912055] All QSes seen, last rcu_sched kthread activity 5252 (4294930024-4294924772), jiffies_till_next_fqs=1 Same here, looks like RCU's grace-period kthreads haven't gotten a chance to run for more than 5,000 jiffies. [ 150.912055] busybox R running task0
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote: On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote: This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function (based on latest upstream ec6f34e5b552). This triggered RCU stall and softlockup, but softlockup just printk only once. Yes, as expected. Instead of flooding the console with the same message, report only once until the high priority task is re-kicked and can re-arm the softlockup timer again. So this makes sense. Based on your double spin lock test, I don't ever expect this to get re-armed, so you should not expect to see another message. As you mentioned other bad things lead to softlockup just printk only once. What's the bad thing ? Well, one of the assumptions here is that if you are going to have interrupts disabled for a long time and purposely kick the hardlockup detector (with touch_nmi_watchdog), then you can reasonably assume that softlockups will go off too if you don't also kick the softlockup detector (even though there is a window where this isn't true: 2*thresh - thresh). So the patch I gave you, can lead to false softlockup warnings. On the flip side, printing to the console also blocks hardlockups from showing up. But I believe lng ago, Paul adjusted the rcu stalls to be longer than a hardlockup/softlockup timeout to prevent such scenarios. I am not sure what to do here. Printing to the console has traditionally been slow (especially if doing so with interrupts disabled), so a touch_nmi_watchdog has been necessary. But a side effect of that, is it prevents the lockup detectors from going off if printing repeatedly. Now we can hack up rcu stall to only print once until it is re-armed. This would show rcu stall printing first, followed two minutes later by a softlockup, which sorta achieves what you want. However, at the end of the day, an rcu stall, softlockup or hardlockup is a bad thing, does it matter if you get one or all of them? One should be enough to start an investigation, no? Hear, hear!!! Thanx, Paul I don't know. I am leaning towards 'working as expected' and nothing to really fix here. Does anyone have any suggestions on what assumptions we could change to handle this better? Cheers, Don / # / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=78, c=77, q=4) [ 39.044058] INFO: Stall ended before state dump start [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=78, c=77, q=4) [ 102.049045] INFO: Stall ended before state dump start [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=78, c=77, q=4) [ 165.054075] INFO: Stall ended before state dump start [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=78, c=77, q=4) [ 228.059045] INFO: Stall ended before state dump start [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=78, c=77, q=4) [ 291.064099] INFO: Stall ended before state dump start [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=78, c=77, q=4) [ 354.069099] INFO: Stall ended before state dump start [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=78, c=77, q=4) [ 417.074045] INFO: Stall ended before state dump start [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=78, c=77, q=4) [ 480.079099] INFO: Stall ended before state dump start [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=78, c=77, q=4) [ 543.084099] INFO: Stall ended before state dump start [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=78, c=77, q=4) [ 606.089101] INFO: Stall ended before state dump start [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=78, c=77, q=4) [ 669.094099] INFO:
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Thu, Jan 22, 2015 at 11:08:11AM +0800, Zhang Zhen wrote: On 2015/1/22 4:06, Paul E. McKenney wrote: On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote: On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote: This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function (based on latest upstream ec6f34e5b552). This triggered RCU stall and softlockup, but softlockup just printk only once. Yes, as expected. Instead of flooding the console with the same message, report only once until the high priority task is re-kicked and can re-arm the softlockup timer again. So this makes sense. Based on your double spin lock test, I don't ever expect this to get re-armed, so you should not expect to see another message. As you mentioned other bad things lead to softlockup just printk only once. What's the bad thing ? Well, one of the assumptions here is that if you are going to have interrupts disabled for a long time and purposely kick the hardlockup detector (with touch_nmi_watchdog), then you can reasonably assume that softlockups will go off too if you don't also kick the softlockup detector (even though there is a window where this isn't true: 2*thresh - thresh). So the patch I gave you, can lead to false softlockup warnings. On the flip side, printing to the console also blocks hardlockups from showing up. But I believe lng ago, Paul adjusted the rcu stalls to be longer than a hardlockup/softlockup timeout to prevent such scenarios. I am not sure what to do here. Printing to the console has traditionally been slow (especially if doing so with interrupts disabled), so a touch_nmi_watchdog has been necessary. But a side effect of that, is it prevents the lockup detectors from going off if printing repeatedly. Now we can hack up rcu stall to only print once until it is re-armed. This would show rcu stall printing first, followed two minutes later by a softlockup, which sorta achieves what you want. However, at the end of the day, an rcu stall, softlockup or hardlockup is a bad thing, does it matter if you get one or all of them? One should be enough to start an investigation, no? Hear, hear!!! Ok, thanks for Don's and your patience! We enriched softlockup printk information, so we want to get softlockup warnning. Booting with rcupdate.rcu_cpu_stall_suppress=1 will shut up the RCU CPU stall warnings. Thanx, Paul I don't know. I am leaning towards 'working as expected' and nothing to really fix here. Does anyone have any suggestions on what assumptions we could change to handle this better? Cheers, Don / # / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=78, c=77, q=4) [ 39.044058] INFO: Stall ended before state dump start [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=78, c=77, q=4) [ 102.049045] INFO: Stall ended before state dump start [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=78, c=77, q=4) [ 165.054075] INFO: Stall ended before state dump start [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=78, c=77, q=4) [ 228.059045] INFO: Stall ended before state dump start [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=78, c=77, q=4) [ 291.064099] INFO: Stall ended before state dump start [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=78, c=77, q=4) [ 354.069099] INFO: Stall ended before state dump start [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=78, c=77, q=4) [ 417.074045] INFO: Stall ended before state dump start [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=78, c=77, q=4) [ 480.079099] INFO: Stall ended before state dump start [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=78, c=77, q=4) [ 543.084099] INFO: Stall ended
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/22 4:06, Paul E. McKenney wrote: On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote: On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote: This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function (based on latest upstream ec6f34e5b552). This triggered RCU stall and softlockup, but softlockup just printk only once. Yes, as expected. Instead of flooding the console with the same message, report only once until the high priority task is re-kicked and can re-arm the softlockup timer again. So this makes sense. Based on your double spin lock test, I don't ever expect this to get re-armed, so you should not expect to see another message. As you mentioned other bad things lead to softlockup just printk only once. What's the bad thing ? Well, one of the assumptions here is that if you are going to have interrupts disabled for a long time and purposely kick the hardlockup detector (with touch_nmi_watchdog), then you can reasonably assume that softlockups will go off too if you don't also kick the softlockup detector (even though there is a window where this isn't true: 2*thresh - thresh). So the patch I gave you, can lead to false softlockup warnings. On the flip side, printing to the console also blocks hardlockups from showing up. But I believe lng ago, Paul adjusted the rcu stalls to be longer than a hardlockup/softlockup timeout to prevent such scenarios. I am not sure what to do here. Printing to the console has traditionally been slow (especially if doing so with interrupts disabled), so a touch_nmi_watchdog has been necessary. But a side effect of that, is it prevents the lockup detectors from going off if printing repeatedly. Now we can hack up rcu stall to only print once until it is re-armed. This would show rcu stall printing first, followed two minutes later by a softlockup, which sorta achieves what you want. However, at the end of the day, an rcu stall, softlockup or hardlockup is a bad thing, does it matter if you get one or all of them? One should be enough to start an investigation, no? Hear, hear!!! Ok, thanks for Don's and your patience! We enriched softlockup printk information, so we want to get softlockup warnning. Thanx, Paul I don't know. I am leaning towards 'working as expected' and nothing to really fix here. Does anyone have any suggestions on what assumptions we could change to handle this better? Cheers, Don / # / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=78, c=77, q=4) [ 39.044058] INFO: Stall ended before state dump start [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=78, c=77, q=4) [ 102.049045] INFO: Stall ended before state dump start [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=78, c=77, q=4) [ 165.054075] INFO: Stall ended before state dump start [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=78, c=77, q=4) [ 228.059045] INFO: Stall ended before state dump start [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=78, c=77, q=4) [ 291.064099] INFO: Stall ended before state dump start [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=78, c=77, q=4) [ 354.069099] INFO: Stall ended before state dump start [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=78, c=77, q=4) [ 417.074045] INFO: Stall ended before state dump start [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=78, c=77, q=4) [ 480.079099] INFO: Stall ended before state dump start [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=78, c=77, q=4) [ 543.084099] INFO: Stall ended before state dump start [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=78, c=77, q=4) [ 606.089101] INFO: Stall ended before state dump start [ 669.094099] INFO: rcu_preempt detected stalls on
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 18:16, Paul E. McKenney wrote: On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote: On 2015/1/21 15:02, Paul E. McKenney wrote: On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: On 2015/1/21 11:13, Zhang Zhen wrote: On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: [ . . . ] Sorry, i made a mistake, the log above is based on v3.10.63. I have tested the latest upstream kernel (based on ec6f34e5b552), but my test case can't triggered RCU stall warning. I used git bisect to find the bad commit, but found a irrelevant commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). Before this commit, my test case can easily trigger RCU stall warning, but after this commit, my test case can't trigger RCU stall warning. I have lost track of exactly what kernel you are using. If you are using a recent kernel (v3.18 or thereabouts), one thing to try is to apply my stack of RCU CPU stall-warning changes that are on their way in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git These handle the problems that Dave Jones and a few others located this past December. Could you please give them a spin? Yeah, my test case can easily trigger RCU stall-warning based on commit 630181c4a915 in your linux-rcu git tree. In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21. Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 /proc/sys/kernel/watchdog_thresh [ 21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 47.900117] INFO: rcu_sched detected stalls on CPUs/tasks: [ 47.900117] (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7) [ 47.900117] All QSes seen, last rcu_sched kthread activity 5252 (4294904271-4294899019), jiffies_till_next_fqs=1 Looks like you are starving RCU's grace-period kthreads. What happens if you boot with rcutree.kthread_prio=1 in order to run them at real-time priority? [ 47.900117] busybox R running task042 41 0x0008 [ 47.900117] 81afaf40 88007fa03d98 810781c0 81a5e5f8 [ 47.900117] 81a45d80 88007fa03dd8 810af723 0083 [ 47.900117] 88007fa13580 81a45d80 81a45d80 810c43e0 [ 47.900117] Call Trace: [ 47.900117] IRQ [810781c0] sched_show_task+0xb0/0x110 [ 47.900117] [810af723] print_other_cpu_stall+0x2d3/0x2f0 [ 47.900117] [810c43e0] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [810af8b8] __rcu_pending+0x178/0x220 [ 47.900117] [810b02a5] rcu_check_callbacks+0x105/0x190 [ 47.900117] [810c43e0] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [810b3582] update_process_times+0x32/0x60 [ 47.900117] [810c41e2] tick_sched_handle+0x32/0x80 [ 47.900117] [810c442d] tick_sched_timer+0x4d/0x90 [ 47.900117] [810b5b57] __run_hrtimer+0xc7/0x1c0 [ 47.900117] [810b5e37] hrtimer_interrupt+0xe7/0x220 [ 47.900117] [a0005000] ? 0xa0005000 [ 47.900117] [81035654] local_apic_timer_interrupt+0x34/0x60 [ 47.900117] [81035d8c] smp_apic_timer_interrupt+0x3c/0x60 [ 47.900117] [814ff02a] apic_timer_interrupt+0x6a/0x70 [ 47.900117] EOI [814fdaa0] ? _raw_spin_lock+0x20/0x30 [ 47.900117] [a000505d] test_init+0x5d/0x1000 [softlockup_test] [ 47.900117] [a0005000] ? 0xa0005000 [ 47.900117] [81000288] do_one_initcall+0xb8/0x1d0 [ 47.900117] [81168794] ? __vunmap+0x94/0xf0 [ 47.900117] [810ce36b] do_init_module+0x2b/0x1b0 [ 47.900117] [810d09b5] load_module+0x585/0x5f0 [ 47.900117] [810cdd40] ? mod_sysfs_teardown+0x150/0x150 [ 47.900117] [810d0bbb] SyS_init_module+0x9b/0xc0 [ 47.900117] [814fe252] system_call_fastpath+0x12/0x17 [ 47.900117] rcu_sched kthread starved for 5252 jiffies! After commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 /proc/sys/kernel/watchdog_thresh [ 123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 150.912055] INFO: rcu_sched detected stalls on CPUs/tasks: [ 150.912055] (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6) [ 150.912055] All QSes seen, last rcu_sched kthread activity 5252 (4294930024-4294924772), jiffies_till_next_fqs=1 Same here, looks like RCU's grace-period kthreads haven't gotten a chance to run for more than 5,000 jiffies. [ 150.912055] busybox R running task042 41 0x0008 [ 150.912055] 81afaf40 88007fa03d98 810781c0 81a5e5f8 [ 150.912055] 81a45d80 88007fa03dd8
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote: On 2015/1/21 15:02, Paul E. McKenney wrote: On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: On 2015/1/21 11:13, Zhang Zhen wrote: On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: [ . . . ] Sorry, i made a mistake, the log above is based on v3.10.63. I have tested the latest upstream kernel (based on ec6f34e5b552), but my test case can't triggered RCU stall warning. I used git bisect to find the bad commit, but found a irrelevant commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). Before this commit, my test case can easily trigger RCU stall warning, but after this commit, my test case can't trigger RCU stall warning. I have lost track of exactly what kernel you are using. If you are using a recent kernel (v3.18 or thereabouts), one thing to try is to apply my stack of RCU CPU stall-warning changes that are on their way in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git These handle the problems that Dave Jones and a few others located this past December. Could you please give them a spin? Yeah, my test case can easily trigger RCU stall-warning based on commit 630181c4a915 in your linux-rcu git tree. In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21. Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 /proc/sys/kernel/watchdog_thresh [ 21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 47.900117] INFO: rcu_sched detected stalls on CPUs/tasks: [ 47.900117](detected by 0, t=5252 jiffies, g=-202, c=-203, q=7) [ 47.900117] All QSes seen, last rcu_sched kthread activity 5252 (4294904271-4294899019), jiffies_till_next_fqs=1 Looks like you are starving RCU's grace-period kthreads. What happens if you boot with rcutree.kthread_prio=1 in order to run them at real-time priority? [ 47.900117] busybox R running task042 41 0x0008 [ 47.900117] 81afaf40 88007fa03d98 810781c0 81a5e5f8 [ 47.900117] 81a45d80 88007fa03dd8 810af723 0083 [ 47.900117] 88007fa13580 81a45d80 81a45d80 810c43e0 [ 47.900117] Call Trace: [ 47.900117] IRQ [810781c0] sched_show_task+0xb0/0x110 [ 47.900117] [810af723] print_other_cpu_stall+0x2d3/0x2f0 [ 47.900117] [810c43e0] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [810af8b8] __rcu_pending+0x178/0x220 [ 47.900117] [810b02a5] rcu_check_callbacks+0x105/0x190 [ 47.900117] [810c43e0] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [810b3582] update_process_times+0x32/0x60 [ 47.900117] [810c41e2] tick_sched_handle+0x32/0x80 [ 47.900117] [810c442d] tick_sched_timer+0x4d/0x90 [ 47.900117] [810b5b57] __run_hrtimer+0xc7/0x1c0 [ 47.900117] [810b5e37] hrtimer_interrupt+0xe7/0x220 [ 47.900117] [a0005000] ? 0xa0005000 [ 47.900117] [81035654] local_apic_timer_interrupt+0x34/0x60 [ 47.900117] [81035d8c] smp_apic_timer_interrupt+0x3c/0x60 [ 47.900117] [814ff02a] apic_timer_interrupt+0x6a/0x70 [ 47.900117] EOI [814fdaa0] ? _raw_spin_lock+0x20/0x30 [ 47.900117] [a000505d] test_init+0x5d/0x1000 [softlockup_test] [ 47.900117] [a0005000] ? 0xa0005000 [ 47.900117] [81000288] do_one_initcall+0xb8/0x1d0 [ 47.900117] [81168794] ? __vunmap+0x94/0xf0 [ 47.900117] [810ce36b] do_init_module+0x2b/0x1b0 [ 47.900117] [810d09b5] load_module+0x585/0x5f0 [ 47.900117] [810cdd40] ? mod_sysfs_teardown+0x150/0x150 [ 47.900117] [810d0bbb] SyS_init_module+0x9b/0xc0 [ 47.900117] [814fe252] system_call_fastpath+0x12/0x17 [ 47.900117] rcu_sched kthread starved for 5252 jiffies! After commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 /proc/sys/kernel/watchdog_thresh [ 123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 150.912055] INFO: rcu_sched detected stalls on CPUs/tasks: [ 150.912055](detected by 0, t=5252 jiffies, g=-198, c=-199, q=6) [ 150.912055] All QSes seen, last rcu_sched kthread activity 5252 (4294930024-4294924772), jiffies_till_next_fqs=1 Same here, looks like RCU's grace-period kthreads haven't gotten a chance to run for more than 5,000 jiffies. [ 150.912055] busybox R running task042 41 0x0008 [ 150.912055] 81afaf40 88007fa03d98 810781c0 81a5e5f8 [ 150.912055] 81a45d80 88007fa03dd8
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 15:02, Paul E. McKenney wrote: On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: On 2015/1/21 11:13, Zhang Zhen wrote: On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: [ . . . ] Sorry, i made a mistake, the log above is based on v3.10.63. I have tested the latest upstream kernel (based on ec6f34e5b552), but my test case can't triggered RCU stall warning. I used git bisect to find the bad commit, but found a irrelevant commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). Before this commit, my test case can easily trigger RCU stall warning, but after this commit, my test case can't trigger RCU stall warning. I have lost track of exactly what kernel you are using. If you are using a recent kernel (v3.18 or thereabouts), one thing to try is to apply my stack of RCU CPU stall-warning changes that are on their way in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git These handle the problems that Dave Jones and a few others located this past December. Could you please give them a spin? Yeah, my test case can easily trigger RCU stall-warning based on commit 630181c4a915 in your linux-rcu git tree. In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21. Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 /proc/sys/kernel/watchdog_thresh [ 21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 47.900117] INFO: rcu_sched detected stalls on CPUs/tasks: [ 47.900117] (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7) [ 47.900117] All QSes seen, last rcu_sched kthread activity 5252 (4294904271-4294899019), jiffies_till_next_fqs=1 [ 47.900117] busybox R running task042 41 0x0008 [ 47.900117] 81afaf40 88007fa03d98 810781c0 81a5e5f8 [ 47.900117] 81a45d80 88007fa03dd8 810af723 0083 [ 47.900117] 88007fa13580 81a45d80 81a45d80 810c43e0 [ 47.900117] Call Trace: [ 47.900117] IRQ [810781c0] sched_show_task+0xb0/0x110 [ 47.900117] [810af723] print_other_cpu_stall+0x2d3/0x2f0 [ 47.900117] [810c43e0] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [810af8b8] __rcu_pending+0x178/0x220 [ 47.900117] [810b02a5] rcu_check_callbacks+0x105/0x190 [ 47.900117] [810c43e0] ? tick_nohz_handler+0xc0/0xc0 [ 47.900117] [810b3582] update_process_times+0x32/0x60 [ 47.900117] [810c41e2] tick_sched_handle+0x32/0x80 [ 47.900117] [810c442d] tick_sched_timer+0x4d/0x90 [ 47.900117] [810b5b57] __run_hrtimer+0xc7/0x1c0 [ 47.900117] [810b5e37] hrtimer_interrupt+0xe7/0x220 [ 47.900117] [a0005000] ? 0xa0005000 [ 47.900117] [81035654] local_apic_timer_interrupt+0x34/0x60 [ 47.900117] [81035d8c] smp_apic_timer_interrupt+0x3c/0x60 [ 47.900117] [814ff02a] apic_timer_interrupt+0x6a/0x70 [ 47.900117] EOI [814fdaa0] ? _raw_spin_lock+0x20/0x30 [ 47.900117] [a000505d] test_init+0x5d/0x1000 [softlockup_test] [ 47.900117] [a0005000] ? 0xa0005000 [ 47.900117] [81000288] do_one_initcall+0xb8/0x1d0 [ 47.900117] [81168794] ? __vunmap+0x94/0xf0 [ 47.900117] [810ce36b] do_init_module+0x2b/0x1b0 [ 47.900117] [810d09b5] load_module+0x585/0x5f0 [ 47.900117] [810cdd40] ? mod_sysfs_teardown+0x150/0x150 [ 47.900117] [810d0bbb] SyS_init_module+0x9b/0xc0 [ 47.900117] [814fe252] system_call_fastpath+0x12/0x17 [ 47.900117] rcu_sched kthread starved for 5252 jiffies! After commented out touch_softlockup_watchdog() in touch_nmi_watchdog: / # / # echo 60 /proc/sys/kernel/watchdog_thresh [ 123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled / # busybox insmod softlockup_test.ko [ 150.912055] INFO: rcu_sched detected stalls on CPUs/tasks: [ 150.912055] (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6) [ 150.912055] All QSes seen, last rcu_sched kthread activity 5252 (4294930024-4294924772), jiffies_till_next_fqs=1 [ 150.912055] busybox R running task042 41 0x0008 [ 150.912055] 81afaf40 88007fa03d98 810781c0 81a5e5f8 [ 150.912055] 81a45d80 88007fa03dd8 810af723 0083 [ 150.912055] 88007fa13580 81a45d80 81a45d80 810c43e0 [ 150.912055] Call Trace: [ 150.912055] IRQ [810781c0] sched_show_task+0xb0/0x110 [ 150.912055] [810af723] print_other_cpu_stall+0x2d3/0x2f0 [ 150.912055] [810c43e0] ? tick_nohz_handler+0xc0/0xc0 [ 150.912055] [810af8b8] __rcu_pending+0x178/0x220 [ 150.912055] [810b02a5]
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote: This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function (based on latest upstream ec6f34e5b552). This triggered RCU stall and softlockup, but softlockup just printk only once. Yes, as expected. Instead of flooding the console with the same message, report only once until the high priority task is re-kicked and can re-arm the softlockup timer again. So this makes sense. Based on your double spin lock test, I don't ever expect this to get re-armed, so you should not expect to see another message. As you mentioned other bad things lead to softlockup just printk only once. What's the bad thing ? Well, one of the assumptions here is that if you are going to have interrupts disabled for a long time and purposely kick the hardlockup detector (with touch_nmi_watchdog), then you can reasonably assume that softlockups will go off too if you don't also kick the softlockup detector (even though there is a window where this isn't true: 2*thresh - thresh). So the patch I gave you, can lead to false softlockup warnings. On the flip side, printing to the console also blocks hardlockups from showing up. But I believe lng ago, Paul adjusted the rcu stalls to be longer than a hardlockup/softlockup timeout to prevent such scenarios. I am not sure what to do here. Printing to the console has traditionally been slow (especially if doing so with interrupts disabled), so a touch_nmi_watchdog has been necessary. But a side effect of that, is it prevents the lockup detectors from going off if printing repeatedly. Now we can hack up rcu stall to only print once until it is re-armed. This would show rcu stall printing first, followed two minutes later by a softlockup, which sorta achieves what you want. However, at the end of the day, an rcu stall, softlockup or hardlockup is a bad thing, does it matter if you get one or all of them? One should be enough to start an investigation, no? I don't know. I am leaning towards 'working as expected' and nothing to really fix here. Does anyone have any suggestions on what assumptions we could change to handle this better? Cheers, Don / # / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=78, c=77, q=4) [ 39.044058] INFO: Stall ended before state dump start [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=78, c=77, q=4) [ 102.049045] INFO: Stall ended before state dump start [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=78, c=77, q=4) [ 165.054075] INFO: Stall ended before state dump start [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=78, c=77, q=4) [ 228.059045] INFO: Stall ended before state dump start [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=78, c=77, q=4) [ 291.064099] INFO: Stall ended before state dump start [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=78, c=77, q=4) [ 354.069099] INFO: Stall ended before state dump start [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=78, c=77, q=4) [ 417.074045] INFO: Stall ended before state dump start [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=78, c=77, q=4) [ 480.079099] INFO: Stall ended before state dump start [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=78, c=77, q=4) [ 543.084099] INFO: Stall ended before state dump start [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=78, c=77, q=4) [ 606.089101] INFO: Stall ended before state dump start [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=78, c=77, q=4) [ 669.094099] INFO: Stall ended before state dump start [ 732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=714057 jiffies, g=78, c=77, q=4) [ 732.099045] INFO: Stall ended before state dump start [ 795.104074] INFO: rcu_preempt
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 15:02, Paul E. McKenney wrote: > On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: >> On 2015/1/21 11:13, Zhang Zhen wrote: >>> On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: > > [ . . . ] > >>> Sorry, i made a mistake, the log above is based on v3.10.63. >>> I have tested the latest upstream kernel (based on ec6f34e5b552), >>> but my test case can't triggered RCU stall warning. >>> >> >> I used git bisect to find the bad commit, but found a irrelevant >> commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). >> Before this commit, my test case can easily trigger RCU stall warning, >> but after this commit, my test case can't trigger RCU stall warning. > > I have lost track of exactly what kernel you are using. If you are > using a recent kernel (v3.18 or thereabouts), one thing to try is to > apply my stack of RCU CPU stall-warning changes that are on their way > in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: > > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git > > These handle the problems that Dave Jones and a few others located this > past December. Could you please give them a spin? > Ok, i will try. Thanks! > Thanx, Paul > >> I'm totally confused. >> >> My test case: >> >> // >> #include >> #include >> #include >> #include >> #include >> >> struct foo { >> int a; >> char b; >> long c; >> }; >> >> struct foo gbl_foo = {1, 'a', 2}; >> struct foo *pgbl_foo = _foo; >> >> static int my_kthread(void *data) >> { >> DEFINE_SPINLOCK(foo_mutex); >> struct foo *new_fp; >> struct foo *old_fp; >> >> new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); >> spin_lock(_mutex); >> old_fp = pgbl_foo; >> *new_fp = *old_fp; >> new_fp->a = 2; >> rcu_assign_pointer(pgbl_foo, new_fp); >> spin_unlock(_mutex); >> synchronize_rcu(); >> kfree(old_fp); >> >> return 0; >> } >> >> static int hello_start(void) >> { >> struct task_struct *my_task = NULL; >> DEFINE_SPINLOCK(hello_lock); >> >> my_task = kthread_run(my_kthread, NULL, "my_thread%d", 1); >> >> spin_lock_init(_lock); >> spin_lock(_lock); >> spin_lock(_lock); >> return 0; >> } >> >> static int __init test_init(void) >> { >> hello_start(); >> >> printk(KERN_INFO "Module init\n"); >> return 0; >> } >> >> static void __exit test_exit(void) >> { >> printk(KERN_INFO "Module exit!\n"); >> } >> >> module_init(test_init) >> module_exit(test_exit) >> MODULE_LICENSE("GPL"); >> // > > Cheers, > Don > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > index 70bf118..833c015 100644 > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void) >* going off. >*/ > raw_cpu_write(watchdog_nmi_touch, true); > - touch_softlockup_watchdog(); > + //touch_softlockup_watchdog(); > } > EXPORT_SYMBOL(touch_nmi_watchdog); > > > . > >>> >> >> > > > . > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: > On 2015/1/21 11:13, Zhang Zhen wrote: > > On 2015/1/21 10:26, Zhang Zhen wrote: > >> On 2015/1/20 23:25, Don Zickus wrote: [ . . . ] > > Sorry, i made a mistake, the log above is based on v3.10.63. > > I have tested the latest upstream kernel (based on ec6f34e5b552), > > but my test case can't triggered RCU stall warning. > > > > I used git bisect to find the bad commit, but found a irrelevant > commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). > Before this commit, my test case can easily trigger RCU stall warning, > but after this commit, my test case can't trigger RCU stall warning. I have lost track of exactly what kernel you are using. If you are using a recent kernel (v3.18 or thereabouts), one thing to try is to apply my stack of RCU CPU stall-warning changes that are on their way in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git These handle the problems that Dave Jones and a few others located this past December. Could you please give them a spin? Thanx, Paul > I'm totally confused. > > My test case: > > // > #include > #include > #include > #include > #include > > struct foo { > int a; > char b; > long c; > }; > > struct foo gbl_foo = {1, 'a', 2}; > struct foo *pgbl_foo = _foo; > > static int my_kthread(void *data) > { > DEFINE_SPINLOCK(foo_mutex); > struct foo *new_fp; > struct foo *old_fp; > > new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); > spin_lock(_mutex); > old_fp = pgbl_foo; > *new_fp = *old_fp; > new_fp->a = 2; > rcu_assign_pointer(pgbl_foo, new_fp); > spin_unlock(_mutex); > synchronize_rcu(); > kfree(old_fp); > > return 0; > } > > static int hello_start(void) > { > struct task_struct *my_task = NULL; > DEFINE_SPINLOCK(hello_lock); > > my_task = kthread_run(my_kthread, NULL, "my_thread%d", 1); > > spin_lock_init(_lock); > spin_lock(_lock); > spin_lock(_lock); > return 0; > } > > static int __init test_init(void) > { > hello_start(); > > printk(KERN_INFO "Module init\n"); > return 0; > } > > static void __exit test_exit(void) > { > printk(KERN_INFO "Module exit!\n"); > } > > module_init(test_init) > module_exit(test_exit) > MODULE_LICENSE("GPL"); > // > >>> > >>> Cheers, > >>> Don > >>> > >>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c > >>> index 70bf118..833c015 100644 > >>> --- a/kernel/watchdog.c > >>> +++ b/kernel/watchdog.c > >>> @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void) > >>>* going off. > >>>*/ > >>> raw_cpu_write(watchdog_nmi_touch, true); > >>> - touch_softlockup_watchdog(); > >>> + //touch_softlockup_watchdog(); > >>> } > >>> EXPORT_SYMBOL(touch_nmi_watchdog); > >>> > >>> > >>> . > >>> > >> > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 11:13, Zhang Zhen wrote: > On 2015/1/21 10:26, Zhang Zhen wrote: >> On 2015/1/20 23:25, Don Zickus wrote: >>> On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote: > Of course back then, touch_nmi_watchdog touched all cpus. So a problem > like this was masked. I believe this upstream commit 62572e29bc53, solved > the problem. Thanks for your suggestion. Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it only touch local cpu not every one. But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu. Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog. >>> >>> Ah, yes. I reviewed the commit to quickly yesterday. I thought >>> touch_softlockup_watchdog was called on every cpu and that commit changed >>> it to the local cpu. But that was incorrect. >>> > > You can apply that commit and see if you if you get both RCU stall > messages _and_ softlockup messages. I believe that is what you were > expecting, correct? > Correct, i expect i can get both RCU stall messages _and_ softlockup messages. I applied that commit, and i only got RCU stall messages. >>> >>> Hmm, I believe the act of printing to the console calls touch_nmi_watchdog >>> which calls touch_softlockup_watchdog. I think that is the problem there. >>> >> Yeah, you are right. >> >>> This may not cause other problems but what happens if you comment out the >>> 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like >>> below (based on latest upstream cb59670870)? >>> >>> The idea is that console printing for that cpu won't reset the softlockup >>> detector. Again other bad things might happen and this patch may not be a >>> good final solution, but it can help give me a clue about what is going >>> on. >> >> I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog >> function >> (based on latest upstream ec6f34e5b552). >> This triggered RCU stall and softlockup, but softlockup just printk only >> once. >> As you mentioned "other bad things" lead to softlockup just printk only once. >> What's the bad thing ? >> >> / # >> / # echo 60 > /proc/sys/kernel/watchdog_thresh >> / # busybox insmod softlockup_test.ko & >> / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} >> (detected by 0, t=21002 jiffies, g=78, c=77, q=4) >> [ 39.044058] INFO: Stall ended before state dump start >> [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=84007 jiffies, g=78, c=77, q=4) >> [ 102.049045] INFO: Stall ended before state dump start >> [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] >> [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=147012 jiffies, g=78, c=77, q=4) >> [ 165.054075] INFO: Stall ended before state dump start >> [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=210017 jiffies, g=78, c=77, q=4) >> [ 228.059045] INFO: Stall ended before state dump start >> [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=273022 jiffies, g=78, c=77, q=4) >> [ 291.064099] INFO: Stall ended before state dump start >> [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=336027 jiffies, g=78, c=77, q=4) >> [ 354.069099] INFO: Stall ended before state dump start >> [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=399032 jiffies, g=78, c=77, q=4) >> [ 417.074045] INFO: Stall ended before state dump start >> [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=462037 jiffies, g=78, c=77, q=4) >> [ 480.079099] INFO: Stall ended before state dump start >> [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=525042 jiffies, g=78, c=77, q=4) >> [ 543.084099] INFO: Stall ended before state dump start >> [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=588047 jiffies, g=78, c=77, q=4) >> [ 606.089101] INFO: Stall ended before state dump start >> [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=651052 jiffies, g=78, c=77, q=4) >> [ 669.094099] INFO: Stall ended before state dump start >> [ 732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=714057 jiffies, g=78, c=77, q=4) >> [ 732.099045] INFO: Stall ended before state dump start >> [ 795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=777062 jiffies, g=78, c=77, q=4) >> [ 795.104098] INFO: Stall ended before state dump start >> [ 858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=840067 jiffies, g=78, c=77, q=4) >> [ 858.109046] INFO: Stall ended before state dump start
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 10:26, Zhang Zhen wrote: > On 2015/1/20 23:25, Don Zickus wrote: >> On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote: >>> Of course back then, touch_nmi_watchdog touched all cpus. So a problem like this was masked. I believe this upstream commit 62572e29bc53, solved the problem. >>> >>> Thanks for your suggestion. >>> >>> Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it >>> only touch local cpu not every one. >>> But watchdog_nmi_touch = true only guarantee no hard lockup check on this >>> cpu. >>> >>> Commit 62572e29bc53 didn't changed the semantics of >>> touch_softlockup_watchdog. >> >> Ah, yes. I reviewed the commit to quickly yesterday. I thought >> touch_softlockup_watchdog was called on every cpu and that commit changed >> it to the local cpu. But that was incorrect. >> You can apply that commit and see if you if you get both RCU stall messages _and_ softlockup messages. I believe that is what you were expecting, correct? >>> Correct, i expect i can get both RCU stall messages _and_ softlockup >>> messages. >>> I applied that commit, and i only got RCU stall messages. >> >> Hmm, I believe the act of printing to the console calls touch_nmi_watchdog >> which calls touch_softlockup_watchdog. I think that is the problem there. >> > Yeah, you are right. > >> This may not cause other problems but what happens if you comment out the >> 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like >> below (based on latest upstream cb59670870)? >> >> The idea is that console printing for that cpu won't reset the softlockup >> detector. Again other bad things might happen and this patch may not be a >> good final solution, but it can help give me a clue about what is going >> on. > > I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog > function > (based on latest upstream ec6f34e5b552). > This triggered RCU stall and softlockup, but softlockup just printk only once. > As you mentioned "other bad things" lead to softlockup just printk only once. > What's the bad thing ? > > / # > / # echo 60 > /proc/sys/kernel/watchdog_thresh > / # busybox insmod softlockup_test.ko & > / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=21002 jiffies, g=78, c=77, q=4) > [ 39.044058] INFO: Stall ended before state dump start > [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=84007 jiffies, g=78, c=77, q=4) > [ 102.049045] INFO: Stall ended before state dump start > [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] > [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=147012 jiffies, g=78, c=77, q=4) > [ 165.054075] INFO: Stall ended before state dump start > [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=210017 jiffies, g=78, c=77, q=4) > [ 228.059045] INFO: Stall ended before state dump start > [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=273022 jiffies, g=78, c=77, q=4) > [ 291.064099] INFO: Stall ended before state dump start > [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=336027 jiffies, g=78, c=77, q=4) > [ 354.069099] INFO: Stall ended before state dump start > [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=399032 jiffies, g=78, c=77, q=4) > [ 417.074045] INFO: Stall ended before state dump start > [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=462037 jiffies, g=78, c=77, q=4) > [ 480.079099] INFO: Stall ended before state dump start > [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=525042 jiffies, g=78, c=77, q=4) > [ 543.084099] INFO: Stall ended before state dump start > [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=588047 jiffies, g=78, c=77, q=4) > [ 606.089101] INFO: Stall ended before state dump start > [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=651052 jiffies, g=78, c=77, q=4) > [ 669.094099] INFO: Stall ended before state dump start > [ 732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=714057 jiffies, g=78, c=77, q=4) > [ 732.099045] INFO: Stall ended before state dump start > [ 795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=777062 jiffies, g=78, c=77, q=4) > [ 795.104098] INFO: Stall ended before state dump start > [ 858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=840067 jiffies, g=78, c=77, q=4) > [ 858.109046] INFO: Stall ended before state dump start > [ 921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=903072 jiffies, g=78, c=77, q=4) > [
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/20 23:25, Don Zickus wrote: > On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote: >> >>> Of course back then, touch_nmi_watchdog touched all cpus. So a problem >>> like this was masked. I believe this upstream commit 62572e29bc53, solved >>> the problem. >> >> Thanks for your suggestion. >> >> Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it >> only touch local cpu not every one. >> But watchdog_nmi_touch = true only guarantee no hard lockup check on this >> cpu. >> >> Commit 62572e29bc53 didn't changed the semantics of >> touch_softlockup_watchdog. > > Ah, yes. I reviewed the commit to quickly yesterday. I thought > touch_softlockup_watchdog was called on every cpu and that commit changed > it to the local cpu. But that was incorrect. > >>> >>> You can apply that commit and see if you if you get both RCU stall >>> messages _and_ softlockup messages. I believe that is what you were >>> expecting, correct? >>> >> Correct, i expect i can get both RCU stall messages _and_ softlockup >> messages. >> I applied that commit, and i only got RCU stall messages. > > Hmm, I believe the act of printing to the console calls touch_nmi_watchdog > which calls touch_softlockup_watchdog. I think that is the problem there. > Yeah, you are right. > This may not cause other problems but what happens if you comment out the > 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like > below (based on latest upstream cb59670870)? > > The idea is that console printing for that cpu won't reset the softlockup > detector. Again other bad things might happen and this patch may not be a > good final solution, but it can help give me a clue about what is going > on. I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function (based on latest upstream ec6f34e5b552). This triggered RCU stall and softlockup, but softlockup just printk only once. As you mentioned "other bad things" lead to softlockup just printk only once. What's the bad thing ? / # / # echo 60 > /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko & / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=78, c=77, q=4) [ 39.044058] INFO: Stall ended before state dump start [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=78, c=77, q=4) [ 102.049045] INFO: Stall ended before state dump start [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=78, c=77, q=4) [ 165.054075] INFO: Stall ended before state dump start [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=78, c=77, q=4) [ 228.059045] INFO: Stall ended before state dump start [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=78, c=77, q=4) [ 291.064099] INFO: Stall ended before state dump start [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=78, c=77, q=4) [ 354.069099] INFO: Stall ended before state dump start [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=78, c=77, q=4) [ 417.074045] INFO: Stall ended before state dump start [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=78, c=77, q=4) [ 480.079099] INFO: Stall ended before state dump start [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=78, c=77, q=4) [ 543.084099] INFO: Stall ended before state dump start [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=78, c=77, q=4) [ 606.089101] INFO: Stall ended before state dump start [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=78, c=77, q=4) [ 669.094099] INFO: Stall ended before state dump start [ 732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=714057 jiffies, g=78, c=77, q=4) [ 732.099045] INFO: Stall ended before state dump start [ 795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=777062 jiffies, g=78, c=77, q=4) [ 795.104098] INFO: Stall ended before state dump start [ 858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=840067 jiffies, g=78, c=77, q=4) [ 858.109046] INFO: Stall ended before state dump start [ 921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=903072 jiffies, g=78, c=77, q=4) [ 921.114100] INFO: Stall ended before state dump start [ 984.119099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=966077 jiffies, g=78, c=77, q=4) [ 984.119099] INFO: Stall
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote: > > > Of course back then, touch_nmi_watchdog touched all cpus. So a problem > > like this was masked. I believe this upstream commit 62572e29bc53, solved > > the problem. > > Thanks for your suggestion. > > Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it > only touch local cpu not every one. > But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu. > > Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog. Ah, yes. I reviewed the commit to quickly yesterday. I thought touch_softlockup_watchdog was called on every cpu and that commit changed it to the local cpu. But that was incorrect. > > > > You can apply that commit and see if you if you get both RCU stall > > messages _and_ softlockup messages. I believe that is what you were > > expecting, correct? > > > Correct, i expect i can get both RCU stall messages _and_ softlockup > messages. > I applied that commit, and i only got RCU stall messages. Hmm, I believe the act of printing to the console calls touch_nmi_watchdog which calls touch_softlockup_watchdog. I think that is the problem there. This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. Cheers, Don diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 70bf118..833c015 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void) * going off. */ raw_cpu_write(watchdog_nmi_touch, true); - touch_softlockup_watchdog(); + //touch_softlockup_watchdog(); } EXPORT_SYMBOL(touch_nmi_watchdog); -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote: Of course back then, touch_nmi_watchdog touched all cpus. So a problem like this was masked. I believe this upstream commit 62572e29bc53, solved the problem. Thanks for your suggestion. Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it only touch local cpu not every one. But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu. Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog. Ah, yes. I reviewed the commit to quickly yesterday. I thought touch_softlockup_watchdog was called on every cpu and that commit changed it to the local cpu. But that was incorrect. You can apply that commit and see if you if you get both RCU stall messages _and_ softlockup messages. I believe that is what you were expecting, correct? Correct, i expect i can get both RCU stall messages _and_ softlockup messages. I applied that commit, and i only got RCU stall messages. Hmm, I believe the act of printing to the console calls touch_nmi_watchdog which calls touch_softlockup_watchdog. I think that is the problem there. This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. Cheers, Don diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 70bf118..833c015 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void) * going off. */ raw_cpu_write(watchdog_nmi_touch, true); - touch_softlockup_watchdog(); + //touch_softlockup_watchdog(); } EXPORT_SYMBOL(touch_nmi_watchdog); -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/20 23:25, Don Zickus wrote: On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote: Of course back then, touch_nmi_watchdog touched all cpus. So a problem like this was masked. I believe this upstream commit 62572e29bc53, solved the problem. Thanks for your suggestion. Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it only touch local cpu not every one. But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu. Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog. Ah, yes. I reviewed the commit to quickly yesterday. I thought touch_softlockup_watchdog was called on every cpu and that commit changed it to the local cpu. But that was incorrect. You can apply that commit and see if you if you get both RCU stall messages _and_ softlockup messages. I believe that is what you were expecting, correct? Correct, i expect i can get both RCU stall messages _and_ softlockup messages. I applied that commit, and i only got RCU stall messages. Hmm, I believe the act of printing to the console calls touch_nmi_watchdog which calls touch_softlockup_watchdog. I think that is the problem there. Yeah, you are right. This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function (based on latest upstream ec6f34e5b552). This triggered RCU stall and softlockup, but softlockup just printk only once. As you mentioned other bad things lead to softlockup just printk only once. What's the bad thing ? / # / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=78, c=77, q=4) [ 39.044058] INFO: Stall ended before state dump start [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=78, c=77, q=4) [ 102.049045] INFO: Stall ended before state dump start [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=78, c=77, q=4) [ 165.054075] INFO: Stall ended before state dump start [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=78, c=77, q=4) [ 228.059045] INFO: Stall ended before state dump start [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=78, c=77, q=4) [ 291.064099] INFO: Stall ended before state dump start [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=78, c=77, q=4) [ 354.069099] INFO: Stall ended before state dump start [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=78, c=77, q=4) [ 417.074045] INFO: Stall ended before state dump start [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=78, c=77, q=4) [ 480.079099] INFO: Stall ended before state dump start [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=78, c=77, q=4) [ 543.084099] INFO: Stall ended before state dump start [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=78, c=77, q=4) [ 606.089101] INFO: Stall ended before state dump start [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=78, c=77, q=4) [ 669.094099] INFO: Stall ended before state dump start [ 732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=714057 jiffies, g=78, c=77, q=4) [ 732.099045] INFO: Stall ended before state dump start [ 795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=777062 jiffies, g=78, c=77, q=4) [ 795.104098] INFO: Stall ended before state dump start [ 858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=840067 jiffies, g=78, c=77, q=4) [ 858.109046] INFO: Stall ended before state dump start [ 921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=903072 jiffies, g=78, c=77, q=4) [ 921.114100] INFO: Stall ended before state dump start [ 984.119099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=966077 jiffies, g=78, c=77, q=4) [ 984.119099] INFO: Stall ended before state dump start [ 1047.124075] INFO: rcu_preempt detected
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote: Of course back then, touch_nmi_watchdog touched all cpus. So a problem like this was masked. I believe this upstream commit 62572e29bc53, solved the problem. Thanks for your suggestion. Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it only touch local cpu not every one. But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu. Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog. Ah, yes. I reviewed the commit to quickly yesterday. I thought touch_softlockup_watchdog was called on every cpu and that commit changed it to the local cpu. But that was incorrect. You can apply that commit and see if you if you get both RCU stall messages _and_ softlockup messages. I believe that is what you were expecting, correct? Correct, i expect i can get both RCU stall messages _and_ softlockup messages. I applied that commit, and i only got RCU stall messages. Hmm, I believe the act of printing to the console calls touch_nmi_watchdog which calls touch_softlockup_watchdog. I think that is the problem there. Yeah, you are right. This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function (based on latest upstream ec6f34e5b552). This triggered RCU stall and softlockup, but softlockup just printk only once. As you mentioned other bad things lead to softlockup just printk only once. What's the bad thing ? / # / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=78, c=77, q=4) [ 39.044058] INFO: Stall ended before state dump start [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=78, c=77, q=4) [ 102.049045] INFO: Stall ended before state dump start [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=78, c=77, q=4) [ 165.054075] INFO: Stall ended before state dump start [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=78, c=77, q=4) [ 228.059045] INFO: Stall ended before state dump start [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=78, c=77, q=4) [ 291.064099] INFO: Stall ended before state dump start [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=78, c=77, q=4) [ 354.069099] INFO: Stall ended before state dump start [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=78, c=77, q=4) [ 417.074045] INFO: Stall ended before state dump start [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=78, c=77, q=4) [ 480.079099] INFO: Stall ended before state dump start [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=78, c=77, q=4) [ 543.084099] INFO: Stall ended before state dump start [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=78, c=77, q=4) [ 606.089101] INFO: Stall ended before state dump start [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=78, c=77, q=4) [ 669.094099] INFO: Stall ended before state dump start [ 732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=714057 jiffies, g=78, c=77, q=4) [ 732.099045] INFO: Stall ended before state dump start [ 795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=777062 jiffies, g=78, c=77, q=4) [ 795.104098] INFO: Stall ended before state dump start [ 858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=840067 jiffies, g=78, c=77, q=4) [ 858.109046] INFO: Stall ended before state dump start [ 921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=903072 jiffies, g=78, c=77, q=4) [ 921.114100] INFO: Stall ended before state dump start [ 984.119099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=966077 jiffies, g=78, c=77, q=4) [
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: On 2015/1/21 11:13, Zhang Zhen wrote: On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: [ . . . ] Sorry, i made a mistake, the log above is based on v3.10.63. I have tested the latest upstream kernel (based on ec6f34e5b552), but my test case can't triggered RCU stall warning. I used git bisect to find the bad commit, but found a irrelevant commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). Before this commit, my test case can easily trigger RCU stall warning, but after this commit, my test case can't trigger RCU stall warning. I have lost track of exactly what kernel you are using. If you are using a recent kernel (v3.18 or thereabouts), one thing to try is to apply my stack of RCU CPU stall-warning changes that are on their way in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git These handle the problems that Dave Jones and a few others located this past December. Could you please give them a spin? Thanx, Paul I'm totally confused. My test case: // #include linux/kernel.h #include linux/module.h #include linux/spinlock.h #include linux/slab.h #include linux/kthread.h struct foo { int a; char b; long c; }; struct foo gbl_foo = {1, 'a', 2}; struct foo *pgbl_foo = gbl_foo; static int my_kthread(void *data) { DEFINE_SPINLOCK(foo_mutex); struct foo *new_fp; struct foo *old_fp; new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); spin_lock(foo_mutex); old_fp = pgbl_foo; *new_fp = *old_fp; new_fp-a = 2; rcu_assign_pointer(pgbl_foo, new_fp); spin_unlock(foo_mutex); synchronize_rcu(); kfree(old_fp); return 0; } static int hello_start(void) { struct task_struct *my_task = NULL; DEFINE_SPINLOCK(hello_lock); my_task = kthread_run(my_kthread, NULL, my_thread%d, 1); spin_lock_init(hello_lock); spin_lock(hello_lock); spin_lock(hello_lock); return 0; } static int __init test_init(void) { hello_start(); printk(KERN_INFO Module init\n); return 0; } static void __exit test_exit(void) { printk(KERN_INFO Module exit!\n); } module_init(test_init) module_exit(test_exit) MODULE_LICENSE(GPL); // Cheers, Don diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 70bf118..833c015 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void) * going off. */ raw_cpu_write(watchdog_nmi_touch, true); - touch_softlockup_watchdog(); + //touch_softlockup_watchdog(); } EXPORT_SYMBOL(touch_nmi_watchdog); . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 11:13, Zhang Zhen wrote: On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote: Of course back then, touch_nmi_watchdog touched all cpus. So a problem like this was masked. I believe this upstream commit 62572e29bc53, solved the problem. Thanks for your suggestion. Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it only touch local cpu not every one. But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu. Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog. Ah, yes. I reviewed the commit to quickly yesterday. I thought touch_softlockup_watchdog was called on every cpu and that commit changed it to the local cpu. But that was incorrect. You can apply that commit and see if you if you get both RCU stall messages _and_ softlockup messages. I believe that is what you were expecting, correct? Correct, i expect i can get both RCU stall messages _and_ softlockup messages. I applied that commit, and i only got RCU stall messages. Hmm, I believe the act of printing to the console calls touch_nmi_watchdog which calls touch_softlockup_watchdog. I think that is the problem there. Yeah, you are right. This may not cause other problems but what happens if you comment out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like below (based on latest upstream cb59670870)? The idea is that console printing for that cpu won't reset the softlockup detector. Again other bad things might happen and this patch may not be a good final solution, but it can help give me a clue about what is going on. I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function (based on latest upstream ec6f34e5b552). This triggered RCU stall and softlockup, but softlockup just printk only once. As you mentioned other bad things lead to softlockup just printk only once. What's the bad thing ? / # / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=78, c=77, q=4) [ 39.044058] INFO: Stall ended before state dump start [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=78, c=77, q=4) [ 102.049045] INFO: Stall ended before state dump start [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54] [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=78, c=77, q=4) [ 165.054075] INFO: Stall ended before state dump start [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=78, c=77, q=4) [ 228.059045] INFO: Stall ended before state dump start [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=78, c=77, q=4) [ 291.064099] INFO: Stall ended before state dump start [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=78, c=77, q=4) [ 354.069099] INFO: Stall ended before state dump start [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=78, c=77, q=4) [ 417.074045] INFO: Stall ended before state dump start [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=78, c=77, q=4) [ 480.079099] INFO: Stall ended before state dump start [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=78, c=77, q=4) [ 543.084099] INFO: Stall ended before state dump start [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=78, c=77, q=4) [ 606.089101] INFO: Stall ended before state dump start [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=78, c=77, q=4) [ 669.094099] INFO: Stall ended before state dump start [ 732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=714057 jiffies, g=78, c=77, q=4) [ 732.099045] INFO: Stall ended before state dump start [ 795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=777062 jiffies, g=78, c=77, q=4) [ 795.104098] INFO: Stall ended before state dump start [ 858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=840067 jiffies, g=78, c=77, q=4) [ 858.109046] INFO: Stall ended before state dump start [ 921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=903072 jiffies, g=78, c=77, q=4) [ 921.114100] INFO: Stall ended before state dump start [ 984.119099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0,
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/21 15:02, Paul E. McKenney wrote: On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote: On 2015/1/21 11:13, Zhang Zhen wrote: On 2015/1/21 10:26, Zhang Zhen wrote: On 2015/1/20 23:25, Don Zickus wrote: [ . . . ] Sorry, i made a mistake, the log above is based on v3.10.63. I have tested the latest upstream kernel (based on ec6f34e5b552), but my test case can't triggered RCU stall warning. I used git bisect to find the bad commit, but found a irrelevant commit(db5d711e2db776 zram: avoid null access when fail to alloc meta). Before this commit, my test case can easily trigger RCU stall warning, but after this commit, my test case can't trigger RCU stall warning. I have lost track of exactly what kernel you are using. If you are using a recent kernel (v3.18 or thereabouts), one thing to try is to apply my stack of RCU CPU stall-warning changes that are on their way in, please see v3.19-rc1..630181c4a915 in -rcu, which is at: git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git These handle the problems that Dave Jones and a few others located this past December. Could you please give them a spin? Ok, i will try. Thanks! Thanx, Paul I'm totally confused. My test case: // #include linux/kernel.h #include linux/module.h #include linux/spinlock.h #include linux/slab.h #include linux/kthread.h struct foo { int a; char b; long c; }; struct foo gbl_foo = {1, 'a', 2}; struct foo *pgbl_foo = gbl_foo; static int my_kthread(void *data) { DEFINE_SPINLOCK(foo_mutex); struct foo *new_fp; struct foo *old_fp; new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL); spin_lock(foo_mutex); old_fp = pgbl_foo; *new_fp = *old_fp; new_fp-a = 2; rcu_assign_pointer(pgbl_foo, new_fp); spin_unlock(foo_mutex); synchronize_rcu(); kfree(old_fp); return 0; } static int hello_start(void) { struct task_struct *my_task = NULL; DEFINE_SPINLOCK(hello_lock); my_task = kthread_run(my_kthread, NULL, my_thread%d, 1); spin_lock_init(hello_lock); spin_lock(hello_lock); spin_lock(hello_lock); return 0; } static int __init test_init(void) { hello_start(); printk(KERN_INFO Module init\n); return 0; } static void __exit test_exit(void) { printk(KERN_INFO Module exit!\n); } module_init(test_init) module_exit(test_exit) MODULE_LICENSE(GPL); // Cheers, Don diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 70bf118..833c015 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void) * going off. */ raw_cpu_write(watchdog_nmi_touch, true); - touch_softlockup_watchdog(); + //touch_softlockup_watchdog(); } EXPORT_SYMBOL(touch_nmi_watchdog); . . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Tue, Jan 20, 2015 at 11:17:01AM +0800, Zhang Zhen wrote: > On 2015/1/19 19:09, Paul E. McKenney wrote: > > On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: > >> On 2015/1/19 16:42, Paul E. McKenney wrote: > >>> On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: > Hi, > > On my x86_64 qemu virtual machine, RCU CPU stall console spews may > lead to soft lockup disabled. > > If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * > watchdog_thresh): > > / # > / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout > 21 > / # echo 60 > /proc/sys/kernel/watchdog_thresh > / # busybox insmod softlockup_test.ko > [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=21002 jiffies, g=85, c=84, q=4) > [ 44.959044] INFO: Stall ended before state dump start > [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=84007 jiffies, g=85, c=84, q=4) > [ 107.964045] INFO: Stall ended before state dump start > [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=147012 jiffies, g=85, c=84, q=4) > [ 170.969060] INFO: Stall ended before state dump start > [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=210017 jiffies, g=85, c=84, q=4) > [ 233.974057] INFO: Stall ended before state dump start > [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=273022 jiffies, g=85, c=84, q=4) > [ 296.979059] INFO: Stall ended before state dump start > [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=336027 jiffies, g=85, c=84, q=4) > [ 359.984058] INFO: Stall ended before state dump start > [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=399032 jiffies, g=85, c=84, q=4) > [ 422.989059] INFO: Stall ended before state dump start > [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=462037 jiffies, g=85, c=84, q=4) > [ 485.994056] INFO: Stall ended before state dump start > [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=525042 jiffies, g=85, c=84, q=4) > [ 548.999059] INFO: Stall ended before state dump start > [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=588047 jiffies, g=85, c=84, q=4) > [ 612.004061] INFO: Stall ended before state dump start > [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > (detected by 0, t=651052 jiffies, g=85, c=84, q=4) > [ 675.009058] INFO: Stall ended before state dump start > > If softlockup_thresh < rcu_cpu_stall_timeout: > > / # > / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout > 21 > / # echo 5 > /proc/sys/kernel/watchdog_thresh > / # busybox insmod softlockup_test.ko > [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > > The softlockup_test.ko source code is: > // > #include > #include > #include > #include > > static int hello_start(void) > { > DEFINE_SPINLOCK(hello_lock); > spin_lock_init(_lock); > spin_lock(_lock); > spin_lock(_lock); > >>> > >>> Did you really intend to acquire the same spinlock twice in a row, > >>> forcing a self-deadlock? If not, I of course suggest changing the second > >>> "spin_lock()" to "spin_unlock()". > >> > >> Yes, i acquire the same spinlock twice in order to reproduce the problem. > > > > Good, I was wondering about that. ;-) > > > >>> If your .config has CONFIG_TREE_RCU=y, the above is quite likely to > >>> give you an RCU CPU stall warning. > >> > >> In my .config CONFIG_TREE_RCU=y. > > > > Which is consistent. > > > >> If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup > >> warning. > >> If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU > >> CPU stall warning > >> just like above and no give soft lockup warning. > >> > >> It means that RCU CPU stall console spews leads to soft lockup disabled. > >> Is this reasonable ? > > > > It depends. You will often see both of them, but they can interfere > > with each other, especially if all these messages are going across a > > serial line. And sometimes the activity of the one will suppress the > > other, though I would
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/19 19:09, Paul E. McKenney wrote: > On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: >> On 2015/1/19 16:42, Paul E. McKenney wrote: >>> On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: Hi, On my x86_64 qemu virtual machine, RCU CPU stall console spews may lead to soft lockup disabled. If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * watchdog_thresh): / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 60 > /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=85, c=84, q=4) [ 44.959044] INFO: Stall ended before state dump start [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=85, c=84, q=4) [ 107.964045] INFO: Stall ended before state dump start [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=85, c=84, q=4) [ 170.969060] INFO: Stall ended before state dump start [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=85, c=84, q=4) [ 233.974057] INFO: Stall ended before state dump start [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=85, c=84, q=4) [ 296.979059] INFO: Stall ended before state dump start [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=85, c=84, q=4) [ 359.984058] INFO: Stall ended before state dump start [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=85, c=84, q=4) [ 422.989059] INFO: Stall ended before state dump start [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=85, c=84, q=4) [ 485.994056] INFO: Stall ended before state dump start [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=85, c=84, q=4) [ 548.999059] INFO: Stall ended before state dump start [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=85, c=84, q=4) [ 612.004061] INFO: Stall ended before state dump start [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=85, c=84, q=4) [ 675.009058] INFO: Stall ended before state dump start If softlockup_thresh < rcu_cpu_stall_timeout: / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 5 > /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] The softlockup_test.ko source code is: // #include #include #include #include static int hello_start(void) { DEFINE_SPINLOCK(hello_lock); spin_lock_init(_lock); spin_lock(_lock); spin_lock(_lock); >>> >>> Did you really intend to acquire the same spinlock twice in a row, >>> forcing a self-deadlock? If not, I of course suggest changing the second >>> "spin_lock()" to "spin_unlock()". >> >> Yes, i acquire the same spinlock twice in order to reproduce the problem. > > Good, I was wondering about that. ;-) > >>> If your .config has CONFIG_TREE_RCU=y, the above is quite likely to >>> give you an RCU CPU stall warning. >> >> In my .config CONFIG_TREE_RCU=y. > > Which is consistent. > >> If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup >> warning. >> If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU >> stall warning >> just like above and no give soft lockup warning. >> >> It means that RCU CPU stall console spews leads to soft lockup disabled. >> Is this reasonable ? > > It depends. You will often see both of them, but they can interfere > with each other, especially if all these messages are going across a > serial line. And sometimes the activity of the one will suppress the > other, though I would not expect that in your spinlock deadlock case. > Ok, my expect is to get both RCU stall messages _and_ softlockup messages even though all these messages are going across a serial line. But in my test case the RCU stall messages suppress the other. The simplest way is
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/19 22:06, Don Zickus wrote: > On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: >>> >>> Did you really intend to acquire the same spinlock twice in a row, >>> forcing a self-deadlock? If not, I of course suggest changing the second >>> "spin_lock()" to "spin_unlock()". >>> >> >> Yes, i acquire the same spinlock twice in order to reproduce the problem. >> >>> If your .config has CONFIG_TREE_RCU=y, the above is quite likely to >>> give you an RCU CPU stall warning. >>> >> >> In my .config CONFIG_TREE_RCU=y. >> >> If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup >> warning. >> If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU >> stall warning >> just like above and no give soft lockup warning. >> >> It means that RCU CPU stall console spews leads to soft lockup disabled. >> Is this reasonable ? > > I believe so. In kernel v3.10.., all activity to the console executed > touch_nmi_watchdog() which calls touch_softlockup_watchdog, which delayed > the softlockup for another round of 'softlockup_thresh'. > Yeah, you are right. It's the real reason. > Of course back then, touch_nmi_watchdog touched all cpus. So a problem > like this was masked. I believe this upstream commit 62572e29bc53, solved > the problem. Thanks for your suggestion. Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it only touch local cpu not every one. But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu. Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog. > > You can apply that commit and see if you if you get both RCU stall > messages _and_ softlockup messages. I believe that is what you were > expecting, correct? > Correct, i expect i can get both RCU stall messages _and_ softlockup messages. I applied that commit, and i only got RCU stall messages. / # / # echo 60 > /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 35.344060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=85, c=84, q=4) [ 35.344060] INFO: Stall ended before state dump start [ 98.349079] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=85, c=84, q=4) [ 98.349079] INFO: Stall ended before state dump start [ 161.354100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=85, c=84, q=4) [ 161.354100] INFO: Stall ended before state dump start > > Of course, on a non-virt guest, your test case would normally trigger a > hardlockup warning first. And a later kernel version for the guest may > actually do that (not quite sure if the emulated PMU stuff is upstream or > not yet). Just to set your expectations correctly. > Yes, on a non-virt guest, my test case tiggered hardlockup warning firt. Best regards! > Cheers, > Don > >> >> Thanks! >> return 0; } static int __init test_init(void) { hello_start(); printk(KERN_INFO "Module init\n"); return 0; } static void __exit test_exit(void) { printk(KERN_INFO "Module exit!\n"); } module_init(test_init) module_exit(test_exit) MODULE_LICENSE("GPL"); // My kernel version is v3.10.63, and i checked the kernel source code, update_process_times -> run_local_timers -> hrtimer_run_queues -> __run_hrtimer -> watchdog_timer_fn -> is_softlockup -> rcu_check_callbacks -> __rcu_pending -> check_cpu_stall -> print_cpu_stall If softlockup_thresh > rcu_cpu_stall_timeout, print_cpu_stall will print log to serial port. The 8250 serial driver will call serial8250_console_write => touch_nmi_watchdog() which reset watchdog_touch_ts to 0. So the softlockup will not be triggered. Is this reasonable? Why? >>> >>> Is exactly what reasonable? ;-) >>> >>> Yes, it is reasonable that your code triggers an RCU CPU stall warning. >>> >>> No, it is not reasonable that the RCU CPU stall warning does not include >>> a stack trace, and the fix for that bug will be going into the next merge >>> window. >>> >>> Yes, is is reasonable that varying the softlockup and RCU CPU stall >>> timeouts might change the behavior. >>> >>> No, your code is not reasonable, except perhaps as a test of the >>> generation of softlockup and RCU CPU stall warnings. If you are not >>> trying to test softlockup and RCU CPU stall warnings, you should of course >>> not try to acquire any non-recursive exclusive lock that you already hold. >>> If it is not reasonable, we should adjust the printk loglevel from *KERN_ERR* to *KERN_INFO*
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: > > > > Did you really intend to acquire the same spinlock twice in a row, > > forcing a self-deadlock? If not, I of course suggest changing the second > > "spin_lock()" to "spin_unlock()". > > > > Yes, i acquire the same spinlock twice in order to reproduce the problem. > > > If your .config has CONFIG_TREE_RCU=y, the above is quite likely to > > give you an RCU CPU stall warning. > > > > In my .config CONFIG_TREE_RCU=y. > > If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup > warning. > If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU > stall warning > just like above and no give soft lockup warning. > > It means that RCU CPU stall console spews leads to soft lockup disabled. > Is this reasonable ? I believe so. In kernel v3.10.., all activity to the console executed touch_nmi_watchdog() which calls touch_softlockup_watchdog, which delayed the softlockup for another round of 'softlockup_thresh'. Of course back then, touch_nmi_watchdog touched all cpus. So a problem like this was masked. I believe this upstream commit 62572e29bc53, solved the problem. You can apply that commit and see if you if you get both RCU stall messages _and_ softlockup messages. I believe that is what you were expecting, correct? Of course, on a non-virt guest, your test case would normally trigger a hardlockup warning first. And a later kernel version for the guest may actually do that (not quite sure if the emulated PMU stuff is upstream or not yet). Just to set your expectations correctly. Cheers, Don > > Thanks! > > >> return 0; > >> } > >> > >> static int __init test_init(void) > >> { > >> hello_start(); > >> > >> printk(KERN_INFO "Module init\n"); > >> return 0; > >> } > >> > >> static void __exit test_exit(void) > >> { > >> printk(KERN_INFO "Module exit!\n"); > >> } > >> > >> module_init(test_init) > >> module_exit(test_exit) > >> MODULE_LICENSE("GPL"); > >> // > >> > >> My kernel version is v3.10.63, and i checked the kernel source code, > >> > >> update_process_times > >>-> run_local_timers > >>-> hrtimer_run_queues > >>-> __run_hrtimer > >>-> watchdog_timer_fn > >>-> is_softlockup > >> > >>-> rcu_check_callbacks > >>-> __rcu_pending > >>-> check_cpu_stall > >>-> print_cpu_stall > >> > >> If softlockup_thresh > rcu_cpu_stall_timeout, print_cpu_stall will print > >> log to serial port. > >> > >> The 8250 serial driver will call serial8250_console_write => > >> touch_nmi_watchdog() which reset > >> watchdog_touch_ts to 0. So the softlockup will not be triggered. > >> > >> Is this reasonable? Why? > > > > Is exactly what reasonable? ;-) > > > > Yes, it is reasonable that your code triggers an RCU CPU stall warning. > > > > No, it is not reasonable that the RCU CPU stall warning does not include > > a stack trace, and the fix for that bug will be going into the next merge > > window. > > > > Yes, is is reasonable that varying the softlockup and RCU CPU stall > > timeouts might change the behavior. > > > > No, your code is not reasonable, except perhaps as a test of the > > generation of softlockup and RCU CPU stall warnings. If you are not > > trying to test softlockup and RCU CPU stall warnings, you should of course > > not try to acquire any non-recursive exclusive lock that you already hold. > > > >> If it is not reasonable, we should adjust the printk loglevel from > >> *KERN_ERR* to *KERN_INFO* > >> in print_cpu_stall. > > > > Given that RCU CPU stall warnings are supposed to be pointing out errors > > elsewhere in the kernel, and in this case are pointing out errors elsewhere > > in the kernel, namely in your hello_start() function, it is reasonable > > that the RCU CPU stall warnings use the KERN_ERR loglevel. > > > > Or am I missing something here? > > > > Thanx, Paul > > > > > > . > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: > On 2015/1/19 16:42, Paul E. McKenney wrote: > > On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: > >> Hi, > >> > >> On my x86_64 qemu virtual machine, RCU CPU stall console spews may > >> lead to soft lockup disabled. > >> > >> If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * > >> watchdog_thresh): > >> > >> / # > >> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout > >> 21 > >> / # echo 60 > /proc/sys/kernel/watchdog_thresh > >> / # busybox insmod softlockup_test.ko > >> [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=21002 jiffies, g=85, c=84, q=4) > >> [ 44.959044] INFO: Stall ended before state dump start > >> [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=84007 jiffies, g=85, c=84, q=4) > >> [ 107.964045] INFO: Stall ended before state dump start > >> [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=147012 jiffies, g=85, c=84, q=4) > >> [ 170.969060] INFO: Stall ended before state dump start > >> [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=210017 jiffies, g=85, c=84, q=4) > >> [ 233.974057] INFO: Stall ended before state dump start > >> [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=273022 jiffies, g=85, c=84, q=4) > >> [ 296.979059] INFO: Stall ended before state dump start > >> [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=336027 jiffies, g=85, c=84, q=4) > >> [ 359.984058] INFO: Stall ended before state dump start > >> [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=399032 jiffies, g=85, c=84, q=4) > >> [ 422.989059] INFO: Stall ended before state dump start > >> [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=462037 jiffies, g=85, c=84, q=4) > >> [ 485.994056] INFO: Stall ended before state dump start > >> [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=525042 jiffies, g=85, c=84, q=4) > >> [ 548.999059] INFO: Stall ended before state dump start > >> [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=588047 jiffies, g=85, c=84, q=4) > >> [ 612.004061] INFO: Stall ended before state dump start > >> [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} > >> (detected by 0, t=651052 jiffies, g=85, c=84, q=4) > >> [ 675.009058] INFO: Stall ended before state dump start > >> > >> If softlockup_thresh < rcu_cpu_stall_timeout: > >> > >> / # > >> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout > >> 21 > >> / # echo 5 > /proc/sys/kernel/watchdog_thresh > >> / # busybox insmod softlockup_test.ko > >> [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > >> [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > >> [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > >> [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > >> [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > >> > >> The softlockup_test.ko source code is: > >> // > >> #include > >> #include > >> #include > >> #include > >> > >> static int hello_start(void) > >> { > >> DEFINE_SPINLOCK(hello_lock); > >> spin_lock_init(_lock); > >> spin_lock(_lock); > >> spin_lock(_lock); > > > > Did you really intend to acquire the same spinlock twice in a row, > > forcing a self-deadlock? If not, I of course suggest changing the second > > "spin_lock()" to "spin_unlock()". > > Yes, i acquire the same spinlock twice in order to reproduce the problem. Good, I was wondering about that. ;-) > > If your .config has CONFIG_TREE_RCU=y, the above is quite likely to > > give you an RCU CPU stall warning. > > In my .config CONFIG_TREE_RCU=y. Which is consistent. > If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup > warning. > If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU > stall warning > just like above and no give soft lockup warning. > > It means that RCU CPU stall console spews leads to soft lockup disabled. > Is this reasonable ? It depends. You will often see both of them, but they can interfere with each other, especially if all these messages are going across a serial line. And sometimes the activity of the one will suppress the other, though I would not expect that in your spinlock deadlock case. Thanx, Paul > Thanks! > > >> return 0; > >> } > >> > >> static int __init test_init(void) > >> { > >> hello_start(); > >> > >> printk(KERN_INFO "Module init\n"); > >> return 0; > >> } > >> > >> static void
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/19 16:42, Paul E. McKenney wrote: > On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: >> Hi, >> >> On my x86_64 qemu virtual machine, RCU CPU stall console spews may >> lead to soft lockup disabled. >> >> If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * >> watchdog_thresh): >> >> / # >> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout >> 21 >> / # echo 60 > /proc/sys/kernel/watchdog_thresh >> / # busybox insmod softlockup_test.ko >> [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=21002 jiffies, g=85, c=84, q=4) >> [ 44.959044] INFO: Stall ended before state dump start >> [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=84007 jiffies, g=85, c=84, q=4) >> [ 107.964045] INFO: Stall ended before state dump start >> [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=147012 jiffies, g=85, c=84, q=4) >> [ 170.969060] INFO: Stall ended before state dump start >> [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=210017 jiffies, g=85, c=84, q=4) >> [ 233.974057] INFO: Stall ended before state dump start >> [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=273022 jiffies, g=85, c=84, q=4) >> [ 296.979059] INFO: Stall ended before state dump start >> [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=336027 jiffies, g=85, c=84, q=4) >> [ 359.984058] INFO: Stall ended before state dump start >> [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=399032 jiffies, g=85, c=84, q=4) >> [ 422.989059] INFO: Stall ended before state dump start >> [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=462037 jiffies, g=85, c=84, q=4) >> [ 485.994056] INFO: Stall ended before state dump start >> [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=525042 jiffies, g=85, c=84, q=4) >> [ 548.999059] INFO: Stall ended before state dump start >> [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=588047 jiffies, g=85, c=84, q=4) >> [ 612.004061] INFO: Stall ended before state dump start >> [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected >> by 0, t=651052 jiffies, g=85, c=84, q=4) >> [ 675.009058] INFO: Stall ended before state dump start >> >> If softlockup_thresh < rcu_cpu_stall_timeout: >> >> / # >> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout >> 21 >> / # echo 5 > /proc/sys/kernel/watchdog_thresh >> / # busybox insmod softlockup_test.ko >> [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] >> [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] >> [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] >> [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] >> [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] >> >> The softlockup_test.ko source code is: >> // >> #include >> #include >> #include >> #include >> >> static int hello_start(void) >> { >> DEFINE_SPINLOCK(hello_lock); >> spin_lock_init(_lock); >> spin_lock(_lock); >> spin_lock(_lock); > > Did you really intend to acquire the same spinlock twice in a row, > forcing a self-deadlock? If not, I of course suggest changing the second > "spin_lock()" to "spin_unlock()". > Yes, i acquire the same spinlock twice in order to reproduce the problem. > If your .config has CONFIG_TREE_RCU=y, the above is quite likely to > give you an RCU CPU stall warning. > In my .config CONFIG_TREE_RCU=y. If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup warning. If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU stall warning just like above and no give soft lockup warning. It means that RCU CPU stall console spews leads to soft lockup disabled. Is this reasonable ? Thanks! >> return 0; >> } >> >> static int __init test_init(void) >> { >> hello_start(); >> >> printk(KERN_INFO "Module init\n"); >> return 0; >> } >> >> static void __exit test_exit(void) >> { >> printk(KERN_INFO "Module exit!\n"); >> } >> >> module_init(test_init) >> module_exit(test_exit) >> MODULE_LICENSE("GPL"); >> // >> >> My kernel version is v3.10.63, and i checked the kernel source code, >> >> update_process_times >> -> run_local_timers >> -> hrtimer_run_queues >> -> __run_hrtimer >> -> watchdog_timer_fn >> -> is_softlockup >> >> -> rcu_check_callbacks >> -> __rcu_pending >> -> check_cpu_stall >> ->
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: > Hi, > > On my x86_64 qemu virtual machine, RCU CPU stall console spews may > lead to soft lockup disabled. > > If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * > watchdog_thresh): > > / # > / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout > 21 > / # echo 60 > /proc/sys/kernel/watchdog_thresh > / # busybox insmod softlockup_test.ko > [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=21002 jiffies, g=85, c=84, q=4) > [ 44.959044] INFO: Stall ended before state dump start > [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=84007 jiffies, g=85, c=84, q=4) > [ 107.964045] INFO: Stall ended before state dump start > [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=147012 jiffies, g=85, c=84, q=4) > [ 170.969060] INFO: Stall ended before state dump start > [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=210017 jiffies, g=85, c=84, q=4) > [ 233.974057] INFO: Stall ended before state dump start > [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=273022 jiffies, g=85, c=84, q=4) > [ 296.979059] INFO: Stall ended before state dump start > [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=336027 jiffies, g=85, c=84, q=4) > [ 359.984058] INFO: Stall ended before state dump start > [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=399032 jiffies, g=85, c=84, q=4) > [ 422.989059] INFO: Stall ended before state dump start > [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=462037 jiffies, g=85, c=84, q=4) > [ 485.994056] INFO: Stall ended before state dump start > [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=525042 jiffies, g=85, c=84, q=4) > [ 548.999059] INFO: Stall ended before state dump start > [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=588047 jiffies, g=85, c=84, q=4) > [ 612.004061] INFO: Stall ended before state dump start > [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected > by 0, t=651052 jiffies, g=85, c=84, q=4) > [ 675.009058] INFO: Stall ended before state dump start > > If softlockup_thresh < rcu_cpu_stall_timeout: > > / # > / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout > 21 > / # echo 5 > /proc/sys/kernel/watchdog_thresh > / # busybox insmod softlockup_test.ko > [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] > > The softlockup_test.ko source code is: > // > #include > #include > #include > #include > > static int hello_start(void) > { > DEFINE_SPINLOCK(hello_lock); > spin_lock_init(_lock); > spin_lock(_lock); > spin_lock(_lock); Did you really intend to acquire the same spinlock twice in a row, forcing a self-deadlock? If not, I of course suggest changing the second "spin_lock()" to "spin_unlock()". If your .config has CONFIG_TREE_RCU=y, the above is quite likely to give you an RCU CPU stall warning. > return 0; > } > > static int __init test_init(void) > { > hello_start(); > > printk(KERN_INFO "Module init\n"); > return 0; > } > > static void __exit test_exit(void) > { > printk(KERN_INFO "Module exit!\n"); > } > > module_init(test_init) > module_exit(test_exit) > MODULE_LICENSE("GPL"); > // > > My kernel version is v3.10.63, and i checked the kernel source code, > > update_process_times > -> run_local_timers > -> hrtimer_run_queues > -> __run_hrtimer > -> watchdog_timer_fn > -> is_softlockup > > -> rcu_check_callbacks > -> __rcu_pending > -> check_cpu_stall > -> print_cpu_stall > > If softlockup_thresh > rcu_cpu_stall_timeout, print_cpu_stall will print log > to serial port. > > The 8250 serial driver will call serial8250_console_write => > touch_nmi_watchdog() which reset > watchdog_touch_ts to 0. So the softlockup will not be triggered. > > Is this reasonable? Why? Is exactly what reasonable? ;-) Yes, it is reasonable that your code triggers an RCU CPU stall warning. No, it is not reasonable that the RCU CPU stall warning does not include a stack trace, and the fix for that bug will be going into the next merge window. Yes, is is
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/19 19:09, Paul E. McKenney wrote: On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: On 2015/1/19 16:42, Paul E. McKenney wrote: On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: Hi, On my x86_64 qemu virtual machine, RCU CPU stall console spews may lead to soft lockup disabled. If softlockup_thresh rcu_cpu_stall_timeout (softlockup_thresh = 2 * watchdog_thresh): / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=85, c=84, q=4) [ 44.959044] INFO: Stall ended before state dump start [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=85, c=84, q=4) [ 107.964045] INFO: Stall ended before state dump start [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=85, c=84, q=4) [ 170.969060] INFO: Stall ended before state dump start [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=85, c=84, q=4) [ 233.974057] INFO: Stall ended before state dump start [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=85, c=84, q=4) [ 296.979059] INFO: Stall ended before state dump start [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=85, c=84, q=4) [ 359.984058] INFO: Stall ended before state dump start [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=85, c=84, q=4) [ 422.989059] INFO: Stall ended before state dump start [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=85, c=84, q=4) [ 485.994056] INFO: Stall ended before state dump start [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=85, c=84, q=4) [ 548.999059] INFO: Stall ended before state dump start [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=85, c=84, q=4) [ 612.004061] INFO: Stall ended before state dump start [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=85, c=84, q=4) [ 675.009058] INFO: Stall ended before state dump start If softlockup_thresh rcu_cpu_stall_timeout: / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 5 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] The softlockup_test.ko source code is: // #include linux/kernel.h #include linux/module.h #include linux/spinlock.h #include linux/slab.h static int hello_start(void) { DEFINE_SPINLOCK(hello_lock); spin_lock_init(hello_lock); spin_lock(hello_lock); spin_lock(hello_lock); Did you really intend to acquire the same spinlock twice in a row, forcing a self-deadlock? If not, I of course suggest changing the second spin_lock() to spin_unlock(). Yes, i acquire the same spinlock twice in order to reproduce the problem. Good, I was wondering about that. ;-) If your .config has CONFIG_TREE_RCU=y, the above is quite likely to give you an RCU CPU stall warning. In my .config CONFIG_TREE_RCU=y. Which is consistent. If softlockup_thresh rcu_cpu_stall_timeout, it will give soft lockup warning. If softlockup_thresh rcu_cpu_stall_timeout, it will likely to give RCU CPU stall warning just like above and no give soft lockup warning. It means that RCU CPU stall console spews leads to soft lockup disabled. Is this reasonable ? It depends. You will often see both of them, but they can interfere with each other, especially if all these messages are going across a serial line. And sometimes the activity of the one will suppress the other, though I would not expect that in your spinlock deadlock case. Ok, my expect is to get both RCU stall messages _and_ softlockup messages even though all these messages are going across a serial line. But in my test case the RCU stall messages suppress the other. The simplest way is to change the RCU CPU stall warnings use the KERN_INFO loglevel. Is there any better way to get both RCU stall messages _and_ softlockup messages in any case ? Thanks! Thanx, Paul Thanks! return 0; } static int __init
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/19 22:06, Don Zickus wrote: On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: Did you really intend to acquire the same spinlock twice in a row, forcing a self-deadlock? If not, I of course suggest changing the second spin_lock() to spin_unlock(). Yes, i acquire the same spinlock twice in order to reproduce the problem. If your .config has CONFIG_TREE_RCU=y, the above is quite likely to give you an RCU CPU stall warning. In my .config CONFIG_TREE_RCU=y. If softlockup_thresh rcu_cpu_stall_timeout, it will give soft lockup warning. If softlockup_thresh rcu_cpu_stall_timeout, it will likely to give RCU CPU stall warning just like above and no give soft lockup warning. It means that RCU CPU stall console spews leads to soft lockup disabled. Is this reasonable ? I believe so. In kernel v3.10.., all activity to the console executed touch_nmi_watchdog() which calls touch_softlockup_watchdog, which delayed the softlockup for another round of 'softlockup_thresh'. Yeah, you are right. It's the real reason. Of course back then, touch_nmi_watchdog touched all cpus. So a problem like this was masked. I believe this upstream commit 62572e29bc53, solved the problem. Thanks for your suggestion. Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it only touch local cpu not every one. But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu. Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog. You can apply that commit and see if you if you get both RCU stall messages _and_ softlockup messages. I believe that is what you were expecting, correct? Correct, i expect i can get both RCU stall messages _and_ softlockup messages. I applied that commit, and i only got RCU stall messages. / # / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 35.344060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=85, c=84, q=4) [ 35.344060] INFO: Stall ended before state dump start [ 98.349079] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=85, c=84, q=4) [ 98.349079] INFO: Stall ended before state dump start [ 161.354100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=85, c=84, q=4) [ 161.354100] INFO: Stall ended before state dump start Of course, on a non-virt guest, your test case would normally trigger a hardlockup warning first. And a later kernel version for the guest may actually do that (not quite sure if the emulated PMU stuff is upstream or not yet). Just to set your expectations correctly. Yes, on a non-virt guest, my test case tiggered hardlockup warning firt. Best regards! Cheers, Don Thanks! return 0; } static int __init test_init(void) { hello_start(); printk(KERN_INFO Module init\n); return 0; } static void __exit test_exit(void) { printk(KERN_INFO Module exit!\n); } module_init(test_init) module_exit(test_exit) MODULE_LICENSE(GPL); // My kernel version is v3.10.63, and i checked the kernel source code, update_process_times - run_local_timers - hrtimer_run_queues - __run_hrtimer - watchdog_timer_fn - is_softlockup - rcu_check_callbacks - __rcu_pending - check_cpu_stall - print_cpu_stall If softlockup_thresh rcu_cpu_stall_timeout, print_cpu_stall will print log to serial port. The 8250 serial driver will call serial8250_console_write = touch_nmi_watchdog() which reset watchdog_touch_ts to 0. So the softlockup will not be triggered. Is this reasonable? Why? Is exactly what reasonable? ;-) Yes, it is reasonable that your code triggers an RCU CPU stall warning. No, it is not reasonable that the RCU CPU stall warning does not include a stack trace, and the fix for that bug will be going into the next merge window. Yes, is is reasonable that varying the softlockup and RCU CPU stall timeouts might change the behavior. No, your code is not reasonable, except perhaps as a test of the generation of softlockup and RCU CPU stall warnings. If you are not trying to test softlockup and RCU CPU stall warnings, you should of course not try to acquire any non-recursive exclusive lock that you already hold. If it is not reasonable, we should adjust the printk loglevel from *KERN_ERR* to *KERN_INFO* in print_cpu_stall. Given that RCU CPU stall warnings are supposed to be pointing out errors elsewhere in the kernel, and in this case are pointing out errors elsewhere in the kernel, namely in your hello_start() function, it is reasonable that the RCU CPU stall warnings use the KERN_ERR loglevel. Or am I missing something
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Tue, Jan 20, 2015 at 11:17:01AM +0800, Zhang Zhen wrote: On 2015/1/19 19:09, Paul E. McKenney wrote: On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: On 2015/1/19 16:42, Paul E. McKenney wrote: On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: Hi, On my x86_64 qemu virtual machine, RCU CPU stall console spews may lead to soft lockup disabled. If softlockup_thresh rcu_cpu_stall_timeout (softlockup_thresh = 2 * watchdog_thresh): / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=85, c=84, q=4) [ 44.959044] INFO: Stall ended before state dump start [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=85, c=84, q=4) [ 107.964045] INFO: Stall ended before state dump start [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=85, c=84, q=4) [ 170.969060] INFO: Stall ended before state dump start [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=85, c=84, q=4) [ 233.974057] INFO: Stall ended before state dump start [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=85, c=84, q=4) [ 296.979059] INFO: Stall ended before state dump start [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=85, c=84, q=4) [ 359.984058] INFO: Stall ended before state dump start [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=85, c=84, q=4) [ 422.989059] INFO: Stall ended before state dump start [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=85, c=84, q=4) [ 485.994056] INFO: Stall ended before state dump start [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=85, c=84, q=4) [ 548.999059] INFO: Stall ended before state dump start [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=85, c=84, q=4) [ 612.004061] INFO: Stall ended before state dump start [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=85, c=84, q=4) [ 675.009058] INFO: Stall ended before state dump start If softlockup_thresh rcu_cpu_stall_timeout: / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 5 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] The softlockup_test.ko source code is: // #include linux/kernel.h #include linux/module.h #include linux/spinlock.h #include linux/slab.h static int hello_start(void) { DEFINE_SPINLOCK(hello_lock); spin_lock_init(hello_lock); spin_lock(hello_lock); spin_lock(hello_lock); Did you really intend to acquire the same spinlock twice in a row, forcing a self-deadlock? If not, I of course suggest changing the second spin_lock() to spin_unlock(). Yes, i acquire the same spinlock twice in order to reproduce the problem. Good, I was wondering about that. ;-) If your .config has CONFIG_TREE_RCU=y, the above is quite likely to give you an RCU CPU stall warning. In my .config CONFIG_TREE_RCU=y. Which is consistent. If softlockup_thresh rcu_cpu_stall_timeout, it will give soft lockup warning. If softlockup_thresh rcu_cpu_stall_timeout, it will likely to give RCU CPU stall warning just like above and no give soft lockup warning. It means that RCU CPU stall console spews leads to soft lockup disabled. Is this reasonable ? It depends. You will often see both of them, but they can interfere with each other, especially if all these messages are going across a serial line. And sometimes the activity of the one will suppress the other, though I would not expect that in your spinlock deadlock case. Ok, my expect is to get both RCU stall messages _and_ softlockup messages even though all these messages are going across a serial line. But in my test case the RCU stall messages suppress the other. On some kernel versions, slow serial-line consoles can cause problems. For example, on a 9600-baud line, only 21K characters output at one go
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: Hi, On my x86_64 qemu virtual machine, RCU CPU stall console spews may lead to soft lockup disabled. If softlockup_thresh rcu_cpu_stall_timeout (softlockup_thresh = 2 * watchdog_thresh): / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=85, c=84, q=4) [ 44.959044] INFO: Stall ended before state dump start [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=85, c=84, q=4) [ 107.964045] INFO: Stall ended before state dump start [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=85, c=84, q=4) [ 170.969060] INFO: Stall ended before state dump start [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=85, c=84, q=4) [ 233.974057] INFO: Stall ended before state dump start [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=85, c=84, q=4) [ 296.979059] INFO: Stall ended before state dump start [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=85, c=84, q=4) [ 359.984058] INFO: Stall ended before state dump start [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=85, c=84, q=4) [ 422.989059] INFO: Stall ended before state dump start [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=85, c=84, q=4) [ 485.994056] INFO: Stall ended before state dump start [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=85, c=84, q=4) [ 548.999059] INFO: Stall ended before state dump start [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=85, c=84, q=4) [ 612.004061] INFO: Stall ended before state dump start [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=85, c=84, q=4) [ 675.009058] INFO: Stall ended before state dump start If softlockup_thresh rcu_cpu_stall_timeout: / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 5 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] The softlockup_test.ko source code is: // #include linux/kernel.h #include linux/module.h #include linux/spinlock.h #include linux/slab.h static int hello_start(void) { DEFINE_SPINLOCK(hello_lock); spin_lock_init(hello_lock); spin_lock(hello_lock); spin_lock(hello_lock); Did you really intend to acquire the same spinlock twice in a row, forcing a self-deadlock? If not, I of course suggest changing the second spin_lock() to spin_unlock(). If your .config has CONFIG_TREE_RCU=y, the above is quite likely to give you an RCU CPU stall warning. return 0; } static int __init test_init(void) { hello_start(); printk(KERN_INFO Module init\n); return 0; } static void __exit test_exit(void) { printk(KERN_INFO Module exit!\n); } module_init(test_init) module_exit(test_exit) MODULE_LICENSE(GPL); // My kernel version is v3.10.63, and i checked the kernel source code, update_process_times - run_local_timers - hrtimer_run_queues - __run_hrtimer - watchdog_timer_fn - is_softlockup - rcu_check_callbacks - __rcu_pending - check_cpu_stall - print_cpu_stall If softlockup_thresh rcu_cpu_stall_timeout, print_cpu_stall will print log to serial port. The 8250 serial driver will call serial8250_console_write = touch_nmi_watchdog() which reset watchdog_touch_ts to 0. So the softlockup will not be triggered. Is this reasonable? Why? Is exactly what reasonable? ;-) Yes, it is reasonable that your code triggers an RCU CPU stall warning. No, it is not reasonable that the RCU CPU stall warning does not include a stack trace, and the fix for that bug will be going into the next merge window. Yes, is is reasonable that varying the softlockup and RCU CPU stall timeouts might
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: On 2015/1/19 16:42, Paul E. McKenney wrote: On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: Hi, On my x86_64 qemu virtual machine, RCU CPU stall console spews may lead to soft lockup disabled. If softlockup_thresh rcu_cpu_stall_timeout (softlockup_thresh = 2 * watchdog_thresh): / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=85, c=84, q=4) [ 44.959044] INFO: Stall ended before state dump start [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=85, c=84, q=4) [ 107.964045] INFO: Stall ended before state dump start [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=85, c=84, q=4) [ 170.969060] INFO: Stall ended before state dump start [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=85, c=84, q=4) [ 233.974057] INFO: Stall ended before state dump start [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=85, c=84, q=4) [ 296.979059] INFO: Stall ended before state dump start [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=85, c=84, q=4) [ 359.984058] INFO: Stall ended before state dump start [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=85, c=84, q=4) [ 422.989059] INFO: Stall ended before state dump start [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=85, c=84, q=4) [ 485.994056] INFO: Stall ended before state dump start [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=85, c=84, q=4) [ 548.999059] INFO: Stall ended before state dump start [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=85, c=84, q=4) [ 612.004061] INFO: Stall ended before state dump start [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=85, c=84, q=4) [ 675.009058] INFO: Stall ended before state dump start If softlockup_thresh rcu_cpu_stall_timeout: / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 5 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] The softlockup_test.ko source code is: // #include linux/kernel.h #include linux/module.h #include linux/spinlock.h #include linux/slab.h static int hello_start(void) { DEFINE_SPINLOCK(hello_lock); spin_lock_init(hello_lock); spin_lock(hello_lock); spin_lock(hello_lock); Did you really intend to acquire the same spinlock twice in a row, forcing a self-deadlock? If not, I of course suggest changing the second spin_lock() to spin_unlock(). Yes, i acquire the same spinlock twice in order to reproduce the problem. Good, I was wondering about that. ;-) If your .config has CONFIG_TREE_RCU=y, the above is quite likely to give you an RCU CPU stall warning. In my .config CONFIG_TREE_RCU=y. Which is consistent. If softlockup_thresh rcu_cpu_stall_timeout, it will give soft lockup warning. If softlockup_thresh rcu_cpu_stall_timeout, it will likely to give RCU CPU stall warning just like above and no give soft lockup warning. It means that RCU CPU stall console spews leads to soft lockup disabled. Is this reasonable ? It depends. You will often see both of them, but they can interfere with each other, especially if all these messages are going across a serial line. And sometimes the activity of the one will suppress the other, though I would not expect that in your spinlock deadlock case. Thanx, Paul Thanks! return 0; } static int __init test_init(void) { hello_start(); printk(KERN_INFO Module init\n); return 0; } static void __exit test_exit(void) { printk(KERN_INFO Module exit!\n); } module_init(test_init) module_exit(test_exit) MODULE_LICENSE(GPL); // My kernel version is v3.10.63, and i checked the kernel source code,
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote: Did you really intend to acquire the same spinlock twice in a row, forcing a self-deadlock? If not, I of course suggest changing the second spin_lock() to spin_unlock(). Yes, i acquire the same spinlock twice in order to reproduce the problem. If your .config has CONFIG_TREE_RCU=y, the above is quite likely to give you an RCU CPU stall warning. In my .config CONFIG_TREE_RCU=y. If softlockup_thresh rcu_cpu_stall_timeout, it will give soft lockup warning. If softlockup_thresh rcu_cpu_stall_timeout, it will likely to give RCU CPU stall warning just like above and no give soft lockup warning. It means that RCU CPU stall console spews leads to soft lockup disabled. Is this reasonable ? I believe so. In kernel v3.10.., all activity to the console executed touch_nmi_watchdog() which calls touch_softlockup_watchdog, which delayed the softlockup for another round of 'softlockup_thresh'. Of course back then, touch_nmi_watchdog touched all cpus. So a problem like this was masked. I believe this upstream commit 62572e29bc53, solved the problem. You can apply that commit and see if you if you get both RCU stall messages _and_ softlockup messages. I believe that is what you were expecting, correct? Of course, on a non-virt guest, your test case would normally trigger a hardlockup warning first. And a later kernel version for the guest may actually do that (not quite sure if the emulated PMU stuff is upstream or not yet). Just to set your expectations correctly. Cheers, Don Thanks! return 0; } static int __init test_init(void) { hello_start(); printk(KERN_INFO Module init\n); return 0; } static void __exit test_exit(void) { printk(KERN_INFO Module exit!\n); } module_init(test_init) module_exit(test_exit) MODULE_LICENSE(GPL); // My kernel version is v3.10.63, and i checked the kernel source code, update_process_times - run_local_timers - hrtimer_run_queues - __run_hrtimer - watchdog_timer_fn - is_softlockup - rcu_check_callbacks - __rcu_pending - check_cpu_stall - print_cpu_stall If softlockup_thresh rcu_cpu_stall_timeout, print_cpu_stall will print log to serial port. The 8250 serial driver will call serial8250_console_write = touch_nmi_watchdog() which reset watchdog_touch_ts to 0. So the softlockup will not be triggered. Is this reasonable? Why? Is exactly what reasonable? ;-) Yes, it is reasonable that your code triggers an RCU CPU stall warning. No, it is not reasonable that the RCU CPU stall warning does not include a stack trace, and the fix for that bug will be going into the next merge window. Yes, is is reasonable that varying the softlockup and RCU CPU stall timeouts might change the behavior. No, your code is not reasonable, except perhaps as a test of the generation of softlockup and RCU CPU stall warnings. If you are not trying to test softlockup and RCU CPU stall warnings, you should of course not try to acquire any non-recursive exclusive lock that you already hold. If it is not reasonable, we should adjust the printk loglevel from *KERN_ERR* to *KERN_INFO* in print_cpu_stall. Given that RCU CPU stall warnings are supposed to be pointing out errors elsewhere in the kernel, and in this case are pointing out errors elsewhere in the kernel, namely in your hello_start() function, it is reasonable that the RCU CPU stall warnings use the KERN_ERR loglevel. Or am I missing something here? Thanx, Paul . -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
On 2015/1/19 16:42, Paul E. McKenney wrote: On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote: Hi, On my x86_64 qemu virtual machine, RCU CPU stall console spews may lead to soft lockup disabled. If softlockup_thresh rcu_cpu_stall_timeout (softlockup_thresh = 2 * watchdog_thresh): / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 60 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=85, c=84, q=4) [ 44.959044] INFO: Stall ended before state dump start [ 107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=85, c=84, q=4) [ 107.964045] INFO: Stall ended before state dump start [ 170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=85, c=84, q=4) [ 170.969060] INFO: Stall ended before state dump start [ 233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=85, c=84, q=4) [ 233.974057] INFO: Stall ended before state dump start [ 296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=85, c=84, q=4) [ 296.979059] INFO: Stall ended before state dump start [ 359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=85, c=84, q=4) [ 359.984058] INFO: Stall ended before state dump start [ 422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=85, c=84, q=4) [ 422.989059] INFO: Stall ended before state dump start [ 485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=85, c=84, q=4) [ 485.994056] INFO: Stall ended before state dump start [ 548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=85, c=84, q=4) [ 548.999059] INFO: Stall ended before state dump start [ 612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=85, c=84, q=4) [ 612.004061] INFO: Stall ended before state dump start [ 675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=85, c=84, q=4) [ 675.009058] INFO: Stall ended before state dump start If softlockup_thresh rcu_cpu_stall_timeout: / # / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout 21 / # echo 5 /proc/sys/kernel/watchdog_thresh / # busybox insmod softlockup_test.ko [ 38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] [ 94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53] The softlockup_test.ko source code is: // #include linux/kernel.h #include linux/module.h #include linux/spinlock.h #include linux/slab.h static int hello_start(void) { DEFINE_SPINLOCK(hello_lock); spin_lock_init(hello_lock); spin_lock(hello_lock); spin_lock(hello_lock); Did you really intend to acquire the same spinlock twice in a row, forcing a self-deadlock? If not, I of course suggest changing the second spin_lock() to spin_unlock(). Yes, i acquire the same spinlock twice in order to reproduce the problem. If your .config has CONFIG_TREE_RCU=y, the above is quite likely to give you an RCU CPU stall warning. In my .config CONFIG_TREE_RCU=y. If softlockup_thresh rcu_cpu_stall_timeout, it will give soft lockup warning. If softlockup_thresh rcu_cpu_stall_timeout, it will likely to give RCU CPU stall warning just like above and no give soft lockup warning. It means that RCU CPU stall console spews leads to soft lockup disabled. Is this reasonable ? Thanks! return 0; } static int __init test_init(void) { hello_start(); printk(KERN_INFO Module init\n); return 0; } static void __exit test_exit(void) { printk(KERN_INFO Module exit!\n); } module_init(test_init) module_exit(test_exit) MODULE_LICENSE(GPL); // My kernel version is v3.10.63, and i checked the kernel source code, update_process_times - run_local_timers - hrtimer_run_queues - __run_hrtimer - watchdog_timer_fn - is_softlockup - rcu_check_callbacks - __rcu_pending - check_cpu_stall - print_cpu_stall If softlockup_thresh rcu_cpu_stall_timeout, print_cpu_stall will print log to serial port. The 8250 serial driver will call serial8250_console_write =