subject:"Re\: RCU CPU stall console spews leads to soft lockup disabled is reasonable \?"

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Paul E. McKenney

On Thu, Jan 22, 2015 at 11:08:11AM +0800, Zhang Zhen wrote:
> On 2015/1/22 4:06, Paul E. McKenney wrote:
> > On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote:
> >> On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote:
>  This may not cause other problems but what happens if you comment out the
>  'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
>  below (based on latest upstream cb59670870)?
> 
>  The idea is that console printing for that cpu won't reset the softlockup
>  detector.  Again other bad things might happen and this patch may not be 
>  a
>  good final solution, but it can help give me a clue about what is going
>  on.
> >>>
> >>> I commented out the 'touch_softlockup_watchdog' from the 
> >>> touch_nmi_watchdog function
> >>> (based on latest upstream ec6f34e5b552).
> >>> This triggered RCU stall and softlockup, but softlockup just printk only 
> >>> once.
> >>
> >> Yes, as expected.  Instead of flooding the console with the same message,
> >> report only once until the high priority task is re-kicked and can re-arm
> >> the softlockup timer again.  So this makes sense.  Based on your double
> >> spin lock test, I don't ever expect this to get re-armed, so you should
> >> not expect to see another message.
> >>
> >>
> >>> As you mentioned "other bad things" lead to softlockup just printk only 
> >>> once.
> >>> What's the bad thing ?
> >>
> >> Well, one of the assumptions here is that if you are going to have
> >> interrupts disabled for a long time and purposely kick the hardlockup
> >> detector (with touch_nmi_watchdog), then you can reasonably assume that
> >> softlockups will go off too if you don't also kick the softlockup detector
> >> (even though there is a window where this isn't true: 2*thresh - thresh).
> >>
> >>
> >> So the patch I gave you, can lead to false softlockup warnings.
> >>
> >>
> >> On the flip side, printing to the console also blocks hardlockups from
> >> showing up.  But I believe lng ago, Paul adjusted the rcu stalls to be
> >> longer than a hardlockup/softlockup timeout to prevent such scenarios.
> >>
> >>
> >>
> >> I am not sure what to do here.  Printing to the console has traditionally
> >> been slow (especially if doing so with interrupts disabled), so a
> >> touch_nmi_watchdog has been necessary.  But a side effect of that, is it
> >> prevents the lockup detectors from going off if printing repeatedly.
> >>
> >>
> >> Now we can hack up rcu stall to only print once until it is re-armed.
> >> This would show rcu stall printing first, followed two minutes later by
> >> a softlockup, which sorta achieves what you want.
> >>
> >>
> >>
> >> However, at the end of the day, an rcu stall, softlockup or hardlockup is
> >> a bad thing, does it matter if you get one or all of them?  One should be
> >> enough to start an investigation, no?
> > 
> > Hear, hear!!!
> > 
> Ok, thanks for Don's and your patience!
> We enriched softlockup printk information, so we want to get softlockup 
> warnning.

Booting with "rcupdate.rcu_cpu_stall_suppress=1" will shut up the RCU
CPU stall warnings.

Thanx, Paul

> >> I don't know.  I am leaning towards 'working as expected' and nothing to
> >> really fix here.  Does anyone have any suggestions on what assumptions we
> >> could change to handle this better?
> >>
> >>
> >> Cheers,
> >> Don
> >>
> >>>
> >>> / #
> >>> / # echo 60 > /proc/sys/kernel/watchdog_thresh
> >>> / # busybox insmod softlockup_test.ko &
> >>> / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >>> (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
> >>> [   39.044058] INFO: Stall ended before state dump start
> >>> [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >>> (detected by 0, t=84007 jiffies, g=78, c=77, q=4)
> >>> [  102.049045] INFO: Stall ended before state dump start
> >>> [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
> >>> [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >>> (detected by 0, t=147012 jiffies, g=78, c=77, q=4)
> >>> [  165.054075] INFO: Stall ended before state dump start
> >>> [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >>> (detected by 0, t=210017 jiffies, g=78, c=77, q=4)
> >>> [  228.059045] INFO: Stall ended before state dump start
> >>> [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >>> (detected by 0, t=273022 jiffies, g=78, c=77, q=4)
> >>> [  291.064099] INFO: Stall ended before state dump start
> >>> [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >>> (detected by 0, t=336027 jiffies, g=78, c=77, q=4)
> >>> [  354.069099] INFO: Stall ended before state dump start
> >>> [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >>> (detected by 0, t=399032 jiffies, g=78, c=77, q=4)
> >>> [  417.074045] INFO: Stall ended before state dump

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Zhang Zhen

On 2015/1/22 4:06, Paul E. McKenney wrote:
> On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote:
>> On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote:
 This may not cause other problems but what happens if you comment out the
 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
 below (based on latest upstream cb59670870)?

 The idea is that console printing for that cpu won't reset the softlockup
 detector.  Again other bad things might happen and this patch may not be a
 good final solution, but it can help give me a clue about what is going
 on.
>>>
>>> I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
>>> function
>>> (based on latest upstream ec6f34e5b552).
>>> This triggered RCU stall and softlockup, but softlockup just printk only 
>>> once.
>>
>> Yes, as expected.  Instead of flooding the console with the same message,
>> report only once until the high priority task is re-kicked and can re-arm
>> the softlockup timer again.  So this makes sense.  Based on your double
>> spin lock test, I don't ever expect this to get re-armed, so you should
>> not expect to see another message.
>>
>>
>>> As you mentioned "other bad things" lead to softlockup just printk only 
>>> once.
>>> What's the bad thing ?
>>
>> Well, one of the assumptions here is that if you are going to have
>> interrupts disabled for a long time and purposely kick the hardlockup
>> detector (with touch_nmi_watchdog), then you can reasonably assume that
>> softlockups will go off too if you don't also kick the softlockup detector
>> (even though there is a window where this isn't true: 2*thresh - thresh).
>>
>>
>> So the patch I gave you, can lead to false softlockup warnings.
>>
>>
>> On the flip side, printing to the console also blocks hardlockups from
>> showing up.  But I believe lng ago, Paul adjusted the rcu stalls to be
>> longer than a hardlockup/softlockup timeout to prevent such scenarios.
>>
>>
>>
>> I am not sure what to do here.  Printing to the console has traditionally
>> been slow (especially if doing so with interrupts disabled), so a
>> touch_nmi_watchdog has been necessary.  But a side effect of that, is it
>> prevents the lockup detectors from going off if printing repeatedly.
>>
>>
>> Now we can hack up rcu stall to only print once until it is re-armed.
>> This would show rcu stall printing first, followed two minutes later by
>> a softlockup, which sorta achieves what you want.
>>
>>
>>
>> However, at the end of the day, an rcu stall, softlockup or hardlockup is
>> a bad thing, does it matter if you get one or all of them?  One should be
>> enough to start an investigation, no?
> 
> Hear, hear!!!
> 
Ok, thanks for Don's and your patience!
We enriched softlockup printk information, so we want to get softlockup 
warnning.

>   Thanx, Paul
> 
>> I don't know.  I am leaning towards 'working as expected' and nothing to
>> really fix here.  Does anyone have any suggestions on what assumptions we
>> could change to handle this better?
>>
>>
>> Cheers,
>> Don
>>
>>>
>>> / #
>>> / # echo 60 > /proc/sys/kernel/watchdog_thresh
>>> / # busybox insmod softlockup_test.ko &
>>> / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>>> (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
>>> [   39.044058] INFO: Stall ended before state dump start
>>> [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>>> (detected by 0, t=84007 jiffies, g=78, c=77, q=4)
>>> [  102.049045] INFO: Stall ended before state dump start
>>> [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
>>> [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>>> (detected by 0, t=147012 jiffies, g=78, c=77, q=4)
>>> [  165.054075] INFO: Stall ended before state dump start
>>> [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>>> (detected by 0, t=210017 jiffies, g=78, c=77, q=4)
>>> [  228.059045] INFO: Stall ended before state dump start
>>> [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>>> (detected by 0, t=273022 jiffies, g=78, c=77, q=4)
>>> [  291.064099] INFO: Stall ended before state dump start
>>> [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>>> (detected by 0, t=336027 jiffies, g=78, c=77, q=4)
>>> [  354.069099] INFO: Stall ended before state dump start
>>> [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>>> (detected by 0, t=399032 jiffies, g=78, c=77, q=4)
>>> [  417.074045] INFO: Stall ended before state dump start
>>> [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>>> (detected by 0, t=462037 jiffies, g=78, c=77, q=4)
>>> [  480.079099] INFO: Stall ended before state dump start
>>> [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>>> (detected by 0, t=525042 jiffies, g=78, c=77, q=4)
>>> [  543.084099] INFO: Stall ended

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Paul E. McKenney

On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote:
> On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote:
> > > This may not cause other problems but what happens if you comment out the
> > > 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
> > > below (based on latest upstream cb59670870)?
> > > 
> > > The idea is that console printing for that cpu won't reset the softlockup
> > > detector.  Again other bad things might happen and this patch may not be a
> > > good final solution, but it can help give me a clue about what is going
> > > on.
> > 
> > I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
> > function
> > (based on latest upstream ec6f34e5b552).
> > This triggered RCU stall and softlockup, but softlockup just printk only 
> > once.
> 
> Yes, as expected.  Instead of flooding the console with the same message,
> report only once until the high priority task is re-kicked and can re-arm
> the softlockup timer again.  So this makes sense.  Based on your double
> spin lock test, I don't ever expect this to get re-armed, so you should
> not expect to see another message.
> 
> 
> > As you mentioned "other bad things" lead to softlockup just printk only 
> > once.
> > What's the bad thing ?
> 
> Well, one of the assumptions here is that if you are going to have
> interrupts disabled for a long time and purposely kick the hardlockup
> detector (with touch_nmi_watchdog), then you can reasonably assume that
> softlockups will go off too if you don't also kick the softlockup detector
> (even though there is a window where this isn't true: 2*thresh - thresh).
> 
> 
> So the patch I gave you, can lead to false softlockup warnings.
> 
> 
> On the flip side, printing to the console also blocks hardlockups from
> showing up.  But I believe lng ago, Paul adjusted the rcu stalls to be
> longer than a hardlockup/softlockup timeout to prevent such scenarios.
> 
> 
> 
> I am not sure what to do here.  Printing to the console has traditionally
> been slow (especially if doing so with interrupts disabled), so a
> touch_nmi_watchdog has been necessary.  But a side effect of that, is it
> prevents the lockup detectors from going off if printing repeatedly.
> 
> 
> Now we can hack up rcu stall to only print once until it is re-armed.
> This would show rcu stall printing first, followed two minutes later by
> a softlockup, which sorta achieves what you want.
> 
> 
> 
> However, at the end of the day, an rcu stall, softlockup or hardlockup is
> a bad thing, does it matter if you get one or all of them?  One should be
> enough to start an investigation, no?

Hear, hear!!!

Thanx, Paul

> I don't know.  I am leaning towards 'working as expected' and nothing to
> really fix here.  Does anyone have any suggestions on what assumptions we
> could change to handle this better?
> 
> 
> Cheers,
> Don
> 
> > 
> > / #
> > / # echo 60 > /proc/sys/kernel/watchdog_thresh
> > / # busybox insmod softlockup_test.ko &
> > / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
> > [   39.044058] INFO: Stall ended before state dump start
> > [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=84007 jiffies, g=78, c=77, q=4)
> > [  102.049045] INFO: Stall ended before state dump start
> > [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
> > [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=147012 jiffies, g=78, c=77, q=4)
> > [  165.054075] INFO: Stall ended before state dump start
> > [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=210017 jiffies, g=78, c=77, q=4)
> > [  228.059045] INFO: Stall ended before state dump start
> > [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=273022 jiffies, g=78, c=77, q=4)
> > [  291.064099] INFO: Stall ended before state dump start
> > [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=336027 jiffies, g=78, c=77, q=4)
> > [  354.069099] INFO: Stall ended before state dump start
> > [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=399032 jiffies, g=78, c=77, q=4)
> > [  417.074045] INFO: Stall ended before state dump start
> > [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=462037 jiffies, g=78, c=77, q=4)
> > [  480.079099] INFO: Stall ended before state dump start
> > [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=525042 jiffies, g=78, c=77, q=4)
> > [  543.084099] INFO: Stall ended before state dump start
> > [  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> > (detected by 0, t=588047 jiffies, g=78, c=77, q=4)
> > [  606.089101] INFO: Stall ended before

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Paul E. McKenney

On Wed, Jan 21, 2015 at 07:11:51PM +0800, Zhang Zhen wrote:
> On 2015/1/21 18:16, Paul E. McKenney wrote:
> > On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote:
> >> On 2015/1/21 15:02, Paul E. McKenney wrote:
> >>> On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
>  On 2015/1/21 11:13, Zhang Zhen wrote:
> > On 2015/1/21 10:26, Zhang Zhen wrote:
> >> On 2015/1/20 23:25, Don Zickus wrote:
> >>>
> >>> [ . . . ]
> >>>
> > Sorry, i made a mistake, the log above is based on v3.10.63.
> > I have tested the latest upstream kernel (based on ec6f34e5b552),
> > but my test case can't triggered RCU stall warning.
> >
> 
>  I used git bisect to find the bad commit, but found a irrelevant
>  commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
>  Before this commit, my test case can easily trigger RCU stall warning,
>  but after this commit, my test case can't trigger RCU stall warning.
> >>>
> >>> I have lost track of exactly what kernel you are using.  If you are
> >>> using a recent kernel (v3.18 or thereabouts), one thing to try is to
> >>> apply my stack of RCU CPU stall-warning changes that are on their way
> >>> in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:
> >>>
> >>> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
> >>>
> >>> These handle the problems that Dave Jones and a few others located this
> >>> past December.  Could you please give them a spin?
> >>>
> >>
> >> Yeah, my test case can easily trigger RCU stall-warning based on commit 
> >> 630181c4a915
> >> in your linux-rcu git tree.
> >>
> >> In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21.
> >> Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
> >>
> >> / #
> >> / # echo 60 > /proc/sys/kernel/watchdog_thresh
> >> [   21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled
> >> / #  busybox insmod softlockup_test.ko
> >> [   47.900117] INFO: rcu_sched detected stalls on CPUs/tasks:
> >> [   47.900117] (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7)
> >> [   47.900117] All QSes seen, last rcu_sched kthread activity 5252 
> >> (4294904271-4294899019), jiffies_till_next_fqs=1
> > 
> > Looks like you are starving RCU's grace-period kthreads.  What happens if
> > you boot with rcutree.kthread_prio=1 in order to run them at real-time
> > priority?
> > 
> >> [   47.900117] busybox R  running task042 41 
> >> 0x0008
> >> [   47.900117]  81afaf40 88007fa03d98 810781c0 
> >> 81a5e5f8
> >> [   47.900117]  81a45d80 88007fa03dd8 810af723 
> >> 0083
> >> [   47.900117]  88007fa13580 81a45d80 81a45d80 
> >> 810c43e0
> >> [   47.900117] Call Trace:
> >> [   47.900117][] sched_show_task+0xb0/0x110
> >> [   47.900117]  [] print_other_cpu_stall+0x2d3/0x2f0
> >> [   47.900117]  [] ? tick_nohz_handler+0xc0/0xc0
> >> [   47.900117]  [] __rcu_pending+0x178/0x220
> >> [   47.900117]  [] rcu_check_callbacks+0x105/0x190
> >> [   47.900117]  [] ? tick_nohz_handler+0xc0/0xc0
> >> [   47.900117]  [] update_process_times+0x32/0x60
> >> [   47.900117]  [] tick_sched_handle+0x32/0x80
> >> [   47.900117]  [] tick_sched_timer+0x4d/0x90
> >> [   47.900117]  [] __run_hrtimer+0xc7/0x1c0
> >> [   47.900117]  [] hrtimer_interrupt+0xe7/0x220
> >> [   47.900117]  [] ? 0xa0005000
> >> [   47.900117]  [] local_apic_timer_interrupt+0x34/0x60
> >> [   47.900117]  [] smp_apic_timer_interrupt+0x3c/0x60
> >> [   47.900117]  [] apic_timer_interrupt+0x6a/0x70
> >> [   47.900117][] ? _raw_spin_lock+0x20/0x30
> >> [   47.900117]  [] test_init+0x5d/0x1000 
> >> [softlockup_test]
> >> [   47.900117]  [] ? 0xa0005000
> >> [   47.900117]  [] do_one_initcall+0xb8/0x1d0
> >> [   47.900117]  [] ? __vunmap+0x94/0xf0
> >> [   47.900117]  [] do_init_module+0x2b/0x1b0
> >> [   47.900117]  [] load_module+0x585/0x5f0
> >> [   47.900117]  [] ? mod_sysfs_teardown+0x150/0x150
> >> [   47.900117]  [] SyS_init_module+0x9b/0xc0
> >> [   47.900117]  [] system_call_fastpath+0x12/0x17
> >> [   47.900117] rcu_sched kthread starved for 5252 jiffies!
> >>
> >> After commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
> >>
> >> / #
> >> / # echo 60 > /proc/sys/kernel/watchdog_thresh
> >> [  123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled
> >> / # busybox insmod softlockup_test.ko
> >> [  150.912055] INFO: rcu_sched detected stalls on CPUs/tasks:
> >> [  150.912055] (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6)
> >> [  150.912055] All QSes seen, last rcu_sched kthread activity 5252 
> >> (4294930024-4294924772), jiffies_till_next_fqs=1
> > 
> > Same here, looks like RCU's grace-period kthreads haven't gotten a chance
> > to run for more than 5,000 jiffies.
> > 
> >> [  150.912055] busybox R  running task042 41 
> >> 0x0008
> >> [  150.912055]

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Don Zickus

On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote:
> > This may not cause other problems but what happens if you comment out the
> > 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
> > below (based on latest upstream cb59670870)?
> > 
> > The idea is that console printing for that cpu won't reset the softlockup
> > detector.  Again other bad things might happen and this patch may not be a
> > good final solution, but it can help give me a clue about what is going
> > on.
> 
> I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
> function
> (based on latest upstream ec6f34e5b552).
> This triggered RCU stall and softlockup, but softlockup just printk only once.

Yes, as expected.  Instead of flooding the console with the same message,
report only once until the high priority task is re-kicked and can re-arm
the softlockup timer again.  So this makes sense.  Based on your double
spin lock test, I don't ever expect this to get re-armed, so you should
not expect to see another message.

> As you mentioned "other bad things" lead to softlockup just printk only once.
> What's the bad thing ?

Well, one of the assumptions here is that if you are going to have
interrupts disabled for a long time and purposely kick the hardlockup
detector (with touch_nmi_watchdog), then you can reasonably assume that
softlockups will go off too if you don't also kick the softlockup detector
(even though there is a window where this isn't true: 2*thresh - thresh).

So the patch I gave you, can lead to false softlockup warnings.

On the flip side, printing to the console also blocks hardlockups from
showing up.  But I believe lng ago, Paul adjusted the rcu stalls to be
longer than a hardlockup/softlockup timeout to prevent such scenarios.

I am not sure what to do here.  Printing to the console has traditionally
been slow (especially if doing so with interrupts disabled), so a
touch_nmi_watchdog has been necessary.  But a side effect of that, is it
prevents the lockup detectors from going off if printing repeatedly.

Now we can hack up rcu stall to only print once until it is re-armed.
This would show rcu stall printing first, followed two minutes later by
a softlockup, which sorta achieves what you want.

However, at the end of the day, an rcu stall, softlockup or hardlockup is
a bad thing, does it matter if you get one or all of them?  One should be
enough to start an investigation, no?

I don't know.  I am leaning towards 'working as expected' and nothing to
really fix here.  Does anyone have any suggestions on what assumptions we
could change to handle this better?

Cheers,
Don

> 
> / #
> / # echo 60 > /proc/sys/kernel/watchdog_thresh
> / # busybox insmod softlockup_test.ko &
> / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
> [   39.044058] INFO: Stall ended before state dump start
> [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=84007 jiffies, g=78, c=77, q=4)
> [  102.049045] INFO: Stall ended before state dump start
> [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
> [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=147012 jiffies, g=78, c=77, q=4)
> [  165.054075] INFO: Stall ended before state dump start
> [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=210017 jiffies, g=78, c=77, q=4)
> [  228.059045] INFO: Stall ended before state dump start
> [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=273022 jiffies, g=78, c=77, q=4)
> [  291.064099] INFO: Stall ended before state dump start
> [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=336027 jiffies, g=78, c=77, q=4)
> [  354.069099] INFO: Stall ended before state dump start
> [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=399032 jiffies, g=78, c=77, q=4)
> [  417.074045] INFO: Stall ended before state dump start
> [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=462037 jiffies, g=78, c=77, q=4)
> [  480.079099] INFO: Stall ended before state dump start
> [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=525042 jiffies, g=78, c=77, q=4)
> [  543.084099] INFO: Stall ended before state dump start
> [  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=588047 jiffies, g=78, c=77, q=4)
> [  606.089101] INFO: Stall ended before state dump start
> [  669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=651052 jiffies, g=78, c=77, q=4)
> [  669.094099] INFO: Stall ended before state dump start
> [  732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=714057 jiffies, g=78, c=77, q=4)
> [  732.099045] INFO: Stall

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Zhang Zhen

On 2015/1/21 18:16, Paul E. McKenney wrote:
> On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote:
>> On 2015/1/21 15:02, Paul E. McKenney wrote:
>>> On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
 On 2015/1/21 11:13, Zhang Zhen wrote:
> On 2015/1/21 10:26, Zhang Zhen wrote:
>> On 2015/1/20 23:25, Don Zickus wrote:
>>>
>>> [ . . . ]
>>>
> Sorry, i made a mistake, the log above is based on v3.10.63.
> I have tested the latest upstream kernel (based on ec6f34e5b552),
> but my test case can't triggered RCU stall warning.
>

 I used git bisect to find the bad commit, but found a irrelevant
 commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
 Before this commit, my test case can easily trigger RCU stall warning,
 but after this commit, my test case can't trigger RCU stall warning.
>>>
>>> I have lost track of exactly what kernel you are using.  If you are
>>> using a recent kernel (v3.18 or thereabouts), one thing to try is to
>>> apply my stack of RCU CPU stall-warning changes that are on their way
>>> in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
>>>
>>> These handle the problems that Dave Jones and a few others located this
>>> past December.  Could you please give them a spin?
>>>
>>
>> Yeah, my test case can easily trigger RCU stall-warning based on commit 
>> 630181c4a915
>> in your linux-rcu git tree.
>>
>> In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21.
>> Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
>>
>> / #
>> / # echo 60 > /proc/sys/kernel/watchdog_thresh
>> [   21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled
>> / #  busybox insmod softlockup_test.ko
>> [   47.900117] INFO: rcu_sched detected stalls on CPUs/tasks:
>> [   47.900117]   (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7)
>> [   47.900117] All QSes seen, last rcu_sched kthread activity 5252 
>> (4294904271-4294899019), jiffies_till_next_fqs=1
> 
> Looks like you are starving RCU's grace-period kthreads.  What happens if
> you boot with rcutree.kthread_prio=1 in order to run them at real-time
> priority?
> 
>> [   47.900117] busybox R  running task042 41 
>> 0x0008
>> [   47.900117]  81afaf40 88007fa03d98 810781c0 
>> 81a5e5f8
>> [   47.900117]  81a45d80 88007fa03dd8 810af723 
>> 0083
>> [   47.900117]  88007fa13580 81a45d80 81a45d80 
>> 810c43e0
>> [   47.900117] Call Trace:
>> [   47.900117][] sched_show_task+0xb0/0x110
>> [   47.900117]  [] print_other_cpu_stall+0x2d3/0x2f0
>> [   47.900117]  [] ? tick_nohz_handler+0xc0/0xc0
>> [   47.900117]  [] __rcu_pending+0x178/0x220
>> [   47.900117]  [] rcu_check_callbacks+0x105/0x190
>> [   47.900117]  [] ? tick_nohz_handler+0xc0/0xc0
>> [   47.900117]  [] update_process_times+0x32/0x60
>> [   47.900117]  [] tick_sched_handle+0x32/0x80
>> [   47.900117]  [] tick_sched_timer+0x4d/0x90
>> [   47.900117]  [] __run_hrtimer+0xc7/0x1c0
>> [   47.900117]  [] hrtimer_interrupt+0xe7/0x220
>> [   47.900117]  [] ? 0xa0005000
>> [   47.900117]  [] local_apic_timer_interrupt+0x34/0x60
>> [   47.900117]  [] smp_apic_timer_interrupt+0x3c/0x60
>> [   47.900117]  [] apic_timer_interrupt+0x6a/0x70
>> [   47.900117][] ? _raw_spin_lock+0x20/0x30
>> [   47.900117]  [] test_init+0x5d/0x1000 [softlockup_test]
>> [   47.900117]  [] ? 0xa0005000
>> [   47.900117]  [] do_one_initcall+0xb8/0x1d0
>> [   47.900117]  [] ? __vunmap+0x94/0xf0
>> [   47.900117]  [] do_init_module+0x2b/0x1b0
>> [   47.900117]  [] load_module+0x585/0x5f0
>> [   47.900117]  [] ? mod_sysfs_teardown+0x150/0x150
>> [   47.900117]  [] SyS_init_module+0x9b/0xc0
>> [   47.900117]  [] system_call_fastpath+0x12/0x17
>> [   47.900117] rcu_sched kthread starved for 5252 jiffies!
>>
>> After commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
>>
>> / #
>> / # echo 60 > /proc/sys/kernel/watchdog_thresh
>> [  123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled
>> / # busybox insmod softlockup_test.ko
>> [  150.912055] INFO: rcu_sched detected stalls on CPUs/tasks:
>> [  150.912055]   (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6)
>> [  150.912055] All QSes seen, last rcu_sched kthread activity 5252 
>> (4294930024-4294924772), jiffies_till_next_fqs=1
> 
> Same here, looks like RCU's grace-period kthreads haven't gotten a chance
> to run for more than 5,000 jiffies.
> 
>> [  150.912055] busybox R  running task042 41 
>> 0x0008
>> [  150.912055]  81afaf40 88007fa03d98 810781c0 
>> 81a5e5f8
>> [  150.912055]  81a45d80 88007fa03dd8 810af723 
>> 0083
>> [  150.912055]  88007fa13580 81a45d80 81a45d80 
>> 810c43e0
>> [  150.912055] Call

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Paul E. McKenney

On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote:
> On 2015/1/21 15:02, Paul E. McKenney wrote:
> > On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
> >> On 2015/1/21 11:13, Zhang Zhen wrote:
> >>> On 2015/1/21 10:26, Zhang Zhen wrote:
>  On 2015/1/20 23:25, Don Zickus wrote:
> > 
> > [ . . . ]
> > 
> >>> Sorry, i made a mistake, the log above is based on v3.10.63.
> >>> I have tested the latest upstream kernel (based on ec6f34e5b552),
> >>> but my test case can't triggered RCU stall warning.
> >>>
> >>
> >> I used git bisect to find the bad commit, but found a irrelevant
> >> commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
> >> Before this commit, my test case can easily trigger RCU stall warning,
> >> but after this commit, my test case can't trigger RCU stall warning.
> > 
> > I have lost track of exactly what kernel you are using.  If you are
> > using a recent kernel (v3.18 or thereabouts), one thing to try is to
> > apply my stack of RCU CPU stall-warning changes that are on their way
> > in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
> > 
> > These handle the problems that Dave Jones and a few others located this
> > past December.  Could you please give them a spin?
> > 
> 
> Yeah, my test case can easily trigger RCU stall-warning based on commit 
> 630181c4a915
> in your linux-rcu git tree.
> 
> In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21.
> Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
> 
> / #
> / # echo 60 > /proc/sys/kernel/watchdog_thresh
> [   21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled
> / #  busybox insmod softlockup_test.ko
> [   47.900117] INFO: rcu_sched detected stalls on CPUs/tasks:
> [   47.900117](detected by 0, t=5252 jiffies, g=-202, c=-203, q=7)
> [   47.900117] All QSes seen, last rcu_sched kthread activity 5252 
> (4294904271-4294899019), jiffies_till_next_fqs=1

Looks like you are starving RCU's grace-period kthreads.  What happens if
you boot with rcutree.kthread_prio=1 in order to run them at real-time
priority?

> [   47.900117] busybox R  running task042 41 
> 0x0008
> [   47.900117]  81afaf40 88007fa03d98 810781c0 
> 81a5e5f8
> [   47.900117]  81a45d80 88007fa03dd8 810af723 
> 0083
> [   47.900117]  88007fa13580 81a45d80 81a45d80 
> 810c43e0
> [   47.900117] Call Trace:
> [   47.900117][] sched_show_task+0xb0/0x110
> [   47.900117]  [] print_other_cpu_stall+0x2d3/0x2f0
> [   47.900117]  [] ? tick_nohz_handler+0xc0/0xc0
> [   47.900117]  [] __rcu_pending+0x178/0x220
> [   47.900117]  [] rcu_check_callbacks+0x105/0x190
> [   47.900117]  [] ? tick_nohz_handler+0xc0/0xc0
> [   47.900117]  [] update_process_times+0x32/0x60
> [   47.900117]  [] tick_sched_handle+0x32/0x80
> [   47.900117]  [] tick_sched_timer+0x4d/0x90
> [   47.900117]  [] __run_hrtimer+0xc7/0x1c0
> [   47.900117]  [] hrtimer_interrupt+0xe7/0x220
> [   47.900117]  [] ? 0xa0005000
> [   47.900117]  [] local_apic_timer_interrupt+0x34/0x60
> [   47.900117]  [] smp_apic_timer_interrupt+0x3c/0x60
> [   47.900117]  [] apic_timer_interrupt+0x6a/0x70
> [   47.900117][] ? _raw_spin_lock+0x20/0x30
> [   47.900117]  [] test_init+0x5d/0x1000 [softlockup_test]
> [   47.900117]  [] ? 0xa0005000
> [   47.900117]  [] do_one_initcall+0xb8/0x1d0
> [   47.900117]  [] ? __vunmap+0x94/0xf0
> [   47.900117]  [] do_init_module+0x2b/0x1b0
> [   47.900117]  [] load_module+0x585/0x5f0
> [   47.900117]  [] ? mod_sysfs_teardown+0x150/0x150
> [   47.900117]  [] SyS_init_module+0x9b/0xc0
> [   47.900117]  [] system_call_fastpath+0x12/0x17
> [   47.900117] rcu_sched kthread starved for 5252 jiffies!
> 
> After commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
> 
> / #
> / # echo 60 > /proc/sys/kernel/watchdog_thresh
> [  123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled
> / # busybox insmod softlockup_test.ko
> [  150.912055] INFO: rcu_sched detected stalls on CPUs/tasks:
> [  150.912055](detected by 0, t=5252 jiffies, g=-198, c=-199, q=6)
> [  150.912055] All QSes seen, last rcu_sched kthread activity 5252 
> (4294930024-4294924772), jiffies_till_next_fqs=1

Same here, looks like RCU's grace-period kthreads haven't gotten a chance
to run for more than 5,000 jiffies.

> [  150.912055] busybox R  running task042 41 
> 0x0008
> [  150.912055]  81afaf40 88007fa03d98 810781c0 
> 81a5e5f8
> [  150.912055]  81a45d80 88007fa03dd8 810af723 
> 0083
> [  150.912055]  88007fa13580 81a45d80 81a45d80 
> 810c43e0
> [  150.912055] Call Trace:
> [  150.912055][] sched_show_task+0xb0/0x110
> [  150.912055]  [] print_other_cpu_stall+0x2d3/0x2f0
> [  150.912055]

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Zhang Zhen

On 2015/1/21 15:02, Paul E. McKenney wrote:
> On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
>> On 2015/1/21 11:13, Zhang Zhen wrote:
>>> On 2015/1/21 10:26, Zhang Zhen wrote:
 On 2015/1/20 23:25, Don Zickus wrote:
> 
> [ . . . ]
> 
>>> Sorry, i made a mistake, the log above is based on v3.10.63.
>>> I have tested the latest upstream kernel (based on ec6f34e5b552),
>>> but my test case can't triggered RCU stall warning.
>>>
>>
>> I used git bisect to find the bad commit, but found a irrelevant
>> commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
>> Before this commit, my test case can easily trigger RCU stall warning,
>> but after this commit, my test case can't trigger RCU stall warning.
> 
> I have lost track of exactly what kernel you are using.  If you are
> using a recent kernel (v3.18 or thereabouts), one thing to try is to
> apply my stack of RCU CPU stall-warning changes that are on their way
> in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
> 
> These handle the problems that Dave Jones and a few others located this
> past December.  Could you please give them a spin?
> 

Yeah, my test case can easily trigger RCU stall-warning based on commit 
630181c4a915
in your linux-rcu git tree.

In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21.
Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog:

/ #
/ # echo 60 > /proc/sys/kernel/watchdog_thresh
[   21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled
/ #  busybox insmod softlockup_test.ko
[   47.900117] INFO: rcu_sched detected stalls on CPUs/tasks:
[   47.900117]  (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7)
[   47.900117] All QSes seen, last rcu_sched kthread activity 5252 
(4294904271-4294899019), jiffies_till_next_fqs=1
[   47.900117] busybox R  running task042 41 0x0008
[   47.900117]  81afaf40 88007fa03d98 810781c0 
81a5e5f8
[   47.900117]  81a45d80 88007fa03dd8 810af723 
0083
[   47.900117]  88007fa13580 81a45d80 81a45d80 
810c43e0
[   47.900117] Call Trace:
[   47.900117][] sched_show_task+0xb0/0x110
[   47.900117]  [] print_other_cpu_stall+0x2d3/0x2f0
[   47.900117]  [] ? tick_nohz_handler+0xc0/0xc0
[   47.900117]  [] __rcu_pending+0x178/0x220
[   47.900117]  [] rcu_check_callbacks+0x105/0x190
[   47.900117]  [] ? tick_nohz_handler+0xc0/0xc0
[   47.900117]  [] update_process_times+0x32/0x60
[   47.900117]  [] tick_sched_handle+0x32/0x80
[   47.900117]  [] tick_sched_timer+0x4d/0x90
[   47.900117]  [] __run_hrtimer+0xc7/0x1c0
[   47.900117]  [] hrtimer_interrupt+0xe7/0x220
[   47.900117]  [] ? 0xa0005000
[   47.900117]  [] local_apic_timer_interrupt+0x34/0x60
[   47.900117]  [] smp_apic_timer_interrupt+0x3c/0x60
[   47.900117]  [] apic_timer_interrupt+0x6a/0x70
[   47.900117][] ? _raw_spin_lock+0x20/0x30
[   47.900117]  [] test_init+0x5d/0x1000 [softlockup_test]
[   47.900117]  [] ? 0xa0005000
[   47.900117]  [] do_one_initcall+0xb8/0x1d0
[   47.900117]  [] ? __vunmap+0x94/0xf0
[   47.900117]  [] do_init_module+0x2b/0x1b0
[   47.900117]  [] load_module+0x585/0x5f0
[   47.900117]  [] ? mod_sysfs_teardown+0x150/0x150
[   47.900117]  [] SyS_init_module+0x9b/0xc0
[   47.900117]  [] system_call_fastpath+0x12/0x17
[   47.900117] rcu_sched kthread starved for 5252 jiffies!

After commented out touch_softlockup_watchdog() in touch_nmi_watchdog:

/ #
/ # echo 60 > /proc/sys/kernel/watchdog_thresh
[  123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled
/ # busybox insmod softlockup_test.ko
[  150.912055] INFO: rcu_sched detected stalls on CPUs/tasks:
[  150.912055]  (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6)
[  150.912055] All QSes seen, last rcu_sched kthread activity 5252 
(4294930024-4294924772), jiffies_till_next_fqs=1
[  150.912055] busybox R  running task042 41 0x0008
[  150.912055]  81afaf40 88007fa03d98 810781c0 
81a5e5f8
[  150.912055]  81a45d80 88007fa03dd8 810af723 
0083
[  150.912055]  88007fa13580 81a45d80 81a45d80 
810c43e0
[  150.912055] Call Trace:
[  150.912055][] sched_show_task+0xb0/0x110
[  150.912055]  [] print_other_cpu_stall+0x2d3/0x2f0
[  150.912055]  [] ? tick_nohz_handler+0xc0/0xc0
[  150.912055]  [] __rcu_pending+0x178/0x220
[  150.912055]  [] rcu_check_callbacks+0x105/0x190
[  150.912055]  [] ? tick_nohz_handler+0xc0/0xc0
[  150.912055]  [] update_process_times+0x32/0x60
[  150.912055]  [] tick_sched_handle+0x32/0x80
[  150.912055]  [] tick_sched_timer+0x4d/0x90
[  150.912055]  [] __run_hrtimer+0xc7/0x1c0
[  150.912055]  [] hrtimer_interrupt+0xe7/0x220
[  150.912055]  [] ? 0xa0005000
[  150.912055]  [] local_apic_timer_interrupt+0x34/0x60
[  150.912055]  []

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Paul E. McKenney

On Wed, Jan 21, 2015 at 07:11:51PM +0800, Zhang Zhen wrote:
 On 2015/1/21 18:16, Paul E. McKenney wrote:
  On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote:
  On 2015/1/21 15:02, Paul E. McKenney wrote:
  On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
  On 2015/1/21 11:13, Zhang Zhen wrote:
  On 2015/1/21 10:26, Zhang Zhen wrote:
  On 2015/1/20 23:25, Don Zickus wrote:
 
  [ . . . ]
 
  Sorry, i made a mistake, the log above is based on v3.10.63.
  I have tested the latest upstream kernel (based on ec6f34e5b552),
  but my test case can't triggered RCU stall warning.
 
 
  I used git bisect to find the bad commit, but found a irrelevant
  commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
  Before this commit, my test case can easily trigger RCU stall warning,
  but after this commit, my test case can't trigger RCU stall warning.
 
  I have lost track of exactly what kernel you are using.  If you are
  using a recent kernel (v3.18 or thereabouts), one thing to try is to
  apply my stack of RCU CPU stall-warning changes that are on their way
  in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:
 
  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
 
  These handle the problems that Dave Jones and a few others located this
  past December.  Could you please give them a spin?
 
 
  Yeah, my test case can easily trigger RCU stall-warning based on commit 
  630181c4a915
  in your linux-rcu git tree.
 
  In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21.
  Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
 
  / #
  / # echo 60  /proc/sys/kernel/watchdog_thresh
  [   21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled
  / #  busybox insmod softlockup_test.ko
  [   47.900117] INFO: rcu_sched detected stalls on CPUs/tasks:
  [   47.900117] (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7)
  [   47.900117] All QSes seen, last rcu_sched kthread activity 5252 
  (4294904271-4294899019), jiffies_till_next_fqs=1
  
  Looks like you are starving RCU's grace-period kthreads.  What happens if
  you boot with rcutree.kthread_prio=1 in order to run them at real-time
  priority?
  
  [   47.900117] busybox R  running task042 41 
  0x0008
  [   47.900117]  81afaf40 88007fa03d98 810781c0 
  81a5e5f8
  [   47.900117]  81a45d80 88007fa03dd8 810af723 
  0083
  [   47.900117]  88007fa13580 81a45d80 81a45d80 
  810c43e0
  [   47.900117] Call Trace:
  [   47.900117]  IRQ  [810781c0] sched_show_task+0xb0/0x110
  [   47.900117]  [810af723] print_other_cpu_stall+0x2d3/0x2f0
  [   47.900117]  [810c43e0] ? tick_nohz_handler+0xc0/0xc0
  [   47.900117]  [810af8b8] __rcu_pending+0x178/0x220
  [   47.900117]  [810b02a5] rcu_check_callbacks+0x105/0x190
  [   47.900117]  [810c43e0] ? tick_nohz_handler+0xc0/0xc0
  [   47.900117]  [810b3582] update_process_times+0x32/0x60
  [   47.900117]  [810c41e2] tick_sched_handle+0x32/0x80
  [   47.900117]  [810c442d] tick_sched_timer+0x4d/0x90
  [   47.900117]  [810b5b57] __run_hrtimer+0xc7/0x1c0
  [   47.900117]  [810b5e37] hrtimer_interrupt+0xe7/0x220
  [   47.900117]  [a0005000] ? 0xa0005000
  [   47.900117]  [81035654] local_apic_timer_interrupt+0x34/0x60
  [   47.900117]  [81035d8c] smp_apic_timer_interrupt+0x3c/0x60
  [   47.900117]  [814ff02a] apic_timer_interrupt+0x6a/0x70
  [   47.900117]  EOI  [814fdaa0] ? _raw_spin_lock+0x20/0x30
  [   47.900117]  [a000505d] test_init+0x5d/0x1000 
  [softlockup_test]
  [   47.900117]  [a0005000] ? 0xa0005000
  [   47.900117]  [81000288] do_one_initcall+0xb8/0x1d0
  [   47.900117]  [81168794] ? __vunmap+0x94/0xf0
  [   47.900117]  [810ce36b] do_init_module+0x2b/0x1b0
  [   47.900117]  [810d09b5] load_module+0x585/0x5f0
  [   47.900117]  [810cdd40] ? mod_sysfs_teardown+0x150/0x150
  [   47.900117]  [810d0bbb] SyS_init_module+0x9b/0xc0
  [   47.900117]  [814fe252] system_call_fastpath+0x12/0x17
  [   47.900117] rcu_sched kthread starved for 5252 jiffies!
 
  After commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
 
  / #
  / # echo 60  /proc/sys/kernel/watchdog_thresh
  [  123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled
  / # busybox insmod softlockup_test.ko
  [  150.912055] INFO: rcu_sched detected stalls on CPUs/tasks:
  [  150.912055] (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6)
  [  150.912055] All QSes seen, last rcu_sched kthread activity 5252 
  (4294930024-4294924772), jiffies_till_next_fqs=1
  
  Same here, looks like RCU's grace-period kthreads haven't gotten a chance
  to run for more than 5,000 jiffies.
  
  [  150.912055] busybox R  running task0

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Paul E. McKenney

On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote:
 On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote:
   This may not cause other problems but what happens if you comment out the
   'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
   below (based on latest upstream cb59670870)?
   
   The idea is that console printing for that cpu won't reset the softlockup
   detector.  Again other bad things might happen and this patch may not be a
   good final solution, but it can help give me a clue about what is going
   on.
  
  I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
  function
  (based on latest upstream ec6f34e5b552).
  This triggered RCU stall and softlockup, but softlockup just printk only 
  once.
 
 Yes, as expected.  Instead of flooding the console with the same message,
 report only once until the high priority task is re-kicked and can re-arm
 the softlockup timer again.  So this makes sense.  Based on your double
 spin lock test, I don't ever expect this to get re-armed, so you should
 not expect to see another message.
 
 
  As you mentioned other bad things lead to softlockup just printk only 
  once.
  What's the bad thing ?
 
 Well, one of the assumptions here is that if you are going to have
 interrupts disabled for a long time and purposely kick the hardlockup
 detector (with touch_nmi_watchdog), then you can reasonably assume that
 softlockups will go off too if you don't also kick the softlockup detector
 (even though there is a window where this isn't true: 2*thresh - thresh).
 
 
 So the patch I gave you, can lead to false softlockup warnings.
 
 
 On the flip side, printing to the console also blocks hardlockups from
 showing up.  But I believe lng ago, Paul adjusted the rcu stalls to be
 longer than a hardlockup/softlockup timeout to prevent such scenarios.
 
 
 
 I am not sure what to do here.  Printing to the console has traditionally
 been slow (especially if doing so with interrupts disabled), so a
 touch_nmi_watchdog has been necessary.  But a side effect of that, is it
 prevents the lockup detectors from going off if printing repeatedly.
 
 
 Now we can hack up rcu stall to only print once until it is re-armed.
 This would show rcu stall printing first, followed two minutes later by
 a softlockup, which sorta achieves what you want.
 
 
 
 However, at the end of the day, an rcu stall, softlockup or hardlockup is
 a bad thing, does it matter if you get one or all of them?  One should be
 enough to start an investigation, no?

Hear, hear!!!

Thanx, Paul

 I don't know.  I am leaning towards 'working as expected' and nothing to
 really fix here.  Does anyone have any suggestions on what assumptions we
 could change to handle this better?
 
 
 Cheers,
 Don
 
  
  / #
  / # echo 60  /proc/sys/kernel/watchdog_thresh
  / # busybox insmod softlockup_test.ko 
  / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
  [   39.044058] INFO: Stall ended before state dump start
  [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=84007 jiffies, g=78, c=77, q=4)
  [  102.049045] INFO: Stall ended before state dump start
  [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
  [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=147012 jiffies, g=78, c=77, q=4)
  [  165.054075] INFO: Stall ended before state dump start
  [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=210017 jiffies, g=78, c=77, q=4)
  [  228.059045] INFO: Stall ended before state dump start
  [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=273022 jiffies, g=78, c=77, q=4)
  [  291.064099] INFO: Stall ended before state dump start
  [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=336027 jiffies, g=78, c=77, q=4)
  [  354.069099] INFO: Stall ended before state dump start
  [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=399032 jiffies, g=78, c=77, q=4)
  [  417.074045] INFO: Stall ended before state dump start
  [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=462037 jiffies, g=78, c=77, q=4)
  [  480.079099] INFO: Stall ended before state dump start
  [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=525042 jiffies, g=78, c=77, q=4)
  [  543.084099] INFO: Stall ended before state dump start
  [  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=588047 jiffies, g=78, c=77, q=4)
  [  606.089101] INFO: Stall ended before state dump start
  [  669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=651052 jiffies, g=78, c=77, q=4)
  [  669.094099] INFO:

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Paul E. McKenney

On Thu, Jan 22, 2015 at 11:08:11AM +0800, Zhang Zhen wrote:
 On 2015/1/22 4:06, Paul E. McKenney wrote:
  On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote:
  On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote:
  This may not cause other problems but what happens if you comment out the
  'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
  below (based on latest upstream cb59670870)?
 
  The idea is that console printing for that cpu won't reset the softlockup
  detector.  Again other bad things might happen and this patch may not be 
  a
  good final solution, but it can help give me a clue about what is going
  on.
 
  I commented out the 'touch_softlockup_watchdog' from the 
  touch_nmi_watchdog function
  (based on latest upstream ec6f34e5b552).
  This triggered RCU stall and softlockup, but softlockup just printk only 
  once.
 
  Yes, as expected.  Instead of flooding the console with the same message,
  report only once until the high priority task is re-kicked and can re-arm
  the softlockup timer again.  So this makes sense.  Based on your double
  spin lock test, I don't ever expect this to get re-armed, so you should
  not expect to see another message.
 
 
  As you mentioned other bad things lead to softlockup just printk only 
  once.
  What's the bad thing ?
 
  Well, one of the assumptions here is that if you are going to have
  interrupts disabled for a long time and purposely kick the hardlockup
  detector (with touch_nmi_watchdog), then you can reasonably assume that
  softlockups will go off too if you don't also kick the softlockup detector
  (even though there is a window where this isn't true: 2*thresh - thresh).
 
 
  So the patch I gave you, can lead to false softlockup warnings.
 
 
  On the flip side, printing to the console also blocks hardlockups from
  showing up.  But I believe lng ago, Paul adjusted the rcu stalls to be
  longer than a hardlockup/softlockup timeout to prevent such scenarios.
 
 
 
  I am not sure what to do here.  Printing to the console has traditionally
  been slow (especially if doing so with interrupts disabled), so a
  touch_nmi_watchdog has been necessary.  But a side effect of that, is it
  prevents the lockup detectors from going off if printing repeatedly.
 
 
  Now we can hack up rcu stall to only print once until it is re-armed.
  This would show rcu stall printing first, followed two minutes later by
  a softlockup, which sorta achieves what you want.
 
 
 
  However, at the end of the day, an rcu stall, softlockup or hardlockup is
  a bad thing, does it matter if you get one or all of them?  One should be
  enough to start an investigation, no?
  
  Hear, hear!!!
  
 Ok, thanks for Don's and your patience!
 We enriched softlockup printk information, so we want to get softlockup 
 warnning.

Booting with rcupdate.rcu_cpu_stall_suppress=1 will shut up the RCU
CPU stall warnings.

Thanx, Paul

  I don't know.  I am leaning towards 'working as expected' and nothing to
  really fix here.  Does anyone have any suggestions on what assumptions we
  could change to handle this better?
 
 
  Cheers,
  Don
 
 
  / #
  / # echo 60  /proc/sys/kernel/watchdog_thresh
  / # busybox insmod softlockup_test.ko 
  / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
  [   39.044058] INFO: Stall ended before state dump start
  [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=84007 jiffies, g=78, c=77, q=4)
  [  102.049045] INFO: Stall ended before state dump start
  [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
  [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=147012 jiffies, g=78, c=77, q=4)
  [  165.054075] INFO: Stall ended before state dump start
  [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=210017 jiffies, g=78, c=77, q=4)
  [  228.059045] INFO: Stall ended before state dump start
  [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=273022 jiffies, g=78, c=77, q=4)
  [  291.064099] INFO: Stall ended before state dump start
  [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=336027 jiffies, g=78, c=77, q=4)
  [  354.069099] INFO: Stall ended before state dump start
  [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=399032 jiffies, g=78, c=77, q=4)
  [  417.074045] INFO: Stall ended before state dump start
  [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=462037 jiffies, g=78, c=77, q=4)
  [  480.079099] INFO: Stall ended before state dump start
  [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=525042 jiffies, g=78, c=77, q=4)
  [  543.084099] INFO: Stall ended

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Zhang Zhen

On 2015/1/22 4:06, Paul E. McKenney wrote:
 On Wed, Jan 21, 2015 at 10:10:51AM -0500, Don Zickus wrote:
 On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote:
 This may not cause other problems but what happens if you comment out the
 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
 below (based on latest upstream cb59670870)?

 The idea is that console printing for that cpu won't reset the softlockup
 detector.  Again other bad things might happen and this patch may not be a
 good final solution, but it can help give me a clue about what is going
 on.

 I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
 function
 (based on latest upstream ec6f34e5b552).
 This triggered RCU stall and softlockup, but softlockup just printk only 
 once.

 Yes, as expected.  Instead of flooding the console with the same message,
 report only once until the high priority task is re-kicked and can re-arm
 the softlockup timer again.  So this makes sense.  Based on your double
 spin lock test, I don't ever expect this to get re-armed, so you should
 not expect to see another message.


 As you mentioned other bad things lead to softlockup just printk only 
 once.
 What's the bad thing ?

 Well, one of the assumptions here is that if you are going to have
 interrupts disabled for a long time and purposely kick the hardlockup
 detector (with touch_nmi_watchdog), then you can reasonably assume that
 softlockups will go off too if you don't also kick the softlockup detector
 (even though there is a window where this isn't true: 2*thresh - thresh).


 So the patch I gave you, can lead to false softlockup warnings.


 On the flip side, printing to the console also blocks hardlockups from
 showing up.  But I believe lng ago, Paul adjusted the rcu stalls to be
 longer than a hardlockup/softlockup timeout to prevent such scenarios.



 I am not sure what to do here.  Printing to the console has traditionally
 been slow (especially if doing so with interrupts disabled), so a
 touch_nmi_watchdog has been necessary.  But a side effect of that, is it
 prevents the lockup detectors from going off if printing repeatedly.


 Now we can hack up rcu stall to only print once until it is re-armed.
 This would show rcu stall printing first, followed two minutes later by
 a softlockup, which sorta achieves what you want.



 However, at the end of the day, an rcu stall, softlockup or hardlockup is
 a bad thing, does it matter if you get one or all of them?  One should be
 enough to start an investigation, no?
 
 Hear, hear!!!
 
Ok, thanks for Don's and your patience!
We enriched softlockup printk information, so we want to get softlockup 
warnning.

   Thanx, Paul
 
 I don't know.  I am leaning towards 'working as expected' and nothing to
 really fix here.  Does anyone have any suggestions on what assumptions we
 could change to handle this better?


 Cheers,
 Don


 / #
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko 
 / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
 [   39.044058] INFO: Stall ended before state dump start
 [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=84007 jiffies, g=78, c=77, q=4)
 [  102.049045] INFO: Stall ended before state dump start
 [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
 [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=147012 jiffies, g=78, c=77, q=4)
 [  165.054075] INFO: Stall ended before state dump start
 [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=210017 jiffies, g=78, c=77, q=4)
 [  228.059045] INFO: Stall ended before state dump start
 [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=273022 jiffies, g=78, c=77, q=4)
 [  291.064099] INFO: Stall ended before state dump start
 [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=336027 jiffies, g=78, c=77, q=4)
 [  354.069099] INFO: Stall ended before state dump start
 [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=399032 jiffies, g=78, c=77, q=4)
 [  417.074045] INFO: Stall ended before state dump start
 [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=462037 jiffies, g=78, c=77, q=4)
 [  480.079099] INFO: Stall ended before state dump start
 [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=525042 jiffies, g=78, c=77, q=4)
 [  543.084099] INFO: Stall ended before state dump start
 [  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=588047 jiffies, g=78, c=77, q=4)
 [  606.089101] INFO: Stall ended before state dump start
 [  669.094099] INFO: rcu_preempt detected stalls on

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Zhang Zhen

On 2015/1/21 18:16, Paul E. McKenney wrote:
 On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote:
 On 2015/1/21 15:02, Paul E. McKenney wrote:
 On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
 On 2015/1/21 11:13, Zhang Zhen wrote:
 On 2015/1/21 10:26, Zhang Zhen wrote:
 On 2015/1/20 23:25, Don Zickus wrote:

 [ . . . ]

 Sorry, i made a mistake, the log above is based on v3.10.63.
 I have tested the latest upstream kernel (based on ec6f34e5b552),
 but my test case can't triggered RCU stall warning.


 I used git bisect to find the bad commit, but found a irrelevant
 commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
 Before this commit, my test case can easily trigger RCU stall warning,
 but after this commit, my test case can't trigger RCU stall warning.

 I have lost track of exactly what kernel you are using.  If you are
 using a recent kernel (v3.18 or thereabouts), one thing to try is to
 apply my stack of RCU CPU stall-warning changes that are on their way
 in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:

 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

 These handle the problems that Dave Jones and a few others located this
 past December.  Could you please give them a spin?


 Yeah, my test case can easily trigger RCU stall-warning based on commit 
 630181c4a915
 in your linux-rcu git tree.

 In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21.
 Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog:

 / #
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 [   21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled
 / #  busybox insmod softlockup_test.ko
 [   47.900117] INFO: rcu_sched detected stalls on CPUs/tasks:
 [   47.900117]   (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7)
 [   47.900117] All QSes seen, last rcu_sched kthread activity 5252 
 (4294904271-4294899019), jiffies_till_next_fqs=1
 
 Looks like you are starving RCU's grace-period kthreads.  What happens if
 you boot with rcutree.kthread_prio=1 in order to run them at real-time
 priority?
 
 [   47.900117] busybox R  running task042 41 
 0x0008
 [   47.900117]  81afaf40 88007fa03d98 810781c0 
 81a5e5f8
 [   47.900117]  81a45d80 88007fa03dd8 810af723 
 0083
 [   47.900117]  88007fa13580 81a45d80 81a45d80 
 810c43e0
 [   47.900117] Call Trace:
 [   47.900117]  IRQ  [810781c0] sched_show_task+0xb0/0x110
 [   47.900117]  [810af723] print_other_cpu_stall+0x2d3/0x2f0
 [   47.900117]  [810c43e0] ? tick_nohz_handler+0xc0/0xc0
 [   47.900117]  [810af8b8] __rcu_pending+0x178/0x220
 [   47.900117]  [810b02a5] rcu_check_callbacks+0x105/0x190
 [   47.900117]  [810c43e0] ? tick_nohz_handler+0xc0/0xc0
 [   47.900117]  [810b3582] update_process_times+0x32/0x60
 [   47.900117]  [810c41e2] tick_sched_handle+0x32/0x80
 [   47.900117]  [810c442d] tick_sched_timer+0x4d/0x90
 [   47.900117]  [810b5b57] __run_hrtimer+0xc7/0x1c0
 [   47.900117]  [810b5e37] hrtimer_interrupt+0xe7/0x220
 [   47.900117]  [a0005000] ? 0xa0005000
 [   47.900117]  [81035654] local_apic_timer_interrupt+0x34/0x60
 [   47.900117]  [81035d8c] smp_apic_timer_interrupt+0x3c/0x60
 [   47.900117]  [814ff02a] apic_timer_interrupt+0x6a/0x70
 [   47.900117]  EOI  [814fdaa0] ? _raw_spin_lock+0x20/0x30
 [   47.900117]  [a000505d] test_init+0x5d/0x1000 [softlockup_test]
 [   47.900117]  [a0005000] ? 0xa0005000
 [   47.900117]  [81000288] do_one_initcall+0xb8/0x1d0
 [   47.900117]  [81168794] ? __vunmap+0x94/0xf0
 [   47.900117]  [810ce36b] do_init_module+0x2b/0x1b0
 [   47.900117]  [810d09b5] load_module+0x585/0x5f0
 [   47.900117]  [810cdd40] ? mod_sysfs_teardown+0x150/0x150
 [   47.900117]  [810d0bbb] SyS_init_module+0x9b/0xc0
 [   47.900117]  [814fe252] system_call_fastpath+0x12/0x17
 [   47.900117] rcu_sched kthread starved for 5252 jiffies!

 After commented out touch_softlockup_watchdog() in touch_nmi_watchdog:

 / #
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 [  123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled
 / # busybox insmod softlockup_test.ko
 [  150.912055] INFO: rcu_sched detected stalls on CPUs/tasks:
 [  150.912055]   (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6)
 [  150.912055] All QSes seen, last rcu_sched kthread activity 5252 
 (4294930024-4294924772), jiffies_till_next_fqs=1
 
 Same here, looks like RCU's grace-period kthreads haven't gotten a chance
 to run for more than 5,000 jiffies.
 
 [  150.912055] busybox R  running task042 41 
 0x0008
 [  150.912055]  81afaf40 88007fa03d98 810781c0 
 81a5e5f8
 [  150.912055]  81a45d80 88007fa03dd8

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Paul E. McKenney

On Wed, Jan 21, 2015 at 05:05:50PM +0800, Zhang Zhen wrote:
 On 2015/1/21 15:02, Paul E. McKenney wrote:
  On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
  On 2015/1/21 11:13, Zhang Zhen wrote:
  On 2015/1/21 10:26, Zhang Zhen wrote:
  On 2015/1/20 23:25, Don Zickus wrote:
  
  [ . . . ]
  
  Sorry, i made a mistake, the log above is based on v3.10.63.
  I have tested the latest upstream kernel (based on ec6f34e5b552),
  but my test case can't triggered RCU stall warning.
 
 
  I used git bisect to find the bad commit, but found a irrelevant
  commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
  Before this commit, my test case can easily trigger RCU stall warning,
  but after this commit, my test case can't trigger RCU stall warning.
  
  I have lost track of exactly what kernel you are using.  If you are
  using a recent kernel (v3.18 or thereabouts), one thing to try is to
  apply my stack of RCU CPU stall-warning changes that are on their way
  in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:
  
  git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
  
  These handle the problems that Dave Jones and a few others located this
  past December.  Could you please give them a spin?
  
 
 Yeah, my test case can easily trigger RCU stall-warning based on commit 
 630181c4a915
 in your linux-rcu git tree.
 
 In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21.
 Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
 
 / #
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 [   21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled
 / #  busybox insmod softlockup_test.ko
 [   47.900117] INFO: rcu_sched detected stalls on CPUs/tasks:
 [   47.900117](detected by 0, t=5252 jiffies, g=-202, c=-203, q=7)
 [   47.900117] All QSes seen, last rcu_sched kthread activity 5252 
 (4294904271-4294899019), jiffies_till_next_fqs=1

Looks like you are starving RCU's grace-period kthreads.  What happens if
you boot with rcutree.kthread_prio=1 in order to run them at real-time
priority?

 [   47.900117] busybox R  running task042 41 
 0x0008
 [   47.900117]  81afaf40 88007fa03d98 810781c0 
 81a5e5f8
 [   47.900117]  81a45d80 88007fa03dd8 810af723 
 0083
 [   47.900117]  88007fa13580 81a45d80 81a45d80 
 810c43e0
 [   47.900117] Call Trace:
 [   47.900117]  IRQ  [810781c0] sched_show_task+0xb0/0x110
 [   47.900117]  [810af723] print_other_cpu_stall+0x2d3/0x2f0
 [   47.900117]  [810c43e0] ? tick_nohz_handler+0xc0/0xc0
 [   47.900117]  [810af8b8] __rcu_pending+0x178/0x220
 [   47.900117]  [810b02a5] rcu_check_callbacks+0x105/0x190
 [   47.900117]  [810c43e0] ? tick_nohz_handler+0xc0/0xc0
 [   47.900117]  [810b3582] update_process_times+0x32/0x60
 [   47.900117]  [810c41e2] tick_sched_handle+0x32/0x80
 [   47.900117]  [810c442d] tick_sched_timer+0x4d/0x90
 [   47.900117]  [810b5b57] __run_hrtimer+0xc7/0x1c0
 [   47.900117]  [810b5e37] hrtimer_interrupt+0xe7/0x220
 [   47.900117]  [a0005000] ? 0xa0005000
 [   47.900117]  [81035654] local_apic_timer_interrupt+0x34/0x60
 [   47.900117]  [81035d8c] smp_apic_timer_interrupt+0x3c/0x60
 [   47.900117]  [814ff02a] apic_timer_interrupt+0x6a/0x70
 [   47.900117]  EOI  [814fdaa0] ? _raw_spin_lock+0x20/0x30
 [   47.900117]  [a000505d] test_init+0x5d/0x1000 [softlockup_test]
 [   47.900117]  [a0005000] ? 0xa0005000
 [   47.900117]  [81000288] do_one_initcall+0xb8/0x1d0
 [   47.900117]  [81168794] ? __vunmap+0x94/0xf0
 [   47.900117]  [810ce36b] do_init_module+0x2b/0x1b0
 [   47.900117]  [810d09b5] load_module+0x585/0x5f0
 [   47.900117]  [810cdd40] ? mod_sysfs_teardown+0x150/0x150
 [   47.900117]  [810d0bbb] SyS_init_module+0x9b/0xc0
 [   47.900117]  [814fe252] system_call_fastpath+0x12/0x17
 [   47.900117] rcu_sched kthread starved for 5252 jiffies!
 
 After commented out touch_softlockup_watchdog() in touch_nmi_watchdog:
 
 / #
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 [  123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled
 / # busybox insmod softlockup_test.ko
 [  150.912055] INFO: rcu_sched detected stalls on CPUs/tasks:
 [  150.912055](detected by 0, t=5252 jiffies, g=-198, c=-199, q=6)
 [  150.912055] All QSes seen, last rcu_sched kthread activity 5252 
 (4294930024-4294924772), jiffies_till_next_fqs=1

Same here, looks like RCU's grace-period kthreads haven't gotten a chance
to run for more than 5,000 jiffies.

 [  150.912055] busybox R  running task042 41 
 0x0008
 [  150.912055]  81afaf40 88007fa03d98 810781c0 
 81a5e5f8
 [  150.912055]  81a45d80 88007fa03dd8

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Zhang Zhen

On 2015/1/21 15:02, Paul E. McKenney wrote:
 On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
 On 2015/1/21 11:13, Zhang Zhen wrote:
 On 2015/1/21 10:26, Zhang Zhen wrote:
 On 2015/1/20 23:25, Don Zickus wrote:
 
 [ . . . ]
 
 Sorry, i made a mistake, the log above is based on v3.10.63.
 I have tested the latest upstream kernel (based on ec6f34e5b552),
 but my test case can't triggered RCU stall warning.


 I used git bisect to find the bad commit, but found a irrelevant
 commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
 Before this commit, my test case can easily trigger RCU stall warning,
 but after this commit, my test case can't trigger RCU stall warning.
 
 I have lost track of exactly what kernel you are using.  If you are
 using a recent kernel (v3.18 or thereabouts), one thing to try is to
 apply my stack of RCU CPU stall-warning changes that are on their way
 in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:
 
 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
 
 These handle the problems that Dave Jones and a few others located this
 past December.  Could you please give them a spin?
 

Yeah, my test case can easily trigger RCU stall-warning based on commit 
630181c4a915
in your linux-rcu git tree.

In my .config, CONFIG_RCU_CPU_STALL_TIMEOUT=21.
Before commented out touch_softlockup_watchdog() in touch_nmi_watchdog:

/ #
/ # echo 60  /proc/sys/kernel/watchdog_thresh
[   21.885200] NMI watchdog: disabled (cpu0): hardware events not enabled
/ #  busybox insmod softlockup_test.ko
[   47.900117] INFO: rcu_sched detected stalls on CPUs/tasks:
[   47.900117]  (detected by 0, t=5252 jiffies, g=-202, c=-203, q=7)
[   47.900117] All QSes seen, last rcu_sched kthread activity 5252 
(4294904271-4294899019), jiffies_till_next_fqs=1
[   47.900117] busybox R  running task042 41 0x0008
[   47.900117]  81afaf40 88007fa03d98 810781c0 
81a5e5f8
[   47.900117]  81a45d80 88007fa03dd8 810af723 
0083
[   47.900117]  88007fa13580 81a45d80 81a45d80 
810c43e0
[   47.900117] Call Trace:
[   47.900117]  IRQ  [810781c0] sched_show_task+0xb0/0x110
[   47.900117]  [810af723] print_other_cpu_stall+0x2d3/0x2f0
[   47.900117]  [810c43e0] ? tick_nohz_handler+0xc0/0xc0
[   47.900117]  [810af8b8] __rcu_pending+0x178/0x220
[   47.900117]  [810b02a5] rcu_check_callbacks+0x105/0x190
[   47.900117]  [810c43e0] ? tick_nohz_handler+0xc0/0xc0
[   47.900117]  [810b3582] update_process_times+0x32/0x60
[   47.900117]  [810c41e2] tick_sched_handle+0x32/0x80
[   47.900117]  [810c442d] tick_sched_timer+0x4d/0x90
[   47.900117]  [810b5b57] __run_hrtimer+0xc7/0x1c0
[   47.900117]  [810b5e37] hrtimer_interrupt+0xe7/0x220
[   47.900117]  [a0005000] ? 0xa0005000
[   47.900117]  [81035654] local_apic_timer_interrupt+0x34/0x60
[   47.900117]  [81035d8c] smp_apic_timer_interrupt+0x3c/0x60
[   47.900117]  [814ff02a] apic_timer_interrupt+0x6a/0x70
[   47.900117]  EOI  [814fdaa0] ? _raw_spin_lock+0x20/0x30
[   47.900117]  [a000505d] test_init+0x5d/0x1000 [softlockup_test]
[   47.900117]  [a0005000] ? 0xa0005000
[   47.900117]  [81000288] do_one_initcall+0xb8/0x1d0
[   47.900117]  [81168794] ? __vunmap+0x94/0xf0
[   47.900117]  [810ce36b] do_init_module+0x2b/0x1b0
[   47.900117]  [810d09b5] load_module+0x585/0x5f0
[   47.900117]  [810cdd40] ? mod_sysfs_teardown+0x150/0x150
[   47.900117]  [810d0bbb] SyS_init_module+0x9b/0xc0
[   47.900117]  [814fe252] system_call_fastpath+0x12/0x17
[   47.900117] rcu_sched kthread starved for 5252 jiffies!

After commented out touch_softlockup_watchdog() in touch_nmi_watchdog:

/ #
/ # echo 60  /proc/sys/kernel/watchdog_thresh
[  123.011980] NMI watchdog: disabled (cpu0): hardware events not enabled
/ # busybox insmod softlockup_test.ko
[  150.912055] INFO: rcu_sched detected stalls on CPUs/tasks:
[  150.912055]  (detected by 0, t=5252 jiffies, g=-198, c=-199, q=6)
[  150.912055] All QSes seen, last rcu_sched kthread activity 5252 
(4294930024-4294924772), jiffies_till_next_fqs=1
[  150.912055] busybox R  running task042 41 0x0008
[  150.912055]  81afaf40 88007fa03d98 810781c0 
81a5e5f8
[  150.912055]  81a45d80 88007fa03dd8 810af723 
0083
[  150.912055]  88007fa13580 81a45d80 81a45d80 
810c43e0
[  150.912055] Call Trace:
[  150.912055]  IRQ  [810781c0] sched_show_task+0xb0/0x110
[  150.912055]  [810af723] print_other_cpu_stall+0x2d3/0x2f0
[  150.912055]  [810c43e0] ? tick_nohz_handler+0xc0/0xc0
[  150.912055]  [810af8b8] __rcu_pending+0x178/0x220
[  150.912055]  [810b02a5]

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-21 Thread Don Zickus

On Wed, Jan 21, 2015 at 10:26:27AM +0800, Zhang Zhen wrote:
  This may not cause other problems but what happens if you comment out the
  'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
  below (based on latest upstream cb59670870)?
  
  The idea is that console printing for that cpu won't reset the softlockup
  detector.  Again other bad things might happen and this patch may not be a
  good final solution, but it can help give me a clue about what is going
  on.
 
 I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
 function
 (based on latest upstream ec6f34e5b552).
 This triggered RCU stall and softlockup, but softlockup just printk only once.

Yes, as expected.  Instead of flooding the console with the same message,
report only once until the high priority task is re-kicked and can re-arm
the softlockup timer again.  So this makes sense.  Based on your double
spin lock test, I don't ever expect this to get re-armed, so you should
not expect to see another message.


 As you mentioned other bad things lead to softlockup just printk only once.
 What's the bad thing ?

Well, one of the assumptions here is that if you are going to have
interrupts disabled for a long time and purposely kick the hardlockup
detector (with touch_nmi_watchdog), then you can reasonably assume that
softlockups will go off too if you don't also kick the softlockup detector
(even though there is a window where this isn't true: 2*thresh - thresh).


So the patch I gave you, can lead to false softlockup warnings.


On the flip side, printing to the console also blocks hardlockups from
showing up.  But I believe lng ago, Paul adjusted the rcu stalls to be
longer than a hardlockup/softlockup timeout to prevent such scenarios.



I am not sure what to do here.  Printing to the console has traditionally
been slow (especially if doing so with interrupts disabled), so a
touch_nmi_watchdog has been necessary.  But a side effect of that, is it
prevents the lockup detectors from going off if printing repeatedly.


Now we can hack up rcu stall to only print once until it is re-armed.
This would show rcu stall printing first, followed two minutes later by
a softlockup, which sorta achieves what you want.



However, at the end of the day, an rcu stall, softlockup or hardlockup is
a bad thing, does it matter if you get one or all of them?  One should be
enough to start an investigation, no?



I don't know.  I am leaning towards 'working as expected' and nothing to
really fix here.  Does anyone have any suggestions on what assumptions we
could change to handle this better?


Cheers,
Don

 
 / #
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko 
 / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
 [   39.044058] INFO: Stall ended before state dump start
 [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=84007 jiffies, g=78, c=77, q=4)
 [  102.049045] INFO: Stall ended before state dump start
 [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
 [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=147012 jiffies, g=78, c=77, q=4)
 [  165.054075] INFO: Stall ended before state dump start
 [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=210017 jiffies, g=78, c=77, q=4)
 [  228.059045] INFO: Stall ended before state dump start
 [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=273022 jiffies, g=78, c=77, q=4)
 [  291.064099] INFO: Stall ended before state dump start
 [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=336027 jiffies, g=78, c=77, q=4)
 [  354.069099] INFO: Stall ended before state dump start
 [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=399032 jiffies, g=78, c=77, q=4)
 [  417.074045] INFO: Stall ended before state dump start
 [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=462037 jiffies, g=78, c=77, q=4)
 [  480.079099] INFO: Stall ended before state dump start
 [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=525042 jiffies, g=78, c=77, q=4)
 [  543.084099] INFO: Stall ended before state dump start
 [  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=588047 jiffies, g=78, c=77, q=4)
 [  606.089101] INFO: Stall ended before state dump start
 [  669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=651052 jiffies, g=78, c=77, q=4)
 [  669.094099] INFO: Stall ended before state dump start
 [  732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=714057 jiffies, g=78, c=77, q=4)
 [  732.099045] INFO: Stall ended before state dump start
 [  795.104074] INFO: rcu_preempt

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Zhang Zhen

On 2015/1/21 15:02, Paul E. McKenney wrote:
> On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
>> On 2015/1/21 11:13, Zhang Zhen wrote:
>>> On 2015/1/21 10:26, Zhang Zhen wrote:
 On 2015/1/20 23:25, Don Zickus wrote:
> 
> [ . . . ]
> 
>>> Sorry, i made a mistake, the log above is based on v3.10.63.
>>> I have tested the latest upstream kernel (based on ec6f34e5b552),
>>> but my test case can't triggered RCU stall warning.
>>>
>>
>> I used git bisect to find the bad commit, but found a irrelevant
>> commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
>> Before this commit, my test case can easily trigger RCU stall warning,
>> but after this commit, my test case can't trigger RCU stall warning.
> 
> I have lost track of exactly what kernel you are using.  If you are
> using a recent kernel (v3.18 or thereabouts), one thing to try is to
> apply my stack of RCU CPU stall-warning changes that are on their way
> in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
> 
> These handle the problems that Dave Jones and a few others located this
> past December.  Could you please give them a spin?
> 
Ok, i will try.

Thanks!
> Thanx, Paul
> 
>> I'm totally confused.
>>
>> My test case:
>>
>> //
>> #include 
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> struct foo {
>> int a;
>> char b;
>> long c;
>> };
>>
>> struct foo gbl_foo = {1, 'a', 2};
>> struct foo *pgbl_foo = _foo;
>>
>> static int my_kthread(void *data)
>> {
>> DEFINE_SPINLOCK(foo_mutex);
>> struct foo *new_fp;
>> struct foo *old_fp;
>>
>> new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
>> spin_lock(_mutex);
>> old_fp = pgbl_foo;
>> *new_fp = *old_fp;
>> new_fp->a = 2;
>> rcu_assign_pointer(pgbl_foo, new_fp);
>> spin_unlock(_mutex);
>> synchronize_rcu();
>> kfree(old_fp);
>>
>> return 0;
>> }
>>
>> static int hello_start(void)
>> {
>> struct task_struct *my_task = NULL;
>> DEFINE_SPINLOCK(hello_lock);
>>
>> my_task = kthread_run(my_kthread, NULL, "my_thread%d", 1);
>>
>> spin_lock_init(_lock);
>> spin_lock(_lock);
>> spin_lock(_lock);
>> return 0;
>> }
>>
>> static int __init test_init(void)
>> {
>> hello_start();
>>
>> printk(KERN_INFO "Module init\n");
>> return 0;
>> }
>>
>> static void __exit test_exit(void)
>> {
>> printk(KERN_INFO "Module exit!\n");
>> }
>>
>> module_init(test_init)
>> module_exit(test_exit)
>> MODULE_LICENSE("GPL");
>> //
>
> Cheers,
> Don
>
> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> index 70bf118..833c015 100644
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void)
>* going off.
>*/
>   raw_cpu_write(watchdog_nmi_touch, true);
> - touch_softlockup_watchdog();
> + //touch_softlockup_watchdog();
>  }
>  EXPORT_SYMBOL(touch_nmi_watchdog);
>  
>
> .
>

>>>
>>
>>
> 
> 
> .
> 


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Paul E. McKenney

On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
> On 2015/1/21 11:13, Zhang Zhen wrote:
> > On 2015/1/21 10:26, Zhang Zhen wrote:
> >> On 2015/1/20 23:25, Don Zickus wrote:

[ . . . ]

> > Sorry, i made a mistake, the log above is based on v3.10.63.
> > I have tested the latest upstream kernel (based on ec6f34e5b552),
> > but my test case can't triggered RCU stall warning.
> >
> 
> I used git bisect to find the bad commit, but found a irrelevant
> commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
> Before this commit, my test case can easily trigger RCU stall warning,
> but after this commit, my test case can't trigger RCU stall warning.

I have lost track of exactly what kernel you are using.  If you are
using a recent kernel (v3.18 or thereabouts), one thing to try is to
apply my stack of RCU CPU stall-warning changes that are on their way
in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

These handle the problems that Dave Jones and a few others located this
past December.  Could you please give them a spin?

Thanx, Paul

> I'm totally confused.
> 
> My test case:
> 
> //
> #include 
> #include 
> #include 
> #include 
> #include 
> 
> struct foo {
> int a;
> char b;
> long c;
> };
> 
> struct foo gbl_foo = {1, 'a', 2};
> struct foo *pgbl_foo = _foo;
> 
> static int my_kthread(void *data)
> {
> DEFINE_SPINLOCK(foo_mutex);
> struct foo *new_fp;
> struct foo *old_fp;
> 
> new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
> spin_lock(_mutex);
> old_fp = pgbl_foo;
> *new_fp = *old_fp;
> new_fp->a = 2;
> rcu_assign_pointer(pgbl_foo, new_fp);
> spin_unlock(_mutex);
> synchronize_rcu();
> kfree(old_fp);
> 
> return 0;
> }
> 
> static int hello_start(void)
> {
> struct task_struct *my_task = NULL;
> DEFINE_SPINLOCK(hello_lock);
> 
> my_task = kthread_run(my_kthread, NULL, "my_thread%d", 1);
> 
> spin_lock_init(_lock);
> spin_lock(_lock);
> spin_lock(_lock);
> return 0;
> }
> 
> static int __init test_init(void)
> {
> hello_start();
> 
> printk(KERN_INFO "Module init\n");
> return 0;
> }
> 
> static void __exit test_exit(void)
> {
> printk(KERN_INFO "Module exit!\n");
> }
> 
> module_init(test_init)
> module_exit(test_exit)
> MODULE_LICENSE("GPL");
> //
> >>>
> >>> Cheers,
> >>> Don
> >>>
> >>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
> >>> index 70bf118..833c015 100644
> >>> --- a/kernel/watchdog.c
> >>> +++ b/kernel/watchdog.c
> >>> @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void)
> >>>* going off.
> >>>*/
> >>>   raw_cpu_write(watchdog_nmi_touch, true);
> >>> - touch_softlockup_watchdog();
> >>> + //touch_softlockup_watchdog();
> >>>  }
> >>>  EXPORT_SYMBOL(touch_nmi_watchdog);
> >>>  
> >>>
> >>> .
> >>>
> >>
> > 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Zhang Zhen

On 2015/1/21 11:13, Zhang Zhen wrote:
> On 2015/1/21 10:26, Zhang Zhen wrote:
>> On 2015/1/20 23:25, Don Zickus wrote:
>>> On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote:

> Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
> like this was masked.  I believe this upstream commit 62572e29bc53, solved
> the problem.

 Thanks for your suggestion.

 Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
 only touch local cpu not every one.
 But watchdog_nmi_touch = true only guarantee no hard lockup check on this 
 cpu.

 Commit 62572e29bc53 didn't changed the semantics of 
 touch_softlockup_watchdog.
>>>
>>> Ah, yes.  I reviewed the commit to quickly yesterday.  I thought
>>> touch_softlockup_watchdog was called on every cpu and that commit changed
>>> it to the local cpu.  But that was incorrect.
>>>
>
> You can apply that commit and see if you if you get both RCU stall
> messages _and_ softlockup messages.  I believe that is what you were
> expecting, correct?
>
 Correct, i expect i can get  both RCU stall messages _and_ softlockup 
 messages.
 I applied that commit, and i only got RCU stall messages.
>>>
>>> Hmm, I believe the act of printing to the console calls touch_nmi_watchdog
>>> which calls touch_softlockup_watchdog.  I think that is the problem there.
>>>
>> Yeah, you are right.
>>
>>> This may not cause other problems but what happens if you comment out the
>>> 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
>>> below (based on latest upstream cb59670870)?
>>>
>>> The idea is that console printing for that cpu won't reset the softlockup
>>> detector.  Again other bad things might happen and this patch may not be a
>>> good final solution, but it can help give me a clue about what is going
>>> on.
>>
>> I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
>> function
>> (based on latest upstream ec6f34e5b552).
>> This triggered RCU stall and softlockup, but softlockup just printk only 
>> once.
>> As you mentioned "other bad things" lead to softlockup just printk only once.
>> What's the bad thing ?
>>
>> / #
>> / # echo 60 > /proc/sys/kernel/watchdog_thresh
>> / # busybox insmod softlockup_test.ko &
>> / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>> (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
>> [   39.044058] INFO: Stall ended before state dump start
>> [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=84007 jiffies, g=78, c=77, q=4)
>> [  102.049045] INFO: Stall ended before state dump start
>> [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
>> [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=147012 jiffies, g=78, c=77, q=4)
>> [  165.054075] INFO: Stall ended before state dump start
>> [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=210017 jiffies, g=78, c=77, q=4)
>> [  228.059045] INFO: Stall ended before state dump start
>> [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=273022 jiffies, g=78, c=77, q=4)
>> [  291.064099] INFO: Stall ended before state dump start
>> [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=336027 jiffies, g=78, c=77, q=4)
>> [  354.069099] INFO: Stall ended before state dump start
>> [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=399032 jiffies, g=78, c=77, q=4)
>> [  417.074045] INFO: Stall ended before state dump start
>> [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=462037 jiffies, g=78, c=77, q=4)
>> [  480.079099] INFO: Stall ended before state dump start
>> [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=525042 jiffies, g=78, c=77, q=4)
>> [  543.084099] INFO: Stall ended before state dump start
>> [  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=588047 jiffies, g=78, c=77, q=4)
>> [  606.089101] INFO: Stall ended before state dump start
>> [  669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=651052 jiffies, g=78, c=77, q=4)
>> [  669.094099] INFO: Stall ended before state dump start
>> [  732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=714057 jiffies, g=78, c=77, q=4)
>> [  732.099045] INFO: Stall ended before state dump start
>> [  795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=777062 jiffies, g=78, c=77, q=4)
>> [  795.104098] INFO: Stall ended before state dump start
>> [  858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=840067 jiffies, g=78, c=77, q=4)
>> [  858.109046] INFO: Stall ended before state dump start

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Zhang Zhen

On 2015/1/21 10:26, Zhang Zhen wrote:
> On 2015/1/20 23:25, Don Zickus wrote:
>> On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote:
>>>
 Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
 like this was masked.  I believe this upstream commit 62572e29bc53, solved
 the problem.
>>>
>>> Thanks for your suggestion.
>>>
>>> Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
>>> only touch local cpu not every one.
>>> But watchdog_nmi_touch = true only guarantee no hard lockup check on this 
>>> cpu.
>>>
>>> Commit 62572e29bc53 didn't changed the semantics of 
>>> touch_softlockup_watchdog.
>>
>> Ah, yes.  I reviewed the commit to quickly yesterday.  I thought
>> touch_softlockup_watchdog was called on every cpu and that commit changed
>> it to the local cpu.  But that was incorrect.
>>

 You can apply that commit and see if you if you get both RCU stall
 messages _and_ softlockup messages.  I believe that is what you were
 expecting, correct?

>>> Correct, i expect i can get  both RCU stall messages _and_ softlockup 
>>> messages.
>>> I applied that commit, and i only got RCU stall messages.
>>
>> Hmm, I believe the act of printing to the console calls touch_nmi_watchdog
>> which calls touch_softlockup_watchdog.  I think that is the problem there.
>>
> Yeah, you are right.
> 
>> This may not cause other problems but what happens if you comment out the
>> 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
>> below (based on latest upstream cb59670870)?
>>
>> The idea is that console printing for that cpu won't reset the softlockup
>> detector.  Again other bad things might happen and this patch may not be a
>> good final solution, but it can help give me a clue about what is going
>> on.
> 
> I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
> function
> (based on latest upstream ec6f34e5b552).
> This triggered RCU stall and softlockup, but softlockup just printk only once.
> As you mentioned "other bad things" lead to softlockup just printk only once.
> What's the bad thing ?
> 
> / #
> / # echo 60 > /proc/sys/kernel/watchdog_thresh
> / # busybox insmod softlockup_test.ko &
> / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
> [   39.044058] INFO: Stall ended before state dump start
> [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=84007 jiffies, g=78, c=77, q=4)
> [  102.049045] INFO: Stall ended before state dump start
> [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
> [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=147012 jiffies, g=78, c=77, q=4)
> [  165.054075] INFO: Stall ended before state dump start
> [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=210017 jiffies, g=78, c=77, q=4)
> [  228.059045] INFO: Stall ended before state dump start
> [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=273022 jiffies, g=78, c=77, q=4)
> [  291.064099] INFO: Stall ended before state dump start
> [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=336027 jiffies, g=78, c=77, q=4)
> [  354.069099] INFO: Stall ended before state dump start
> [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=399032 jiffies, g=78, c=77, q=4)
> [  417.074045] INFO: Stall ended before state dump start
> [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=462037 jiffies, g=78, c=77, q=4)
> [  480.079099] INFO: Stall ended before state dump start
> [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=525042 jiffies, g=78, c=77, q=4)
> [  543.084099] INFO: Stall ended before state dump start
> [  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=588047 jiffies, g=78, c=77, q=4)
> [  606.089101] INFO: Stall ended before state dump start
> [  669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=651052 jiffies, g=78, c=77, q=4)
> [  669.094099] INFO: Stall ended before state dump start
> [  732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=714057 jiffies, g=78, c=77, q=4)
> [  732.099045] INFO: Stall ended before state dump start
> [  795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=777062 jiffies, g=78, c=77, q=4)
> [  795.104098] INFO: Stall ended before state dump start
> [  858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=840067 jiffies, g=78, c=77, q=4)
> [  858.109046] INFO: Stall ended before state dump start
> [  921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=903072 jiffies, g=78, c=77, q=4)
> [

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Zhang Zhen

On 2015/1/20 23:25, Don Zickus wrote:
> On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote:
>>
>>> Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
>>> like this was masked.  I believe this upstream commit 62572e29bc53, solved
>>> the problem.
>>
>> Thanks for your suggestion.
>>
>> Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
>> only touch local cpu not every one.
>> But watchdog_nmi_touch = true only guarantee no hard lockup check on this 
>> cpu.
>>
>> Commit 62572e29bc53 didn't changed the semantics of 
>> touch_softlockup_watchdog.
> 
> Ah, yes.  I reviewed the commit to quickly yesterday.  I thought
> touch_softlockup_watchdog was called on every cpu and that commit changed
> it to the local cpu.  But that was incorrect.
> 
>>>
>>> You can apply that commit and see if you if you get both RCU stall
>>> messages _and_ softlockup messages.  I believe that is what you were
>>> expecting, correct?
>>>
>> Correct, i expect i can get  both RCU stall messages _and_ softlockup 
>> messages.
>> I applied that commit, and i only got RCU stall messages.
> 
> Hmm, I believe the act of printing to the console calls touch_nmi_watchdog
> which calls touch_softlockup_watchdog.  I think that is the problem there.
> 
Yeah, you are right.

> This may not cause other problems but what happens if you comment out the
> 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
> below (based on latest upstream cb59670870)?
> 
> The idea is that console printing for that cpu won't reset the softlockup
> detector.  Again other bad things might happen and this patch may not be a
> good final solution, but it can help give me a clue about what is going
> on.

I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
function
(based on latest upstream ec6f34e5b552).
This triggered RCU stall and softlockup, but softlockup just printk only once.
As you mentioned "other bad things" lead to softlockup just printk only once.
What's the bad thing ?

/ #
/ # echo 60 > /proc/sys/kernel/watchdog_thresh
/ # busybox insmod softlockup_test.ko &
/ # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
(detected by 0, t=21002 jiffies, g=78, c=77, q=4)
[   39.044058] INFO: Stall ended before state dump start
[  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=84007 jiffies, g=78, c=77, q=4)
[  102.049045] INFO: Stall ended before state dump start
[  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
[  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=147012 jiffies, g=78, c=77, q=4)
[  165.054075] INFO: Stall ended before state dump start
[  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=210017 jiffies, g=78, c=77, q=4)
[  228.059045] INFO: Stall ended before state dump start
[  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=273022 jiffies, g=78, c=77, q=4)
[  291.064099] INFO: Stall ended before state dump start
[  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=336027 jiffies, g=78, c=77, q=4)
[  354.069099] INFO: Stall ended before state dump start
[  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=399032 jiffies, g=78, c=77, q=4)
[  417.074045] INFO: Stall ended before state dump start
[  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=462037 jiffies, g=78, c=77, q=4)
[  480.079099] INFO: Stall ended before state dump start
[  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=525042 jiffies, g=78, c=77, q=4)
[  543.084099] INFO: Stall ended before state dump start
[  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=588047 jiffies, g=78, c=77, q=4)
[  606.089101] INFO: Stall ended before state dump start
[  669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=651052 jiffies, g=78, c=77, q=4)
[  669.094099] INFO: Stall ended before state dump start
[  732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=714057 jiffies, g=78, c=77, q=4)
[  732.099045] INFO: Stall ended before state dump start
[  795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=777062 jiffies, g=78, c=77, q=4)
[  795.104098] INFO: Stall ended before state dump start
[  858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=840067 jiffies, g=78, c=77, q=4)
[  858.109046] INFO: Stall ended before state dump start
[  921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=903072 jiffies, g=78, c=77, q=4)
[  921.114100] INFO: Stall ended before state dump start
[  984.119099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=966077 jiffies, g=78, c=77, q=4)
[  984.119099] INFO: Stall

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Don Zickus

On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote:
> 
> > Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
> > like this was masked.  I believe this upstream commit 62572e29bc53, solved
> > the problem.
> 
> Thanks for your suggestion.
> 
> Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
> only touch local cpu not every one.
> But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu.
> 
> Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog.

Ah, yes.  I reviewed the commit to quickly yesterday.  I thought
touch_softlockup_watchdog was called on every cpu and that commit changed
it to the local cpu.  But that was incorrect.

> > 
> > You can apply that commit and see if you if you get both RCU stall
> > messages _and_ softlockup messages.  I believe that is what you were
> > expecting, correct?
> > 
> Correct, i expect i can get  both RCU stall messages _and_ softlockup 
> messages.
> I applied that commit, and i only got RCU stall messages.

Hmm, I believe the act of printing to the console calls touch_nmi_watchdog
which calls touch_softlockup_watchdog.  I think that is the problem there.

This may not cause other problems but what happens if you comment out the
'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
below (based on latest upstream cb59670870)?

The idea is that console printing for that cpu won't reset the softlockup
detector.  Again other bad things might happen and this patch may not be a
good final solution, but it can help give me a clue about what is going
on.

Cheers,
Don

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 70bf118..833c015 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -209,7 +209,7 @@ void touch_nmi_watchdog(void)
 * going off.
 */
raw_cpu_write(watchdog_nmi_touch, true);
-   touch_softlockup_watchdog();
+   //touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL(touch_nmi_watchdog);

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Don Zickus

On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote:
 
  Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
  like this was masked.  I believe this upstream commit 62572e29bc53, solved
  the problem.
 
 Thanks for your suggestion.
 
 Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
 only touch local cpu not every one.
 But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu.
 
 Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog.

Ah, yes.  I reviewed the commit to quickly yesterday.  I thought
touch_softlockup_watchdog was called on every cpu and that commit changed
it to the local cpu.  But that was incorrect.

  
  You can apply that commit and see if you if you get both RCU stall
  messages _and_ softlockup messages.  I believe that is what you were
  expecting, correct?
  
 Correct, i expect i can get  both RCU stall messages _and_ softlockup 
 messages.
 I applied that commit, and i only got RCU stall messages.

Hmm, I believe the act of printing to the console calls touch_nmi_watchdog
which calls touch_softlockup_watchdog.  I think that is the problem there.

This may not cause other problems but what happens if you comment out the
'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
below (based on latest upstream cb59670870)?

The idea is that console printing for that cpu won't reset the softlockup
detector.  Again other bad things might happen and this patch may not be a
good final solution, but it can help give me a clue about what is going
on.

Cheers,
Don

diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 70bf118..833c015 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -209,7 +209,7 @@ void touch_nmi_watchdog(void)
 * going off.
 */
raw_cpu_write(watchdog_nmi_touch, true);
-   touch_softlockup_watchdog();
+   //touch_softlockup_watchdog();
 }
 EXPORT_SYMBOL(touch_nmi_watchdog);
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Zhang Zhen

On 2015/1/20 23:25, Don Zickus wrote:
 On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote:

 Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
 like this was masked.  I believe this upstream commit 62572e29bc53, solved
 the problem.

 Thanks for your suggestion.

 Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
 only touch local cpu not every one.
 But watchdog_nmi_touch = true only guarantee no hard lockup check on this 
 cpu.

 Commit 62572e29bc53 didn't changed the semantics of 
 touch_softlockup_watchdog.
 
 Ah, yes.  I reviewed the commit to quickly yesterday.  I thought
 touch_softlockup_watchdog was called on every cpu and that commit changed
 it to the local cpu.  But that was incorrect.
 

 You can apply that commit and see if you if you get both RCU stall
 messages _and_ softlockup messages.  I believe that is what you were
 expecting, correct?

 Correct, i expect i can get  both RCU stall messages _and_ softlockup 
 messages.
 I applied that commit, and i only got RCU stall messages.
 
 Hmm, I believe the act of printing to the console calls touch_nmi_watchdog
 which calls touch_softlockup_watchdog.  I think that is the problem there.
 
Yeah, you are right.

 This may not cause other problems but what happens if you comment out the
 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
 below (based on latest upstream cb59670870)?
 
 The idea is that console printing for that cpu won't reset the softlockup
 detector.  Again other bad things might happen and this patch may not be a
 good final solution, but it can help give me a clue about what is going
 on.

I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
function
(based on latest upstream ec6f34e5b552).
This triggered RCU stall and softlockup, but softlockup just printk only once.
As you mentioned other bad things lead to softlockup just printk only once.
What's the bad thing ?

/ #
/ # echo 60  /proc/sys/kernel/watchdog_thresh
/ # busybox insmod softlockup_test.ko 
/ # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
(detected by 0, t=21002 jiffies, g=78, c=77, q=4)
[   39.044058] INFO: Stall ended before state dump start
[  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=84007 jiffies, g=78, c=77, q=4)
[  102.049045] INFO: Stall ended before state dump start
[  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
[  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=147012 jiffies, g=78, c=77, q=4)
[  165.054075] INFO: Stall ended before state dump start
[  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=210017 jiffies, g=78, c=77, q=4)
[  228.059045] INFO: Stall ended before state dump start
[  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=273022 jiffies, g=78, c=77, q=4)
[  291.064099] INFO: Stall ended before state dump start
[  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=336027 jiffies, g=78, c=77, q=4)
[  354.069099] INFO: Stall ended before state dump start
[  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=399032 jiffies, g=78, c=77, q=4)
[  417.074045] INFO: Stall ended before state dump start
[  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=462037 jiffies, g=78, c=77, q=4)
[  480.079099] INFO: Stall ended before state dump start
[  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=525042 jiffies, g=78, c=77, q=4)
[  543.084099] INFO: Stall ended before state dump start
[  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=588047 jiffies, g=78, c=77, q=4)
[  606.089101] INFO: Stall ended before state dump start
[  669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=651052 jiffies, g=78, c=77, q=4)
[  669.094099] INFO: Stall ended before state dump start
[  732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=714057 jiffies, g=78, c=77, q=4)
[  732.099045] INFO: Stall ended before state dump start
[  795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=777062 jiffies, g=78, c=77, q=4)
[  795.104098] INFO: Stall ended before state dump start
[  858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=840067 jiffies, g=78, c=77, q=4)
[  858.109046] INFO: Stall ended before state dump start
[  921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=903072 jiffies, g=78, c=77, q=4)
[  921.114100] INFO: Stall ended before state dump start
[  984.119099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=966077 jiffies, g=78, c=77, q=4)
[  984.119099] INFO: Stall ended before state dump start
[ 1047.124075] INFO: rcu_preempt detected

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Zhang Zhen

On 2015/1/21 10:26, Zhang Zhen wrote:
 On 2015/1/20 23:25, Don Zickus wrote:
 On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote:

 Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
 like this was masked.  I believe this upstream commit 62572e29bc53, solved
 the problem.

 Thanks for your suggestion.

 Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
 only touch local cpu not every one.
 But watchdog_nmi_touch = true only guarantee no hard lockup check on this 
 cpu.

 Commit 62572e29bc53 didn't changed the semantics of 
 touch_softlockup_watchdog.

 Ah, yes.  I reviewed the commit to quickly yesterday.  I thought
 touch_softlockup_watchdog was called on every cpu and that commit changed
 it to the local cpu.  But that was incorrect.


 You can apply that commit and see if you if you get both RCU stall
 messages _and_ softlockup messages.  I believe that is what you were
 expecting, correct?

 Correct, i expect i can get  both RCU stall messages _and_ softlockup 
 messages.
 I applied that commit, and i only got RCU stall messages.

 Hmm, I believe the act of printing to the console calls touch_nmi_watchdog
 which calls touch_softlockup_watchdog.  I think that is the problem there.

 Yeah, you are right.
 
 This may not cause other problems but what happens if you comment out the
 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
 below (based on latest upstream cb59670870)?

 The idea is that console printing for that cpu won't reset the softlockup
 detector.  Again other bad things might happen and this patch may not be a
 good final solution, but it can help give me a clue about what is going
 on.
 
 I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
 function
 (based on latest upstream ec6f34e5b552).
 This triggered RCU stall and softlockup, but softlockup just printk only once.
 As you mentioned other bad things lead to softlockup just printk only once.
 What's the bad thing ?
 
 / #
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko 
 / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
 [   39.044058] INFO: Stall ended before state dump start
 [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=84007 jiffies, g=78, c=77, q=4)
 [  102.049045] INFO: Stall ended before state dump start
 [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
 [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=147012 jiffies, g=78, c=77, q=4)
 [  165.054075] INFO: Stall ended before state dump start
 [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=210017 jiffies, g=78, c=77, q=4)
 [  228.059045] INFO: Stall ended before state dump start
 [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=273022 jiffies, g=78, c=77, q=4)
 [  291.064099] INFO: Stall ended before state dump start
 [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=336027 jiffies, g=78, c=77, q=4)
 [  354.069099] INFO: Stall ended before state dump start
 [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=399032 jiffies, g=78, c=77, q=4)
 [  417.074045] INFO: Stall ended before state dump start
 [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=462037 jiffies, g=78, c=77, q=4)
 [  480.079099] INFO: Stall ended before state dump start
 [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=525042 jiffies, g=78, c=77, q=4)
 [  543.084099] INFO: Stall ended before state dump start
 [  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=588047 jiffies, g=78, c=77, q=4)
 [  606.089101] INFO: Stall ended before state dump start
 [  669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=651052 jiffies, g=78, c=77, q=4)
 [  669.094099] INFO: Stall ended before state dump start
 [  732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=714057 jiffies, g=78, c=77, q=4)
 [  732.099045] INFO: Stall ended before state dump start
 [  795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=777062 jiffies, g=78, c=77, q=4)
 [  795.104098] INFO: Stall ended before state dump start
 [  858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=840067 jiffies, g=78, c=77, q=4)
 [  858.109046] INFO: Stall ended before state dump start
 [  921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=903072 jiffies, g=78, c=77, q=4)
 [  921.114100] INFO: Stall ended before state dump start
 [  984.119099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=966077 jiffies, g=78, c=77, q=4)
 [

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Paul E. McKenney

On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
 On 2015/1/21 11:13, Zhang Zhen wrote:
  On 2015/1/21 10:26, Zhang Zhen wrote:
  On 2015/1/20 23:25, Don Zickus wrote:

[ . . . ]

  Sorry, i made a mistake, the log above is based on v3.10.63.
  I have tested the latest upstream kernel (based on ec6f34e5b552),
  but my test case can't triggered RCU stall warning.
 
 
 I used git bisect to find the bad commit, but found a irrelevant
 commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
 Before this commit, my test case can easily trigger RCU stall warning,
 but after this commit, my test case can't trigger RCU stall warning.

I have lost track of exactly what kernel you are using.  If you are
using a recent kernel (v3.18 or thereabouts), one thing to try is to
apply my stack of RCU CPU stall-warning changes that are on their way
in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:

git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git

These handle the problems that Dave Jones and a few others located this
past December.  Could you please give them a spin?

Thanx, Paul

 I'm totally confused.
 
 My test case:
 
 //
 #include linux/kernel.h
 #include linux/module.h
 #include linux/spinlock.h
 #include linux/slab.h
 #include linux/kthread.h
 
 struct foo {
 int a;
 char b;
 long c;
 };
 
 struct foo gbl_foo = {1, 'a', 2};
 struct foo *pgbl_foo = gbl_foo;
 
 static int my_kthread(void *data)
 {
 DEFINE_SPINLOCK(foo_mutex);
 struct foo *new_fp;
 struct foo *old_fp;
 
 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
 spin_lock(foo_mutex);
 old_fp = pgbl_foo;
 *new_fp = *old_fp;
 new_fp-a = 2;
 rcu_assign_pointer(pgbl_foo, new_fp);
 spin_unlock(foo_mutex);
 synchronize_rcu();
 kfree(old_fp);
 
 return 0;
 }
 
 static int hello_start(void)
 {
 struct task_struct *my_task = NULL;
 DEFINE_SPINLOCK(hello_lock);
 
 my_task = kthread_run(my_kthread, NULL, my_thread%d, 1);
 
 spin_lock_init(hello_lock);
 spin_lock(hello_lock);
 spin_lock(hello_lock);
 return 0;
 }
 
 static int __init test_init(void)
 {
 hello_start();
 
 printk(KERN_INFO Module init\n);
 return 0;
 }
 
 static void __exit test_exit(void)
 {
 printk(KERN_INFO Module exit!\n);
 }
 
 module_init(test_init)
 module_exit(test_exit)
 MODULE_LICENSE(GPL);
 //
 
  Cheers,
  Don
 
  diff --git a/kernel/watchdog.c b/kernel/watchdog.c
  index 70bf118..833c015 100644
  --- a/kernel/watchdog.c
  +++ b/kernel/watchdog.c
  @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void)
 * going off.
 */
raw_cpu_write(watchdog_nmi_touch, true);
  - touch_softlockup_watchdog();
  + //touch_softlockup_watchdog();
   }
   EXPORT_SYMBOL(touch_nmi_watchdog);
   
 
  .
 
 
  
 
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Zhang Zhen

On 2015/1/21 11:13, Zhang Zhen wrote:
 On 2015/1/21 10:26, Zhang Zhen wrote:
 On 2015/1/20 23:25, Don Zickus wrote:
 On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote:

 Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
 like this was masked.  I believe this upstream commit 62572e29bc53, solved
 the problem.

 Thanks for your suggestion.

 Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
 only touch local cpu not every one.
 But watchdog_nmi_touch = true only guarantee no hard lockup check on this 
 cpu.

 Commit 62572e29bc53 didn't changed the semantics of 
 touch_softlockup_watchdog.

 Ah, yes.  I reviewed the commit to quickly yesterday.  I thought
 touch_softlockup_watchdog was called on every cpu and that commit changed
 it to the local cpu.  But that was incorrect.


 You can apply that commit and see if you if you get both RCU stall
 messages _and_ softlockup messages.  I believe that is what you were
 expecting, correct?

 Correct, i expect i can get  both RCU stall messages _and_ softlockup 
 messages.
 I applied that commit, and i only got RCU stall messages.

 Hmm, I believe the act of printing to the console calls touch_nmi_watchdog
 which calls touch_softlockup_watchdog.  I think that is the problem there.

 Yeah, you are right.

 This may not cause other problems but what happens if you comment out the
 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
 below (based on latest upstream cb59670870)?

 The idea is that console printing for that cpu won't reset the softlockup
 detector.  Again other bad things might happen and this patch may not be a
 good final solution, but it can help give me a clue about what is going
 on.

 I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog 
 function
 (based on latest upstream ec6f34e5b552).
 This triggered RCU stall and softlockup, but softlockup just printk only 
 once.
 As you mentioned other bad things lead to softlockup just printk only once.
 What's the bad thing ?

 / #
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko 
 / # [   39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
 [   39.044058] INFO: Stall ended before state dump start
 [  102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=84007 jiffies, g=78, c=77, q=4)
 [  102.049045] INFO: Stall ended before state dump start
 [  160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
 [  165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=147012 jiffies, g=78, c=77, q=4)
 [  165.054075] INFO: Stall ended before state dump start
 [  228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=210017 jiffies, g=78, c=77, q=4)
 [  228.059045] INFO: Stall ended before state dump start
 [  291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=273022 jiffies, g=78, c=77, q=4)
 [  291.064099] INFO: Stall ended before state dump start
 [  354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=336027 jiffies, g=78, c=77, q=4)
 [  354.069099] INFO: Stall ended before state dump start
 [  417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=399032 jiffies, g=78, c=77, q=4)
 [  417.074045] INFO: Stall ended before state dump start
 [  480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=462037 jiffies, g=78, c=77, q=4)
 [  480.079099] INFO: Stall ended before state dump start
 [  543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=525042 jiffies, g=78, c=77, q=4)
 [  543.084099] INFO: Stall ended before state dump start
 [  606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=588047 jiffies, g=78, c=77, q=4)
 [  606.089101] INFO: Stall ended before state dump start
 [  669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=651052 jiffies, g=78, c=77, q=4)
 [  669.094099] INFO: Stall ended before state dump start
 [  732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=714057 jiffies, g=78, c=77, q=4)
 [  732.099045] INFO: Stall ended before state dump start
 [  795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=777062 jiffies, g=78, c=77, q=4)
 [  795.104098] INFO: Stall ended before state dump start
 [  858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=840067 jiffies, g=78, c=77, q=4)
 [  858.109046] INFO: Stall ended before state dump start
 [  921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=903072 jiffies, g=78, c=77, q=4)
 [  921.114100] INFO: Stall ended before state dump start
 [  984.119099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0,

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-20 Thread Zhang Zhen

On 2015/1/21 15:02, Paul E. McKenney wrote:
 On Wed, Jan 21, 2015 at 02:54:05PM +0800, Zhang Zhen wrote:
 On 2015/1/21 11:13, Zhang Zhen wrote:
 On 2015/1/21 10:26, Zhang Zhen wrote:
 On 2015/1/20 23:25, Don Zickus wrote:
 
 [ . . . ]
 
 Sorry, i made a mistake, the log above is based on v3.10.63.
 I have tested the latest upstream kernel (based on ec6f34e5b552),
 but my test case can't triggered RCU stall warning.


 I used git bisect to find the bad commit, but found a irrelevant
 commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
 Before this commit, my test case can easily trigger RCU stall warning,
 but after this commit, my test case can't trigger RCU stall warning.
 
 I have lost track of exactly what kernel you are using.  If you are
 using a recent kernel (v3.18 or thereabouts), one thing to try is to
 apply my stack of RCU CPU stall-warning changes that are on their way
 in, please see v3.19-rc1..630181c4a915 in -rcu, which is at:
 
 git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu.git
 
 These handle the problems that Dave Jones and a few others located this
 past December.  Could you please give them a spin?
 
Ok, i will try.

Thanks!
 Thanx, Paul
 
 I'm totally confused.

 My test case:

 //
 #include linux/kernel.h
 #include linux/module.h
 #include linux/spinlock.h
 #include linux/slab.h
 #include linux/kthread.h

 struct foo {
 int a;
 char b;
 long c;
 };

 struct foo gbl_foo = {1, 'a', 2};
 struct foo *pgbl_foo = gbl_foo;

 static int my_kthread(void *data)
 {
 DEFINE_SPINLOCK(foo_mutex);
 struct foo *new_fp;
 struct foo *old_fp;

 new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
 spin_lock(foo_mutex);
 old_fp = pgbl_foo;
 *new_fp = *old_fp;
 new_fp-a = 2;
 rcu_assign_pointer(pgbl_foo, new_fp);
 spin_unlock(foo_mutex);
 synchronize_rcu();
 kfree(old_fp);

 return 0;
 }

 static int hello_start(void)
 {
 struct task_struct *my_task = NULL;
 DEFINE_SPINLOCK(hello_lock);

 my_task = kthread_run(my_kthread, NULL, my_thread%d, 1);

 spin_lock_init(hello_lock);
 spin_lock(hello_lock);
 spin_lock(hello_lock);
 return 0;
 }

 static int __init test_init(void)
 {
 hello_start();

 printk(KERN_INFO Module init\n);
 return 0;
 }

 static void __exit test_exit(void)
 {
 printk(KERN_INFO Module exit!\n);
 }

 module_init(test_init)
 module_exit(test_exit)
 MODULE_LICENSE(GPL);
 //

 Cheers,
 Don

 diff --git a/kernel/watchdog.c b/kernel/watchdog.c
 index 70bf118..833c015 100644
 --- a/kernel/watchdog.c
 +++ b/kernel/watchdog.c
 @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void)
* going off.
*/
   raw_cpu_write(watchdog_nmi_touch, true);
 - touch_softlockup_watchdog();
 + //touch_softlockup_watchdog();
  }
  EXPORT_SYMBOL(touch_nmi_watchdog);
  

 .





 
 
 .
 


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Paul E. McKenney

On Tue, Jan 20, 2015 at 11:17:01AM +0800, Zhang Zhen wrote:
> On 2015/1/19 19:09, Paul E. McKenney wrote:
> > On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:
> >> On 2015/1/19 16:42, Paul E. McKenney wrote:
> >>> On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
>  Hi,
> 
>  On my x86_64 qemu virtual machine, RCU CPU stall console spews may
>  lead to soft lockup disabled.
> 
>  If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
>  watchdog_thresh):
> 
>  / #
>  / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
>  21
>  / # echo 60 > /proc/sys/kernel/watchdog_thresh
>  / # busybox insmod softlockup_test.ko
>  [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=21002 jiffies, g=85, c=84, q=4)
>  [   44.959044] INFO: Stall ended before state dump start
>  [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=84007 jiffies, g=85, c=84, q=4)
>  [  107.964045] INFO: Stall ended before state dump start
>  [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=147012 jiffies, g=85, c=84, q=4)
>  [  170.969060] INFO: Stall ended before state dump start
>  [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=210017 jiffies, g=85, c=84, q=4)
>  [  233.974057] INFO: Stall ended before state dump start
>  [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=273022 jiffies, g=85, c=84, q=4)
>  [  296.979059] INFO: Stall ended before state dump start
>  [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=336027 jiffies, g=85, c=84, q=4)
>  [  359.984058] INFO: Stall ended before state dump start
>  [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=399032 jiffies, g=85, c=84, q=4)
>  [  422.989059] INFO: Stall ended before state dump start
>  [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=462037 jiffies, g=85, c=84, q=4)
>  [  485.994056] INFO: Stall ended before state dump start
>  [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=525042 jiffies, g=85, c=84, q=4)
>  [  548.999059] INFO: Stall ended before state dump start
>  [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=588047 jiffies, g=85, c=84, q=4)
>  [  612.004061] INFO: Stall ended before state dump start
>  [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
>  (detected by 0, t=651052 jiffies, g=85, c=84, q=4)
>  [  675.009058] INFO: Stall ended before state dump start
> 
>  If softlockup_thresh < rcu_cpu_stall_timeout:
> 
>  / #
>  / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
>  21
>  / # echo 5 > /proc/sys/kernel/watchdog_thresh
>  / # busybox insmod softlockup_test.ko
>  [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
>  [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
>  [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
>  [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
>  [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> 
>  The softlockup_test.ko source code is:
>  //
>  #include 
>  #include 
>  #include 
>  #include 
> 
>  static int hello_start(void)
>  {
>  DEFINE_SPINLOCK(hello_lock);
>  spin_lock_init(_lock);
>  spin_lock(_lock);
>  spin_lock(_lock);
> >>>
> >>> Did you really intend to acquire the same spinlock twice in a row,
> >>> forcing a self-deadlock?  If not, I of course suggest changing the second
> >>> "spin_lock()" to "spin_unlock()".
> >>
> >> Yes, i acquire the same spinlock twice in order to reproduce the problem.
> > 
> > Good, I was wondering about that.  ;-)
> > 
> >>> If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
> >>> give you an RCU CPU stall warning.
> >>
> >> In my .config CONFIG_TREE_RCU=y.
> > 
> > Which is consistent.
> > 
> >> If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup 
> >> warning.
> >> If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU 
> >> CPU stall warning
> >> just like above and no give soft lockup warning.
> >>
> >> It means that RCU CPU stall console spews leads to soft lockup disabled.
> >> Is this reasonable ?
> > 
> > It depends.  You will often see both of them, but they can interfere
> > with each other, especially if all these messages are going across a
> > serial line.  And sometimes the activity of the one will suppress the
> > other, though I would

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Zhang Zhen

On 2015/1/19 19:09, Paul E. McKenney wrote:
> On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:
>> On 2015/1/19 16:42, Paul E. McKenney wrote:
>>> On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
 Hi,

 On my x86_64 qemu virtual machine, RCU CPU stall console spews may
 lead to soft lockup disabled.

 If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
 watchdog_thresh):

 / #
 / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
 21
 / # echo 60 > /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko
 [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=21002 jiffies, g=85, c=84, q=4)
 [   44.959044] INFO: Stall ended before state dump start
 [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=84007 jiffies, g=85, c=84, q=4)
 [  107.964045] INFO: Stall ended before state dump start
 [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=147012 jiffies, g=85, c=84, q=4)
 [  170.969060] INFO: Stall ended before state dump start
 [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=210017 jiffies, g=85, c=84, q=4)
 [  233.974057] INFO: Stall ended before state dump start
 [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=273022 jiffies, g=85, c=84, q=4)
 [  296.979059] INFO: Stall ended before state dump start
 [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=336027 jiffies, g=85, c=84, q=4)
 [  359.984058] INFO: Stall ended before state dump start
 [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=399032 jiffies, g=85, c=84, q=4)
 [  422.989059] INFO: Stall ended before state dump start
 [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=462037 jiffies, g=85, c=84, q=4)
 [  485.994056] INFO: Stall ended before state dump start
 [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=525042 jiffies, g=85, c=84, q=4)
 [  548.999059] INFO: Stall ended before state dump start
 [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=588047 jiffies, g=85, c=84, q=4)
 [  612.004061] INFO: Stall ended before state dump start
 [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=651052 jiffies, g=85, c=84, q=4)
 [  675.009058] INFO: Stall ended before state dump start

 If softlockup_thresh < rcu_cpu_stall_timeout:

 / #
 / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
 21
 / # echo 5 > /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko
 [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]

 The softlockup_test.ko source code is:
 //
 #include 
 #include 
 #include 
 #include 

 static int hello_start(void)
 {
 DEFINE_SPINLOCK(hello_lock);
 spin_lock_init(_lock);
 spin_lock(_lock);
 spin_lock(_lock);
>>>
>>> Did you really intend to acquire the same spinlock twice in a row,
>>> forcing a self-deadlock?  If not, I of course suggest changing the second
>>> "spin_lock()" to "spin_unlock()".
>>
>> Yes, i acquire the same spinlock twice in order to reproduce the problem.
> 
> Good, I was wondering about that.  ;-)
> 
>>> If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
>>> give you an RCU CPU stall warning.
>>
>> In my .config CONFIG_TREE_RCU=y.
> 
> Which is consistent.
> 
>> If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup 
>> warning.
>> If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU 
>> stall warning
>> just like above and no give soft lockup warning.
>>
>> It means that RCU CPU stall console spews leads to soft lockup disabled.
>> Is this reasonable ?
> 
> It depends.  You will often see both of them, but they can interfere
> with each other, especially if all these messages are going across a
> serial line.  And sometimes the activity of the one will suppress the
> other, though I would not expect that in your spinlock deadlock case.
> 
Ok, my expect is to get both RCU stall messages _and_ softlockup messages
even though all these messages are going across a serial line.
But in my test case the RCU stall messages suppress the other.

The simplest way is

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Zhang Zhen

On 2015/1/19 22:06, Don Zickus wrote:
> On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:
>>>
>>> Did you really intend to acquire the same spinlock twice in a row,
>>> forcing a self-deadlock?  If not, I of course suggest changing the second
>>> "spin_lock()" to "spin_unlock()".
>>>
>>
>> Yes, i acquire the same spinlock twice in order to reproduce the problem.
>>
>>> If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
>>> give you an RCU CPU stall warning.
>>>
>>
>> In my .config CONFIG_TREE_RCU=y.
>>
>> If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup 
>> warning.
>> If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU 
>> stall warning
>> just like above and no give soft lockup warning.
>>
>> It means that RCU CPU stall console spews leads to soft lockup disabled.
>> Is this reasonable ?
> 
> I believe so.  In kernel v3.10.., all activity to the console executed
> touch_nmi_watchdog() which calls touch_softlockup_watchdog, which delayed
> the softlockup for another round of 'softlockup_thresh'.
> 
Yeah, you are right. It's the real reason.

> Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
> like this was masked.  I believe this upstream commit 62572e29bc53, solved
> the problem.

Thanks for your suggestion.

Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
only touch local cpu not every one.
But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu.

Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog.
> 
> You can apply that commit and see if you if you get both RCU stall
> messages _and_ softlockup messages.  I believe that is what you were
> expecting, correct?
> 
Correct, i expect i can get  both RCU stall messages _and_ softlockup messages.
I applied that commit, and i only got RCU stall messages.

/ #
/ #  echo 60 > /proc/sys/kernel/watchdog_thresh
/ # busybox insmod softlockup_test.ko
[   35.344060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=21002 jiffies, g=85, c=84, q=4)
[   35.344060] INFO: Stall ended before state dump start
[   98.349079] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=84007 jiffies, g=85, c=84, q=4)
[   98.349079] INFO: Stall ended before state dump start
[  161.354100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=147012 jiffies, g=85, c=84, q=4)
[  161.354100] INFO: Stall ended before state dump start
> 
> Of course, on a non-virt guest, your test case would normally trigger a
> hardlockup warning first.  And a later kernel version for the guest may
> actually do that (not quite sure if the emulated PMU stuff is upstream or
> not yet).  Just to set your expectations correctly.
> 
Yes, on a non-virt guest, my test case tiggered hardlockup warning firt.

Best regards!
> Cheers,
> Don
> 
>>
>> Thanks!
>>
 return 0;
 }

 static int __init test_init(void)
 {
 hello_start();

 printk(KERN_INFO "Module init\n");
 return 0;
 }

 static void __exit test_exit(void)
 {
 printk(KERN_INFO "Module exit!\n");
 }

 module_init(test_init)
 module_exit(test_exit)
 MODULE_LICENSE("GPL");
 //

 My kernel version is v3.10.63, and i checked the kernel source code,

 update_process_times
-> run_local_timers
-> hrtimer_run_queues
-> __run_hrtimer
-> watchdog_timer_fn
-> is_softlockup

-> rcu_check_callbacks
-> __rcu_pending
-> check_cpu_stall
-> print_cpu_stall

 If softlockup_thresh > rcu_cpu_stall_timeout, print_cpu_stall will print 
 log to serial port.

 The 8250 serial driver will call serial8250_console_write => 
 touch_nmi_watchdog() which reset
 watchdog_touch_ts to 0. So the softlockup will not be triggered.

 Is this reasonable? Why?
>>>
>>> Is exactly what reasonable?  ;-)
>>>
>>> Yes, it is reasonable that your code triggers an RCU CPU stall warning.
>>>
>>> No, it is not reasonable that the RCU CPU stall warning does not include
>>> a stack trace, and the fix for that bug will be going into the next merge
>>> window.
>>>
>>> Yes, is is reasonable that varying the softlockup and RCU CPU stall
>>> timeouts might change the behavior.
>>>
>>> No, your code is not reasonable, except perhaps as a test of the
>>> generation of softlockup and RCU CPU stall warnings.  If you are not
>>> trying to test softlockup and RCU CPU stall warnings, you should of course
>>> not try to acquire any non-recursive exclusive lock that you already hold.
>>>
 If it is not reasonable, we should adjust the printk loglevel from 
 *KERN_ERR* to *KERN_INFO*

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Don Zickus

On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:
> > 
> > Did you really intend to acquire the same spinlock twice in a row,
> > forcing a self-deadlock?  If not, I of course suggest changing the second
> > "spin_lock()" to "spin_unlock()".
> > 
> 
> Yes, i acquire the same spinlock twice in order to reproduce the problem.
> 
> > If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
> > give you an RCU CPU stall warning.
> > 
> 
> In my .config CONFIG_TREE_RCU=y.
> 
> If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup 
> warning.
> If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU 
> stall warning
> just like above and no give soft lockup warning.
> 
> It means that RCU CPU stall console spews leads to soft lockup disabled.
> Is this reasonable ?

I believe so.  In kernel v3.10.., all activity to the console executed
touch_nmi_watchdog() which calls touch_softlockup_watchdog, which delayed
the softlockup for another round of 'softlockup_thresh'.

Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
like this was masked.  I believe this upstream commit 62572e29bc53, solved
the problem.

You can apply that commit and see if you if you get both RCU stall
messages _and_ softlockup messages.  I believe that is what you were
expecting, correct?


Of course, on a non-virt guest, your test case would normally trigger a
hardlockup warning first.  And a later kernel version for the guest may
actually do that (not quite sure if the emulated PMU stuff is upstream or
not yet).  Just to set your expectations correctly.

Cheers,
Don

> 
> Thanks!
> 
> >> return 0;
> >> }
> >>
> >> static int __init test_init(void)
> >> {
> >> hello_start();
> >>
> >> printk(KERN_INFO "Module init\n");
> >> return 0;
> >> }
> >>
> >> static void __exit test_exit(void)
> >> {
> >> printk(KERN_INFO "Module exit!\n");
> >> }
> >>
> >> module_init(test_init)
> >> module_exit(test_exit)
> >> MODULE_LICENSE("GPL");
> >> //
> >>
> >> My kernel version is v3.10.63, and i checked the kernel source code,
> >>
> >> update_process_times
> >>-> run_local_timers
> >>-> hrtimer_run_queues
> >>-> __run_hrtimer
> >>-> watchdog_timer_fn
> >>-> is_softlockup
> >>
> >>-> rcu_check_callbacks
> >>-> __rcu_pending
> >>-> check_cpu_stall
> >>-> print_cpu_stall
> >>
> >> If softlockup_thresh > rcu_cpu_stall_timeout, print_cpu_stall will print 
> >> log to serial port.
> >>
> >> The 8250 serial driver will call serial8250_console_write => 
> >> touch_nmi_watchdog() which reset
> >> watchdog_touch_ts to 0. So the softlockup will not be triggered.
> >>
> >> Is this reasonable? Why?
> > 
> > Is exactly what reasonable?  ;-)
> > 
> > Yes, it is reasonable that your code triggers an RCU CPU stall warning.
> > 
> > No, it is not reasonable that the RCU CPU stall warning does not include
> > a stack trace, and the fix for that bug will be going into the next merge
> > window.
> > 
> > Yes, is is reasonable that varying the softlockup and RCU CPU stall
> > timeouts might change the behavior.
> > 
> > No, your code is not reasonable, except perhaps as a test of the
> > generation of softlockup and RCU CPU stall warnings.  If you are not
> > trying to test softlockup and RCU CPU stall warnings, you should of course
> > not try to acquire any non-recursive exclusive lock that you already hold.
> > 
> >> If it is not reasonable, we should adjust the printk loglevel from 
> >> *KERN_ERR* to *KERN_INFO*
> >> in print_cpu_stall.
> > 
> > Given that RCU CPU stall warnings are supposed to be pointing out errors
> > elsewhere in the kernel, and in this case are pointing out errors elsewhere
> > in the kernel, namely in your hello_start() function, it is reasonable
> > that the RCU CPU stall warnings use the KERN_ERR loglevel.
> > 
> > Or am I missing something here?
> > 
> > Thanx, Paul
> > 
> > 
> > .
> > 
> 
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Paul E. McKenney

On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:
> On 2015/1/19 16:42, Paul E. McKenney wrote:
> > On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
> >> Hi,
> >>
> >> On my x86_64 qemu virtual machine, RCU CPU stall console spews may
> >> lead to soft lockup disabled.
> >>
> >> If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
> >> watchdog_thresh):
> >>
> >> / #
> >> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> >> 21
> >> / # echo 60 > /proc/sys/kernel/watchdog_thresh
> >> / # busybox insmod softlockup_test.ko
> >> [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=21002 jiffies, g=85, c=84, q=4)
> >> [   44.959044] INFO: Stall ended before state dump start
> >> [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=84007 jiffies, g=85, c=84, q=4)
> >> [  107.964045] INFO: Stall ended before state dump start
> >> [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=147012 jiffies, g=85, c=84, q=4)
> >> [  170.969060] INFO: Stall ended before state dump start
> >> [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=210017 jiffies, g=85, c=84, q=4)
> >> [  233.974057] INFO: Stall ended before state dump start
> >> [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=273022 jiffies, g=85, c=84, q=4)
> >> [  296.979059] INFO: Stall ended before state dump start
> >> [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=336027 jiffies, g=85, c=84, q=4)
> >> [  359.984058] INFO: Stall ended before state dump start
> >> [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=399032 jiffies, g=85, c=84, q=4)
> >> [  422.989059] INFO: Stall ended before state dump start
> >> [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=462037 jiffies, g=85, c=84, q=4)
> >> [  485.994056] INFO: Stall ended before state dump start
> >> [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=525042 jiffies, g=85, c=84, q=4)
> >> [  548.999059] INFO: Stall ended before state dump start
> >> [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=588047 jiffies, g=85, c=84, q=4)
> >> [  612.004061] INFO: Stall ended before state dump start
> >> [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
> >> (detected by 0, t=651052 jiffies, g=85, c=84, q=4)
> >> [  675.009058] INFO: Stall ended before state dump start
> >>
> >> If softlockup_thresh < rcu_cpu_stall_timeout:
> >>
> >> / #
> >> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> >> 21
> >> / # echo 5 > /proc/sys/kernel/watchdog_thresh
> >> / # busybox insmod softlockup_test.ko
> >> [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> >> [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> >> [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> >> [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> >> [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> >>
> >> The softlockup_test.ko source code is:
> >> //
> >> #include 
> >> #include 
> >> #include 
> >> #include 
> >>
> >> static int hello_start(void)
> >> {
> >> DEFINE_SPINLOCK(hello_lock);
> >> spin_lock_init(_lock);
> >> spin_lock(_lock);
> >> spin_lock(_lock);
> > 
> > Did you really intend to acquire the same spinlock twice in a row,
> > forcing a self-deadlock?  If not, I of course suggest changing the second
> > "spin_lock()" to "spin_unlock()".
> 
> Yes, i acquire the same spinlock twice in order to reproduce the problem.

Good, I was wondering about that.  ;-)

> > If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
> > give you an RCU CPU stall warning.
> 
> In my .config CONFIG_TREE_RCU=y.

Which is consistent.

> If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup 
> warning.
> If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU 
> stall warning
> just like above and no give soft lockup warning.
> 
> It means that RCU CPU stall console spews leads to soft lockup disabled.
> Is this reasonable ?

It depends.  You will often see both of them, but they can interfere
with each other, especially if all these messages are going across a
serial line.  And sometimes the activity of the one will suppress the
other, though I would not expect that in your spinlock deadlock case.

Thanx, Paul

> Thanks!
> 
> >> return 0;
> >> }
> >>
> >> static int __init test_init(void)
> >> {
> >> hello_start();
> >>
> >> printk(KERN_INFO "Module init\n");
> >> return 0;
> >> }
> >>
> >> static void

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Zhang Zhen

On 2015/1/19 16:42, Paul E. McKenney wrote:
> On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
>> Hi,
>>
>> On my x86_64 qemu virtual machine, RCU CPU stall console spews may
>> lead to soft lockup disabled.
>>
>> If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
>> watchdog_thresh):
>>
>> / #
>> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
>> 21
>> / # echo 60 > /proc/sys/kernel/watchdog_thresh
>> / # busybox insmod softlockup_test.ko
>> [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=21002 jiffies, g=85, c=84, q=4)
>> [   44.959044] INFO: Stall ended before state dump start
>> [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=84007 jiffies, g=85, c=84, q=4)
>> [  107.964045] INFO: Stall ended before state dump start
>> [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=147012 jiffies, g=85, c=84, q=4)
>> [  170.969060] INFO: Stall ended before state dump start
>> [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=210017 jiffies, g=85, c=84, q=4)
>> [  233.974057] INFO: Stall ended before state dump start
>> [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=273022 jiffies, g=85, c=84, q=4)
>> [  296.979059] INFO: Stall ended before state dump start
>> [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=336027 jiffies, g=85, c=84, q=4)
>> [  359.984058] INFO: Stall ended before state dump start
>> [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=399032 jiffies, g=85, c=84, q=4)
>> [  422.989059] INFO: Stall ended before state dump start
>> [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=462037 jiffies, g=85, c=84, q=4)
>> [  485.994056] INFO: Stall ended before state dump start
>> [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=525042 jiffies, g=85, c=84, q=4)
>> [  548.999059] INFO: Stall ended before state dump start
>> [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=588047 jiffies, g=85, c=84, q=4)
>> [  612.004061] INFO: Stall ended before state dump start
>> [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
>> by 0, t=651052 jiffies, g=85, c=84, q=4)
>> [  675.009058] INFO: Stall ended before state dump start
>>
>> If softlockup_thresh < rcu_cpu_stall_timeout:
>>
>> / #
>> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
>> 21
>> / # echo 5 > /proc/sys/kernel/watchdog_thresh
>> / # busybox insmod softlockup_test.ko
>> [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
>> [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
>> [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
>> [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
>> [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
>>
>> The softlockup_test.ko source code is:
>> //
>> #include 
>> #include 
>> #include 
>> #include 
>>
>> static int hello_start(void)
>> {
>> DEFINE_SPINLOCK(hello_lock);
>> spin_lock_init(_lock);
>> spin_lock(_lock);
>> spin_lock(_lock);
> 
> Did you really intend to acquire the same spinlock twice in a row,
> forcing a self-deadlock?  If not, I of course suggest changing the second
> "spin_lock()" to "spin_unlock()".
> 

Yes, i acquire the same spinlock twice in order to reproduce the problem.

> If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
> give you an RCU CPU stall warning.
> 

In my .config CONFIG_TREE_RCU=y.

If softlockup_thresh < rcu_cpu_stall_timeout, it will give soft lockup warning.
If softlockup_thresh > rcu_cpu_stall_timeout, it will likely to give RCU CPU 
stall warning
just like above and no give soft lockup warning.

It means that RCU CPU stall console spews leads to soft lockup disabled.
Is this reasonable ?

Thanks!

>> return 0;
>> }
>>
>> static int __init test_init(void)
>> {
>> hello_start();
>>
>> printk(KERN_INFO "Module init\n");
>> return 0;
>> }
>>
>> static void __exit test_exit(void)
>> {
>> printk(KERN_INFO "Module exit!\n");
>> }
>>
>> module_init(test_init)
>> module_exit(test_exit)
>> MODULE_LICENSE("GPL");
>> //
>>
>> My kernel version is v3.10.63, and i checked the kernel source code,
>>
>> update_process_times
>>  -> run_local_timers
>>  -> hrtimer_run_queues
>>  -> __run_hrtimer
>>  -> watchdog_timer_fn
>>  -> is_softlockup
>>  
>>  -> rcu_check_callbacks
>>  -> __rcu_pending
>>  -> check_cpu_stall
>>  ->

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Paul E. McKenney

On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
> Hi,
> 
> On my x86_64 qemu virtual machine, RCU CPU stall console spews may
> lead to soft lockup disabled.
> 
> If softlockup_thresh > rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
> watchdog_thresh):
> 
> / #
> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> 21
> / # echo 60 > /proc/sys/kernel/watchdog_thresh
> / # busybox insmod softlockup_test.ko
> [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=21002 jiffies, g=85, c=84, q=4)
> [   44.959044] INFO: Stall ended before state dump start
> [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=84007 jiffies, g=85, c=84, q=4)
> [  107.964045] INFO: Stall ended before state dump start
> [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=147012 jiffies, g=85, c=84, q=4)
> [  170.969060] INFO: Stall ended before state dump start
> [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=210017 jiffies, g=85, c=84, q=4)
> [  233.974057] INFO: Stall ended before state dump start
> [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=273022 jiffies, g=85, c=84, q=4)
> [  296.979059] INFO: Stall ended before state dump start
> [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=336027 jiffies, g=85, c=84, q=4)
> [  359.984058] INFO: Stall ended before state dump start
> [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=399032 jiffies, g=85, c=84, q=4)
> [  422.989059] INFO: Stall ended before state dump start
> [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=462037 jiffies, g=85, c=84, q=4)
> [  485.994056] INFO: Stall ended before state dump start
> [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=525042 jiffies, g=85, c=84, q=4)
> [  548.999059] INFO: Stall ended before state dump start
> [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=588047 jiffies, g=85, c=84, q=4)
> [  612.004061] INFO: Stall ended before state dump start
> [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
> by 0, t=651052 jiffies, g=85, c=84, q=4)
> [  675.009058] INFO: Stall ended before state dump start
> 
> If softlockup_thresh < rcu_cpu_stall_timeout:
> 
> / #
> / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
> 21
> / # echo 5 > /proc/sys/kernel/watchdog_thresh
> / # busybox insmod softlockup_test.ko
> [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
> 
> The softlockup_test.ko source code is:
> //
> #include 
> #include 
> #include 
> #include 
> 
> static int hello_start(void)
> {
> DEFINE_SPINLOCK(hello_lock);
> spin_lock_init(_lock);
> spin_lock(_lock);
> spin_lock(_lock);

Did you really intend to acquire the same spinlock twice in a row,
forcing a self-deadlock?  If not, I of course suggest changing the second
"spin_lock()" to "spin_unlock()".

If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
give you an RCU CPU stall warning.

> return 0;
> }
> 
> static int __init test_init(void)
> {
> hello_start();
> 
> printk(KERN_INFO "Module init\n");
> return 0;
> }
> 
> static void __exit test_exit(void)
> {
> printk(KERN_INFO "Module exit!\n");
> }
> 
> module_init(test_init)
> module_exit(test_exit)
> MODULE_LICENSE("GPL");
> //
> 
> My kernel version is v3.10.63, and i checked the kernel source code,
> 
> update_process_times
>   -> run_local_timers
>   -> hrtimer_run_queues
>   -> __run_hrtimer
>   -> watchdog_timer_fn
>   -> is_softlockup
>   
>   -> rcu_check_callbacks
>   -> __rcu_pending
>   -> check_cpu_stall
>   -> print_cpu_stall
> 
> If softlockup_thresh > rcu_cpu_stall_timeout, print_cpu_stall will print log 
> to serial port.
> 
> The 8250 serial driver will call serial8250_console_write => 
> touch_nmi_watchdog() which reset
> watchdog_touch_ts to 0. So the softlockup will not be triggered.
> 
> Is this reasonable? Why?

Is exactly what reasonable?  ;-)

Yes, it is reasonable that your code triggers an RCU CPU stall warning.

No, it is not reasonable that the RCU CPU stall warning does not include
a stack trace, and the fix for that bug will be going into the next merge
window.

Yes, is is

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Zhang Zhen

On 2015/1/19 19:09, Paul E. McKenney wrote:
 On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:
 On 2015/1/19 16:42, Paul E. McKenney wrote:
 On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
 Hi,

 On my x86_64 qemu virtual machine, RCU CPU stall console spews may
 lead to soft lockup disabled.

 If softlockup_thresh  rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
 watchdog_thresh):

 / #
 / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
 21
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko
 [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=21002 jiffies, g=85, c=84, q=4)
 [   44.959044] INFO: Stall ended before state dump start
 [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=84007 jiffies, g=85, c=84, q=4)
 [  107.964045] INFO: Stall ended before state dump start
 [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=147012 jiffies, g=85, c=84, q=4)
 [  170.969060] INFO: Stall ended before state dump start
 [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=210017 jiffies, g=85, c=84, q=4)
 [  233.974057] INFO: Stall ended before state dump start
 [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=273022 jiffies, g=85, c=84, q=4)
 [  296.979059] INFO: Stall ended before state dump start
 [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=336027 jiffies, g=85, c=84, q=4)
 [  359.984058] INFO: Stall ended before state dump start
 [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=399032 jiffies, g=85, c=84, q=4)
 [  422.989059] INFO: Stall ended before state dump start
 [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=462037 jiffies, g=85, c=84, q=4)
 [  485.994056] INFO: Stall ended before state dump start
 [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=525042 jiffies, g=85, c=84, q=4)
 [  548.999059] INFO: Stall ended before state dump start
 [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=588047 jiffies, g=85, c=84, q=4)
 [  612.004061] INFO: Stall ended before state dump start
 [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
 (detected by 0, t=651052 jiffies, g=85, c=84, q=4)
 [  675.009058] INFO: Stall ended before state dump start

 If softlockup_thresh  rcu_cpu_stall_timeout:

 / #
 / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
 21
 / # echo 5  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko
 [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]

 The softlockup_test.ko source code is:
 //
 #include linux/kernel.h
 #include linux/module.h
 #include linux/spinlock.h
 #include linux/slab.h

 static int hello_start(void)
 {
 DEFINE_SPINLOCK(hello_lock);
 spin_lock_init(hello_lock);
 spin_lock(hello_lock);
 spin_lock(hello_lock);

 Did you really intend to acquire the same spinlock twice in a row,
 forcing a self-deadlock?  If not, I of course suggest changing the second
 spin_lock() to spin_unlock().

 Yes, i acquire the same spinlock twice in order to reproduce the problem.
 
 Good, I was wondering about that.  ;-)
 
 If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
 give you an RCU CPU stall warning.

 In my .config CONFIG_TREE_RCU=y.
 
 Which is consistent.
 
 If softlockup_thresh  rcu_cpu_stall_timeout, it will give soft lockup 
 warning.
 If softlockup_thresh  rcu_cpu_stall_timeout, it will likely to give RCU CPU 
 stall warning
 just like above and no give soft lockup warning.

 It means that RCU CPU stall console spews leads to soft lockup disabled.
 Is this reasonable ?
 
 It depends.  You will often see both of them, but they can interfere
 with each other, especially if all these messages are going across a
 serial line.  And sometimes the activity of the one will suppress the
 other, though I would not expect that in your spinlock deadlock case.
 
Ok, my expect is to get both RCU stall messages _and_ softlockup messages
even though all these messages are going across a serial line.
But in my test case the RCU stall messages suppress the other.

The simplest way is to change the RCU CPU stall warnings use the KERN_INFO
loglevel.
Is there any better way to get both RCU stall messages _and_ softlockup messages
in any case ?

Thanks!
   Thanx, Paul
 
 Thanks!

 return 0;
 }

 static int __init

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Zhang Zhen

On 2015/1/19 22:06, Don Zickus wrote:
 On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:

 Did you really intend to acquire the same spinlock twice in a row,
 forcing a self-deadlock?  If not, I of course suggest changing the second
 spin_lock() to spin_unlock().


 Yes, i acquire the same spinlock twice in order to reproduce the problem.

 If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
 give you an RCU CPU stall warning.


 In my .config CONFIG_TREE_RCU=y.

 If softlockup_thresh  rcu_cpu_stall_timeout, it will give soft lockup 
 warning.
 If softlockup_thresh  rcu_cpu_stall_timeout, it will likely to give RCU CPU 
 stall warning
 just like above and no give soft lockup warning.

 It means that RCU CPU stall console spews leads to soft lockup disabled.
 Is this reasonable ?
 
 I believe so.  In kernel v3.10.., all activity to the console executed
 touch_nmi_watchdog() which calls touch_softlockup_watchdog, which delayed
 the softlockup for another round of 'softlockup_thresh'.
 
Yeah, you are right. It's the real reason.

 Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
 like this was masked.  I believe this upstream commit 62572e29bc53, solved
 the problem.

Thanks for your suggestion.

Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
only touch local cpu not every one.
But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu.

Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog.
 
 You can apply that commit and see if you if you get both RCU stall
 messages _and_ softlockup messages.  I believe that is what you were
 expecting, correct?
 
Correct, i expect i can get  both RCU stall messages _and_ softlockup messages.
I applied that commit, and i only got RCU stall messages.

/ #
/ #  echo 60  /proc/sys/kernel/watchdog_thresh
/ # busybox insmod softlockup_test.ko
[   35.344060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=21002 jiffies, g=85, c=84, q=4)
[   35.344060] INFO: Stall ended before state dump start
[   98.349079] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=84007 jiffies, g=85, c=84, q=4)
[   98.349079] INFO: Stall ended before state dump start
[  161.354100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 
0, t=147012 jiffies, g=85, c=84, q=4)
[  161.354100] INFO: Stall ended before state dump start
 
 Of course, on a non-virt guest, your test case would normally trigger a
 hardlockup warning first.  And a later kernel version for the guest may
 actually do that (not quite sure if the emulated PMU stuff is upstream or
 not yet).  Just to set your expectations correctly.
 
Yes, on a non-virt guest, my test case tiggered hardlockup warning firt.

Best regards!
 Cheers,
 Don
 

 Thanks!

 return 0;
 }

 static int __init test_init(void)
 {
 hello_start();

 printk(KERN_INFO Module init\n);
 return 0;
 }

 static void __exit test_exit(void)
 {
 printk(KERN_INFO Module exit!\n);
 }

 module_init(test_init)
 module_exit(test_exit)
 MODULE_LICENSE(GPL);
 //

 My kernel version is v3.10.63, and i checked the kernel source code,

 update_process_times
- run_local_timers
- hrtimer_run_queues
- __run_hrtimer
- watchdog_timer_fn
- is_softlockup

- rcu_check_callbacks
- __rcu_pending
- check_cpu_stall
- print_cpu_stall

 If softlockup_thresh  rcu_cpu_stall_timeout, print_cpu_stall will print 
 log to serial port.

 The 8250 serial driver will call serial8250_console_write = 
 touch_nmi_watchdog() which reset
 watchdog_touch_ts to 0. So the softlockup will not be triggered.

 Is this reasonable? Why?

 Is exactly what reasonable?  ;-)

 Yes, it is reasonable that your code triggers an RCU CPU stall warning.

 No, it is not reasonable that the RCU CPU stall warning does not include
 a stack trace, and the fix for that bug will be going into the next merge
 window.

 Yes, is is reasonable that varying the softlockup and RCU CPU stall
 timeouts might change the behavior.

 No, your code is not reasonable, except perhaps as a test of the
 generation of softlockup and RCU CPU stall warnings.  If you are not
 trying to test softlockup and RCU CPU stall warnings, you should of course
 not try to acquire any non-recursive exclusive lock that you already hold.

 If it is not reasonable, we should adjust the printk loglevel from 
 *KERN_ERR* to *KERN_INFO*
 in print_cpu_stall.

 Given that RCU CPU stall warnings are supposed to be pointing out errors
 elsewhere in the kernel, and in this case are pointing out errors elsewhere
 in the kernel, namely in your hello_start() function, it is reasonable
 that the RCU CPU stall warnings use the KERN_ERR loglevel.

 Or am I missing something

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Paul E. McKenney

On Tue, Jan 20, 2015 at 11:17:01AM +0800, Zhang Zhen wrote:
 On 2015/1/19 19:09, Paul E. McKenney wrote:
  On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:
  On 2015/1/19 16:42, Paul E. McKenney wrote:
  On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
  Hi,
 
  On my x86_64 qemu virtual machine, RCU CPU stall console spews may
  lead to soft lockup disabled.
 
  If softlockup_thresh  rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
  watchdog_thresh):
 
  / #
  / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
  21
  / # echo 60  /proc/sys/kernel/watchdog_thresh
  / # busybox insmod softlockup_test.ko
  [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=21002 jiffies, g=85, c=84, q=4)
  [   44.959044] INFO: Stall ended before state dump start
  [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=84007 jiffies, g=85, c=84, q=4)
  [  107.964045] INFO: Stall ended before state dump start
  [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=147012 jiffies, g=85, c=84, q=4)
  [  170.969060] INFO: Stall ended before state dump start
  [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=210017 jiffies, g=85, c=84, q=4)
  [  233.974057] INFO: Stall ended before state dump start
  [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=273022 jiffies, g=85, c=84, q=4)
  [  296.979059] INFO: Stall ended before state dump start
  [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=336027 jiffies, g=85, c=84, q=4)
  [  359.984058] INFO: Stall ended before state dump start
  [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=399032 jiffies, g=85, c=84, q=4)
  [  422.989059] INFO: Stall ended before state dump start
  [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=462037 jiffies, g=85, c=84, q=4)
  [  485.994056] INFO: Stall ended before state dump start
  [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=525042 jiffies, g=85, c=84, q=4)
  [  548.999059] INFO: Stall ended before state dump start
  [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=588047 jiffies, g=85, c=84, q=4)
  [  612.004061] INFO: Stall ended before state dump start
  [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=651052 jiffies, g=85, c=84, q=4)
  [  675.009058] INFO: Stall ended before state dump start
 
  If softlockup_thresh  rcu_cpu_stall_timeout:
 
  / #
  / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
  21
  / # echo 5  /proc/sys/kernel/watchdog_thresh
  / # busybox insmod softlockup_test.ko
  [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
  [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
  [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
  [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
  [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 
  The softlockup_test.ko source code is:
  //
  #include linux/kernel.h
  #include linux/module.h
  #include linux/spinlock.h
  #include linux/slab.h
 
  static int hello_start(void)
  {
  DEFINE_SPINLOCK(hello_lock);
  spin_lock_init(hello_lock);
  spin_lock(hello_lock);
  spin_lock(hello_lock);
 
  Did you really intend to acquire the same spinlock twice in a row,
  forcing a self-deadlock?  If not, I of course suggest changing the second
  spin_lock() to spin_unlock().
 
  Yes, i acquire the same spinlock twice in order to reproduce the problem.
  
  Good, I was wondering about that.  ;-)
  
  If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
  give you an RCU CPU stall warning.
 
  In my .config CONFIG_TREE_RCU=y.
  
  Which is consistent.
  
  If softlockup_thresh  rcu_cpu_stall_timeout, it will give soft lockup 
  warning.
  If softlockup_thresh  rcu_cpu_stall_timeout, it will likely to give RCU 
  CPU stall warning
  just like above and no give soft lockup warning.
 
  It means that RCU CPU stall console spews leads to soft lockup disabled.
  Is this reasonable ?
  
  It depends.  You will often see both of them, but they can interfere
  with each other, especially if all these messages are going across a
  serial line.  And sometimes the activity of the one will suppress the
  other, though I would not expect that in your spinlock deadlock case.
  
 Ok, my expect is to get both RCU stall messages _and_ softlockup messages
 even though all these messages are going across a serial line.
 But in my test case the RCU stall messages suppress the other.

On some kernel versions, slow serial-line consoles can cause problems.
For example, on a 9600-baud line, only 21K characters output at one go

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Paul E. McKenney

On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
 Hi,
 
 On my x86_64 qemu virtual machine, RCU CPU stall console spews may
 lead to soft lockup disabled.
 
 If softlockup_thresh  rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
 watchdog_thresh):
 
 / #
 / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
 21
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko
 [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=21002 jiffies, g=85, c=84, q=4)
 [   44.959044] INFO: Stall ended before state dump start
 [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=84007 jiffies, g=85, c=84, q=4)
 [  107.964045] INFO: Stall ended before state dump start
 [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=147012 jiffies, g=85, c=84, q=4)
 [  170.969060] INFO: Stall ended before state dump start
 [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=210017 jiffies, g=85, c=84, q=4)
 [  233.974057] INFO: Stall ended before state dump start
 [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=273022 jiffies, g=85, c=84, q=4)
 [  296.979059] INFO: Stall ended before state dump start
 [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=336027 jiffies, g=85, c=84, q=4)
 [  359.984058] INFO: Stall ended before state dump start
 [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=399032 jiffies, g=85, c=84, q=4)
 [  422.989059] INFO: Stall ended before state dump start
 [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=462037 jiffies, g=85, c=84, q=4)
 [  485.994056] INFO: Stall ended before state dump start
 [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=525042 jiffies, g=85, c=84, q=4)
 [  548.999059] INFO: Stall ended before state dump start
 [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=588047 jiffies, g=85, c=84, q=4)
 [  612.004061] INFO: Stall ended before state dump start
 [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=651052 jiffies, g=85, c=84, q=4)
 [  675.009058] INFO: Stall ended before state dump start
 
 If softlockup_thresh  rcu_cpu_stall_timeout:
 
 / #
 / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
 21
 / # echo 5  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko
 [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 
 The softlockup_test.ko source code is:
 //
 #include linux/kernel.h
 #include linux/module.h
 #include linux/spinlock.h
 #include linux/slab.h
 
 static int hello_start(void)
 {
 DEFINE_SPINLOCK(hello_lock);
 spin_lock_init(hello_lock);
 spin_lock(hello_lock);
 spin_lock(hello_lock);

Did you really intend to acquire the same spinlock twice in a row,
forcing a self-deadlock?  If not, I of course suggest changing the second
spin_lock() to spin_unlock().

If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
give you an RCU CPU stall warning.

 return 0;
 }
 
 static int __init test_init(void)
 {
 hello_start();
 
 printk(KERN_INFO Module init\n);
 return 0;
 }
 
 static void __exit test_exit(void)
 {
 printk(KERN_INFO Module exit!\n);
 }
 
 module_init(test_init)
 module_exit(test_exit)
 MODULE_LICENSE(GPL);
 //
 
 My kernel version is v3.10.63, and i checked the kernel source code,
 
 update_process_times
   - run_local_timers
   - hrtimer_run_queues
   - __run_hrtimer
   - watchdog_timer_fn
   - is_softlockup
   
   - rcu_check_callbacks
   - __rcu_pending
   - check_cpu_stall
   - print_cpu_stall
 
 If softlockup_thresh  rcu_cpu_stall_timeout, print_cpu_stall will print log 
 to serial port.
 
 The 8250 serial driver will call serial8250_console_write = 
 touch_nmi_watchdog() which reset
 watchdog_touch_ts to 0. So the softlockup will not be triggered.
 
 Is this reasonable? Why?

Is exactly what reasonable?  ;-)

Yes, it is reasonable that your code triggers an RCU CPU stall warning.

No, it is not reasonable that the RCU CPU stall warning does not include
a stack trace, and the fix for that bug will be going into the next merge
window.

Yes, is is reasonable that varying the softlockup and RCU CPU stall
timeouts might

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Paul E. McKenney

On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:
 On 2015/1/19 16:42, Paul E. McKenney wrote:
  On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
  Hi,
 
  On my x86_64 qemu virtual machine, RCU CPU stall console spews may
  lead to soft lockup disabled.
 
  If softlockup_thresh  rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
  watchdog_thresh):
 
  / #
  / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
  21
  / # echo 60  /proc/sys/kernel/watchdog_thresh
  / # busybox insmod softlockup_test.ko
  [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=21002 jiffies, g=85, c=84, q=4)
  [   44.959044] INFO: Stall ended before state dump start
  [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=84007 jiffies, g=85, c=84, q=4)
  [  107.964045] INFO: Stall ended before state dump start
  [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=147012 jiffies, g=85, c=84, q=4)
  [  170.969060] INFO: Stall ended before state dump start
  [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=210017 jiffies, g=85, c=84, q=4)
  [  233.974057] INFO: Stall ended before state dump start
  [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=273022 jiffies, g=85, c=84, q=4)
  [  296.979059] INFO: Stall ended before state dump start
  [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=336027 jiffies, g=85, c=84, q=4)
  [  359.984058] INFO: Stall ended before state dump start
  [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=399032 jiffies, g=85, c=84, q=4)
  [  422.989059] INFO: Stall ended before state dump start
  [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=462037 jiffies, g=85, c=84, q=4)
  [  485.994056] INFO: Stall ended before state dump start
  [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=525042 jiffies, g=85, c=84, q=4)
  [  548.999059] INFO: Stall ended before state dump start
  [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=588047 jiffies, g=85, c=84, q=4)
  [  612.004061] INFO: Stall ended before state dump start
  [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} 
  (detected by 0, t=651052 jiffies, g=85, c=84, q=4)
  [  675.009058] INFO: Stall ended before state dump start
 
  If softlockup_thresh  rcu_cpu_stall_timeout:
 
  / #
  / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
  21
  / # echo 5  /proc/sys/kernel/watchdog_thresh
  / # busybox insmod softlockup_test.ko
  [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
  [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
  [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
  [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
  [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 
  The softlockup_test.ko source code is:
  //
  #include linux/kernel.h
  #include linux/module.h
  #include linux/spinlock.h
  #include linux/slab.h
 
  static int hello_start(void)
  {
  DEFINE_SPINLOCK(hello_lock);
  spin_lock_init(hello_lock);
  spin_lock(hello_lock);
  spin_lock(hello_lock);
  
  Did you really intend to acquire the same spinlock twice in a row,
  forcing a self-deadlock?  If not, I of course suggest changing the second
  spin_lock() to spin_unlock().
 
 Yes, i acquire the same spinlock twice in order to reproduce the problem.

Good, I was wondering about that.  ;-)

  If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
  give you an RCU CPU stall warning.
 
 In my .config CONFIG_TREE_RCU=y.

Which is consistent.

 If softlockup_thresh  rcu_cpu_stall_timeout, it will give soft lockup 
 warning.
 If softlockup_thresh  rcu_cpu_stall_timeout, it will likely to give RCU CPU 
 stall warning
 just like above and no give soft lockup warning.
 
 It means that RCU CPU stall console spews leads to soft lockup disabled.
 Is this reasonable ?

It depends.  You will often see both of them, but they can interfere
with each other, especially if all these messages are going across a
serial line.  And sometimes the activity of the one will suppress the
other, though I would not expect that in your spinlock deadlock case.

Thanx, Paul

 Thanks!
 
  return 0;
  }
 
  static int __init test_init(void)
  {
  hello_start();
 
  printk(KERN_INFO Module init\n);
  return 0;
  }
 
  static void __exit test_exit(void)
  {
  printk(KERN_INFO Module exit!\n);
  }
 
  module_init(test_init)
  module_exit(test_exit)
  MODULE_LICENSE(GPL);
  //
 
  My kernel version is v3.10.63, and i checked the kernel source code,

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Don Zickus

On Mon, Jan 19, 2015 at 05:04:29PM +0800, Zhang Zhen wrote:
  
  Did you really intend to acquire the same spinlock twice in a row,
  forcing a self-deadlock?  If not, I of course suggest changing the second
  spin_lock() to spin_unlock().
  
 
 Yes, i acquire the same spinlock twice in order to reproduce the problem.
 
  If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
  give you an RCU CPU stall warning.
  
 
 In my .config CONFIG_TREE_RCU=y.
 
 If softlockup_thresh  rcu_cpu_stall_timeout, it will give soft lockup 
 warning.
 If softlockup_thresh  rcu_cpu_stall_timeout, it will likely to give RCU CPU 
 stall warning
 just like above and no give soft lockup warning.
 
 It means that RCU CPU stall console spews leads to soft lockup disabled.
 Is this reasonable ?

I believe so.  In kernel v3.10.., all activity to the console executed
touch_nmi_watchdog() which calls touch_softlockup_watchdog, which delayed
the softlockup for another round of 'softlockup_thresh'.

Of course back then, touch_nmi_watchdog touched all cpus.  So a problem
like this was masked.  I believe this upstream commit 62572e29bc53, solved
the problem.

You can apply that commit and see if you if you get both RCU stall
messages _and_ softlockup messages.  I believe that is what you were
expecting, correct?


Of course, on a non-virt guest, your test case would normally trigger a
hardlockup warning first.  And a later kernel version for the guest may
actually do that (not quite sure if the emulated PMU stuff is upstream or
not yet).  Just to set your expectations correctly.

Cheers,
Don

 
 Thanks!
 
  return 0;
  }
 
  static int __init test_init(void)
  {
  hello_start();
 
  printk(KERN_INFO Module init\n);
  return 0;
  }
 
  static void __exit test_exit(void)
  {
  printk(KERN_INFO Module exit!\n);
  }
 
  module_init(test_init)
  module_exit(test_exit)
  MODULE_LICENSE(GPL);
  //
 
  My kernel version is v3.10.63, and i checked the kernel source code,
 
  update_process_times
 - run_local_timers
 - hrtimer_run_queues
 - __run_hrtimer
 - watchdog_timer_fn
 - is_softlockup
 
 - rcu_check_callbacks
 - __rcu_pending
 - check_cpu_stall
 - print_cpu_stall
 
  If softlockup_thresh  rcu_cpu_stall_timeout, print_cpu_stall will print 
  log to serial port.
 
  The 8250 serial driver will call serial8250_console_write = 
  touch_nmi_watchdog() which reset
  watchdog_touch_ts to 0. So the softlockup will not be triggered.
 
  Is this reasonable? Why?
  
  Is exactly what reasonable?  ;-)
  
  Yes, it is reasonable that your code triggers an RCU CPU stall warning.
  
  No, it is not reasonable that the RCU CPU stall warning does not include
  a stack trace, and the fix for that bug will be going into the next merge
  window.
  
  Yes, is is reasonable that varying the softlockup and RCU CPU stall
  timeouts might change the behavior.
  
  No, your code is not reasonable, except perhaps as a test of the
  generation of softlockup and RCU CPU stall warnings.  If you are not
  trying to test softlockup and RCU CPU stall warnings, you should of course
  not try to acquire any non-recursive exclusive lock that you already hold.
  
  If it is not reasonable, we should adjust the printk loglevel from 
  *KERN_ERR* to *KERN_INFO*
  in print_cpu_stall.
  
  Given that RCU CPU stall warnings are supposed to be pointing out errors
  elsewhere in the kernel, and in this case are pointing out errors elsewhere
  in the kernel, namely in your hello_start() function, it is reasonable
  that the RCU CPU stall warnings use the KERN_ERR loglevel.
  
  Or am I missing something here?
  
  Thanx, Paul
  
  
  .
  
 
 
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?

2015-01-19 Thread Zhang Zhen

On 2015/1/19 16:42, Paul E. McKenney wrote:
 On Mon, Jan 19, 2015 at 04:07:15PM +0800, Zhang Zhen wrote:
 Hi,

 On my x86_64 qemu virtual machine, RCU CPU stall console spews may
 lead to soft lockup disabled.

 If softlockup_thresh  rcu_cpu_stall_timeout (softlockup_thresh = 2 * 
 watchdog_thresh):

 / #
 / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
 21
 / # echo 60  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko
 [   44.959044] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=21002 jiffies, g=85, c=84, q=4)
 [   44.959044] INFO: Stall ended before state dump start
 [  107.964045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=84007 jiffies, g=85, c=84, q=4)
 [  107.964045] INFO: Stall ended before state dump start
 [  170.969060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=147012 jiffies, g=85, c=84, q=4)
 [  170.969060] INFO: Stall ended before state dump start
 [  233.974057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=210017 jiffies, g=85, c=84, q=4)
 [  233.974057] INFO: Stall ended before state dump start
 [  296.979059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=273022 jiffies, g=85, c=84, q=4)
 [  296.979059] INFO: Stall ended before state dump start
 [  359.984058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=336027 jiffies, g=85, c=84, q=4)
 [  359.984058] INFO: Stall ended before state dump start
 [  422.989059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=399032 jiffies, g=85, c=84, q=4)
 [  422.989059] INFO: Stall ended before state dump start
 [  485.994056] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=462037 jiffies, g=85, c=84, q=4)
 [  485.994056] INFO: Stall ended before state dump start
 [  548.999059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=525042 jiffies, g=85, c=84, q=4)
 [  548.999059] INFO: Stall ended before state dump start
 [  612.004061] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=588047 jiffies, g=85, c=84, q=4)
 [  612.004061] INFO: Stall ended before state dump start
 [  675.009058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected 
 by 0, t=651052 jiffies, g=85, c=84, q=4)
 [  675.009058] INFO: Stall ended before state dump start

 If softlockup_thresh  rcu_cpu_stall_timeout:

 / #
 / # busybox cat /sys/module/rcupdate/parameters/rcu_cpu_stall_timeout
 21
 / # echo 5  /proc/sys/kernel/watchdog_thresh
 / # busybox insmod softlockup_test.ko
 [   38.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   52.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   66.450073] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   80.450060] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]
 [   94.450061] BUG: soft lockup - CPU#0 stuck for 11s! [busybox:53]

 The softlockup_test.ko source code is:
 //
 #include linux/kernel.h
 #include linux/module.h
 #include linux/spinlock.h
 #include linux/slab.h

 static int hello_start(void)
 {
 DEFINE_SPINLOCK(hello_lock);
 spin_lock_init(hello_lock);
 spin_lock(hello_lock);
 spin_lock(hello_lock);
 
 Did you really intend to acquire the same spinlock twice in a row,
 forcing a self-deadlock?  If not, I of course suggest changing the second
 spin_lock() to spin_unlock().
 

Yes, i acquire the same spinlock twice in order to reproduce the problem.

 If your .config has CONFIG_TREE_RCU=y, the above is quite likely to
 give you an RCU CPU stall warning.
 

In my .config CONFIG_TREE_RCU=y.

If softlockup_thresh  rcu_cpu_stall_timeout, it will give soft lockup warning.
If softlockup_thresh  rcu_cpu_stall_timeout, it will likely to give RCU CPU 
stall warning
just like above and no give soft lockup warning.

It means that RCU CPU stall console spews leads to soft lockup disabled.
Is this reasonable ?

Thanks!

 return 0;
 }

 static int __init test_init(void)
 {
 hello_start();

 printk(KERN_INFO Module init\n);
 return 0;
 }

 static void __exit test_exit(void)
 {
 printk(KERN_INFO Module exit!\n);
 }

 module_init(test_init)
 module_exit(test_exit)
 MODULE_LICENSE(GPL);
 //

 My kernel version is v3.10.63, and i checked the kernel source code,

 update_process_times
  - run_local_timers
  - hrtimer_run_queues
  - __run_hrtimer
  - watchdog_timer_fn
  - is_softlockup
  
  - rcu_check_callbacks
  - __rcu_pending
  - check_cpu_stall
  - print_cpu_stall

 If softlockup_thresh  rcu_cpu_stall_timeout, print_cpu_stall will print log 
 to serial port.

 The 8250 serial driver will call serial8250_console_write =

42 matches

Mail list logo