RE: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Liu, Chuansheng


> -Original Message-
> From: ccr...@google.com [mailto:ccr...@google.com] On Behalf Of Colin
> Cross
> Sent: Friday, January 11, 2013 2:18 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Andrew Morton; Don Zickus; Ingo Molnar;
> Thomas Gleixner; linux-arm-ker...@lists.infradead.org
> Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using
> secondary cpus
> 
> On Thu, Jan 10, 2013 at 9:57 PM, Liu, Chuansheng
>  wrote:
> >
> >
> >> -Original Message-
> >> From: ccr...@google.com [mailto:ccr...@google.com] On Behalf Of Colin
> >> Cross
> >> Sent: Friday, January 11, 2013 1:34 PM
> >> To: Liu, Chuansheng
> >> Cc: linux-kernel@vger.kernel.org; Andrew Morton; Don Zickus; Ingo Molnar;
> >> Thomas Gleixner; linux-arm-ker...@lists.infradead.org
> >> Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using
> >> secondary cpus
> >>
> >> On Thu, Jan 10, 2013 at 5:39 PM, Liu, Chuansheng
> >>  wrote:
> >> >
> >> >
> >> >> -Original Message-
> >> >> From: Colin Cross [mailto:ccr...@android.com]
> >> >> Sent: Thursday, January 10, 2013 9:58 AM
> >> >> To: linux-kernel@vger.kernel.org
> >> >> Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
> >> >> Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
> >> >> Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
> >> >> secondary cpus
> >> >>
> >> >> Emulate NMIs on systems where they are not available by using timer
> >> >> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
> >> >> to check that the next cpu is processing hrtimer interrupts by
> >> >> verifying that a counter is increasing.
> >> >>
> >> >> This patch is useful on systems where the hardlockup detector is not
> >> >> available due to a lack of NMIs, for example most ARM SoCs.
> >> >> Without this patch any cpu stuck with interrupts disabled can
> >> >> cause a hardware watchdog reset with no debugging information,
> >> >> but with this patch the kernel can detect the lockup and panic,
> >> >> which can result in useful debugging info.
> >> >>
> >> >> Signed-off-by: Colin Cross 
> >> >> +static void watchdog_check_hardlockup_other_cpu(void)
> >> >> +{
> >> >> + int cpu;
> >> >> + cpumask_t cpus = watchdog_cpus;
> >> >> +
> >> >> + /*
> >> >> +  * Test for hardlockups every 3 samples.  The sample period is
> >> >> +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to
> slightly
> >> over
> >> >> +  *  watchdog_thresh (over by 20%).
> >> >> +  */
> >> >> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
> >> >> + return;
> >> >> +
> > Another feeling is about __this_cpu_read(hrtimer_interrupts) % 3 != 0,
> > It will cause the actual timeout value for hard lockup detection is not 
> > very fix,
> or even
> > very short.
> > Sometimes using 3 samples can detect the lockup case, but sometimes 1
> sample.
> > Is it the case?
> 
> I'm not sure what you mean.  The mod 3 will cause every 3rd timer (12
> seconds, assuming watchdog_thresh = 10) to check hrtimer_interrupts
> vs. hrtimer_interrupts_saved, and then update it.  The sampling should
> be fixed and very accurate.  It will cause a panic/warning between 12
> and 24 seconds after a cpu stops processing timer interrupts,
> depending on the alignment of the hrtimers between the two cpus.
> 
You are right, thanks.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Colin Cross
On Thu, Jan 10, 2013 at 9:57 PM, Liu, Chuansheng
 wrote:
>
>
>> -Original Message-
>> From: ccr...@google.com [mailto:ccr...@google.com] On Behalf Of Colin
>> Cross
>> Sent: Friday, January 11, 2013 1:34 PM
>> To: Liu, Chuansheng
>> Cc: linux-kernel@vger.kernel.org; Andrew Morton; Don Zickus; Ingo Molnar;
>> Thomas Gleixner; linux-arm-ker...@lists.infradead.org
>> Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using
>> secondary cpus
>>
>> On Thu, Jan 10, 2013 at 5:39 PM, Liu, Chuansheng
>>  wrote:
>> >
>> >
>> >> -Original Message-
>> >> From: Colin Cross [mailto:ccr...@android.com]
>> >> Sent: Thursday, January 10, 2013 9:58 AM
>> >> To: linux-kernel@vger.kernel.org
>> >> Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
>> >> Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
>> >> Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
>> >> secondary cpus
>> >>
>> >> Emulate NMIs on systems where they are not available by using timer
>> >> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
>> >> to check that the next cpu is processing hrtimer interrupts by
>> >> verifying that a counter is increasing.
>> >>
>> >> This patch is useful on systems where the hardlockup detector is not
>> >> available due to a lack of NMIs, for example most ARM SoCs.
>> >> Without this patch any cpu stuck with interrupts disabled can
>> >> cause a hardware watchdog reset with no debugging information,
>> >> but with this patch the kernel can detect the lockup and panic,
>> >> which can result in useful debugging info.
>> >>
>> >> Signed-off-by: Colin Cross 
>> >> +static void watchdog_check_hardlockup_other_cpu(void)
>> >> +{
>> >> + int cpu;
>> >> + cpumask_t cpus = watchdog_cpus;
>> >> +
>> >> + /*
>> >> +  * Test for hardlockups every 3 samples.  The sample period is
>> >> +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly
>> over
>> >> +  *  watchdog_thresh (over by 20%).
>> >> +  */
>> >> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
>> >> + return;
>> >> +
> Another feeling is about __this_cpu_read(hrtimer_interrupts) % 3 != 0,
> It will cause the actual timeout value for hard lockup detection is not very 
> fix, or even
> very short.
> Sometimes using 3 samples can detect the lockup case, but sometimes 1 sample.
> Is it the case?

I'm not sure what you mean.  The mod 3 will cause every 3rd timer (12
seconds, assuming watchdog_thresh = 10) to check hrtimer_interrupts
vs. hrtimer_interrupts_saved, and then update it.  The sampling should
be fixed and very accurate.  It will cause a panic/warning between 12
and 24 seconds after a cpu stops processing timer interrupts,
depending on the alignment of the hrtimers between the two cpus.

> And in NMI case, the NMI interrupt is coming at least every watchdog_thresh.

NMI interrupt will happen every 10 seconds instead of 12, meaning the
panic/warning will occur between 10 and 20 seconds after a cpu stops
processing timer interrupts, depending on the alignment of the NMI
with the hrtimer, but otherwise my patch should be very similar.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Liu, Chuansheng


> -Original Message-
> From: ccr...@google.com [mailto:ccr...@google.com] On Behalf Of Colin
> Cross
> Sent: Friday, January 11, 2013 1:34 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@vger.kernel.org; Andrew Morton; Don Zickus; Ingo Molnar;
> Thomas Gleixner; linux-arm-ker...@lists.infradead.org
> Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using
> secondary cpus
> 
> On Thu, Jan 10, 2013 at 5:39 PM, Liu, Chuansheng
>  wrote:
> >
> >
> >> -Original Message-
> >> From: Colin Cross [mailto:ccr...@android.com]
> >> Sent: Thursday, January 10, 2013 9:58 AM
> >> To: linux-kernel@vger.kernel.org
> >> Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
> >> Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
> >> Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
> >> secondary cpus
> >>
> >> Emulate NMIs on systems where they are not available by using timer
> >> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
> >> to check that the next cpu is processing hrtimer interrupts by
> >> verifying that a counter is increasing.
> >>
> >> This patch is useful on systems where the hardlockup detector is not
> >> available due to a lack of NMIs, for example most ARM SoCs.
> >> Without this patch any cpu stuck with interrupts disabled can
> >> cause a hardware watchdog reset with no debugging information,
> >> but with this patch the kernel can detect the lockup and panic,
> >> which can result in useful debugging info.
> >>
> >> Signed-off-by: Colin Cross 
> >> +static void watchdog_check_hardlockup_other_cpu(void)
> >> +{
> >> + int cpu;
> >> + cpumask_t cpus = watchdog_cpus;
> >> +
> >> + /*
> >> +  * Test for hardlockups every 3 samples.  The sample period is
> >> +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly
> over
> >> +  *  watchdog_thresh (over by 20%).
> >> +  */
> >> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
> >> + return;
> >> +
Another feeling is about __this_cpu_read(hrtimer_interrupts) % 3 != 0,
It will cause the actual timeout value for hard lockup detection is not very 
fix, or even
very short.
Sometimes using 3 samples can detect the lockup case, but sometimes 1 sample.
Is it the case?

And in NMI case, the NMI interrupt is coming at least every watchdog_thresh.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Colin Cross
On Thu, Jan 10, 2013 at 5:39 PM, Liu, Chuansheng
 wrote:
>
>
>> -Original Message-
>> From: Colin Cross [mailto:ccr...@android.com]
>> Sent: Thursday, January 10, 2013 9:58 AM
>> To: linux-kernel@vger.kernel.org
>> Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
>> Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
>> Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
>> secondary cpus
>>
>> Emulate NMIs on systems where they are not available by using timer
>> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
>> to check that the next cpu is processing hrtimer interrupts by
>> verifying that a counter is increasing.
>>
>> This patch is useful on systems where the hardlockup detector is not
>> available due to a lack of NMIs, for example most ARM SoCs.
>> Without this patch any cpu stuck with interrupts disabled can
>> cause a hardware watchdog reset with no debugging information,
>> but with this patch the kernel can detect the lockup and panic,
>> which can result in useful debugging info.
>>
>> Signed-off-by: Colin Cross 
>> +static void watchdog_check_hardlockup_other_cpu(void)
>> +{
>> + int cpu;
>> + cpumask_t cpus = watchdog_cpus;
>> +
>> + /*
>> +  * Test for hardlockups every 3 samples.  The sample period is
>> +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
>> +  *  watchdog_thresh (over by 20%).
>> +  */
>> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
>> + return;
>> +
>> + /* check for a hardlockup on the next cpu */
>> + cpu = cpumask_next(smp_processor_id(), );
>> + if (cpu >= nr_cpu_ids)
>> + cpu = cpumask_first();
>> + if (cpu == smp_processor_id())
>> + return;
>> +
>> + smp_rmb();
>> +
>> + if (per_cpu(watchdog_nmi_touch, cpu) == true) {
>> + per_cpu(watchdog_nmi_touch, cpu) = false;
>> + return;
>> + }
>> +
>> + if (is_hardlockup_other_cpu(cpu)) {
>> + /* only warn once */
> One possible case for new hotplug CPU that false hardlockup case.
> 1/ Assume CPU1, CPU2 are online, CPU3 is being hotplug:
> CPU3:CPU2:
> watchdog_nmi_enable()
>  per_cpu(watchdog_nmi_touch, cpu) = true;
>  cpumask_set_cpu(cpu, _cpus);
>  
> watchdog_check_hardlockup_other_cpu()
>
> per_cpu(watchdog_nmi_touch, cpu) = false; == > Here cpu is CPU3
>
> 2/ Before CPU3's first hrtimer interrupt coming, CPU2 is been offlined.
>   Then CPU1's next CPU is CPU3. But we can not use CPU3's watchdog_nmi_touch 
> to defense
>   false CPU3 hardlock more. When CPU1's hrtimer interrupt coming, it is 
> possible report CPU3
>   false hard lockup.
>
> Is it the case?

Yes, this is the same race condition I pointed out in reply to Don
Zickus earlier in the thread.  I think the easiest solution is to set
per_cpu(watchdog_nmi_touch, next_cpu) = true during cpu offlining.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Liu, Chuansheng


> -Original Message-
> From: Colin Cross [mailto:ccr...@android.com]
> Sent: Thursday, January 10, 2013 9:58 AM
> To: linux-kernel@vger.kernel.org
> Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
> Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
> Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
> secondary cpus
> 
> Emulate NMIs on systems where they are not available by using timer
> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
> to check that the next cpu is processing hrtimer interrupts by
> verifying that a counter is increasing.
> 
> This patch is useful on systems where the hardlockup detector is not
> available due to a lack of NMIs, for example most ARM SoCs.
> Without this patch any cpu stuck with interrupts disabled can
> cause a hardware watchdog reset with no debugging information,
> but with this patch the kernel can detect the lockup and panic,
> which can result in useful debugging info.
> 
> Signed-off-by: Colin Cross 
> +static void watchdog_check_hardlockup_other_cpu(void)
> +{
> + int cpu;
> + cpumask_t cpus = watchdog_cpus;
> +
> + /*
> +  * Test for hardlockups every 3 samples.  The sample period is
> +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
> +  *  watchdog_thresh (over by 20%).
> +  */
> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
> + return;
> +
> + /* check for a hardlockup on the next cpu */
> + cpu = cpumask_next(smp_processor_id(), );
> + if (cpu >= nr_cpu_ids)
> + cpu = cpumask_first();
> + if (cpu == smp_processor_id())
> + return;
> +
> + smp_rmb();
> +
> + if (per_cpu(watchdog_nmi_touch, cpu) == true) {
> + per_cpu(watchdog_nmi_touch, cpu) = false;
> + return;
> + }
> +
> + if (is_hardlockup_other_cpu(cpu)) {
> + /* only warn once */
One possible case for new hotplug CPU that false hardlockup case.
1/ Assume CPU1, CPU2 are online, CPU3 is being hotplug:
CPU3:CPU2:
watchdog_nmi_enable()
 per_cpu(watchdog_nmi_touch, cpu) = true;
 cpumask_set_cpu(cpu, _cpus);
 
watchdog_check_hardlockup_other_cpu()
   per_cpu(watchdog_nmi_touch, 
cpu) = false; == > Here cpu is CPU3

2/ Before CPU3's first hrtimer interrupt coming, CPU2 is been offlined.
  Then CPU1's next CPU is CPU3. But we can not use CPU3's watchdog_nmi_touch to 
defense
  false CPU3 hardlock more. When CPU1's hrtimer interrupt coming, it is 
possible report CPU3
  false hard lockup.

Is it the case?


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Tony Lindgren
* Colin Cross  [130110 14:37]:
> On Thu, Jan 10, 2013 at 12:38 PM, Tony Lindgren  wrote:
> >
> > * Colin Cross  [130109 18:05]:
> > > +static void watchdog_check_hardlockup_other_cpu(void)
> > > +{
> > > + int cpu;
> > > + cpumask_t cpus = watchdog_cpus;
> > > +
> > > + /*
> > > +  * Test for hardlockups every 3 samples.  The sample period is
> > > +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly 
> > > over
> > > +  *  watchdog_thresh (over by 20%).
> > > +  */
> > > + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
> > > + return;
> > > +
> > > + /* check for a hardlockup on the next cpu */
> > > + cpu = cpumask_next(smp_processor_id(), );
> >
> > Hmm don't you want to check cpu_oneline_mask here and
> > return if the other CPU is offline?
> 
> watchdog_cpus is effectively a local copy of cpu_online_mask, but
> updated after the watchdog_nmi_touch in watchdog_nmi_enable.  This
> avoids a false positive after hotplugging in a cpu when
> cpu_online_mask is true but that cpu hasn't yet run it's first
> hrtimer.

OK thanks for clarifying that.

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Colin Cross
On Thu, Jan 10, 2013 at 12:38 PM, Tony Lindgren  wrote:
>
> * Colin Cross  [130109 18:05]:
> > +static void watchdog_check_hardlockup_other_cpu(void)
> > +{
> > + int cpu;
> > + cpumask_t cpus = watchdog_cpus;
> > +
> > + /*
> > +  * Test for hardlockups every 3 samples.  The sample period is
> > +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly 
> > over
> > +  *  watchdog_thresh (over by 20%).
> > +  */
> > + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
> > + return;
> > +
> > + /* check for a hardlockup on the next cpu */
> > + cpu = cpumask_next(smp_processor_id(), );
>
> Hmm don't you want to check cpu_oneline_mask here and
> return if the other CPU is offline?

watchdog_cpus is effectively a local copy of cpu_online_mask, but
updated after the watchdog_nmi_touch in watchdog_nmi_enable.  This
avoids a false positive after hotplugging in a cpu when
cpu_online_mask is true but that cpu hasn't yet run it's first
hrtimer.

> > + if (cpu >= nr_cpu_ids)
> > + cpu = cpumask_first();
> > + if (cpu == smp_processor_id())
> > + return;
>
> Regards,
>
> Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Tony Lindgren
* Colin Cross  [130109 18:05]:
> +static void watchdog_check_hardlockup_other_cpu(void)
> +{
> + int cpu;
> + cpumask_t cpus = watchdog_cpus;
> +
> + /*
> +  * Test for hardlockups every 3 samples.  The sample period is
> +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
> +  *  watchdog_thresh (over by 20%).
> +  */
> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
> + return;
> +
> + /* check for a hardlockup on the next cpu */
> + cpu = cpumask_next(smp_processor_id(), );

Hmm don't you want to check cpu_oneline_mask here and
return if the other CPU is offline?

> + if (cpu >= nr_cpu_ids)
> + cpu = cpumask_first();
> + if (cpu == smp_processor_id())
> + return;

Regards,

Tony
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Don Zickus
On Thu, Jan 10, 2013 at 09:27:28AM -0800, Colin Cross wrote:
> On Thu, Jan 10, 2013 at 6:02 AM, Don Zickus  wrote:
> > On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
> >> Emulate NMIs on systems where they are not available by using timer
> >> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
> >> to check that the next cpu is processing hrtimer interrupts by
> >> verifying that a counter is increasing.
> >>
> >> This patch is useful on systems where the hardlockup detector is not
> >> available due to a lack of NMIs, for example most ARM SoCs.
> >
> > I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
> > reserving an IRQ line as 'special' (can not be masked).  Not sure if that
> > is something worth looking at here (or even possible).
> >
> >> Without this patch any cpu stuck with interrupts disabled can
> >> cause a hardware watchdog reset with no debugging information,
> >> but with this patch the kernel can detect the lockup and panic,
> >> which can result in useful debugging info.
> >
> > 
> >> +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
> >> +static int is_hardlockup_other_cpu(int cpu)
> >> +{
> >> + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
> >> +
> >> + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> >> + return 1;
> >> +
> >> + per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
> >> + return 0;
> >
> > Will this race with the other cpu you are checking?  For example if cpuA
> > just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's
> > hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume
> > cpuA is stuck?
> 
> cpuA doesn't update its own hrtimer_interrupts_saved, cpuB does.
> However, there may be a similar race condition during hotplug if cpuB
> updates hrtimer_interrupts_saved for cpuA, then goes offline, then
> cpuC may try to check cpuA and see that hrtimer_interrupts_saved ==
> hrtimer_interrupts.  I think this can be solved by setting
> watchdog_nmi_touch for the next cpu when a cpu goes online or offline.

Ah, that is where my misunderstanding was.  I overlooked the fact that it
was only updated by the other cpu.  Sorry about that.

I'll re-review it again with that in mind.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Colin Cross
On Thu, Jan 10, 2013 at 6:02 AM, Don Zickus  wrote:
> On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
>> Emulate NMIs on systems where they are not available by using timer
>> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
>> to check that the next cpu is processing hrtimer interrupts by
>> verifying that a counter is increasing.
>>
>> This patch is useful on systems where the hardlockup detector is not
>> available due to a lack of NMIs, for example most ARM SoCs.
>
> I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
> reserving an IRQ line as 'special' (can not be masked).  Not sure if that
> is something worth looking at here (or even possible).
>
>> Without this patch any cpu stuck with interrupts disabled can
>> cause a hardware watchdog reset with no debugging information,
>> but with this patch the kernel can detect the lockup and panic,
>> which can result in useful debugging info.
>
> 
>> +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
>> +static int is_hardlockup_other_cpu(int cpu)
>> +{
>> + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
>> +
>> + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
>> + return 1;
>> +
>> + per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
>> + return 0;
>
> Will this race with the other cpu you are checking?  For example if cpuA
> just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's
> hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume
> cpuA is stuck?

cpuA doesn't update its own hrtimer_interrupts_saved, cpuB does.
However, there may be a similar race condition during hotplug if cpuB
updates hrtimer_interrupts_saved for cpuA, then goes offline, then
cpuC may try to check cpuA and see that hrtimer_interrupts_saved ==
hrtimer_interrupts.  I think this can be solved by setting
watchdog_nmi_touch for the next cpu when a cpu goes online or offline.

>> +}
>> +
>> +static void watchdog_check_hardlockup_other_cpu(void)
>> +{
>> + int cpu;
>> + cpumask_t cpus = watchdog_cpus;
>> +
>> + /*
>> +  * Test for hardlockups every 3 samples.  The sample period is
>> +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
>> +  *  watchdog_thresh (over by 20%).
>> +  */
>> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
>> + return;
>> +
>> + /* check for a hardlockup on the next cpu */
>> + cpu = cpumask_next(smp_processor_id(), );
>> + if (cpu >= nr_cpu_ids)
>> + cpu = cpumask_first();
>> + if (cpu == smp_processor_id())
>> + return;
>> +
>> + smp_rmb();
>> +
>> + if (per_cpu(watchdog_nmi_touch, cpu) == true) {
>> + per_cpu(watchdog_nmi_touch, cpu) = false;
>> + return;
>> + }
>
> Same race here.  Usually touch_nmi_watchdog is reserved for those
> functions that plan on disabling interrupts for a while.  cpuB could set
> cpuA's watchdog_nmi_touch to false here expecting not to revisit this
> variable for another couple of seconds.  While cpuA could read this
> variable milliseconds later after cpuB sets it and falsely assume there is
> a lockup?
>
> Perhaps I am misreading the code?

Again, cpuA won't ever read its own watchdog_nmi_touch variable, only
cpuB will.  The only variables cpuA updates for itself is
hrtimer_interrupts or setting watchdog_nmi_touch to true.
hrtimer_interrupts_saved and setting watchdog_nmi_touch to false are
done by the cpu watching over cpuA, so the only races here are when a
cpu goes offline and a different cpu starts watching over cpuA.

> If not, I don't have a good idea on how to solve those races off the top of my
> head unfortunately.
>
> Cheers,
> Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Russell King - ARM Linux
On Thu, Jan 10, 2013 at 05:18:40PM +0100, Frederic Weisbecker wrote:
> 2013/1/10 Russell King - ARM Linux :
> > On Thu, Jan 10, 2013 at 09:02:15AM -0500, Don Zickus wrote:
> >> On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
> >> > Emulate NMIs on systems where they are not available by using timer
> >> > interrupts on other cpus.  Each cpu will use its softlockup hrtimer
> >> > to check that the next cpu is processing hrtimer interrupts by
> >> > verifying that a counter is increasing.
> >> >
> >> > This patch is useful on systems where the hardlockup detector is not
> >> > available due to a lack of NMIs, for example most ARM SoCs.
> >>
> >> I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
> >> reserving an IRQ line as 'special' (can not be masked).  Not sure if that
> >> is something worth looking at here (or even possible).
> >
> > No it isn't, because that assumes that things like spin_lock_irqsave()
> > won't mask that interrupt.  We don't have the facility to do that.
> 
> I believe sparc is doing something like this though. Look at
> arch/sparc/include/asm/irqflags_64.h, it seems NMIs are implemented
> there using an irq number that is not masked by this function.

As I said, we don't have a facility to do that.

The CPU doesn't know about interrupt levels.  It's either all-IRQs-masked
or no-IRQs-masked.  If you want anything inbetween, you have to go outside
the CPU and fiddle with the IRQ controller, which may be one of _many_
different types, and some platforms even have a shadow IRQ controller.
Plus, doing such manipulation may in itself also require locking.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Frederic Weisbecker
2013/1/10 Russell King - ARM Linux :
> On Thu, Jan 10, 2013 at 09:02:15AM -0500, Don Zickus wrote:
>> On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
>> > Emulate NMIs on systems where they are not available by using timer
>> > interrupts on other cpus.  Each cpu will use its softlockup hrtimer
>> > to check that the next cpu is processing hrtimer interrupts by
>> > verifying that a counter is increasing.
>> >
>> > This patch is useful on systems where the hardlockup detector is not
>> > available due to a lack of NMIs, for example most ARM SoCs.
>>
>> I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
>> reserving an IRQ line as 'special' (can not be masked).  Not sure if that
>> is something worth looking at here (or even possible).
>
> No it isn't, because that assumes that things like spin_lock_irqsave()
> won't mask that interrupt.  We don't have the facility to do that.

I believe sparc is doing something like this though. Look at
arch/sparc/include/asm/irqflags_64.h, it seems NMIs are implemented
there using an irq number that is not masked by this function.

Not all archs can do that so easily I guess.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Russell King - ARM Linux
On Thu, Jan 10, 2013 at 09:02:15AM -0500, Don Zickus wrote:
> On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
> > Emulate NMIs on systems where they are not available by using timer
> > interrupts on other cpus.  Each cpu will use its softlockup hrtimer
> > to check that the next cpu is processing hrtimer interrupts by
> > verifying that a counter is increasing.
> > 
> > This patch is useful on systems where the hardlockup detector is not
> > available due to a lack of NMIs, for example most ARM SoCs.
> 
> I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
> reserving an IRQ line as 'special' (can not be masked).  Not sure if that
> is something worth looking at here (or even possible).

No it isn't, because that assumes that things like spin_lock_irqsave()
won't mask that interrupt.  We don't have the facility to do that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Don Zickus
On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
> Emulate NMIs on systems where they are not available by using timer
> interrupts on other cpus.  Each cpu will use its softlockup hrtimer
> to check that the next cpu is processing hrtimer interrupts by
> verifying that a counter is increasing.
> 
> This patch is useful on systems where the hardlockup detector is not
> available due to a lack of NMIs, for example most ARM SoCs.

I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
reserving an IRQ line as 'special' (can not be masked).  Not sure if that
is something worth looking at here (or even possible).

> Without this patch any cpu stuck with interrupts disabled can
> cause a hardware watchdog reset with no debugging information,
> but with this patch the kernel can detect the lockup and panic,
> which can result in useful debugging info.


> +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
> +static int is_hardlockup_other_cpu(int cpu)
> +{
> + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
> +
> + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> + return 1;
> +
> + per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
> + return 0;

Will this race with the other cpu you are checking?  For example if cpuA
just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's
hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume
cpuA is stuck?


> +}
> +
> +static void watchdog_check_hardlockup_other_cpu(void)
> +{
> + int cpu;
> + cpumask_t cpus = watchdog_cpus;
> +
> + /*
> +  * Test for hardlockups every 3 samples.  The sample period is
> +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
> +  *  watchdog_thresh (over by 20%).
> +  */
> + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
> + return;
> +
> + /* check for a hardlockup on the next cpu */
> + cpu = cpumask_next(smp_processor_id(), );
> + if (cpu >= nr_cpu_ids)
> + cpu = cpumask_first();
> + if (cpu == smp_processor_id())
> + return;
> +
> + smp_rmb();
> +
> + if (per_cpu(watchdog_nmi_touch, cpu) == true) {
> + per_cpu(watchdog_nmi_touch, cpu) = false;
> + return;
> + }

Same race here.  Usually touch_nmi_watchdog is reserved for those
functions that plan on disabling interrupts for a while.  cpuB could set
cpuA's watchdog_nmi_touch to false here expecting not to revisit this
variable for another couple of seconds.  While cpuA could read this
variable milliseconds later after cpuB sets it and falsely assume there is
a lockup?

Perhaps I am misreading the code?

If not, I don't have a good idea on how to solve those races off the top of my
head unfortunately.

Cheers,
Don
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Frederic Weisbecker
2013/1/10 Russell King - ARM Linux li...@arm.linux.org.uk:
 On Thu, Jan 10, 2013 at 09:02:15AM -0500, Don Zickus wrote:
 On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
  Emulate NMIs on systems where they are not available by using timer
  interrupts on other cpus.  Each cpu will use its softlockup hrtimer
  to check that the next cpu is processing hrtimer interrupts by
  verifying that a counter is increasing.
 
  This patch is useful on systems where the hardlockup detector is not
  available due to a lack of NMIs, for example most ARM SoCs.

 I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
 reserving an IRQ line as 'special' (can not be masked).  Not sure if that
 is something worth looking at here (or even possible).

 No it isn't, because that assumes that things like spin_lock_irqsave()
 won't mask that interrupt.  We don't have the facility to do that.

I believe sparc is doing something like this though. Look at
arch/sparc/include/asm/irqflags_64.h, it seems NMIs are implemented
there using an irq number that is not masked by this function.

Not all archs can do that so easily I guess.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Russell King - ARM Linux
On Thu, Jan 10, 2013 at 05:18:40PM +0100, Frederic Weisbecker wrote:
 2013/1/10 Russell King - ARM Linux li...@arm.linux.org.uk:
  On Thu, Jan 10, 2013 at 09:02:15AM -0500, Don Zickus wrote:
  On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
   Emulate NMIs on systems where they are not available by using timer
   interrupts on other cpus.  Each cpu will use its softlockup hrtimer
   to check that the next cpu is processing hrtimer interrupts by
   verifying that a counter is increasing.
  
   This patch is useful on systems where the hardlockup detector is not
   available due to a lack of NMIs, for example most ARM SoCs.
 
  I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
  reserving an IRQ line as 'special' (can not be masked).  Not sure if that
  is something worth looking at here (or even possible).
 
  No it isn't, because that assumes that things like spin_lock_irqsave()
  won't mask that interrupt.  We don't have the facility to do that.
 
 I believe sparc is doing something like this though. Look at
 arch/sparc/include/asm/irqflags_64.h, it seems NMIs are implemented
 there using an irq number that is not masked by this function.

As I said, we don't have a facility to do that.

The CPU doesn't know about interrupt levels.  It's either all-IRQs-masked
or no-IRQs-masked.  If you want anything inbetween, you have to go outside
the CPU and fiddle with the IRQ controller, which may be one of _many_
different types, and some platforms even have a shadow IRQ controller.
Plus, doing such manipulation may in itself also require locking.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Colin Cross
On Thu, Jan 10, 2013 at 6:02 AM, Don Zickus dzic...@redhat.com wrote:
 On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
 Emulate NMIs on systems where they are not available by using timer
 interrupts on other cpus.  Each cpu will use its softlockup hrtimer
 to check that the next cpu is processing hrtimer interrupts by
 verifying that a counter is increasing.

 This patch is useful on systems where the hardlockup detector is not
 available due to a lack of NMIs, for example most ARM SoCs.

 I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
 reserving an IRQ line as 'special' (can not be masked).  Not sure if that
 is something worth looking at here (or even possible).

 Without this patch any cpu stuck with interrupts disabled can
 cause a hardware watchdog reset with no debugging information,
 but with this patch the kernel can detect the lockup and panic,
 which can result in useful debugging info.

 SNIP
 +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
 +static int is_hardlockup_other_cpu(int cpu)
 +{
 + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
 +
 + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
 + return 1;
 +
 + per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
 + return 0;

 Will this race with the other cpu you are checking?  For example if cpuA
 just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's
 hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume
 cpuA is stuck?

cpuA doesn't update its own hrtimer_interrupts_saved, cpuB does.
However, there may be a similar race condition during hotplug if cpuB
updates hrtimer_interrupts_saved for cpuA, then goes offline, then
cpuC may try to check cpuA and see that hrtimer_interrupts_saved ==
hrtimer_interrupts.  I think this can be solved by setting
watchdog_nmi_touch for the next cpu when a cpu goes online or offline.

 +}
 +
 +static void watchdog_check_hardlockup_other_cpu(void)
 +{
 + int cpu;
 + cpumask_t cpus = watchdog_cpus;
 +
 + /*
 +  * Test for hardlockups every 3 samples.  The sample period is
 +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
 +  *  watchdog_thresh (over by 20%).
 +  */
 + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
 + return;
 +
 + /* check for a hardlockup on the next cpu */
 + cpu = cpumask_next(smp_processor_id(), cpus);
 + if (cpu = nr_cpu_ids)
 + cpu = cpumask_first(cpus);
 + if (cpu == smp_processor_id())
 + return;
 +
 + smp_rmb();
 +
 + if (per_cpu(watchdog_nmi_touch, cpu) == true) {
 + per_cpu(watchdog_nmi_touch, cpu) = false;
 + return;
 + }

 Same race here.  Usually touch_nmi_watchdog is reserved for those
 functions that plan on disabling interrupts for a while.  cpuB could set
 cpuA's watchdog_nmi_touch to false here expecting not to revisit this
 variable for another couple of seconds.  While cpuA could read this
 variable milliseconds later after cpuB sets it and falsely assume there is
 a lockup?

 Perhaps I am misreading the code?

Again, cpuA won't ever read its own watchdog_nmi_touch variable, only
cpuB will.  The only variables cpuA updates for itself is
hrtimer_interrupts or setting watchdog_nmi_touch to true.
hrtimer_interrupts_saved and setting watchdog_nmi_touch to false are
done by the cpu watching over cpuA, so the only races here are when a
cpu goes offline and a different cpu starts watching over cpuA.

 If not, I don't have a good idea on how to solve those races off the top of my
 head unfortunately.

 Cheers,
 Don
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Don Zickus
On Thu, Jan 10, 2013 at 09:27:28AM -0800, Colin Cross wrote:
 On Thu, Jan 10, 2013 at 6:02 AM, Don Zickus dzic...@redhat.com wrote:
  On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
  Emulate NMIs on systems where they are not available by using timer
  interrupts on other cpus.  Each cpu will use its softlockup hrtimer
  to check that the next cpu is processing hrtimer interrupts by
  verifying that a counter is increasing.
 
  This patch is useful on systems where the hardlockup detector is not
  available due to a lack of NMIs, for example most ARM SoCs.
 
  I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
  reserving an IRQ line as 'special' (can not be masked).  Not sure if that
  is something worth looking at here (or even possible).
 
  Without this patch any cpu stuck with interrupts disabled can
  cause a hardware watchdog reset with no debugging information,
  but with this patch the kernel can detect the lockup and panic,
  which can result in useful debugging info.
 
  SNIP
  +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
  +static int is_hardlockup_other_cpu(int cpu)
  +{
  + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
  +
  + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
  + return 1;
  +
  + per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
  + return 0;
 
  Will this race with the other cpu you are checking?  For example if cpuA
  just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's
  hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume
  cpuA is stuck?
 
 cpuA doesn't update its own hrtimer_interrupts_saved, cpuB does.
 However, there may be a similar race condition during hotplug if cpuB
 updates hrtimer_interrupts_saved for cpuA, then goes offline, then
 cpuC may try to check cpuA and see that hrtimer_interrupts_saved ==
 hrtimer_interrupts.  I think this can be solved by setting
 watchdog_nmi_touch for the next cpu when a cpu goes online or offline.

Ah, that is where my misunderstanding was.  I overlooked the fact that it
was only updated by the other cpu.  Sorry about that.

I'll re-review it again with that in mind.

Cheers,
Don
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Tony Lindgren
* Colin Cross ccr...@android.com [130109 18:05]:
 +static void watchdog_check_hardlockup_other_cpu(void)
 +{
 + int cpu;
 + cpumask_t cpus = watchdog_cpus;
 +
 + /*
 +  * Test for hardlockups every 3 samples.  The sample period is
 +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
 +  *  watchdog_thresh (over by 20%).
 +  */
 + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
 + return;
 +
 + /* check for a hardlockup on the next cpu */
 + cpu = cpumask_next(smp_processor_id(), cpus);

Hmm don't you want to check cpu_oneline_mask here and
return if the other CPU is offline?

 + if (cpu = nr_cpu_ids)
 + cpu = cpumask_first(cpus);
 + if (cpu == smp_processor_id())
 + return;

Regards,

Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Colin Cross
On Thu, Jan 10, 2013 at 12:38 PM, Tony Lindgren t...@atomide.com wrote:

 * Colin Cross ccr...@android.com [130109 18:05]:
  +static void watchdog_check_hardlockup_other_cpu(void)
  +{
  + int cpu;
  + cpumask_t cpus = watchdog_cpus;
  +
  + /*
  +  * Test for hardlockups every 3 samples.  The sample period is
  +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly 
  over
  +  *  watchdog_thresh (over by 20%).
  +  */
  + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
  + return;
  +
  + /* check for a hardlockup on the next cpu */
  + cpu = cpumask_next(smp_processor_id(), cpus);

 Hmm don't you want to check cpu_oneline_mask here and
 return if the other CPU is offline?

watchdog_cpus is effectively a local copy of cpu_online_mask, but
updated after the watchdog_nmi_touch in watchdog_nmi_enable.  This
avoids a false positive after hotplugging in a cpu when
cpu_online_mask is true but that cpu hasn't yet run it's first
hrtimer.

  + if (cpu = nr_cpu_ids)
  + cpu = cpumask_first(cpus);
  + if (cpu == smp_processor_id())
  + return;

 Regards,

 Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Tony Lindgren
* Colin Cross ccr...@android.com [130110 14:37]:
 On Thu, Jan 10, 2013 at 12:38 PM, Tony Lindgren t...@atomide.com wrote:
 
  * Colin Cross ccr...@android.com [130109 18:05]:
   +static void watchdog_check_hardlockup_other_cpu(void)
   +{
   + int cpu;
   + cpumask_t cpus = watchdog_cpus;
   +
   + /*
   +  * Test for hardlockups every 3 samples.  The sample period is
   +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly 
   over
   +  *  watchdog_thresh (over by 20%).
   +  */
   + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
   + return;
   +
   + /* check for a hardlockup on the next cpu */
   + cpu = cpumask_next(smp_processor_id(), cpus);
 
  Hmm don't you want to check cpu_oneline_mask here and
  return if the other CPU is offline?
 
 watchdog_cpus is effectively a local copy of cpu_online_mask, but
 updated after the watchdog_nmi_touch in watchdog_nmi_enable.  This
 avoids a false positive after hotplugging in a cpu when
 cpu_online_mask is true but that cpu hasn't yet run it's first
 hrtimer.

OK thanks for clarifying that.

Tony
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Liu, Chuansheng


 -Original Message-
 From: Colin Cross [mailto:ccr...@android.com]
 Sent: Thursday, January 10, 2013 9:58 AM
 To: linux-kernel@vger.kernel.org
 Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
 Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
 Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
 secondary cpus
 
 Emulate NMIs on systems where they are not available by using timer
 interrupts on other cpus.  Each cpu will use its softlockup hrtimer
 to check that the next cpu is processing hrtimer interrupts by
 verifying that a counter is increasing.
 
 This patch is useful on systems where the hardlockup detector is not
 available due to a lack of NMIs, for example most ARM SoCs.
 Without this patch any cpu stuck with interrupts disabled can
 cause a hardware watchdog reset with no debugging information,
 but with this patch the kernel can detect the lockup and panic,
 which can result in useful debugging info.
 
 Signed-off-by: Colin Cross ccr...@android.com
 +static void watchdog_check_hardlockup_other_cpu(void)
 +{
 + int cpu;
 + cpumask_t cpus = watchdog_cpus;
 +
 + /*
 +  * Test for hardlockups every 3 samples.  The sample period is
 +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
 +  *  watchdog_thresh (over by 20%).
 +  */
 + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
 + return;
 +
 + /* check for a hardlockup on the next cpu */
 + cpu = cpumask_next(smp_processor_id(), cpus);
 + if (cpu = nr_cpu_ids)
 + cpu = cpumask_first(cpus);
 + if (cpu == smp_processor_id())
 + return;
 +
 + smp_rmb();
 +
 + if (per_cpu(watchdog_nmi_touch, cpu) == true) {
 + per_cpu(watchdog_nmi_touch, cpu) = false;
 + return;
 + }
 +
 + if (is_hardlockup_other_cpu(cpu)) {
 + /* only warn once */
One possible case for new hotplug CPU that false hardlockup case.
1/ Assume CPU1, CPU2 are online, CPU3 is being hotplug:
CPU3:CPU2:
watchdog_nmi_enable()
 per_cpu(watchdog_nmi_touch, cpu) = true;
 cpumask_set_cpu(cpu, watchdog_cpus);
 
watchdog_check_hardlockup_other_cpu()
   per_cpu(watchdog_nmi_touch, 
cpu) = false; ==  Here cpu is CPU3

2/ Before CPU3's first hrtimer interrupt coming, CPU2 is been offlined.
  Then CPU1's next CPU is CPU3. But we can not use CPU3's watchdog_nmi_touch to 
defense
  false CPU3 hardlock more. When CPU1's hrtimer interrupt coming, it is 
possible report CPU3
  false hard lockup.

Is it the case?


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Colin Cross
On Thu, Jan 10, 2013 at 5:39 PM, Liu, Chuansheng
chuansheng@intel.com wrote:


 -Original Message-
 From: Colin Cross [mailto:ccr...@android.com]
 Sent: Thursday, January 10, 2013 9:58 AM
 To: linux-kernel@vger.kernel.org
 Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
 Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
 Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
 secondary cpus

 Emulate NMIs on systems where they are not available by using timer
 interrupts on other cpus.  Each cpu will use its softlockup hrtimer
 to check that the next cpu is processing hrtimer interrupts by
 verifying that a counter is increasing.

 This patch is useful on systems where the hardlockup detector is not
 available due to a lack of NMIs, for example most ARM SoCs.
 Without this patch any cpu stuck with interrupts disabled can
 cause a hardware watchdog reset with no debugging information,
 but with this patch the kernel can detect the lockup and panic,
 which can result in useful debugging info.

 Signed-off-by: Colin Cross ccr...@android.com
 +static void watchdog_check_hardlockup_other_cpu(void)
 +{
 + int cpu;
 + cpumask_t cpus = watchdog_cpus;
 +
 + /*
 +  * Test for hardlockups every 3 samples.  The sample period is
 +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
 +  *  watchdog_thresh (over by 20%).
 +  */
 + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
 + return;
 +
 + /* check for a hardlockup on the next cpu */
 + cpu = cpumask_next(smp_processor_id(), cpus);
 + if (cpu = nr_cpu_ids)
 + cpu = cpumask_first(cpus);
 + if (cpu == smp_processor_id())
 + return;
 +
 + smp_rmb();
 +
 + if (per_cpu(watchdog_nmi_touch, cpu) == true) {
 + per_cpu(watchdog_nmi_touch, cpu) = false;
 + return;
 + }
 +
 + if (is_hardlockup_other_cpu(cpu)) {
 + /* only warn once */
 One possible case for new hotplug CPU that false hardlockup case.
 1/ Assume CPU1, CPU2 are online, CPU3 is being hotplug:
 CPU3:CPU2:
 watchdog_nmi_enable()
  per_cpu(watchdog_nmi_touch, cpu) = true;
  cpumask_set_cpu(cpu, watchdog_cpus);
  
 watchdog_check_hardlockup_other_cpu()

 per_cpu(watchdog_nmi_touch, cpu) = false; ==  Here cpu is CPU3

 2/ Before CPU3's first hrtimer interrupt coming, CPU2 is been offlined.
   Then CPU1's next CPU is CPU3. But we can not use CPU3's watchdog_nmi_touch 
 to defense
   false CPU3 hardlock more. When CPU1's hrtimer interrupt coming, it is 
 possible report CPU3
   false hard lockup.

 Is it the case?

Yes, this is the same race condition I pointed out in reply to Don
Zickus earlier in the thread.  I think the easiest solution is to set
per_cpu(watchdog_nmi_touch, next_cpu) = true during cpu offlining.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Liu, Chuansheng


 -Original Message-
 From: ccr...@google.com [mailto:ccr...@google.com] On Behalf Of Colin
 Cross
 Sent: Friday, January 11, 2013 1:34 PM
 To: Liu, Chuansheng
 Cc: linux-kernel@vger.kernel.org; Andrew Morton; Don Zickus; Ingo Molnar;
 Thomas Gleixner; linux-arm-ker...@lists.infradead.org
 Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using
 secondary cpus
 
 On Thu, Jan 10, 2013 at 5:39 PM, Liu, Chuansheng
 chuansheng@intel.com wrote:
 
 
  -Original Message-
  From: Colin Cross [mailto:ccr...@android.com]
  Sent: Thursday, January 10, 2013 9:58 AM
  To: linux-kernel@vger.kernel.org
  Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
  Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
  Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
  secondary cpus
 
  Emulate NMIs on systems where they are not available by using timer
  interrupts on other cpus.  Each cpu will use its softlockup hrtimer
  to check that the next cpu is processing hrtimer interrupts by
  verifying that a counter is increasing.
 
  This patch is useful on systems where the hardlockup detector is not
  available due to a lack of NMIs, for example most ARM SoCs.
  Without this patch any cpu stuck with interrupts disabled can
  cause a hardware watchdog reset with no debugging information,
  but with this patch the kernel can detect the lockup and panic,
  which can result in useful debugging info.
 
  Signed-off-by: Colin Cross ccr...@android.com
  +static void watchdog_check_hardlockup_other_cpu(void)
  +{
  + int cpu;
  + cpumask_t cpus = watchdog_cpus;
  +
  + /*
  +  * Test for hardlockups every 3 samples.  The sample period is
  +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly
 over
  +  *  watchdog_thresh (over by 20%).
  +  */
  + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
  + return;
  +
Another feeling is about __this_cpu_read(hrtimer_interrupts) % 3 != 0,
It will cause the actual timeout value for hard lockup detection is not very 
fix, or even
very short.
Sometimes using 3 samples can detect the lockup case, but sometimes 1 sample.
Is it the case?

And in NMI case, the NMI interrupt is coming at least every watchdog_thresh.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Colin Cross
On Thu, Jan 10, 2013 at 9:57 PM, Liu, Chuansheng
chuansheng@intel.com wrote:


 -Original Message-
 From: ccr...@google.com [mailto:ccr...@google.com] On Behalf Of Colin
 Cross
 Sent: Friday, January 11, 2013 1:34 PM
 To: Liu, Chuansheng
 Cc: linux-kernel@vger.kernel.org; Andrew Morton; Don Zickus; Ingo Molnar;
 Thomas Gleixner; linux-arm-ker...@lists.infradead.org
 Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using
 secondary cpus

 On Thu, Jan 10, 2013 at 5:39 PM, Liu, Chuansheng
 chuansheng@intel.com wrote:
 
 
  -Original Message-
  From: Colin Cross [mailto:ccr...@android.com]
  Sent: Thursday, January 10, 2013 9:58 AM
  To: linux-kernel@vger.kernel.org
  Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
  Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
  Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
  secondary cpus
 
  Emulate NMIs on systems where they are not available by using timer
  interrupts on other cpus.  Each cpu will use its softlockup hrtimer
  to check that the next cpu is processing hrtimer interrupts by
  verifying that a counter is increasing.
 
  This patch is useful on systems where the hardlockup detector is not
  available due to a lack of NMIs, for example most ARM SoCs.
  Without this patch any cpu stuck with interrupts disabled can
  cause a hardware watchdog reset with no debugging information,
  but with this patch the kernel can detect the lockup and panic,
  which can result in useful debugging info.
 
  Signed-off-by: Colin Cross ccr...@android.com
  +static void watchdog_check_hardlockup_other_cpu(void)
  +{
  + int cpu;
  + cpumask_t cpus = watchdog_cpus;
  +
  + /*
  +  * Test for hardlockups every 3 samples.  The sample period is
  +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly
 over
  +  *  watchdog_thresh (over by 20%).
  +  */
  + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
  + return;
  +
 Another feeling is about __this_cpu_read(hrtimer_interrupts) % 3 != 0,
 It will cause the actual timeout value for hard lockup detection is not very 
 fix, or even
 very short.
 Sometimes using 3 samples can detect the lockup case, but sometimes 1 sample.
 Is it the case?

I'm not sure what you mean.  The mod 3 will cause every 3rd timer (12
seconds, assuming watchdog_thresh = 10) to check hrtimer_interrupts
vs. hrtimer_interrupts_saved, and then update it.  The sampling should
be fixed and very accurate.  It will cause a panic/warning between 12
and 24 seconds after a cpu stops processing timer interrupts,
depending on the alignment of the hrtimers between the two cpus.

 And in NMI case, the NMI interrupt is coming at least every watchdog_thresh.

NMI interrupt will happen every 10 seconds instead of 12, meaning the
panic/warning will occur between 10 and 20 seconds after a cpu stops
processing timer interrupts, depending on the alignment of the NMI
with the hrtimer, but otherwise my patch should be very similar.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


RE: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Liu, Chuansheng


 -Original Message-
 From: ccr...@google.com [mailto:ccr...@google.com] On Behalf Of Colin
 Cross
 Sent: Friday, January 11, 2013 2:18 PM
 To: Liu, Chuansheng
 Cc: linux-kernel@vger.kernel.org; Andrew Morton; Don Zickus; Ingo Molnar;
 Thomas Gleixner; linux-arm-ker...@lists.infradead.org
 Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using
 secondary cpus
 
 On Thu, Jan 10, 2013 at 9:57 PM, Liu, Chuansheng
 chuansheng@intel.com wrote:
 
 
  -Original Message-
  From: ccr...@google.com [mailto:ccr...@google.com] On Behalf Of Colin
  Cross
  Sent: Friday, January 11, 2013 1:34 PM
  To: Liu, Chuansheng
  Cc: linux-kernel@vger.kernel.org; Andrew Morton; Don Zickus; Ingo Molnar;
  Thomas Gleixner; linux-arm-ker...@lists.infradead.org
  Subject: Re: [PATCH] hardlockup: detect hard lockups without NMIs using
  secondary cpus
 
  On Thu, Jan 10, 2013 at 5:39 PM, Liu, Chuansheng
  chuansheng@intel.com wrote:
  
  
   -Original Message-
   From: Colin Cross [mailto:ccr...@android.com]
   Sent: Thursday, January 10, 2013 9:58 AM
   To: linux-kernel@vger.kernel.org
   Cc: Andrew Morton; Don Zickus; Ingo Molnar; Thomas Gleixner; Liu,
   Chuansheng; linux-arm-ker...@lists.infradead.org; Colin Cross
   Subject: [PATCH] hardlockup: detect hard lockups without NMIs using
   secondary cpus
  
   Emulate NMIs on systems where they are not available by using timer
   interrupts on other cpus.  Each cpu will use its softlockup hrtimer
   to check that the next cpu is processing hrtimer interrupts by
   verifying that a counter is increasing.
  
   This patch is useful on systems where the hardlockup detector is not
   available due to a lack of NMIs, for example most ARM SoCs.
   Without this patch any cpu stuck with interrupts disabled can
   cause a hardware watchdog reset with no debugging information,
   but with this patch the kernel can detect the lockup and panic,
   which can result in useful debugging info.
  
   Signed-off-by: Colin Cross ccr...@android.com
   +static void watchdog_check_hardlockup_other_cpu(void)
   +{
   + int cpu;
   + cpumask_t cpus = watchdog_cpus;
   +
   + /*
   +  * Test for hardlockups every 3 samples.  The sample period is
   +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to
 slightly
  over
   +  *  watchdog_thresh (over by 20%).
   +  */
   + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
   + return;
   +
  Another feeling is about __this_cpu_read(hrtimer_interrupts) % 3 != 0,
  It will cause the actual timeout value for hard lockup detection is not 
  very fix,
 or even
  very short.
  Sometimes using 3 samples can detect the lockup case, but sometimes 1
 sample.
  Is it the case?
 
 I'm not sure what you mean.  The mod 3 will cause every 3rd timer (12
 seconds, assuming watchdog_thresh = 10) to check hrtimer_interrupts
 vs. hrtimer_interrupts_saved, and then update it.  The sampling should
 be fixed and very accurate.  It will cause a panic/warning between 12
 and 24 seconds after a cpu stops processing timer interrupts,
 depending on the alignment of the hrtimers between the two cpus.
 
You are right, thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Don Zickus
On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
 Emulate NMIs on systems where they are not available by using timer
 interrupts on other cpus.  Each cpu will use its softlockup hrtimer
 to check that the next cpu is processing hrtimer interrupts by
 verifying that a counter is increasing.
 
 This patch is useful on systems where the hardlockup detector is not
 available due to a lack of NMIs, for example most ARM SoCs.

I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
reserving an IRQ line as 'special' (can not be masked).  Not sure if that
is something worth looking at here (or even possible).

 Without this patch any cpu stuck with interrupts disabled can
 cause a hardware watchdog reset with no debugging information,
 but with this patch the kernel can detect the lockup and panic,
 which can result in useful debugging info.

SNIP
 +#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
 +static int is_hardlockup_other_cpu(int cpu)
 +{
 + unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
 +
 + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
 + return 1;
 +
 + per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
 + return 0;

Will this race with the other cpu you are checking?  For example if cpuA
just updated its hrtimer_interrupts_saved and cpuB goes to check cpuA's
hrtimer_interrupts_saved, it seems possible that cpuB could falsely assume
cpuA is stuck?


 +}
 +
 +static void watchdog_check_hardlockup_other_cpu(void)
 +{
 + int cpu;
 + cpumask_t cpus = watchdog_cpus;
 +
 + /*
 +  * Test for hardlockups every 3 samples.  The sample period is
 +  *  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
 +  *  watchdog_thresh (over by 20%).
 +  */
 + if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
 + return;
 +
 + /* check for a hardlockup on the next cpu */
 + cpu = cpumask_next(smp_processor_id(), cpus);
 + if (cpu = nr_cpu_ids)
 + cpu = cpumask_first(cpus);
 + if (cpu == smp_processor_id())
 + return;
 +
 + smp_rmb();
 +
 + if (per_cpu(watchdog_nmi_touch, cpu) == true) {
 + per_cpu(watchdog_nmi_touch, cpu) = false;
 + return;
 + }

Same race here.  Usually touch_nmi_watchdog is reserved for those
functions that plan on disabling interrupts for a while.  cpuB could set
cpuA's watchdog_nmi_touch to false here expecting not to revisit this
variable for another couple of seconds.  While cpuA could read this
variable milliseconds later after cpuB sets it and falsely assume there is
a lockup?

Perhaps I am misreading the code?

If not, I don't have a good idea on how to solve those races off the top of my
head unfortunately.

Cheers,
Don
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-10 Thread Russell King - ARM Linux
On Thu, Jan 10, 2013 at 09:02:15AM -0500, Don Zickus wrote:
 On Wed, Jan 09, 2013 at 05:57:39PM -0800, Colin Cross wrote:
  Emulate NMIs on systems where they are not available by using timer
  interrupts on other cpus.  Each cpu will use its softlockup hrtimer
  to check that the next cpu is processing hrtimer interrupts by
  verifying that a counter is increasing.
  
  This patch is useful on systems where the hardlockup detector is not
  available due to a lack of NMIs, for example most ARM SoCs.
 
 I have seen other cpus, like Sparc I think, create a 'virtual NMI' by
 reserving an IRQ line as 'special' (can not be masked).  Not sure if that
 is something worth looking at here (or even possible).

No it isn't, because that assumes that things like spin_lock_irqsave()
won't mask that interrupt.  We don't have the facility to do that.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-09 Thread Colin Cross
Emulate NMIs on systems where they are not available by using timer
interrupts on other cpus.  Each cpu will use its softlockup hrtimer
to check that the next cpu is processing hrtimer interrupts by
verifying that a counter is increasing.

This patch is useful on systems where the hardlockup detector is not
available due to a lack of NMIs, for example most ARM SoCs.
Without this patch any cpu stuck with interrupts disabled can
cause a hardware watchdog reset with no debugging information,
but with this patch the kernel can detect the lockup and panic,
which can result in useful debugging info.

Signed-off-by: Colin Cross 
---
 include/linux/nmi.h |5 ++-
 kernel/watchdog.c   |   98 ---
 lib/Kconfig.debug   |   14 +++-
 3 files changed, 110 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index db50840..c8f8aa0 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -14,8 +14,11 @@
  * may be used to reset the timeout - for code which intentionally
  * disables interrupts for a long time. This call is stateless.
  */
-#if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
+#if defined(CONFIG_HAVE_NMI_WATCHDOG) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_NMI)
 #include 
+#endif
+
+#if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void touch_nmi_watchdog(void);
 #else
 static inline void touch_nmi_watchdog(void)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 75a2ab3..94c231e 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -44,6 +44,11 @@
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
+#endif
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
+static cpumask_t __read_mostly watchdog_cpus;
+#endif
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_NMI
 static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
 #endif
 
@@ -179,7 +184,7 @@ void touch_softlockup_watchdog_sync(void)
__raw_get_cpu_var(watchdog_touch_ts) = 0;
 }
 
-#ifdef CONFIG_HARDLOCKUP_DETECTOR
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_NMI
 /* watchdog detector functions */
 static int is_hardlockup(void)
 {
@@ -193,6 +198,64 @@ static int is_hardlockup(void)
 }
 #endif
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
+static int is_hardlockup_other_cpu(int cpu)
+{
+   unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
+
+   if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
+   return 1;
+
+   per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
+   return 0;
+}
+
+static void watchdog_check_hardlockup_other_cpu(void)
+{
+   int cpu;
+   cpumask_t cpus = watchdog_cpus;
+
+   /*
+* Test for hardlockups every 3 samples.  The sample period is
+*  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
+*  watchdog_thresh (over by 20%).
+*/
+   if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
+   return;
+
+   /* check for a hardlockup on the next cpu */
+   cpu = cpumask_next(smp_processor_id(), );
+   if (cpu >= nr_cpu_ids)
+   cpu = cpumask_first();
+   if (cpu == smp_processor_id())
+   return;
+
+   smp_rmb();
+
+   if (per_cpu(watchdog_nmi_touch, cpu) == true) {
+   per_cpu(watchdog_nmi_touch, cpu) = false;
+   return;
+   }
+
+   if (is_hardlockup_other_cpu(cpu)) {
+   /* only warn once */
+   if (per_cpu(hard_watchdog_warn, cpu) == true)
+   return;
+
+   if (hardlockup_panic)
+   panic("Watchdog detected hard LOCKUP on cpu %d", cpu);
+   else
+   WARN(1, "Watchdog detected hard LOCKUP on cpu %d", cpu);
+
+   per_cpu(hard_watchdog_warn, cpu) = true;
+   } else {
+   per_cpu(hard_watchdog_warn, cpu) = false;
+   }
+}
+#else
+static inline void watchdog_check_hardlockup_other_cpu(void) { return; }
+#endif
+
 static int is_softlockup(unsigned long touch_ts)
 {
unsigned long now = get_timestamp(smp_processor_id());
@@ -204,7 +267,7 @@ static int is_softlockup(unsigned long touch_ts)
return 0;
 }
 
-#ifdef CONFIG_HARDLOCKUP_DETECTOR
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_NMI
 
 static struct perf_event_attr wd_hw_attr = {
.type   = PERF_TYPE_HARDWARE,
@@ -252,7 +315,7 @@ static void watchdog_overflow_callback(struct perf_event 
*event,
__this_cpu_write(hard_watchdog_warn, false);
return;
 }
-#endif /* CONFIG_HARDLOCKUP_DETECTOR */
+#endif /* CONFIG_HARDLOCKUP_DETECTOR_NMI */
 
 static void watchdog_interrupt_count(void)
 {
@@ -272,6 +335,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
/* kick the hardlockup detector */
watchdog_interrupt_count();
 
+   /* test for hardlockups 

[PATCH] hardlockup: detect hard lockups without NMIs using secondary cpus

2013-01-09 Thread Colin Cross
Emulate NMIs on systems where they are not available by using timer
interrupts on other cpus.  Each cpu will use its softlockup hrtimer
to check that the next cpu is processing hrtimer interrupts by
verifying that a counter is increasing.

This patch is useful on systems where the hardlockup detector is not
available due to a lack of NMIs, for example most ARM SoCs.
Without this patch any cpu stuck with interrupts disabled can
cause a hardware watchdog reset with no debugging information,
but with this patch the kernel can detect the lockup and panic,
which can result in useful debugging info.

Signed-off-by: Colin Cross ccr...@android.com
---
 include/linux/nmi.h |5 ++-
 kernel/watchdog.c   |   98 ---
 lib/Kconfig.debug   |   14 +++-
 3 files changed, 110 insertions(+), 7 deletions(-)

diff --git a/include/linux/nmi.h b/include/linux/nmi.h
index db50840..c8f8aa0 100644
--- a/include/linux/nmi.h
+++ b/include/linux/nmi.h
@@ -14,8 +14,11 @@
  * may be used to reset the timeout - for code which intentionally
  * disables interrupts for a long time. This call is stateless.
  */
-#if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
+#if defined(CONFIG_HAVE_NMI_WATCHDOG) || 
defined(CONFIG_HARDLOCKUP_DETECTOR_NMI)
 #include asm/nmi.h
+#endif
+
+#if defined(CONFIG_HAVE_NMI_WATCHDOG) || defined(CONFIG_HARDLOCKUP_DETECTOR)
 extern void touch_nmi_watchdog(void);
 #else
 static inline void touch_nmi_watchdog(void)
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 75a2ab3..94c231e 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -44,6 +44,11 @@
 static DEFINE_PER_CPU(bool, hard_watchdog_warn);
 static DEFINE_PER_CPU(bool, watchdog_nmi_touch);
 static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
+#endif
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
+static cpumask_t __read_mostly watchdog_cpus;
+#endif
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_NMI
 static DEFINE_PER_CPU(struct perf_event *, watchdog_ev);
 #endif
 
@@ -179,7 +184,7 @@ void touch_softlockup_watchdog_sync(void)
__raw_get_cpu_var(watchdog_touch_ts) = 0;
 }
 
-#ifdef CONFIG_HARDLOCKUP_DETECTOR
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_NMI
 /* watchdog detector functions */
 static int is_hardlockup(void)
 {
@@ -193,6 +198,64 @@ static int is_hardlockup(void)
 }
 #endif
 
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_OTHER_CPU
+static int is_hardlockup_other_cpu(int cpu)
+{
+   unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);
+
+   if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
+   return 1;
+
+   per_cpu(hrtimer_interrupts_saved, cpu) = hrint;
+   return 0;
+}
+
+static void watchdog_check_hardlockup_other_cpu(void)
+{
+   int cpu;
+   cpumask_t cpus = watchdog_cpus;
+
+   /*
+* Test for hardlockups every 3 samples.  The sample period is
+*  watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
+*  watchdog_thresh (over by 20%).
+*/
+   if (__this_cpu_read(hrtimer_interrupts) % 3 != 0)
+   return;
+
+   /* check for a hardlockup on the next cpu */
+   cpu = cpumask_next(smp_processor_id(), cpus);
+   if (cpu = nr_cpu_ids)
+   cpu = cpumask_first(cpus);
+   if (cpu == smp_processor_id())
+   return;
+
+   smp_rmb();
+
+   if (per_cpu(watchdog_nmi_touch, cpu) == true) {
+   per_cpu(watchdog_nmi_touch, cpu) = false;
+   return;
+   }
+
+   if (is_hardlockup_other_cpu(cpu)) {
+   /* only warn once */
+   if (per_cpu(hard_watchdog_warn, cpu) == true)
+   return;
+
+   if (hardlockup_panic)
+   panic(Watchdog detected hard LOCKUP on cpu %d, cpu);
+   else
+   WARN(1, Watchdog detected hard LOCKUP on cpu %d, cpu);
+
+   per_cpu(hard_watchdog_warn, cpu) = true;
+   } else {
+   per_cpu(hard_watchdog_warn, cpu) = false;
+   }
+}
+#else
+static inline void watchdog_check_hardlockup_other_cpu(void) { return; }
+#endif
+
 static int is_softlockup(unsigned long touch_ts)
 {
unsigned long now = get_timestamp(smp_processor_id());
@@ -204,7 +267,7 @@ static int is_softlockup(unsigned long touch_ts)
return 0;
 }
 
-#ifdef CONFIG_HARDLOCKUP_DETECTOR
+#ifdef CONFIG_HARDLOCKUP_DETECTOR_NMI
 
 static struct perf_event_attr wd_hw_attr = {
.type   = PERF_TYPE_HARDWARE,
@@ -252,7 +315,7 @@ static void watchdog_overflow_callback(struct perf_event 
*event,
__this_cpu_write(hard_watchdog_warn, false);
return;
 }
-#endif /* CONFIG_HARDLOCKUP_DETECTOR */
+#endif /* CONFIG_HARDLOCKUP_DETECTOR_NMI */
 
 static void watchdog_interrupt_count(void)
 {
@@ -272,6 +335,9 @@ static enum hrtimer_restart watchdog_timer_fn(struct 
hrtimer *hrtimer)
/* kick the hardlockup detector */
watchdog_interrupt_count();
 
+