> On Tue, 15 Aug 2017, Liang, Kan wrote:
> > This patch which speed up the hrtimer
> (https://lkml.org/lkml/2017/6/26/685)
> > is decent to fix the spurious hard lockups.
> > Tested-by: Kan Liang
> >
> > Please consider to merge it into both mainline and stable tree.
>
>
> On Tue, 15 Aug 2017, Liang, Kan wrote:
> > This patch which speed up the hrtimer
> (https://lkml.org/lkml/2017/6/26/685)
> > is decent to fix the spurious hard lockups.
> > Tested-by: Kan Liang
> >
> > Please consider to merge it into both mainline and stable tree.
>
> Well, it 'fixes' the
On Tue, 15 Aug 2017, Liang, Kan wrote:
> This patch which speed up the hrtimer (https://lkml.org/lkml/2017/6/26/685)
> is decent to fix the spurious hard lockups.
> Tested-by: Kan Liang
>
> Please consider to merge it into both mainline and stable tree.
Well, it 'fixes' the
On Tue, 15 Aug 2017, Liang, Kan wrote:
> This patch which speed up the hrtimer (https://lkml.org/lkml/2017/6/26/685)
> is decent to fix the spurious hard lockups.
> Tested-by: Kan Liang
>
> Please consider to merge it into both mainline and stable tree.
Well, it 'fixes' the problem, but at the
On Mon, Aug 14, 2017 at 6:16 PM, Liang, Kan wrote:
>
> We have confirmed that the hardlock with "speed up the hrtimer" patch is
> actually another issue.
Good.
However:
> Tim has already proposed a patch to fix it.
> Here is his patch. https://lkml.org/lkml/2017/8/14/1000
On Mon, Aug 14, 2017 at 6:16 PM, Liang, Kan wrote:
>
> We have confirmed that the hardlock with "speed up the hrtimer" patch is
> actually another issue.
Good.
However:
> Tim has already proposed a patch to fix it.
> Here is his patch. https://lkml.org/lkml/2017/8/14/1000
Ugh. I hate that
> On Mon, Jul 17, 2017 at 01:24:23AM +, Liang, Kan wrote:
> > Hi Don & Thomas,
> >
> > Sorry for the late response. We just finished the tests for all proposed
> patches.
> >
> > There are three proposed patches so far.
> > Patch 1: The patch as above which speed up the hrtimer.
> > Patch 2:
> On Mon, Jul 17, 2017 at 01:24:23AM +, Liang, Kan wrote:
> > Hi Don & Thomas,
> >
> > Sorry for the late response. We just finished the tests for all proposed
> patches.
> >
> > There are three proposed patches so far.
> > Patch 1: The patch as above which speed up the hrtimer.
> > Patch 2:
On Mon, 17 Jul 2017, Liang, Kan wrote:
> > > > > According to our test, only patch 3 works well.
> > > > > The other two patches will hang the system eventually.
> >
> > Hang the system eventually? Does that mean that the system stops working
> > and the watchdog does not catch the problem?
>
>
On Mon, 17 Jul 2017, Liang, Kan wrote:
> > > > > According to our test, only patch 3 works well.
> > > > > The other two patches will hang the system eventually.
> >
> > Hang the system eventually? Does that mean that the system stops working
> > and the watchdog does not catch the problem?
>
>
> On Mon, 17 Jul 2017, Liang, Kan wrote:
> > > That doesn't make sense. What's the exact test procedure?
> >
> > I don't know the exact test procedure. The test case is from our customer.
> > I only know that the test case makes calls into the x11 libs.
>
> Sigh. This starts to be silly. You
> On Mon, 17 Jul 2017, Liang, Kan wrote:
> > > That doesn't make sense. What's the exact test procedure?
> >
> > I don't know the exact test procedure. The test case is from our customer.
> > I only know that the test case makes calls into the x11 libs.
>
> Sigh. This starts to be silly. You
On Mon, Jul 17, 2017 at 01:24:23AM +, Liang, Kan wrote:
> Hi Don & Thomas,
>
> Sorry for the late response. We just finished the tests for all proposed
> patches.
>
> There are three proposed patches so far.
> Patch 1: The patch as above which speed up the hrtimer.
> Patch 2: Thomas's first
On Mon, Jul 17, 2017 at 01:24:23AM +, Liang, Kan wrote:
> Hi Don & Thomas,
>
> Sorry for the late response. We just finished the tests for all proposed
> patches.
>
> There are three proposed patches so far.
> Patch 1: The patch as above which speed up the hrtimer.
> Patch 2: Thomas's first
On Mon, 17 Jul 2017, Liang, Kan wrote:
> > That doesn't make sense. What's the exact test procedure?
>
> I don't know the exact test procedure. The test case is from our customer.
> I only know that the test case makes calls into the x11 libs.
Sigh. This starts to be silly. You test something
On Mon, 17 Jul 2017, Liang, Kan wrote:
> > That doesn't make sense. What's the exact test procedure?
>
> I don't know the exact test procedure. The test case is from our customer.
> I only know that the test case makes calls into the x11 libs.
Sigh. This starts to be silly. You test something
>
> On Mon, 17 Jul 2017, Liang, Kan wrote:
> > There are three proposed patches so far.
> > Patch 1: The patch as above which speed up the hrtimer.
> > Patch 2: Thomas's first proposal.
> > https://patchwork.kernel.org/patch/9803033/
> > https://patchwork.kernel.org/patch/9805903/
> > Patch 3: my
>
> On Mon, 17 Jul 2017, Liang, Kan wrote:
> > There are three proposed patches so far.
> > Patch 1: The patch as above which speed up the hrtimer.
> > Patch 2: Thomas's first proposal.
> > https://patchwork.kernel.org/patch/9803033/
> > https://patchwork.kernel.org/patch/9805903/
> > Patch 3: my
On Mon, 17 Jul 2017, Liang, Kan wrote:
> There are three proposed patches so far.
> Patch 1: The patch as above which speed up the hrtimer.
> Patch 2: Thomas's first proposal.
> https://patchwork.kernel.org/patch/9803033/
> https://patchwork.kernel.org/patch/9805903/
> Patch 3: my original
On Mon, 17 Jul 2017, Liang, Kan wrote:
> There are three proposed patches so far.
> Patch 1: The patch as above which speed up the hrtimer.
> Patch 2: Thomas's first proposal.
> https://patchwork.kernel.org/patch/9803033/
> https://patchwork.kernel.org/patch/9805903/
> Patch 3: my original
> On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > Hmm, all this work for a temp fix. Kan, how much longer until the
> > > > real fix of having perf count the
> On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > Hmm, all this work for a temp fix. Kan, how much longer until the
> > > > real fix of having perf count the
> > Thomas' patch to modulate the frequency seemed reasonable to me.
> > It made the NMI watchdog depend on accurate ktime, but that's probably ok.
>
> Ok, did Kan finish testing this patch (with the small fix on top)?
Kan doesn't have the specific hardware to test it. We've been waiting
for
> > Thomas' patch to modulate the frequency seemed reasonable to me.
> > It made the NMI watchdog depend on accurate ktime, but that's probably ok.
>
> Ok, did Kan finish testing this patch (with the small fix on top)?
Kan doesn't have the specific hardware to test it. We've been waiting
for
On Thu, Jun 29, 2017 at 09:12:20AM -0700, Andi Kleen wrote:
> On Thu, Jun 29, 2017 at 11:44:06AM -0400, Don Zickus wrote:
> > On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> > > It can be a useful debugging tool for a specific class of bugs:
> > > when kernel software is looping
On Thu, Jun 29, 2017 at 09:12:20AM -0700, Andi Kleen wrote:
> On Thu, Jun 29, 2017 at 11:44:06AM -0400, Don Zickus wrote:
> > On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> > > It can be a useful debugging tool for a specific class of bugs:
> > > when kernel software is looping
On Thu, Jun 29, 2017 at 11:44:06AM -0400, Don Zickus wrote:
> On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> > It can be a useful debugging tool for a specific class of bugs:
> > when kernel software is looping forever.
> >
> > But if that happens does it really matter how many
On Thu, Jun 29, 2017 at 11:44:06AM -0400, Don Zickus wrote:
> On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> > It can be a useful debugging tool for a specific class of bugs:
> > when kernel software is looping forever.
> >
> > But if that happens does it really matter how many
On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> It can be a useful debugging tool for a specific class of bugs:
> when kernel software is looping forever.
>
> But if that happens does it really matter how many iterations the
> loop does before it is stopped?
>
> Even the current
On Wed, Jun 28, 2017 at 01:14:04PM -0700, Andi Kleen wrote:
> It can be a useful debugging tool for a specific class of bugs:
> when kernel software is looping forever.
>
> But if that happens does it really matter how many iterations the
> loop does before it is stopped?
>
> Even the current
On Wed, Jun 28, 2017 at 03:00:08PM -0400, Don Zickus wrote:
> On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > > I haven't heard back any test result yet.
> > >
> > > The above patch looks good to me.
> >
> > This needs performance testing. It may slow down performance or latency
On Wed, Jun 28, 2017 at 03:00:08PM -0400, Don Zickus wrote:
> On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > > I haven't heard back any test result yet.
> > >
> > > The above patch looks good to me.
> >
> > This needs performance testing. It may slow down performance or latency
On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > I haven't heard back any test result yet.
> >
> > The above patch looks good to me.
>
> This needs performance testing. It may slow down performance or latency
> sensitive workloads.
More motivation to work through the issues
On Tue, Jun 27, 2017 at 04:48:22PM -0700, Andi Kleen wrote:
> > I haven't heard back any test result yet.
> >
> > The above patch looks good to me.
>
> This needs performance testing. It may slow down performance or latency
> sensitive workloads.
More motivation to work through the issues
> I haven't heard back any test result yet.
>
> The above patch looks good to me.
This needs performance testing. It may slow down performance or latency
sensitive workloads.
> Which workaround do you prefer, the above one or the one checking timestamp?
I prefer the earlier patch, it has far
> I haven't heard back any test result yet.
>
> The above patch looks good to me.
This needs performance testing. It may slow down performance or latency
sensitive workloads.
> Which workaround do you prefer, the above one or the one checking timestamp?
I prefer the earlier patch, it has far
On Tue, Jun 27, 2017 at 08:49:19PM +, Liang, Kan wrote:
>
> > On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > > Hmm, all this work for a temp fix. Kan,
On Tue, Jun 27, 2017 at 08:49:19PM +, Liang, Kan wrote:
>
> > On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > > Hmm, all this work for a temp fix. Kan,
> On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > Hmm, all this work for a temp fix. Kan, how much longer until the
> > > > real fix of having perf count the
> On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> > On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > > Hmm, all this work for a temp fix. Kan, how much longer until the
> > > > real fix of having perf count the
On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > Hmm, all this work for a temp fix. Kan, how much longer until the real
> > > fix
> > > of having perf count the right
On Mon, Jun 26, 2017 at 04:19:27PM -0400, Don Zickus wrote:
> On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > Hmm, all this work for a temp fix. Kan, how much longer until the real
> > > fix
> > > of having perf count the right
On Mon, 26 Jun 2017, Don Zickus wrote:
> On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > Hmm, all this work for a temp fix. Kan, how much longer until the real
> > > fix
> > > of having perf count the right cycles?
> >
> > Quite
On Mon, 26 Jun 2017, Don Zickus wrote:
> On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> > On Fri, 23 Jun 2017, Don Zickus wrote:
> > > Hmm, all this work for a temp fix. Kan, how much longer until the real
> > > fix
> > > of having perf count the right cycles?
> >
> > Quite
On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> On Fri, 23 Jun 2017, Don Zickus wrote:
> > Hmm, all this work for a temp fix. Kan, how much longer until the real fix
> > of having perf count the right cycles?
>
> Quite a while. The approach is wilfully breaking the user space
On Fri, Jun 23, 2017 at 11:50:25PM +0200, Thomas Gleixner wrote:
> On Fri, 23 Jun 2017, Don Zickus wrote:
> > Hmm, all this work for a temp fix. Kan, how much longer until the real fix
> > of having perf count the right cycles?
>
> Quite a while. The approach is wilfully breaking the user space
On Fri, 23 Jun 2017, Don Zickus wrote:
> Hmm, all this work for a temp fix. Kan, how much longer until the real fix
> of having perf count the right cycles?
Quite a while. The approach is wilfully breaking the user space ABI, which
is not going to happen.
And there is a simpler solution as
On Fri, 23 Jun 2017, Don Zickus wrote:
> Hmm, all this work for a temp fix. Kan, how much longer until the real fix
> of having perf count the right cycles?
Quite a while. The approach is wilfully breaking the user space ABI, which
is not going to happen.
And there is a simpler solution as
On Fri, Jun 23, 2017 at 10:01:55AM +0200, Thomas Gleixner wrote:
> On Thu, 22 Jun 2017, Don Zickus wrote:
> > On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > > > We now have more and more systems where the Turbo range is
On Fri, Jun 23, 2017 at 10:01:55AM +0200, Thomas Gleixner wrote:
> On Thu, 22 Jun 2017, Don Zickus wrote:
> > On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > > > We now have more and more systems where the Turbo range is
On Thu, 22 Jun 2017, Don Zickus wrote:
> On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > > We now have more and more systems where the Turbo range is wide enough
> > > that the NMI watchdog expires faster than the soft
On Thu, 22 Jun 2017, Don Zickus wrote:
> On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > > We now have more and more systems where the Turbo range is wide enough
> > > that the NMI watchdog expires faster than the soft
> Subject: Re: [PATCH V2] kernel/watchdog: fix spurious hard lockups
>
> On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > > We now have more and more systems where the Turbo range is wide
> >
> Subject: Re: [PATCH V2] kernel/watchdog: fix spurious hard lockups
>
> On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > > We now have more and more systems where the Turbo range is wide
> >
On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > We now have more and more systems where the Turbo range is wide enough
> > that the NMI watchdog expires faster than the soft watchdog timer that
> > updates the interrupt tick
On Wed, Jun 21, 2017 at 11:53:57PM +0200, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > We now have more and more systems where the Turbo range is wide enough
> > that the NMI watchdog expires faster than the soft watchdog timer that
> > updates the interrupt tick
On Wed, 21 Jun 2017, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > We now have more and more systems where the Turbo range is wide enough
> > that the NMI watchdog expires faster than the soft watchdog timer that
> > updates the interrupt tick the NMI watchdog
On Wed, 21 Jun 2017, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > We now have more and more systems where the Turbo range is wide enough
> > that the NMI watchdog expires faster than the soft watchdog timer that
> > updates the interrupt tick the NMI watchdog
On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> We now have more and more systems where the Turbo range is wide enough
> that the NMI watchdog expires faster than the soft watchdog timer that
> updates the interrupt tick the NMI watchdog relies on.
>
> This problem was originally added by
On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> We now have more and more systems where the Turbo range is wide enough
> that the NMI watchdog expires faster than the soft watchdog timer that
> updates the interrupt tick the NMI watchdog relies on.
>
> This problem was originally added by
On Wed, 21 Jun 2017, Andi Kleen wrote:
> On Wed, Jun 21, 2017 at 05:12:06PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > >
> > > #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > > +/*
> > > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> > >
On Wed, 21 Jun 2017, Andi Kleen wrote:
> On Wed, Jun 21, 2017 at 05:12:06PM +0200, Thomas Gleixner wrote:
> > On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> > >
> > > #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > > +/*
> > > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> > >
On 06/21/2017 11:47 AM, Liang, Kan wrote:
>
>
>> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
>>>
>>> #ifdef CONFIG_HARDLOCKUP_DETECTOR
>>> +/*
>>> + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event,
>> which
>>> + * can tick faster than the measured CPU Frequency due to Turbo
On 06/21/2017 11:47 AM, Liang, Kan wrote:
>
>
>> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
>>>
>>> #ifdef CONFIG_HARDLOCKUP_DETECTOR
>>> +/*
>>> + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event,
>> which
>>> + * can tick faster than the measured CPU Frequency due to Turbo
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> >
> > #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > +/*
> > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event,
> which
> > + * can tick faster than the measured CPU Frequency due to Turbo mode.
> > + * That can lead to spurious timeouts.
>
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> >
> > #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > +/*
> > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event,
> which
> > + * can tick faster than the measured CPU Frequency due to Turbo mode.
> > + * That can lead to spurious timeouts.
>
On Wed, Jun 21, 2017 at 05:12:06PM +0200, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> >
> > #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > +/*
> > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> > + * can tick faster than the measured CPU Frequency
On Wed, Jun 21, 2017 at 05:12:06PM +0200, Thomas Gleixner wrote:
> On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
> >
> > #ifdef CONFIG_HARDLOCKUP_DETECTOR
> > +/*
> > + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> > + * can tick faster than the measured CPU Frequency
On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
>
> #ifdef CONFIG_HARDLOCKUP_DETECTOR
> +/*
> + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> + * can tick faster than the measured CPU Frequency due to Turbo mode.
> + * That can lead to spurious timeouts.
> + * To
On Wed, 21 Jun 2017, kan.li...@intel.com wrote:
>
> #ifdef CONFIG_HARDLOCKUP_DETECTOR
> +/*
> + * The NMI watchdog relies on PERF_COUNT_HW_CPU_CYCLES event, which
> + * can tick faster than the measured CPU Frequency due to Turbo mode.
> + * That can lead to spurious timeouts.
> + * To
From: Kan Liang
Some users reported spurious NMI watchdog timeouts.
We now have more and more systems where the Turbo range is wide enough
that the NMI watchdog expires faster than the soft watchdog timer that
updates the interrupt tick the NMI watchdog relies on.
This
From: Kan Liang
Some users reported spurious NMI watchdog timeouts.
We now have more and more systems where the Turbo range is wide enough
that the NMI watchdog expires faster than the soft watchdog timer that
updates the interrupt tick the NMI watchdog relies on.
This problem was originally
72 matches
Mail list logo