On Thu, Feb 12, 2015 at 3:09 AM, Martin van Es wrote:
>
> Best I can come up with now is try the next mainline that has all the
> fixes and ideas in this thread incorporated. Would that be 3.19?
Yes. I'm attaching a patch (very much experimental - it might
introduce new problems rather than fix
To follow up on this long standing promise to bisect.
I've made two attempts at bisecting and both landed in limbo. It's
hard to explain but it feels like this bug has quantum properties;
I know for sure it's present in 3.17 and not in 3.16(.7). But once I
start bisecting it gets less
To follow up on this long standing promise to bisect.
I've made two attempts at bisecting and both landed in limbo. It's
hard to explain but it feels like this bug has quantum properties;
I know for sure it's present in 3.17 and not in 3.16(.7). But once I
start bisecting it gets less
On Thu, Feb 12, 2015 at 3:09 AM, Martin van Es mrva...@gmail.com wrote:
Best I can come up with now is try the next mainline that has all the
fixes and ideas in this thread incorporated. Would that be 3.19?
Yes. I'm attaching a patch (very much experimental - it might
introduce new problems
On Sun, 21 Dec 2014, Linus Torvalds wrote:
> On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote:
> > On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
> > >
> > > And finally, and stupidly, is there any chance that you have anything
> > > accessing /dev/hpet?
> >
> > Not knowingly
On Sun, 21 Dec 2014, Linus Torvalds wrote:
On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones da...@codemonkey.org.uk wrote:
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
And finally, and stupidly, is there any chance that you have anything
accessing /dev/hpet?
Not
On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds
wrote:
> On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote:
>>
>> Anyway, It may be worth keeping the 50% margin (and dropping the 12%
>> reduction to simplify things)
>
> Again, the 50% margin is only on the multiplication overflow. Not on the
On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote:
>
> Anyway, It may be worth keeping the 50% margin (and dropping the 12%
> reduction to simplify things)
Again, the 50% margin is only on the multiplication overflow. Not on the mask.
So it won't do anything at all for the case we actually care
On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds
wrote:
> On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote:
>>
>> So I sent out a first step validation check to warn us if we end up
>> with idle periods that are larger then we expect.
>
> .. not having tested it, this is just from reading the
On Mon, Jan 5, 2015 at 5:17 PM, John Stultz john.stu...@linaro.org wrote:
Anyway, It may be worth keeping the 50% margin (and dropping the 12%
reduction to simplify things)
Again, the 50% margin is only on the multiplication overflow. Not on the mask.
So it won't do anything at all for the
On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
On Mon, Jan 5, 2015 at 5:17 PM, John Stultz john.stu...@linaro.org wrote:
Anyway, It may be worth keeping the 50% margin (and dropping the 12%
reduction to simplify things)
Again, the 50% margin is only on
On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
On Fri, Jan 2, 2015 at 4:27 PM, John Stultz john.stu...@linaro.org wrote:
So I sent out a first step validation check to warn us if we end up
with idle periods that are larger then we expect.
.. not having
On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote:
>
> So I sent out a first step validation check to warn us if we end up
> with idle periods that are larger then we expect.
.. not having tested it, this is just from reading the patch, but it
would *seem* that it doesn't actually validate the
On Fri, Jan 2, 2015 at 4:27 PM, John Stultz john.stu...@linaro.org wrote:
So I sent out a first step validation check to warn us if we end up
with idle periods that are larger then we expect.
.. not having tested it, this is just from reading the patch, but it
would *seem* that it doesn't
On 01/02/2015 07:27 PM, John Stultz wrote:
> On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
> wrote:
>> > On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones
>> > wrote:
>>> >> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>>> >>
>>> >> > One thing I think I'll try is to try and
On 01/02/2015 07:27 PM, John Stultz wrote:
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk
wrote:
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
One thing I think
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
wrote:
> On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote:
>> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>>
>> > One thing I think I'll try is to try and narrow down which
>> > syscalls are triggering those "Clocksource hpet
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote:
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
One thing I think I'll try is to try and narrow down which
syscalls
On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote:
> On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote:
> >
> > * So 1/8th of the interval seems way too short, as there's
> > clocksources like the ACP PM, which wrap every 2.5 seconds or so.
>
> Ugh. At the same time, 1/8th of a
On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote:
On Mon, Dec 22, 2014 at 3:59 PM, John Stultz john.stu...@linaro.org wrote:
* So 1/8th of the interval seems way too short, as there's
clocksources like the ACP PM, which wrap every 2.5 seconds or so.
Ugh. At the same time,
On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote:
> On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote:
> > >
> > > Oh - and have you actually seen the "TSC unstable (delta = xyz)" +
> > > "switched to hpet" messages there yet?
> >
> > not yet. 3 hrs in.
>
> Ok, so then
On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote:
On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones da...@codemonkey.org.uk wrote:
Oh - and have you actually seen the TSC unstable (delta = xyz) +
switched to hpet messages there yet?
not yet. 3 hrs in.
Ok, so
On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote:
> >
> > Oh - and have you actually seen the "TSC unstable (delta = xyz)" +
> > "switched to hpet" messages there yet?
>
> not yet. 3 hrs in.
Ok, so then the
INFO: rcu_preempt detected stalls on CPUs/tasks:
has nothing to do with HPET,
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
> On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
> >
> > still running though..
>
> Btw, did you ever boot with "tsc=reliable" as a kernel command line option?
I'll check it again in the morning, but before I turn in for
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
> On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
> >
> > still running though..
>
> Btw, did you ever boot with "tsc=reliable" as a kernel command line option?
I don't think so.
> For the last night, can you see if you
On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote:
> On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
> >
> > hm.
>
> So with the previous patch that had the false positives, you never saw
> this? You saw the false positives instead?
correct.
> I'm wondering if the added
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
>
> still running though..
Btw, did you ever boot with "tsc=reliable" as a kernel command line option?
For the last night, can you see if you can just run it with that, and
things work? Because by now, my gut feel is that we should start
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote:
>
> hm.
So with the previous patch that had the false positives, you never saw
this? You saw the false positives instead?
I'm wondering if the added debug noise just ended up helping. Doing a
printk() will automatically cause some scheduler
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:
> I have a newer version of the patch that gets rid of the false
> positives with some ordering rules instead, and just for you I hacked
> it up to say where the problem happens too, but it's likely too late.
hm.
[
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:
> I have a newer version of the patch that gets rid of the false
> positives with some ordering rules instead, and just for you I hacked
> it up to say where the problem happens too, but it's likely too late.
I'll give it a spin
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote:
> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>
> > One thing I think I'll try is to try and narrow down which
> > syscalls are triggering those "Clocksource hpet had cycles off"
> > messages. I'm still unclear on exactly
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
> One thing I think I'll try is to try and narrow down which
> syscalls are triggering those "Clocksource hpet had cycles off"
> messages. I'm still unclear on exactly what is doing
> the stomping on the hpet.
First I ran trinity
On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote:
> On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
>
> > But in the meantime please do keep that thing running as long as you
> > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
> > result -
On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote:
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
But in the meantime please do keep that thing running as long as you
can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
result - the
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
One thing I think I'll try is to try and narrow down which
syscalls are triggering those Clocksource hpet had cycles off
messages. I'm still unclear on exactly what is doing
the stomping on the hpet.
First I ran trinity with -g
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote:
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
One thing I think I'll try is to try and narrow down which
syscalls are triggering those Clocksource hpet had cycles off
messages. I'm still unclear
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:
I have a newer version of the patch that gets rid of the false
positives with some ordering rules instead, and just for you I hacked
it up to say where the problem happens too, but it's likely too late.
I'll give it a spin and
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:
I have a newer version of the patch that gets rid of the false
positives with some ordering rules instead, and just for you I hacked
it up to say where the problem happens too, but it's likely too late.
hm.
[ 2733.047100]
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:
hm.
So with the previous patch that had the false positives, you never saw
this? You saw the false positives instead?
I'm wondering if the added debug noise just ended up helping. Doing a
printk() will automatically
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:
still running though..
Btw, did you ever boot with tsc=reliable as a kernel command line option?
For the last night, can you see if you can just run it with that, and
things work? Because by now, my gut feel is that we
On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote:
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:
hm.
So with the previous patch that had the false positives, you never saw
this? You saw the false positives instead?
correct.
I'm
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:
still running though..
Btw, did you ever boot with tsc=reliable as a kernel command line option?
I don't think so.
For the last night, can you
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:
still running though..
Btw, did you ever boot with tsc=reliable as a kernel command line option?
I'll check it again in the morning, but before I
On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones da...@codemonkey.org.uk wrote:
Oh - and have you actually seen the TSC unstable (delta = xyz) +
switched to hpet messages there yet?
not yet. 3 hrs in.
Ok, so then the
INFO: rcu_preempt detected stalls on CPUs/tasks:
has nothing to do
On 12/23/2014 09:56 AM, Dave Jones wrote:
> On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
>
> > But in the meantime please do keep that thing running as long as you
> > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
> > result - the original
On 12/23/2014 09:56 AM, Dave Jones wrote:
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
But in the meantime please do keep that thing running as long as you
can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
result - the original softlockup bug
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
> But in the meantime please do keep that thing running as long as you
> can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
> result - the original softlockup bug happening *without* any bigger
> hpet jumps.
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
> But in the meantime please do keep that thing running as long as you
> can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
> result - the original softlockup bug happening *without* any bigger
> hpet jumps.
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
But in the meantime please do keep that thing running as long as you
can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
result - the original softlockup bug happening *without* any bigger
hpet jumps.
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
But in the meantime please do keep that thing running as long as you
can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
result - the original softlockup bug happening *without* any bigger
hpet jumps.
So
On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote:
>
> * So 1/8th of the interval seems way too short, as there's
> clocksources like the ACP PM, which wrap every 2.5 seconds or so.
Ugh. At the same time, 1/8th of a range is actually bigger than I'd
like, since if there is some timer
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
wrote:
> On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
> wrote:
>>
>> This is *not* to say that this is the bug you're hitting. But it does show
>> that
>>
>> (a) a flaky HPET can do some seriously bad stuff
>> (b) the kernel is very fragile
On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones wrote:
>
> I tried the nohpet thing for a few hours this morning and didn't see
> anything weird, but it may have been that I just didn't run long enough.
> When I saw your patch, I gave that a shot instead, with hpet enabled
> again. Just got back to
On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote:
> And again: this is not trying to make the kernel clock not jump. There
> is no way I can come up with even in theory to try to really *fix* a
> fundamentally broken clock.
>
> So this is not meant to be a real "fix" for
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
wrote:
>
> .. and we might still lock up under some circumstances. But at least
> from my limited testing, it is infinitely much better, even if it
> might not be perfect. Also note that my "testing" has been writing
> zero to the HPET lock (so the
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
wrote:
>
> This is *not* to say that this is the bug you're hitting. But it does show
> that
>
> (a) a flaky HPET can do some seriously bad stuff
> (b) the kernel is very fragile wrt time going backwards.
>
> and maybe we can use this test
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
This is *not* to say that this is the bug you're hitting. But it does show
that
(a) a flaky HPET can do some seriously bad stuff
(b) the kernel is very fragile wrt time going backwards.
and maybe we can
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
.. and we might still lock up under some circumstances. But at least
from my limited testing, it is infinitely much better, even if it
might not be perfect. Also note that my testing has been writing
zero to
On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote:
And again: this is not trying to make the kernel clock not jump. There
is no way I can come up with even in theory to try to really *fix* a
fundamentally broken clock.
So this is not meant to be a real fix for anything, but
On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:
I tried the nohpet thing for a few hours this morning and didn't see
anything weird, but it may have been that I just didn't run long enough.
When I saw your patch, I gave that a shot instead, with hpet enabled
again.
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
This is *not* to say that this is the bug you're hitting. But it does show
that
(a) a flaky HPET can do some seriously
On Mon, Dec 22, 2014 at 3:59 PM, John Stultz john.stu...@linaro.org wrote:
* So 1/8th of the interval seems way too short, as there's
clocksources like the ACP PM, which wrap every 2.5 seconds or so.
Ugh. At the same time, 1/8th of a range is actually bigger than I'd
like, since if there is
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
> On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
> wrote:
> >
> > The second time (or third, or fourth - it might not take immediately)
> > you get a lockup or similar. Bad things happen.
>
> I've only tested it twice now, but the
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
> > The second time (or third, or fourth - it might not take immediately)
> > you get a lockup or similar. Bad things happen.
>
> I've only tested it twice now, but the first time I got a weird
> lockup-like thing (things *kind*
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
wrote:
>
> The second time (or third, or fourth - it might not take immediately)
> you get a lockup or similar. Bad things happen.
I've only tested it twice now, but the first time I got a weird
lockup-like thing (things *kind* of worked, but I
On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds
wrote:
>
> I can do the mmap(/dev/mem) thing and access the HPET by hand, and
> when I write zero to it I immediately get something like this:
>
> Clocksource tsc unstable (delta = -284317725450 ns)
> Switched to clocksource hpet
>
> just to
On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote:
> On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
> >
> > And finally, and stupidly, is there any chance that you have anything
> > accessing /dev/hpet?
>
> Not knowingly at least, but who the hell knows what systemd has its
>
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
> > So the range of 1-251 seconds is not entirely random. It's all in
> > that "32-bit HPET range".
>
> DaveJ, I assume it's too late now, and you don't effectively have any
> access to the machine any more, but "hpet=disable"
On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds
wrote:
>
> So the range of 1-251 seconds is not entirely random. It's all in
> that "32-bit HPET range".
DaveJ, I assume it's too late now, and you don't effectively have any
access to the machine any more, but "hpet=disable" or "nohpet" on the
On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds
wrote:
>
> Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
> can't see anything wrong. The whole "update_wall_clock" and the shadow
> timekeeping state is confusing as hell, but seems fine. We'd have to
> avoid
On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
can't see anything wrong. The whole update_wall_clock and the shadow
timekeeping state is confusing as hell, but seems fine. We'd have to
On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
So the range of 1-251 seconds is not entirely random. It's all in
that 32-bit HPET range.
DaveJ, I assume it's too late now, and you don't effectively have any
access to the machine any more, but hpet=disable
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
So the range of 1-251 seconds is not entirely random. It's all in
that 32-bit HPET range.
DaveJ, I assume it's too late now, and you don't effectively have any
access to the machine any more, but hpet=disable or nohpet
On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones da...@codemonkey.org.uk wrote:
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
And finally, and stupidly, is there any chance that you have anything
accessing /dev/hpet?
Not knowingly at least, but who the hell knows what
On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
I can do the mmap(/dev/mem) thing and access the HPET by hand, and
when I write zero to it I immediately get something like this:
Clocksource tsc unstable (delta = -284317725450 ns)
Switched to
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
The second time (or third, or fourth - it might not take immediately)
you get a lockup or similar. Bad things happen.
I've only tested it twice now, but the first time I got a weird
lockup-like thing (things
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
The second time (or third, or fourth - it might not take immediately)
you get a lockup or similar. Bad things happen.
I've only tested it twice now, but the first time I got a weird
lockup-like thing (things *kind* of
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
The second time (or third, or fourth - it might not take immediately)
you get a lockup or similar. Bad things happen.
I've only tested it
On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote:
> On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
> wrote:
> >
> > How/where is the HPET overflow case handled? I don't know the code enough.
>
> Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
> can't see
On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
wrote:
>
> How/where is the HPET overflow case handled? I don't know the code enough.
Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
can't see anything wrong. The whole "update_wall_clock" and the shadow
timekeeping state
On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds
wrote:
>
> I'm claiming that the race happened *once*. And it then corrupted some
> data structure or similar sufficiently that CPU0 keeps looping.
>
> Perhaps something keeps re-adding itself to the head of the timerqueue
> due to the race.
So
On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote:
> > Right now I'm doing Chris' idea of "turn debugging back on,
> > and try without serial console". Shall I try your suggestion
> > on top of that ?
>
> Might as well. I doubt it really will make any difference, but I also
>
On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote:
Right now I'm doing Chris' idea of turn debugging back on,
and try without serial console. Shall I try your suggestion
on top of that ?
Might as well. I doubt it really will make any difference, but I also
don't think
On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
I'm claiming that the race happened *once*. And it then corrupted some
data structure or similar sufficiently that CPU0 keeps looping.
Perhaps something keeps re-adding itself to the head of the timerqueue
On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
How/where is the HPET overflow case handled? I don't know the code enough.
Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
can't see anything wrong. The whole update_wall_clock and the
On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote:
On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
How/where is the HPET overflow case handled? I don't know the code enough.
Hmm, ok, I've re-acquainted myself with it. And I have to admit
On Fri, Dec 19, 2014 at 5:00 PM, Thomas Gleixner wrote:
>
> The watchdog timer runs on a fully periodic schedule. It's self
> rearming via
>
> hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
>
> So if that aligns with the equally periodic tick interrupt on the
> other CPU then
On Fri, 19 Dec 2014, Chris Mason wrote:
> On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner wrote:
> > But at the very end this would be detected by the runtime check of the
> > hrtimer interrupt, which does not trigger. And it would trigger at
> > some point as ALL cpus including CPU0 in that
On Fri, 19 Dec 2014, Linus Torvalds wrote:
> On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote:
> > Now that all looks correct. So there is something else going on. After
> > staring some more at it, I think we are looking at it from the wrong
> > angle.
> >
> > The watchdog always detects
On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner
wrote:
On Fri, 19 Dec 2014, Chris Mason wrote:
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> Here's another pattern. In your latest thing, every single time
that
> CPU1 is waiting for some other CPU to pick up the IPI,
On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote:
>
> Now that all looks correct. So there is something else going on. After
> staring some more at it, I think we are looking at it from the wrong
> angle.
>
> The watchdog always detects CPU1 as stuck and we got completely
> fixated on the
On Fri, 19 Dec 2014, Chris Mason wrote:
> On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> > Here's another pattern. In your latest thing, every single time that
> > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0
> > doing this:
> >
> > [24998.060963] NMI
On Fri, 19 Dec 2014, Linus Torvalds wrote:
> Here's another pattern. In your latest thing, every single time that
> CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0
> doing this:
>
> [24998.060963] NMI backtrace for cpu 0
> [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not
On Fri, Dec 19, 2014 at 12:54 PM, Dave Jones wrote:
>
> Right now I'm doing Chris' idea of "turn debugging back on,
> and try without serial console". Shall I try your suggestion
> on top of that ?
Might as well. I doubt it really will make any difference, but I also
don't think it will
On Fri, Dec 19, 2014 at 12:46:16PM -0800, Linus Torvalds wrote:
> On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds
> wrote:
> >
> > I do note that we depend on the "new mwait" semantics where we do
> > mwait with interrupts disabled and a non-zero RCX value. Are there
> > possibly even any
On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds
wrote:
>
> I do note that we depend on the "new mwait" semantics where we do
> mwait with interrupts disabled and a non-zero RCX value. Are there
> possibly even any known CPU errata in that area? Not that it sounds
> likely, but still..
Remind me
On Fri, Dec 19, 2014 at 03:31:36PM -0500, Chris Mason wrote:
> > So it's not stuck *inside* read_hpet(), and it's almost certainly not
> > the loop over the sequence counter in ktime_get() either (it's not
> > increasing *that* quickly). But some basically infinite __run_hrtimer
> > thing or
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> Here's another pattern. In your latest thing, every single time that
> CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0
> doing this:
>
> [24998.060963] NMI backtrace for cpu 0
> [24998.061989] CPU: 0 PID: 2940
On Fri, Dec 19, 2014 at 11:15 AM, Linus Torvalds
wrote:
>
> In your earlier trace (with spinlock debugging), the softlockup
> detection was in lock_acquire for copy_page_range(), but CPU2 was
> always in that "generic_exec_single" due to a TLB flush from that
> zap_page_range thing again. But
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> sched: RT throttling activated
>
> And after RT throttling, it's random (not even always trinity), but
> that's probably because the watchdog thread doesn't run reliably any
> more.
So if we want to shoot that RT throttling
1 - 100 of 958 matches
Mail list logo