subject:"Re\: frequent lockups in 3.18rc4"

On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds
 wrote:
> On Mon, Jan 5, 2015 at 5:17 PM, John Stultz  wrote:
>>
>> Anyway, It may be worth keeping the 50% margin (and dropping the 12%
>> reduction to simplify things)
>
> Again, the 50% margin is only on the multiplication overflow. Not on the mask.

Right, but we calculate the mult value based on the mask (or 10 mins,
whichever is shorter).

So then when we go back and calculate the max_cycles/max_idle_ns using
the mult, we end up with a value smaller then the mask. So the
scheduler shouldn't push idle times out beyond that and the debug
logic in my patch should be able to catch strangely large values.

> So it won't do anything at all for the case we actually care about,
> namely a broken HPET, afaik.

Yea, the case my code doesn't catch that yours did is for slightly
broken clocksources (I'm thinking two cpus which virtual hpets
embedded in them that are slightly off) where you could get negative
deltas right after the update. In that case the capping on read is
really needed since by the next update the stale value has grown large
enough to look like a reasonable offset. The TSC has a similar issue,
but its easier to check for negative values because it won't
reasonably ever overflow.

>
> I'd much rather limit to 50% of the mask too.

Ok, I'll try to rework the code to make this choice and make it more
explicitly clear.

> Also, why do we actually play games with ilog2 for that overflow
> calculation? It seems pointless. This is for the setup code, doing a
> real division there would seem to be a whole lot more straightforward,
> and not need that big comment. And there's no performance issue. Am I
> missing something?

I feel like there was a time when this may have been called by some of
the clocksource code if it they changed frequency (I think over
suspend/resume), but I'm not seeing it in the current source. So yea,
likely something to simplify.

>> I've also got a capping patch that I'm testing that keeps time reads
>> from passing that interval. The only thing I'm really cautious about
>> with that change is that we have to make sure the hrtimer that
>> triggers update_wall_clock is always set to expire within that cap (I
>> need to review it again) or else we'll hang ourselves.
>
>  Yeah, that thing is fragile. And quite possibly part of the problem.

"Time is a flat circle..." and thus unfortunately requires some
circular logic. :)

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2015-01-05 Thread Linus Torvalds

On Mon, Jan 5, 2015 at 5:17 PM, John Stultz  wrote:
>
> Anyway, It may be worth keeping the 50% margin (and dropping the 12%
> reduction to simplify things)

Again, the 50% margin is only on the multiplication overflow. Not on the mask.

So it won't do anything at all for the case we actually care about,
namely a broken HPET, afaik.

I'd much rather limit to 50% of the mask too.

Also, why do we actually play games with ilog2 for that overflow
calculation? It seems pointless. This is for the setup code, doing a
real division there would seem to be a whole lot more straightforward,
and not need that big comment. And there's no performance issue. Am I
missing something?

> I've also got a capping patch that I'm testing that keeps time reads
> from passing that interval. The only thing I'm really cautious about
> with that change is that we have to make sure the hrtimer that
> triggers update_wall_clock is always set to expire within that cap (I
> need to review it again) or else we'll hang ourselves.

 Yeah, that thing is fragile. And quite possibly part of the problem.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds
 wrote:
> On Fri, Jan 2, 2015 at 4:27 PM, John Stultz  wrote:
>>
>> So I sent out a first step validation check to warn us if we end up
>> with idle periods that are larger then we expect.
>
> .. not having tested it, this is just from reading the patch, but it
> would *seem* that it doesn't actually validate the clock reading much
> at all.
>
> Why? Because most of the time, for crap clocks like HPET, the real
> limitation will be not the multiplication overflow, but the "mask",
> which is just 32-bit (or worse - I think the ACPI PM timer might be
> just 24 bits).
>
> So then you effectively "validate" that the timer difference value
> fits in mask, but that isn't any validation at all - it's just a
> truism. Since we by definition mask the difference to just the valid
> bitmask.
>
> So I really think that the maximum valid clock needs to be narrowed
> down from the "technically, this clock can count to X".
>
> But maybe I'm wrong, and the multiplication overflow is actually often
> the real limit. What are the actual values for real timer sources?

As you point out, for clocksources where the mask is 32bits or under,
we shouldn't have any risk of multiplication overflow, since mult is a
32bit. So yes, the max_cycles on those probably should be the same as
the mask, and isn't useful on those clocksources to test if we run
over (though warning if we're within the 12% margin could be useful).
But for clocksources who have larger masks, it still could be a useful
check (@2ghz tscs overflow 32bits in 2 seconds), although the mult
value is targets an mult overflow at ~10 minutes, so its less likely
that we really hit it.

However, it ends up the calculations we use are a little more
conservative, treating the result as signed and so they avoid
multiplications that could run into a that sign bit. This looks like
an error to me (the code is also used for clockevents, so I haven't
run through to see if the requirements are different there), but a
conservative one, which results in the maximum idle interval to be
~halved what they could be (and we add yet another 12% margin on that
- so we probably need to just pick one or the other, not both).

So even on 32bit masks, max_cycles in my patch is smaller then 32bits.
That's why I was able to hit both warnings in my testing with the hpet
by sending SIGSTOP to qemu.

Anyway, It may be worth keeping the 50% margin (and dropping the 12%
reduction to simplify things), since I've not heard recent complaints
about timekeeping limiting idle lengths (but could be wrong here).
This would give you something closer to the 1/8th of the mask that you
were using in your patch (and on larger mask clocksources, we do
already cap the interval at 10 minutes - so most really crazy values
would be caught for clocksources like the TSC - and maybe we can make
this more configurable so we can shorten it as done in your patch to
try to debug things).

I've also got a capping patch that I'm testing that keeps time reads
from passing that interval. The only thing I'm really cautious about
with that change is that we have to make sure the hrtimer that
triggers update_wall_clock is always set to expire within that cap (I
need to review it again) or else we'll hang ourselves.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2015-01-05 Thread Linus Torvalds

On Mon, Jan 5, 2015 at 5:17 PM, John Stultz john.stu...@linaro.org wrote:

 Anyway, It may be worth keeping the 50% margin (and dropping the 12%
 reduction to simplify things)

Again, the 50% margin is only on the multiplication overflow. Not on the mask.

So it won't do anything at all for the case we actually care about,
namely a broken HPET, afaik.

I'd much rather limit to 50% of the mask too.

Also, why do we actually play games with ilog2 for that overflow
calculation? It seems pointless. This is for the setup code, doing a
real division there would seem to be a whole lot more straightforward,
and not need that big comment. And there's no performance issue. Am I
missing something?

 I've also got a capping patch that I'm testing that keeps time reads
 from passing that interval. The only thing I'm really cautious about
 with that change is that we have to make sure the hrtimer that
 triggers update_wall_clock is always set to expire within that cap (I
 need to review it again) or else we'll hang ourselves.

 Yeah, that thing is fragile. And quite possibly part of the problem.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Mon, Jan 5, 2015 at 5:17 PM, John Stultz john.stu...@linaro.org wrote:

 Anyway, It may be worth keeping the 50% margin (and dropping the 12%
 reduction to simplify things)

 Again, the 50% margin is only on the multiplication overflow. Not on the mask.

Right, but we calculate the mult value based on the mask (or 10 mins,
whichever is shorter).

So then when we go back and calculate the max_cycles/max_idle_ns using
the mult, we end up with a value smaller then the mask. So the
scheduler shouldn't push idle times out beyond that and the debug
logic in my patch should be able to catch strangely large values.

 So it won't do anything at all for the case we actually care about,
 namely a broken HPET, afaik.

Yea, the case my code doesn't catch that yours did is for slightly
broken clocksources (I'm thinking two cpus which virtual hpets
embedded in them that are slightly off) where you could get negative
deltas right after the update. In that case the capping on read is
really needed since by the next update the stale value has grown large
enough to look like a reasonable offset. The TSC has a similar issue,
but its easier to check for negative values because it won't
reasonably ever overflow.


 I'd much rather limit to 50% of the mask too.

Ok, I'll try to rework the code to make this choice and make it more
explicitly clear.


 Also, why do we actually play games with ilog2 for that overflow
 calculation? It seems pointless. This is for the setup code, doing a
 real division there would seem to be a whole lot more straightforward,
 and not need that big comment. And there's no performance issue. Am I
 missing something?

I feel like there was a time when this may have been called by some of
the clocksource code if it they changed frequency (I think over
suspend/resume), but I'm not seeing it in the current source. So yea,
likely something to simplify.

 I've also got a capping patch that I'm testing that keeps time reads
 from passing that interval. The only thing I'm really cautious about
 with that change is that we have to make sure the hrtimer that
 triggers update_wall_clock is always set to expire within that cap (I
 need to review it again) or else we'll hang ourselves.

  Yeah, that thing is fragile. And quite possibly part of the problem.

Time is a flat circle... and thus unfortunately requires some
circular logic. :)

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Jan 2, 2015 at 4:27 PM, John Stultz john.stu...@linaro.org wrote:

 So I sent out a first step validation check to warn us if we end up
 with idle periods that are larger then we expect.

 .. not having tested it, this is just from reading the patch, but it
 would *seem* that it doesn't actually validate the clock reading much
 at all.

 Why? Because most of the time, for crap clocks like HPET, the real
 limitation will be not the multiplication overflow, but the mask,
 which is just 32-bit (or worse - I think the ACPI PM timer might be
 just 24 bits).

 So then you effectively validate that the timer difference value
 fits in mask, but that isn't any validation at all - it's just a
 truism. Since we by definition mask the difference to just the valid
 bitmask.

 So I really think that the maximum valid clock needs to be narrowed
 down from the technically, this clock can count to X.

 But maybe I'm wrong, and the multiplication overflow is actually often
 the real limit. What are the actual values for real timer sources?


As you point out, for clocksources where the mask is 32bits or under,
we shouldn't have any risk of multiplication overflow, since mult is a
32bit. So yes, the max_cycles on those probably should be the same as
the mask, and isn't useful on those clocksources to test if we run
over (though warning if we're within the 12% margin could be useful).
But for clocksources who have larger masks, it still could be a useful
check (@2ghz tscs overflow 32bits in 2 seconds), although the mult
value is targets an mult overflow at ~10 minutes, so its less likely
that we really hit it.

However, it ends up the calculations we use are a little more
conservative, treating the result as signed and so they avoid
multiplications that could run into a that sign bit. This looks like
an error to me (the code is also used for clockevents, so I haven't
run through to see if the requirements are different there), but a
conservative one, which results in the maximum idle interval to be
~halved what they could be (and we add yet another 12% margin on that
- so we probably need to just pick one or the other, not both).

So even on 32bit masks, max_cycles in my patch is smaller then 32bits.
That's why I was able to hit both warnings in my testing with the hpet
by sending SIGSTOP to qemu.

Anyway, It may be worth keeping the 50% margin (and dropping the 12%
reduction to simplify things), since I've not heard recent complaints
about timekeeping limiting idle lengths (but could be wrong here).
This would give you something closer to the 1/8th of the mask that you
were using in your patch (and on larger mask clocksources, we do
already cap the interval at 10 minutes - so most really crazy values
would be caught for clocksources like the TSC - and maybe we can make
this more configurable so we can shorten it as done in your patch to
try to debug things).

I've also got a capping patch that I'm testing that keeps time reads
from passing that interval. The only thing I'm really cautious about
with that change is that we have to make sure the hrtimer that
triggers update_wall_clock is always set to expire within that cap (I
need to review it again) or else we'll hang ourselves.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2015-01-04 Thread Linus Torvalds

On Fri, Jan 2, 2015 at 4:27 PM, John Stultz  wrote:
>
> So I sent out a first step validation check to warn us if we end up
> with idle periods that are larger then we expect.

.. not having tested it, this is just from reading the patch, but it
would *seem* that it doesn't actually validate the clock reading much
at all.

Why? Because most of the time, for crap clocks like HPET, the real
limitation will be not the multiplication overflow, but the "mask",
which is just 32-bit (or worse - I think the ACPI PM timer might be
just 24 bits).

So then you effectively "validate" that the timer difference value
fits in mask, but that isn't any validation at all - it's just a
truism. Since we by definition mask the difference to just the valid
bitmask.

So I really think that the maximum valid clock needs to be narrowed
down from the "technically, this clock can count to X".

But maybe I'm wrong, and the multiplication overflow is actually often
the real limit. What are the actual values for real timer sources?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2015-01-04 Thread Linus Torvalds

On Fri, Jan 2, 2015 at 4:27 PM, John Stultz john.stu...@linaro.org wrote:

 So I sent out a first step validation check to warn us if we end up
 with idle periods that are larger then we expect.

.. not having tested it, this is just from reading the patch, but it
would *seem* that it doesn't actually validate the clock reading much
at all.

Why? Because most of the time, for crap clocks like HPET, the real
limitation will be not the multiplication overflow, but the mask,
which is just 32-bit (or worse - I think the ACPI PM timer might be
just 24 bits).

So then you effectively validate that the timer difference value
fits in mask, but that isn't any validation at all - it's just a
truism. Since we by definition mask the difference to just the valid
bitmask.

So I really think that the maximum valid clock needs to be narrowed
down from the technically, this clock can count to X.

But maybe I'm wrong, and the multiplication overflow is actually often
the real limit. What are the actual values for real timer sources?

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2015-01-03 Thread Sasha Levin

On 01/02/2015 07:27 PM, John Stultz wrote:
> On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
>  wrote:
>> > On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones  
>> > wrote:
>>> >> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>>> >>
>>> >>  > One thing I think I'll try is to try and narrow down which
>>> >>  > syscalls are triggering those "Clocksource hpet had cycles off"
>>> >>  > messages.  I'm still unclear on exactly what is doing
>>> >>  > the stomping on the hpet.
>>> >>
>>> >> First I ran trinity with "-g vm" which limits it to use just
>>> >> a subset of syscalls, specifically VM related ones.
>>> >> That triggered the messages. Further experiments revealed:
>> >
>> > So I can trigger the false positives with my original patch quite
>> > easily by just putting my box under some load. My numbers are nowhere
>> > near as bad as yours, but then, I didn't put it under as much load
>> > anyway. Just a regular "make -j64" of the kernel.
>> >
>> > I suspect your false positives are bigger partly because of the load,
>> > but mostly because you presumably have preemption enabled too. I don't
>> > do preemption in my normal kernels, and that limits the damage of the
>> > race a bit.
>> >
>> > I have a newer version of the patch that gets rid of the false
>> > positives with some ordering rules instead, and just for you I hacked
>> > it up to say where the problem happens too, but it's likely too late.
>> >
>> > The fact that the original racy patch seems to make a difference for
>> > you does say that yes, we seem to be zeroing in on the right area
>> > here, but I'm not seeing what's wrong. I was hoping for big jumps from
>> > your HPET, since your "TSC unstable" messages do kind of imply that
>> > such really big jumps can happen.
>> >
>> > I'm attaching my updated hacky patch, although I assume it's much too
>> > late for that machine. Don't look too closely at the backtrace
>> > generation part, that's just a quick hack, and only works with frame
>> > pointers enabled anyway.
>> >
>> > So I'm still a bit unhappy about not figuring out *what* is wrong. And
>> > I'd still like the dmidecode from that machine, just for posterity. In
>> > case we can figure out some pattern.
>> >
>> > So right now I can imagine several reasons:
>> >
>> >  - actual hardware bug.
>> >
>> >This is *really* unlikely, though. It should hit everybody. The
>> > HPET is in the core intel chipset, we're not talking random unusual
>> > hardware by fly-by-night vendors here.
>> >
>> >  - some SMM/BIOS "power management" feature.
>> >
>> >We've seen this before, where the SMM saves/restores the TSC on
>> > entry/exit in order to hide itself from the system. I could imagine
>> > similar code for the HPET counter. SMM writers use some bad drugs to
>> > dull their pain.
>> >
>> >And with the HPET counter, since it's not even per-CPU, the "save
>> > and restore HPET" will actually show up as "HPET went backwards" to
>> > the other non-SMM CPU's if it happens
>> >
>> >  - a bug in our own clocksource handling.
>> >
>> >I'm not seeing it. But maybe my patch hides it for some magical reason.
> So I sent out a first step validation check to warn us if we end up
> with idle periods that are larger then we expect.
> 
> It doesn't yet cap the timekeeping_get_ns() output (like you're patch
> effectively does), but it would be easy to do that in a following
> patch.
> 
> I did notice while testing this that the max_idle_ns (max idle time we
> report to the scheduler) for the hpet is only ~16sec, and we'll
> overflow after just ~21seconds. This second number maps closely to the
> 22 second stalls seen in the  nmi watchdog reports which seems
> interesting, but I also realize that qemu uses a 100MHz hpet, where as
> real hardware is likely to be a bit slower, so maybe that's just
> chance..
> 
> I'd be interested if folks seeing anything similar to Dave would give
> my patch a shot.

I ran it overnight, but I didn't see any of the new warnings in the logs.


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2015-01-03 Thread Sasha Levin

On 01/02/2015 07:27 PM, John Stultz wrote:
 On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
  On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk 
  wrote:
  On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
 
One thing I think I'll try is to try and narrow down which
syscalls are triggering those Clocksource hpet had cycles off
messages.  I'm still unclear on exactly what is doing
the stomping on the hpet.
 
  First I ran trinity with -g vm which limits it to use just
  a subset of syscalls, specifically VM related ones.
  That triggered the messages. Further experiments revealed:
 
  So I can trigger the false positives with my original patch quite
  easily by just putting my box under some load. My numbers are nowhere
  near as bad as yours, but then, I didn't put it under as much load
  anyway. Just a regular make -j64 of the kernel.
 
  I suspect your false positives are bigger partly because of the load,
  but mostly because you presumably have preemption enabled too. I don't
  do preemption in my normal kernels, and that limits the damage of the
  race a bit.
 
  I have a newer version of the patch that gets rid of the false
  positives with some ordering rules instead, and just for you I hacked
  it up to say where the problem happens too, but it's likely too late.
 
  The fact that the original racy patch seems to make a difference for
  you does say that yes, we seem to be zeroing in on the right area
  here, but I'm not seeing what's wrong. I was hoping for big jumps from
  your HPET, since your TSC unstable messages do kind of imply that
  such really big jumps can happen.
 
  I'm attaching my updated hacky patch, although I assume it's much too
  late for that machine. Don't look too closely at the backtrace
  generation part, that's just a quick hack, and only works with frame
  pointers enabled anyway.
 
  So I'm still a bit unhappy about not figuring out *what* is wrong. And
  I'd still like the dmidecode from that machine, just for posterity. In
  case we can figure out some pattern.
 
  So right now I can imagine several reasons:
 
   - actual hardware bug.
 
 This is *really* unlikely, though. It should hit everybody. The
  HPET is in the core intel chipset, we're not talking random unusual
  hardware by fly-by-night vendors here.
 
   - some SMM/BIOS power management feature.
 
 We've seen this before, where the SMM saves/restores the TSC on
  entry/exit in order to hide itself from the system. I could imagine
  similar code for the HPET counter. SMM writers use some bad drugs to
  dull their pain.
 
 And with the HPET counter, since it's not even per-CPU, the save
  and restore HPET will actually show up as HPET went backwards to
  the other non-SMM CPU's if it happens
 
   - a bug in our own clocksource handling.
 
 I'm not seeing it. But maybe my patch hides it for some magical reason.
 So I sent out a first step validation check to warn us if we end up
 with idle periods that are larger then we expect.
 
 It doesn't yet cap the timekeeping_get_ns() output (like you're patch
 effectively does), but it would be easy to do that in a following
 patch.
 
 I did notice while testing this that the max_idle_ns (max idle time we
 report to the scheduler) for the hpet is only ~16sec, and we'll
 overflow after just ~21seconds. This second number maps closely to the
 22 second stalls seen in the  nmi watchdog reports which seems
 interesting, but I also realize that qemu uses a 100MHz hpet, where as
 real hardware is likely to be a bit slower, so maybe that's just
 chance..
 
 I'd be interested if folks seeing anything similar to Dave would give
 my patch a shot.

I ran it overnight, but I didn't see any of the new warnings in the logs.


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2015-01-02 Thread John Stultz

On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
 wrote:
> On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones  wrote:
>> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>>
>>  > One thing I think I'll try is to try and narrow down which
>>  > syscalls are triggering those "Clocksource hpet had cycles off"
>>  > messages.  I'm still unclear on exactly what is doing
>>  > the stomping on the hpet.
>>
>> First I ran trinity with "-g vm" which limits it to use just
>> a subset of syscalls, specifically VM related ones.
>> That triggered the messages. Further experiments revealed:
>
> So I can trigger the false positives with my original patch quite
> easily by just putting my box under some load. My numbers are nowhere
> near as bad as yours, but then, I didn't put it under as much load
> anyway. Just a regular "make -j64" of the kernel.
>
> I suspect your false positives are bigger partly because of the load,
> but mostly because you presumably have preemption enabled too. I don't
> do preemption in my normal kernels, and that limits the damage of the
> race a bit.
>
> I have a newer version of the patch that gets rid of the false
> positives with some ordering rules instead, and just for you I hacked
> it up to say where the problem happens too, but it's likely too late.
>
> The fact that the original racy patch seems to make a difference for
> you does say that yes, we seem to be zeroing in on the right area
> here, but I'm not seeing what's wrong. I was hoping for big jumps from
> your HPET, since your "TSC unstable" messages do kind of imply that
> such really big jumps can happen.
>
> I'm attaching my updated hacky patch, although I assume it's much too
> late for that machine. Don't look too closely at the backtrace
> generation part, that's just a quick hack, and only works with frame
> pointers enabled anyway.
>
> So I'm still a bit unhappy about not figuring out *what* is wrong. And
> I'd still like the dmidecode from that machine, just for posterity. In
> case we can figure out some pattern.
>
> So right now I can imagine several reasons:
>
>  - actual hardware bug.
>
>This is *really* unlikely, though. It should hit everybody. The
> HPET is in the core intel chipset, we're not talking random unusual
> hardware by fly-by-night vendors here.
>
>  - some SMM/BIOS "power management" feature.
>
>We've seen this before, where the SMM saves/restores the TSC on
> entry/exit in order to hide itself from the system. I could imagine
> similar code for the HPET counter. SMM writers use some bad drugs to
> dull their pain.
>
>And with the HPET counter, since it's not even per-CPU, the "save
> and restore HPET" will actually show up as "HPET went backwards" to
> the other non-SMM CPU's if it happens
>
>  - a bug in our own clocksource handling.
>
>I'm not seeing it. But maybe my patch hides it for some magical reason.

So I sent out a first step validation check to warn us if we end up
with idle periods that are larger then we expect.

It doesn't yet cap the timekeeping_get_ns() output (like you're patch
effectively does), but it would be easy to do that in a following
patch.

I did notice while testing this that the max_idle_ns (max idle time we
report to the scheduler) for the hpet is only ~16sec, and we'll
overflow after just ~21seconds. This second number maps closely to the
22 second stalls seen in the  nmi watchdog reports which seems
interesting, but I also realize that qemu uses a 100MHz hpet, where as
real hardware is likely to be a bit slower, so maybe that's just
chance..

I'd be interested if folks seeing anything similar to Dave would give
my patch a shot.

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2015-01-02 Thread John Stultz

On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote:
 On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:

   One thing I think I'll try is to try and narrow down which
   syscalls are triggering those Clocksource hpet had cycles off
   messages.  I'm still unclear on exactly what is doing
   the stomping on the hpet.

 First I ran trinity with -g vm which limits it to use just
 a subset of syscalls, specifically VM related ones.
 That triggered the messages. Further experiments revealed:

 So I can trigger the false positives with my original patch quite
 easily by just putting my box under some load. My numbers are nowhere
 near as bad as yours, but then, I didn't put it under as much load
 anyway. Just a regular make -j64 of the kernel.

 I suspect your false positives are bigger partly because of the load,
 but mostly because you presumably have preemption enabled too. I don't
 do preemption in my normal kernels, and that limits the damage of the
 race a bit.

 I have a newer version of the patch that gets rid of the false
 positives with some ordering rules instead, and just for you I hacked
 it up to say where the problem happens too, but it's likely too late.

 The fact that the original racy patch seems to make a difference for
 you does say that yes, we seem to be zeroing in on the right area
 here, but I'm not seeing what's wrong. I was hoping for big jumps from
 your HPET, since your TSC unstable messages do kind of imply that
 such really big jumps can happen.

 I'm attaching my updated hacky patch, although I assume it's much too
 late for that machine. Don't look too closely at the backtrace
 generation part, that's just a quick hack, and only works with frame
 pointers enabled anyway.

 So I'm still a bit unhappy about not figuring out *what* is wrong. And
 I'd still like the dmidecode from that machine, just for posterity. In
 case we can figure out some pattern.

 So right now I can imagine several reasons:

  - actual hardware bug.

This is *really* unlikely, though. It should hit everybody. The
 HPET is in the core intel chipset, we're not talking random unusual
 hardware by fly-by-night vendors here.

  - some SMM/BIOS power management feature.

We've seen this before, where the SMM saves/restores the TSC on
 entry/exit in order to hide itself from the system. I could imagine
 similar code for the HPET counter. SMM writers use some bad drugs to
 dull their pain.

And with the HPET counter, since it's not even per-CPU, the save
 and restore HPET will actually show up as HPET went backwards to
 the other non-SMM CPU's if it happens

  - a bug in our own clocksource handling.

I'm not seeing it. But maybe my patch hides it for some magical reason.

So I sent out a first step validation check to warn us if we end up
with idle periods that are larger then we expect.

It doesn't yet cap the timekeeping_get_ns() output (like you're patch
effectively does), but it would be easy to do that in a following
patch.

I did notice while testing this that the max_idle_ns (max idle time we
report to the scheduler) for the hpet is only ~16sec, and we'll
overflow after just ~21seconds. This second number maps closely to the
22 second stalls seen in the  nmi watchdog reports which seems
interesting, but I also realize that qemu uses a 100MHz hpet, where as
real hardware is likely to be a bit slower, so maybe that's just
chance..

I'd be interested if folks seeing anything similar to Dave would give
my patch a shot.

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-28 Thread Paul E. McKenney

On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote:
> On Mon, Dec 22, 2014 at 3:59 PM, John Stultz  wrote:
> >
> > * So 1/8th of the interval seems way too short, as there's
> > clocksources like the ACP PM, which wrap every 2.5 seconds or so.
> 
> Ugh. At the same time, 1/8th of a range is actually bigger than I'd
> like, since if there is some timer corruption, it means that we only
> catch it when it's in really big.
> 
> But as I said, I'd actually prefer it to be time-based, because it
> would be good if this approach worked on things like the TSC which is
> a 64-bit counter..
> 
> So yes, that capping was very much arbitrary, and was mostly a case of
> "this works with the one timer source that I can easily trigger"
> 
> > * I suspect something closer to the clocksource_max_deferment() value
> > (which I think is max interval before multiplication overflows could
> > happen - ~12%) which we use in the scheduler would make more sense.
> > Especially since the timer scheduler uses that to calculate how long
> > we can idle for.
> 
> I'd rather not be anywhere *close* to any overflow problems. Even for
> the scheduler all-idle case, I'd argue that there is rather quickly
> diminishing returns. Yes, a thousand timer interrupts per second are
> expensive and a noticeable power draw. The difference between "one
> timer interrupt every two seconds" and "every 20 seconds" is rather
> less noticeable.
> 
> Of course, reasonable clock sources have *much* longer periods than a
> second (yeah, the acpi pm timer really isn't a good one), so there are
> probably good middle grounds, The 1/8th was a hack, and one that was
> aware of teh 300s cycle of the HPET at that..

I of course very much like the idea of the timekeeping system doing a
bit of self-checking.  ;-)

Can we simplify things by just not doing self-checking on clocks that
overflow in less than a few minutes?  In other words, if someone
reports oddball RCU CPU stall warnings when using the ACPI PM
timer, am I within my rights to tell them to reproduce the stall
using a better timer?

If so, we could possibly get away with the assumption that preemptions
don't last longer than the soft-lockup interval, which is currently
a bit over 20 seconds (hence "a few minutes" above).  And yes, you
might avoid a soft lockup by having a series of 15-second preemptions
between each pair of instructions, but my response would be to increase
my "a few minutes" a bit and to invoke probabilities.

I don't claim to understand the timer code for all the reasons that
Linus calls out below, but I believe that this simplifying
assumption would in turn simplify the self-check code.

Thanx, Paul

> > * Nulling out delta in timekeeping_get_ns() seems like it could cause
> > problems since time would then possibly go backwards compared to
> > previous reads (as you mentioned, resulting in smaller time jumps).
> > Instead it would probably make more sense to cap the delta at the
> > maximum value (though this assumes the clock doesn't jump back in the
> > interval before the next call to update_wall_time).
> 
> So part of the nulling was that it was simpler, and part of it was
> that I expected to get backwards jumps (see the other email to Dave
> about the inherent races). And with the whole timer mask modulo
> arithmetic, those backwards jumps just look like biggish positive
> numbers, not even negative. So it ends up being things like "is it an
> unsigned number larger than half the mask? Consider it negative" etc.
> 
> The "zero it out" was simple, and it worked for my test-case, which
> was "ok, my machine no longer locks up when I mess with the timer".
> 
> And I didn't post the earlier versions of that patch that didn't even *boot*.
> 
> I started out trying to do it at a higher level (not on a clock read
> level, but outside the whole 'convert-to-ns and do the sequence  value
> check'), but during bootup we play a lot of games with initializing
> the timer sources etc.
> 
> So that explains the approach of doing it at that
> 
>cycle_now = tkr->read(tkr->clock);
> 
> level, and keeping it very low-level.
> 
> But as I already explained in the email that crossed, that low-level
> thing also results in some fundamental races.
> 
> > * Also, as you note, this would just cause the big time jump to only
> > happen at the next update, since there's no logic in
> > update_wall_time() to limit the jump. I'm not sure if "believing" the
> > large jump at write time make that much more sense, though.
> 
> So I considered just capping it there (to a single interval or
> something). Again, just ignoring - like the read side does - it would
> have been easier, but at the same time I *really* wanted to make time
> go forward, so just taking the big value seemed safest.
> 
> But yes. this was very much a RFC patch. It's not even ready for real
> use, as DaveJ found out (although it might be good enough in practice,
> despite

Re: frequent lockups in 3.18rc4

2014-12-28 Thread Paul E. McKenney

On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote:
 On Mon, Dec 22, 2014 at 3:59 PM, John Stultz john.stu...@linaro.org wrote:
 
  * So 1/8th of the interval seems way too short, as there's
  clocksources like the ACP PM, which wrap every 2.5 seconds or so.
 
 Ugh. At the same time, 1/8th of a range is actually bigger than I'd
 like, since if there is some timer corruption, it means that we only
 catch it when it's in really big.
 
 But as I said, I'd actually prefer it to be time-based, because it
 would be good if this approach worked on things like the TSC which is
 a 64-bit counter..
 
 So yes, that capping was very much arbitrary, and was mostly a case of
 this works with the one timer source that I can easily trigger
 
  * I suspect something closer to the clocksource_max_deferment() value
  (which I think is max interval before multiplication overflows could
  happen - ~12%) which we use in the scheduler would make more sense.
  Especially since the timer scheduler uses that to calculate how long
  we can idle for.
 
 I'd rather not be anywhere *close* to any overflow problems. Even for
 the scheduler all-idle case, I'd argue that there is rather quickly
 diminishing returns. Yes, a thousand timer interrupts per second are
 expensive and a noticeable power draw. The difference between one
 timer interrupt every two seconds and every 20 seconds is rather
 less noticeable.
 
 Of course, reasonable clock sources have *much* longer periods than a
 second (yeah, the acpi pm timer really isn't a good one), so there are
 probably good middle grounds, The 1/8th was a hack, and one that was
 aware of teh 300s cycle of the HPET at that..

I of course very much like the idea of the timekeeping system doing a
bit of self-checking.  ;-)

Can we simplify things by just not doing self-checking on clocks that
overflow in less than a few minutes?  In other words, if someone
reports oddball RCU CPU stall warnings when using the ACPI PM
timer, am I within my rights to tell them to reproduce the stall
using a better timer?

If so, we could possibly get away with the assumption that preemptions
don't last longer than the soft-lockup interval, which is currently
a bit over 20 seconds (hence a few minutes above).  And yes, you
might avoid a soft lockup by having a series of 15-second preemptions
between each pair of instructions, but my response would be to increase
my a few minutes a bit and to invoke probabilities.

I don't claim to understand the timer code for all the reasons that
Linus calls out below, but I believe that this simplifying
assumption would in turn simplify the self-check code.

Thanx, Paul

  * Nulling out delta in timekeeping_get_ns() seems like it could cause
  problems since time would then possibly go backwards compared to
  previous reads (as you mentioned, resulting in smaller time jumps).
  Instead it would probably make more sense to cap the delta at the
  maximum value (though this assumes the clock doesn't jump back in the
  interval before the next call to update_wall_time).
 
 So part of the nulling was that it was simpler, and part of it was
 that I expected to get backwards jumps (see the other email to Dave
 about the inherent races). And with the whole timer mask modulo
 arithmetic, those backwards jumps just look like biggish positive
 numbers, not even negative. So it ends up being things like is it an
 unsigned number larger than half the mask? Consider it negative etc.
 
 The zero it out was simple, and it worked for my test-case, which
 was ok, my machine no longer locks up when I mess with the timer.
 
 And I didn't post the earlier versions of that patch that didn't even *boot*.
 
 I started out trying to do it at a higher level (not on a clock read
 level, but outside the whole 'convert-to-ns and do the sequence  value
 check'), but during bootup we play a lot of games with initializing
 the timer sources etc.
 
 So that explains the approach of doing it at that
 
cycle_now = tkr-read(tkr-clock);
 
 level, and keeping it very low-level.
 
 But as I already explained in the email that crossed, that low-level
 thing also results in some fundamental races.
 
  * Also, as you note, this would just cause the big time jump to only
  happen at the next update, since there's no logic in
  update_wall_time() to limit the jump. I'm not sure if believing the
  large jump at write time make that much more sense, though.
 
 So I considered just capping it there (to a single interval or
 something). Again, just ignoring - like the read side does - it would
 have been easier, but at the same time I *really* wanted to make time
 go forward, so just taking the big value seemed safest.
 
 But yes. this was very much a RFC patch. It's not even ready for real
 use, as DaveJ found out (although it might be good enough in practice,
 despite its flaws)
 
  * Finally, you're writing to error while only holding a read lock, but
  that's

Re: frequent lockups in 3.18rc4

2014-12-27 Thread Dave Jones

On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote:
 > On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones  wrote:
 > >  >
 > >  > Oh - and have you actually seen the "TSC unstable (delta = xyz)" +
 > >  > "switched to hpet" messages there yet?
 > >
 > > not yet. 3 hrs in.
 > 
 > Ok, so then the
 > 
 >  INFO: rcu_preempt detected stalls on CPUs/tasks:
 > 
 > has nothing to do with HPET, since you'd still be running with the TSC 
 > enabled.

right. 16hrs later, that's the only thing that's spewed.

 > My googling around did find a number of "machine locks up a few hours
 > after switching to hpet" reports, so it is possible that the whole rcu
 > stall and nmi watchdog thing is independent and unrelated to the
 > actual locking up.

possible.  I'm heading home in a few hours to start the wipe of that
box. This is going to be 'the one that got away', but at least we've
managed to find a number of other things that needed fixing along the way.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-27 Thread Dave Jones

On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote:
  On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones da...@codemonkey.org.uk wrote:

 Oh - and have you actually seen the TSC unstable (delta = xyz) +
 switched to hpet messages there yet?
  
   not yet. 3 hrs in.
  
  Ok, so then the
  
   INFO: rcu_preempt detected stalls on CPUs/tasks:
  
  has nothing to do with HPET, since you'd still be running with the TSC 
  enabled.

right. 16hrs later, that's the only thing that's spewed.

  My googling around did find a number of machine locks up a few hours
  after switching to hpet reports, so it is possible that the whole rcu
  stall and nmi watchdog thing is independent and unrelated to the
  actual locking up.

possible.  I'm heading home in a few hours to start the wipe of that
box. This is going to be 'the one that got away', but at least we've
managed to find a number of other things that needed fixing along the way.

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones  wrote:
>  >
>  > Oh - and have you actually seen the "TSC unstable (delta = xyz)" +
>  > "switched to hpet" messages there yet?
>
> not yet. 3 hrs in.

Ok, so then the

 INFO: rcu_preempt detected stalls on CPUs/tasks:

has nothing to do with HPET, since you'd still be running with the TSC enabled.

My googling around did find a number of "machine locks up a few hours
after switching to hpet" reports, so it is possible that the whole rcu
stall and nmi watchdog thing is independent and unrelated to the
actual locking up.

It *is* intriguing that my broken patch seemed to prevent it from
happening, though. And both NMI watchdogs and the rcu stall are
related to wall-clock time.  But hey, maybe there really is some odd
loop in the kernel that stops scheduling or RCU grace periods. It just
seems to be never caught by your backtraces..

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
 > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones  wrote:
 > >
 > > still running though..
 > 
 > Btw, did you ever boot with "tsc=reliable" as a kernel command line option?

I'll check it again in the morning, but before I turn in for the night,
so far the only thing is this:


[ 6713.394395] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 6713.394489]  Tasks blocked on level-0 rcu_node (CPUs 0-7):
[ 6713.394513]  Tasks blocked on level-0 rcu_node (CPUs 0-7):
[ 6713.394536]  (detected by 3, t=6502 jiffies, g=141292, c=141291, q=0)
[ 6713.394564] INFO: Stall ended before state dump start


Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
 > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones  wrote:
 > >
 > > still running though..
 > 
 > Btw, did you ever boot with "tsc=reliable" as a kernel command line option?

I don't think so.

 > For the last night, can you see if you can just run it with that, and
 > things work?

Sure.

 > So I'm not even convinced that trying to debug some HPET issue is
 > really worth it. Especially if your machine is a preproduction board
 > from Intel.

Yeah, I agree. Even though it's strange that this only became a problem
this last few months for me, after over a year of abuse.

Hopefully the new year brings more trustworthy hardware.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote:
 > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones  wrote:
 > >
 > > hm.
 > 
 > So with the previous patch that had the false positives, you never saw
 > this? You saw the false positives instead?
 
correct.

 > I'm wondering if the added debug noise just ended up helping. Doing a
 > printk() will automatically cause some scheduler activity. And they
 > also caused the time reading jiggle.
 > 
 > That said, it's also possible that I screwed something up in the
 > second version of the patch, just breaking it and making it generally
 > ineffective.
 > 
 > Oh - and have you actually seen the "TSC unstable (delta = xyz)" +
 > "switched to hpet" messages there yet?

not yet. 3 hrs in.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones  wrote:
>
> still running though..

Btw, did you ever boot with "tsc=reliable" as a kernel command line option?

For the last night, can you see if you can just run it with that, and
things work? Because by now, my gut feel is that we should start
derating the HPET rather than the TSC, especially going forward on
modern hardware. And if this is some subtle timing issue with our hpet
code (the whole -ETIME thing when getting close to setting a timer is
subtle, for example, even if the HPET hardware itself would be ok),
I'm wondering if the fix isn't to just stop believing in HPET if there
are better alternatives around.

So I'm not even convinced that trying to debug some HPET issue is
really worth it. Especially if your machine is a preproduction board
from Intel.

But verifying that with just the TSC everything is ok might still be worth it.

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones  wrote:
>
> hm.

So with the previous patch that had the false positives, you never saw
this? You saw the false positives instead?

I'm wondering if the added debug noise just ended up helping. Doing a
printk() will automatically cause some scheduler activity. And they
also caused the time reading jiggle.

That said, it's also possible that I screwed something up in the
second version of the patch, just breaking it and making it generally
ineffective.

Oh - and have you actually seen the "TSC unstable (delta = xyz)" +
"switched to hpet" messages there yet?

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:
 
 > I have a newer version of the patch that gets rid of the false
 > positives with some ordering rules instead, and just for you I hacked
 > it up to say where the problem happens too, but it's likely too late.

hm.


[ 2733.047100] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2733.047188]  Tasks blocked on level-0 rcu_node (CPUs 0-7): P25811
[ 2733.047216]  Tasks blocked on level-0 rcu_node (CPUs 0-7): P25811
[ 2733.047242]  (detected by 0, t=6502 jiffies, g=52141, c=52140, q=0)
[ 2733.047271] trinity-c406R  running task13416 25811  24907 0x
[ 2733.047305]  88022208fd28 0002 a819f627 
8801df2c
[ 2733.047341]  001d31c0 0002 88022208ffd8 
001d31c0
[ 2733.047375]  8800806e1780 8801df2c 88022208fd18 
88022208ffd8
[ 2733.047411] Call Trace:
[ 2733.047429]  [] ? context_tracking_user_exit+0x67/0x280
[ 2733.047457]  [] preempt_schedule_irq+0x52/0xb0
[ 2733.047482]  [] retint_kernel+0x20/0x30
[ 2733.047505]  [] ? check_kill_permission+0xb1/0x1e0
[ 2733.047531]  [] ? check_kill_permission+0x152/0x1e0
[ 2733.047557]  [] group_send_sig_info+0x65/0x150
[ 2733.047581]  [] ? group_send_sig_info+0x5/0x150
[ 2733.047607]  [] ? rcu_read_lock_held+0x6e/0x80
[ 2733.047632]  [] kill_pid_info+0x78/0x130
[ 2733.047654]  [] ? kill_pid_info+0x5/0x130
[ 2733.047677]  [] SYSC_kill+0xf2/0x2f0
[ 2733.047699]  [] ? SYSC_kill+0x9b/0x2f0
[ 2733.047721]  [] ? trace_hardirqs_on+0xd/0x10
[ 2733.047745]  [] ? syscall_trace_enter_phase1+0x125/0x1a0
[ 2733.048607]  [] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 2733.049469]  [] SyS_kill+0xe/0x10
[ 2733.050332]  [] system_call_fastpath+0x12/0x17
[ 2733.051197] trinity-c406R  running task13416 25811  24907 0x
[ 2733.052064]  88022208fd28 0002 a819f627 
8801df2c
[ 2733.052932]  001d31c0 0002 88022208ffd8 
001d31c0
[ 2733.053792]  880209e2c680 8801df2c 88022208fd18 
88022208ffd8
[ 2733.054651] Call Trace:
[ 2733.055500]  [] ? context_tracking_user_exit+0x67/0x280
[ 2733.056362]  [] preempt_schedule_irq+0x52/0xb0
[ 2733.057222]  [] retint_kernel+0x20/0x30
[ 2733.058076]  [] ? check_kill_permission+0xb1/0x1e0
[ 2733.058930]  [] ? check_kill_permission+0x152/0x1e0
[ 2733.059778]  [] group_send_sig_info+0x65/0x150
[ 2733.060624]  [] ? group_send_sig_info+0x5/0x150
[ 2733.061472]  [] ? rcu_read_lock_held+0x6e/0x80
[ 2733.062322]  [] kill_pid_info+0x78/0x130
[ 2733.063168]  [] ? kill_pid_info+0x5/0x130
[ 2733.064015]  [] SYSC_kill+0xf2/0x2f0
[ 2733.064863]  [] ? SYSC_kill+0x9b/0x2f0
[ 2733.065704]  [] ? trace_hardirqs_on+0xd/0x10
[ 2733.066541]  [] ? syscall_trace_enter_phase1+0x125/0x1a0
[ 2733.067384]  [] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 2733.068217]  [] SyS_kill+0xe/0x10
[ 2733.069045]  [] system_call_fastpath+0x12/0x17
[ 3708.217920] perf interrupt took too long (2503 > 2500), lowering 
kernel.perf_event_max_sample_rate to 5
[ 4583.530580] request_module: runaway loop modprobe personality-87


still running though..

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:

 > I have a newer version of the patch that gets rid of the false
 > positives with some ordering rules instead, and just for you I hacked
 > it up to say where the problem happens too, but it's likely too late.

I'll give it a spin and see what falls out this evening.

 > So I'm still a bit unhappy about not figuring out *what* is wrong. And
 > I'd still like the dmidecode from that machine, just for posterity. In
 > case we can figure out some pattern.

So this is something I should have done a long time ago.
Googling for the board name shows up a very similar report
from a year ago, except that was within kvm, and was aparently fixed.
https://lkml.org/lkml/2013/10/9/206 and 
https://bugzilla.kernel.org/show_bug.cgi?id=69491
(dmidecode attachment there is pretty much the same as mine)

 >  - actual hardware bug.
 >This is *really* unlikely, though. It should hit everybody. The
 > HPET is in the core intel chipset, we're not talking random unusual
 > hardware by fly-by-night vendors here.

This machine is allegedly a 'production' box from Intel, but
given Kashyap saw something very similar I'm wondering now if there
was some board/bios errata for this system.

There's a few Intel folks cc'd here, maybe one those can dig up whether
there was anything peculiar about Shark Bay systems that would
explain the HPET getting screwed up.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones  wrote:
> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:
>
>  > One thing I think I'll try is to try and narrow down which
>  > syscalls are triggering those "Clocksource hpet had cycles off"
>  > messages.  I'm still unclear on exactly what is doing
>  > the stomping on the hpet.
>
> First I ran trinity with "-g vm" which limits it to use just
> a subset of syscalls, specifically VM related ones.
> That triggered the messages. Further experiments revealed:

So I can trigger the false positives with my original patch quite
easily by just putting my box under some load. My numbers are nowhere
near as bad as yours, but then, I didn't put it under as much load
anyway. Just a regular "make -j64" of the kernel.

I suspect your false positives are bigger partly because of the load,
but mostly because you presumably have preemption enabled too. I don't
do preemption in my normal kernels, and that limits the damage of the
race a bit.

I have a newer version of the patch that gets rid of the false
positives with some ordering rules instead, and just for you I hacked
it up to say where the problem happens too, but it's likely too late.

The fact that the original racy patch seems to make a difference for
you does say that yes, we seem to be zeroing in on the right area
here, but I'm not seeing what's wrong. I was hoping for big jumps from
your HPET, since your "TSC unstable" messages do kind of imply that
such really big jumps can happen.

I'm attaching my updated hacky patch, although I assume it's much too
late for that machine. Don't look too closely at the backtrace
generation part, that's just a quick hack, and only works with frame
pointers enabled anyway.

So I'm still a bit unhappy about not figuring out *what* is wrong. And
I'd still like the dmidecode from that machine, just for posterity. In
case we can figure out some pattern.

So right now I can imagine several reasons:

 - actual hardware bug.

   This is *really* unlikely, though. It should hit everybody. The
HPET is in the core intel chipset, we're not talking random unusual
hardware by fly-by-night vendors here.

 - some SMM/BIOS "power management" feature.

   We've seen this before, where the SMM saves/restores the TSC on
entry/exit in order to hide itself from the system. I could imagine
similar code for the HPET counter. SMM writers use some bad drugs to
dull their pain.

   And with the HPET counter, since it's not even per-CPU, the "save
and restore HPET" will actually show up as "HPET went backwards" to
the other non-SMM CPU's if it happens

 - a bug in our own clocksource handling.

   I'm not seeing it. But maybe my patch hides it for some magical reason.

 - gremlins.

So I dunno. I hope more people will look at this after the holidays,
even if your machine is gone. My test-program to do bad things to the
HPET shows *something*, and works on any machine.

Linus
 arch/x86/kernel/entry_64.S  |  5 +++
 include/linux/timekeeper_internal.h |  1 +
 kernel/time/timekeeping.c   | 78 +++--
 3 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 9ebaf63ba182..0a4c34b4658e 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -312,6 +312,11 @@ ENTRY(save_paranoid)
CFI_ENDPROC
 END(save_paranoid)

+ENTRY(save_back_trace)
+   movq %rbp,%rdi
+   jmp do_save_back_trace
+END(save_back_trace)
+
 /*
  * A newly forked process directly context switches into this address.
  *
diff --git a/include/linux/timekeeper_internal.h 
b/include/linux/timekeeper_internal.h
index 05af9a334893..0fcb60d77079 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -32,6 +32,7 @@ struct tk_read_base {
cycle_t (*read)(struct clocksource *cs);
cycle_t mask;
cycle_t cycle_last;
+   cycle_t cycle_error;
u32 mult;
u32 shift;
u64 xtime_nsec;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6a931852082f..1c924c80b462 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -140,6 +140,7 @@ static void tk_setup_internals(struct timekeeper *tk, 
struct clocksource *clock)
tk->tkr.read = clock->read;
tk->tkr.mask = clock->mask;
tk->tkr.cycle_last = tk->tkr.read(clock);
+   tk->tkr.cycle_error = 0;

/* Do the ns -> cycle conversion first, using original mult */
tmp = NTP_INTERVAL_LENGTH;
@@ -191,16 +192,59 @@ u32 (*arch_gettimeoffset)(void) = 
default_arch_gettimeoffset;
 static inline u32 arch_gettimeoffset(void) { return 0; }
 #endif

+unsigned long tracebuffer[16];
+
+extern void save_back_trace(long dummy, void *ptr);
+
+void do_save_back_trace(long rbp,

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:

 > One thing I think I'll try is to try and narrow down which
 > syscalls are triggering those "Clocksource hpet had cycles off"
 > messages.  I'm still unclear on exactly what is doing
 > the stomping on the hpet.

First I ran trinity with "-g vm" which limits it to use just
a subset of syscalls, specifically VM related ones.
That triggered the messages. Further experiments revealed:

-c mremap triggered it, but only when I also passed -C256
to crank up the number of child processes. The same thing
occured with mprotect, madvise, remap_file_pages.

I couldn't trigger it with -c mmap, or msync, mbind, move_pages,
migrate_pages, mlock, regardless of how many child processes there were.

Given the high child count necessary to trigger it,
it's nigh on impossible to weed through all the calls
that trinity made to figure out which one actually
triggered the messages.

I'm not even convinced that the syscall parameters are
even particularly interesting.  The "needs high load to trigger"
aspect of the bug still has a smell of scheduler interaction or
side effect of lock contention. Looking at one childs
syscall params in isolation might look quite dull, but if
we have N processes hammering on the same mapping, that's
probably a lot more interesting.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote:
 > On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
 >  
 >  > But in the meantime please do keep that thing running as long as you
 >  > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
 >  > result - the original softlockup bug happening *without* any bigger
 >  > hpet jumps.
 > 
 > So I've got this box a *little* longer than anticipated.
 > It's now been running 30 hours with not a single NMI lockup.
 > and that's with my kitchen-sink debugging kernel.
 > 
 > The 'hpet off' messages continue to be spewed, and again they're
 > all in the same range of 4293198075 -> 4294967266

In case there was any doubt remaining, it's now been running
3 days, 20 hours with no lockups at all.  I haven't seen it
run this long in months.

Either tomorrow or Sunday I'm finally wiping that box
to give it back on Monday, so if there's anything else
you'd like to try, the next 24hrs are pretty much the only
remaining time I have.

One thing I think I'll try is to try and narrow down which
syscalls are triggering those "Clocksource hpet had cycles off"
messages.  I'm still unclear on exactly what is doing
the stomping on the hpet.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote:
  On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
   
But in the meantime please do keep that thing running as long as you
can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
result - the original softlockup bug happening *without* any bigger
hpet jumps.
  
  So I've got this box a *little* longer than anticipated.
  It's now been running 30 hours with not a single NMI lockup.
  and that's with my kitchen-sink debugging kernel.
  
  The 'hpet off' messages continue to be spewed, and again they're
  all in the same range of 4293198075 - 4294967266

In case there was any doubt remaining, it's now been running
3 days, 20 hours with no lockups at all.  I haven't seen it
run this long in months.

Either tomorrow or Sunday I'm finally wiping that box
to give it back on Monday, so if there's anything else
you'd like to try, the next 24hrs are pretty much the only
remaining time I have.

One thing I think I'll try is to try and narrow down which
syscalls are triggering those Clocksource hpet had cycles off
messages.  I'm still unclear on exactly what is doing
the stomping on the hpet.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:

  One thing I think I'll try is to try and narrow down which
  syscalls are triggering those Clocksource hpet had cycles off
  messages.  I'm still unclear on exactly what is doing
  the stomping on the hpet.

First I ran trinity with -g vm which limits it to use just
a subset of syscalls, specifically VM related ones.
That triggered the messages. Further experiments revealed:

-c mremap triggered it, but only when I also passed -C256
to crank up the number of child processes. The same thing
occured with mprotect, madvise, remap_file_pages.

I couldn't trigger it with -c mmap, or msync, mbind, move_pages,
migrate_pages, mlock, regardless of how many child processes there were.


Given the high child count necessary to trigger it,
it's nigh on impossible to weed through all the calls
that trinity made to figure out which one actually
triggered the messages.

I'm not even convinced that the syscall parameters are
even particularly interesting.  The needs high load to trigger
aspect of the bug still has a smell of scheduler interaction or
side effect of lock contention. Looking at one childs
syscall params in isolation might look quite dull, but if
we have N processes hammering on the same mapping, that's
probably a lot more interesting.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote:
 On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote:

   One thing I think I'll try is to try and narrow down which
   syscalls are triggering those Clocksource hpet had cycles off
   messages.  I'm still unclear on exactly what is doing
   the stomping on the hpet.

 First I ran trinity with -g vm which limits it to use just
 a subset of syscalls, specifically VM related ones.
 That triggered the messages. Further experiments revealed:

So I can trigger the false positives with my original patch quite
easily by just putting my box under some load. My numbers are nowhere
near as bad as yours, but then, I didn't put it under as much load
anyway. Just a regular make -j64 of the kernel.

I suspect your false positives are bigger partly because of the load,
but mostly because you presumably have preemption enabled too. I don't
do preemption in my normal kernels, and that limits the damage of the
race a bit.

I have a newer version of the patch that gets rid of the false
positives with some ordering rules instead, and just for you I hacked
it up to say where the problem happens too, but it's likely too late.

The fact that the original racy patch seems to make a difference for
you does say that yes, we seem to be zeroing in on the right area
here, but I'm not seeing what's wrong. I was hoping for big jumps from
your HPET, since your TSC unstable messages do kind of imply that
such really big jumps can happen.

I'm attaching my updated hacky patch, although I assume it's much too
late for that machine. Don't look too closely at the backtrace
generation part, that's just a quick hack, and only works with frame
pointers enabled anyway.

So I'm still a bit unhappy about not figuring out *what* is wrong. And
I'd still like the dmidecode from that machine, just for posterity. In
case we can figure out some pattern.

So right now I can imagine several reasons:

 - actual hardware bug.

   This is *really* unlikely, though. It should hit everybody. The
HPET is in the core intel chipset, we're not talking random unusual
hardware by fly-by-night vendors here.

 - some SMM/BIOS power management feature.

   We've seen this before, where the SMM saves/restores the TSC on
entry/exit in order to hide itself from the system. I could imagine
similar code for the HPET counter. SMM writers use some bad drugs to
dull their pain.

   And with the HPET counter, since it's not even per-CPU, the save
and restore HPET will actually show up as HPET went backwards to
the other non-SMM CPU's if it happens

 - a bug in our own clocksource handling.

   I'm not seeing it. But maybe my patch hides it for some magical reason.

 - gremlins.

So I dunno. I hope more people will look at this after the holidays,
even if your machine is gone. My test-program to do bad things to the
HPET shows *something*, and works on any machine.

Linus
 arch/x86/kernel/entry_64.S  |  5 +++
 include/linux/timekeeper_internal.h |  1 +
 kernel/time/timekeeping.c   | 78 +++--
 3 files changed, 81 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 9ebaf63ba182..0a4c34b4658e 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -312,6 +312,11 @@ ENTRY(save_paranoid)
CFI_ENDPROC
 END(save_paranoid)
 
+ENTRY(save_back_trace)
+   movq %rbp,%rdi
+   jmp do_save_back_trace
+END(save_back_trace)
+
 /*
  * A newly forked process directly context switches into this address.
  *
diff --git a/include/linux/timekeeper_internal.h 
b/include/linux/timekeeper_internal.h
index 05af9a334893..0fcb60d77079 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -32,6 +32,7 @@ struct tk_read_base {
cycle_t (*read)(struct clocksource *cs);
cycle_t mask;
cycle_t cycle_last;
+   cycle_t cycle_error;
u32 mult;
u32 shift;
u64 xtime_nsec;
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6a931852082f..1c924c80b462 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -140,6 +140,7 @@ static void tk_setup_internals(struct timekeeper *tk, 
struct clocksource *clock)
tk-tkr.read = clock-read;
tk-tkr.mask = clock-mask;
tk-tkr.cycle_last = tk-tkr.read(clock);
+   tk-tkr.cycle_error = 0;
 
/* Do the ns - cycle conversion first, using original mult */
tmp = NTP_INTERVAL_LENGTH;
@@ -191,16 +192,59 @@ u32 (*arch_gettimeoffset)(void) = 
default_arch_gettimeoffset;
 static inline u32 arch_gettimeoffset(void) { return 0; }
 #endif
 
+unsigned long tracebuffer[16];
+
+extern void save_back_trace(long dummy, void *ptr);
+
+void do_save_back_trace(long rbp, void *ptr)

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:

  I have a newer version of the patch that gets rid of the false
  positives with some ordering rules instead, and just for you I hacked
  it up to say where the problem happens too, but it's likely too late.

I'll give it a spin and see what falls out this evening.

  So I'm still a bit unhappy about not figuring out *what* is wrong. And
  I'd still like the dmidecode from that machine, just for posterity. In
  case we can figure out some pattern.

So this is something I should have done a long time ago.
Googling for the board name shows up a very similar report
from a year ago, except that was within kvm, and was aparently fixed.
https://lkml.org/lkml/2013/10/9/206 and 
https://bugzilla.kernel.org/show_bug.cgi?id=69491
(dmidecode attachment there is pretty much the same as mine)

   - actual hardware bug.
 This is *really* unlikely, though. It should hit everybody. The
  HPET is in the core intel chipset, we're not talking random unusual
  hardware by fly-by-night vendors here.

This machine is allegedly a 'production' box from Intel, but
given Kashyap saw something very similar I'm wondering now if there
was some board/bios errata for this system.

There's a few Intel folks cc'd here, maybe one those can dig up whether
there was anything peculiar about Shark Bay systems that would
explain the HPET getting screwed up.

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote:
 
  I have a newer version of the patch that gets rid of the false
  positives with some ordering rules instead, and just for you I hacked
  it up to say where the problem happens too, but it's likely too late.

hm.


[ 2733.047100] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 2733.047188]  Tasks blocked on level-0 rcu_node (CPUs 0-7): P25811
[ 2733.047216]  Tasks blocked on level-0 rcu_node (CPUs 0-7): P25811
[ 2733.047242]  (detected by 0, t=6502 jiffies, g=52141, c=52140, q=0)
[ 2733.047271] trinity-c406R  running task13416 25811  24907 0x
[ 2733.047305]  88022208fd28 0002 a819f627 
8801df2c
[ 2733.047341]  001d31c0 0002 88022208ffd8 
001d31c0
[ 2733.047375]  8800806e1780 8801df2c 88022208fd18 
88022208ffd8
[ 2733.047411] Call Trace:
[ 2733.047429]  [a819f627] ? context_tracking_user_exit+0x67/0x280
[ 2733.047457]  [a88522a2] preempt_schedule_irq+0x52/0xb0
[ 2733.047482]  [a8859820] retint_kernel+0x20/0x30
[ 2733.047505]  [a808a361] ? check_kill_permission+0xb1/0x1e0
[ 2733.047531]  [a808a402] ? check_kill_permission+0x152/0x1e0
[ 2733.047557]  [a808dc25] group_send_sig_info+0x65/0x150
[ 2733.047581]  [a808dbc5] ? group_send_sig_info+0x5/0x150
[ 2733.047607]  [a80ed71e] ? rcu_read_lock_held+0x6e/0x80
[ 2733.047632]  [a808dee8] kill_pid_info+0x78/0x130
[ 2733.047654]  [a808de75] ? kill_pid_info+0x5/0x130
[ 2733.047677]  [a808e0b2] SYSC_kill+0xf2/0x2f0
[ 2733.047699]  [a808e05b] ? SYSC_kill+0x9b/0x2f0
[ 2733.047721]  [a80d7ffd] ? trace_hardirqs_on+0xd/0x10
[ 2733.047745]  [a8013765] ? syscall_trace_enter_phase1+0x125/0x1a0
[ 2733.048607]  [a80d7f2d] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 2733.049469]  [a809079e] SyS_kill+0xe/0x10
[ 2733.050332]  [a8858aa2] system_call_fastpath+0x12/0x17
[ 2733.051197] trinity-c406R  running task13416 25811  24907 0x
[ 2733.052064]  88022208fd28 0002 a819f627 
8801df2c
[ 2733.052932]  001d31c0 0002 88022208ffd8 
001d31c0
[ 2733.053792]  880209e2c680 8801df2c 88022208fd18 
88022208ffd8
[ 2733.054651] Call Trace:
[ 2733.055500]  [a819f627] ? context_tracking_user_exit+0x67/0x280
[ 2733.056362]  [a88522a2] preempt_schedule_irq+0x52/0xb0
[ 2733.057222]  [a8859820] retint_kernel+0x20/0x30
[ 2733.058076]  [a808a361] ? check_kill_permission+0xb1/0x1e0
[ 2733.058930]  [a808a402] ? check_kill_permission+0x152/0x1e0
[ 2733.059778]  [a808dc25] group_send_sig_info+0x65/0x150
[ 2733.060624]  [a808dbc5] ? group_send_sig_info+0x5/0x150
[ 2733.061472]  [a80ed71e] ? rcu_read_lock_held+0x6e/0x80
[ 2733.062322]  [a808dee8] kill_pid_info+0x78/0x130
[ 2733.063168]  [a808de75] ? kill_pid_info+0x5/0x130
[ 2733.064015]  [a808e0b2] SYSC_kill+0xf2/0x2f0
[ 2733.064863]  [a808e05b] ? SYSC_kill+0x9b/0x2f0
[ 2733.065704]  [a80d7ffd] ? trace_hardirqs_on+0xd/0x10
[ 2733.066541]  [a8013765] ? syscall_trace_enter_phase1+0x125/0x1a0
[ 2733.067384]  [a80d7f2d] ? trace_hardirqs_on_caller+0x10d/0x1d0
[ 2733.068217]  [a809079e] SyS_kill+0xe/0x10
[ 2733.069045]  [a8858aa2] system_call_fastpath+0x12/0x17
[ 3708.217920] perf interrupt took too long (2503  2500), lowering 
kernel.perf_event_max_sample_rate to 5
[ 4583.530580] request_module: runaway loop modprobe personality-87


still running though..

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:

 hm.

So with the previous patch that had the false positives, you never saw
this? You saw the false positives instead?

I'm wondering if the added debug noise just ended up helping. Doing a
printk() will automatically cause some scheduler activity. And they
also caused the time reading jiggle.

That said, it's also possible that I screwed something up in the
second version of the patch, just breaking it and making it generally
ineffective.

Oh - and have you actually seen the TSC unstable (delta = xyz) +
switched to hpet messages there yet?

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:

 still running though..

Btw, did you ever boot with tsc=reliable as a kernel command line option?

For the last night, can you see if you can just run it with that, and
things work? Because by now, my gut feel is that we should start
derating the HPET rather than the TSC, especially going forward on
modern hardware. And if this is some subtle timing issue with our hpet
code (the whole -ETIME thing when getting close to setting a timer is
subtle, for example, even if the HPET hardware itself would be ok),
I'm wondering if the fix isn't to just stop believing in HPET if there
are better alternatives around.

So I'm not even convinced that trying to debug some HPET issue is
really worth it. Especially if your machine is a preproduction board
from Intel.

But verifying that with just the TSC everything is ok might still be worth it.

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote:
  On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:
  
   hm.
  
  So with the previous patch that had the false positives, you never saw
  this? You saw the false positives instead?
 
correct.

  I'm wondering if the added debug noise just ended up helping. Doing a
  printk() will automatically cause some scheduler activity. And they
  also caused the time reading jiggle.
  
  That said, it's also possible that I screwed something up in the
  second version of the patch, just breaking it and making it generally
  ineffective.
  
  Oh - and have you actually seen the TSC unstable (delta = xyz) +
  switched to hpet messages there yet?

not yet. 3 hrs in.

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
  On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:
  
   still running though..
  
  Btw, did you ever boot with tsc=reliable as a kernel command line option?

I don't think so.

  For the last night, can you see if you can just run it with that, and
  things work?

Sure.

  So I'm not even convinced that trying to debug some HPET issue is
  really worth it. Especially if your machine is a preproduction board
  from Intel.

Yeah, I agree. Even though it's strange that this only became a problem
this last few months for me, after over a year of abuse.

Hopefully the new year brings more trustworthy hardware.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote:
  On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:
  
   still running though..
  
  Btw, did you ever boot with tsc=reliable as a kernel command line option?

I'll check it again in the morning, but before I turn in for the night,
so far the only thing is this:


[ 6713.394395] INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 6713.394489]  Tasks blocked on level-0 rcu_node (CPUs 0-7):
[ 6713.394513]  Tasks blocked on level-0 rcu_node (CPUs 0-7):
[ 6713.394536]  (detected by 3, t=6502 jiffies, g=141292, c=141291, q=0)
[ 6713.394564] INFO: Stall ended before state dump start


Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones da...@codemonkey.org.uk wrote:
  
   Oh - and have you actually seen the TSC unstable (delta = xyz) +
   switched to hpet messages there yet?

 not yet. 3 hrs in.

Ok, so then the

 INFO: rcu_preempt detected stalls on CPUs/tasks:

has nothing to do with HPET, since you'd still be running with the TSC enabled.

My googling around did find a number of machine locks up a few hours
after switching to hpet reports, so it is possible that the whole rcu
stall and nmi watchdog thing is independent and unrelated to the
actual locking up.

It *is* intriguing that my broken patch seemed to prevent it from
happening, though. And both NMI watchdogs and the rcu stall are
related to wall-clock time.  But hey, maybe there really is some odd
loop in the kernel that stops scheduling or RCU grace periods. It just
seems to be never caught by your backtraces..

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-24 Thread Sasha Levin

On 12/23/2014 09:56 AM, Dave Jones wrote:
> On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
>  
>  > But in the meantime please do keep that thing running as long as you
>  > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
>  > result - the original softlockup bug happening *without* any bigger
>  > hpet jumps.
> 
> It's been going for 18 hours, with just a bunch more of those hpet
> messages, all in the same range.  I'll leave it go a few more hours,
> before I have to wipe it, but I've got feel-good vibes about this.
> Even if that patch isn't the solution, It seems like we're finally
> looking in the right direction.

I've got myself a physical server to play with, and running trinity on it
seems to cause similar stalls:

 2338.389210] INFO: rcu_sched self-detected stall on CPU[ 2338.429153] INFO: 
rcu_sched detected stalls on CPUs/tasks:[ 2338.429164] 16: (5999 ticks this 
GP) idle=4b5/141/0 softirq=24859/24860 last_accelerate: 039d/1b78, 
nonlazy_posted: 64, ..
[ 2338.429165]  
[ 2338.680231]  16: (5999 ticks this GP) idle=4b5/141/0 
softirq=24859/24860 last_accelerate: 039d/1b91, nonlazy_posted: 64, ..
[ 2338.828353]   (t=6044 jiffies g=16473 c=16472 q=4915881)

Oddly enough, there's no stacktrace...


Thanks,
Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-24 Thread Sasha Levin

On 12/23/2014 09:56 AM, Dave Jones wrote:
 On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
  
   But in the meantime please do keep that thing running as long as you
   can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
   result - the original softlockup bug happening *without* any bigger
   hpet jumps.
 
 It's been going for 18 hours, with just a bunch more of those hpet
 messages, all in the same range.  I'll leave it go a few more hours,
 before I have to wipe it, but I've got feel-good vibes about this.
 Even if that patch isn't the solution, It seems like we're finally
 looking in the right direction.

I've got myself a physical server to play with, and running trinity on it
seems to cause similar stalls:

 2338.389210] INFO: rcu_sched self-detected stall on CPU[ 2338.429153] INFO: 
rcu_sched detected stalls on CPUs/tasks:[ 2338.429164] 16: (5999 ticks this 
GP) idle=4b5/141/0 softirq=24859/24860 last_accelerate: 039d/1b78, 
nonlazy_posted: 64, ..
[ 2338.429165]  
[ 2338.680231]  16: (5999 ticks this GP) idle=4b5/141/0 
softirq=24859/24860 last_accelerate: 039d/1b91, nonlazy_posted: 64, ..
[ 2338.828353]   (t=6044 jiffies g=16473 c=16472 q=4915881)

Oddly enough, there's no stacktrace...


Thanks,
Sasha
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
 
 > But in the meantime please do keep that thing running as long as you
 > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
 > result - the original softlockup bug happening *without* any bigger
 > hpet jumps.

So I've got this box a *little* longer than anticipated.
It's now been running 30 hours with not a single NMI lockup.
and that's with my kitchen-sink debugging kernel.

The 'hpet off' messages continue to be spewed, and again they're
all in the same range of 4293198075 -> 4294967266

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:

 > But in the meantime please do keep that thing running as long as you
 > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
 > result - the original softlockup bug happening *without* any bigger
 > hpet jumps.

It's been going for 18 hours, with just a bunch more of those hpet
messages, all in the same range.  I'll leave it go a few more hours,
before I have to wipe it, but I've got feel-good vibes about this.
Even if that patch isn't the solution, It seems like we're finally
looking in the right direction.

Dave
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
 
  But in the meantime please do keep that thing running as long as you
  can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
  result - the original softlockup bug happening *without* any bigger
  hpet jumps.

It's been going for 18 hours, with just a bunch more of those hpet
messages, all in the same range.  I'll leave it go a few more hours,
before I have to wipe it, but I've got feel-good vibes about this.
Even if that patch isn't the solution, It seems like we're finally
looking in the right direction.

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote:
 
  But in the meantime please do keep that thing running as long as you
  can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
  result - the original softlockup bug happening *without* any bigger
  hpet jumps.

So I've got this box a *little* longer than anticipated.
It's now been running 30 hours with not a single NMI lockup.
and that's with my kitchen-sink debugging kernel.

The 'hpet off' messages continue to be spewed, and again they're
all in the same range of 4293198075 - 4294967266

Dave
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 3:59 PM, John Stultz  wrote:
>
> * So 1/8th of the interval seems way too short, as there's
> clocksources like the ACP PM, which wrap every 2.5 seconds or so.

Ugh. At the same time, 1/8th of a range is actually bigger than I'd
like, since if there is some timer corruption, it means that we only
catch it when it's in really big.

But as I said, I'd actually prefer it to be time-based, because it
would be good if this approach worked on things like the TSC which is
a 64-bit counter..

So yes, that capping was very much arbitrary, and was mostly a case of
"this works with the one timer source that I can easily trigger"

> * I suspect something closer to the clocksource_max_deferment() value
> (which I think is max interval before multiplication overflows could
> happen - ~12%) which we use in the scheduler would make more sense.
> Especially since the timer scheduler uses that to calculate how long
> we can idle for.

I'd rather not be anywhere *close* to any overflow problems. Even for
the scheduler all-idle case, I'd argue that there is rather quickly
diminishing returns. Yes, a thousand timer interrupts per second are
expensive and a noticeable power draw. The difference between "one
timer interrupt every two seconds" and "every 20 seconds" is rather
less noticeable.

Of course, reasonable clock sources have *much* longer periods than a
second (yeah, the acpi pm timer really isn't a good one), so there are
probably good middle grounds, The 1/8th was a hack, and one that was
aware of teh 300s cycle of the HPET at that..

> * Nulling out delta in timekeeping_get_ns() seems like it could cause
> problems since time would then possibly go backwards compared to
> previous reads (as you mentioned, resulting in smaller time jumps).
> Instead it would probably make more sense to cap the delta at the
> maximum value (though this assumes the clock doesn't jump back in the
> interval before the next call to update_wall_time).

So part of the nulling was that it was simpler, and part of it was
that I expected to get backwards jumps (see the other email to Dave
about the inherent races). And with the whole timer mask modulo
arithmetic, those backwards jumps just look like biggish positive
numbers, not even negative. So it ends up being things like "is it an
unsigned number larger than half the mask? Consider it negative" etc.

The "zero it out" was simple, and it worked for my test-case, which
was "ok, my machine no longer locks up when I mess with the timer".

And I didn't post the earlier versions of that patch that didn't even *boot*.

I started out trying to do it at a higher level (not on a clock read
level, but outside the whole 'convert-to-ns and do the sequence  value
check'), but during bootup we play a lot of games with initializing
the timer sources etc.

So that explains the approach of doing it at that

   cycle_now = tkr->read(tkr->clock);

level, and keeping it very low-level.

But as I already explained in the email that crossed, that low-level
thing also results in some fundamental races.

> * Also, as you note, this would just cause the big time jump to only
> happen at the next update, since there's no logic in
> update_wall_time() to limit the jump. I'm not sure if "believing" the
> large jump at write time make that much more sense, though.

So I considered just capping it there (to a single interval or
something). Again, just ignoring - like the read side does - it would
have been easier, but at the same time I *really* wanted to make time
go forward, so just taking the big value seemed safest.

But yes. this was very much a RFC patch. It's not even ready for real
use, as DaveJ found out (although it might be good enough in practice,
despite its flaws)

> * Finally, you're writing to error while only holding a read lock, but
> that's sort of a minor thing.

It's not a minor thing, but the alternatives looked worse.

I really wanted to make it per-cpu, and do this with interrupts
disabled or something. But that then pushes a big problem to the write
time to go over all cpu's and see if there are errors.

So it's not right. But .. It's a hacky patch to get discussion
started, and it's actually hard to do "right" when this code has to be
basically lockless.

> * Checking the accumulation interval isn't beyond the
> clocksource_max_deferment() value seems like a very good check to have
> in update_wall_time().

Sounds like a good idea. Also, quite frankly, reading all the code I
wasn't ever really able to figure out that things don't overflow. The
overflow protection is a bit ad-hoc (that maxshift thing in
update_wall_time() really makes baby Jesus cry, despite the season,
and it wasn't at all obvious that ntp_tick_length() is fundamentally
bigger than xtime_interval, for example).

It's also not clear that the complicated and frankly not-very-obvious
shift-loop is any faster than just using a divide - possibly with the
"single interval" case being a special case to avoid

Re: frequent lockups in 3.18rc4

2014-12-22 Thread John Stultz

On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
 wrote:
> On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
>  wrote:
>>
>> This is *not* to say that this is the bug you're hitting. But it does show 
>> that
>>
>>  (a) a flaky HPET can do some seriously bad stuff
>>  (b) the kernel is very fragile wrt time going backwards.
>>
>> and maybe we can use this test program to at least try to alleviate problem 
>> (b).
>
> Ok, so after several false starts (ktime_get() is really really
> fragile - called in scheduler things, and doing magic things at
> bootup), here is something that seems to alleviate the problem for me.
>
> I still get a lot of RCU  messages like "self-detected stall" etc, but
> that's to be expected. When the clock does odd things, crap *will*
> happen.
>
> But what this does is:
>
>  (a) make the error more visible as a clock error rather than various
> random downstream users
>
>  IOW, it prints things out when it looks like we're getting odd
> clock read errors (arbitrary cut-off: we expect clock read-outs to be
> withing 1/8th of the range of the expected clock value)

(Warning: I'm replying with my vacation goggles on)

A few thoughts from quickly looking at the patch (some of this is
repeating your comments here):

* So 1/8th of the interval seems way too short, as there's
clocksources like the ACP PM, which wrap every 2.5 seconds or so. And
even with more reasonable clocksource wrapping intervals, the tick
scheduler may not schedule the next tick till after that time, which
could cause major problems (we don't want the hrtimer expiration
calculation to get capped out here, since the timer may be scheduled
past 1/8th of the interval, which would keep us from ever accumulating
time and clearing the cycle_error added here)

* I suspect something closer to the clocksource_max_deferment() value
(which I think is max interval before multiplication overflows could
happen - ~12%) which we use in the scheduler would make more sense.
Especially since the timer scheduler uses that to calculate how long
we can idle for.

* Nulling out delta in timekeeping_get_ns() seems like it could cause
problems since time would then possibly go backwards compared to
previous reads (as you mentioned, resulting in smaller time jumps).
Instead it would probably make more sense to cap the delta at the
maximum value (though this assumes the clock doesn't jump back in the
interval before the next call to update_wall_time).

* Also, as you note, this would just cause the big time jump to only
happen at the next update, since there's no logic in
update_wall_time() to limit the jump. I'm not sure if "believing" the
large jump at write time make that much more sense, though.

* Finally, you're writing to error while only holding a read lock, but
that's sort of a minor thing.

I do agree something that is more helpful in validating the
timekeeping here would be nice to avoid further geese chases in the
future.

Some possible ideas:
* Checking the accumulation interval isn't beyond the
clocksource_max_deferment() value seems like a very good check to have
in update_wall_time().

* Maybe when we schedule the next timekeeping update, the tick
scheduler could store the expected time for that to fire, and then we
could validate that we're relatively close after that value when we do
accumulate time (warning if we're running too early or far too late -
though with virtualziation, defining a "reasonable" late value is
difficult).

* This "expected next tick" time could be used to try to cap read-time
intervals in a similar fashion as done here. (Of course, again, we'd
have to be careful, since if that expected next tick ends up somehow
being before the actual hrtimer expiration value, we could end up
stopping time - and the system).

I can try to add some of this when I'm back from holiday in the new year.

Maybe Thomas will have some other ideas?

thanks
-john
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones  wrote:
>
> I tried the nohpet thing for a few hours this morning and didn't see
> anything weird, but it may have been that I just didn't run long enough.
> When I saw your patch, I gave that a shot instead, with hpet enabled
> again.  Just got back to find lots of messages in dmesg, but none
> of the usual NMI/lockup messages.

Hmm. So my patch is a bit sloppy, and I suspect that the sloppiness
may account for some - and quite probably all - of those numbers.

Some of your numbers are pretty big (it's a 32-bit mask, so they are
all really just pretty small negative numbers), but they are still in
the 2us .. 165ms range when taking the 14MHz HPET counter into
account.  So not huge timer shifts.

And the sloppiness of the patch is two-fold:

One problem with my patch is that it does that "tkr->cycle_error"
without any locking (because it's running in various environmetns
where locking really doesn't work, and we can't even sanely disable
preemption because we might well be inside the scheduler etc.

So this:

  cycle_now = tkr->read(tkr->clock) + tkr->cycle_error;

  /* calculate the delta since the last update_wall_time: */
  delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask);

  /* Hmm? This is really not good, we're too close to overflowing */
  if (unlikely(delta > (tkr->mask >> 3))) {
  tkr->cycle_error = delta;
 delta = 0;
  }

might run concurrently on two CPU's, and then that tkr->cycle_error
access isn't serialized, so a "later" read of the HPET clock could end
up being written to tkr->cycle_error before. So that can account for
small errors: you'd have a "cycle_error" that gets updated on one CPU,
and then used to correct for an "earlier" read of the clock on another
CPU, and that could make the cycle error possibly worse.

However, that first race only matters if you get errors to begin with,
so if that was the only race, it would still show that some real error
happened.

BUT.

The *bigger* problem is that since the reading side cannot hold any
locks at all, it can also race against the writing side. That's by
design, and we will use the sequence counter to recover from it and
try again, but it means that some of those small errors are just a
reader racing with the wall-time update code, and since this error
code is done _inside_ the read-sequence code, it's not aware of the
retry, and will give a false positive even if we then later on throw
the known-bad result out and re-try.

So your small negative numbers are most likely just those false positives.

I was more hoping to see some big sudden jumps on the order of your
20-second delays - the kinds of jumps that your "tsc unstable"
messages implied (which weren't in the 2us .. 165ms range, but in the
2s to 250s range)

Ugh. I guess I'll have to try to figure out a non-sloppy thing, but
quite frankly, the non-sloppy things I tried first were rather painful
failures. The sloppy thing was sloppy, but worked well to see the
disaster case.

I'll put on my thinking cap. Maybe I can move the "cycle_error" logic
to outside the sequence lock retry loop.

But in the meantime please do keep that thing running as long as you
can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
result - the original softlockup bug happening *without* any bigger
hpet jumps.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Dave Jones

On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote:
 > And again: this is not trying to make the kernel clock not jump. There
 > is no way I can come up with even in theory to try to really *fix* a
 > fundamentally broken clock.
 > 
 > So this is not meant to be a real "fix" for anything, but is meant to
 > make sure that if the clock is unreliable, we pinpoint the clock
 > itself, and it mitigates the absolutely horrendously bad behavior we
 > currently with bad clocks. So think of this as debug and band-aid
 > rather than "this makes clocks magically reliable".
 > 
 > .. and we might still lock up under some circumstances. But at least
 > from my limited testing, it is infinitely much better, even if it
 > might not be perfect. Also note that my "testing" has been writing
 > zero to the HPET lock (so the HPET clock difference tends to be pretty
 > specific), while my next step is to see what happens when I write
 > random values (and a lot of them).
 > 
 > Since I expect that to cause more problems, I thought I'd send this
 > RFC out before I start killing my machine again ;)

I tried the nohpet thing for a few hours this morning and didn't see
anything weird, but it may have been that I just didn't run long enough.
When I saw your patch, I gave that a shot instead, with hpet enabled
again.  Just got back to find lots of messages in dmesg, but none
of the usual NMI/lockup messages.

[ 2256.430694] Clocksource tsc unstable (delta = -142018897190 ns)
[ 2256.437433] Switched to clocksource hpet
[ 2279.788605] Clocksource hpet had cycles off by 4294946559
[ 2280.191272] Clocksource hpet had cycles off by 4294905111
[ 2282.605990] Clocksource hpet had cycles off by 4294960721
[ 2284.485410] Clocksource hpet had cycles off by 4294953427
[ 2288.954844] Clocksource hpet had cycles off by 4294924880
[ 2305.202931] Clocksource hpet had cycles off by 4294960429
[ 2315.527247] Clocksource hpet had cycles off by 4294956296
[ 2318.954066] Clocksource hpet had cycles off by 4293673652
[ 2332.370923] Clocksource hpet had cycles off by 4294907221
[ 2332.739861] Clocksource hpet had cycles off by 4294919496
[ 2345.459694] Clocksource hpet had cycles off by 4294959592
[ 2346.159780] Clocksource hpet had cycles off by 4294952613
[ 2348.132071] Clocksource hpet had cycles off by 4294903415
[ 2348.207593] Clocksource hpet had cycles off by 4294966900
[ 2351.699779] Clocksource hpet had cycles off by 4294906755
[ 2354.125982] Clocksource hpet had cycles off by 4294941028
[ 2365.249438] Clocksource hpet had cycles off by 4294942458
[ 2370.247560] Clocksource hpet had cycles off by 4294927938
[ 2372.554642] Clocksource hpet had cycles off by 4294950723
[ 2377.361721] Clocksource hpet had cycles off by 4294952569
[ 2384.747820] Clocksource hpet had cycles off by 4294947263
[ 2389.133886] Clocksource hpet had cycles off by 4294967233
[ 2392.423458] Clocksource hpet had cycles off by 4294946214
[ 2397.648955] Clocksource hpet had cycles off by 4294967205
[ 2405.228015] Clocksource hpet had cycles off by 4294917938
[ 2429.571163] Clocksource hpet had cycles off by 4294957112
[ 2434.214788] Clocksource hpet had cycles off by 429482
[ 2438.686705] Clocksource hpet had cycles off by 4294945380
[ 2440.280478] Clocksource hpet had cycles off by 4294878090
[ 2458.370164] Clocksource hpet had cycles off by 4294875577
[ 2496.916971] Clocksource hpet had cycles off by 4294887574
[ 2516.314875] Clocksource hpet had cycles off by 4294899744
[ 2519.857221] Clocksource hpet had cycles off by 4294836752
[ 2522.696576] Clocksource hpet had cycles off by 4294965711
[ 2527.599967] Clocksource hpet had cycles off by 4294876467
[ 2528.573678] Clocksource hpet had cycles off by 4294815154
[ 2537.325296] Clocksource hpet had cycles off by 4294862624
[ 2542.296016] Clocksource hpet had cycles off by 4294954228
[ 2558.634123] Clocksource hpet had cycles off by 4294845883
[ 2560.804973] Clocksource hpet had cycles off by 4294958781
[ 2579.057030] Clocksource hpet had cycles off by 4294921012
[ 2588.139716] Clocksource hpet had cycles off by 4294950381
[ 2594.076877] Clocksource hpet had cycles off by 4294941777
[ 2597.645800] Clocksource hpet had cycles off by 4294927609
[ 2605.032338] Clocksource hpet had cycles off by 4294915823
[ 2605.239672] Clocksource hpet had cycles off by 4294952275
[ 2605.294230] Clocksource hpet had cycles off by 4294886603
[ 2609.801532] Clocksource hpet had cycles off by 4294887976
[ 2615.003674] Clocksource hpet had cycles off by 4294957202
[ 2641.039536] Clocksource hpet had cycles off by 4294943689
[ 2644.554947] Clocksource hpet had cycles off by 4294837076
[ 2648.576203] Clocksource hpet had cycles off by 4294928887
[ 2648.627249] Clocksource hpet had cycles off by 4294913656
[ 2680.465314] Clocksource hpet had cycles off by 4294963565
[ 2705.231925] Clocksource hpet had cycles off by 4294949762
[ 2708.181981] Clocksource hpet had cycles off by 4294924526
[ 2713.622343] Clocksource hpet had cycles off by

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
 wrote:
>
> .. and we might still lock up under some circumstances. But at least
> from my limited testing, it is infinitely much better, even if it
> might not be perfect. Also note that my "testing" has been writing
> zero to the HPET lock (so the HPET clock difference tends to be pretty
> specific), while my next step is to see what happens when I write
> random values (and a lot of them).
>
> Since I expect that to cause more problems, I thought I'd send this
> RFC out before I start killing my machine again ;)

Ok, not horrible. Although I'd suggest not testing in a terminal
window while running X. The time jumping will confuse X input timing
and the screensaver, to the point that the machine may not be dead,
but it isn't exactly usable. Do it in a virtual console.

Again, making the limit tighter (one second?) and perhaps not trusting
insane values too much at walltime clock update time either, might
make it all work smoother still.

I did manage to confuse systemd with all the garbage the kernel
spewed, with a lot of stuff like:

   systemd-journald[779]: Failed to write entry (9 items, 276 bytes),
ignoring: Invalid argument

showing up in the logs, but I'm writing this without having had to
reboot the machine.

 Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
 wrote:
>
> This is *not* to say that this is the bug you're hitting. But it does show 
> that
>
>  (a) a flaky HPET can do some seriously bad stuff
>  (b) the kernel is very fragile wrt time going backwards.
>
> and maybe we can use this test program to at least try to alleviate problem 
> (b).

Ok, so after several false starts (ktime_get() is really really
fragile - called in scheduler things, and doing magic things at
bootup), here is something that seems to alleviate the problem for me.

I still get a lot of RCU  messages like "self-detected stall" etc, but
that's to be expected. When the clock does odd things, crap *will*
happen.

But what this does is:

 (a) make the error more visible as a clock error rather than various
random downstream users

 IOW, it prints things out when it looks like we're getting odd
clock read errors (arbitrary cut-off: we expect clock read-outs to be
withing 1/8th of the range of the expected clock value)

 (b) try to alleviate the horrible things that happen when the clock
error is big

 The patch tries to "correct" for the huge time jump by basically
undoing it. We'll still see time jumps (there really is no way to
avoid it), but we limit the range of them.

With the attached patch, my machine seems to survive me writing to the
HPET master counter register. It spews warnings, and it is noisy about
the odd clock reads:

...
Clocksource hpet had cycles off by 642817751
Cutting it too close for hpet in in update_wall_time (offset = 4034102337)
INFO: rcu_sched self-detected stall on CPU { 0}  (t=281743 jiffies
g=4722 c=4721 q=14)
...

and there may still be situations where it does horrible horrible
things due to the (smaller) time leaps, but it does seem a lot more
robust.

NOTE! There's an (intentional) difference in how we handle the time
leaps at time read time vs write (wall-clock update).

At time read time, we just refuse to believe the big delta, and we set
the "cycle_error" value so that future time reads will be relative to
the error we just got. We also don't print anything out, because we're
possibly deep in the scheduler or in tracing, and we do not want to
spam the console about our fixup.

At time *write* time, we first report about any read-time errors, and
then we report (but believe in) overlarge clocksource delta errors as
we update the time.

This seems to be the best way to limit the damage.

Also note that the patch is entirely clock-agnostic. It's just that I
can trivially screw up my HPET, I didn't look at other clocks.

One note: my current limit of clocksource delta errors is based on the
range of the clock (1/8th of the range). I actually think that's
bogus, and it should instead be based on the expected frequency of the
clock (ie "we are guaranteed to update the wall clock at least once
every second, so if the clock source delta read is larger than one
second, we've done something wrong"). So this patch is meant very much
as an RFC, rather than anything else. It's pretty hacky. But it does
actually make a huge difference for me wrt the "mess up HPET time on
purpose". That used to crash my machine pretty hard, and pretty
reliably. With this patch, I've done it ten+ times, and while it spews
a lot of garbage, the machine stays up and _works_.

Making the sanity check tighter (ie the "one second" band rather than
"1/8th of the clock range") would probably just improve it further.

Thomas, what do you think? Hate it? Any better ideas?

And again: this is not trying to make the kernel clock not jump. There
is no way I can come up with even in theory to try to really *fix* a
fundamentally broken clock.

So this is not meant to be a real "fix" for anything, but is meant to
make sure that if the clock is unreliable, we pinpoint the clock
itself, and it mitigates the absolutely horrendously bad behavior we
currently with bad clocks. So think of this as debug and band-aid
rather than "this makes clocks magically reliable".

.. and we might still lock up under some circumstances. But at least
from my limited testing, it is infinitely much better, even if it
might not be perfect. Also note that my "testing" has been writing
zero to the HPET lock (so the HPET clock difference tends to be pretty
specific), while my next step is to see what happens when I write
random values (and a lot of them).

Since I expect that to cause more problems, I thought I'd send this
RFC out before I start killing my machine again ;)

 Linus
 include/linux/timekeeper_internal.h |  1 +
 kernel/time/timekeeping.c   | 25 -
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/timekeeper_internal.h 
b/include/linux/timekeeper_internal.h
index 05af9a334893..0fcb60d77079 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -32,6 +32,7 @@ struct tk_read_base {
cycle_t

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 This is *not* to say that this is the bug you're hitting. But it does show 
 that

  (a) a flaky HPET can do some seriously bad stuff
  (b) the kernel is very fragile wrt time going backwards.

 and maybe we can use this test program to at least try to alleviate problem 
 (b).

Ok, so after several false starts (ktime_get() is really really
fragile - called in scheduler things, and doing magic things at
bootup), here is something that seems to alleviate the problem for me.

I still get a lot of RCU  messages like self-detected stall etc, but
that's to be expected. When the clock does odd things, crap *will*
happen.

But what this does is:

 (a) make the error more visible as a clock error rather than various
random downstream users

 IOW, it prints things out when it looks like we're getting odd
clock read errors (arbitrary cut-off: we expect clock read-outs to be
withing 1/8th of the range of the expected clock value)

 (b) try to alleviate the horrible things that happen when the clock
error is big

 The patch tries to correct for the huge time jump by basically
undoing it. We'll still see time jumps (there really is no way to
avoid it), but we limit the range of them.

With the attached patch, my machine seems to survive me writing to the
HPET master counter register. It spews warnings, and it is noisy about
the odd clock reads:

...
Clocksource hpet had cycles off by 642817751
Cutting it too close for hpet in in update_wall_time (offset = 4034102337)
INFO: rcu_sched self-detected stall on CPU { 0}  (t=281743 jiffies
g=4722 c=4721 q=14)
...

and there may still be situations where it does horrible horrible
things due to the (smaller) time leaps, but it does seem a lot more
robust.

NOTE! There's an (intentional) difference in how we handle the time
leaps at time read time vs write (wall-clock update).

At time read time, we just refuse to believe the big delta, and we set
the cycle_error value so that future time reads will be relative to
the error we just got. We also don't print anything out, because we're
possibly deep in the scheduler or in tracing, and we do not want to
spam the console about our fixup.

At time *write* time, we first report about any read-time errors, and
then we report (but believe in) overlarge clocksource delta errors as
we update the time.

This seems to be the best way to limit the damage.

Also note that the patch is entirely clock-agnostic. It's just that I
can trivially screw up my HPET, I didn't look at other clocks.

One note: my current limit of clocksource delta errors is based on the
range of the clock (1/8th of the range). I actually think that's
bogus, and it should instead be based on the expected frequency of the
clock (ie we are guaranteed to update the wall clock at least once
every second, so if the clock source delta read is larger than one
second, we've done something wrong). So this patch is meant very much
as an RFC, rather than anything else. It's pretty hacky. But it does
actually make a huge difference for me wrt the mess up HPET time on
purpose. That used to crash my machine pretty hard, and pretty
reliably. With this patch, I've done it ten+ times, and while it spews
a lot of garbage, the machine stays up and _works_.

Making the sanity check tighter (ie the one second band rather than
1/8th of the clock range) would probably just improve it further.

Thomas, what do you think? Hate it? Any better ideas?

And again: this is not trying to make the kernel clock not jump. There
is no way I can come up with even in theory to try to really *fix* a
fundamentally broken clock.

So this is not meant to be a real fix for anything, but is meant to
make sure that if the clock is unreliable, we pinpoint the clock
itself, and it mitigates the absolutely horrendously bad behavior we
currently with bad clocks. So think of this as debug and band-aid
rather than this makes clocks magically reliable.

.. and we might still lock up under some circumstances. But at least
from my limited testing, it is infinitely much better, even if it
might not be perfect. Also note that my testing has been writing
zero to the HPET lock (so the HPET clock difference tends to be pretty
specific), while my next step is to see what happens when I write
random values (and a lot of them).

Since I expect that to cause more problems, I thought I'd send this
RFC out before I start killing my machine again ;)

 Linus
 include/linux/timekeeper_internal.h |  1 +
 kernel/time/timekeeping.c   | 25 -
 2 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/include/linux/timekeeper_internal.h 
b/include/linux/timekeeper_internal.h
index 05af9a334893..0fcb60d77079 100644
--- a/include/linux/timekeeper_internal.h
+++ b/include/linux/timekeeper_internal.h
@@ -32,6 +32,7 @@ struct tk_read_base {
cycle_t

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 .. and we might still lock up under some circumstances. But at least
 from my limited testing, it is infinitely much better, even if it
 might not be perfect. Also note that my testing has been writing
 zero to the HPET lock (so the HPET clock difference tends to be pretty
 specific), while my next step is to see what happens when I write
 random values (and a lot of them).

 Since I expect that to cause more problems, I thought I'd send this
 RFC out before I start killing my machine again ;)

Ok, not horrible. Although I'd suggest not testing in a terminal
window while running X. The time jumping will confuse X input timing
and the screensaver, to the point that the machine may not be dead,
but it isn't exactly usable. Do it in a virtual console.

Again, making the limit tighter (one second?) and perhaps not trusting
insane values too much at walltime clock update time either, might
make it all work smoother still.

I did manage to confuse systemd with all the garbage the kernel
spewed, with a lot of stuff like:

   systemd-journald[779]: Failed to write entry (9 items, 276 bytes),
ignoring: Invalid argument

showing up in the logs, but I'm writing this without having had to
reboot the machine.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Dave Jones

On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote:
  And again: this is not trying to make the kernel clock not jump. There
  is no way I can come up with even in theory to try to really *fix* a
  fundamentally broken clock.
  
  So this is not meant to be a real fix for anything, but is meant to
  make sure that if the clock is unreliable, we pinpoint the clock
  itself, and it mitigates the absolutely horrendously bad behavior we
  currently with bad clocks. So think of this as debug and band-aid
  rather than this makes clocks magically reliable.
  
  .. and we might still lock up under some circumstances. But at least
  from my limited testing, it is infinitely much better, even if it
  might not be perfect. Also note that my testing has been writing
  zero to the HPET lock (so the HPET clock difference tends to be pretty
  specific), while my next step is to see what happens when I write
  random values (and a lot of them).
  
  Since I expect that to cause more problems, I thought I'd send this
  RFC out before I start killing my machine again ;)

I tried the nohpet thing for a few hours this morning and didn't see
anything weird, but it may have been that I just didn't run long enough.
When I saw your patch, I gave that a shot instead, with hpet enabled
again.  Just got back to find lots of messages in dmesg, but none
of the usual NMI/lockup messages.

[ 2256.430694] Clocksource tsc unstable (delta = -142018897190 ns)
[ 2256.437433] Switched to clocksource hpet
[ 2279.788605] Clocksource hpet had cycles off by 4294946559
[ 2280.191272] Clocksource hpet had cycles off by 4294905111
[ 2282.605990] Clocksource hpet had cycles off by 4294960721
[ 2284.485410] Clocksource hpet had cycles off by 4294953427
[ 2288.954844] Clocksource hpet had cycles off by 4294924880
[ 2305.202931] Clocksource hpet had cycles off by 4294960429
[ 2315.527247] Clocksource hpet had cycles off by 4294956296
[ 2318.954066] Clocksource hpet had cycles off by 4293673652
[ 2332.370923] Clocksource hpet had cycles off by 4294907221
[ 2332.739861] Clocksource hpet had cycles off by 4294919496
[ 2345.459694] Clocksource hpet had cycles off by 4294959592
[ 2346.159780] Clocksource hpet had cycles off by 4294952613
[ 2348.132071] Clocksource hpet had cycles off by 4294903415
[ 2348.207593] Clocksource hpet had cycles off by 4294966900
[ 2351.699779] Clocksource hpet had cycles off by 4294906755
[ 2354.125982] Clocksource hpet had cycles off by 4294941028
[ 2365.249438] Clocksource hpet had cycles off by 4294942458
[ 2370.247560] Clocksource hpet had cycles off by 4294927938
[ 2372.554642] Clocksource hpet had cycles off by 4294950723
[ 2377.361721] Clocksource hpet had cycles off by 4294952569
[ 2384.747820] Clocksource hpet had cycles off by 4294947263
[ 2389.133886] Clocksource hpet had cycles off by 4294967233
[ 2392.423458] Clocksource hpet had cycles off by 4294946214
[ 2397.648955] Clocksource hpet had cycles off by 4294967205
[ 2405.228015] Clocksource hpet had cycles off by 4294917938
[ 2429.571163] Clocksource hpet had cycles off by 4294957112
[ 2434.214788] Clocksource hpet had cycles off by 429482
[ 2438.686705] Clocksource hpet had cycles off by 4294945380
[ 2440.280478] Clocksource hpet had cycles off by 4294878090
[ 2458.370164] Clocksource hpet had cycles off by 4294875577
[ 2496.916971] Clocksource hpet had cycles off by 4294887574
[ 2516.314875] Clocksource hpet had cycles off by 4294899744
[ 2519.857221] Clocksource hpet had cycles off by 4294836752
[ 2522.696576] Clocksource hpet had cycles off by 4294965711
[ 2527.599967] Clocksource hpet had cycles off by 4294876467
[ 2528.573678] Clocksource hpet had cycles off by 4294815154
[ 2537.325296] Clocksource hpet had cycles off by 4294862624
[ 2542.296016] Clocksource hpet had cycles off by 4294954228
[ 2558.634123] Clocksource hpet had cycles off by 4294845883
[ 2560.804973] Clocksource hpet had cycles off by 4294958781
[ 2579.057030] Clocksource hpet had cycles off by 4294921012
[ 2588.139716] Clocksource hpet had cycles off by 4294950381
[ 2594.076877] Clocksource hpet had cycles off by 4294941777
[ 2597.645800] Clocksource hpet had cycles off by 4294927609
[ 2605.032338] Clocksource hpet had cycles off by 4294915823
[ 2605.239672] Clocksource hpet had cycles off by 4294952275
[ 2605.294230] Clocksource hpet had cycles off by 4294886603
[ 2609.801532] Clocksource hpet had cycles off by 4294887976
[ 2615.003674] Clocksource hpet had cycles off by 4294957202
[ 2641.039536] Clocksource hpet had cycles off by 4294943689
[ 2644.554947] Clocksource hpet had cycles off by 4294837076
[ 2648.576203] Clocksource hpet had cycles off by 4294928887
[ 2648.627249] Clocksource hpet had cycles off by 4294913656
[ 2680.465314] Clocksource hpet had cycles off by 4294963565
[ 2705.231925] Clocksource hpet had cycles off by 4294949762
[ 2708.181981] Clocksource hpet had cycles off by 4294924526
[ 2713.622343] Clocksource hpet had cycles off by 4294874217
[ 2714.725619]

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote:

 I tried the nohpet thing for a few hours this morning and didn't see
 anything weird, but it may have been that I just didn't run long enough.
 When I saw your patch, I gave that a shot instead, with hpet enabled
 again.  Just got back to find lots of messages in dmesg, but none
 of the usual NMI/lockup messages.

Hmm. So my patch is a bit sloppy, and I suspect that the sloppiness
may account for some - and quite probably all - of those numbers.

Some of your numbers are pretty big (it's a 32-bit mask, so they are
all really just pretty small negative numbers), but they are still in
the 2us .. 165ms range when taking the 14MHz HPET counter into
account.  So not huge timer shifts.

And the sloppiness of the patch is two-fold:

One problem with my patch is that it does that tkr-cycle_error
without any locking (because it's running in various environmetns
where locking really doesn't work, and we can't even sanely disable
preemption because we might well be inside the scheduler etc.

So this:

  cycle_now = tkr-read(tkr-clock) + tkr-cycle_error;

  /* calculate the delta since the last update_wall_time: */
  delta = clocksource_delta(cycle_now, tkr-cycle_last, tkr-mask);

  /* Hmm? This is really not good, we're too close to overflowing */
  if (unlikely(delta  (tkr-mask  3))) {
  tkr-cycle_error = delta;
 delta = 0;
  }

might run concurrently on two CPU's, and then that tkr-cycle_error
access isn't serialized, so a later read of the HPET clock could end
up being written to tkr-cycle_error before. So that can account for
small errors: you'd have a cycle_error that gets updated on one CPU,
and then used to correct for an earlier read of the clock on another
CPU, and that could make the cycle error possibly worse.

However, that first race only matters if you get errors to begin with,
so if that was the only race, it would still show that some real error
happened.

BUT.

The *bigger* problem is that since the reading side cannot hold any
locks at all, it can also race against the writing side. That's by
design, and we will use the sequence counter to recover from it and
try again, but it means that some of those small errors are just a
reader racing with the wall-time update code, and since this error
code is done _inside_ the read-sequence code, it's not aware of the
retry, and will give a false positive even if we then later on throw
the known-bad result out and re-try.

So your small negative numbers are most likely just those false positives.

I was more hoping to see some big sudden jumps on the order of your
20-second delays - the kinds of jumps that your tsc unstable
messages implied (which weren't in the 2us .. 165ms range, but in the
2s to 250s range)

Ugh. I guess I'll have to try to figure out a non-sloppy thing, but
quite frankly, the non-sloppy things I tried first were rather painful
failures. The sloppy thing was sloppy, but worked well to see the
disaster case.

I'll put on my thinking cap. Maybe I can move the cycle_error logic
to outside the sequence lock retry loop.

But in the meantime please do keep that thing running as long as you
can. Let's see if we get bigger jumps. Or perhaps we'll get a negative
result - the original softlockup bug happening *without* any bigger
hpet jumps.

 Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-22 Thread John Stultz

On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds
torva...@linux-foundation.org wrote:
 On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:

 This is *not* to say that this is the bug you're hitting. But it does show 
 that

  (a) a flaky HPET can do some seriously bad stuff
  (b) the kernel is very fragile wrt time going backwards.

 and maybe we can use this test program to at least try to alleviate problem 
 (b).

 Ok, so after several false starts (ktime_get() is really really
 fragile - called in scheduler things, and doing magic things at
 bootup), here is something that seems to alleviate the problem for me.

 I still get a lot of RCU  messages like self-detected stall etc, but
 that's to be expected. When the clock does odd things, crap *will*
 happen.

 But what this does is:

  (a) make the error more visible as a clock error rather than various
 random downstream users

  IOW, it prints things out when it looks like we're getting odd
 clock read errors (arbitrary cut-off: we expect clock read-outs to be
 withing 1/8th of the range of the expected clock value)

(Warning: I'm replying with my vacation goggles on)

A few thoughts from quickly looking at the patch (some of this is
repeating your comments here):

* So 1/8th of the interval seems way too short, as there's
clocksources like the ACP PM, which wrap every 2.5 seconds or so. And
even with more reasonable clocksource wrapping intervals, the tick
scheduler may not schedule the next tick till after that time, which
could cause major problems (we don't want the hrtimer expiration
calculation to get capped out here, since the timer may be scheduled
past 1/8th of the interval, which would keep us from ever accumulating
time and clearing the cycle_error added here)

* I suspect something closer to the clocksource_max_deferment() value
(which I think is max interval before multiplication overflows could
happen - ~12%) which we use in the scheduler would make more sense.
Especially since the timer scheduler uses that to calculate how long
we can idle for.

* Nulling out delta in timekeeping_get_ns() seems like it could cause
problems since time would then possibly go backwards compared to
previous reads (as you mentioned, resulting in smaller time jumps).
Instead it would probably make more sense to cap the delta at the
maximum value (though this assumes the clock doesn't jump back in the
interval before the next call to update_wall_time).

* Also, as you note, this would just cause the big time jump to only
happen at the next update, since there's no logic in
update_wall_time() to limit the jump. I'm not sure if believing the
large jump at write time make that much more sense, though.

* Finally, you're writing to error while only holding a read lock, but
that's sort of a minor thing.

I do agree something that is more helpful in validating the
timekeeping here would be nice to avoid further geese chases in the
future.

Some possible ideas:
* Checking the accumulation interval isn't beyond the
clocksource_max_deferment() value seems like a very good check to have
in update_wall_time().

* Maybe when we schedule the next timekeeping update, the tick
scheduler could store the expected time for that to fire, and then we
could validate that we're relatively close after that value when we do
accumulate time (warning if we're running too early or far too late -
though with virtualziation, defining a reasonable late value is
difficult).

* This expected next tick time could be used to try to cap read-time
intervals in a similar fashion as done here. (Of course, again, we'd
have to be careful, since if that expected next tick ends up somehow
being before the actual hrtimer expiration value, we could end up
stopping time - and the system).

I can try to add some of this when I'm back from holiday in the new year.

Maybe Thomas will have some other ideas?

thanks
-john
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Mon, Dec 22, 2014 at 3:59 PM, John Stultz john.stu...@linaro.org wrote:

 * So 1/8th of the interval seems way too short, as there's
 clocksources like the ACP PM, which wrap every 2.5 seconds or so.

Ugh. At the same time, 1/8th of a range is actually bigger than I'd
like, since if there is some timer corruption, it means that we only
catch it when it's in really big.

But as I said, I'd actually prefer it to be time-based, because it
would be good if this approach worked on things like the TSC which is
a 64-bit counter..

So yes, that capping was very much arbitrary, and was mostly a case of
this works with the one timer source that I can easily trigger

 * I suspect something closer to the clocksource_max_deferment() value
 (which I think is max interval before multiplication overflows could
 happen - ~12%) which we use in the scheduler would make more sense.
 Especially since the timer scheduler uses that to calculate how long
 we can idle for.

I'd rather not be anywhere *close* to any overflow problems. Even for
the scheduler all-idle case, I'd argue that there is rather quickly
diminishing returns. Yes, a thousand timer interrupts per second are
expensive and a noticeable power draw. The difference between one
timer interrupt every two seconds and every 20 seconds is rather
less noticeable.

Of course, reasonable clock sources have *much* longer periods than a
second (yeah, the acpi pm timer really isn't a good one), so there are
probably good middle grounds, The 1/8th was a hack, and one that was
aware of teh 300s cycle of the HPET at that..

 * Nulling out delta in timekeeping_get_ns() seems like it could cause
 problems since time would then possibly go backwards compared to
 previous reads (as you mentioned, resulting in smaller time jumps).
 Instead it would probably make more sense to cap the delta at the
 maximum value (though this assumes the clock doesn't jump back in the
 interval before the next call to update_wall_time).

So part of the nulling was that it was simpler, and part of it was
that I expected to get backwards jumps (see the other email to Dave
about the inherent races). And with the whole timer mask modulo
arithmetic, those backwards jumps just look like biggish positive
numbers, not even negative. So it ends up being things like is it an
unsigned number larger than half the mask? Consider it negative etc.

The zero it out was simple, and it worked for my test-case, which
was ok, my machine no longer locks up when I mess with the timer.

And I didn't post the earlier versions of that patch that didn't even *boot*.

I started out trying to do it at a higher level (not on a clock read
level, but outside the whole 'convert-to-ns and do the sequence  value
check'), but during bootup we play a lot of games with initializing
the timer sources etc.

So that explains the approach of doing it at that

   cycle_now = tkr-read(tkr-clock);

level, and keeping it very low-level.

But as I already explained in the email that crossed, that low-level
thing also results in some fundamental races.

 * Also, as you note, this would just cause the big time jump to only
 happen at the next update, since there's no logic in
 update_wall_time() to limit the jump. I'm not sure if believing the
 large jump at write time make that much more sense, though.

So I considered just capping it there (to a single interval or
something). Again, just ignoring - like the read side does - it would
have been easier, but at the same time I *really* wanted to make time
go forward, so just taking the big value seemed safest.

But yes. this was very much a RFC patch. It's not even ready for real
use, as DaveJ found out (although it might be good enough in practice,
despite its flaws)

 * Finally, you're writing to error while only holding a read lock, but
 that's sort of a minor thing.

It's not a minor thing, but the alternatives looked worse.

I really wanted to make it per-cpu, and do this with interrupts
disabled or something. But that then pushes a big problem to the write
time to go over all cpu's and see if there are errors.

So it's not right. But .. It's a hacky patch to get discussion
started, and it's actually hard to do right when this code has to be
basically lockless.

 * Checking the accumulation interval isn't beyond the
 clocksource_max_deferment() value seems like a very good check to have
 in update_wall_time().

Sounds like a good idea. Also, quite frankly, reading all the code I
wasn't ever really able to figure out that things don't overflow. The
overflow protection is a bit ad-hoc (that maxshift thing in
update_wall_time() really makes baby Jesus cry, despite the season,
and it wasn't at all obvious that ntp_tick_length() is fundamentally
bigger than xtime_interval, for example).

It's also not clear that the complicated and frankly not-very-obvious
shift-loop is any faster than just using a divide - possibly with the
single interval case being a special case to avoid dividing then.

I was a

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Paul E. McKenney

On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
> On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
>  wrote:
> >
> > The second time (or third, or fourth - it might not take immediately)
> > you get a lockup or similar. Bad things happen.
> 
> I've only tested it twice now, but the first time I got a weird
> lockup-like thing (things *kind* of worked, but I could imagine that
> one CPU was stuck with a lock held, because things eventually ground
> to a screeching halt.
> 
> The second time I got
> 
>   INFO: rcu_sched self-detected stall on CPU { 5}  (t=84533 jiffies
> g=11971 c=11970 q=17)
> 
> and then
> 
>INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7}
> (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57)
> 
> with backtraces that made no sense (because obviously no actual stall
> had taken place), and were the CPU's mostly being idle.

Yep, if time gets messed up too much, RCU can incorrectly decide that
21 seconds have elapsed since the grace period started, and can even
decide this pretty much immediately after the grace period starts.

Thanx, Paul

> I could easily see it resulting in your softlockup scenario too.
> 
>   Linus
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
 > > The second time (or third, or fourth - it might not take immediately)
 > > you get a lockup or similar. Bad things happen.
 > 
 > I've only tested it twice now, but the first time I got a weird
 > lockup-like thing (things *kind* of worked, but I could imagine that
 > one CPU was stuck with a lock held, because things eventually ground
 > to a screeching halt.
 > 
 > The second time I got
 > 
 >   INFO: rcu_sched self-detected stall on CPU { 5}  (t=84533 jiffies
 > g=11971 c=11970 q=17)
 > 
 > and then
 > 
 >INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7}
 > (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57)
 > 
 > with backtraces that made no sense (because obviously no actual stall
 > had taken place), and were the CPU's mostly being idle.
 > 
 > I could easily see it resulting in your softlockup scenario too.

So something trinity does when it doesn't have a better idea of
something to pass a syscall is to generate a random number.

A wild hypothesis could be that we're in one of these situations,
and we randomly generated 0xfed000f0 and passed that as a value to
a syscall, and the kernel wrote 0 to that address.

What syscall could do that, and not just fail a access_ok() or similar
is a mystery though.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
 wrote:
>
> The second time (or third, or fourth - it might not take immediately)
> you get a lockup or similar. Bad things happen.

I've only tested it twice now, but the first time I got a weird
lockup-like thing (things *kind* of worked, but I could imagine that
one CPU was stuck with a lock held, because things eventually ground
to a screeching halt.

The second time I got

  INFO: rcu_sched self-detected stall on CPU { 5}  (t=84533 jiffies
g=11971 c=11970 q=17)

and then

   INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7}
(detected by 0, t=291309 jiffies, g=12031, c=12030, q=57)

with backtraces that made no sense (because obviously no actual stall
had taken place), and were the CPU's mostly being idle.

I could easily see it resulting in your softlockup scenario too.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds
 wrote:
>
> I can do the mmap(/dev/mem) thing and access the HPET by hand, and
> when I write zero to it I immediately get something like this:
>
>   Clocksource tsc unstable (delta = -284317725450 ns)
>   Switched to clocksource hpet
>
> just to confirm that yes, a jump in the HPET counter would indeed give
> those kinds of symptoms:blaming the TSC with a negative delta in the
> 0-300s range, even though it's the HPET that is broken.
>
> And if the HPET then occasionally jumps around afterwards, it would
> show up as ktime_get() occasionally going backwards, which in turn
> would - as far as I can tell - result in exactly that pseudo-infirnite
> loop with timers.

Ok, so I tried that too.

It's actually a pretty easy experiment to do: just mmap(/dev/mem) at
the HPET offset (the kernel prints it out at boot, it should normally
be at 0xfed0). And then just write a zero to offset 0xf0, which is
the main counter.

The first time, you get the "Clocksource tsc unstable".

The second time (or third, or fourth - it might not take immediately)
you get a lockup or similar. Bad things happen.

This is *not* to say that this is the bug you're hitting. But it does show that

 (a) a flaky HPET can do some seriously bad stuff
 (b) the kernel is very fragile wrt time going backwards.

and maybe we can use this test program to at least try to alleviate problem (b).

Trivial HPET mess-up program attached.

Linus
#include 
#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
	int fd = open("/dev/mem", O_RDWR);
	void *base;

	if (fd < 0) {
		fputs("Unable to open /dev/mem\n", stderr);
		return -1;
	}
	base = mmap(NULL, 4096 ,PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0xfed0);
	if ((long)base == -1) {
		fputs("Unable to mmap HPET\n", stderr);
		return -1;
	}
	*(unsigned long *) (base+0xf0) = 0;
	return 0;
}

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones  wrote:
> On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
> >
>  > And finally, and stupidly, is there any chance that you have anything
>  > accessing /dev/hpet?
>
> Not knowingly at least, but who the hell knows what systemd has its
> fingers in these days.

Actually, it looks like /dev/hpet doesn't allow write access.

I can do the mmap(/dev/mem) thing and access the HPET by hand, and
when I write zero to it I immediately get something like this:

  Clocksource tsc unstable (delta = -284317725450 ns)
  Switched to clocksource hpet

just to confirm that yes, a jump in the HPET counter would indeed give
those kinds of symptoms:blaming the TSC with a negative delta in the
0-300s range, even though it's the HPET that is broken.

And if the HPET then occasionally jumps around afterwards, it would
show up as ktime_get() occasionally going backwards, which in turn
would - as far as I can tell - result in exactly that pseudo-infirnite
loop with timers.

Anyway, any wild kernel pointer access *could* happen to just hit the
HPET and write to the main counter value, although I'd personally be
more inclined to blame BIOS/SMM kind of code playing tricks with
time.. We do have a few places where we explicitly write the value on
purpose, but they are in the HPET init code, and in the clocksource
resume code, so they should not be involved.

Thomas - have you had reports of HPET breakage in RT circles, the same
way BIOSes have been tinkering with TSC?

Also, would it perhaps be a good idea to make "ktime_get()" save the
last time in a percpu variable, and warn if time ever goes backwards
on a particular CPU?  A percpu thing should be pretty cheap, even if
we write to it every time somebody asks for time..

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:

 > > So the range of 1-251 seconds  is not entirely random. It's all in
 > > that "32-bit HPET range".
 > 
 > DaveJ, I assume it's too late now, and you don't effectively have any
 > access to the machine any more, but "hpet=disable" or "nohpet" on the
 > command line might be worth trying if you ever see that piece of
 > hardware again.

I can give it a try tomorrow. I'm probably saying goodbye to that
machine on Tuesday, so we'll have 24hrs of testing at least.

 > And for posterity, do you have a dmidecode with motherboard/BIOS
 > information for the problematic machine? And your configuration?

I can grab that in the morning too.

 > And finally, and stupidly, is there any chance that you have anything
 > accessing /dev/hpet?

Not knowingly at least, but who the hell knows what systemd has its
fingers in these days.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds
 wrote:
>
> So the range of 1-251 seconds  is not entirely random. It's all in
> that "32-bit HPET range".

DaveJ, I assume it's too late now, and you don't effectively have any
access to the machine any more, but "hpet=disable" or "nohpet" on the
command line might be worth trying if you ever see that piece of
hardware again.

And for posterity, do you have a dmidecode with motherboard/BIOS
information for the problematic machine? And your configuration?

And finally, and stupidly, is there any chance that you have anything
accessing /dev/hpet?

   Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds
 wrote:
>
> Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
> can't see anything wrong. The whole "update_wall_clock" and the shadow
> timekeeping state is confusing as hell, but seems fine. We'd have to
> avoid update_wall_clock for a *long* time for overflows to occur.
>
> And the overflow in 32 bits isn't that special, since the only thing
> that really matters is the overflow of "cycle_now - tkr->cycle_last"
> within the mask.
>
> So I'm not seeing anything even halfway suspicious.

.. of course, this reminds me of the "clocksource TSC unstable" issue.

Then *simple* solution may actually be that the HPET itself is
buggered. That would explain both the "clocksource TSC unstable"
messages _and_ the "time went backwards, so now we're re-arming the
scheduler tick 'forever' until time has gone forwards again".

And googling for this actually shows other people seeing similar
issues, including hangs after switching to another clocksource. See
for example

http://stackoverflow.com/questions/13796944/system-hang-with-possible-relevance-to-clocksource-tsc-unstable

which switches to acpi_pm (not HPET) and then hangs afterwards.

Of course, it may be the switching itself that causes some issue.

Btw, there's another reason to think that it's the HPET, I just realized.

DaveJ posted all his odd TSC unstable things, and the delta was pretty
damn random. But it did have a range: it was in the 1-251 second
range.

With a 14.318MHz clock (which is, I think, the normal HPET frequency),
a 32-bit overflow happens in about 300 seconds.

So the range of 1-251 seconds  is not entirely random. It's all in
that "32-bit HPET range".

In contrast, wrt the TSC frequency, that kind of range makes no sense at all.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
can't see anything wrong. The whole update_wall_clock and the shadow
timekeeping state is confusing as hell, but seems fine. We'd have to
avoid update_wall_clock for a *long* time for overflows to occur.

And the overflow in 32 bits isn't that special, since the only thing
that really matters is the overflow of cycle_now - tkr-cycle_last
within the mask.

So I'm not seeing anything even halfway suspicious.

.. of course, this reminds me of the clocksource TSC unstable issue.

Then *simple* solution may actually be that the HPET itself is
buggered. That would explain both the clocksource TSC unstable
messages _and_ the time went backwards, so now we're re-arming the
scheduler tick 'forever' until time has gone forwards again.

And googling for this actually shows other people seeing similar
issues, including hangs after switching to another clocksource. See
for example

http://stackoverflow.com/questions/13796944/system-hang-with-possible-relevance-to-clocksource-tsc-unstable

which switches to acpi_pm (not HPET) and then hangs afterwards.

Of course, it may be the switching itself that causes some issue.

Btw, there's another reason to think that it's the HPET, I just realized.

DaveJ posted all his odd TSC unstable things, and the delta was pretty
damn random. But it did have a range: it was in the 1-251 second
range.

With a 14.318MHz clock (which is, I think, the normal HPET frequency),
a 32-bit overflow happens in about 300 seconds.

So the range of 1-251 seconds is not entirely random. It's all in
that 32-bit HPET range.

In contrast, wrt the TSC frequency, that kind of range makes no sense at all.

Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 So the range of 1-251 seconds  is not entirely random. It's all in
 that 32-bit HPET range.

DaveJ, I assume it's too late now, and you don't effectively have any
access to the machine any more, but hpet=disable or nohpet on the
command line might be worth trying if you ever see that piece of
hardware again.

And for posterity, do you have a dmidecode with motherboard/BIOS
information for the problematic machine? And your configuration?

And finally, and stupidly, is there any chance that you have anything
accessing /dev/hpet?

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:

   So the range of 1-251 seconds  is not entirely random. It's all in
   that 32-bit HPET range.
  
  DaveJ, I assume it's too late now, and you don't effectively have any
  access to the machine any more, but hpet=disable or nohpet on the
  command line might be worth trying if you ever see that piece of
  hardware again.

I can give it a try tomorrow. I'm probably saying goodbye to that
machine on Tuesday, so we'll have 24hrs of testing at least.

  And for posterity, do you have a dmidecode with motherboard/BIOS
  information for the problematic machine? And your configuration?

I can grab that in the morning too.

  And finally, and stupidly, is there any chance that you have anything
  accessing /dev/hpet?

Not knowingly at least, but who the hell knows what systemd has its
fingers in these days.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones da...@codemonkey.org.uk wrote:
 On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote:
 
   And finally, and stupidly, is there any chance that you have anything
   accessing /dev/hpet?

 Not knowingly at least, but who the hell knows what systemd has its
 fingers in these days.

Actually, it looks like /dev/hpet doesn't allow write access.

I can do the mmap(/dev/mem) thing and access the HPET by hand, and
when I write zero to it I immediately get something like this:

  Clocksource tsc unstable (delta = -284317725450 ns)
  Switched to clocksource hpet

just to confirm that yes, a jump in the HPET counter would indeed give
those kinds of symptoms:blaming the TSC with a negative delta in the
0-300s range, even though it's the HPET that is broken.

And if the HPET then occasionally jumps around afterwards, it would
show up as ktime_get() occasionally going backwards, which in turn
would - as far as I can tell - result in exactly that pseudo-infirnite
loop with timers.

Anyway, any wild kernel pointer access *could* happen to just hit the
HPET and write to the main counter value, although I'd personally be
more inclined to blame BIOS/SMM kind of code playing tricks with
time.. We do have a few places where we explicitly write the value on
purpose, but they are in the HPET init code, and in the clocksource
resume code, so they should not be involved.

Thomas - have you had reports of HPET breakage in RT circles, the same
way BIOSes have been tinkering with TSC?

Also, would it perhaps be a good idea to make ktime_get() save the
last time in a percpu variable, and warn if time ever goes backwards
on a particular CPU?  A percpu thing should be pretty cheap, even if
we write to it every time somebody asks for time..

   Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 I can do the mmap(/dev/mem) thing and access the HPET by hand, and
 when I write zero to it I immediately get something like this:

   Clocksource tsc unstable (delta = -284317725450 ns)
   Switched to clocksource hpet

 just to confirm that yes, a jump in the HPET counter would indeed give
 those kinds of symptoms:blaming the TSC with a negative delta in the
 0-300s range, even though it's the HPET that is broken.

 And if the HPET then occasionally jumps around afterwards, it would
 show up as ktime_get() occasionally going backwards, which in turn
 would - as far as I can tell - result in exactly that pseudo-infirnite
 loop with timers.

Ok, so I tried that too.

It's actually a pretty easy experiment to do: just mmap(/dev/mem) at
the HPET offset (the kernel prints it out at boot, it should normally
be at 0xfed0). And then just write a zero to offset 0xf0, which is
the main counter.

The first time, you get the Clocksource tsc unstable.

The second time (or third, or fourth - it might not take immediately)
you get a lockup or similar. Bad things happen.

This is *not* to say that this is the bug you're hitting. But it does show that

 (a) a flaky HPET can do some seriously bad stuff
 (b) the kernel is very fragile wrt time going backwards.

and maybe we can use this test program to at least try to alleviate problem (b).

Trivial HPET mess-up program attached.

Linus
#include sys/types.h
#include sys/stat.h
#include fcntl.h
#include sys/mman.h
#include stdio.h


int main(int argc, char **argv)
{
	int fd = open(/dev/mem, O_RDWR);
	void *base;

	if (fd  0) {
		fputs(Unable to open /dev/mem\n, stderr);
		return -1;
	}
	base = mmap(NULL, 4096 ,PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0xfed0);
	if ((long)base == -1) {
		fputs(Unable to mmap HPET\n, stderr);
		return -1;
	}
	*(unsigned long *) (base+0xf0) = 0;
	return 0;
}

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 The second time (or third, or fourth - it might not take immediately)
 you get a lockup or similar. Bad things happen.

I've only tested it twice now, but the first time I got a weird
lockup-like thing (things *kind* of worked, but I could imagine that
one CPU was stuck with a lock held, because things eventually ground
to a screeching halt.

The second time I got

  INFO: rcu_sched self-detected stall on CPU { 5}  (t=84533 jiffies
g=11971 c=11970 q=17)

and then

   INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7}
(detected by 0, t=291309 jiffies, g=12031, c=12030, q=57)

with backtraces that made no sense (because obviously no actual stall
had taken place), and were the CPU's mostly being idle.

I could easily see it resulting in your softlockup scenario too.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
   The second time (or third, or fourth - it might not take immediately)
   you get a lockup or similar. Bad things happen.
  
  I've only tested it twice now, but the first time I got a weird
  lockup-like thing (things *kind* of worked, but I could imagine that
  one CPU was stuck with a lock held, because things eventually ground
  to a screeching halt.
  
  The second time I got
  
INFO: rcu_sched self-detected stall on CPU { 5}  (t=84533 jiffies
  g=11971 c=11970 q=17)
  
  and then
  
 INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7}
  (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57)
  
  with backtraces that made no sense (because obviously no actual stall
  had taken place), and were the CPU's mostly being idle.
  
  I could easily see it resulting in your softlockup scenario too.

So something trinity does when it doesn't have a better idea of
something to pass a syscall is to generate a random number.

A wild hypothesis could be that we're in one of these situations,
and we randomly generated 0xfed000f0 and passed that as a value to
a syscall, and the kernel wrote 0 to that address.

What syscall could do that, and not just fail a access_ok() or similar
is a mystery though.

Dave

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Paul E. McKenney

On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote:
 On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  The second time (or third, or fourth - it might not take immediately)
  you get a lockup or similar. Bad things happen.
 
 I've only tested it twice now, but the first time I got a weird
 lockup-like thing (things *kind* of worked, but I could imagine that
 one CPU was stuck with a lock held, because things eventually ground
 to a screeching halt.
 
 The second time I got
 
   INFO: rcu_sched self-detected stall on CPU { 5}  (t=84533 jiffies
 g=11971 c=11970 q=17)
 
 and then
 
INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7}
 (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57)
 
 with backtraces that made no sense (because obviously no actual stall
 had taken place), and were the CPU's mostly being idle.

Yep, if time gets messed up too much, RCU can incorrectly decide that
21 seconds have elapsed since the grace period started, and can even
decide this pretty much immediately after the grace period starts.

Thanx, Paul

 I could easily see it resulting in your softlockup scenario too.
 
   Linus
 

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Paul E. McKenney

On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote:
> On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
>  wrote:
> >
> > How/where is the HPET overflow case handled? I don't know the code enough.
> 
> Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
> can't see anything wrong. The whole "update_wall_clock" and the shadow
> timekeeping state is confusing as hell, but seems fine. We'd have to
> avoid update_wall_clock for a *long* time for overflows to occur.
> 
> And the overflow in 32 bits isn't that special, since the only thing
> that really matters is the overflow of "cycle_now - tkr->cycle_last"
> within the mask.
> 
> So I'm not seeing anything even halfway suspicious.

One long shot is a bug in rcu_barrier() that I introduced in v3.18-rc1.
This is a low-probability race that can cause rcu_barrier() and friends
to return too soon, which can of course result in arbitrary misbehavior.
Please see below for a fix which looks good thus far in reasonably
intense rcutorture testing.

Might be what Dave and Sasha are seeing.  Or not.

Thanx, Paul



rcu: Fix rcu_barrier() race that could result in too-short wait

The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions.
It checks a given CPU's lists of callbacks, and if all three no-CBs lists
are empty, ignores that CPU.  However, these three lists could potentially
be empty even when callbacks are present if the check executed just as
the callbacks were being moved from one list to another.  It turns out
that recent versions of rcutorture can spot this race.

This commit plugs this hole by consolidating the per-list counts of
no-CBs callbacks into a single count, which is incremented before
the corresponding callback is posted and after it is invoked.  Then
rcu_barrier() checks this single count to reliably determine whether
the corresponding CPU has no-CBs callbacks.

Signed-off-by: Paul E. McKenney 

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7680fc275036..658b691dc32b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3318,6 +3318,7 @@ static void _rcu_barrier(struct rcu_state *rsp)
} else {
_rcu_barrier_trace(rsp, "OnlineNoCB", cpu,
   rsp->n_barrier_done);
+   smp_mb__before_atomic();
atomic_inc(>barrier_cpu_count);
__call_rcu(>barrier_head,
   rcu_barrier_callback, rsp, cpu, 0);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 8e7b1843896e..cb5908672f11 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -340,14 +340,10 @@ struct rcu_data {
 #ifdef CONFIG_RCU_NOCB_CPU
struct rcu_head *nocb_head; /* CBs waiting for kthread. */
struct rcu_head **nocb_tail;
-   atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
-   atomic_long_t nocb_q_count_lazy; /*  (approximate). */
+   atomic_long_t nocb_q_count; /* # CBs waiting for nocb */
+   atomic_long_t nocb_q_count_lazy; /*  invocation (all stages). */
struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
struct rcu_head **nocb_follower_tail;
-   atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
-   atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
-   int nocb_p_count;   /* # CBs being invoked by kthread */
-   int nocb_p_count_lazy;  /*  (approximate). */
wait_queue_head_t nocb_wq;  /* For nocb kthreads to sleep on. */
struct task_struct *nocb_kthread;
int nocb_defer_wakeup;  /* Defer wakeup of nocb_kthread. */
@@ -356,8 +352,6 @@ struct rcu_data {
struct rcu_head *nocb_gp_head cacheline_internodealigned_in_smp;
/* CBs waiting for GP. */
struct rcu_head **nocb_gp_tail;
-   long nocb_gp_count;
-   long nocb_gp_count_lazy;
bool nocb_leader_sleep; /* Is the nocb leader thread asleep? */
struct rcu_data *nocb_next_follower;
/* Next follower in wakeup chain. */
@@ -622,24 +616,15 @@ static void rcu_dynticks_task_exit(void);
 #endif /* #ifndef RCU_TREE_NONCORE */
 
 #ifdef CONFIG_RCU_TRACE
-#ifdef CONFIG_RCU_NOCB_CPU
-/* Sum up queue lengths for tracing. */
+/* Read out queue lengths for tracing. */
 static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long 
*qll)
 {
-   *ql = atomic_long_read(>nocb_q_count) +
- rdp->nocb_p_count +
- atomic_long_read(>nocb_follower_count) +
- rdp->nocb_p_count + rdp->nocb_gp_count;
-   *qll = atomic_long_read(>nocb_q_count_lazy) +
-  rdp->nocb_p_count_lazy +
-

Re: frequent lockups in 3.18rc4

On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
 wrote:
>
> How/where is the HPET overflow case handled? I don't know the code enough.

Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
can't see anything wrong. The whole "update_wall_clock" and the shadow
timekeeping state is confusing as hell, but seems fine. We'd have to
avoid update_wall_clock for a *long* time for overflows to occur.

And the overflow in 32 bits isn't that special, since the only thing
that really matters is the overflow of "cycle_now - tkr->cycle_last"
within the mask.

So I'm not seeing anything even halfway suspicious.

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds
 wrote:
>
> I'm claiming that the race happened *once*. And it then corrupted some
> data structure or similar sufficiently that CPU0 keeps looping.
>
> Perhaps something keeps re-adding itself to the head of the timerqueue
> due to the race.

So tick_sched_timer() does

ktime_t now = ktime_get();
...
hrtimer_forward(timer, now, tick_period);
return HRTIMER_RESTART;

and then __run_hrtimer does

enqueue_hrtimer(timer, base);

which just adds the timer back on the tiemr heap.

So all you need to get an infinite loop (as far as I can see) is that
hrtimer_forward() doesn't actually move the timer forward.

The most likely reason would seem to be this:

delta = ktime_sub(now, hrtimer_get_expires(timer));

if (delta.tv64 < 0)
return 0;

and clearly it *should* return a positive number, since the timer has
expired, so the expiry time _should_ be smaller than "now". So it
should never trigger, and this bug is clearly impossible.

HOWEVER.

It turns out that while tick_sched_timer() does "ktime_get()" to get
the current time, the actual timer machinery does *not* do that at
all. The actual timer machinery does

entry_time = now = hrtimer_update_base(cpu_base);

base = cpu_base->clock_base + i;
basenow = ktime_add(now, base->offset);

_once_ per hrtimer_clock_base. And then it iterates using that
"basenow" thing, and  compares it to the timer expiry.

So we have two different times. Now, let's think about what happens if
those clocks aren't quote in sync.

We know (since __run_hrtimer was called) that

basenow.tv64 > hrtimer_get_softexpires_tv64(timer)

but here we have "basenow" - which is not that ktime_get(), and we
have "hrtimer_get_softexpires_tv64()" (which is not
hrtimer_get_expires() in general - we have all that "delta" range
handling, but for the scheduling tick it *should* be the same).

So I can see at least one lockup:

 - if "expires < basenow" hrtimer_interrupt() will run the timer
 - if "now < expires" hrtimer_forward() will not do anything, and will
just reschedule the timer with the same expiration

iow, all you need for a re-arming of the same timer is:

   now < expires < basenow

now, the two clocks (now and basenow) are not the same, but they do
have the same *base*. So they are related, but even the base time was
gotten under two different sequence locks, so even the base could have
been updated in between the hrtimer_update_base() time and the
ktime_get(). And even though they have the same base, they have
different offsets: basenow does that "base->offset" thing (and
ktime_get_update_offsets_now() does timekeeping_get_ns()

 - now = ktime_get() does

base = tk->tkr.base_mono;
nsecs = timekeeping_get_ns(>tkr);

 - basenow = ktime_get_update_offsets_now() does

base = tk->tkr.base_mono;
nsecs = timekeeping_get_ns(>tkr);
  .. and then ..
ktime_add(.., base->offset);

and if I read the thing right, the ktime_add() should be a no-op,
because base->offset should be 0 for the normal monotonic clock.
Right?

So the two times (now and basenow) *should* be the same time, and the
whole "now < expires < basenow" situation can never happen. Right?

Potentially wrong.

Because that's where the whole "different sequence locks" comes in.
The one-time race could be something that updates the base in between
the (one-time) ktime_get_update_offsets_now() and the (then as a
result pseudo-infinitely repeating) ktime_get.

Hmm? If "base" ever goes backwards, or if "base" does *not* update
atomically with the HPET timer overflows, I can see that happening. Of
course, that would imply that ktime_get() is not monotonic. But we do
know that we've had odd time issues on that machine.

I think you already had DaveJ check clock monotonicity. But that was
with the TSC, wasn't it? I'm claiming maybe the HPET isn't monotonic,
and there is some HPET clocksource issue with overflow in 32 bits.

(I think the HPET *should* be 64-bit, and just the comparators for
interrupts may be 32-bit, but we use a "readl()" and only use the low
32-bits even if the upper 32 bits *might* be ok).

I keep harping on that HPET overflow, because we actually have the 6
"locked up" HPET traces, and then we have a seventh without that
lockup, and there definitely was a overflow in 32 bits:

  [torvalds@i7 linux]$ grep -3 read_hpet+0x16 ~/dj-1.txt | grep RAX
   RAX: 61fece8a RBX: 00510792 RCX: 
   RAX: 79e588fc RBX: 00511d6e RCX: 
   RAX: 91ca7f65 RBX: 00513346 RCX: 
   RAX: a9afbd0d RBX: 0051491e RCX: 
   RAX: cbd1340c RBX: 0051684a RCX: 
   RAX: fb9d303f RBX: 005193fc RCX: 
   RAX: 2b67efe4 RBX:

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Dave Jones

On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote:
 > > Right now I'm doing Chris' idea of "turn debugging back on,
 > > and try without serial console".  Shall I try your suggestion
 > > on top of that ?
 > 
 > Might as well. I doubt it really will make any difference, but I also
 > don't think it will interact badly in any way, so..

It locked up. It's not even responding to icmp.
It might be Monday before I can see what's on the console.

Dave


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Dave Jones

On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote:
   Right now I'm doing Chris' idea of turn debugging back on,
   and try without serial console.  Shall I try your suggestion
   on top of that ?
  
  Might as well. I doubt it really will make any difference, but I also
  don't think it will interact badly in any way, so..

It locked up. It's not even responding to icmp.
It might be Monday before I can see what's on the console.

Dave


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds
torva...@linux-foundation.org wrote:

 I'm claiming that the race happened *once*. And it then corrupted some
 data structure or similar sufficiently that CPU0 keeps looping.

 Perhaps something keeps re-adding itself to the head of the timerqueue
 due to the race.

So tick_sched_timer() does

ktime_t now = ktime_get();
...
hrtimer_forward(timer, now, tick_period);
return HRTIMER_RESTART;

and then __run_hrtimer does

enqueue_hrtimer(timer, base);

which just adds the timer back on the tiemr heap.

So all you need to get an infinite loop (as far as I can see) is that
hrtimer_forward() doesn't actually move the timer forward.

The most likely reason would seem to be this:

delta = ktime_sub(now, hrtimer_get_expires(timer));

if (delta.tv64  0)
return 0;

and clearly it *should* return a positive number, since the timer has
expired, so the expiry time _should_ be smaller than now. So it
should never trigger, and this bug is clearly impossible.

HOWEVER.

It turns out that while tick_sched_timer() does ktime_get() to get
the current time, the actual timer machinery does *not* do that at
all. The actual timer machinery does

entry_time = now = hrtimer_update_base(cpu_base);

base = cpu_base-clock_base + i;
basenow = ktime_add(now, base-offset);

_once_ per hrtimer_clock_base. And then it iterates using that
basenow thing, and  compares it to the timer expiry.

So we have two different times. Now, let's think about what happens if
those clocks aren't quote in sync.

We know (since __run_hrtimer was called) that

basenow.tv64  hrtimer_get_softexpires_tv64(timer)

but here we have basenow - which is not that ktime_get(), and we
have hrtimer_get_softexpires_tv64() (which is not
hrtimer_get_expires() in general - we have all that delta range
handling, but for the scheduling tick it *should* be the same).

So I can see at least one lockup:

 - if expires  basenow hrtimer_interrupt() will run the timer
 - if now  expires hrtimer_forward() will not do anything, and will
just reschedule the timer with the same expiration

iow, all you need for a re-arming of the same timer is:

   now  expires  basenow

now, the two clocks (now and basenow) are not the same, but they do
have the same *base*. So they are related, but even the base time was
gotten under two different sequence locks, so even the base could have
been updated in between the hrtimer_update_base() time and the
ktime_get(). And even though they have the same base, they have
different offsets: basenow does that base-offset thing (and
ktime_get_update_offsets_now() does timekeeping_get_ns()

 - now = ktime_get() does

base = tk-tkr.base_mono;
nsecs = timekeeping_get_ns(tk-tkr);

 - basenow = ktime_get_update_offsets_now() does

base = tk-tkr.base_mono;
nsecs = timekeeping_get_ns(tk-tkr);
  .. and then ..
ktime_add(.., base-offset);

and if I read the thing right, the ktime_add() should be a no-op,
because base-offset should be 0 for the normal monotonic clock.
Right?

So the two times (now and basenow) *should* be the same time, and the
whole now  expires  basenow situation can never happen. Right?

Potentially wrong.

Because that's where the whole different sequence locks comes in.
The one-time race could be something that updates the base in between
the (one-time) ktime_get_update_offsets_now() and the (then as a
result pseudo-infinitely repeating) ktime_get.

Hmm? If base ever goes backwards, or if base does *not* update
atomically with the HPET timer overflows, I can see that happening. Of
course, that would imply that ktime_get() is not monotonic. But we do
know that we've had odd time issues on that machine.

I think you already had DaveJ check clock monotonicity. But that was
with the TSC, wasn't it? I'm claiming maybe the HPET isn't monotonic,
and there is some HPET clocksource issue with overflow in 32 bits.

(I think the HPET *should* be 64-bit, and just the comparators for
interrupts may be 32-bit, but we use a readl() and only use the low
32-bits even if the upper 32 bits *might* be ok).

I keep harping on that HPET overflow, because we actually have the 6
locked up HPET traces, and then we have a seventh without that
lockup, and there definitely was a overflow in 32 bits:

  [torvalds@i7 linux]$ grep -3 read_hpet+0x16 ~/dj-1.txt | grep RAX
   RAX: 61fece8a RBX: 00510792 RCX: 
   RAX: 79e588fc RBX: 00511d6e RCX: 
   RAX: 91ca7f65 RBX: 00513346 RCX: 
   RAX: a9afbd0d RBX: 0051491e RCX: 
   RAX: cbd1340c RBX: 0051684a RCX: 
   RAX: fb9d303f RBX: 005193fc RCX: 
   RAX: 2b67efe4 RBX: 0051c224 RCX:

Re: frequent lockups in 3.18rc4

On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
torva...@linux-foundation.org wrote:

 How/where is the HPET overflow case handled? I don't know the code enough.

Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
can't see anything wrong. The whole update_wall_clock and the shadow
timekeeping state is confusing as hell, but seems fine. We'd have to
avoid update_wall_clock for a *long* time for overflows to occur.

And the overflow in 32 bits isn't that special, since the only thing
that really matters is the overflow of cycle_now - tkr-cycle_last
within the mask.

So I'm not seeing anything even halfway suspicious.

  Linus
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Paul E. McKenney

On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote:
 On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds
 torva...@linux-foundation.org wrote:
 
  How/where is the HPET overflow case handled? I don't know the code enough.
 
 Hmm, ok, I've re-acquainted myself with it. And I have to admit that I
 can't see anything wrong. The whole update_wall_clock and the shadow
 timekeeping state is confusing as hell, but seems fine. We'd have to
 avoid update_wall_clock for a *long* time for overflows to occur.
 
 And the overflow in 32 bits isn't that special, since the only thing
 that really matters is the overflow of cycle_now - tkr-cycle_last
 within the mask.
 
 So I'm not seeing anything even halfway suspicious.

One long shot is a bug in rcu_barrier() that I introduced in v3.18-rc1.
This is a low-probability race that can cause rcu_barrier() and friends
to return too soon, which can of course result in arbitrary misbehavior.
Please see below for a fix which looks good thus far in reasonably
intense rcutorture testing.

Might be what Dave and Sasha are seeing.  Or not.

Thanx, Paul



rcu: Fix rcu_barrier() race that could result in too-short wait

The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions.
It checks a given CPU's lists of callbacks, and if all three no-CBs lists
are empty, ignores that CPU.  However, these three lists could potentially
be empty even when callbacks are present if the check executed just as
the callbacks were being moved from one list to another.  It turns out
that recent versions of rcutorture can spot this race.

This commit plugs this hole by consolidating the per-list counts of
no-CBs callbacks into a single count, which is incremented before
the corresponding callback is posted and after it is invoked.  Then
rcu_barrier() checks this single count to reliably determine whether
the corresponding CPU has no-CBs callbacks.

Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com

diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 7680fc275036..658b691dc32b 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -3318,6 +3318,7 @@ static void _rcu_barrier(struct rcu_state *rsp)
} else {
_rcu_barrier_trace(rsp, OnlineNoCB, cpu,
   rsp-n_barrier_done);
+   smp_mb__before_atomic();
atomic_inc(rsp-barrier_cpu_count);
__call_rcu(rdp-barrier_head,
   rcu_barrier_callback, rsp, cpu, 0);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index 8e7b1843896e..cb5908672f11 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -340,14 +340,10 @@ struct rcu_data {
 #ifdef CONFIG_RCU_NOCB_CPU
struct rcu_head *nocb_head; /* CBs waiting for kthread. */
struct rcu_head **nocb_tail;
-   atomic_long_t nocb_q_count; /* # CBs waiting for kthread */
-   atomic_long_t nocb_q_count_lazy; /*  (approximate). */
+   atomic_long_t nocb_q_count; /* # CBs waiting for nocb */
+   atomic_long_t nocb_q_count_lazy; /*  invocation (all stages). */
struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */
struct rcu_head **nocb_follower_tail;
-   atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */
-   atomic_long_t nocb_follower_count_lazy; /*  (approximate). */
-   int nocb_p_count;   /* # CBs being invoked by kthread */
-   int nocb_p_count_lazy;  /*  (approximate). */
wait_queue_head_t nocb_wq;  /* For nocb kthreads to sleep on. */
struct task_struct *nocb_kthread;
int nocb_defer_wakeup;  /* Defer wakeup of nocb_kthread. */
@@ -356,8 +352,6 @@ struct rcu_data {
struct rcu_head *nocb_gp_head cacheline_internodealigned_in_smp;
/* CBs waiting for GP. */
struct rcu_head **nocb_gp_tail;
-   long nocb_gp_count;
-   long nocb_gp_count_lazy;
bool nocb_leader_sleep; /* Is the nocb leader thread asleep? */
struct rcu_data *nocb_next_follower;
/* Next follower in wakeup chain. */
@@ -622,24 +616,15 @@ static void rcu_dynticks_task_exit(void);
 #endif /* #ifndef RCU_TREE_NONCORE */
 
 #ifdef CONFIG_RCU_TRACE
-#ifdef CONFIG_RCU_NOCB_CPU
-/* Sum up queue lengths for tracing. */
+/* Read out queue lengths for tracing. */
 static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long 
*qll)
 {
-   *ql = atomic_long_read(rdp-nocb_q_count) +
- rdp-nocb_p_count +
- atomic_long_read(rdp-nocb_follower_count) +
- rdp-nocb_p_count + rdp-nocb_gp_count;
-   *qll = atomic_long_read(rdp-nocb_q_count_lazy) +
-

Re: frequent lockups in 3.18rc4

On Fri, Dec 19, 2014 at 5:00 PM, Thomas Gleixner  wrote:
>
> The watchdog timer runs on a fully periodic schedule. It's self
> rearming via
>
>  hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));
>
> So if that aligns with the equally periodic tick interrupt on the
> other CPU then you might get into that situation due to the fully
> synchronizing and serializing nature of HPET reads.

No. Really. No.

There is no way in hell that a modern system ends up being that
synchronized. Not to the point that the interrupt also happens at the
*exact* same point.

Thomas, you're in denial. It's not just the exact same instruction in
the timer interrupt handler ("timer interrupts and NMI's are in
sync"), the call chain *past* the timer interrupt is the exact same
thing too.  The stack has the same contents. So not only must the two
CPU's be perfectly in sync so that the NMI always triggers on exactly
the same instruction, the CPU that takes the timer interrupt (in your
schenario, over and over again) must be magically filling the stack
with the exact same thing every time.

That CPU isn't making progress that just happens to be "synchronized".

Admit it, you've not seen a busy system that is *so* synchronized that
you get timer interrupts and NMI's that hit on the same instruction
over a sequence of minutes.

> So lets assume there is that hrtimer_remove race (I'm certainly going
> to stare at that tomorrow with fully awake brain. It's past beer
> o'clock here). How do you explain that:
>
> 1) the watchdog always triggers on CPU1?
>
> 2) the race only ever happens on CPU0?

I'm claiming that the race happened *once*. And it then corrupted some
data structure or similar sufficiently that CPU0 keeps looping.

Perhaps something keeps re-adding itself to the head of the timerqueue
due to the race.

The watchdog doesn't trigger on CPU0, because this is the software
watchdog, and interrupts are disabled on CPU0, and CPU0 isn't making
any progress. Because it's looping in a fairly tight loop.

The watchdog triggers on CPU1 (over and over again) because CPU1 is
waiting for the TLB shootdown to complete. And it doesn't, because
interrupts are disabled on CPU0 that it's trying to shoot down the TLB
on.

That theory at least fits the data. So CPU1 isn't doing anything odd at all.

In a way that "magic happens so that everything is so synchronized
that you cross-syncornize two CPU's making real progress over many
minutes". THAT sounds like just a fairy tale to me.

Your argument "it has a question mark in front of it" objection is
bogus. We got an *interrupt* in the middle of the call chain. Of
*course* the call chain is unreliable. It doesn't matter. What matters
is that the stack under the interrupt DOES NOT CHANGE. It doesn't even
matter if it's a real honest-to-god callchain or not, what matters is
that the kernel stack under the interrupt is unchanged. No way does
that happen if it's making progress at the same time.

> 3) the hrtimer interrupt took too long message never appears?

At a guess, it's looping (for a long long time) on
timerqueue_getnext() in hrtimer_interrupt(), and then returns. Never
gets to the retry or the "interrupt took %llu" messages.

And we know what timer entry it is looping on:  tick_sched_timer.
Which does a HRTIMER_RESTART, so the re-enqueueing isn't exactly
unlikely.

All it needs is that the re-enqueueing has gotten confused enough that
it re-enqueues it on the same queue over and over again.  Which would
run tick_sched_timer over and over again. No? Until the re-enqueuing
magically stops (and we do actually have a HPET counter overflow
there. Just look at the RAX values:

 RAX: fb9d303f
 RAX: 2b67efe4

that fb9d303f is the last time we see that particular
callchain. The next time we see hpet_read(), it's that
2b67efe4 thing.

So my "maybe it has something to do with HPET overflow" wasn't just a
random throw-away comment. We actually have real data saying that the
HPET *did* overflow, and it in fact happened somewhere around the time
when the lockup went away.

Are they related? Maybe not. But dammit, there's a lot of
"coincidences" here. Not just the "oh, it always takes the NMI on the
exact same instruction".

> 4) the RT throttler hit?
> 5) that the system makes progress afterwards?

.. something eventually clears the problem, maybe because of the HPET
overflow. I dunno. I'm just saying that your arguments to ignore CPU0
are pretty damn weak.

So I think the CPU1 behavior is 100% consistent with CPU0 just having
interrupts disabled.

So I think CPU1 is _trivial_ to explain if you accept that CPU0 is
doing something weird.

Which is why I think your "look at CPU1" sounds so strange. I don't
think CPU1 is all that interesting. I can *trivially* explain it with
a single sentence, and did exactly that above.

Can you trivially explain CPU0? Without the "it's just a big
coincidence that we take  the NMI on the same instruction for several
minutes,

Re: frequent lockups in 3.18rc4

On Fri, 19 Dec 2014, Chris Mason wrote:
> On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner  wrote:
> > But at the very end this would be detected by the runtime check of the
> > hrtimer interrupt, which does not trigger. And it would trigger at
> > some point as ALL cpus including CPU0 in that trace dump make
> > progress.
> 
> I'll admit that at some point we should be hitting one of the WARN or BUG_ON,
> but it's possible to thread that needle and corrupt the timer list, without
> hitting a warning (CPU 1 in my example has to enqueue last).  Once the rbtree
> is hosed, it can go forever.  Probably not the bug we're looking for, but
> still suspect in general.

I surely have a close look at that, but in that case we get out of
that state later on and I doubt that we have 

 A) a corruption of the rbtree
 B) a self healing of the rbtree afterwards

I doubt it, but who knows.

Though even if A & B would happen we would still get the 'hrtimer
interrupt took a gazillion of seconds' warning because CPU0 definitely
leaves the timer interrupt at some point otherwise we would not see
backtraces from usb, userspace and idle later on.

Thanks,

tglx




--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, 19 Dec 2014, Linus Torvalds wrote:
> On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner  wrote:
> > Now that all looks correct. So there is something else going on. After
> > staring some more at it, I think we are looking at it from the wrong
> > angle.
> >
> > The watchdog always detects CPU1 as stuck and we got completely
> > fixated on the csd_wait() in the stack trace on CPU1. Now we have
> > stack traces which show a different picture, i.e. CPU1 makes progress
> > after a gazillion of seconds.
> 
> .. but that doesn't explain why CPU0 ends up always being at that
> *exact* same instruction in the NMI backtrace.

Up to the point where it exposes different instructions and completely
different code pathes. I'm not agreeing with your theory that after
the RT throttler hit the watchdog and everything else goes completely
bonkers. And even before that happens we see a different backtrace on
CPU0:

[25149.982766] RIP: 0010:[]  [] 
invoke_rcu_core+0x2b/0x50

Though I have to admit that this "very same instruction" pattern
puzzled me for quite a while as well.

> While a fairly tight loop, together with "mmio read is very expensive
> and synchronizing" would explain it. An MMIO read can easily be as
> expensive as several thousand instructions.

The watchdog timer runs on a fully periodic schedule. It's self
rearming via

 hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period));

So if that aligns with the equally periodic tick interrupt on the
other CPU then you might get into that situation due to the fully
synchronizing and serializing nature of HPET reads.

That can drift apart over time because the timer device (apic or that
new fangled tscdeadline timer) are not frequency corrected versus
timekeeping.

> > I think we really need to look at CPU1 itself.
> 
> Not so fast. Take another look at CPU0.
> 
> [24998.083577]  [] ktime_get+0x3e/0xa0
> [24998.084450]  [] tick_sched_timer+0x23/0x160
> [24998.085315]  [] __run_hrtimer+0x76/0x1f0
> [24998.086173]  [] ? tick_init_highres+0x20/0x20
> [24998.087025]  [] hrtimer_interrupt+0x107/0x260
> [24998.087877]  [] local_apic_timer_interrupt+0x3b/0x70
> [24998.088732]  [] smp_apic_timer_interrupt+0x45/0x60
> [24998.089583]  [] apic_timer_interrupt+0x6f/0x80
> [24998.090435]  
> [24998.091279]  [] ? __remove_hrtimer+0x4e/0xa0
> [24998.092118]  [] ? ipcget+0x8a/0x1e0
> [24998.092951]  [] ? ipcget+0x7c/0x1e0
> [24998.093779]  [] SyS_msgget+0x4d/0x70
> 
> 
> Really. None of that changed. NONE. The likelihood that we hit the
> exact same instruction every time? Over a timefraem of more than a
> minute?
> 
> The only way I see that happening is (a) NMI is completely buggered,
> and the backtrace is just random crap that is always the same.  Or (b)
> it's really a fairly tight loop.
> 
> The fact that you had a hrtimer interrupt happen in the *middle* of
> __remove_hrtimer() is really another fairly strong hint. That smells
> like "__remove_hrtimer() has a race with hrtimer interrupts".

That __remove_hrtimer has a '?' in front of it. So its not a reliable
trace entry.

There is NO hrtimer related operation in the msgget() syscall at all
and SyS_msgget() is the only reliable entry on that stack trace.

So that __remove_hrtimer operation happened before that msgget()
syscall and is just a stack artifact. poll/select/nanosleep whatever.

> And that race results in a basically endless loop (which perhaps ends
> when the HRtimer overflows, in what, a few minutes?)

hrtimers overflow in about 584 years

> I really don't think you should look at CPU1. Not when CPU0 has such
> an interesting pattern that you dismissed just because the HPET is
> making progress.

No. I did not dismiss it because HPET is making progress. I looked at
it from a different angle.

So lets assume there is that hrtimer_remove race (I'm certainly going
to stare at that tomorrow with fully awake brain. It's past beer
o'clock here). How do you explain that:

1) the watchdog always triggers on CPU1?

2) the race only ever happens on CPU0?

3) the hrtimer interrupt took too long message never appears?

   If that timer interrupt loops forever then it will complain about
   that. And it leaves that code for sure as the backtrace of CPU0
   hits user space later on.

4) the RT throttler hit?

   Admittetly we dont know from which CPU and which task that comes
   but that's very simple to figure out. Debug patch below.

5) that the system makes progress afterwards?

6) 

If my assumption about an interrupt storm turns out to be true, then
it explain all of the above. I might be wrong as usual, but I still
think its worth to have a look.

Thanks,

tglx
---

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index ee15f5a0d1c1..d9e4153d405b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -895,7 +895,8 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq)
 */
if (likely(rt_b->rt_runtime)) {
rt_rq->rt_throttled =

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Chris Mason




On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner  
wrote:

On Fri, 19 Dec 2014, Chris Mason wrote:

 On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
 > Here's another pattern. In your latest thing, every single time 
that
 > CPU1 is waiting for some other CPU to pick up the IPI, we have 
CPU0

 > doing this:
 >
 > [24998.060963] NMI backtrace for cpu 0
 > [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not tainted 
3.18.0+ #108
 > [24998.064073] task: 8801bf3536b0 ti: 880197e0c000 
task.ti:

 > 880197e0c000
 > [24998.065137] RIP: 0010:[]  
[]

 > read_hpet+0x16/0x20
 > [24998.083577]  [] ktime_get+0x3e/0xa0
 > [24998.084450]  [] tick_sched_timer+0x23/0x160
 > [24998.085315]  [] __run_hrtimer+0x76/0x1f0
 > [24998.086173]  [] ? tick_init_highres+0x20/0x20
 > [24998.087025]  [] hrtimer_interrupt+0x107/0x260
 > [24998.087877]  [] 
local_apic_timer_interrupt+0x3b/0x70
 > [24998.088732]  [] 
smp_apic_timer_interrupt+0x45/0x60
 > [24998.089583]  [] 
apic_timer_interrupt+0x6f/0x80

 > [24998.090435]  
 > [24998.091279]  [] ? __remove_hrtimer+0x4e/0xa0
 > [24998.092118]  [] ? ipcget+0x8a/0x1e0
 > [24998.092951]  [] ? ipcget+0x7c/0x1e0
 > [24998.093779]  [] SyS_msgget+0x4d/0x70
 >
 > and I think that's the smoking gun. The reason CPU0 isn't picking 
up
 > any IPI's is because it is in some endless loop around 
read_hpet().

 >
 > There is even time information in the register dump:
 >
 >  RAX: 61fece8a RBX: 00510792 RCX: 
 >  RAX: 79e588fc RBX: 00511d6e RCX: 
 >  RAX: 91ca7f65 RBX: 00513346 RCX: 
 >  RAX: a9afbd0d RBX: 0051491e RCX: 
 >  RAX: cbd1340c RBX: 0051684a RCX: 
 >  RAX: fb9d303f RBX: 005193fc RCX: 
 >  RAX: 2b67efe4 RBX: 0051c224 RCX: 0004
 >
 > That RAX value is the value we just read from the HPET, and RBX 
seems

 > to be monotonically increasing too, so it's likely the sequence
 > counter in ktime_get().
 >
 > So it's not stuck *inside* read_hpet(), and it's almost certainly 
not

 > the loop over the sequence counter in ktime_get() either (it's not
 > increasing *that* quickly). But some basically infinite 
__run_hrtimer

 > thing or something?

 Really interesting.

 So, we're calling __run_hrtimer in a loop:

 while ((node = timerqueue_getnext(>active))) {
...
__run_hrtimer(timer, );
...
}

 The easy question is how often does trinity call nanosleep?

 Looking at __run_hrtimer(), it drops the lock and runs the function 
and then

 takes the lock again, maybe enqueueing us again right away.

 timer->state is supposed to protect us from other CPUs jumping in 
and doing
 something else with the timer, but it feels racey wrt 
remove_hrtimer().
 Something like this, but I'm not sure how often 
__hrtimer_start_range_ns gets

 called

 CPU 0  CPU 1
 __run_hrtimer()
 timer->state = HRTIMER_STATE_CALLBACK
 removed from list
 unlock cpu_base->lock
 restrt = fn(timer)
__hrtimer_start_range_ns()
base = lock_hrtimer_base()
ret = remove_hrtimer()
finds timer->state = 
HRTIMER_STATE_CALLBACK
does nothing
new_base = switch_hrtimer_base()
now we're on a different 
base, different lock
 lock(cpu_base->lock)
 enqueue the timer
enqueue the timer


But at the very end this would be detected by the runtime check of the
hrtimer interrupt, which does not trigger. And it would trigger at
some point as ALL cpus including CPU0 in that trace dump make
progress.


I'll admit that at some point we should be hitting one of the WARN or 
BUG_ON, but it's possible to thread that needle and corrupt the timer 
list, without hitting a warning (CPU 1 in my example has to enqueue 
last).  Once the rbtree is hosed, it can go forever.  Probably not the 
bug we're looking for, but still suspect in general.


-chris



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner  wrote:
>
> Now that all looks correct. So there is something else going on. After
> staring some more at it, I think we are looking at it from the wrong
> angle.
>
> The watchdog always detects CPU1 as stuck and we got completely
> fixated on the csd_wait() in the stack trace on CPU1. Now we have
> stack traces which show a different picture, i.e. CPU1 makes progress
> after a gazillion of seconds.

.. but that doesn't explain why CPU0 ends up always being at that
*exact* same instruction in the NMI backtrace.

While a fairly tight loop, together with "mmio read is very expensive
and synchronizing" would explain it. An MMIO read can easily be as
expensive as several thousand instructions.

> I think we really need to look at CPU1 itself.

Not so fast. Take another look at CPU0.

[24998.083577]  [] ktime_get+0x3e/0xa0
[24998.084450]  [] tick_sched_timer+0x23/0x160
[24998.085315]  [] __run_hrtimer+0x76/0x1f0
[24998.086173]  [] ? tick_init_highres+0x20/0x20
[24998.087025]  [] hrtimer_interrupt+0x107/0x260
[24998.087877]  [] local_apic_timer_interrupt+0x3b/0x70
[24998.088732]  [] smp_apic_timer_interrupt+0x45/0x60
[24998.089583]  [] apic_timer_interrupt+0x6f/0x80
[24998.090435]  
[24998.091279]  [] ? __remove_hrtimer+0x4e/0xa0
[24998.092118]  [] ? ipcget+0x8a/0x1e0
[24998.092951]  [] ? ipcget+0x7c/0x1e0
[24998.093779]  [] SyS_msgget+0x4d/0x70

Really. None of that changed. NONE. The likelihood that we hit the
exact same instruction every time? Over a timefraem of more than a
minute?

The only way I see that happening is (a) NMI is completely buggered,
and the backtrace is just random crap that is always the same.  Or (b)
it's really a fairly tight loop.

The fact that you had a hrtimer interrupt happen in the *middle* of
__remove_hrtimer() is really another fairly strong hint. That smells
like "__remove_hrtimer() has a race with hrtimer interrupts".

And that race results in a basically endless loop (which perhaps ends
when the HRtimer overflows, in what, a few minutes?)

I really don't think you should look at CPU1. Not when CPU0 has such
an interesting pattern that you dismissed just because the HPET is
making progress.

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, 19 Dec 2014, Chris Mason wrote:
> On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> > Here's another pattern. In your latest thing, every single time that
> > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0
> > doing this:
> > 
> > [24998.060963] NMI backtrace for cpu 0
> > [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not tainted 3.18.0+ #108
> > [24998.064073] task: 8801bf3536b0 ti: 880197e0c000 task.ti:
> > 880197e0c000
> > [24998.065137] RIP: 0010:[]  []
> > read_hpet+0x16/0x20
> > [24998.083577]  [] ktime_get+0x3e/0xa0
> > [24998.084450]  [] tick_sched_timer+0x23/0x160
> > [24998.085315]  [] __run_hrtimer+0x76/0x1f0
> > [24998.086173]  [] ? tick_init_highres+0x20/0x20
> > [24998.087025]  [] hrtimer_interrupt+0x107/0x260
> > [24998.087877]  [] local_apic_timer_interrupt+0x3b/0x70
> > [24998.088732]  [] smp_apic_timer_interrupt+0x45/0x60
> > [24998.089583]  [] apic_timer_interrupt+0x6f/0x80
> > [24998.090435]  
> > [24998.091279]  [] ? __remove_hrtimer+0x4e/0xa0
> > [24998.092118]  [] ? ipcget+0x8a/0x1e0
> > [24998.092951]  [] ? ipcget+0x7c/0x1e0
> > [24998.093779]  [] SyS_msgget+0x4d/0x70
> > 
> > and I think that's the smoking gun. The reason CPU0 isn't picking up
> > any IPI's is because it is in some endless loop around read_hpet().
> > 
> > There is even time information in the register dump:
> > 
> >  RAX: 61fece8a RBX: 00510792 RCX: 
> >  RAX: 79e588fc RBX: 00511d6e RCX: 
> >  RAX: 91ca7f65 RBX: 00513346 RCX: 
> >  RAX: a9afbd0d RBX: 0051491e RCX: 
> >  RAX: cbd1340c RBX: 0051684a RCX: 
> >  RAX: fb9d303f RBX: 005193fc RCX: 
> >  RAX: 2b67efe4 RBX: 0051c224 RCX: 0004
> > 
> > That RAX value is the value we just read from the HPET, and RBX seems
> > to be monotonically increasing too, so it's likely the sequence
> > counter in ktime_get().
> > 
> > So it's not stuck *inside* read_hpet(), and it's almost certainly not
> > the loop over the sequence counter in ktime_get() either (it's not
> > increasing *that* quickly). But some basically infinite __run_hrtimer
> > thing or something?
> 
> Really interesting.
> 
> So, we're calling __run_hrtimer in a loop:
> 
> while ((node = timerqueue_getnext(>active))) {
>   ...
>   __run_hrtimer(timer, );
>   ...
>   }
> 
> The easy question is how often does trinity call nanosleep?
> 
> Looking at __run_hrtimer(), it drops the lock and runs the function and then
> takes the lock again, maybe enqueueing us again right away.
> 
> timer->state is supposed to protect us from other CPUs jumping in and doing
> something else with the timer, but it feels racey wrt remove_hrtimer().
> Something like this, but I'm not sure how often __hrtimer_start_range_ns gets
> called
> 
> CPU 0 CPU 1
> __run_hrtimer()
> timer->state = HRTIMER_STATE_CALLBACK
> removed from list
> unlock cpu_base->lock
> restrt = fn(timer)
>   __hrtimer_start_range_ns()
>   base = lock_hrtimer_base()
>   ret = remove_hrtimer()
>   finds timer->state = 
> HRTIMER_STATE_CALLBACK
>   does nothing
>   new_base = switch_hrtimer_base()
>   now we're on a different 
> base, different lock
> lock(cpu_base->lock)
> enqueue the timer
>   enqueue the timer

But at the very end this would be detected by the runtime check of the
hrtimer interrupt, which does not trigger. And it would trigger at
some point as ALL cpus including CPU0 in that trace dump make
progress.

Thanks,

tglx


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, 19 Dec 2014, Linus Torvalds wrote:
> Here's another pattern. In your latest thing, every single time that
> CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0
> doing this:
> 
> [24998.060963] NMI backtrace for cpu 0
> [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not tainted 3.18.0+ #108
> [24998.064073] task: 8801bf3536b0 ti: 880197e0c000 task.ti:
> 880197e0c000
> [24998.065137] RIP: 0010:[]  []
> read_hpet+0x16/0x20
> [24998.083577]  [] ktime_get+0x3e/0xa0
> [24998.084450]  [] tick_sched_timer+0x23/0x160
> [24998.085315]  [] __run_hrtimer+0x76/0x1f0
> [24998.086173]  [] ? tick_init_highres+0x20/0x20
> [24998.087025]  [] hrtimer_interrupt+0x107/0x260
> [24998.087877]  [] local_apic_timer_interrupt+0x3b/0x70
> [24998.088732]  [] smp_apic_timer_interrupt+0x45/0x60
> [24998.089583]  [] apic_timer_interrupt+0x6f/0x80
> [24998.090435]  
> [24998.091279]  [] ? __remove_hrtimer+0x4e/0xa0
> [24998.092118]  [] ? ipcget+0x8a/0x1e0
> [24998.092951]  [] ? ipcget+0x7c/0x1e0
> [24998.093779]  [] SyS_msgget+0x4d/0x70
> 
> and I think that's the smoking gun. The reason CPU0 isn't picking up
> any IPI's is because it is in some endless loop around read_hpet().
> 
> There is even time information in the register dump:
> 
>  RAX: 61fece8a RBX: 00510792 RCX: 
>  RAX: 79e588fc RBX: 00511d6e RCX: 
>  RAX: 91ca7f65 RBX: 00513346 RCX: 
>  RAX: a9afbd0d RBX: 0051491e RCX: 
>  RAX: cbd1340c RBX: 0051684a RCX: 
>  RAX: fb9d303f RBX: 005193fc RCX: 
>  RAX: 2b67efe4 RBX: 0051c224 RCX: 0004
> 
> That RAX value is the value we just read from the HPET, and RBX seems
> to be monotonically increasing too, so it's likely the sequence
> counter in ktime_get().

Here is the full diff of the first and the second splat for CPU0

 task: 8801bf3536b0 ti: 880197e0c000 task.ti: 880197e0c000
 read_hpet+0x16/0x20
 RSP: 0018:88024e203e38  EFLAGS: 0046
-RAX: 61fece8a RBX: 00510792 RCX: 
+RAX: 79e588fc RBX: 00511d6e RCX: 

RAX: 
(0x79e588fc - 0x61fece8a) / 14.318180e6 ~= 28.0061

So HPET @14.3MHz progressed by 28 seconds, which matches the splat
delta between the first and the second one.

25026.001132 - 24998.017355 = 27.9838

RBX:
0x00511d6e - 0x00510792 = 5596 

The sequence counter increments by 2 per tick. So:

28 / (5596/2) ~= 0.0025 s

==> HZ = 400

The sequence counter is even, so ktime_get() will succeed.

 RDX:  RSI: 88024e20c710 RDI: 81c26f40
 RBP: 88024e203e38 R08:  R09: 000f
-R10: 0526 R11: 000f R12: 16bf99600917
+R10: 0526 R11: 000f R12: 16c61e4e2117

R12:
0x16c61e4e2117 - 0x16bf99600917 = 2.8e+10

That's the nanoseconds timestamp: 2.8e10/1e9 = 28

Now that all looks correct. So there is something else going on. After
staring some more at it, I think we are looking at it from the wrong
angle.

The watchdog always detects CPU1 as stuck and we got completely
fixated on the csd_wait() in the stack trace on CPU1. Now we have
stack traces which show a different picture, i.e. CPU1 makes progress
after a gazillion of seconds.

I think we really need to look at CPU1 itself.

AFAICT all these 'stuck' events happen in fully interruptible
context. So an undetected interrupt storm can cause that.

We only detect interrupt storms for unhandled interrupts, but for
those where the handler returns IRQ_HANDLED, we just count them.

For directly handled vectors we do not even have a detection mechanism
at all.

That also might explain the RT throttling. If that storm hits a high
prio task, the throttler will trigger.

Just a theory, but worth to explore, IMO.

So adding a dump of the total interrupt counts to the watchdog trace
might give us some insight.

Debug patch below.

Thanks,

tglx
---
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 501baa9ac1be..2021662663c7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -315,11 +315,21 @@ asmlinkage __visible void do_softirq(void)
local_irq_restore(flags);
 }
 
+static DEFINE_PER_CPU(unsigned long, irqcnt);
+
+void show_irqcnt(int cpu)
+{
+   pr_emerg("CPU#%d: IRQ %lu NMI %u\n", cpu, this_cpu_read(irqcnt),
+this_cpu_read(irq_stat.__nmi_count));
+}
+
 /*
  * Enter an interrupt context.
  */
 void irq_enter(void)
 {
+   this_cpu_inc(irqcnt);
+
rcu_irq_enter();
if (is_idle_task(current) && !in_interrupt()) {
/*
diff --git a/kernel/watchdog.c b/kernel/watchdog.c
index 70bf11815f84..f505cc58d354 100644
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -306,6 +306,8 @@ static void watchdog_interrupt_count(void)
 static int

Re: frequent lockups in 3.18rc4

On Fri, Dec 19, 2014 at 12:54 PM, Dave Jones  wrote:
>
> Right now I'm doing Chris' idea of "turn debugging back on,
> and try without serial console".  Shall I try your suggestion
> on top of that ?

Might as well. I doubt it really will make any difference, but I also
don't think it will interact badly in any way, so..

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Dave Jones

On Fri, Dec 19, 2014 at 12:46:16PM -0800, Linus Torvalds wrote:
 > On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds
 >  wrote:
 > >
 > > I do note that we depend on the "new mwait" semantics where we do
 > > mwait with interrupts disabled and a non-zero RCX value. Are there
 > > possibly even any known CPU errata in that area? Not that it sounds
 > > likely, but still..
 > 
 > Remind me what CPU you have in that machine again? The %rax value for
 > the mwait cases in question seems to be 0x32, which is either C7s-HSW
 > or C7s-BDW, and in both cases has the "TLB flushed" flag set.
 > 
 > I'm pretty sure you have a Haswell, I'm just checking. Which model?
 > I'm assuming it's family 6, model 60, stepping 3? I found you
 > mentioning i5-4670T in a perf thread.. That the one?

Yep.

vendor_id   : GenuineIntel
cpu family  : 6
model   : 60
model name  : Intel(R) Core(TM) i5-4670T CPU @ 2.30GHz
stepping: 3
microcode   : 0x1a

 > Anyway, I don't actually believe in any CPU bugs, but you could try
 > "intel_idle.max_cstate=0" and see if that makes any difference, for
 > example.
 > 
 > Or perhaps just "intel_idle.max_cstate=1", which leaves intel_idle
 > active, but gets rid of the deeper sleep states (that incidentally
 > also play games with leave_mm() etc)

So I'm leaving Red Hat on Tuesday, and can realistically only do one
more experiment over the weekend before I give them this box back.

Right now I'm doing Chris' idea of "turn debugging back on,
and try without serial console".  Shall I try your suggestion
on top of that ?

I *hate* for this to be "the one that got away", but we've
at least gotten some good mileage out of this bug in the last
two months.  Who knows, maybe I'll find some new hardware that
will exhibit the same behaviour in the new year.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds
 wrote:
>
> I do note that we depend on the "new mwait" semantics where we do
> mwait with interrupts disabled and a non-zero RCX value. Are there
> possibly even any known CPU errata in that area? Not that it sounds
> likely, but still..

Remind me what CPU you have in that machine again? The %rax value for
the mwait cases in question seems to be 0x32, which is either C7s-HSW
or C7s-BDW, and in both cases has the "TLB flushed" flag set.

I'm pretty sure you have a Haswell, I'm just checking. Which model?
I'm assuming it's family 6, model 60, stepping 3? I found you
mentioning i5-4670T in a perf thread.. That the one?

Anyway, I don't actually believe in any CPU bugs, but you could try
"intel_idle.max_cstate=0" and see if that makes any difference, for
example.

Or perhaps just "intel_idle.max_cstate=1", which leaves intel_idle
active, but gets rid of the deeper sleep states (that incidentally
also play games with leave_mm() etc)

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Dave Jones

On Fri, Dec 19, 2014 at 03:31:36PM -0500, Chris Mason wrote:

 > > So it's not stuck *inside* read_hpet(), and it's almost certainly not
 > > the loop over the sequence counter in ktime_get() either (it's not
 > > increasing *that* quickly). But some basically infinite __run_hrtimer
 > > thing or something?
 > 
 > Really interesting.
 > 
 > So, we're calling __run_hrtimer in a loop:
 > 
 > while ((node = timerqueue_getnext(>active))) {
 >  ...
 >  __run_hrtimer(timer, );
 >  ...
 >  }
 > 
 > The easy question is how often does trinity call nanosleep?

It shouldn't call it directly. (syscalls/nanosleep.c)

/*
 * SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp, struct timespec 
__user *, rmtp)
 */
#include "sanitise.h"

struct syscallentry syscall_nanosleep = {
.name = "nanosleep",
.num_args = 2,
.arg1name = "rqtp",
.arg1type = ARG_ADDRESS,
.arg2name = "rmtp",
.arg2type = ARG_ADDRESS,
.flags = AVOID_SYSCALL, // Boring.  Can cause long sleeps.
};


That last line being the key one.  We used to do it, but it's well.. boring.
We could do something smarter, but given it's never triggered anything
interesting in the past, focussing runtime on the more interesting
syscalls seems to have been more fruitful.

Dave

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Chris Mason

On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote:
> Here's another pattern. In your latest thing, every single time that
> CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0
> doing this:
> 
> [24998.060963] NMI backtrace for cpu 0
> [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not tainted 3.18.0+ #108
> [24998.064073] task: 8801bf3536b0 ti: 880197e0c000 task.ti:
> 880197e0c000
> [24998.065137] RIP: 0010:[]  []
> read_hpet+0x16/0x20
> [24998.083577]  [] ktime_get+0x3e/0xa0
> [24998.084450]  [] tick_sched_timer+0x23/0x160
> [24998.085315]  [] __run_hrtimer+0x76/0x1f0
> [24998.086173]  [] ? tick_init_highres+0x20/0x20
> [24998.087025]  [] hrtimer_interrupt+0x107/0x260
> [24998.087877]  [] local_apic_timer_interrupt+0x3b/0x70
> [24998.088732]  [] smp_apic_timer_interrupt+0x45/0x60
> [24998.089583]  [] apic_timer_interrupt+0x6f/0x80
> [24998.090435]  
> [24998.091279]  [] ? __remove_hrtimer+0x4e/0xa0
> [24998.092118]  [] ? ipcget+0x8a/0x1e0
> [24998.092951]  [] ? ipcget+0x7c/0x1e0
> [24998.093779]  [] SyS_msgget+0x4d/0x70
> 
> and I think that's the smoking gun. The reason CPU0 isn't picking up
> any IPI's is because it is in some endless loop around read_hpet().
> 
> There is even time information in the register dump:
> 
>  RAX: 61fece8a RBX: 00510792 RCX: 
>  RAX: 79e588fc RBX: 00511d6e RCX: 
>  RAX: 91ca7f65 RBX: 00513346 RCX: 
>  RAX: a9afbd0d RBX: 0051491e RCX: 
>  RAX: cbd1340c RBX: 0051684a RCX: 
>  RAX: fb9d303f RBX: 005193fc RCX: 
>  RAX: 2b67efe4 RBX: 0051c224 RCX: 0004
> 
> That RAX value is the value we just read from the HPET, and RBX seems
> to be monotonically increasing too, so it's likely the sequence
> counter in ktime_get().
> 
> So it's not stuck *inside* read_hpet(), and it's almost certainly not
> the loop over the sequence counter in ktime_get() either (it's not
> increasing *that* quickly). But some basically infinite __run_hrtimer
> thing or something?

Really interesting.

So, we're calling __run_hrtimer in a loop:

while ((node = timerqueue_getnext(>active))) {
...
__run_hrtimer(timer, );
...
}

The easy question is how often does trinity call nanosleep?

Looking at __run_hrtimer(), it drops the lock and runs the function and then
takes the lock again, maybe enqueueing us again right away.

timer->state is supposed to protect us from other CPUs jumping in and doing
something else with the timer, but it feels racey wrt remove_hrtimer().
Something like this, but I'm not sure how often __hrtimer_start_range_ns gets
called

CPU 0   CPU 1
__run_hrtimer()
timer->state = HRTIMER_STATE_CALLBACK
removed from list
unlock cpu_base->lock
restrt = fn(timer)
__hrtimer_start_range_ns()
base = lock_hrtimer_base()
ret = remove_hrtimer()
finds timer->state = 
HRTIMER_STATE_CALLBACK
does nothing
new_base = switch_hrtimer_base()
now we're on a different 
base, different lock
lock(cpu_base->lock)
enqueue the timer
enqueue the timer
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: frequent lockups in 3.18rc4