Re: frequent lockups in 3.18rc4
On Thu, Feb 12, 2015 at 3:09 AM, Martin van Es wrote: > > Best I can come up with now is try the next mainline that has all the > fixes and ideas in this thread incorporated. Would that be 3.19? Yes. I'm attaching a patch (very much experimental - it might introduce new problems rather than fix old ones) that might also be worth testing on top of 3.19. > I'm sorry I couldn't be more helpful. Hey, so far nobody else has been able to pin this down either. It seems to be very timing-specific, and it's possible (even likely, considering DaveJ's adventures) that while you cannot trigger it with 3.16.7, it might be lurking there too, just not with the kind of timing that can trigger it on your machine. Which would explain the bisection trouble. It would have been wonderful if somebody had been able to really reproduce it truly reliably, but it seems to be very slippery. Linus kernel/smp.c | 78 1 file changed, 47 insertions(+), 31 deletions(-) diff --git a/kernel/smp.c b/kernel/smp.c index f38a1e692259..2aaac2c47683 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -19,7 +19,7 @@ enum { CSD_FLAG_LOCK = 0x01, - CSD_FLAG_WAIT = 0x02, + CSD_FLAG_SYNCHRONOUS= 0x02, }; struct call_function_data { @@ -107,7 +107,7 @@ void __init call_function_init(void) */ static void csd_lock_wait(struct call_single_data *csd) { - while (csd->flags & CSD_FLAG_LOCK) + while (smp_load_acquire(>flags) & CSD_FLAG_LOCK) cpu_relax(); } @@ -121,19 +121,17 @@ static void csd_lock(struct call_single_data *csd) * to ->flags with any subsequent assignments to other * fields of the specified call_single_data structure: */ - smp_mb(); + smp_wmb(); } static void csd_unlock(struct call_single_data *csd) { - WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK)); + WARN_ON(!(csd->flags & CSD_FLAG_LOCK)); /* * ensure we're all done before releasing data: */ - smp_mb(); - - csd->flags &= ~CSD_FLAG_LOCK; + smp_store_release(>flags, 0); } static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data); @@ -144,13 +142,16 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data); * ->func, ->info, and ->flags set. */ static int generic_exec_single(int cpu, struct call_single_data *csd, - smp_call_func_t func, void *info, int wait) + smp_call_func_t func, void *info) { - struct call_single_data csd_stack = { .flags = 0 }; - unsigned long flags; - - if (cpu == smp_processor_id()) { + unsigned long flags; + + /* +* We can unlock early even for the synchronous on-stack case, +* since we're doing this from the same CPU.. +*/ + csd_unlock(csd); local_irq_save(flags); func(info); local_irq_restore(flags); @@ -161,21 +162,9 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu)) return -ENXIO; - - if (!csd) { - csd = _stack; - if (!wait) - csd = this_cpu_ptr(_data); - } - - csd_lock(csd); - csd->func = func; csd->info = info; - if (wait) - csd->flags |= CSD_FLAG_WAIT; - /* * The list addition should be visible before sending the IPI * handler locks the list to pull the entry off it because of @@ -190,9 +179,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, if (llist_add(>llist, _cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); - if (wait) - csd_lock_wait(csd); - return 0; } @@ -250,8 +236,17 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline) } llist_for_each_entry_safe(csd, csd_next, entry, llist) { - csd->func(csd->info); - csd_unlock(csd); + smp_call_func_t func = csd->func; + void *info = csd->info; + + /* Do we wait until *after* callback? */ + if (csd->flags & CSD_FLAG_SYNCHRONOUS) { + func(info); + csd_unlock(csd); + } else { + csd_unlock(csd); + func(info); + } } /* @@ -274,6 +269,8 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline) int smp_call_function_single(int cpu, smp_call_func_t func, void *info, int wait) { + struct call_single_data *csd; + struct call_single_data
Re: frequent lockups in 3.18rc4
To follow up on this long standing promise to bisect. I've made two attempts at bisecting and both landed in limbo. It's hard to explain but it feels like this bug has quantum properties; I know for sure it's present in 3.17 and not in 3.16(.7). But once I start bisecting it gets less pronounced. I know it sound vague, but the best explanations I get come up with are: - The out-of-kernel dvb-c driver and hardware are playing funny games (although being absolutely stable in 3.16?) - and/or the problem consists of two or more commits that independently don't express themselves like I see in 3.17? So, my second bisecting ended at 3be738ad55d8e2b9d949eb0d830de5aa4d4f8e05, which is nonsense because it's a commit to a staging module that I don't even compile. Best I can come up with now is try the next mainline that has all the fixes and ideas in this thread incorporated. Would that be 3.19? I'm sorry I couldn't be more helpful. Best regards, Martin On Sat, Dec 6, 2014 at 10:14 PM, Martin van Es wrote: > On Sat, Dec 6, 2014 at 9:09 PM, Linus Torvalds > wrote: >> On Sat, Dec 6, 2014 at 8:22 AM, Martin van Es wrote: >>> >>> Hope this may help in finding the right direction for this bug? >> >> If you can reproduce it with your spare J1900 system and could perhaps >> bisect it there, that would be a huge help. >> > > I'll give it a shot and see if I can get to freeze it on 3.17.3 as a > start by playing content from the prd backend, but don't expect fast > respons times... busy man... > > M. -- If 'but' was any useful, it would be a logic operator -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
To follow up on this long standing promise to bisect. I've made two attempts at bisecting and both landed in limbo. It's hard to explain but it feels like this bug has quantum properties; I know for sure it's present in 3.17 and not in 3.16(.7). But once I start bisecting it gets less pronounced. I know it sound vague, but the best explanations I get come up with are: - The out-of-kernel dvb-c driver and hardware are playing funny games (although being absolutely stable in 3.16?) - and/or the problem consists of two or more commits that independently don't express themselves like I see in 3.17? So, my second bisecting ended at 3be738ad55d8e2b9d949eb0d830de5aa4d4f8e05, which is nonsense because it's a commit to a staging module that I don't even compile. Best I can come up with now is try the next mainline that has all the fixes and ideas in this thread incorporated. Would that be 3.19? I'm sorry I couldn't be more helpful. Best regards, Martin On Sat, Dec 6, 2014 at 10:14 PM, Martin van Es mrva...@gmail.com wrote: On Sat, Dec 6, 2014 at 9:09 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Sat, Dec 6, 2014 at 8:22 AM, Martin van Es mrva...@gmail.com wrote: Hope this may help in finding the right direction for this bug? If you can reproduce it with your spare J1900 system and could perhaps bisect it there, that would be a huge help. I'll give it a shot and see if I can get to freeze it on 3.17.3 as a start by playing content from the prd backend, but don't expect fast respons times... busy man... M. -- If 'but' was any useful, it would be a logic operator -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Thu, Feb 12, 2015 at 3:09 AM, Martin van Es mrva...@gmail.com wrote: Best I can come up with now is try the next mainline that has all the fixes and ideas in this thread incorporated. Would that be 3.19? Yes. I'm attaching a patch (very much experimental - it might introduce new problems rather than fix old ones) that might also be worth testing on top of 3.19. I'm sorry I couldn't be more helpful. Hey, so far nobody else has been able to pin this down either. It seems to be very timing-specific, and it's possible (even likely, considering DaveJ's adventures) that while you cannot trigger it with 3.16.7, it might be lurking there too, just not with the kind of timing that can trigger it on your machine. Which would explain the bisection trouble. It would have been wonderful if somebody had been able to really reproduce it truly reliably, but it seems to be very slippery. Linus kernel/smp.c | 78 1 file changed, 47 insertions(+), 31 deletions(-) diff --git a/kernel/smp.c b/kernel/smp.c index f38a1e692259..2aaac2c47683 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -19,7 +19,7 @@ enum { CSD_FLAG_LOCK = 0x01, - CSD_FLAG_WAIT = 0x02, + CSD_FLAG_SYNCHRONOUS= 0x02, }; struct call_function_data { @@ -107,7 +107,7 @@ void __init call_function_init(void) */ static void csd_lock_wait(struct call_single_data *csd) { - while (csd-flags CSD_FLAG_LOCK) + while (smp_load_acquire(csd-flags) CSD_FLAG_LOCK) cpu_relax(); } @@ -121,19 +121,17 @@ static void csd_lock(struct call_single_data *csd) * to -flags with any subsequent assignments to other * fields of the specified call_single_data structure: */ - smp_mb(); + smp_wmb(); } static void csd_unlock(struct call_single_data *csd) { - WARN_ON((csd-flags CSD_FLAG_WAIT) !(csd-flags CSD_FLAG_LOCK)); + WARN_ON(!(csd-flags CSD_FLAG_LOCK)); /* * ensure we're all done before releasing data: */ - smp_mb(); - - csd-flags = ~CSD_FLAG_LOCK; + smp_store_release(csd-flags, 0); } static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data); @@ -144,13 +142,16 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data); * -func, -info, and -flags set. */ static int generic_exec_single(int cpu, struct call_single_data *csd, - smp_call_func_t func, void *info, int wait) + smp_call_func_t func, void *info) { - struct call_single_data csd_stack = { .flags = 0 }; - unsigned long flags; - - if (cpu == smp_processor_id()) { + unsigned long flags; + + /* +* We can unlock early even for the synchronous on-stack case, +* since we're doing this from the same CPU.. +*/ + csd_unlock(csd); local_irq_save(flags); func(info); local_irq_restore(flags); @@ -161,21 +162,9 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, if ((unsigned)cpu = nr_cpu_ids || !cpu_online(cpu)) return -ENXIO; - - if (!csd) { - csd = csd_stack; - if (!wait) - csd = this_cpu_ptr(csd_data); - } - - csd_lock(csd); - csd-func = func; csd-info = info; - if (wait) - csd-flags |= CSD_FLAG_WAIT; - /* * The list addition should be visible before sending the IPI * handler locks the list to pull the entry off it because of @@ -190,9 +179,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd, if (llist_add(csd-llist, per_cpu(call_single_queue, cpu))) arch_send_call_function_single_ipi(cpu); - if (wait) - csd_lock_wait(csd); - return 0; } @@ -250,8 +236,17 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline) } llist_for_each_entry_safe(csd, csd_next, entry, llist) { - csd-func(csd-info); - csd_unlock(csd); + smp_call_func_t func = csd-func; + void *info = csd-info; + + /* Do we wait until *after* callback? */ + if (csd-flags CSD_FLAG_SYNCHRONOUS) { + func(info); + csd_unlock(csd); + } else { + csd_unlock(csd); + func(info); + } } /* @@ -274,6 +269,8 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline) int smp_call_function_single(int cpu, smp_call_func_t func, void *info, int wait) { + struct call_single_data *csd; + struct call_single_data
Re: frequent lockups in 3.18rc4
On Sun, 21 Dec 2014, Linus Torvalds wrote: > On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote: > > On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: > > > > > > And finally, and stupidly, is there any chance that you have anything > > > accessing /dev/hpet? > > > > Not knowingly at least, but who the hell knows what systemd has its > > fingers in these days. > > Actually, it looks like /dev/hpet doesn't allow write access. > > I can do the mmap(/dev/mem) thing and access the HPET by hand, and > when I write zero to it I immediately get something like this: > > Clocksource tsc unstable (delta = -284317725450 ns) > Switched to clocksource hpet > > just to confirm that yes, a jump in the HPET counter would indeed give > those kinds of symptoms:blaming the TSC with a negative delta in the > 0-300s range, even though it's the HPET that is broken. > > And if the HPET then occasionally jumps around afterwards, it would > show up as ktime_get() occasionally going backwards, which in turn > would - as far as I can tell - result in exactly that pseudo-infirnite > loop with timers. > > Anyway, any wild kernel pointer access *could* happen to just hit the > HPET and write to the main counter value, although I'd personally be > more inclined to blame BIOS/SMM kind of code playing tricks with > time.. We do have a few places where we explicitly write the value on > purpose, but they are in the HPET init code, and in the clocksource > resume code, so they should not be involved. Right. > Thomas - have you had reports of HPET breakage in RT circles, the same > way BIOSes have been tinkering with TSC? Not that I'm aware of. > Also, would it perhaps be a good idea to make "ktime_get()" save the > last time in a percpu variable, and warn if time ever goes backwards > on a particular CPU? A percpu thing should be pretty cheap, even if > we write to it every time somebody asks for time.. That should be simple enough to implement. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, 21 Dec 2014, Linus Torvalds wrote: On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones da...@codemonkey.org.uk wrote: On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: And finally, and stupidly, is there any chance that you have anything accessing /dev/hpet? Not knowingly at least, but who the hell knows what systemd has its fingers in these days. Actually, it looks like /dev/hpet doesn't allow write access. I can do the mmap(/dev/mem) thing and access the HPET by hand, and when I write zero to it I immediately get something like this: Clocksource tsc unstable (delta = -284317725450 ns) Switched to clocksource hpet just to confirm that yes, a jump in the HPET counter would indeed give those kinds of symptoms:blaming the TSC with a negative delta in the 0-300s range, even though it's the HPET that is broken. And if the HPET then occasionally jumps around afterwards, it would show up as ktime_get() occasionally going backwards, which in turn would - as far as I can tell - result in exactly that pseudo-infirnite loop with timers. Anyway, any wild kernel pointer access *could* happen to just hit the HPET and write to the main counter value, although I'd personally be more inclined to blame BIOS/SMM kind of code playing tricks with time.. We do have a few places where we explicitly write the value on purpose, but they are in the HPET init code, and in the clocksource resume code, so they should not be involved. Right. Thomas - have you had reports of HPET breakage in RT circles, the same way BIOSes have been tinkering with TSC? Not that I'm aware of. Also, would it perhaps be a good idea to make ktime_get() save the last time in a percpu variable, and warn if time ever goes backwards on a particular CPU? A percpu thing should be pretty cheap, even if we write to it every time somebody asks for time.. That should be simple enough to implement. Thanks, tglx -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds wrote: > On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote: >> >> Anyway, It may be worth keeping the 50% margin (and dropping the 12% >> reduction to simplify things) > > Again, the 50% margin is only on the multiplication overflow. Not on the mask. Right, but we calculate the mult value based on the mask (or 10 mins, whichever is shorter). So then when we go back and calculate the max_cycles/max_idle_ns using the mult, we end up with a value smaller then the mask. So the scheduler shouldn't push idle times out beyond that and the debug logic in my patch should be able to catch strangely large values. > So it won't do anything at all for the case we actually care about, > namely a broken HPET, afaik. Yea, the case my code doesn't catch that yours did is for slightly broken clocksources (I'm thinking two cpus which virtual hpets embedded in them that are slightly off) where you could get negative deltas right after the update. In that case the capping on read is really needed since by the next update the stale value has grown large enough to look like a reasonable offset. The TSC has a similar issue, but its easier to check for negative values because it won't reasonably ever overflow. > > I'd much rather limit to 50% of the mask too. Ok, I'll try to rework the code to make this choice and make it more explicitly clear. > Also, why do we actually play games with ilog2 for that overflow > calculation? It seems pointless. This is for the setup code, doing a > real division there would seem to be a whole lot more straightforward, > and not need that big comment. And there's no performance issue. Am I > missing something? I feel like there was a time when this may have been called by some of the clocksource code if it they changed frequency (I think over suspend/resume), but I'm not seeing it in the current source. So yea, likely something to simplify. >> I've also got a capping patch that I'm testing that keeps time reads >> from passing that interval. The only thing I'm really cautious about >> with that change is that we have to make sure the hrtimer that >> triggers update_wall_clock is always set to expire within that cap (I >> need to review it again) or else we'll hang ourselves. > > Yeah, that thing is fragile. And quite possibly part of the problem. "Time is a flat circle..." and thus unfortunately requires some circular logic. :) thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote: > > Anyway, It may be worth keeping the 50% margin (and dropping the 12% > reduction to simplify things) Again, the 50% margin is only on the multiplication overflow. Not on the mask. So it won't do anything at all for the case we actually care about, namely a broken HPET, afaik. I'd much rather limit to 50% of the mask too. Also, why do we actually play games with ilog2 for that overflow calculation? It seems pointless. This is for the setup code, doing a real division there would seem to be a whole lot more straightforward, and not need that big comment. And there's no performance issue. Am I missing something? > I've also got a capping patch that I'm testing that keeps time reads > from passing that interval. The only thing I'm really cautious about > with that change is that we have to make sure the hrtimer that > triggers update_wall_clock is always set to expire within that cap (I > need to review it again) or else we'll hang ourselves. Yeah, that thing is fragile. And quite possibly part of the problem. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds wrote: > On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote: >> >> So I sent out a first step validation check to warn us if we end up >> with idle periods that are larger then we expect. > > .. not having tested it, this is just from reading the patch, but it > would *seem* that it doesn't actually validate the clock reading much > at all. > > Why? Because most of the time, for crap clocks like HPET, the real > limitation will be not the multiplication overflow, but the "mask", > which is just 32-bit (or worse - I think the ACPI PM timer might be > just 24 bits). > > So then you effectively "validate" that the timer difference value > fits in mask, but that isn't any validation at all - it's just a > truism. Since we by definition mask the difference to just the valid > bitmask. > > So I really think that the maximum valid clock needs to be narrowed > down from the "technically, this clock can count to X". > > But maybe I'm wrong, and the multiplication overflow is actually often > the real limit. What are the actual values for real timer sources? As you point out, for clocksources where the mask is 32bits or under, we shouldn't have any risk of multiplication overflow, since mult is a 32bit. So yes, the max_cycles on those probably should be the same as the mask, and isn't useful on those clocksources to test if we run over (though warning if we're within the 12% margin could be useful). But for clocksources who have larger masks, it still could be a useful check (@2ghz tscs overflow 32bits in 2 seconds), although the mult value is targets an mult overflow at ~10 minutes, so its less likely that we really hit it. However, it ends up the calculations we use are a little more conservative, treating the result as signed and so they avoid multiplications that could run into a that sign bit. This looks like an error to me (the code is also used for clockevents, so I haven't run through to see if the requirements are different there), but a conservative one, which results in the maximum idle interval to be ~halved what they could be (and we add yet another 12% margin on that - so we probably need to just pick one or the other, not both). So even on 32bit masks, max_cycles in my patch is smaller then 32bits. That's why I was able to hit both warnings in my testing with the hpet by sending SIGSTOP to qemu. Anyway, It may be worth keeping the 50% margin (and dropping the 12% reduction to simplify things), since I've not heard recent complaints about timekeeping limiting idle lengths (but could be wrong here). This would give you something closer to the 1/8th of the mask that you were using in your patch (and on larger mask clocksources, we do already cap the interval at 10 minutes - so most really crazy values would be caught for clocksources like the TSC - and maybe we can make this more configurable so we can shorten it as done in your patch to try to debug things). I've also got a capping patch that I'm testing that keeps time reads from passing that interval. The only thing I'm really cautious about with that change is that we have to make sure the hrtimer that triggers update_wall_clock is always set to expire within that cap (I need to review it again) or else we'll hang ourselves. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Jan 5, 2015 at 5:17 PM, John Stultz john.stu...@linaro.org wrote: Anyway, It may be worth keeping the 50% margin (and dropping the 12% reduction to simplify things) Again, the 50% margin is only on the multiplication overflow. Not on the mask. So it won't do anything at all for the case we actually care about, namely a broken HPET, afaik. I'd much rather limit to 50% of the mask too. Also, why do we actually play games with ilog2 for that overflow calculation? It seems pointless. This is for the setup code, doing a real division there would seem to be a whole lot more straightforward, and not need that big comment. And there's no performance issue. Am I missing something? I've also got a capping patch that I'm testing that keeps time reads from passing that interval. The only thing I'm really cautious about with that change is that we have to make sure the hrtimer that triggers update_wall_clock is always set to expire within that cap (I need to review it again) or else we'll hang ourselves. Yeah, that thing is fragile. And quite possibly part of the problem. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Mon, Jan 5, 2015 at 5:17 PM, John Stultz john.stu...@linaro.org wrote: Anyway, It may be worth keeping the 50% margin (and dropping the 12% reduction to simplify things) Again, the 50% margin is only on the multiplication overflow. Not on the mask. Right, but we calculate the mult value based on the mask (or 10 mins, whichever is shorter). So then when we go back and calculate the max_cycles/max_idle_ns using the mult, we end up with a value smaller then the mask. So the scheduler shouldn't push idle times out beyond that and the debug logic in my patch should be able to catch strangely large values. So it won't do anything at all for the case we actually care about, namely a broken HPET, afaik. Yea, the case my code doesn't catch that yours did is for slightly broken clocksources (I'm thinking two cpus which virtual hpets embedded in them that are slightly off) where you could get negative deltas right after the update. In that case the capping on read is really needed since by the next update the stale value has grown large enough to look like a reasonable offset. The TSC has a similar issue, but its easier to check for negative values because it won't reasonably ever overflow. I'd much rather limit to 50% of the mask too. Ok, I'll try to rework the code to make this choice and make it more explicitly clear. Also, why do we actually play games with ilog2 for that overflow calculation? It seems pointless. This is for the setup code, doing a real division there would seem to be a whole lot more straightforward, and not need that big comment. And there's no performance issue. Am I missing something? I feel like there was a time when this may have been called by some of the clocksource code if it they changed frequency (I think over suspend/resume), but I'm not seeing it in the current source. So yea, likely something to simplify. I've also got a capping patch that I'm testing that keeps time reads from passing that interval. The only thing I'm really cautious about with that change is that we have to make sure the hrtimer that triggers update_wall_clock is always set to expire within that cap (I need to review it again) or else we'll hang ourselves. Yeah, that thing is fragile. And quite possibly part of the problem. Time is a flat circle... and thus unfortunately requires some circular logic. :) thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Jan 2, 2015 at 4:27 PM, John Stultz john.stu...@linaro.org wrote: So I sent out a first step validation check to warn us if we end up with idle periods that are larger then we expect. .. not having tested it, this is just from reading the patch, but it would *seem* that it doesn't actually validate the clock reading much at all. Why? Because most of the time, for crap clocks like HPET, the real limitation will be not the multiplication overflow, but the mask, which is just 32-bit (or worse - I think the ACPI PM timer might be just 24 bits). So then you effectively validate that the timer difference value fits in mask, but that isn't any validation at all - it's just a truism. Since we by definition mask the difference to just the valid bitmask. So I really think that the maximum valid clock needs to be narrowed down from the technically, this clock can count to X. But maybe I'm wrong, and the multiplication overflow is actually often the real limit. What are the actual values for real timer sources? As you point out, for clocksources where the mask is 32bits or under, we shouldn't have any risk of multiplication overflow, since mult is a 32bit. So yes, the max_cycles on those probably should be the same as the mask, and isn't useful on those clocksources to test if we run over (though warning if we're within the 12% margin could be useful). But for clocksources who have larger masks, it still could be a useful check (@2ghz tscs overflow 32bits in 2 seconds), although the mult value is targets an mult overflow at ~10 minutes, so its less likely that we really hit it. However, it ends up the calculations we use are a little more conservative, treating the result as signed and so they avoid multiplications that could run into a that sign bit. This looks like an error to me (the code is also used for clockevents, so I haven't run through to see if the requirements are different there), but a conservative one, which results in the maximum idle interval to be ~halved what they could be (and we add yet another 12% margin on that - so we probably need to just pick one or the other, not both). So even on 32bit masks, max_cycles in my patch is smaller then 32bits. That's why I was able to hit both warnings in my testing with the hpet by sending SIGSTOP to qemu. Anyway, It may be worth keeping the 50% margin (and dropping the 12% reduction to simplify things), since I've not heard recent complaints about timekeeping limiting idle lengths (but could be wrong here). This would give you something closer to the 1/8th of the mask that you were using in your patch (and on larger mask clocksources, we do already cap the interval at 10 minutes - so most really crazy values would be caught for clocksources like the TSC - and maybe we can make this more configurable so we can shorten it as done in your patch to try to debug things). I've also got a capping patch that I'm testing that keeps time reads from passing that interval. The only thing I'm really cautious about with that change is that we have to make sure the hrtimer that triggers update_wall_clock is always set to expire within that cap (I need to review it again) or else we'll hang ourselves. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote: > > So I sent out a first step validation check to warn us if we end up > with idle periods that are larger then we expect. .. not having tested it, this is just from reading the patch, but it would *seem* that it doesn't actually validate the clock reading much at all. Why? Because most of the time, for crap clocks like HPET, the real limitation will be not the multiplication overflow, but the "mask", which is just 32-bit (or worse - I think the ACPI PM timer might be just 24 bits). So then you effectively "validate" that the timer difference value fits in mask, but that isn't any validation at all - it's just a truism. Since we by definition mask the difference to just the valid bitmask. So I really think that the maximum valid clock needs to be narrowed down from the "technically, this clock can count to X". But maybe I'm wrong, and the multiplication overflow is actually often the real limit. What are the actual values for real timer sources? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Jan 2, 2015 at 4:27 PM, John Stultz john.stu...@linaro.org wrote: So I sent out a first step validation check to warn us if we end up with idle periods that are larger then we expect. .. not having tested it, this is just from reading the patch, but it would *seem* that it doesn't actually validate the clock reading much at all. Why? Because most of the time, for crap clocks like HPET, the real limitation will be not the multiplication overflow, but the mask, which is just 32-bit (or worse - I think the ACPI PM timer might be just 24 bits). So then you effectively validate that the timer difference value fits in mask, but that isn't any validation at all - it's just a truism. Since we by definition mask the difference to just the valid bitmask. So I really think that the maximum valid clock needs to be narrowed down from the technically, this clock can count to X. But maybe I'm wrong, and the multiplication overflow is actually often the real limit. What are the actual values for real timer sources? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On 01/02/2015 07:27 PM, John Stultz wrote: > On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds > wrote: >> > On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones >> > wrote: >>> >> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: >>> >> >>> >> > One thing I think I'll try is to try and narrow down which >>> >> > syscalls are triggering those "Clocksource hpet had cycles off" >>> >> > messages. I'm still unclear on exactly what is doing >>> >> > the stomping on the hpet. >>> >> >>> >> First I ran trinity with "-g vm" which limits it to use just >>> >> a subset of syscalls, specifically VM related ones. >>> >> That triggered the messages. Further experiments revealed: >> > >> > So I can trigger the false positives with my original patch quite >> > easily by just putting my box under some load. My numbers are nowhere >> > near as bad as yours, but then, I didn't put it under as much load >> > anyway. Just a regular "make -j64" of the kernel. >> > >> > I suspect your false positives are bigger partly because of the load, >> > but mostly because you presumably have preemption enabled too. I don't >> > do preemption in my normal kernels, and that limits the damage of the >> > race a bit. >> > >> > I have a newer version of the patch that gets rid of the false >> > positives with some ordering rules instead, and just for you I hacked >> > it up to say where the problem happens too, but it's likely too late. >> > >> > The fact that the original racy patch seems to make a difference for >> > you does say that yes, we seem to be zeroing in on the right area >> > here, but I'm not seeing what's wrong. I was hoping for big jumps from >> > your HPET, since your "TSC unstable" messages do kind of imply that >> > such really big jumps can happen. >> > >> > I'm attaching my updated hacky patch, although I assume it's much too >> > late for that machine. Don't look too closely at the backtrace >> > generation part, that's just a quick hack, and only works with frame >> > pointers enabled anyway. >> > >> > So I'm still a bit unhappy about not figuring out *what* is wrong. And >> > I'd still like the dmidecode from that machine, just for posterity. In >> > case we can figure out some pattern. >> > >> > So right now I can imagine several reasons: >> > >> > - actual hardware bug. >> > >> >This is *really* unlikely, though. It should hit everybody. The >> > HPET is in the core intel chipset, we're not talking random unusual >> > hardware by fly-by-night vendors here. >> > >> > - some SMM/BIOS "power management" feature. >> > >> >We've seen this before, where the SMM saves/restores the TSC on >> > entry/exit in order to hide itself from the system. I could imagine >> > similar code for the HPET counter. SMM writers use some bad drugs to >> > dull their pain. >> > >> >And with the HPET counter, since it's not even per-CPU, the "save >> > and restore HPET" will actually show up as "HPET went backwards" to >> > the other non-SMM CPU's if it happens >> > >> > - a bug in our own clocksource handling. >> > >> >I'm not seeing it. But maybe my patch hides it for some magical reason. > So I sent out a first step validation check to warn us if we end up > with idle periods that are larger then we expect. > > It doesn't yet cap the timekeeping_get_ns() output (like you're patch > effectively does), but it would be easy to do that in a following > patch. > > I did notice while testing this that the max_idle_ns (max idle time we > report to the scheduler) for the hpet is only ~16sec, and we'll > overflow after just ~21seconds. This second number maps closely to the > 22 second stalls seen in the nmi watchdog reports which seems > interesting, but I also realize that qemu uses a 100MHz hpet, where as > real hardware is likely to be a bit slower, so maybe that's just > chance.. > > I'd be interested if folks seeing anything similar to Dave would give > my patch a shot. I ran it overnight, but I didn't see any of the new warnings in the logs. Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On 01/02/2015 07:27 PM, John Stultz wrote: On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote: On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: One thing I think I'll try is to try and narrow down which syscalls are triggering those Clocksource hpet had cycles off messages. I'm still unclear on exactly what is doing the stomping on the hpet. First I ran trinity with -g vm which limits it to use just a subset of syscalls, specifically VM related ones. That triggered the messages. Further experiments revealed: So I can trigger the false positives with my original patch quite easily by just putting my box under some load. My numbers are nowhere near as bad as yours, but then, I didn't put it under as much load anyway. Just a regular make -j64 of the kernel. I suspect your false positives are bigger partly because of the load, but mostly because you presumably have preemption enabled too. I don't do preemption in my normal kernels, and that limits the damage of the race a bit. I have a newer version of the patch that gets rid of the false positives with some ordering rules instead, and just for you I hacked it up to say where the problem happens too, but it's likely too late. The fact that the original racy patch seems to make a difference for you does say that yes, we seem to be zeroing in on the right area here, but I'm not seeing what's wrong. I was hoping for big jumps from your HPET, since your TSC unstable messages do kind of imply that such really big jumps can happen. I'm attaching my updated hacky patch, although I assume it's much too late for that machine. Don't look too closely at the backtrace generation part, that's just a quick hack, and only works with frame pointers enabled anyway. So I'm still a bit unhappy about not figuring out *what* is wrong. And I'd still like the dmidecode from that machine, just for posterity. In case we can figure out some pattern. So right now I can imagine several reasons: - actual hardware bug. This is *really* unlikely, though. It should hit everybody. The HPET is in the core intel chipset, we're not talking random unusual hardware by fly-by-night vendors here. - some SMM/BIOS power management feature. We've seen this before, where the SMM saves/restores the TSC on entry/exit in order to hide itself from the system. I could imagine similar code for the HPET counter. SMM writers use some bad drugs to dull their pain. And with the HPET counter, since it's not even per-CPU, the save and restore HPET will actually show up as HPET went backwards to the other non-SMM CPU's if it happens - a bug in our own clocksource handling. I'm not seeing it. But maybe my patch hides it for some magical reason. So I sent out a first step validation check to warn us if we end up with idle periods that are larger then we expect. It doesn't yet cap the timekeeping_get_ns() output (like you're patch effectively does), but it would be easy to do that in a following patch. I did notice while testing this that the max_idle_ns (max idle time we report to the scheduler) for the hpet is only ~16sec, and we'll overflow after just ~21seconds. This second number maps closely to the 22 second stalls seen in the nmi watchdog reports which seems interesting, but I also realize that qemu uses a 100MHz hpet, where as real hardware is likely to be a bit slower, so maybe that's just chance.. I'd be interested if folks seeing anything similar to Dave would give my patch a shot. I ran it overnight, but I didn't see any of the new warnings in the logs. Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote: >> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: >> >> > One thing I think I'll try is to try and narrow down which >> > syscalls are triggering those "Clocksource hpet had cycles off" >> > messages. I'm still unclear on exactly what is doing >> > the stomping on the hpet. >> >> First I ran trinity with "-g vm" which limits it to use just >> a subset of syscalls, specifically VM related ones. >> That triggered the messages. Further experiments revealed: > > So I can trigger the false positives with my original patch quite > easily by just putting my box under some load. My numbers are nowhere > near as bad as yours, but then, I didn't put it under as much load > anyway. Just a regular "make -j64" of the kernel. > > I suspect your false positives are bigger partly because of the load, > but mostly because you presumably have preemption enabled too. I don't > do preemption in my normal kernels, and that limits the damage of the > race a bit. > > I have a newer version of the patch that gets rid of the false > positives with some ordering rules instead, and just for you I hacked > it up to say where the problem happens too, but it's likely too late. > > The fact that the original racy patch seems to make a difference for > you does say that yes, we seem to be zeroing in on the right area > here, but I'm not seeing what's wrong. I was hoping for big jumps from > your HPET, since your "TSC unstable" messages do kind of imply that > such really big jumps can happen. > > I'm attaching my updated hacky patch, although I assume it's much too > late for that machine. Don't look too closely at the backtrace > generation part, that's just a quick hack, and only works with frame > pointers enabled anyway. > > So I'm still a bit unhappy about not figuring out *what* is wrong. And > I'd still like the dmidecode from that machine, just for posterity. In > case we can figure out some pattern. > > So right now I can imagine several reasons: > > - actual hardware bug. > >This is *really* unlikely, though. It should hit everybody. The > HPET is in the core intel chipset, we're not talking random unusual > hardware by fly-by-night vendors here. > > - some SMM/BIOS "power management" feature. > >We've seen this before, where the SMM saves/restores the TSC on > entry/exit in order to hide itself from the system. I could imagine > similar code for the HPET counter. SMM writers use some bad drugs to > dull their pain. > >And with the HPET counter, since it's not even per-CPU, the "save > and restore HPET" will actually show up as "HPET went backwards" to > the other non-SMM CPU's if it happens > > - a bug in our own clocksource handling. > >I'm not seeing it. But maybe my patch hides it for some magical reason. So I sent out a first step validation check to warn us if we end up with idle periods that are larger then we expect. It doesn't yet cap the timekeeping_get_ns() output (like you're patch effectively does), but it would be easy to do that in a following patch. I did notice while testing this that the max_idle_ns (max idle time we report to the scheduler) for the hpet is only ~16sec, and we'll overflow after just ~21seconds. This second number maps closely to the 22 second stalls seen in the nmi watchdog reports which seems interesting, but I also realize that qemu uses a 100MHz hpet, where as real hardware is likely to be a bit slower, so maybe that's just chance.. I'd be interested if folks seeing anything similar to Dave would give my patch a shot. thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote: On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: One thing I think I'll try is to try and narrow down which syscalls are triggering those Clocksource hpet had cycles off messages. I'm still unclear on exactly what is doing the stomping on the hpet. First I ran trinity with -g vm which limits it to use just a subset of syscalls, specifically VM related ones. That triggered the messages. Further experiments revealed: So I can trigger the false positives with my original patch quite easily by just putting my box under some load. My numbers are nowhere near as bad as yours, but then, I didn't put it under as much load anyway. Just a regular make -j64 of the kernel. I suspect your false positives are bigger partly because of the load, but mostly because you presumably have preemption enabled too. I don't do preemption in my normal kernels, and that limits the damage of the race a bit. I have a newer version of the patch that gets rid of the false positives with some ordering rules instead, and just for you I hacked it up to say where the problem happens too, but it's likely too late. The fact that the original racy patch seems to make a difference for you does say that yes, we seem to be zeroing in on the right area here, but I'm not seeing what's wrong. I was hoping for big jumps from your HPET, since your TSC unstable messages do kind of imply that such really big jumps can happen. I'm attaching my updated hacky patch, although I assume it's much too late for that machine. Don't look too closely at the backtrace generation part, that's just a quick hack, and only works with frame pointers enabled anyway. So I'm still a bit unhappy about not figuring out *what* is wrong. And I'd still like the dmidecode from that machine, just for posterity. In case we can figure out some pattern. So right now I can imagine several reasons: - actual hardware bug. This is *really* unlikely, though. It should hit everybody. The HPET is in the core intel chipset, we're not talking random unusual hardware by fly-by-night vendors here. - some SMM/BIOS power management feature. We've seen this before, where the SMM saves/restores the TSC on entry/exit in order to hide itself from the system. I could imagine similar code for the HPET counter. SMM writers use some bad drugs to dull their pain. And with the HPET counter, since it's not even per-CPU, the save and restore HPET will actually show up as HPET went backwards to the other non-SMM CPU's if it happens - a bug in our own clocksource handling. I'm not seeing it. But maybe my patch hides it for some magical reason. So I sent out a first step validation check to warn us if we end up with idle periods that are larger then we expect. It doesn't yet cap the timekeeping_get_ns() output (like you're patch effectively does), but it would be easy to do that in a following patch. I did notice while testing this that the max_idle_ns (max idle time we report to the scheduler) for the hpet is only ~16sec, and we'll overflow after just ~21seconds. This second number maps closely to the 22 second stalls seen in the nmi watchdog reports which seems interesting, but I also realize that qemu uses a 100MHz hpet, where as real hardware is likely to be a bit slower, so maybe that's just chance.. I'd be interested if folks seeing anything similar to Dave would give my patch a shot. thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote: > On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote: > > > > * So 1/8th of the interval seems way too short, as there's > > clocksources like the ACP PM, which wrap every 2.5 seconds or so. > > Ugh. At the same time, 1/8th of a range is actually bigger than I'd > like, since if there is some timer corruption, it means that we only > catch it when it's in really big. > > But as I said, I'd actually prefer it to be time-based, because it > would be good if this approach worked on things like the TSC which is > a 64-bit counter.. > > So yes, that capping was very much arbitrary, and was mostly a case of > "this works with the one timer source that I can easily trigger" > > > * I suspect something closer to the clocksource_max_deferment() value > > (which I think is max interval before multiplication overflows could > > happen - ~12%) which we use in the scheduler would make more sense. > > Especially since the timer scheduler uses that to calculate how long > > we can idle for. > > I'd rather not be anywhere *close* to any overflow problems. Even for > the scheduler all-idle case, I'd argue that there is rather quickly > diminishing returns. Yes, a thousand timer interrupts per second are > expensive and a noticeable power draw. The difference between "one > timer interrupt every two seconds" and "every 20 seconds" is rather > less noticeable. > > Of course, reasonable clock sources have *much* longer periods than a > second (yeah, the acpi pm timer really isn't a good one), so there are > probably good middle grounds, The 1/8th was a hack, and one that was > aware of teh 300s cycle of the HPET at that.. I of course very much like the idea of the timekeeping system doing a bit of self-checking. ;-) Can we simplify things by just not doing self-checking on clocks that overflow in less than a few minutes? In other words, if someone reports oddball RCU CPU stall warnings when using the ACPI PM timer, am I within my rights to tell them to reproduce the stall using a better timer? If so, we could possibly get away with the assumption that preemptions don't last longer than the soft-lockup interval, which is currently a bit over 20 seconds (hence "a few minutes" above). And yes, you might avoid a soft lockup by having a series of 15-second preemptions between each pair of instructions, but my response would be to increase my "a few minutes" a bit and to invoke probabilities. I don't claim to understand the timer code for all the reasons that Linus calls out below, but I believe that this simplifying assumption would in turn simplify the self-check code. Thanx, Paul > > * Nulling out delta in timekeeping_get_ns() seems like it could cause > > problems since time would then possibly go backwards compared to > > previous reads (as you mentioned, resulting in smaller time jumps). > > Instead it would probably make more sense to cap the delta at the > > maximum value (though this assumes the clock doesn't jump back in the > > interval before the next call to update_wall_time). > > So part of the nulling was that it was simpler, and part of it was > that I expected to get backwards jumps (see the other email to Dave > about the inherent races). And with the whole timer mask modulo > arithmetic, those backwards jumps just look like biggish positive > numbers, not even negative. So it ends up being things like "is it an > unsigned number larger than half the mask? Consider it negative" etc. > > The "zero it out" was simple, and it worked for my test-case, which > was "ok, my machine no longer locks up when I mess with the timer". > > And I didn't post the earlier versions of that patch that didn't even *boot*. > > I started out trying to do it at a higher level (not on a clock read > level, but outside the whole 'convert-to-ns and do the sequence value > check'), but during bootup we play a lot of games with initializing > the timer sources etc. > > So that explains the approach of doing it at that > >cycle_now = tkr->read(tkr->clock); > > level, and keeping it very low-level. > > But as I already explained in the email that crossed, that low-level > thing also results in some fundamental races. > > > * Also, as you note, this would just cause the big time jump to only > > happen at the next update, since there's no logic in > > update_wall_time() to limit the jump. I'm not sure if "believing" the > > large jump at write time make that much more sense, though. > > So I considered just capping it there (to a single interval or > something). Again, just ignoring - like the read side does - it would > have been easier, but at the same time I *really* wanted to make time > go forward, so just taking the big value seemed safest. > > But yes. this was very much a RFC patch. It's not even ready for real > use, as DaveJ found out (although it might be good enough in practice, > despite
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote: On Mon, Dec 22, 2014 at 3:59 PM, John Stultz john.stu...@linaro.org wrote: * So 1/8th of the interval seems way too short, as there's clocksources like the ACP PM, which wrap every 2.5 seconds or so. Ugh. At the same time, 1/8th of a range is actually bigger than I'd like, since if there is some timer corruption, it means that we only catch it when it's in really big. But as I said, I'd actually prefer it to be time-based, because it would be good if this approach worked on things like the TSC which is a 64-bit counter.. So yes, that capping was very much arbitrary, and was mostly a case of this works with the one timer source that I can easily trigger * I suspect something closer to the clocksource_max_deferment() value (which I think is max interval before multiplication overflows could happen - ~12%) which we use in the scheduler would make more sense. Especially since the timer scheduler uses that to calculate how long we can idle for. I'd rather not be anywhere *close* to any overflow problems. Even for the scheduler all-idle case, I'd argue that there is rather quickly diminishing returns. Yes, a thousand timer interrupts per second are expensive and a noticeable power draw. The difference between one timer interrupt every two seconds and every 20 seconds is rather less noticeable. Of course, reasonable clock sources have *much* longer periods than a second (yeah, the acpi pm timer really isn't a good one), so there are probably good middle grounds, The 1/8th was a hack, and one that was aware of teh 300s cycle of the HPET at that.. I of course very much like the idea of the timekeeping system doing a bit of self-checking. ;-) Can we simplify things by just not doing self-checking on clocks that overflow in less than a few minutes? In other words, if someone reports oddball RCU CPU stall warnings when using the ACPI PM timer, am I within my rights to tell them to reproduce the stall using a better timer? If so, we could possibly get away with the assumption that preemptions don't last longer than the soft-lockup interval, which is currently a bit over 20 seconds (hence a few minutes above). And yes, you might avoid a soft lockup by having a series of 15-second preemptions between each pair of instructions, but my response would be to increase my a few minutes a bit and to invoke probabilities. I don't claim to understand the timer code for all the reasons that Linus calls out below, but I believe that this simplifying assumption would in turn simplify the self-check code. Thanx, Paul * Nulling out delta in timekeeping_get_ns() seems like it could cause problems since time would then possibly go backwards compared to previous reads (as you mentioned, resulting in smaller time jumps). Instead it would probably make more sense to cap the delta at the maximum value (though this assumes the clock doesn't jump back in the interval before the next call to update_wall_time). So part of the nulling was that it was simpler, and part of it was that I expected to get backwards jumps (see the other email to Dave about the inherent races). And with the whole timer mask modulo arithmetic, those backwards jumps just look like biggish positive numbers, not even negative. So it ends up being things like is it an unsigned number larger than half the mask? Consider it negative etc. The zero it out was simple, and it worked for my test-case, which was ok, my machine no longer locks up when I mess with the timer. And I didn't post the earlier versions of that patch that didn't even *boot*. I started out trying to do it at a higher level (not on a clock read level, but outside the whole 'convert-to-ns and do the sequence value check'), but during bootup we play a lot of games with initializing the timer sources etc. So that explains the approach of doing it at that cycle_now = tkr-read(tkr-clock); level, and keeping it very low-level. But as I already explained in the email that crossed, that low-level thing also results in some fundamental races. * Also, as you note, this would just cause the big time jump to only happen at the next update, since there's no logic in update_wall_time() to limit the jump. I'm not sure if believing the large jump at write time make that much more sense, though. So I considered just capping it there (to a single interval or something). Again, just ignoring - like the read side does - it would have been easier, but at the same time I *really* wanted to make time go forward, so just taking the big value seemed safest. But yes. this was very much a RFC patch. It's not even ready for real use, as DaveJ found out (although it might be good enough in practice, despite its flaws) * Finally, you're writing to error while only holding a read lock, but that's
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote: > > > > > > Oh - and have you actually seen the "TSC unstable (delta = xyz)" + > > > "switched to hpet" messages there yet? > > > > not yet. 3 hrs in. > > Ok, so then the > > INFO: rcu_preempt detected stalls on CPUs/tasks: > > has nothing to do with HPET, since you'd still be running with the TSC > enabled. right. 16hrs later, that's the only thing that's spewed. > My googling around did find a number of "machine locks up a few hours > after switching to hpet" reports, so it is possible that the whole rcu > stall and nmi watchdog thing is independent and unrelated to the > actual locking up. possible. I'm heading home in a few hours to start the wipe of that box. This is going to be 'the one that got away', but at least we've managed to find a number of other things that needed fixing along the way. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote: On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones da...@codemonkey.org.uk wrote: Oh - and have you actually seen the TSC unstable (delta = xyz) + switched to hpet messages there yet? not yet. 3 hrs in. Ok, so then the INFO: rcu_preempt detected stalls on CPUs/tasks: has nothing to do with HPET, since you'd still be running with the TSC enabled. right. 16hrs later, that's the only thing that's spewed. My googling around did find a number of machine locks up a few hours after switching to hpet reports, so it is possible that the whole rcu stall and nmi watchdog thing is independent and unrelated to the actual locking up. possible. I'm heading home in a few hours to start the wipe of that box. This is going to be 'the one that got away', but at least we've managed to find a number of other things that needed fixing along the way. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote: > > > > Oh - and have you actually seen the "TSC unstable (delta = xyz)" + > > "switched to hpet" messages there yet? > > not yet. 3 hrs in. Ok, so then the INFO: rcu_preempt detected stalls on CPUs/tasks: has nothing to do with HPET, since you'd still be running with the TSC enabled. My googling around did find a number of "machine locks up a few hours after switching to hpet" reports, so it is possible that the whole rcu stall and nmi watchdog thing is independent and unrelated to the actual locking up. It *is* intriguing that my broken patch seemed to prevent it from happening, though. And both NMI watchdogs and the rcu stall are related to wall-clock time. But hey, maybe there really is some odd loop in the kernel that stops scheduling or RCU grace periods. It just seems to be never caught by your backtraces.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > > > still running though.. > > Btw, did you ever boot with "tsc=reliable" as a kernel command line option? I'll check it again in the morning, but before I turn in for the night, so far the only thing is this: [ 6713.394395] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 6713.394489] Tasks blocked on level-0 rcu_node (CPUs 0-7): [ 6713.394513] Tasks blocked on level-0 rcu_node (CPUs 0-7): [ 6713.394536] (detected by 3, t=6502 jiffies, g=141292, c=141291, q=0) [ 6713.394564] INFO: Stall ended before state dump start Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > > > still running though.. > > Btw, did you ever boot with "tsc=reliable" as a kernel command line option? I don't think so. > For the last night, can you see if you can just run it with that, and > things work? Sure. > So I'm not even convinced that trying to debug some HPET issue is > really worth it. Especially if your machine is a preproduction board > from Intel. Yeah, I agree. Even though it's strange that this only became a problem this last few months for me, after over a year of abuse. Hopefully the new year brings more trustworthy hardware. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > > > hm. > > So with the previous patch that had the false positives, you never saw > this? You saw the false positives instead? correct. > I'm wondering if the added debug noise just ended up helping. Doing a > printk() will automatically cause some scheduler activity. And they > also caused the time reading jiggle. > > That said, it's also possible that I screwed something up in the > second version of the patch, just breaking it and making it generally > ineffective. > > Oh - and have you actually seen the "TSC unstable (delta = xyz)" + > "switched to hpet" messages there yet? not yet. 3 hrs in. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > still running though.. Btw, did you ever boot with "tsc=reliable" as a kernel command line option? For the last night, can you see if you can just run it with that, and things work? Because by now, my gut feel is that we should start derating the HPET rather than the TSC, especially going forward on modern hardware. And if this is some subtle timing issue with our hpet code (the whole -ETIME thing when getting close to setting a timer is subtle, for example, even if the HPET hardware itself would be ok), I'm wondering if the fix isn't to just stop believing in HPET if there are better alternatives around. So I'm not even convinced that trying to debug some HPET issue is really worth it. Especially if your machine is a preproduction board from Intel. But verifying that with just the TSC everything is ok might still be worth it. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > hm. So with the previous patch that had the false positives, you never saw this? You saw the false positives instead? I'm wondering if the added debug noise just ended up helping. Doing a printk() will automatically cause some scheduler activity. And they also caused the time reading jiggle. That said, it's also possible that I screwed something up in the second version of the patch, just breaking it and making it generally ineffective. Oh - and have you actually seen the "TSC unstable (delta = xyz)" + "switched to hpet" messages there yet? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: > I have a newer version of the patch that gets rid of the false > positives with some ordering rules instead, and just for you I hacked > it up to say where the problem happens too, but it's likely too late. hm. [ 2733.047100] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 2733.047188] Tasks blocked on level-0 rcu_node (CPUs 0-7): P25811 [ 2733.047216] Tasks blocked on level-0 rcu_node (CPUs 0-7): P25811 [ 2733.047242] (detected by 0, t=6502 jiffies, g=52141, c=52140, q=0) [ 2733.047271] trinity-c406R running task13416 25811 24907 0x [ 2733.047305] 88022208fd28 0002 a819f627 8801df2c [ 2733.047341] 001d31c0 0002 88022208ffd8 001d31c0 [ 2733.047375] 8800806e1780 8801df2c 88022208fd18 88022208ffd8 [ 2733.047411] Call Trace: [ 2733.047429] [] ? context_tracking_user_exit+0x67/0x280 [ 2733.047457] [] preempt_schedule_irq+0x52/0xb0 [ 2733.047482] [] retint_kernel+0x20/0x30 [ 2733.047505] [] ? check_kill_permission+0xb1/0x1e0 [ 2733.047531] [] ? check_kill_permission+0x152/0x1e0 [ 2733.047557] [] group_send_sig_info+0x65/0x150 [ 2733.047581] [] ? group_send_sig_info+0x5/0x150 [ 2733.047607] [] ? rcu_read_lock_held+0x6e/0x80 [ 2733.047632] [] kill_pid_info+0x78/0x130 [ 2733.047654] [] ? kill_pid_info+0x5/0x130 [ 2733.047677] [] SYSC_kill+0xf2/0x2f0 [ 2733.047699] [] ? SYSC_kill+0x9b/0x2f0 [ 2733.047721] [] ? trace_hardirqs_on+0xd/0x10 [ 2733.047745] [] ? syscall_trace_enter_phase1+0x125/0x1a0 [ 2733.048607] [] ? trace_hardirqs_on_caller+0x10d/0x1d0 [ 2733.049469] [] SyS_kill+0xe/0x10 [ 2733.050332] [] system_call_fastpath+0x12/0x17 [ 2733.051197] trinity-c406R running task13416 25811 24907 0x [ 2733.052064] 88022208fd28 0002 a819f627 8801df2c [ 2733.052932] 001d31c0 0002 88022208ffd8 001d31c0 [ 2733.053792] 880209e2c680 8801df2c 88022208fd18 88022208ffd8 [ 2733.054651] Call Trace: [ 2733.055500] [] ? context_tracking_user_exit+0x67/0x280 [ 2733.056362] [] preempt_schedule_irq+0x52/0xb0 [ 2733.057222] [] retint_kernel+0x20/0x30 [ 2733.058076] [] ? check_kill_permission+0xb1/0x1e0 [ 2733.058930] [] ? check_kill_permission+0x152/0x1e0 [ 2733.059778] [] group_send_sig_info+0x65/0x150 [ 2733.060624] [] ? group_send_sig_info+0x5/0x150 [ 2733.061472] [] ? rcu_read_lock_held+0x6e/0x80 [ 2733.062322] [] kill_pid_info+0x78/0x130 [ 2733.063168] [] ? kill_pid_info+0x5/0x130 [ 2733.064015] [] SYSC_kill+0xf2/0x2f0 [ 2733.064863] [] ? SYSC_kill+0x9b/0x2f0 [ 2733.065704] [] ? trace_hardirqs_on+0xd/0x10 [ 2733.066541] [] ? syscall_trace_enter_phase1+0x125/0x1a0 [ 2733.067384] [] ? trace_hardirqs_on_caller+0x10d/0x1d0 [ 2733.068217] [] SyS_kill+0xe/0x10 [ 2733.069045] [] system_call_fastpath+0x12/0x17 [ 3708.217920] perf interrupt took too long (2503 > 2500), lowering kernel.perf_event_max_sample_rate to 5 [ 4583.530580] request_module: runaway loop modprobe personality-87 still running though.. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: > I have a newer version of the patch that gets rid of the false > positives with some ordering rules instead, and just for you I hacked > it up to say where the problem happens too, but it's likely too late. I'll give it a spin and see what falls out this evening. > So I'm still a bit unhappy about not figuring out *what* is wrong. And > I'd still like the dmidecode from that machine, just for posterity. In > case we can figure out some pattern. So this is something I should have done a long time ago. Googling for the board name shows up a very similar report from a year ago, except that was within kvm, and was aparently fixed. https://lkml.org/lkml/2013/10/9/206 and https://bugzilla.kernel.org/show_bug.cgi?id=69491 (dmidecode attachment there is pretty much the same as mine) > - actual hardware bug. >This is *really* unlikely, though. It should hit everybody. The > HPET is in the core intel chipset, we're not talking random unusual > hardware by fly-by-night vendors here. This machine is allegedly a 'production' box from Intel, but given Kashyap saw something very similar I'm wondering now if there was some board/bios errata for this system. There's a few Intel folks cc'd here, maybe one those can dig up whether there was anything peculiar about Shark Bay systems that would explain the HPET getting screwed up. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote: > On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: > > > One thing I think I'll try is to try and narrow down which > > syscalls are triggering those "Clocksource hpet had cycles off" > > messages. I'm still unclear on exactly what is doing > > the stomping on the hpet. > > First I ran trinity with "-g vm" which limits it to use just > a subset of syscalls, specifically VM related ones. > That triggered the messages. Further experiments revealed: So I can trigger the false positives with my original patch quite easily by just putting my box under some load. My numbers are nowhere near as bad as yours, but then, I didn't put it under as much load anyway. Just a regular "make -j64" of the kernel. I suspect your false positives are bigger partly because of the load, but mostly because you presumably have preemption enabled too. I don't do preemption in my normal kernels, and that limits the damage of the race a bit. I have a newer version of the patch that gets rid of the false positives with some ordering rules instead, and just for you I hacked it up to say where the problem happens too, but it's likely too late. The fact that the original racy patch seems to make a difference for you does say that yes, we seem to be zeroing in on the right area here, but I'm not seeing what's wrong. I was hoping for big jumps from your HPET, since your "TSC unstable" messages do kind of imply that such really big jumps can happen. I'm attaching my updated hacky patch, although I assume it's much too late for that machine. Don't look too closely at the backtrace generation part, that's just a quick hack, and only works with frame pointers enabled anyway. So I'm still a bit unhappy about not figuring out *what* is wrong. And I'd still like the dmidecode from that machine, just for posterity. In case we can figure out some pattern. So right now I can imagine several reasons: - actual hardware bug. This is *really* unlikely, though. It should hit everybody. The HPET is in the core intel chipset, we're not talking random unusual hardware by fly-by-night vendors here. - some SMM/BIOS "power management" feature. We've seen this before, where the SMM saves/restores the TSC on entry/exit in order to hide itself from the system. I could imagine similar code for the HPET counter. SMM writers use some bad drugs to dull their pain. And with the HPET counter, since it's not even per-CPU, the "save and restore HPET" will actually show up as "HPET went backwards" to the other non-SMM CPU's if it happens - a bug in our own clocksource handling. I'm not seeing it. But maybe my patch hides it for some magical reason. - gremlins. So I dunno. I hope more people will look at this after the holidays, even if your machine is gone. My test-program to do bad things to the HPET shows *something*, and works on any machine. Linus arch/x86/kernel/entry_64.S | 5 +++ include/linux/timekeeper_internal.h | 1 + kernel/time/timekeeping.c | 78 +++-- 3 files changed, 81 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 9ebaf63ba182..0a4c34b4658e 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -312,6 +312,11 @@ ENTRY(save_paranoid) CFI_ENDPROC END(save_paranoid) +ENTRY(save_back_trace) + movq %rbp,%rdi + jmp do_save_back_trace +END(save_back_trace) + /* * A newly forked process directly context switches into this address. * diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index 05af9a334893..0fcb60d77079 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -32,6 +32,7 @@ struct tk_read_base { cycle_t (*read)(struct clocksource *cs); cycle_t mask; cycle_t cycle_last; + cycle_t cycle_error; u32 mult; u32 shift; u64 xtime_nsec; diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 6a931852082f..1c924c80b462 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -140,6 +140,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock) tk->tkr.read = clock->read; tk->tkr.mask = clock->mask; tk->tkr.cycle_last = tk->tkr.read(clock); + tk->tkr.cycle_error = 0; /* Do the ns -> cycle conversion first, using original mult */ tmp = NTP_INTERVAL_LENGTH; @@ -191,16 +192,59 @@ u32 (*arch_gettimeoffset)(void) = default_arch_gettimeoffset; static inline u32 arch_gettimeoffset(void) { return 0; } #endif +unsigned long tracebuffer[16]; + +extern void save_back_trace(long dummy, void *ptr); + +void do_save_back_trace(long rbp,
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: > One thing I think I'll try is to try and narrow down which > syscalls are triggering those "Clocksource hpet had cycles off" > messages. I'm still unclear on exactly what is doing > the stomping on the hpet. First I ran trinity with "-g vm" which limits it to use just a subset of syscalls, specifically VM related ones. That triggered the messages. Further experiments revealed: -c mremap triggered it, but only when I also passed -C256 to crank up the number of child processes. The same thing occured with mprotect, madvise, remap_file_pages. I couldn't trigger it with -c mmap, or msync, mbind, move_pages, migrate_pages, mlock, regardless of how many child processes there were. Given the high child count necessary to trigger it, it's nigh on impossible to weed through all the calls that trinity made to figure out which one actually triggered the messages. I'm not even convinced that the syscall parameters are even particularly interesting. The "needs high load to trigger" aspect of the bug still has a smell of scheduler interaction or side effect of lock contention. Looking at one childs syscall params in isolation might look quite dull, but if we have N processes hammering on the same mapping, that's probably a lot more interesting. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote: > On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > > > But in the meantime please do keep that thing running as long as you > > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > > result - the original softlockup bug happening *without* any bigger > > hpet jumps. > > So I've got this box a *little* longer than anticipated. > It's now been running 30 hours with not a single NMI lockup. > and that's with my kitchen-sink debugging kernel. > > The 'hpet off' messages continue to be spewed, and again they're > all in the same range of 4293198075 -> 4294967266 In case there was any doubt remaining, it's now been running 3 days, 20 hours with no lockups at all. I haven't seen it run this long in months. Either tomorrow or Sunday I'm finally wiping that box to give it back on Monday, so if there's anything else you'd like to try, the next 24hrs are pretty much the only remaining time I have. One thing I think I'll try is to try and narrow down which syscalls are triggering those "Clocksource hpet had cycles off" messages. I'm still unclear on exactly what is doing the stomping on the hpet. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote: On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the original softlockup bug happening *without* any bigger hpet jumps. So I've got this box a *little* longer than anticipated. It's now been running 30 hours with not a single NMI lockup. and that's with my kitchen-sink debugging kernel. The 'hpet off' messages continue to be spewed, and again they're all in the same range of 4293198075 - 4294967266 In case there was any doubt remaining, it's now been running 3 days, 20 hours with no lockups at all. I haven't seen it run this long in months. Either tomorrow or Sunday I'm finally wiping that box to give it back on Monday, so if there's anything else you'd like to try, the next 24hrs are pretty much the only remaining time I have. One thing I think I'll try is to try and narrow down which syscalls are triggering those Clocksource hpet had cycles off messages. I'm still unclear on exactly what is doing the stomping on the hpet. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: One thing I think I'll try is to try and narrow down which syscalls are triggering those Clocksource hpet had cycles off messages. I'm still unclear on exactly what is doing the stomping on the hpet. First I ran trinity with -g vm which limits it to use just a subset of syscalls, specifically VM related ones. That triggered the messages. Further experiments revealed: -c mremap triggered it, but only when I also passed -C256 to crank up the number of child processes. The same thing occured with mprotect, madvise, remap_file_pages. I couldn't trigger it with -c mmap, or msync, mbind, move_pages, migrate_pages, mlock, regardless of how many child processes there were. Given the high child count necessary to trigger it, it's nigh on impossible to weed through all the calls that trinity made to figure out which one actually triggered the messages. I'm not even convinced that the syscall parameters are even particularly interesting. The needs high load to trigger aspect of the bug still has a smell of scheduler interaction or side effect of lock contention. Looking at one childs syscall params in isolation might look quite dull, but if we have N processes hammering on the same mapping, that's probably a lot more interesting. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote: On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: One thing I think I'll try is to try and narrow down which syscalls are triggering those Clocksource hpet had cycles off messages. I'm still unclear on exactly what is doing the stomping on the hpet. First I ran trinity with -g vm which limits it to use just a subset of syscalls, specifically VM related ones. That triggered the messages. Further experiments revealed: So I can trigger the false positives with my original patch quite easily by just putting my box under some load. My numbers are nowhere near as bad as yours, but then, I didn't put it under as much load anyway. Just a regular make -j64 of the kernel. I suspect your false positives are bigger partly because of the load, but mostly because you presumably have preemption enabled too. I don't do preemption in my normal kernels, and that limits the damage of the race a bit. I have a newer version of the patch that gets rid of the false positives with some ordering rules instead, and just for you I hacked it up to say where the problem happens too, but it's likely too late. The fact that the original racy patch seems to make a difference for you does say that yes, we seem to be zeroing in on the right area here, but I'm not seeing what's wrong. I was hoping for big jumps from your HPET, since your TSC unstable messages do kind of imply that such really big jumps can happen. I'm attaching my updated hacky patch, although I assume it's much too late for that machine. Don't look too closely at the backtrace generation part, that's just a quick hack, and only works with frame pointers enabled anyway. So I'm still a bit unhappy about not figuring out *what* is wrong. And I'd still like the dmidecode from that machine, just for posterity. In case we can figure out some pattern. So right now I can imagine several reasons: - actual hardware bug. This is *really* unlikely, though. It should hit everybody. The HPET is in the core intel chipset, we're not talking random unusual hardware by fly-by-night vendors here. - some SMM/BIOS power management feature. We've seen this before, where the SMM saves/restores the TSC on entry/exit in order to hide itself from the system. I could imagine similar code for the HPET counter. SMM writers use some bad drugs to dull their pain. And with the HPET counter, since it's not even per-CPU, the save and restore HPET will actually show up as HPET went backwards to the other non-SMM CPU's if it happens - a bug in our own clocksource handling. I'm not seeing it. But maybe my patch hides it for some magical reason. - gremlins. So I dunno. I hope more people will look at this after the holidays, even if your machine is gone. My test-program to do bad things to the HPET shows *something*, and works on any machine. Linus arch/x86/kernel/entry_64.S | 5 +++ include/linux/timekeeper_internal.h | 1 + kernel/time/timekeeping.c | 78 +++-- 3 files changed, 81 insertions(+), 3 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index 9ebaf63ba182..0a4c34b4658e 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -312,6 +312,11 @@ ENTRY(save_paranoid) CFI_ENDPROC END(save_paranoid) +ENTRY(save_back_trace) + movq %rbp,%rdi + jmp do_save_back_trace +END(save_back_trace) + /* * A newly forked process directly context switches into this address. * diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index 05af9a334893..0fcb60d77079 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -32,6 +32,7 @@ struct tk_read_base { cycle_t (*read)(struct clocksource *cs); cycle_t mask; cycle_t cycle_last; + cycle_t cycle_error; u32 mult; u32 shift; u64 xtime_nsec; diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c index 6a931852082f..1c924c80b462 100644 --- a/kernel/time/timekeeping.c +++ b/kernel/time/timekeeping.c @@ -140,6 +140,7 @@ static void tk_setup_internals(struct timekeeper *tk, struct clocksource *clock) tk-tkr.read = clock-read; tk-tkr.mask = clock-mask; tk-tkr.cycle_last = tk-tkr.read(clock); + tk-tkr.cycle_error = 0; /* Do the ns - cycle conversion first, using original mult */ tmp = NTP_INTERVAL_LENGTH; @@ -191,16 +192,59 @@ u32 (*arch_gettimeoffset)(void) = default_arch_gettimeoffset; static inline u32 arch_gettimeoffset(void) { return 0; } #endif +unsigned long tracebuffer[16]; + +extern void save_back_trace(long dummy, void *ptr); + +void do_save_back_trace(long rbp, void *ptr)
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: I have a newer version of the patch that gets rid of the false positives with some ordering rules instead, and just for you I hacked it up to say where the problem happens too, but it's likely too late. I'll give it a spin and see what falls out this evening. So I'm still a bit unhappy about not figuring out *what* is wrong. And I'd still like the dmidecode from that machine, just for posterity. In case we can figure out some pattern. So this is something I should have done a long time ago. Googling for the board name shows up a very similar report from a year ago, except that was within kvm, and was aparently fixed. https://lkml.org/lkml/2013/10/9/206 and https://bugzilla.kernel.org/show_bug.cgi?id=69491 (dmidecode attachment there is pretty much the same as mine) - actual hardware bug. This is *really* unlikely, though. It should hit everybody. The HPET is in the core intel chipset, we're not talking random unusual hardware by fly-by-night vendors here. This machine is allegedly a 'production' box from Intel, but given Kashyap saw something very similar I'm wondering now if there was some board/bios errata for this system. There's a few Intel folks cc'd here, maybe one those can dig up whether there was anything peculiar about Shark Bay systems that would explain the HPET getting screwed up. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: I have a newer version of the patch that gets rid of the false positives with some ordering rules instead, and just for you I hacked it up to say where the problem happens too, but it's likely too late. hm. [ 2733.047100] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 2733.047188] Tasks blocked on level-0 rcu_node (CPUs 0-7): P25811 [ 2733.047216] Tasks blocked on level-0 rcu_node (CPUs 0-7): P25811 [ 2733.047242] (detected by 0, t=6502 jiffies, g=52141, c=52140, q=0) [ 2733.047271] trinity-c406R running task13416 25811 24907 0x [ 2733.047305] 88022208fd28 0002 a819f627 8801df2c [ 2733.047341] 001d31c0 0002 88022208ffd8 001d31c0 [ 2733.047375] 8800806e1780 8801df2c 88022208fd18 88022208ffd8 [ 2733.047411] Call Trace: [ 2733.047429] [a819f627] ? context_tracking_user_exit+0x67/0x280 [ 2733.047457] [a88522a2] preempt_schedule_irq+0x52/0xb0 [ 2733.047482] [a8859820] retint_kernel+0x20/0x30 [ 2733.047505] [a808a361] ? check_kill_permission+0xb1/0x1e0 [ 2733.047531] [a808a402] ? check_kill_permission+0x152/0x1e0 [ 2733.047557] [a808dc25] group_send_sig_info+0x65/0x150 [ 2733.047581] [a808dbc5] ? group_send_sig_info+0x5/0x150 [ 2733.047607] [a80ed71e] ? rcu_read_lock_held+0x6e/0x80 [ 2733.047632] [a808dee8] kill_pid_info+0x78/0x130 [ 2733.047654] [a808de75] ? kill_pid_info+0x5/0x130 [ 2733.047677] [a808e0b2] SYSC_kill+0xf2/0x2f0 [ 2733.047699] [a808e05b] ? SYSC_kill+0x9b/0x2f0 [ 2733.047721] [a80d7ffd] ? trace_hardirqs_on+0xd/0x10 [ 2733.047745] [a8013765] ? syscall_trace_enter_phase1+0x125/0x1a0 [ 2733.048607] [a80d7f2d] ? trace_hardirqs_on_caller+0x10d/0x1d0 [ 2733.049469] [a809079e] SyS_kill+0xe/0x10 [ 2733.050332] [a8858aa2] system_call_fastpath+0x12/0x17 [ 2733.051197] trinity-c406R running task13416 25811 24907 0x [ 2733.052064] 88022208fd28 0002 a819f627 8801df2c [ 2733.052932] 001d31c0 0002 88022208ffd8 001d31c0 [ 2733.053792] 880209e2c680 8801df2c 88022208fd18 88022208ffd8 [ 2733.054651] Call Trace: [ 2733.055500] [a819f627] ? context_tracking_user_exit+0x67/0x280 [ 2733.056362] [a88522a2] preempt_schedule_irq+0x52/0xb0 [ 2733.057222] [a8859820] retint_kernel+0x20/0x30 [ 2733.058076] [a808a361] ? check_kill_permission+0xb1/0x1e0 [ 2733.058930] [a808a402] ? check_kill_permission+0x152/0x1e0 [ 2733.059778] [a808dc25] group_send_sig_info+0x65/0x150 [ 2733.060624] [a808dbc5] ? group_send_sig_info+0x5/0x150 [ 2733.061472] [a80ed71e] ? rcu_read_lock_held+0x6e/0x80 [ 2733.062322] [a808dee8] kill_pid_info+0x78/0x130 [ 2733.063168] [a808de75] ? kill_pid_info+0x5/0x130 [ 2733.064015] [a808e0b2] SYSC_kill+0xf2/0x2f0 [ 2733.064863] [a808e05b] ? SYSC_kill+0x9b/0x2f0 [ 2733.065704] [a80d7ffd] ? trace_hardirqs_on+0xd/0x10 [ 2733.066541] [a8013765] ? syscall_trace_enter_phase1+0x125/0x1a0 [ 2733.067384] [a80d7f2d] ? trace_hardirqs_on_caller+0x10d/0x1d0 [ 2733.068217] [a809079e] SyS_kill+0xe/0x10 [ 2733.069045] [a8858aa2] system_call_fastpath+0x12/0x17 [ 3708.217920] perf interrupt took too long (2503 2500), lowering kernel.perf_event_max_sample_rate to 5 [ 4583.530580] request_module: runaway loop modprobe personality-87 still running though.. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: hm. So with the previous patch that had the false positives, you never saw this? You saw the false positives instead? I'm wondering if the added debug noise just ended up helping. Doing a printk() will automatically cause some scheduler activity. And they also caused the time reading jiggle. That said, it's also possible that I screwed something up in the second version of the patch, just breaking it and making it generally ineffective. Oh - and have you actually seen the TSC unstable (delta = xyz) + switched to hpet messages there yet? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: still running though.. Btw, did you ever boot with tsc=reliable as a kernel command line option? For the last night, can you see if you can just run it with that, and things work? Because by now, my gut feel is that we should start derating the HPET rather than the TSC, especially going forward on modern hardware. And if this is some subtle timing issue with our hpet code (the whole -ETIME thing when getting close to setting a timer is subtle, for example, even if the HPET hardware itself would be ok), I'm wondering if the fix isn't to just stop believing in HPET if there are better alternatives around. So I'm not even convinced that trying to debug some HPET issue is really worth it. Especially if your machine is a preproduction board from Intel. But verifying that with just the TSC everything is ok might still be worth it. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote: On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: hm. So with the previous patch that had the false positives, you never saw this? You saw the false positives instead? correct. I'm wondering if the added debug noise just ended up helping. Doing a printk() will automatically cause some scheduler activity. And they also caused the time reading jiggle. That said, it's also possible that I screwed something up in the second version of the patch, just breaking it and making it generally ineffective. Oh - and have you actually seen the TSC unstable (delta = xyz) + switched to hpet messages there yet? not yet. 3 hrs in. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: still running though.. Btw, did you ever boot with tsc=reliable as a kernel command line option? I don't think so. For the last night, can you see if you can just run it with that, and things work? Sure. So I'm not even convinced that trying to debug some HPET issue is really worth it. Especially if your machine is a preproduction board from Intel. Yeah, I agree. Even though it's strange that this only became a problem this last few months for me, after over a year of abuse. Hopefully the new year brings more trustworthy hardware. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: still running though.. Btw, did you ever boot with tsc=reliable as a kernel command line option? I'll check it again in the morning, but before I turn in for the night, so far the only thing is this: [ 6713.394395] INFO: rcu_preempt detected stalls on CPUs/tasks: [ 6713.394489] Tasks blocked on level-0 rcu_node (CPUs 0-7): [ 6713.394513] Tasks blocked on level-0 rcu_node (CPUs 0-7): [ 6713.394536] (detected by 3, t=6502 jiffies, g=141292, c=141291, q=0) [ 6713.394564] INFO: Stall ended before state dump start Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones da...@codemonkey.org.uk wrote: Oh - and have you actually seen the TSC unstable (delta = xyz) + switched to hpet messages there yet? not yet. 3 hrs in. Ok, so then the INFO: rcu_preempt detected stalls on CPUs/tasks: has nothing to do with HPET, since you'd still be running with the TSC enabled. My googling around did find a number of machine locks up a few hours after switching to hpet reports, so it is possible that the whole rcu stall and nmi watchdog thing is independent and unrelated to the actual locking up. It *is* intriguing that my broken patch seemed to prevent it from happening, though. And both NMI watchdogs and the rcu stall are related to wall-clock time. But hey, maybe there really is some odd loop in the kernel that stops scheduling or RCU grace periods. It just seems to be never caught by your backtraces.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On 12/23/2014 09:56 AM, Dave Jones wrote: > On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > > > But in the meantime please do keep that thing running as long as you > > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > > result - the original softlockup bug happening *without* any bigger > > hpet jumps. > > It's been going for 18 hours, with just a bunch more of those hpet > messages, all in the same range. I'll leave it go a few more hours, > before I have to wipe it, but I've got feel-good vibes about this. > Even if that patch isn't the solution, It seems like we're finally > looking in the right direction. I've got myself a physical server to play with, and running trinity on it seems to cause similar stalls: 2338.389210] INFO: rcu_sched self-detected stall on CPU[ 2338.429153] INFO: rcu_sched detected stalls on CPUs/tasks:[ 2338.429164] 16: (5999 ticks this GP) idle=4b5/141/0 softirq=24859/24860 last_accelerate: 039d/1b78, nonlazy_posted: 64, .. [ 2338.429165] [ 2338.680231] 16: (5999 ticks this GP) idle=4b5/141/0 softirq=24859/24860 last_accelerate: 039d/1b91, nonlazy_posted: 64, .. [ 2338.828353] (t=6044 jiffies g=16473 c=16472 q=4915881) Oddly enough, there's no stacktrace... Thanks, Sasha -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On 12/23/2014 09:56 AM, Dave Jones wrote: On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the original softlockup bug happening *without* any bigger hpet jumps. It's been going for 18 hours, with just a bunch more of those hpet messages, all in the same range. I'll leave it go a few more hours, before I have to wipe it, but I've got feel-good vibes about this. Even if that patch isn't the solution, It seems like we're finally looking in the right direction. I've got myself a physical server to play with, and running trinity on it seems to cause similar stalls: 2338.389210] INFO: rcu_sched self-detected stall on CPU[ 2338.429153] INFO: rcu_sched detected stalls on CPUs/tasks:[ 2338.429164] 16: (5999 ticks this GP) idle=4b5/141/0 softirq=24859/24860 last_accelerate: 039d/1b78, nonlazy_posted: 64, .. [ 2338.429165] [ 2338.680231] 16: (5999 ticks this GP) idle=4b5/141/0 softirq=24859/24860 last_accelerate: 039d/1b91, nonlazy_posted: 64, .. [ 2338.828353] (t=6044 jiffies g=16473 c=16472 q=4915881) Oddly enough, there's no stacktrace... Thanks, Sasha -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > But in the meantime please do keep that thing running as long as you > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > result - the original softlockup bug happening *without* any bigger > hpet jumps. So I've got this box a *little* longer than anticipated. It's now been running 30 hours with not a single NMI lockup. and that's with my kitchen-sink debugging kernel. The 'hpet off' messages continue to be spewed, and again they're all in the same range of 4293198075 -> 4294967266 Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > But in the meantime please do keep that thing running as long as you > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > result - the original softlockup bug happening *without* any bigger > hpet jumps. It's been going for 18 hours, with just a bunch more of those hpet messages, all in the same range. I'll leave it go a few more hours, before I have to wipe it, but I've got feel-good vibes about this. Even if that patch isn't the solution, It seems like we're finally looking in the right direction. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the original softlockup bug happening *without* any bigger hpet jumps. It's been going for 18 hours, with just a bunch more of those hpet messages, all in the same range. I'll leave it go a few more hours, before I have to wipe it, but I've got feel-good vibes about this. Even if that patch isn't the solution, It seems like we're finally looking in the right direction. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the original softlockup bug happening *without* any bigger hpet jumps. So I've got this box a *little* longer than anticipated. It's now been running 30 hours with not a single NMI lockup. and that's with my kitchen-sink debugging kernel. The 'hpet off' messages continue to be spewed, and again they're all in the same range of 4293198075 - 4294967266 Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote: > > * So 1/8th of the interval seems way too short, as there's > clocksources like the ACP PM, which wrap every 2.5 seconds or so. Ugh. At the same time, 1/8th of a range is actually bigger than I'd like, since if there is some timer corruption, it means that we only catch it when it's in really big. But as I said, I'd actually prefer it to be time-based, because it would be good if this approach worked on things like the TSC which is a 64-bit counter.. So yes, that capping was very much arbitrary, and was mostly a case of "this works with the one timer source that I can easily trigger" > * I suspect something closer to the clocksource_max_deferment() value > (which I think is max interval before multiplication overflows could > happen - ~12%) which we use in the scheduler would make more sense. > Especially since the timer scheduler uses that to calculate how long > we can idle for. I'd rather not be anywhere *close* to any overflow problems. Even for the scheduler all-idle case, I'd argue that there is rather quickly diminishing returns. Yes, a thousand timer interrupts per second are expensive and a noticeable power draw. The difference between "one timer interrupt every two seconds" and "every 20 seconds" is rather less noticeable. Of course, reasonable clock sources have *much* longer periods than a second (yeah, the acpi pm timer really isn't a good one), so there are probably good middle grounds, The 1/8th was a hack, and one that was aware of teh 300s cycle of the HPET at that.. > * Nulling out delta in timekeeping_get_ns() seems like it could cause > problems since time would then possibly go backwards compared to > previous reads (as you mentioned, resulting in smaller time jumps). > Instead it would probably make more sense to cap the delta at the > maximum value (though this assumes the clock doesn't jump back in the > interval before the next call to update_wall_time). So part of the nulling was that it was simpler, and part of it was that I expected to get backwards jumps (see the other email to Dave about the inherent races). And with the whole timer mask modulo arithmetic, those backwards jumps just look like biggish positive numbers, not even negative. So it ends up being things like "is it an unsigned number larger than half the mask? Consider it negative" etc. The "zero it out" was simple, and it worked for my test-case, which was "ok, my machine no longer locks up when I mess with the timer". And I didn't post the earlier versions of that patch that didn't even *boot*. I started out trying to do it at a higher level (not on a clock read level, but outside the whole 'convert-to-ns and do the sequence value check'), but during bootup we play a lot of games with initializing the timer sources etc. So that explains the approach of doing it at that cycle_now = tkr->read(tkr->clock); level, and keeping it very low-level. But as I already explained in the email that crossed, that low-level thing also results in some fundamental races. > * Also, as you note, this would just cause the big time jump to only > happen at the next update, since there's no logic in > update_wall_time() to limit the jump. I'm not sure if "believing" the > large jump at write time make that much more sense, though. So I considered just capping it there (to a single interval or something). Again, just ignoring - like the read side does - it would have been easier, but at the same time I *really* wanted to make time go forward, so just taking the big value seemed safest. But yes. this was very much a RFC patch. It's not even ready for real use, as DaveJ found out (although it might be good enough in practice, despite its flaws) > * Finally, you're writing to error while only holding a read lock, but > that's sort of a minor thing. It's not a minor thing, but the alternatives looked worse. I really wanted to make it per-cpu, and do this with interrupts disabled or something. But that then pushes a big problem to the write time to go over all cpu's and see if there are errors. So it's not right. But .. It's a hacky patch to get discussion started, and it's actually hard to do "right" when this code has to be basically lockless. > * Checking the accumulation interval isn't beyond the > clocksource_max_deferment() value seems like a very good check to have > in update_wall_time(). Sounds like a good idea. Also, quite frankly, reading all the code I wasn't ever really able to figure out that things don't overflow. The overflow protection is a bit ad-hoc (that maxshift thing in update_wall_time() really makes baby Jesus cry, despite the season, and it wasn't at all obvious that ntp_tick_length() is fundamentally bigger than xtime_interval, for example). It's also not clear that the complicated and frankly not-very-obvious shift-loop is any faster than just using a divide - possibly with the "single interval" case being a special case to avoid
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds wrote: > On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds > wrote: >> >> This is *not* to say that this is the bug you're hitting. But it does show >> that >> >> (a) a flaky HPET can do some seriously bad stuff >> (b) the kernel is very fragile wrt time going backwards. >> >> and maybe we can use this test program to at least try to alleviate problem >> (b). > > Ok, so after several false starts (ktime_get() is really really > fragile - called in scheduler things, and doing magic things at > bootup), here is something that seems to alleviate the problem for me. > > I still get a lot of RCU messages like "self-detected stall" etc, but > that's to be expected. When the clock does odd things, crap *will* > happen. > > But what this does is: > > (a) make the error more visible as a clock error rather than various > random downstream users > > IOW, it prints things out when it looks like we're getting odd > clock read errors (arbitrary cut-off: we expect clock read-outs to be > withing 1/8th of the range of the expected clock value) (Warning: I'm replying with my vacation goggles on) A few thoughts from quickly looking at the patch (some of this is repeating your comments here): * So 1/8th of the interval seems way too short, as there's clocksources like the ACP PM, which wrap every 2.5 seconds or so. And even with more reasonable clocksource wrapping intervals, the tick scheduler may not schedule the next tick till after that time, which could cause major problems (we don't want the hrtimer expiration calculation to get capped out here, since the timer may be scheduled past 1/8th of the interval, which would keep us from ever accumulating time and clearing the cycle_error added here) * I suspect something closer to the clocksource_max_deferment() value (which I think is max interval before multiplication overflows could happen - ~12%) which we use in the scheduler would make more sense. Especially since the timer scheduler uses that to calculate how long we can idle for. * Nulling out delta in timekeeping_get_ns() seems like it could cause problems since time would then possibly go backwards compared to previous reads (as you mentioned, resulting in smaller time jumps). Instead it would probably make more sense to cap the delta at the maximum value (though this assumes the clock doesn't jump back in the interval before the next call to update_wall_time). * Also, as you note, this would just cause the big time jump to only happen at the next update, since there's no logic in update_wall_time() to limit the jump. I'm not sure if "believing" the large jump at write time make that much more sense, though. * Finally, you're writing to error while only holding a read lock, but that's sort of a minor thing. I do agree something that is more helpful in validating the timekeeping here would be nice to avoid further geese chases in the future. Some possible ideas: * Checking the accumulation interval isn't beyond the clocksource_max_deferment() value seems like a very good check to have in update_wall_time(). * Maybe when we schedule the next timekeeping update, the tick scheduler could store the expected time for that to fire, and then we could validate that we're relatively close after that value when we do accumulate time (warning if we're running too early or far too late - though with virtualziation, defining a "reasonable" late value is difficult). * This "expected next tick" time could be used to try to cap read-time intervals in a similar fashion as done here. (Of course, again, we'd have to be careful, since if that expected next tick ends up somehow being before the actual hrtimer expiration value, we could end up stopping time - and the system). I can try to add some of this when I'm back from holiday in the new year. Maybe Thomas will have some other ideas? thanks -john -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones wrote: > > I tried the nohpet thing for a few hours this morning and didn't see > anything weird, but it may have been that I just didn't run long enough. > When I saw your patch, I gave that a shot instead, with hpet enabled > again. Just got back to find lots of messages in dmesg, but none > of the usual NMI/lockup messages. Hmm. So my patch is a bit sloppy, and I suspect that the sloppiness may account for some - and quite probably all - of those numbers. Some of your numbers are pretty big (it's a 32-bit mask, so they are all really just pretty small negative numbers), but they are still in the 2us .. 165ms range when taking the 14MHz HPET counter into account. So not huge timer shifts. And the sloppiness of the patch is two-fold: One problem with my patch is that it does that "tkr->cycle_error" without any locking (because it's running in various environmetns where locking really doesn't work, and we can't even sanely disable preemption because we might well be inside the scheduler etc. So this: cycle_now = tkr->read(tkr->clock) + tkr->cycle_error; /* calculate the delta since the last update_wall_time: */ delta = clocksource_delta(cycle_now, tkr->cycle_last, tkr->mask); /* Hmm? This is really not good, we're too close to overflowing */ if (unlikely(delta > (tkr->mask >> 3))) { tkr->cycle_error = delta; delta = 0; } might run concurrently on two CPU's, and then that tkr->cycle_error access isn't serialized, so a "later" read of the HPET clock could end up being written to tkr->cycle_error before. So that can account for small errors: you'd have a "cycle_error" that gets updated on one CPU, and then used to correct for an "earlier" read of the clock on another CPU, and that could make the cycle error possibly worse. However, that first race only matters if you get errors to begin with, so if that was the only race, it would still show that some real error happened. BUT. The *bigger* problem is that since the reading side cannot hold any locks at all, it can also race against the writing side. That's by design, and we will use the sequence counter to recover from it and try again, but it means that some of those small errors are just a reader racing with the wall-time update code, and since this error code is done _inside_ the read-sequence code, it's not aware of the retry, and will give a false positive even if we then later on throw the known-bad result out and re-try. So your small negative numbers are most likely just those false positives. I was more hoping to see some big sudden jumps on the order of your 20-second delays - the kinds of jumps that your "tsc unstable" messages implied (which weren't in the 2us .. 165ms range, but in the 2s to 250s range) Ugh. I guess I'll have to try to figure out a non-sloppy thing, but quite frankly, the non-sloppy things I tried first were rather painful failures. The sloppy thing was sloppy, but worked well to see the disaster case. I'll put on my thinking cap. Maybe I can move the "cycle_error" logic to outside the sequence lock retry loop. But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the original softlockup bug happening *without* any bigger hpet jumps. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote: > And again: this is not trying to make the kernel clock not jump. There > is no way I can come up with even in theory to try to really *fix* a > fundamentally broken clock. > > So this is not meant to be a real "fix" for anything, but is meant to > make sure that if the clock is unreliable, we pinpoint the clock > itself, and it mitigates the absolutely horrendously bad behavior we > currently with bad clocks. So think of this as debug and band-aid > rather than "this makes clocks magically reliable". > > .. and we might still lock up under some circumstances. But at least > from my limited testing, it is infinitely much better, even if it > might not be perfect. Also note that my "testing" has been writing > zero to the HPET lock (so the HPET clock difference tends to be pretty > specific), while my next step is to see what happens when I write > random values (and a lot of them). > > Since I expect that to cause more problems, I thought I'd send this > RFC out before I start killing my machine again ;) I tried the nohpet thing for a few hours this morning and didn't see anything weird, but it may have been that I just didn't run long enough. When I saw your patch, I gave that a shot instead, with hpet enabled again. Just got back to find lots of messages in dmesg, but none of the usual NMI/lockup messages. [ 2256.430694] Clocksource tsc unstable (delta = -142018897190 ns) [ 2256.437433] Switched to clocksource hpet [ 2279.788605] Clocksource hpet had cycles off by 4294946559 [ 2280.191272] Clocksource hpet had cycles off by 4294905111 [ 2282.605990] Clocksource hpet had cycles off by 4294960721 [ 2284.485410] Clocksource hpet had cycles off by 4294953427 [ 2288.954844] Clocksource hpet had cycles off by 4294924880 [ 2305.202931] Clocksource hpet had cycles off by 4294960429 [ 2315.527247] Clocksource hpet had cycles off by 4294956296 [ 2318.954066] Clocksource hpet had cycles off by 4293673652 [ 2332.370923] Clocksource hpet had cycles off by 4294907221 [ 2332.739861] Clocksource hpet had cycles off by 4294919496 [ 2345.459694] Clocksource hpet had cycles off by 4294959592 [ 2346.159780] Clocksource hpet had cycles off by 4294952613 [ 2348.132071] Clocksource hpet had cycles off by 4294903415 [ 2348.207593] Clocksource hpet had cycles off by 4294966900 [ 2351.699779] Clocksource hpet had cycles off by 4294906755 [ 2354.125982] Clocksource hpet had cycles off by 4294941028 [ 2365.249438] Clocksource hpet had cycles off by 4294942458 [ 2370.247560] Clocksource hpet had cycles off by 4294927938 [ 2372.554642] Clocksource hpet had cycles off by 4294950723 [ 2377.361721] Clocksource hpet had cycles off by 4294952569 [ 2384.747820] Clocksource hpet had cycles off by 4294947263 [ 2389.133886] Clocksource hpet had cycles off by 4294967233 [ 2392.423458] Clocksource hpet had cycles off by 4294946214 [ 2397.648955] Clocksource hpet had cycles off by 4294967205 [ 2405.228015] Clocksource hpet had cycles off by 4294917938 [ 2429.571163] Clocksource hpet had cycles off by 4294957112 [ 2434.214788] Clocksource hpet had cycles off by 429482 [ 2438.686705] Clocksource hpet had cycles off by 4294945380 [ 2440.280478] Clocksource hpet had cycles off by 4294878090 [ 2458.370164] Clocksource hpet had cycles off by 4294875577 [ 2496.916971] Clocksource hpet had cycles off by 4294887574 [ 2516.314875] Clocksource hpet had cycles off by 4294899744 [ 2519.857221] Clocksource hpet had cycles off by 4294836752 [ 2522.696576] Clocksource hpet had cycles off by 4294965711 [ 2527.599967] Clocksource hpet had cycles off by 4294876467 [ 2528.573678] Clocksource hpet had cycles off by 4294815154 [ 2537.325296] Clocksource hpet had cycles off by 4294862624 [ 2542.296016] Clocksource hpet had cycles off by 4294954228 [ 2558.634123] Clocksource hpet had cycles off by 4294845883 [ 2560.804973] Clocksource hpet had cycles off by 4294958781 [ 2579.057030] Clocksource hpet had cycles off by 4294921012 [ 2588.139716] Clocksource hpet had cycles off by 4294950381 [ 2594.076877] Clocksource hpet had cycles off by 4294941777 [ 2597.645800] Clocksource hpet had cycles off by 4294927609 [ 2605.032338] Clocksource hpet had cycles off by 4294915823 [ 2605.239672] Clocksource hpet had cycles off by 4294952275 [ 2605.294230] Clocksource hpet had cycles off by 4294886603 [ 2609.801532] Clocksource hpet had cycles off by 4294887976 [ 2615.003674] Clocksource hpet had cycles off by 4294957202 [ 2641.039536] Clocksource hpet had cycles off by 4294943689 [ 2644.554947] Clocksource hpet had cycles off by 4294837076 [ 2648.576203] Clocksource hpet had cycles off by 4294928887 [ 2648.627249] Clocksource hpet had cycles off by 4294913656 [ 2680.465314] Clocksource hpet had cycles off by 4294963565 [ 2705.231925] Clocksource hpet had cycles off by 4294949762 [ 2708.181981] Clocksource hpet had cycles off by 4294924526 [ 2713.622343] Clocksource hpet had cycles off by
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds wrote: > > .. and we might still lock up under some circumstances. But at least > from my limited testing, it is infinitely much better, even if it > might not be perfect. Also note that my "testing" has been writing > zero to the HPET lock (so the HPET clock difference tends to be pretty > specific), while my next step is to see what happens when I write > random values (and a lot of them). > > Since I expect that to cause more problems, I thought I'd send this > RFC out before I start killing my machine again ;) Ok, not horrible. Although I'd suggest not testing in a terminal window while running X. The time jumping will confuse X input timing and the screensaver, to the point that the machine may not be dead, but it isn't exactly usable. Do it in a virtual console. Again, making the limit tighter (one second?) and perhaps not trusting insane values too much at walltime clock update time either, might make it all work smoother still. I did manage to confuse systemd with all the garbage the kernel spewed, with a lot of stuff like: systemd-journald[779]: Failed to write entry (9 items, 276 bytes), ignoring: Invalid argument showing up in the logs, but I'm writing this without having had to reboot the machine. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds wrote: > > This is *not* to say that this is the bug you're hitting. But it does show > that > > (a) a flaky HPET can do some seriously bad stuff > (b) the kernel is very fragile wrt time going backwards. > > and maybe we can use this test program to at least try to alleviate problem > (b). Ok, so after several false starts (ktime_get() is really really fragile - called in scheduler things, and doing magic things at bootup), here is something that seems to alleviate the problem for me. I still get a lot of RCU messages like "self-detected stall" etc, but that's to be expected. When the clock does odd things, crap *will* happen. But what this does is: (a) make the error more visible as a clock error rather than various random downstream users IOW, it prints things out when it looks like we're getting odd clock read errors (arbitrary cut-off: we expect clock read-outs to be withing 1/8th of the range of the expected clock value) (b) try to alleviate the horrible things that happen when the clock error is big The patch tries to "correct" for the huge time jump by basically undoing it. We'll still see time jumps (there really is no way to avoid it), but we limit the range of them. With the attached patch, my machine seems to survive me writing to the HPET master counter register. It spews warnings, and it is noisy about the odd clock reads: ... Clocksource hpet had cycles off by 642817751 Cutting it too close for hpet in in update_wall_time (offset = 4034102337) INFO: rcu_sched self-detected stall on CPU { 0} (t=281743 jiffies g=4722 c=4721 q=14) ... and there may still be situations where it does horrible horrible things due to the (smaller) time leaps, but it does seem a lot more robust. NOTE! There's an (intentional) difference in how we handle the time leaps at time read time vs write (wall-clock update). At time read time, we just refuse to believe the big delta, and we set the "cycle_error" value so that future time reads will be relative to the error we just got. We also don't print anything out, because we're possibly deep in the scheduler or in tracing, and we do not want to spam the console about our fixup. At time *write* time, we first report about any read-time errors, and then we report (but believe in) overlarge clocksource delta errors as we update the time. This seems to be the best way to limit the damage. Also note that the patch is entirely clock-agnostic. It's just that I can trivially screw up my HPET, I didn't look at other clocks. One note: my current limit of clocksource delta errors is based on the range of the clock (1/8th of the range). I actually think that's bogus, and it should instead be based on the expected frequency of the clock (ie "we are guaranteed to update the wall clock at least once every second, so if the clock source delta read is larger than one second, we've done something wrong"). So this patch is meant very much as an RFC, rather than anything else. It's pretty hacky. But it does actually make a huge difference for me wrt the "mess up HPET time on purpose". That used to crash my machine pretty hard, and pretty reliably. With this patch, I've done it ten+ times, and while it spews a lot of garbage, the machine stays up and _works_. Making the sanity check tighter (ie the "one second" band rather than "1/8th of the clock range") would probably just improve it further. Thomas, what do you think? Hate it? Any better ideas? And again: this is not trying to make the kernel clock not jump. There is no way I can come up with even in theory to try to really *fix* a fundamentally broken clock. So this is not meant to be a real "fix" for anything, but is meant to make sure that if the clock is unreliable, we pinpoint the clock itself, and it mitigates the absolutely horrendously bad behavior we currently with bad clocks. So think of this as debug and band-aid rather than "this makes clocks magically reliable". .. and we might still lock up under some circumstances. But at least from my limited testing, it is infinitely much better, even if it might not be perfect. Also note that my "testing" has been writing zero to the HPET lock (so the HPET clock difference tends to be pretty specific), while my next step is to see what happens when I write random values (and a lot of them). Since I expect that to cause more problems, I thought I'd send this RFC out before I start killing my machine again ;) Linus include/linux/timekeeper_internal.h | 1 + kernel/time/timekeeping.c | 25 - 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index 05af9a334893..0fcb60d77079 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -32,6 +32,7 @@ struct tk_read_base { cycle_t
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds torva...@linux-foundation.org wrote: This is *not* to say that this is the bug you're hitting. But it does show that (a) a flaky HPET can do some seriously bad stuff (b) the kernel is very fragile wrt time going backwards. and maybe we can use this test program to at least try to alleviate problem (b). Ok, so after several false starts (ktime_get() is really really fragile - called in scheduler things, and doing magic things at bootup), here is something that seems to alleviate the problem for me. I still get a lot of RCU messages like self-detected stall etc, but that's to be expected. When the clock does odd things, crap *will* happen. But what this does is: (a) make the error more visible as a clock error rather than various random downstream users IOW, it prints things out when it looks like we're getting odd clock read errors (arbitrary cut-off: we expect clock read-outs to be withing 1/8th of the range of the expected clock value) (b) try to alleviate the horrible things that happen when the clock error is big The patch tries to correct for the huge time jump by basically undoing it. We'll still see time jumps (there really is no way to avoid it), but we limit the range of them. With the attached patch, my machine seems to survive me writing to the HPET master counter register. It spews warnings, and it is noisy about the odd clock reads: ... Clocksource hpet had cycles off by 642817751 Cutting it too close for hpet in in update_wall_time (offset = 4034102337) INFO: rcu_sched self-detected stall on CPU { 0} (t=281743 jiffies g=4722 c=4721 q=14) ... and there may still be situations where it does horrible horrible things due to the (smaller) time leaps, but it does seem a lot more robust. NOTE! There's an (intentional) difference in how we handle the time leaps at time read time vs write (wall-clock update). At time read time, we just refuse to believe the big delta, and we set the cycle_error value so that future time reads will be relative to the error we just got. We also don't print anything out, because we're possibly deep in the scheduler or in tracing, and we do not want to spam the console about our fixup. At time *write* time, we first report about any read-time errors, and then we report (but believe in) overlarge clocksource delta errors as we update the time. This seems to be the best way to limit the damage. Also note that the patch is entirely clock-agnostic. It's just that I can trivially screw up my HPET, I didn't look at other clocks. One note: my current limit of clocksource delta errors is based on the range of the clock (1/8th of the range). I actually think that's bogus, and it should instead be based on the expected frequency of the clock (ie we are guaranteed to update the wall clock at least once every second, so if the clock source delta read is larger than one second, we've done something wrong). So this patch is meant very much as an RFC, rather than anything else. It's pretty hacky. But it does actually make a huge difference for me wrt the mess up HPET time on purpose. That used to crash my machine pretty hard, and pretty reliably. With this patch, I've done it ten+ times, and while it spews a lot of garbage, the machine stays up and _works_. Making the sanity check tighter (ie the one second band rather than 1/8th of the clock range) would probably just improve it further. Thomas, what do you think? Hate it? Any better ideas? And again: this is not trying to make the kernel clock not jump. There is no way I can come up with even in theory to try to really *fix* a fundamentally broken clock. So this is not meant to be a real fix for anything, but is meant to make sure that if the clock is unreliable, we pinpoint the clock itself, and it mitigates the absolutely horrendously bad behavior we currently with bad clocks. So think of this as debug and band-aid rather than this makes clocks magically reliable. .. and we might still lock up under some circumstances. But at least from my limited testing, it is infinitely much better, even if it might not be perfect. Also note that my testing has been writing zero to the HPET lock (so the HPET clock difference tends to be pretty specific), while my next step is to see what happens when I write random values (and a lot of them). Since I expect that to cause more problems, I thought I'd send this RFC out before I start killing my machine again ;) Linus include/linux/timekeeper_internal.h | 1 + kernel/time/timekeeping.c | 25 - 2 files changed, 25 insertions(+), 1 deletion(-) diff --git a/include/linux/timekeeper_internal.h b/include/linux/timekeeper_internal.h index 05af9a334893..0fcb60d77079 100644 --- a/include/linux/timekeeper_internal.h +++ b/include/linux/timekeeper_internal.h @@ -32,6 +32,7 @@ struct tk_read_base { cycle_t
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds torva...@linux-foundation.org wrote: .. and we might still lock up under some circumstances. But at least from my limited testing, it is infinitely much better, even if it might not be perfect. Also note that my testing has been writing zero to the HPET lock (so the HPET clock difference tends to be pretty specific), while my next step is to see what happens when I write random values (and a lot of them). Since I expect that to cause more problems, I thought I'd send this RFC out before I start killing my machine again ;) Ok, not horrible. Although I'd suggest not testing in a terminal window while running X. The time jumping will confuse X input timing and the screensaver, to the point that the machine may not be dead, but it isn't exactly usable. Do it in a virtual console. Again, making the limit tighter (one second?) and perhaps not trusting insane values too much at walltime clock update time either, might make it all work smoother still. I did manage to confuse systemd with all the garbage the kernel spewed, with a lot of stuff like: systemd-journald[779]: Failed to write entry (9 items, 276 bytes), ignoring: Invalid argument showing up in the logs, but I'm writing this without having had to reboot the machine. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote: And again: this is not trying to make the kernel clock not jump. There is no way I can come up with even in theory to try to really *fix* a fundamentally broken clock. So this is not meant to be a real fix for anything, but is meant to make sure that if the clock is unreliable, we pinpoint the clock itself, and it mitigates the absolutely horrendously bad behavior we currently with bad clocks. So think of this as debug and band-aid rather than this makes clocks magically reliable. .. and we might still lock up under some circumstances. But at least from my limited testing, it is infinitely much better, even if it might not be perfect. Also note that my testing has been writing zero to the HPET lock (so the HPET clock difference tends to be pretty specific), while my next step is to see what happens when I write random values (and a lot of them). Since I expect that to cause more problems, I thought I'd send this RFC out before I start killing my machine again ;) I tried the nohpet thing for a few hours this morning and didn't see anything weird, but it may have been that I just didn't run long enough. When I saw your patch, I gave that a shot instead, with hpet enabled again. Just got back to find lots of messages in dmesg, but none of the usual NMI/lockup messages. [ 2256.430694] Clocksource tsc unstable (delta = -142018897190 ns) [ 2256.437433] Switched to clocksource hpet [ 2279.788605] Clocksource hpet had cycles off by 4294946559 [ 2280.191272] Clocksource hpet had cycles off by 4294905111 [ 2282.605990] Clocksource hpet had cycles off by 4294960721 [ 2284.485410] Clocksource hpet had cycles off by 4294953427 [ 2288.954844] Clocksource hpet had cycles off by 4294924880 [ 2305.202931] Clocksource hpet had cycles off by 4294960429 [ 2315.527247] Clocksource hpet had cycles off by 4294956296 [ 2318.954066] Clocksource hpet had cycles off by 4293673652 [ 2332.370923] Clocksource hpet had cycles off by 4294907221 [ 2332.739861] Clocksource hpet had cycles off by 4294919496 [ 2345.459694] Clocksource hpet had cycles off by 4294959592 [ 2346.159780] Clocksource hpet had cycles off by 4294952613 [ 2348.132071] Clocksource hpet had cycles off by 4294903415 [ 2348.207593] Clocksource hpet had cycles off by 4294966900 [ 2351.699779] Clocksource hpet had cycles off by 4294906755 [ 2354.125982] Clocksource hpet had cycles off by 4294941028 [ 2365.249438] Clocksource hpet had cycles off by 4294942458 [ 2370.247560] Clocksource hpet had cycles off by 4294927938 [ 2372.554642] Clocksource hpet had cycles off by 4294950723 [ 2377.361721] Clocksource hpet had cycles off by 4294952569 [ 2384.747820] Clocksource hpet had cycles off by 4294947263 [ 2389.133886] Clocksource hpet had cycles off by 4294967233 [ 2392.423458] Clocksource hpet had cycles off by 4294946214 [ 2397.648955] Clocksource hpet had cycles off by 4294967205 [ 2405.228015] Clocksource hpet had cycles off by 4294917938 [ 2429.571163] Clocksource hpet had cycles off by 4294957112 [ 2434.214788] Clocksource hpet had cycles off by 429482 [ 2438.686705] Clocksource hpet had cycles off by 4294945380 [ 2440.280478] Clocksource hpet had cycles off by 4294878090 [ 2458.370164] Clocksource hpet had cycles off by 4294875577 [ 2496.916971] Clocksource hpet had cycles off by 4294887574 [ 2516.314875] Clocksource hpet had cycles off by 4294899744 [ 2519.857221] Clocksource hpet had cycles off by 4294836752 [ 2522.696576] Clocksource hpet had cycles off by 4294965711 [ 2527.599967] Clocksource hpet had cycles off by 4294876467 [ 2528.573678] Clocksource hpet had cycles off by 4294815154 [ 2537.325296] Clocksource hpet had cycles off by 4294862624 [ 2542.296016] Clocksource hpet had cycles off by 4294954228 [ 2558.634123] Clocksource hpet had cycles off by 4294845883 [ 2560.804973] Clocksource hpet had cycles off by 4294958781 [ 2579.057030] Clocksource hpet had cycles off by 4294921012 [ 2588.139716] Clocksource hpet had cycles off by 4294950381 [ 2594.076877] Clocksource hpet had cycles off by 4294941777 [ 2597.645800] Clocksource hpet had cycles off by 4294927609 [ 2605.032338] Clocksource hpet had cycles off by 4294915823 [ 2605.239672] Clocksource hpet had cycles off by 4294952275 [ 2605.294230] Clocksource hpet had cycles off by 4294886603 [ 2609.801532] Clocksource hpet had cycles off by 4294887976 [ 2615.003674] Clocksource hpet had cycles off by 4294957202 [ 2641.039536] Clocksource hpet had cycles off by 4294943689 [ 2644.554947] Clocksource hpet had cycles off by 4294837076 [ 2648.576203] Clocksource hpet had cycles off by 4294928887 [ 2648.627249] Clocksource hpet had cycles off by 4294913656 [ 2680.465314] Clocksource hpet had cycles off by 4294963565 [ 2705.231925] Clocksource hpet had cycles off by 4294949762 [ 2708.181981] Clocksource hpet had cycles off by 4294924526 [ 2713.622343] Clocksource hpet had cycles off by 4294874217 [ 2714.725619]
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: I tried the nohpet thing for a few hours this morning and didn't see anything weird, but it may have been that I just didn't run long enough. When I saw your patch, I gave that a shot instead, with hpet enabled again. Just got back to find lots of messages in dmesg, but none of the usual NMI/lockup messages. Hmm. So my patch is a bit sloppy, and I suspect that the sloppiness may account for some - and quite probably all - of those numbers. Some of your numbers are pretty big (it's a 32-bit mask, so they are all really just pretty small negative numbers), but they are still in the 2us .. 165ms range when taking the 14MHz HPET counter into account. So not huge timer shifts. And the sloppiness of the patch is two-fold: One problem with my patch is that it does that tkr-cycle_error without any locking (because it's running in various environmetns where locking really doesn't work, and we can't even sanely disable preemption because we might well be inside the scheduler etc. So this: cycle_now = tkr-read(tkr-clock) + tkr-cycle_error; /* calculate the delta since the last update_wall_time: */ delta = clocksource_delta(cycle_now, tkr-cycle_last, tkr-mask); /* Hmm? This is really not good, we're too close to overflowing */ if (unlikely(delta (tkr-mask 3))) { tkr-cycle_error = delta; delta = 0; } might run concurrently on two CPU's, and then that tkr-cycle_error access isn't serialized, so a later read of the HPET clock could end up being written to tkr-cycle_error before. So that can account for small errors: you'd have a cycle_error that gets updated on one CPU, and then used to correct for an earlier read of the clock on another CPU, and that could make the cycle error possibly worse. However, that first race only matters if you get errors to begin with, so if that was the only race, it would still show that some real error happened. BUT. The *bigger* problem is that since the reading side cannot hold any locks at all, it can also race against the writing side. That's by design, and we will use the sequence counter to recover from it and try again, but it means that some of those small errors are just a reader racing with the wall-time update code, and since this error code is done _inside_ the read-sequence code, it's not aware of the retry, and will give a false positive even if we then later on throw the known-bad result out and re-try. So your small negative numbers are most likely just those false positives. I was more hoping to see some big sudden jumps on the order of your 20-second delays - the kinds of jumps that your tsc unstable messages implied (which weren't in the 2us .. 165ms range, but in the 2s to 250s range) Ugh. I guess I'll have to try to figure out a non-sloppy thing, but quite frankly, the non-sloppy things I tried first were rather painful failures. The sloppy thing was sloppy, but worked well to see the disaster case. I'll put on my thinking cap. Maybe I can move the cycle_error logic to outside the sequence lock retry loop. But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the original softlockup bug happening *without* any bigger hpet jumps. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds torva...@linux-foundation.org wrote: This is *not* to say that this is the bug you're hitting. But it does show that (a) a flaky HPET can do some seriously bad stuff (b) the kernel is very fragile wrt time going backwards. and maybe we can use this test program to at least try to alleviate problem (b). Ok, so after several false starts (ktime_get() is really really fragile - called in scheduler things, and doing magic things at bootup), here is something that seems to alleviate the problem for me. I still get a lot of RCU messages like self-detected stall etc, but that's to be expected. When the clock does odd things, crap *will* happen. But what this does is: (a) make the error more visible as a clock error rather than various random downstream users IOW, it prints things out when it looks like we're getting odd clock read errors (arbitrary cut-off: we expect clock read-outs to be withing 1/8th of the range of the expected clock value) (Warning: I'm replying with my vacation goggles on) A few thoughts from quickly looking at the patch (some of this is repeating your comments here): * So 1/8th of the interval seems way too short, as there's clocksources like the ACP PM, which wrap every 2.5 seconds or so. And even with more reasonable clocksource wrapping intervals, the tick scheduler may not schedule the next tick till after that time, which could cause major problems (we don't want the hrtimer expiration calculation to get capped out here, since the timer may be scheduled past 1/8th of the interval, which would keep us from ever accumulating time and clearing the cycle_error added here) * I suspect something closer to the clocksource_max_deferment() value (which I think is max interval before multiplication overflows could happen - ~12%) which we use in the scheduler would make more sense. Especially since the timer scheduler uses that to calculate how long we can idle for. * Nulling out delta in timekeeping_get_ns() seems like it could cause problems since time would then possibly go backwards compared to previous reads (as you mentioned, resulting in smaller time jumps). Instead it would probably make more sense to cap the delta at the maximum value (though this assumes the clock doesn't jump back in the interval before the next call to update_wall_time). * Also, as you note, this would just cause the big time jump to only happen at the next update, since there's no logic in update_wall_time() to limit the jump. I'm not sure if believing the large jump at write time make that much more sense, though. * Finally, you're writing to error while only holding a read lock, but that's sort of a minor thing. I do agree something that is more helpful in validating the timekeeping here would be nice to avoid further geese chases in the future. Some possible ideas: * Checking the accumulation interval isn't beyond the clocksource_max_deferment() value seems like a very good check to have in update_wall_time(). * Maybe when we schedule the next timekeeping update, the tick scheduler could store the expected time for that to fire, and then we could validate that we're relatively close after that value when we do accumulate time (warning if we're running too early or far too late - though with virtualziation, defining a reasonable late value is difficult). * This expected next tick time could be used to try to cap read-time intervals in a similar fashion as done here. (Of course, again, we'd have to be careful, since if that expected next tick ends up somehow being before the actual hrtimer expiration value, we could end up stopping time - and the system). I can try to add some of this when I'm back from holiday in the new year. Maybe Thomas will have some other ideas? thanks -john -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Mon, Dec 22, 2014 at 3:59 PM, John Stultz john.stu...@linaro.org wrote: * So 1/8th of the interval seems way too short, as there's clocksources like the ACP PM, which wrap every 2.5 seconds or so. Ugh. At the same time, 1/8th of a range is actually bigger than I'd like, since if there is some timer corruption, it means that we only catch it when it's in really big. But as I said, I'd actually prefer it to be time-based, because it would be good if this approach worked on things like the TSC which is a 64-bit counter.. So yes, that capping was very much arbitrary, and was mostly a case of this works with the one timer source that I can easily trigger * I suspect something closer to the clocksource_max_deferment() value (which I think is max interval before multiplication overflows could happen - ~12%) which we use in the scheduler would make more sense. Especially since the timer scheduler uses that to calculate how long we can idle for. I'd rather not be anywhere *close* to any overflow problems. Even for the scheduler all-idle case, I'd argue that there is rather quickly diminishing returns. Yes, a thousand timer interrupts per second are expensive and a noticeable power draw. The difference between one timer interrupt every two seconds and every 20 seconds is rather less noticeable. Of course, reasonable clock sources have *much* longer periods than a second (yeah, the acpi pm timer really isn't a good one), so there are probably good middle grounds, The 1/8th was a hack, and one that was aware of teh 300s cycle of the HPET at that.. * Nulling out delta in timekeeping_get_ns() seems like it could cause problems since time would then possibly go backwards compared to previous reads (as you mentioned, resulting in smaller time jumps). Instead it would probably make more sense to cap the delta at the maximum value (though this assumes the clock doesn't jump back in the interval before the next call to update_wall_time). So part of the nulling was that it was simpler, and part of it was that I expected to get backwards jumps (see the other email to Dave about the inherent races). And with the whole timer mask modulo arithmetic, those backwards jumps just look like biggish positive numbers, not even negative. So it ends up being things like is it an unsigned number larger than half the mask? Consider it negative etc. The zero it out was simple, and it worked for my test-case, which was ok, my machine no longer locks up when I mess with the timer. And I didn't post the earlier versions of that patch that didn't even *boot*. I started out trying to do it at a higher level (not on a clock read level, but outside the whole 'convert-to-ns and do the sequence value check'), but during bootup we play a lot of games with initializing the timer sources etc. So that explains the approach of doing it at that cycle_now = tkr-read(tkr-clock); level, and keeping it very low-level. But as I already explained in the email that crossed, that low-level thing also results in some fundamental races. * Also, as you note, this would just cause the big time jump to only happen at the next update, since there's no logic in update_wall_time() to limit the jump. I'm not sure if believing the large jump at write time make that much more sense, though. So I considered just capping it there (to a single interval or something). Again, just ignoring - like the read side does - it would have been easier, but at the same time I *really* wanted to make time go forward, so just taking the big value seemed safest. But yes. this was very much a RFC patch. It's not even ready for real use, as DaveJ found out (although it might be good enough in practice, despite its flaws) * Finally, you're writing to error while only holding a read lock, but that's sort of a minor thing. It's not a minor thing, but the alternatives looked worse. I really wanted to make it per-cpu, and do this with interrupts disabled or something. But that then pushes a big problem to the write time to go over all cpu's and see if there are errors. So it's not right. But .. It's a hacky patch to get discussion started, and it's actually hard to do right when this code has to be basically lockless. * Checking the accumulation interval isn't beyond the clocksource_max_deferment() value seems like a very good check to have in update_wall_time(). Sounds like a good idea. Also, quite frankly, reading all the code I wasn't ever really able to figure out that things don't overflow. The overflow protection is a bit ad-hoc (that maxshift thing in update_wall_time() really makes baby Jesus cry, despite the season, and it wasn't at all obvious that ntp_tick_length() is fundamentally bigger than xtime_interval, for example). It's also not clear that the complicated and frankly not-very-obvious shift-loop is any faster than just using a divide - possibly with the single interval case being a special case to avoid dividing then. I was a
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: > On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds > wrote: > > > > The second time (or third, or fourth - it might not take immediately) > > you get a lockup or similar. Bad things happen. > > I've only tested it twice now, but the first time I got a weird > lockup-like thing (things *kind* of worked, but I could imagine that > one CPU was stuck with a lock held, because things eventually ground > to a screeching halt. > > The second time I got > > INFO: rcu_sched self-detected stall on CPU { 5} (t=84533 jiffies > g=11971 c=11970 q=17) > > and then > >INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7} > (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57) > > with backtraces that made no sense (because obviously no actual stall > had taken place), and were the CPU's mostly being idle. Yep, if time gets messed up too much, RCU can incorrectly decide that 21 seconds have elapsed since the grace period started, and can even decide this pretty much immediately after the grace period starts. Thanx, Paul > I could easily see it resulting in your softlockup scenario too. > > Linus > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: > > The second time (or third, or fourth - it might not take immediately) > > you get a lockup or similar. Bad things happen. > > I've only tested it twice now, but the first time I got a weird > lockup-like thing (things *kind* of worked, but I could imagine that > one CPU was stuck with a lock held, because things eventually ground > to a screeching halt. > > The second time I got > > INFO: rcu_sched self-detected stall on CPU { 5} (t=84533 jiffies > g=11971 c=11970 q=17) > > and then > >INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7} > (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57) > > with backtraces that made no sense (because obviously no actual stall > had taken place), and were the CPU's mostly being idle. > > I could easily see it resulting in your softlockup scenario too. So something trinity does when it doesn't have a better idea of something to pass a syscall is to generate a random number. A wild hypothesis could be that we're in one of these situations, and we randomly generated 0xfed000f0 and passed that as a value to a syscall, and the kernel wrote 0 to that address. What syscall could do that, and not just fail a access_ok() or similar is a mystery though. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds wrote: > > The second time (or third, or fourth - it might not take immediately) > you get a lockup or similar. Bad things happen. I've only tested it twice now, but the first time I got a weird lockup-like thing (things *kind* of worked, but I could imagine that one CPU was stuck with a lock held, because things eventually ground to a screeching halt. The second time I got INFO: rcu_sched self-detected stall on CPU { 5} (t=84533 jiffies g=11971 c=11970 q=17) and then INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7} (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57) with backtraces that made no sense (because obviously no actual stall had taken place), and were the CPU's mostly being idle. I could easily see it resulting in your softlockup scenario too. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds wrote: > > I can do the mmap(/dev/mem) thing and access the HPET by hand, and > when I write zero to it I immediately get something like this: > > Clocksource tsc unstable (delta = -284317725450 ns) > Switched to clocksource hpet > > just to confirm that yes, a jump in the HPET counter would indeed give > those kinds of symptoms:blaming the TSC with a negative delta in the > 0-300s range, even though it's the HPET that is broken. > > And if the HPET then occasionally jumps around afterwards, it would > show up as ktime_get() occasionally going backwards, which in turn > would - as far as I can tell - result in exactly that pseudo-infirnite > loop with timers. Ok, so I tried that too. It's actually a pretty easy experiment to do: just mmap(/dev/mem) at the HPET offset (the kernel prints it out at boot, it should normally be at 0xfed0). And then just write a zero to offset 0xf0, which is the main counter. The first time, you get the "Clocksource tsc unstable". The second time (or third, or fourth - it might not take immediately) you get a lockup or similar. Bad things happen. This is *not* to say that this is the bug you're hitting. But it does show that (a) a flaky HPET can do some seriously bad stuff (b) the kernel is very fragile wrt time going backwards. and maybe we can use this test program to at least try to alleviate problem (b). Trivial HPET mess-up program attached. Linus #include #include #include #include #include int main(int argc, char **argv) { int fd = open("/dev/mem", O_RDWR); void *base; if (fd < 0) { fputs("Unable to open /dev/mem\n", stderr); return -1; } base = mmap(NULL, 4096 ,PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0xfed0); if ((long)base == -1) { fputs("Unable to mmap HPET\n", stderr); return -1; } *(unsigned long *) (base+0xf0) = 0; return 0; }
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote: > On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: > > > > And finally, and stupidly, is there any chance that you have anything > > accessing /dev/hpet? > > Not knowingly at least, but who the hell knows what systemd has its > fingers in these days. Actually, it looks like /dev/hpet doesn't allow write access. I can do the mmap(/dev/mem) thing and access the HPET by hand, and when I write zero to it I immediately get something like this: Clocksource tsc unstable (delta = -284317725450 ns) Switched to clocksource hpet just to confirm that yes, a jump in the HPET counter would indeed give those kinds of symptoms:blaming the TSC with a negative delta in the 0-300s range, even though it's the HPET that is broken. And if the HPET then occasionally jumps around afterwards, it would show up as ktime_get() occasionally going backwards, which in turn would - as far as I can tell - result in exactly that pseudo-infirnite loop with timers. Anyway, any wild kernel pointer access *could* happen to just hit the HPET and write to the main counter value, although I'd personally be more inclined to blame BIOS/SMM kind of code playing tricks with time.. We do have a few places where we explicitly write the value on purpose, but they are in the HPET init code, and in the clocksource resume code, so they should not be involved. Thomas - have you had reports of HPET breakage in RT circles, the same way BIOSes have been tinkering with TSC? Also, would it perhaps be a good idea to make "ktime_get()" save the last time in a percpu variable, and warn if time ever goes backwards on a particular CPU? A percpu thing should be pretty cheap, even if we write to it every time somebody asks for time.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: > > So the range of 1-251 seconds is not entirely random. It's all in > > that "32-bit HPET range". > > DaveJ, I assume it's too late now, and you don't effectively have any > access to the machine any more, but "hpet=disable" or "nohpet" on the > command line might be worth trying if you ever see that piece of > hardware again. I can give it a try tomorrow. I'm probably saying goodbye to that machine on Tuesday, so we'll have 24hrs of testing at least. > And for posterity, do you have a dmidecode with motherboard/BIOS > information for the problematic machine? And your configuration? I can grab that in the morning too. > And finally, and stupidly, is there any chance that you have anything > accessing /dev/hpet? Not knowingly at least, but who the hell knows what systemd has its fingers in these days. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds wrote: > > So the range of 1-251 seconds is not entirely random. It's all in > that "32-bit HPET range". DaveJ, I assume it's too late now, and you don't effectively have any access to the machine any more, but "hpet=disable" or "nohpet" on the command line might be worth trying if you ever see that piece of hardware again. And for posterity, do you have a dmidecode with motherboard/BIOS information for the problematic machine? And your configuration? And finally, and stupidly, is there any chance that you have anything accessing /dev/hpet? Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds wrote: > > Hmm, ok, I've re-acquainted myself with it. And I have to admit that I > can't see anything wrong. The whole "update_wall_clock" and the shadow > timekeeping state is confusing as hell, but seems fine. We'd have to > avoid update_wall_clock for a *long* time for overflows to occur. > > And the overflow in 32 bits isn't that special, since the only thing > that really matters is the overflow of "cycle_now - tkr->cycle_last" > within the mask. > > So I'm not seeing anything even halfway suspicious. .. of course, this reminds me of the "clocksource TSC unstable" issue. Then *simple* solution may actually be that the HPET itself is buggered. That would explain both the "clocksource TSC unstable" messages _and_ the "time went backwards, so now we're re-arming the scheduler tick 'forever' until time has gone forwards again". And googling for this actually shows other people seeing similar issues, including hangs after switching to another clocksource. See for example http://stackoverflow.com/questions/13796944/system-hang-with-possible-relevance-to-clocksource-tsc-unstable which switches to acpi_pm (not HPET) and then hangs afterwards. Of course, it may be the switching itself that causes some issue. Btw, there's another reason to think that it's the HPET, I just realized. DaveJ posted all his odd TSC unstable things, and the delta was pretty damn random. But it did have a range: it was in the 1-251 second range. With a 14.318MHz clock (which is, I think, the normal HPET frequency), a 32-bit overflow happens in about 300 seconds. So the range of 1-251 seconds is not entirely random. It's all in that "32-bit HPET range". In contrast, wrt the TSC frequency, that kind of range makes no sense at all. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds torva...@linux-foundation.org wrote: Hmm, ok, I've re-acquainted myself with it. And I have to admit that I can't see anything wrong. The whole update_wall_clock and the shadow timekeeping state is confusing as hell, but seems fine. We'd have to avoid update_wall_clock for a *long* time for overflows to occur. And the overflow in 32 bits isn't that special, since the only thing that really matters is the overflow of cycle_now - tkr-cycle_last within the mask. So I'm not seeing anything even halfway suspicious. .. of course, this reminds me of the clocksource TSC unstable issue. Then *simple* solution may actually be that the HPET itself is buggered. That would explain both the clocksource TSC unstable messages _and_ the time went backwards, so now we're re-arming the scheduler tick 'forever' until time has gone forwards again. And googling for this actually shows other people seeing similar issues, including hangs after switching to another clocksource. See for example http://stackoverflow.com/questions/13796944/system-hang-with-possible-relevance-to-clocksource-tsc-unstable which switches to acpi_pm (not HPET) and then hangs afterwards. Of course, it may be the switching itself that causes some issue. Btw, there's another reason to think that it's the HPET, I just realized. DaveJ posted all his odd TSC unstable things, and the delta was pretty damn random. But it did have a range: it was in the 1-251 second range. With a 14.318MHz clock (which is, I think, the normal HPET frequency), a 32-bit overflow happens in about 300 seconds. So the range of 1-251 seconds is not entirely random. It's all in that 32-bit HPET range. In contrast, wrt the TSC frequency, that kind of range makes no sense at all. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds torva...@linux-foundation.org wrote: So the range of 1-251 seconds is not entirely random. It's all in that 32-bit HPET range. DaveJ, I assume it's too late now, and you don't effectively have any access to the machine any more, but hpet=disable or nohpet on the command line might be worth trying if you ever see that piece of hardware again. And for posterity, do you have a dmidecode with motherboard/BIOS information for the problematic machine? And your configuration? And finally, and stupidly, is there any chance that you have anything accessing /dev/hpet? Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: So the range of 1-251 seconds is not entirely random. It's all in that 32-bit HPET range. DaveJ, I assume it's too late now, and you don't effectively have any access to the machine any more, but hpet=disable or nohpet on the command line might be worth trying if you ever see that piece of hardware again. I can give it a try tomorrow. I'm probably saying goodbye to that machine on Tuesday, so we'll have 24hrs of testing at least. And for posterity, do you have a dmidecode with motherboard/BIOS information for the problematic machine? And your configuration? I can grab that in the morning too. And finally, and stupidly, is there any chance that you have anything accessing /dev/hpet? Not knowingly at least, but who the hell knows what systemd has its fingers in these days. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones da...@codemonkey.org.uk wrote: On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: And finally, and stupidly, is there any chance that you have anything accessing /dev/hpet? Not knowingly at least, but who the hell knows what systemd has its fingers in these days. Actually, it looks like /dev/hpet doesn't allow write access. I can do the mmap(/dev/mem) thing and access the HPET by hand, and when I write zero to it I immediately get something like this: Clocksource tsc unstable (delta = -284317725450 ns) Switched to clocksource hpet just to confirm that yes, a jump in the HPET counter would indeed give those kinds of symptoms:blaming the TSC with a negative delta in the 0-300s range, even though it's the HPET that is broken. And if the HPET then occasionally jumps around afterwards, it would show up as ktime_get() occasionally going backwards, which in turn would - as far as I can tell - result in exactly that pseudo-infirnite loop with timers. Anyway, any wild kernel pointer access *could* happen to just hit the HPET and write to the main counter value, although I'd personally be more inclined to blame BIOS/SMM kind of code playing tricks with time.. We do have a few places where we explicitly write the value on purpose, but they are in the HPET init code, and in the clocksource resume code, so they should not be involved. Thomas - have you had reports of HPET breakage in RT circles, the same way BIOSes have been tinkering with TSC? Also, would it perhaps be a good idea to make ktime_get() save the last time in a percpu variable, and warn if time ever goes backwards on a particular CPU? A percpu thing should be pretty cheap, even if we write to it every time somebody asks for time.. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds torva...@linux-foundation.org wrote: I can do the mmap(/dev/mem) thing and access the HPET by hand, and when I write zero to it I immediately get something like this: Clocksource tsc unstable (delta = -284317725450 ns) Switched to clocksource hpet just to confirm that yes, a jump in the HPET counter would indeed give those kinds of symptoms:blaming the TSC with a negative delta in the 0-300s range, even though it's the HPET that is broken. And if the HPET then occasionally jumps around afterwards, it would show up as ktime_get() occasionally going backwards, which in turn would - as far as I can tell - result in exactly that pseudo-infirnite loop with timers. Ok, so I tried that too. It's actually a pretty easy experiment to do: just mmap(/dev/mem) at the HPET offset (the kernel prints it out at boot, it should normally be at 0xfed0). And then just write a zero to offset 0xf0, which is the main counter. The first time, you get the Clocksource tsc unstable. The second time (or third, or fourth - it might not take immediately) you get a lockup or similar. Bad things happen. This is *not* to say that this is the bug you're hitting. But it does show that (a) a flaky HPET can do some seriously bad stuff (b) the kernel is very fragile wrt time going backwards. and maybe we can use this test program to at least try to alleviate problem (b). Trivial HPET mess-up program attached. Linus #include sys/types.h #include sys/stat.h #include fcntl.h #include sys/mman.h #include stdio.h int main(int argc, char **argv) { int fd = open(/dev/mem, O_RDWR); void *base; if (fd 0) { fputs(Unable to open /dev/mem\n, stderr); return -1; } base = mmap(NULL, 4096 ,PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0xfed0); if ((long)base == -1) { fputs(Unable to mmap HPET\n, stderr); return -1; } *(unsigned long *) (base+0xf0) = 0; return 0; }
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds torva...@linux-foundation.org wrote: The second time (or third, or fourth - it might not take immediately) you get a lockup or similar. Bad things happen. I've only tested it twice now, but the first time I got a weird lockup-like thing (things *kind* of worked, but I could imagine that one CPU was stuck with a lock held, because things eventually ground to a screeching halt. The second time I got INFO: rcu_sched self-detected stall on CPU { 5} (t=84533 jiffies g=11971 c=11970 q=17) and then INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7} (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57) with backtraces that made no sense (because obviously no actual stall had taken place), and were the CPU's mostly being idle. I could easily see it resulting in your softlockup scenario too. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: The second time (or third, or fourth - it might not take immediately) you get a lockup or similar. Bad things happen. I've only tested it twice now, but the first time I got a weird lockup-like thing (things *kind* of worked, but I could imagine that one CPU was stuck with a lock held, because things eventually ground to a screeching halt. The second time I got INFO: rcu_sched self-detected stall on CPU { 5} (t=84533 jiffies g=11971 c=11970 q=17) and then INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7} (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57) with backtraces that made no sense (because obviously no actual stall had taken place), and were the CPU's mostly being idle. I could easily see it resulting in your softlockup scenario too. So something trinity does when it doesn't have a better idea of something to pass a syscall is to generate a random number. A wild hypothesis could be that we're in one of these situations, and we randomly generated 0xfed000f0 and passed that as a value to a syscall, and the kernel wrote 0 to that address. What syscall could do that, and not just fail a access_ok() or similar is a mystery though. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds torva...@linux-foundation.org wrote: The second time (or third, or fourth - it might not take immediately) you get a lockup or similar. Bad things happen. I've only tested it twice now, but the first time I got a weird lockup-like thing (things *kind* of worked, but I could imagine that one CPU was stuck with a lock held, because things eventually ground to a screeching halt. The second time I got INFO: rcu_sched self-detected stall on CPU { 5} (t=84533 jiffies g=11971 c=11970 q=17) and then INFO: rcu_sched detected stalls on CPUs/tasks: { 1 2 3 4 5 6 7} (detected by 0, t=291309 jiffies, g=12031, c=12030, q=57) with backtraces that made no sense (because obviously no actual stall had taken place), and were the CPU's mostly being idle. Yep, if time gets messed up too much, RCU can incorrectly decide that 21 seconds have elapsed since the grace period started, and can even decide this pretty much immediately after the grace period starts. Thanx, Paul I could easily see it resulting in your softlockup scenario too. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote: > On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds > wrote: > > > > How/where is the HPET overflow case handled? I don't know the code enough. > > Hmm, ok, I've re-acquainted myself with it. And I have to admit that I > can't see anything wrong. The whole "update_wall_clock" and the shadow > timekeeping state is confusing as hell, but seems fine. We'd have to > avoid update_wall_clock for a *long* time for overflows to occur. > > And the overflow in 32 bits isn't that special, since the only thing > that really matters is the overflow of "cycle_now - tkr->cycle_last" > within the mask. > > So I'm not seeing anything even halfway suspicious. One long shot is a bug in rcu_barrier() that I introduced in v3.18-rc1. This is a low-probability race that can cause rcu_barrier() and friends to return too soon, which can of course result in arbitrary misbehavior. Please see below for a fix which looks good thus far in reasonably intense rcutorture testing. Might be what Dave and Sasha are seeing. Or not. Thanx, Paul rcu: Fix rcu_barrier() race that could result in too-short wait The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions. It checks a given CPU's lists of callbacks, and if all three no-CBs lists are empty, ignores that CPU. However, these three lists could potentially be empty even when callbacks are present if the check executed just as the callbacks were being moved from one list to another. It turns out that recent versions of rcutorture can spot this race. This commit plugs this hole by consolidating the per-list counts of no-CBs callbacks into a single count, which is incremented before the corresponding callback is posted and after it is invoked. Then rcu_barrier() checks this single count to reliably determine whether the corresponding CPU has no-CBs callbacks. Signed-off-by: Paul E. McKenney diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 7680fc275036..658b691dc32b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3318,6 +3318,7 @@ static void _rcu_barrier(struct rcu_state *rsp) } else { _rcu_barrier_trace(rsp, "OnlineNoCB", cpu, rsp->n_barrier_done); + smp_mb__before_atomic(); atomic_inc(>barrier_cpu_count); __call_rcu(>barrier_head, rcu_barrier_callback, rsp, cpu, 0); diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 8e7b1843896e..cb5908672f11 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -340,14 +340,10 @@ struct rcu_data { #ifdef CONFIG_RCU_NOCB_CPU struct rcu_head *nocb_head; /* CBs waiting for kthread. */ struct rcu_head **nocb_tail; - atomic_long_t nocb_q_count; /* # CBs waiting for kthread */ - atomic_long_t nocb_q_count_lazy; /* (approximate). */ + atomic_long_t nocb_q_count; /* # CBs waiting for nocb */ + atomic_long_t nocb_q_count_lazy; /* invocation (all stages). */ struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */ struct rcu_head **nocb_follower_tail; - atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */ - atomic_long_t nocb_follower_count_lazy; /* (approximate). */ - int nocb_p_count; /* # CBs being invoked by kthread */ - int nocb_p_count_lazy; /* (approximate). */ wait_queue_head_t nocb_wq; /* For nocb kthreads to sleep on. */ struct task_struct *nocb_kthread; int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ @@ -356,8 +352,6 @@ struct rcu_data { struct rcu_head *nocb_gp_head cacheline_internodealigned_in_smp; /* CBs waiting for GP. */ struct rcu_head **nocb_gp_tail; - long nocb_gp_count; - long nocb_gp_count_lazy; bool nocb_leader_sleep; /* Is the nocb leader thread asleep? */ struct rcu_data *nocb_next_follower; /* Next follower in wakeup chain. */ @@ -622,24 +616,15 @@ static void rcu_dynticks_task_exit(void); #endif /* #ifndef RCU_TREE_NONCORE */ #ifdef CONFIG_RCU_TRACE -#ifdef CONFIG_RCU_NOCB_CPU -/* Sum up queue lengths for tracing. */ +/* Read out queue lengths for tracing. */ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll) { - *ql = atomic_long_read(>nocb_q_count) + - rdp->nocb_p_count + - atomic_long_read(>nocb_follower_count) + - rdp->nocb_p_count + rdp->nocb_gp_count; - *qll = atomic_long_read(>nocb_q_count_lazy) + - rdp->nocb_p_count_lazy + -
Re: frequent lockups in 3.18rc4
On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds wrote: > > How/where is the HPET overflow case handled? I don't know the code enough. Hmm, ok, I've re-acquainted myself with it. And I have to admit that I can't see anything wrong. The whole "update_wall_clock" and the shadow timekeeping state is confusing as hell, but seems fine. We'd have to avoid update_wall_clock for a *long* time for overflows to occur. And the overflow in 32 bits isn't that special, since the only thing that really matters is the overflow of "cycle_now - tkr->cycle_last" within the mask. So I'm not seeing anything even halfway suspicious. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds wrote: > > I'm claiming that the race happened *once*. And it then corrupted some > data structure or similar sufficiently that CPU0 keeps looping. > > Perhaps something keeps re-adding itself to the head of the timerqueue > due to the race. So tick_sched_timer() does ktime_t now = ktime_get(); ... hrtimer_forward(timer, now, tick_period); return HRTIMER_RESTART; and then __run_hrtimer does enqueue_hrtimer(timer, base); which just adds the timer back on the tiemr heap. So all you need to get an infinite loop (as far as I can see) is that hrtimer_forward() doesn't actually move the timer forward. The most likely reason would seem to be this: delta = ktime_sub(now, hrtimer_get_expires(timer)); if (delta.tv64 < 0) return 0; and clearly it *should* return a positive number, since the timer has expired, so the expiry time _should_ be smaller than "now". So it should never trigger, and this bug is clearly impossible. HOWEVER. It turns out that while tick_sched_timer() does "ktime_get()" to get the current time, the actual timer machinery does *not* do that at all. The actual timer machinery does entry_time = now = hrtimer_update_base(cpu_base); base = cpu_base->clock_base + i; basenow = ktime_add(now, base->offset); _once_ per hrtimer_clock_base. And then it iterates using that "basenow" thing, and compares it to the timer expiry. So we have two different times. Now, let's think about what happens if those clocks aren't quote in sync. We know (since __run_hrtimer was called) that basenow.tv64 > hrtimer_get_softexpires_tv64(timer) but here we have "basenow" - which is not that ktime_get(), and we have "hrtimer_get_softexpires_tv64()" (which is not hrtimer_get_expires() in general - we have all that "delta" range handling, but for the scheduling tick it *should* be the same). So I can see at least one lockup: - if "expires < basenow" hrtimer_interrupt() will run the timer - if "now < expires" hrtimer_forward() will not do anything, and will just reschedule the timer with the same expiration iow, all you need for a re-arming of the same timer is: now < expires < basenow now, the two clocks (now and basenow) are not the same, but they do have the same *base*. So they are related, but even the base time was gotten under two different sequence locks, so even the base could have been updated in between the hrtimer_update_base() time and the ktime_get(). And even though they have the same base, they have different offsets: basenow does that "base->offset" thing (and ktime_get_update_offsets_now() does timekeeping_get_ns() - now = ktime_get() does base = tk->tkr.base_mono; nsecs = timekeeping_get_ns(>tkr); - basenow = ktime_get_update_offsets_now() does base = tk->tkr.base_mono; nsecs = timekeeping_get_ns(>tkr); .. and then .. ktime_add(.., base->offset); and if I read the thing right, the ktime_add() should be a no-op, because base->offset should be 0 for the normal monotonic clock. Right? So the two times (now and basenow) *should* be the same time, and the whole "now < expires < basenow" situation can never happen. Right? Potentially wrong. Because that's where the whole "different sequence locks" comes in. The one-time race could be something that updates the base in between the (one-time) ktime_get_update_offsets_now() and the (then as a result pseudo-infinitely repeating) ktime_get. Hmm? If "base" ever goes backwards, or if "base" does *not* update atomically with the HPET timer overflows, I can see that happening. Of course, that would imply that ktime_get() is not monotonic. But we do know that we've had odd time issues on that machine. I think you already had DaveJ check clock monotonicity. But that was with the TSC, wasn't it? I'm claiming maybe the HPET isn't monotonic, and there is some HPET clocksource issue with overflow in 32 bits. (I think the HPET *should* be 64-bit, and just the comparators for interrupts may be 32-bit, but we use a "readl()" and only use the low 32-bits even if the upper 32 bits *might* be ok). I keep harping on that HPET overflow, because we actually have the 6 "locked up" HPET traces, and then we have a seventh without that lockup, and there definitely was a overflow in 32 bits: [torvalds@i7 linux]$ grep -3 read_hpet+0x16 ~/dj-1.txt | grep RAX RAX: 61fece8a RBX: 00510792 RCX: RAX: 79e588fc RBX: 00511d6e RCX: RAX: 91ca7f65 RBX: 00513346 RCX: RAX: a9afbd0d RBX: 0051491e RCX: RAX: cbd1340c RBX: 0051684a RCX: RAX: fb9d303f RBX: 005193fc RCX: RAX: 2b67efe4 RBX:
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote: > > Right now I'm doing Chris' idea of "turn debugging back on, > > and try without serial console". Shall I try your suggestion > > on top of that ? > > Might as well. I doubt it really will make any difference, but I also > don't think it will interact badly in any way, so.. It locked up. It's not even responding to icmp. It might be Monday before I can see what's on the console. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote: Right now I'm doing Chris' idea of turn debugging back on, and try without serial console. Shall I try your suggestion on top of that ? Might as well. I doubt it really will make any difference, but I also don't think it will interact badly in any way, so.. It locked up. It's not even responding to icmp. It might be Monday before I can see what's on the console. Dave -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds torva...@linux-foundation.org wrote: I'm claiming that the race happened *once*. And it then corrupted some data structure or similar sufficiently that CPU0 keeps looping. Perhaps something keeps re-adding itself to the head of the timerqueue due to the race. So tick_sched_timer() does ktime_t now = ktime_get(); ... hrtimer_forward(timer, now, tick_period); return HRTIMER_RESTART; and then __run_hrtimer does enqueue_hrtimer(timer, base); which just adds the timer back on the tiemr heap. So all you need to get an infinite loop (as far as I can see) is that hrtimer_forward() doesn't actually move the timer forward. The most likely reason would seem to be this: delta = ktime_sub(now, hrtimer_get_expires(timer)); if (delta.tv64 0) return 0; and clearly it *should* return a positive number, since the timer has expired, so the expiry time _should_ be smaller than now. So it should never trigger, and this bug is clearly impossible. HOWEVER. It turns out that while tick_sched_timer() does ktime_get() to get the current time, the actual timer machinery does *not* do that at all. The actual timer machinery does entry_time = now = hrtimer_update_base(cpu_base); base = cpu_base-clock_base + i; basenow = ktime_add(now, base-offset); _once_ per hrtimer_clock_base. And then it iterates using that basenow thing, and compares it to the timer expiry. So we have two different times. Now, let's think about what happens if those clocks aren't quote in sync. We know (since __run_hrtimer was called) that basenow.tv64 hrtimer_get_softexpires_tv64(timer) but here we have basenow - which is not that ktime_get(), and we have hrtimer_get_softexpires_tv64() (which is not hrtimer_get_expires() in general - we have all that delta range handling, but for the scheduling tick it *should* be the same). So I can see at least one lockup: - if expires basenow hrtimer_interrupt() will run the timer - if now expires hrtimer_forward() will not do anything, and will just reschedule the timer with the same expiration iow, all you need for a re-arming of the same timer is: now expires basenow now, the two clocks (now and basenow) are not the same, but they do have the same *base*. So they are related, but even the base time was gotten under two different sequence locks, so even the base could have been updated in between the hrtimer_update_base() time and the ktime_get(). And even though they have the same base, they have different offsets: basenow does that base-offset thing (and ktime_get_update_offsets_now() does timekeeping_get_ns() - now = ktime_get() does base = tk-tkr.base_mono; nsecs = timekeeping_get_ns(tk-tkr); - basenow = ktime_get_update_offsets_now() does base = tk-tkr.base_mono; nsecs = timekeeping_get_ns(tk-tkr); .. and then .. ktime_add(.., base-offset); and if I read the thing right, the ktime_add() should be a no-op, because base-offset should be 0 for the normal monotonic clock. Right? So the two times (now and basenow) *should* be the same time, and the whole now expires basenow situation can never happen. Right? Potentially wrong. Because that's where the whole different sequence locks comes in. The one-time race could be something that updates the base in between the (one-time) ktime_get_update_offsets_now() and the (then as a result pseudo-infinitely repeating) ktime_get. Hmm? If base ever goes backwards, or if base does *not* update atomically with the HPET timer overflows, I can see that happening. Of course, that would imply that ktime_get() is not monotonic. But we do know that we've had odd time issues on that machine. I think you already had DaveJ check clock monotonicity. But that was with the TSC, wasn't it? I'm claiming maybe the HPET isn't monotonic, and there is some HPET clocksource issue with overflow in 32 bits. (I think the HPET *should* be 64-bit, and just the comparators for interrupts may be 32-bit, but we use a readl() and only use the low 32-bits even if the upper 32 bits *might* be ok). I keep harping on that HPET overflow, because we actually have the 6 locked up HPET traces, and then we have a seventh without that lockup, and there definitely was a overflow in 32 bits: [torvalds@i7 linux]$ grep -3 read_hpet+0x16 ~/dj-1.txt | grep RAX RAX: 61fece8a RBX: 00510792 RCX: RAX: 79e588fc RBX: 00511d6e RCX: RAX: 91ca7f65 RBX: 00513346 RCX: RAX: a9afbd0d RBX: 0051491e RCX: RAX: cbd1340c RBX: 0051684a RCX: RAX: fb9d303f RBX: 005193fc RCX: RAX: 2b67efe4 RBX: 0051c224 RCX:
Re: frequent lockups in 3.18rc4
On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds torva...@linux-foundation.org wrote: How/where is the HPET overflow case handled? I don't know the code enough. Hmm, ok, I've re-acquainted myself with it. And I have to admit that I can't see anything wrong. The whole update_wall_clock and the shadow timekeeping state is confusing as hell, but seems fine. We'd have to avoid update_wall_clock for a *long* time for overflows to occur. And the overflow in 32 bits isn't that special, since the only thing that really matters is the overflow of cycle_now - tkr-cycle_last within the mask. So I'm not seeing anything even halfway suspicious. Linus -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote: On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds torva...@linux-foundation.org wrote: How/where is the HPET overflow case handled? I don't know the code enough. Hmm, ok, I've re-acquainted myself with it. And I have to admit that I can't see anything wrong. The whole update_wall_clock and the shadow timekeeping state is confusing as hell, but seems fine. We'd have to avoid update_wall_clock for a *long* time for overflows to occur. And the overflow in 32 bits isn't that special, since the only thing that really matters is the overflow of cycle_now - tkr-cycle_last within the mask. So I'm not seeing anything even halfway suspicious. One long shot is a bug in rcu_barrier() that I introduced in v3.18-rc1. This is a low-probability race that can cause rcu_barrier() and friends to return too soon, which can of course result in arbitrary misbehavior. Please see below for a fix which looks good thus far in reasonably intense rcutorture testing. Might be what Dave and Sasha are seeing. Or not. Thanx, Paul rcu: Fix rcu_barrier() race that could result in too-short wait The rcu_barrier() no-callbacks check for no-CBs CPUs has race conditions. It checks a given CPU's lists of callbacks, and if all three no-CBs lists are empty, ignores that CPU. However, these three lists could potentially be empty even when callbacks are present if the check executed just as the callbacks were being moved from one list to another. It turns out that recent versions of rcutorture can spot this race. This commit plugs this hole by consolidating the per-list counts of no-CBs callbacks into a single count, which is incremented before the corresponding callback is posted and after it is invoked. Then rcu_barrier() checks this single count to reliably determine whether the corresponding CPU has no-CBs callbacks. Signed-off-by: Paul E. McKenney paul...@linux.vnet.ibm.com diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c index 7680fc275036..658b691dc32b 100644 --- a/kernel/rcu/tree.c +++ b/kernel/rcu/tree.c @@ -3318,6 +3318,7 @@ static void _rcu_barrier(struct rcu_state *rsp) } else { _rcu_barrier_trace(rsp, OnlineNoCB, cpu, rsp-n_barrier_done); + smp_mb__before_atomic(); atomic_inc(rsp-barrier_cpu_count); __call_rcu(rdp-barrier_head, rcu_barrier_callback, rsp, cpu, 0); diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h index 8e7b1843896e..cb5908672f11 100644 --- a/kernel/rcu/tree.h +++ b/kernel/rcu/tree.h @@ -340,14 +340,10 @@ struct rcu_data { #ifdef CONFIG_RCU_NOCB_CPU struct rcu_head *nocb_head; /* CBs waiting for kthread. */ struct rcu_head **nocb_tail; - atomic_long_t nocb_q_count; /* # CBs waiting for kthread */ - atomic_long_t nocb_q_count_lazy; /* (approximate). */ + atomic_long_t nocb_q_count; /* # CBs waiting for nocb */ + atomic_long_t nocb_q_count_lazy; /* invocation (all stages). */ struct rcu_head *nocb_follower_head; /* CBs ready to invoke. */ struct rcu_head **nocb_follower_tail; - atomic_long_t nocb_follower_count; /* # CBs ready to invoke. */ - atomic_long_t nocb_follower_count_lazy; /* (approximate). */ - int nocb_p_count; /* # CBs being invoked by kthread */ - int nocb_p_count_lazy; /* (approximate). */ wait_queue_head_t nocb_wq; /* For nocb kthreads to sleep on. */ struct task_struct *nocb_kthread; int nocb_defer_wakeup; /* Defer wakeup of nocb_kthread. */ @@ -356,8 +352,6 @@ struct rcu_data { struct rcu_head *nocb_gp_head cacheline_internodealigned_in_smp; /* CBs waiting for GP. */ struct rcu_head **nocb_gp_tail; - long nocb_gp_count; - long nocb_gp_count_lazy; bool nocb_leader_sleep; /* Is the nocb leader thread asleep? */ struct rcu_data *nocb_next_follower; /* Next follower in wakeup chain. */ @@ -622,24 +616,15 @@ static void rcu_dynticks_task_exit(void); #endif /* #ifndef RCU_TREE_NONCORE */ #ifdef CONFIG_RCU_TRACE -#ifdef CONFIG_RCU_NOCB_CPU -/* Sum up queue lengths for tracing. */ +/* Read out queue lengths for tracing. */ static inline void rcu_nocb_q_lengths(struct rcu_data *rdp, long *ql, long *qll) { - *ql = atomic_long_read(rdp-nocb_q_count) + - rdp-nocb_p_count + - atomic_long_read(rdp-nocb_follower_count) + - rdp-nocb_p_count + rdp-nocb_gp_count; - *qll = atomic_long_read(rdp-nocb_q_count_lazy) + -
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 5:00 PM, Thomas Gleixner wrote: > > The watchdog timer runs on a fully periodic schedule. It's self > rearming via > > hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period)); > > So if that aligns with the equally periodic tick interrupt on the > other CPU then you might get into that situation due to the fully > synchronizing and serializing nature of HPET reads. No. Really. No. There is no way in hell that a modern system ends up being that synchronized. Not to the point that the interrupt also happens at the *exact* same point. Thomas, you're in denial. It's not just the exact same instruction in the timer interrupt handler ("timer interrupts and NMI's are in sync"), the call chain *past* the timer interrupt is the exact same thing too. The stack has the same contents. So not only must the two CPU's be perfectly in sync so that the NMI always triggers on exactly the same instruction, the CPU that takes the timer interrupt (in your schenario, over and over again) must be magically filling the stack with the exact same thing every time. That CPU isn't making progress that just happens to be "synchronized". Admit it, you've not seen a busy system that is *so* synchronized that you get timer interrupts and NMI's that hit on the same instruction over a sequence of minutes. > So lets assume there is that hrtimer_remove race (I'm certainly going > to stare at that tomorrow with fully awake brain. It's past beer > o'clock here). How do you explain that: > > 1) the watchdog always triggers on CPU1? > > 2) the race only ever happens on CPU0? I'm claiming that the race happened *once*. And it then corrupted some data structure or similar sufficiently that CPU0 keeps looping. Perhaps something keeps re-adding itself to the head of the timerqueue due to the race. The watchdog doesn't trigger on CPU0, because this is the software watchdog, and interrupts are disabled on CPU0, and CPU0 isn't making any progress. Because it's looping in a fairly tight loop. The watchdog triggers on CPU1 (over and over again) because CPU1 is waiting for the TLB shootdown to complete. And it doesn't, because interrupts are disabled on CPU0 that it's trying to shoot down the TLB on. That theory at least fits the data. So CPU1 isn't doing anything odd at all. In a way that "magic happens so that everything is so synchronized that you cross-syncornize two CPU's making real progress over many minutes". THAT sounds like just a fairy tale to me. Your argument "it has a question mark in front of it" objection is bogus. We got an *interrupt* in the middle of the call chain. Of *course* the call chain is unreliable. It doesn't matter. What matters is that the stack under the interrupt DOES NOT CHANGE. It doesn't even matter if it's a real honest-to-god callchain or not, what matters is that the kernel stack under the interrupt is unchanged. No way does that happen if it's making progress at the same time. > 3) the hrtimer interrupt took too long message never appears? At a guess, it's looping (for a long long time) on timerqueue_getnext() in hrtimer_interrupt(), and then returns. Never gets to the retry or the "interrupt took %llu" messages. And we know what timer entry it is looping on: tick_sched_timer. Which does a HRTIMER_RESTART, so the re-enqueueing isn't exactly unlikely. All it needs is that the re-enqueueing has gotten confused enough that it re-enqueues it on the same queue over and over again. Which would run tick_sched_timer over and over again. No? Until the re-enqueuing magically stops (and we do actually have a HPET counter overflow there. Just look at the RAX values: RAX: fb9d303f RAX: 2b67efe4 that fb9d303f is the last time we see that particular callchain. The next time we see hpet_read(), it's that 2b67efe4 thing. So my "maybe it has something to do with HPET overflow" wasn't just a random throw-away comment. We actually have real data saying that the HPET *did* overflow, and it in fact happened somewhere around the time when the lockup went away. Are they related? Maybe not. But dammit, there's a lot of "coincidences" here. Not just the "oh, it always takes the NMI on the exact same instruction". > 4) the RT throttler hit? > 5) that the system makes progress afterwards? .. something eventually clears the problem, maybe because of the HPET overflow. I dunno. I'm just saying that your arguments to ignore CPU0 are pretty damn weak. So I think the CPU1 behavior is 100% consistent with CPU0 just having interrupts disabled. So I think CPU1 is _trivial_ to explain if you accept that CPU0 is doing something weird. Which is why I think your "look at CPU1" sounds so strange. I don't think CPU1 is all that interesting. I can *trivially* explain it with a single sentence, and did exactly that above. Can you trivially explain CPU0? Without the "it's just a big coincidence that we take the NMI on the same instruction for several minutes,
Re: frequent lockups in 3.18rc4
On Fri, 19 Dec 2014, Chris Mason wrote: > On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner wrote: > > But at the very end this would be detected by the runtime check of the > > hrtimer interrupt, which does not trigger. And it would trigger at > > some point as ALL cpus including CPU0 in that trace dump make > > progress. > > I'll admit that at some point we should be hitting one of the WARN or BUG_ON, > but it's possible to thread that needle and corrupt the timer list, without > hitting a warning (CPU 1 in my example has to enqueue last). Once the rbtree > is hosed, it can go forever. Probably not the bug we're looking for, but > still suspect in general. I surely have a close look at that, but in that case we get out of that state later on and I doubt that we have A) a corruption of the rbtree B) a self healing of the rbtree afterwards I doubt it, but who knows. Though even if A & B would happen we would still get the 'hrtimer interrupt took a gazillion of seconds' warning because CPU0 definitely leaves the timer interrupt at some point otherwise we would not see backtraces from usb, userspace and idle later on. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, 19 Dec 2014, Linus Torvalds wrote: > On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote: > > Now that all looks correct. So there is something else going on. After > > staring some more at it, I think we are looking at it from the wrong > > angle. > > > > The watchdog always detects CPU1 as stuck and we got completely > > fixated on the csd_wait() in the stack trace on CPU1. Now we have > > stack traces which show a different picture, i.e. CPU1 makes progress > > after a gazillion of seconds. > > .. but that doesn't explain why CPU0 ends up always being at that > *exact* same instruction in the NMI backtrace. Up to the point where it exposes different instructions and completely different code pathes. I'm not agreeing with your theory that after the RT throttler hit the watchdog and everything else goes completely bonkers. And even before that happens we see a different backtrace on CPU0: [25149.982766] RIP: 0010:[] [] invoke_rcu_core+0x2b/0x50 Though I have to admit that this "very same instruction" pattern puzzled me for quite a while as well. > While a fairly tight loop, together with "mmio read is very expensive > and synchronizing" would explain it. An MMIO read can easily be as > expensive as several thousand instructions. The watchdog timer runs on a fully periodic schedule. It's self rearming via hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period)); So if that aligns with the equally periodic tick interrupt on the other CPU then you might get into that situation due to the fully synchronizing and serializing nature of HPET reads. That can drift apart over time because the timer device (apic or that new fangled tscdeadline timer) are not frequency corrected versus timekeeping. > > I think we really need to look at CPU1 itself. > > Not so fast. Take another look at CPU0. > > [24998.083577] [] ktime_get+0x3e/0xa0 > [24998.084450] [] tick_sched_timer+0x23/0x160 > [24998.085315] [] __run_hrtimer+0x76/0x1f0 > [24998.086173] [] ? tick_init_highres+0x20/0x20 > [24998.087025] [] hrtimer_interrupt+0x107/0x260 > [24998.087877] [] local_apic_timer_interrupt+0x3b/0x70 > [24998.088732] [] smp_apic_timer_interrupt+0x45/0x60 > [24998.089583] [] apic_timer_interrupt+0x6f/0x80 > [24998.090435] > [24998.091279] [] ? __remove_hrtimer+0x4e/0xa0 > [24998.092118] [] ? ipcget+0x8a/0x1e0 > [24998.092951] [] ? ipcget+0x7c/0x1e0 > [24998.093779] [] SyS_msgget+0x4d/0x70 > > > Really. None of that changed. NONE. The likelihood that we hit the > exact same instruction every time? Over a timefraem of more than a > minute? > > The only way I see that happening is (a) NMI is completely buggered, > and the backtrace is just random crap that is always the same. Or (b) > it's really a fairly tight loop. > > The fact that you had a hrtimer interrupt happen in the *middle* of > __remove_hrtimer() is really another fairly strong hint. That smells > like "__remove_hrtimer() has a race with hrtimer interrupts". That __remove_hrtimer has a '?' in front of it. So its not a reliable trace entry. There is NO hrtimer related operation in the msgget() syscall at all and SyS_msgget() is the only reliable entry on that stack trace. So that __remove_hrtimer operation happened before that msgget() syscall and is just a stack artifact. poll/select/nanosleep whatever. > And that race results in a basically endless loop (which perhaps ends > when the HRtimer overflows, in what, a few minutes?) hrtimers overflow in about 584 years > I really don't think you should look at CPU1. Not when CPU0 has such > an interesting pattern that you dismissed just because the HPET is > making progress. No. I did not dismiss it because HPET is making progress. I looked at it from a different angle. So lets assume there is that hrtimer_remove race (I'm certainly going to stare at that tomorrow with fully awake brain. It's past beer o'clock here). How do you explain that: 1) the watchdog always triggers on CPU1? 2) the race only ever happens on CPU0? 3) the hrtimer interrupt took too long message never appears? If that timer interrupt loops forever then it will complain about that. And it leaves that code for sure as the backtrace of CPU0 hits user space later on. 4) the RT throttler hit? Admittetly we dont know from which CPU and which task that comes but that's very simple to figure out. Debug patch below. 5) that the system makes progress afterwards? 6) If my assumption about an interrupt storm turns out to be true, then it explain all of the above. I might be wrong as usual, but I still think its worth to have a look. Thanks, tglx --- diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index ee15f5a0d1c1..d9e4153d405b 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -895,7 +895,8 @@ static int sched_rt_runtime_exceeded(struct rt_rq *rt_rq) */ if (likely(rt_b->rt_runtime)) { rt_rq->rt_throttled =
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner wrote: On Fri, 19 Dec 2014, Chris Mason wrote: On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > Here's another pattern. In your latest thing, every single time that > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > doing this: > > [24998.060963] NMI backtrace for cpu 0 > [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not tainted 3.18.0+ #108 > [24998.064073] task: 8801bf3536b0 ti: 880197e0c000 task.ti: > 880197e0c000 > [24998.065137] RIP: 0010:[] [] > read_hpet+0x16/0x20 > [24998.083577] [] ktime_get+0x3e/0xa0 > [24998.084450] [] tick_sched_timer+0x23/0x160 > [24998.085315] [] __run_hrtimer+0x76/0x1f0 > [24998.086173] [] ? tick_init_highres+0x20/0x20 > [24998.087025] [] hrtimer_interrupt+0x107/0x260 > [24998.087877] [] local_apic_timer_interrupt+0x3b/0x70 > [24998.088732] [] smp_apic_timer_interrupt+0x45/0x60 > [24998.089583] [] apic_timer_interrupt+0x6f/0x80 > [24998.090435] > [24998.091279] [] ? __remove_hrtimer+0x4e/0xa0 > [24998.092118] [] ? ipcget+0x8a/0x1e0 > [24998.092951] [] ? ipcget+0x7c/0x1e0 > [24998.093779] [] SyS_msgget+0x4d/0x70 > > and I think that's the smoking gun. The reason CPU0 isn't picking up > any IPI's is because it is in some endless loop around read_hpet(). > > There is even time information in the register dump: > > RAX: 61fece8a RBX: 00510792 RCX: > RAX: 79e588fc RBX: 00511d6e RCX: > RAX: 91ca7f65 RBX: 00513346 RCX: > RAX: a9afbd0d RBX: 0051491e RCX: > RAX: cbd1340c RBX: 0051684a RCX: > RAX: fb9d303f RBX: 005193fc RCX: > RAX: 2b67efe4 RBX: 0051c224 RCX: 0004 > > That RAX value is the value we just read from the HPET, and RBX seems > to be monotonically increasing too, so it's likely the sequence > counter in ktime_get(). > > So it's not stuck *inside* read_hpet(), and it's almost certainly not > the loop over the sequence counter in ktime_get() either (it's not > increasing *that* quickly). But some basically infinite __run_hrtimer > thing or something? Really interesting. So, we're calling __run_hrtimer in a loop: while ((node = timerqueue_getnext(>active))) { ... __run_hrtimer(timer, ); ... } The easy question is how often does trinity call nanosleep? Looking at __run_hrtimer(), it drops the lock and runs the function and then takes the lock again, maybe enqueueing us again right away. timer->state is supposed to protect us from other CPUs jumping in and doing something else with the timer, but it feels racey wrt remove_hrtimer(). Something like this, but I'm not sure how often __hrtimer_start_range_ns gets called CPU 0 CPU 1 __run_hrtimer() timer->state = HRTIMER_STATE_CALLBACK removed from list unlock cpu_base->lock restrt = fn(timer) __hrtimer_start_range_ns() base = lock_hrtimer_base() ret = remove_hrtimer() finds timer->state = HRTIMER_STATE_CALLBACK does nothing new_base = switch_hrtimer_base() now we're on a different base, different lock lock(cpu_base->lock) enqueue the timer enqueue the timer But at the very end this would be detected by the runtime check of the hrtimer interrupt, which does not trigger. And it would trigger at some point as ALL cpus including CPU0 in that trace dump make progress. I'll admit that at some point we should be hitting one of the WARN or BUG_ON, but it's possible to thread that needle and corrupt the timer list, without hitting a warning (CPU 1 in my example has to enqueue last). Once the rbtree is hosed, it can go forever. Probably not the bug we're looking for, but still suspect in general. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote: > > Now that all looks correct. So there is something else going on. After > staring some more at it, I think we are looking at it from the wrong > angle. > > The watchdog always detects CPU1 as stuck and we got completely > fixated on the csd_wait() in the stack trace on CPU1. Now we have > stack traces which show a different picture, i.e. CPU1 makes progress > after a gazillion of seconds. .. but that doesn't explain why CPU0 ends up always being at that *exact* same instruction in the NMI backtrace. While a fairly tight loop, together with "mmio read is very expensive and synchronizing" would explain it. An MMIO read can easily be as expensive as several thousand instructions. > I think we really need to look at CPU1 itself. Not so fast. Take another look at CPU0. [24998.083577] [] ktime_get+0x3e/0xa0 [24998.084450] [] tick_sched_timer+0x23/0x160 [24998.085315] [] __run_hrtimer+0x76/0x1f0 [24998.086173] [] ? tick_init_highres+0x20/0x20 [24998.087025] [] hrtimer_interrupt+0x107/0x260 [24998.087877] [] local_apic_timer_interrupt+0x3b/0x70 [24998.088732] [] smp_apic_timer_interrupt+0x45/0x60 [24998.089583] [] apic_timer_interrupt+0x6f/0x80 [24998.090435] [24998.091279] [] ? __remove_hrtimer+0x4e/0xa0 [24998.092118] [] ? ipcget+0x8a/0x1e0 [24998.092951] [] ? ipcget+0x7c/0x1e0 [24998.093779] [] SyS_msgget+0x4d/0x70 Really. None of that changed. NONE. The likelihood that we hit the exact same instruction every time? Over a timefraem of more than a minute? The only way I see that happening is (a) NMI is completely buggered, and the backtrace is just random crap that is always the same. Or (b) it's really a fairly tight loop. The fact that you had a hrtimer interrupt happen in the *middle* of __remove_hrtimer() is really another fairly strong hint. That smells like "__remove_hrtimer() has a race with hrtimer interrupts". And that race results in a basically endless loop (which perhaps ends when the HRtimer overflows, in what, a few minutes?) I really don't think you should look at CPU1. Not when CPU0 has such an interesting pattern that you dismissed just because the HPET is making progress. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, 19 Dec 2014, Chris Mason wrote: > On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > > Here's another pattern. In your latest thing, every single time that > > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > > doing this: > > > > [24998.060963] NMI backtrace for cpu 0 > > [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not tainted 3.18.0+ #108 > > [24998.064073] task: 8801bf3536b0 ti: 880197e0c000 task.ti: > > 880197e0c000 > > [24998.065137] RIP: 0010:[] [] > > read_hpet+0x16/0x20 > > [24998.083577] [] ktime_get+0x3e/0xa0 > > [24998.084450] [] tick_sched_timer+0x23/0x160 > > [24998.085315] [] __run_hrtimer+0x76/0x1f0 > > [24998.086173] [] ? tick_init_highres+0x20/0x20 > > [24998.087025] [] hrtimer_interrupt+0x107/0x260 > > [24998.087877] [] local_apic_timer_interrupt+0x3b/0x70 > > [24998.088732] [] smp_apic_timer_interrupt+0x45/0x60 > > [24998.089583] [] apic_timer_interrupt+0x6f/0x80 > > [24998.090435] > > [24998.091279] [] ? __remove_hrtimer+0x4e/0xa0 > > [24998.092118] [] ? ipcget+0x8a/0x1e0 > > [24998.092951] [] ? ipcget+0x7c/0x1e0 > > [24998.093779] [] SyS_msgget+0x4d/0x70 > > > > and I think that's the smoking gun. The reason CPU0 isn't picking up > > any IPI's is because it is in some endless loop around read_hpet(). > > > > There is even time information in the register dump: > > > > RAX: 61fece8a RBX: 00510792 RCX: > > RAX: 79e588fc RBX: 00511d6e RCX: > > RAX: 91ca7f65 RBX: 00513346 RCX: > > RAX: a9afbd0d RBX: 0051491e RCX: > > RAX: cbd1340c RBX: 0051684a RCX: > > RAX: fb9d303f RBX: 005193fc RCX: > > RAX: 2b67efe4 RBX: 0051c224 RCX: 0004 > > > > That RAX value is the value we just read from the HPET, and RBX seems > > to be monotonically increasing too, so it's likely the sequence > > counter in ktime_get(). > > > > So it's not stuck *inside* read_hpet(), and it's almost certainly not > > the loop over the sequence counter in ktime_get() either (it's not > > increasing *that* quickly). But some basically infinite __run_hrtimer > > thing or something? > > Really interesting. > > So, we're calling __run_hrtimer in a loop: > > while ((node = timerqueue_getnext(>active))) { > ... > __run_hrtimer(timer, ); > ... > } > > The easy question is how often does trinity call nanosleep? > > Looking at __run_hrtimer(), it drops the lock and runs the function and then > takes the lock again, maybe enqueueing us again right away. > > timer->state is supposed to protect us from other CPUs jumping in and doing > something else with the timer, but it feels racey wrt remove_hrtimer(). > Something like this, but I'm not sure how often __hrtimer_start_range_ns gets > called > > CPU 0 CPU 1 > __run_hrtimer() > timer->state = HRTIMER_STATE_CALLBACK > removed from list > unlock cpu_base->lock > restrt = fn(timer) > __hrtimer_start_range_ns() > base = lock_hrtimer_base() > ret = remove_hrtimer() > finds timer->state = > HRTIMER_STATE_CALLBACK > does nothing > new_base = switch_hrtimer_base() > now we're on a different > base, different lock > lock(cpu_base->lock) > enqueue the timer > enqueue the timer But at the very end this would be detected by the runtime check of the hrtimer interrupt, which does not trigger. And it would trigger at some point as ALL cpus including CPU0 in that trace dump make progress. Thanks, tglx -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, 19 Dec 2014, Linus Torvalds wrote: > Here's another pattern. In your latest thing, every single time that > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > doing this: > > [24998.060963] NMI backtrace for cpu 0 > [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not tainted 3.18.0+ #108 > [24998.064073] task: 8801bf3536b0 ti: 880197e0c000 task.ti: > 880197e0c000 > [24998.065137] RIP: 0010:[] [] > read_hpet+0x16/0x20 > [24998.083577] [] ktime_get+0x3e/0xa0 > [24998.084450] [] tick_sched_timer+0x23/0x160 > [24998.085315] [] __run_hrtimer+0x76/0x1f0 > [24998.086173] [] ? tick_init_highres+0x20/0x20 > [24998.087025] [] hrtimer_interrupt+0x107/0x260 > [24998.087877] [] local_apic_timer_interrupt+0x3b/0x70 > [24998.088732] [] smp_apic_timer_interrupt+0x45/0x60 > [24998.089583] [] apic_timer_interrupt+0x6f/0x80 > [24998.090435] > [24998.091279] [] ? __remove_hrtimer+0x4e/0xa0 > [24998.092118] [] ? ipcget+0x8a/0x1e0 > [24998.092951] [] ? ipcget+0x7c/0x1e0 > [24998.093779] [] SyS_msgget+0x4d/0x70 > > and I think that's the smoking gun. The reason CPU0 isn't picking up > any IPI's is because it is in some endless loop around read_hpet(). > > There is even time information in the register dump: > > RAX: 61fece8a RBX: 00510792 RCX: > RAX: 79e588fc RBX: 00511d6e RCX: > RAX: 91ca7f65 RBX: 00513346 RCX: > RAX: a9afbd0d RBX: 0051491e RCX: > RAX: cbd1340c RBX: 0051684a RCX: > RAX: fb9d303f RBX: 005193fc RCX: > RAX: 2b67efe4 RBX: 0051c224 RCX: 0004 > > That RAX value is the value we just read from the HPET, and RBX seems > to be monotonically increasing too, so it's likely the sequence > counter in ktime_get(). Here is the full diff of the first and the second splat for CPU0 task: 8801bf3536b0 ti: 880197e0c000 task.ti: 880197e0c000 read_hpet+0x16/0x20 RSP: 0018:88024e203e38 EFLAGS: 0046 -RAX: 61fece8a RBX: 00510792 RCX: +RAX: 79e588fc RBX: 00511d6e RCX: RAX: (0x79e588fc - 0x61fece8a) / 14.318180e6 ~= 28.0061 So HPET @14.3MHz progressed by 28 seconds, which matches the splat delta between the first and the second one. 25026.001132 - 24998.017355 = 27.9838 RBX: 0x00511d6e - 0x00510792 = 5596 The sequence counter increments by 2 per tick. So: 28 / (5596/2) ~= 0.0025 s ==> HZ = 400 The sequence counter is even, so ktime_get() will succeed. RDX: RSI: 88024e20c710 RDI: 81c26f40 RBP: 88024e203e38 R08: R09: 000f -R10: 0526 R11: 000f R12: 16bf99600917 +R10: 0526 R11: 000f R12: 16c61e4e2117 R12: 0x16c61e4e2117 - 0x16bf99600917 = 2.8e+10 That's the nanoseconds timestamp: 2.8e10/1e9 = 28 Now that all looks correct. So there is something else going on. After staring some more at it, I think we are looking at it from the wrong angle. The watchdog always detects CPU1 as stuck and we got completely fixated on the csd_wait() in the stack trace on CPU1. Now we have stack traces which show a different picture, i.e. CPU1 makes progress after a gazillion of seconds. I think we really need to look at CPU1 itself. AFAICT all these 'stuck' events happen in fully interruptible context. So an undetected interrupt storm can cause that. We only detect interrupt storms for unhandled interrupts, but for those where the handler returns IRQ_HANDLED, we just count them. For directly handled vectors we do not even have a detection mechanism at all. That also might explain the RT throttling. If that storm hits a high prio task, the throttler will trigger. Just a theory, but worth to explore, IMO. So adding a dump of the total interrupt counts to the watchdog trace might give us some insight. Debug patch below. Thanks, tglx --- diff --git a/kernel/softirq.c b/kernel/softirq.c index 501baa9ac1be..2021662663c7 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -315,11 +315,21 @@ asmlinkage __visible void do_softirq(void) local_irq_restore(flags); } +static DEFINE_PER_CPU(unsigned long, irqcnt); + +void show_irqcnt(int cpu) +{ + pr_emerg("CPU#%d: IRQ %lu NMI %u\n", cpu, this_cpu_read(irqcnt), +this_cpu_read(irq_stat.__nmi_count)); +} + /* * Enter an interrupt context. */ void irq_enter(void) { + this_cpu_inc(irqcnt); + rcu_irq_enter(); if (is_idle_task(current) && !in_interrupt()) { /* diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 70bf11815f84..f505cc58d354 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -306,6 +306,8 @@ static void watchdog_interrupt_count(void) static int
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 12:54 PM, Dave Jones wrote: > > Right now I'm doing Chris' idea of "turn debugging back on, > and try without serial console". Shall I try your suggestion > on top of that ? Might as well. I doubt it really will make any difference, but I also don't think it will interact badly in any way, so.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 12:46:16PM -0800, Linus Torvalds wrote: > On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds > wrote: > > > > I do note that we depend on the "new mwait" semantics where we do > > mwait with interrupts disabled and a non-zero RCX value. Are there > > possibly even any known CPU errata in that area? Not that it sounds > > likely, but still.. > > Remind me what CPU you have in that machine again? The %rax value for > the mwait cases in question seems to be 0x32, which is either C7s-HSW > or C7s-BDW, and in both cases has the "TLB flushed" flag set. > > I'm pretty sure you have a Haswell, I'm just checking. Which model? > I'm assuming it's family 6, model 60, stepping 3? I found you > mentioning i5-4670T in a perf thread.. That the one? Yep. vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4670T CPU @ 2.30GHz stepping: 3 microcode : 0x1a > Anyway, I don't actually believe in any CPU bugs, but you could try > "intel_idle.max_cstate=0" and see if that makes any difference, for > example. > > Or perhaps just "intel_idle.max_cstate=1", which leaves intel_idle > active, but gets rid of the deeper sleep states (that incidentally > also play games with leave_mm() etc) So I'm leaving Red Hat on Tuesday, and can realistically only do one more experiment over the weekend before I give them this box back. Right now I'm doing Chris' idea of "turn debugging back on, and try without serial console". Shall I try your suggestion on top of that ? I *hate* for this to be "the one that got away", but we've at least gotten some good mileage out of this bug in the last two months. Who knows, maybe I'll find some new hardware that will exhibit the same behaviour in the new year. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds wrote: > > I do note that we depend on the "new mwait" semantics where we do > mwait with interrupts disabled and a non-zero RCX value. Are there > possibly even any known CPU errata in that area? Not that it sounds > likely, but still.. Remind me what CPU you have in that machine again? The %rax value for the mwait cases in question seems to be 0x32, which is either C7s-HSW or C7s-BDW, and in both cases has the "TLB flushed" flag set. I'm pretty sure you have a Haswell, I'm just checking. Which model? I'm assuming it's family 6, model 60, stepping 3? I found you mentioning i5-4670T in a perf thread.. That the one? Anyway, I don't actually believe in any CPU bugs, but you could try "intel_idle.max_cstate=0" and see if that makes any difference, for example. Or perhaps just "intel_idle.max_cstate=1", which leaves intel_idle active, but gets rid of the deeper sleep states (that incidentally also play games with leave_mm() etc) Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 03:31:36PM -0500, Chris Mason wrote: > > So it's not stuck *inside* read_hpet(), and it's almost certainly not > > the loop over the sequence counter in ktime_get() either (it's not > > increasing *that* quickly). But some basically infinite __run_hrtimer > > thing or something? > > Really interesting. > > So, we're calling __run_hrtimer in a loop: > > while ((node = timerqueue_getnext(>active))) { > ... > __run_hrtimer(timer, ); > ... > } > > The easy question is how often does trinity call nanosleep? It shouldn't call it directly. (syscalls/nanosleep.c) /* * SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp, struct timespec __user *, rmtp) */ #include "sanitise.h" struct syscallentry syscall_nanosleep = { .name = "nanosleep", .num_args = 2, .arg1name = "rqtp", .arg1type = ARG_ADDRESS, .arg2name = "rmtp", .arg2type = ARG_ADDRESS, .flags = AVOID_SYSCALL, // Boring. Can cause long sleeps. }; That last line being the key one. We used to do it, but it's well.. boring. We could do something smarter, but given it's never triggered anything interesting in the past, focussing runtime on the more interesting syscalls seems to have been more fruitful. Dave -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > Here's another pattern. In your latest thing, every single time that > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > doing this: > > [24998.060963] NMI backtrace for cpu 0 > [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not tainted 3.18.0+ #108 > [24998.064073] task: 8801bf3536b0 ti: 880197e0c000 task.ti: > 880197e0c000 > [24998.065137] RIP: 0010:[] [] > read_hpet+0x16/0x20 > [24998.083577] [] ktime_get+0x3e/0xa0 > [24998.084450] [] tick_sched_timer+0x23/0x160 > [24998.085315] [] __run_hrtimer+0x76/0x1f0 > [24998.086173] [] ? tick_init_highres+0x20/0x20 > [24998.087025] [] hrtimer_interrupt+0x107/0x260 > [24998.087877] [] local_apic_timer_interrupt+0x3b/0x70 > [24998.088732] [] smp_apic_timer_interrupt+0x45/0x60 > [24998.089583] [] apic_timer_interrupt+0x6f/0x80 > [24998.090435] > [24998.091279] [] ? __remove_hrtimer+0x4e/0xa0 > [24998.092118] [] ? ipcget+0x8a/0x1e0 > [24998.092951] [] ? ipcget+0x7c/0x1e0 > [24998.093779] [] SyS_msgget+0x4d/0x70 > > and I think that's the smoking gun. The reason CPU0 isn't picking up > any IPI's is because it is in some endless loop around read_hpet(). > > There is even time information in the register dump: > > RAX: 61fece8a RBX: 00510792 RCX: > RAX: 79e588fc RBX: 00511d6e RCX: > RAX: 91ca7f65 RBX: 00513346 RCX: > RAX: a9afbd0d RBX: 0051491e RCX: > RAX: cbd1340c RBX: 0051684a RCX: > RAX: fb9d303f RBX: 005193fc RCX: > RAX: 2b67efe4 RBX: 0051c224 RCX: 0004 > > That RAX value is the value we just read from the HPET, and RBX seems > to be monotonically increasing too, so it's likely the sequence > counter in ktime_get(). > > So it's not stuck *inside* read_hpet(), and it's almost certainly not > the loop over the sequence counter in ktime_get() either (it's not > increasing *that* quickly). But some basically infinite __run_hrtimer > thing or something? Really interesting. So, we're calling __run_hrtimer in a loop: while ((node = timerqueue_getnext(>active))) { ... __run_hrtimer(timer, ); ... } The easy question is how often does trinity call nanosleep? Looking at __run_hrtimer(), it drops the lock and runs the function and then takes the lock again, maybe enqueueing us again right away. timer->state is supposed to protect us from other CPUs jumping in and doing something else with the timer, but it feels racey wrt remove_hrtimer(). Something like this, but I'm not sure how often __hrtimer_start_range_ns gets called CPU 0 CPU 1 __run_hrtimer() timer->state = HRTIMER_STATE_CALLBACK removed from list unlock cpu_base->lock restrt = fn(timer) __hrtimer_start_range_ns() base = lock_hrtimer_base() ret = remove_hrtimer() finds timer->state = HRTIMER_STATE_CALLBACK does nothing new_base = switch_hrtimer_base() now we're on a different base, different lock lock(cpu_base->lock) enqueue the timer enqueue the timer -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 11:15 AM, Linus Torvalds wrote: > > In your earlier trace (with spinlock debugging), the softlockup > detection was in lock_acquire for copy_page_range(), but CPU2 was > always in that "generic_exec_single" due to a TLB flush from that > zap_page_range thing again. But there are no timer traces from that > one, so I dunno. Ahh, and that's because the TLB flushing is done under the page table lock these days (see commit 1cf35d47712d: "mm: split 'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). Which means that if the TLB flushing gets stuck on CPU#2, CPU#1 that is trying to get the page table lock will be locked up too. So this is all very consistent, actually. The underlying bug in both cases seems to be that the IPI for the TLB flushing doesn't happen for some reason. In your second trace, that's explained by the fact that CPU0 is in a timer interrupt. In the first trace with spinlock debugging, no such obvious explanation exists. It could be that an IPI has gotten lost for some reason. However, the first trace does have this: NMI backtrace for cpu 3 INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 66.180 msecs CPU: 3 PID: 0 Comm: swapper/3 Not tainted 3.18.0+ #107 RIP: 0010: intel_idle+0xdb/0x180 Code: 31 d2 65 48 8b 34 ... INFO: NMI handler (arch_trigger_all_cpu_backtrace_handler) took too long to run: 95.053 msecs so something odd is happening (probably on CPU3). It took a long time to react to the NMI IPI too. So there's definitely something screwy going on here in IPI-land. I do note that we depend on the "new mwait" semantics where we do mwait with interrupts disabled and a non-zero RCX value. Are there possibly even any known CPU errata in that area? Not that it sounds likely, but still.. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: frequent lockups in 3.18rc4
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > sched: RT throttling activated > > And after RT throttling, it's random (not even always trinity), but > that's probably because the watchdog thread doesn't run reliably any > more. So if we want to shoot that RT throttling in the head you can do: echo -1 > /proc/sys/kernel/sched_rt_runtime_us That should completely disable that stuff, of course at that point a run-away RR/FIFO thread will hog your cpu indefinitely. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/