Re: frequent lockups in 3.18rc4

2015-02-12 Thread Linus Torvalds
On Thu, Feb 12, 2015 at 3:09 AM, Martin van Es wrote: > > Best I can come up with now is try the next mainline that has all the > fixes and ideas in this thread incorporated. Would that be 3.19? Yes. I'm attaching a patch (very much experimental - it might introduce new problems rather than fix

Re: frequent lockups in 3.18rc4

2015-02-12 Thread Martin van Es
To follow up on this long standing promise to bisect. I've made two attempts at bisecting and both landed in limbo. It's hard to explain but it feels like this bug has quantum properties; I know for sure it's present in 3.17 and not in 3.16(.7). But once I start bisecting it gets less

Re: frequent lockups in 3.18rc4

2015-02-12 Thread Martin van Es
To follow up on this long standing promise to bisect. I've made two attempts at bisecting and both landed in limbo. It's hard to explain but it feels like this bug has quantum properties; I know for sure it's present in 3.17 and not in 3.16(.7). But once I start bisecting it gets less

Re: frequent lockups in 3.18rc4

2015-02-12 Thread Linus Torvalds
On Thu, Feb 12, 2015 at 3:09 AM, Martin van Es mrva...@gmail.com wrote: Best I can come up with now is try the next mainline that has all the fixes and ideas in this thread incorporated. Would that be 3.19? Yes. I'm attaching a patch (very much experimental - it might introduce new problems

Re: frequent lockups in 3.18rc4

2015-01-12 Thread Thomas Gleixner
On Sun, 21 Dec 2014, Linus Torvalds wrote: > On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote: > > On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: > > > > > > And finally, and stupidly, is there any chance that you have anything > > > accessing /dev/hpet? > > > > Not knowingly

Re: frequent lockups in 3.18rc4

2015-01-12 Thread Thomas Gleixner
On Sun, 21 Dec 2014, Linus Torvalds wrote: On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones da...@codemonkey.org.uk wrote: On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: And finally, and stupidly, is there any chance that you have anything accessing /dev/hpet? Not

Re: frequent lockups in 3.18rc4

2015-01-05 Thread John Stultz
On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds wrote: > On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote: >> >> Anyway, It may be worth keeping the 50% margin (and dropping the 12% >> reduction to simplify things) > > Again, the 50% margin is only on the multiplication overflow. Not on the

Re: frequent lockups in 3.18rc4

2015-01-05 Thread Linus Torvalds
On Mon, Jan 5, 2015 at 5:17 PM, John Stultz wrote: > > Anyway, It may be worth keeping the 50% margin (and dropping the 12% > reduction to simplify things) Again, the 50% margin is only on the multiplication overflow. Not on the mask. So it won't do anything at all for the case we actually care

Re: frequent lockups in 3.18rc4

2015-01-05 Thread John Stultz
On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds wrote: > On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote: >> >> So I sent out a first step validation check to warn us if we end up >> with idle periods that are larger then we expect. > > .. not having tested it, this is just from reading the

Re: frequent lockups in 3.18rc4

2015-01-05 Thread Linus Torvalds
On Mon, Jan 5, 2015 at 5:17 PM, John Stultz john.stu...@linaro.org wrote: Anyway, It may be worth keeping the 50% margin (and dropping the 12% reduction to simplify things) Again, the 50% margin is only on the multiplication overflow. Not on the mask. So it won't do anything at all for the

Re: frequent lockups in 3.18rc4

2015-01-05 Thread John Stultz
On Mon, Jan 5, 2015 at 5:25 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Mon, Jan 5, 2015 at 5:17 PM, John Stultz john.stu...@linaro.org wrote: Anyway, It may be worth keeping the 50% margin (and dropping the 12% reduction to simplify things) Again, the 50% margin is only on

Re: frequent lockups in 3.18rc4

2015-01-05 Thread John Stultz
On Sun, Jan 4, 2015 at 11:46 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Jan 2, 2015 at 4:27 PM, John Stultz john.stu...@linaro.org wrote: So I sent out a first step validation check to warn us if we end up with idle periods that are larger then we expect. .. not having

Re: frequent lockups in 3.18rc4

2015-01-04 Thread Linus Torvalds
On Fri, Jan 2, 2015 at 4:27 PM, John Stultz wrote: > > So I sent out a first step validation check to warn us if we end up > with idle periods that are larger then we expect. .. not having tested it, this is just from reading the patch, but it would *seem* that it doesn't actually validate the

Re: frequent lockups in 3.18rc4

2015-01-04 Thread Linus Torvalds
On Fri, Jan 2, 2015 at 4:27 PM, John Stultz john.stu...@linaro.org wrote: So I sent out a first step validation check to warn us if we end up with idle periods that are larger then we expect. .. not having tested it, this is just from reading the patch, but it would *seem* that it doesn't

Re: frequent lockups in 3.18rc4

2015-01-03 Thread Sasha Levin
On 01/02/2015 07:27 PM, John Stultz wrote: > On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds > wrote: >> > On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones >> > wrote: >>> >> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: >>> >> >>> >> > One thing I think I'll try is to try and

Re: frequent lockups in 3.18rc4

2015-01-03 Thread Sasha Levin
On 01/02/2015 07:27 PM, John Stultz wrote: On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote: On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: One thing I think

Re: frequent lockups in 3.18rc4

2015-01-02 Thread John Stultz
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote: >> On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: >> >> > One thing I think I'll try is to try and narrow down which >> > syscalls are triggering those "Clocksource hpet

Re: frequent lockups in 3.18rc4

2015-01-02 Thread John Stultz
On Fri, Dec 26, 2014 at 12:57 PM, Linus Torvalds torva...@linux-foundation.org wrote: On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote: On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: One thing I think I'll try is to try and narrow down which syscalls

Re: frequent lockups in 3.18rc4

2014-12-28 Thread Paul E. McKenney
On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote: > On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote: > > > > * So 1/8th of the interval seems way too short, as there's > > clocksources like the ACP PM, which wrap every 2.5 seconds or so. > > Ugh. At the same time, 1/8th of a

Re: frequent lockups in 3.18rc4

2014-12-28 Thread Paul E. McKenney
On Mon, Dec 22, 2014 at 04:46:42PM -0800, Linus Torvalds wrote: On Mon, Dec 22, 2014 at 3:59 PM, John Stultz john.stu...@linaro.org wrote: * So 1/8th of the interval seems way too short, as there's clocksources like the ACP PM, which wrap every 2.5 seconds or so. Ugh. At the same time,

Re: frequent lockups in 3.18rc4

2014-12-27 Thread Dave Jones
On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote: > > > > > > Oh - and have you actually seen the "TSC unstable (delta = xyz)" + > > > "switched to hpet" messages there yet? > > > > not yet. 3 hrs in. > > Ok, so then

Re: frequent lockups in 3.18rc4

2014-12-27 Thread Dave Jones
On Fri, Dec 26, 2014 at 07:14:55PM -0800, Linus Torvalds wrote: On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones da...@codemonkey.org.uk wrote: Oh - and have you actually seen the TSC unstable (delta = xyz) + switched to hpet messages there yet? not yet. 3 hrs in. Ok, so

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones wrote: > > > > Oh - and have you actually seen the "TSC unstable (delta = xyz)" + > > "switched to hpet" messages there yet? > > not yet. 3 hrs in. Ok, so then the INFO: rcu_preempt detected stalls on CPUs/tasks: has nothing to do with HPET,

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > > > still running though.. > > Btw, did you ever boot with "tsc=reliable" as a kernel command line option? I'll check it again in the morning, but before I turn in for

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > > > still running though.. > > Btw, did you ever boot with "tsc=reliable" as a kernel command line option? I don't think so. > For the last night, can you see if you

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote: > On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > > > hm. > > So with the previous patch that had the false positives, you never saw > this? You saw the false positives instead? correct. > I'm wondering if the added

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > still running though.. Btw, did you ever boot with "tsc=reliable" as a kernel command line option? For the last night, can you see if you can just run it with that, and things work? Because by now, my gut feel is that we should start

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones wrote: > > hm. So with the previous patch that had the false positives, you never saw this? You saw the false positives instead? I'm wondering if the added debug noise just ended up helping. Doing a printk() will automatically cause some scheduler

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: > I have a newer version of the patch that gets rid of the false > positives with some ordering rules instead, and just for you I hacked > it up to say where the problem happens too, but it's likely too late. hm. [

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: > I have a newer version of the patch that gets rid of the false > positives with some ordering rules instead, and just for you I hacked > it up to say where the problem happens too, but it's likely too late. I'll give it a spin

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones wrote: > On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: > > > One thing I think I'll try is to try and narrow down which > > syscalls are triggering those "Clocksource hpet had cycles off" > > messages. I'm still unclear on exactly

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: > One thing I think I'll try is to try and narrow down which > syscalls are triggering those "Clocksource hpet had cycles off" > messages. I'm still unclear on exactly what is doing > the stomping on the hpet. First I ran trinity

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote: > On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > > > But in the meantime please do keep that thing running as long as you > > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > > result -

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Tue, Dec 23, 2014 at 10:01:25PM -0500, Dave Jones wrote: On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: One thing I think I'll try is to try and narrow down which syscalls are triggering those Clocksource hpet had cycles off messages. I'm still unclear on exactly what is doing the stomping on the hpet. First I ran trinity with -g

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 10:12 AM, Dave Jones da...@codemonkey.org.uk wrote: On Fri, Dec 26, 2014 at 11:34:10AM -0500, Dave Jones wrote: One thing I think I'll try is to try and narrow down which syscalls are triggering those Clocksource hpet had cycles off messages. I'm still unclear

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: I have a newer version of the patch that gets rid of the false positives with some ordering rules instead, and just for you I hacked it up to say where the problem happens too, but it's likely too late. I'll give it a spin and

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 12:57:07PM -0800, Linus Torvalds wrote: I have a newer version of the patch that gets rid of the false positives with some ordering rules instead, and just for you I hacked it up to say where the problem happens too, but it's likely too late. hm. [ 2733.047100]

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: hm. So with the previous patch that had the false positives, you never saw this? You saw the false positives instead? I'm wondering if the added debug noise just ended up helping. Doing a printk() will automatically

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: still running though.. Btw, did you ever boot with tsc=reliable as a kernel command line option? For the last night, can you see if you can just run it with that, and things work? Because by now, my gut feel is that we

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 03:16:41PM -0800, Linus Torvalds wrote: On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: hm. So with the previous patch that had the false positives, you never saw this? You saw the false positives instead? correct. I'm

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: still running though.. Btw, did you ever boot with tsc=reliable as a kernel command line option? I don't think so. For the last night, can you

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Dave Jones
On Fri, Dec 26, 2014 at 03:30:20PM -0800, Linus Torvalds wrote: On Fri, Dec 26, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: still running though.. Btw, did you ever boot with tsc=reliable as a kernel command line option? I'll check it again in the morning, but before I

Re: frequent lockups in 3.18rc4

2014-12-26 Thread Linus Torvalds
On Fri, Dec 26, 2014 at 4:36 PM, Dave Jones da...@codemonkey.org.uk wrote: Oh - and have you actually seen the TSC unstable (delta = xyz) + switched to hpet messages there yet? not yet. 3 hrs in. Ok, so then the INFO: rcu_preempt detected stalls on CPUs/tasks: has nothing to do

Re: frequent lockups in 3.18rc4

2014-12-24 Thread Sasha Levin
On 12/23/2014 09:56 AM, Dave Jones wrote: > On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > > > But in the meantime please do keep that thing running as long as you > > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > > result - the original

Re: frequent lockups in 3.18rc4

2014-12-24 Thread Sasha Levin
On 12/23/2014 09:56 AM, Dave Jones wrote: On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the original softlockup bug

Re: frequent lockups in 3.18rc4

2014-12-23 Thread Dave Jones
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > But in the meantime please do keep that thing running as long as you > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > result - the original softlockup bug happening *without* any bigger > hpet jumps.

Re: frequent lockups in 3.18rc4

2014-12-23 Thread Dave Jones
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: > But in the meantime please do keep that thing running as long as you > can. Let's see if we get bigger jumps. Or perhaps we'll get a negative > result - the original softlockup bug happening *without* any bigger > hpet jumps.

Re: frequent lockups in 3.18rc4

2014-12-23 Thread Dave Jones
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the original softlockup bug happening *without* any bigger hpet jumps.

Re: frequent lockups in 3.18rc4

2014-12-23 Thread Dave Jones
On Mon, Dec 22, 2014 at 03:59:19PM -0800, Linus Torvalds wrote: But in the meantime please do keep that thing running as long as you can. Let's see if we get bigger jumps. Or perhaps we'll get a negative result - the original softlockup bug happening *without* any bigger hpet jumps. So

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 3:59 PM, John Stultz wrote: > > * So 1/8th of the interval seems way too short, as there's > clocksources like the ACP PM, which wrap every 2.5 seconds or so. Ugh. At the same time, 1/8th of a range is actually bigger than I'd like, since if there is some timer

Re: frequent lockups in 3.18rc4

2014-12-22 Thread John Stultz
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds wrote: > On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds > wrote: >> >> This is *not* to say that this is the bug you're hitting. But it does show >> that >> >> (a) a flaky HPET can do some seriously bad stuff >> (b) the kernel is very fragile

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones wrote: > > I tried the nohpet thing for a few hours this morning and didn't see > anything weird, but it may have been that I just didn't run long enough. > When I saw your patch, I gave that a shot instead, with hpet enabled > again. Just got back to

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Dave Jones
On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote: > And again: this is not trying to make the kernel clock not jump. There > is no way I can come up with even in theory to try to really *fix* a > fundamentally broken clock. > > So this is not meant to be a real "fix" for

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds wrote: > > .. and we might still lock up under some circumstances. But at least > from my limited testing, it is infinitely much better, even if it > might not be perfect. Also note that my "testing" has been writing > zero to the HPET lock (so the

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds wrote: > > This is *not* to say that this is the bug you're hitting. But it does show > that > > (a) a flaky HPET can do some seriously bad stuff > (b) the kernel is very fragile wrt time going backwards. > > and maybe we can use this test

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds torva...@linux-foundation.org wrote: This is *not* to say that this is the bug you're hitting. But it does show that (a) a flaky HPET can do some seriously bad stuff (b) the kernel is very fragile wrt time going backwards. and maybe we can

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds torva...@linux-foundation.org wrote: .. and we might still lock up under some circumstances. But at least from my limited testing, it is infinitely much better, even if it might not be perfect. Also note that my testing has been writing zero to

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Dave Jones
On Mon, Dec 22, 2014 at 11:47:37AM -0800, Linus Torvalds wrote: And again: this is not trying to make the kernel clock not jump. There is no way I can come up with even in theory to try to really *fix* a fundamentally broken clock. So this is not meant to be a real fix for anything, but

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 2:57 PM, Dave Jones da...@codemonkey.org.uk wrote: I tried the nohpet thing for a few hours this morning and didn't see anything weird, but it may have been that I just didn't run long enough. When I saw your patch, I gave that a shot instead, with hpet enabled again.

Re: frequent lockups in 3.18rc4

2014-12-22 Thread John Stultz
On Mon, Dec 22, 2014 at 11:47 AM, Linus Torvalds torva...@linux-foundation.org wrote: On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds torva...@linux-foundation.org wrote: This is *not* to say that this is the bug you're hitting. But it does show that (a) a flaky HPET can do some seriously

Re: frequent lockups in 3.18rc4

2014-12-22 Thread Linus Torvalds
On Mon, Dec 22, 2014 at 3:59 PM, John Stultz john.stu...@linaro.org wrote: * So 1/8th of the interval seems way too short, as there's clocksources like the ACP PM, which wrap every 2.5 seconds or so. Ugh. At the same time, 1/8th of a range is actually bigger than I'd like, since if there is

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Paul E. McKenney
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: > On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds > wrote: > > > > The second time (or third, or fourth - it might not take immediately) > > you get a lockup or similar. Bad things happen. > > I've only tested it twice now, but the

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Dave Jones
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: > > The second time (or third, or fourth - it might not take immediately) > > you get a lockup or similar. Bad things happen. > > I've only tested it twice now, but the first time I got a weird > lockup-like thing (things *kind*

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds wrote: > > The second time (or third, or fourth - it might not take immediately) > you get a lockup or similar. Bad things happen. I've only tested it twice now, but the first time I got a weird lockup-like thing (things *kind* of worked, but I

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds wrote: > > I can do the mmap(/dev/mem) thing and access the HPET by hand, and > when I write zero to it I immediately get something like this: > > Clocksource tsc unstable (delta = -284317725450 ns) > Switched to clocksource hpet > > just to

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones wrote: > On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: > > > > And finally, and stupidly, is there any chance that you have anything > > accessing /dev/hpet? > > Not knowingly at least, but who the hell knows what systemd has its >

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Dave Jones
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: > > So the range of 1-251 seconds is not entirely random. It's all in > > that "32-bit HPET range". > > DaveJ, I assume it's too late now, and you don't effectively have any > access to the machine any more, but "hpet=disable"

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds wrote: > > So the range of 1-251 seconds is not entirely random. It's all in > that "32-bit HPET range". DaveJ, I assume it's too late now, and you don't effectively have any access to the machine any more, but "hpet=disable" or "nohpet" on the

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds wrote: > > Hmm, ok, I've re-acquainted myself with it. And I have to admit that I > can't see anything wrong. The whole "update_wall_clock" and the shadow > timekeeping state is confusing as hell, but seems fine. We'd have to > avoid

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sat, Dec 20, 2014 at 1:16 PM, Linus Torvalds torva...@linux-foundation.org wrote: Hmm, ok, I've re-acquainted myself with it. And I have to admit that I can't see anything wrong. The whole update_wall_clock and the shadow timekeeping state is confusing as hell, but seems fine. We'd have to

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 1:22 PM, Linus Torvalds torva...@linux-foundation.org wrote: So the range of 1-251 seconds is not entirely random. It's all in that 32-bit HPET range. DaveJ, I assume it's too late now, and you don't effectively have any access to the machine any more, but hpet=disable

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Dave Jones
On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: So the range of 1-251 seconds is not entirely random. It's all in that 32-bit HPET range. DaveJ, I assume it's too late now, and you don't effectively have any access to the machine any more, but hpet=disable or nohpet

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 2:32 PM, Dave Jones da...@codemonkey.org.uk wrote: On Sun, Dec 21, 2014 at 02:19:03PM -0800, Linus Torvalds wrote: And finally, and stupidly, is there any chance that you have anything accessing /dev/hpet? Not knowingly at least, but who the hell knows what

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 3:58 PM, Linus Torvalds torva...@linux-foundation.org wrote: I can do the mmap(/dev/mem) thing and access the HPET by hand, and when I write zero to it I immediately get something like this: Clocksource tsc unstable (delta = -284317725450 ns) Switched to

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Linus Torvalds
On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds torva...@linux-foundation.org wrote: The second time (or third, or fourth - it might not take immediately) you get a lockup or similar. Bad things happen. I've only tested it twice now, but the first time I got a weird lockup-like thing (things

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Dave Jones
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: The second time (or third, or fourth - it might not take immediately) you get a lockup or similar. Bad things happen. I've only tested it twice now, but the first time I got a weird lockup-like thing (things *kind* of

Re: frequent lockups in 3.18rc4

2014-12-21 Thread Paul E. McKenney
On Sun, Dec 21, 2014 at 04:52:28PM -0800, Linus Torvalds wrote: On Sun, Dec 21, 2014 at 4:41 PM, Linus Torvalds torva...@linux-foundation.org wrote: The second time (or third, or fourth - it might not take immediately) you get a lockup or similar. Bad things happen. I've only tested it

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Paul E. McKenney
On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote: > On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds > wrote: > > > > How/where is the HPET overflow case handled? I don't know the code enough. > > Hmm, ok, I've re-acquainted myself with it. And I have to admit that I > can't see

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Linus Torvalds
On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds wrote: > > How/where is the HPET overflow case handled? I don't know the code enough. Hmm, ok, I've re-acquainted myself with it. And I have to admit that I can't see anything wrong. The whole "update_wall_clock" and the shadow timekeeping state

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds wrote: > > I'm claiming that the race happened *once*. And it then corrupted some > data structure or similar sufficiently that CPU0 keeps looping. > > Perhaps something keeps re-adding itself to the head of the timerqueue > due to the race. So

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Dave Jones
On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote: > > Right now I'm doing Chris' idea of "turn debugging back on, > > and try without serial console". Shall I try your suggestion > > on top of that ? > > Might as well. I doubt it really will make any difference, but I also >

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Dave Jones
On Fri, Dec 19, 2014 at 02:05:20PM -0800, Linus Torvalds wrote: Right now I'm doing Chris' idea of turn debugging back on, and try without serial console. Shall I try your suggestion on top of that ? Might as well. I doubt it really will make any difference, but I also don't think

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 5:57 PM, Linus Torvalds torva...@linux-foundation.org wrote: I'm claiming that the race happened *once*. And it then corrupted some data structure or similar sufficiently that CPU0 keeps looping. Perhaps something keeps re-adding itself to the head of the timerqueue

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Linus Torvalds
On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds torva...@linux-foundation.org wrote: How/where is the HPET overflow case handled? I don't know the code enough. Hmm, ok, I've re-acquainted myself with it. And I have to admit that I can't see anything wrong. The whole update_wall_clock and the

Re: frequent lockups in 3.18rc4

2014-12-20 Thread Paul E. McKenney
On Sat, Dec 20, 2014 at 01:16:29PM -0800, Linus Torvalds wrote: On Sat, Dec 20, 2014 at 10:25 AM, Linus Torvalds torva...@linux-foundation.org wrote: How/where is the HPET overflow case handled? I don't know the code enough. Hmm, ok, I've re-acquainted myself with it. And I have to admit

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 5:00 PM, Thomas Gleixner wrote: > > The watchdog timer runs on a fully periodic schedule. It's self > rearming via > > hrtimer_forward_now(hrtimer, ns_to_ktime(sample_period)); > > So if that aligns with the equally periodic tick interrupt on the > other CPU then

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Thomas Gleixner
On Fri, 19 Dec 2014, Chris Mason wrote: > On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner wrote: > > But at the very end this would be detected by the runtime check of the > > hrtimer interrupt, which does not trigger. And it would trigger at > > some point as ALL cpus including CPU0 in that

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Thomas Gleixner
On Fri, 19 Dec 2014, Linus Torvalds wrote: > On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote: > > Now that all looks correct. So there is something else going on. After > > staring some more at it, I think we are looking at it from the wrong > > angle. > > > > The watchdog always detects

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Chris Mason
On Fri, Dec 19, 2014 at 6:22 PM, Thomas Gleixner wrote: On Fri, 19 Dec 2014, Chris Mason wrote: On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > Here's another pattern. In your latest thing, every single time that > CPU1 is waiting for some other CPU to pick up the IPI,

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 3:14 PM, Thomas Gleixner wrote: > > Now that all looks correct. So there is something else going on. After > staring some more at it, I think we are looking at it from the wrong > angle. > > The watchdog always detects CPU1 as stuck and we got completely > fixated on the

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Thomas Gleixner
On Fri, 19 Dec 2014, Chris Mason wrote: > On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > > Here's another pattern. In your latest thing, every single time that > > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > > doing this: > > > > [24998.060963] NMI

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Thomas Gleixner
On Fri, 19 Dec 2014, Linus Torvalds wrote: > Here's another pattern. In your latest thing, every single time that > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > doing this: > > [24998.060963] NMI backtrace for cpu 0 > [24998.061989] CPU: 0 PID: 2940 Comm: trinity-c150 Not

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 12:54 PM, Dave Jones wrote: > > Right now I'm doing Chris' idea of "turn debugging back on, > and try without serial console". Shall I try your suggestion > on top of that ? Might as well. I doubt it really will make any difference, but I also don't think it will

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Dave Jones
On Fri, Dec 19, 2014 at 12:46:16PM -0800, Linus Torvalds wrote: > On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds > wrote: > > > > I do note that we depend on the "new mwait" semantics where we do > > mwait with interrupts disabled and a non-zero RCX value. Are there > > possibly even any

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 11:51 AM, Linus Torvalds wrote: > > I do note that we depend on the "new mwait" semantics where we do > mwait with interrupts disabled and a non-zero RCX value. Are there > possibly even any known CPU errata in that area? Not that it sounds > likely, but still.. Remind me

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Dave Jones
On Fri, Dec 19, 2014 at 03:31:36PM -0500, Chris Mason wrote: > > So it's not stuck *inside* read_hpet(), and it's almost certainly not > > the loop over the sequence counter in ktime_get() either (it's not > > increasing *that* quickly). But some basically infinite __run_hrtimer > > thing or

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Chris Mason
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > Here's another pattern. In your latest thing, every single time that > CPU1 is waiting for some other CPU to pick up the IPI, we have CPU0 > doing this: > > [24998.060963] NMI backtrace for cpu 0 > [24998.061989] CPU: 0 PID: 2940

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Linus Torvalds
On Fri, Dec 19, 2014 at 11:15 AM, Linus Torvalds wrote: > > In your earlier trace (with spinlock debugging), the softlockup > detection was in lock_acquire for copy_page_range(), but CPU2 was > always in that "generic_exec_single" due to a TLB flush from that > zap_page_range thing again. But

Re: frequent lockups in 3.18rc4

2014-12-19 Thread Peter Zijlstra
On Fri, Dec 19, 2014 at 11:15:21AM -0800, Linus Torvalds wrote: > sched: RT throttling activated > > And after RT throttling, it's random (not even always trinity), but > that's probably because the watchdog thread doesn't run reliably any > more. So if we want to shoot that RT throttling

  1   2   3   4   5   6   7   8   9   10   >