Re: kernel BUG at kernel/sched_rt.c:493!
On Fri, 2013-01-11 at 06:22 +0100, Mike Galbraith wrote: > On Thu, 2013-01-10 at 13:58 -0600, Shawn Bohrer wrote: > > > Here is the output: > > > > [ 81.278842] SysRq : Changing Loglevel > > [ 81.279027] Loglevel set to 9 > > [ 83.285456] Initial want: 5000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 9 > > [ 85.286452] Initial want: 5000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 9 > > [ 85.289625] Initial want: 5000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 9 > > [ 87.287435] Initial want: 1 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 85000 > > [ 87.290718] Initial want: 5000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 9 > > [ 89.288469] Initial want: -5000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 10 > > [ 89.291550] Initial want: 15000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 8 > > [ 89.292940] Initial want: 1 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 85000 > > [ 89.294082] Initial want: 1 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 85000 > > [ 89.295194] Initial want: 5000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 9 > > [ 89.296274] Initial want: 5000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 9 > > [ 90.959004] [sched_delayed] sched: RT throttling activated > > [ 91.289470] Initial want: 2 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 75000 > > [ 91.292767] Initial want: 2 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 75000 > > [ 91.294037] Initial want: 2 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 75000 > > [ 91.295364] Initial want: 2 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 75000 > > [ 91.296355] BUG triggered, want: 2 > > [ 91.296355] > > [ 91.296355] rt_rq[7]: > > [ 91.296355] .rt_nr_running : 0 > > [ 91.296355] .rt_throttled : 0 > > [ 91.296355] .rt_time : 0.00 > > [ 91.296355] .rt_runtime: 750.00 > > [ 91.307332] Initial want: -5000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 10 > > [ 91.308440] Initial want: -1 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 105000 > > [ 91.309586] Initial want: -15000 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 11 > > [ 91.310716] Initial want: -2 rt_b->rt_runtime: 95000 > > rt_rq->rt_runtime: 115000 > > [ 91.311707] BUG triggered, want: -2 > > [ 91.311707] > > [ 91.311707] rt_rq[6]: > > [ 91.311707] .rt_nr_running : 1 > > [ 91.311707] .rt_throttled : 0 > > [ 91.311707] .rt_time : 307.209987 > > [ 91.311707] .rt_runtime: 1150.00 > > That makes about as much sense as my crash did. There is no leak, but > cpu found nada. So rd/span is changing on us? So I looked at the locking (yet again), and (for the umpteenth time) see no way in the world that can happen. Hrmph. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Thu, 2013-01-10 at 13:58 -0600, Shawn Bohrer wrote: > Here is the output: > > [ 81.278842] SysRq : Changing Loglevel > [ 81.279027] Loglevel set to 9 > [ 83.285456] Initial want: 5000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 9 > [ 85.286452] Initial want: 5000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 9 > [ 85.289625] Initial want: 5000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 9 > [ 87.287435] Initial want: 1 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 85000 > [ 87.290718] Initial want: 5000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 9 > [ 89.288469] Initial want: -5000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 10 > [ 89.291550] Initial want: 15000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 8 > [ 89.292940] Initial want: 1 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 85000 > [ 89.294082] Initial want: 1 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 85000 > [ 89.295194] Initial want: 5000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 9 > [ 89.296274] Initial want: 5000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 9 > [ 90.959004] [sched_delayed] sched: RT throttling activated > [ 91.289470] Initial want: 2 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 75000 > [ 91.292767] Initial want: 2 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 75000 > [ 91.294037] Initial want: 2 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 75000 > [ 91.295364] Initial want: 2 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 75000 > [ 91.296355] BUG triggered, want: 2 > [ 91.296355] > [ 91.296355] rt_rq[7]: > [ 91.296355] .rt_nr_running : 0 > [ 91.296355] .rt_throttled : 0 > [ 91.296355] .rt_time : 0.00 > [ 91.296355] .rt_runtime: 750.00 > [ 91.307332] Initial want: -5000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 10 > [ 91.308440] Initial want: -1 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 105000 > [ 91.309586] Initial want: -15000 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 11 > [ 91.310716] Initial want: -2 rt_b->rt_runtime: 95000 > rt_rq->rt_runtime: 115000 > [ 91.311707] BUG triggered, want: -2 > [ 91.311707] > [ 91.311707] rt_rq[6]: > [ 91.311707] .rt_nr_running : 1 > [ 91.311707] .rt_throttled : 0 > [ 91.311707] .rt_time : 307.209987 > [ 91.311707] .rt_runtime: 1150.00 That makes about as much sense as my crash did. There is no leak, but cpu found nada. So rd/span is changing on us? We saw nada as we traversed, release locks, poof, it's there. In my dump, at crash time, rd/span was fine, but at the top of the stack, I found a mighty suspicious reference to def_root_domain.span. At _crash_ time, both it and rq->rd.span had pilfer-able goodies, but whatever it was that we traversed had nada just a wee bit earlier. If the darn thing would trigger for me again, I'd use trace_printk(), check rd and cpumask_weight(rd->span) before/after traverse, print cpu and rt_runtime as we traverse, set /proc/sys/kernel/ftrace_dump_on_oops to 1.. and hope bug doesn't like to play heisenbug games. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Thu, Jan 10, 2013 at 05:13:11AM +0100, Mike Galbraith wrote: > On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: > > On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: > > > > > > > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug > > > > is still present in the latest kernel. > > > > > > Shawn, > > > > > > Can you send me your .config file. > > > > I've attached the 3.8.0-rc2 config that I used to reproduce this in an > > 8 core kvm image. Let me know if you need anything else. > > I tried beating on my little Q6600 with no success. I even tried > setting the entire box rt, GUI and all, nada. > > Hm, maybe re-installing systemd.. I don't know if Steve has had any success. I can reproduce this easily now so I'm happy to do some debugging if anyone has some things they want me to try. Here is some info on my setup at the moment. I'm using an 8 core KVM image now with an xfs file system. We do use systemd if that is relevant. My cpuset controller is mounted on /cgroup/cpuset and we use libcgroup-tools to move everything on the system that can be moved into /cgroup/cpuset/sysdefault/ I've also boosted all kworker threads to run as SCHED_FIFO with a priority of 51. From there I just drop the three attached shell scripts (burn.sh, sched_domain_bug.sh and sched_domain_burn.sh) in /root/ and run /root/sched_domain_bug.sh as root. Usually the bug triggers in less than a minute. You may need to tweak my shell scripts if your setup is different but they are very rudimentary. In order to try digging up some more info I applied the following patch, and triggered the bug a few times. The results are always essentially the same: --- kernel/sched/rt.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 418feb0..fba7f01 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -650,6 +650,8 @@ static void __disable_runtime(struct rq *rq) * we lend and now have to reclaim. */ want = rt_b->rt_runtime - rt_rq->rt_runtime; + printk(KERN_INFO "Initial want: %lld rt_b->rt_runtime: %llu rt_rq->rt_runtime: %llu\n", + want, rt_b->rt_runtime, rt_rq->rt_runtime); /* * Greedy reclaim, take back as much as we can. @@ -684,7 +686,12 @@ static void __disable_runtime(struct rq *rq) * We cannot be left wanting - that would mean some runtime * leaked out of the system. */ - BUG_ON(want); + if (want) { + printk(KERN_ERR "BUG triggered, want: %lld\n", want); + for_each_cpu(i, rd->span) { + print_rt_stats(NULL, i); + } + } balanced: /* * Disable all the borrow logic by pretending we have inf --- Here is the output: [ 81.278842] SysRq : Changing Loglevel [ 81.279027] Loglevel set to 9 [ 83.285456] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 85.286452] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 85.289625] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 87.287435] Initial want: 1 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 85000 [ 87.290718] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 89.288469] Initial want: -5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 10 [ 89.291550] Initial want: 15000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 8 [ 89.292940] Initial want: 1 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 85000 [ 89.294082] Initial want: 1 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 85000 [ 89.295194] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 89.296274] Initial want: 5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 9 [ 90.959004] [sched_delayed] sched: RT throttling activated [ 91.289470] Initial want: 2 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 75000 [ 91.292767] Initial want: 2 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 75000 [ 91.294037] Initial want: 2 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 75000 [ 91.295364] Initial want: 2 rt_b->rt_runtime: 95000 rt_rq->rt_runtime: 75000 [ 91.296355] BUG triggered, want: 2 [ 91.296355] [ 91.296355] rt_rq[7]: [ 91.296355] .rt_nr_running : 0 [ 91.296355] .rt_throttled : 0 [ 91.296355] .rt_time : 0.00 [ 91.296355] .rt_runtime: 750.00 [ 91.307332] Initial want: -5000 rt_b->rt_runtime: 95000 rt_rq->rt_runtime:
Re: kernel BUG at kernel/sched_rt.c:493!
On Thu, Jan 10, 2013 at 05:13:11AM +0100, Mike Galbraith wrote: On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Shawn, Can you send me your .config file. I've attached the 3.8.0-rc2 config that I used to reproduce this in an 8 core kvm image. Let me know if you need anything else. I tried beating on my little Q6600 with no success. I even tried setting the entire box rt, GUI and all, nada. Hm, maybe re-installing systemd.. I don't know if Steve has had any success. I can reproduce this easily now so I'm happy to do some debugging if anyone has some things they want me to try. Here is some info on my setup at the moment. I'm using an 8 core KVM image now with an xfs file system. We do use systemd if that is relevant. My cpuset controller is mounted on /cgroup/cpuset and we use libcgroup-tools to move everything on the system that can be moved into /cgroup/cpuset/sysdefault/ I've also boosted all kworker threads to run as SCHED_FIFO with a priority of 51. From there I just drop the three attached shell scripts (burn.sh, sched_domain_bug.sh and sched_domain_burn.sh) in /root/ and run /root/sched_domain_bug.sh as root. Usually the bug triggers in less than a minute. You may need to tweak my shell scripts if your setup is different but they are very rudimentary. In order to try digging up some more info I applied the following patch, and triggered the bug a few times. The results are always essentially the same: --- kernel/sched/rt.c |9 - 1 files changed, 8 insertions(+), 1 deletions(-) diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c index 418feb0..fba7f01 100644 --- a/kernel/sched/rt.c +++ b/kernel/sched/rt.c @@ -650,6 +650,8 @@ static void __disable_runtime(struct rq *rq) * we lend and now have to reclaim. */ want = rt_b-rt_runtime - rt_rq-rt_runtime; + printk(KERN_INFO Initial want: %lld rt_b-rt_runtime: %llu rt_rq-rt_runtime: %llu\n, + want, rt_b-rt_runtime, rt_rq-rt_runtime); /* * Greedy reclaim, take back as much as we can. @@ -684,7 +686,12 @@ static void __disable_runtime(struct rq *rq) * We cannot be left wanting - that would mean some runtime * leaked out of the system. */ - BUG_ON(want); + if (want) { + printk(KERN_ERR BUG triggered, want: %lld\n, want); + for_each_cpu(i, rd-span) { + print_rt_stats(NULL, i); + } + } balanced: /* * Disable all the borrow logic by pretending we have inf --- Here is the output: [ 81.278842] SysRq : Changing Loglevel [ 81.279027] Loglevel set to 9 [ 83.285456] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 85.286452] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 85.289625] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 87.287435] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 87.290718] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 89.288469] Initial want: -5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 10 [ 89.291550] Initial want: 15000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 8 [ 89.292940] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 89.294082] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 89.295194] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 89.296274] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 90.959004] [sched_delayed] sched: RT throttling activated [ 91.289470] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.292767] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.294037] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.295364] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.296355] BUG triggered, want: 2 [ 91.296355] [ 91.296355] rt_rq[7]: [ 91.296355] .rt_nr_running : 0 [ 91.296355] .rt_throttled : 0 [ 91.296355] .rt_time : 0.00 [ 91.296355] .rt_runtime: 750.00 [ 91.307332] Initial want: -5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 10 [ 91.308440] Initial want: -1 rt_b-rt_runtime: 95000
Re: kernel BUG at kernel/sched_rt.c:493!
On Thu, 2013-01-10 at 13:58 -0600, Shawn Bohrer wrote: Here is the output: [ 81.278842] SysRq : Changing Loglevel [ 81.279027] Loglevel set to 9 [ 83.285456] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 85.286452] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 85.289625] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 87.287435] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 87.290718] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 89.288469] Initial want: -5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 10 [ 89.291550] Initial want: 15000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 8 [ 89.292940] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 89.294082] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 89.295194] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 89.296274] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 90.959004] [sched_delayed] sched: RT throttling activated [ 91.289470] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.292767] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.294037] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.295364] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.296355] BUG triggered, want: 2 [ 91.296355] [ 91.296355] rt_rq[7]: [ 91.296355] .rt_nr_running : 0 [ 91.296355] .rt_throttled : 0 [ 91.296355] .rt_time : 0.00 [ 91.296355] .rt_runtime: 750.00 [ 91.307332] Initial want: -5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 10 [ 91.308440] Initial want: -1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 105000 [ 91.309586] Initial want: -15000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 11 [ 91.310716] Initial want: -2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 115000 [ 91.311707] BUG triggered, want: -2 [ 91.311707] [ 91.311707] rt_rq[6]: [ 91.311707] .rt_nr_running : 1 [ 91.311707] .rt_throttled : 0 [ 91.311707] .rt_time : 307.209987 [ 91.311707] .rt_runtime: 1150.00 That makes about as much sense as my crash did. There is no leak, but cpu found nada. So rd/span is changing on us? We saw nada as we traversed, release locks, poof, it's there. In my dump, at crash time, rd/span was fine, but at the top of the stack, I found a mighty suspicious reference to def_root_domain.span. At _crash_ time, both it and rq-rd.span had pilfer-able goodies, but whatever it was that we traversed had nada just a wee bit earlier. If the darn thing would trigger for me again, I'd use trace_printk(), check rd and cpumask_weight(rd-span) before/after traverse, print cpu and rt_runtime as we traverse, set /proc/sys/kernel/ftrace_dump_on_oops to 1.. and hope bug doesn't like to play heisenbug games. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Fri, 2013-01-11 at 06:22 +0100, Mike Galbraith wrote: On Thu, 2013-01-10 at 13:58 -0600, Shawn Bohrer wrote: Here is the output: [ 81.278842] SysRq : Changing Loglevel [ 81.279027] Loglevel set to 9 [ 83.285456] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 85.286452] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 85.289625] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 87.287435] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 87.290718] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 89.288469] Initial want: -5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 10 [ 89.291550] Initial want: 15000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 8 [ 89.292940] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 89.294082] Initial want: 1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 85000 [ 89.295194] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 89.296274] Initial want: 5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 9 [ 90.959004] [sched_delayed] sched: RT throttling activated [ 91.289470] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.292767] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.294037] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.295364] Initial want: 2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 75000 [ 91.296355] BUG triggered, want: 2 [ 91.296355] [ 91.296355] rt_rq[7]: [ 91.296355] .rt_nr_running : 0 [ 91.296355] .rt_throttled : 0 [ 91.296355] .rt_time : 0.00 [ 91.296355] .rt_runtime: 750.00 [ 91.307332] Initial want: -5000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 10 [ 91.308440] Initial want: -1 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 105000 [ 91.309586] Initial want: -15000 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 11 [ 91.310716] Initial want: -2 rt_b-rt_runtime: 95000 rt_rq-rt_runtime: 115000 [ 91.311707] BUG triggered, want: -2 [ 91.311707] [ 91.311707] rt_rq[6]: [ 91.311707] .rt_nr_running : 1 [ 91.311707] .rt_throttled : 0 [ 91.311707] .rt_time : 307.209987 [ 91.311707] .rt_runtime: 1150.00 That makes about as much sense as my crash did. There is no leak, but cpu found nada. So rd/span is changing on us? So I looked at the locking (yet again), and (for the umpteenth time) see no way in the world that can happen. Hrmph. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: > On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: > > > > > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug > > > is still present in the latest kernel. > > > > Shawn, > > > > Can you send me your .config file. > > I've attached the 3.8.0-rc2 config that I used to reproduce this in an > 8 core kvm image. Let me know if you need anything else. I tried beating on my little Q6600 with no success. I even tried setting the entire box rt, GUI and all, nada. Hm, maybe re-installing systemd.. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Shawn, Can you send me your .config file. I've attached the 3.8.0-rc2 config that I used to reproduce this in an 8 core kvm image. Let me know if you need anything else. I tried beating on my little Q6600 with no success. I even tried setting the entire box rt, GUI and all, nada. Hm, maybe re-installing systemd.. -Mike -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: > > > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug > > is still present in the latest kernel. > > Shawn, > > Can you send me your .config file. I've attached the 3.8.0-rc2 config that I used to reproduce this in an 8 core kvm image. Let me know if you need anything else. Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. 3.8.0-rc2.config.gz Description: GNU Zip compressed data
Re: kernel BUG at kernel/sched_rt.c:493!
> > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug > is still present in the latest kernel. Shawn, Can you send me your .config file. Thanks, -- Steve -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Mon, 2013-01-07 at 17:02 -0600, Shawn Bohrer wrote: > On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote: > > On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: > > > I've tried reproducing the issue, but so far I've been unsuccessful > > > but I believe that is because my RT tasks aren't using enough CPU > > > cause borrowing from the other runqueues. Normally our RT tasks use > > > very little CPU so I'm not entirely sure what conditions caused them > > > to run into throttling on the day that this happened. > > > > I've managed to reproduce this a couple times now on 3.1.9 I'll give > > this a try later with a more recent kernel. Here is what I've done to > > reproduce the issue. > > > > > > # Setup in shell 1 > > root@berbox39:/cgroup/cpuset# mkdir package0 > > root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems > > root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus > > root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance > > 1 > > root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance > > 1 > > root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance > > 1 > > root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus > > root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems > > root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks > > > > # Setup in shell 2 > > root@berbox39:~# cd /cgroup/cpuset/ > > root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ > > root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks > > > > # In shell 1 > > root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & > > root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & > > > > # In shell 2 > > root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance > > root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance > > root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance > > root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance > > > > I haven't found the exact magic combination but I've been going back > > and forth adding/killing burn.sh processes and toggling > > cpuset.sched_load_balance and in a couple of minutes I can usually get > > the machine to trigger the bug. > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug > is still present in the latest kernel. > > Also just re-reading my instructions above /root/burn.sh is just a > simple: > > while true; do : ; done > > I've also had to make the kworker threads SCHED_FIFO with a higher > priority than burn.sh or as expected I can lock up the system due to > some xfs threads getting starved. > > Let me know if anyone needs any more information, or needs me to try > anything since it appears I can trigger this fairly easily now. I hit something very similar. In my case, I had booted with sched_debug, and a 57600 serial console connected to my local box via a horribly slow internet link, which led to it taking 40 seconds to get 2 of 80 cpus down for reboot. There were a gaggle of CPUs with pilfered rt_runtime, a matching gaggle of bandits, no runqueues were throttled, nor were there any with rt_time accrued at crash time, but as CPU4 tried to go down, it couldn't find its property, so cleverly fired a thermonuclear pop-flare, lest thieves escape unseen, precious rt_runtime in hand. runqueues 0x88047f811180 rt_runtime = 888192922 cpu0 0x880c7f811180 rt_runtime = RUNTIME_INF cpu1 0x88087f811180 rt_runtime = RUNTIME_INF cpu2 0x880e7f811180 rt_runtime = 699462573 cpu3 0x88047f851180 rt_runtime = 841094505 cpu4 0x880c7f851180 rt_runtime = 92625 cpu5 0x88087f851180 rt_runtime = 92625 cpu6 0x880e7f851180 rt_runtime = 92625 cpu7 0x88047f891180 rt_runtime = 92625 cpu8 0x880c7f891180 rt_runtime = 92625 cpu9 0x88087f891180 rt_runtime = 92625 cpu10 0x880e7f891180 rt_runtime = 92625 cpu11 0x88047f8d1180 rt_runtime = 92625 cpu12 0x880c7f8d1180 rt_runtime = 92625 cpu13 0x88087f8d1180 rt_runtime = 10 cpu14 0x880e7f8d1180 rt_runtime = 10 cpu15 0x88047f911180 rt_runtime = 10 cpu16 0x880c7f911180 rt_runtime = 10 cpu17 0x88087f911180 rt_runtime = 10 cpu18 0x880e7f911180 rt_runtime = 10 cpu19 0x88047f951180 rt_runtime = 10 cpu20 0x880c7f951180 rt_runtime = 10 cpu21 0x88087f951180 rt_runtime = 10 cpu23 0x880e7f951180 rt_runtime = 10 cpu24 0x88047f991180 rt_runtime = 10 cpu25 0x880c7f991180 rt_runtime = 95000 cpu26 0x88087f991180 rt_runtime = 10 cpu27 0x880e7f991180
Re: kernel BUG at kernel/sched_rt.c:493!
On Mon, 2013-01-07 at 17:02 -0600, Shawn Bohrer wrote: On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote: On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: I've tried reproducing the issue, but so far I've been unsuccessful but I believe that is because my RT tasks aren't using enough CPU cause borrowing from the other runqueues. Normally our RT tasks use very little CPU so I'm not entirely sure what conditions caused them to run into throttling on the day that this happened. I've managed to reproduce this a couple times now on 3.1.9 I'll give this a try later with a more recent kernel. Here is what I've done to reproduce the issue. # Setup in shell 1 root@berbox39:/cgroup/cpuset# mkdir package0 root@berbox39:/cgroup/cpuset# echo 0 package0/cpuset.mems root@berbox39:/cgroup/cpuset# echo 0,2,4,6 package0/cpuset.cpus root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# echo 1,3,5,7 sysdefault/cpuset.cpus root@berbox39:/cgroup/cpuset# echo 0 sysdefault/cpuset.mems root@berbox39:/cgroup/cpuset# echo $$ package0/tasks # Setup in shell 2 root@berbox39:~# cd /cgroup/cpuset/ root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ root@berbox39:/cgroup/cpuset# echo $$ sysdefault/tasks # In shell 1 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh # In shell 2 root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance I haven't found the exact magic combination but I've been going back and forth adding/killing burn.sh processes and toggling cpuset.sched_load_balance and in a couple of minutes I can usually get the machine to trigger the bug. I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Also just re-reading my instructions above /root/burn.sh is just a simple: while true; do : ; done I've also had to make the kworker threads SCHED_FIFO with a higher priority than burn.sh or as expected I can lock up the system due to some xfs threads getting starved. Let me know if anyone needs any more information, or needs me to try anything since it appears I can trigger this fairly easily now. I hit something very similar. In my case, I had booted with sched_debug, and a 57600 serial console connected to my local box via a horribly slow internet link, which led to it taking 40 seconds to get 2 of 80 cpus down for reboot. There were a gaggle of CPUs with pilfered rt_runtime, a matching gaggle of bandits, no runqueues were throttled, nor were there any with rt_time accrued at crash time, but as CPU4 tried to go down, it couldn't find its property, so cleverly fired a thermonuclear pop-flare, lest thieves escape unseen, precious rt_runtime in hand. runqueues 0x88047f811180 rt_runtime = 888192922 cpu0 0x880c7f811180 rt_runtime = RUNTIME_INF cpu1 0x88087f811180 rt_runtime = RUNTIME_INF cpu2 0x880e7f811180 rt_runtime = 699462573 cpu3 0x88047f851180 rt_runtime = 841094505 cpu4 0x880c7f851180 rt_runtime = 92625 cpu5 0x88087f851180 rt_runtime = 92625 cpu6 0x880e7f851180 rt_runtime = 92625 cpu7 0x88047f891180 rt_runtime = 92625 cpu8 0x880c7f891180 rt_runtime = 92625 cpu9 0x88087f891180 rt_runtime = 92625 cpu10 0x880e7f891180 rt_runtime = 92625 cpu11 0x88047f8d1180 rt_runtime = 92625 cpu12 0x880c7f8d1180 rt_runtime = 92625 cpu13 0x88087f8d1180 rt_runtime = 10 cpu14 0x880e7f8d1180 rt_runtime = 10 cpu15 0x88047f911180 rt_runtime = 10 cpu16 0x880c7f911180 rt_runtime = 10 cpu17 0x88087f911180 rt_runtime = 10 cpu18 0x880e7f911180 rt_runtime = 10 cpu19 0x88047f951180 rt_runtime = 10 cpu20 0x880c7f951180 rt_runtime = 10 cpu21 0x88087f951180 rt_runtime = 10 cpu23 0x880e7f951180 rt_runtime = 10 cpu24 0x88047f991180 rt_runtime = 10 cpu25 0x880c7f991180 rt_runtime = 95000 cpu26 0x88087f991180 rt_runtime = 10 cpu27 0x880e7f991180 rt_runtime = 95000 cpu28 0x88087fcd1180 rt_runtime = 95000 cpu79 crash bt PID:
Re: kernel BUG at kernel/sched_rt.c:493!
I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Shawn, Can you send me your .config file. Thanks, -- Steve -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote: I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Shawn, Can you send me your .config file. I've attached the 3.8.0-rc2 config that I used to reproduce this in an 8 core kvm image. Let me know if you need anything else. Thanks, Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. 3.8.0-rc2.config.gz Description: GNU Zip compressed data
Re: kernel BUG at kernel/sched_rt.c:493!
On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote: > On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: > > I've tried reproducing the issue, but so far I've been unsuccessful > > but I believe that is because my RT tasks aren't using enough CPU > > cause borrowing from the other runqueues. Normally our RT tasks use > > very little CPU so I'm not entirely sure what conditions caused them > > to run into throttling on the day that this happened. > > I've managed to reproduce this a couple times now on 3.1.9 I'll give > this a try later with a more recent kernel. Here is what I've done to > reproduce the issue. > > > # Setup in shell 1 > root@berbox39:/cgroup/cpuset# mkdir package0 > root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems > root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus > root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance > 1 > root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance > 1 > root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance > 1 > root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus > root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems > root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks > > # Setup in shell 2 > root@berbox39:~# cd /cgroup/cpuset/ > root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ > root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks > > # In shell 1 > root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & > root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & > > # In shell 2 > root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance > root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance > root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance > root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance > > I haven't found the exact magic combination but I've been going back > and forth adding/killing burn.sh processes and toggling > cpuset.sched_load_balance and in a couple of minutes I can usually get > the machine to trigger the bug. I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Also just re-reading my instructions above /root/burn.sh is just a simple: while true; do : ; done I've also had to make the kworker threads SCHED_FIFO with a higher priority than burn.sh or as expected I can lock up the system due to some xfs threads getting starved. Let me know if anyone needs any more information, or needs me to try anything since it appears I can trigger this fairly easily now. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: > I've tried reproducing the issue, but so far I've been unsuccessful > but I believe that is because my RT tasks aren't using enough CPU > cause borrowing from the other runqueues. Normally our RT tasks use > very little CPU so I'm not entirely sure what conditions caused them > to run into throttling on the day that this happened. I've managed to reproduce this a couple times now on 3.1.9 I'll give this a try later with a more recent kernel. Here is what I've done to reproduce the issue. # Setup in shell 1 root@berbox39:/cgroup/cpuset# mkdir package0 root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks # Setup in shell 2 root@berbox39:~# cd /cgroup/cpuset/ root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks # In shell 1 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh & # In shell 2 root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance I haven't found the exact magic combination but I've been going back and forth adding/killing burn.sh processes and toggling cpuset.sched_load_balance and in a couple of minutes I can usually get the machine to trigger the bug. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: I've tried reproducing the issue, but so far I've been unsuccessful but I believe that is because my RT tasks aren't using enough CPU cause borrowing from the other runqueues. Normally our RT tasks use very little CPU so I'm not entirely sure what conditions caused them to run into throttling on the day that this happened. I've managed to reproduce this a couple times now on 3.1.9 I'll give this a try later with a more recent kernel. Here is what I've done to reproduce the issue. # Setup in shell 1 root@berbox39:/cgroup/cpuset# mkdir package0 root@berbox39:/cgroup/cpuset# echo 0 package0/cpuset.mems root@berbox39:/cgroup/cpuset# echo 0,2,4,6 package0/cpuset.cpus root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# echo 1,3,5,7 sysdefault/cpuset.cpus root@berbox39:/cgroup/cpuset# echo 0 sysdefault/cpuset.mems root@berbox39:/cgroup/cpuset# echo $$ package0/tasks # Setup in shell 2 root@berbox39:~# cd /cgroup/cpuset/ root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ root@berbox39:/cgroup/cpuset# echo $$ sysdefault/tasks # In shell 1 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh # In shell 2 root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance I haven't found the exact magic combination but I've been going back and forth adding/killing burn.sh processes and toggling cpuset.sched_load_balance and in a couple of minutes I can usually get the machine to trigger the bug. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: kernel BUG at kernel/sched_rt.c:493!
On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote: On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote: I've tried reproducing the issue, but so far I've been unsuccessful but I believe that is because my RT tasks aren't using enough CPU cause borrowing from the other runqueues. Normally our RT tasks use very little CPU so I'm not entirely sure what conditions caused them to run into throttling on the day that this happened. I've managed to reproduce this a couple times now on 3.1.9 I'll give this a try later with a more recent kernel. Here is what I've done to reproduce the issue. # Setup in shell 1 root@berbox39:/cgroup/cpuset# mkdir package0 root@berbox39:/cgroup/cpuset# echo 0 package0/cpuset.mems root@berbox39:/cgroup/cpuset# echo 0,2,4,6 package0/cpuset.cpus root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance 1 root@berbox39:/cgroup/cpuset# echo 1,3,5,7 sysdefault/cpuset.cpus root@berbox39:/cgroup/cpuset# echo 0 sysdefault/cpuset.mems root@berbox39:/cgroup/cpuset# echo $$ package0/tasks # Setup in shell 2 root@berbox39:~# cd /cgroup/cpuset/ root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$ root@berbox39:/cgroup/cpuset# echo $$ sysdefault/tasks # In shell 1 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh # In shell 2 root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 0 cpuset.sched_load_balance root@berbox39:/cgroup/cpuset# echo 1 cpuset.sched_load_balance I haven't found the exact magic combination but I've been going back and forth adding/killing burn.sh processes and toggling cpuset.sched_load_balance and in a couple of minutes I can usually get the machine to trigger the bug. I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug is still present in the latest kernel. Also just re-reading my instructions above /root/burn.sh is just a simple: while true; do : ; done I've also had to make the kworker threads SCHED_FIFO with a higher priority than burn.sh or as expected I can lock up the system due to some xfs threads getting starved. Let me know if anyone needs any more information, or needs me to try anything since it appears I can trigger this fairly easily now. -- Shawn -- --- This email, along with any attachments, is confidential. If you believe you received this message in error, please contact the sender immediately and delete all copies of the message. Thank you. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/