Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-10 Thread Mike Galbraith
On Fri, 2013-01-11 at 06:22 +0100, Mike Galbraith wrote: 
> On Thu, 2013-01-10 at 13:58 -0600, Shawn Bohrer wrote:
> 
> > Here is the output:
> > 
> > [   81.278842] SysRq : Changing Loglevel
> > [   81.279027] Loglevel set to 9
> > [   83.285456] Initial want: 5000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 9
> > [   85.286452] Initial want: 5000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 9
> > [   85.289625] Initial want: 5000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 9
> > [   87.287435] Initial want: 1 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 85000
> > [   87.290718] Initial want: 5000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 9
> > [   89.288469] Initial want: -5000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 10
> > [   89.291550] Initial want: 15000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 8
> > [   89.292940] Initial want: 1 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 85000
> > [   89.294082] Initial want: 1 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 85000
> > [   89.295194] Initial want: 5000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 9
> > [   89.296274] Initial want: 5000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 9
> > [   90.959004] [sched_delayed] sched: RT throttling activated
> > [   91.289470] Initial want: 2 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 75000
> > [   91.292767] Initial want: 2 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 75000
> > [   91.294037] Initial want: 2 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 75000
> > [   91.295364] Initial want: 2 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 75000
> > [   91.296355] BUG triggered, want: 2
> > [   91.296355] 
> > [   91.296355] rt_rq[7]:
> > [   91.296355]   .rt_nr_running : 0
> > [   91.296355]   .rt_throttled  : 0
> > [   91.296355]   .rt_time   : 0.00
> > [   91.296355]   .rt_runtime: 750.00
> > [   91.307332] Initial want: -5000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 10
> > [   91.308440] Initial want: -1 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 105000
> > [   91.309586] Initial want: -15000 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 11
> > [   91.310716] Initial want: -2 rt_b->rt_runtime: 95000 
> > rt_rq->rt_runtime: 115000
> > [   91.311707] BUG triggered, want: -2
> > [   91.311707] 
> > [   91.311707] rt_rq[6]:
> > [   91.311707]   .rt_nr_running : 1
> > [   91.311707]   .rt_throttled  : 0
> > [   91.311707]   .rt_time   : 307.209987
> > [   91.311707]   .rt_runtime: 1150.00
> 
> That makes about as much sense as my crash did.  There is no leak, but
> cpu found nada.  So rd/span is changing on us?

So I looked at the locking (yet again), and (for the umpteenth time) see
no way in the world that can happen.  Hrmph.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-10 Thread Mike Galbraith
On Thu, 2013-01-10 at 13:58 -0600, Shawn Bohrer wrote:

> Here is the output:
> 
> [   81.278842] SysRq : Changing Loglevel
> [   81.279027] Loglevel set to 9
> [   83.285456] Initial want: 5000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 9
> [   85.286452] Initial want: 5000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 9
> [   85.289625] Initial want: 5000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 9
> [   87.287435] Initial want: 1 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 85000
> [   87.290718] Initial want: 5000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 9
> [   89.288469] Initial want: -5000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 10
> [   89.291550] Initial want: 15000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 8
> [   89.292940] Initial want: 1 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 85000
> [   89.294082] Initial want: 1 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 85000
> [   89.295194] Initial want: 5000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 9
> [   89.296274] Initial want: 5000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 9
> [   90.959004] [sched_delayed] sched: RT throttling activated
> [   91.289470] Initial want: 2 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 75000
> [   91.292767] Initial want: 2 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 75000
> [   91.294037] Initial want: 2 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 75000
> [   91.295364] Initial want: 2 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 75000
> [   91.296355] BUG triggered, want: 2
> [   91.296355] 
> [   91.296355] rt_rq[7]:
> [   91.296355]   .rt_nr_running : 0
> [   91.296355]   .rt_throttled  : 0
> [   91.296355]   .rt_time   : 0.00
> [   91.296355]   .rt_runtime: 750.00
> [   91.307332] Initial want: -5000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 10
> [   91.308440] Initial want: -1 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 105000
> [   91.309586] Initial want: -15000 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 11
> [   91.310716] Initial want: -2 rt_b->rt_runtime: 95000 
> rt_rq->rt_runtime: 115000
> [   91.311707] BUG triggered, want: -2
> [   91.311707] 
> [   91.311707] rt_rq[6]:
> [   91.311707]   .rt_nr_running : 1
> [   91.311707]   .rt_throttled  : 0
> [   91.311707]   .rt_time   : 307.209987
> [   91.311707]   .rt_runtime: 1150.00

That makes about as much sense as my crash did.  There is no leak, but
cpu found nada.  So rd/span is changing on us?  We saw nada as we
traversed, release locks, poof, it's there.  In my dump, at crash time,
rd/span was fine, but at the top of the stack, I found a mighty
suspicious reference to def_root_domain.span.  At _crash_ time, both it
and rq->rd.span had pilfer-able goodies, but whatever it was that we
traversed had nada just a wee bit earlier.

If the darn thing would trigger for me again, I'd use trace_printk(),
check rd and cpumask_weight(rd->span) before/after traverse, print cpu
and rt_runtime as we traverse, set /proc/sys/kernel/ftrace_dump_on_oops
to 1.. and hope bug doesn't like to play heisenbug games.

-Mike 



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-10 Thread Shawn Bohrer
On Thu, Jan 10, 2013 at 05:13:11AM +0100, Mike Galbraith wrote:
> On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: 
> > On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:
> > > > 
> > > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
> > > > is still present in the latest kernel.
> > > 
> > > Shawn,
> > > 
> > > Can you send me your .config file.
> > 
> > I've attached the 3.8.0-rc2 config that I used to reproduce this in an
> > 8 core kvm image.  Let me know if you need anything else.
> 
> I tried beating on my little Q6600 with no success.  I even tried
> setting the entire box rt, GUI and all, nada.
> 
> Hm, maybe re-installing systemd..

I don't know if Steve has had any success.  I can reproduce this easily
now so I'm happy to do some debugging if anyone has some things they
want me to try.

Here is some info on my setup at the moment.  I'm using an 8 core KVM
image now with an xfs file system.  We do use systemd if that is
relevant.  My cpuset controller is mounted on /cgroup/cpuset and we
use libcgroup-tools to move everything on the system that can be moved
into /cgroup/cpuset/sysdefault/  I've also boosted all kworker threads
to run as SCHED_FIFO with a priority of 51.  From there I just drop
the three attached shell scripts (burn.sh, sched_domain_bug.sh and
sched_domain_burn.sh) in /root/ and run /root/sched_domain_bug.sh as
root.  Usually the bug triggers in less than a minute.  You may need
to tweak my shell scripts if your setup is different but they are very
rudimentary.

In order to try digging up some more info I applied the following
patch, and triggered the bug a few times.  The results are always
essentially the same:

---
 kernel/sched/rt.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..fba7f01 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -650,6 +650,8 @@ static void __disable_runtime(struct rq *rq)
 * we lend and now have to reclaim.
 */
want = rt_b->rt_runtime - rt_rq->rt_runtime;
+   printk(KERN_INFO "Initial want: %lld rt_b->rt_runtime: %llu 
rt_rq->rt_runtime: %llu\n",
+  want, rt_b->rt_runtime, rt_rq->rt_runtime);
 
/*
 * Greedy reclaim, take back as much as we can.
@@ -684,7 +686,12 @@ static void __disable_runtime(struct rq *rq)
 * We cannot be left wanting - that would mean some runtime
 * leaked out of the system.
 */
-   BUG_ON(want);
+   if (want) {
+   printk(KERN_ERR "BUG triggered, want: %lld\n", want);
+   for_each_cpu(i, rd->span) {
+   print_rt_stats(NULL, i);
+   }
+   }
 balanced:
/*
 * Disable all the borrow logic by pretending we have inf
---

Here is the output:

[   81.278842] SysRq : Changing Loglevel
[   81.279027] Loglevel set to 9
[   83.285456] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   85.286452] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   85.289625] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   87.287435] Initial want: 1 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 85000
[   87.290718] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   89.288469] Initial want: -5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 10
[   89.291550] Initial want: 15000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 8
[   89.292940] Initial want: 1 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 85000
[   89.294082] Initial want: 1 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 85000
[   89.295194] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   89.296274] Initial want: 5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 9
[   90.959004] [sched_delayed] sched: RT throttling activated
[   91.289470] Initial want: 2 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 75000
[   91.292767] Initial want: 2 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 75000
[   91.294037] Initial want: 2 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 75000
[   91.295364] Initial want: 2 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 75000
[   91.296355] BUG triggered, want: 2
[   91.296355] 
[   91.296355] rt_rq[7]:
[   91.296355]   .rt_nr_running : 0
[   91.296355]   .rt_throttled  : 0
[   91.296355]   .rt_time   : 0.00
[   91.296355]   .rt_runtime: 750.00
[   91.307332] Initial want: -5000 rt_b->rt_runtime: 95000 
rt_rq->rt_runtime: 

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-10 Thread Shawn Bohrer
On Thu, Jan 10, 2013 at 05:13:11AM +0100, Mike Galbraith wrote:
 On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: 
  On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:

I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
is still present in the latest kernel.
   
   Shawn,
   
   Can you send me your .config file.
  
  I've attached the 3.8.0-rc2 config that I used to reproduce this in an
  8 core kvm image.  Let me know if you need anything else.
 
 I tried beating on my little Q6600 with no success.  I even tried
 setting the entire box rt, GUI and all, nada.
 
 Hm, maybe re-installing systemd..

I don't know if Steve has had any success.  I can reproduce this easily
now so I'm happy to do some debugging if anyone has some things they
want me to try.

Here is some info on my setup at the moment.  I'm using an 8 core KVM
image now with an xfs file system.  We do use systemd if that is
relevant.  My cpuset controller is mounted on /cgroup/cpuset and we
use libcgroup-tools to move everything on the system that can be moved
into /cgroup/cpuset/sysdefault/  I've also boosted all kworker threads
to run as SCHED_FIFO with a priority of 51.  From there I just drop
the three attached shell scripts (burn.sh, sched_domain_bug.sh and
sched_domain_burn.sh) in /root/ and run /root/sched_domain_bug.sh as
root.  Usually the bug triggers in less than a minute.  You may need
to tweak my shell scripts if your setup is different but they are very
rudimentary.

In order to try digging up some more info I applied the following
patch, and triggered the bug a few times.  The results are always
essentially the same:

---
 kernel/sched/rt.c |9 -
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 418feb0..fba7f01 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -650,6 +650,8 @@ static void __disable_runtime(struct rq *rq)
 * we lend and now have to reclaim.
 */
want = rt_b-rt_runtime - rt_rq-rt_runtime;
+   printk(KERN_INFO Initial want: %lld rt_b-rt_runtime: %llu 
rt_rq-rt_runtime: %llu\n,
+  want, rt_b-rt_runtime, rt_rq-rt_runtime);
 
/*
 * Greedy reclaim, take back as much as we can.
@@ -684,7 +686,12 @@ static void __disable_runtime(struct rq *rq)
 * We cannot be left wanting - that would mean some runtime
 * leaked out of the system.
 */
-   BUG_ON(want);
+   if (want) {
+   printk(KERN_ERR BUG triggered, want: %lld\n, want);
+   for_each_cpu(i, rd-span) {
+   print_rt_stats(NULL, i);
+   }
+   }
 balanced:
/*
 * Disable all the borrow logic by pretending we have inf
---

Here is the output:

[   81.278842] SysRq : Changing Loglevel
[   81.279027] Loglevel set to 9
[   83.285456] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   85.286452] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   85.289625] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   87.287435] Initial want: 1 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 85000
[   87.290718] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   89.288469] Initial want: -5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 10
[   89.291550] Initial want: 15000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 8
[   89.292940] Initial want: 1 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 85000
[   89.294082] Initial want: 1 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 85000
[   89.295194] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   89.296274] Initial want: 5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 9
[   90.959004] [sched_delayed] sched: RT throttling activated
[   91.289470] Initial want: 2 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 75000
[   91.292767] Initial want: 2 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 75000
[   91.294037] Initial want: 2 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 75000
[   91.295364] Initial want: 2 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 75000
[   91.296355] BUG triggered, want: 2
[   91.296355] 
[   91.296355] rt_rq[7]:
[   91.296355]   .rt_nr_running : 0
[   91.296355]   .rt_throttled  : 0
[   91.296355]   .rt_time   : 0.00
[   91.296355]   .rt_runtime: 750.00
[   91.307332] Initial want: -5000 rt_b-rt_runtime: 95000 
rt_rq-rt_runtime: 10
[   91.308440] Initial want: -1 rt_b-rt_runtime: 95000 

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-10 Thread Mike Galbraith
On Thu, 2013-01-10 at 13:58 -0600, Shawn Bohrer wrote:

 Here is the output:
 
 [   81.278842] SysRq : Changing Loglevel
 [   81.279027] Loglevel set to 9
 [   83.285456] Initial want: 5000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 9
 [   85.286452] Initial want: 5000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 9
 [   85.289625] Initial want: 5000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 9
 [   87.287435] Initial want: 1 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 85000
 [   87.290718] Initial want: 5000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 9
 [   89.288469] Initial want: -5000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 10
 [   89.291550] Initial want: 15000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 8
 [   89.292940] Initial want: 1 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 85000
 [   89.294082] Initial want: 1 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 85000
 [   89.295194] Initial want: 5000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 9
 [   89.296274] Initial want: 5000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 9
 [   90.959004] [sched_delayed] sched: RT throttling activated
 [   91.289470] Initial want: 2 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 75000
 [   91.292767] Initial want: 2 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 75000
 [   91.294037] Initial want: 2 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 75000
 [   91.295364] Initial want: 2 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 75000
 [   91.296355] BUG triggered, want: 2
 [   91.296355] 
 [   91.296355] rt_rq[7]:
 [   91.296355]   .rt_nr_running : 0
 [   91.296355]   .rt_throttled  : 0
 [   91.296355]   .rt_time   : 0.00
 [   91.296355]   .rt_runtime: 750.00
 [   91.307332] Initial want: -5000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 10
 [   91.308440] Initial want: -1 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 105000
 [   91.309586] Initial want: -15000 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 11
 [   91.310716] Initial want: -2 rt_b-rt_runtime: 95000 
 rt_rq-rt_runtime: 115000
 [   91.311707] BUG triggered, want: -2
 [   91.311707] 
 [   91.311707] rt_rq[6]:
 [   91.311707]   .rt_nr_running : 1
 [   91.311707]   .rt_throttled  : 0
 [   91.311707]   .rt_time   : 307.209987
 [   91.311707]   .rt_runtime: 1150.00

That makes about as much sense as my crash did.  There is no leak, but
cpu found nada.  So rd/span is changing on us?  We saw nada as we
traversed, release locks, poof, it's there.  In my dump, at crash time,
rd/span was fine, but at the top of the stack, I found a mighty
suspicious reference to def_root_domain.span.  At _crash_ time, both it
and rq-rd.span had pilfer-able goodies, but whatever it was that we
traversed had nada just a wee bit earlier.

If the darn thing would trigger for me again, I'd use trace_printk(),
check rd and cpumask_weight(rd-span) before/after traverse, print cpu
and rt_runtime as we traverse, set /proc/sys/kernel/ftrace_dump_on_oops
to 1.. and hope bug doesn't like to play heisenbug games.

-Mike 



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-10 Thread Mike Galbraith
On Fri, 2013-01-11 at 06:22 +0100, Mike Galbraith wrote: 
 On Thu, 2013-01-10 at 13:58 -0600, Shawn Bohrer wrote:
 
  Here is the output:
  
  [   81.278842] SysRq : Changing Loglevel
  [   81.279027] Loglevel set to 9
  [   83.285456] Initial want: 5000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 9
  [   85.286452] Initial want: 5000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 9
  [   85.289625] Initial want: 5000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 9
  [   87.287435] Initial want: 1 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 85000
  [   87.290718] Initial want: 5000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 9
  [   89.288469] Initial want: -5000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 10
  [   89.291550] Initial want: 15000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 8
  [   89.292940] Initial want: 1 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 85000
  [   89.294082] Initial want: 1 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 85000
  [   89.295194] Initial want: 5000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 9
  [   89.296274] Initial want: 5000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 9
  [   90.959004] [sched_delayed] sched: RT throttling activated
  [   91.289470] Initial want: 2 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 75000
  [   91.292767] Initial want: 2 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 75000
  [   91.294037] Initial want: 2 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 75000
  [   91.295364] Initial want: 2 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 75000
  [   91.296355] BUG triggered, want: 2
  [   91.296355] 
  [   91.296355] rt_rq[7]:
  [   91.296355]   .rt_nr_running : 0
  [   91.296355]   .rt_throttled  : 0
  [   91.296355]   .rt_time   : 0.00
  [   91.296355]   .rt_runtime: 750.00
  [   91.307332] Initial want: -5000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 10
  [   91.308440] Initial want: -1 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 105000
  [   91.309586] Initial want: -15000 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 11
  [   91.310716] Initial want: -2 rt_b-rt_runtime: 95000 
  rt_rq-rt_runtime: 115000
  [   91.311707] BUG triggered, want: -2
  [   91.311707] 
  [   91.311707] rt_rq[6]:
  [   91.311707]   .rt_nr_running : 1
  [   91.311707]   .rt_throttled  : 0
  [   91.311707]   .rt_time   : 307.209987
  [   91.311707]   .rt_runtime: 1150.00
 
 That makes about as much sense as my crash did.  There is no leak, but
 cpu found nada.  So rd/span is changing on us?

So I looked at the locking (yet again), and (for the umpteenth time) see
no way in the world that can happen.  Hrmph.

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-09 Thread Mike Galbraith
On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: 
> On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:
> > > 
> > > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
> > > is still present in the latest kernel.
> > 
> > Shawn,
> > 
> > Can you send me your .config file.
> 
> I've attached the 3.8.0-rc2 config that I used to reproduce this in an
> 8 core kvm image.  Let me know if you need anything else.

I tried beating on my little Q6600 with no success.  I even tried
setting the entire box rt, GUI and all, nada.

Hm, maybe re-installing systemd..

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-09 Thread Mike Galbraith
On Tue, 2013-01-08 at 09:01 -0600, Shawn Bohrer wrote: 
 On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:
   
   I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
   is still present in the latest kernel.
  
  Shawn,
  
  Can you send me your .config file.
 
 I've attached the 3.8.0-rc2 config that I used to reproduce this in an
 8 core kvm image.  Let me know if you need anything else.

I tried beating on my little Q6600 with no success.  I even tried
setting the entire box rt, GUI and all, nada.

Hm, maybe re-installing systemd..

-Mike

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-08 Thread Shawn Bohrer
On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:
> > 
> > I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
> > is still present in the latest kernel.
> 
> Shawn,
> 
> Can you send me your .config file.

I've attached the 3.8.0-rc2 config that I used to reproduce this in an
8 core kvm image.  Let me know if you need anything else.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.


3.8.0-rc2.config.gz
Description: GNU Zip compressed data


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-08 Thread Steven Rostedt
> 
> I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
> is still present in the latest kernel.

Shawn,

Can you send me your .config file.

Thanks,

-- Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-08 Thread Mike Galbraith
On Mon, 2013-01-07 at 17:02 -0600, Shawn Bohrer wrote: 
> On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote:
> > On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
> > > I've tried reproducing the issue, but so far I've been unsuccessful
> > > but I believe that is because my RT tasks aren't using enough CPU
> > > cause borrowing from the other runqueues.  Normally our RT tasks use
> > > very little CPU so I'm not entirely sure what conditions caused them
> > > to run into throttling on the day that this happened.
> > 
> > I've managed to reproduce this a couple times now on 3.1.9 I'll give
> > this a try later with a more recent kernel.  Here is what I've done to
> > reproduce the issue.
> > 
> > 
> > # Setup in shell 1
> > root@berbox39:/cgroup/cpuset# mkdir package0
> > root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems
> > root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus
> > root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
> > 1
> > root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
> > 1
> > root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
> > 1
> > root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus
> > root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems
> > root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks
> > 
> > # Setup in shell 2
> > root@berbox39:~# cd /cgroup/cpuset/
> > root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
> > root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks
> > 
> > # In shell 1
> > root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &
> > root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &
> > 
> > # In shell 2
> > root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
> > root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance
> > root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
> > root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance
> > 
> > I haven't found the exact magic combination but I've been going back
> > and forth adding/killing burn.sh processes and toggling
> > cpuset.sched_load_balance and in a couple of minutes I can usually get
> > the machine to trigger the bug.
> 
> I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
> is still present in the latest kernel.
> 
> Also just re-reading my instructions above /root/burn.sh is just a
> simple:
> 
> while true; do : ; done
> 
> I've also had to make the kworker threads SCHED_FIFO with a higher
> priority than burn.sh or as expected I can lock up the system due to
> some xfs threads getting starved.
> 
> Let me know if anyone needs any more information, or needs me to try
> anything since it appears I can trigger this fairly easily now.

I hit something very similar.

In my case, I had booted with sched_debug, and a 57600 serial console
connected to my local box via a horribly slow internet link, which led
to it taking 40 seconds to get 2 of 80 cpus down for reboot.  There were
a gaggle of CPUs with pilfered rt_runtime, a matching gaggle of bandits,
no runqueues were throttled, nor were there any with rt_time accrued at
crash time, but as CPU4 tried to go down, it couldn't find its property,
so cleverly fired a thermonuclear pop-flare, lest thieves escape unseen,
precious rt_runtime in hand.

 runqueues 
 0x88047f811180  rt_runtime = 888192922   cpu0
 0x880c7f811180  rt_runtime = RUNTIME_INF cpu1
 0x88087f811180  rt_runtime = RUNTIME_INF cpu2
 0x880e7f811180  rt_runtime = 699462573   cpu3
 0x88047f851180  rt_runtime = 841094505   cpu4
 0x880c7f851180  rt_runtime = 92625   cpu5
 0x88087f851180  rt_runtime = 92625   cpu6
 0x880e7f851180  rt_runtime = 92625   cpu7
 0x88047f891180  rt_runtime = 92625   cpu8
 0x880c7f891180  rt_runtime = 92625   cpu9
 0x88087f891180  rt_runtime = 92625   cpu10
 0x880e7f891180  rt_runtime = 92625   cpu11
 0x88047f8d1180  rt_runtime = 92625   cpu12
 0x880c7f8d1180  rt_runtime = 92625   cpu13
 0x88087f8d1180  rt_runtime = 10  cpu14
 0x880e7f8d1180  rt_runtime = 10  cpu15
 0x88047f911180  rt_runtime = 10  cpu16
 0x880c7f911180  rt_runtime = 10  cpu17
 0x88087f911180  rt_runtime = 10  cpu18
 0x880e7f911180  rt_runtime = 10  cpu19
 0x88047f951180  rt_runtime = 10  cpu20
 0x880c7f951180  rt_runtime = 10  cpu21
 0x88087f951180  rt_runtime = 10  cpu23
 0x880e7f951180  rt_runtime = 10  cpu24
 0x88047f991180  rt_runtime = 10  cpu25
 0x880c7f991180  rt_runtime = 95000   cpu26
 0x88087f991180  rt_runtime = 10  cpu27
 0x880e7f991180

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-08 Thread Mike Galbraith
On Mon, 2013-01-07 at 17:02 -0600, Shawn Bohrer wrote: 
 On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote:
  On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
   I've tried reproducing the issue, but so far I've been unsuccessful
   but I believe that is because my RT tasks aren't using enough CPU
   cause borrowing from the other runqueues.  Normally our RT tasks use
   very little CPU so I'm not entirely sure what conditions caused them
   to run into throttling on the day that this happened.
  
  I've managed to reproduce this a couple times now on 3.1.9 I'll give
  this a try later with a more recent kernel.  Here is what I've done to
  reproduce the issue.
  
  
  # Setup in shell 1
  root@berbox39:/cgroup/cpuset# mkdir package0
  root@berbox39:/cgroup/cpuset# echo 0  package0/cpuset.mems
  root@berbox39:/cgroup/cpuset# echo 0,2,4,6  package0/cpuset.cpus
  root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
  1
  root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
  1
  root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
  1
  root@berbox39:/cgroup/cpuset# echo 1,3,5,7  sysdefault/cpuset.cpus
  root@berbox39:/cgroup/cpuset# echo 0  sysdefault/cpuset.mems
  root@berbox39:/cgroup/cpuset# echo $$  package0/tasks
  
  # Setup in shell 2
  root@berbox39:~# cd /cgroup/cpuset/
  root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
  root@berbox39:/cgroup/cpuset# echo $$  sysdefault/tasks
  
  # In shell 1
  root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 
  root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 
  
  # In shell 2
  root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
  root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance
  root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
  root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance
  
  I haven't found the exact magic combination but I've been going back
  and forth adding/killing burn.sh processes and toggling
  cpuset.sched_load_balance and in a couple of minutes I can usually get
  the machine to trigger the bug.
 
 I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
 is still present in the latest kernel.
 
 Also just re-reading my instructions above /root/burn.sh is just a
 simple:
 
 while true; do : ; done
 
 I've also had to make the kworker threads SCHED_FIFO with a higher
 priority than burn.sh or as expected I can lock up the system due to
 some xfs threads getting starved.
 
 Let me know if anyone needs any more information, or needs me to try
 anything since it appears I can trigger this fairly easily now.

I hit something very similar.

In my case, I had booted with sched_debug, and a 57600 serial console
connected to my local box via a horribly slow internet link, which led
to it taking 40 seconds to get 2 of 80 cpus down for reboot.  There were
a gaggle of CPUs with pilfered rt_runtime, a matching gaggle of bandits,
no runqueues were throttled, nor were there any with rt_time accrued at
crash time, but as CPU4 tried to go down, it couldn't find its property,
so cleverly fired a thermonuclear pop-flare, lest thieves escape unseen,
precious rt_runtime in hand.

 runqueues 
 0x88047f811180  rt_runtime = 888192922   cpu0
 0x880c7f811180  rt_runtime = RUNTIME_INF cpu1
 0x88087f811180  rt_runtime = RUNTIME_INF cpu2
 0x880e7f811180  rt_runtime = 699462573   cpu3
 0x88047f851180  rt_runtime = 841094505   cpu4
 0x880c7f851180  rt_runtime = 92625   cpu5
 0x88087f851180  rt_runtime = 92625   cpu6
 0x880e7f851180  rt_runtime = 92625   cpu7
 0x88047f891180  rt_runtime = 92625   cpu8
 0x880c7f891180  rt_runtime = 92625   cpu9
 0x88087f891180  rt_runtime = 92625   cpu10
 0x880e7f891180  rt_runtime = 92625   cpu11
 0x88047f8d1180  rt_runtime = 92625   cpu12
 0x880c7f8d1180  rt_runtime = 92625   cpu13
 0x88087f8d1180  rt_runtime = 10  cpu14
 0x880e7f8d1180  rt_runtime = 10  cpu15
 0x88047f911180  rt_runtime = 10  cpu16
 0x880c7f911180  rt_runtime = 10  cpu17
 0x88087f911180  rt_runtime = 10  cpu18
 0x880e7f911180  rt_runtime = 10  cpu19
 0x88047f951180  rt_runtime = 10  cpu20
 0x880c7f951180  rt_runtime = 10  cpu21
 0x88087f951180  rt_runtime = 10  cpu23
 0x880e7f951180  rt_runtime = 10  cpu24
 0x88047f991180  rt_runtime = 10  cpu25
 0x880c7f991180  rt_runtime = 95000   cpu26
 0x88087f991180  rt_runtime = 10  cpu27
 0x880e7f991180  rt_runtime = 95000   cpu28
 
 0x88087fcd1180  rt_runtime = 95000   cpu79

crash bt
PID: 

Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-08 Thread Steven Rostedt
 
 I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
 is still present in the latest kernel.

Shawn,

Can you send me your .config file.

Thanks,

-- Steve
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-08 Thread Shawn Bohrer
On Tue, Jan 08, 2013 at 09:36:05AM -0500, Steven Rostedt wrote:
  
  I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
  is still present in the latest kernel.
 
 Shawn,
 
 Can you send me your .config file.

I've attached the 3.8.0-rc2 config that I used to reproduce this in an
8 core kvm image.  Let me know if you need anything else.

Thanks,
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.


3.8.0-rc2.config.gz
Description: GNU Zip compressed data


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote:
> On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
> > I've tried reproducing the issue, but so far I've been unsuccessful
> > but I believe that is because my RT tasks aren't using enough CPU
> > cause borrowing from the other runqueues.  Normally our RT tasks use
> > very little CPU so I'm not entirely sure what conditions caused them
> > to run into throttling on the day that this happened.
> 
> I've managed to reproduce this a couple times now on 3.1.9 I'll give
> this a try later with a more recent kernel.  Here is what I've done to
> reproduce the issue.
> 
> 
> # Setup in shell 1
> root@berbox39:/cgroup/cpuset# mkdir package0
> root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems
> root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus
> root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
> 1
> root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
> 1
> root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
> 1
> root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus
> root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems
> root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks
> 
> # Setup in shell 2
> root@berbox39:~# cd /cgroup/cpuset/
> root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
> root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks
> 
> # In shell 1
> root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &
> root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &
> 
> # In shell 2
> root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
> root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance
> root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
> root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance
> 
> I haven't found the exact magic combination but I've been going back
> and forth adding/killing burn.sh processes and toggling
> cpuset.sched_load_balance and in a couple of minutes I can usually get
> the machine to trigger the bug.

I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
is still present in the latest kernel.

Also just re-reading my instructions above /root/burn.sh is just a
simple:

while true; do : ; done

I've also had to make the kworker threads SCHED_FIFO with a higher
priority than burn.sh or as expected I can lock up the system due to
some xfs threads getting starved.

Let me know if anyone needs any more information, or needs me to try
anything since it appears I can trigger this fairly easily now.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
> I've tried reproducing the issue, but so far I've been unsuccessful
> but I believe that is because my RT tasks aren't using enough CPU
> cause borrowing from the other runqueues.  Normally our RT tasks use
> very little CPU so I'm not entirely sure what conditions caused them
> to run into throttling on the day that this happened.

I've managed to reproduce this a couple times now on 3.1.9 I'll give
this a try later with a more recent kernel.  Here is what I've done to
reproduce the issue.


# Setup in shell 1
root@berbox39:/cgroup/cpuset# mkdir package0
root@berbox39:/cgroup/cpuset# echo 0 > package0/cpuset.mems
root@berbox39:/cgroup/cpuset# echo 0,2,4,6 > package0/cpuset.cpus
root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# echo 1,3,5,7 > sysdefault/cpuset.cpus
root@berbox39:/cgroup/cpuset# echo 0 > sysdefault/cpuset.mems
root@berbox39:/cgroup/cpuset# echo $$ > package0/tasks

# Setup in shell 2
root@berbox39:~# cd /cgroup/cpuset/
root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
root@berbox39:/cgroup/cpuset# echo $$ > sysdefault/tasks

# In shell 1
root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &
root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh &

# In shell 2
root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 0 > cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 1 > cpuset.sched_load_balance

I haven't found the exact magic combination but I've been going back
and forth adding/killing burn.sh processes and toggling
cpuset.sched_load_balance and in a couple of minutes I can usually get
the machine to trigger the bug.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
 I've tried reproducing the issue, but so far I've been unsuccessful
 but I believe that is because my RT tasks aren't using enough CPU
 cause borrowing from the other runqueues.  Normally our RT tasks use
 very little CPU so I'm not entirely sure what conditions caused them
 to run into throttling on the day that this happened.

I've managed to reproduce this a couple times now on 3.1.9 I'll give
this a try later with a more recent kernel.  Here is what I've done to
reproduce the issue.


# Setup in shell 1
root@berbox39:/cgroup/cpuset# mkdir package0
root@berbox39:/cgroup/cpuset# echo 0  package0/cpuset.mems
root@berbox39:/cgroup/cpuset# echo 0,2,4,6  package0/cpuset.cpus
root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
1
root@berbox39:/cgroup/cpuset# echo 1,3,5,7  sysdefault/cpuset.cpus
root@berbox39:/cgroup/cpuset# echo 0  sysdefault/cpuset.mems
root@berbox39:/cgroup/cpuset# echo $$  package0/tasks

# Setup in shell 2
root@berbox39:~# cd /cgroup/cpuset/
root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
root@berbox39:/cgroup/cpuset# echo $$  sysdefault/tasks

# In shell 1
root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 
root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 

# In shell 2
root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance

I haven't found the exact magic combination but I've been going back
and forth adding/killing burn.sh processes and toggling
cpuset.sched_load_balance and in a couple of minutes I can usually get
the machine to trigger the bug.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: kernel BUG at kernel/sched_rt.c:493!

2013-01-07 Thread Shawn Bohrer
On Mon, Jan 07, 2013 at 11:58:18AM -0600, Shawn Bohrer wrote:
 On Sat, Jan 05, 2013 at 11:46:32AM -0600, Shawn Bohrer wrote:
  I've tried reproducing the issue, but so far I've been unsuccessful
  but I believe that is because my RT tasks aren't using enough CPU
  cause borrowing from the other runqueues.  Normally our RT tasks use
  very little CPU so I'm not entirely sure what conditions caused them
  to run into throttling on the day that this happened.
 
 I've managed to reproduce this a couple times now on 3.1.9 I'll give
 this a try later with a more recent kernel.  Here is what I've done to
 reproduce the issue.
 
 
 # Setup in shell 1
 root@berbox39:/cgroup/cpuset# mkdir package0
 root@berbox39:/cgroup/cpuset# echo 0  package0/cpuset.mems
 root@berbox39:/cgroup/cpuset# echo 0,2,4,6  package0/cpuset.cpus
 root@berbox39:/cgroup/cpuset# cat cpuset.sched_load_balance
 1
 root@berbox39:/cgroup/cpuset# cat package0/cpuset.sched_load_balance
 1
 root@berbox39:/cgroup/cpuset# cat sysdefault/cpuset.sched_load_balance
 1
 root@berbox39:/cgroup/cpuset# echo 1,3,5,7  sysdefault/cpuset.cpus
 root@berbox39:/cgroup/cpuset# echo 0  sysdefault/cpuset.mems
 root@berbox39:/cgroup/cpuset# echo $$  package0/tasks
 
 # Setup in shell 2
 root@berbox39:~# cd /cgroup/cpuset/
 root@berbox39:/cgroup/cpuset# chrt -f -p 60 $$
 root@berbox39:/cgroup/cpuset# echo $$  sysdefault/tasks
 
 # In shell 1
 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 
 root@berbox39:/cgroup/cpuset# chrt -f 1 /root/burn.sh 
 
 # In shell 2
 root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
 root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance
 root@berbox39:/cgroup/cpuset# echo 0  cpuset.sched_load_balance
 root@berbox39:/cgroup/cpuset# echo 1  cpuset.sched_load_balance
 
 I haven't found the exact magic combination but I've been going back
 and forth adding/killing burn.sh processes and toggling
 cpuset.sched_load_balance and in a couple of minutes I can usually get
 the machine to trigger the bug.

I've also managed to reproduce this on 3.8.0-rc2 so it appears the bug
is still present in the latest kernel.

Also just re-reading my instructions above /root/burn.sh is just a
simple:

while true; do : ; done

I've also had to make the kworker threads SCHED_FIFO with a higher
priority than burn.sh or as expected I can lock up the system due to
some xfs threads getting starved.

Let me know if anyone needs any more information, or needs me to try
anything since it appears I can trigger this fairly easily now.

--
Shawn

-- 

---
This email, along with any attachments, is confidential. If you 
believe you received this message in error, please contact the 
sender immediately and delete all copies of the message.  
Thank you.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/