Re: NMI watchdog triggering during load_balance

2015-03-09 Thread David Ahern

On 3/6/15 12:29 PM, Mike Galbraith wrote:

On Fri, 2015-03-06 at 11:37 -0700, David Ahern wrote:


But, I do not understand how the wrong topology is causing the NMI
watchdog to trigger. In the end there are still N domains, M groups per
domain and P cpus per group. Doesn't the balancing walk over all of them
irrespective of physical topology?


You have this size extra large CPU domain that you shouldn't have,
massive collisions therein ensue.



I was able to get the socket/cores/threads issue resolved, so the 
topology is correct. But still need to check out a few things. Thanks 
Mike and Peter for the suggestions.


David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-07 Thread Peter Zijlstra
On Fri, Mar 06, 2015 at 11:37:11AM -0700, David Ahern wrote:
> On 3/6/15 11:11 AM, Mike Galbraith wrote:
> In responding earlier today I realized that the topology is all wrong as you
> were pointing out. There should be 16 NUMA domains (4 memory controllers per
> socket and 4 sockets). There should be 8 sibling cores. I will look into why
> that is not getting setup properly and what we can do about fixing it.

So we changed the numa topology setup a while back; see commit
cb83b629bae0 ("sched/numa: Rewrite the CONFIG_NUMA sched domain
support").

> But, I do not understand how the wrong topology is causing the NMI watchdog
> to trigger. In the end there are still N domains, M groups per domain and P
> cpus per group. Doesn't the balancing walk over all of them irrespective of
> physical topology?

Not quite; so for regular load balancing only the first CPU in the
domain will iterate up.

So if you have 4 'nodes' only 4 CPUs will iterate the entire machine,
not all 1024.



> Call Trace:
>  [0045dc30] double_rq_lock+0x4c/0x68
>  [0046a23c] load_balance+0x278/0x740
>  [008aa178] __schedule+0x378/0x8e4
>  [008aab1c] schedule+0x68/0x78
>  [004718ac] do_exit+0x798/0x7c0
>  [0047195c] do_group_exit+0x88/0xc0
>  [00481148] get_signal_to_deliver+0x3ec/0x4c8
>  [0042cbc0] do_signal+0x70/0x5e4
>  [0042d14c] do_notify_resume+0x18/0x50
>  [004049c4] __handle_signal+0xc/0x2c
> 
> 
> For example the stream program has 1024 threads (1 for each CPU). If you
> ctrl-c the program or wait for it terminate that's when it trips. Other
> workloads that routinely trip it are make -j N, N some number (e.g., on a
> 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, ctrl-c
> ... boom with the above stack trace.
> 
> Code wise ... and this is still present in 3.18 and 3.20:
> 
> schedule()
> - __schedule()
>   + irqs disabled: raw_spin_lock_irq(&rq->lock);
> 
>  pick_next_task
>  - idle_balance()

> For 2.6.39 it's the invocation of idle_balance which is triggering load
> balancing with IRQs disabled. That's when the NMI watchdog trips.

So for idle_balance() look at SD_BALANCE_NEWIDLE, only domains with that
set will get iterated.

I suppose you could try something like the below on 3.18

Which will disable SD_BALANCE_NEWDILE on all 'distant' nodes; but first
check how your fixed numa topology looks and if you trigger that case at
all.

---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 17141da77c6e..7fce683928fe 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6268,6 +6268,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
if (sched_domains_numa_distance[tl->numa_level] > 
RECLAIM_DISTANCE) {
sd->flags &= ~(SD_BALANCE_EXEC |
   SD_BALANCE_FORK |
+  SD_BALANCE_NEWIDLE |
   SD_WAKE_AFFINE);
}
 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread Mike Galbraith
On Fri, 2015-03-06 at 11:37 -0700, David Ahern wrote:

> But, I do not understand how the wrong topology is causing the NMI 
> watchdog to trigger. In the end there are still N domains, M groups per 
> domain and P cpus per group. Doesn't the balancing walk over all of them 
> irrespective of physical topology?

You have this size extra large CPU domain that you shouldn't have,
massive collisions therein ensue.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread David Ahern

On 3/6/15 11:11 AM, Mike Galbraith wrote:

That was the question, _do_ you have any control, because that topology
is toxic.  I guess your reply means 'nope'.


The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8
threads per core and each cpu has 4 memory controllers.


Thank god I've never met one of these, looks like the box from hell :)


If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a
noticeable improvement -- watchdog does not trigger and I do not get the
rq locks held for 2-3 seconds. But there is still fairly high cpu usage
for an idle system. Perhaps I should leave SCHED_MC on and disable
SCHED_SMT; I'll try that today.


Well, if you disable SMT,your troubles _should_ shrink radically, as
your box does. You should probably look at why you have CPU domains.
You don't ever want to see that on a NUMA box.


In responding earlier today I realized that the topology is all wrong as 
you were pointing out. There should be 16 NUMA domains (4 memory 
controllers per socket and 4 sockets). There should be 8 sibling cores. 
I will look into why that is not getting setup properly and what we can 
do about fixing it.


--

But, I do not understand how the wrong topology is causing the NMI 
watchdog to trigger. In the end there are still N domains, M groups per 
domain and P cpus per group. Doesn't the balancing walk over all of them 
irrespective of physical topology?


Here's another data point that jelled this morning explaining the 
problem to someone: the NMI watchdog trips on a mass exit:


TPC: <_raw_spin_trylock_bh+0x38/0x100>
g0: 7fff g1: 00ff g2: 00070f8c g3: 
fffe403b97891c98
g4: fffe803b963eda00 g5: 00010036c000 g6: fffe803b84108000 g7: 
0093
o0: 0fe0 o1: 0fe0 o2: ff00 o3: 
00200200
o4: 00a98080 o5:  sp: fffe803b8410ada1 ret_pc: 
006800dc

RPC: 
l0: 00e9b114 l1: 0001 l2: 0001 l3: 
0005
l4: 2000 l5: fffe803b8410b990 l6: 0004 l7: 
00f267b0
i0: 000100b10700 i1:  i2: 000101324d80 i3: 
fffe803b8410b6c0
i4: 0038 i5: 0498 i6: fffe803b8410ae51 i7: 
0045dc30

I7: 
Call Trace:
 [0045dc30] double_rq_lock+0x4c/0x68
 [0046a23c] load_balance+0x278/0x740
 [008aa178] __schedule+0x378/0x8e4
 [008aab1c] schedule+0x68/0x78
 [004718ac] do_exit+0x798/0x7c0
 [0047195c] do_group_exit+0x88/0xc0
 [00481148] get_signal_to_deliver+0x3ec/0x4c8
 [0042cbc0] do_signal+0x70/0x5e4
 [0042d14c] do_notify_resume+0x18/0x50
 [004049c4] __handle_signal+0xc/0x2c


For example the stream program has 1024 threads (1 for each CPU). If you 
ctrl-c the program or wait for it terminate that's when it trips. Other 
workloads that routinely trip it are make -j N, N some number (e.g., on 
a 256 cpu system 'make -j 128'), 10 seconds later oops stop that build, 
ctrl-c ... boom with the above stack trace.


Code wise ... and this is still present in 3.18 and 3.20:

schedule()
- __schedule()
  + irqs disabled: raw_spin_lock_irq(&rq->lock);

 pick_next_task
 - idle_balance()

  + irqs enabled:
different task: context_switch(rq, prev, next)
--> finish_lock_switch eventually
same task: raw_spin_unlock_irq(&rq->lock) or


For 2.6.39 it's the invocation of idle_balance which is triggering load 
balancing with IRQs disabled. That's when the NMI watchdog trips.


I'll pound on 3.18 and see if I can reproduce something similar there.

David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread Mike Galbraith
On Fri, 2015-03-06 at 08:01 -0700, David Ahern wrote:
> On 3/5/15 9:52 PM, Mike Galbraith wrote:
> >> CPU970 attaching sched-domain:
> >>domain 0: span 968-975 level SIBLING
> >> groups: 8 single CPU groups
> >> domain 1: span 968-975 level MC
> >>  groups: 1 group with 8 cpus
> >>  domain 2: span 768-1023 level CPU
> >>   groups: 4 groups with 256 cpus per group
> >
> > Wow, that topology is horrid.  I'm not surprised that your box is
> > writhing in agony.  Can you twiddle that?
> >
> 
> twiddle that how?

That was the question, _do_ you have any control, because that topology
is toxic.  I guess your reply means 'nope'.

> The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8 
> threads per core and each cpu has 4 memory controllers.

Thank god I've never met one of these, looks like the box from hell :)

> If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a 
> noticeable improvement -- watchdog does not trigger and I do not get the 
> rq locks held for 2-3 seconds. But there is still fairly high cpu usage 
> for an idle system. Perhaps I should leave SCHED_MC on and disable 
> SCHED_SMT; I'll try that today.

Well, if you disable SMT,your troubles _should_ shrink radically, as
your box does. You should probably look at why you have CPU domains.
You don't ever want to see that on a NUMA box.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread David Ahern

On 3/6/15 2:12 AM, Peter Zijlstra wrote:

On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:

Socket(s): 32
NUMA node(s):  4


Urgh, with 32 'cpus' per socket, you still do _8_ sockets per node, for
a total of 256 cpus per node.


Per the response to Mike, the system has 4 physical cpus. Each cpu has 
32 cores with 8 threads per core and 4 memory controllers (one mcu per 8 
cores). Yes there are 256 logical cpus per node.


David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread David Ahern

On 3/6/15 2:07 AM, Peter Zijlstra wrote:

On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:

Since each domain is a superset of the lower one each pass through
load_balance regularly repeats the processing of the previous domain (e.g.,
NODE domain repeats the cpus in the CPU domain). Then multiplying that
across 1024 cpus and it seems like a of duplication.


It is, _but_ each domain has an interval, bigger domains _should_ load
balance at a bigger interval (iow lower frequency), and all this is
lockless data gathering, so reusing stuff from the previous round could
be quite stale indeed.



Yes and I have twiddled the intervals. The defaults for min_interval and 
max_interval (msec):

SMT 1 2
MC  1 4
CPU 1 4
NODE 8 32

Increasing those values (e.g. moving NODE to 50 and 100) drops idle time 
cpu usage but does not solve the fundamental problem -- under load the 
balancing of domains seems to be lining up and the system comes to a 
halt in load balancing frenzy.


David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread David Ahern

On 3/6/15 1:51 AM, Peter Zijlstra wrote:

On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:

Hi Peter/Mike/Ingo:

Does that make sense or am I off in the weeds?


How much of your story pertains to 3.18? I'm not particularly interested
in anything much older than that.



No. All of the data in the opening email are from 2.6.39. Each kernel 
(2.6.39, 3.8 and 3.18) has a different performance problem. I will look 
at 3.18 in depth soon, but from what I can see the fundamental concepts 
of the load balancing have not changed (e.g., my tracepoints from 2.6.39 
still apply to 3.18).


David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread David Ahern

On 3/5/15 9:52 PM, Mike Galbraith wrote:

CPU970 attaching sched-domain:
   domain 0: span 968-975 level SIBLING
groups: 8 single CPU groups
domain 1: span 968-975 level MC
 groups: 1 group with 8 cpus
 domain 2: span 768-1023 level CPU
  groups: 4 groups with 256 cpus per group


Wow, that topology is horrid.  I'm not surprised that your box is
writhing in agony.  Can you twiddle that?



twiddle that how?

The system has 4 physical cpus (sockets). Each cpu has 32 cores with 8 
threads per core and each cpu has 4 memory controllers.


If I disable SCHED_MC and CGROUPS_SCHED (group scheduling) there is a 
noticeable improvement -- watchdog does not trigger and I do not get the 
rq locks held for 2-3 seconds. But there is still fairly high cpu usage 
for an idle system. Perhaps I should leave SCHED_MC on and disable 
SCHED_SMT; I'll try that today.


Thanks,
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread Peter Zijlstra
On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:
> Socket(s): 32
> NUMA node(s):  4

Urgh, with 32 'cpus' per socket, you still do _8_ sockets per node, for
a total of 256 cpus per node.

That's painful. I don't suppose you can really change the hardware, but
that's a 'curious' choice.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread Peter Zijlstra
On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:
> Since each domain is a superset of the lower one each pass through
> load_balance regularly repeats the processing of the previous domain (e.g.,
> NODE domain repeats the cpus in the CPU domain). Then multiplying that
> across 1024 cpus and it seems like a of duplication.

It is, _but_ each domain has an interval, bigger domains _should_ load
balance at a bigger interval (iow lower frequency), and all this is
lockless data gathering, so reusing stuff from the previous round could
be quite stale indeed.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-06 Thread Peter Zijlstra
On Thu, Mar 05, 2015 at 09:05:28PM -0700, David Ahern wrote:
> Hi Peter/Mike/Ingo:
> 
> Does that make sense or am I off in the weeds?

How much of your story pertains to 3.18? I'm not particularly interested
in anything much older than that.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: NMI watchdog triggering during load_balance

2015-03-05 Thread Mike Galbraith
On Thu, 2015-03-05 at 21:05 -0700, David Ahern wrote:
> Hi Peter/Mike/Ingo:
> 
> I've been banging my against this wall for a week now and hoping you or 
> someone could shed some light on the problem.
> 
> On larger systems (256 to 1024 cpus) there are several use cases (e.g., 
> http://www.cs.virginia.edu/stream/) that regularly trigger the NMI 
> watchdog with the stack trace:
> 
> Call Trace:
> @  [0045d3d0] double_rq_lock+0x4c/0x68
> @  [004699c4] load_balance+0x278/0x740
> @  [008a7b88] __schedule+0x378/0x8e4
> @  [008a852c] schedule+0x68/0x78
> @  [0042c82c] cpu_idle+0x14c/0x18c
> @  [008a3a50] after_lock_tlb+0x1b4/0x1cc
> 
> Capturing data for all CPUs I tend to see load_balance related stack 
> traces on 700-800 cpus, with a few hundred blocked on _raw_spin_trylock_bh.
> 
> I originally thought it was a deadlock in the rq locking, but if I bump 
> the watchdog timeout the system eventually recovers (after 10-30+ 
> seconds of unresponsiveness) so it does not seem likely to be a deadlock.
> 
> This particluar system has 1024 cpus:
> # lscpu
> Architecture:  sparc64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Big Endian
> CPU(s):1024
> On-line CPU(s) list:   0-1023
> Thread(s) per core:8
> Core(s) per socket:4
> Socket(s): 32
> NUMA node(s):  4
> NUMA node0 CPU(s): 0-255
> NUMA node1 CPU(s): 256-511
> NUMA node2 CPU(s): 512-767
> NUMA node3 CPU(s): 768-1023
> 
> and there are 4 scheduling domains. An example of the domain debug 
> output (condensed for the email):
> 
> CPU970 attaching sched-domain:
>   domain 0: span 968-975 level SIBLING
>groups: 8 single CPU groups
>domain 1: span 968-975 level MC
> groups: 1 group with 8 cpus
> domain 2: span 768-1023 level CPU
>  groups: 4 groups with 256 cpus per group

Wow, that topology is horrid.  I'm not surprised that your box is
writhing in agony.  Can you twiddle that?

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


NMI watchdog triggering during load_balance

2015-03-05 Thread David Ahern

Hi Peter/Mike/Ingo:

I've been banging my against this wall for a week now and hoping you or 
someone could shed some light on the problem.


On larger systems (256 to 1024 cpus) there are several use cases (e.g., 
http://www.cs.virginia.edu/stream/) that regularly trigger the NMI 
watchdog with the stack trace:


Call Trace:
@  [0045d3d0] double_rq_lock+0x4c/0x68
@  [004699c4] load_balance+0x278/0x740
@  [008a7b88] __schedule+0x378/0x8e4
@  [008a852c] schedule+0x68/0x78
@  [0042c82c] cpu_idle+0x14c/0x18c
@  [008a3a50] after_lock_tlb+0x1b4/0x1cc

Capturing data for all CPUs I tend to see load_balance related stack 
traces on 700-800 cpus, with a few hundred blocked on _raw_spin_trylock_bh.


I originally thought it was a deadlock in the rq locking, but if I bump 
the watchdog timeout the system eventually recovers (after 10-30+ 
seconds of unresponsiveness) so it does not seem likely to be a deadlock.


This particluar system has 1024 cpus:
# lscpu
Architecture:  sparc64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Big Endian
CPU(s):1024
On-line CPU(s) list:   0-1023
Thread(s) per core:8
Core(s) per socket:4
Socket(s): 32
NUMA node(s):  4
NUMA node0 CPU(s): 0-255
NUMA node1 CPU(s): 256-511
NUMA node2 CPU(s): 512-767
NUMA node3 CPU(s): 768-1023

and there are 4 scheduling domains. An example of the domain debug 
output (condensed for the email):


CPU970 attaching sched-domain:
 domain 0: span 968-975 level SIBLING
  groups: 8 single CPU groups
  domain 1: span 968-975 level MC
   groups: 1 group with 8 cpus
   domain 2: span 768-1023 level CPU
groups: 32 groups with 8 cpus per group
domain 3: span 0-1023 level NODE
 groups: 4 groups with 256 cpus per group


On an idle system (20 or so non-kernel threads such as mingetty, udev, 
...) perf top shows the task scheduler is consuming significant time:



   PerfTop:  136580 irqs/sec  kernel:99.9%  exact:  0.0% [1000Hz 
cycles],  (all, 1024 CPUs)

---

20.22%  [kernel]  [k] find_busiest_group
16.00%  [kernel]  [k] find_next_bit
 6.37%  [kernel]  [k] ktime_get_update_offsets
 5.70%  [kernel]  [k] ktime_get
...


This is a 2.6.39 kernel (yes, a relatively old one); 3.8 shows similar 
symptoms. 3.18 is much better.


From what I can tell load balancing is happening non-stop and there is 
heavy contention in the run queue locks. I instrumented the rq locking 
and under load (e.g, the stream test) regularly see single rq locks held 
continuously for 2-3 seconds (e.g., at the end of the stream run which 
has 1024 threads and the process is terminating).


I have been staring at and instrumenting the scheduling code for days. 
It seems like the balancing of domains is regularly lining up on all or 
almost all CPUs and it seems like the NODE domain causes the most damage 
since it scans all cpus (ie., in rebalance_domains() each domain pass 
triggers a call to load_balance on all cpus at the same time). Just in 
random snapshots during a stream test I have seen 1 pass through 
rebalance_domains take > 17 seconds (custom tracepoints to tag start and 
end).


Since each domain is a superset of the lower one each pass through 
load_balance regularly repeats the processing of the previous domain 
(e.g., NODE domain repeats the cpus in the CPU domain). Then multiplying 
that across 1024 cpus and it seems like a of duplication.


Does that make sense or am I off in the weeds?

Thanks,
David
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/