[PATCH v2 09/10] sched/fair: disable stealing if too many NUMA nodes
The STEAL feature causes regressions on hackbench on larger NUMA systems, so disable it on systems with more than sched_steal_node_limit nodes (default 2). Note that the feature remains enabled as seen in features.h and /sys/kernel/debug/sched_features, but stealing is only performed if nodes <= sched_steal_node_limit. This arrangement allows users to activate stealing on reboot by setting the kernel parameter sched_steal_node_limit on kernels built without CONFIG_SCHED_DEBUG. The parameter is temporary and will be deleted when the regression is fixed. Details of the regression follow. With the STEAL feature set, hackbench is slower on many-node systems: X5-8: 8 sockets * 18 cores * 2 hyperthreads = 288 CPUs Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz Average of 10 runs of: hackbench processes 5 --- base ----- new --- groupstime %stdevtime %stdev %speedup 1 3.627 15.8 3.8767.3 -6.5 2 4.545 24.7 5.583 16.7 -18.6 3 5.716 25.0 7.367 14.2 -22.5 4 6.901 32.9 7.718 14.5 -10.6 8 8.604 38.5 9.111 16.0 -5.6 16 7.7346.8 11.0078.2 -29.8 Total CPU time increases. Profiling shows that CPU time increases uniformly across all functions, suggesting a systemic increase in cache or memory latency. This may be due to NUMA migrations, as they cause loss of LLC cache footprint and remote memory latencies. The domains for this system and their flags are: domain0 (SMT) : 1 core SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_SHARE_CPUCAPACITY SD_WAKE_AFFINE domain1 (MC) : 1 socket SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_WAKE_AFFINE domain2 (NUMA) : 4 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SERIALIZE SD_OVERLAP SD_NUMA SD_WAKE_AFFINE domain3 (NUMA) : 8 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA Schedstats point to the root cause of the regression. hackbench is run 10 times per group and the average schedstat accumulation per-run and per-cpu is shown below. Note that domain3 moves are zero because SD_WAKE_AFFINE is not set there. NO_STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.3 10.3 28710 14346 14366 0490 33780 4039 00 2 26.4 18.8 56721 28258 28469 0792 7026 12 9229 07 3 29.9 28.3 90191 44933 45272 0 5380 7204 19 16481 03 4 30.2 35.8 121324 60409 60933 0 7012 9372 27 21438 05 8 27.7 64.2 229174 111917 117272 0 11991 1837 168 44006 0 32 16 32.6 74.0 334615 146784 188043 0 3404 1468 49 61405 08 STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.6 10.2 28490 14232 1426118 3 35250 4254 00 2 27.9 18.8 56757 28203 28562 303 1675 78395 9690 02 3 35.3 27.7 87337 43274 44085 698741 12785 14 15689 03 4 36.8 36.0 118630 58437 60216 1579 2973 14101 28 18732 07 8 48.1 73.8 289374 133681 155600 18646 35340 10179 171 65889 0 34 16 41.4 82.5 268925 91908 177172 47498 17206 6940 176 71776 0 20 Cross-numa-node migrations are caused by load balancing pulls and wake_affine moves. Pulls are small and similar for no_steal and steal. However, moves are significantly higher for steal, and rows above with the highest moves have the worst regressions for time; see for example grp=8. Moves increase for steal due to the following logic in wake_affine_idle() for synchronous wakeup: if (sync && cpu_rq(this_cpu)->nr_running == 1) return this_cpu;// move the task The steal feature does a better job of smoothing the load between idle and busy CPUs, so nr_running is 1 more often, and moves are performed more often. For hackbench, cross-node affine moves early in the run are good because they colocate wakers and wakees from the same group on the same node, but continued moves later in the run are bad, because the wakee is moved away from peers on its previous node. Note that even no_steal is far from optimal; binding an instance of "hackbench 2" to each of the 8 NUMA nodes runs much faster than running "hackbench 16" with no binding. Clearing SD_WAKE_AFFINE for domain2 eliminates the affine cross-node migrations and eliminates the difference between no_steal and steal performance. However, overall performance is lower than WA_IDLE because some migrations are helpful as explained above. I have tried many heuristics in a attempt to optimize the number of cross-node moves in
[PATCH v2 09/10] sched/fair: disable stealing if too many NUMA nodes
The STEAL feature causes regressions on hackbench on larger NUMA systems, so disable it on systems with more than sched_steal_node_limit nodes (default 2). Note that the feature remains enabled as seen in features.h and /sys/kernel/debug/sched_features, but stealing is only performed if nodes <= sched_steal_node_limit. This arrangement allows users to activate stealing on reboot by setting the kernel parameter sched_steal_node_limit on kernels built without CONFIG_SCHED_DEBUG. The parameter is temporary and will be deleted when the regression is fixed. Details of the regression follow. With the STEAL feature set, hackbench is slower on many-node systems: X5-8: 8 sockets * 18 cores * 2 hyperthreads = 288 CPUs Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz Average of 10 runs of: hackbench processes 5 --- base ----- new --- groupstime %stdevtime %stdev %speedup 1 3.627 15.8 3.8767.3 -6.5 2 4.545 24.7 5.583 16.7 -18.6 3 5.716 25.0 7.367 14.2 -22.5 4 6.901 32.9 7.718 14.5 -10.6 8 8.604 38.5 9.111 16.0 -5.6 16 7.7346.8 11.0078.2 -29.8 Total CPU time increases. Profiling shows that CPU time increases uniformly across all functions, suggesting a systemic increase in cache or memory latency. This may be due to NUMA migrations, as they cause loss of LLC cache footprint and remote memory latencies. The domains for this system and their flags are: domain0 (SMT) : 1 core SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_SHARE_CPUCAPACITY SD_WAKE_AFFINE domain1 (MC) : 1 socket SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SHARE_PKG_RESOURCES SD_PREFER_SIBLING SD_WAKE_AFFINE domain2 (NUMA) : 4 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_BALANCE_EXEC SD_BALANCE_FORK SD_SERIALIZE SD_OVERLAP SD_NUMA SD_WAKE_AFFINE domain3 (NUMA) : 8 sockets SD_LOAD_BALANCE SD_BALANCE_NEWIDLE SD_SERIALIZE SD_OVERLAP SD_NUMA Schedstats point to the root cause of the regression. hackbench is run 10 times per group and the average schedstat accumulation per-run and per-cpu is shown below. Note that domain3 moves are zero because SD_WAKE_AFFINE is not set there. NO_STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.3 10.3 28710 14346 14366 0490 33780 4039 00 2 26.4 18.8 56721 28258 28469 0792 7026 12 9229 07 3 29.9 28.3 90191 44933 45272 0 5380 7204 19 16481 03 4 30.2 35.8 121324 60409 60933 0 7012 9372 27 21438 05 8 27.7 64.2 229174 111917 117272 0 11991 1837 168 44006 0 32 16 32.6 74.0 334615 146784 188043 0 3404 1468 49 61405 08 STEAL --- domain2 --- --- domain3 --- grp time %busy sched idle wake steal remote move pull remote move pull 1 20.6 10.2 28490 14232 1426118 3 35250 4254 00 2 27.9 18.8 56757 28203 28562 303 1675 78395 9690 02 3 35.3 27.7 87337 43274 44085 698741 12785 14 15689 03 4 36.8 36.0 118630 58437 60216 1579 2973 14101 28 18732 07 8 48.1 73.8 289374 133681 155600 18646 35340 10179 171 65889 0 34 16 41.4 82.5 268925 91908 177172 47498 17206 6940 176 71776 0 20 Cross-numa-node migrations are caused by load balancing pulls and wake_affine moves. Pulls are small and similar for no_steal and steal. However, moves are significantly higher for steal, and rows above with the highest moves have the worst regressions for time; see for example grp=8. Moves increase for steal due to the following logic in wake_affine_idle() for synchronous wakeup: if (sync && cpu_rq(this_cpu)->nr_running == 1) return this_cpu;// move the task The steal feature does a better job of smoothing the load between idle and busy CPUs, so nr_running is 1 more often, and moves are performed more often. For hackbench, cross-node affine moves early in the run are good because they colocate wakers and wakees from the same group on the same node, but continued moves later in the run are bad, because the wakee is moved away from peers on its previous node. Note that even no_steal is far from optimal; binding an instance of "hackbench 2" to each of the 8 NUMA nodes runs much faster than running "hackbench 16" with no binding. Clearing SD_WAKE_AFFINE for domain2 eliminates the affine cross-node migrations and eliminates the difference between no_steal and steal performance. However, overall performance is lower than WA_IDLE because some migrations are helpful as explained above. I have tried many heuristics in a attempt to optimize the number of cross-node moves in