[PATCH v2 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle
When a CPU has no more CFS tasks to run, and idle_balance() fails to find a task, then attempt to steal a task from an overloaded CPU in the same LLC, using the cfs_overload_cpus bitmap to efficiently identify candidates. To minimize search time, steal the first migratable task that is found when the bitmap is traversed. For fairness, search for migratable tasks on an overloaded CPU in order of next to run. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, so it may be called every time the CPU is about to go idle. idle_balance() does more work because it searches widely for the busiest queue, so to limit its CPU consumption, it declines to search if the system is too busy. Simple stealing does not offload the globally busiest queue, but it is much better than running nothing at all. Stealing is controlled by the sched feature SCHED_STEAL, which is enabled by default. Stealing imprroves utilization with only a modest CPU overhead in scheduler code. In the following experiment, hackbench is run with varying numbers of groups (40 tasks per group), and the delta in /proc/schedstat is shown for each run, averaged per CPU, augmented with these non-standard stats: %find - percent of time spent in old and new functions that search for idle CPUs and tasks to steal and set the overloaded CPUs bitmap. steal - number of times a task is stolen from another CPU. X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz hackbench process 10 sched_wakeup_granularity_ns=1500 baseline grps time %busy slice sched idle wake %find steal 18.084 75.02 0.10 105476 4629159183 0.31 0 2 13.892 85.33 0.10 190225 70958 119264 0.45 0 3 19.668 89.04 0.10 263896 87047 176850 0.49 0 4 25.279 91.28 0.10 322171 94691 227474 0.51 0 8 47.832 94.86 0.09 630636 144141 486322 0.56 0 new grps time %busy slice sched idle wake %find steal %speedup 15.938 96.80 0.24 31255 719024061 0.63 7433 36.1 2 11.491 99.23 0.16 74097 457869512 0.84 19463 20.9 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6 Elapsed time improves by 8 to 36%, and CPU busy utilization is up by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks). The cost is at most 0.4% more find time. Additional performance results follow. A negative "speedup" is a regression. Note: for all hackbench runs, sched_wakeup_granularity_ns is set to 15 msec. Otherwise, preemptions increase at higher loads and distort the comparison between baseline and new. -- 1 Socket Results -- X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Average of 10 runs of: hackbench process 10 --- base ----- new --- groupstime %stdevtime %stdev %speedup 1 8.0080.1 5.9050.2 35.6 2 13.8140.2 11.4380.1 20.7 3 19.4880.2 16.9190.1 15.1 4 25.0590.1 22.4090.1 11.8 8 47.4780.1 44.2210.1 7.3 X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Average of 10 runs of: hackbench process 10 --- base ----- new --- groupstime %stdevtime %stdev %speedup 1 4.5860.8 4.5960.6 -0.3 2 7.6930.2 5.7751.3 33.2 3 10.4420.3 8.2880.3 25.9 4 13.0870.2 11.0570.1 18.3 8 24.1450.2 22.0760.3 9.3 16 43.7790.1 41.7410.2 4.8 KVM 4-cpu Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz tbench, average of 11 runs clients%speedup 116.2 211.7 4 9.9 812.8 1613.7 KVM 2-cpu Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz Benchmark %speedup specjbb2015_critical_jops 5.7 mysql_sysb1.0.14_mutex_2 40.6 mysql_sysb1.0.14_oltp_23.9 -- 2 Socket Results -- X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Average of 10 runs of: hackbench process 10 --- base ----- new --- groupstime %stdevtime %stdev %speedup 1 7.9450.2 7.2198.7 10.0 2 8.4440.4 6.6891.5 26.2 3 12.1001.1 9.9622.0 21.4 4 15.0010.4 13.1091.1 14.4 8 27.9600.2 26.1270.3 7.0 X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
[PATCH v2 08/10] sched/fair: Steal work from an overloaded CPU when CPU goes idle
When a CPU has no more CFS tasks to run, and idle_balance() fails to find a task, then attempt to steal a task from an overloaded CPU in the same LLC, using the cfs_overload_cpus bitmap to efficiently identify candidates. To minimize search time, steal the first migratable task that is found when the bitmap is traversed. For fairness, search for migratable tasks on an overloaded CPU in order of next to run. This simple stealing yields a higher CPU utilization than idle_balance() alone, because the search is cheap, so it may be called every time the CPU is about to go idle. idle_balance() does more work because it searches widely for the busiest queue, so to limit its CPU consumption, it declines to search if the system is too busy. Simple stealing does not offload the globally busiest queue, but it is much better than running nothing at all. Stealing is controlled by the sched feature SCHED_STEAL, which is enabled by default. Stealing imprroves utilization with only a modest CPU overhead in scheduler code. In the following experiment, hackbench is run with varying numbers of groups (40 tasks per group), and the delta in /proc/schedstat is shown for each run, averaged per CPU, augmented with these non-standard stats: %find - percent of time spent in old and new functions that search for idle CPUs and tasks to steal and set the overloaded CPUs bitmap. steal - number of times a task is stolen from another CPU. X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz hackbench process 10 sched_wakeup_granularity_ns=1500 baseline grps time %busy slice sched idle wake %find steal 18.084 75.02 0.10 105476 4629159183 0.31 0 2 13.892 85.33 0.10 190225 70958 119264 0.45 0 3 19.668 89.04 0.10 263896 87047 176850 0.49 0 4 25.279 91.28 0.10 322171 94691 227474 0.51 0 8 47.832 94.86 0.09 630636 144141 486322 0.56 0 new grps time %busy slice sched idle wake %find steal %speedup 15.938 96.80 0.24 31255 719024061 0.63 7433 36.1 2 11.491 99.23 0.16 74097 457869512 0.84 19463 20.9 3 16.987 99.66 0.15 115824 1985 113826 0.77 24707 15.8 4 22.504 99.80 0.14 167188 2385 164786 0.75 29353 12.3 8 44.441 99.86 0.11 389153 1616 387401 0.67 38190 7.6 Elapsed time improves by 8 to 36%, and CPU busy utilization is up by 5 to 22% hitting 99% for 2 or more groups (80 or more tasks). The cost is at most 0.4% more find time. Additional performance results follow. A negative "speedup" is a regression. Note: for all hackbench runs, sched_wakeup_granularity_ns is set to 15 msec. Otherwise, preemptions increase at higher loads and distort the comparison between baseline and new. -- 1 Socket Results -- X6-2: 1 socket * 10 cores * 2 hyperthreads = 20 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Average of 10 runs of: hackbench process 10 --- base ----- new --- groupstime %stdevtime %stdev %speedup 1 8.0080.1 5.9050.2 35.6 2 13.8140.2 11.4380.1 20.7 3 19.4880.2 16.9190.1 15.1 4 25.0590.1 22.4090.1 11.8 8 47.4780.1 44.2210.1 7.3 X6-2: 1 socket * 22 cores * 2 hyperthreads = 44 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz Average of 10 runs of: hackbench process 10 --- base ----- new --- groupstime %stdevtime %stdev %speedup 1 4.5860.8 4.5960.6 -0.3 2 7.6930.2 5.7751.3 33.2 3 10.4420.3 8.2880.3 25.9 4 13.0870.2 11.0570.1 18.3 8 24.1450.2 22.0760.3 9.3 16 43.7790.1 41.7410.2 4.8 KVM 4-cpu Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz tbench, average of 11 runs clients%speedup 116.2 211.7 4 9.9 812.8 1613.7 KVM 2-cpu Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz Benchmark %speedup specjbb2015_critical_jops 5.7 mysql_sysb1.0.14_mutex_2 40.6 mysql_sysb1.0.14_oltp_23.9 -- 2 Socket Results -- X6-2: 2 sockets * 10 cores * 2 hyperthreads = 40 CPUs Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz Average of 10 runs of: hackbench process 10 --- base ----- new --- groupstime %stdevtime %stdev %speedup 1 7.9450.2 7.2198.7 10.0 2 8.4440.4 6.6891.5 26.2 3 12.1001.1 9.9622.0 21.4 4 15.0010.4 13.1091.1 14.4 8 27.9600.2 26.1270.3 7.0 X6-2: 2 sockets * 22 cores * 2 hyperthreads = 88 CPUs Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz